Traceroute data correlation #17

Open
opened 2026-02-28 21:37:47 +03:00 by skobkin · 0 comments
Owner

Add bounded ingest-side traceroute correlation with timeout synthesis

Summary

Implement bounded ingest-side traceroute correlation so the app can combine:

  • a traceroute request packet
  • a later traceroute reply packet
  • a later routing error packet

into one logical traceroute run, with explicit lifecycle status such as requested, partial, completed, failed, or timed_out.

Problem

The current codebase now decodes individual TRACEROUTE_APP and ROUTING_APP packets with much richer semantics, but it still treats them as packet-local events.

That means:

  • operators must mentally correlate request, reply, and failure packets
  • a request with no visible reply on MQTT never becomes an explicit timeout result
  • reply and routing error packets are not attached to a single traceroute lifecycle
  • duplicate MQTT retransmissions are deduplicated at packet level, but there is no higher-level traceroute lifecycle dedup

This is especially important because MQTT often exposes only part of the real traceroute exchange. The app should preserve partial evidence while still producing a useful traceroute-level outcome.

Goal

Add a short-lived ingest-side correlator that tracks traceroute requests by request_id, merges later reply or routing-failure packets into the same logical run, and synthesizes a timeout when no terminal packet arrives within a bounded window.

The intent is to improve operator visibility without fabricating data that MQTT never exposed.

Non-goals

  • Do not move this logic into internal/meshtastic; parsing should stay packet-local and stateless.
  • Do not introduce unbounded in-memory tracking.
  • Do not fabricate route or return-path data that is not supported by packet evidence.
  • Do not replace current semantic packet decoding with a transport-specific state machine.

Why ingest is the right layer

Correlation is transport-observation policy, not packet parsing.

The parser should continue to answer questions like:

  • is this packet a traceroute request or reply?
  • what is the request id?
  • what forward or return path can be reconstructed from this packet alone?

The ingest layer should answer questions like:

  • did this reply correspond to a previously seen request?
  • did this traceroute fail due to routing error?
  • did this request time out because no terminal packet arrived in time?
  • has a final lifecycle event already been emitted?

That makes internal/ingest the appropriate ownership boundary.

Likely implementation area

  • internal/ingest/service.go
  • a small helper such as internal/ingest/traceroute_tracker.go

Proposed tracked state

Track active traceroute requests in an in-memory bounded map keyed by request_id.

Each tracked entry should include at least:

  • request_id
  • request packet id
  • source node id
  • destination node id
  • channel
  • first observed time
  • last updated time
  • current status
  • best known forward path
  • best known return path
  • best known forward SNR
  • best known return SNR
  • failure reason, if any
  • flags indicating whether request, final state, or timeout has already been emitted
  • source packet ids used to build the lifecycle state

Suggested statuses

  • requested
  • partial
  • completed
  • failed
  • timed_out

Correlation rules

1. Request packet

Packet shape:

  • TRACEROUTE_APP
  • want_response = true

Behavior:

  • create a tracker entry keyed by request_id
  • if request id is absent, use the request packet id as the tracking key
  • mark lifecycle status as requested
  • store request metadata such as source, destination, channel, observed time, and packet id
  • do not treat the request packet itself as a successful traceroute result
  • optionally emit a lifecycle log row with status requested

2. Reply packet

Packet shape:

  • TRACEROUTE_APP
  • want_response = false
  • correlate primarily by request_id
  • use reply_id only as fallback if needed

Behavior:

  • look up the matching tracker entry
  • merge in forward path, return path, and SNR data
  • prefer explicit route arrays over inferred paths when both exist
  • preserve partial result data if that is all MQTT exposed
  • mark status completed if the reply provides a usable result
  • if result is still incomplete but useful, mark status partial
  • do not emit multiple completion records for duplicate reply packets

3. Routing error packet

Packet shape:

  • ROUTING_APP
  • request_id present
  • error_reason != NONE

Behavior:

  • look up the matching tracker entry
  • mark status failed
  • attach error_reason
  • emit one terminal failure record
  • do not treat error_reason = NONE as failure

4. Timeout

Behavior:

  • if no reply and no failure arrives within a configured timeout window, mark status timed_out
  • synthesize timeout from ingest observation time, not only packet timestamp
  • emit exactly one timeout record
  • retain the entry only long enough to avoid duplicate terminal emission, then evict it

Bounded-memory requirements

This feature must remain bounded and safe for long-running processes.

Suggested controls:

  • per-entry TTL, for example 30 to 120 seconds
  • maximum active tracked entries, for example 1,000 or 10,000
  • cleanup on normal ingest flow or via a lightweight periodic sweeper
  • eviction policy that removes expired oldest entries first

If the tracker is full:

  • evict expired entries first
  • if still full, evict the oldest non-final entry
  • log a warning with useful fields such as active entry count, evictions, and timeout window

Dedup requirements

There are two separate dedup concerns.

1. Packet-level dedup

The existing packet dedup logic already suppresses exact duplicate MQTT packet IDs.

That should remain in place.

2. Lifecycle-level dedup

The new tracker must additionally ensure it does not emit duplicate final lifecycle records when:

  • the same reply is seen more than once
  • the same routing error is seen more than once
  • timeout cleanup runs more than once
  • a late duplicate terminal packet arrives after a final state was already emitted

Each tracked entry should remember whether a final lifecycle event has already been emitted.

Merging policy

When updating a tracked traceroute state:

  • never replace richer route data with emptier later data
  • preserve partial path data if that is all MQTT exposed
  • prefer explicit route arrays over purely inferred paths
  • preserve low-level packet facts that materially affect interpretation
  • do not invent return path if reply-side evidence is absent
  • do not rewrite a partial result into a fake success

Suggested output strategy

Two designs are possible.

Option A: lifecycle rows plus packet rows

  • keep existing packet-level log rows
  • additionally emit correlated traceroute lifecycle rows

Pros:

  • preserves raw packet visibility
  • safer incremental rollout
  • easier to debug mismatches between packet evidence and correlation logic

Cons:

  • more log volume

Option B: lifecycle rows only for traceroute

  • suppress packet-level traceroute and routing rows once correlation exists

Pros:

  • cleaner operator view

Cons:

  • loses useful low-level packet visibility
  • harder to debug MQTT partial-observation behavior

Recommended first step: implement Option A.

Suggested lifecycle log details

Correlated traceroute lifecycle rows should include at least:

  • status
  • request_id
  • from
  • to
  • channel
  • forward_path
  • return_path
  • forward_snr
  • return_snr
  • error_reason
  • started_at
  • updated_at
  • completed_at or timeout timestamp when relevant
  • inferred_* fields where applicable
  • source_packets containing request, reply, and routing packet ids when known

Timeout policy

The issue should require a clear timeout policy:

  • timeout starts when request is first observed by ingest
  • timeout duration should be a constant near ingest or configurable later
  • timeout emits exactly one terminal lifecycle record with status = timed_out
  • timed-out entries should remain in memory only as long as needed for final-state dedup, then be evicted

Concurrency and ownership

If ingest processing can run concurrently, the tracker must be synchronized.

A simple design is sufficient:

  • mutex-protected map
  • bounded in-memory state only
  • no database persistence required for the first implementation

Ownership should remain local to ingest.

Expected tests

Add regression coverage for at least:

  • request followed by reply -> one correlated completion
  • request followed by routing error NO_ROUTE -> one correlated failure
  • request followed only by routing NONE -> not failed
  • request with no follow-up -> timed out
  • duplicate reply packet -> no duplicate completion
  • duplicate routing error packet -> no duplicate failure
  • reply arrives without previously observed request -> handled by explicit policy
  • bounded eviction does not panic or leak state
  • partial MQTT evidence remains visible and is not rewritten into fabricated success

Acceptance criteria

Implementation is complete when all of the following are true:

  • traceroute requests are tracked in ingest by request_id
  • traceroute reply packets update the matching tracked request
  • routing error packets with non-NONE error reason fail the matching tracked request
  • timeout is synthesized after a bounded window
  • memory usage is bounded by TTL and max-entry policy
  • final lifecycle records are emitted exactly once per traceroute run
  • duplicate MQTT retransmissions do not create duplicate final lifecycle records
  • partial MQTT evidence remains visible and does not become fabricated success
  • parser-layer code remains packet-local and stateless

Implementation notes

  • Start with a small helper owned by ingest rather than introducing a broad new abstraction.
  • Keep packet semantic decoding in internal/meshtastic unchanged except where new fields are needed.
  • Prefer lifecycle correlation as an additive behavior before considering any suppression of packet-level traceroute logs.
  • If a reply arrives without a visible request, choose and document one policy explicitly:
    • log standalone partial/completed result without correlation, or
    • ignore unmatched replies, or
    • create synthetic tracker entry marked as request-missing

The safest first implementation is to keep unmatched reply evidence visible rather than dropping it silently.

# Add bounded ingest-side traceroute correlation with timeout synthesis ## Summary Implement bounded ingest-side traceroute correlation so the app can combine: - a traceroute request packet - a later traceroute reply packet - a later routing error packet into one logical traceroute run, with explicit lifecycle status such as `requested`, `partial`, `completed`, `failed`, or `timed_out`. ## Problem The current codebase now decodes individual `TRACEROUTE_APP` and `ROUTING_APP` packets with much richer semantics, but it still treats them as packet-local events. That means: - operators must mentally correlate request, reply, and failure packets - a request with no visible reply on MQTT never becomes an explicit timeout result - reply and routing error packets are not attached to a single traceroute lifecycle - duplicate MQTT retransmissions are deduplicated at packet level, but there is no higher-level traceroute lifecycle dedup This is especially important because MQTT often exposes only part of the real traceroute exchange. The app should preserve partial evidence while still producing a useful traceroute-level outcome. ## Goal Add a short-lived ingest-side correlator that tracks traceroute requests by `request_id`, merges later reply or routing-failure packets into the same logical run, and synthesizes a timeout when no terminal packet arrives within a bounded window. The intent is to improve operator visibility without fabricating data that MQTT never exposed. ## Non-goals - Do not move this logic into `internal/meshtastic`; parsing should stay packet-local and stateless. - Do not introduce unbounded in-memory tracking. - Do not fabricate route or return-path data that is not supported by packet evidence. - Do not replace current semantic packet decoding with a transport-specific state machine. ## Why ingest is the right layer Correlation is transport-observation policy, not packet parsing. The parser should continue to answer questions like: - is this packet a traceroute request or reply? - what is the request id? - what forward or return path can be reconstructed from this packet alone? The ingest layer should answer questions like: - did this reply correspond to a previously seen request? - did this traceroute fail due to routing error? - did this request time out because no terminal packet arrived in time? - has a final lifecycle event already been emitted? That makes `internal/ingest` the appropriate ownership boundary. ## Likely implementation area - `internal/ingest/service.go` - a small helper such as `internal/ingest/traceroute_tracker.go` ## Proposed tracked state Track active traceroute requests in an in-memory bounded map keyed by `request_id`. Each tracked entry should include at least: - `request_id` - request packet id - source node id - destination node id - channel - first observed time - last updated time - current status - best known forward path - best known return path - best known forward SNR - best known return SNR - failure reason, if any - flags indicating whether request, final state, or timeout has already been emitted - source packet ids used to build the lifecycle state ## Suggested statuses - `requested` - `partial` - `completed` - `failed` - `timed_out` ## Correlation rules ### 1. Request packet Packet shape: - `TRACEROUTE_APP` - `want_response = true` Behavior: - create a tracker entry keyed by `request_id` - if request id is absent, use the request packet id as the tracking key - mark lifecycle status as `requested` - store request metadata such as source, destination, channel, observed time, and packet id - do not treat the request packet itself as a successful traceroute result - optionally emit a lifecycle log row with status `requested` ### 2. Reply packet Packet shape: - `TRACEROUTE_APP` - `want_response = false` - correlate primarily by `request_id` - use `reply_id` only as fallback if needed Behavior: - look up the matching tracker entry - merge in forward path, return path, and SNR data - prefer explicit route arrays over inferred paths when both exist - preserve partial result data if that is all MQTT exposed - mark status `completed` if the reply provides a usable result - if result is still incomplete but useful, mark status `partial` - do not emit multiple completion records for duplicate reply packets ### 3. Routing error packet Packet shape: - `ROUTING_APP` - `request_id` present - `error_reason != NONE` Behavior: - look up the matching tracker entry - mark status `failed` - attach `error_reason` - emit one terminal failure record - do not treat `error_reason = NONE` as failure ### 4. Timeout Behavior: - if no reply and no failure arrives within a configured timeout window, mark status `timed_out` - synthesize timeout from ingest observation time, not only packet timestamp - emit exactly one timeout record - retain the entry only long enough to avoid duplicate terminal emission, then evict it ## Bounded-memory requirements This feature must remain bounded and safe for long-running processes. Suggested controls: - per-entry TTL, for example 30 to 120 seconds - maximum active tracked entries, for example 1,000 or 10,000 - cleanup on normal ingest flow or via a lightweight periodic sweeper - eviction policy that removes expired oldest entries first If the tracker is full: - evict expired entries first - if still full, evict the oldest non-final entry - log a warning with useful fields such as active entry count, evictions, and timeout window ## Dedup requirements There are two separate dedup concerns. ### 1. Packet-level dedup The existing packet dedup logic already suppresses exact duplicate MQTT packet IDs. That should remain in place. ### 2. Lifecycle-level dedup The new tracker must additionally ensure it does not emit duplicate final lifecycle records when: - the same reply is seen more than once - the same routing error is seen more than once - timeout cleanup runs more than once - a late duplicate terminal packet arrives after a final state was already emitted Each tracked entry should remember whether a final lifecycle event has already been emitted. ## Merging policy When updating a tracked traceroute state: - never replace richer route data with emptier later data - preserve partial path data if that is all MQTT exposed - prefer explicit route arrays over purely inferred paths - preserve low-level packet facts that materially affect interpretation - do not invent return path if reply-side evidence is absent - do not rewrite a partial result into a fake success ## Suggested output strategy Two designs are possible. ### Option A: lifecycle rows plus packet rows - keep existing packet-level log rows - additionally emit correlated traceroute lifecycle rows Pros: - preserves raw packet visibility - safer incremental rollout - easier to debug mismatches between packet evidence and correlation logic Cons: - more log volume ### Option B: lifecycle rows only for traceroute - suppress packet-level traceroute and routing rows once correlation exists Pros: - cleaner operator view Cons: - loses useful low-level packet visibility - harder to debug MQTT partial-observation behavior Recommended first step: implement Option A. ## Suggested lifecycle log details Correlated traceroute lifecycle rows should include at least: - `status` - `request_id` - `from` - `to` - `channel` - `forward_path` - `return_path` - `forward_snr` - `return_snr` - `error_reason` - `started_at` - `updated_at` - `completed_at` or timeout timestamp when relevant - `inferred_*` fields where applicable - `source_packets` containing request, reply, and routing packet ids when known ## Timeout policy The issue should require a clear timeout policy: - timeout starts when request is first observed by ingest - timeout duration should be a constant near ingest or configurable later - timeout emits exactly one terminal lifecycle record with `status = timed_out` - timed-out entries should remain in memory only as long as needed for final-state dedup, then be evicted ## Concurrency and ownership If ingest processing can run concurrently, the tracker must be synchronized. A simple design is sufficient: - mutex-protected map - bounded in-memory state only - no database persistence required for the first implementation Ownership should remain local to ingest. ## Expected tests Add regression coverage for at least: - request followed by reply -> one correlated completion - request followed by routing error `NO_ROUTE` -> one correlated failure - request followed only by routing `NONE` -> not failed - request with no follow-up -> timed out - duplicate reply packet -> no duplicate completion - duplicate routing error packet -> no duplicate failure - reply arrives without previously observed request -> handled by explicit policy - bounded eviction does not panic or leak state - partial MQTT evidence remains visible and is not rewritten into fabricated success ## Acceptance criteria Implementation is complete when all of the following are true: - traceroute requests are tracked in ingest by `request_id` - traceroute reply packets update the matching tracked request - routing error packets with non-`NONE` error reason fail the matching tracked request - timeout is synthesized after a bounded window - memory usage is bounded by TTL and max-entry policy - final lifecycle records are emitted exactly once per traceroute run - duplicate MQTT retransmissions do not create duplicate final lifecycle records - partial MQTT evidence remains visible and does not become fabricated success - parser-layer code remains packet-local and stateless ## Implementation notes - Start with a small helper owned by ingest rather than introducing a broad new abstraction. - Keep packet semantic decoding in `internal/meshtastic` unchanged except where new fields are needed. - Prefer lifecycle correlation as an additive behavior before considering any suppression of packet-level traceroute logs. - If a reply arrives without a visible request, choose and document one policy explicitly: - log standalone partial/completed result without correlation, or - ignore unmatched replies, or - create synthetic tracker entry marked as request-missing The safest first implementation is to keep unmatched reply evidence visible rather than dropping it silently.
skobkin self-assigned this 2026-02-28 21:37:47 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks Depends on
#15 Traceroute logs details
skobkin/meshmap-lite
Reference
skobkin/meshmap-lite#17
No description provided.