skobkin/meshmap-lite

Fork 1

Traceroute data correlation #17

New issue

Open

opened 2026-02-28 21:37:47 +03:00 by skobkin · 0 comments

skobkin commented

2026-02-28 21:37:47 +03:00

Owner

Add bounded ingest-side traceroute correlation with timeout synthesis

Summary

Implement bounded ingest-side traceroute correlation so the app can combine:

a traceroute request packet
a later traceroute reply packet
a later routing error packet

into one logical traceroute run, with explicit lifecycle status such as requested, partial, completed, failed, or timed_out.

Problem

The current codebase now decodes individual TRACEROUTE_APP and ROUTING_APP packets with much richer semantics, but it still treats them as packet-local events.

That means:

operators must mentally correlate request, reply, and failure packets
a request with no visible reply on MQTT never becomes an explicit timeout result
reply and routing error packets are not attached to a single traceroute lifecycle
duplicate MQTT retransmissions are deduplicated at packet level, but there is no higher-level traceroute lifecycle dedup

This is especially important because MQTT often exposes only part of the real traceroute exchange. The app should preserve partial evidence while still producing a useful traceroute-level outcome.

Goal

Add a short-lived ingest-side correlator that tracks traceroute requests by request_id, merges later reply or routing-failure packets into the same logical run, and synthesizes a timeout when no terminal packet arrives within a bounded window.

The intent is to improve operator visibility without fabricating data that MQTT never exposed.

Non-goals

Do not move this logic into internal/meshtastic; parsing should stay packet-local and stateless.
Do not introduce unbounded in-memory tracking.
Do not fabricate route or return-path data that is not supported by packet evidence.
Do not replace current semantic packet decoding with a transport-specific state machine.

Why ingest is the right layer

Correlation is transport-observation policy, not packet parsing.

The parser should continue to answer questions like:

is this packet a traceroute request or reply?
what is the request id?
what forward or return path can be reconstructed from this packet alone?

The ingest layer should answer questions like:

did this reply correspond to a previously seen request?
did this traceroute fail due to routing error?
did this request time out because no terminal packet arrived in time?
has a final lifecycle event already been emitted?

That makes internal/ingest the appropriate ownership boundary.

Likely implementation area

internal/ingest/service.go
a small helper such as internal/ingest/traceroute_tracker.go

Proposed tracked state

Track active traceroute requests in an in-memory bounded map keyed by request_id.

Each tracked entry should include at least:

request_id
request packet id
source node id
destination node id
channel
first observed time
last updated time
current status
best known forward path
best known return path
best known forward SNR
best known return SNR
failure reason, if any
flags indicating whether request, final state, or timeout has already been emitted
source packet ids used to build the lifecycle state

Suggested statuses

requested
partial
completed
failed
timed_out

Correlation rules

1. Request packet

Packet shape:

TRACEROUTE_APP
want_response = true

Behavior:

create a tracker entry keyed by request_id
if request id is absent, use the request packet id as the tracking key
mark lifecycle status as requested
store request metadata such as source, destination, channel, observed time, and packet id
do not treat the request packet itself as a successful traceroute result
optionally emit a lifecycle log row with status requested

2. Reply packet

Packet shape:

TRACEROUTE_APP
want_response = false
correlate primarily by request_id
use reply_id only as fallback if needed

Behavior:

look up the matching tracker entry
merge in forward path, return path, and SNR data
prefer explicit route arrays over inferred paths when both exist
preserve partial result data if that is all MQTT exposed
mark status completed if the reply provides a usable result
if result is still incomplete but useful, mark status partial
do not emit multiple completion records for duplicate reply packets

3. Routing error packet

Packet shape:

ROUTING_APP
request_id present
error_reason != NONE

Behavior:

look up the matching tracker entry
mark status failed
attach error_reason
emit one terminal failure record
do not treat error_reason = NONE as failure

4. Timeout

Behavior:

if no reply and no failure arrives within a configured timeout window, mark status timed_out
synthesize timeout from ingest observation time, not only packet timestamp
emit exactly one timeout record
retain the entry only long enough to avoid duplicate terminal emission, then evict it

Bounded-memory requirements

This feature must remain bounded and safe for long-running processes.

Suggested controls:

per-entry TTL, for example 30 to 120 seconds
maximum active tracked entries, for example 1,000 or 10,000
cleanup on normal ingest flow or via a lightweight periodic sweeper
eviction policy that removes expired oldest entries first

If the tracker is full:

evict expired entries first
if still full, evict the oldest non-final entry
log a warning with useful fields such as active entry count, evictions, and timeout window

Dedup requirements

There are two separate dedup concerns.

1. Packet-level dedup

The existing packet dedup logic already suppresses exact duplicate MQTT packet IDs.

That should remain in place.

2. Lifecycle-level dedup

The new tracker must additionally ensure it does not emit duplicate final lifecycle records when:

the same reply is seen more than once
the same routing error is seen more than once
timeout cleanup runs more than once
a late duplicate terminal packet arrives after a final state was already emitted

Each tracked entry should remember whether a final lifecycle event has already been emitted.

Merging policy

When updating a tracked traceroute state:

never replace richer route data with emptier later data
preserve partial path data if that is all MQTT exposed
prefer explicit route arrays over purely inferred paths
preserve low-level packet facts that materially affect interpretation
do not invent return path if reply-side evidence is absent
do not rewrite a partial result into a fake success

Suggested output strategy

Two designs are possible.

Option A: lifecycle rows plus packet rows

keep existing packet-level log rows
additionally emit correlated traceroute lifecycle rows

Pros:

preserves raw packet visibility
safer incremental rollout
easier to debug mismatches between packet evidence and correlation logic

Cons:

more log volume

Option B: lifecycle rows only for traceroute

suppress packet-level traceroute and routing rows once correlation exists

Pros:

cleaner operator view

Cons:

loses useful low-level packet visibility
harder to debug MQTT partial-observation behavior

Recommended first step: implement Option A.

Suggested lifecycle log details

Correlated traceroute lifecycle rows should include at least:

status
request_id
from
to
channel
forward_path
return_path
forward_snr
return_snr
error_reason
started_at
updated_at
completed_at or timeout timestamp when relevant
inferred_* fields where applicable
source_packets containing request, reply, and routing packet ids when known

Timeout policy

The issue should require a clear timeout policy:

timeout starts when request is first observed by ingest
timeout duration should be a constant near ingest or configurable later
timeout emits exactly one terminal lifecycle record with status = timed_out
timed-out entries should remain in memory only as long as needed for final-state dedup, then be evicted

Concurrency and ownership

If ingest processing can run concurrently, the tracker must be synchronized.

A simple design is sufficient:

mutex-protected map
bounded in-memory state only
no database persistence required for the first implementation

Ownership should remain local to ingest.

Expected tests

Add regression coverage for at least:

request followed by reply -> one correlated completion
request followed by routing error NO_ROUTE -> one correlated failure
request followed only by routing NONE -> not failed
request with no follow-up -> timed out
duplicate reply packet -> no duplicate completion
duplicate routing error packet -> no duplicate failure
reply arrives without previously observed request -> handled by explicit policy
bounded eviction does not panic or leak state
partial MQTT evidence remains visible and is not rewritten into fabricated success

Acceptance criteria

Implementation is complete when all of the following are true:

traceroute requests are tracked in ingest by request_id
traceroute reply packets update the matching tracked request
routing error packets with non-NONE error reason fail the matching tracked request
timeout is synthesized after a bounded window
memory usage is bounded by TTL and max-entry policy
final lifecycle records are emitted exactly once per traceroute run
duplicate MQTT retransmissions do not create duplicate final lifecycle records
partial MQTT evidence remains visible and does not become fabricated success
parser-layer code remains packet-local and stateless

Implementation notes

Start with a small helper owned by ingest rather than introducing a broad new abstraction.
Keep packet semantic decoding in internal/meshtastic unchanged except where new fields are needed.
Prefer lifecycle correlation as an additive behavior before considering any suppression of packet-level traceroute logs.
If a reply arrives without a visible request, choose and document one policy explicitly:
- log standalone partial/completed result without correlation, or
- ignore unmatched replies, or
- create synthetic tracker entry marked as request-missing

The safest first implementation is to keep unmatched reply evidence visible rather than dropping it silently.

# Add bounded ingest-side traceroute correlation with timeout synthesis ## Summary Implement bounded ingest-side traceroute correlation so the app can combine: - a traceroute request packet - a later traceroute reply packet - a later routing error packet into one logical traceroute run, with explicit lifecycle status such as `requested`, `partial`, `completed`, `failed`, or `timed_out`. ## Problem The current codebase now decodes individual `TRACEROUTE_APP` and `ROUTING_APP` packets with much richer semantics, but it still treats them as packet-local events. That means: - operators must mentally correlate request, reply, and failure packets - a request with no visible reply on MQTT never becomes an explicit timeout result - reply and routing error packets are not attached to a single traceroute lifecycle - duplicate MQTT retransmissions are deduplicated at packet level, but there is no higher-level traceroute lifecycle dedup This is especially important because MQTT often exposes only part of the real traceroute exchange. The app should preserve partial evidence while still producing a useful traceroute-level outcome. ## Goal Add a short-lived ingest-side correlator that tracks traceroute requests by `request_id`, merges later reply or routing-failure packets into the same logical run, and synthesizes a timeout when no terminal packet arrives within a bounded window. The intent is to improve operator visibility without fabricating data that MQTT never exposed. ## Non-goals - Do not move this logic into `internal/meshtastic`; parsing should stay packet-local and stateless. - Do not introduce unbounded in-memory tracking. - Do not fabricate route or return-path data that is not supported by packet evidence. - Do not replace current semantic packet decoding with a transport-specific state machine. ## Why ingest is the right layer Correlation is transport-observation policy, not packet parsing. The parser should continue to answer questions like: - is this packet a traceroute request or reply? - what is the request id? - what forward or return path can be reconstructed from this packet alone? The ingest layer should answer questions like: - did this reply correspond to a previously seen request? - did this traceroute fail due to routing error? - did this request time out because no terminal packet arrived in time? - has a final lifecycle event already been emitted? That makes `internal/ingest` the appropriate ownership boundary. ## Likely implementation area - `internal/ingest/service.go` - a small helper such as `internal/ingest/traceroute_tracker.go` ## Proposed tracked state Track active traceroute requests in an in-memory bounded map keyed by `request_id`. Each tracked entry should include at least: - `request_id` - request packet id - source node id - destination node id - channel - first observed time - last updated time - current status - best known forward path - best known return path - best known forward SNR - best known return SNR - failure reason, if any - flags indicating whether request, final state, or timeout has already been emitted - source packet ids used to build the lifecycle state ## Suggested statuses - `requested` - `partial` - `completed` - `failed` - `timed_out` ## Correlation rules ### 1. Request packet Packet shape: - `TRACEROUTE_APP` - `want_response = true` Behavior: - create a tracker entry keyed by `request_id` - if request id is absent, use the request packet id as the tracking key - mark lifecycle status as `requested` - store request metadata such as source, destination, channel, observed time, and packet id - do not treat the request packet itself as a successful traceroute result - optionally emit a lifecycle log row with status `requested` ### 2. Reply packet Packet shape: - `TRACEROUTE_APP` - `want_response = false` - correlate primarily by `request_id` - use `reply_id` only as fallback if needed Behavior: - look up the matching tracker entry - merge in forward path, return path, and SNR data - prefer explicit route arrays over inferred paths when both exist - preserve partial result data if that is all MQTT exposed - mark status `completed` if the reply provides a usable result - if result is still incomplete but useful, mark status `partial` - do not emit multiple completion records for duplicate reply packets ### 3. Routing error packet Packet shape: - `ROUTING_APP` - `request_id` present - `error_reason != NONE` Behavior: - look up the matching tracker entry - mark status `failed` - attach `error_reason` - emit one terminal failure record - do not treat `error_reason = NONE` as failure ### 4. Timeout Behavior: - if no reply and no failure arrives within a configured timeout window, mark status `timed_out` - synthesize timeout from ingest observation time, not only packet timestamp - emit exactly one timeout record - retain the entry only long enough to avoid duplicate terminal emission, then evict it ## Bounded-memory requirements This feature must remain bounded and safe for long-running processes. Suggested controls: - per-entry TTL, for example 30 to 120 seconds - maximum active tracked entries, for example 1,000 or 10,000 - cleanup on normal ingest flow or via a lightweight periodic sweeper - eviction policy that removes expired oldest entries first If the tracker is full: - evict expired entries first - if still full, evict the oldest non-final entry - log a warning with useful fields such as active entry count, evictions, and timeout window ## Dedup requirements There are two separate dedup concerns. ### 1. Packet-level dedup The existing packet dedup logic already suppresses exact duplicate MQTT packet IDs. That should remain in place. ### 2. Lifecycle-level dedup The new tracker must additionally ensure it does not emit duplicate final lifecycle records when: - the same reply is seen more than once - the same routing error is seen more than once - timeout cleanup runs more than once - a late duplicate terminal packet arrives after a final state was already emitted Each tracked entry should remember whether a final lifecycle event has already been emitted. ## Merging policy When updating a tracked traceroute state: - never replace richer route data with emptier later data - preserve partial path data if that is all MQTT exposed - prefer explicit route arrays over purely inferred paths - preserve low-level packet facts that materially affect interpretation - do not invent return path if reply-side evidence is absent - do not rewrite a partial result into a fake success ## Suggested output strategy Two designs are possible. ### Option A: lifecycle rows plus packet rows - keep existing packet-level log rows - additionally emit correlated traceroute lifecycle rows Pros: - preserves raw packet visibility - safer incremental rollout - easier to debug mismatches between packet evidence and correlation logic Cons: - more log volume ### Option B: lifecycle rows only for traceroute - suppress packet-level traceroute and routing rows once correlation exists Pros: - cleaner operator view Cons: - loses useful low-level packet visibility - harder to debug MQTT partial-observation behavior Recommended first step: implement Option A. ## Suggested lifecycle log details Correlated traceroute lifecycle rows should include at least: - `status` - `request_id` - `from` - `to` - `channel` - `forward_path` - `return_path` - `forward_snr` - `return_snr` - `error_reason` - `started_at` - `updated_at` - `completed_at` or timeout timestamp when relevant - `inferred_*` fields where applicable - `source_packets` containing request, reply, and routing packet ids when known ## Timeout policy The issue should require a clear timeout policy: - timeout starts when request is first observed by ingest - timeout duration should be a constant near ingest or configurable later - timeout emits exactly one terminal lifecycle record with `status = timed_out` - timed-out entries should remain in memory only as long as needed for final-state dedup, then be evicted ## Concurrency and ownership If ingest processing can run concurrently, the tracker must be synchronized. A simple design is sufficient: - mutex-protected map - bounded in-memory state only - no database persistence required for the first implementation Ownership should remain local to ingest. ## Expected tests Add regression coverage for at least: - request followed by reply -> one correlated completion - request followed by routing error `NO_ROUTE` -> one correlated failure - request followed only by routing `NONE` -> not failed - request with no follow-up -> timed out - duplicate reply packet -> no duplicate completion - duplicate routing error packet -> no duplicate failure - reply arrives without previously observed request -> handled by explicit policy - bounded eviction does not panic or leak state - partial MQTT evidence remains visible and is not rewritten into fabricated success ## Acceptance criteria Implementation is complete when all of the following are true: - traceroute requests are tracked in ingest by `request_id` - traceroute reply packets update the matching tracked request - routing error packets with non-`NONE` error reason fail the matching tracked request - timeout is synthesized after a bounded window - memory usage is bounded by TTL and max-entry policy - final lifecycle records are emitted exactly once per traceroute run - duplicate MQTT retransmissions do not create duplicate final lifecycle records - partial MQTT evidence remains visible and does not become fabricated success - parser-layer code remains packet-local and stateless ## Implementation notes - Start with a small helper owned by ingest rather than introducing a broad new abstraction. - Keep packet semantic decoding in `internal/meshtastic` unchanged except where new fields are needed. - Prefer lifecycle correlation as an additive behavior before considering any suppression of packet-level traceroute logs. - If a reply arrives without a visible request, choose and document one policy explicitly: - log standalone partial/completed result without correlation, or - ignore unmatched replies, or - create synthetic tracker entry marked as request-missing The safest first implementation is to keep unmatched reply evidence visible rather than dropping it silently.

skobkin added the

enhancement

label

2026-02-28 21:37:47 +03:00

skobkin self-assigned this

2026-02-28 21:37:47 +03:00

skobkin added a new dependency

2026-02-28 21:37:53 +03:00

#15 Traceroute logs details

skobkin added a new dependency

2026-02-28 23:52:40 +03:00

#16 Traceroute log details pretty rendering