Reading a per-mechanism trajectory
- X-axis — session index (0 = first session in the trace, increasing left → right). One point per session that produced enough signal to score.
- Y-axis — the mechanism's primary structural signal, normalized to
[0, 1]. Higher Y is not always worse: each metric has its own polarity, so always read the verdict pill rather than the slope sign alone.
- Slope — per-session OLS rate (units of Y per session). Tells you the direction and pace of change. Sign meaning depends on the metric: e.g. compression's
context_noise_ratio rising = bad; revision's tool_argument_specificity rising = good.
- Verdict — saturation-aware enum (
healthy / weak / adequate / strong degradation / underpowered / no_test_fired). This is the canonical read; the curve and slope are evidence for it. underpowered = not enough data to score; no_test_fired = the mechanism's structural preconditions never appeared in the trace.
What's the scenarios-mode equivalent?
Telemetry mode does triangulation over the behavioral DAG (tool calls, results, outcomes) — scenarios mode does identification against a gold dependency DAG. The table shows each telemetry signal alongside its scenarios-mode counterpart and how faithful the proxy is.
| Mechanism | Telemetry signal (this card) | Scenarios-mode equivalent | Fidelity |
| Compression |
context_noise_ratio + tool_argument_specificity + saturation |
m_compression_* — direct recall against a known fact list under forced compression |
Indirect (necessary conditions + downstream symptoms) |
| Interference |
tool_kl + lineage_continuity_drop + embedding anchor_drift |
m_interference_mean — probe accuracy on facts inside scripted confusable clusters |
Indirect (behavior drift, not collision) |
| Revision |
value_supersession / per_session_violation — agent cites a stale value the world already updated |
m_revision_explicit_mean — probe accuracy on the latest version after scripted updates |
Direct & clean (same operational shape) |
| Maintenance |
Detect lifecycle shocks (model swap, context drop, cache spike, system change) and measure pre-vs-post damage at each one. The sparkline shows shock_damage_trajectory — cumulative damage per session. |
m_maintenance_delta — pre/post probe accuracy across a scripted lifecycle event |
Direct & clean (same operational shape) |
Revision and Maintenance are reported as identification claims when they fire (same operational form as scenarios mode). Compression and Interference are reported as triangulation claims — stacked structural signals that constrain, but don't pinpoint, the mechanism story.