Lifespan Check · AgingBench

Try it and check your lifespan AgingCard 🪪

Pick a pre-bundled sample, or drop a JSONL trace of your own. Computation runs locally in Pyodide (cached on first visit).

Loading analysis engine…

First visit takes ~5–15s while Pyodide and the AgingBench bundle download (~10 MB total). Cached in your browser after — subsequent loads are <1 s.

Try a sample

1 · Provide a trace

Drop or select a JSONL file: no file chosen

No trace handy? See the claude_code recipe below for the one-liner that bundles your local sessions.

2 · Format

format Auto-detected from the uploaded file when possible. claude_code for native Claude Code session files (default); generic for any JSONL with session_id / role / content / token fields. Five more adapters parse correctly in the codebase (openai_assistants, openhands, langfuse, langsmith, otlp) but aren't surfaced in the demo yet — see Other adapters below.

Per-mechanism trajectories

Telemetry summary

Show full AgingCard JSON

How to read the AgingCard (X / Y / slope / verdict, plus scenarios-mode equivalents)

Reading a per-mechanism trajectory

X-axis — session index (0 = first session in the trace, increasing left → right). One point per session that produced enough signal to score.
Y-axis — the mechanism's primary structural signal, normalized to [0, 1]. Higher Y is not always worse: each metric has its own polarity, so always read the verdict pill rather than the slope sign alone.
Slope — per-session OLS rate (units of Y per session). Tells you the direction and pace of change. Sign meaning depends on the metric: e.g. compression's context_noise_ratio rising = bad; revision's tool_argument_specificity rising = good.
Verdict — saturation-aware enum (healthy / weak / adequate / strong degradation / underpowered / no_test_fired). This is the canonical read; the curve and slope are evidence for it. underpowered = not enough data to score; no_test_fired = the mechanism's structural preconditions never appeared in the trace.

What's the scenarios-mode equivalent?

Telemetry mode does triangulation over the behavioral DAG (tool calls, results, outcomes) — scenarios mode does identification against a gold dependency DAG. The table shows each telemetry signal alongside its scenarios-mode counterpart and how faithful the proxy is.

Mechanism	Telemetry signal (this card)	Scenarios-mode equivalent	Fidelity
Compression	`context_noise_ratio` + `tool_argument_specificity` + saturation	`m_compression_*` — direct recall against a known fact list under forced compression	Indirect (necessary conditions + downstream symptoms)
Interference	`tool_kl` + `lineage_continuity_drop` + embedding `anchor_drift`	`m_interference_mean` — probe accuracy on facts inside scripted confusable clusters	Indirect (behavior drift, not collision)
Revision	`value_supersession` / `per_session_violation` — agent cites a stale value the world already updated	`m_revision_explicit_mean` — probe accuracy on the latest version after scripted updates	Direct & clean (same operational shape)
Maintenance	Detect lifecycle shocks (model swap, context drop, cache spike, system change) and measure pre-vs-post damage at each one. The sparkline shows `shock_damage_trajectory` — cumulative damage per session.	`m_maintenance_delta` — pre/post probe accuracy across a scripted lifecycle event	Direct & clean (same operational shape)

Revision and Maintenance are reported as identification claims when they fire (same operational form as scenarios mode). Compression and Interference are reported as triangulation claims — stacked structural signals that constrain, but don't pinpoint, the mechanism story.

Get your trace in 60 seconds

Two paths are validated for v0.3.0. Five more adapters will be validated end-to-end — they parse correctly against our fixture tests, but the extraction recipes haven't been validated against current third-party SDKs. See the list below.

claude_code

drop-in · verified

Files Claude Code already writes — no export step. Each .jsonl = one conversation; we bundle them all into one timestamp-sorted trace for cross-session aging signal.

# Cross-platform (macOS / Linux / Windows) — install AgingBench, then prepare your trace:
pip install "git+https://github.com/VITA-Group/AgingBench.git@v0.3.0#subdirectory=prototype"
python -m agingbench.telemetry.prepare_trace ~/.claude/projects/<your-project-dir>
# → wrote ~/.claude/projects/<your-project-dir>/agingbench_trace.jsonl (N events)

# Then drop agingbench_trace.jsonl into Step 1 and pick format=claude_code.

# Don't know which project dir is yours? List them:
ls ~/.claude/projects/             # macOS / Linux
dir %USERPROFILE%\.claude\projects # Windows (cmd)

Contains: your prompts, agent replies, tool calls, token counts. PII patterns (emails, API keys, IPs) are auto-redacted before metric inference. Aging signal needs ≥3 sessions — picking a single .jsonl will trip the coverage: underpowered warning.

generic

bring-your-own JSONL

Any JSONL with session_id, role, content, and token counts. The adapter aliases camelCase + snake_case field names.

{"session_id":"abc","role":"user","content":"hi",
 "input_tokens":42,"output_tokens":0,"timestamp":"2026-05-16T10:00:00Z"}
{"session_id":"abc","role":"assistant","content":"hello!",
 "input_tokens":0,"output_tokens":12,"timestamp":"2026-05-16T10:00:01Z"}

Bring whatever shape you have — best-effort field aliasing. Pick format=generic in the dropdown above.

How much trace do you need? ≥3 sessions of ≥1 agent turn each is the floor for meaningful aging signal — below that the demo will flag coverage: underpowered. Privacy: all parsing happens in your browser via Pyodide. The trace never leaves your machine. The privacy scrubber redacts PII patterns (emails, SSN, card numbers, API keys, phone, IP) from prompt/response previews before inference; it does NOT scrub semantic content — review your trace if it contains proprietary prompts.

Other adapters (parse-tested, recipe pending)

Five additional format adapters ship in agingbench/telemetry/adapters/ and pass our fixture tests against bundled sample data. The extraction recipe for each — i.e. the literal command to dump a JSONL from your live deployment — depends on third-party SDK versions and hasn't been validated end-to-end in v0.3.0. If you already have a JSONL of the matching shape, drop it in Step 1 and pick the format from the dropdown — the parser handles it.

openai_assistants — Mixed thread.message / thread.run / thread.run.step objects. Recipe pending validation against OpenAI SDK 1.x; the exact field-shape from model_dump() may vary by SDK version.
openhands — OpenHands SDK event log (source, action, observation, llm_metrics). Recipe assumes a session path that varies by install method.
langfuse — Langfuse GENERATION spans (camelCase + snake_case both accepted). REST endpoint + filter syntax may drift across Langfuse releases.
langsmith — Routes through generic; you reshape LangSmith run JSON into the generic field tuple.
otlp — OpenTelemetry JSON spans (gen_ai.* and legacy llm.* namespaces). Exporter-specific (Phoenix, Honeycomb, Jaeger, otel-cli) — your wrapping may differ.

Adapter source + per-format input contract in agingbench/telemetry/adapters/. Promote one of these to the validated set by contributing a tested recipe against a live source — see Contribute.

Check your agent's lifespan.