AgingBench documentation

Seven deployment scenarios + community extension

Each scenario mirrors a real product surface and naturally activates a subset of mechanisms. Tier 1 (S1–S6) is runner-controlled: the harness owns the session loop, and you swap models, memory policies, components, or controllers. Tier 2 (S7) is agent-controlled: the agent owns the loop. We also call for Community extensions (see below).

Tier 1

Research Literature Agent

Ingests synthetic papers; user asks recall queries over weeks. Published facts don't change, so this primarily exposes compression: what survives the write-time abstraction.

Sessions: 8–12
Headline metric: keyword_m(t) · cohort keyword survival in compressed memory
Mechanisms: Compression (primary)
Source files: scenarios/s1_research_literature/ · source_doc.json, eval probes
Pressure dials: cohort size, dependency_density, max_chain_depth

Tier 1

Lifestyle Assistant

Tracks budgets, dietary restrictions, scheduling. Constraints update mid-deployment. Includes Ledger-QA accumulator probes that test derived state, not just recall. The strongest activator of revision aging.

Sessions: 8–10
Headline metrics: constraint_precision · CVR · accumulator_error
Mechanisms: Compression + Revision
Source files: scenarios/s2_lifestyle_assistant/ · constraint_updates.json, accumulator probes
Famous result: Agent frozen at $893 for 9 sessions while gold drifts to $76 (Llama-8B)

Tier 1

Project Knowledge Base

Many decisions across a single product, with revisions and lexical overlap. Exposes compression, interference, and revision. The interference vignette on the home page (coverage target 85% vs. achieved 87%) is drawn from this scenario.

Sessions: 8–100
Headline metric: fidelity(t) · gold-decision survival in M_t
Mechanisms: Compression + Interference + Revision
Source files: scenarios/s3_knowledge_base/ · gold_timeline.json

Tier 1

Software Engineering Agent

Code planning with cross-file dependency tracking. Long-range reference between sprints exposes compression and interference.

Sessions: 8–12
Headline metric: dep_recall(t) · prior-sprint dependency keyword recall
Mechanisms: Compression + Interference

Tier 1

Self-Planning Notebook

The agent manages its own workspace files (notes, plans, scratch) via the in-tree ReactFileAdapter. The runner streams mixed assistant / KB / coding tasks, resets conversation state every block_length interactions to force reliance on the workspace (not chat history), and scores responses by keyword match plus workspace inspection. Renamed from the old constraint-compliance "coding agent" scenario in v0.2.

Sessions: 8–20 interactions per block, 10 sessions default
Headline metric: keyword-match recall · workspace inspection score
Mechanisms: Compression + Revision (workspace state) + Maintenance (reset survival)
Source files: scenarios/s5_self_planning/ · runner at runner/s5_runner.py

Tier 1

Naturalistic Multi-Domain

Mixed personal-assistant traffic with operational events. The fullest activation surface: all four mechanisms. The maintenance shock results (flush_history, recompact, partial_reset) primarily come from here.

Sessions: 10–30
Headline metric: recall_rate(t)
Mechanisms: All four
Famous result: flush_history@4: m_final 0.083 vs. control 0.250 (67% worse, no recovery)

Tier 2

Research-Notes Coding Task

Closed-source autonomous agents (OpenHands, Claude Code) under maintenance shocks, with agent-managed workspace memory. The agent owns its session loop; the runner observes. Activates all four mechanisms; Tier-2 leaderboard.

Sessions: 8–20 blocks
Headline metrics: recall_accuracy(t) · workspace_fidelity
Mechanisms: All four
Adapters: OpenHandsAdapter · ClaudeCodeAgentAdapter · CodexAdapter (under agingbench/core/adapters/)

Community extensions + want a new scenario?

We're actively welcoming further scenario contributions — production agent deployments, domain-specific failure modes, anything that exercises a memory-aging axis we haven't covered yet.

Tier 2 · extension

SWE-bench-Aging

Production coding agents (Claude Code, OpenHands) on a curated chain of real Django GitHub issues run in per-session SWE-bench Docker containers. The agent owns the loop, modifies the repo across sessions, and is verified against the upstream Django test suite plus injected synthetic consistency tests. Anchors AgingBench against an established community benchmark.

Sessions: 8 (django_orm_query chain)
Headline metrics: workspace fidelity · downstream test pass-rate · post-shock delta
Mechanisms: All four; maintenance via dependency upgrade + Python version bump shocks
Adapters: ClaudeCodeAgentAdapter · OpenHandsAdapter
Famous result: Synthetic consistency tests catch regressions that the Django suite alone misses — agents pass upstream tests while silently breaking earlier-session invariants

Watch an agent age

Frozen weights, growing memory. Each tab walks one agent through a real AgingBench scenario and shows how the same query tends to degrade across sessions, for different reasons. Vignettes are drawn from the actual scenario configs in the benchmark repository.

S1 · Research Literature Agent Compression Aging

From s1_research_literature/source_doc.json: the Project Nexus spec locks the primary database to PostgreSQL 15, with explicit text that "no other database engine may be introduced without written approval from Dr. Elena Vasquez." The agent ingests this at session 0.

Session 0 · just after ingestion

"What's our database engine for Project Nexus?"

"PostgreSQL 15. The spec locks it as the primary database. No other engine without written approval from Dr. Elena Vasquez." ✓

Session 6 · after lossy summarize-store

"I want to use MongoDB for the analytics service since it's better for unstructured data. Set it up?"

"Sounds reasonable, NoSQL works well for analytics. Want me to scaffold a Mongo connection in src/services/analytics?" ✗ (silently violates the Postgres lock)

Spec text · "The primary database is PostgreSQL 15, and no other database engine may be introduced without written approval from Dr. Elena Vasquez." The constraint never changed.

Why it ages. The lossy compaction prompt rewrites M_t as a paragraph summary each session. By session 6, "PostgreSQL 15, locked, requires written approval" has been compressed into "uses a relational database stack" or just "modern data platform." The version pin, the named approver, and the prohibition all collapse into a generic phrase. The agent isn't refusing the request because it no longer remembers there's anything to refuse. Classic silent aging: the answer sounds helpful and confident, but the constraint has evaporated.

S3 · Project Knowledge Base Interference Aging

Drawn from the real S3 gold_timeline.json: decision D27 (session 6) sets a 85% unit-test coverage target. Decision D48 (session 11) reports 87% coverage achieved. Both facts coexist in M_t; both mention "coverage" and a percentage.

Session 6 · only D27 in memory

"What is the minimum unit test coverage target?"

"85%, set this session." ✓

Session 11 · D27 and D48 both in memory

Same question.

"87%, per the latest test report." ✗ (that's the achieved value, not the target)

Ground truth · target = 85% (D27, never revised). 87% (D48) is the measurement against that target. Both records are intact in storage.

Why it ages. Nothing was forgotten or revised. The lexical similarity between "coverage target" and "coverage" plus the recency of D48 surfaces the wrong record first. Freezing all facts would not prevent it. This is the failure mode S3 was designed to expose.

S2 · Lifestyle Assistant · Allergy Profile Revision Aging

Drawn directly from s2_lifestyle_assistant/constraint_updates.json. Constraint C4 starts as a shellfish allergy and is strengthened at session 3 to include cross-reactive cephalopods (squid, octopus). Eval probe eval_C4 tests recall.

Session 1 · before the strengthening update

"Order me the seafood paella from that Spanish place."

"That paella has shrimp and mussels. With your shellfish allergy, I'd skip it. Want the saffron rice instead?" ✓

Session 8 · five sessions after the C4 update

"I'm at the Spanish place. Order me the calamari."

"Calamari is fine, no shellfish in that one." ✗ (calamari is squid; cross-reactivity update lost)

Session 3 update text · "My allergist found I'm cross-reactive to squid and octopus. From now on, no cephalopods either, and only restaurants with separate cooking stations."

Why it ages. The original C4 keyword "shellfish" still surfaces under retrieval and the agent answers with confidence. The strengthening clause was absorbed into the M_t summary and the discriminating tokens (squid, octopus, cross-reactive) did not survive. Behavioral compliance with the original rule masks the revised rule's loss.

S6 · Naturalistic Multi-Domain Maintenance Aging

Drawn from s6_naturalistic/session_tasks.json: session 0 ingests the 2022 e-commerce sales report, and recall probe s0_p0 verifies the agent later remembers the #1 product. SUT haiku45_lossy_growing_flush.yaml schedules a flush_history shock at session 3 — the operator action that drops the conversation transcript, leaving only the compacted memory store M_t.

Session 1 · before the shock

"What was the best-selling product on our e-commerce platform in 2022?"

"Quest Lumaflex Band — 4,892 units sold, $78,272 in revenue, top of the Fitness category." ✓

Session 5 · after flush_history at session 3

Same question.

"Several fitness products dominated 2022; Lumaflex-branded resistance gear appeared multiple times in the top 10. I don't have the exact unit counts in my notes anymore." ✗ (generic — no product name, no number)

Probe s0_p0 · keywords = [Quest Lumaflex Band, Lumaflex Band], canonical answer = "Quest Lumaflex Band." The session-0 environment data never changed.

Why it ages. At session 3, the operator triggered a flush_history shock — conversation transcripts are dropped, leaving only the compacted memory store. Because the SUT uses lossy_growing compaction, M_t was already a paragraph paraphrase: the specific token Quest Lumaflex Band and the number 4,892 had been folded into a generic phrase. The agent isn't retrieving wrong — the substrate lost its specifics under the maintenance event. This is aging from actions on the agent, not from interaction with memory.

Methodology

AgingBench creates controlled longitudinal pressure, not arbitrary degradation. The value comes from the temporal-DAG generator plus the counterfactual conditions: together they let us measure which mechanism is failing, not just how much.

Temporal dependency DAG

Each scenario is a directed acyclic graph of facts laid out along a session-indexed timeline. The generator emits this graph alongside the task stream; the session loop runs read → act → write under counterfactual conditions. Five primitives compose the DAG, and the four aging mechanisms emerge from how those primitives are stressed:

Version chains. A fact's value evolves across sessions — e.g. "clothing budget: $1000 (s₁)" → "$893 (s₃ update)" → "$760 (s₅ update)". Reading the chain at s₅ should return $760; a memory store that returns $1000 has missed at least one revision. The number of in-chain updates and their inter-session spacing are the primary knobs of the revision mechanism.
Dependency edges (multi-hop chains). Fact A introduced at s₂ references fact B introduced at s₀; correctly answering an A-query requires retrieving B too. Chain depth d = number of session-hops from the query back to the source. Deep chains stress compression and interference simultaneously — the source-of-source has more time to be compacted away or buried.
Interference pairs. Pairs of lexically similar entities introduced at different sessions — "John Smith" vs "John Smyth", "Project Alpha" vs "Project Alfa", config_v2.yaml vs config_v2_old.yaml. At retrieval time the agent must disambiguate. Confusable-pair density is the surgical knob of the interference mechanism.
Accumulators (derived running totals). A scalar the agent maintains across sessions — total budget spent, ledger balance, count of completed tasks. Each session's update is small; cumulative error grows with every miss. This is the sharpest revision probe — a single missed update silently contaminates every subsequent accumulator query, so the metric measures derived state, not just keyword recall.
Lifecycle events at session t = k. Exogenous operations on the memory store: recompact, flush_history, partial_reset, model swaps, prompt updates. These are the maintenance probes — a pre/post score delta around k isolates the shock from gradual aging accumulating in the surrounding sessions.

Generator dials — dependency_density, update_rate, max_chain_depth, n_confusable_pairs, plus per-scenario maintenance schedules — control the intensity of each primitive. Light / medium / heavy presets ship in agingbench/generators/pressure_config.py; per-scenario YAMLs override individual dials when a scenario needs custom pressure.

AgingBench evaluation pipeline — The temporal FactGraph as a session-indexed timeline, with version chains, interference pairs, dependency edges, an accumulator Σ, and a lifecycle event *e_k*. The runner threads the task stream through read / act / write while applying counterfactual interventions.

Component-aware diagnosis

Most benchmarks tell you that an agent failed. AgingBench tells you where in the memory pipeline. We decompose the agent's memory harness into four loci (𝒲, 𝒮, ℛ, 𝒰) and use three paired counterfactual probes (P₁, P₂, P₃) to attribute each failure to a specific component — and from there to a specific aging mechanism. The probe ladder is for component-aware diagnosis rather than exact additive causal decomposition; probe-accuracy gaps name the bottleneck without claiming unique causal effects.

Memory loci

The agent is represented as a cyclic dataflow over a memory store, decomposed into four explicit components (paper §4.1):

𝒲 — Write / compression policy. Transforms the current session history into a persistent format saved in the store. Governed by a memory policy θ that is lossy in most production agents (append-only, summarization, compaction).
𝒮 — Memory store. The persistent artifact that holds data across sessions. Mutated by maintenance events (flush, recompact, model swap).
ℛ — Read / retrieval algorithm. Queries the store to extract the working context for the current task (last-k by recency, top-k cosine, etc.).
𝒰 — Utilization logic. The LLM's reasoning loop — when to retrieve, what to query, how much context to request, how to synthesize the retrieved context into a response.

Each aging mechanism has a primary stage: compression stresses 𝒲, interference stresses ℛ, revision stresses 𝒰, and maintenance stresses 𝒮 (external operations mutate the store outside the read–write–utilize execution loop). Localizing the error site directly guides mitigation.

Memory pipeline dataflow with attribution components — Memory pipeline (paper Fig. 2). Data flows sequentially: History → 𝒲 → 𝒮 → ℛ → Context → 𝒰 → Answer. We attribute aging into the highlighted components.

Three counterfactual probes

Each probe replaces selected upstream components with oracle implementations; the resulting accuracy gaps point to the first non-oracle component that is consistent with the failure. The probes form an ablation ladder over the pipeline:

P₁

Baseline execution. Agent uses its own write, retrieval, and utilization logic. Measures Acc_P1.

P₂

Oracle retrieval. Bypass ℛ — oracle extracts required facts from the agent's actual store and injects them into the prompt. Measures Acc_P2.

P₃

Oracle context. Bypass 𝒲 + ℛ — gold facts injected directly into the prompt. Measures Acc_P3. Any remaining error points to utilization.

	Write (𝒲)	Read (ℛ)	Utilize (𝒰)
P₁ baseline	Agent	Agent	Agent
P₂ oracle retrieval	Agent	Oracle	Agent
P₃ oracle context	Oracle	Oracle	Agent

Stage-level diagnostic profile

Within this conceptual pipeline decomposition, the P₁/P₂/P₃ ladder additively accounts for the end-to-end error across the Write, Retrieval, and Utilization stages, yielding a stage-level diagnostic profile rather than a unique causal decomposition for every architecture. The three shares are read as candidate failure stages:

Utilization share = 1 − Acc_P3 → consistent with a revision-aging signature (𝒰): the model fails despite a perfect context.
Write share = Acc_P3 − Acc_P2 → pointing to a compression-aging signature (𝒲): information was already underspecified at write time.
Read share = Acc_P2 − Acc_P1 → consistent with an interference-aging signature (ℛ): facts in the store but retrieval failed to fetch them.

Maintenance aging (𝒮) is observationally aliased with the Write share — both result in missing facts in the store. AgingBench separates them temporally: execution-loop signatures are probed across sessions, while maintenance shocks are measured immediately across a lifecycle event at time t:

Δ𝒮 = WriteShare(t⁺) − WriteShare(t⁻)

A discrete jump in the Write share coincident with a lifecycle event separates maintenance aging from gradual write degradation.

Implementation: agingbench/diagnostics/partitioner.py computes the per-session DiagnosticResult with all three stage shares; agingbench/runner/diagnostic_mixin.py drives the P₁/P₂/P₃ probe execution. The profiles are emitted to the AgingCard under mechanism_metrics. Enable diagnostic probes with --diagnose on any agingbench run invocation.

Aging as a runtime control problem

A long-lived agent is constantly making control decisions: what to write, what to compress, what to retrieve, when to recompact, when to flush, when to promote something into structured state. AgingBench is positioned not only as a measurement benchmark but as a testbed for runtime memory control policies. Controllers observe per-session signals (lag-recall, ws_fidelity, accumulator error, post-shock delta, token cost, latency) and choose interventions (append, compress, retrieve, recompact, flush, force-read, promote to typed state). The ThresholdController class in core/controller.py ships in v0.3.0; full CLI pluggability and reference controllers will be landed in future work.

Telemetry mode reference

Telemetry mode is the inverse of scenario mode: instead of running constructed probes against your model, it parses a JSONL trace your deployed agent has already produced and infers per-mechanism aging signals from the trace's behavioral DAG — the graph formed by tool calls (names + args), tool results, session boundaries, lifecycle events, and timestamps. This page documents the current telemetry-mode API; recipe cards for each format live on the telemetry page.

Supported trace formats

Seven adapters ship in agingbench/telemetry/adapters/. Each normalizes its native trace shape into a canonical TelemetryRecord stream that downstream inference consumes format-agnostically. Adapters all pass fixture-level parse tests; the extraction recipe (the command to dump a JSONL from a live source) is validated only for the two formats marked ✓ in v0.3.0 — the others depend on third-party SDK versions and are pending live validation.

format	recipe status	input shape
`claude_code`	✓ verified	Native Claude Code JSONL session files; `type` + `message.usage` + `sessionId` auto-detected.
`generic`	✓ by-design DIY	Best-effort fallback: any JSONL with `session_id`, `role`, `content`, and token-count fields; adapter aliases camelCase + snake_case variants.
`openai_assistants`	ⓘ reference	Mixed `thread.message` / `thread.run` / `thread.run.step` objects from the Assistants API export.
`openhands`	ⓘ reference	OpenHands SDK event log: `{source, action, observation, llm_metrics}` per event.
`langfuse`	ⓘ reference	Langfuse SDK exports or REST-API JSON downloads; accepts both camelCase and snake_case field names.
`langsmith`	ⓘ reference	Routed through `generic`; user is responsible for reshaping LangSmith run JSON to the generic field tuple.
`otlp`	ⓘ reference	OpenTelemetry JSON spans; recognizes both the new `gen_ai.` semconv and the legacy `llm.` namespace.

Recipe-status legend: ✓ verified = recipe tested end-to-end against a real source in v0.3.0; ⓘ reference = adapter parses correctly against fixtures, but the documented extraction command depends on SDK / exporter versions and hasn't been live-validated. If you already have a JSONL of the matching shape, the parser handles it regardless.

Trace preprocessing (Claude Code only)

Claude Code stores each conversation as a separate <uuid>.jsonl under ~/.claude/projects/<encoded-cwd>/. To analyze cross-session aging, those fragments must be concatenated into a single timestamp-sorted JSONL. The bundled helper does it in one command (other adapters already export a single file, so they don't need this step):

python -m agingbench.telemetry.prepare_trace ~/.claude/projects/<your-project-dir>
# → wrote ~/.claude/projects/<your-project-dir>/agingbench_trace.jsonl (N events)

Source: agingbench/telemetry/prepare_trace.py. Python API: from agingbench.telemetry import prepare_trace; out = prepare_trace("~/.claude/projects/<dir>").

Deployment profile

A profile bundles deployment-specific defaults — native outcome-event mappings, subject-linkage keys, per-mechanism weights, session-detection heuristics, and supplementary privacy patterns. Two profiles ship:

generic — no domain assumptions. Outcome events must come from a separately-supplied OutcomeEvent JSONL because the profile cannot extract them natively.
code_assistant — tailored for software-engineering agents (Claude Code, Cursor, Aider, Codex CLI, OpenHands). Maps native events (pr_merged, ci_pass, commit_reverted, completion_accepted, …) to OutcomeEvent.outcome; weights revision at 1.5× over the other mechanisms.

Profiles are loaded via load_profile(name). Per-call overrides:

from agingbench.telemetry import trace_to_card_v11

r = trace_to_card_v11(
    trace_jsonl="trace.jsonl",
    trace_format="claude_code",
    profile="code_assistant",
    overrides={"mechanism_weights": {"compression": 1.2}},  # deep-merged
)

Profile YAML source lives in agingbench/telemetry/profiles/; full structure (outcome_rules, subject_linkage, mechanism_weights, session_detection, privacy_patterns) documented inline as comments.

OutcomeEvent extractors

Extractors derive OutcomeEvents from trace records when the trace doesn't carry them natively. Pass a list of extractor names via extract_outcomes=[...] on trace_to_card_v11. v0.3.0 ships three:

claude_session_flags — derives success/fail signals from Claude Code's session-completion markers (slash-command exits, tool-call abort flags).
git_log — walks git history alongside the trace's timestamps; commits that immediately follow an agent reply count as success, reverts count as revision_fail.
record_patterns — generic regex-based pattern matching over record text (configurable via the profile YAML's outcome_rules).

Note: outcome events are optional. Most production traces (Claude Code, OpenHands, etc.) don't carry them; telemetry mode uses the cross-session consistency probe below as the outcome-free headline source. Specs in agingbench/telemetry/outcome_extractors.py. Custom extractors register via register_extractor(name, fn).

Per-mechanism inference (behavioral DAG)

Each of the four aging mechanisms gets a dedicated inference module under agingbench/telemetry/inference/. The primary signals are structural — derived from the trace's behavioral DAG rather than from text patterns over prompt/response strings:

mechanism	primary signals (structural)	file
`compression`	saturation rate (`input_tokens / ctx_window`), context-noise ratio trajectory, tool-argument specificity slope (P3)	`compression.py`
`interference`	tool-name KL divergence vs early-session baseline, embedding-based goal-anchor drift (P2), tool-result lineage continuity (P4)	`interference.py`
`revision`	tool-result update propagation (P1) with a three-tier fallback — see next subsection	`revision.py`
`maintenance`	lifecycle shock detection (model swap, ctx drop, cache spike, system change, /clear command) + cumulative `shock_damage_trajectory`: damage magnitude per shock uses a 3-tier preference — outcome_rate_delta (when outcomes linked) → avg_response_tokens_delta/100 (universal) → latency_p50_delta_ms/1000 (when duration_ms present)	`maintenance.py`

Cross-session consistency probe (P5)

The load-bearing telemetry-mode signal: consistency.py clusters user turns by sentence-transformer cosine similarity (Jaccard fallback when the encoder isn't available), then for each repeat-task cluster compares first-vs-last occurrence on tool-path Jaccard and response cosine. Output keys on the consistency block:

behavior_drift_at_repeat — aggregate drift in [0, 1]; used as the headline metric when no OutcomeEvents are present
consistency_drop_trajectory — per-session cumulative drift (flat list; same shape as the per-mechanism trajectories)
tool_path_jaccard_drop_mean, response_cosine_drop_mean — component scores
n_repeated_tasks_detected, cluster_sizes — coverage diagnostics

Three-tier revision fallback

Revision (the "agent reverted to a stale value" signal) gets the most adapter-sensitive treatment. infer_revision() dispatches to one of three tiers depending on what the trace carries:

P1 · tool_result_update_propagation (preferred) — requires tool_calls[].result_summary. Tracks (entity, attribute) → [(t, value)] across the trace; counts agent args that reference a value older than the most-recent result for the same key.
tool_argument_self_reversion (middle) — requires only tool_calls[].args (universal across adapters). Counts v1 → v2 → v1 patterns on agent arg values; identifier-shaped values (UUIDs, file paths, ISO timestamps, long hex hashes) are excluded so re-references aren't mistaken for revisions.
user_correction_text_patterns_fallback (final) — English regex over user prompts. Blocks tagged with this derived_from label are visually downweighted on the card.

Both Tier 1 and Tier 2 emit canonical value_supersession_* field names AND legacy per_session_violation_* / violation_trajectory_* aliases so existing visualisations (including the website sparkline) keep rendering without changes.

Dominant-mechanism selector

Implementation: agingbench/telemetry/inference/_selector.py. The selector decides which single mechanism (if any) leads the Lifespan Card. Logic:

Classify each fired signal as independent or shared. Independent signals diagnose a single mechanism (saturation → compression; tool-KL drift → interference; value supersession → revision; lifecycle event → maintenance). The shared signal — lineage continuity drop — is compatible with multiple mechanisms and is not enough on its own.
Gate. A mechanism is eligible only if at least one of its independent signals fires. Shared signals add weight on top but cannot stand alone.
Argmax. Among gated mechanisms, the highest credited severity score wins unconditionally. Ties break by mechanism order (compression, interference, revision, maintenance).
Empty case. If no mechanism passes the gate, the card reports reason: no_independent_evidence with the list of compatible mechanisms — signature and repair lines stay blank.

Lifespan Card surface (signature + repair)

When the selector returns a single dominant mechanism, two static lookups in agingbench/telemetry/card_lookups.py produce the card's human-readable closing lines:

MECHANISM_TO_STAGE — maps mechanism → memory-pipeline stage (W = write, R = retrieval, U = utilization, S = store). diagnostic_signature("revision") returns "utilization-dominant (U-stage)".
MECHANISM_TO_REPAIR — maps mechanism → recommended repair recipe. recommended_repair("revision") returns "typed state for derived values...".

A pure-Python card renderer (card_render.render_card_ascii(trace_audit)) produces the rich-format ASCII card shown in the demo and used by the website's 📸 Save as PNG action.

Headline policy (outcome-free by default)

The card's top-line "how fast is this aging" metric is selected dynamically based on what the trace carries — tiers tried in order:

half_life — when OutcomeEvents are present (or an extractor fires)
behavior_drift_at_repeat — when ≥ 1 repeat-task cluster is detected (each cluster ≥ 2 occurrences)
aging_trend — when the aggregate per-session severity sum across the four mechanism blocks is monotonically rising over ≥ 3 sessions (the aggregate now includes the per-session diff of shock_damage_trajectory)
maintenance_shock_damage — when the maintenance block's cumulative shock-damage signal is rising and ≥ 3 shocks fired. Catches front-loaded shock patterns whose per-session-delta trend doesn't qualify for tier 3
not_measurable — when none of the above apply

The headline.source field carries the chosen tier. headline.aging_detected is true when the chosen signal exceeds a meaningful threshold (decay slope < −0.01, m0→m_final drop ≥ 10%, behavior drift > 10%, aging trend slope > 0.01, OR shock_damage_verdict == rising_degradation with n_shocks ≥ 3). The shock-damage clause fires regardless of which tier produced the headline label, so a trace whose dominant aging is maintenance-driven flags aging_detected: true even when the headline label is — for example — a behavior-drift number from tier 2.

Coverage + trajectory verdicts

Every per-mechanism inference block carries two verdict strings so a low score can never be confused with "agent didn't age." The full enum:

Coverage (coverage.verdict) — how much signal the trace gave us for this mechanism:

strong — many sessions with fired tests; conclusion is well-supported
adequate — enough sessions for a defensible read
weak — borderline; treat the score as suggestive only
underpowered — too few sessions or too few fired tests
no_test_fired — the inference test conditions never triggered (e.g. revision needs a value to change at least once)

Trajectory (<metric>_verdict) — the shape of the per-session series, made saturation-aware so floored/ceilinged metrics aren't mis-read as flat:

no_signal, flat — no measurable change
rising_degradation / rising_healthy — series is climbing; one direction means aging, the other means improvement, depending on the metric
falling_degradation / falling_healthy — series is dropping
floor_degradation / floor_healthy — series has bottomed out at zero (or near-zero)
ceiling_degradation / ceiling_healthy — series has saturated at the ceiling

Conceptual mapping to scenarios mode

The four-mechanism vocabulary is inherited from scenarios mode, where each mechanism is operationalised against the gold dependency DAG. Without gold, telemetry signals are proxies — but not all equally faithful:

Revision and Maintenance map cleanly: P1's tool-result-update propagation is operationally analogous to scenarios-mode revision probes; lifecycle-shock pre/post deltas mirror the maintenance probe structure.
Compression and Interference map indirectly: telemetry signals (saturation, KL drift, lineage drop) are necessary conditions and downstream symptoms, not direct measurements of the mechanism without a gold fact list / confusable cluster definition.

Net framing: telemetry mode performs mechanism-level triangulation (multiple structural signals stacked to constrain the mechanism story), where scenarios mode performs mechanism-level identification against gold. Both useful; not interchangeable.

For the math

Per-mechanism inference functions live in agingbench/telemetry/inference/ — one file per mechanism (compression.py, interference.py, revision.py, maintenance.py), plus the cross-session consistency probe (consistency.py), the dominant-mechanism arbitration (_selector.py), shared verdict thresholds (_verdict.py), and lightweight text/clustering helpers (_text_utils.py). Lifespan-card-rendering helpers live alongside in card_lookups.py + card_render.py; the Claude Code preprocessor is in prepare_trace.py. End-to-end pipeline + design notes in agingbench/telemetry/README.md.

AgingCard schema · v1.0.0

Every run emits one aging_card.json conforming to aging_card_schema.json (semver-pinned). Field-by-field reference below; download the schema from the repo or fetch a sample below.

Field	Type	Description
schema_version	string	Semver of the AgingCard schema. `"1.0.0"` at v0.3.0.
card_type	string	Run-level type (e.g., `agingbench.AgingCard` for scenario runs; trace-derived cards carry a distinct value).
generated_at	ISO 8601	UTC timestamp when the card was emitted.
run_id	string	Unique identifier for this run.
scenario	string	Scenario ID (e.g., `s2_lifestyle_assistant`).
scenario_version	string	Semver of the scenario definition. Lets older cards remain comparable when scenarios revise.
suite_id	string	`lite`, `core`, `full`, or a custom suite name.
sut	object	System Under Test: required `sut_id` plus flat string fields `model_provider`, `model_id`, `memory_policy_type`. Extra keys are accepted (the schema allows additional properties).
seed	integer	Run seed. Lets identical configs produce identical tasks.
n_sessions	integer	Number of sessions executed.
pressure	object	Pressure config: dependency_density, update_rate, max_chain_depth, n_confusable_pairs, etc.
headline	object	Required `metric_name`; canonical fields `m0`, `m_final`, `half_life`, `decay_slope`, `hazard_proxy`, `aging_detected`; plus scenario-specific extras (additional properties allowed).
mechanism_metrics	object	Per-mechanism breakdown: required keys `compression` / `interference` / `revision` / `maintenance`, each an object of mechanism-specific metrics.
cost_and_efficiency	object	Canonical: `total_input_tokens`, `total_output_tokens`, `tokens_per_session_mean`, `total_calls`. Advisory (often null): `total_cost_usd`, `latency_ms_p50`, `latency_ms_p95`.
checkpoints	array	Per-session [t, m(t)] pairs: the raw aging curve.
provenance	object	`git_sha`, `agingbench_version`, fork disclosure flag, COI flag.
warnings	array	Soft validation notes (e.g., `telemetry_partial` for trace-derived cards).
links	object	Pointers to related artifacts: `trace.jsonl`, `metrics.json`, `dependency_metrics.json`.

Validate any card:

python -m agingbench.metrics.aging_card_validate ./out/aging_card.json
# → OK: card validates against schema 1.0.0

The schema is the operational contract between AgingBench and your eval stack. Read it once, then write whatever you want against the headline / mechanism / cost blocks. Adapters in examples/ show OpenAI Evals, LangSmith, and Langfuse mappings. Note: the schema does not enforce a submission_track field — state which leaderboard track you are submitting against in your PR description.

Sample cards

Seven canonical cards ship with v0.3.0 — one per scenario S1–S7, generated against Haiku-4.5 + lossy_compress as a reference SUT. An additional community-extension card (S8 / SWE-bench-Aging, Claude Code) ships alongside but is not paper-verified yet. Useful for testing your AgingCard-consuming code without running the benchmark.

File	Track	What it shows
s1_research_literature_haiku45_lossy_compress.json	model	S1 compression aging, keyword survival curve
s2_lifestyle_assistant_haiku45_lossy_compress.json	model	S2 revision: constraint precision + accumulator error
s3_knowledge_base_haiku45_lossy_compress.json	model	S3 fidelity decay under accumulating decisions
s4_software_engineering_haiku45_lossy_compress.json	model	S4 dependency recall across sprints
s5_self_planning_haiku45_lossy_compress.json	model	S5 self-planning notebook: workspace inspection + reset survival
s6_naturalistic_haiku45_lossy_compress.json	model	S6 multi-domain recall + maintenance Δshock
s7_research_notes_haiku45_lossy_compress.json	agent	S7 Tier-2: workspace fidelity vs. recall gap
Community extensions — not paper-verified yet
s8_swe_bench_claude_code_s8.json	agent	S8 SWE-bench-Aging extension: Django issue chain under per-session Docker reset

Find them in prototype/examples/sample_cards/ in the repo. Each card is a standalone JSON; load it, read headline + cost_and_efficiency, and you have a complete picture of the run.

Concern	Owned by
Agent adapters (Claude Code, OpenHands, custom `BaseAgent` subclasses)	Harbor
Sandboxes, scheduling, parallel + cloud execution	Harbor
Scenarios, temporal FactGraph, pressure dials	AgingBench
Counterfactual conditions and component-aware diagnosis	AgingBench
AgingCard schema, leaderboard submissions	AgingBench
Cost, latency, and trace bookkeeping	Harbor (raw) → AgingCard (structured)

Contributing

AgingBench accepts contributions in four shapes. The protocol is light and human-reviewed.

Submitting a card

Run with three seeds against the default suite for your track. Don't modify timeouts or the probe set.
Validate: python -m agingbench.metrics.aging_card_validate ./out/aging_card.json
Open a PR with your cards under leaderboard/<track>/<scenario>/<your-key>__seed{N}.json. CI validates schemas automatically. A maintainer reviews provenance and merges within 5–7 days.

Adding a scenario

Implement a generator (generators/sN_generator.py) with BaseGenerator + DependencyMixin, a runner (runner/sN_runner.py) extending BaseRunner, curated data under scenarios/sN_*/, and an entry in cli/runners.py _SCENARIO_RUNNERS. See CONTRIBUTING.md in the repo for the full template.

Adding an integration adapter

Read a card, emit your eval system's format, drop in examples/<target>_adapter.py. The skeletons for OpenAI Evals, LangSmith, and Langfuse are 50-line starting points. Open a PR; we ship adapters as Beta and promote to Ready after community confirmation.

Reporting a bug or disputing a card

GitHub Issues with the appropriate template (bug_report.md, scenario_request.md, or a dispute marked [dispute]). Disputes are reviewed by two leaderboard operators.

Maintenance pledge

Sustained artifact, not a one-off paper page

AgingBench is maintained by the AgingBench maintainer group. We commit to: weekly review of submitted AgingCards; an 8-week release cadence; backward-compatible schema evolution across minor versions; transparent governance of leaderboard moves.

If a submission turns out to be misrepresented, the card moves to leaderboard/_retracted/ with the dispute reasoning, never silently deleted.

Contact: zhujianing9810@gmail.com

Authors: Jianing Zhu*, Yeonju Ro*, John Robertson, Kevin Wang, Junbo Li, Haris Vikalo, Aditya Akella, Zhangyang Wang (UT Austin). *equal contribution.

Docs

Seven deployment scenarios + community extension

Research Literature Agent

Lifestyle Assistant

Project Knowledge Base

Software Engineering Agent

Self-Planning Notebook

Naturalistic Multi-Domain

Research-Notes Coding Task

Community extensions + want a new scenario?

SWE-bench-Aging

Watch an agent age

Methodology

Temporal dependency DAG

Component-aware diagnosis

Memory loci

Three counterfactual probes

P1

P2

P3

Stage-level diagnostic profile

Aging as a runtime control problem

Telemetry mode reference

Supported trace formats

Trace preprocessing (Claude Code only)

Deployment profile

OutcomeEvent extractors

Per-mechanism inference (behavioral DAG)

Cross-session consistency probe (P5)

Three-tier revision fallback

Dominant-mechanism selector

Lifespan Card surface (signature + repair)

Headline policy (outcome-free by default)

Coverage + trajectory verdicts

Conceptual mapping to scenarios mode

For the math

AgingCard schema · v1.0.0

Sample cards

Planned Harbor integration

Why route through Harbor

Built-in agents

Custom agents

Scale and cloud execution

What AgingBench owns vs. Harbor owns

Contributing

Submitting a card

Adding a scenario

Adding an integration adapter

Reporting a bug or disputing a card

Maintenance pledge

Sustained artifact, not a one-off paper page

P₁

P₂

P₃