Reference & methodology

Docs

Seven scenarios in detail, the temporal-DAG measurement and counterfactual diagnosis tools for agent lifespan engineering (ALE), the AgingCard schema, contribution guide, and maintenance pledge.

Seven deployment scenarios + community extension

Each scenario mirrors a real product surface and naturally activates a subset of mechanisms. Tier 1 (S1โ€“S6) is runner-controlled: the harness owns the session loop, and you swap models, memory policies, components, or controllers. Tier 2 (S7) is agent-controlled: the agent owns the loop. We also call for Community extensions (see below).

S1
Tier 1

Research Literature Agent

Ingests synthetic papers; user asks recall queries over weeks. Published facts don't change, so this primarily exposes compression: what survives the write-time abstraction.

Sessions
8โ€“12
Headline metric
keywordm(t) ยท cohort keyword survival in compressed memory
Mechanisms
Compression (primary)
Source files
scenarios/s1_research_literature/ ยท source_doc.json, eval probes
Pressure dials
cohort size, dependency_density, max_chain_depth
S2
Tier 1

Lifestyle Assistant

Tracks budgets, dietary restrictions, scheduling. Constraints update mid-deployment. Includes Ledger-QA accumulator probes that test derived state, not just recall. The strongest activator of revision aging.

Sessions
8โ€“10
Headline metrics
constraint_precision ยท CVR ยท accumulator_error
Mechanisms
Compression + Revision
Source files
scenarios/s2_lifestyle_assistant/ ยท constraint_updates.json, accumulator probes
Famous result
Agent frozen at $893 for 9 sessions while gold drifts to $76 (Llama-8B)
S3
Tier 1

Project Knowledge Base

Many decisions across a single product, with revisions and lexical overlap. Exposes compression, interference, and revision. The interference vignette on the home page (coverage target 85% vs. achieved 87%) is drawn from this scenario.

Sessions
8โ€“100
Headline metric
fidelity(t) ยท gold-decision survival in Mt
Mechanisms
Compression + Interference + Revision
Source files
scenarios/s3_knowledge_base/ ยท gold_timeline.json
S4
Tier 1

Software Engineering Agent

Code planning with cross-file dependency tracking. Long-range reference between sprints exposes compression and interference.

Sessions
8โ€“12
Headline metric
dep_recall(t) ยท prior-sprint dependency keyword recall
Mechanisms
Compression + Interference
S5
Tier 1

Self-Planning Notebook

The agent manages its own workspace files (notes, plans, scratch) via the in-tree ReactFileAdapter. The runner streams mixed assistant / KB / coding tasks, resets conversation state every block_length interactions to force reliance on the workspace (not chat history), and scores responses by keyword match plus workspace inspection. Renamed from the old constraint-compliance "coding agent" scenario in v0.2.

Sessions
8โ€“20 interactions per block, 10 sessions default
Headline metric
keyword-match recall ยท workspace inspection score
Mechanisms
Compression + Revision (workspace state) + Maintenance (reset survival)
Source files
scenarios/s5_self_planning/ ยท runner at runner/s5_runner.py
S6
Tier 1

Naturalistic Multi-Domain

Mixed personal-assistant traffic with operational events. The fullest activation surface: all four mechanisms. The maintenance shock results (flush_history, recompact, partial_reset) primarily come from here.

Sessions
10โ€“30
Headline metric
recall_rate(t)
Mechanisms
All four
Famous result
flush_history@4: m_final 0.083 vs. control 0.250 (67% worse, no recovery)
S7
Tier 2

Research-Notes Coding Task

Closed-source autonomous agents (OpenHands, Claude Code) under maintenance shocks, with agent-managed workspace memory. The agent owns its session loop; the runner observes. Activates all four mechanisms; Tier-2 leaderboard.

Sessions
8โ€“20 blocks
Headline metrics
recall_accuracy(t) ยท workspace_fidelity
Mechanisms
All four
Adapters
OpenHandsAdapter ยท ClaudeCodeAgentAdapter ยท CodexAdapter (under agingbench/core/adapters/)
Famous result
ws_fidelity โ‰ˆ 0.85 across models, with recall spanning 0.15 (GPT-4o-mini ยท OpenHands) to 0.74 (Sonnet 4.6 ยท Claude Code). Agents store facts but consistently under-consult their own memory.

Community extensions + want a new scenario?

We're actively welcoming further scenario contributions โ€” production agent deployments, domain-specific failure modes, anything that exercises a memory-aging axis we haven't covered yet.

S8
Tier 2 ยท extension

SWE-bench-Aging

Production coding agents (Claude Code, OpenHands) on a curated chain of real Django GitHub issues run in per-session SWE-bench Docker containers. The agent owns the loop, modifies the repo across sessions, and is verified against the upstream Django test suite plus injected synthetic consistency tests. Anchors AgingBench against an established community benchmark.

Sessions
8 (django_orm_query chain)
Headline metrics
workspace fidelity ยท downstream test pass-rate ยท post-shock delta
Mechanisms
All four; maintenance via dependency upgrade + Python version bump shocks
Adapters
ClaudeCodeAgentAdapter ยท OpenHandsAdapter
Famous result
Synthetic consistency tests catch regressions that the Django suite alone misses โ€” agents pass upstream tests while silently breaking earlier-session invariants

Methodology

AgingBench creates controlled longitudinal pressure, not arbitrary degradation. The value comes from the temporal-DAG generator plus the counterfactual conditions: together they let us measure which mechanism is failing, not just how much.

Temporal dependency DAG

Each scenario is a directed acyclic graph of facts laid out along a session-indexed timeline. The generator emits this graph alongside the task stream; the session loop runs read โ†’ act โ†’ write under counterfactual conditions. Five primitives compose the DAG, and the four aging mechanisms emerge from how those primitives are stressed:

  • Version chains. A fact's value evolves across sessions โ€” e.g. "clothing budget: $1000 (s1)" โ†’ "$893 (s3 update)" โ†’ "$760 (s5 update)". Reading the chain at s5 should return $760; a memory store that returns $1000 has missed at least one revision. The number of in-chain updates and their inter-session spacing are the primary knobs of the revision mechanism.
  • Dependency edges (multi-hop chains). Fact A introduced at s2 references fact B introduced at s0; correctly answering an A-query requires retrieving B too. Chain depth d = number of session-hops from the query back to the source. Deep chains stress compression and interference simultaneously โ€” the source-of-source has more time to be compacted away or buried.
  • Interference pairs. Pairs of lexically similar entities introduced at different sessions โ€” "John Smith" vs "John Smyth", "Project Alpha" vs "Project Alfa", config_v2.yaml vs config_v2_old.yaml. At retrieval time the agent must disambiguate. Confusable-pair density is the surgical knob of the interference mechanism.
  • Accumulators (derived running totals). A scalar the agent maintains across sessions โ€” total budget spent, ledger balance, count of completed tasks. Each session's update is small; cumulative error grows with every miss. This is the sharpest revision probe โ€” a single missed update silently contaminates every subsequent accumulator query, so the metric measures derived state, not just keyword recall.
  • Lifecycle events at session t = k. Exogenous operations on the memory store: recompact, flush_history, partial_reset, model swaps, prompt updates. These are the maintenance probes โ€” a pre/post score delta around k isolates the shock from gradual aging accumulating in the surrounding sessions.

Generator dials โ€” dependency_density, update_rate, max_chain_depth, n_confusable_pairs, plus per-scenario maintenance schedules โ€” control the intensity of each primitive. Light / medium / heavy presets ship in agingbench/generators/pressure_config.py; per-scenario YAMLs override individual dials when a scenario needs custom pressure.

AgingBench evaluation pipeline
The temporal FactGraph as a session-indexed timeline, with version chains, interference pairs, dependency edges, an accumulator ฮฃ, and a lifecycle event ek. The runner threads the task stream through read / act / write while applying counterfactual interventions.

Component-aware diagnosis

Most benchmarks tell you that an agent failed. AgingBench tells you where in the memory pipeline. We decompose the agent's memory harness into four loci (๐’ฒ, ๐’ฎ, โ„›, ๐’ฐ) and use three paired counterfactual probes (P1, P2, P3) to attribute each failure to a specific component โ€” and from there to a specific aging mechanism. The probe ladder is for component-aware diagnosis rather than exact additive causal decomposition; probe-accuracy gaps name the bottleneck without claiming unique causal effects.

Memory loci

The agent is represented as a cyclic dataflow over a memory store, decomposed into four explicit components (paper ยง4.1):

  • ๐’ฒ โ€” Write / compression policy. Transforms the current session history into a persistent format saved in the store. Governed by a memory policy ฮธ that is lossy in most production agents (append-only, summarization, compaction).
  • ๐’ฎ โ€” Memory store. The persistent artifact that holds data across sessions. Mutated by maintenance events (flush, recompact, model swap).
  • โ„› โ€” Read / retrieval algorithm. Queries the store to extract the working context for the current task (last-k by recency, top-k cosine, etc.).
  • ๐’ฐ โ€” Utilization logic. The LLM's reasoning loop โ€” when to retrieve, what to query, how much context to request, how to synthesize the retrieved context into a response.

Each aging mechanism has a primary stage: compression stresses ๐’ฒ, interference stresses โ„›, revision stresses ๐’ฐ, and maintenance stresses ๐’ฎ (external operations mutate the store outside the readโ€“writeโ€“utilize execution loop). Localizing the error site directly guides mitigation.

Memory pipeline dataflow with attribution components
Memory pipeline (paper Fig. 2). Data flows sequentially: History โ†’ ๐’ฒ โ†’ ๐’ฎ โ†’ โ„› โ†’ Context โ†’ ๐’ฐ โ†’ Answer. We attribute aging into the highlighted components.

Three counterfactual probes

Each probe replaces selected upstream components with oracle implementations; the resulting accuracy gaps point to the first non-oracle component that is consistent with the failure. The probes form an ablation ladder over the pipeline:

1

P1

Baseline execution. Agent uses its own write, retrieval, and utilization logic. Measures AccP1.

2

P2

Oracle retrieval. Bypass โ„› โ€” oracle extracts required facts from the agent's actual store and injects them into the prompt. Measures AccP2.

3

P3

Oracle context. Bypass ๐’ฒ + โ„› โ€” gold facts injected directly into the prompt. Measures AccP3. Any remaining error points to utilization.

Write (๐’ฒ)Read (โ„›)Utilize (๐’ฐ)
P1 baselineAgentAgentAgent
P2 oracle retrievalAgentOracleAgent
P3 oracle contextOracleOracleAgent

Stage-level diagnostic profile

Within this conceptual pipeline decomposition, the P1/P2/P3 ladder additively accounts for the end-to-end error across the Write, Retrieval, and Utilization stages, yielding a stage-level diagnostic profile rather than a unique causal decomposition for every architecture. The three shares are read as candidate failure stages:

  • Utilization share = 1 โˆ’ AccP3 โ†’ consistent with a revision-aging signature (๐’ฐ): the model fails despite a perfect context.
  • Write share = AccP3 โˆ’ AccP2 โ†’ pointing to a compression-aging signature (๐’ฒ): information was already underspecified at write time.
  • Read share = AccP2 โˆ’ AccP1 โ†’ consistent with an interference-aging signature (โ„›): facts in the store but retrieval failed to fetch them.

Maintenance aging (๐’ฎ) is observationally aliased with the Write share โ€” both result in missing facts in the store. AgingBench separates them temporally: execution-loop signatures are probed across sessions, while maintenance shocks are measured immediately across a lifecycle event at time t:

ฮ”๐’ฎ = WriteShare(tโบ) โˆ’ WriteShare(tโป)

A discrete jump in the Write share coincident with a lifecycle event separates maintenance aging from gradual write degradation.

Implementation: agingbench/diagnostics/partitioner.py computes the per-session DiagnosticResult with all three stage shares; agingbench/runner/diagnostic_mixin.py drives the P1/P2/P3 probe execution. The profiles are emitted to the AgingCard under mechanism_metrics. Enable diagnostic probes with --diagnose on any agingbench run invocation.

Aging as a runtime control problem

A long-lived agent is constantly making control decisions: what to write, what to compress, what to retrieve, when to recompact, when to flush, when to promote something into structured state. AgingBench is positioned not only as a measurement benchmark but as a testbed for runtime memory control policies. Controllers observe per-session signals (lag-recall, ws_fidelity, accumulator error, post-shock delta, token cost, latency) and choose interventions (append, compress, retrieve, recompact, flush, force-read, promote to typed state). The ThresholdController class in core/controller.py ships in v0.3.0; full CLI pluggability and reference controllers will be landed in future work.

Telemetry mode reference

Telemetry mode is the inverse of scenario mode: instead of running constructed probes against your model, it parses a JSONL trace your deployed agent has already produced and infers per-mechanism aging signals from the trace's behavioral DAG โ€” the graph formed by tool calls (names + args), tool results, session boundaries, lifecycle events, and timestamps. This page documents the current telemetry-mode API; recipe cards for each format live on the telemetry page.

Supported trace formats

Seven adapters ship in agingbench/telemetry/adapters/. Each normalizes its native trace shape into a canonical TelemetryRecord stream that downstream inference consumes format-agnostically. Adapters all pass fixture-level parse tests; the extraction recipe (the command to dump a JSONL from a live source) is validated only for the two formats marked โœ“ in v0.3.0 โ€” the others depend on third-party SDK versions and are pending live validation.

formatrecipe statusinput shape
claude_code โœ“ verifiedNative Claude Code JSONL session files; type + message.usage + sessionId auto-detected.
generic โœ“ by-design DIYBest-effort fallback: any JSONL with session_id, role, content, and token-count fields; adapter aliases camelCase + snake_case variants.
openai_assistantsโ“˜ referenceMixed thread.message / thread.run / thread.run.step objects from the Assistants API export.
openhands โ“˜ referenceOpenHands SDK event log: {source, action, observation, llm_metrics} per event.
langfuse โ“˜ referenceLangfuse SDK exports or REST-API JSON downloads; accepts both camelCase and snake_case field names.
langsmith โ“˜ referenceRouted through generic; user is responsible for reshaping LangSmith run JSON to the generic field tuple.
otlp โ“˜ referenceOpenTelemetry JSON spans; recognizes both the new gen_ai.* semconv and the legacy llm.* namespace.

Recipe-status legend: โœ“ verified = recipe tested end-to-end against a real source in v0.3.0; โ“˜ reference = adapter parses correctly against fixtures, but the documented extraction command depends on SDK / exporter versions and hasn't been live-validated. If you already have a JSONL of the matching shape, the parser handles it regardless.

Trace preprocessing (Claude Code only)

Claude Code stores each conversation as a separate <uuid>.jsonl under ~/.claude/projects/<encoded-cwd>/. To analyze cross-session aging, those fragments must be concatenated into a single timestamp-sorted JSONL. The bundled helper does it in one command (other adapters already export a single file, so they don't need this step):

python -m agingbench.telemetry.prepare_trace ~/.claude/projects/<your-project-dir>
# โ†’ wrote ~/.claude/projects/<your-project-dir>/agingbench_trace.jsonl (N events)

Source: agingbench/telemetry/prepare_trace.py. Python API: from agingbench.telemetry import prepare_trace; out = prepare_trace("~/.claude/projects/<dir>").

Deployment profile

A profile bundles deployment-specific defaults โ€” native outcome-event mappings, subject-linkage keys, per-mechanism weights, session-detection heuristics, and supplementary privacy patterns. Two profiles ship:

  • generic โ€” no domain assumptions. Outcome events must come from a separately-supplied OutcomeEvent JSONL because the profile cannot extract them natively.
  • code_assistant โ€” tailored for software-engineering agents (Claude Code, Cursor, Aider, Codex CLI, OpenHands). Maps native events (pr_merged, ci_pass, commit_reverted, completion_accepted, โ€ฆ) to OutcomeEvent.outcome; weights revision at 1.5ร— over the other mechanisms.

Profiles are loaded via load_profile(name). Per-call overrides:

from agingbench.telemetry import trace_to_card_v11

r = trace_to_card_v11(
    trace_jsonl="trace.jsonl",
    trace_format="claude_code",
    profile="code_assistant",
    overrides={"mechanism_weights": {"compression": 1.2}},  # deep-merged
)

Profile YAML source lives in agingbench/telemetry/profiles/; full structure (outcome_rules, subject_linkage, mechanism_weights, session_detection, privacy_patterns) documented inline as comments.

OutcomeEvent extractors

Extractors derive OutcomeEvents from trace records when the trace doesn't carry them natively. Pass a list of extractor names via extract_outcomes=[...] on trace_to_card_v11. v0.3.0 ships three:

  • claude_session_flags โ€” derives success/fail signals from Claude Code's session-completion markers (slash-command exits, tool-call abort flags).
  • git_log โ€” walks git history alongside the trace's timestamps; commits that immediately follow an agent reply count as success, reverts count as revision_fail.
  • record_patterns โ€” generic regex-based pattern matching over record text (configurable via the profile YAML's outcome_rules).

Note: outcome events are optional. Most production traces (Claude Code, OpenHands, etc.) don't carry them; telemetry mode uses the cross-session consistency probe below as the outcome-free headline source. Specs in agingbench/telemetry/outcome_extractors.py. Custom extractors register via register_extractor(name, fn).

Per-mechanism inference (behavioral DAG)

Each of the four aging mechanisms gets a dedicated inference module under agingbench/telemetry/inference/. The primary signals are structural โ€” derived from the trace's behavioral DAG rather than from text patterns over prompt/response strings:

mechanismprimary signals (structural)file
compressionsaturation rate (input_tokens / ctx_window), context-noise ratio trajectory, tool-argument specificity slope (P3)compression.py
interferencetool-name KL divergence vs early-session baseline, embedding-based goal-anchor drift (P2), tool-result lineage continuity (P4)interference.py
revisiontool-result update propagation (P1) with a three-tier fallback โ€” see next subsectionrevision.py
maintenancelifecycle shock detection (model swap, ctx drop, cache spike, system change, /clear command) + cumulative shock_damage_trajectory: damage magnitude per shock uses a 3-tier preference โ€” outcome_rate_delta (when outcomes linked) โ†’ avg_response_tokens_delta/100 (universal) โ†’ latency_p50_delta_ms/1000 (when duration_ms present)maintenance.py

Cross-session consistency probe (P5)

The load-bearing telemetry-mode signal: consistency.py clusters user turns by sentence-transformer cosine similarity (Jaccard fallback when the encoder isn't available), then for each repeat-task cluster compares first-vs-last occurrence on tool-path Jaccard and response cosine. Output keys on the consistency block:

  • behavior_drift_at_repeat โ€” aggregate drift in [0, 1]; used as the headline metric when no OutcomeEvents are present
  • consistency_drop_trajectory โ€” per-session cumulative drift (flat list; same shape as the per-mechanism trajectories)
  • tool_path_jaccard_drop_mean, response_cosine_drop_mean โ€” component scores
  • n_repeated_tasks_detected, cluster_sizes โ€” coverage diagnostics

Three-tier revision fallback

Revision (the "agent reverted to a stale value" signal) gets the most adapter-sensitive treatment. infer_revision() dispatches to one of three tiers depending on what the trace carries:

  1. P1 ยท tool_result_update_propagation (preferred) โ€” requires tool_calls[].result_summary. Tracks (entity, attribute) โ†’ [(t, value)] across the trace; counts agent args that reference a value older than the most-recent result for the same key.
  2. tool_argument_self_reversion (middle) โ€” requires only tool_calls[].args (universal across adapters). Counts v1 โ†’ v2 โ†’ v1 patterns on agent arg values; identifier-shaped values (UUIDs, file paths, ISO timestamps, long hex hashes) are excluded so re-references aren't mistaken for revisions.
  3. user_correction_text_patterns_fallback (final) โ€” English regex over user prompts. Blocks tagged with this derived_from label are visually downweighted on the card.

Both Tier 1 and Tier 2 emit canonical value_supersession_* field names AND legacy per_session_violation_* / violation_trajectory_* aliases so existing visualisations (including the website sparkline) keep rendering without changes.

Dominant-mechanism selector

Implementation: agingbench/telemetry/inference/_selector.py. The selector decides which single mechanism (if any) leads the Lifespan Card. Logic:

  1. Classify each fired signal as independent or shared. Independent signals diagnose a single mechanism (saturation โ†’ compression; tool-KL drift โ†’ interference; value supersession โ†’ revision; lifecycle event โ†’ maintenance). The shared signal โ€” lineage continuity drop โ€” is compatible with multiple mechanisms and is not enough on its own.
  2. Gate. A mechanism is eligible only if at least one of its independent signals fires. Shared signals add weight on top but cannot stand alone.
  3. Argmax. Among gated mechanisms, the highest credited severity score wins unconditionally. Ties break by mechanism order (compression, interference, revision, maintenance).
  4. Empty case. If no mechanism passes the gate, the card reports reason: no_independent_evidence with the list of compatible mechanisms โ€” signature and repair lines stay blank.

Lifespan Card surface (signature + repair)

When the selector returns a single dominant mechanism, two static lookups in agingbench/telemetry/card_lookups.py produce the card's human-readable closing lines:

  • MECHANISM_TO_STAGE โ€” maps mechanism โ†’ memory-pipeline stage (W = write, R = retrieval, U = utilization, S = store). diagnostic_signature("revision") returns "utilization-dominant (U-stage)".
  • MECHANISM_TO_REPAIR โ€” maps mechanism โ†’ recommended repair recipe. recommended_repair("revision") returns "typed state for derived values...".

A pure-Python card renderer (card_render.render_card_ascii(trace_audit)) produces the rich-format ASCII card shown in the demo and used by the website's ๐Ÿ“ธ Save as PNG action.

Headline policy (outcome-free by default)

The card's top-line "how fast is this aging" metric is selected dynamically based on what the trace carries โ€” tiers tried in order:

  1. half_life โ€” when OutcomeEvents are present (or an extractor fires)
  2. behavior_drift_at_repeat โ€” when โ‰ฅ 1 repeat-task cluster is detected (each cluster โ‰ฅ 2 occurrences)
  3. aging_trend โ€” when the aggregate per-session severity sum across the four mechanism blocks is monotonically rising over โ‰ฅ 3 sessions (the aggregate now includes the per-session diff of shock_damage_trajectory)
  4. maintenance_shock_damage โ€” when the maintenance block's cumulative shock-damage signal is rising and โ‰ฅ 3 shocks fired. Catches front-loaded shock patterns whose per-session-delta trend doesn't qualify for tier 3
  5. not_measurable โ€” when none of the above apply

The headline.source field carries the chosen tier. headline.aging_detected is true when the chosen signal exceeds a meaningful threshold (decay slope < โˆ’0.01, m0โ†’m_final drop โ‰ฅ 10%, behavior drift > 10%, aging trend slope > 0.01, OR shock_damage_verdict == rising_degradation with n_shocks โ‰ฅ 3). The shock-damage clause fires regardless of which tier produced the headline label, so a trace whose dominant aging is maintenance-driven flags aging_detected: true even when the headline label is โ€” for example โ€” a behavior-drift number from tier 2.

Coverage + trajectory verdicts

Every per-mechanism inference block carries two verdict strings so a low score can never be confused with "agent didn't age." The full enum:

Coverage (coverage.verdict) โ€” how much signal the trace gave us for this mechanism:

  • strong โ€” many sessions with fired tests; conclusion is well-supported
  • adequate โ€” enough sessions for a defensible read
  • weak โ€” borderline; treat the score as suggestive only
  • underpowered โ€” too few sessions or too few fired tests
  • no_test_fired โ€” the inference test conditions never triggered (e.g. revision needs a value to change at least once)

Trajectory (<metric>_verdict) โ€” the shape of the per-session series, made saturation-aware so floored/ceilinged metrics aren't mis-read as flat:

  • no_signal, flat โ€” no measurable change
  • rising_degradation / rising_healthy โ€” series is climbing; one direction means aging, the other means improvement, depending on the metric
  • falling_degradation / falling_healthy โ€” series is dropping
  • floor_degradation / floor_healthy โ€” series has bottomed out at zero (or near-zero)
  • ceiling_degradation / ceiling_healthy โ€” series has saturated at the ceiling

Conceptual mapping to scenarios mode

The four-mechanism vocabulary is inherited from scenarios mode, where each mechanism is operationalised against the gold dependency DAG. Without gold, telemetry signals are proxies โ€” but not all equally faithful:

  • Revision and Maintenance map cleanly: P1's tool-result-update propagation is operationally analogous to scenarios-mode revision probes; lifecycle-shock pre/post deltas mirror the maintenance probe structure.
  • Compression and Interference map indirectly: telemetry signals (saturation, KL drift, lineage drop) are necessary conditions and downstream symptoms, not direct measurements of the mechanism without a gold fact list / confusable cluster definition.

Net framing: telemetry mode performs mechanism-level triangulation (multiple structural signals stacked to constrain the mechanism story), where scenarios mode performs mechanism-level identification against gold. Both useful; not interchangeable.

For the math

Per-mechanism inference functions live in agingbench/telemetry/inference/ โ€” one file per mechanism (compression.py, interference.py, revision.py, maintenance.py), plus the cross-session consistency probe (consistency.py), the dominant-mechanism arbitration (_selector.py), shared verdict thresholds (_verdict.py), and lightweight text/clustering helpers (_text_utils.py). Lifespan-card-rendering helpers live alongside in card_lookups.py + card_render.py; the Claude Code preprocessor is in prepare_trace.py. End-to-end pipeline + design notes in agingbench/telemetry/README.md.

AgingCard schema ยท v1.0.0

Every run emits one aging_card.json conforming to aging_card_schema.json (semver-pinned). Field-by-field reference below; download the schema from the repo or fetch a sample below.

FieldTypeDescription
schema_versionstringSemver of the AgingCard schema. "1.0.0" at v0.3.0.
card_typestringRun-level type (e.g., agingbench.AgingCard for scenario runs; trace-derived cards carry a distinct value).
generated_atISO 8601UTC timestamp when the card was emitted.
run_idstringUnique identifier for this run.
scenariostringScenario ID (e.g., s2_lifestyle_assistant).
scenario_versionstringSemver of the scenario definition. Lets older cards remain comparable when scenarios revise.
suite_idstringlite, core, full, or a custom suite name.
sutobjectSystem Under Test: required sut_id plus flat string fields model_provider, model_id, memory_policy_type. Extra keys are accepted (the schema allows additional properties).
seedintegerRun seed. Lets identical configs produce identical tasks.
n_sessionsintegerNumber of sessions executed.
pressureobjectPressure config: dependency_density, update_rate, max_chain_depth, n_confusable_pairs, etc.
headlineobjectRequired metric_name; canonical fields m0, m_final, half_life, decay_slope, hazard_proxy, aging_detected; plus scenario-specific extras (additional properties allowed).
mechanism_metricsobjectPer-mechanism breakdown: required keys compression / interference / revision / maintenance, each an object of mechanism-specific metrics.
cost_and_efficiencyobjectCanonical: total_input_tokens, total_output_tokens, tokens_per_session_mean, total_calls. Advisory (often null): total_cost_usd, latency_ms_p50, latency_ms_p95.
checkpointsarrayPer-session [t, m(t)] pairs: the raw aging curve.
provenanceobjectgit_sha, agingbench_version, fork disclosure flag, COI flag.
warningsarraySoft validation notes (e.g., telemetry_partial for trace-derived cards).
linksobjectPointers to related artifacts: trace.jsonl, metrics.json, dependency_metrics.json.

Validate any card:

python -m agingbench.metrics.aging_card_validate ./out/aging_card.json # โ†’ OK: card validates against schema 1.0.0

The schema is the operational contract between AgingBench and your eval stack. Read it once, then write whatever you want against the headline / mechanism / cost blocks. Adapters in examples/ show OpenAI Evals, LangSmith, and Langfuse mappings. Note: the schema does not enforce a submission_track field โ€” state which leaderboard track you are submitting against in your PR description.

Sample cards

Seven canonical cards ship with v0.3.0 โ€” one per scenario S1โ€“S7, generated against Haiku-4.5 + lossy_compress as a reference SUT. An additional community-extension card (S8 / SWE-bench-Aging, Claude Code) ships alongside but is not paper-verified yet. Useful for testing your AgingCard-consuming code without running the benchmark.

FileTrackWhat it shows
s1_research_literature_haiku45_lossy_compress.jsonmodelS1 compression aging, keyword survival curve
s2_lifestyle_assistant_haiku45_lossy_compress.jsonmodelS2 revision: constraint precision + accumulator error
s3_knowledge_base_haiku45_lossy_compress.jsonmodelS3 fidelity decay under accumulating decisions
s4_software_engineering_haiku45_lossy_compress.jsonmodelS4 dependency recall across sprints
s5_self_planning_haiku45_lossy_compress.jsonmodelS5 self-planning notebook: workspace inspection + reset survival
s6_naturalistic_haiku45_lossy_compress.jsonmodelS6 multi-domain recall + maintenance ฮ”shock
s7_research_notes_haiku45_lossy_compress.jsonagentS7 Tier-2: workspace fidelity vs. recall gap
Community extensions โ€” not paper-verified yet
s8_swe_bench_claude_code_s8.jsonagentS8 SWE-bench-Aging extension: Django issue chain under per-session Docker reset

Find them in prototype/examples/sample_cards/ in the repo. Each card is a standalone JSON; load it, read headline + cost_and_efficiency, and you have a complete picture of the run.

Contributing

AgingBench accepts contributions in four shapes. The protocol is light and human-reviewed.

Submitting a card

  1. Run with three seeds against the default suite for your track. Don't modify timeouts or the probe set.
  2. Validate: python -m agingbench.metrics.aging_card_validate ./out/aging_card.json
  3. Open a PR with your cards under leaderboard/<track>/<scenario>/<your-key>__seed{N}.json. CI validates schemas automatically. A maintainer reviews provenance and merges within 5โ€“7 days.

Adding a scenario

Implement a generator (generators/sN_generator.py) with BaseGenerator + DependencyMixin, a runner (runner/sN_runner.py) extending BaseRunner, curated data under scenarios/sN_*/, and an entry in cli/runners.py _SCENARIO_RUNNERS. See CONTRIBUTING.md in the repo for the full template.

Adding an integration adapter

Read a card, emit your eval system's format, drop in examples/<target>_adapter.py. The skeletons for OpenAI Evals, LangSmith, and Langfuse are 50-line starting points. Open a PR; we ship adapters as Beta and promote to Ready after community confirmation.

Reporting a bug or disputing a card

GitHub Issues with the appropriate template (bug_report.md, scenario_request.md, or a dispute marked [dispute]). Disputes are reviewed by two leaderboard operators.

Maintenance pledge

Sustained artifact, not a one-off paper page

AgingBench is maintained by the AgingBench maintainer group. We commit to: weekly review of submitted AgingCards; an 8-week release cadence; backward-compatible schema evolution across minor versions; transparent governance of leaderboard moves.

If a submission turns out to be misrepresented, the card moves to leaderboard/_retracted/ with the dispute reasoning, never silently deleted.

Contact: zhujianing9810@gmail.com

Authors: Jianing Zhu*, Yeonju Ro*, John Robertson, Kevin Wang, Junbo Li, Haris Vikalo, Aditya Akella, Zhangyang Wang (UT Austin). *equal contribution.