Run AgingBench in 10 minutes

Quickstart: Tier 2 in ten minutes

Tier 2 is the headline — production agents on the S7 Research-Notes Coding Task. Pick an in-tree adapter (claude_code, openhands, codex) or wire your own; AgingBench runs the session loop and emits a leaderboard-ready AgingCard.

Prerequisites for Tier 2 (S7). The claude_code adapter shells out to the official Claude Code CLI, so install it once: npm i -g @anthropic-ai/claude-code. Without it, the runner hangs while the subprocess fails to start. For the openhands adapter, install the OpenHands SDK in a separate conda env and point OPENHANDS_BRIDGE_PYTHON at its python interpreter.

# Install AgingBench (one-time)
pip install "git+https://github.com/VITA-Group/AgingBench.git@v0.3.0#subdirectory=prototype"

# Install the Claude Code CLI (one-time, for the claude_code adapter)
npm i -g @anthropic-ai/claude-code

# Set your API key (Tier-2 adapters call the model directly)
export ANTHROPIC_API_KEY=sk-ant-...

# Run Tier 2 on Claude Code
agingbench run \
  --scenario s7_research_notes \
  --sut agingbench/registry/suts/claude_code/claude_code_sonnet46_s7.yaml \
  --seeds 1 --card

The run emits an AgingCard with workspace fidelity, probe-time recall, maintenance-shock deltas, and the cost breakdown. Submit it to the Tier 2 leaderboard as a PR.

No API key or no CLI install? The in-browser telemetry demo is the friction-free path — drop a JSONL trace and get the same AgingCard back, no install, no key, no GPU. The Tier-2 scenarios on this page need a production CLI agent (Claude Code or OpenHands), which intrinsically means an API key + a CLI install.

Adapters that work today

Three Tier-2 adapters ship in-tree under agingbench/core/adapters/:

claude_code: Anthropic's Claude Code (subprocess + SDK)
openhands: All Hands AI's OpenHands (isolated conda env via subprocess bridge)
codex: OpenAI Codex CLI (codex exec in non-interactive mode)

Bring your own agent

Custom agents drop in by subclassing AgentAdapter (two required methods: send_message, reset_session) and referencing your class from the SUT YAML's adapter: block. You don't modify AgingBench code.

Start from the runnable template at examples/byo_agent_minimal.py — copy it, replace the stub send_message with your agent's call, and point a SUT YAML at it:

# my_sut.yaml
adapter:
  type: custom
  class: my_pkg.my_agent:MyAgent   # importable on PYTHONPATH
  max_turns: 30                    # any extra keys are forwarded as kwargs

agingbench run --scenario s7_research_notes \
  --sut my_sut.yaml --seeds 3 --card

The four bundled adapters — claude_code, openhands, codex, cursor — are full-fat reference implementations of the same ABC, useful when you want to see how an opaque CLI agent is wrapped. See docs for the interface contract.

Opaque-agent caveat: the optional get_workspace_state / get_memory_text hooks default to {} / "". That's fine — the run still emits a valid AgingCard — but probe scoring then only credits what the agent recites in its reply, so S5/S7 file-survival probes will read as more aged. If your agent writes notes or scratchpads to a known directory, return them from these hooks.

Three release modes

Pick the mode that matches what you need. All three emit the same AgingCard format, so downstream comparisons stay consistent.

Lite Ready

S1 + S2 + S7 · 3 seeds · fixed configs · pinned schedule. The CI-friendly suite for product teams who want a fast signal before changing model, prompt, memory policy, or scaffolding.

~$5 · ~30 min on Haiku-class · needs ANTHROPIC_API_KEY

pip install "git+https://github.com/VITA-Group/AgingBench.git@v0.3.0#subdirectory=prototype"
export ANTHROPIC_API_KEY=sk-ant-...
agingbench-lite run --suite lite --card
# default SUT: agingbench/registry/suts/haiku45/
#              haiku45_lossy_compress.yaml
# default seeds: 42, 43, 44 (3 seeds)
# override either with --sut <path.yaml> / --seeds N

Full Ready

All 7 scenarios, pressure sweeps, lifecycle shocks, controller hooks, full counterfactual spectrum. For paper reproduction and serious benchmarking. (S8 SWE-bench-Aging ships in full.yaml as a Docker-only extension; it is not pressure-swept.)

~$30–80 · 4–8 h depending on scenarios · API key required

pip install "git+https://github.com/VITA-Group/AgingBench.git@v0.3.0#subdirectory=prototype"
export ANTHROPIC_API_KEY=sk-ant-...  # or OPENAI_API_KEY, depending on SUT
agingbench run \
  --suite full \
  --sut agingbench/registry/suts/<your-sut.yaml> \
  --seeds 3 --card

Telemetry Ready

Map your production traces into AgingCard metrics without running our scenarios. Compute lag-recall, write-read gaps, accumulator drift, and post-maintenance regressions from your own JSONL. Try it in your browser at telemetry.html — no install — or call the Python API directly:

$0 · runs over existing traces · in-browser via Pyodide · no API key

pip install "git+https://github.com/VITA-Group/AgingBench.git@v0.3.0#subdirectory=prototype"
python -c "
from agingbench.telemetry import trace_to_card_v11
r = trace_to_card_v11(
  trace_jsonl='my_runs.jsonl',
  trace_format='claude_code',   # or openai_assistants, openhands,
                                #    langfuse, langsmith, otlp, generic
  profile='code_assistant')     # or generic
print(r.card['headline'])"

Cost & time, per scenario

Estimates from a single seed run on the default lossy-compress memory policy. Multiply by the number of seeds (default Lite: 3). Token counts come from cost_and_efficiency in the emitted AgingCard.

Scenario	Sessions	Tokens / run	Calls / run	$ on Haiku-4.5	$ on GPT-4o	Wall-clock
S1 · Research Literature	8	~40k	~80	$0.18	$0.42	~4 min
S2 · Lifestyle Assistant	10	~45k	~100	$0.20	$0.48	~5 min
S3 · Knowledge Base	12	~60k	~120	$0.27	$0.65	~7 min
S4 · Software Engineering	10	~55k	~110	$0.24	$0.58	~6 min
S5 · Self-Planning Notebook	10	~70k	~140	$0.31	$0.76	~8 min
S6 · Naturalistic	15	~85k	~170	$0.38	$0.92	~10 min
S7 · Research-Notes Coding Task (Tier-2)	10	~120k	~200	$0.55	$1.30	~12 min
Lite (S1+S2+S7, 3 seeds)	—	~615k	~1140	~$5	~$12	~30 min

Numbers are guidance only. Your actual cost depends on retry behavior, scaffolding overhead, and pricing changes. Reading cost_and_efficiency.total_cost_usd from the emitted card is the ground truth for any specific run.

Tier 1 tracks (controlled runner)

Tier 2 is the headline (above). For deeper diagnosis on the controlled ReferenceAgent runner, three Tier 1 swap-points let you isolate one variable at a time. Track A is live today via the SUT YAML; B is live via the same YAML, with a first-class --memory-policy CLI flag planned for v1.1; C is the runtime-controller track and opens with two reference controllers in v1.1. Each track gets its own sub-tab of the Tier 1 leaderboard.

A

Model swap Ready

Swap the LLM, fix the agent, memory policy, and seeds. Does this model age slower under identical scaffolding? Pick a different SUT YAML from agingbench/registry/suts/.

--sut agingbench/registry/suts/claude_sonnet46/….yaml

B

Memory policy v1.1 CLI

Subclass MemoryPolicy, fix the model. Does this memory backbone reduce aging? The track for memory-systems research. Set memory_policy.type in the SUT YAML today; first-class --memory-policy flag in v1.1. Runnable template: examples/byo_memory_minimal.py.

C

Runtime controller v1.1 CLI

Attach a controller that observes per-session signals and fires interventions (recompact, force-read, promote to typed state). The ThresholdController ABC ships in core/controller.py in v0.3.0; first-class --controller-import-path flag and two reference controllers (lag-recall trigger, accumulator-promotes-to-typed-state) land in v1.1.

Tier 2 (full autonomous agents like Claude Code / OpenHands) is covered above in the Quickstart and has its own panel on the leaderboard. Cross-tier and cross-track comparisons are not apples-to-apples by design — the leaderboard keeps each in its own tab so the comparison stays honest.

AgingCard: a portable run summary

Every AgingBench run emits one aging_card.json with a version-pinned schema. Use it for CI gates, dashboards, internal eval pipelines, or as a leaderboard submission. The schema is the operational contract between the benchmark and the rest of your stack.

{
  "schema_version": "1.0.0",
  "scenario": "s2_lifestyle_assistant",
  "sut": {
    "sut_id": "haiku45_lossy_compress",
    "model_provider": "anthropic",
    "model_id": "claude-haiku-4-5",
    "memory_policy_type": "summarize_store"
  },
  "seed": 42, "n_sessions": 10,
  "headline": {
    "metric_name": "constraint_precision",
    "m0": 0.92, "m_final": 0.32,
    "half_life": 6.9,
    "decay_slope": -0.06,
    "aging_detected": true
  },
  "mechanism_metrics": { "compression": { /* ... */ },
    "interference": { /* ... */ },
    "revision": { /* ... */ }, "maintenance": { /* ... */ } },
  "cost_and_efficiency": {
    "total_input_tokens": 42150, "total_output_tokens": 2880,
    "tokens_per_session_mean": 4503, "total_calls": 102
  },
  "checkpoints": [[0, 1.00], [1, 0.92], /* ... */],
  "provenance": { "git_sha": "...", "agingbench_version": "0.3.0" },
  "warnings": []
}

Validate any card against the schema:

python -m agingbench.metrics.aging_card_validate ./out/aging_card.json
# → OK: card validates against schema 1.0.0

→ Field-by-field schema reference · → 8 sample cards (one per scenario) · Use AgingCard in your eval pipeline via the integration adapters.

Use AgingBench where you already evaluate

AgingCard is designed to slot into existing eval pipelines. The adapters in examples/ convert cards to the formats those systems expect. All adapters are tiny — start from the skeleton and extend.

OpenAI Evals Beta

Translates AgingCard to OpenAI Evals JSONL records. Good enough to wire AgingBench into an existing Evals pipeline; the exact payload shape depends on which eval flavor.

examples/openai_evals_adapter.py →

LangSmith Beta

Emits LangSmith dataset records from the AgingCard. Upload through the LangSmith API with your project key.

examples/langsmith_adapter.py →

Langfuse / OTel Beta

Produces parent + child spans in OpenTelemetry / Langfuse format. Batch through the Langfuse SDK in a real deployment.

examples/langfuse_adapter.py →

MCP Planned v1.1

Memory-event surface over MCP so production agents can stream tool / memory events into AgingBench's diagnostics live.

→ Roadmap

Writing your own adapter? The schema is stable. Read one card, emit your format, open a PR with your examples/<target>_adapter.py.

Lite sanity check in CI

Drop this into .github/workflows/aging.yml to run a single Tier-1 scenario (S1 keyword retention on a Haiku-class SUT) on every pull request and fail the build if the half-life regresses. Cheap enough to gate every PR (~$0.50, ~3 min). The full Lite suite (S1+S2+S7, including the Tier-2 autonomous-agent run) is for local pre-release verification, not per-PR gating.

name: AgingBench sanity check (S1)

on: [pull_request]

jobs:
  aging:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }

      - run: pip install "git+https://github.com/VITA-Group/AgingBench.git@v0.3.0#subdirectory=prototype"

      - name: Run S1 (Tier-1 keyword-retention, single scenario for a predictable output path)
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          agingbench run \
            --scenario s1_research_literature \
            --sut agingbench/registry/suts/haiku45/haiku45_lossy_compress.yaml \
            --card --output ./out

      - name: Validate all AgingCards produced
        run: |
          for card in $(find ./out -name aging_card.json); do
            python -m agingbench.metrics.aging_card_validate "$card"
          done

      - name: Gate on S1 headline half-life
        run: |
          python - <<'PY'
          import glob, json, sys
          paths = glob.glob('./out/**/aging_card.json', recursive=True)
          assert paths, 'no aging_card.json produced under ./out'
          for p in paths:
              c = json.load(open(p))
              hl = c['headline'].get('half_life')
              if hl is None or hl < 5.0:
                  sys.exit(f"regression in {p}: half_life={hl} < 5.0")
              print(f"OK {p}: half_life={hl}")
          PY

      - uses: actions/upload-artifact@v4
        with: { name: aging_cards, path: ./out/**/aging_card.json }

Tune the gate thresholds to your model's baseline. The card's headline block is stable across schema-compatible versions.

Submit your result

Submissions land on the appropriate leaderboard track. State which track you are submitting against (model / memory-policy / component / controller / Tier-2 agent) in the PR description; the protocol is light and human-reviewed.

Run with three seeds. Use the suite's default session count, maintenance schedule, and probe budget. Don't modify timeouts.
Validate locally: python -m agingbench.metrics.aging_card_validate ./out/aging_card.json
Open a PR with your cards under leaderboard/<track>/<scenario>/<your-key>__seed{N}.json. Use the AgingCard submission PR template. CI re-validates schemas automatically.
Maintainer review. We check provenance (the git_sha field, the trace.jsonl reference), no schema tampering, COI disclosure, and merge within 5–7 days. Self-reported rows ship with a SELF badge; lab re-execution promotes to VERIFIED.

Email zhujianing9810@gmail.com if you'd rather send a tarball than open a PR. For maintenance/governance details see the contributor docs.