v0.3.0 · last updated 2026-05-16

AgingBench Leaderboard

Tier 2 evaluates autonomous agents (Claude Code, OpenHands, custom) that own their own session loop under maintenance shocks, scored on the S7 probe suite. Tier 1 isolates one variable at a time on the ReferenceAgent runner: model swap, memory policy, runtime controller. What are the tracks?

Tier 1 runs use a controlled runner. The ReferenceAgent harness owns the session loop and you swap one variable at a time. Three sub-tracks, each answering a different research question.

Submitter swaps the LLM; agent (ReferenceAgent ReAct), memory policy, compaction prompt, and seeds held constant. This is the "tbench-shaped" track: which model ages better under identical scaffolding?

Status Model Scale Compression Interference Revision Maint.
S1 half-life(kw_m) S2 prec. S3 fidel. S4 dep_rec S6 recall S2 accum_err S5 recall S6 Δshock
Open models · lossy compression
VERIFIEDLlama-3.1-8B8B5.80.400.440.200.031570.33−0.17 / +0.08
VERIFIEDQwen3-8B8B6.20.530.460.130.151920.33+0.04 / +0.08
VERIFIEDDeepSeek-7B7B5.60.670.430.280.112110.60−0.08 / +0.00
VERIFIEDQwen3-14B14B7.90.500.520.180.22640.33−0.13 / +0.04
VERIFIEDDeepSeek-14B14B5.90.570.420.220.081070.47+0.00 / +0.13
VERIFIEDGemma4-31B31B4.90.570.800.180.071320.33−0.04 / −0.04
VERIFIEDgpt-oss-120B120B5.40.370.420.330.211240.40−0.21 / −0.29
VERIFIEDGPT-4oAPI7.60.430.500.100.142270.27+0.04 / +0.08
VERIFIEDHaiku-4.5API6.90.320.480.000.131530.60+0.00 / −0.04

S1 column shows half-life in sessions for the kw_m survival curve (higher = more durable; ∞ = never crossed 0.5 within the deployment window). indicates that at least one seed gave ∞; reported value is the mean over the finite seeds. Two-value entries in the Maint. column are (recompact / flush_history) shock deltas.

This track currently has only UT-Austin-run rows. Submit your model's card to land here; same protocol as Tier-2.

Submit a model card →

Same memory policy (summarize_store), two compaction prompts (lossy vs careful), across four models — the policy-contrast slice of the paper's Table 3. Use this to read off how much of the across-row spread is attributable to the prompt rather than the model. Track B opens for external MemoryPolicy submissions in v0.3.0.

Status Model Compaction Compression Interference Revision Maint.
S1 half-life(kw_m) S2 prec. S3 fidel. S4 dep_rec S6 recall S2 accum_err S5 recall S6 Δshock
Paper Table 3 · policy contrast slice (4 models × {lossy, careful})
VERIFIEDQwen3-8Blossy6.20.530.460.130.151920.33+0.04
VERIFIEDQwen3-8Bcareful5.90.800.300.460.111230.27+0.21
VERIFIEDGemma4-31Blossy4.90.570.800.180.071320.33−0.04
VERIFIEDGemma4-31Bcareful7.40.400.690.180.40510.33−0.50
VERIFIEDgpt-oss-120Blossy5.40.370.420.330.211240.40−0.21
VERIFIEDgpt-oss-120Bcareful0.300.630.150.331800.33−0.21
VERIFIEDGPT-4olossy7.60.430.500.100.142270.27+0.04
VERIFIEDGPT-4ocareful0.530.770.180.381670.27−0.17

All eight rows come from the paper's main results table (Table 3). indicates that at least one seed gave ∞; reported half-life is the mean over the finite seeds. Read the lossy → careful rows pairwise per model: e.g. Gemma4-31B picks up substantially more S2 precision headroom under careful (+0.40 abs) but its S1 half-life moves only modestly, illustrating that compaction-prompt investment pays off where the utilization margin can absorb it (paper §6 / Finding II). Track B opens for external MemoryPolicy submissions in v0.3.0 — submit a custom subclass and we'll add it to this matrix.

Submitting a memory policy? Subclass MemoryPolicy, point your SUT YAML's memory_policy.type at the import path, run the lite or full suite, and note "Track B (memory policy)" in the PR description. The first-class --memory-policy CLI flag lands in v1.1.

How to submit a policy →

Controller track will open in future

The ThresholdController ABC ships in core/controller.py in v0.3.0. The CLI flag (--controller-import-path) and the leaderboard track will open in future with two reference controllers (lag-recall trigger, accumulator-promotes-to-typed-state). Read the runtime-control framing in the docs.

For autonomous agents that own their session loop. S7 runs 10 sessions with maintenance shocks, scoring workspace fidelity and probe-time recall against the gold FactGraph. Default sort is recall.

# Agent Model Provider Compression Interference Revision Maint.
pytest ↑ ws_fid ↑ intf. ↑ rev_ex ↑ accum_err ↓ recall ↑ |Δshock| ↓

Direction arrows show which way is better. |Δshock| sorts by absolute magnitude: closer to zero means the agent absorbed the maintenance event without large performance shifts. All values are means over the S7 probe suite (research notes + maintenance-shock probes).

Have an autonomous agent? S7 is the slot. Submit through Path A (CLI wrapper) or Path B (programmatic AgentInterface).

Submit an agent →

Submitting to the leaderboard

The protocol is light. Run with three seeds, validate your AgingCards locally, open a PR with cards under leaderboard/<track>/<scenario>/. CI re-validates schema; a maintainer reviews provenance and merges within 5–7 days. The Submit page has the full walkthrough including the CI template and the file-path convention.

Verified rows ship with a VERIFIED badge after lab re-execution; self-reported rows ship with SELF and are clearly marked. We promote rows to VERIFIED on a rolling basis as we re-execute submissions.

Contact: zhujianing9810@gmail.com · Contributor docs