AgingBench Leaderboard

Tier 1 runs use a controlled runner. The ReferenceAgent harness owns the session loop and you swap one variable at a time. Three sub-tracks, each answering a different research question.

Submitter swaps the LLM; agent (ReferenceAgent ReAct), memory policy, compaction prompt, and seeds held constant. This is the "tbench-shaped" track: which model ages better under identical scaffolding?

Status	Model	Scale	Compression			Interference		Revision		Maint.
Status	Model	Scale	S1 half-life(kw_m)	S2 prec.	S3 fidel.	S4 dep_rec	S6 recall	S2 accum_err	S5 recall	S6 Δshock
Open models · lossy compression
VERIFIED	Llama-3.1-8B	8B	5.8	0.40	0.44	0.20	0.03	157	0.33	−0.17 / +0.08
VERIFIED	Qwen3-8B	8B	6.2	0.53	0.46	0.13	0.15	192	0.33	+0.04 / +0.08
VERIFIED	DeepSeek-7B	7B	5.6	0.67	0.43	0.28	0.11	211	0.60	−0.08 / +0.00
VERIFIED	Qwen3-14B	14B	7.9	0.50	0.52	0.18	0.22	64	0.33	−0.13 / +0.04
VERIFIED	DeepSeek-14B	14B	5.9	0.57	0.42	0.22	0.08	107	0.47	+0.00 / +0.13
VERIFIED	Gemma4-31B	31B	4.9^‡	0.57	0.80	0.18	0.07	132	0.33	−0.04 / −0.04
VERIFIED	gpt-oss-120B	120B	5.4^‡	0.37	0.42	0.33	0.21	124	0.40	−0.21 / −0.29
VERIFIED	GPT-4o	API	7.6	0.43	0.50	0.10	0.14	227	0.27	+0.04 / +0.08
VERIFIED	Haiku-4.5	API	6.9	0.32	0.48	0.00	0.13	153	0.60	+0.00 / −0.04

S1 column shows half-life in sessions for the kw_m survival curve (higher = more durable; ∞ = never crossed 0.5 within the deployment window). ^‡ indicates that at least one seed gave ∞; reported value is the mean over the finite seeds. Two-value entries in the Maint. column are (recompact / flush_history) shock deltas.

This track currently has only UT-Austin-run rows. Submit your model's card to land here; same protocol as Tier-2.

Submit a model card →

Same memory policy (summarize_store), two compaction prompts (lossy vs careful), across four models — the policy-contrast slice of the paper's Table 3. Use this to read off how much of the across-row spread is attributable to the prompt rather than the model. Track B opens for external MemoryPolicy submissions in v0.3.0.

Status	Model	Compaction	Compression			Interference		Revision		Maint.
Status	Model	Compaction	S1 half-life(kw_m)	S2 prec.	S3 fidel.	S4 dep_rec	S6 recall	S2 accum_err	S5 recall	S6 Δshock
Paper Table 3 · policy contrast slice (4 models × {lossy, careful})
VERIFIED	Qwen3-8B	lossy	6.2	0.53	0.46	0.13	0.15	192	0.33	+0.04
VERIFIED	Qwen3-8B	careful	5.9	0.80	0.30	0.46	0.11	123	0.27	+0.21
VERIFIED	Gemma4-31B	lossy	4.9^‡	0.57	0.80	0.18	0.07	132	0.33	−0.04
VERIFIED	Gemma4-31B	careful	7.4^‡	0.40	0.69	0.18	0.40	51	0.33	−0.50
VERIFIED	gpt-oss-120B	lossy	5.4^‡	0.37	0.42	0.33	0.21	124	0.40	−0.21
VERIFIED	gpt-oss-120B	careful	∞	0.30	0.63	0.15	0.33	180	0.33	−0.21
VERIFIED	GPT-4o	lossy	7.6	0.43	0.50	0.10	0.14	227	0.27	+0.04
VERIFIED	GPT-4o	careful	∞	0.53	0.77	0.18	0.38	167	0.27	−0.17

All eight rows come from the paper's main results table (Table 3). ^‡ indicates that at least one seed gave ∞; reported half-life is the mean over the finite seeds. Read the lossy → careful rows pairwise per model: e.g. Gemma4-31B picks up substantially more S2 precision headroom under careful (+0.40 abs) but its S1 half-life moves only modestly, illustrating that compaction-prompt investment pays off where the utilization margin can absorb it (paper §6 / Finding II). Track B opens for external MemoryPolicy submissions in v0.3.0 — submit a custom subclass and we'll add it to this matrix.

Submitting a memory policy? Subclass MemoryPolicy, point your SUT YAML's memory_policy.type at the import path, run the lite or full suite, and note "Track B (memory policy)" in the PR description. The first-class --memory-policy CLI flag lands in v1.1.

How to submit a policy →

Controller track will open in future

The ThresholdController ABC ships in core/controller.py in v0.3.0. The CLI flag (--controller-import-path) and the leaderboard track will open in future with two reference controllers (lag-recall trigger, accumulator-promotes-to-typed-state). Read the runtime-control framing in the docs.

For autonomous agents that own their session loop. S7 runs 10 sessions with maintenance shocks, scoring workspace fidelity and probe-time recall against the gold FactGraph. Default sort is recall.

#	Agent	Model	Provider	Compression		Interference		Revision		Maint.
				pytest ↑	ws_fid ↑	intf. ↑	rev_ex ↑	accum_err ↓	recall ↑	\|Δshock\| ↓

Direction arrows show which way is better. |Δshock| sorts by absolute magnitude: closer to zero means the agent absorbed the maintenance event without large performance shifts. pytest and ws_fid are final-session values (m_F); recall, intf, rev_ex, and accum_err are means over all S7 sessions (m̄), since session-9 probes are hardcoded and collapse m_F to scenario-invariant floors.

Have an autonomous agent? S7 is the slot. Submit through Path A (CLI wrapper) or Path B (programmatic AgentInterface).

Submit an agent →

AgingBench Leaderboard

Controller track will open in future

Submitting to the leaderboard