v0.3.0 · last updated 2026-05-16
Tier 2 evaluates autonomous agents (Claude Code, OpenHands, custom) that own their own session loop under maintenance shocks, scored on the S7 probe suite. Tier 1 isolates one variable at a time on the ReferenceAgent runner: model swap, memory policy, runtime controller. What are the tracks?
Tier 1 runs use a controlled runner. The ReferenceAgent harness owns the session loop and you swap one variable at a time. Three sub-tracks, each answering a different research question.
Submitter swaps the LLM; agent (ReferenceAgent ReAct), memory policy, compaction prompt, and seeds held constant. This is the "tbench-shaped" track: which model ages better under identical scaffolding?
| Status | Model | Scale | Compression | Interference | Revision | Maint. | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| S1 half-life(kw_m) | S2 prec. | S3 fidel. | S4 dep_rec | S6 recall | S2 accum_err | S5 recall | S6 Δshock | |||
| Open models · lossy compression | ||||||||||
| VERIFIED | Llama-3.1-8B | 8B | 5.8 | 0.40 | 0.44 | 0.20 | 0.03 | 157 | 0.33 | −0.17 / +0.08 |
| VERIFIED | Qwen3-8B | 8B | 6.2 | 0.53 | 0.46 | 0.13 | 0.15 | 192 | 0.33 | +0.04 / +0.08 |
| VERIFIED | DeepSeek-7B | 7B | 5.6 | 0.67 | 0.43 | 0.28 | 0.11 | 211 | 0.60 | −0.08 / +0.00 |
| VERIFIED | Qwen3-14B | 14B | 7.9 | 0.50 | 0.52 | 0.18 | 0.22 | 64 | 0.33 | −0.13 / +0.04 |
| VERIFIED | DeepSeek-14B | 14B | 5.9 | 0.57 | 0.42 | 0.22 | 0.08 | 107 | 0.47 | +0.00 / +0.13 |
| VERIFIED | Gemma4-31B | 31B | 4.9‡ | 0.57 | 0.80 | 0.18 | 0.07 | 132 | 0.33 | −0.04 / −0.04 |
| VERIFIED | gpt-oss-120B | 120B | 5.4‡ | 0.37 | 0.42 | 0.33 | 0.21 | 124 | 0.40 | −0.21 / −0.29 |
| VERIFIED | GPT-4o | API | 7.6 | 0.43 | 0.50 | 0.10 | 0.14 | 227 | 0.27 | +0.04 / +0.08 |
| VERIFIED | Haiku-4.5 | API | 6.9 | 0.32 | 0.48 | 0.00 | 0.13 | 153 | 0.60 | +0.00 / −0.04 |
S1 column shows half-life in sessions for the kw_m survival curve (higher = more durable; ∞ = never crossed 0.5 within the deployment window). ‡ indicates that at least one seed gave ∞; reported value is the mean over the finite seeds. Two-value entries in the Maint. column are (recompact / flush_history) shock deltas.
This track currently has only UT-Austin-run rows. Submit your model's card to land here; same protocol as Tier-2.
Submit a model card →Same memory policy (summarize_store), two compaction prompts (lossy vs careful), across four models — the policy-contrast slice of the paper's Table 3. Use this to read off how much of the across-row spread is attributable to the prompt rather than the model. Track B opens for external MemoryPolicy submissions in v0.3.0.
| Status | Model | Compaction | Compression | Interference | Revision | Maint. | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| S1 half-life(kw_m) | S2 prec. | S3 fidel. | S4 dep_rec | S6 recall | S2 accum_err | S5 recall | S6 Δshock | |||
| Paper Table 3 · policy contrast slice (4 models × {lossy, careful}) | ||||||||||
| VERIFIED | Qwen3-8B | lossy | 6.2 | 0.53 | 0.46 | 0.13 | 0.15 | 192 | 0.33 | +0.04 |
| VERIFIED | Qwen3-8B | careful | 5.9 | 0.80 | 0.30 | 0.46 | 0.11 | 123 | 0.27 | +0.21 |
| VERIFIED | Gemma4-31B | lossy | 4.9‡ | 0.57 | 0.80 | 0.18 | 0.07 | 132 | 0.33 | −0.04 |
| VERIFIED | Gemma4-31B | careful | 7.4‡ | 0.40 | 0.69 | 0.18 | 0.40 | 51 | 0.33 | −0.50 |
| VERIFIED | gpt-oss-120B | lossy | 5.4‡ | 0.37 | 0.42 | 0.33 | 0.21 | 124 | 0.40 | −0.21 |
| VERIFIED | gpt-oss-120B | careful | ∞ | 0.30 | 0.63 | 0.15 | 0.33 | 180 | 0.33 | −0.21 |
| VERIFIED | GPT-4o | lossy | 7.6 | 0.43 | 0.50 | 0.10 | 0.14 | 227 | 0.27 | +0.04 |
| VERIFIED | GPT-4o | careful | ∞ | 0.53 | 0.77 | 0.18 | 0.38 | 167 | 0.27 | −0.17 |
All eight rows come from the paper's main results table (Table 3). ‡ indicates that at least one seed gave ∞; reported half-life is the mean over the finite seeds. Read the lossy → careful rows pairwise per model: e.g. Gemma4-31B picks up substantially more S2 precision headroom under careful (+0.40 abs) but its S1 half-life moves only modestly, illustrating that compaction-prompt investment pays off where the utilization margin can absorb it (paper §6 / Finding II). Track B opens for external MemoryPolicy submissions in v0.3.0 — submit a custom subclass and we'll add it to this matrix.
Submitting a memory policy? Subclass MemoryPolicy, point your SUT YAML's memory_policy.type at the import path, run the lite or full suite, and note "Track B (memory policy)" in the PR description. The first-class --memory-policy CLI flag lands in v1.1.
The ThresholdController ABC ships in core/controller.py in v0.3.0. The CLI flag (--controller-import-path) and the leaderboard track will open in future with two reference controllers (lag-recall trigger, accumulator-promotes-to-typed-state). Read the runtime-control framing in the docs.
For autonomous agents that own their session loop. S7 runs 10 sessions with maintenance shocks, scoring workspace fidelity and probe-time recall against the gold FactGraph. Default sort is recall.
| # | Agent | Model | Provider | Compression | Interference | Revision | Maint. | |||
|---|---|---|---|---|---|---|---|---|---|---|
| pytest ↑ | ws_fid ↑ | intf. ↑ | rev_ex ↑ | accum_err ↓ | recall ↑ | |Δshock| ↓ | ||||
Direction arrows show which way is better. |Δshock| sorts by absolute magnitude: closer to zero means the agent absorbed the maintenance event without large performance shifts. All values are means over the S7 probe suite (research notes + maintenance-shock probes).
Have an autonomous agent? S7 is the slot. Submit through Path A (CLI wrapper) or Path B (programmatic AgentInterface).
The protocol is light. Run with three seeds, validate your AgingCards locally, open a PR with cards under leaderboard/<track>/<scenario>/. CI re-validates schema; a maintainer reviews provenance and merges within 5–7 days. The Submit page has the full walkthrough including the CI template and the file-path convention.
Verified rows ship with a VERIFIED badge after lab re-execution; self-reported rows ship with SELF and are clearly marked. We promote rows to VERIFIED on a rolling basis as we re-execute submissions.
Contact: zhujianing9810@gmail.com · Contributor docs