AgingBench: AI Agents Age Too

85%

Maximum recall drop across 10 sessions — frozen weights, same scaffolding (S7 · GPT-4o-mini · OpenHands)

4.5×

Half-life spread from memory policy alone — bigger than any model swap (S1 · careful vs. lossy compaction)

67%

Post-shock cliff from a single flush-history maintenance event — no recovery (S6 naturalistic)

3%

Claude Code 4.7 still not better than 4.6? Mean recall drop on S7 from Sonnet-4.6 (0.82) to Opus-4.7 (0.79).

We're looking for collaborators with production agent traces, sponsors for larger-scale benchmarking, and contributors with new scenarios for agent lifespan engineering.

Abstract

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model.

We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, where write-time summarization drops future-relevant details; interference aging, where accumulated similar memories crowd out the target fact; revision aging, where changed or derived state is not updated correctly; and maintenance aging, where lifecycle events such as flushing or recompaction trigger regressions. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline.

Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

Agent Lifespan Engineering (ALE)

Three key questions:

How long does a deployed agent remain reliable?
How does reliability decay: through compression, interference, revision, or maintenance?
Where should repair target: writing, retrieval, utilization, or the memory lifecycle?

AgingBench is NOT: biological aging one-shot hallucination just long-context evaluation

Fresh deployment → aged agent: input/output stays the same, but memory clutters, signals fade, and the loop closes on itself. — Fresh deployment vs. aged agent — same model, same input/output surface. After enough sessions, the memory store clutters, signals fade, and the agent starts looping on itself. (Click to enlarge.)

Three ways in

AgingBench is a paper, a leaderboard, and a runnable benchmark. Pick the door that matches what you want to do next.

For builders

Run AgingBench

One command, ten minutes. Three release modes (Lite / Full / ▶ Lifespan Check) and a plug-and-play surface.

→ Get started

For comparison

Leaderboard

Multi-track results: model swaps, custom memory policies, runtime controllers, autonomous agents.

→ See results For depth

Docs & methodology

Seven scenarios in detail. AgingCard schema. Counterfactual diagnosis. Contributing and roadmap.

→ Read the docs

① How long remain reliable? Day 1 → Day N

Across scenarios, models, and memory policies, agents that pass day-one evaluation often show longitudinal degradation across sessions. See more results in our evaluation →

Day 1 vs Day N for the four mechanisms — chat-bubble examples plus declining recall/precision curves across sessions. — Day 1 → Day N across the four mechanisms. Chat-bubble examples (left) show how each failure mode tends to read to a user (omission, confusion, staleness, collapse). Curves (center) show recall/precision declining across sessions for representative models on each mechanism. (Click to enlarge.)

② How does reliability decay? Four aging mechanisms

Decay is rarely a single phenomenon. AgingBench organizes the observed failure patterns into four mechanisms, each pointing to a different way reliability can erode. See the worked examples →

Compression aging: the agent's recall signal declines across sessions. — Compression

Interference aging: similar memories crowd out the target across sessions. — Interference

Revision aging: the agent keeps citing a stale value after an update. — Revision

Maintenance aging: a lifecycle shock causes a performance cliff. — Maintenance

Compression

Write-time abstraction destroys information before future queries are known. Low-frequency details (numbers, names, constraint values) are discarded; high-level summaries survive. The system "remembers what it was about" but loses what it actually said.

Interference

Even when no information is lost and nothing has changed, growing state buries the relevant fact behind similar entries during retrieval. Orthogonal to revision: freezing all facts does not prevent it.

Revision

The agent fails to track changing truth. Especially severe for dynamic latent state (a budget, a counter, a configuration) where the answer is derived from accumulated updates. A single missed update silently contaminates every subsequent query.

Maintenance

Routine operational events (recompaction, prompt updates, log cleanup, model swaps) silently alter behavior. Driven by actions on the agent, not by its interaction with memory. Causes performance cliffs invisible to standard evaluation.

③ Where should repair target? Component-level diagnosis

Knowing how long and how a system tends to decay is not yet enough to act on. The third ALE question asks where in the memory pipeline a failure originates — the candidate repair site. AgingBench instruments the four pipeline components (𝒲 write, 𝒮 store, ℛ retrieve, 𝒰 utilize) and uses paired counterfactual probes that swap each component with an oracle to surface a stage-level diagnostic profile.

Memory pipeline

Four candidate repair sites — 𝒲 (write/compression), 𝒮 (store), ℛ (retrieve), 𝒰 (utilize). Counterfactual probes swap each with an oracle to localize where a failure is consistent with.

Same wrong answer, different repairs

Aggregate failure rates can look nearly identical across scenarios while the underlying stage-level profiles differ. Two systems with the same accuracy can point to different repair targets.

Aging attribution fingerprint across three models and three scenarios; Write/Retrieval/Utilization decomposition. — Stage-level diagnostic fingerprints across three models on three scenarios. Aggregate failure rates cluster (~0.60–0.82), but the Write / Retrieval / Utilization shares vary: S1 is consistent with a utilization-dominant profile, S2 with a write-dominant one, S5 flips between them.

Analysis

Selected results, each anchored on a results figure — scroll the strip to browse. See our paper for detailed findings and intervention implications.

Aging curves across all seven scenarios.

Overview

Aging is multi-dimensional and measurable

Across the seven scenarios, headline metrics tend to trend downward over the deployment horizon for the (model × memory policy) combinations we tested. The rate and shape vary by mechanism, and none of the configurations we evaluated avoided the trend.

Compression

Memory policy beats model swap

S1 half-life heatmap: switching the compaction prompt produces a larger half-life delta than any model swap in the row. The memory layer dominates, not the model.

S2 silent precision-loss curve: violation rate stays at zero while precision drops.

Silent decay

Behavioral compliance ≠ factual accuracy

On S2, constraint violation rate stays at zero throughout deployment while constraint precision drops and lag recall collapses. Behavioral monitors miss the decay entirely.

Revision

Two-axis failure: scaling doesn't help

Across 7 models on S2, accumulator error and forget accuracy do not co-improve. Larger models are not consistently better — revision is representational, not a capacity problem.

Maintenance

Shock type matters more than model

On S6, flush / recompact / early-shock variants share the pre-shock window but produce distinct post-shock recovery shapes — the type of routine event matters more than the model running underneath.

Cite this work

@inproceedings{agingbench2026,
  title     = {Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems},
  author    = {Zhu, Jianing and Ro, Yeonju and Robertson, John T and Wang, Kevin and
               Li, Junbo and Vikalo, Haris and Akella, Aditya and Wang, Zhangyang},
  booktitle = {arXiv preprint arXiv:2605.26302},
  year      = {2026}
}

Your Agents Are Aging Too:Agent Lifespan Engineering for Deployed Systems

Abstract

Three ways in

Run AgingBench

Leaderboard

Docs & methodology

① How long remain reliable? Day 1 → Day N

② How does reliability decay? Four aging mechanisms

Compression

Interference

Revision

Maintenance

③ Where should repair target? Component-level diagnosis

Memory pipeline

Same wrong answer, different repairs

Analysis

Aging is multi-dimensional and measurable

Memory policy beats model swap

Behavioral compliance ≠ factual accuracy

Two-axis failure: scaling doesn't help

Shock type matters more than model

Cite this work

Your Agents Are Aging Too:
Agent Lifespan Engineering for Deployed Systems