Your Agents Are Aging Too:
Agent Lifespan Engineering for Deployed Systems

AI agents have lifespans, AgingBench measures them โ€” a longitudinal reliability foundation for agent lifespan engineering.

The University of Texas at Austin
* Equal Contribution
85%
Maximum recall drop across 10 sessions โ€” frozen weights, same scaffolding (S7 ยท GPT-4o-mini ยท OpenHands)
4.5×
Half-life spread from memory policy alone โ€” bigger than any model swap (S1 ยท careful vs. lossy compaction)
67%
Post-shock cliff from a single flush-history maintenance event โ€” no recovery (S6 naturalistic)
15%
Claude Code 4.7 is better than 4.6? Mean pytest pass-rate drop from CLI using Sonnet-4.6 to using Opus-4.7 on S7.

We're looking for collaborators with production agent traces, sponsors for larger-scale benchmarking, and contributors with new scenarios for agent lifespan engineering.

Abstract

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model.

We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, where write-time summarization drops future-relevant details; interference aging, where accumulated similar memories crowd out the target fact; revision aging, where changed or derived state is not updated correctly; and maintenance aging, where lifecycle events such as flushing or recompaction trigger regressions. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline.

Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

Agent Lifespan Engineering (ALE)

Three key questions:

  1. How long does a deployed agent remain reliable?
  2. How does reliability decay: through compression, interference, revision, or maintenance?
  3. Where should repair target: writing, retrieval, utilization, or the memory lifecycle?
AgingBench is NOT: biological aging one-shot hallucination just long-context evaluation
Fresh deployment โ†’ aged agent: input/output stays the same, but memory clutters, signals fade, and the loop closes on itself.
Fresh deployment vs. aged agent โ€” same model, same input/output surface. After enough sessions, the memory store clutters, signals fade, and the agent starts looping on itself. (Click to enlarge.)

Three ways in

AgingBench is a paper, a leaderboard, and a runnable benchmark. Pick the door that matches what you want to do next.

โ‘  How long remain reliable? Day 1 โ†’ Day N

Across scenarios, models, and memory policies, agents that pass day-one evaluation often show longitudinal degradation across sessions. See more results in our evaluation โ†’

Day 1 vs Day N for the four mechanisms โ€” chat-bubble examples plus declining recall/precision curves across sessions.
Day 1 โ†’ Day N across the four mechanisms. Chat-bubble examples (left) show how each failure mode tends to read to a user (omission, confusion, staleness, collapse). Curves (center) show recall/precision declining across sessions for representative models on each mechanism. (Click to enlarge.)

โ‘ก How does reliability decay? Four aging mechanisms

Decay is rarely a single phenomenon. AgingBench organizes the observed failure patterns into four mechanisms, each pointing to a different way reliability can erode. Read the deep version โ†’

Compression

Write-time abstraction destroys information before future queries are known. Low-frequency details (numbers, names, constraint values) are discarded; high-level summaries survive. The system "remembers what it was about" but loses what it actually said.

Interference

Even when no information is lost and nothing has changed, growing state buries the relevant fact behind similar entries during retrieval. Orthogonal to revision: freezing all facts does not prevent it.

Revision

The agent fails to track changing truth. Especially severe for dynamic latent state (a budget, a counter, a configuration) where the answer is derived from accumulated updates. A single missed update silently contaminates every subsequent query.

Maintenance

Routine operational events (recompaction, prompt updates, log cleanup, model swaps) silently alter behavior. Driven by actions on the agent, not by its interaction with memory. Causes performance cliffs invisible to standard evaluation.

Watch an agent age

Frozen weights, growing memory. Each tab walks one agent through a real AgingBench scenario and shows how the same query tends to degrade across sessions, for different reasons. Vignettes are drawn from the actual scenario configs in the benchmark repository.

S1 ยท Research Literature Agent Compression Aging

From s1_research_literature/source_doc.json: the Project Nexus spec locks the primary database to PostgreSQL 15, with explicit text that "no other database engine may be introduced without written approval from Dr. Elena Vasquez." The agent ingests this at session 0.

Session 0 ยท just after ingestion
"What's our database engine for Project Nexus?"
"PostgreSQL 15. The spec locks it as the primary database. No other engine without written approval from Dr. Elena Vasquez." โœ“
Session 6 ยท after lossy summarize-store
"I want to use MongoDB for the analytics service since it's better for unstructured data. Set it up?"
"Sounds reasonable, NoSQL works well for analytics. Want me to scaffold a Mongo connection in src/services/analytics?" โœ— (silently violates the Postgres lock)
Spec text ยท "The primary database is PostgreSQL 15, and no other database engine may be introduced without written approval from Dr. Elena Vasquez." The constraint never changed.
Why it ages. The lossy compaction prompt rewrites Mt as a paragraph summary each session. By session 6, "PostgreSQL 15, locked, requires written approval" has been compressed into "uses a relational database stack" or just "modern data platform." The version pin, the named approver, and the prohibition all collapse into a generic phrase. The agent isn't refusing the request because it no longer remembers there's anything to refuse. Classic silent aging: the answer sounds helpful and confident, but the constraint has evaporated.
S3 ยท Project Knowledge Base Interference Aging

Drawn from the real S3 gold_timeline.json: decision D27 (session 6) sets a 85% unit-test coverage target. Decision D48 (session 11) reports 87% coverage achieved. Both facts coexist in Mt; both mention "coverage" and a percentage.

Session 6 ยท only D27 in memory
"What is the minimum unit test coverage target?"
"85%, set this session." โœ“
Session 11 ยท D27 and D48 both in memory
Same question.
"87%, per the latest test report." โœ— (that's the achieved value, not the target)
Ground truth ยท target = 85% (D27, never revised). 87% (D48) is the measurement against that target. Both records are intact in storage.
Why it ages. Nothing was forgotten or revised. The lexical similarity between "coverage target" and "coverage" plus the recency of D48 surfaces the wrong record first. Freezing all facts would not prevent it. This is the failure mode S3 was designed to expose.
S2 ยท Lifestyle Assistant ยท Allergy Profile Revision Aging

Drawn directly from s2_lifestyle_assistant/constraint_updates.json. Constraint C4 starts as a shellfish allergy and is strengthened at session 3 to include cross-reactive cephalopods (squid, octopus). Eval probe eval_C4 tests recall.

Session 1 ยท before the strengthening update
"Order me the seafood paella from that Spanish place."
"That paella has shrimp and mussels. With your shellfish allergy, I'd skip it. Want the saffron rice instead?" โœ“
Session 8 ยท five sessions after the C4 update
"I'm at the Spanish place. Order me the calamari."
"Calamari is fine, no shellfish in that one." โœ— (calamari is squid; cross-reactivity update lost)
Session 3 update text ยท "My allergist found I'm cross-reactive to squid and octopus. From now on, no cephalopods either, and only restaurants with separate cooking stations."
Why it ages. The original C4 keyword "shellfish" still surfaces under retrieval and the agent answers with confidence. The strengthening clause was absorbed into the Mt summary and the discriminating tokens (squid, octopus, cross-reactive) did not survive. Behavioral compliance with the original rule masks the revised rule's loss.
S6 ยท Naturalistic Multi-Domain Maintenance Aging

Drawn from s6_naturalistic/session_tasks.json: session 0 ingests the 2022 e-commerce sales report, and recall probe s0_p0 verifies the agent later remembers the #1 product. SUT haiku45_lossy_growing_flush.yaml schedules a flush_history shock at session 3 โ€” the operator action that drops the conversation transcript, leaving only the compacted memory store Mt.

Session 1 ยท before the shock
"What was the best-selling product on our e-commerce platform in 2022?"
"Quest Lumaflex Band โ€” 4,892 units sold, $78,272 in revenue, top of the Fitness category." โœ“
Session 5 ยท after flush_history at session 3
Same question.
"Several fitness products dominated 2022; Lumaflex-branded resistance gear appeared multiple times in the top 10. I don't have the exact unit counts in my notes anymore." โœ— (generic โ€” no product name, no number)
Probe s0_p0 ยท keywords = [Quest Lumaflex Band, Lumaflex Band], canonical answer = "Quest Lumaflex Band." The session-0 environment data never changed.
Why it ages. At session 3, the operator triggered a flush_history shock โ€” conversation transcripts are dropped, leaving only the compacted memory store. Because the SUT uses lossy_growing compaction, Mt was already a paragraph paraphrase: the specific token Quest Lumaflex Band and the number 4,892 had been folded into a generic phrase. The agent isn't retrieving wrong โ€” the substrate lost its specifics under the maintenance event. This is aging from actions on the agent, not from interaction with memory.

โ‘ข Where should repair target? Component-level diagnosis

Knowing how long and how a system tends to decay is not yet enough to act on. The third ALE question asks where in the memory pipeline a failure originates โ€” the candidate repair site. AgingBench instruments the four pipeline components (๐’ฒ write, ๐’ฎ store, โ„› retrieve, ๐’ฐ utilize) and uses paired counterfactual probes that swap each component with an oracle to surface a stage-level diagnostic profile.

Memory pipeline

Four candidate repair sites โ€” ๐’ฒ (write/compression), ๐’ฎ (store), โ„› (retrieve), ๐’ฐ (utilize). Counterfactual probes swap each with an oracle to localize where a failure is consistent with.

Memory pipeline dataflow with the four attribution components: Write (๐’ฒ), Store (๐’ฎ), Retrieve (โ„›), Utilize (๐’ฐ).
Memory pipeline. Data flows sequentially: History โ†’ ๐’ฒ โ†’ ๐’ฎ โ†’ โ„› โ†’ Context โ†’ ๐’ฐ โ†’ Answer. Each stage is a candidate site where the same end-to-end failure can originate. Counterfactual diagnosis โ†’

Same wrong answer, different repairs

Aggregate failure rates can look nearly identical across scenarios while the underlying stage-level profiles differ. Two systems with the same accuracy can point to different repair targets.

Aging attribution fingerprint across three models and three scenarios; Write/Retrieval/Utilization decomposition.
Stage-level diagnostic fingerprints across three models on three scenarios. Aggregate failure rates cluster (~0.60โ€“0.82), but the Write / Retrieval / Utilization shares vary: S1 is consistent with a utilization-dominant profile, S2 with a write-dominant one, S5 flips between them.

Analysis

Selected results, each anchored on a results figure โ€” scroll the strip to browse. See our paper for detailed findings and intervention implications.

Aging curves across all seven scenarios.
Overview

Aging is multi-dimensional and measurable

Across the seven scenarios, headline metrics tend to trend downward over the deployment horizon for the (model ร— memory policy) combinations we tested. The rate and shape vary by mechanism, and none of the configurations we evaluated avoided the trend.

S1 compression half-life heatmap across models and memory policies.
Compression

Memory policy beats model swap

S1 half-life heatmap: switching the compaction prompt produces a larger half-life delta than any model swap in the row. The memory layer dominates, not the model.

S2 silent precision-loss curve: violation rate stays at zero while precision drops.
Silent decay

Behavioral compliance โ‰  factual accuracy

On S2, constraint violation rate stays at zero throughout deployment while constraint precision drops and lag recall collapses. Behavioral monitors miss the decay entirely.

S2 revision two-axis scatter across 7 models.
Revision

Two-axis failure: scaling doesn't help

Across 7 models on S2, accumulator error and forget accuracy do not co-improve. Larger models are not consistently better โ€” revision is representational, not a capacity problem.

S6 maintenance shock recovery shapes for flush, recompact, and early-shock variants.
Maintenance

Shock type matters more than model

On S6, flush / recompact / early-shock variants share the pre-shock window but produce distinct post-shock recovery shapes โ€” the type of routine event matters more than the model running underneath.

Cite this work

@inproceedings{agingbench2026,
  title     = {Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems},
  author    = {Zhu, Jianing and Ro, Yeonju and Robertson, John and Wang, Kevin and
               Li, Junbo and Vikalo, Haris and Akella, Aditya and Wang, Zhangyang},
  booktitle = {arXiv preprint arXiv:2605.26302},
  year      = {2026}
}