Quickstart: Tier 2 in ten minutes
Tier 2 is the headline — production agents on the S7 Research-Notes Coding Task. Pick an in-tree adapter (claude_code, openhands, codex) or wire your own; AgingBench runs the session loop and emits a leaderboard-ready AgingCard.
Prerequisites for Tier 2 (S7). The claude_code adapter shells out to the official Claude Code CLI, so install it once: npm i -g @anthropic-ai/claude-code. Without it, the runner hangs while the subprocess fails to start. For the openhands adapter, install the OpenHands SDK in a separate conda env and point OPENHANDS_BRIDGE_PYTHON at its python interpreter.
# Install AgingBench (one-time)
pip install "git+https://github.com/VITA-Group/AgingBench.git@v0.3.0#subdirectory=prototype"
# Install the Claude Code CLI (one-time, for the claude_code adapter)
npm i -g @anthropic-ai/claude-code
# Set your API key (Tier-2 adapters call the model directly)
export ANTHROPIC_API_KEY=sk-ant-...
# Run Tier 2 on Claude Code
agingbench run \
--scenario s7_research_notes \
--sut agingbench/registry/suts/claude_code/claude_code_sonnet46_s7.yaml \
--seeds 1 --card
The run emits an AgingCard with workspace fidelity, probe-time recall, maintenance-shock deltas, and the cost breakdown. Submit it to the Tier 2 leaderboard as a PR.
No API key or no CLI install? The in-browser telemetry demo is the friction-free path — drop a JSONL trace and get the same AgingCard back, no install, no key, no GPU. The Tier-2 scenarios on this page need a production CLI agent (Claude Code or OpenHands), which intrinsically means an API key + a CLI install.
Adapters that work today
Three Tier-2 adapters ship in-tree under agingbench/core/adapters/:
claude_code: Anthropic's Claude Code (subprocess + SDK)
openhands: All Hands AI's OpenHands (isolated conda env via subprocess bridge)
codex: OpenAI Codex CLI (codex exec in non-interactive mode)
Bring your own agent
Custom agents drop in by subclassing AgentAdapter (two required methods: send_message, reset_session) and referencing your class from the SUT YAML's adapter: block. You don't modify AgingBench code.
Start from the runnable template at examples/byo_agent_minimal.py — copy it, replace the stub send_message with your agent's call, and point a SUT YAML at it:
# my_sut.yaml
adapter:
type: custom
class: my_pkg.my_agent:MyAgent # importable on PYTHONPATH
max_turns: 30 # any extra keys are forwarded as kwargs
agingbench run --scenario s7_research_notes \
--sut my_sut.yaml --seeds 3 --card
The four bundled adapters — claude_code, openhands, codex, cursor — are full-fat reference implementations of the same ABC, useful when you want to see how an opaque CLI agent is wrapped. See docs for the interface contract.
Opaque-agent caveat: the optional get_workspace_state / get_memory_text hooks default to {} / "". That's fine — the run still emits a valid AgingCard — but probe scoring then only credits what the agent recites in its reply, so S5/S7 file-survival probes will read as more aged. If your agent writes notes or scratchpads to a known directory, return them from these hooks.