This article is based on Anthropic - Scaling Managed Agents and LangChain - The Anatomy of an Agent Harness, with my own understanding and examples mixed in.
1. What Is an Agent Harness
1.1 One-Sentence Definition
LangChain gives a concise formula:
In one sentence: if you're not the model, then you're the Harness.
Harness is everything outside the model — the loop that calls the model, tool routing, context management, execution environment, security isolation, memory mechanisms... all the infrastructure that turns a "bare model" into an "Agent that can get work done."
1.2 Harness vs Traditional Agent Frameworks
Many early Agent frameworks (typified by early LangChain's Chain/Agent abstractions) tend to: encapsulate model capabilities into predefined tool chains, with developers orchestrating the call sequence.
Harness has a different focus: it presets less about what the model should do, and instead provides general capability primitives (filesystem, code execution, sandbox, network access), letting the model autonomously decide how to combine these primitives to solve problems.
- Traditional framework ≈ giving the model an SOP manual to execute step by step
- Harness ≈ giving the model a fully-equipped workshop, letting it decide which tools to use
1.3 A Concrete Example
Claude Code itself is a typical Harness:
┌───────────────────────────────────────────────┐ │ Claude Code (Harness) │ ├───────────────────────────────────────────────┤ │ • Calls Claude API in a loop │ │ • Routes tool calls (Bash, Read, Write, ...) │ │ • Manages context window │ │ • Provides filesystem access │ │ • Isolates execution environment │ └───────────────────────────────────────────────┘
The model is responsible for "thinking" (reasoning, planning, decision-making), and the Harness is responsible for "doing" (executing commands, reading/writing files, managing context, isolating permissions).
2. Core Components of a Harness
The LangChain article summarizes six core components of a Harness. I'll explain each one with Claude Code examples.
2.1 Filesystem
The filesystem is the most critical foundational primitive of a Harness, with a role far beyond "reading and writing files":
| Use Case | Description | Claude Code Example |
| Persistent storage | Retain info across conversation runs | CLAUDE.md |
| Context extension | Store info beyond token limits | Large tool outputs persisted to disk; context keeps only summary and file path |
| Multi-Agent collaboration | Shared workspace | Team-shared task list file |
| Version control | Track changes, support rollback | git worktree isolation |
Side note: In the second row of the table, what I call the persisted-output tool output strategy, and in the third row, the "TeamCreate creates a team and binds a shared task list" — these are Claude Code-specific implementation details, not general Harness theory.
The filesystem is essentially the Agent's external memory — the context window is "working memory" (short-term), and the filesystem is "long-term memory."
2.2 Code Execution
Rather than giving the model 100 pre-packaged tools, it's better to give it a general code execution capability:
Pre-packaged tool approach:\n Tool: search_files(pattern="*.go", keyword="func main")\n\nCode execution approach:\n Bash: grep -rn "func main" --include="*.go" .\n\nThe latter can combine any shell commands.
The latter is more flexible — the model can combine arbitrary shell commands to solve problems, unconstrained by a predefined tool set. Claude Code providing a Bash tool is exactly this design philosophy: giving the model "hands."
2.3 Sandbox
The sandbox provides three layers of value:
- Security: model-generated code runs in an isolated environment, not affecting the host system
- Scalability: the sandbox can scale independently, unconstrained by the Harness process
- Pre-installed environment: language runtimes, CLI tools, browsers, etc., letting the Agent autonomously verify work results
2.4 Memory & Search
Agents need to accumulate experience across conversation runs, and also need to access real-time information beyond training data:
run 1: user corrects coding style preference\n ↓ writes to memory\nrun 2: Agent directly codes in the correct style
- Persistent memory files: write preferences, conventions, and experience to files for reuse across runs
- External search capability: access post-training-cutoff information through web search, MCP tools, etc.
Claude Code's memory hierarchy:
- ~/.claude/CLAUDE.md — user global preferences
- Project-level CLAUDE.md — project conventions
- Single-run working context — current conversation and tool results, not persisted to long-term memory files by default, and not auto-loaded on next launch (Claude Code persists the transcript to disk, but does not automatically inject it into the next session's context).
2.5 Context Management
In long conversations, the context window will "rot" (context rot): more and more irrelevant information dilutes the effective information. The Harness needs to actively manage context:
- Compaction: compress historical conversations, retaining key information
- Offloading: store large outputs in files rather than keeping them in context
- Progressive Skill Disclosure: inject instructions on demand rather than loading everything at once
2.6 Long-Horizon Execution
In real scenarios, a task may span multiple conversation runs, lasting hours or even days. The Harness needs to support this kind of "long-horizon work":
- Filesystem as progress carrier: persist intermediate state to disk, so an interrupted run doesn't start from scratch
- Planning and self-verification: let the Agent decompose subtasks and check intermediate outputs
- Resumable execution loop: patterns like the Ralph Loop — after interruption, can continue based on task list and produced artifacts, rather than restarting
Long-horizon execution is essentially orchestrating the previous five components (filesystem, code execution, sandbox, memory/search, context management) to give the Agent the ability to "advance one thing across time periods."
3. Harness Architecture Evolution
This part is the core insight of the Anthropic article — how Harness design evolves with model capability progress.
3.1 Early Days: Single-Container "Pet" Mode
The original Managed Agents architecture crammed all components into a single container:
┌──────────────────── Container ─────────────────────┐ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Model │ │ Tools │ │ State │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ Everything in one container │ └────────────────────────────────────────────────────┘
This is like raising a "pet" — each container is unique, containing irreplaceable state. Problems:
- Container crash = session loss: all state disappears with the container
- Debugging difficulty: entering the container to debug may expose user data
- Slow startup: users have to wait for the container to be ready before sending the first message, with extremely high p95 latency
- Poor elasticity: cannot independently scale compute and storage
3.2 Three-Layer Decoupling: Session / Harness / Sandbox
Terminology clarification: Session (capitalized) below specifically refers to the event log component in Anthropic's architecture — an append-only persistent store, distinct from the "conversation session" concept used in earlier sections.
Anthropic's solution is to split the monolithic architecture into three independent interfaces:
┌─────────────────────────────────────────────────────────────┐ │ Managed Agent │ ├──────────────┬──────────────┬──────────────────────────────┤ │ Session │ Harness │ Sandbox │ │ (event log) │ (model call) │ (code execution) │ │ append-only │ stateless │ stateless │ │ persistent │ restartable │ replaceable │ └──────────────┴──────────────┴──────────────────────────────┘
(Session = immutable event log; Harness = stateless service that calls the model / routes tools; Sandbox = stateless container that executes code / file operations.)
Session ↔ Harness interaction semantics: the Harness appends event logs to the Session; after restart, the new instance recovers state by replaying the existing event stream.
This design borrows from operating system thinking — just as the OS virtualizes hardware into abstractions like "processes" and "files" that outlive the hardware, Managed Agents virtualize the Agent runtime.
3.3 Brain vs Hands Separation
After decoupling, the most core change is that the Brain (Claude + Harness) and Hands (Sandbox and various execution tools) are completely separated:
The Harness is decoupled from the conversation run's state, no longer bound to a specific container (the Harness itself can still run in a container, but it becomes a stateless, restartable service). The container side becomes purely...
Session exists independently of the Harness. If the Harness crashes, no state is lost — a new Harness instance can pick up from the event log with just the sessionId:
// Pseudocode (not official SDK API, just illustrating recovery semantics)\nconst session = await getSession(sessionId);\nconst harness = new Harness({ model: 'claude' });\nawait harness.replay(session.events);\n// Harness is now back to the state before crash
3.4 Performance Benefits
Decoupling brought significant performance improvements — the key metric is TTFT (Time-to-First-Token), the time from user sending a message to seeing the first response:
Core reason (based on architectural inference): before decoupling, inference was tied to the container lifecycle, requiring the container to be ready first; after decoupling, the Harness as a stateless service can start inference immediately, no longer blocked by execution environment readiness.
| Metric | Improvement |
| p50 TTFT | ~60% reduction |
| p95 TTFT | >90% reduction |
3.5 Security Boundary
Decoupling also naturally forms security isolation:
┌──────── Harness ────────┐ ┌──────── Sandbox ────────┐\n│ • User credentials │ │ • Code execution │\n│ • OAuth tokens │ │ • File operations │\n│ • API keys │ ───▶ │ • Tool results │\n│ • Sensitive config │ │ • No credentials visible │\n└──────────────────────────┘ └──────────────────────────┘
(The Harness on the left holds user credentials / OAuth tokens / API keys; the sandbox can only see operation results — credentials never enter the sandbox.)
The Anthropic article mentions two approaches to ensure credentials don't enter the sandbox (the specific implementation details below are my inferences based on architectural contracts; the original text doesn't go into this level of detail):
- Bundling at initialization: credentials are bound as resources during container initialization (e.g., a Git repo token is bound as part of the repo resource), and the sandbox can only operate on the resource, not see the raw credential.
- Proxy access (external vault): tool calls obtain credentials through a dedicated proxy, which reads from an external vault and only passes back the operation result — credentials never enter the sandbox.
3.6 Multiple Brains, Multiple Hands
The decoupled architecture leaves room for "many-to-many" expansion (the following structure is an inference based on interface contracts, not an explicit conclusion from the Anthropic article):
Brain A ──┐ ┌── Hand 1 (container)\n ├── Session ────────┤\n Brain B ──┘ └── Hand 2 (container)\n\nMultiple Brains can connect to the same Session;\nMultiple Hands can be used as interchangeable tools.
- Multiple Harnesses can connect to the same Session as stateless services
- Multiple Sandboxes are treated as interchangeable tools
- In theory, Brains can also "pass" Hands to each other, enabling more complex multi-Agent collaboration
4. Design Philosophy and Insights
4.1 The "Expiry" Problem of Harness
Following Anthropic's "stable interfaces + replaceable implementations" line of thinking one step further, we find a more interesting question: the Harness inevitably encodes assumptions about model capabilities, and these assumptions will expire as the model improves.
Take an easy-to-understand scenario (the following is an illustrative hypothetical example to demonstrate the "assumptions expire" pattern, not a specific case from the original text): suppose early models have "context anxiety" — when conversations get long and approach the context limit, they start forgetting or hallucinating.
Lesson: Harness code written for model limitations is a breeding ground for technical debt.
4.2 Design for Stable Interfaces, Not Implementations
The Anthropic article's coping strategy is Future-Proofing Through Abstraction — building a set of interfaces that outlive any specific Harness implementation:
- Strong opinions about interfaces: Session must be an append-only event stream, Sandbox must be stateless and replaceable
- Neutral about implementations: specific Harness logic (Claude Code, task-specific agents, future new forms) can be freely swapped
Stable interface layer (Session / Sandbox contracts remain unchanged)
│
├── Claude Code Harness (today's implementation)
├── Task-Specific Harness (optimized for specific scenarios)
└── Future Harness forms (new shapes we haven't imagined yet)
4.3 Insights for Us
If you're building an Agent application, here are some points worth considering:
1. Don't encode too many assumptions in the Harness
// Bad: assuming the model is bad at long context\nif (context.length > 4000) {\n context = summarize(context);\n}\n\n// Good: let the model decide\n// (no hardcoded context length limits)
2. Consider conversation run persistence from day one
Even for a single-machine application, you should persist conversation history to files/database, not just keep it in memory. Benefits:
- Crash recovery
- Cross-device continuation
- Post-hoc analysis and debugging
3. Security boundary design upfront
Credentials management should be solved at the architecture level, not by "careful coding." Core principle: in environments that model-generated code can reach, there should be no sensitive information.
4. Leave room for model capability growth
Today you may need complex ReAct loops, Chain-of-Thought forcing, multi-step verification. But as model capabilities improve, these may all become redundant. When designing, ask yourself: if the model suddenly gets 10x smarter, which of my code would become unnecessary?
5. Review and Summary
| Dimension | LangChain perspective | Anthropic perspective |
| Focus | What a Harness is (components & responsibilities) | How a Harness evolves (decoupling & architecture) |
| Core idea | Agent = Model + Harness | Decouple Brain from Hands |
| Design approach | Provide general primitives, not predefined tools | Build interfaces that outlive implementations |
| Attitude toward models | Empower models to solve problems autonomously | Don't encode assumptions about models |
| Key innovation | Filesystem as core primitive | Session/Harness/Sandbox three-layer separation |
Both articles point to the same trend: the Harness is getting "thinner." As models get smarter, the Harness's responsibility gradually shifts from "guiding the model to do the right thing" to "giving the model the ability to do things." The best Harness is one that you barely notice.
References
- Scaling Managed Agents: Decoupling the Brain from the Hands - Anthropic Engineering
- The Anatomy of an Agent Harness - LangChain Blog