The market is shifting from "best model" to "best operating system around the model": harness, tools, permissions, memory, and execution boundary.
Raul Cosentino · AV Test Engineering
Mental Model
How the Four Layers Fit Together
We start with models/backends because the harness has to call something underneath it. That lower layer decides where inference runs, what it costs, how private it is, and whether the stack can stay local.
1
Model / Backendcloud API · local server · NIM
2
Harnessagent loop, tools, workflow
3
Extension Layerskills · hooks · plugins · MCP
4
Execution Boundaryapprovals · sandbox · network
Developer / team intent
Prompts, tickets, repo context, rules, and expected outcomes.
through the harness
Harness
The harness chooses the model, runs the agent loop, invokes tools, and applies your policy. This is where Claude Code, Codex, Pi, OpenCode, Aider, Goose, Gemini CLI, and others actually differ.
How to read this stack: extensions plug into the harness rather than replacing it. Skills shape behavior, hooks enforce policy, plugins package capabilities, and MCP connects outside systems.
Layer 1
Models & Backends
This layer is becoming more interchangeable, but it still decides where the intelligence runs and what tradeoffs you accept.
Cloud APIs
Anthropic, OpenAI, Google, xAI, Mistral, DeepSeek, and aggregators such as OpenRouter, Together, Fireworks, and Groq Cloud. Best raw capability, easiest setup, recurring cost, and your code or prompts leave the machine unless the product offers local execution.
Local / self-hosted inference
Ollama (Jun 2023, 164K+ stars) is the easiest local path. vLLM (Feb 2023, 72K+ stars) is strong infra-first serving. LM Studio gives you a desktop-first local server. NVIDIA NIM (launched GTC, Mar 2024) is the production-oriented path for accelerated inference on NVIDIA GPUs.
Where the backend sits
HarnessClaude Code, Codex, Pi, OpenCode, Goose
→
Backend / providerthe service the harness calls to reach a model
→
Model runtimethe model process that actually performs inference
The harness does not talk straight to GPU kernels or weights. It calls a backend/provider, and that backend either hosts the model itself or routes the request to a model runtime. That is why this layer controls cloud vs local, latency, privacy, and deployability.
Layer 2
The Harness Landscape
Same broad job, four lanes. The competition is around orchestration, memory, and execution — not just the model.
Editor-first reference point. Inline editing, project rules, agent mode, background/cloud delegation from inside the IDE.
Cline
58K+ ★. Permission-conscious VS Code agent with strong tool integration and MCP support.
Copilot
The path from autocomplete into full agent mode inside the mainstream IDE + GitHub workflow.
Comparison
Feature Matrix
Capability
Claude Code
Codex
Pi
OpenCode
Aider
Gemini CLI
Goose
Main surface
Terminal
CLI / app / cloud / IDE
Terminal
Terminal / desktop / IDE
Terminal
Terminal
Terminal / desktop
Model flexibility
Anthropic-first
OpenAI-first
High
High
High
Google-first
High
Local backends
No
No
Yes
Yes
Yes
No
Yes
Project instructions
CLAUDE.md + skills
AGENTS.md
skills / templates
skills / rules
chat modes / config
GEMINI.md
config / extensions
Extension model
plugins + hooks + MCP
MCP + agent/server mode
extensions + packages
plugins + MCP
lightweight / git-centric
MCP
extensions + MCP
Approvals
tool / hook mediated
explicit approval modes
extension controlled
explicit permission config
host-terminal workflow
explicit confirmation
configurable tool permissions
Sandbox story
local machine + policy
built-in sandboxing
local machine + policy
local by default; external sandbox options
local repo / git safety
containerized sandbox option
local runtime; external isolation optional
Delegation style
subagents
remote / cloud workflows
build your own
agent patterns via plugins
single-loop pair programming
single-loop agent workflow
single agent + extensions
Templates and rules play different roles:templates provide reusable starting structures for common tasks, while rules are ongoing instructions or constraints the harness should keep applying.
Infrastructure
MCP: Model Context Protocol
MCP (announced Nov 25, 2024 by Anthropic; donated to Linux Foundation Dec 9, 2025) standardizes how agents connect to external tools and context. The core value is simple: stop rewriting one-off adapters for every client/tool pair.
without MCP: N clients × M tools→bespoke adapters everywherewith MCP: N clients + M servers→one shared protocol in the middle
MCP clients
Claude Code, Codex, Gemini CLI, Pi, OpenCode, Goose, Cursor, and others can speak MCP as clients.
What an MCP server is
A process or service that exposes tools/resources through the MCP protocol. It is not a JSON file. It is an adapter/service with its own auth, runtime, and capabilities.
Examples
Filesystem, databases, browsers, issue trackers, internal APIs, or custom business tools - all can be wrapped behind one protocol instead of many bespoke integrations.
In practice: each client learns one protocol, and each tool team publishes one server. A single MCP server can expose issue tracking, docs, chat, code hosting, storage, and search through one shared interface — one protocol in the middle, many tools at the edge.
Layer 3
Extensions & Delegation
Extensions make the harness yours. Delegation makes it scale. They work together — each subagent gets its own scoped skills, restricted tools, and enforced policies. In mature setups, all four primitives can ship together as one installable package.
📋
Skills
what each agent knows
🪝
Hooks
what gets enforced
🔌
Plugins
how to package it
🧠
Memory
what persists
📋 Skills → scoped expertise
Reusable domain know-how. The parent agent loads a planning skill; it delegates an Explorer subagent with only a code-search skill and read-only tools. Each agent gets exactly the knowledge it needs — nothing more.
🪝 Hooks → guardrails per agent
Lifecycle interception: run lint after every edit, require approval before destructive commands, auto-test on save. A hook on the Editor subagent runs tests after each patch. A hook on Tester blocks network access. "Please remember to X" becomes enforceable.
🔌 Plugins → composable capabilities
Installable bundles that ship skills + hooks + MCP configs + tools together. Install a "test-engineering" plugin and every subagent in the team gets the right skills, tool permissions, and memory structure automatically.
🧠 Memory → shared context
Durable state that survives across agents and sessions. The parent reads AGENTS.md on start; subagents write findings to shared files; the next session picks up where the last one left off. Starts with markdown, grows into indexed retrieval.
Case study — generalized pattern
One Bundle, Every Layer
The strongest extension packs do not add just one feature. They bundle tool access, reusable skills, workflow rules, hooks, and packaged subagents so teams can adopt a working system instead of assembling the pieces by hand.
tool-pack <service> <command>
├── issue tracker→ triage + planning
├── docs→ search + summarize
├── chat→ history + reporting
├── code host→ reviews + diffs
├── people directory→ ownership lookup
├── storage→ file retrieval
└── shell→ allowlisted only
one command surface · multiple business systems · one packaging model
📋 skills/
task-specific expertise and playbooks
🪝 hooks/
automation and guardrails after actions
📏 rules/
team conventions and safety policies
🤖 agents/
specialized helpers for parallel work
Team chat → structured report
"Analyze the support channel from the last 2 weeks. Group issues by topic, identify likely owners, and generate an HTML summary."
Code review workflow
"Fetch the open review, summarize the diff, flag risks, and return feedback in a consistent template."
Periodic self-review draft
"Search recent work artifacts, identify contributions and cross-team impact, then draft a structured self-review with evidence."
Triage → ticket → fix → commit
"Turn an incoming issue into a ticket, propose a plan, make the change, run checks, and prepare the patch."
Why this matters: the winning pattern is not just “more tools.” It is coherent packaging: one install gives a team the protocols, workflows, guardrails, and reusable expertise needed to turn raw model capability into repeatable work.
Layer 4
Execution Boundaries
Two different questions that people keep mixing together.
Approval policy
When must the agent stop and ask?
Codex makes approval modes explicit. Gemini CLI shows commands/diffs before confirmation. OpenCode has permission rules. Pi can intercept actions through extensions. This is the governance layer.
Technical sandbox
What can the agent technically touch even if approved?
Some harnesses bundle their own sandbox (Codex runs code in containers by default). Others need external sandbox infrastructure - that is where Daytona comes in.
Daytona (open-source sandbox infra)
Daytona is an open-source platform (Feb 2024, 63K+ stars) that spins up isolated development environments in under 90ms. It provides Python and TypeScript SDKs with file, git, and process APIs. Harnesses like OpenCode and Pi can point their execution at a Daytona sandbox instead of your local machine - so agent-generated code runs in a throwaway container, not on your workstation.
Codex (built-in sandbox)
Codex takes a different approach: the sandbox is part of the product. Agent code runs in a restricted container by default. No external infra to configure - isolation comes out of the box.
Gemini CLI
Offers a containerized sandbox option for command execution. All shell commands require explicit confirmation before running. Sandboxing is opt-in, not default.
Why this distinction matters: Daytona is infrastructure (where code runs). A harness is what orchestrates the agent. Some products bundle both; others let you compose them. Knowing the difference prevents wrong architecture decisions.
Engineering depth
Harness Engineering: The Hard Problems
Five challenges every harness has to solve. Getting these wrong is why agents feel brittle.
Context Rot
Agent performance degrades as the context window fills with stale tool outputs, old reasoning, and accumulated noise. Three mitigations:
Compaction — summarize and offload when context approaches limits, then continue in a fresh window. Tool call offloading — keep head+tail tokens, dump full output to filesystem. The model reads the file only if needed. Progressive disclosure — skills load instructions on demand instead of stuffing 15 MCP server definitions into context at startup.
Long-Horizon Execution
Complex tasks span multiple context windows. Agents stop early, lose coherence, or forget the goal.
Ralph Loops — a hook intercepts the model's exit attempt and reinjects the original prompt into a fresh context window. The filesystem bridges windows: state persists even when context resets. Planning files — agents maintain a plan in a file, check off steps, and re-read it after each compaction to stay on track.
Self-Verification Loops
An agent that writes code but cannot check if it works is guessing. The harness orchestrates verification:
Write code → run tests → inspect output → fix errors → repeat. Hooks can auto-run a test suite after every edit and loop back on failure with the error message. Linting, type checking, and screenshot comparison are all forms of automated verification the harness provides — not the model.
Filesystem as Foundational Primitive
The filesystem is arguably the most foundational harness component — not because it is fancy, but because everything else depends on it.
It enables: durable state across sessions, work offloading when context fills up, multi-agent collaboration through shared files, memory via files like AGENTS.md that get injected on start, and versioning through git (rollback, branching, experiment tracking). The filesystem is how context windows survive death.
Model-Harness Co-Training
Codex and Claude Code are now post-trained with harness in the loop. The model is optimized for its own harness's tool-calling patterns. This creates better native performance — but also overfitting.
Proof point: Terminal Bench 2.0 shows the same model (Opus 4.6) scoring dramatically differently across harnesses. The harness is not just scaffolding — it is a performance multiplier. Choosing the right harness for your task matters as much as choosing the right model.
The core insight: these are not model problems — they are systems problems. A smarter model still needs compaction, still needs verification, still needs a filesystem to persist state. As models improve, some harness features may be absorbed (better native planning, fewer hallucinations), but the systems layer will keep mattering — just like prompt engineering still matters even as models get better at following instructions.
Your GPU as Backend
Local Inference on Your Own Hardware
Your GPU becomes the backend. Pi and OpenCode connect to local inference servers - your data never leaves the machine.
⌨️
Harness
Pi / OpenCode
🔀
Provider
router
🖥️
Inference
NIM / vLLM / Ollama
⚡
GPU
local accelerator
NIM
Production-grade model serving for accelerated deployments across workstation, data center, edge, or cloud setups.
vLLM
Flexible infra-first serving. Best for custom deployment. GPU-optimized with continuous batching.
Ollama
Easiest local dev path. One command: ollama pull deepseek-coder. Exposes localhost:11434.
Top local coding models: Qwen3-Coder (Alibaba, Jun 2025), DeepSeek-Coder-V2 (DeepSeek, Jun 2024), Codestral (Mistral, May 2024), CodeGemma (Google, Apr 2024). All run via the same Ollama/vLLM/NIM stack — the model is swappable, the infrastructure stays.
Your code stays 100% local. Zero API costs. Full privacy.
The Next Wave
Memory Is the Compounding Advantage
"Memory" is not one thing. It is a stack, and each layer needs different infrastructure.
Memory type
What it stores
Benefit
How it's built
Working
current task, selected files, active context window
the agent can act on what's in front of it right now
the live prompt/context window — managed by the harness automatically
Procedural
how to work: rules, checklists, conventions, step-by-step instructions
agents follow your team's process consistently — no drift between sessions or people; new agents onboard instantly
write markdown files: CLAUDE.md, AGENTS.md, SKILL.md, prompt templates. Version-controlled, human-edited. Zero infra cost.
Semantic
what is true: architecture facts, entity summaries, design decisions, domain knowledge
agents recall facts they've never seen in this session — "how does the auth system work?" gets a real answer instead of hallucination
starts as markdown docs. Scales into indexed retrieval: BM25, vector embeddings, or tools like ByteRover that auto-extract and query a local context tree.
Episodic
what happened: past runs, experiment results, task summaries, audit trails
agents learn from prior attempts — don't repeat failed approaches, build on what worked
session logs, experiment result files (results.tsv), task summaries. Capture outcomes and decisions, not every raw token.
Learning from Obsidian's design
Obsidian popularized a powerful pattern: every document declares its relationships via [[wikilinks]], frontmatter tags, and backlinks. This turns a folder of markdown into a traversable graph - agents navigate by relationship (neighbors, paths, clusters) instead of just searching text. The pattern works in any markdown system; you do not need Obsidian to use it.
A standardization wave
LSP (Language Server Protocol, 2016) standardized how editors get completions, diagnostics, and refactoring - any editor, any language, one protocol. MCP (2024) does the same for AI tool integrations. The knowledge-graph pattern is doing it for documentation structure. Obsidian Skills (Jan 2026, 12K+ stars) and the Agent Skills spec all build on this idea.
Tip (early-stage experiment): adding links:, cluster:, and tags: to your doc frontmatter costs nothing. It gives agents graph traversal and cluster discovery on top of regular search. We are prototyping a small Python tool that does pathfinding and validation on plain markdown - early results are promising but this is still work in progress, not a proven workflow.
Making it real
Memory in Practice: Tools That Work Today
The theory is clear. Here are two concrete tools you can install this week — one for semantic memory, one for procedural.
💻
Agent
codes & learns
📝
brv curate
auto-extract context
🌳
Context Tree
local markdown
🔍
brv query
semantic retrieval
ByteRover CLI — Semantic Memory
What it solves: agents forget facts between sessions. "How does our auth system work?" gets hallucinated instead of recalled.
How it works: agents call brv curate during work to auto-extract facts and store them; future sessions call brv query "how does X work?" and get relevant knowledge back instantly. The agent retrieves what is true — architecture, decisions, domain knowledge — without needing it pre-loaded in context.
Design: CLI-first (MCP tool definitions alone ate 26% of context). Local markdown in .brv/context-tree/. Zero infra. Team sync via brv push/pull.
Skill Graphs & [[Wikilinks]] — Procedural Memory
What it solves: agents don't follow your team's process. Every session starts from scratch — no consistent checklists, conventions, or step-by-step workflows.
How it works: structured markdown files (SKILL.md, AGENTS.md, CLAUDE.md) declare how to work — rules, steps, conventions. Files link to each other via [[wikilinks]] and YAML frontmatter. Agents navigate by graph relationship (neighbors, paths, clusters) instead of searching text.
Already in use: Pi skills, Obsidian Skills (12K+ ★), Agent Skills spec, and our own test-engineering knowledge base. Zero special tooling — just markdown + links.
Adoption path: start with procedural memory (rules + skills files) — it costs nothing and compounds immediately. Add semantic search (ByteRover or equivalent) when retrieval quality matters. Layer episodic memory (experiment logs, session summaries) last. Trying to build all four memory types at once is a recipe for abandoning the effort.
Strategic view
The Autonomy Spectrum & Where the Moat Forms
Commodity
raw model access · plain chat · basic read/write/shell tools · single-model wrappers · manual context dumping
Compounding advantage (durable, compounding, hard to replicate)
Adoption path: harness first, add memory, then experiment with autonomous layers. The harness-first path is lower-risk and compounds faster.
Just released — March 6, 2026
Autoresearch: The Autonomous Experiment Loop
A small side project from Andrej Karpathy — 630 lines of Python — that exploded because of what it implies for harness-driven research. An AI agent runs ML experiments overnight on a single GPU. 21K+ stars in 3 days.
📋 program.md — human writes
The agent's "skill file." Research direction, constraints, evaluation criteria. Karpathy's word: a lightweight skill. The human engineers the research org, not the code.
🧬 train.py — agent edits
Full GPT model, optimizer (Muon + AdamW), training loop. Architecture, hyperparameters, batch size — everything is fair game.
🔒 prepare.py — fixed
Data prep, BPE tokenizer, evaluation utilities. The stable foundation that never changes between experiments.
630
lines of Python
5 min
per experiment
21K+
stars in 3 days
Andrej Karpathy
"The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement."
Tobi Lütke (Shopify CEO)
"OK this thing is totally insane. Before going to bed I told my agent to read this repo and make a version for the qmd query-expansion model. Woke up to +19% score on a 0.8B model (higher than previous 1.6B) after 8 hours and 37 experiments. It's mesmerizing to read it reasoning its way through the experiments."
Why this ties to our topic: autoresearch is a minimal harness. The human provides procedural memory (program.md), the harness orchestrates execution (train.py edits), a fixed metric provides the evaluation boundary (val_bpb), and the loop runs autonomously. Your four-layer model in 630 lines — and it works on real problems overnight.
Productivity signals
What the Workflow Gains Look Like
The effect is easier to see in workflow metrics than in a single universal score.
What to measure on your team
Time-to-first-useful-output - first good draft, query, test, or patch
Time-to-first-passing-change - from task start to green checks or validated result
Context reacquisition time - how long it takes to re-enter a repo or workflow after interruption
Human turns per task - prompts, approvals, and corrections needed before completion
Unattended completion rate - % of tasks that finish with only bounded oversight
Rework / rollback rate - fixes that looked fast but created churn later
External anchor results vary a lot
Controlled lab task: GitHub/Microsoft reported developers completed a JavaScript HTTP server task 55.8% faster with Copilot.
Experienced OSS developers in own repos: METR's 2025 randomized study found AI tools made participants 19% slower on average.
Transcript-based estimate: Anthropic estimated Claude reduced completion time by about 80% across many real conversations, but that is a model-based estimate, not a randomized trial.
Best proxy for harness quality
Less context setup, fewer back-and-forth turns, faster validated changes, and less cleanup afterward.
Use capability benchmarks carefully
SWE-bench or Terminal-Bench are useful for model+agent capability, but they are not the same thing as team productivity.
Practical evaluation pattern
A short bakeoff on real tasks works better than a generic benchmark: same repo, same issue types, same guardrails, then compare cycle time and rework.
Takeaway: the meta upgrade should be sold as better task flow and lower coordination overhead - not just "the model is smarter."
Recommendations
Best Stack for Your Case
Team adopt-now
Claude Code, Codex, or OpenCode • add MCP only where it solves a real integration problem • define procedural memory first: rules, skills, AGENTS/CLAUDE files • use hooks/approvals for guardrails
Best for fast adoption and low organizational friction.
Local power-user
Pi or OpenCode + local inference • Ollama / vLLM / NIM / LM Studio backends • skills for procedural memory • markdown + SQLite or index-backed semantic memory • hooks/extensions for custom automation
Can be near-zero API cost if you actually run local models.
Experimental autonomy
OpenHands or OpenClaw-style systems • background runs and delegated agents • stronger isolation/sandboxing required • memory plugins and retrieval layers become core infra
Worth exploring, but not the default recommendation for most teams.
How we got here - part 1
Timeline: From Shell Customization to Shared Agent Infrastructure
2009
Oh My Zsh popularizes the community-bundle pattern: plugins, themes, and sane defaults around a raw terminal tool.
2015-16
VS Code launches; LSP standardized. The Language Server Protocol decouples editor features from editors - any tool can provide completions, diagnostics, and refactoring through one protocol. The same pattern MCP will later use for AI tools.
Jun 2017
"Attention Is All You Need" introduces the Transformer architecture. Everything that follows - GPT, BERT, Codex, Claude - builds on this paper.
2018-19
First AI code completions. TabNine ships GPT-2-based code completion (Nov 2018). Kite offers ML-powered Python autocomplete. IntelliCode brings AI-assisted completions to VS Code. These prove the concept but stay narrow - single-line suggestions, not agentic loops.
May 2020
GPT-3 (175B params). Demonstrates that scale unlocks qualitative capability jumps. Code generation becomes plausible at this scale, leading directly to Codex research.
Jun 29, 2021
GitHub Copilot technical preview. Powered by OpenAI's code-specialized model (confusingly also called "Codex" - a different product from the Codex CLI that launches in 2025). Makes AI pair programming mainstream, but the interaction is still suggestion / accept.
Jun 21, 2022
GitHub Copilot GA cements autocomplete and inline assistance as a normal part of coding workflows.
2023-2024
Agentic coding loops emerge. IDE and terminal tools start to read files, edit code, run commands, and iterate instead of only suggesting text.
Nov 25, 2024
MCP is introduced. Tool integration starts becoming shared infrastructure instead of one-off adapters.
The before / after
Before: editor assistance and autocomplete. After: systems that can inspect repos, edit files, run commands, call tools, and continue with feedback.
Why MCP matters here
Once clients and tools can speak one shared protocol, the ecosystem starts compounding. That is the bridge from isolated wrappers to harnesses with a real extension layer.
Best reading of this era
This was the shift from AI features inside coding tools to coding tools that behave like agents.
How we got here - part 2
Timeline: The Harness Era (2025 - now)
Feb 24, 2025
Claude Code is introduced alongside Claude 3.7 Sonnet, signaling that the model now ships with a purpose-built coding shell.
May 2025
Claude Code reaches general availability. The harness model moves from experiment to mainstream product category.
May 19, 2025
GitHub Copilot coding agent pushes background task execution and PR-style delegation into the mainstream.
Jun 25, 2025
Gemini CLI launches as an open-source terminal agent, reinforcing that the harness race is not just an Anthropic/OpenAI story.
Nov 5, 2025
Linus Torvalds on vibe coding at Open Source Summit Korea: "fairly positive" about it for learning and low-stakes work, but warned it "may be a horrible, horrible idea from a maintenance standpoint" for production code.
Dec 9, 2025
MCP is donated to the Linux Foundation's Agentic AI Foundation, a sign that the protocol is becoming foundational infrastructure.
Jan 11, 2026
Torvalds publishes AudioNoise — his hobby guitar pedal repo includes a Python visualizer "basically written by vibe-coding" using Google Antigravity. 67 days after his cautious remarks. Consistent with his position: fine for learning, not for the kernel.
Feb 2–5, 2026
Codex app and GPT-5.3-Codex launch. The story expands from "agent in a terminal" to coordinated agents, computer-use workflows, and long-running work.
Feb 2026
Claude Code turns one. At the anniversary event in SF, Anthropic's Boris Cherny (Director of Product, Claude Code) noted the team had shifted to weekly planning cycles — monthly and quarterly plans couldn't keep up with the pace of change in this space.
Mar 6, 2026
Andrej Karpathy releases Autoresearch — 630 lines of Python that let an AI agent autonomously run ML experiments on a single GPU overnight. Edit → train 5 min → evaluate → keep/discard → repeat. 21K+ stars in 3 days. The program.md is a "skill" — the human engineers the research org, the agent does the science.
Mar 10, 2026 YESTERDAY
Google ships Gemini Embedding 2 — first natively multimodal embedding model. Text, images, video, audio, and PDFs map into a single embedding space. Practical gain: one query can now search across code, screenshots, architecture diagrams, and meeting recordings at once — multimodal RAG without stitching separate pipelines per format.
What actually changed
The model stopped being the whole product. The durable differentiators are now harness design, extension depth, permissions, memory, and execution boundaries.
What is coming next
More MCP standardization, more delegated/background workflows, and more memory/governance infrastructure attached to coding agents.
The shift that matters
The move from AI features inside coding tools (autocomplete, inline suggestions) to AI tools that behave like agents (read repos, run commands, iterate on feedback, delegate subtasks) happened in under four years.
Key Takeaway
The Competition Is No Longer "Which Coding Agent Is Best?"
It's which harness architecture + extension system + execution boundary fits your workflow. Memory and autonomous agents are the next wave. MCP is becoming infrastructure. Pick the layer stack that matches your team - and make it yours.