Coding Harnesses & Agentic Workflows

March 2026

Coding Harnesses &
Agentic Workflows

The market is shifting from "best model" to "best operating system around the model": harness, tools, permissions, memory, and execution boundary.

Raul Cosentino · AV Test Engineering

Mental Model

How the Four Layers Fit Together

We start with models/backends because the harness has to call something underneath it. That lower layer decides where inference runs, what it costs, how private it is, and whether the stack can stay local.

Model / Backendcloud API · local server · NIM

Harnessagent loop, tools, workflow

Extension Layerskills · hooks · plugins · MCP

Execution Boundaryapprovals · sandbox · network

Developer / team intent

Prompts, tickets, repo context, rules, and expected outcomes.

through the harness

Harness

The harness chooses the model, runs the agent loop, invokes tools, and applies your policy. This is where Claude Code, Codex, Pi, OpenCode, Aider, Goose, Gemini CLI, and others actually differ.

Model / backend side

Claude, GPT/Codex, Gemini, Ollama, vLLM, LM Studio, NIM.

Tool / context side

Files, shell, git, browser, database, APIs, MCP servers.

How to read this stack: extensions plug into the harness rather than replacing it. Skills shape behavior, hooks enforce policy, plugins package capabilities, and MCP connects outside systems.

Layer 1

Models & Backends

This layer is becoming more interchangeable, but it still decides where the intelligence runs and what tradeoffs you accept.

Cloud APIs

Anthropic, OpenAI, Google, xAI, Mistral, DeepSeek, and aggregators such as OpenRouter, Together, Fireworks, and Groq Cloud. Best raw capability, easiest setup, recurring cost, and your code or prompts leave the machine unless the product offers local execution.

Local / self-hosted inference

Ollama (Jun 2023, 164K+ stars) is the easiest local path. vLLM (Feb 2023, 72K+ stars) is strong infra-first serving. LM Studio gives you a desktop-first local server. NVIDIA NIM (launched GTC, Mar 2024) is the production-oriented path for accelerated inference on NVIDIA GPUs.

Where the backend sits

HarnessClaude Code, Codex, Pi, OpenCode, Goose

→

Backend / providerthe service the harness calls to reach a model

→

Model runtimethe model process that actually performs inference

Cloud API local inference server Ollama vLLM LM Studio NIM

The harness does not talk straight to GPU kernels or weights. It calls a backend/provider, and that backend either hosts the model itself or routes the request to a model runtime. That is why this layer controls cloud vs local, latency, privacy, and deployability.

Layer 2

The Harness Landscape

Same broad job, four lanes. The competition is around orchestration, memory, and execution — not just the model.

First-party

Claude Code, Codex, Gemini CLI — lab's default workflow, tight model-specific features.

Open terminal

Pi, OpenCode, Aider, Goose — model choice, local control, customizable workflows.

IDE-native

Cursor, Cline, Copilot — code editing inside the editor is the main UX.

Autonomous / cloud

OpenHands, OpenClaw — background execution, persistent agents, longer-running work.

What teams optimize for

permissioning and sandbox boundaries
reusable skills, rules, hooks, templates
MCP and external system connectivity
local vs cloud execution and cost control

Where it is heading

background agents and delegated subagents
persistent memory and retrieval before each task
cloud sandboxes that open PRs, not just chat
team-shared knowledge as the compounding advantage

The key players

Key Harnesses Across Three Lanes

First-party harnesses

Claude Code

Anthropic · terminal-first

Skills + plugins as first-class extension primitives
Hooks enforce workflow/policy at lifecycle events
Subagents with isolated prompts and tool scopes
CLAUDE.md as procedural memory

Codex

OpenAI · CLI / app / cloud / IDE

AGENTS.md for project-specific instructions
Explicit approvals + sandboxing
MCP client + server mode
Multi-surface: CLI, cloud tasks, desktop app

Gemini CLI

Google · 96K+ ★ · open-source

ReAct loop with built-in tools
Local and remote MCP servers
Explicit confirmations before execution
Optional containerized sandbox

Open ecosystem

extension-first · custom providers · local backends

Framework-like host. TypeScript extensions, skills, prompt templates, themes, packages. Full control over harness behavior.

OpenCode

11K+ ★ · multi-provider · Daytona

Native skills, plugins, permissions, MCP, optional external sandboxing. Also notable: Aider (41K+ ★) · Goose (32K+ ★).

IDE-native & others

Cursor

Editor-first reference point. Inline editing, project rules, agent mode, background/cloud delegation from inside the IDE.

Cline

58K+ ★. Permission-conscious VS Code agent with strong tool integration and MCP support.

Copilot

The path from autocomplete into full agent mode inside the mainstream IDE + GitHub workflow.

Comparison

Feature Matrix

Capability	Claude Code	Codex	Pi	OpenCode	Aider	Gemini CLI	Goose
Main surface	Terminal	CLI / app / cloud / IDE	Terminal	Terminal / desktop / IDE	Terminal	Terminal	Terminal / desktop
Model flexibility	Anthropic-first	OpenAI-first	High	High	High	Google-first	High
Local backends	No	No	Yes	Yes	Yes	No	Yes
Project instructions	CLAUDE.md + skills	AGENTS.md	skills / templates	skills / rules	chat modes / config	GEMINI.md	config / extensions
Extension model	plugins + hooks + MCP	MCP + agent/server mode	extensions + packages	plugins + MCP	lightweight / git-centric	MCP	extensions + MCP
Approvals	tool / hook mediated	explicit approval modes	extension controlled	explicit permission config	host-terminal workflow	explicit confirmation	configurable tool permissions
Sandbox story	local machine + policy	built-in sandboxing	local machine + policy	local by default; external sandbox options	local repo / git safety	containerized sandbox option	local runtime; external isolation optional
Delegation style	subagents	remote / cloud workflows	build your own	agent patterns via plugins	single-loop pair programming	single-loop agent workflow	single agent + extensions

Templates and rules play different roles: templates provide reusable starting structures for common tasks, while rules are ongoing instructions or constraints the harness should keep applying.

Infrastructure

MCP: Model Context Protocol

MCP (announced Nov 25, 2024 by Anthropic; donated to Linux Foundation Dec 9, 2025) standardizes how agents connect to external tools and context. The core value is simple: stop rewriting one-off adapters for every client/tool pair.

without MCP: N clients × M tools → bespoke adapters everywhere with MCP: N clients + M servers → one shared protocol in the middle

MCP clients

Claude Code, Codex, Gemini CLI, Pi, OpenCode, Goose, Cursor, and others can speak MCP as clients.

What an MCP server is

A process or service that exposes tools/resources through the MCP protocol. It is not a JSON file. It is an adapter/service with its own auth, runtime, and capabilities.

Examples

Filesystem, databases, browsers, issue trackers, internal APIs, or custom business tools - all can be wrapped behind one protocol instead of many bespoke integrations.

In practice: each client learns one protocol, and each tool team publishes one server. A single MCP server can expose issue tracking, docs, chat, code hosting, storage, and search through one shared interface — one protocol in the middle, many tools at the edge.

Layer 3

Extensions & Delegation

Extensions make the harness yours. Delegation makes it scale. They work together — each subagent gets its own scoped skills, restricted tools, and enforced policies. In mature setups, all four primitives can ship together as one installable package.

📋

Skills

what each agent knows

🪝

Hooks

what gets enforced

🔌

Plugins

how to package it

🧠

Memory

what persists

📋 Skills → scoped expertise

Reusable domain know-how. The parent agent loads a planning skill; it delegates an Explorer subagent with only a code-search skill and read-only tools. Each agent gets exactly the knowledge it needs — nothing more.

🪝 Hooks → guardrails per agent

Lifecycle interception: run lint after every edit, require approval before destructive commands, auto-test on save. A hook on the Editor subagent runs tests after each patch. A hook on Tester blocks network access. "Please remember to X" becomes enforceable.

🔌 Plugins → composable capabilities

Installable bundles that ship skills + hooks + MCP configs + tools together. Install a "test-engineering" plugin and every subagent in the team gets the right skills, tool permissions, and memory structure automatically.

🧠 Memory → shared context

Durable state that survives across agents and sessions. The parent reads AGENTS.md on start; subagents write findings to shared files; the next session picks up where the last one left off. Starts with markdown, grows into indexed retrieval.

Case study — generalized pattern

One Bundle, Every Layer

The strongest extension packs do not add just one feature. They bundle tool access, reusable skills, workflow rules, hooks, and packaged subagents so teams can adopt a working system instead of assembling the pieces by hand.

tool-pack <service> <command>

├── issue tracker → triage + planning

├── docs → search + summarize

├── chat → history + reporting

├── code host → reviews + diffs

├── people directory → ownership lookup

├── storage → file retrieval

└── shell → allowlisted only

one command surface · multiple business systems · one packaging model

📋 skills/

task-specific expertise and playbooks

🪝 hooks/

automation and guardrails after actions

📏 rules/

team conventions and safety policies

🤖 agents/

specialized helpers for parallel work

Team chat → structured report

"Analyze the support channel from the last 2 weeks. Group issues by topic, identify likely owners, and generate an HTML summary."

Code review workflow

"Fetch the open review, summarize the diff, flag risks, and return feedback in a consistent template."

Periodic self-review draft

"Search recent work artifacts, identify contributions and cross-team impact, then draft a structured self-review with evidence."

Triage → ticket → fix → commit

"Turn an incoming issue into a ticket, propose a plan, make the change, run checks, and prepare the patch."

Why this matters: the winning pattern is not just “more tools.” It is coherent packaging: one install gives a team the protocols, workflows, guardrails, and reusable expertise needed to turn raw model capability into repeatable work.

Layer 4

Execution Boundaries

Two different questions that people keep mixing together.

Approval policy

When must the agent stop and ask?

Codex makes approval modes explicit. Gemini CLI shows commands/diffs before confirmation. OpenCode has permission rules. Pi can intercept actions through extensions. This is the governance layer.

Technical sandbox

What can the agent technically touch even if approved?

Some harnesses bundle their own sandbox (Codex runs code in containers by default). Others need external sandbox infrastructure - that is where Daytona comes in.

Daytona (open-source sandbox infra)

Daytona is an open-source platform (Feb 2024, 63K+ stars) that spins up isolated development environments in under 90ms. It provides Python and TypeScript SDKs with file, git, and process APIs. Harnesses like OpenCode and Pi can point their execution at a Daytona sandbox instead of your local machine - so agent-generated code runs in a throwaway container, not on your workstation.

Codex (built-in sandbox)

Codex takes a different approach: the sandbox is part of the product. Agent code runs in a restricted container by default. No external infra to configure - isolation comes out of the box.

Gemini CLI

Offers a containerized sandbox option for command execution. All shell commands require explicit confirmation before running. Sandboxing is opt-in, not default.

Why this distinction matters: Daytona is infrastructure (where code runs). A harness is what orchestrates the agent. Some products bundle both; others let you compose them. Knowing the difference prevents wrong architecture decisions.

Engineering depth

Harness Engineering: The Hard Problems

Five challenges every harness has to solve. Getting these wrong is why agents feel brittle.

Context Rot

Agent performance degrades as the context window fills with stale tool outputs, old reasoning, and accumulated noise. Three mitigations:

Compaction — summarize and offload when context approaches limits, then continue in a fresh window.
Tool call offloading — keep head+tail tokens, dump full output to filesystem. The model reads the file only if needed.
Progressive disclosure — skills load instructions on demand instead of stuffing 15 MCP server definitions into context at startup.

Long-Horizon Execution

Complex tasks span multiple context windows. Agents stop early, lose coherence, or forget the goal.

Ralph Loops — a hook intercepts the model's exit attempt and reinjects the original prompt into a fresh context window. The filesystem bridges windows: state persists even when context resets.
Planning files — agents maintain a plan in a file, check off steps, and re-read it after each compaction to stay on track.

Self-Verification Loops

An agent that writes code but cannot check if it works is guessing. The harness orchestrates verification:

Write code → run tests → inspect output → fix errors → repeat. Hooks can auto-run a test suite after every edit and loop back on failure with the error message. Linting, type checking, and screenshot comparison are all forms of automated verification the harness provides — not the model.

Filesystem as Foundational Primitive

The filesystem is arguably the most foundational harness component — not because it is fancy, but because everything else depends on it.

It enables: durable state across sessions, work offloading when context fills up, multi-agent collaboration through shared files, memory via files like AGENTS.md that get injected on start, and versioning through git (rollback, branching, experiment tracking). The filesystem is how context windows survive death.

Model-Harness Co-Training

Codex and Claude Code are now post-trained with harness in the loop. The model is optimized for its own harness's tool-calling patterns. This creates better native performance — but also overfitting.

Proof point: Terminal Bench 2.0 shows the same model (Opus 4.6) scoring dramatically differently across harnesses. The harness is not just scaffolding — it is a performance multiplier. Choosing the right harness for your task matters as much as choosing the right model.

The core insight: these are not model problems — they are systems problems. A smarter model still needs compaction, still needs verification, still needs a filesystem to persist state. As models improve, some harness features may be absorbed (better native planning, fewer hallucinations), but the systems layer will keep mattering — just like prompt engineering still matters even as models get better at following instructions.

Your GPU as Backend

Local Inference on Your Own Hardware

Your GPU becomes the backend. Pi and OpenCode connect to local inference servers - your data never leaves the machine.

⌨️

Harness

Pi / OpenCode

🔀

Provider

router

🖥️

Inference

NIM / vLLM / Ollama

⚡

GPU

local accelerator

NIM

Production-grade model serving for accelerated deployments across workstation, data center, edge, or cloud setups.

vLLM

Flexible infra-first serving. Best for custom deployment. GPU-optimized with continuous batching.

Ollama

Easiest local dev path. One command: ollama pull deepseek-coder. Exposes localhost:11434.

Top local coding models: Qwen3-Coder (Alibaba, Jun 2025), DeepSeek-Coder-V2 (DeepSeek, Jun 2024), Codestral (Mistral, May 2024), CodeGemma (Google, Apr 2024). All run via the same Ollama/vLLM/NIM stack — the model is swappable, the infrastructure stays.

Your code stays 100% local. Zero API costs. Full privacy.

The Next Wave

Memory Is the Compounding Advantage

"Memory" is not one thing. It is a stack, and each layer needs different infrastructure.

Memory type	What it stores	Benefit	How it's built
Working	current task, selected files, active context window	the agent can act on what's in front of it right now	the live prompt/context window — managed by the harness automatically
Procedural	how to work: rules, checklists, conventions, step-by-step instructions	agents follow your team's process consistently — no drift between sessions or people; new agents onboard instantly	write markdown files: `CLAUDE.md`, `AGENTS.md`, `SKILL.md`, prompt templates. Version-controlled, human-edited. Zero infra cost.
Semantic	what is true: architecture facts, entity summaries, design decisions, domain knowledge	agents recall facts they've never seen in this session — "how does the auth system work?" gets a real answer instead of hallucination	starts as markdown docs. Scales into indexed retrieval: BM25, vector embeddings, or tools like ByteRover that auto-extract and query a local context tree.
Episodic	what happened: past runs, experiment results, task summaries, audit trails	agents learn from prior attempts — don't repeat failed approaches, build on what worked	session logs, experiment result files (`results.tsv`), task summaries. Capture outcomes and decisions, not every raw token.

Learning from Obsidian's design

Obsidian popularized a powerful pattern: every document declares its relationships via [[wikilinks]], frontmatter tags, and backlinks. This turns a folder of markdown into a traversable graph - agents navigate by relationship (neighbors, paths, clusters) instead of just searching text. The pattern works in any markdown system; you do not need Obsidian to use it.

A standardization wave

LSP (Language Server Protocol, 2016) standardized how editors get completions, diagnostics, and refactoring - any editor, any language, one protocol. MCP (2024) does the same for AI tool integrations. The knowledge-graph pattern is doing it for documentation structure. Obsidian Skills (Jan 2026, 12K+ stars) and the Agent Skills spec all build on this idea.

Tip (early-stage experiment): adding links:, cluster:, and tags: to your doc frontmatter costs nothing. It gives agents graph traversal and cluster discovery on top of regular search. We are prototyping a small Python tool that does pathfinding and validation on plain markdown - early results are promising but this is still work in progress, not a proven workflow.

Making it real

Memory in Practice: Tools That Work Today

The theory is clear. Here are two concrete tools you can install this week — one for semantic memory, one for procedural.

💻

Agent

codes & learns

📝

brv curate

auto-extract context

🌳

Context Tree

local markdown

🔍

brv query

semantic retrieval

ByteRover CLI — Semantic Memory

What it solves: agents forget facts between sessions. "How does our auth system work?" gets hallucinated instead of recalled.

How it works: agents call brv curate during work to auto-extract facts and store them; future sessions call brv query "how does X work?" and get relevant knowledge back instantly. The agent retrieves what is true — architecture, decisions, domain knowledge — without needing it pre-loaded in context.

Design: CLI-first (MCP tool definitions alone ate 26% of context). Local markdown in .brv/context-tree/. Zero infra. Team sync via brv push/pull.

Skill Graphs & [[Wikilinks]] — Procedural Memory

What it solves: agents don't follow your team's process. Every session starts from scratch — no consistent checklists, conventions, or step-by-step workflows.

How it works: structured markdown files (SKILL.md, AGENTS.md, CLAUDE.md) declare how to work — rules, steps, conventions. Files link to each other via [[wikilinks]] and YAML frontmatter. Agents navigate by graph relationship (neighbors, paths, clusters) instead of searching text.

Already in use: Pi skills, Obsidian Skills (12K+ ★), Agent Skills spec, and our own test-engineering knowledge base. Zero special tooling — just markdown + links.

Adoption path: start with procedural memory (rules + skills files) — it costs nothing and compounds immediately. Add semantic search (ByteRover or equivalent) when retrieval quality matters. Layer episodic memory (experiment logs, session summaries) last. Trying to build all four memory types at once is a recipe for abandoning the effort.

Strategic view

The Autonomy Spectrum & Where the Moat Forms

Commodity

raw model access · plain chat · basic read/write/shell tools · single-model wrappers · manual context dumping

Compounding advantage (durable, compounding, hard to replicate)

skills, hooks, plugins, MCP depth, stable approval models, repeatable workflows, team memory, trustworthy execution boundaries

1. Assistant

Answers, suggests, edits. High user steering.

2. Agentic loop

Read, edit, run, iterate. Aider, Gemini CLI, Pi, OpenCode, Claude Code, Codex.

3. Delegated / background

Subagents, remote runs. Cursor background agents, Codex cloud tasks, OpenHands (68K+ ★).

4. Persistent / autonomous

Cross-session memory, long-lived identity. OpenClaw (284K+ ★) — persistent agents with plugins, hooks, memory.

Adoption path: harness first, add memory, then experiment with autonomous layers. The harness-first path is lower-risk and compounds faster.

Just released — March 6, 2026

Autoresearch: The Autonomous Experiment Loop

A small side project from Andrej Karpathy — 630 lines of Python — that exploded because of what it implies for harness-driven research. An AI agent runs ML experiments overnight on a single GPU. 21K+ stars in 3 days.

📋 program.md — human writes

The agent's "skill file." Research direction, constraints, evaluation criteria. Karpathy's word: a lightweight skill. The human engineers the research org, not the code.

🧬 train.py — agent edits

Full GPT model, optimizer (Muon + AdamW), training loop. Architecture, hyperparameters, batch size — everything is fair game.

🔒 prepare.py — fixed

Data prep, BPE tokenizer, evaluation utilities. The stable foundation that never changes between experiments.

630

lines of Python

5 min

per experiment

21K+

stars in 3 days

Andrej Karpathy

"The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement."

Tobi Lütke (Shopify CEO)

"OK this thing is totally insane. Before going to bed I told my agent to read this repo and make a version for the qmd query-expansion model. Woke up to +19% score on a 0.8B model (higher than previous 1.6B) after 8 hours and 37 experiments. It's mesmerizing to read it reasoning its way through the experiments."

Why this ties to our topic: autoresearch is a minimal harness. The human provides procedural memory (program.md), the harness orchestrates execution (train.py edits), a fixed metric provides the evaluation boundary (val_bpb), and the loop runs autonomously. Your four-layer model in 630 lines — and it works on real problems overnight.

Productivity signals

What the Workflow Gains Look Like

The effect is easier to see in workflow metrics than in a single universal score.

What to measure on your team

Time-to-first-useful-output - first good draft, query, test, or patch

Time-to-first-passing-change - from task start to green checks or validated result

Context reacquisition time - how long it takes to re-enter a repo or workflow after interruption

Human turns per task - prompts, approvals, and corrections needed before completion

Unattended completion rate - % of tasks that finish with only bounded oversight

Rework / rollback rate - fixes that looked fast but created churn later

External anchor results vary a lot

Controlled lab task: GitHub/Microsoft reported developers completed a JavaScript HTTP server task 55.8% faster with Copilot.

Experienced OSS developers in own repos: METR's 2025 randomized study found AI tools made participants 19% slower on average.

Transcript-based estimate: Anthropic estimated Claude reduced completion time by about 80% across many real conversations, but that is a model-based estimate, not a randomized trial.

Best proxy for harness quality

Less context setup, fewer back-and-forth turns, faster validated changes, and less cleanup afterward.

Use capability benchmarks carefully

SWE-bench or Terminal-Bench are useful for model+agent capability, but they are not the same thing as team productivity.

Practical evaluation pattern

A short bakeoff on real tasks works better than a generic benchmark: same repo, same issue types, same guardrails, then compare cycle time and rework.

Takeaway: the meta upgrade should be sold as better task flow and lower coordination overhead - not just "the model is smarter."

Recommendations

Best Stack for Your Case

Team adopt-now

Claude Code, Codex, or OpenCode
• add MCP only where it solves a real integration problem
• define procedural memory first: rules, skills, AGENTS/CLAUDE files
• use hooks/approvals for guardrails

Best for fast adoption and low organizational friction.

Local power-user

Pi or OpenCode + local inference
• Ollama / vLLM / NIM / LM Studio backends
• skills for procedural memory
• markdown + SQLite or index-backed semantic memory
• hooks/extensions for custom automation

Can be near-zero API cost if you actually run local models.

Experimental autonomy

OpenHands or OpenClaw-style systems
• background runs and delegated agents
• stronger isolation/sandboxing required
• memory plugins and retrieval layers become core infra

Worth exploring, but not the default recommendation for most teams.

How we got here - part 1

Timeline: From Shell Customization to Shared Agent Infrastructure

2009

Oh My Zsh popularizes the community-bundle pattern: plugins, themes, and sane defaults around a raw terminal tool.

2015-16

VS Code launches; LSP standardized. The Language Server Protocol decouples editor features from editors - any tool can provide completions, diagnostics, and refactoring through one protocol. The same pattern MCP will later use for AI tools.

Jun 2017

"Attention Is All You Need" introduces the Transformer architecture. Everything that follows - GPT, BERT, Codex, Claude - builds on this paper.

2018-19

First AI code completions. TabNine ships GPT-2-based code completion (Nov 2018). Kite offers ML-powered Python autocomplete. IntelliCode brings AI-assisted completions to VS Code. These prove the concept but stay narrow - single-line suggestions, not agentic loops.

May 2020

GPT-3 (175B params). Demonstrates that scale unlocks qualitative capability jumps. Code generation becomes plausible at this scale, leading directly to Codex research.

Jun 29, 2021

GitHub Copilot technical preview. Powered by OpenAI's code-specialized model (confusingly also called "Codex" - a different product from the Codex CLI that launches in 2025). Makes AI pair programming mainstream, but the interaction is still suggestion / accept.

Jun 21, 2022

GitHub Copilot GA cements autocomplete and inline assistance as a normal part of coding workflows.

2023-2024

Agentic coding loops emerge. IDE and terminal tools start to read files, edit code, run commands, and iterate instead of only suggesting text.

Nov 25, 2024

MCP is introduced. Tool integration starts becoming shared infrastructure instead of one-off adapters.

The before / after

Before: editor assistance and autocomplete.
After: systems that can inspect repos, edit files, run commands, call tools, and continue with feedback.

Why MCP matters here

Once clients and tools can speak one shared protocol, the ecosystem starts compounding. That is the bridge from isolated wrappers to harnesses with a real extension layer.

Best reading of this era

This was the shift from AI features inside coding tools to coding tools that behave like agents.

How we got here - part 2

Timeline: The Harness Era (2025 - now)

Feb 24, 2025

Claude Code is introduced alongside Claude 3.7 Sonnet, signaling that the model now ships with a purpose-built coding shell.

May 2025

Claude Code reaches general availability. The harness model moves from experiment to mainstream product category.

May 19, 2025

GitHub Copilot coding agent pushes background task execution and PR-style delegation into the mainstream.

Jun 25, 2025

Gemini CLI launches as an open-source terminal agent, reinforcing that the harness race is not just an Anthropic/OpenAI story.

Nov 5, 2025

Linus Torvalds on vibe coding at Open Source Summit Korea: "fairly positive" about it for learning and low-stakes work, but warned it "may be a horrible, horrible idea from a maintenance standpoint" for production code.

Dec 9, 2025

MCP is donated to the Linux Foundation's Agentic AI Foundation, a sign that the protocol is becoming foundational infrastructure.

Jan 11, 2026

Torvalds publishes AudioNoise — his hobby guitar pedal repo includes a Python visualizer "basically written by vibe-coding" using Google Antigravity. 67 days after his cautious remarks. Consistent with his position: fine for learning, not for the kernel.

Feb 2–5, 2026

Codex app and GPT-5.3-Codex launch. The story expands from "agent in a terminal" to coordinated agents, computer-use workflows, and long-running work.

Feb 2026

Claude Code turns one. At the anniversary event in SF, Anthropic's Boris Cherny (Director of Product, Claude Code) noted the team had shifted to weekly planning cycles — monthly and quarterly plans couldn't keep up with the pace of change in this space.

Mar 6, 2026

Andrej Karpathy releases Autoresearch — 630 lines of Python that let an AI agent autonomously run ML experiments on a single GPU overnight. Edit → train 5 min → evaluate → keep/discard → repeat. 21K+ stars in 3 days. The program.md is a "skill" — the human engineers the research org, the agent does the science.

Mar 10, 2026 YESTERDAY

Google ships Gemini Embedding 2 — first natively multimodal embedding model. Text, images, video, audio, and PDFs map into a single embedding space. Practical gain: one query can now search across code, screenshots, architecture diagrams, and meeting recordings at once — multimodal RAG without stitching separate pipelines per format.

What actually changed

The model stopped being the whole product. The durable differentiators are now harness design, extension depth, permissions, memory, and execution boundaries.

What is coming next

More MCP standardization, more delegated/background workflows, and more memory/governance infrastructure attached to coding agents.

The shift that matters

The move from AI features inside coding tools (autocomplete, inline suggestions) to AI tools that behave like agents (read repos, run commands, iterate on feedback, delegate subtasks) happened in under four years.

Key Takeaway

The Competition Is No Longer
"Which Coding Agent Is Best?"

It's which harness architecture + extension system + execution boundary fits your workflow. Memory and autonomous agents are the next wave. MCP is becoming infrastructure. Pick the layer stack that matches your team - and make it yours.