A Coding Agent That Never Leaves the Laptop

The premise: keep the whole loop on the laptop

The default shape of an AI coding agent in 2026 is a thin local client wrapped around a remote frontier model. You type, your prompt and the relevant slice of your repository get shipped to a data centre, a large model thinks about it, and tokens stream back. It works well. It is also a standing data-egress decision: every prompt is a copy of your source, your infrastructure config, and your intent leaving the building.

For a lot of work that trade is fine. For some clients, and for anyone who has read our writing on EU sovereign cloud and the CLOUD Act, it is not. So the question was narrow and practical: can you run a genuinely useful agentic coding loop with the model on your own silicon, nothing in flight, and still have it be fast enough to live in?

The answer, on current hardware, is yes, with caveats that are worth being precise about.

The hardware: why 48 GB of unified memory is the whole game

The machine is a MacBook Pro 14" with an M4 Pro and 48 GB of unified memory. That second number is the one that matters. On Apple Silicon the CPU and GPU share one pool of RAM, so the same 48 GB that runs your editor, your browser, and a Terraform plan is also the memory the model lives in. There is no separate VRAM ceiling to fall off.

A 35-billion-parameter model quantised to 4-bit lands somewhere around 20 GB of weights. Add the KV cache (the running memory of the conversation, which grows with every token of context) and a long session can claim another several gigabytes. On a 48 GB machine that fits with room to keep working. On a 16 GB or 24 GB laptop it does not, and that is the real reason this setup is an M4-Pro-and-up story rather than a "runs on any Mac" story.

The stack

Three pieces, each doing one job.

LM Studio, the inference server

LM Studio hosts the model and exposes it on a local OpenAI-compatible endpoint. That compatibility is the load-bearing detail: anything that knows how to talk to the OpenAI API can be pointed at localhost instead, with no awareness that the model is sitting two inches below the keyboard. The model itself is Qwen3.6-35B, large enough to follow multi-step tool-use instructions and write coherent diffs, small enough to fit the memory budget above.

OpenCode, the agent harness

OpenCode is the terminal-native coding agent: it owns the loop. It reads files, proposes edits, runs commands, reads the output, and decides what to do next. Crucially it lets you swap the model provider, so instead of a hosted frontier model it drives the local Qwen3.6-35B through LM Studio's endpoint. This is the direct, local-only replacement for a tool like Codex: same agentic shape, none of the egress.

The Cave* plugins: Caveman, Cavemem, Cavekit

This is where the loop gets sharp:

Caveman drives the agent persona and behaviour: how it plans, how aggressively it acts, how it narrates what it is doing.

Cavemem is the memory layer, and it is the most interesting of the three. A local 35B model has a smaller effective context and far less raw capability than a frontier model, so you cannot lean on "just put everything in the prompt." Cavemem is a small RAG pipeline: it persists facts, decisions, and prior context, then retrieves the relevant slice on each turn instead of replaying the entire history. That retrieval discipline is what keeps a modest local model on-task across a long session.

Cavekit is the toolkit, the concrete tools (file ops, shell, search) the agent reaches for.

The codebase under test is the real thing, not a toy: Python, Terraform, and TypeScript in one repository, application code, infrastructure-as-code, and frontend in the same loop.

The numbers: prompt processing is the cost you actually pay

Here is the part people skip when they say "I ran a local model and it was fine." With local inference, the latency you feel is almost entirely prompt processing: the prefill phase, where the model has to read and encode every token of context before it can emit the first token of its answer. It is compute-bound, and on Apple Silicon it is the bottleneck. Token generation after that is comparatively quick.

So the honest metric is time-to-first-token, measured against context size:

Cold start on a fresh conversation: up to ~20 seconds before output begins. These were not trivial prompts; they were multiline Markdown specs describing a whole feature. The bigger the opening prompt, the longer the prefill.

Deep into a long session (~100 messages): around one minute to process the accumulated context and start the output.

Read those two numbers together and you can see the shape of the trade. The cost is not in generating the answer; it is in re-encoding the conversation. This is exactly the problem Cavemem's retrieval layer exists to bound: by feeding the model a curated, relevant context instead of an ever-growing transcript, you keep the prefill from ballooning as the session ages. KV-cache reuse within a single conversation helps too, since an unchanged prefix does not have to be reprocessed turn over turn. The minute-long waits are the case where the context genuinely grew, not where the same context got re-read.

OpenCode gives you a direct lever over this as well: a compress command that condenses the conversation so far into a compact summary and continues from there. When a long thread starts paying that minute-long prefill, compressing it collapses the accumulated history back down and pulls time-to-first-token with it. It is a manual handle on exactly the cost this section is about, and in practice the loop feels faster than the raw numbers suggest because you are not dragging a hundred-message transcript through every turn.

When the task is big, it runs a team of agents

OpenCode does not cap out at a single linear conversation. For larger work it decomposes the task and spins up sub-agents to execute the pieces, then checks its own progress against the real tooling rather than against its own guesswork.

A concrete example from this codebase: the instruction was simply to run the lint command and fix everything it flagged, roughly 100 errors. The agent built a to-do list, distributed the fixes across sub-agents, and then re-ran lint between passes to see how far it had gotten. It was driving a feedback loop against ground truth (the linter's own output), closing the list down error by error instead of trying to one-shot the whole pile.

That is the behaviour that lets a local model punch above its parameter count. The orchestration layer turns a hundred-error grind into a managed, self-checking job: decompose, delegate, verify against the tool, repeat. No single prompt ever has to hold the entire problem at once, which is precisely what a 35B model on a laptop needs in order to stay reliable.

Why this is a "human in the lead" tool, not an autopilot

A local 35B model is not a frontier model, and pretending otherwise is how you get burned. It will not one-shot an ambiguous request the way a hosted giant sometimes can. What it does well is execute clearly-specified work against a codebase it has the right context for, which is precisely the human-in-the-lead workflow we already argue for.

The constraints push you toward better engineering, not worse:

The prompt has to be good. Vague intent wastes a twenty-second-to-one-minute prefill and returns mush. A precise Markdown feature spec returns something you can use. The model rewards the engineer who already knows what they want.

The memory layer has to be curated. Cavemem only helps if what it retrieves is the right context. That is an engineering decision about what matters in this repo, exactly the kind of call a model should not be making for you.

The review is non-negotiable. Smaller model, tighter leash. Every diff goes through the same gate as hand-written code.

None of that is a downgrade from how we work on cloud models. It is the same philosophy with the volume turned up: the engineer leads, the tool executes, and the tool happens to be running on the engineer's own machine.

What you actually buy

Zero egress. No prompt, no file, no infrastructure config leaves the laptop. For regulated or sovereignty-sensitive work, that is not a feature; it is the requirement.

Zero marginal cost. No per-token billing. The loop runs as long as you want it to, offline, on a plane, wherever.

A real Codex replacement. Same agentic shape (read, edit, run, observe, repeat) with the model swapped from a remote API to local silicon.

The catch is the one the numbers already told you: prefill latency scales with context, and a 35B model is not a frontier model. If your workflow is hundreds of low-latency interactions a day, the cloud still wins on raw speed. If your workflow is considered, spec-driven, review-heavy engineering on a codebase you cannot afford to ship to anyone else's GPU, this runs today, on a laptop you can close and carry home.

The takeaway

The interesting result is not that a local model can write code. It is that the entire agentic loop (planning, memory, retrieval, tool use, editing) now fits on a 48 GB MacBook Pro with usable latency, against a real Python/Terraform/TypeScript codebase. The frontier models are more capable and will stay that way. But "more capable" and "good enough to lead, on hardware you own, with nothing leaving the room" are different questions. For a growing class of work, the second answer is the one that matters.

By Capability

By Industry

A Coding Agent That Never Leaves the Laptop

The premise: keep the whole loop on the laptop

The hardware: why 48 GB of unified memory is the whole game

The stack

LM Studio, the inference server

OpenCode, the agent harness

The Cave* plugins: Caveman, Cavemem, Cavekit

The numbers: prompt processing is the cost you actually pay

When the task is big, it runs a team of agents

Why this is a "human in the lead" tool, not an autopilot

What you actually buy

The takeaway

Ready to build
something remarkable?

A Coding Agent That Never Leaves the Laptop

The premise: keep the whole loop on the laptop

The hardware: why 48 GB of unified memory is the whole game

The stack

LM Studio, the inference server

OpenCode, the agent harness

The Cave* plugins: Caveman, Cavemem, Cavekit

The numbers: prompt processing is the cost you actually pay

When the task is big, it runs a team of agents

Why this is a "human in the lead" tool, not an autopilot

What you actually buy

The takeaway

Ready to buildsomething remarkable?

Ready to build
something remarkable?