The premise: keep the whole loop on the laptop
The default shape of an AI coding agent in 2026 is a thin local client wrapped around a remote frontier model. You type, your prompt and the relevant slice of your repository get shipped to a data centre, a large model thinks about it, and tokens stream back. It works well. It is also a standing data-egress decision: every prompt is a copy of your source, your infrastructure config, and your intent leaving the building.
For a lot of work that trade is fine. For some clients, and for anyone who has read our writing on EU sovereign cloud and the CLOUD Act, it is not. So the question was narrow and practical: can you run a genuinely useful agentic coding loop with the model on your own silicon, nothing in flight, and still have it be fast enough to live in?
The answer, on current hardware, is yes, with caveats that are worth being precise about.
The hardware: why 48 GB of unified memory is the whole game
The machine is a MacBook Pro 14" with an M4 Pro and 48 GB of unified memory. That second number is the one that matters. On Apple Silicon the CPU and GPU share one pool of RAM, so the same 48 GB that runs your editor, your browser, and a Terraform plan is also the memory the model lives in. There is no separate VRAM ceiling to fall off.
A 35-billion-parameter model quantised to 4-bit lands somewhere around 20 GB of weights. Add the KV cache (the running memory of the conversation, which grows with every token of context) and a long session can claim another several gigabytes. On a 48 GB machine that fits with room to keep working. On a 16 GB or 24 GB laptop it does not, and that is the real reason this setup is an M4-Pro-and-up story rather than a "runs on any Mac" story.
The stack
Three pieces, each doing one job.
LM Studio, the inference server
LM Studio hosts the model and exposes it on a local OpenAI-compatible endpoint. That compatibility is the load-bearing detail: anything that knows how to talk to the OpenAI API can be pointed at localhost instead, with no awareness that the model is sitting two inches below the keyboard. The model itself is Qwen3.6-35B, large enough to follow multi-step tool-use instructions and write coherent diffs, small enough to fit the memory budget above.
OpenCode, the agent harness
OpenCode is the terminal-native coding agent: it owns the loop. It reads files, proposes edits, runs commands, reads the output, and decides what to do next. Crucially it lets you swap the model provider, so instead of a hosted frontier model it drives the local Qwen3.6-35B through LM Studio's endpoint. This is the direct, local-only replacement for a tool like Codex: same agentic shape, none of the egress.
The Cave* plugins: Caveman, Cavemem, Cavekit
This is where the loop gets sharp:
The codebase under test is the real thing, not a toy: Python, Terraform, and TypeScript in one repository, application code, infrastructure-as-code, and frontend in the same loop.
The numbers: prompt processing is the cost you actually pay
Here is the part people skip when they say "I ran a local model and it was fine." With local inference, the latency you feel is almost entirely prompt processing: the prefill phase, where the model has to read and encode every token of context before it can emit the first token of its answer. It is compute-bound, and on Apple Silicon it is the bottleneck. Token generation after that is comparatively quick.
So the honest metric is time-to-first-token, measured against context size:
Read those two numbers together and you can see the shape of the trade. The cost is not in generating the answer; it is in re-encoding the conversation. This is exactly the problem Cavemem's retrieval layer exists to bound: by feeding the model a curated, relevant context instead of an ever-growing transcript, you keep the prefill from ballooning as the session ages. KV-cache reuse within a single conversation helps too, since an unchanged prefix does not have to be reprocessed turn over turn. The minute-long waits are the case where the context genuinely grew, not where the same context got re-read.
OpenCode gives you a direct lever over this as well: a compress command that condenses the conversation so far into a compact summary and continues from there. When a long thread starts paying that minute-long prefill, compressing it collapses the accumulated history back down and pulls time-to-first-token with it. It is a manual handle on exactly the cost this section is about, and in practice the loop feels faster than the raw numbers suggest because you are not dragging a hundred-message transcript through every turn.
When the task is big, it runs a team of agents
OpenCode does not cap out at a single linear conversation. For larger work it decomposes the task and spins up sub-agents to execute the pieces, then checks its own progress against the real tooling rather than against its own guesswork.
A concrete example from this codebase: the instruction was simply to run the lint command and fix everything it flagged, roughly 100 errors. The agent built a to-do list, distributed the fixes across sub-agents, and then re-ran lint between passes to see how far it had gotten. It was driving a feedback loop against ground truth (the linter's own output), closing the list down error by error instead of trying to one-shot the whole pile.
That is the behaviour that lets a local model punch above its parameter count. The orchestration layer turns a hundred-error grind into a managed, self-checking job: decompose, delegate, verify against the tool, repeat. No single prompt ever has to hold the entire problem at once, which is precisely what a 35B model on a laptop needs in order to stay reliable.
Why this is a "human in the lead" tool, not an autopilot
A local 35B model is not a frontier model, and pretending otherwise is how you get burned. It will not one-shot an ambiguous request the way a hosted giant sometimes can. What it does well is execute clearly-specified work against a codebase it has the right context for, which is precisely the human-in-the-lead workflow we already argue for.
The constraints push you toward better engineering, not worse:
None of that is a downgrade from how we work on cloud models. It is the same philosophy with the volume turned up: the engineer leads, the tool executes, and the tool happens to be running on the engineer's own machine.
What you actually buy
The catch is the one the numbers already told you: prefill latency scales with context, and a 35B model is not a frontier model. If your workflow is hundreds of low-latency interactions a day, the cloud still wins on raw speed. If your workflow is considered, spec-driven, review-heavy engineering on a codebase you cannot afford to ship to anyone else's GPU, this runs today, on a laptop you can close and carry home.
The takeaway
The interesting result is not that a local model can write code. It is that the entire agentic loop (planning, memory, retrieval, tool use, editing) now fits on a 48 GB MacBook Pro with usable latency, against a real Python/Terraform/TypeScript codebase. The frontier models are more capable and will stay that way. But "more capable" and "good enough to lead, on hardware you own, with nothing leaving the room" are different questions. For a growing class of work, the second answer is the one that matters.