We Let an AI Agent Optimize Our RAG Pipeline Overnight. Here's What Happened.

When Andrej Karpathy released autoresearch, the idea was simple: give an AI agent a training script, a single metric, a fixed compute budget, and let it iterate autonomously. Wake up to a better model.

We were knee-deep in a RAG pipeline for a client — processing hundreds of enterprise documents, chunking them, embedding them, and trying to make retrieval actually work. The pipeline was built but untuned. And tuning RAG is a surprisingly tedious grind: change the chunk size, re-embed everything, run your test queries, check the metrics, revert if it got worse, try another value. Repeat.

Reading through Karpathy's approach, we realized the pattern mapped almost perfectly onto our problem.

The Pattern

Autoresearch separates what's immutable (data prep, evaluation harness) from what's tunable (model architecture, hyperparameters). An agent modifies only the tunable parts, runs a fixed-budget experiment, compares the metric, keeps or discards the change. Simple loop. Powerful results.

The constraints are the genius: a fixed budget per experiment means each run is cheap and comparable. A single metric means 'better' is unambiguous. And the separation between immutable prep and mutable experiment means the agent can't accidentally break the evaluation while trying to improve results.

How We Applied It to RAG

We mapped each concept directly:

  • Document extraction (PDF parsing, OCR, format handling) → immutable setup. Run once, cache to disk. ~8 minutes we never want to repeat.
  • Chunking parameters and search weights → the tunable config. A simple JSON file the agent edits.
  • A composite retrieval score (MRR, Recall, NDCG) → the single metric. 'Better' becomes a boolean.
  • ~17 minutes per full experiment on CPU → the fixed budget. Predictable, comparable, plannable.
  • Append-only JSONL log → every experiment recorded with full config, metrics, and duration.
  • By caching extracted documents, every experiment starts from the chunking step. The expensive work happens once. The iteration is cheap.

    Seven Knobs, One Overnight Run

    A RAG pipeline has more tunable parameters than you'd expect:

    Chunk size — too small and you lose context, too large and you dilute the signal with noise.

    Overlap — how much context bleeds between adjacent chunks. More isn't always better.

    Dense vs. sparse search weights — the balance between semantic understanding and keyword matching. Corpus-dependent.

    Retrieval depth — how many candidates to consider before ranking. Affects the precision/recall tradeoff in non-obvious ways.

    Seven parameters, four to six sensible values each. ~35 one-at-a-time experiments at 17 minutes each. About six hours. A perfect overnight run.

    What We Found

    Chunk size has a sweet spot — not linear. Large enough to carry context, small enough to be retrievable. The optimum was 512 tokens with 32-token overlap.

    Overlap is surprisingly sensitive. Beyond the optimum, additional overlap actually degraded quality. Near-duplicate chunks competed with each other in search results.

    Table summarization was a surprise win. Using a local LLM to generate natural-language summaries of extracted tables — and embedding those alongside raw data — produced measurable improvement. Tables are hard for embedding models; summaries give them something semantic to work with.

    Dense-dominant search (0.7/0.3) crushed equal weighting. But the right balance depends on your corpus. Domain-specific jargon makes keyword matching more valuable than you'd expect.

    Best configuration: 512-token chunks, 32-token overlap, 0.7 dense / 0.3 sparse weighting, topk multiplier of 3. Composite score: 0.745 on eval, 0.61 on held-out test.

    Why This Works for RAG

    It's not any single design choice — it's the combination:

    Separation of concerns. The agent can't corrupt source data. It only touches the cheap, fast-to-evaluate parts.

    A single composite metric. No judgment calls. 'MRR went up but recall went down' is resolved by the weighted score.

    Append-only logging. Nothing is lost. You can identify patterns across experiments — 'every time overlap exceeds X, recall drops.'

    Human-in-the-loop via the program file. Autonomous but not unsupervised. Human judgment enters through parameter ranges and phase sequencing.

    Fixed budget per experiment. Predictable, comparable, and you know immediately when something goes wrong.

    The Bigger Picture

    Karpathy's original framing was about LLM training, but the pattern fits any optimization problem where there's an expensive immutable setup, a set of tunable parameters, a quantitative metric, and bounded experiment cost.

    RAG pipelines. Prompt engineering. Embedding model selection. Reranking strategies. The domain doesn't matter. What matters is giving an AI agent a well-defined playground, a clear scoreboard, and the autonomy to explore it.

    We built this for a client's enterprise document corpus. The infrastructure is in place, each experiment is a 17-minute fire-and-forget operation, and the next step is closing the loop fully — having the agent decide what to try next based on the accumulated log.

    The overnight-run dream: wake up to a Pareto frontier of configurations and a clear recommendation.

    By Milen Hristov, Lead Software Engineer at Looming Tech