We've shipped RAG systems for pharmaceutical intelligence, consumer product reviews, legal document management, engineering documentation, and enterprise knowledge bases. Not prototypes — production systems that real people depend on every day.

Here's what we've actually learned. No theory, no hype, just the patterns and pitfalls from building retrieval-augmented generation at scale.

1. Your chunking strategy matters more than your model choice

Everyone obsesses over which LLM to use. GPT-4 vs Claude vs Gemini. In practice, the difference between a good and bad RAG system is almost always in the retrieval layer — and that starts with how you chunk your documents.

We've tried fixed-size chunks, sentence-based splitting, page-level chunks, and hybrid approaches. The winner depends on your corpus. For regulatory pharmaceutical documents, page-level chunking with table preservation worked best. For consumer product reviews, smaller semantic chunks (512 tokens) with minimal overlap outperformed everything else.

The lesson: Run chunking experiments before touching the prompt. We built an automated sweep tool that tests chunk sizes, overlap ratios, and splitting strategies against a fixed eval set. It runs overnight and we wake up to results.

2. Hybrid search beats pure vector search. Every time.

Pure semantic search sounds elegant — embed everything, find the closest vectors, done. In practice, it misses things. A user searching for "PCI DSS compliance requirements" needs exact keyword matching alongside semantic understanding.

In every deployment, a weighted combination of dense (semantic) and sparse (keyword) search outperformed either alone. The typical sweet spot is 70% dense, 30% sparse — but this varies by corpus. Domain-specific jargon pushes the balance toward keywords.

We use: Pinecone, OpenSearch, or Milvus for dense vectors, combined with BM25 or sparse vectors from BGE-M3. The hybrid approach consistently delivers 15-25% better recall than pure vector search.

3. Citation is non-negotiable in regulated industries

For our pharmaceutical client, every AI-generated answer needs to cite its sources — document name, page number, and the specific passage. Without this, the system is useless. Doctors and regulatory affairs teams won't trust an answer they can't verify.

We built a grounding system that tracks exactly which chunks contributed to each answer, preserves the source metadata through the entire pipeline, and presents citations inline. It's not glamorous engineering, but it's the difference between a demo and a system people actually use.

4. Evaluation is the hardest part

You can't improve what you can't measure. And measuring RAG quality is genuinely hard.

For retrieval, we track MRR (Mean Reciprocal Rank), Recall@K, and NDCG. But building the ground truth eval set — queries paired with the documents that should be retrieved — requires domain expertise and manual labeling. There's no shortcut.

We start with 20-30 hand-curated queries across different categories (factual lookups, technical specs, cross-document questions, table-heavy queries). Then we expand semi-automatically as we discover edge cases. Holding out a test set to detect overfitting is critical.

Our rule: If you don't have an eval set, you don't have a RAG system. You have a chatbot with a search bar.

5. Tables and structured data need special treatment

Most RAG tutorials assume your documents are clean prose. Real enterprise documents are full of tables, forms, headers, footers, and weird formatting. Standard text extraction butchers tables.

Our approach: extract tables separately, generate natural-language summaries using a local LLM, and embed both the raw table data and the summary. The summary gives the embedding model something semantic to work with. This single technique improved retrieval quality measurably in every deployment with table-heavy documents.

6. Start with the retrieval, not the generation

A common mistake: teams spend weeks crafting the perfect system prompt and few-shot examples before making sure retrieval actually works. If the right documents don't appear in the context window, no amount of prompt engineering will save you.

Our process: get retrieval working first. Measure it. Iterate on chunking, embeddings, and search until the eval metrics are solid. Only then optimize the generation prompt. We've seen teams waste months on generation when the real problem was retrieval returning irrelevant chunks.

7. Cost management is an architecture decision

At scale, RAG costs add up fast. Embedding 100K+ documents, running vector search on every query, and sending large context windows to an LLM — each step has a cost.

  • Cache aggressively. If the same query hits the same documents, cache the response.
  • Use cheaper models for filtering and reranking, expensive models only for final generation.
  • Context caching (supported by Claude and GPT) can reduce costs by 50-80% for repeated system prompts.
  • Consider on-premise vector databases (Zvec, FAISS) for sensitive data — no per-query cloud costs.
  • 8. The "overnight experiment" pattern changed how we tune

    Inspired by Karpathy's autoresearch, we built a system that lets an AI agent sweep RAG parameters autonomously. Define the parameter ranges, set a fixed compute budget per experiment, point it at an eval set, and let it run overnight.

    For one client, 24 experiments ran in six hours. The winning configuration (512-token chunks, 32-token overlap, 0.7/0.3 dense/sparse weighting) achieved 82% Recall@5 — significantly better than our hand-tuned baseline. The agent found parameter interactions we wouldn't have tested manually.

    9. Production RAG needs monitoring, not just deployment

    Shipping a RAG system is the beginning, not the end. Document corpora change. New content gets added. Query patterns evolve. A system that worked at launch can degrade silently.

  • Track retrieval quality metrics in production (not just latency and error rates)
  • Log queries that return no relevant results — these are your improvement opportunities
  • Re-embed and re-index when significant new content is added
  • Monitor for hallucination patterns — queries where the model confidently cites non-existent information
  • 10. The model matters less than you think

    We've deployed RAG with GPT-4, Claude Sonnet, Gemini Flash, and open-source models. The quality difference between top-tier models is smaller than the quality difference between good and bad retrieval.

    Choose your model based on practical constraints: latency requirements, cost at your query volume, data residency rules, and context window size. For most enterprise use cases, Claude Sonnet or GPT-4o with good retrieval will outperform GPT-5 with mediocre retrieval.

    The bottom line

    RAG in production is 20% AI and 80% engineering. The sexy part — picking models, writing prompts — is the smallest piece. The hard work is in document extraction, chunking strategy, evaluation methodology, citation tracking, and operational monitoring.

    Every deployment has taught us that the fundamentals matter more than the frontier. Get the retrieval right, measure everything, and iterate relentlessly. The overnight experiment dream isn't a dream anymore — it's how we work.