The four ways RAG breaks in production.
Retrieval-augmented generation looks perfect on the whiteboard and breaks on Monday morning. These are the four failure modes we see most often in client systems — and what actually fixes them.
RAG is the default architecture for LLM features that need to be grounded in a client's own knowledge. It also has a remarkably low survival rate once it meets real users. We have rebuilt enough of these systems for clients to have a consistent list of the things that go wrong. None of them are exotic. All of them are fixable.
1. The chunking is wrong
Most pipelines chunk documents by a fixed token count. That optimises for the embedding model's context window — not for the unit of meaning your users actually ask about. A 512-token chunk cut mid-paragraph loses its subject. A medical note split mid-sentence loses its diagnosis. A legal clause separated from its definition becomes gibberish.
Ask your users what a complete answer looks like and chunk toward that shape. For policy documents that is usually the clause. For product docs it is the section. For long-form content it is the paragraph plus its heading. Then overlap by one full unit of meaning, not by a handful of tokens.
2. The embeddings are lying
Semantic similarity and answer-relevance are not the same problem. The query "does the employee handbook allow working from home?" embeds close to "is working from home forbidden?" — fine. It also embeds close to a paragraph about equipment reimbursement that happens to share surface vocabulary. Dense retrieval will happily hand you the wrong one.
Pair embedding retrieval with a reranker. Cohere's rerank-english, Jina's rerank or a fine-tuned cross-encoder will lift top-1 accuracy by double digits on most corpora. Evaluate retrieval with labelled query-passage pairs. Not cosine numbers — actual precision at k.
3. The index is stale
RAG systems that index once and never re-index produce confident, cited, wrong answers. The cite says "Policy v3"; the policy is on v7. We have seen production systems where fifteen percent of answers referenced a document that had been superseded for eighteen months.
- Run a continuous indexing job triggered by source-of-truth webhooks — never a nightly cron on everything.
- Version every chunk. Retrieval includes the version; the answer cites the version.
- Track which chunks answered which queries. When a chunk goes cold or contradicts a newer chunk, flag it for review.
4. The generator is over-trusted
The final LLM is not a verifier. It will synthesise an answer from whatever retrieval hands it — including three near-duplicate chunks and one obsolete policy. Wrap generation in a structured critique: does the answer actually cite a source in context, and does the source actually support the claim?
“A RAG system without a critique pass is a system that cannot tell when it is wrong. That is not a minor gap. It is the whole difference between a demo and a product.”
Cheap to run, catches most public-facing embarrassment before it happens. We typically run this as a second pass with a smaller model — the same one doing the generation, a 4-bit local model, or the fastest tier of whichever vendor you are already paying. Budget: a few cents per answer. Return: the difference between a feature you can ship and a feature you cannot.
What actually ships
Production RAG is not one system. It is a pipeline: hybrid retrieval (BM25 plus dense), rerank, structured generation with citations, critique pass, feedback loop into your eval set. Skip any of these and you will be back here, rebuilding.
- RAG
- Retrieval
- LLM
- Architecture