RETRIEVALJournal

The four ways RAG breaks in production.

Retrieval-augmented generation looks perfect on the whiteboard and breaks on Monday morning. These are the four failure modes we see most often in client systems — and what actually fixes them.

Tech Nerve AI/AI / ML teamApril 2026·7 min

RAG is the default architecture for LLM features that need to be grounded in a client's own knowledge. It also has a remarkably low survival rate once it meets real users. We have rebuilt enough of these systems for clients to have a consistent list of the things that go wrong. None of them are exotic. All of them are fixable.

1. The chunking is wrong

Most pipelines chunk documents by a fixed token count. That optimises for the embedding model's context window — not for the unit of meaning your users actually ask about. A 512-token chunk cut mid-paragraph loses its subject. A medical note split mid-sentence loses its diagnosis. A legal clause separated from its definition becomes gibberish.

Ask your users what a complete answer looks like and chunk toward that shape. For policy documents that is usually the clause. For product docs it is the section. For long-form content it is the paragraph plus its heading. Then overlap by one full unit of meaning, not by a handful of tokens.

2. The embeddings are lying

Semantic similarity and answer-relevance are not the same problem. The query "does the employee handbook allow working from home?" embeds close to "is working from home forbidden?" — fine. It also embeds close to a paragraph about equipment reimbursement that happens to share surface vocabulary. Dense retrieval will happily hand you the wrong one.

Pair embedding retrieval with a reranker. Cohere's rerank-english, Jina's rerank or a fine-tuned cross-encoder will lift top-1 accuracy by double digits on most corpora. Evaluate retrieval with labelled query-passage pairs. Not cosine numbers — actual precision at k.

3. The index is stale

RAG systems that index once and never re-index produce confident, cited, wrong answers. The cite says "Policy v3"; the policy is on v7. We have seen production systems where fifteen percent of answers referenced a document that had been superseded for eighteen months.

Run a continuous indexing job triggered by source-of-truth webhooks — never a nightly cron on everything.
Version every chunk. Retrieval includes the version; the answer cites the version.
Track which chunks answered which queries. When a chunk goes cold or contradicts a newer chunk, flag it for review.

4. The generator is over-trusted

The final LLM is not a verifier. It will synthesise an answer from whatever retrieval hands it — including three near-duplicate chunks and one obsolete policy. Wrap generation in a structured critique: does the answer actually cite a source in context, and does the source actually support the claim?

“A RAG system without a critique pass is a system that cannot tell when it is wrong. That is not a minor gap. It is the whole difference between a demo and a product.”

Cheap to run, catches most public-facing embarrassment before it happens. We typically run this as a second pass with a smaller model — the same one doing the generation, a 4-bit local model, or the fastest tier of whichever vendor you are already paying. Budget: a few cents per answer. Return: the difference between a feature you can ship and a feature you cannot.

What actually ships

Production RAG is not one system. It is a pipeline: hybrid retrieval (BM25 plus dense), rerank, structured generation with citations, critique pass, feedback loop into your eval set. Skip any of these and you will be back here, rebuilding.

Tagged

RAG
Retrieval
LLM
Architecture