There is a widespread misunderstanding in how organisations approach Retrieval Augmented Generation (RAG) deployment.
The common assumption is that the quality of the final output depends primarily on the language model. If the answers are not good enough, upgrade the model. If the system hallucinates, use a model with stronger reasoning. If the responses miss nuance, try a larger, more capable foundation model.
This assumption is wrong.
In most production RAG systems, the language model is not the bottleneck. The retrieval layer is. And until retrieval quality is solved, no amount of LLM capability will make your system reliable.
This article explains why retrieval accuracy is the central quality problem in RAG, how poor retrieval manifests in production, and what practical steps organisations can take to address it.
What RAG Actually Does
To understand why retrieval matters so much, it helps to be precise about what RAG is doing.
In a traditional language model interaction, the model responds based purely on knowledge encoded during training. This creates obvious limitations: the model cannot know about documents you have created internally, recent events, or proprietary data it was never trained on.
RAG addresses this by adding a retrieval step before generation:
- The user submits a query
- The system searches a document corpus and retrieves the most relevant chunks
- The retrieved content is injected into the model's context window alongside the query
- The model generates a response grounded in the retrieved content
The key insight is that the model's response is only as good as what it receives in step 3. If the retrieval step returns irrelevant, incomplete, or misleading content, the model will generate a response that is irrelevant, incomplete, or misleading — no matter how capable the underlying model is.
Garbage in. Garbage out. The LLM is just a very sophisticated garbage processor.
The RAG Quality Failure Modes
Poor retrieval quality manifests in several characteristic ways in production RAG systems. Recognising these patterns is the first step toward diagnosis.
Missing Context (False Negatives)
The most damaging retrieval failure is when the relevant content exists in your corpus but is not retrieved for a given query. The model, working only with what it receives, will either hallucinate an answer or acknowledge it cannot respond — even though the answer was theoretically available.
This failure mode is particularly insidious because it is invisible in basic testing. Simple test queries often surface easily-matched content. The failures tend to emerge with:
- Implicit queries (where the user knows the answer exists but phrases the question ambiguously)
- Technical terminology variations (user says "login error", document says "authentication failure")
- Cross-document reasoning (the answer requires combining information from multiple documents)
Irrelevant Retrieval (False Positives)
The opposite problem: the retrieval system returns content that is superficially related to the query but does not actually answer it.
This happens frequently with large document corpora where multiple sections discuss related topics. A query about pricing policy might retrieve content about pricing philosophy, pricing history, and pricing templates — but not the current pricing policy document.
The model then has to reason across noisy, marginally-relevant content and often either produces a confused synthesis or confidently provides outdated or incorrect information.
Stale or Superseded Content
RAG systems that are not carefully maintained can retrieve outdated content that has been superseded by more recent documents. Without clear versioning and deprecation strategies, a query about your current product capabilities might return documentation written two product versions ago.
The model has no way to know which document is current unless that information is explicitly embedded in the content or metadata — and most organisations do not structure their document corpora with this in mind.
Semantic Mismatch
Standard vector similarity retrieval is based on semantic proximity — the assumption that similar meaning produces similar vector representations. This works well for many queries, but breaks down in important cases:
- Negation: "What are the limitations of our AI system?" and "What are the capabilities of our AI system?" can have very similar vector representations, causing relevant-but-wrong content to be retrieved
- Specificity: A very specific query may retrieve broadly-related content rather than the precise answer
- Domain jargon: Industry-specific terminology may not be well-represented in the embedding model's training data, causing poor similarity matching
Why Organisations Focus on the Wrong Thing
If retrieval quality is the real bottleneck, why do most teams focus on LLM selection?
Several reasons:
LLMs are visible; retrieval pipelines are not. When you interact with a RAG system, you see the generated response. The retrieval step is an intermediate process that most users never observe directly. It is natural to attribute response quality to the visible component.
LLM improvements are marketed heavily. AI model providers publish benchmarks, release notes, and case studies. The retrieval layer — databases, embedding models, chunking strategies, metadata schemas — does not have the same commercial attention.
Retrieval problems often look like model problems. When a RAG system gives a wrong answer, it looks like the model hallucinated. In many cases, the model was reasoning correctly — it simply did not have the right information to reason from.
LLM upgrades are easy to execute. Swapping one API endpoint for another takes minutes. Redesigning a retrieval pipeline, re-embedding a document corpus, and building evaluation infrastructure takes weeks or months. Teams naturally gravitate toward the faster intervention.
The Hidden Complexity of Retrieval Quality
Retrieval quality is genuinely complex to measure and improve. Unlike LLM performance, which can be assessed with standardised benchmarks, retrieval quality is deeply context-dependent.
A retrieval system that works well for your HR policy queries may perform poorly for technical support queries from the same document corpus. A chunking strategy that works for legal contracts may fail for meeting notes.
Embedding Model Selection
The embedding model converts text into vector representations. The quality of these representations directly determines retrieval accuracy — but embedding quality is highly domain-dependent.
General-purpose embedding models (the ones most commonly deployed) are trained on broad corpora. They perform well on general language but often underperform on specialised domains: legal language, medical terminology, technical jargon, or domain-specific acronyms.
Many organisations would achieve significantly better retrieval results by fine-tuning an embedding model on domain-specific content — but this requires labelled training data and infrastructure that most teams do not have.
Chunking Strategy
How you divide documents into chunks determines what units of content are available for retrieval. Poor chunking is one of the most common and most impactful sources of retrieval failures.
Common chunking mistakes:
- Fixed-size chunking that splits sentences mid-thought, creating chunks that lack coherent meaning
- Chunks that are too small, providing too little context for the model to reason from
- Chunks that are too large, diluting the relevant content with surrounding material and exceeding the effective utility of the context window
- No overlap between chunks, causing content at the boundaries to be effectively lost
Good chunking is semantic-aware: chunks should map to discrete units of meaning in the source document — paragraphs, sections, procedures, or logical components — not arbitrary character counts.
Metadata and Filtering
Vector similarity alone is often insufficient for high-precision retrieval. Combining semantic search with metadata filtering — by date, document type, product version, department, or other structured attributes — dramatically improves precision.
Most organisations under-invest in metadata schemas at the point of document ingestion. This creates a situation where every query has to search the entire corpus, making it harder to surface the right content against a background of noise.
Query Understanding
The query itself is often a source of retrieval failure. Users express information needs in natural language, which may not closely match the language used in source documents.
Query expansion (generating alternative phrasings of the original query), hypothetical document embeddings (generating a likely answer and using its embedding for retrieval), and query decomposition (breaking complex queries into simpler sub-queries) are all techniques that improve retrieval quality at the query layer.
Building Retrieval Evaluation Infrastructure
You cannot improve what you do not measure. One of the most important investments in RAG quality is building proper evaluation infrastructure for the retrieval layer.
Ground Truth Dataset Construction
Effective retrieval evaluation requires a dataset of queries with known-correct answers and known-relevant source documents. Building this dataset is labour-intensive but irreplaceable.
Start with a representative sample of real queries your system receives. Manually identify which documents or chunks should be retrieved for each query. Use this as your evaluation dataset.
Key metrics to track:
- Recall@K: What fraction of relevant documents are retrieved in the top K results?
- Precision@K: What fraction of the top K results are actually relevant?
- Mean Reciprocal Rank (MRR): How highly is the first correct result ranked?
- NDCG (Normalised Discounted Cumulative Gain): How well does the ranking order reflect relevance?
Retrieval-Separated Evaluation
A critical evaluation mistake is evaluating RAG systems end-to-end only. This makes it impossible to distinguish retrieval failures from generation failures.
Establish separate evaluation pipelines for the retrieval layer and the generation layer. When end-to-end quality drops, you can immediately determine whether the problem is in retrieval, generation, or their interaction.
Continuous Monitoring in Production
Retrieval quality degrades over time as document corpora grow and evolve. Build monitoring that tracks retrieval quality metrics against your evaluation dataset on a continuous basis, and alerts when degradation is detected.
Practical Improvements: Where to Start
For organisations experiencing RAG quality issues, here is a practical prioritisation:
Week 1: Diagnose before changing anything. Build a small evaluation dataset (50–100 query-document pairs) and measure your current retrieval metrics. Establish a baseline before making changes.
Week 2: Review your chunking strategy. Examine chunks that are being retrieved for failing queries. Are they coherent? Do they contain the relevant information? Adjust chunk size and overlap based on what you observe.
Week 3: Improve metadata schemas. Add structured metadata to your document corpus where it is currently absent. Implement hybrid search (vector + metadata filtering) for queries where document type or recency matters.
Month 2: Evaluate embedding model alternatives. Compare your current embedding model against alternatives, especially models fine-tuned for your domain. Use your evaluation dataset to measure the impact.
Month 3: Implement query enhancement. Add query expansion or hypothetical document embedding to your retrieval pipeline. Measure the impact on your evaluation dataset.
The Model Choice Question
After all this, where does LLM selection fit?
It matters — but as a secondary consideration, not a primary one.
A highly capable model operating on poor retrieval results will produce confidently-stated wrong answers. A less capable model operating on excellent retrieval results will produce accurate, well-grounded responses.
Once your retrieval layer is functioning at a high level — with precision and recall metrics that you are satisfied with — then LLM selection becomes meaningful. At that point, differences in reasoning quality, instruction following, and response coherence are the relevant differentiators.
The practical implication: invest in retrieval quality first. When your retrieval system is reliably surfacing the right content, evaluate whether the model is making good use of it. If not, then consider model selection.
Conclusion
The RAG quality problem is primarily a retrieval quality problem.
Organisations that understand this invest in chunking strategies, embedding model selection, metadata schemas, query enhancement, and retrieval evaluation infrastructure — and achieve reliable, production-grade RAG systems.
Organisations that do not understand this cycle through increasingly expensive language models, never quite understanding why their system continues to underperform, because the actual bottleneck remains unaddressed.
Your language model is a reasoning engine. It can only reason from what it receives. Give it the right information, reliably, and it will produce the right outputs.
That is the RAG quality insight. Everything else follows from it.
Ready to fix your RAG quality problem?
Digenio Tech Ltd specialises in designing and implementing production-grade RAG and vector database architectures for B2B organisations.
Book a Strategy Call →Related Articles: