Why Embedding is where most RAG Systems quietly go wrong

Blog

Once documents are uploaded, parsed, and split into sensible pieces, most RAG systems make a quiet assumption:
“From here on, the model will figure it out.”

This assumption is where quality begins to erode.

In this article, we want to talk about the stage that sits between ingestion and answer generation, the part that decides which pieces of knowledge are even considered when a question is asked.

This stage is often called embedding or retrieval. Those terms sound technical, so let’s reframe them in plain language.

This is the moment where the system decides what it thinks is relevant. If that decision is wrong, everything that follows is compromised, no matter how capable the language model may be.


From “Well-structured documents” to “Useful answers” is not automatic

In previous articles, we focused on ingestion and chunking: how documents are broken into meaningful units, and why structure matters.

But even perfectly chunked documents can still fail at question time. Why?

Because when a user asks a question, the system must:

  • Search across thousands of chunks

  • Decide which ones might be relevant

  • Narrow that down to a small handful

  • Pass only those pieces to the language model

That narrowing step is not neutral. It is an opinionated act. And, most RAG systems outsource that opinion to default settings.


The Default Retrieval Problem: “Close Enough” Content

Most off-the-shelf RAG setups work roughly like this:

  1. Turn each chunk into a numeric representation (an “embedding”)

  2. Turn the user’s question into the same kind of representation

  3. Find the chunks that are mathematically “closest” to the question

  4. Return the top few

This works surprisingly well, until it doesn’t. In real corpora, especially complex ones, “close” is often not “correct”. Consider what tends to happen:

  • Broad, general documents appear relevant to many questions

  • Large regulatory texts produce lots of candidate chunks

  • Generic safety language competes with specific technical clauses

  • Entire documents are returned when only one paragraph mattered

The result is not obviously wrong. It is plausible. And plausible answers are the hardest failures to detect.


Why ranking matters more than retrieval

One of the biggest misconceptions in RAG design is that the hard part is finding information.

In practice, the hard part is deciding what outranks what.

When multiple chunks are “somewhat relevant”, the system must answer questions like:

  • Should a highly specific clause beat a broad overview?

  • Should an authoritative source beat a more recent but less binding one?

  • Should multiple similar chunks from the same document crowd out others?

  • How much context can we afford to pass before the model loses focus?

These are not model questions. They are design questions. And default RAG pipelines don’t answer them, they avoid them.


Why “Top-K” is not a strategy

Many RAG demos rely on a simple idea: “Just take the top 5 (or 10, or 20) results.” That sounds reasonable until you look closely.

Top-K does not account for:

  • Document hierarchy or authority

  • Structural dominance (some documents naturally produce more chunks)

  • Context budgets (how much text the model can actually use)

  • Fairness across sources

  • The difference between related and decisive information

In practice, top-K often means:

  • Too much generic content

  • Too many chunks from the same source

  • Not enough coverage of edge cases

  • And an overloaded model trying to reason over everything at once

Again: not a model failure. A retrieval design failure.

How we think about retrieval at SnapInsight

At SnapInsight, we treat retrieval as a ranking problem, not a lookup problem. That mindset changes everything. Instead of asking: “Which chunks are similar to this question?”

We ask: “If only a limited number of chunks can shape the answer, which ones deserve that influence?” That leads to a very different design approach.


Phase 1: Controlled Semantic selection

The first step is still semantic similarity but used carefully. We intentionally:

  • Limit how many candidate chunks are considered

  • Discard weak matches early

  • Avoid letting sheer document size dominate results

  • Ensure retrieval remains predictable under load

This alone removes a surprising amount of noise.


Phase 2: Reinforcing Meaning, not just Proximity

Similarity alone is not enough.

Two pieces of text can be mathematically “close” to a question for very different reasons:

  • One may genuinely answer it

  • Another may just share vocabulary

So, we introduce additional signals that reinforce meaning:

  • Exact language matches

  • Term specificity

  • Structural relevance

Not by tagging documents manually, but by letting the text itself speak more clearly. This helps ensure that when a question is precise, the retrieved knowledge is precise too.


Why this is rarely done well

Designing retrieval this way is slower.

It requires:

  • Thinking in terms of trade-offs, not defaults

  • Testing behaviour across many questions, not just demos

  • Accepting that “more context” is often worse, not better

  • Measuring how systems behave when knowledge is incomplete

Most teams don’t do this because:

  • Retrieval failures are subtle

  • Demos don’t reveal them

  • And shortcuts look good early on

But over time, these shortcuts surface as:

  • Inconsistent answers

  • Unstable citations

  • Overconfident responses

  • And user trust slowly eroding


Embedding is not a background task

Embedding is often treated as a background process of “Upload the files, generate embeddings, move on.” In reality, it is where the system learns how to pay attention.

Poor attention leads to:

  • Confused answers

  • Hallucinations framed as confidence

  • Missed edge cases

  • And a system that feels impressive but unreliable

Good attention leads to:

  • Focused answers

  • Predictable behaviour

  • Clear grounding

  • And a system people trust with important decisions


Why this matters for Production Systems

Anyone can build a RAG prototype. Very few teams build RAG systems that:

  • Scale gracefully

  • Remain stable as corpora grow

  • Behave sensibly under ambiguity

  • And continue to improve rather than drift

That difference is not about models. It is about the invisible design choices made between ingestion and generation.
This is the work we choose to do at SnapInsight. Not because it’s glamorous. Not because it’s easy to demo. But because this is where quality is actually decided.