Why Embedding is where most RAG Systems quietly go wrong

Blog

Once documents are uploaded, parsed, and split into sensible pieces, most RAG systems make a quiet assumption:
“From here on, the model will figure it out.”

This assumption is where quality begins to erode.

In this article, we want to talk about the stage that sits between ingestion and answer generation, the part that decides which pieces of knowledge are even considered when a question is asked.

This stage is often called embedding or retrieval. Those terms sound technical, so let’s reframe them in plain language.

This is the moment where the system decides what it thinks is relevant. If that decision is wrong, everything that follows is compromised, no matter how capable the language model may be.

From “Well-structured documents” to “Useful answers” is not automatic

In previous articles, we focused on ingestion and chunking: how documents are broken into meaningful units, and why structure matters.

But even perfectly chunked documents can still fail at question time. Why?

Because when a user asks a question, the system must:

Search across thousands of chunks
Decide which ones might be relevant
Narrow that down to a small handful
Pass only those pieces to the language model

That narrowing step is not neutral. It is an opinionated act. And, most RAG systems outsource that opinion to default settings.

The Default Retrieval Problem: “Close Enough” Content

Most off-the-shelf RAG setups work roughly like this:

Turn each chunk into a numeric representation (an “embedding”)
Turn the user’s question into the same kind of representation
Find the chunks that are mathematically “closest” to the question
Return the top few

This works surprisingly well, until it doesn’t. In real corpora, especially complex ones, “close” is often not “correct”. Consider what tends to happen:

Broad, general documents appear relevant to many questions
Large regulatory texts produce lots of candidate chunks
Generic safety language competes with specific technical clauses
Entire documents are returned when only one paragraph mattered

The result is not obviously wrong. It is plausible. And plausible answers are the hardest failures to detect.

Why ranking matters more than retrieval

One of the biggest misconceptions in RAG design is that the hard part is finding information.

In practice, the hard part is deciding what outranks what.

When multiple chunks are “somewhat relevant”, the system must answer questions like:

Should a highly specific clause beat a broad overview?
Should an authoritative source beat a more recent but less binding one?
Should multiple similar chunks from the same document crowd out others?
How much context can we afford to pass before the model loses focus?

These are not model questions. They are design questions. And default RAG pipelines don’t answer them, they avoid them.

Why “Top-K” is not a strategy

Many RAG demos rely on a simple idea: “Just take the top 5 (or 10, or 20) results.” That sounds reasonable until you look closely.

Top-K does not account for:

Document hierarchy or authority
Structural dominance (some documents naturally produce more chunks)
Context budgets (how much text the model can actually use)
Fairness across sources
The difference between related and decisive information

In practice, top-K often means:

Too much generic content
Too many chunks from the same source
Not enough coverage of edge cases
And an overloaded model trying to reason over everything at once

Again: not a model failure. A retrieval design failure.

How we think about retrieval at SnapInsight

At SnapInsight, we treat retrieval as a ranking problem, not a lookup problem. That mindset changes everything. Instead of asking: “Which chunks are similar to this question?”

We ask: “If only a limited number of chunks can shape the answer, which ones deserve that influence?” That leads to a very different design approach.

Phase 1: Controlled Semantic selection

The first step is still semantic similarity but used carefully. We intentionally:

Limit how many candidate chunks are considered
Discard weak matches early
Avoid letting sheer document size dominate results
Ensure retrieval remains predictable under load

This alone removes a surprising amount of noise.

Phase 2: Reinforcing Meaning, not just Proximity

Similarity alone is not enough.

Two pieces of text can be mathematically “close” to a question for very different reasons:

One may genuinely answer it
Another may just share vocabulary

So, we introduce additional signals that reinforce meaning:

Exact language matches
Term specificity
Structural relevance

Not by tagging documents manually, but by letting the text itself speak more clearly. This helps ensure that when a question is precise, the retrieved knowledge is precise too.

Why this is rarely done well

Designing retrieval this way is slower.

It requires:

Thinking in terms of trade-offs, not defaults
Testing behaviour across many questions, not just demos
Accepting that “more context” is often worse, not better
Measuring how systems behave when knowledge is incomplete

Most teams don’t do this because:

Retrieval failures are subtle
Demos don’t reveal them
And shortcuts look good early on

But over time, these shortcuts surface as:

Inconsistent answers
Unstable citations
Overconfident responses
And user trust slowly eroding

Embedding is not a background task

Embedding is often treated as a background process of “Upload the files, generate embeddings, move on.” In reality, it is where the system learns how to pay attention.

Poor attention leads to:

Confused answers
Hallucinations framed as confidence
Missed edge cases
And a system that feels impressive but unreliable

Good attention leads to:

Focused answers
Predictable behaviour
Clear grounding
And a system people trust with important decisions

Why this matters for Production Systems

Anyone can build a RAG prototype. Very few teams build RAG systems that:

Scale gracefully
Remain stable as corpora grow
Behave sensibly under ambiguity
And continue to improve rather than drift

That difference is not about models. It is about the invisible design choices made between ingestion and generation.
This is the work we choose to do at SnapInsight. Not because it’s glamorous. Not because it’s easy to demo. But because this is where quality is actually decided.

Latest

See all

Use case

When Policies Stop Being Documents and Start Being Understood

An education-based organisation struggled with low engagement and understanding of internal policies and procedures. By allowing staff to ask questions in plain language and receive clear, source-backed answers, the organisation removed fear and friction from compliance.

READ MORE +

Blog

The Art of Structuring Knowledge: Hierarchy vs. Search-Centric Models

Dive into the nuances of organising knowledge bases effectively by comparing hierarchy-based and search-centric models. This blog highlights the advantages, challenges, and best use cases for each structure, helping organisations choose the right approach for their users and goals.

READ MORE +

Use case

Becoming the Authority Before the First Conversation

A peak industry body wanted to become the recognised authority across the wider industry. By offering free access to its AI-powered assistant, it allowed the market to experience its expertise firsthand. SnapInsight shifted lead generation from resource-heavy outbound activity to an always-on inbound channel.

Use case

When Customer Questions Become Strategic Advantage

A peak Australian association was answering thousands of member questions but learning nothing from them. By implementing SnapInsight, it captured every question in one place and gained real visibility into member pain points, reduced repeat support, and reshaped content, events, and services around real demand.

READ MORE +

Blog

What “Good Ingestion” Actually Means in a RAG System

What does “good ingestion” really mean in a RAG system? A deep dive into chunking, document structure, retrieval competition, and why ingestion determines quality.

READ MORE +

Blog

Why high-quality RAG starts long before the first question

High-quality Retrieval-Augmented Generation isn’t built at query time. Learn why ingestion, chunking, and simulation determine RAG reliability long before users ask a question.

READ MORE +

Guide

Why most business AI initiatives stall

Most organisations aren’t failing at AI - they’re struggling to move from pilots to real impact. An MIT study explains why business AI initiatives so often stall.

READ MORE +

Domain

Knowledgebase: Electrical

Simplify electrical compliance with SnapInsight, an AI-powered solution designed for electrical professionals. Instantly access regulations, cross-reference critical details, and ensure compliance with advanced contextual understanding and smart filtering tailored to your location. Boost efficiency and confidence today.

READ MORE +