Skip to content

RAG in the Real World: Building a High-Precision Retrieval System for LLM Pipelines

I built this as an independent, production-style simulation in my own environment: chunk files, embed them, retrieve top-k, send to an LLM.

Then I started testing it with real repo questions and found consistent failure modes: wrong chunk selection, missed exact symbol lookups, and noisy context windows. This post is my learning path from that naive baseline to a retrieval stack that is good enough for production-style workloads.

What I Was Trying to Build

The broader attempt in this series is a technical documentation repository that AI coding agents can query to get correct, source-grounded implementation guidance.

The system-level arc is:

  1. Ingest and version technical sources reliably.
  2. Retrieve high-precision context for coding questions.
  3. Measure quality continuously so the system does not drift.

This post is Part 2: retrieval quality and ranking.

A note on terminology before we begin. The retrieval system described in this post is not an agent in the research sense. An agent implies autonomous tool use, multi-step planning, and environment interaction — executing code, navigating a file system, calling external APIs. What we built here is a high-precision knowledge lookup component: a tool that LLM-driven agents invoke. If you are building autonomous agents that edit code or run tests, you still need exactly this layer underneath them. Getting retrieval right is a prerequisite, not a substitute.

Table of contents

Open Table of contents

Naive RAG: Where Everyone Starts

Our initial setup looked like every RAG tutorial. We had a corpus of project files — Python source, markdown documentation, JSON configs, CSV reports — and we wanted a retrieval system that could answer questions by reading them.

The four steps felt obvious:

  1. Chunk each file into passages
  2. Embed each passage with a language model
  3. At query time, embed the question and find the top-k most similar passages
  4. Feed those passages to an LLM and ask it to synthesize an answer

I chunked at 512 tokens with 50-token overlap, used text-embedding-3-large via Azure OpenAI, stored everything in Azure AI Search, and put GPT-4o at the end. The demo worked. Production-like queries did not.

Why these components

I want to be explicit here because “just use X” is not useful advice.

  • Azure AI Search: chosen because it supports vector + BM25 in one place, which reduced system complexity for a solo project.
  • Cross-encoder reranking: added only after I verified top-k quality was inconsistent.
  • Query classifier + HDE: added later for query-specific routing, not as a default starting point.

If you are starting from scratch, build dense + BM25 first. Add reranking only when your evaluation data shows a clear gap.

flowchart LR
    subgraph Ingestion["Offline: File Ingestion"]
        A[Files in Azure Blob\nCode, Docs, Configs, CSVs] --> B[Text Chunker\n512 tokens, 50 overlap]
        B --> C[Azure OpenAI Embeddings\ntext-embedding-3-large]
        C --> D[(Azure AI Search\nVector Index)]
    end

    subgraph Retrieval["Online: Query Time"]
        E[User Query] --> F[Embed Query]
        F --> G[Top-K Similarity Search\nk=5]
        G --> H[Retrieved Chunks]
        H --> I[Azure OpenAI\nGPT-4o]
        I --> J[Answer]
    end

    D --> G

    style Ingestion fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb
    style Retrieval fill:#14532d,stroke:#22c55e,color:#f9fafb

Clean. Simple. Broken in ways that took us weeks to understand.

Problem 1: Chunking Is Not a Detail

The first surprise was how catastrophically chunking strategy affects quality — not a little, catastrophically. Fixed-size chunks slice through concepts mid-sentence, separate code from its docstring, and bundle unrelated sections together when they happen to sit adjacently in a file.

Consider a Python module structured like this:

# ──────────────────────────────
# Section: Authentication helpers
# ──────────────────────────────

def verify_token(token: str) -> bool:
    """Verify a JWT access token. Returns True if valid."""
    ...
    # ... 300 tokens of implementation ...

# ──────────────────────────────
# Section: Rate limiting
# ──────────────────────────────

def check_rate_limit(user_id: str) -> bool:
    """Check if the user has exceeded their request quota."""
    ...

A 512-token chunk might end mid-function, grabbing the tail of verify_token and the head of check_rate_limit. Pass the query “how does rate limiting work?” and the system retrieves this mixed chunk — then the LLM, faithfully following its context, confuses the two.

Structured files compound the problem. A CSV with hundreds of rows gets chunked arbitrarily, stripping the column headers that give each row meaning.

The fix was semantic chunking: slice at natural boundaries — function definitions, section headers, paragraph breaks — rather than at arbitrary token counts:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticChunker:
    def __init__(self, model: SentenceTransformer, threshold: float = 0.85):
        self.model = model
        self.threshold = threshold
        self.structural_splitter = RecursiveCharacterTextSplitter(
            separators=["\n\n", "\ndef ", "\nclass ", "\n# ", ". ", " "],
            chunk_size=200,
            chunk_overlap=0,
        )

    def chunk(self, text: str) -> list[str]:
        small_chunks = self.structural_splitter.split_text(text)
        embeddings = self.model.encode(small_chunks)

        merged = [small_chunks[0]]
        current_emb = embeddings[0]

        for i in range(1, len(small_chunks)):
            sim = np.dot(current_emb, embeddings[i]) / (
                np.linalg.norm(current_emb) * np.linalg.norm(embeddings[i])
            )

            if sim > self.threshold and len(merged[-1]) + len(small_chunks[i]) < 1500:
                merged[-1] += "\n" + small_chunks[i]
                current_emb = (current_emb + embeddings[i]) / 2
            else:
                merged.append(small_chunks[i])
                current_emb = embeddings[i]

        return merged

For structured data we took a different approach — serialize each row as a natural-language sentence before chunking, so the embedding model has something meaningful to work with:

def chunk_csv(df: pd.DataFrame, max_rows_per_chunk: int = 10) -> list[str]:
    """Convert CSV rows to natural-language chunks, headers always included."""
    chunks = []
    cols = df.columns.tolist()

    for start in range(0, len(df), max_rows_per_chunk):
        batch = df.iloc[start : start + max_rows_per_chunk]
        header = "Columns: " + ", ".join(cols)
        rows = batch.apply(
            lambda r: " | ".join(f"{c}: {r[c]}" for c in cols), axis=1
        ).tolist()
        chunks.append(header + "\n" + "\n".join(rows))

    return chunks

Problem 2: Dense Retrieval Misses What Keyword Search Catches

Even with better chunking, dense retrieval had a predictable blind spot: exact lookups.

When a developer asked “show me the calculate_mrr function”, the embedding model surfaced a dozen vaguely similar functions — because semantically, all function definitions cluster together in the embedding space. What the user needed was the single chunk containing that specific function name, which a keyword index finds instantly.

This is the core tension between dense retrieval (great for semantic similarity) and sparse retrieval (great for exact terms and rare identifiers). Neither alone is enough.

We implemented hybrid search using Azure AI Search, which supports both vector and BM25 full-text search natively in a single request:

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential

def reciprocal_rank_fusion(
    dense_results: list[tuple[str, float]],
    sparse_results: list[tuple[str, float]],
    k: int = 60,
) -> list[tuple[str, float]]:
    scores: dict[str, float] = {}
    for rank, (doc_id, _) in enumerate(dense_results):
        scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank + 1)
    for rank, (doc_id, _) in enumerate(sparse_results):
        scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

class HybridSearcher:
    def __init__(self, endpoint: str, index_name: str, api_key: str):
        self.client = SearchClient(
            endpoint=endpoint,
            index_name=index_name,
            credential=AzureKeyCredential(api_key),
        )
        self.corpus = corpus
        self.corpus_index = {doc["id"]: doc for doc in corpus}
    
    def search(
        self,
        query: str,
        query_embedding: list[float],
        top_k: int = 20,
    ) -> list[dict]:
        vector_query = VectorizedQuery(
            vector=query_embedding,
            k_nearest_neighbors=top_k * 2,
            fields="embedding",
        )

        # Azure AI Search runs both in a single round-trip
        results = self.client.search(
            search_text=query,              # BM25 full-text
            vector_queries=[vector_query],  # dense vector
            top=top_k * 2,
            select=["id", "chunk_text", "source_path", "file_type"],
        )

        return list(results)

Reciprocal Rank Fusion is elegant in its simplicity: rather than normalizing scores across two fundamentally different systems, it just uses rank positions. A chunk ranked #2 in dense and #3 in sparse is almost certainly more relevant than one ranked #1 in dense and #40 in sparse. k=60 is empirically solid and rarely needs tuning.

Problem 3: Top-5 Chunks Are Not All Equal

Even with hybrid retrieval, the top-5 chunks often mixed highly relevant and tangentially relevant content. Give all five to the LLM and it treats them as equally authoritative — it may anchor on a less-relevant chunk if it contains confident-sounding language.

The solution is a cross-encoder re-ranker. Cross-encoders score a (query, passage) pair jointly — far more accurate than bi-encoders, but too slow to run over the whole corpus. The pattern: use bi-encoder retrieval to get a candidate set (top-20), then re-rank with the cross-encoder.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(
    query: str,
    candidates: list[dict],
    top_k: int = 5,
) -> list[dict]:
    pairs = [(query, c["chunk_text"]) for c in candidates]
    scores = reranker.predict(pairs)

    ranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True,
    )
    return [result for result, _ in ranked[:top_k]]

First stage: recall everything potentially relevant, fast. Second stage: score only the candidates, slow but precise.

Minified Production Architecture

flowchart TD
    A[Source files] --> B[Chunk + embed]
    B --> C[Hybrid index]
    D[User query] --> E[Hybrid retrieval top-20]
    C --> E
    E --> F[Rerank top-5]
    F --> G[LLM answer with citations]

    style A fill:#1e3a5f,stroke:#3b82f6,color:#f9fafb
    style C fill:#14532d,stroke:#22c55e,color:#f9fafb
    style F fill:#3b0764,stroke:#a855f7,color:#f9fafb
    style G fill:#7c2d12,stroke:#f97316,color:#f9fafb

A few components mattered most during iteration.

Hypothetical Document Embedding (HDE)

The problem: query embeddings and document embeddings don’t always sit in the same region of semantic space. A short question like “how does caching work in the auth layer?” may not embed close to the relevant source files, even though those files clearly answer the question.

HDE inverts this. Instead of embedding the query directly, you ask a fast LLM to generate a hypothetical code snippet or doc excerpt that would answer the query — then embed that as your search vector. Hypothetical documents embed far closer to real documents because they’re the same kind of content.

from openai import AzureOpenAI

async def hypothetical_document_embed(
    query: str,
    azure_client: AzureOpenAI,
    embed_fn,
) -> list[float]:
    response = await azure_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": (
                "Write a short code snippet or documentation excerpt (2–3 sentences) "
                "that would directly answer this question. Be specific.\n\n"
                f"Question: {query}\n\nExcerpt:"
            ),
        }],
        max_tokens=150,
        temperature=0.0,
    )
    hypothetical = response.choices[0].message.content
    return embed_fn(hypothetical)

I A/B tested this against direct query embedding. For precise function lookups, HDE improved recall@5 by about 15%. For open-ended exploratory queries it occasionally hurt, so I only use it for selected query classes.

Query Classification

Not all queries are the same:

  • Keyword queries ("calculate_mrr function"): Route heavily toward BM25.
  • Conceptual queries ("how does the auth module handle token expiry?"): Route heavily toward dense + HDE.
  • Mixed queries ("rate limiting implementation in the billing service"): Balanced hybrid.

A lightweight fine-tuned classifier routes each query to the right retrieval configuration, adding about 8ms latency in exchange for meaningfully better results on identifier-heavy queries.

The Lesson Nobody Puts in the Tutorial

The quality of your retrieval system is mostly determined by what you retrieve, not by your choice of LLM. I spent too much time tuning generation before I had retrieval quality baselines.

Give the LLM the wrong files and it will produce a confidently wrong answer. Give it the right files and even a weaker model performs acceptably. Improve retrieval first, always — and measure it with recall@k before making any generation-side changes.

The other uncomfortable truth: your file quality is your ceiling. If comments are stale or docs contradict code, retrieval will surface that confusion exactly as written.

What Breaks in Production

Documenting the failure modes is as important as documenting the design.

Semantic chunking degrades on terse, heavily annotated code. The similarity-merge approach performs well on prose markdown and verbose docstrings. It degrades on files with dense decorator stacks (@pytest.mark.parametrize, @dataclass, @app.route), where each decorator is short and semantically unrelated to adjacent ones but structurally inseparable. Threshold 0.85 produces over-merged chunks that mix unrelated decorators. We fall back to AST-level function splitting for files above a decorator density threshold.

HDE degrades on multi-hop queries. “What does calculate_mrr call, and does any of those functions touch the rate limiter?” requires two retrieval passes. A single hypothetical document embedding retrieves the entry function but misses the downstream dependency. HDE improved recall@5 by 14% on single-hop factual queries and by 1–2% on multi-hop queries. We use it selectively.

Cross-encoder reranking is unreliable on very short chunks. Chunks under ~80 tokens lack enough context for the cross-encoder to score confidently. We see high score variance on these and have found it more reliable to merge sub-80-token chunks during ingestion than to rerank them at query time.

The query classifier over-routes to keyword mode for mixed queries with identifiers. Queries like “how does TokenBucket interact with the auth flow?” contain a class name (keyword signal) but are fundamentally semantic. Our classifier, fine-tuned on a 2,000-sample training set, routes ~22% of these to keyword-only mode, which has lower recall@5 for them (0.61 vs. 0.79 in hybrid mode). Expanding the training set is on the roadmap.


The next question, after you’ve built this retrieval system, is: how do you know it’s any good? We’ll cover that — RAGAS, LLM-as-judge calibration, and the failure modes that metrics don’t catch — in the next post.