RAG Retrieval Optimization: Implementing Hybrid Search and Reranking
Prologue: The Illusion and Cold Reality of Vector Search
In the current AI landscape, nine out of ten developers building applications are implementing RAG (Retrieval-Augmented Generation). I was no exception. Eager to create a chatbot that could accurately query a vast base of technical documentation, I set out to build a RAG pipeline.
Initially, I treated the combination of vector databases and embedding models like magic. Users would type a natural language question, and the vector DB would calculate the mathematical cosine similarity, returning the most semantically related chunks of text.
However, once the application was pushed to production and real user logs started rolling in, the harsh reality set in.
Users did not ask questions in clean, descriptive sentences. Often, they typed exact error codes or specific API endpoints like "Error 504", "auth-client-init", or "pg_dump".
To my surprise, semantic vector search failed miserably at retrieving documentation for these specific terms. For instance, when a user queried auth-client-init, vector search recommended general guides on "authentication best practices" or "client initialization syntax." While semantically close, it completely missed the highly specific error guide the user needed to resolve their immediate bug.
"How could a system that understands human semantics so well fail to match a simple error code?"
After research, I realized that vector semantic search and traditional keyword search are fundamentally complementary. To build a production-grade RAG pipeline, you must combine both.
Concept: Hybrid Search and Reranking
To significantly boost retrieval quality, two design patterns stand out: Hybrid Search and Reranking.
1. Hybrid Search: The Marriage of BM25 and Vector Embeddings
Hybrid search merges traditional keyword-based matching algorithms like BM25 with modern semantic vector search.
- BM25 (Sparse Vector): A statistical formula that weights words based on frequency and document importance. It is extremely robust for exact term matching, such as error codes, product models, serial numbers, and function names.
- Vector Search (Dense Vector): Finds text chunks based on semantic proximity, even when the query and the documents use entirely different vocabularies. It easily maps "I can't log in" to "session expiration" or "OAuth failure."
By combining these two approaches, we create an optimized retriever capable of handling both precise keywords and flexible conceptual queries.
2. Reranking: Isolating Critical Context
Once a retriever returns a batch of candidate documents, sending all of them directly to the LLM is counterproductive. Too much context leads to higher token costs, latency spikes, and the "Lost in the Middle" phenomenon, where LLMs fail to pay attention to information placed in the middle of long prompts.
This is where a Rerank model comes in.
A Reranker acts as a second-stage filter. It takes the top 20 or 30 documents returned by the hybrid search and performs a computationally heavier, highly precise evaluation comparing the raw query to each candidate document. It then outputs a reordered list, allowing us to pass only the top 3 to 5 most relevant documents to the LLM.
Integrating a Reranker explained why my previous RAG pipeline suffered from hallucinations. The context was cluttered with irrelevant information. Filtering out the noise was the key to unlocking accurate answers.
Deep Dive: Technical Mechanics of Merging and Sorting
Below are the details of the mathematical and architectural mechanisms I used to implement hybrid search and reranking.
1. Merging Scores: Reciprocal Rank Fusion (RRF)
BM25 scores (typically unbound positive numbers) and vector similarity scores (typically bounded between 0 and 1) have entirely different scales. Trying to add them together directly is like adding apples to oranges.
The industry-standard solution for this is Reciprocal Rank Fusion (RRF). RRF ignores the raw scores entirely and computes a new score based solely on the rank of the document within each search system.
RRF_Score(d) = Σ[m∈M] 1 / (k + r_m(d))
Here, $r_m(d)$ represents the rank of document $d$ in retriever $m$, and $k$ is a constant (typically set to 60) that prevents top ranks from overwhelmingly dominating the score. Documents that rank consistently high across both lists are heavily prioritized.
2. Cross-Encoder Reranking
Embedding models rely on a "Bi-Encoder" architecture, where queries and documents are converted into vectors independently and compared later. While fast, it ignores interactive semantics.
Reranking models use a "Cross-Encoder" architecture, processing the query and the candidate document simultaneously through the self-attention layers of a transformer. Because it analyzes the mutual relationships between words directly, it offers far superior precision and re-orders the candidate list with high accuracy.
Application: Implementing the Pipeline in TypeScript
I wrote a TypeScript utility to handle RRF merging and call the Cohere Rerank API to sort candidates.
interface SearchResult {
id: string;
content: string;
score: number;
}
// 1. Implementing RRF
function reciprocalRankFusion(
vectorResults: SearchResult[],
bm25Results: SearchResult[],
k: number = 60
): SearchResult[] {
const rrfScores: Record<string, { doc: SearchResult; score: number }> = {};
const applyRrf = (results: SearchResult[]) => {
results.forEach((doc, index) => {
const rank = index + 1;
const scoreContribution = 1 / (k + rank);
if (!rrfScores[doc.id]) {
rrfScores[doc.id] = { doc, score: 0 };
}
rrfScores[doc.id].score += scoreContribution;
});
};
applyRrf(vectorResults);
applyRrf(bm25Results);
return Object.values(rrfScores)
.sort((a, b) => b.score - a.score)
.map(item => ({
...item.doc,
score: item.score
}));
}
// 2. Main Retrieval Pipeline
async function searchKnowledgeBase(query: string) {
// Step 1: Run retrievers in parallel
const [vectorDocs, bm25Docs] = await Promise.all([
retrieveVectorSearch(query, 20),
retrieveBM25Search(query, 20)
]);
// Step 2: Merge using RRF
const hybridDocs = reciprocalRankFusion(vectorDocs, bm25Docs);
// Step 3: Rerank via Cohere API
const documentsForRerank = hybridDocs.map(d => d.content);
const response = await fetch("https://api.cohere.com/v1/rerank", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.COHERE_API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
query: query,
documents: documentsForRerank,
top_n: 3,
model: "rerank-multilingual-v3.0"
})
});
const rerankData = await response.json();
const finalContext = rerankData.results.map((r: any) => {
return hybridDocs[r.index];
});
return finalContext;
}
I re-ran the query for "auth-client-init error" using this pipeline:
- Before (Vector Only): Retreived basic setup docs, while the specific error guide was buried deep down.
- After (Hybrid + Reranking): BM25 captured the exact
auth-client-initkeyword, pushing the correct troubleshooting guide into the candidate list. The Reranker then evaluated its direct relevance and successfully ranked it first.
Summary: RAG is 90% Retrieval and 10% Generation
A common pitfall when building RAG systems is assuming that upgrading to the latest LLM (like GPT-4o or Claude 3.5 Sonnet) will magically resolve all context errors. That is the equivalent of sending a student into an exam with the wrong textbook and expecting an A+.
The output of any RAG pipeline is only as good as the context injected into the prompt. Consequently, 90% of RAG engineering is spent refining data, indexing, and optimizing the Retrieval stage.
Supplementing vector search with a BM25 filter and cross-encoder Reranking is computationally cheap and easy to build, yet it dramatically reduces hallucinations. Even in the era of AI, success in engineering still comes down to how well we implement core data structures and search fundamentals.