Building a RAG Pipeline: Document Search with Vector DB + LLM
1. Prologue — Two Critical LLM Limitations
Ask ChatGPT or Claude about your company's internal documents. You'll get one of two responses:
"I don't have access to that information."
Or worse:
"Yes, here are the internal policies for your company: ..." (followed by confidently stated, completely fabricated information)
LLMs have two fundamental limitations:
1. Knowledge Cutoff: LLMs don't know anything that happened after their training data was collected. GPT-4o's training data goes up to 2023. It knows nothing about your proprietary documentation or anything published afterward.
2. Hallucination: Ask about something they don't know and they'll make up a plausible-sounding answer — confidently and without any sense of uncertainty. This is the main reason enterprises can't just plug raw LLMs into their workflows.
RAG (Retrieval-Augmented Generation) solves both problems simultaneously.
The idea is simple: when you ask the LLM a question, first retrieve relevant documents and hand them over as "reference material." Instead of generating answers from its weights alone, the LLM reasons over the provided documents.
2. Full RAG Pipeline Architecture
RAG breaks into two phases:
[Ingestion Pipeline — Offline]
Document collection → Chunking → Embedding → Vector DB storage
[Query Pipeline — Online]
User question → Query embedding → Vector DB search → LLM generation
Ingestion Pipeline (Offline)
Raw Documents (PDF, DOCX, Web, DB)
│
▼
[Document Loader] ← format-specific parsers
│
▼
[Text Splitter] ← chunking strategy (size, overlap)
│
▼
[Embedding Model] ← text-embedding-3-small, BGE, etc.
│
▼
[Vector Store] ← Pinecone, pgvector, Weaviate, Chroma
Query Pipeline (Online)
User Question
│
▼
[Embedding Model] ← vectorize the question
│
▼
[Vector Store] ← similarity search (Top-K)
│
▼
[Retrieved Chunks] ← relevant document fragments
│
▼
[Prompt Assembly] ← "Answer based on these documents: ..."
│
▼
[LLM] ← GPT-4, Claude, Gemini
│
▼
[Generated Answer] ← grounded, citable response
3. What Are Embeddings?
To understand RAG, you need to understand embeddings first.
An embedding converts text into an array of numbers (a vector). Text with similar meaning produces similar vectors.
# Conceptually
embed("cats are cute") → [0.2, 0.8, 0.1, ..., 0.4] # 1536 dimensions
embed("猫はかわいい") → [0.21, 0.79, 0.11, ..., 0.41] # similar!
embed("investment risk notice") → [0.9, 0.1, 0.7, ..., 0.2] # different
Distance between vectors (cosine similarity, Euclidean distance) gives you a numerical measure of semantic similarity.
Popular Embedding Models
| Model | Provider | Dimensions | Notes |
|---|
| text-embedding-3-small | OpenAI | 1536 | Great value, multilingual |
| text-embedding-3-large | OpenAI | 3072 | High quality, expensive |
| text-embedding-ada-002 | OpenAI | 1536 | Legacy, still widely used |
| BGE-M3 | BAAI | 1024 | Open source, best multilingual |
| E5-large | Microsoft | 1024 | Open source, solid performance |
| Gemini Embedding | Google | 768 | For Gemini ecosystem |
For non-English documents, BGE-M3 is hard to beat. Strong performance across 100+ languages.
4. Chunking Strategies
Before storing documents in a vector DB, you need to split them into appropriately sized pieces — chunking.
Why chunk? LLMs have limited context windows, and packing in too much irrelevant content degrades retrieval quality.
Chunking Strategy Comparison
1. Fixed Size Chunking (simplest)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # tokens or characters
chunk_overlap=50, # overlap between chunks (context continuity)
separators=["\n\n", "\n", ".", " "] # split priority order
)
Fast and simple. Can cut in the middle of sentences.
2. Semantic Chunking (meaning-aware)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95,
)
Higher quality but incurs embedding API costs.
3. Document-Aware Chunking (structure-aware)
Respects document structure — markdown headings, HTML tags, PDF sections.
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
Chunk Size Guide
| Use Case | Recommended Size | Reason |
|---|
| FAQ / short Q&A | 100–200 tokens | Short question-answer pairs |
| Technical docs | 300–500 tokens | Section-level understanding |
| Legal / contracts | 500–800 tokens | Context matters a lot |
| Code | Function/class unit | Natural logical boundaries |
Recommend overlap of 10–20% of chunk size to preserve context across chunk boundaries.
5. Choosing a Vector Database
Pinecone
Managed service. Easy setup, reliable. Great for fast starts at startups.
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="your-api-key")
pc.create_index(
name="documents",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("documents")
index.upsert(vectors=[
{
"id": "doc-1-chunk-0",
"values": embedding_vector,
"metadata": {
"text": "chunk content",
"source": "document.pdf",
"page": 1
}
}
])
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)
pgvector (PostgreSQL Extension)
The most pragmatic choice if you're already on PostgreSQL. No extra infrastructure.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID REFERENCES documents(id),
content TEXT NOT NULL,
embedding VECTOR(1536),
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops);
-- Similarity search
SELECT
id,
content,
metadata,
1 - (embedding <=> $1::vector) AS similarity
FROM document_chunks
ORDER BY embedding <=> $1::vector
LIMIT 5;
Chroma
Best for local development and prototyping. In-memory or local file storage. Install: pip install chromadb.
import chromadb
from chromadb.utils import embedding_functions
client = chromadb.Client() # in-memory (dev)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-api-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="documents",
embedding_function=openai_ef
)
# Auto-generates embeddings
collection.add(
ids=["chunk-1", "chunk-2"],
documents=["chunk 1 content", "chunk 2 content"],
metadatas=[{"source": "doc.pdf"}, {"source": "doc.pdf"}]
)
results = collection.query(
query_texts=["user question"],
n_results=5
)
6. Complete RAG Pipeline
TypeScript Version (Vercel AI SDK + pgvector)
import { openai } from '@ai-sdk/openai';
import { embed, generateText } from 'ai';
import { createClient } from '@supabase/supabase-js';
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY!
);
// Ingestion: save document chunks to DB
async function ingestDocument(
content: string,
metadata: Record<string, unknown>
): Promise<void> {
const chunks = splitIntoChunks(content, 500, 50);
for (const chunk of chunks) {
const { embedding } = await embed({
model: openai.embedding('text-embedding-3-small'),
value: chunk,
});
await supabase.from('document_chunks').insert({
content: chunk,
embedding: JSON.stringify(embedding),
metadata,
});
}
}
// Retrieval: find similar chunks
async function retrieveRelevantChunks(
query: string,
limit = 5
): Promise<Array<{ content: string; metadata: unknown; similarity: number }>> {
const { embedding: queryEmbedding } = await embed({
model: openai.embedding('text-embedding-3-small'),
value: query,
});
const { data, error } = await supabase.rpc('match_document_chunks', {
query_embedding: queryEmbedding,
match_threshold: 0.7,
match_count: limit,
});
if (error) throw error;
return data;
}
// RAG: retrieve + generate
async function answerQuestion(question: string): Promise<string> {
const chunks = await retrieveRelevantChunks(question);
if (chunks.length === 0) {
return "No relevant documents found.";
}
const context = chunks
.map((c, i) => `[Document ${i + 1}]\n${c.content}`)
.join('\n\n');
const { text } = await generateText({
model: openai('gpt-4o-mini'),
system: `You are an assistant that answers questions based solely on the provided documents.
Never speculate or add information not present in the documents.`,
prompt: `[Reference Documents]\n${context}\n\n[Question]\n${question}`,
});
return text;
}
function splitIntoChunks(text: string, chunkSize: number, overlap: number): string[] {
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + chunkSize, text.length);
chunks.push(text.slice(start, end));
start += chunkSize - overlap;
}
return chunks;
}
7. Improving Retrieval Quality
Hybrid Search
Combine vector search (semantic) + keyword search (BM25). BM25 excels at exact term matching; vector search handles conceptual queries.
from langchain.retrievers import EnsembleRetriever, BM25Retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# 60% vector, 40% keyword
ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.6, 0.4]
)
Reranking
Rerank vector search results using a Cross-Encoder. More accurate than initial retrieval, but slower.
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
compressor = CohereRerank(model="rerank-multilingual-v3.0")
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)
# Retrieve 20, rerank, return top 5
8. Evaluation Metrics
Use the RAGAS framework to automatically evaluate pipeline quality across four dimensions:
| Metric | Meaning | Goal |
|---|
| Faithfulness | Is the answer grounded in retrieved docs? | Higher is better |
| Answer Relevancy | Does the answer address the question? | Higher is better |
| Context Recall | Were the right docs retrieved? | Higher is better |
| Context Precision | Are retrieved docs actually useful? | Higher is better |
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
dataset = {
"question": ["What is the refund policy?"],
"answer": ["Within 14 days"],
"contexts": [["Refunds accepted within 14 days of purchase..."]],
"ground_truth": ["Refunds are available within 14 days of purchase"]
}
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(result)
9. Conclusion
RAG is currently the most practical solution to LLM knowledge cutoff and hallucination problems.
Pipeline summary:
- Document collection: PDF, DOCX, web crawl, database exports
- Chunking: 300–500 token chunks, 10–20% overlap
- Embedding: BGE-M3 for multilingual, text-embedding-3-small for cost efficiency
- Vector storage: Chroma for prototyping, pgvector or Pinecone for production
- Retrieval: Cosine similarity as baseline, hybrid + reranking for quality improvement
- Generation: System prompt must explicitly constrain the LLM to document-grounded answers
The most common mistake is ignoring retrieval quality. No matter how capable the LLM, wrong documents in means wrong answers out. Garbage In, Garbage Out applies here too. Invest time in chunking strategy and retrieval quality — that's where most RAG performance gains come from.