M·05AI ENGINEERING2026.03.114 MIN READ

Building a RAG Pipeline: Document Search with Vector DB + LLM

RAG 파이프라인 구축: 벡터 DB + LLM으로 문서 검색

LLMs don't know what they weren't trained on. Here's how RAG fixes that — walking through the complete pipeline from document ingestion to chunking, embedding, vector storage, retrieval, and generation with real Python and TypeScript examples.

codemapo

INTERDISCIPLINARY DEV · SEOUL

Building a RAG Pipeline: Document Search with Vector DB + LLM

1. Prologue — Two Critical LLM Limitations

Ask ChatGPT or Claude about your company's internal documents. You'll get one of two responses:

"I don't have access to that information."

Or worse:

"Yes, here are the internal policies for your company: ..." (followed by confidently stated, completely fabricated information)

LLMs have two fundamental limitations:

1. Knowledge Cutoff: LLMs don't know anything that happened after their training data was collected. GPT-4o's training data goes up to 2023. It knows nothing about your proprietary documentation or anything published afterward.

2. Hallucination: Ask about something they don't know and they'll make up a plausible-sounding answer — confidently and without any sense of uncertainty. This is the main reason enterprises can't just plug raw LLMs into their workflows.

RAG (Retrieval-Augmented Generation) solves both problems simultaneously.

The idea is simple: when you ask the LLM a question, first retrieve relevant documents and hand them over as "reference material." Instead of generating answers from its weights alone, the LLM reasons over the provided documents.

2. Full RAG Pipeline Architecture

RAG breaks into two phases:

[Ingestion Pipeline — Offline]
Document collection → Chunking → Embedding → Vector DB storage

[Query Pipeline — Online]
User question → Query embedding → Vector DB search → LLM generation

Ingestion Pipeline (Offline)

Raw Documents (PDF, DOCX, Web, DB)
        │
        ▼
[Document Loader]  ← format-specific parsers
        │
        ▼
[Text Splitter]   ← chunking strategy (size, overlap)
        │
        ▼
[Embedding Model] ← text-embedding-3-small, BGE, etc.
        │
        ▼
[Vector Store]    ← Pinecone, pgvector, Weaviate, Chroma

Query Pipeline (Online)

User Question
        │
        ▼
[Embedding Model]  ← vectorize the question
        │
        ▼
[Vector Store]     ← similarity search (Top-K)
        │
        ▼
[Retrieved Chunks] ← relevant document fragments
        │
        ▼
[Prompt Assembly]  ← "Answer based on these documents: ..."
        │
        ▼
[LLM]              ← GPT-4, Claude, Gemini
        │
        ▼
[Generated Answer] ← grounded, citable response

3. What Are Embeddings?

To understand RAG, you need to understand embeddings first.

An embedding converts text into an array of numbers (a vector). Text with similar meaning produces similar vectors.

# Conceptually
embed("cats are cute")         → [0.2, 0.8, 0.1, ..., 0.4]  # 1536 dimensions
embed("猫はかわいい")          → [0.21, 0.79, 0.11, ..., 0.41]  # similar!
embed("investment risk notice") → [0.9, 0.1, 0.7, ..., 0.2]   # different

Distance between vectors (cosine similarity, Euclidean distance) gives you a numerical measure of semantic similarity.

Popular Embedding Models

Model	Provider	Dimensions	Notes
text-embedding-3-small	OpenAI	1536	Great value, multilingual
text-embedding-3-large	OpenAI	3072	High quality, expensive
text-embedding-ada-002	OpenAI	1536	Legacy, still widely used
BGE-M3	BAAI	1024	Open source, best multilingual
E5-large	Microsoft	1024	Open source, solid performance
Gemini Embedding	Google	768	For Gemini ecosystem

For non-English documents, BGE-M3 is hard to beat. Strong performance across 100+ languages.

4. Chunking Strategies

Before storing documents in a vector DB, you need to split them into appropriately sized pieces — chunking.

Why chunk? LLMs have limited context windows, and packing in too much irrelevant content degrades retrieval quality.

Chunking Strategy Comparison

1. Fixed Size Chunking (simplest)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # tokens or characters
    chunk_overlap=50,    # overlap between chunks (context continuity)
    separators=["\n\n", "\n", ".", " "]  # split priority order
)

Fast and simple. Can cut in the middle of sentences.

2. Semantic Chunking (meaning-aware)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)

Higher quality but incurs embedding API costs.

3. Document-Aware Chunking (structure-aware) Respects document structure — markdown headings, HTML tags, PDF sections.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

Chunk Size Guide

Use Case	Recommended Size	Reason
FAQ / short Q&A	100–200 tokens	Short question-answer pairs
Technical docs	300–500 tokens	Section-level understanding
Legal / contracts	500–800 tokens	Context matters a lot
Code	Function/class unit	Natural logical boundaries

Recommend overlap of 10–20% of chunk size to preserve context across chunk boundaries.

5. Choosing a Vector Database

Pinecone

Managed service. Easy setup, reliable. Great for fast starts at startups.

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="your-api-key")

pc.create_index(
    name="documents",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("documents")

index.upsert(vectors=[
    {
        "id": "doc-1-chunk-0",
        "values": embedding_vector,
        "metadata": {
            "text": "chunk content",
            "source": "document.pdf",
            "page": 1
        }
    }
])

results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

pgvector (PostgreSQL Extension)

The most pragmatic choice if you're already on PostgreSQL. No extra infrastructure.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID REFERENCES documents(id),
    content TEXT NOT NULL,
    embedding VECTOR(1536),
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX ON document_chunks 
USING hnsw (embedding vector_cosine_ops);

-- Similarity search
SELECT 
    id,
    content,
    metadata,
    1 - (embedding <=> $1::vector) AS similarity
FROM document_chunks
ORDER BY embedding <=> $1::vector
LIMIT 5;

Chroma

Best for local development and prototyping. In-memory or local file storage. Install: pip install chromadb.

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.Client()  # in-memory (dev)

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="documents",
    embedding_function=openai_ef
)

# Auto-generates embeddings
collection.add(
    ids=["chunk-1", "chunk-2"],
    documents=["chunk 1 content", "chunk 2 content"],
    metadatas=[{"source": "doc.pdf"}, {"source": "doc.pdf"}]
)

results = collection.query(
    query_texts=["user question"],
    n_results=5
)

6. Complete RAG Pipeline

TypeScript Version (Vercel AI SDK + pgvector)

import { openai } from '@ai-sdk/openai';
import { embed, generateText } from 'ai';
import { createClient } from '@supabase/supabase-js';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
);

// Ingestion: save document chunks to DB
async function ingestDocument(
  content: string,
  metadata: Record<string, unknown>
): Promise<void> {
  const chunks = splitIntoChunks(content, 500, 50);
  
  for (const chunk of chunks) {
    const { embedding } = await embed({
      model: openai.embedding('text-embedding-3-small'),
      value: chunk,
    });

    await supabase.from('document_chunks').insert({
      content: chunk,
      embedding: JSON.stringify(embedding),
      metadata,
    });
  }
}

// Retrieval: find similar chunks
async function retrieveRelevantChunks(
  query: string,
  limit = 5
): Promise<Array<{ content: string; metadata: unknown; similarity: number }>> {
  const { embedding: queryEmbedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: query,
  });

  const { data, error } = await supabase.rpc('match_document_chunks', {
    query_embedding: queryEmbedding,
    match_threshold: 0.7,
    match_count: limit,
  });

  if (error) throw error;
  return data;
}

// RAG: retrieve + generate
async function answerQuestion(question: string): Promise<string> {
  const chunks = await retrieveRelevantChunks(question);
  
  if (chunks.length === 0) {
    return "No relevant documents found.";
  }

  const context = chunks
    .map((c, i) => `[Document ${i + 1}]\n${c.content}`)
    .join('\n\n');

  const { text } = await generateText({
    model: openai('gpt-4o-mini'),
    system: `You are an assistant that answers questions based solely on the provided documents.
    Never speculate or add information not present in the documents.`,
    prompt: `[Reference Documents]\n${context}\n\n[Question]\n${question}`,
  });

  return text;
}

function splitIntoChunks(text: string, chunkSize: number, overlap: number): string[] {
  const chunks: string[] = [];
  let start = 0;
  while (start < text.length) {
    const end = Math.min(start + chunkSize, text.length);
    chunks.push(text.slice(start, end));
    start += chunkSize - overlap;
  }
  return chunks;
}

7. Improving Retrieval Quality

Hybrid Search

Combine vector search (semantic) + keyword search (BM25). BM25 excels at exact term matching; vector search handles conceptual queries.

from langchain.retrievers import EnsembleRetriever, BM25Retriever

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# 60% vector, 40% keyword
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]
)

Reranking

Rerank vector search results using a Cross-Encoder. More accurate than initial retrieval, but slower.

from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-multilingual-v3.0")

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)
# Retrieve 20, rerank, return top 5

8. Evaluation Metrics

Use the RAGAS framework to automatically evaluate pipeline quality across four dimensions:

Metric	Meaning	Goal
Faithfulness	Is the answer grounded in retrieved docs?	Higher is better
Answer Relevancy	Does the answer address the question?	Higher is better
Context Recall	Were the right docs retrieved?	Higher is better
Context Precision	Are retrieved docs actually useful?	Higher is better

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

dataset = {
    "question": ["What is the refund policy?"],
    "answer": ["Within 14 days"],
    "contexts": [["Refunds accepted within 14 days of purchase..."]],
    "ground_truth": ["Refunds are available within 14 days of purchase"]
}

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(result)

9. Conclusion

RAG is currently the most practical solution to LLM knowledge cutoff and hallucination problems.

Pipeline summary:

Document collection: PDF, DOCX, web crawl, database exports
Chunking: 300–500 token chunks, 10–20% overlap
Embedding: BGE-M3 for multilingual, text-embedding-3-small for cost efficiency
Vector storage: Chroma for prototyping, pgvector or Pinecone for production
Retrieval: Cosine similarity as baseline, hybrid + reranking for quality improvement
Generation: System prompt must explicitly constrain the LLM to document-grounded answers

The most common mistake is ignoring retrieval quality. No matter how capable the LLM, wrong documents in means wrong answers out. Garbage In, Garbage Out applies here too. Invest time in chunking strategy and retrieval quality — that's where most RAG performance gains come from.

#RAG #Vector Database #LLM #AI Engineering #Embeddings

← Back to List

M·05AI ENGINEERING2026.03.114 MIN READ

Building a RAG Pipeline: Document Search with Vector DB + LLM

RAG 파이프라인 구축: 벡터 DB + LLM으로 문서 검색

codemapo

INTERDISCIPLINARY DEV · SEOUL

Building a RAG Pipeline: Document Search with Vector DB + LLM

1. Prologue — Two Critical LLM Limitations

Ask ChatGPT or Claude about your company's internal documents. You'll get one of two responses:

"I don't have access to that information."

Or worse:

"Yes, here are the internal policies for your company: ..." (followed by confidently stated, completely fabricated information)

LLMs have two fundamental limitations:

RAG (Retrieval-Augmented Generation) solves both problems simultaneously.

2. Full RAG Pipeline Architecture

RAG breaks into two phases:

[Ingestion Pipeline — Offline]
Document collection → Chunking → Embedding → Vector DB storage

[Query Pipeline — Online]
User question → Query embedding → Vector DB search → LLM generation

Ingestion Pipeline (Offline)

Raw Documents (PDF, DOCX, Web, DB)
        │
        ▼
[Document Loader]  ← format-specific parsers
        │
        ▼
[Text Splitter]   ← chunking strategy (size, overlap)
        │
        ▼
[Embedding Model] ← text-embedding-3-small, BGE, etc.
        │
        ▼
[Vector Store]    ← Pinecone, pgvector, Weaviate, Chroma

Query Pipeline (Online)

User Question
        │
        ▼
[Embedding Model]  ← vectorize the question
        │
        ▼
[Vector Store]     ← similarity search (Top-K)
        │
        ▼
[Retrieved Chunks] ← relevant document fragments
        │
        ▼
[Prompt Assembly]  ← "Answer based on these documents: ..."
        │
        ▼
[LLM]              ← GPT-4, Claude, Gemini
        │
        ▼
[Generated Answer] ← grounded, citable response

3. What Are Embeddings?

To understand RAG, you need to understand embeddings first.

An embedding converts text into an array of numbers (a vector). Text with similar meaning produces similar vectors.

# Conceptually
embed("cats are cute")         → [0.2, 0.8, 0.1, ..., 0.4]  # 1536 dimensions
embed("猫はかわいい")          → [0.21, 0.79, 0.11, ..., 0.41]  # similar!
embed("investment risk notice") → [0.9, 0.1, 0.7, ..., 0.2]   # different

Distance between vectors (cosine similarity, Euclidean distance) gives you a numerical measure of semantic similarity.

Popular Embedding Models

Model	Provider	Dimensions	Notes
text-embedding-3-small	OpenAI	1536	Great value, multilingual
text-embedding-3-large	OpenAI	3072	High quality, expensive
text-embedding-ada-002	OpenAI	1536	Legacy, still widely used
BGE-M3	BAAI	1024	Open source, best multilingual
E5-large	Microsoft	1024	Open source, solid performance
Gemini Embedding	Google	768	For Gemini ecosystem

For non-English documents, BGE-M3 is hard to beat. Strong performance across 100+ languages.

4. Chunking Strategies

Before storing documents in a vector DB, you need to split them into appropriately sized pieces — chunking.

Why chunk? LLMs have limited context windows, and packing in too much irrelevant content degrades retrieval quality.

Chunking Strategy Comparison

1. Fixed Size Chunking (simplest)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # tokens or characters
    chunk_overlap=50,    # overlap between chunks (context continuity)
    separators=["\n\n", "\n", ".", " "]  # split priority order
)

Fast and simple. Can cut in the middle of sentences.

2. Semantic Chunking (meaning-aware)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)

Higher quality but incurs embedding API costs.

3. Document-Aware Chunking (structure-aware) Respects document structure — markdown headings, HTML tags, PDF sections.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

Chunk Size Guide

Use Case	Recommended Size	Reason
FAQ / short Q&A	100–200 tokens	Short question-answer pairs
Technical docs	300–500 tokens	Section-level understanding
Legal / contracts	500–800 tokens	Context matters a lot
Code	Function/class unit	Natural logical boundaries

Recommend overlap of 10–20% of chunk size to preserve context across chunk boundaries.

5. Choosing a Vector Database

Pinecone

Managed service. Easy setup, reliable. Great for fast starts at startups.

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="your-api-key")

pc.create_index(
    name="documents",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("documents")

index.upsert(vectors=[
    {
        "id": "doc-1-chunk-0",
        "values": embedding_vector,
        "metadata": {
            "text": "chunk content",
            "source": "document.pdf",
            "page": 1
        }
    }
])

results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

pgvector (PostgreSQL Extension)

The most pragmatic choice if you're already on PostgreSQL. No extra infrastructure.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID REFERENCES documents(id),
    content TEXT NOT NULL,
    embedding VECTOR(1536),
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX ON document_chunks 
USING hnsw (embedding vector_cosine_ops);

-- Similarity search
SELECT 
    id,
    content,
    metadata,
    1 - (embedding <=> $1::vector) AS similarity
FROM document_chunks
ORDER BY embedding <=> $1::vector
LIMIT 5;

Chroma

Best for local development and prototyping. In-memory or local file storage. Install: pip install chromadb.

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.Client()  # in-memory (dev)

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="documents",
    embedding_function=openai_ef
)

# Auto-generates embeddings
collection.add(
    ids=["chunk-1", "chunk-2"],
    documents=["chunk 1 content", "chunk 2 content"],
    metadatas=[{"source": "doc.pdf"}, {"source": "doc.pdf"}]
)

results = collection.query(
    query_texts=["user question"],
    n_results=5
)

6. Complete RAG Pipeline

TypeScript Version (Vercel AI SDK + pgvector)

import { openai } from '@ai-sdk/openai';
import { embed, generateText } from 'ai';
import { createClient } from '@supabase/supabase-js';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
);

// Ingestion: save document chunks to DB
async function ingestDocument(
  content: string,
  metadata: Record<string, unknown>
): Promise<void> {
  const chunks = splitIntoChunks(content, 500, 50);
  
  for (const chunk of chunks) {
    const { embedding } = await embed({
      model: openai.embedding('text-embedding-3-small'),
      value: chunk,
    });

    await supabase.from('document_chunks').insert({
      content: chunk,
      embedding: JSON.stringify(embedding),
      metadata,
    });
  }
}

// Retrieval: find similar chunks
async function retrieveRelevantChunks(
  query: string,
  limit = 5
): Promise<Array<{ content: string; metadata: unknown; similarity: number }>> {
  const { embedding: queryEmbedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: query,
  });

  const { data, error } = await supabase.rpc('match_document_chunks', {
    query_embedding: queryEmbedding,
    match_threshold: 0.7,
    match_count: limit,
  });

  if (error) throw error;
  return data;
}

// RAG: retrieve + generate
async function answerQuestion(question: string): Promise<string> {
  const chunks = await retrieveRelevantChunks(question);
  
  if (chunks.length === 0) {
    return "No relevant documents found.";
  }

  const context = chunks
    .map((c, i) => `[Document ${i + 1}]\n${c.content}`)
    .join('\n\n');

  const { text } = await generateText({
    model: openai('gpt-4o-mini'),
    system: `You are an assistant that answers questions based solely on the provided documents.
    Never speculate or add information not present in the documents.`,
    prompt: `[Reference Documents]\n${context}\n\n[Question]\n${question}`,
  });

  return text;
}

function splitIntoChunks(text: string, chunkSize: number, overlap: number): string[] {
  const chunks: string[] = [];
  let start = 0;
  while (start < text.length) {
    const end = Math.min(start + chunkSize, text.length);
    chunks.push(text.slice(start, end));
    start += chunkSize - overlap;
  }
  return chunks;
}

7. Improving Retrieval Quality

Hybrid Search

Combine vector search (semantic) + keyword search (BM25). BM25 excels at exact term matching; vector search handles conceptual queries.

from langchain.retrievers import EnsembleRetriever, BM25Retriever

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# 60% vector, 40% keyword
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]
)

Reranking

Rerank vector search results using a Cross-Encoder. More accurate than initial retrieval, but slower.

from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-multilingual-v3.0")

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)
# Retrieve 20, rerank, return top 5

8. Evaluation Metrics

Use the RAGAS framework to automatically evaluate pipeline quality across four dimensions:

Metric	Meaning	Goal
Faithfulness	Is the answer grounded in retrieved docs?	Higher is better
Answer Relevancy	Does the answer address the question?	Higher is better
Context Recall	Were the right docs retrieved?	Higher is better
Context Precision	Are retrieved docs actually useful?	Higher is better

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

dataset = {
    "question": ["What is the refund policy?"],
    "answer": ["Within 14 days"],
    "contexts": [["Refunds accepted within 14 days of purchase..."]],
    "ground_truth": ["Refunds are available within 14 days of purchase"]
}

result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(result)

9. Conclusion

RAG is currently the most practical solution to LLM knowledge cutoff and hallucination problems.

Pipeline summary:

Document collection: PDF, DOCX, web crawl, database exports
Chunking: 300–500 token chunks, 10–20% overlap
Embedding: BGE-M3 for multilingual, text-embedding-3-small for cost efficiency
Vector storage: Chroma for prototyping, pgvector or Pinecone for production
Retrieval: Cosine similarity as baseline, hybrid + reranking for quality improvement
Generation: System prompt must explicitly constrain the LLM to document-grounded answers

#RAG #Vector Database #LLM #AI Engineering #Embeddings

← Back to List