내 AI 챗봇이 거짓말을 멈췄다 (RAG 도입기)

1. AI가 내 고객에게 거짓말을 했다

SaaS 서비스를 운영하면서 CS(고객 응대) 업무가 너무 많아졌습니다. 매일 들어오는 단순 반복 질문들... "비밀번호는 어떻게 바꾸나요?", "환불 규정이 뭔가요?" 그래서 "요즘 유행하는 GPT로 챗봇을 만들자!"라고 생각했습니다. 자신만만하게 OpenAI API를 연결해서 배포했습니다. 설정하는 데 30분도 안 걸렸죠.

그런데 배포 1시간 만에 고객 문의가 들어왔습니다. 고객: "비밀번호 초기화는 어떻게 하나요?" 내 챗봇: "설정 > 보안 > 비밀번호 변경 메뉴로 가세요."

문제는, 우리 서비스엔 '보안'이라는 메뉴가 없었습니다. AI는 아주 당당하게 거짓말(Hallucination)을 하고 있었죠. 그럴듯한 문장을 지어내는 능력이 너무 뛰어나서, 저조차 "어? 우리한테 그런 메뉴가 있었나?" 하고 속을 뻔했습니다. 고객은 "메뉴가 없는데요?"라며 화를 냈고, 저는 급하게 챗봇을 내렸습니다.

그때 깨달았습니다. "LLM은 지식 검색 엔진이 아니라, 그럴듯한 문장 생성기구나." 제가 만든 건 챗봇이 아니라, '친절한 거짓말쟁이'였던 겁니다.

2. 해결책 - 암기 말고, 컨닝을 시키자 (RAG)

처음엔 LLM을 학습(Fine-tuning)시키려고 했습니다. 우리 매뉴얼 데이터를 다 넣어서 재학습시키면 되지 않을까? 하지만 견적을 내보니 비용이 너무 비쌌고, 더 큰 문제는 "정보가 바뀔 때마다 다시 학습시켜야 한다"는 점이었습니다. 환불 규정이 바뀌면? 또 수백만 원 들여서 학습? 말도 안 되죠.

해답은 RAG (Retrieval-Augmented Generation, 검색 증강 생성)였습니다. 이 개념을 처음 접했을 때, 가장 와닿았던 비유는 "오픈북 시험"입니다.

기존 LLM (Fine-tuning): 교과서를 달달 외워서 시험을 봅니다. 기억이 안 나면 그럴듯하게 말을 지어냅니다(Hallucination). 새로운 내용이 추가되면 뇌를 갈아끼워야(재학습) 합니다.
RAG: 교과서를 책상 위에 펴두고 시험을 봅니다. 질문이 나오면 해당 페이지를 찾아서(Retrieval) 읽어보고 답변(Generation)합니다. 내용이 바뀌면 교과서 페이지만 바꾸면 됩니다.

이 "교과서"가 바로 우리 서비스의 매뉴얼 문서들이고, "페이지를 찾는 과정"이 검색(Retrieval)입니다.

3. 구현 과정 - 텍스트를 숫자로 (Vector Embedding)

하지만 컴퓨터는 "비밀번호"와 "패스워드"가 같은 말인지 모릅니다. 단순 키워드 검색(LIKE 검색)으로는 "암호 변경"이라고 검색하면 "비밀번호"가 포함된 문서는 못 찾습니다. 그래서 텍스트를 숫자 벡터(Vector)로 바꿔야 합니다. 이걸 임베딩(Embedding)이라고 합니다.

# '사과'를 벡터로 변환하면 대충 이런 느낌입니다
[0.1, 0.5, -0.3, 0.9, ...] (1536차원)

이 벡터 공간은 마법 같습니다. 비슷한 의미를 가진 단어는 벡터 공간에서 물리적으로 가까운 곳에 위치합니다. 유명한 예시로, King 벡터에서 Man 벡터를 빼고 Woman 벡터를 더하면 Queen 벡터와 가장 가까운 위치가 나옵니다. 즉, 컴퓨터가 단어의 '의미'를 이해하게 되는 거죠.

3.1. 텍스트 나누기 (Chunking Strategies)

문서를 벡터화하려면 먼저 잘게 잘라야겠죠? 이걸 청킹(Chunking)이라고 합니다. 처음엔 단순히 500자씩 잘랐습니다. 그랬더니 이런 참사가 일어났습니다.

Chunk 1: "...아버지가 방에"
Chunk 2: "들어가신다..."

문맥이 끊겨버려서 검색이 제대로 안 됐습니다. 그래서 더 똑똑한 전략이 필요했습니다.

고정 크기 (Fixed Order): 500자씩 자르되, 앞뒤 50자를 겹치게(Overlap) 합니다. 문맥을 보존하는 가장 쉽고 효과적인 방법입니다.
의미 기반 (Semantic): 문단 단위나 주제가 바뀔 때 자릅니다. NLP 모델을 써서 문장 간의 의미적 유사도가 낮아지는 지점을 찾아서 거기서 자릅니다.
재귀적 분할 (Recursive): 문단 -> 문장 -> 단어 순으로 계층적으로 자릅니다. LangChain의 RecursiveCharacterTextSplitter가 대표적입니다. 저는 이걸 사용했습니다.

3.2. 메타데이터 필터링 (Metadata Filtering)

벡터만 저장하지 말고, 태그를 같이 저장해야 합니다. {"content": "...", "metadata": {"category": "billing", "year": "2024"}}

사용자가 "2024년 환불 정책 알려줘"라고 하면, 벡터 검색 전에 year=2024로 먼저 필터링합니다. 탐색 범위를 좁히면 정확도가 비약적으로 상승합니다. 이걸 Pre-filtering이라고 합니다.

4. 검색 - 벡터 DB와의 대화

사용자가 질문을 하면 내부적으로 이런 일이 일어납니다.

질문: "비번 까먹음"
질문 임베딩: 질문 텍스트를 임베딩 모델(OpenAI text-embedding-3-small 등)에 넣어 [0.2, 0.4, ...] 벡터로 변환합니다.
유사도 검색 (Cosine Similarity): 벡터 DB(Pinecone, Weaviate 등)에서 내 질문 벡터와 가장 거리가 가까운 문서 조각 3개를 가져옵니다.
- 문서 A: "비밀번호 재설정 링크 발송" (유사도 90%)
- 문서 B: "로그인 오류 해결" (유사도 85%)
프롬프트 조립: 검색된 문서를 LLM에게 "컨닝 페이퍼"로 같이 줍니다.

[지시사항]
너는 상담원이야. 아래 [참고 문서]만 보고 대답해. 모르면 모른다고 해. (지어내지 마!)

[참고 문서]
... (DB에서 가져온 문서 A, B의 내용)

[질문]
비번 까먹음

답변: "비밀번호 재설정을 원하시면 가입하신 이메일로 링크를 보내드릴 수 있습니다." (정확함!)

5. 현실의 벽 - "키워드"가 필요해 (Hybrid Search)

RAG를 도입하고 만세! 를 외쳤지만, 곧 벽에 부딪혔습니다. 벡터 검색은 "의미"를 잘 찾지만, "정확한 단어"에는 약할 때가 있거든요.

예를 들어 사용자가 에러 코드 "ERR-503"을 검색하면, 벡터 검색은 "서버 오류", "연결 실패" 같은 의미가 비슷한 문서를 가져옵니다. 하지만 우리는 "ERR-503"이라는 단어가 정확히 포함된 문서가 필요합니다.

그래서 저는 하이브리드 검색(Hybrid Search)을 도입했습니다.

벡터 검색 (Dense Retrieval): 의미 파악 ("로그인이 안 돼요" -> 로그인 문제 문서 찾음)
키워드 검색 (Sparse Retrieval / BM25): 정확한 용어 매칭 ("ERR-503" -> ERR-503이 포함된 문서 찾음)

이 둘의 결과를 섞어서(Ensemble) 랭킹을 매기니 정확도가 훨씬 올라갔습니다. 이를 Reciprocal Rank Fusion (RRF) 알고리즘으로 합칩니다.

5.1. 리랭킹 (Reranking) - 최후의 1인 선발

검색된 10개의 문서가 진짜 사용자의 질문과 관련이 있는지 다시 한번 정밀 검사합니다. 여기서는 Cross-Encoder 모델(BGE-Reranker 등)을 사용합니다.

Bi-Encoder (기존 벡터 검색): 질문과 문서를 따로따로 벡터화해서 거리만 잰다. (빠름, 정확도 중)
Cross-Encoder (리랭킹): 질문과 문서를 쌍으로 입력받아 "이 둘이 얼마나 관련 있어?"라고 직접 채점한다. (느림, 정확도 최상)

100개를 벡터 검색으로 빠르게 가져오고, 그 중 상위 10개를 리랭킹으로 줄 세우는 "2-Stage Retrieval" 전략이 국룰입니다.

6. 속도 전쟁 - 인덱싱 (HNSW)

데이터가 100만 개가 넘어가면, 하나하나 비교(Linear Search)하는 건 너무 느립니다. 그래서 벡터 인덱싱(Vector Indexing)이 필요합니다.

대표적인 알고리즘이 HNSW (Hierarchical Navigable Small World)입니다. 이름이 어렵지만, 원리는 "고속도로와 국도"입니다.

Layer 2 (고속도로): 아주 멀리 떨어진 데이터끼리만 듬성듬성 연결됨.
Layer 1 (지방도): 지역 간 연결.
Layer 0 (국도): 모든 데이터가 촘촘하게 연결됨.

검색할 때 고속도로(Layer 2)를 타고 목적지 근처로 빠르게 이동한 다음, 국도(Layer 0)로 내려와서 집을 찾습니다. 이 덕분에 수억 개의 데이터에서도 밀리초(ms) 단위 검색이 가능했던 겁니다.

7. 안전장치 - "모르면 모른다고 해"

RAG를 쓴다고 할루시네이션이 100% 사라지진 않습니다. DB에서 엉뚱한 문서를 가져오면(Garbage In), AI는 그걸 믿고 엉뚱한 답을 합니다(Garbage Out).

그래서 임계값(Threshold) 설정이 필수입니다.

# 유사도가 0.7 이하면 "관련 문서 없음"으로 판단
if max_similarity < 0.7:
    return "죄송합니다. 관련 정보를 찾을 수 없습니다. 상담원을 연결해 드릴까요?"

또한 프롬프트에서 "출처 표기(Citation)"를 강제하면 거짓말이 확 줄어듭니다. "답변 문장 끝에 [문서 ID: 12]와 같이 출처를 명시해줘"라고 지시하면, AI는 근거가 없는 말을 지어내기 부담스러워합니다.

8. 마무리 - AI에게 뇌를 빌려주지 말고, 책을 줘라

RAG를 도입하고 나서 환각증세(거짓말)가 99% 사라졌습니다. 이제 챗봇은 모르는 질문이 나오면 "죄송합니다, 매뉴얼에 없는 내용입니다"라고 말할 줄 알게 되었습니다. 이게 거짓말보다 백배 낫습니다.

AI 서비스를 준비 중이라면 명심하세요. 모델을 똑똑하게 만드는 것(Fine-tuning)보다, 똑똑하게 검색해서 밥상을 차려주는 것(RAG)이 훨씬 가성비 좋고 성능도 확실한 전략입니다.

우리의 목표는 '천재 AI'가 아니라 '신뢰할 수 있는 서비스'를 만드는 것이니까요.

How I Stopped My AI from Lying (RAG Implementation)

1. My AI Lied to My Customers

As tasks piled up running my SaaS, I decided to automate Customer Support (CS). Thinking, "Let's build a chatbot with the trendy GPT!" I felt confident. I hooked up the OpenAI API and deployed it. The setup took less than 30 minutes.

However, within an hour of deployment, a ticket came in. Customer: "How do I reset my password?" My Chatbot: "Go to Settings > Security > Change Password."

The problem? We didn't have a 'Security' menu. The AI was lying (Hallucinating) with absolute confidence. It was so skilled at generating plausible sentences that even I almost fell for it, thinking, "Wait, did we add that menu?" The customer, understandably, got angry, saying, "There is no such menu!" and I had to hurriedly take the chatbot offline.

That's when I realized. "LLMs are not knowledge engines; they are plausible sentence generators." I hadn't built a chatbot; I had built a 'Polite Liar'.

2. Solution: Don't Memorize, Let It Cheat (RAG)

Initially, I considered Fine-tuning. Why not feed all our manuals into the model and re-train it? but the cost was prohibitive, and more importantly, "I'd have to re-train every time information changed." If our refund policy changed? Spend thousands again? Ridiculous.

The answer was RAG (Retrieval-Augmented Generation). When I first encountered this concept, the best analogy was an "Open-Book Exam."

Standard LLM (Fine-tuning): Memorizes the textbook for the exam. If it forgets, it invents facts (Hallucination). If the textbook content changes, you need a brain transplant (Re-training).
RAG: Keeps the textbook open on the desk. When a question comes up, it looks up the relevant page (Retrieval), reads it, and answers (Generation). If content changes, just swap the page.

This "textbook" is your internal documentation, and the "looking up" is Retrieval.

3. Implementation: Text to Numbers (Vector Embedding)

But computers don't know "password" and "secret code" are related. Simple keyword search (LIKE query) won't find a document containing "password" if you search for "secret code." We need to convert text into Vector Numbers. This is Embedding.

# 'Apple' as a vector looks something like this
[0.1, 0.5, -0.3, 0.9, ...] (1536 dimensions)

This vector space is magical. Semantically similar words live physically close together in this space. Famous example: Vector('King') - Vector('Man') + Vector('Woman') ≈ Vector('Queen'). Essentially, the computer starts to understand the 'Meaning'.

Architecture: Chunking Strategies

To vectorize documents, you first need to chop them up. This is Chunking. I started by simply cutting every 500 characters. Disaster ensued.

Chunk 1: "...father in the"
Chunk 2: "room enters..."

Context was severed, and search failed. I needed smarter strategies.

Fixed Size with Overlap: 500 chars chunk + 50 chars overlap. Prevents cutting sentences in half. Simple and effective.
Semantic Chunking: Split when the topic changes. Uses NLP models to detect semantic shifts between sentences.
Recursive Splitting: Split by paragraphs -> sentences -> words. LangChain's RecursiveCharacterTextSplitter does this beautifully. I went with this.

Metadata Filtering

Don't just store vectors. Store tags. {"content": "...", "metadata": {"category": "billing", "year": "2024"}}

If a user asks "2024 Refund Policy", filter by year=2024 BEFORE vector search. Narrowing the search space drastically improves accuracy. This is called Pre-filtering.

4. Search: Talking to the Vector DB

When a user asks a question, here's what happens internally:

Question: "Forgot pwd"
Question Embedding: Convert text to vector [0.2, 0.4, ...] using an embedding model (like OpenAI text-embedding-3-small).
Similarity Search (Cosine Similarity): Fetch top 3 closest chunks from the Vector DB (Pinecone, Weaviate, etc.).
- Doc A: "Send password reset link" (90% match)
- Doc B: "Login error troubleshooting" (85% match)
Prompt Assembly: Feed the retrieved docs to the LLM as a "Cheat Sheet".

[Instructions]
You are a support agent. Answer ONLY based on the [Context] below. If unsure, say "I don't know". (Do not invent facts!)

[Context]
... (Content of Doc A, B)

[Question]
Forgot pwd

Answer: "If you wish to reset your password, we can send a link to your registered email." (Accurate!)

5. Reality Check: Sometimes You Need Keywords (Hybrid Search)

I celebrated after implementing RAG, but hit a wall soon after. Vector search is great for "Meaning," but bad for "Exact Words."

If a user searches for error code "ERR-503", vector search might bring up generic "server error" or "connection failed" docs because they are semantically similar. But we need the document that specifically mentions "ERR-503".

So I implemented Hybrid Search.

Vector Search (Dense Retrieval): For meaning ("I can't login" -> Finds Login issue doc).
Keyword Search (Sparse Retrieval / BM25): For exact matches ("ERR-503" -> Finds doc with ERR-503).

Combining these (Ensemble) using Reciprocal Rank Fusion (RRF) drastically improved accuracy.

5.1. Reranking (Cross-Encoder)

After retrieving top 10 docs, we double-check if they are really relevant. We use a Cross-Encoder model (like BGE-Reranker).

Bi-Encoder (Standard Vector Search): Calculates distance between two pre-computed vectors. Fast, medium accuracy.
Cross-Encoder (Reranker): Takes (Query, Doc) pair as input and directly scores "How relevant are these two?". Slow, highest accuracy.

The golden rule is "2-Stage Retrieval":

Fetch 100 candidates fast with Vector Search.
Sort top 10 accurately with Reranker in the final stage.

6. Speed War: Vector Indexing (HNSW)

When you have millions of vectors, comparing them one by one (Linear Search) is too slow. You need Vector Indexing.

The industry standard is HNSW (Hierarchical Navigable Small World). The name is complex, but the principle is "Highways and Local Roads."

Layer 2 (Highway): Connects distant regions sparsely.
Layer 1 (Arterial): Connects neighborhoods.
Layer 0 (Local): Every house is connected.

The search starts on the Highway (Layer 2) to get near the destination quickly, then drops down to Local Roads (Layer 0) for precision. This allows sub-millisecond search speeds even with billions of vectors.

7. Safety Net: "Admit Ignorance"

RAG doesn't eliminate hallucinations 100%. If retrieval fetches unrelated docs (Garbage In), the AI will hallucinate based on them (Garbage Out).

You must set a Similarity Threshold.

# If similarity is too low, don't even send it to LLM
if max_similarity < 0.7:
    return "I'm sorry, I couldn't find relevant information in the manual. Should I connect you to a human agent?"

Also, forcing Citations in the prompt drastically reduces lies. Instructing "Append [Doc ID: 12] at the end of each sentence" makes the AI hesitant to fabricate information without a source.