M·06AI2025.07.266 MIN READ

From Words to Numbers: The Art of Embedding and Vector Databases

기계가 언어를 이해하는 법: 임베딩 (Embedding)과 벡터 데이터베이스

How do computers understand that 'King' - 'Man' + 'Woman' = 'Queen'? We dive deep into the evolution of NLP embeddings, from One-Hot Encoding to Word2Vec and Transformer-based models. Learn about Vector Databases, Cosine Similarity math, and how RAG (Retrieval-Augmented Generation) is reshaping modern AI applications.

codemapo

INTERDISCIPLINARY DEV · SEOUL

From Words to Numbers: The Art of Embedding and Vector Databases

1. The Language of Machines

Computers are fundamentally calculators. They understand numbers, not nuances like sarcasm, metaphors, or synonyms. To bridge the gap between human language and machine understanding, we need a translation layer. This layer is Embedding.

Historically, dealing with text was crude. Bag of Words (BoW) simply counted word frequency. It didn't care about order. "Dog bites man" and "Man bites dog" looked identical to BoW because they contained the exact same words. One-Hot Encoding assigned a unique index to every word. If your vocabulary had 50,000 words, every word was a vector of size 50,000 with a single 1 and 49,999 0s. This was horribly inefficient (Sparse) and carried no semantic meaning. In One-Hot space, "Car" is as distant from "Bus" as it is from "Banana". They are all orthogonal.

2. Capturing Meaning: Dense Vectors

The goal of Embedding is to compress that sparse, high-dimensional space into a lower-dimensional Dense Vector space (usually 256 to a few thousand dimensions), where position equals meaning.

Cosine Similarity: The Yardstick of Meaning

In this vector space, how do we engage "similarity"? We measure the angle between vectors. The most common metric is Cosine Similarity.

similarity = cos(θ) = (A · B) / (‖A‖ · ‖B‖)

If two vectors point in the exact same direction, the angle is 0, and cosine is 1. (Identical meaning)
If they are orthogonal (90 degrees), cosine is 0. (Unrelated)
If they point in opposite directions (180 degrees), cosine is -1. (Opposite meaning, though rare in raw embeddings)

This allows us to perform nearest neighbor searches. When you type a query into Google, it's not just matching keywords; it's matching the intent vector of your query with the content vectors of web pages.

3. The Evolution: Word2Vec to Transformers

Word2Vec (2013)

Google's Word2Vec introduced the concept that a word is defined by its neighbors. It trained a shallow neural network to predict a word given its context (CBOW) or context given a word (Skip-gram). It gave us the famous arithmetic properties: Vector("King") - Vector("Man") + Vector("Woman") ≈ Vector("Queen") However, Word2Vec was static. The word "Bank" had one fixed vector, whether you meant a "River bank" or a "Financial bank".

BERT and Contextual Embeddings (2018)

Transformers changed everything. Models like BERT (Bidirectional Encoder Representations from Transformers) generate embeddings dynamically. It reads the entire sentence at once (using Self-Attention mechanisms) to understand the specific nuance of "Bank" in that specific sentence. This is why modern search engines are so good at understanding complex, long-tail queries.

4. Vector Search Algorithms: How They Work (HNSW & IVF)

Storing vectors is easy; searching them is hard. If you have 1 million vectors, and you want to find the nearest neighbor to a query vector, you could calculate the distance to all 1 million vectors (Brute Force or KNN). This gives perfect accuracy but is too slow for production (O(N) complexity).

Vector Databases use Approximate Nearest Neighbor (ANN) algorithms to speed this up.

HNSW (Hierarchical Navigable Small World)

This is the gold standard for vector search today (used by Pinecone, Weaviate). Imagine a multi-layered graph.

Top Layer: Has very few nodes (like express highways). You jump long distances quickly.
Bottom Layer: Has all the nodes (local streets). You fine-tune your location. The search starts at the top layer, zooms in to the general neighborhood of the query vector, and then drills down to lower layers for precision. It provides logarithmic search speed (O(log N)) with high recall.

IVF (Inverted File Index)

This technique uses clustering (like K-Means).

Training: Group your 1 million vectors into 1,000 clusters (centroids).
Indexing: Assign every vector to its nearest centroid.
Search: When a query comes, find the nearest centroid first. Then, ONLY search the vectors inside that cluster (and maybe adjacent ones). This drastically reduces the search space, but if the query falls on the boundary of clusters, you might miss the true nearest neighbor (Recal issue).

5. RAG: Retrieval-Augmented Generation

The hottest application of embeddings today is RAG. LLMs like GPT-4 are frozen in time. They don't know about your company's internal documents or yesterday's news. RAG bridges this gap:

Ingestion: You take your private documents (PDFs, Wikis), chunk them into small pieces, pass them through an Embedding Model (like OpenAI's text-embedding-3-small), and store the resulting vectors in a Vector Database.
Retrieval: When a user asks a question, you embed the question itself. You query the Vector DB for the "Top K" chunks that are semantically closest to the question vector.
Generation: You feed those retrieved chunks into the LLM as part of the prompt.

6. Multimodal Embeddings: Beyond Text

Embeddings aren't limited to text. Models like CLIP (Contrastive Language-Image Pre-training) learn to map images and text into the same vector space. This means the vector for an image of a cat and the vector for the text "A cute kitten" will be close together. This enables:

Text-to-Image Search: Searching your photo library by typing "Birthday party on the beach".
Zero-Shot Classification: Classifying images without training a specific model for those labels.

7. Case Study: Netflix's Recommendation Engine

Netflix is the pioneer of using embeddings for personalization. They don't just look at "Action" or "Comedy" genres. They create high-dimensional embeddings for every movie based on:

Metadata: Director, actors, year.
Visuals: They analyze the thumbnails using Convolutional Neural Networks (CNN) to extract visual embeddings.
Audio: They analyze the trailer's audio.
User Interactions: Who watched this? (Collaborative Filtering).

All these are combined into a dense vector representing the "vibe" of the movie. When you watch "Stranger Things", Netflix finds the nearest neighbors in this vector space. It might recommend "Dark" (German sci-fi) not because they share the same tags, but because their embeddings look similar—similar visual tone, similar suspenseful audio, and similar viewing patterns. This content-based filtering via embeddings is why their recommendations feel so magical.

8. Summary

Embeddings are the bridge between the discrete world of symbols (words) and the continuous world of mathematics (vectors). They allow computers to reason about semantic similarity, not just keyword overlap. From the simplicity of Word2Vec to the contextual power of Transformers and the scalability of HNSW-indexed Vector Databases, embeddings are the foundational block of modern AI, enabling everything from better search bars to intelligent chatbots via RAG. Understanding inputs as vectors is the first step to understanding modern AI.

#AI #NLP #Embedding #Vector Database #RAG

← Back to List

M·06AI2025.07.266 MIN READ

From Words to Numbers: The Art of Embedding and Vector Databases

기계가 언어를 이해하는 법: 임베딩 (Embedding)과 벡터 데이터베이스

codemapo

INTERDISCIPLINARY DEV · SEOUL

From Words to Numbers: The Art of Embedding and Vector Databases

1. The Language of Machines

2. Capturing Meaning: Dense Vectors

The goal of Embedding is to compress that sparse, high-dimensional space into a lower-dimensional Dense Vector space (usually 256 to a few thousand dimensions), where position equals meaning.

Cosine Similarity: The Yardstick of Meaning

In this vector space, how do we engage "similarity"? We measure the angle between vectors. The most common metric is Cosine Similarity.

similarity = cos(θ) = (A · B) / (‖A‖ · ‖B‖)

If two vectors point in the exact same direction, the angle is 0, and cosine is 1. (Identical meaning)
If they are orthogonal (90 degrees), cosine is 0. (Unrelated)
If they point in opposite directions (180 degrees), cosine is -1. (Opposite meaning, though rare in raw embeddings)

3. The Evolution: Word2Vec to Transformers

Word2Vec (2013)

BERT and Contextual Embeddings (2018)

4. Vector Search Algorithms: How They Work (HNSW & IVF)

Vector Databases use Approximate Nearest Neighbor (ANN) algorithms to speed this up.

HNSW (Hierarchical Navigable Small World)

This is the gold standard for vector search today (used by Pinecone, Weaviate). Imagine a multi-layered graph.

Top Layer: Has very few nodes (like express highways). You jump long distances quickly.
Bottom Layer: Has all the nodes (local streets). You fine-tune your location. The search starts at the top layer, zooms in to the general neighborhood of the query vector, and then drills down to lower layers for precision. It provides logarithmic search speed (O(log N)) with high recall.

IVF (Inverted File Index)

This technique uses clustering (like K-Means).

Training: Group your 1 million vectors into 1,000 clusters (centroids).
Indexing: Assign every vector to its nearest centroid.
Search: When a query comes, find the nearest centroid first. Then, ONLY search the vectors inside that cluster (and maybe adjacent ones). This drastically reduces the search space, but if the query falls on the boundary of clusters, you might miss the true nearest neighbor (Recal issue).

5. RAG: Retrieval-Augmented Generation

The hottest application of embeddings today is RAG. LLMs like GPT-4 are frozen in time. They don't know about your company's internal documents or yesterday's news. RAG bridges this gap:

Ingestion: You take your private documents (PDFs, Wikis), chunk them into small pieces, pass them through an Embedding Model (like OpenAI's text-embedding-3-small), and store the resulting vectors in a Vector Database.
Retrieval: When a user asks a question, you embed the question itself. You query the Vector DB for the "Top K" chunks that are semantically closest to the question vector.
Generation: You feed those retrieved chunks into the LLM as part of the prompt.

6. Multimodal Embeddings: Beyond Text

Text-to-Image Search: Searching your photo library by typing "Birthday party on the beach".
Zero-Shot Classification: Classifying images without training a specific model for those labels.

7. Case Study: Netflix's Recommendation Engine

Netflix is the pioneer of using embeddings for personalization. They don't just look at "Action" or "Comedy" genres. They create high-dimensional embeddings for every movie based on:

Metadata: Director, actors, year.
Visuals: They analyze the thumbnails using Convolutional Neural Networks (CNN) to extract visual embeddings.
Audio: They analyze the trailer's audio.
User Interactions: Who watched this? (Collaborative Filtering).

8. Summary

#AI #NLP #Embedding #Vector Database #RAG

← Back to List