Transformer: Foundation of Modern AI
1. "Why Did Google Translate Get So Smart?"
Do you remember the news in late 2016 that Google Translate suddenly became so smart that people jokingly suspected they were torturing aliens for technology? Before that, it translated "The father enters the room" into broken English. Suddenly, it started spitting out fluent, natural sentences. Behind that revolution was the Transformer architecture.
Why "Transformer"? It's not about the robots in disguise. It refers to the model's ability to transform one sequence of symbols into another sequence (e.g., French to English) with unprecedented efficiency and accuracy. Until then, sequence transduction was dominated by Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs). Transformer changed the game entirely.
At the time, I was building a chatbot using LSTM (Long Short-Term Memory) at work, struggling with the issue that it turned into an idiot whenever the input sentence got long. "It just forgets everything mentioned at the beginning." A senior colleague told me, "Hey, Google published a paper. 'Attention Is All You Need'. Check it out."
The title was provocative, and that paper completely changed my AI career.
2. Initially, Why Was I Confused?
When I first opened the paper, I was bewildered.
Confusion 1: "No RNN?"
Until then, the rule of thumb for NLP (Natural Language Processing) was RNN. To process "I go to school", you naturally read it in the order "I" -> "go" -> "to" -> "school", right? But Transformer said, "Who needs order? Just shove it all in at once (Parallel)." "Wait, how can language exist without order?"
Confusion 2: "What is Attention?"
"Q, K, V... Query, Key, Value?"
Are these database terms? Why are they here?
The formula was alien stuff like softmax(QK^T / sqrt(d)), and I couldn't intuitively grasp how this led to "understanding context".
3. The 'Aha!' Moment
The analogies that finally made it click were "Blind Date" and "Library Search".
Attention = The Eye Contact Game at a Meeting
Imagine a meeting (sentence) with several people. "Cheolsu gave Yeonghui an apple."
- RNN: Listens to each person's self-introduction one by one. By the time the last person speaks, you've forgotten the first person's name.
- Transformer (Self-Attention): Everyone looks at each other simultaneously.
- The word 'Cheolsu' stares intensely at 'gave'. (Subject-Verb relation)
- The word 'apple' also looks at 'gave'. (Object-Verb relation)
- 'Yeonghui' also looks at 'gave'.
Each word exchanges glances (Attention) asking, "Who is related to me?" It doesn't matter how far apart they are. You just have to look. This was the secret to breaking RNN's limitation (inability to remember distant words).
Q, K, V = Library Search
- Query: "What information am I looking for right now?" (e.g., actions related to 'Cheolsu')
- Key: "What information do you hold?" (Labels of each word)
- Value: "What is the actual meaning?" (Vector value)
'Cheolsu' (Query) scans the entire sentence, finds the keyhole that fits 'gave' (Key) perfectly, and pulls that word's meaning (Value) to reinforce its own meaning. In the end, 'Cheolsu' is no longer just 'Cheolsu', but is reborn as a vector with rich context: 'Cheolsu who gave an apple to Yeonghui'.
4. Dissecting the Structure: Encoder and Decoder
Transformer is largely divided into Encoder and Decoder. Let's take "Translating English to Korean" as an example.
Step 1: Positional Encoding (Position Info)
Since Transformer processes in parallel, it doesn't know the order. (It treats "School I go to" and "I go to school" the same.) So, we attach a number tag to each word. "You are No.1, you are No.2..." Mathematically, this is done elegantly using Sine/Cosine functions.
Step 2: Multi-Head Attention (Looking from Multiple Angles)
Using only one Attention narrows the field of view. So, we create 8 Eyes (Heads).
- Eye 1: "Look at who (Subject) did it."
- Eye 2: "Look at when (Time) it happened."
- Eye 3: "Look at what (Object) was affected."
Each analyzes the sentence from a different perspective and combines the findings later. This is Multi-Head Attention.
Step 3: Feed Forward & Residual Connection
Information gained from Attention is mixed well (Feed Forward) and added to the original information (Residual Connection). "Original Me" + "Me who understood context" = "Smarter Me".
5. Practical Example: Understanding via Code (PyTorch)
Implementing the whole structure is complex, so let's look at the core Attention part.
import torch
import torch.nn as nn
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, num_heads=8):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads # Dimension of each head
# Linear layers for Q, K, V, O
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x):
batch_size, seq_len, d_model = x.shape
# 1. Generate Q, K, V (Linear Projection)
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
# 2. Split Heads
Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# 3. Calculate Attention Score (Scaled Dot-Product)
# Multiply Q and K to find similarity
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
# 4. Convert to probability with Softmax
attention_weights = torch.softmax(scores, dim=-1)
# 5. Multiply with Value to extract final info
context = torch.matmul(attention_weights, V)
# 6. Concat Heads
context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
output = self.W_o(context)
return output
When I first wrote this code, I cried a lot because of Dimension errors in transpose and view.
But once I saw it running, I got goosebumps thinking, "Wow, this really understands sentences just with matrix multiplications, without a single for-loop?" I understood why GPUs love it.
6. The Monsters Transformer Birthed: BERT and GPT
This single paper rewrote the history of AI.
BERT (Bidirectional Encoder Representations from Transformers)
- Structure: Only the Encoder of Transformer.
- Specialty: "Fill in the blank". Champion of understanding context bidirectionally.
- Usage: Search engines, Sentiment Analysis, QA. (Google Search switched to this)
GPT (Generative Pre-trained Transformer)
- Structure: Only the Decoder of Transformer.
- Specialty: "Predict the next word". Champion of making up stories.
- Usage: Chatbots, Composition, Coding. (The ChatGPT we know)
In the end, we are living in the era of Transformers.
7. Summary
Transformer is a revolutionary architecture that grasps the relationship of all words in a sentence at once (Parallel) through a mechanism called 'Attention'. It solved RNN's chronic diseases of 'Amnesia' and 'Slowness' simultaneously, and is the ancestor of all Large Language Models (LLMs) today.