
Normalization: Kill Duplicates
Foundation of DB Design. Splitting tables to prevent Anomalies. 1NF, 2NF, 3NF explained simply.

Foundation of DB Design. Splitting tables to prevent Anomalies. 1NF, 2NF, 3NF explained simply.
Why does my server crash? OS's desperate struggle to manage limited memory. War against Fragmentation.

Two ways to escape a maze. Spread out wide (BFS) or dig deep (DFS)? Who finds the shortest path?

Fast by name. Partitioning around a Pivot. Why is it the standard library choice despite O(N²) worst case?

Establishing TCP connection is expensive. Reuse it for multiple requests.

When I first built a database, I thought of it like an Excel spreadsheet. It seemed simple and convenient to dump all the necessary information into one table. In the early days, I actually did that. I put order information, customer information, and product information all into a single table. Queries were simple too. One SELECT * and I had everything.
But problems exploded when I tried to modify or delete data. I wanted to change a customer's email address, but if that customer had 100 orders, I had to update 100 rows. If I accidentally missed even one, I'd end up with the bizarre situation where the same customer had different emails. When I deleted one order, the customer information got wiped out too. I only wanted to delete the order.
That's when I encountered the concept of normalization. At first, I thought "Why complicate things by splitting a perfectly fine table?" But after experiencing anomalies firsthand, my perspective changed. Ultimately, normalization was "a systematic way to avoid storing the same information in multiple places." I'm recording this as personal notes for myself.
Studying normalization, I was bombarded with terms: First Normal Form, Second Normal Form, Third Normal Form, BCNF, Fourth Normal Form... Initially, I thought "Is this some academic theory?" Each normal form came with mathematical definitions and concepts like Functional Dependency. As someone without a CS background, these expressions didn't click for me.
When I read "eliminate partial functional dependency," I had no idea what was partial and what was whole. "Transitive dependency" was the same. I understood that A determines B and B determines C, but what was I supposed to do about it?
Then I realized: don't try to understand the theory first. Look at the actual problem situations first. When I built tables myself, inserted data, deleted it, modified it, and experienced what problems arose, I viscerally understood why normalization was needed.
The analogy that resonated most for me in understanding normalization was "organizing a room." Imagine a messy room with clothes, books, electronics, and documents all mixed together. In this state, finding anything takes forever, and the same items are scattered in multiple places, so if you organize part of it, the rest stays neglected.
Normalization is the process of sorting "clothes in the closet, books on the bookshelf, documents in the filing cabinet" with labels. Making sure each item exists in exactly one place. Then when you change clothes, you only open the closet, and when you look for a book, you only search the bookshelf. Organizing one place keeps the whole system tidy.
Denormalization is the opposite: "keep frequently used items together." For example, the watch, wallet, and keys you use every morning should technically be in separate drawers, but for convenience, you put them together in a basket by the entrance. Even if there's some duplication, being able to grab them quickly is better. I accepted normalization and denormalization through this analogy.
Before learning normalization, you need to understand anomalies first. Initially, I didn't know what anomalies were and thought "Can't I just be careful?" But after experiencing them, I realized being careful doesn't solve the problem. If the data structure itself is poorly designed, problems occur no matter how cautious you are.
Say we manage students and courses in one table.
CREATE TABLE Student_Course (
student_id INT,
student_name VARCHAR(100),
department VARCHAR(100),
course_id INT,
course_name VARCHAR(100),
PRIMARY KEY (student_id, course_id)
);
In this table, a student can only be registered if they're enrolled in at least one course. What if a newly admitted student hasn't registered for courses yet? You can't insert the data. Because the Primary Key is the combination of student_id and course_id, if course_id is NULL, storage is impossible. That's an insertion anomaly. You want to add student information, but you can't without course information.
Initially, I thought "Can't I just put a temporary 'TBD' for the course?" But that creates dummy data, and later you need extra work to clean up the dummies. Putting false data in the database wasn't a fundamental solution.
In the above table, suppose a student is enrolled in only one course. If that student drops the course? You have to delete that row, but then the student information (student_name, department) disappears along with it. I wanted to delete only the enrollment information, but the student information vanishes too.
This actually happened in production. I had customer information together in an order table, and when canceling an order, I accidentally deleted the customer information too. Luckily, there was a backup so I recovered it, but in that moment I realized "the table design is wrong."
Student 'Hong Gildong' is enrolled in 10 courses. Hong's department changed from 'Computer Engineering' to 'Software Engineering'. Then you have to update the department column in all 10 rows. If you accidentally update 9 and miss 1? The same student has two departments. Data integrity breaks.
-- To change Hong's department, you must update all rows
UPDATE Student_Course
SET department = 'Software Engineering'
WHERE student_id = 101; -- All 10 rows must be updated
I thought "Can't I just wrap it in a transaction at the application logic level?" But that's not a fundamental solution. What if there are multiple applications? What if different teams use the same database in different languages? You'd have to implement the same logic everywhere. If the database structure itself enforces duplication, no matter how careful the application is, problems will eventually arise.
I first encountered 1NF through the term "atomicity." The explanation "values that cannot be split further, like atoms" felt abstract at first. But when I saw actual examples, it was simple.
CREATE TABLE Student_Bad (
student_id INT PRIMARY KEY,
name VARCHAR(100),
courses VARCHAR(500) -- Values like "Math, English, Science"
);
INSERT INTO Student_Bad VALUES (1, 'Hong Gildong', 'Math, English, Science');
In this table, how do you write a query to "find students taking English"?
SELECT * FROM Student_Bad WHERE courses LIKE '%English%';
You have to use LIKE pattern matching. The problem is this is slow. You can't use indexes. And if there are courses like "English, English Conversation, Practical English," how do you distinguish them? Searching for "Science" also returns "Science and Technology." Exact searching is impossible.
Initially, I thought "Can't I just define delimiters precisely? Clearly specify commas and spaces." But that's offloading database work to the application. The database should clearly express "this student takes this course," but handling it with string parsing is poor design.
CREATE TABLE Student (
student_id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE Enrollment (
student_id INT,
course_id INT,
PRIMARY KEY (student_id, course_id),
FOREIGN KEY (student_id) REFERENCES Student(student_id)
);
CREATE TABLE Course (
course_id INT PRIMARY KEY,
course_name VARCHAR(100)
);
INSERT INTO Student VALUES (1, 'Hong Gildong');
INSERT INTO Course VALUES (101, 'Math'), (102, 'English'), (103, 'Science');
INSERT INTO Enrollment VALUES (1, 101), (1, 102), (1, 103);
Now the query to find "students taking English" becomes clear.
SELECT s.name
FROM Student s
JOIN Enrollment e ON s.student_id = e.student_id
JOIN Course c ON e.course_id = c.course_id
WHERE c.course_name = 'English';
You can use indexes and get exact matches. Through this, I accepted that "1NF is the foundation of relational databases." The moment you put comma-separated strings, you're already giving up the advantages of relational databases.
To understand 2NF, you first need to know "Composite Key." It's when the Primary Key is a combination of multiple columns. Initially, I didn't understand what "dependent on only part of the key, not the whole thing" meant. After seeing an example, I thought "Ah, this is it."
CREATE TABLE Enrollment_Bad (
student_id INT,
course_id INT,
student_name VARCHAR(100), -- Dependent only on student_id
department VARCHAR(100), -- Dependent only on student_id
course_name VARCHAR(100), -- Dependent only on course_id
grade CHAR(1), -- Needs both student_id and course_id
PRIMARY KEY (student_id, course_id)
);
This table's Primary Key is the combination (student_id, course_id). But student_name and department are determined with only student_id. course_id isn't needed. Conversely, course_name is determined with only course_id. Only grade needs both values.
This is partial functional dependency. It means there are columns determined by only part of the Primary Key. I understood this as "carrying around more information than necessary." If student 101 is enrolled in 5 courses, the student name and department are duplicated 5 times. To change the student name, you have to update all 5 rows. That's an update anomaly.
CREATE TABLE Student (
student_id INT PRIMARY KEY,
student_name VARCHAR(100),
department VARCHAR(100)
);
CREATE TABLE Course (
course_id INT PRIMARY KEY,
course_name VARCHAR(100)
);
CREATE TABLE Enrollment (
student_id INT,
course_id INT,
grade CHAR(1),
PRIMARY KEY (student_id, course_id),
FOREIGN KEY (student_id) REFERENCES Student(student_id),
FOREIGN KEY (course_id) REFERENCES Course(course_id)
);
Now student information is stored exactly once in the Student table. Course information once in the Course table. The Enrollment table purely records "who took what and got what grade." To change a student name, you only update one row in the Student table.
Through this experience, I summarized "2NF is the normal form to watch out for when using composite keys." If the Primary Key is a single column, 2NF violation is impossible from the start. The concept of partial dependency doesn't even apply.
3NF was the most confusing concept for me. When I read "if A determines B and B determines C, then A transitively determines C," I was reminded of the transitive relation "if A=B and B=C, then A=C" from math class. But I didn't immediately grasp why this was a problem in databases.
CREATE TABLE Student_Bad (
student_id INT PRIMARY KEY,
student_name VARCHAR(100),
department VARCHAR(100),
dept_office VARCHAR(100) -- Department office location
);
INSERT INTO Student_Bad VALUES
(101, 'Hong Gildong', 'Computer Engineering', 'Engineering Hall 301'),
(102, 'Kim Cheolsu', 'Computer Engineering', 'Engineering Hall 301'),
(103, 'Lee Younghee', 'Electronic Engineering', 'Engineering Hall 201');
Here, the relationship student_id → department → dept_office holds. You can know the department from the student ID, and know the office location from the department. So you can indirectly know the office location from the student ID. That's transitive dependency.
What's the problem? Say there are 100 students in 'Computer Engineering'. Then 'Engineering Hall 301' is duplicated 100 times. If the Computer Engineering office moves to 'Engineering Hall 401'? You have to update all 100 rows. If you miss even one, you get the bizarre situation where the same department has different office locations.
Initially, I thought "Can't I just update them all in one transaction?" But that's not the fundamental issue. The office location is information dependent on the department, but it's stored in the student table, which is weird. Changing department information is unrelated to students, yet you have to touch the student table, which doesn't make sense.
CREATE TABLE Department (
department VARCHAR(100) PRIMARY KEY,
dept_office VARCHAR(100)
);
CREATE TABLE Student (
student_id INT PRIMARY KEY,
student_name VARCHAR(100),
department VARCHAR(100),
FOREIGN KEY (department) REFERENCES Department(department)
);
INSERT INTO Department VALUES
('Computer Engineering', 'Engineering Hall 301'),
('Electronic Engineering', 'Engineering Hall 201');
INSERT INTO Student VALUES
(101, 'Hong Gildong', 'Computer Engineering'),
(102, 'Kim Cheolsu', 'Computer Engineering'),
(103, 'Lee Younghee', 'Electronic Engineering');
Now the office location is stored exactly once in the Department table. If Computer Engineering moves, you only update one row in the Department table. You don't need to touch the data for 100 students at all.
Through this, I summarized "3NF eliminates dependencies between non-key columns." department isn't a Primary Key, but dept_office depends on it. Separating such relationships is what 3NF does.
BCNF (Boyce-Codd Normal Form) is a more strict normal form than 3NF. Initially, I thought "Isn't 3NF enough?" In practice, most cases are fine with just 3NF. But in special cases, even if you satisfy 3NF, anomalies can remain.
3NF has the condition "non-prime attributes should not be transitively dependent on candidate keys." But BCNF has the stronger condition "every determinant must be a candidate key." I didn't fully grasp this difference, but I accepted it as "there are special cases that 3NF doesn't solve." I heard it's rare in practice to need to consider BCNF.
Theory alone didn't resonate, so I actually designed an order system for an e-commerce site. At first, I put all the information in one table.
CREATE TABLE Orders_Bad (
order_id INT PRIMARY KEY,
customer_name VARCHAR(100),
customer_email VARCHAR(100),
customer_address VARCHAR(500),
products VARCHAR(1000), -- "Product A, Product B, Product C"
total_price DECIMAL(10, 2),
order_date DATE
);
This table doesn't even satisfy 1NF. Multiple products go into the products column separated by commas. To find "people who ordered Product A," you'd have to use a query like LIKE '%Product A%'.
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
customer_name VARCHAR(100),
customer_email VARCHAR(100),
customer_address VARCHAR(500),
order_date DATE
);
CREATE TABLE Order_Items (
order_id INT,
product_id INT,
quantity INT,
price DECIMAL(10, 2),
PRIMARY KEY (order_id, product_id)
);
Now product information is clearly separated. But there's still a problem. If a customer orders 10 times, the name, email, and address are duplicated 10 times. If the customer's email changes, you have to update all 10 rows.
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
customer_email VARCHAR(100),
customer_address VARCHAR(500)
);
CREATE TABLE Products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
product_price DECIMAL(10, 2)
);
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
FOREIGN KEY (customer_id) REFERENCES Customers(customer_id)
);
CREATE TABLE Order_Items (
order_id INT,
product_id INT,
quantity INT,
price DECIMAL(10, 2), -- Price at order time (product price might change later)
PRIMARY KEY (order_id, product_id),
FOREIGN KEY (order_id) REFERENCES Orders(order_id),
FOREIGN KEY (product_id) REFERENCES Products(product_id)
);
Now customer information is stored only once in the Customers table. Product information once in Products. Order information in Orders, order details in Order_Items. Each piece of information exists in exactly one place.
Going through this process myself, I understood "normalization is classifying information by subject." Customer information, product information, order information are each independent entities. Mixing them in one table creates duplication and anomalies.
After learning normalization, I thought "normalizing to 3NF is always the answer." But in practice, it wasn't. Some queries required too many JOINs and performed terribly slow. Especially in read-heavy services, denormalization was essential.
For example, consider a page showing an order list. You need to show customer name, product name, order date, total amount.
SELECT
o.order_id,
c.customer_name,
p.product_name,
oi.quantity,
oi.price,
o.order_date
FROM Orders o
JOIN Customers c ON o.customer_id = c.customer_id
JOIN Order_Items oi ON o.order_id = oi.order_id
JOIN Products p ON oi.product_id = p.product_id;
This query JOINs 4 tables. With millions of orders, it can get quite slow. That's when you consider denormalization.
CREATE TABLE Order_Summary (
order_id INT PRIMARY KEY,
customer_name VARCHAR(100), -- Duplicated storage
product_names TEXT, -- Duplicated storage
total_price DECIMAL(10, 2),
order_date DATE
);
When an order is created, you also store summary information in this table. Then when retrieving the list, you only read one table without JOINs. Read performance drastically improves.
But what if a customer name changes? You have to update not only the original Customers table but also the Order_Summary table. That's the cost of denormalization. I accepted this as "paying consistency maintenance costs for convenience."
In practice, in such situations, you add synchronization logic at the application level, or create a separate read-only replica and manage it separately. There's no right answer; you have to judge based on the situation.
While studying normalization, I had a moment where I realized "this is what it all comes down to." The essence of normalization is "don't store the same information twice." Every piece of information should exist in exactly one place. Then when you modify it, you only fix one place, and data integrity is automatically maintained.
1NF, 2NF, 3NF each eliminate different types of duplication.
They all aim for the same goal: "One fact in one place." After accepting this principle, database design became much clearer for me. When pondering "where should I store this information?", asking "what does this information depend on?" gave me the answer.
Customer name depends on customer ID → Customers table.
Product price depends on product ID → Products table.
Grade depends on the combination of student ID and course ID → Enrollment table.
Thinking this way, table design naturally organized itself.
Normalization gave me the realization that "a database is not Excel." In Excel, you put all information in one sheet and manage it with filters and sorting. But databases are different. Relational databases should be designed around "relationships." Each entity exists independently and is connected by foreign keys.
Through normalization, I understood why anomalies occur and how to prevent them. 1NF is the foundation of relational databases, and 2NF and 3NF are methods to systematically eliminate duplication. Denormalization is a technique to intentionally allow duplication for performance, and you must consider the tradeoff between reads and writes.
In practice, you don't always normalize to 3NF. Depending on the situation, you might stop at 2NF, or apply denormalization to some tables. What's important is being able to explain "why I designed it this way." If you know normalization principles, you can break them with justification when you do.
Going forward, whenever I design a database, I'll always think about anomalies first. "Will there be problems when adding/modifying/deleting data in this structure?" If I can answer this question, the design is good.