B-Tree: The Algorithm Behind Database Indexes
1. Introduction: The Case of the Slow Query
I still remember my first internship project. I was dealing with a database of about a million orders.
In my local environment with 1,000 rows, queries ran in 0.005 seconds. But on the production server with 1,000,000 rows, the same query took 2.5 seconds.
It was 500 times slower.
I urgently added an index:
CREATE INDEX idx_user_id ON orders(user_id);
The result? 0.003 seconds. It became magically fast.
What on earth did CREATE INDEX do?
Behind this magic lies a data structure called B-Tree.
2. Why Do We Need This?
Why don't databases use simple Arrays or Binary Search Trees (BST)?
- Array: Fast reading (
O(1)), but inserting data requires shifting all elements. Write performance is terrible. - BST: Fast (
O(log N)), but relies on pointers in memory.
The critical problem is the physical limitation of the Hard Disk (HDD/SSD).
RAM vs Disk: The Great Wall
- RAM access: ~100 nanoseconds.
- Disk access: ~10 milliseconds. Disk is 100,000 times slower than RAM.
If you use a standard Binary Tree on a Disk, accessing a node means a disk seek. If the tree height is 20, you need 20 disk seeks. That's 0.2 seconds per query just for seeking. In a world where 0.003 seconds is the target, 0.2 seconds is unacceptable.
3. B-Tree Strategy: "Short and Fat"
To solve the disk slowness, B-Tree adopts a strategy to minimize disk I/O.
- High Fan-out: Unlike a binary tree with 2 children, a B-Tree node can have thousands of children.
- Low Height: By packing hundreds of keys into a single node (matching the Disk Page size of 4KB or 16KB), the tree becomes very short and wide.
A B-Tree with 1,000,000 items might only be depth 3. (Level 1: Root, Level 2: Branches, Level 3: Leaves). Finding any data takes only 3 Disk I/Os. This efficiency is why databases are fast.
4. B-Tree vs B+Tree (Crucial Comparison)
Modern databases (MySQL InnoDB, Oracle) rarely use pure B-Trees. They use B+Trees. Understanding this distinction is often a "pass/fail" question in developer interviews.
| Feature | B-Tree | B+Tree (Winner) |
|---|---|---|
| Data Storage | Internal Nodes & Leaves | Leaves Only |
| Internal Nodes | Heavy (Key + Data) | Light (Key only) |
| Range Search | Slow (Tree traversal) | Fast (Linked List at leaves) |
Why B+Tree Wins
- More Keys in RAM: Since internal nodes don't hold actual data, they are lighter. You can fit more "signposts" (Keys) in the RAM cache, increasing the cache hit ratio.
- Range Scan Optimization: Leaf nodes in a B+Tree are connected like a Linked List. To find "User ID 20 to 30", you find 20 (
O(log N)), and then just scan the list sideways (O(K)).
5. Clustered vs Non-Clustered Index
This is practical knowledge you must know to optimize DB performance.
1) Clustered Index
- Usually the Primary Key (PK).
- The actual table data (rows) is physically sorted and stored in the leaf nodes of this B+Tree.
- Analogy: A dictionary. The word "Apple" is physically followed by "Apricot".
- Pro: Fastest retrieval.
- Con: Only one per table.
2) Non-Clustered (Secondary) Index
- Any index created with
CREATE INDEX. - The leaf nodes contain the Primary Key pointing to the data, not the data itself.
- Analogy: The index at the back of a book. "Topic A: Page 123".
- Process: Search Secondary Index -> Get PK -> Search Clustered Index (Lookup).
- Trade-off: Requires two lookups, slightly slower.
Optimization Tip: Covering Index. If your query only asks for columns contained in the index (e.g., SELECT age FROM users WHERE age > 20), the DB doesn't need to look up the main table. It serves data directly from the index.
6. The Competitor: LSM Tree (Log Structured Merge Tree)
B-Trees are great for reads but expensive for writes because they require updating the sorted structure immediately (Random I/O). What if you need to handle massive write throughput, like chat logs or sensor data? Enter LSM Tree (used by Cassandra, HBase, Kafka).
- Mechanism: Writes are first buffered in RAM (MemTable). When full, they are flushed to disk as immutable sorted files (SSTables). Merging happens in the background.
- Benefit: Extremely fast writes (Sequential Write).
- Drawback: Slower reads (might need to check multiple files).
Verdict:
- Read/Write balanced (General App) -> B+Tree (RDBMS).
- Write-heavy (Logs, Big Data) -> LSM Tree (NoSQL).
7. Deep Dive: Physical Constraints & WAL
Why does B-Tree try so hard to avoid random I/Os? Because HDD physics is brutal.
1) Seek Time & Rotational Latency
An HDD is like a vinyl record player. To read data:
- Seek: The arm moves to the track.
- Rotate: The platter spins to the sector. These movements take milliseconds. In CPU time, that's an eternity. B-Tree minimizes these movements by making nodes size align with Disk Pages (4KB/16KB).
2) WAL (Write Ahead Log)
To support transactions (ACID) without waiting for slow B-Tree updates, databases use WAL. Changes are first appended to a log file (Sequential Write = Fast). The complex B-Tree reorganization happens asynchronously (Checkpointing). This ensures both durability and performance.
3) What about SSDs?
SSDs have no moving parts, so random access is much faster than HDDs. However, SSDs still write in Pages. Writing 1 byte requires rewriting a whole 4KB page (Write Amplification). Thus, the B-Tree's page-aligned design remains highly efficient and relevant even in the SSD era.
8. Development Tips: Index Optimization
1) Leftmost Prefix Rule
B-Trees sort from left to right.
If you have a composite index on (Name, Age):
WHERE Name = 'Alice'-> Hit.WHERE Name = 'Alice' AND Age = 20-> Hit.WHERE Age = 20-> Miss. You cannot look up a word in a dictionary by its second letter. The order matters.
2) Index Condition Pushdown (ICP)
This is an optimization in modern MySQL (5.6+).
Ideally, the DB engine pushes the WHERE condition down to the storage engine. Even if an index can't fully sort the result (like ZIP_CODE LIKE '%99'), the storage engine checks the index content and filters out non-matches before reading the full table row. This significantly reduces disk I/O.
9. Conclusion
B-Tree is the spine of modern databases. It is an algorithm designed out of necessity—to bridge the million-fold speed gap between RAM and Disk. Understanding B-Tree isn't just about passing an interview; it's about understanding the physical reality of the hardware your software runs on. When your query is slow, don't just blame the server. Close your eyes and imagine the B-Tree. Are you traversing it efficiently?