Replication: Availability and Read Distribution
Why I Started Learning About Replication
If you only have one DB and it goes down, the whole service stops. That simple question is what led me to study Replication.
Reading through post-mortems and incident reports, the same pattern kept appearing: "We had a single DB setup. When it failed, the service was down for hours." The follow-up questions were obvious. Can't you just run multiple DBs? How do you keep the data in sync? Who takes over if the master dies?
Following those questions led me to Replication. The master DB's data is automatically copied to slave DBs, and if the master dies, a slave is promoted to master (Failover). On top of that, distributing read queries to slaves reduces the master's load.
What Confused Me Initially
When I first encountered replication, the most confusing part was "How does data synchronize?" When you write data to the master, does it automatically copy to slaves? Then how do you handle network latency?
Another confusion was the difference between "synchronous vs asynchronous replication." Synchronous seems safer, so why use asynchronous? For performance?
And I was curious "How does failover happen automatically?" Who detects that the master died, and who promotes a slave to master?
The 'Aha!' Moment
The decisive analogy that helped me understand replication was "Meeting Minutes."
During a meeting, one person (Master) writes the minutes. Others (Slaves) have copies of those minutes.
Synchronous Replication:
- When Master writes a line, everyone must pause until all Slaves finish copying it.
- Pros: Everyone always has the exact same data. Consistency is guaranteed.
- Cons: Very slow. If one Slave falls asleep (network issue), the whole meeting stops.
Asynchronous Replication:
- Master writes and moves on. Slaves copy at their own pace.
- Pros: Fast. The meeting never stops.
- Cons: Slaves might be reading Line 5 while Master is at Line 10. This delay is Replication Lag.
Most production systems use Asynchronous for performance, accepting the slight lag risk.
Replication Architectures
1. Master-Slave (Primary-Replica)
The most basic and common structure.
┌─────────┐
│ Master │ ← Write (INSERT, UPDATE, DELETE)
└────┬────┘
│ Replication Stream
├────────┬────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Slave 1 │ │ Slave 2 │ │ Slave 3 │ ← Read (SELECT)
└─────────┘ └─────────┘ └─────────┘
Key Code Pattern:
// Write: Always to Master
async function createUser(userData) {
return await masterDb.query('INSERT INTO users ...', userData);
}
// Read: Load balance across Slaves (Round Robin)
let slaveIndex = 0;
async function getUser(userId) {
const slave = slaves[slaveIndex % slaves.length];
slaveIndex++;
return await slave.query('SELECT * FROM users WHERE id = ?', [userId]);
}
2. Master-Master (Multi-Master)
Multiple masters replicate each other.
┌─────────┐ ←──────→ ┌─────────┐
│ Master1 │ │ Master2 │
└─────────┘ └─────────┘
Pros:
- Write Scalability (Can write to both).
- Geographic redundancy (One master in US, one in EU).
Cons:
- Conflict Hell: What if User A updates 'Name: Alice' in US, and 'Name: Bob' in EU at the same time?
- Needs Conflict Resolution strategies like Last Write Wins (LWW) or Version Vectors.
3. Leaderless Replication (Dynamo-style)
Used by DynamoDB and Cassandra. There is no Master. Clients send writes to all replicas.
The Quorum Formula (W + R > N)
- N: Number of replicas (e.g., 3)
- W: Write quorum (Min nodes to confirm write, e.g., 2)
- R: Read quorum (Min nodes to confirm read, e.g., 2)
If W + R > N, you are mathematically guaranteed to read the latest data (Pigeonhole Principle).
Pros: No Single Point of Failure (SPOF). Extreme availability. Cons: Complexity in handling conflicts (Read Repair).
Deep Dive: How to Resolve Conflicts? What if two users update the same record at the exact same millisecond?
- LWW (Last Write Wins): The one with the later timestamp wins. Simple, but risks data loss.
- Vector Clocks: Carry version info
[v1, v2]. On conflict, ask the app to merge them. - CRDTs (Conflict-free Replicated Data Types): Data structures designed to be mergeable mathematically. (Used in Figma, Google Docs).
Handling Replication Lag
The biggest headache in replication. A user updates their profile, refreshes the page, and still sees the old name. Why? Because the read went to a Slave that hasn't received the update yet.
Solution 1: Read-after-Write Consistency
If a user just wrote something, force their subsequent reads to go to the Master for a short duration.
class Database {
private lastWriteTime = 0;
private LAG_THRESHOLD = 2000; // 2 seconds
async write(query, params) {
const result = await masterDb.query(query, params);
this.lastWriteTime = Date.now();
return result;
}
async read(query, params) {
const timeSinceWrite = Date.now() - this.lastWriteTime;
if (timeSinceWrite < this.LAG_THRESHOLD) {
// Wrote recently? Read from Master to ensure consistency.
return await masterDb.query(query, params);
}
// Safe to read from Slave
return await slaveDb.query(query, params);
}
}
Failover: When Master Dies
If the Master server crashes, the service can't accept writes. We need to promote a Slave to be the new Master.
Automatic Failover Steps (e.g., using Orchestrator):
- Detection: Monitor notices Master is unreachable (3 consecutive failures).
- Election: Choose the most up-to-date Slave.
- Promotion: Run
STOP SLAVE; RESET SLAVE ALL;on the chosen Slave. - Reconfiguration: Tell other Slaves to follow the new Master.
- Routing: Move the Virtual IP (VIP) to the new Master so the app doesn't need config changes.
Production Use Cases
- Read Scaling: If you have 100k reads/sec, one DB can't handle it. Add 5 Read Replicas to split the load.
- Analytics: Run heavy
GROUP BYqueries on a dedicated Analytics Slave. This prevents slowing down the Master for live users. - Geographic Distribution (Geo-Replication): Put a Slave in the US region for US users.
- Latency: Speed of light matters. US user -> KR Master (200ms) vs US user -> US Slave (20ms).
- Disaster Recovery (DR): If the KR datacenter burns down, promote the US Slave to Master immediately.
- Traffic Routing: Use DNS (like AWS Route53) with Latency-based Routing to automatically continually direct users to the nearest DB.
One-Line Summary
Database replication is about copying data from a Master to Slaves to ensure High Availability and Scale Reads. While Asynchronous Replication is standard for performance, you must handle Replication Lag (e.g., Read-your-own-writes) and plan for Failover (Promoting a Slave when Master dies). It's the backbone of any scalable system.