Distributed Transactions in Microservices: Beyond ACID
1. The Death of ACID
In a Monolithic application, transaction management is trivial. A single Database Management System (DBMS) enforces ACID properties (Atomicity, Consistency, Isolation, Durability).
However, when we adopt Microservices Architecture (MSA), we typically adopt the Database-per-Service pattern.
- Order Service has
Order DB.
- Payment Service has
Payment DB.
- Inventory Service has
Inventory DB.
A single business process("Buy an item") now spans multiple databases.
Since managing a transaction across multiple physical databases is difficult, we lose the 'A' (Atomicity) in ACID. We cannot guarantee that "everything happens or nothing happens" instantaneously.
2. The Trap of Two-Phase Commit (2PC)
The traditional solution to this is the Two-Phase Commit protocol (XA Transaction).
It introduces a Transaction Coordinator.
- Prepare Phase: Coordinator asks all participants: "Can you commit?" Each node locks rows and writes to logs.
- Commit Phase: If everyone says YES, Coordinator says "COMMIT". If anyone says NO, Coordinator says "ROLLBACK".
Why 2PC fails in MSA:
- Availability: It is a blocking protocol. If one service goes down during the Prepare phase, everyone else waits. They hold onto their database locks indefinitely. This kills system throughput.
- Latency: The response time is the sum of the slowest node. In a distributed system over a network, this is unacceptable.
- CAP Theorem: 2PC chooses Consistency over Availability. In modern web scale apps, Availability usually wins.
3. The Saga Pattern
The Saga pattern is the industry standard for distributed transactions.
A Saga is a sequence of local transactions. Ideally, each local transaction updates the database and publishes a message or event to trigger the next local transaction in the saga.
If a local transaction fails because it violates a business rule (e.g., Credit Card Declined), the Saga executes a series of Compensating Transactions that undo the changes that were made by the preceding local transactions.
Real-world Example: The E-commerce Order
Imagine an order flow:
Order Service: Create Order (Pending).
Payment Service: Charge Credit Card.
Inventory Service: Reserve Stock.
Delivery Service: Create Shipment.
Scenario: Inventory is Empty.
- Order Created ✅
- Payment Charged ✅
- Inventory Check -> FAIL (Out of Stock) ❌
-
(Trigger Compensation)
- Refund Payment (Compensating Transaction for step 2) ↩️
- Cancel Order (Compensating Transaction for step 1) ↩️
Implementation Styles:
- Choreography: No central coordinator. Services listen to events.
- Payment Svc listens to
OrderCreated.
- Inventory Svc listens to
PaymentSuccess.
- Good for simple workflows. Bad for complex ones (visualizing the flow is hard).
- Orchestration: A central "Saga Orchestrator" service tells participants what to do.
- Orchestrator sends command
ChargePayment to Payment Svc.
- Orchestrator receives
ChargeSuccess.
- Orchestrator sends command
ReserveStock to Inventory Svc.
- Good for complex workflows. Easier to monitor and manage.
4. The Transactional Outbox Pattern
A common newbie mistake: how do you atomically update the DB and publish an event?
// BAD
await db.update(); // Success
await kafka.publish(); // Fails? DB is updated but event is lost. System is inconsistent.
If you reverse it:
// BAD
await kafka.publish(); // Success
await db.update(); // Fails? Event is sent but local data is wrong.
The Transactional Outbox Pattern solves this.
- Instead of publishing to Kafka directly, write the message to an
Outbox table in the same database transaction as your business data.
- Since it's the same DB, ACID guarantees both happen or neither happens.
- A separate process (Message Relay or CDC tool like Debezium) reads the
Outbox table and pushes messages to Kafka asynchronously.
5. The Dual Write Problem
This "writing to DB and Kafka separately" issue is formally known as The Dual Write Problem.
It occurs whenever you try to modify two different storage systems in a single transaction without a distributed coordination mechanism.
The only failsafe ways to solve this are:
- Outbox Pattern: As described above.
- Change Data Capture (CDC): Tools like Debezium read your database's Transaction Log (WAL in Postgres, Binlog in MySQL) and stream every change to Kafka. This guarantees that "If it happened in the DB, it happens in Kafka." You don't even need code to dual-write; you just write to the DB, and the infrastructure handles the rest.
6. Modern Patterns: Reliable Event Processing
Once the message is in Kafka (thanks to Outbox/CDC), how do we consume it reliably?
Idempotency is key here too.
The Consumer might crash after updating the DB but before committing the Kafka offset. When it restarts, it reads the same message again.
- Strategies:
- Deduplication Table: Store processed message IDs in a separate table. Check existence before processing.
- Upsert Operations: Use
INSERT ON CONFLICT UPDATE instead of INSERT. Even if you run it 100 times, the result is the same.
- Version Checking: Ensure the incoming message version is > current DB version to avoid out-of-order updates (part of Event Sourcing).
7. Comparison Strategy
Which pattern should you choose?
| Feature | 2PC (XA) | Saga (Choreography) | Saga (Orchestration) | TCC |
|---|
| Consistency | Strong (ACID) | Eventual (Base) | Eventual (Base) | Stronger than Saga |
| Complexity | High (for infrastructure) | Medium (hard to debug) | High (central logic) | Very High (custom logic) |
| Performance | Low (Blocking) | High (Async) | High (Async) | Medium |
| Coupling | High | Low | Medium | High |
| Use Case | Banks, Legacy Systems | Independent Services | Complex Workflows | Booking Systems |
Bonus: The Modular Monolith
Before jumping to Sagas, ask yourself: "Do we really need to split these services?"
If Order and Payment are in the same Monolith (sharing a DB), you just use BEGIN TRANSACTION.
Don't distribute your transactions if you don’t have to.
Pro Tip: Identifying the Ghost in the machine
When using Sagas, debugging becomes a nightmare because logs are scattered.
Always generate a Correlation ID (e.g., X-Request-ID) at the initial gateway and pass it down to every event and service call.
Without this ID, connecting a log in the Order Service to an error in the Inventory Service is impossible. Tools like Jaeger or Zipkin rely on this ID to visualize the full transaction trace.
8. The Challenge of Isolation
Sagas lack the 'I' (Isolation) in ACID.
Intermediate states are visible.
While the Saga is running (Order Created -> Payment Charged -> ...), a user might see their order as "Created" even if it eventually gets cancelled due to inventory error.
Also, another transaction could modify the data in between. This is known as a Lack of Isolation anomaly.
Countermeasures:
- Semantic Lock: Set a flag on the record (e.g.,
status: ORDER_PROCESSING). Other transactions should check this flag and block or fail if they try to touch it.
- Commutative Updates: Design operations so order doesn't matter (commutative).
- Pessimistic View: Show the user a "Processing..." state until the Saga is fully complete to hide the intermediate states.
9. Event Consistency: The Trade-off
Choosing Sagas means choosing Eventual Consistency.
The system will be in a consistent state... eventually.
This requires a mindset shift.
Developers must design systems to tolerate temporary inconsistency.
Design UIs that are "Optimistic" or handle asynchronous states gracefully.
Instead of blocking the user with a spinner for 10 seconds, say "We received your order! We'll email you when it's confirmed."
This decouples the user experience from the complex backend transaction, providing a snappier, more resilient application.
Monolith vs. Microservices: The Transaction Tax
Remember, distributed transactions are a "tax" you pay for using Microservices.
In a Monolith, consistency is cheap (ACID). In Microservices, consistency is expensive (Sagas, Complexity, Latency).
If your business requires extremely strict consistency (e.g., real-time high-frequency trading), a Monolithic architecture or a shared database might actually be superior. Distributed systems are not a silver bullet; they solve scaling issues but introduce data integrity issues. Choose your poison wisely.
Complexity is shifted from the Database layer (ACID) to the Application layer (Sagas). This is the price of Microservices.