Postmortem: Post-Incident Analysis
Why Incidents Keep Repeating
When something breaks, finding the cause and preventing it from happening again matters. Yet in practice, most people move on once the system is back up. I was the same at first.
I came across postmortem culture while reading engineering blogs from Google and Netflix. What stood out was this: "We publish our post-incident analysis to the entire team." Rather than hiding mistakes, they shared them so the whole organization could learn. That approach clicked with me.
When you're writing your own code, you tend to repeat the same kinds of mistakes. DB connection pool exhaustion is a classic example. If connections aren't properly released, they pile up until the service stops. Once you've dealt with that kind of problem, you don't want it to happen again.
So I put together this write-up on postmortems: what they are, how to write one, and why it matters.
What Is a Postmortem?
Postmortem literally means "after death examination." It's a medical term, but in development, it means a document to analyze the cause after an incident and prevent recurrence.
At first, I thought, "Why is writing an incident report important? I fixed it, that's enough." But reading through the Google SRE book, I found this:
"If you don't write it now, you'll make the same mistake again. And next time, it won't just affect you — the same pain gets passed on to whoever hits it next."
That hit home. The goal is to make sure what you learned the hard way doesn't have to be learned again.
Writing a Postmortem
Looking at templates published by Google and AWS, the structure breaks down like this:
1. Summary
First, write a summary visible at a glance:
Date: 2025-01-15
Time: 03:00 - 05:00 (KST)
Duration: 2 hours
Impact: All users unable to access service
Severity: Critical
Root Cause: DB connection pool exhaustion
Writing this gave me a clear picture of the entire incident.
2. Timeline
Next, write what happened in chronological order. In my case:
03:00 - Monitoring alert: Response time exceeded 5 seconds
03:02 - Log check: Multiple "Cannot get connection from pool" errors found
03:05 - DB connection pool status check: 100/100 (all in use)
03:10 - Service restart attempt → Failed (connections still exhausted)
03:20 - Code review: Missing connection return in finally block discovered
03:30 - Hotfix deployment started
04:00 - Deployment completed
04:10 - Manual DB connection pool reset
04:30 - Service normalization confirmed
05:00 - Monitoring normal values confirmed, incident declared over
Writing this made me realize what I did wrong became crystal clear. At 03:10, I almost just restarted the service and called it done. If I had, it would have crashed again.
3. Root Cause
The most important part. Why did this happen?
In my case:
// Problematic code
async function getUser(userId) {
const connection = await pool.getConnection();
try {
const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
} catch (error) {
console.error(error);
throw error;
}
// Problem: Didn't call connection.release()!
}
I didn't call connection.release() in the finally block, so connections kept piling up. This function was called thousands of times a day, and since connections weren't returned, the pool eventually got exhausted.
Root Cause:
- Direct cause: Not returning DB connections after use
- Fundamental cause: Lack of understanding of connection management
- System cause: Absence of connection pool usage monitoring
Finding multiple levels of causes is important. If you just end with "code mistake," you'll make similar mistakes again.
4. Resolution
Write how you fixed it:
// Fixed code
async function getUser(userId) {
const connection = await pool.getConnection();
try {
const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
} catch (error) {
console.error(error);
throw error;
} finally {
// Added: Always return connection
connection.release();
}
}
And going further, I created a helper function to prevent this mistake:
// Helper to automate connection management
async function withConnection(callback) {
const connection = await pool.getConnection();
try {
return await callback(connection);
} finally {
connection.release(); // Automatically returned
}
}
// Usage example
async function getUser(userId) {
return withConnection(async (conn) => {
const result = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
});
}
Now it's okay if developers forget release(). It's handled automatically.
5. Action Items
The most important part. What will we do to prevent recurrence?
My action items were:
Immediate (within 1 week):
- Add finally block to all DB query code (Owner: Me, Done: 1/16)
- Introduce withConnection helper function (Owner: Me, Done: 1/17)
- Add connection pool usage monitoring (Owner: Me, Done: 1/18)
Short-term (within 1 month):
- Add auto-scaling logic for connection pool (Owner: Team, Due: 2/15)
- Add "resource return" item to code review checklist (Owner: Team Lead, Due: 2/1)
Long-term (within 3 months):
- Review ORM adoption (automate connection management) (Owner: Team, Due: 4/15)
- Write incident response playbook (Owner: Team, Due: 4/30)
Clearly specifying owner and deadline is important. Otherwise, it becomes "let's do it later," and it never gets done.
Postmortem Principle: No Blame
One principle that comes up consistently in postmortem culture: "No Blame Culture" — don't blame people.
At first, I didn't get it. "I made the mistake, isn't it my fault?" But reading through Google's SRE practices, this framing stuck:
"Yes, you made a mistake. But ask yourself why that mistake was possible. Had you ever been taught proper connection management? Was there monitoring? Did code review catch it? The system is the problem."
That reframing matters. If you blame people, they hide mistakes. And when mistakes get hidden, the whole system misses a chance to improve.
Core principles of postmortems:
1. Improve the system, not people
- Not good: "John didn't return the connection"
- Better: "There was no system to enforce connection return"
2. Share transparently
- Postmortems should be readable by everyone involved
- Don't hide mistakes — publish them
- So the same mistake isn't made twice by different people
3. Focus on learning
- "What did we learn this time?"
- "How can we do better next time?"
After Writing a Postmortem
After going through the postmortem format for the first time, a few things changed.
1. Stopped repeating the same mistakes
- Following through on action items actually prevented similar problems
- Monitoring surfaced issues earlier
2. Response got faster
- Writing out the timeline clarified the right order of steps
- Next time a similar situation came up, the response was much calmer
3. Root cause analysis got deeper
- Stopped settling for "code mistake" and traced back to system-level causes
- Action items became more concrete and actually executable
4. Less anxiety about mistakes
- Realized that documenting mistakes openly is better for growth than hiding them
- Understanding why "mistakes are learning opportunities" actually matters took practice
Good Postmortem vs Bad Postmortem
After writing several postmortems, I learned the difference between good and bad ones.
Bad Postmortem
Title: DB Incident
Problem: DB went down
Cause: Connection issue
Solution: Restarted
Prevention: Be careful
This is useless. Not specific, nothing to learn.
Good Postmortem
Title: 2-hour Service Down Due to DB Connection Pool Exhaustion
Summary:
- Date: 2025-01-15 03:00-05:00
- Impact: All users (about 1,000 people)
- Root Cause: Missing connection return in finally block
Timeline: (Detailed time-based records)
Root Cause Analysis:
- Direct cause: Missing connection.release()
- System cause: Absence of connection pool monitoring
- Process cause: Code review didn't check resource management
Resolution:
- Immediate: Add finally block
- Short-term: Introduce withConnection helper
- Long-term: Review ORM adoption
Action Items: (Owner and deadline specified)
Lessons Learned:
- Resources must be returned
- Without monitoring, problems are detected late
- Automation is safer than manual management
See the difference? Good postmortems are specific, actionable, and educational.
One-Line Summary
Postmortem is a document to analyze the cause after an incident and prevent recurrence. Write timeline, root cause, resolution, and action items specifically, focus on improving the system rather than blaming people, and share transparently for organization-wide learning. A culture of not hiding mistakes but making them public creates stronger systems.
Postmortem: Post-Incident Analysis
Walkthrough: A Real Example
2. Timeline
Next, write what happened in chronological order. In my case:
03:00 - Monitoring alert: Response time exceeded 5 seconds
03:02 - Log check: Multiple "Cannot get connection from pool" errors found
03:05 - DB connection pool status check: 100/100 (all in use)
03:10 - Service restart attempt → Failed (connections still exhausted)
03:20 - Code review: Missing connection return in finally block discovered
03:30 - Hotfix deployment started
04:00 - Deployment completed
04:10 - Manual DB connection pool reset
04:30 - Service normalization confirmed
05:00 - Monitoring normal values confirmed, incident declared over
Writing this made me realize what I did wrong became crystal clear. At 03:10, I almost just restarted the service and called it done. If I had, it would have crashed again.
3. Root Cause
The most important part. Why did this happen?
In my case:
// Problematic code
async function getUser(userId) {
const connection = await pool.getConnection();
try {
const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
} catch (error) {
console.error(error);
throw error;
}
// Problem: Didn't call connection.release()!
}
I didn't call connection.release() in the finally block, so connections kept piling up. This function was called thousands of times a day, and since connections weren't returned, the pool eventually got exhausted.
Root Cause:
- Direct cause: Not returning DB connections after use
- Fundamental cause: Lack of understanding of connection management
- System cause: Absence of connection pool usage monitoring
Finding multiple levels of causes is important. If you just end with "code mistake," you'll make similar mistakes again.
4. Resolution
Write how you fixed it:
// Fixed code
async function getUser(userId) {
const connection = await pool.getConnection();
try {
const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
} catch (error) {
console.error(error);
throw error;
} finally {
// Added: Always return connection
connection.release();
}
}
And going further, I created a helper function to prevent this mistake:
// Helper to automate connection management
async function withConnection(callback) {
const connection = await pool.getConnection();
try {
return await callback(connection);
} finally {
connection.release(); // Automatically returned
}
}
// Usage example
async function getUser(userId) {
return withConnection(async (conn) => {
const result = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
});
}
Now it's okay if developers forget release(). It's handled automatically.
5. Action Items
The most important part. What will we do to prevent recurrence?
My action items were:
Immediate (within 1 week):
- Add finally block to all DB query code (Owner: Me, Done: 1/16)
- Introduce withConnection helper function (Owner: Me, Done: 1/17)
- Add connection pool usage monitoring (Owner: Me, Done: 1/18)
Short-term (within 1 month):
- Add auto-scaling logic for connection pool (Owner: Team, Due: 2/15)
- Add "resource return" item to code review checklist (Owner: Team Lead, Due: 2/1)
Long-term (within 3 months):
- Review ORM adoption (automate connection management) (Owner: Team, Due: 4/15)
- Write incident response playbook (Owner: Team, Due: 4/30)
Clearly specifying owner and deadline is important. Otherwise, it becomes "let's do it later," and it never gets done.
6. Digging Deeper with "5 Whys"
To avoid concluding with just "it was a mistake," the '5 Whys' technique is highly effective.
- Why did the service stop? -> No DB connections available.
- Why no connections? -> Because
release()wasn't called. - Why wasn't
release()called? ->finallyblock was missing, so it was skipped during error. - Why didn't tests catch this? -> Local dev environment had a large connection pool, masking the leak.
- Why was the alert late? -> No threshold alert configured for connection count.
Asking "Why" 5 times reveals the real root causes (Environment difference, Lack of monitoring).
7. Metrics: MTTD and MTTR
To improve incident response, focus on reducing two numbers:
- MTTD (Mean Time To Detection): Time from incident start to detection. (Goal: < 5 mins)
- MTTR (Mean Time To Recovery): Time from detection to recovery. (Goal: < 30 mins)
In this incident, my MTTD was 2 mins (Great), but MTTR was 2 hours (Terrible). My next goal is to reduce MTTR. Having a pre-written Playbook helps you recover quickly without panic.
Postmortem Principle: No Blame
One principle that comes up consistently in postmortem culture: "No Blame Culture" — don't blame people.
At first, I didn't get it. "I made the mistake, isn't it my fault?" But reading through Google's SRE practices, this framing stuck:
"Yes, you made a mistake. But ask yourself why that mistake was possible. Had you ever been taught proper connection management? Was there monitoring? Did code review catch it? The system is the problem."
That reframing matters. If you blame people, they hide mistakes. And when mistakes get hidden, the whole system misses a chance to improve.
Core principles of postmortems:
1. Improve the system, not people
- Not good: "John didn't return the connection"
- Better: "There was no system to enforce connection return"
2. Share transparently
- Postmortems should be readable by everyone involved
- Don't hide mistakes — publish them
- So the same mistake isn't made twice by different people
3. Focus on learning
- "What did we learn this time?"
- "How can we do better next time?"
After Writing a Postmortem
After going through the postmortem format for the first time, a few things changed.
1. Stopped repeating the same mistakes
- Following through on action items actually prevented similar problems
- Monitoring surfaced issues earlier
2. Response got faster
- Writing out the timeline clarified the right order of steps
- Next time a similar situation came up, the response was much calmer
3. Root cause analysis got deeper
- Stopped settling for "code mistake" and traced back to system-level causes
- Action items became more concrete and actually executable
4. Less anxiety about mistakes
- Realized that documenting mistakes openly is better for growth than hiding them
- Understanding why "mistakes are learning opportunities" actually matters took practice
Good Postmortem vs Bad Postmortem
After writing several postmortems, I learned the difference between good and bad ones.
Bad Postmortem
Title: DB Incident
Problem: DB went down
Cause: Connection issue
Solution: Restarted
Prevention: Be careful
This is useless. Not specific, nothing to learn.
Good Postmortem
Title: 2-hour Service Down Due to DB Connection Pool Exhaustion
Summary:
- Date: 2025-01-15 03:00-05:00
- Impact: All users (about 1,000 people)
- Root Cause: Missing connection return in finally block
Timeline: (Detailed time-based records)
Root Cause Analysis:
- Direct cause: Missing connection.release()
- System cause: Absence of connection pool monitoring
- Process cause: Code review didn't check resource management
Resolution:
- Immediate: Add finally block
- Short-term: Introduce withConnection helper
- Long-term: Review ORM adoption
Action Items: (Owner and deadline specified)
Lessons Learned:
- Resources must be returned
- Without monitoring, problems are detected late
- Automation is safer than manual management
See the difference? Good postmortems are specific, actionable, and educational.
One-Line Summary
Postmortems are about fixing the "System," not blaming the "Person." Use the 5 Whys to find root causes, track MTTD/MTTR metrics, and execute specific Action Items to prevent recurrence. Hidden failures repeat; shared failures become assets.