
Postmortem: Post-Incident Analysis
Postmortem purpose and writing method

Postmortem purpose and writing method
Understanding load balancing principles and practical applications through project experience

Understanding Terraform principles and practical applications through project experience

I tried to save money by deploying to AWS S3 instead of Vercel, but ended up with a broken site. I share the three nightmares of Static Export (Image Optimization, API Routes, Dynamic Routing) and how to fix them.

Running a service means failures will happen. Reading Google's SRE book made me realize that operations is a high-level engineering problem, not just toil. I walk through how the concepts of SLI, SLO, and Error Budget shift your mindset from firefighter to architect.

When something breaks, finding the cause and preventing it from happening again matters. Yet in practice, most people move on once the system is back up. I was the same at first.
I came across postmortem culture while reading engineering blogs from Google and Netflix. What stood out was this: "We publish our post-incident analysis to the entire team." Rather than hiding mistakes, they shared them so the whole organization could learn. That approach clicked with me.
When you're writing your own code, you tend to repeat the same kinds of mistakes. DB connection pool exhaustion is a classic example. If connections aren't properly released, they pile up until the service stops. Once you've dealt with that kind of problem, you don't want it to happen again.
So I put together this write-up on postmortems: what they are, how to write one, and why it matters.
Postmortem literally means "after death examination." It's a medical term, but in development, it means a document to analyze the cause after an incident and prevent recurrence.
At first, I thought, "Why is writing an incident report important? I fixed it, that's enough." But reading through the Google SRE book, I found this:
"If you don't write it now, you'll make the same mistake again. And next time, it won't just affect you — the same pain gets passed on to whoever hits it next."
That hit home. The goal is to make sure what you learned the hard way doesn't have to be learned again.
Looking at templates published by Google and AWS, the structure breaks down like this:
First, write a summary visible at a glance:
Date: 2025-01-15
Time: 03:00 - 05:00 (KST)
Duration: 2 hours
Impact: All users unable to access service
Severity: Critical
Root Cause: DB connection pool exhaustion
Writing this gave me a clear picture of the entire incident.
Next, write what happened in chronological order. In my case:
03:00 - Monitoring alert: Response time exceeded 5 seconds
03:02 - Log check: Multiple "Cannot get connection from pool" errors found
03:05 - DB connection pool status check: 100/100 (all in use)
03:10 - Service restart attempt → Failed (connections still exhausted)
03:20 - Code review: Missing connection return in finally block discovered
03:30 - Hotfix deployment started
04:00 - Deployment completed
04:10 - Manual DB connection pool reset
04:30 - Service normalization confirmed
05:00 - Monitoring normal values confirmed, incident declared over
Writing this made me realize what I did wrong became crystal clear. At 03:10, I almost just restarted the service and called it done. If I had, it would have crashed again.
The most important part. Why did this happen?
In my case:
// Problematic code
async function getUser(userId) {
const connection = await pool.getConnection();
try {
const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
} catch (error) {
console.error(error);
throw error;
}
// Problem: Didn't call connection.release()!
}
I didn't call connection.release() in the finally block, so connections kept piling up. This function was called thousands of times a day, and since connections weren't returned, the pool eventually got exhausted.
Finding multiple levels of causes is important. If you just end with "code mistake," you'll make similar mistakes again.
Write how you fixed it:
// Fixed code
async function getUser(userId) {
const connection = await pool.getConnection();
try {
const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
} catch (error) {
console.error(error);
throw error;
} finally {
// Added: Always return connection
connection.release();
}
}
And going further, I created a helper function to prevent this mistake:
// Helper to automate connection management
async function withConnection(callback) {
const connection = await pool.getConnection();
try {
return await callback(connection);
} finally {
connection.release(); // Automatically returned
}
}
// Usage example
async function getUser(userId) {
return withConnection(async (conn) => {
const result = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
});
}
Now it's okay if developers forget release(). It's handled automatically.
The most important part. What will we do to prevent recurrence?
My action items were:
Immediate (within 1 week):Clearly specifying owner and deadline is important. Otherwise, it becomes "let's do it later," and it never gets done.
One principle that comes up consistently in postmortem culture: "No Blame Culture" — don't blame people.
At first, I didn't get it. "I made the mistake, isn't it my fault?" But reading through Google's SRE practices, this framing stuck:
"Yes, you made a mistake. But ask yourself why that mistake was possible. Had you ever been taught proper connection management? Was there monitoring? Did code review catch it? The system is the problem."
That reframing matters. If you blame people, they hide mistakes. And when mistakes get hidden, the whole system misses a chance to improve.
Core principles of postmortems:
1. Improve the system, not peopleAfter going through the postmortem format for the first time, a few things changed.
1. Stopped repeating the same mistakesAfter writing several postmortems, I learned the difference between good and bad ones.
Title: DB Incident
Problem: DB went down
Cause: Connection issue
Solution: Restarted
Prevention: Be careful
This is useless. Not specific, nothing to learn.
Title: 2-hour Service Down Due to DB Connection Pool Exhaustion
Summary:
- Date: 2025-01-15 03:00-05:00
- Impact: All users (about 1,000 people)
- Root Cause: Missing connection return in finally block
Timeline: (Detailed time-based records)
Root Cause Analysis:
- Direct cause: Missing connection.release()
- System cause: Absence of connection pool monitoring
- Process cause: Code review didn't check resource management
Resolution:
- Immediate: Add finally block
- Short-term: Introduce withConnection helper
- Long-term: Review ORM adoption
Action Items: (Owner and deadline specified)
Lessons Learned:
- Resources must be returned
- Without monitoring, problems are detected late
- Automation is safer than manual management
See the difference? Good postmortems are specific, actionable, and educational.
Postmortem is a document to analyze the cause after an incident and prevent recurrence. Write timeline, root cause, resolution, and action items specifically, focus on improving the system rather than blaming people, and share transparently for organization-wide learning. A culture of not hiding mistakes but making them public creates stronger systems.
Next, write what happened in chronological order. In my case:
03:00 - Monitoring alert: Response time exceeded 5 seconds
03:02 - Log check: Multiple "Cannot get connection from pool" errors found
03:05 - DB connection pool status check: 100/100 (all in use)
03:10 - Service restart attempt → Failed (connections still exhausted)
03:20 - Code review: Missing connection return in finally block discovered
03:30 - Hotfix deployment started
04:00 - Deployment completed
04:10 - Manual DB connection pool reset
04:30 - Service normalization confirmed
05:00 - Monitoring normal values confirmed, incident declared over
Writing this made me realize what I did wrong became crystal clear. At 03:10, I almost just restarted the service and called it done. If I had, it would have crashed again.
The most important part. Why did this happen?
In my case:
// Problematic code
async function getUser(userId) {
const connection = await pool.getConnection();
try {
const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
} catch (error) {
console.error(error);
throw error;
}
// Problem: Didn't call connection.release()!
}
I didn't call connection.release() in the finally block, so connections kept piling up. This function was called thousands of times a day, and since connections weren't returned, the pool eventually got exhausted.
Finding multiple levels of causes is important. If you just end with "code mistake," you'll make similar mistakes again.
Write how you fixed it:
// Fixed code
async function getUser(userId) {
const connection = await pool.getConnection();
try {
const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
} catch (error) {
console.error(error);
throw error;
} finally {
// Added: Always return connection
connection.release();
}
}
And going further, I created a helper function to prevent this mistake:
// Helper to automate connection management
async function withConnection(callback) {
const connection = await pool.getConnection();
try {
return await callback(connection);
} finally {
connection.release(); // Automatically returned
}
}
// Usage example
async function getUser(userId) {
return withConnection(async (conn) => {
const result = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);
return result[0];
});
}
Now it's okay if developers forget release(). It's handled automatically.
The most important part. What will we do to prevent recurrence?
My action items were:
Immediate (within 1 week):Clearly specifying owner and deadline is important. Otherwise, it becomes "let's do it later," and it never gets done.
To avoid concluding with just "it was a mistake," the '5 Whys' technique is highly effective.
release() wasn't called.release() called? -> finally block was missing, so it was skipped during error.Asking "Why" 5 times reveals the real root causes (Environment difference, Lack of monitoring).
To improve incident response, focus on reducing two numbers:
In this incident, my MTTD was 2 mins (Great), but MTTR was 2 hours (Terrible). My next goal is to reduce MTTR. Having a pre-written Playbook helps you recover quickly without panic.
One principle that comes up consistently in postmortem culture: "No Blame Culture" — don't blame people.
At first, I didn't get it. "I made the mistake, isn't it my fault?" But reading through Google's SRE practices, this framing stuck:
"Yes, you made a mistake. But ask yourself why that mistake was possible. Had you ever been taught proper connection management? Was there monitoring? Did code review catch it? The system is the problem."
That reframing matters. If you blame people, they hide mistakes. And when mistakes get hidden, the whole system misses a chance to improve.
Core principles of postmortems:
1. Improve the system, not peopleAfter going through the postmortem format for the first time, a few things changed.
1. Stopped repeating the same mistakesAfter writing several postmortems, I learned the difference between good and bad ones.
Title: DB Incident
Problem: DB went down
Cause: Connection issue
Solution: Restarted
Prevention: Be careful
This is useless. Not specific, nothing to learn.
Title: 2-hour Service Down Due to DB Connection Pool Exhaustion
Summary:
- Date: 2025-01-15 03:00-05:00
- Impact: All users (about 1,000 people)
- Root Cause: Missing connection return in finally block
Timeline: (Detailed time-based records)
Root Cause Analysis:
- Direct cause: Missing connection.release()
- System cause: Absence of connection pool monitoring
- Process cause: Code review didn't check resource management
Resolution:
- Immediate: Add finally block
- Short-term: Introduce withConnection helper
- Long-term: Review ORM adoption
Action Items: (Owner and deadline specified)
Lessons Learned:
- Resources must be returned
- Without monitoring, problems are detected late
- Automation is safer than manual management
See the difference? Good postmortems are specific, actionable, and educational.
Postmortems are about fixing the "System," not blaming the "Person." Use the 5 Whys to find root causes, track MTTD/MTTR metrics, and execute specific Action Items to prevent recurrence. Hidden failures repeat; shared failures become assets.