I·17DEVOPS2025.09.259 MIN READ

Postmortem: Post-Incident Analysis

포스트모템: 장애 후 분석

Postmortem purpose and writing method

codemapo

INTERDISCIPLINARY DEV · SEOUL

Postmortem: Post-Incident Analysis

Why Incidents Keep Repeating

When something breaks, finding the cause and preventing it from happening again matters. Yet in practice, most people move on once the system is back up. I was the same at first.

I came across postmortem culture while reading engineering blogs from Google and Netflix. What stood out was this: "We publish our post-incident analysis to the entire team." Rather than hiding mistakes, they shared them so the whole organization could learn. That approach clicked with me.

When you're writing your own code, you tend to repeat the same kinds of mistakes. DB connection pool exhaustion is a classic example. If connections aren't properly released, they pile up until the service stops. Once you've dealt with that kind of problem, you don't want it to happen again.

So I put together this write-up on postmortems: what they are, how to write one, and why it matters.

What Is a Postmortem?

Postmortem literally means "after death examination." It's a medical term, but in development, it means a document to analyze the cause after an incident and prevent recurrence.

At first, I thought, "Why is writing an incident report important? I fixed it, that's enough." But reading through the Google SRE book, I found this:

"If you don't write it now, you'll make the same mistake again. And next time, it won't just affect you — the same pain gets passed on to whoever hits it next."

That hit home. The goal is to make sure what you learned the hard way doesn't have to be learned again.

Writing a Postmortem

Looking at templates published by Google and AWS, the structure breaks down like this:

1. Summary

First, write a summary visible at a glance:

Date: 2025-01-15
Time: 03:00 - 05:00 (KST)
Duration: 2 hours
Impact: All users unable to access service
Severity: Critical
Root Cause: DB connection pool exhaustion

Writing this gave me a clear picture of the entire incident.

2. Timeline

Next, write what happened in chronological order. In my case:

03:00 - Monitoring alert: Response time exceeded 5 seconds
03:02 - Log check: Multiple "Cannot get connection from pool" errors found
03:05 - DB connection pool status check: 100/100 (all in use)
03:10 - Service restart attempt → Failed (connections still exhausted)
03:20 - Code review: Missing connection return in finally block discovered
03:30 - Hotfix deployment started
04:00 - Deployment completed
04:10 - Manual DB connection pool reset
04:30 - Service normalization confirmed
05:00 - Monitoring normal values confirmed, incident declared over

Writing this made me realize what I did wrong became crystal clear. At 03:10, I almost just restarted the service and called it done. If I had, it would have crashed again.

3. Root Cause

The most important part. Why did this happen?

In my case:

// Problematic code
async function getUser(userId) {
  const connection = await pool.getConnection();
  try {
    const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  } catch (error) {
    console.error(error);
    throw error;
  }
  // Problem: Didn't call connection.release()!
}

I didn't call connection.release() in the finally block, so connections kept piling up. This function was called thousands of times a day, and since connections weren't returned, the pool eventually got exhausted.

Root Cause:

Direct cause: Not returning DB connections after use
Fundamental cause: Lack of understanding of connection management
System cause: Absence of connection pool usage monitoring

Finding multiple levels of causes is important. If you just end with "code mistake," you'll make similar mistakes again.

4. Resolution

Write how you fixed it:

// Fixed code
async function getUser(userId) {
  const connection = await pool.getConnection();
  try {
    const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  } catch (error) {
    console.error(error);
    throw error;
  } finally {
    // Added: Always return connection
    connection.release();
  }
}

And going further, I created a helper function to prevent this mistake:

// Helper to automate connection management
async function withConnection(callback) {
  const connection = await pool.getConnection();
  try {
    return await callback(connection);
  } finally {
    connection.release(); // Automatically returned
  }
}

// Usage example
async function getUser(userId) {
  return withConnection(async (conn) => {
    const result = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  });
}

Now it's okay if developers forget release(). It's handled automatically.

5. Action Items

The most important part. What will we do to prevent recurrence?

My action items were:

Immediate (within 1 week):

Add finally block to all DB query code (Owner: Me, Done: 1/16)
Introduce withConnection helper function (Owner: Me, Done: 1/17)
Add connection pool usage monitoring (Owner: Me, Done: 1/18)

Short-term (within 1 month):

Add auto-scaling logic for connection pool (Owner: Team, Due: 2/15)
Add "resource return" item to code review checklist (Owner: Team Lead, Due: 2/1)

Long-term (within 3 months):

Review ORM adoption (automate connection management) (Owner: Team, Due: 4/15)
Write incident response playbook (Owner: Team, Due: 4/30)

Clearly specifying owner and deadline is important. Otherwise, it becomes "let's do it later," and it never gets done.

Postmortem Principle: No Blame

One principle that comes up consistently in postmortem culture: "No Blame Culture" — don't blame people.

At first, I didn't get it. "I made the mistake, isn't it my fault?" But reading through Google's SRE practices, this framing stuck:

"Yes, you made a mistake. But ask yourself why that mistake was possible. Had you ever been taught proper connection management? Was there monitoring? Did code review catch it? The system is the problem."

That reframing matters. If you blame people, they hide mistakes. And when mistakes get hidden, the whole system misses a chance to improve.

Core principles of postmortems:

1. Improve the system, not people

Not good: "John didn't return the connection"
Better: "There was no system to enforce connection return"

2. Share transparently

Postmortems should be readable by everyone involved
Don't hide mistakes — publish them
So the same mistake isn't made twice by different people

3. Focus on learning

"What did we learn this time?"
"How can we do better next time?"

After Writing a Postmortem

After going through the postmortem format for the first time, a few things changed.

1. Stopped repeating the same mistakes

Following through on action items actually prevented similar problems
Monitoring surfaced issues earlier

2. Response got faster

Writing out the timeline clarified the right order of steps
Next time a similar situation came up, the response was much calmer

3. Root cause analysis got deeper

Stopped settling for "code mistake" and traced back to system-level causes
Action items became more concrete and actually executable

4. Less anxiety about mistakes

Realized that documenting mistakes openly is better for growth than hiding them
Understanding why "mistakes are learning opportunities" actually matters took practice

Good Postmortem vs Bad Postmortem

After writing several postmortems, I learned the difference between good and bad ones.

Bad Postmortem

Title: DB Incident

Problem: DB went down
Cause: Connection issue
Solution: Restarted
Prevention: Be careful

This is useless. Not specific, nothing to learn.

Good Postmortem

Title: 2-hour Service Down Due to DB Connection Pool Exhaustion

Summary:
- Date: 2025-01-15 03:00-05:00
- Impact: All users (about 1,000 people)
- Root Cause: Missing connection return in finally block

Timeline: (Detailed time-based records)

Root Cause Analysis:
- Direct cause: Missing connection.release()
- System cause: Absence of connection pool monitoring
- Process cause: Code review didn't check resource management

Resolution:
- Immediate: Add finally block
- Short-term: Introduce withConnection helper
- Long-term: Review ORM adoption

Action Items: (Owner and deadline specified)

Lessons Learned:
- Resources must be returned
- Without monitoring, problems are detected late
- Automation is safer than manual management

See the difference? Good postmortems are specific, actionable, and educational.

One-Line Summary

Postmortem is a document to analyze the cause after an incident and prevent recurrence. Write timeline, root cause, resolution, and action items specifically, focus on improving the system rather than blaming people, and share transparently for organization-wide learning. A culture of not hiding mistakes but making them public creates stronger systems.

Postmortem: Post-Incident Analysis

Walkthrough: A Real Example

2. Timeline

Next, write what happened in chronological order. In my case:

03:00 - Monitoring alert: Response time exceeded 5 seconds
03:02 - Log check: Multiple "Cannot get connection from pool" errors found
03:05 - DB connection pool status check: 100/100 (all in use)
03:10 - Service restart attempt → Failed (connections still exhausted)
03:20 - Code review: Missing connection return in finally block discovered
03:30 - Hotfix deployment started
04:00 - Deployment completed
04:10 - Manual DB connection pool reset
04:30 - Service normalization confirmed
05:00 - Monitoring normal values confirmed, incident declared over

Writing this made me realize what I did wrong became crystal clear. At 03:10, I almost just restarted the service and called it done. If I had, it would have crashed again.

3. Root Cause

The most important part. Why did this happen?

In my case:

// Problematic code
async function getUser(userId) {
  const connection = await pool.getConnection();
  try {
    const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  } catch (error) {
    console.error(error);
    throw error;
  }
  // Problem: Didn't call connection.release()!
}

Root Cause:

Direct cause: Not returning DB connections after use
Fundamental cause: Lack of understanding of connection management
System cause: Absence of connection pool usage monitoring

Finding multiple levels of causes is important. If you just end with "code mistake," you'll make similar mistakes again.

4. Resolution

Write how you fixed it:

// Fixed code
async function getUser(userId) {
  const connection = await pool.getConnection();
  try {
    const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  } catch (error) {
    console.error(error);
    throw error;
  } finally {
    // Added: Always return connection
    connection.release();
  }
}

And going further, I created a helper function to prevent this mistake:

// Helper to automate connection management
async function withConnection(callback) {
  const connection = await pool.getConnection();
  try {
    return await callback(connection);
  } finally {
    connection.release(); // Automatically returned
  }
}

// Usage example
async function getUser(userId) {
  return withConnection(async (conn) => {
    const result = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  });
}

Now it's okay if developers forget release(). It's handled automatically.

5. Action Items

The most important part. What will we do to prevent recurrence?

My action items were:

Immediate (within 1 week):

Add finally block to all DB query code (Owner: Me, Done: 1/16)
Introduce withConnection helper function (Owner: Me, Done: 1/17)
Add connection pool usage monitoring (Owner: Me, Done: 1/18)

Short-term (within 1 month):

Add auto-scaling logic for connection pool (Owner: Team, Due: 2/15)
Add "resource return" item to code review checklist (Owner: Team Lead, Due: 2/1)

Long-term (within 3 months):

Review ORM adoption (automate connection management) (Owner: Team, Due: 4/15)
Write incident response playbook (Owner: Team, Due: 4/30)

Clearly specifying owner and deadline is important. Otherwise, it becomes "let's do it later," and it never gets done.

6. Digging Deeper with "5 Whys"

To avoid concluding with just "it was a mistake," the '5 Whys' technique is highly effective.

Why did the service stop? -> No DB connections available.
Why no connections? -> Because release() wasn't called.
Why wasn't release() called? -> finally block was missing, so it was skipped during error.
Why didn't tests catch this? -> Local dev environment had a large connection pool, masking the leak.
Why was the alert late? -> No threshold alert configured for connection count.

Asking "Why" 5 times reveals the real root causes (Environment difference, Lack of monitoring).

7. Metrics: MTTD and MTTR

To improve incident response, focus on reducing two numbers:

MTTD (Mean Time To Detection): Time from incident start to detection. (Goal: < 5 mins)
MTTR (Mean Time To Recovery): Time from detection to recovery. (Goal: < 30 mins)

In this incident, my MTTD was 2 mins (Great), but MTTR was 2 hours (Terrible). My next goal is to reduce MTTR. Having a pre-written Playbook helps you recover quickly without panic.

Postmortem Principle: No Blame

One principle that comes up consistently in postmortem culture: "No Blame Culture" — don't blame people.

At first, I didn't get it. "I made the mistake, isn't it my fault?" But reading through Google's SRE practices, this framing stuck:

"Yes, you made a mistake. But ask yourself why that mistake was possible. Had you ever been taught proper connection management? Was there monitoring? Did code review catch it? The system is the problem."

That reframing matters. If you blame people, they hide mistakes. And when mistakes get hidden, the whole system misses a chance to improve.

Core principles of postmortems:

1. Improve the system, not people

Not good: "John didn't return the connection"
Better: "There was no system to enforce connection return"

2. Share transparently

Postmortems should be readable by everyone involved
Don't hide mistakes — publish them
So the same mistake isn't made twice by different people

3. Focus on learning

"What did we learn this time?"
"How can we do better next time?"

After Writing a Postmortem

After going through the postmortem format for the first time, a few things changed.

1. Stopped repeating the same mistakes

Following through on action items actually prevented similar problems
Monitoring surfaced issues earlier

2. Response got faster

Writing out the timeline clarified the right order of steps
Next time a similar situation came up, the response was much calmer

3. Root cause analysis got deeper

Stopped settling for "code mistake" and traced back to system-level causes
Action items became more concrete and actually executable

4. Less anxiety about mistakes

Realized that documenting mistakes openly is better for growth than hiding them
Understanding why "mistakes are learning opportunities" actually matters took practice

Good Postmortem vs Bad Postmortem

After writing several postmortems, I learned the difference between good and bad ones.

Bad Postmortem

Title: DB Incident

Problem: DB went down
Cause: Connection issue
Solution: Restarted
Prevention: Be careful

This is useless. Not specific, nothing to learn.

Good Postmortem

Title: 2-hour Service Down Due to DB Connection Pool Exhaustion

Summary:
- Date: 2025-01-15 03:00-05:00
- Impact: All users (about 1,000 people)
- Root Cause: Missing connection return in finally block

Timeline: (Detailed time-based records)

Root Cause Analysis:
- Direct cause: Missing connection.release()
- System cause: Absence of connection pool monitoring
- Process cause: Code review didn't check resource management

Resolution:
- Immediate: Add finally block
- Short-term: Introduce withConnection helper
- Long-term: Review ORM adoption

Action Items: (Owner and deadline specified)

Lessons Learned:
- Resources must be returned
- Without monitoring, problems are detected late
- Automation is safer than manual management

See the difference? Good postmortems are specific, actionable, and educational.

One-Line Summary

Postmortems are about fixing the "System," not blaming the "Person." Use the 5 Whys to find root causes, track MTTD/MTTR metrics, and execute specific Action Items to prevent recurrence. Hidden failures repeat; shared failures become assets.

#postmortem #incident #sre #devops

← Back to List

I·17DEVOPS2025.09.259 MIN READ

Postmortem: Post-Incident Analysis

포스트모템: 장애 후 분석

Postmortem purpose and writing method

codemapo

INTERDISCIPLINARY DEV · SEOUL

Postmortem: Post-Incident Analysis

Why Incidents Keep Repeating

When something breaks, finding the cause and preventing it from happening again matters. Yet in practice, most people move on once the system is back up. I was the same at first.

So I put together this write-up on postmortems: what they are, how to write one, and why it matters.

What Is a Postmortem?

Postmortem literally means "after death examination." It's a medical term, but in development, it means a document to analyze the cause after an incident and prevent recurrence.

At first, I thought, "Why is writing an incident report important? I fixed it, that's enough." But reading through the Google SRE book, I found this:

"If you don't write it now, you'll make the same mistake again. And next time, it won't just affect you — the same pain gets passed on to whoever hits it next."

That hit home. The goal is to make sure what you learned the hard way doesn't have to be learned again.

Writing a Postmortem

Looking at templates published by Google and AWS, the structure breaks down like this:

1. Summary

First, write a summary visible at a glance:

Date: 2025-01-15
Time: 03:00 - 05:00 (KST)
Duration: 2 hours
Impact: All users unable to access service
Severity: Critical
Root Cause: DB connection pool exhaustion

Writing this gave me a clear picture of the entire incident.

2. Timeline

Next, write what happened in chronological order. In my case:

03:00 - Monitoring alert: Response time exceeded 5 seconds
03:02 - Log check: Multiple "Cannot get connection from pool" errors found
03:05 - DB connection pool status check: 100/100 (all in use)
03:10 - Service restart attempt → Failed (connections still exhausted)
03:20 - Code review: Missing connection return in finally block discovered
03:30 - Hotfix deployment started
04:00 - Deployment completed
04:10 - Manual DB connection pool reset
04:30 - Service normalization confirmed
05:00 - Monitoring normal values confirmed, incident declared over

Writing this made me realize what I did wrong became crystal clear. At 03:10, I almost just restarted the service and called it done. If I had, it would have crashed again.

3. Root Cause

The most important part. Why did this happen?

In my case:

// Problematic code
async function getUser(userId) {
  const connection = await pool.getConnection();
  try {
    const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  } catch (error) {
    console.error(error);
    throw error;
  }
  // Problem: Didn't call connection.release()!
}

Root Cause:

Direct cause: Not returning DB connections after use
Fundamental cause: Lack of understanding of connection management
System cause: Absence of connection pool usage monitoring

Finding multiple levels of causes is important. If you just end with "code mistake," you'll make similar mistakes again.

4. Resolution

Write how you fixed it:

// Fixed code
async function getUser(userId) {
  const connection = await pool.getConnection();
  try {
    const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  } catch (error) {
    console.error(error);
    throw error;
  } finally {
    // Added: Always return connection
    connection.release();
  }
}

And going further, I created a helper function to prevent this mistake:

// Helper to automate connection management
async function withConnection(callback) {
  const connection = await pool.getConnection();
  try {
    return await callback(connection);
  } finally {
    connection.release(); // Automatically returned
  }
}

// Usage example
async function getUser(userId) {
  return withConnection(async (conn) => {
    const result = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  });
}

Now it's okay if developers forget release(). It's handled automatically.

5. Action Items

The most important part. What will we do to prevent recurrence?

My action items were:

Immediate (within 1 week):

Add finally block to all DB query code (Owner: Me, Done: 1/16)
Introduce withConnection helper function (Owner: Me, Done: 1/17)
Add connection pool usage monitoring (Owner: Me, Done: 1/18)

Short-term (within 1 month):

Add auto-scaling logic for connection pool (Owner: Team, Due: 2/15)
Add "resource return" item to code review checklist (Owner: Team Lead, Due: 2/1)

Long-term (within 3 months):

Review ORM adoption (automate connection management) (Owner: Team, Due: 4/15)
Write incident response playbook (Owner: Team, Due: 4/30)

Clearly specifying owner and deadline is important. Otherwise, it becomes "let's do it later," and it never gets done.

Postmortem Principle: No Blame

One principle that comes up consistently in postmortem culture: "No Blame Culture" — don't blame people.

At first, I didn't get it. "I made the mistake, isn't it my fault?" But reading through Google's SRE practices, this framing stuck:

"Yes, you made a mistake. But ask yourself why that mistake was possible. Had you ever been taught proper connection management? Was there monitoring? Did code review catch it? The system is the problem."

That reframing matters. If you blame people, they hide mistakes. And when mistakes get hidden, the whole system misses a chance to improve.

Core principles of postmortems:

1. Improve the system, not people

Not good: "John didn't return the connection"
Better: "There was no system to enforce connection return"

2. Share transparently

Postmortems should be readable by everyone involved
Don't hide mistakes — publish them
So the same mistake isn't made twice by different people

3. Focus on learning

"What did we learn this time?"
"How can we do better next time?"

After Writing a Postmortem

After going through the postmortem format for the first time, a few things changed.

1. Stopped repeating the same mistakes

Following through on action items actually prevented similar problems
Monitoring surfaced issues earlier

2. Response got faster

Writing out the timeline clarified the right order of steps
Next time a similar situation came up, the response was much calmer

3. Root cause analysis got deeper

Stopped settling for "code mistake" and traced back to system-level causes
Action items became more concrete and actually executable

4. Less anxiety about mistakes

Realized that documenting mistakes openly is better for growth than hiding them
Understanding why "mistakes are learning opportunities" actually matters took practice

Good Postmortem vs Bad Postmortem

After writing several postmortems, I learned the difference between good and bad ones.

Bad Postmortem

Title: DB Incident

Problem: DB went down
Cause: Connection issue
Solution: Restarted
Prevention: Be careful

This is useless. Not specific, nothing to learn.

Good Postmortem

Title: 2-hour Service Down Due to DB Connection Pool Exhaustion

Summary:
- Date: 2025-01-15 03:00-05:00
- Impact: All users (about 1,000 people)
- Root Cause: Missing connection return in finally block

Timeline: (Detailed time-based records)

Root Cause Analysis:
- Direct cause: Missing connection.release()
- System cause: Absence of connection pool monitoring
- Process cause: Code review didn't check resource management

Resolution:
- Immediate: Add finally block
- Short-term: Introduce withConnection helper
- Long-term: Review ORM adoption

Action Items: (Owner and deadline specified)

Lessons Learned:
- Resources must be returned
- Without monitoring, problems are detected late
- Automation is safer than manual management

See the difference? Good postmortems are specific, actionable, and educational.

One-Line Summary

Postmortem: Post-Incident Analysis

Walkthrough: A Real Example

2. Timeline

Next, write what happened in chronological order. In my case:

03:00 - Monitoring alert: Response time exceeded 5 seconds
03:02 - Log check: Multiple "Cannot get connection from pool" errors found
03:05 - DB connection pool status check: 100/100 (all in use)
03:10 - Service restart attempt → Failed (connections still exhausted)
03:20 - Code review: Missing connection return in finally block discovered
03:30 - Hotfix deployment started
04:00 - Deployment completed
04:10 - Manual DB connection pool reset
04:30 - Service normalization confirmed
05:00 - Monitoring normal values confirmed, incident declared over

Writing this made me realize what I did wrong became crystal clear. At 03:10, I almost just restarted the service and called it done. If I had, it would have crashed again.

3. Root Cause

The most important part. Why did this happen?

In my case:

// Problematic code
async function getUser(userId) {
  const connection = await pool.getConnection();
  try {
    const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  } catch (error) {
    console.error(error);
    throw error;
  }
  // Problem: Didn't call connection.release()!
}

Root Cause:

Direct cause: Not returning DB connections after use
Fundamental cause: Lack of understanding of connection management
System cause: Absence of connection pool usage monitoring

Finding multiple levels of causes is important. If you just end with "code mistake," you'll make similar mistakes again.

4. Resolution

Write how you fixed it:

// Fixed code
async function getUser(userId) {
  const connection = await pool.getConnection();
  try {
    const result = await connection.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  } catch (error) {
    console.error(error);
    throw error;
  } finally {
    // Added: Always return connection
    connection.release();
  }
}

And going further, I created a helper function to prevent this mistake:

// Helper to automate connection management
async function withConnection(callback) {
  const connection = await pool.getConnection();
  try {
    return await callback(connection);
  } finally {
    connection.release(); // Automatically returned
  }
}

// Usage example
async function getUser(userId) {
  return withConnection(async (conn) => {
    const result = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);
    return result[0];
  });
}

Now it's okay if developers forget release(). It's handled automatically.

5. Action Items

The most important part. What will we do to prevent recurrence?

My action items were:

Immediate (within 1 week):

Add finally block to all DB query code (Owner: Me, Done: 1/16)
Introduce withConnection helper function (Owner: Me, Done: 1/17)
Add connection pool usage monitoring (Owner: Me, Done: 1/18)

Short-term (within 1 month):

Add auto-scaling logic for connection pool (Owner: Team, Due: 2/15)
Add "resource return" item to code review checklist (Owner: Team Lead, Due: 2/1)

Long-term (within 3 months):

Review ORM adoption (automate connection management) (Owner: Team, Due: 4/15)
Write incident response playbook (Owner: Team, Due: 4/30)

Clearly specifying owner and deadline is important. Otherwise, it becomes "let's do it later," and it never gets done.

6. Digging Deeper with "5 Whys"

To avoid concluding with just "it was a mistake," the '5 Whys' technique is highly effective.

Why did the service stop? -> No DB connections available.
Why no connections? -> Because release() wasn't called.
Why wasn't release() called? -> finally block was missing, so it was skipped during error.
Why didn't tests catch this? -> Local dev environment had a large connection pool, masking the leak.
Why was the alert late? -> No threshold alert configured for connection count.

Asking "Why" 5 times reveals the real root causes (Environment difference, Lack of monitoring).

7. Metrics: MTTD and MTTR

To improve incident response, focus on reducing two numbers:

MTTD (Mean Time To Detection): Time from incident start to detection. (Goal: < 5 mins)
MTTR (Mean Time To Recovery): Time from detection to recovery. (Goal: < 30 mins)

In this incident, my MTTD was 2 mins (Great), but MTTR was 2 hours (Terrible). My next goal is to reduce MTTR. Having a pre-written Playbook helps you recover quickly without panic.

Postmortem Principle: No Blame

One principle that comes up consistently in postmortem culture: "No Blame Culture" — don't blame people.

At first, I didn't get it. "I made the mistake, isn't it my fault?" But reading through Google's SRE practices, this framing stuck:

"Yes, you made a mistake. But ask yourself why that mistake was possible. Had you ever been taught proper connection management? Was there monitoring? Did code review catch it? The system is the problem."

That reframing matters. If you blame people, they hide mistakes. And when mistakes get hidden, the whole system misses a chance to improve.

Core principles of postmortems:

1. Improve the system, not people

Not good: "John didn't return the connection"
Better: "There was no system to enforce connection return"

2. Share transparently

Postmortems should be readable by everyone involved
Don't hide mistakes — publish them
So the same mistake isn't made twice by different people

3. Focus on learning

"What did we learn this time?"
"How can we do better next time?"

After Writing a Postmortem

After going through the postmortem format for the first time, a few things changed.

1. Stopped repeating the same mistakes

Following through on action items actually prevented similar problems
Monitoring surfaced issues earlier

2. Response got faster

Writing out the timeline clarified the right order of steps
Next time a similar situation came up, the response was much calmer

3. Root cause analysis got deeper

Stopped settling for "code mistake" and traced back to system-level causes
Action items became more concrete and actually executable

4. Less anxiety about mistakes

Realized that documenting mistakes openly is better for growth than hiding them
Understanding why "mistakes are learning opportunities" actually matters took practice

Good Postmortem vs Bad Postmortem

After writing several postmortems, I learned the difference between good and bad ones.

Bad Postmortem

Title: DB Incident

Problem: DB went down
Cause: Connection issue
Solution: Restarted
Prevention: Be careful

This is useless. Not specific, nothing to learn.

Good Postmortem

Title: 2-hour Service Down Due to DB Connection Pool Exhaustion

Summary:
- Date: 2025-01-15 03:00-05:00
- Impact: All users (about 1,000 people)
- Root Cause: Missing connection return in finally block

Timeline: (Detailed time-based records)

Root Cause Analysis:
- Direct cause: Missing connection.release()
- System cause: Absence of connection pool monitoring
- Process cause: Code review didn't check resource management

Resolution:
- Immediate: Add finally block
- Short-term: Introduce withConnection helper
- Long-term: Review ORM adoption

Action Items: (Owner and deadline specified)

Lessons Learned:
- Resources must be returned
- Without monitoring, problems are detected late
- Automation is safer than manual management

See the difference? Good postmortems are specific, actionable, and educational.

One-Line Summary

#postmortem #incident #sre #devops

← Back to List