Incident Management: Writing Postmortems and Managing Incidents
Prologue: The 3 AM PagerDuty Alert
CRITICAL: API error rate 45% — P1 incident
The PagerDuty tone jolted me awake. Nearly half our requests were failing. Eyes half-open, hands shaking, mind blank about where to start.
At that point, our team had no clear incident process. Slack was flooded with "I'm seeing errors" messages. Nobody knew who was checking what, who should notify customers, or where we were in the response. Everyone had a different dashboard open.
We recovered after four hours. The root cause was simple — one poorly indexed database query. But it took four hours because we had no process.
After that incident, we built our incident process from scratch. Here's what we learned.
1. Incident Severity Levels
Not all alerts demand the same urgency. You need defined severity levels for structured response.
| Level | Name | Definition | Response Time |
|---|---|---|---|
| SEV-1 (P1) | Critical | Full service down, data loss risk | Immediate (24/7) |
| SEV-2 (P2) | Major | Core feature impaired, majority of users affected | Within 30 min (including off-hours) |
| SEV-3 (P3) | Minor | Partial feature issue, small user impact | During business hours |
| SEV-4 (P4) | Low | Minor bug with workaround available | Normal ticket queue |
Classification Criteria
SEV-1:
- Error rate > 10% (all requests)
- Payment/auth completely down
- Data loss or exposure risk
- Complete service unavailability
SEV-2:
- Error rate > 5% (core features)
- Latency > 3x normal
- 100% impact on specific region/segment
SEV-3:
- Error rate > 1% (non-critical features)
- Workaround exists
- Small percentage of users affected
SEV-4:
- Degraded UX
- Monitoring/internal tool issues
2. Incident Roles
The biggest source of chaos during an incident is not knowing who does what. Clear role definitions are the solution.
Incident Commander (IC)
Role: Coordinates the entire response. The job is coordination, not technical debugging.
Responsibilities:
- Declare the incident and assign severity
- Assign roles and tasks to team members
- Track response progress
- Make escalation decisions
- Declare the incident resolved
Critical: The IC does not debug. They coordinate others who debug.
Communications Lead (Comms)
Role: Owns all external communication during the incident.
Responsibilities:
- Update the status page
- Send customer notification emails/social posts
- Brief executives
- Post internal Slack updates (separate from IC's technical channel)
By giving Comms this ownership, the IC can stay focused on technical coordination.
Subject Matter Expert (SME)
Role: Diagnoses and resolves the actual technical problem.
- Backend SME: servers, databases, APIs
- Infrastructure SME: cloud, networking, CDN
- Frontend SME: client-side issues
Scribe
Role: Record everything.
[03:15] IC declares SEV-1. Kim as IC, Lee as backend SME.
[03:17] Error rate confirmed: 45%, mainly /api/payments endpoint
[03:22] Lee: reviewing slow query logs for payment DB
[03:31] Lee: confirmed full table scan due to unused index
[03:35] Temporary mitigation: payment feature switched to maintenance mode
[03:40] Error rate drops to 2%
[04:10] Index rebuild complete, error rate 0.1% — normalized
[04:15] IC: SEV-1 incident declared resolved
This log becomes the foundation for the postmortem.
3. Incident Communication
Internal Channel Structure
#incidents → Real-time response thread (technical discussion)
#incidents-updates → Polished status updates (execs, other teams)
#on-call-alerts → PagerDuty/monitoring notifications
Never share #incidents externally. Hypotheses during investigation can be misinterpreted.
Status Update Template
Post updates every 5–15 minutes during an active incident:
**[03:20] SEV-1 Update #2**
**Status**: Investigating
**Impact**: Payment API, all users
**Error rate**: 45% → 38% (slight decrease)
**Current work**: Analyzing DB slow queries
**Next update**: 03:35
**IC**: Kim / **SME**: Lee
Five status states:
Investigating— looking for the causeIdentified— cause found, fix in progressMonitoring— temporary fix in place, watching for stabilityResolved— fully recoveredPost-Incident— postmortem in progress
Customer Communication Principles
- Communicate early — post "investigating" even before you know the cause
- Be honest — don't understate impact
- Update regularly — minimum every 15 minutes
- Post-resolution summary — "Root cause was X, we did Y"
4. Incident Runbooks
Pre-written runbooks let you take clear action at 3 AM, even half-asleep.
High Error Rate Runbook (Example)
# High Error Rate Runbook
## Trigger
Error rate > 5% for 5 minutes
## Immediate checks (within 5 minutes)
1. [ ] Which endpoint(s)?
- Datadog: `sum:trace.web.request.errors{*} by {resource_name}`
2. [ ] Recent deployment?
- GitHub: `/compare/HEAD~1..HEAD`
- ArgoCD deployment history
3. [ ] External dependency status?
- Stripe, Supabase, AWS status pages
## Decision tree
Error rate > 10%?
├── YES → Declare SEV-1, consider immediate rollback
└── NO → Declare SEV-2, investigate before acting
Recent deploy?
├── YES → Try rollback first (prioritize fast recovery)
└── NO → Check infrastructure and external services
## Rollback procedure
1. GitHub Actions: re-run with previous deployment tag
2. ArgoCD: roll back to previous version
3. Feature flags: immediately disable related flags
## Escalation
- 15 min, cause unknown → page senior engineer
- 30 min, unresolved → notify CTO
5. Blameless Postmortems
After the incident closes, write a postmortem. The goal is learning, not blame.
What "Blameless" Means
"Kim deployed bad code" → whose fault? "The deployment pipeline didn't catch this pattern in staging" → what system failed?
Blame-based cultures:
- People hide their mistakes next time
- Root cause analysis stays shallow
- Psychological safety erodes
System-focused cultures:
- Make it harder for anyone to make the same mistake
- Surface real root causes
Postmortem Template
# Postmortem: [Incident Title]
**Date**: 2026-03-20
**Severity**: SEV-1
**Duration**: 03:15 – 04:15 UTC (1 hour)
**Author**: Kim
**Reviewers**: Lee, Park
---
## Executive Summary
On March 20, 2026, payment API error rates rose to 45% for one hour.
Root cause: a recently deployed query bypassed the existing index, causing full table scans.
Estimated impact: ~2,300 failed payment attempts.
---
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 03:00 | v2.14.0 deployed (includes payment query optimization) |
| 03:12 | Datadog alert fires (5% error rate threshold) |
| 03:15 | PagerDuty pages on-call. Kim declared IC |
| 03:17 | Lee: confirms 45% error rate on /api/payments |
| 03:22 | Lee: begins slow query log analysis |
| 03:31 | Lee: confirms full table scan on user_payment_methods |
| 03:35 | Mitigation: maintenance mode enabled → error rate drops to 2% |
| 03:40 | Index rebuild started |
| 04:10 | Index rebuild complete, maintenance mode disabled |
| 04:15 | Error rate 0.1%, SEV-1 resolved |
---
## Root Cause Analysis (5 Whys)
**Q: Why did error rate hit 45%?**
A: Payment API queries were timing out.
**Q: Why were queries timing out?**
A: Full table scan on user_payment_methods.
**Q: Why full table scan?**
A: v2.14.0 query change broke the composite index prefix rule.
**Q: Why wasn't this caught before deploy?**
A: Staging DB has 1k rows vs production's 1M — the slow query didn't surface.
No query execution plan validation in CI.
**Root cause**: Staging data volume doesn't represent production scale, and
there's no automated query plan validation in the deployment pipeline.
---
## Impact
- **Users**: ~2,300 failed payment attempts (20 minutes)
- **Revenue**: ~$46,000 estimated
- **Support tickets**: 87 filed
---
## What Went Well
1. PagerDuty alerted within 3 minutes of the error spike
2. Incident channel stayed focused and clean
3. Maintenance mode quickly reduced user impact
4. Scribe's timeline made analysis straightforward
---
## What Went Wrong
1. Staging/production data scale mismatch masked the issue
2. No automated query execution plan validation
3. Enabling maintenance mode required 10 minutes of manual work
4. Status page wasn't updated for the first 30 minutes
---
## Action Items
| Item | Owner | Due | Priority |
|------|-------|-----|----------|
| Add EXPLAIN ANALYZE validation to CI pipeline | Lee | 2026-03-27 | P1 |
| Script to keep staging DB at ≥10% of production scale | Park | 2026-03-31 | P1 |
| Implement payment kill switch as feature flag | Kim | 2026-03-28 | P2 |
| Automate status page update within 5 min of SEV-1 | Park | 2026-04-07 | P2 |
| Add DB query performance anomaly detection alert | Lee | 2026-04-07 | P2 |
6. The 5 Whys Technique
The most powerful tool for root cause analysis in postmortems. Originally from Toyota.
How to Use It Correctly
Ask "why" five times. Each answer becomes the subject of the next question.
Bad example:
Q: Why did the service go down?
A: Kim deployed bad code.
Q: Why did Kim deploy bad code?
A: By mistake.
→ Dead end. "Don't make mistakes" is not an action item.
Good example:
Q: Why did the service go down?
A: The deployed code had a memory leak.
Q: Why wasn't the memory leak caught?
A: It wasn't caught in code review.
Q: Why wasn't it caught in code review?
A: Memory profiling isn't in the review checklist.
Q: Why isn't memory profiling in the checklist?
A: There's no frontend PR review checklist.
Q: Why is there no checklist?
A: We never documented our review process.
→ Action item: write a PR review checklist
5 Whys Limitations
- Assumes a single root cause (reality is often multi-causal)
- Conclusions vary based on the questioner's experience
- For complex system failures, pair with a Fishbone (Ishikawa) Diagram
7. Action Items That Actually Get Done
The most common postmortem failure: nothing gets executed.
What Makes a Bad Action Item
Bad:
- "Improve monitoring" → too vague
- "Team-wide awareness" → no owner
- "Do better code reviews" → not measurable
Good:
- "Add Datadog alert for payment error rate > 5% (Owner: Lee, Due: 3/27)"
- "Add Lighthouse score < 80 as build failure in CI (Owner: Kim, Due: 3/31)"
- "Add DB migration step to deployment runbook (Owner: Park, Due: 4/7)"
SMART Criteria
| Criterion | Description |
|---|---|
| Specific | What exactly? |
| Measurable | How do we know it's done? |
| Assignable | Who owns it? |
| Realistic | Achievable by the deadline? |
| Time-bound | When is it due? |
Tracking Action Items
Review action item progress in your weekly engineering meeting. Don't let overdue items silently expire — reschedule or explicitly deprioritize them.
8. On-Call Best Practices
Prevent Alert Fatigue
Noisy alerts (bad):
- CPU > 70%
- Memory > 70%
- Response time > 500ms
→ Dozens of alerts per day → everyone ignores them
Meaningful alerts (good):
- Direct user impact: error rate, P99 latency, payment failure rate
- Imminent resource exhaustion: disk predicted full within 4 hours
Rule: Every alert must require action. Alerts you can safely ignore should be deleted or demoted.
On-Call Rotation Design
rotation:
type: weekly
team:
- Kim (week 1)
- Lee (week 2)
- Park (week 3)
escalation:
level1: on-call engineer (respond within 5 min)
level2: team lead (escalate if no response in 15 min)
level3: VP Engineering (escalate if unresolved in 30 min)
Building an On-Call Culture
On-call is not punishment. It's a team responsibility.
- Compensate for it: adjust work schedule after late-night pages
- Shadow on-call first: new engineers observe before being primary
- Keep runbooks current: write what needs to be known; make it findable at 3 AM
- Psychological safety: mistakes are okay — the postmortem proves it
9. Incident Metrics
You can only improve what you measure.
Key Metrics
MTTD (Mean Time to Detect)
: Time from incident occurrence to alert firing
MTTA (Mean Time to Acknowledge)
: Time from alert to IC assigned
MTTR (Mean Time to Resolve)
: Time from IC assigned to incident resolved
MTTM (Mean Time to Mitigate)
: Time from IC assigned to temporary fix in place
Epilogue: Incidents Are Gifts (Seriously)
It sounds absurd, but a well-handled incident is a gift to your team.
It exposes system weaknesses. It reveals blind spots in your monitoring. It surfaces undocumented tribal knowledge. And it shows how your team collaborates under pressure.
Write a proper blameless postmortem and a single incident generates multiple concrete system improvements. Bad luck becomes good investment.
But that only works if you have a process. At 3 AM when your brain is offline, you need a runbook to follow, clear roles to fall back on, and enough energy left to write the postmortem.
That's the heart of SRE culture. Don't fear incidents — learn from them.