
Incident Management: Writing Postmortems and Managing Incidents
Incidents will happen. What matters is how fast you recover and what you learn. From severity levels and incident roles to blameless postmortems and action items that actually get done.

Incidents will happen. What matters is how fast you recover and what you learn. From severity levels and incident roles to blameless postmortems and action items that actually get done.
Why would Netflix intentionally shut down its own production servers? Explore the philosophy of Chaos Engineering, the Simian Army, and detailed strategies like GameDays and Automating Chaos to build resilient distributed systems.

Running a service means failures will happen. Reading Google's SRE book made me realize that operations is a high-level engineering problem, not just toil. I walk through how the concepts of SLI, SLO, and Error Budget shift your mindset from firefighter to architect.

Users complained the service was slow, but I was blindly grepping log files. I share how I moved from 'driving blind' to full observability using Prometheus and Grafana, and explain Google's 4 Golden Signals of monitoring.

Started building a smart home for convenience, but realized security is a nightmare. From WiFi bulbs to smart locks, I share my journey of securing IoT devices, setting up a local control system with Home Assistant, and preventing my house from becoming a zombie botnet.

CRITICAL: API error rate 45% — P1 incident
The PagerDuty tone jolted me awake. Nearly half our requests were failing. Eyes half-open, hands shaking, mind blank about where to start.
At that point, our team had no clear incident process. Slack was flooded with "I'm seeing errors" messages. Nobody knew who was checking what, who should notify customers, or where we were in the response. Everyone had a different dashboard open.
We recovered after four hours. The root cause was simple — one poorly indexed database query. But it took four hours because we had no process.
After that incident, we built our incident process from scratch. Here's what we learned.
Not all alerts demand the same urgency. You need defined severity levels for structured response.
| Level | Name | Definition | Response Time |
|---|---|---|---|
| SEV-1 (P1) | Critical | Full service down, data loss risk | Immediate (24/7) |
| SEV-2 (P2) | Major | Core feature impaired, majority of users affected | Within 30 min (including off-hours) |
| SEV-3 (P3) | Minor | Partial feature issue, small user impact | During business hours |
| SEV-4 (P4) | Low | Minor bug with workaround available | Normal ticket queue |
SEV-1:
- Error rate > 10% (all requests)
- Payment/auth completely down
- Data loss or exposure risk
- Complete service unavailability
SEV-2:
- Error rate > 5% (core features)
- Latency > 3x normal
- 100% impact on specific region/segment
SEV-3:
- Error rate > 1% (non-critical features)
- Workaround exists
- Small percentage of users affected
SEV-4:
- Degraded UX
- Monitoring/internal tool issues
The biggest source of chaos during an incident is not knowing who does what. Clear role definitions are the solution.
Role: Coordinates the entire response. The job is coordination, not technical debugging.
Responsibilities:
Critical: The IC does not debug. They coordinate others who debug.
Role: Owns all external communication during the incident.
Responsibilities:
By giving Comms this ownership, the IC can stay focused on technical coordination.
Role: Diagnoses and resolves the actual technical problem.
Role: Record everything.
[03:15] IC declares SEV-1. Kim as IC, Lee as backend SME.
[03:17] Error rate confirmed: 45%, mainly /api/payments endpoint
[03:22] Lee: reviewing slow query logs for payment DB
[03:31] Lee: confirmed full table scan due to unused index
[03:35] Temporary mitigation: payment feature switched to maintenance mode
[03:40] Error rate drops to 2%
[04:10] Index rebuild complete, error rate 0.1% — normalized
[04:15] IC: SEV-1 incident declared resolved
This log becomes the foundation for the postmortem.
#incidents → Real-time response thread (technical discussion)
#incidents-updates → Polished status updates (execs, other teams)
#on-call-alerts → PagerDuty/monitoring notifications
Never share #incidents externally. Hypotheses during investigation can be misinterpreted.
Post updates every 5–15 minutes during an active incident:
**[03:20] SEV-1 Update #2**
**Status**: Investigating
**Impact**: Payment API, all users
**Error rate**: 45% → 38% (slight decrease)
**Current work**: Analyzing DB slow queries
**Next update**: 03:35
**IC**: Kim / **SME**: Lee
Five status states:
Investigating — looking for the causeIdentified — cause found, fix in progressMonitoring — temporary fix in place, watching for stabilityResolved — fully recoveredPost-Incident — postmortem in progressPre-written runbooks let you take clear action at 3 AM, even half-asleep.
# High Error Rate Runbook
## Trigger
Error rate > 5% for 5 minutes
## Immediate checks (within 5 minutes)
1. [ ] Which endpoint(s)?
- Datadog: `sum:trace.web.request.errors{*} by {resource_name}`
2. [ ] Recent deployment?
- GitHub: `/compare/HEAD~1..HEAD`
- ArgoCD deployment history
3. [ ] External dependency status?
- Stripe, Supabase, AWS status pages
## Decision tree
Error rate > 10%?
├── YES → Declare SEV-1, consider immediate rollback
└── NO → Declare SEV-2, investigate before acting
Recent deploy?
├── YES → Try rollback first (prioritize fast recovery)
└── NO → Check infrastructure and external services
## Rollback procedure
1. GitHub Actions: re-run with previous deployment tag
2. ArgoCD: roll back to previous version
3. Feature flags: immediately disable related flags
## Escalation
- 15 min, cause unknown → page senior engineer
- 30 min, unresolved → notify CTO
After the incident closes, write a postmortem. The goal is learning, not blame.
"Kim deployed bad code" → whose fault? "The deployment pipeline didn't catch this pattern in staging" → what system failed?
Blame-based cultures:
System-focused cultures:
# Postmortem: [Incident Title]
**Date**: 2026-03-20
**Severity**: SEV-1
**Duration**: 03:15 – 04:15 UTC (1 hour)
**Author**: Kim
**Reviewers**: Lee, Park
---
## Executive Summary
On March 20, 2026, payment API error rates rose to 45% for one hour.
Root cause: a recently deployed query bypassed the existing index, causing full table scans.
Estimated impact: ~2,300 failed payment attempts.
---
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 03:00 | v2.14.0 deployed (includes payment query optimization) |
| 03:12 | Datadog alert fires (5% error rate threshold) |
| 03:15 | PagerDuty pages on-call. Kim declared IC |
| 03:17 | Lee: confirms 45% error rate on /api/payments |
| 03:22 | Lee: begins slow query log analysis |
| 03:31 | Lee: confirms full table scan on user_payment_methods |
| 03:35 | Mitigation: maintenance mode enabled → error rate drops to 2% |
| 03:40 | Index rebuild started |
| 04:10 | Index rebuild complete, maintenance mode disabled |
| 04:15 | Error rate 0.1%, SEV-1 resolved |
---
## Root Cause Analysis (5 Whys)
**Q: Why did error rate hit 45%?**
A: Payment API queries were timing out.
**Q: Why were queries timing out?**
A: Full table scan on user_payment_methods.
**Q: Why full table scan?**
A: v2.14.0 query change broke the composite index prefix rule.
**Q: Why wasn't this caught before deploy?**
A: Staging DB has 1k rows vs production's 1M — the slow query didn't surface.
No query execution plan validation in CI.
**Root cause**: Staging data volume doesn't represent production scale, and
there's no automated query plan validation in the deployment pipeline.
---
## Impact
- **Users**: ~2,300 failed payment attempts (20 minutes)
- **Revenue**: ~$46,000 estimated
- **Support tickets**: 87 filed
---
## What Went Well
1. PagerDuty alerted within 3 minutes of the error spike
2. Incident channel stayed focused and clean
3. Maintenance mode quickly reduced user impact
4. Scribe's timeline made analysis straightforward
---
## What Went Wrong
1. Staging/production data scale mismatch masked the issue
2. No automated query execution plan validation
3. Enabling maintenance mode required 10 minutes of manual work
4. Status page wasn't updated for the first 30 minutes
---
## Action Items
| Item | Owner | Due | Priority |
|------|-------|-----|----------|
| Add EXPLAIN ANALYZE validation to CI pipeline | Lee | 2026-03-27 | P1 |
| Script to keep staging DB at ≥10% of production scale | Park | 2026-03-31 | P1 |
| Implement payment kill switch as feature flag | Kim | 2026-03-28 | P2 |
| Automate status page update within 5 min of SEV-1 | Park | 2026-04-07 | P2 |
| Add DB query performance anomaly detection alert | Lee | 2026-04-07 | P2 |
The most powerful tool for root cause analysis in postmortems. Originally from Toyota.
Ask "why" five times. Each answer becomes the subject of the next question.
Bad example:Q: Why did the service go down?
A: Kim deployed bad code.
Q: Why did Kim deploy bad code?
A: By mistake.
→ Dead end. "Don't make mistakes" is not an action item.
Good example:
Q: Why did the service go down?
A: The deployed code had a memory leak.
Q: Why wasn't the memory leak caught?
A: It wasn't caught in code review.
Q: Why wasn't it caught in code review?
A: Memory profiling isn't in the review checklist.
Q: Why isn't memory profiling in the checklist?
A: There's no frontend PR review checklist.
Q: Why is there no checklist?
A: We never documented our review process.
→ Action item: write a PR review checklist
The most common postmortem failure: nothing gets executed.
Bad:
- "Improve monitoring" → too vague
- "Team-wide awareness" → no owner
- "Do better code reviews" → not measurable
Good:
- "Add Datadog alert for payment error rate > 5% (Owner: Lee, Due: 3/27)"
- "Add Lighthouse score < 80 as build failure in CI (Owner: Kim, Due: 3/31)"
- "Add DB migration step to deployment runbook (Owner: Park, Due: 4/7)"
| Criterion | Description |
|---|---|
| Specific | What exactly? |
| Measurable | How do we know it's done? |
| Assignable | Who owns it? |
| Realistic | Achievable by the deadline? |
| Time-bound | When is it due? |
Review action item progress in your weekly engineering meeting. Don't let overdue items silently expire — reschedule or explicitly deprioritize them.
Noisy alerts (bad):
- CPU > 70%
- Memory > 70%
- Response time > 500ms
→ Dozens of alerts per day → everyone ignores them
Meaningful alerts (good):
- Direct user impact: error rate, P99 latency, payment failure rate
- Imminent resource exhaustion: disk predicted full within 4 hours
Rule: Every alert must require action. Alerts you can safely ignore should be deleted or demoted.
rotation:
type: weekly
team:
- Kim (week 1)
- Lee (week 2)
- Park (week 3)
escalation:
level1: on-call engineer (respond within 5 min)
level2: team lead (escalate if no response in 15 min)
level3: VP Engineering (escalate if unresolved in 30 min)
On-call is not punishment. It's a team responsibility.
You can only improve what you measure.
MTTD (Mean Time to Detect)
: Time from incident occurrence to alert firing
MTTA (Mean Time to Acknowledge)
: Time from alert to IC assigned
MTTR (Mean Time to Resolve)
: Time from IC assigned to incident resolved
MTTM (Mean Time to Mitigate)
: Time from IC assigned to temporary fix in place
It sounds absurd, but a well-handled incident is a gift to your team.
It exposes system weaknesses. It reveals blind spots in your monitoring. It surfaces undocumented tribal knowledge. And it shows how your team collaborates under pressure.
Write a proper blameless postmortem and a single incident generates multiple concrete system improvements. Bad luck becomes good investment.
But that only works if you have a process. At 3 AM when your brain is offline, you need a runbook to follow, clear roles to fall back on, and enough energy left to write the postmortem.
That's the heart of SRE culture. Don't fear incidents — learn from them.