2026.03.20I·02Incident Management: Writing Postmortems and Managing Incidents
Incidents will happen. What matters is how fast you recover and what you learn. From severity levels and incident roles to blameless postmortems and action items that actually get done.
Incident ManagementPostmortemSRE
→2025.09.25I·17Postmortem: Post-Incident Analysis
Postmortem purpose and writing method
postmortemincidentsre
→2025.09.18I·16What Is SRE: Google's Philosophy for Turning Operations into Engineering
Running a service means failures will happen. Reading Google's SRE book made me realize that operations is a high-level engineering problem, not just toil. I walk through how the concepts of SLI, SLO, and Error Budget shift your mindset from firefighter to architect.
SREDevOpsReliability
→2025.08.05I·10Would You Drive with Your Eyes Closed? (Why You Need Monitoring)
Users complained the service was slow, but I was blindly grepping log files. I share how I moved from 'driving blind' to full observability using Prometheus and Grafana, and explain Google's 4 Golden Signals of monitoring.
DevOpsMonitoringPrometheus
→2025.05.22I·03Chaos Engineering: Building Immunity by Breaking Things
Why would Netflix intentionally shut down its own production servers? Explore the philosophy of Chaos Engineering, the Simian Army, and detailed strategies like GameDays and Automating Chaos to build resilient distributed systems.
DevOpsSREInfrastructure
→