What Is SRE: Google's Philosophy for Turning Operations into Engineering
1. Running a Service Means Dealing with Failures
When you run a service, failures are inevitable.
Traffic spikes fill up DB Connection Pools, Redis runs out of memory, a deployment breaks something unexpected.
When I started studying how to handle these situations, I kept running into the same question.
"Critical Alert: API Response Time > 5000ms"
When an alert like this fires, what do you actually do? Restart the server? Scale out? And if that keeps happening, where do you start fixing it at the root?
There had to be a way to build systems that fail less often—not just put out fires. Then I read Google's SRE (Site Reliability Engineering) book, and it hit me like a truck.
2. Operations is Not "Toil"
Google defines it this way:
"SRE is what happens when you ask a software engineer to design an operations function."
Repetitive incident response and manual deployments are called Toil in SRE terms.
A great SRE team keeps Toil under 50% and spends the remaining 50% on 'coding to improve systems'.
"Operations is not about cleaning up after developers. It's about ensuring system reliability through code."
This sentence reframes what operations is actually for.
3. The Core Trio: SLI, SLO, Error Budget
To understand SRE, you only need these three.
3.1. SLI (Service Level Indicator): The Gauge
If someone asks "Is the server healthy?", it's hard to answer.
You need numbers.
- Availability: Percent of 200 OK responses.
- Latency: How fast are the top 99% of requests? (P99)
- Throughput: Requests Per Second (RPS)
These metrics are SLIs. Putting these three front and center on a monitoring dashboard (Grafana) is where you start.
3.2. SLO (Service Level Objective): The Goal
Once you measure SLI, you need a target.
"Let's aim for 100% availability!"
...That sounds right, but from an SRE perspective it's the wrong goal.
100% availability is impossible and exponentially expensive.
99.9% is enough. (Allows 43 minutes downtime/month)
99.99% is excellent. (Allows 4 minutes downtime/month)
The key is defining a concrete, service-appropriate SLO—something like "Response Time P95 < 500ms".
3.3. Error Budget: The Right to Fail
This is the most interesting concept.
If SLO is 99.9%, the remaining 0.1% is "allowance for failure". This is the Error Budget.
- 43 minutes of downtime per month is acceptable.
- If there's budget left? -> Deploy aggressively, ship experimental features.
- If the budget is burned? -> Freeze all deployments, focus solely on stability.
This rule eliminates the endless debate between Dev and Ops—Google's SRE teams use exactly this mechanism.
"We have budget, let's ship."
"We burned the budget, no deploys this week. Let's refactor."
4. From Firefighter to Architect
Once you internalize SRE philosophy, how you approach operations changes fundamentally.
- Automation: Write scripts to auto-restart servers when they cross a threshold. (Self-Healing)
- Postmortem: Instead of asking "Who made the mistake?", write reports analyzing "Why didn't the system prevent this mistake?".
- Canary Deployment: Instead of deploying to everyone at once, deploy to 5% of users first to validate.
The goal shifts from firefighter running around putting out fires to architect designing fire-resistant buildings where automated sprinklers kick in. That's the role SRE aims for.
5. Chaos Engineering: Drilling for Fire
Taking it a step further, you reach the stage of "Setting fires on purpose".
Netflix's 'Chaos Monkey' is famous for this.
It randomly shuts down working servers.
You might think "Are they crazy?", but the logic is sound.
"Failures will happen anyway. So let's trigger them intentionally when we are in control (office hours) and verify if the system self-heals."
For example, forcibly killing one web server during business hours lets you verify that the Load Balancer automatically excludes the dead node and redistributes traffic. Running this test in a controlled setting—before production forces the issue—is the core idea behind Chaos Engineering.
6. SRE Toolbox
For those who want to start SRE but don't know what tools to use.
6.1. Monitoring & Visualization
- Prometheus: The standard for Time-Series Metric collection. Scrapes "CPU Usage", "Memory", etc.
- Grafana: Visualizes collected data into beautiful graphs. The most loved dashboard tool.
6.2. Log Analysis
- ELK Stack (Elasticsearch, Logstash, Kibana): Collects and searches logs. Used to find "Which error log appears the most?".
- Loki: A lightweight log system from Grafana Labs. Works great with Prometheus.
6.3. Alerting
- PagerDuty: Calls/SMS the on-call person when incidents happen. The tool responsible for those 2 AM wake-up calls.
- Slack Webhook: Non-critical alerts are sent to Slack channels.
7. SRE Glossary (Cheat Sheet)
If you are new to SRE, these terms might be confusing.
- Toil (노가다): Work that is manual, repetitive, automated, and devoid of long-term value. SREs aim to eliminate this.
- SLA (Service Level Agreement): A contract with the customer. (e.g., "If downtime > 1 hour, we refund 10%"). SREs don't usually deal with this directly; lawyers do.
- SLO (Service Level Objective): An internal goal. Stricter than SLA. (e.g., "Let's aim for 99.9% so we don't break the SLA").
- SLI (Service Level Indicator): A metric to measure SLO. (e.g., Error Rate, Latency).
- MTTR (Mean Time To Recovery): Average time to fix a failure. Lower is better.
- MTBF (Mean Time Between Failures): Average time between failures. Higher is better.
- Postmortem (회고): A blameless written record of an incident, its root cause, and impact.
8. FAQ
Q. How is it different from DevOps?
A. Google says: "class SRE implements interface DevOps".
If DevOps is a cultural movement saying "Dev and Ops should collaborate", SRE provides concrete methodologies (SLO, Error Budget, etc.) on "How to actually do that".
Q. Do small startups need this?
A. You don't need a dedicated team, but you need the 'Mindset'. Even a 3-person team needs to decide "Where do we store logs?" and "Who gets alerted if the server dies?". That is the start of SRE.
Q. Do I need to be good at Math?
A. Basic statistics help. We use Percentiles (P95, P99) much more than Average. Averages are easily skewed by outliers.
8. Conclusion: "Hope is Not a Strategy"
A famous quote from the SRE book.
"Hope is not a strategy."
Praying "Please let there be no errors this deployment" is not engineering.
Engineering is assuming failure, measuring it, and agreeing on acceptable limits.
How about your service?
When a user complains "It's slow", are you answering "It's just your feeling"?
Open Grafana right now and check your SLIs.