I Deleted Production with One Click: Why IaC Matters
1. The Tragedy of Friday 5 PM
'That incident', which every developer experiences at least once, happened on a peaceful Friday afternoon.
Getting ready to leave work, I was cleaning up unused test servers to save AWS costs.
"Hmm, test-db-01, dev-api-server... gotta delete them all."
I checked the boxes in the AWS console and clicked the 'Terminate' button.
Feeling refreshed, I opened Slack, but notifications started going crazy.
"🚨 [Critical] Production Database Connection Failed"
"Customers say they can't log in!!"
"Payment Error 500 spiking!!"
Cold sweat ran down my spine. I checked the AWS console again.
Oh no. What I deleted was not test-db-01 but prod-db-01.
I made a mistake because the names were similar.
It took me a full 3 hours to restore it. The whole ordeal made painfully clear how dangerous manual infrastructure management really is.
And I learned the hard way.
"Human fingers cannot be trusted. Infrastructure must never be managed with clicks."
2. Limits of ClickOps
When we write code, we meticulously manage versions with Git. Who changed what and when is all recorded.
But why do we manage infrastructure with mouse clicks (ClickOps)?
Creating infrastructure manually in the AWS console leads to these problems:
- Human Error: Mistakes like deleting the wrong server like me, or opening Security Group ports to the public (0.0.0.0/0).
- Not Reproducible: "Uh? Make another server exactly like the one made last year." -> Can't remember. I forgot what options were on, or what the VPC settings were.
- No History: "Who opened the firewall?" No one knows. You have to dig through logs to barely find out.
The only way to solve all these problems is Managing Infrastructure as Code (IaC).
3. Terraform: Blueprints for Infrastructure
There are many IaC tools, but I love Terraform the most. (Of course, Pulumi or CloudFormation are great too.)
If you define the Desired State of the infrastructure you want in code, Terraform automatically creates that state.
# main.tf
resource "aws_instance" "app_server" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t2.micro"
tags = {
Name = "Production-Server"
# Hardcoding the name in code prevents confusion.
}
}
resource "aws_s3_bucket" "log_bucket" {
bucket = "my-company-logs"
acl = "private"
}
Now, instead of "Please make a server", just type two commands in the terminal.
terraform plan: "I'm going to build it like this, want to check?" (Preview)
terraform apply: "Good, make it happen!" (Apply)
This process is like a Code Review.
"Uh? David, there's code changing the production DB instance type here? Is this intended?"
A colleague can review simply by looking at the terraform plan result. My mistake (DB deletion) would have been caught here.
4. Secret of State File: Rule of Two
The most confusing thing when first encountering IaC is the State File (terraform.tfstate).
Terraform records what is actually deployed on AWS in this JSON file. It's like a 'Ledger'.
But what if Team Member A and Team Member B define Terraform at the same time?
If A is editing the ledger and B writes over it (Race Condition), the infrastructure becomes a mess.
So Remote State Backend and Locking features are essential.
Fantasy Combo of S3 and DynamoDB
Usually, when using AWS, it's configured like this:
- Storage (S3): Save the
tfstate file in a secure S3 bucket. All team members view this file.
- Locking Device (DynamoDB): If someone runs
terraform apply, it marks "Work in Progress" (Lock) on the DynamoDB table. Others have to wait until the work is done.
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/terraform.tfstate"
region = "ap-northeast-2"
dynamodb_table = "terraform-locks" # Table for locking
encrypt = true
}
}
Now there is no need to shout on Slack, "Hey, I'm deploying, don't touch it!" Terraform blocks it automatically.
5. Drift: Catching Secret Changes
Even with IaC, sometimes someone (usually a rushed boss or dev) goes into the AWS console and secretly changes settings.
"I was in a rush so I opened the security group port for a sec."
This is called Drift. Code says closed, but reality says open—a dangerous state.
Terraform catches this amazingly well.
$ terraform plan
# Result:
# aws_security_group.web_sg will be updated in-place
~ ingress {
- cidr_blocks = ["0.0.0.0/0"] # Discovered someone secretly added this!
+ cidr_blocks = ["10.0.0.0/16"] # Will revert to original code
}
You can build a system that runs terraform plan every morning (Cronjob) to monitor if the code and actual infrastructure differ. This is called Drift Detection.
Thanks to this, there is no need to play the "Who changed it?" blame game.
6. Of Course, IaC Isn't Free
IaC is not a magic wand. The Learning Curve is quite steep.
You have to learn a new language called HCL (HashiCorp Configuration Language), and think about Module structure.
And "Importing Existing Infrastructure" is truly painful.
To move AWS resources already made by hand into Terraform code, you have to use the terraform import command, which is quite a grind. (Recently the generated block made it a bit easier.)
But it is worth enduring that pain.
The peace of mind that when a server blows up at 3 AM, instead of clicking the AWS console with shaking hands, you can calmly restore it with a single terraform apply. This alone is reason enough to do IaC.
7. Conclusion: Infrastructure is 'Cattle', Not 'Pets'
There is a famous analogy in the DevOps world.
"Treat servers like Cattle, not Pets."
- Pets: You give them names (Prince, Princess), and nurse them carefully when sick. (Manual Management)
- Cattle: You manage them by numbers (cow-01, cow-02), and replace them quickly if issues arise. (Automated Management)
My prod-db-01 server was a Pet. Because I tweaked settings one by one with care, it became an entity no one but me could touch. That's why I despaired so much when it was deleted.
Now my servers are cattle. I can stamp out identical ones anytime with Terraform code.
Even if a fire burns down the data center, I will sip my coffee and fire a terraform apply to another Region.
Is your infrastructure a pet or cattle?
Don't bet your life on a single click anymore. Record with code, protect with code.
title_en: "I Deleted the Production Server with One Click: Why We Need IaC"
description: "AWS 콘솔에서 실수로 운영 서버를 삭제한 경험, 다들 있으신가요? (없어야 합니다...) 수동으로 인프라를 관리할 때의 공포와, 이를 해결하기 위해 Terraform을 도입하여 '코드 형 인프라(IaC)'를 구축한 과정을 공유합니다."
description_en: "Have you ever accidentally deleted a production server on the AWS console? The fear of manual infrastructure management is real. I share my nightmare experience and how I adopted Terraform to build 'Infrastructure as Code (IaC)', turning chaos into version-controlled stability."
date: "2025-08-23"
tags: ["DevOps", "IaC", "Terraform", "AWS", "Infrastructure"]
category: "devops"
published: true
coverImage: "/images/blog/iac/cover.png"
1. "어? 이 버튼이 아니었나?"
스타트업 초기, 저는 'AWS 콘솔 마스터'였습니다.
EC2 인스턴스를 만들고, 보안 그룹(Security Group)을 설정하고, 로드 밸런서를 연결하는 모든 작업을 마우스 클릭으로 해결했죠.
어느 날, 개발 서버를 정리하려고 했습니다.
"이거 안 쓰는 거지?" 하고 Terminate 버튼을 눌렀습니다.
그런데 3초 뒤, 슬랙에서 알람이 울리기 시작했습니다.
[Critical] Production Database Connection Failed
등줄기에 식은땀이 흘렀습니다.
제가 지운 건 개발 서버가 아니라 운영 DB와 연결된 프라이빗 서브넷의 NAT 게이트웨이였습니다.
비슷하게 생긴 이름 때문에 헷갈린 거죠.
더 끔찍한 건 복구 과정이었습니다.
"그때 서브넷 IP 대역을 뭘로 했더라?", "라우팅 테이블 설정이 뭐였지?"
기억에 의존해서 클릭질을 다시 하느라 복구에 3시간이 걸렸습니다.
이 사건 이후 저는 결심했습니다.
"다시는 마우스로 인프라를 만지지 않겠다."
2. 인프라를 코드로 짠다고? (IaC)
그래서 도입한 것이 IaC (Infrastructure as Code), 즉 코드 형 인프라입니다.
쉽게 말해 "서버 설정도 프로그래밍 코드처럼 텍스트 파일로 짜서 관리하자"는 겁니다.
도구로는 Terraform(테라폼)을 선택했습니다.
가장 많이 쓰이고, AWS뿐만 아니라 다른 클라우드도 지원하니까요.
3. Terraform으로 인프라 그리기
처음엔 좀 어색했습니다. "아니, 그냥 클릭하면 되는데 왜 굳이 코드를 짜?"
하지만 코드를 작성해 보니 신세계가 열렸습니다.
예시: EC2 서버 한 대 만들기
# main.tf
provider "aws" {
region = "ap-northeast-2" # 서울 리전
}
resource "aws_instance" "app_server" {
ami = "ami-0e1d09d8b7c751816" # Amazon Linux 2
instance_type = "t3.micro"
tags = {
Name = "MyAppServer"
Env = "Production"
}
}
이 코드를 작성하고 terraform apply를 입력하면?
Terraform이 알아서 AWS API를 호출해서 서버를 딱! 만들어줍니다.
3.5. 심화: Terraform State의 비밀 (tfstate)
Terraform은 terraform.tfstate라는 파일에 현재 인프라 상태를 저장합니다.
이 파일은 "실제 세상(AWS)"과 "내 코드(HCL)" 사이의 지도(Map) 역할을 합니다.
왜 중요한가요? (State Management)
- 성능: 매번 AWS API를 찔러서 현재 상태를 확인하면 너무 느립니다. State 파일을 보고 "아, 이미 EC2 3개 있네" 하고 판단합니다.
- 의존성: 인스턴스 ID나 IP 주소 같은 리소스의 출력값을 저장해뒀다가 다른 리소스가 참조할 때 씁니다.
절대 로컬에 두지 마세요 (Remote Backend)
혼자 할 때는 terraform.tfstate가 내 노트북에 있어도 됩니다. 하지만 팀 프로젝트라면?
- Locking: A가 수정 중인데 B가 동시에
apply를 날리면? State가 깨집니다. 이걸 막으려면 DynamoDB 같은 걸로 Lock을 걸어야 합니다.
- Secrets: State 파일에는 DB 비밀번호 같은 민감한 정보가 평문(Plain Text)으로 저장됩니다. Github에 올리면 큰일 납니다.
그래서 실제로는 S3 (암호화) + DynamoDB (Locking) 조합으로 State를 원격 저장소(Remote Backend)에서 관리합니다.
3.9. 도구 비교: Terraform vs Ansible vs Pulumi
"그럼 Ansible은 뭐예요?" 라고 묻는 분들이 많습니다.
- Terraform: 인프라 프로비저닝(Provisioning) 도구입니다. "서버를 만드는 역할"입니다. (AWS, Azure, GCP 리소스 생성)
- Ansible: 구성 관리(Configuration Management) 도구입니다. "만들어진 서버에 들어가서 설정하는 역할"입니다. (Nginx 설치, 설정 파일 수정)
- Pulumi: Terraform과 비슷하지만 HCL 대신 Python, TypeScript 같은 범용 언어를 씁니다. 개발자 친화적이지만, 인프라 팀과는 소통이 어려울 수 있습니다.
요즘 트렌드는 "Terraform으로 서버(뼈대)를 만들고, 서버 이미지는 Packer로 미리 구워두고(AMI), Ansible 없이 바로 배포"하는 불변 인프라 방식입니다. Ansible의 역할이 조금씩 줄어들고 있죠.
3.95. 자주 묻는 질문 (FAQ)
Q: 이미 AWS 콘솔로 만든 리소스는 Terraform으로 못 가져오나요?
A: 가능합니다. terraform import 명령어를 쓰면 됩니다. 하지만 과정이 꽤 고통스럽습니다(Terraform 1.5부터 import 블록이 생겨서 좀 나아졌습니다). 처음부터 IaC로 시작하는 게 정신건강에 좋습니다.
Q: Terraform State 파일이 삭제되면 복구 되나요?
A: S3 버저닝(Versioning)을 켜두셨다면 가능합니다. 그게 아니라면... AWS 리소스는 살아있지만 Terraform은 "어? 리소스 없네?" 하고 새로 만들려고 할 겁니다. 대재앙이죠. 백업은 필수입니다.
Q: 작은 프로젝트인데도 IaC를 써야 하나요?
A: "일회성 프로젝트"라면 안 써도 됩니다. 하지만 "3개월 뒤에도 유지보수해야 한다"면 쓰는 게 좋습니다. 3개월 뒤의 나를 믿지 마세요. 코드를 믿으세요.
4. IaC가 가져다준 평화
IaC를 도입하고 나서 제 삶은 이렇게 변했습니다.
4.1. "히스토리"가 남는다 (Git)
예전에는 "누가 보안 그룹 8080 포트 열었어?" 하면 범인을 찾을 수 없었습니다. AWS 콘솔 로그는 보기 힘들거든요.
이제는 Git 커밋 로그를 보면 됩니다.
commit: Allow 8080 port for debugging (Author: Harry)
누가, 언제, 왜 바꿨는지 100% 추적 가능합니다. 잘못되면 git revert로 되돌리고 다시 배포하면 그만입니다.
4.2. 재현 가능하다 (Reproducibility)
똑같은 환경의 "개발 서버"와 "운영 서버"를 만드는 게 너무 쉬워졌습니다.
코드에서 변수만 살짝 바꾸면 됩니다.
variable "environment" {
default = "dev" # 운영은 "prod"로 변경
}
resource "aws_s3_bucket" "b" {
bucket = "my-app-${var.environment}"
}
이제 "내 로컬에서는 되는데 운영에서는 안 돼요" 같은 핑계가 통하지 않습니다. 환경이 똑같으니까요.
4.3. 리뷰를 받을 수 있다
이전에는 혼자 몰래(?) 설정을 바꿨지만, 이제는 Pull Request(PR)를 올려야 합니다.
"DB 용량 늘리겠습니다."라고 코드로 PR을 올리면, 리뷰어가 내용을 확인하고 승인합니다.
인프라 변경이 더 이상 공포의 대상이 아니라, 팀의 협업 과정이 되었습니다.
5. 물론 단점도 있습니다
IaC가 만능은 아닙니다.
- 러닝 커브: HCL(Terraform 언어) 문법을 배워야 합니다. 개발자라면 금방 배우지만, 시스템 엔지니어에게는 낯설 수 있습니다.
- State 관리: Terraform은 현재 인프라 상태를
tfstate라는 파일에 저장합니다. 이 파일이 꼬이거나 날아가면... AWS 콘솔과 Terraform 사이의 싱크가 안 맞아서 대재앙이 일어납니다. (S3에 백업 필수!)
- 기존 인프라 가져오기: 이미 수동으로 만든 서버들을 Terraform으로 가져오는(
terraform import) 과정이 꽤나 고통스럽습니다.
6. 마무리: 마우스를 버려라
아직도 AWS 콘솔에 로그인해서 '인스턴스 시작' 버튼을 누르고 계신가요?
그건 마치 2025년에 손편지를 써서 업무 보고를 하는 것과 같습니다.
IaC는 단순한 도구가 아닙니다. 인프라를 다루는 철학의 변화입니다.
서버를 "애완동물(Pet)"처럼 하나하나 이름 붙여 키우지 말고, "가축(Cattle)"처럼 코드로 대량 생산하고 관리하세요.
실수로 서버를 날려도 웃으면서 terraform apply 한 번이면 복구되는 그 짜릿함을 느껴보시기 바랍니다.
I Deleted the Production Server with One Click: Why We Need IaC
1. "Wait, Was It the Wrong Button?"
In the early startup days, I was the 'AWS Console Master'.
Launching EC2 instances, configuring Security Groups, attaching Load Balancers—I did it all with mouse clicks.
One day, I was cleaning up development servers.
"We're not using this one, right?" I clicked Terminate.
Three seconds later, Slack alarms started ringing.
[Critical] Production Database Connection Failed
Cold sweat ran down my spine.
What I deleted wasn't a dev server, but the NAT Gateway in the private subnet connected to the Production DB.
I got confused because they had similar names.
The recovery was even more horrific.
"What was the Subnet IP range?" "What was the Route Table setting?"
relying on memory to click through the console again took 3 hours to restore service.
After this incident, I swore an oath.
"I will never touch infrastructure with a mouse again."
2. Coding Infrastructure? (IaC)
So we adopted IaC (Infrastructure as Code).
Simply put, "Let's manage server configurations as text files, just like programming code."
We chose Terraform as our tool.
Attributes: Most popular, supports not only AWS but multi-cloud.
3. Terraform: Blueprints for Infrastructure
There are many IaC tools (AWS CloudFormation, Ansible, Pulumi), but I love Terraform the most.
Its declarative nature allows you to define the Desired State, and Terraform figures out how to reach it.
Example: Creating one EC2 Server
# main.tf
provider "aws" {
region = "ap-northeast-2" # Seoul Region
}
resource "aws_instance" "app_server" {
ami = "ami-0e1d09d8b7c751816" # Amazon Linux 2
instance_type = "t3.micro"
tags = {
Name = "Production-Server"
Env = "Production"
}
}
resource "aws_s3_bucket" "log_bucket" {
bucket = "my-company-logs-2025"
acl = "private"
}
Now, instead of "Please make a server", just type two commands in the terminal.
terraform plan: "I'm going to build it like this, want to check?" (Preview)
terraform apply: "Good, make it happen!" (Apply)
This process is like a Code Review for infrastructure.
"Uh? David, there's code changing the production DB instance type here? Is this intended?"
A colleague can review simply by looking at the terraform plan result. My previous mistake (DB deletion) would have been caught here instantly.
4. The Secret of State File: The "Rule of Two"
The most confusing and critical concept when first encountering Terraform is the State File (terraform.tfstate).
Terraform records what is actually deployed on AWS in this JSON file. It acts as the 'Ledger' or 'Map' of your infrastructure.
When you run terraform plan:
- Terraform reads the
tfstate file to know what should be there.
- It calls AWS API to see what is actually there.
- It compares your code with the state file.
The Concurrency Nightmare
But what if Team Member A and Team Member B run Terraform at the same time?
If A is editing the ledger (adding a server) and B writes over it (deleting a database), the state file gets corrupted.
This is a classic Race Condition. If the state file is corrupted, Terraform loses track of your infrastructure. You might have to manually import hundreds of resources again.
The Fantasy Combo: S3 and DynamoDB
To solve this, we use a Remote State Backend with Locking.
- Storage (S3): We save the
tfstate file in a secure, versioned AWS S3 bucket. All team members' Terraform clients point to this single source of truth.
- Locking Device (DynamoDB): We use a DynamoDB table to handle locks. When Terraform starts running, it writes a
LockID to the table. If anyone else tries to run Terraform, it checks the table, sees the lock, and fails with Error: Locked by User A.
// backend.tf
terraform {
backend "s3" {
bucket = "my-company-terraform-state-prod"
key = "network/terraform.tfstate"
region = "ap-northeast-2"
// This is the magic line for locking
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Now there is no need to shout on Slack, "Hey, I'm deploying, don't touch it!" Terraform enforces serial execution automatically.
5. Drift: Catching Secret Changes
Even with IaC, we are human. Sometimes someone (usually a rushed boss or a developer debugging an issue) goes into the AWS console and secretly changes settings manually.
"I was in a rush so I opened the security group port 22 for a sec to SSH in."
And they forget to close it.
This is called Drift.
- Code says: Port 22 is CLOSED.
- Reality says: Port 22 is OPEN.
This difference is a massive security vulnerability.
Terraform catches this amazingly well during the refresh phase.
$ terraform plan
# Terraform detected the following changes made outside of Terraform since the last "terraform apply":
# aws_security_group.web_sg has been changed
~ ingress {
- cidr_blocks = ["0.0.0.0/0"] # Terraform discovered someone secretly added this!
+ cidr_blocks = ["10.0.0.0/16"] # Will revert to original code
}
Automated Drift Detection
You can build a system that runs terraform plan -detailed-exitcode every morning (via Cronjob or GitHub Actions).
If the exit code is 2, it means there is a diff (Drift). Trigger a Slack alert:
"🚨 Infrastructure Drift Detected! Someone manually modified the Security Group web_sg. Please investigate."
Thanks to this, there is no need to play the "Who changed it?" blame game. The code is the law.
6. Automating with CI/CD (GitHub Actions)
Running terraform apply from your laptop is still dangerous.
What if your Wi-Fi cuts out in the middle of an update? The state file could be left in a half-written, broken state.
What if you have local admin keys that get stolen?
The best practice is to remove AWS keys from developer laptops and automate this via CI/CD (GitHub Actions, GitLab CI).
- Pull Request (Plan): When a PR is opened, GitHub Actions runs
terraform plan. It posts the output as a comment on the PR.
- Team leads verify: "Oh, you are changing the instance type from
t3.micro to m5.large. This will cost $50 more. Approved."
- Merge (Apply): When the PR is merged to the
main branch, GitHub Actions automatically runs terraform apply.
This is GitOps.
- Version Control: Every infrastructure change is a commit.
- Code Review: No change happens without approval.
- Audit Trail: We know exactly who changed what and when.
6.5. Terraform Best Practices: Structure and Security
Avoid spaghetti code. Follow these rules to keep your sanity.
- Isolate Environments: Separate
dev, stage, and prod into different folders. If you put everything in one main.tf, you risk destroying Prod while tweaking Dev. (Workspaces exist, but folders are safer.)
- Modularize: Bundle related resources (EC2 + SG + IAM) into a Module. Use it like a function:
module "web_cluster" { ... }. Reuse code, save time.
- Secrets in State: Never commit
.tfstate to Git. It contains sensitive data (DB passwords) in plain text. Always use Remote Backend (S3) with encryption.
6.8. Terraform vs Ansible vs Pulumi
- Terraform: Provisioning tool. Best for creating resources like VPCs, EC2s, RDS. (Declarative)
- Ansible: Configuration tool. Best for installing software (Nginx, Docker) inside existing servers. (Procedural)
- Pulumi: Infrastructure as Software. Use TypeScript/Python instead of HCL. Great if you hate learning domain-specific languages.
A common pattern is: Use Terraform to build the server, and Ansible to configure it.
6.9. Deep Dive: GitOps and Automation (Atlantis)
Running terraform apply from your laptop is fine for solo projects, but it's a ticking time bomb for teams.
If someone applies changes with an outdated local state, they might overwrite recent updates or cause conflicts.
Enter GitOps and Atlantis.
What Atlantis Does
It's a CI/CD Bot specifically for Terraform.
- Developer opens a Pull Request (PR).
- Atlantis bot automatically runs
terraform plan and posts the output as a PR comment.
- Team lead reviews and approves. Then comments
atlantis apply on the PR.
- Atlantis executes
apply on the server, posts the result, and merges the PR.
Developers don't need AWS keys on their local machines anymore. Only the Atlantis server holds the keys. This secures your credentials and standardizes the deployment workflow.
6.95. Philosophy: Immutable Infrastructure
The ultimate goal of IaC is Immutable Infrastructure.
Logging into a server to run apt-get update or tweak configs is Mutable management. Over time, servers drift apart (Configuration Drift), creating "Snowflake Servers".
Immutable Infrastructure means "We don't fix servers; we replace them."
Need a security patch? Bake a new machine image (AMI) with the patch, terminate the old instances, and launch new ones (Blue/Green Deployment).
This eliminates phantom bugs like "It worked yesterday, why is it broken today?" because every server is an exact clone of the code.
7. The Pain Points: IaC Isn't Free
IaC is not a magic wand. It comes with trade-offs.
- High Learning Curve: You need to learn HCL. It's declarative, which is different from Python or JS. Loops and conditionals (
for_each, dynamic blocks) can get tricky.
- The "Import" Nightmare: Bringing existing, manually created servers into Terraform (
terraform import) is painful. You have to map every single attribute of your existing AWS resources to code manually. It requires immense patience.
- State Management Risks: If you accidentally delete the S3 bucket containing your state file, you are in big trouble. Terraform will think nothing exists and try to create everything again (duplicate resources). Always enable S3 Versioning and MFA Delete on your state bucket.
8. Conclusion: Throw Away Your Mouse
Are you still logging into AWS Console and clicking 'Launch Instance'?
That's like writing business reports with a quill pen in 2025. It works, but it's not scalable, and it's prone to ink spills.
IaC isn't just a tool adoption. It's a shift in philosophy.
Stop treating servers like "Pets" (naming them 'Prince', 'Princess', and nursing them when sick).
Start treating them like "Cattle" (managing them by numbers 'web-01', 'web-02', and replacing them instantly via code if they fail).
My prod-db-01 server was a Pet. Because I tweaked settings manually for years, it became a unique snowflake that no one else could reproduce.
Now, my servers are cattle. Even if a fire burns down the data center (ap-northeast-2), I will sip my coffee, change the region variable to us-east-1 in my code, and fire a terraform apply.
Is your infrastructure a Pet or Cattle?
Don't bet your career on a single mouse click. Record with code, protect with code.