Why Your Server Crashes After Launch: A Practical Guide to Load Testing
1. "The Feature Works Fine — So Why Does the Server Die?"
Load testing first crossed my mind when I thought about a simple question:
If my service gets real traffic, will it hold?
Stories like this are common. A link gets posted on a forum, a marketing campaign lands, or something goes viral on social media. Suddenly, the server that handled a handful of concurrent users is overwhelmed by hundreds or thousands. Functional tests all passed. But the moment traffic spikes, the server starts throwing 502 Bad Gateway errors.
"Can't login."
"Page is blank."
Open the logs and you'll see Connection Timed Out flooding the output with CPU usage pegged at 100%. If you spent money on ads to drive that traffic, it's gone. That's the cost of skipping load testing.
This is what load testing is designed to prevent. Let me break it down.
1.5. Performance is a Feature
There is a famous saying by Jeff Atwood: "Performance is a Feature."
It's not just a nice-to-have; it's a core requirement.
Amazon found that every 100ms of latency cost them 1% in sales.
Google found that an extra 0.5 seconds in search generation time dropped traffic by 20%.
My service crashing wasn't just a technical glitch; it was a business disaster.
Users don't care if your code is clean or if you used the latest framework.
They only care if the page opens when they click it.
If it takes more than 3 seconds, they leave. And they never come back.
2. Functionality Was Perfect, Why?
QA passed. No bugs. Tested the login flow over 100 times manually.
The issue was that "Me alone 100 times" is completely different from "100 people simultaneously, once".
Only Functional Testing was done. Load Testing was skipped.
It's like building a sturdy bridge, verifying one truck can pass, but never checking if it collapses when 100 trucks cross at once.
2.5. What a Server Crash at Peak Moment Actually Costs
It's not just a technical issue. The timing matters.
- Marketing Budget Wasted: If the server dies exactly when ad-driven traffic arrives, every dollar spent on ads is gone.
- Negative First Impression: Comments like "What is this? Can't even load" start appearing right at launch.
- Opportunity Cost: The golden window at launch — when attention is highest — gets squandered.
Performance problems don't stay technical. They become business problems quickly, and there are plenty of documented examples of this.
3. Tool Choice: JMeter vs k6
Looking for load testing tools, I found two candidates.
- JMeter: Java-based, long history, has GUI. But... XML config is too complex and heavy.
- k6: Go-based, Scripting in JavaScript, light and fast.
As a developer, I naturally chose k6.
"writing test scenarios in code" was too attractive.
3.1. Why I Ditched JMeter (Comparison)
There are many tools, but my criteria were clear:
"Can I version control it with Git?"
"Can my colleagues use it without a steep learning curve?"
| Feature | JMeter | k6 | nGrinder | Gatling |
|---|
| Language | GUI / XML | JavaScript / TS | Groovy | Scala / Java |
| Execution | Thread based | Goroutine (Light) | Thread based | Akka Actor |
| DevOps | Painful (XML Hell) | Excellent (CLI) | Average | Good |
| Report | HTML Generation Setup | CLI / Dashboard | Built-in Web UI | HTML Generation |
JMeter is great, but its XML-based configuration is a nightmare for Git Conflicts.
JMeter config files (.jmx) can grow into 5MB XML monstrosities.
When a Git merge conflict hits, tracking down mismatched XML tags can take hours — and teams often end up just overwriting the file.
k6, on the other hand, uses JavaScript which every web developer knows, and its Code-as-Configuration approach makes code reviews a breeze. A merge conflict is solved in 5 minutes. This is real productivity.
4. Scenario Scripting: Simulating 100 Virtual Users
The first thing I did was translating "Real User Actions" into code.
- Visit Main Page
- Login (API Call)
- List Products
- View Product Detail
/* script.js */
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
// Ramp up from 0 to 100 users over 1 minute
stages: [
{ duration: '1m', target: 100 },
{ duration: '3m', target: 100 }, // Stay for 3 mins
{ duration: '1m', target: 0 }, // Ramp down
],
};
export default function () {
// 1. Visit Main
const res = http.get('https://my-service.com');
check(res, { 'status is 200': (r) => r.status === 200 });
sleep(1); // Humans pause
// 2. Login
const loginRes = http.post('https://my-service.com/api/login', {
email: 'user@test.com',
password: 'password123',
});
check(loginRes, { 'logged in': (r) => r.status === 200 });
sleep(2);
}
### 4.1. Anatomy of the Script
* **`stages`**: Ramps up traffic gradually. Real traffic doesn't hit instantly. It grows from 0 to 100 over a minute (Ramp-up).
* **`check`**: Assertions. If status is not 200, k6 marks it as a failure.
* **`sleep`**: The most critical part. Real users don't click every millisecond. They read, scroll, and think. We simulate this **"Think Time"** to measure realistic load.
Running this (`k6 run script.js`) drew a graph in my terminal instantly.
### 4.2. Why 100 Users? (Little's Law)
Some might ask, 'Is 100 users enough?'
Here is my calculation based on **Little's Law**.
* **Target DAU:** 10,000
* **Peak Traffic:** 30% of users visit between 8 PM - 9 PM. (3,000 users / hour)
* **Avg Session Time:** 5 minutes.
**Concurrent Users (N) = Arrival Rate (λ) * Response Time (W)**
(Or simplified: Users per hour * Session duration)
3,000 * (5 / 60) = **250 Concurrent Users**.
So 100 was actually too low for the final goal, but good for a starting smoke test.
Later, I scaled this to 1000 to handle marketing spikes.
---
---
## 5. Not Just 'Load': 4 Types of Stress (Types of Load Testing)
Writing a script and blindly firing 1000 users is wrong. You must classify based on purpose.
### 1. Smoke Test
This is the most basic step. Launching 1000 users right after writing a script is suicide.
Run with **1 VUS for 1 minute**. It verifies if the script logic is correct and if the server is responding at all. It's like checking if the car engine starts.
### 2. Load Test
This is what we usually mean by load testing. It verifies if the system can handle **"Normal Traffic"** or **"Target Traffic"**.
For example, "Can we handle 100 concurrent users for 30 minutes?" It's a mandatory gate before any deployment.
### 3. Stress Test
The goal here is **"Destruction"**.
We want to find the **Breaking Point**. We increase the load incrementally: 100 -> 500 -> 1000 -> 2000...
Eventually, CPU hits 100% or errors flood. That point is your system's current capacity limit. Knowing this limit is crucial for capacity planning.
### 4. Soak Test
As the name implies, we "soak" the system. We apply moderate load for a **long duration (12+ hours)**.
This is excellent for finding **Memory Leaks**, **Disk Space Leaks**, or **Connection Leaks** that don't show up in short tests. We usually run this overnight or over the weekend.
For this launch, I focused on **#2 (Load)** to ensure stability and **#3 (Stress)** to determine the server size.
### 5.1. Top 3 Beginner Mistakes
Here are the most common pitfalls I've seen.
1. **Running on Localhost:** Your laptop will crash before the server does. Or your office WiFi will choke. Always run from a separate machine like AWS EC2.
2. **Attacking Production:** Running load tests on live production without warning is not testing; it's a **DDoS Attack**. Use Staging or an isolated environment.
3. **Ignoring Think Time:** Looping without `sleep()` creates unrealistic load. Real users click, wait, and think for at least 1-3 seconds.
---
## 6. Finding Bottlenecks: The Culprit Was DB
The results were disastrous.
When concurrent users crossed 50, **Response Time (p95)** spiked from 200ms to **3 seconds**.
Error rate surged from 1% to 20%.
Analyzing causes, I found two bottlenecks.
### 1. The N+1 Problem (Why DB Screamed)
Checking APM (Datadog), a single `GET /api/products` API triggered **101 SQL queries**.
It was fetching the `product` table, then looping through each product to fetch `image` data individually.
* 1 User: 101 Queries (Fast)
* 100 Users: **10,100 Queries** (DB CPU Melting)
**Solution:** Used JPA `fetch join` to retrieve everything in a single query.
```java
// Before: N+1
@OneToMany
List<Image> images;
// After: Fetch Join (One Shot)
@Query("SELECT p FROM Product p JOIN FETCH p.images")
List<Product> findAllWithImages();
2. DB Connection Pool Exhaustion (Chain Reaction)
Spring Boot (HikariCP) default pool size is 10.
With 100 users, 10 get connections, and 90 are Blocked waiting for a connection.
This led to Connection Timeout errors and ultimately 502 Bad Gateway.
Solution:
- Increased
maximum-pool-size from 10 -> 50 (Considering DB specs).
- Applied Redis Caching for read-heavy data to bypass DB entirely.
3. The Trap of Averages
The most dangerous metric in load testing reports is Average Response Time.
"Average 1s" doesn't mean "Most users experienced 1s".
If 99 users took 0.1s and 1 user took 100s, the average is roughly 1s.
But for that 1 user, it was a disaster.
That's why you must look at p95 (95th Percentile) or p99.
"The maximum response time experienced by 95% of users" is the real performance metric.
In my case, while the average was 500ms, the p95 exceeded 3 seconds.
4. OS Tuning (The Hidden Enemy)
Fixing server code isn't enough. OS configuration matters.
With high concurrency, you might run out of File Descriptors (FD).
Linux default ulimit is typically 1024. 1000 concurrent users will exhaust this immediately.
I also faced Port Exhaustion, where TIME_WAIT sockets consumed all ephemeral ports, blocking new connections.
Tuning kernel parameters like net.ipv4.tcp_tw_reuse was necessary.
6.5. What to Monitor? (The Golden Signals)
When load testing, don't just look at "Success/Fail". Watch these 4 signals (Google SRE Book):
- Latency: Time it takes to serve a request. (p95, p99).
- Traffic: Demand placed on your system (RPS).
- Errors: Rate of requests that fail (5xx).
- Saturation: How "full" your service is (CPU, Memory, I/O).
In my case, Saturation (CPU 100%) caused Errors (502), which spiked Latency (3s). They are all connected.
7. After the Fix: Stress-Testing at 1000 Users
After applying the fixes, I re-ran k6.
This time aiming higher: 1000 Concurrent Users.
Result?
Avg Response Time 150ms, Error Rate 0%.
Server CPU held steady around 40%.
Seeing those numbers made it possible to deploy with confidence.
7.5. When Redis Becomes the Bottleneck: Flash Sales
Keep pushing with load tests and you'll find more bottlenecks. A common one: Redis dying during a flash sale.
When 1,000 users simultaneously hit "Give me the coupon!", Redis's single-threaded command processing can't keep up.
Solution: Lua Script & Rate Limiter
To reduce Redis round-trips, use a Lua Script to combine "Check Stock + Deduct Stock" into one atomic operation.
Pair that with Rate Limiting at the Nginx level to cap per-IP request rates.
Adding a "Malicious Bot" scenario to the k6 script helps verify this protection is working.
8. Three Principles That Stuck
Here are three things that became clear after working through load testing.
-
Never Trust Your "Gut Feeling".
Thinking "It should handle this much, right?" is wishful thinking. Performance that isn't proven by numbers is unverified. Measurement beats assumption every time.
-
Logs Don't Lie.
When users say "It's not working", without logs you're debugging blind.
Real-time log monitoring during load testing is mandatory. The logs are where the bottleneck reveals itself.
-
Don't Skimp on Infrastructure.
The cost of underprovisioning — lost traffic, failed launches — routinely exceeds the cost of slightly larger server specs. In the early phase, Over-provisioning is a valid strategy. Scale down once you have real data.
9. Don't Run Manually: CI/CD Automation
Fixing it once doesn't mean it's fixed forever. It can regress in the next deployment.
So I integrated k6 into GitHub Actions.
# .github/workflows/load-test.yml
name: Load Test
on: [push]
jobs:
k6_test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run k6
uses: grafana/k6-action@v0.3.0
with:
filename: script.js
flags: --vus 50 --duration 1m
And I set a "Performance Budget".
"Fail the build if response time exceeds 500ms."
This rule prevented bad code from ever reaching production.
9.5. Advanced k6: Constant Arrival Rate
Sometimes you want to test RPS (Requests Per Second) instead of just User Count.
Use the constant-arrival-rate executor.
This is perfect for capacity planning (e.g., "Can we handle 1000 RPS?").
export const options = {
scenarios: {
constant_request_rate: {
executor: 'constant-arrival-rate',
rate: 1000, // Maintain 1000 RPS
timeUnit: '1s',
duration: '1m',
preAllocatedVUs: 100,
maxVUs: 200,
},
},
};
9.8. Thresholds: Fail the Build!
Automated testing is useless if you ignore the results.
You must set Thresholds.
export const options = {
thresholds: {
http_req_duration: ['p(95)<500'], // 95% of requests must complete below 500ms
http_req_failed: ['rate<0.01'], // Error rate must be below 1%
},
};
If this fails, k6 returns exit code 1, and GitHub Actions will block the Pull Request.
This is the only way to protect your production environment from "Performance Regression".
This is the only way to protect your production environment from "Performance Regression".
11. Next Step: Distributed Testing & Visualization
So far, we tested from a single machine.
But what if we have 1 Million Users?
A single test server will choke before the target server does. (The k6 process itself becomes the bottleneck.)
This is when you need Distributed Load Testing.
You install k6 on multiple EC2 instances and orchestrate them centrally.
Or use k6 Cloud to fire traffic from 10 different global regions with one click.
Also, integrating the results with Grafana + InfluxDB creates beautiful real-time dashboards that are perfect for showing to executives.
In the next post, I will cover this advanced setup.
11.5. Scenario 3: The Regex Disaster
One last story.
We had a "Search" feature. It worked fine with 10 users.
But with 500 users, the CPU spiked to 100% and stayed there.
k6 showed that only the /api/search endpoint was timing out.
Investigating the code, I found a Catastrophic Backtracking Regex.
A developer wrote a regex to validate email that took exponential time when the input was long.
Because Node.js is single-threaded, one bad regex blocked the entire Event Loop.
k6 helped us isolate this specific endpoint because we could see the breakdown of response times per API.
This reinforced my belief: "You can't fix what you can't measure."
11.7. From Load Testing to Chaos Engineering
Once you are comfortable with Load Testing, the next step is Chaos Engineering.
Load Testing checks "Can we handle traffic?".
Chaos Engineering checks "Can we handle failure?".
Tools like Gremlin or Chaos Mesh can randomly kill pods or increase latency during your load tests.
This is the ultimate test of resilience.
If your system survives a 50% pod failure rate while handling 1000 RPS, you are ready for anything.
11.8. FAQ: Common Questions
Q: Is k6 truly free?
A: Yes, the CLI tool is open-source and free forever. You only pay for k6 Cloud if you need managed infrastructure. It is robust enough for most startups.
Q: Can I test complex flows like OAuth login?
A: Absolutely. You can handle tokens, cookies, and headers just like in a real browser. I use it to test our entire checkout flow, including payment gateways (mocked).
Q: Does it support WebSocket?
A: Yes, it has native support for WebSocket, gRPC, and even Redis. It's not just for HTTP APIs. You can test your chat server or real-time notification system effectively.
10. Conclusion: Don't Guess, Measure
Developers often fall into the illusion "My code is perfect".
But Performance comes from the real environment, not just code.
Load testing isn't just seeing "Does server survive?".
It's a process to understand Where is our limit (Capacity Planning) and What breaks first (Bottleneck).
Are you planning a launch?
Install k6 right now and run a script.
You will hear your server screaming. (Better to hear it before launch.)
🎁 Bonus: k6 Cheatsheet
# 1. Simple Run
k6 run script.js
# 2. Adjust VUS & Duration (CLI override)
k6 run --vus 10 --duration 30s script.js
# 3. Save result as JSON
k6 run --out json=result.json script.js
# 4. Inject Environment Variables
k6 run -e HOSTNAME=staging.my-site.com script.js