Keep-Alive: Don't hang up yet

Prologue: "Why so slow?"

Investigating a slow website, opening the Network tab reveals something like this:

Result:

logo.png - 300ms (250ms Handshake)
style.css - 280ms (240ms Handshake)
app.js - 290ms (245ms Handshake)
...

"You did 100 Handshakes. Keep-Alive isn't enabled."

Why I Studied This

When a site loads several times slower than competitors, the diagnosis is often surprising:

"Server is fast, but closing the connection every time makes it slow."

That's when TCP connections and Keep-Alive become worth studying.

At first, it's hard to comprehend how this single setting could impact performance so drastically. Why does "maintaining connections" matter so much? Why was HTTP/1.0 disabled by default? And does enabling Keep-Alive put more burden on the server?

What Confused Me Initially

What's this 3-Way Handshake?
Why close the connection?
Won't Keep-Alive burden the server?
Was this the difference between HTTP/1.0 and HTTP/1.1?
Why do browsers only open 6 connections?
Are TCP Keep-Alive and HTTP Keep-Alive different things?

Most importantly: "Why was it designed so inefficiently?"

Later I learned that in the HTTP/1.0 era, web pages only needed to fetch a single HTML file, and servers couldn't handle many concurrent connections. So "closing quickly" was actually more efficient. But as the web evolved and pages started requiring dozens or hundreds of files, this approach became a bottleneck.

The Aha Moment: "Phone Call vs Package Delivery"

This analogy makes the concept click:

HTTP/1.0 (Close connection): "Have 100 questions for a friend. Call → Ask 1 question → Hang up Call → Ask 1 question → Hang up (Repeat 100 times)

Calling time > Answer time."

HTTP/1.1 (Keep-Alive): "Call once. Ask questions 1, 2, 3... all 100. Hang up when done.

Just 1 call!"

"Oh, it's connection reuse!"

That's when I understood. Keep-Alive was ultimately about "infrastructure efficiency." Making and ending phone calls (TCP Handshake) was far more expensive than the actual conversation (data transfer). Rather than repeating this 100 times, calling once and continuing the conversation was obviously more efficient.

I thought of another analogy with package delivery. How inefficient would it be if a delivery person came to your door, delivered one item and left, then returned 5 minutes later to deliver another item? It's much better to come once and deliver all 100 items at once.

1. TCP 3-Way Handshake: The Expensive Cost

Process

Client → Server: SYN (connection request)
Server → Client: SYN-ACK (got it, ready)
Client → Server: ACK (OK, let's start!)

→ 2 round trips (RTT x 2)

Time Cost

Seoul → US Server
RTT (Round Trip Time) = 200ms

Handshake = 200ms x 2 = 400ms

400ms is huge!

You might not realize how significant 400ms is. But from a user experience perspective, it's critical. Google published research showing that a 0.5 second delay in search results display leads to a 20% drop in traffic. 400ms is nearly half of that.

The bigger problem is that this is just the setup time for a single file. Real web pages need to fetch dozens to hundreds of resources: HTML, CSS, JavaScript, images, fonts. Without Keep-Alive, establishing a connection for each one? Pure hell.

Also, RTT is proportional to physical distance. Seoul-US is 200ms, but Seoul-Australia can exceed 300ms. Even at the speed of light, we can't overcome physics, so the only way to reduce this cost is to "reuse connections."

2. HTTP/1.0 Inefficiency

Before (Close connection each time)

Request 1:
1. TCP Handshake (400ms)
2. logo.png request
3. Response
4. Close connection

Request 2:
1. TCP Handshake (400ms)  ← Again!
2. style.css request
3. Response
4. Close connection

...100 files → 100 Handshakes

Total time: 400ms x 100 = 40 seconds!

Web development in the HTTP/1.0 era employed all sorts of tricks to work around this problem. CSS Sprites (combining multiple images into one), file bundling, inline styles—these were all born from this necessity. Looking back, they were workarounds, but at the time, they were essential.

I didn't experience this era directly, but when looking at legacy code, I often wondered "why was this made so complicated?" Turns out it was all effort to minimize the pain of an era without Keep-Alive.

3. Keep-Alive: Connection Reuse

After (Maintain connection)

1. TCP Handshake (400ms)  ← Only once!
2. logo.png request
3. Response
4. style.css request  ← Keep connection
5. Response
6. app.js request
...
100. All requests complete
101. Close connection

Total time: 400ms + (file transfer time)

When I first understood this, I thought "why didn't they do this from the start?" But thinking deeper, there was a tradeoff. Maintaining connections means the server keeps occupying memory and file descriptors for those connections.

Web servers in the early 1990s weren't as powerful as today and struggled to handle thousands of concurrent connections. So "closing quickly" was safer. But as hardware improved and web pages became more complex, the benefits of Keep-Alive became overwhelming.

This single setting essentially changed the game for web performance. HTTP/1.1 came out in 1997, so we've been enjoying Keep-Alive's benefits for nearly 30 years.

4. HTTP Headers: The Meaning of Connection

HTTP/1.0 (Request Keep-Alive)

GET /logo.png HTTP/1.0
Connection: keep-alive

Server response:

HTTP/1.0 200 OK
Connection: keep-alive
Keep-Alive: timeout=5, max=100

timeout=5: Close if idle for 5 seconds
max=100: Maximum 100 requests

HTTP/1.1 (Default enabled)

GET /logo.png HTTP/1.1
(Auto Keep-Alive without Connection header)

Only when you want to close:

Connection: close

When do you use this Connection: close header? I've experienced a few cases in production:

Before server restart: Clean up existing connections without accepting new requests
On errors: Cleanly close when connection state is corrupted
Long polling requests: Close connection when long-polling ends

In most cases, Keep-Alive is default so you don't need to worry, but when going through proxies or load balancers, the Connection header can get modified. This is the most troublesome part in production.

5. TCP Keep-Alive vs HTTP Keep-Alive: Different Concepts

Initially, I mistakenly thought these were the same thing. But they're completely different layer concepts.

TCP Keep-Alive (Transport Layer)

Purpose: Verify connection is alive
Action: Periodically send empty packets (usually every 2 hours)
Configuration: OS level (sysctl)

Example:
net.ipv4.tcp_keepalive_time = 7200  # 2 hours
net.ipv4.tcp_keepalive_intvl = 75   # Retry every 75s
net.ipv4.tcp_keepalive_probes = 9   # Close after 9 failures

HTTP Keep-Alive (Application Layer)

Purpose: Reuse same TCP connection for multiple HTTP requests
Action: Maintain connection even after request/response completes
Configuration: HTTP headers (Connection, Keep-Alive)

Example:
Connection: keep-alive
Keep-Alive: timeout=5, max=100

It took me a while to understand this difference. TCP Keep-Alive is a "health check to detect dead connections," while HTTP Keep-Alive is an "optimization technique for connection reuse." Completely different purposes.

In production, you rarely need to adjust TCP Keep-Alive directly. The OS default (2 hours) is usually sufficient. But HTTP Keep-Alive is frequently adjusted in server/proxy configurations.

6. Real World: Node.js Server

Before (Keep-Alive disabled)

const http = require('http');

const server = http.createServer((req, res) => {
  res.setHeader('Connection', 'close');  // ❌ Close each time
  res.end('Hello');
});

server.keepAliveTimeout = 0;  // Keep-Alive off
server.listen(3000);

Result: Slow

After (Keep-Alive enabled)

const server = http.createServer((req, res) => {
  // Auto Keep-Alive without explicit Connection header
  res.end('Hello');
});

server.keepAliveTimeout = 5000;  // Maintain for 5 seconds
server.maxRequestsPerSocket = 100;  // Max 100 requests
server.listen(3000);

Result: Fast ⚡

Node.js's default keepAliveTimeout is 5 seconds. Whether this is appropriate depends on traffic patterns. Common guidelines:

Static file server: 10-15 seconds (until browser finishes fetching resources)
API server: 5 seconds (short request intervals)
WebSocket server: 60+ seconds (long-lived connections)

maxRequestsPerSocket is also important. If set too high, a single connection lives too long with memory leak risks; too low, and Keep-Alive effectiveness decreases. 100-1000 is usually appropriate.

7. Connection Pooling: Same Principle for Databases

After understanding HTTP Keep-Alive, I realized database connection pools follow the exact same concept.

Without Connection Pool (Connect each time)

// ❌ Bad example
async function getUser(id) {
  const connection = await mysql.createConnection({
    host: 'localhost',
    user: 'root',
    password: 'pass'
  });

  const [rows] = await connection.query('SELECT * FROM users WHERE id = ?', [id]);
  await connection.end();  // Close every time!

  return rows[0];
}

// 100 calls → 100 connection creations/closures

With Connection Pool (Reuse connections)

// ✅ Good example
const pool = mysql.createPool({
  host: 'localhost',
  user: 'root',
  password: 'pass',
  connectionLimit: 10  // Maintain 10 connections
});

async function getUser(id) {
  const [rows] = await pool.query('SELECT * FROM users WHERE id = ?', [id]);
  return rows[0];  // Return connection (don't close)
}

// 100 calls → Reuse 10 connections

The cost of establishing a database connection is even higher than TCP Handshake. It involves authentication, permission checks, session initialization, and more. So running without Connection Pool leads to terrible performance.

These follow the same principle—a "common pattern in infrastructure design." Ultimately, it's the philosophy of "reuse expensive resources."

8. Performance Comparison

Applying Keep-Alive, the difference in total time is significant:

Before (Keep-Alive OFF)

100 files load
Avg response time: 4.2s
Total Handshake time: 38s

After (Keep-Alive ON)

100 files load
Avg response time: 1.1s
Total Handshake time: 0.4s (once only!)

About 3.8x faster.

Applying Keep-Alive is said to produce significant performance improvements. Users just feel "it got faster," but from a developer's perspective, it's a massive difference—server load decreases, network costs decrease.

The difference is especially stark in mobile environments. In 4G LTE with RTT around 50-100ms, loading 100 files without Keep-Alive adds 5-10 seconds. Numbers that directly translate to user abandonment.

9. Keep-Alive Downsides

Server Resource Consumption

10,000 concurrent users
Keep-Alive timeout = 60s

→ Maintain 10,000 TCP connections
→ Memory usage ↑

Solution: Set short timeout (5-10s)

Idle Connections

User received page and left
→ Server maintains connection for 60s
→ Waste!

Solution: Appropriate max requests setting

Setting timeout to 120 seconds and having concurrent users spike can exhaust file descriptors. Linux's default ulimit -n is 1024, which gets exceeded as Keep-Alive connections accumulate.

Reducing timeout to 10 seconds and raising ulimit to 65536 resolves the issue. Keep-Alive isn't magic—it's a tradeoff that requires tuning.

10. Nginx Configuration

A practical Nginx configuration for Keep-Alive:

http {
  # Enable Keep-Alive
  keepalive_timeout  65;  # Maintain for 65 seconds
  keepalive_requests 100;  # Max 100 requests per connection

  # Keep-Alive with upstream servers too
  upstream backend {
    server 127.0.0.1:3000;
    keepalive 32;  # Maintain pool of 32 connections
  }

  server {
    location / {
      proxy_pass http://backend;
      proxy_http_version 1.1;
      proxy_set_header Connection "";  # Maintain Keep-Alive
    }
  }
}

The proxy_set_header Connection ""; line is crucial here. Without it, Nginx closes the connection with each backend request. Initially not knowing this, I wondered "why is it slow even with Nginx?"

keepalive 32; is the size of the connection pool Nginx maintains with backend servers. If you have multiple backend servers, increase this value. Usually set to number of backend servers x 10.

11. Browser Behavior: 6 Connection Limit

Chrome DevTools Check

Network tab → Click file → Timing tab

Queueing: 0.5ms
Stalled: 0.2ms
DNS Lookup: 0ms  ← Reused!
Initial connection: 0ms  ← Keep-Alive!
SSL: 0ms  ← Reused!
Request sent: 0.1ms
Waiting (TTFB): 50ms
Content Download: 10ms

Connection 0ms = Keep-Alive is working!

Browser's 6 Connection Limit

Browsers typically open only 6 connections per domain simultaneously. What does this mean?

Download 100 files from example.com

Keep-Alive OFF:
Files 1-6: Parallel download (each new connection)
Files 7-12: Wait → Start after 1-6 finish (each new connection)
...
→ Total 100 connection creations

Keep-Alive ON:
Files 1-6: Parallel download (6 connections)
Files 7-12: Reuse same 6 connections
...
→ Only 6 connections maintained total

This limit was a kind of courtesy from the HTTP/1.1 era to reduce server load. But combined with Keep-Alive, it delivers tremendous efficiency.

In the past, domain sharding was used to bypass this limit:

img1.example.com
img2.example.com
img3.example.com

6 connections per domain, total 18 connections possible. But after HTTP/2, such tricks became unnecessary.

12. HTTP/2 and Keep-Alive

HTTP/1.1 + Keep-Alive

1 connection
Request 1 → Response 1
Request 2 → Response 2 (sequential)

Problem: Head-of-Line Blocking

If front request is slow, back request waits

HTTP/2 (Multiplexing)

1 connection
Requests 1, 2, 3 sent simultaneously
Responses 3, 1, 2 received in any order

Keep-Alive required!

HTTP/2 advanced Keep-Alive one step further. In HTTP/1.1, requests were sent sequentially on one connection, but HTTP/2 sends multiple requests simultaneously (Multiplexing) on a single connection.

This is possible because HTTP/2 introduced the "stream" concept. Each request/response has an independent stream ID and gets interleaved within a single TCP connection.

When I first understood this, I thought "then the browser's 6 connection limit is meaningless?" Correct. With HTTP/2, just 1 connection per domain is sufficient. Opening multiple connections only adds overhead.

13. HTTP/3 and QUIC: The Future of Keep-Alive

HTTP/3 uses QUIC instead of TCP. QUIC is UDP-based, so there's no 3-Way Handshake.

HTTP/1.1 + TLS:
TCP Handshake (1 RTT) + TLS Handshake (1-2 RTT) = 2-3 RTT

HTTP/3 + QUIC:
QUIC Handshake (0-1 RTT) = 0-1 RTT

First connection: 1 RTT Reconnection: 0 RTT (session resumption)

QUIC's 0-RTT is truly magical. If you've connected to a server before, data transmission is possible immediately without Handshake. Keep-Alive pushed to the extreme.

But QUIC still maintains connections. It's just faster at connection recovery than TCP (Connection Migration) and more resilient to IP changes in mobile environments.

Ultimately, the philosophy of Keep-Alive continues in HTTP/3. The principle of "reuse expensive connections" doesn't change.

14. CDN and Keep-Alive

Using a CDN multiplies the Keep-Alive effect.

User → CDN (Seoul) → Origin Server (US)

Without CDN:
User ↔ US Server (RTT 200ms)
Handshake = 400ms

With CDN:
User ↔ Seoul CDN (RTT 10ms)
Handshake = 20ms

CDN ↔ US Server maintains Keep-Alive connection pool

CDNs maintain long-lived Keep-Alive connections with origin servers. From the user's perspective, they only need to Handshake to the Seoul CDN, making it incredibly fast.

A typical before/after comparison when adding a CDN like CloudFront:

Before: RTT 180ms (Seoul-US origin)
After: RTT 8ms (Seoul-CDN edge)

The combination of Keep-Alive and CDN is said to produce dramatic improvements. Physical distance shrinks, and connections get reused—both principles working together.

15. Common Mistakes

Mistake 1: Load Balancer Disconnection

Client → LB → Server

LB timeout: 10s
Server timeout: 60s

→ LB closes first!

Solution: LB timeout ≥ Server timeout

When AWS ALB's default timeout is 60 seconds but the backend server's Keep-Alive timeout is set to 120 seconds, ALB closes the connection at 60 seconds while the backend still thinks it's alive and sends a response. Clients receive 502 errors.

Lesson: Always set LB timeout greater than or equal to server timeout.

Mistake 2: Proxy Server Issues

Client ↔ Proxy ↔ Server

Proxy forces Connection: close
→ Keep-Alive nullified

Solution: Check proxy configuration

Squid proxy's default setting is often Connection: close. No matter how much Keep-Alive is enabled on the backend server, the proxy closes everything. Adding persistent_connection_after_error on to the config file resolves it.

Mistake 3: Missing Connection Header

Leaving out this single line in Nginx config breaks Keep-Alive silently:

proxy_set_header Connection "";

Without this, Nginx sends Connection: close when requesting from the backend. The backend wonders "why are you closing every time?"

16. HTTP/2 and Keep-Alive Relationship

Using HTTP/2 doesn't mean you can ignore Keep-Alive. In fact, it becomes more important.

HTTP/1.1 Era

6 connections per domain
Each connection with Keep-Alive

Total 6 TCP connections maintained

HTTP/2 Era

1 connection per domain
All requests Multiplexed on single connection

Total 1 TCP connection maintained (Keep-Alive essential!)

HTTP/2 relies on a single connection, so if that connection drops, all requests stop. That's why Keep-Alive settings become more critical.

Also, HTTP/2 has Server Push capability, which only works with a Keep-Alive connection. For the server to push resources proactively without client requests, an open connection is necessary.

17. Summary: Keep-Alive Checklist

Item	HTTP/1.0	HTTP/1.1
Default	OFF	ON
Header	`Connection: keep-alive`	(Optional)
Close	`Connection: close`	`Connection: close`
Timeout	Must specify	Server default
Performance	Slow	Fast ⚡

Key Takeaways I Accepted

Keep-Alive is "connection reuse": Make one call, have 100 conversations
Handshake cost is high: RTT x 2 is non-negligible time
Default enabled since HTTP/1.1: No reason not to use it
There are tradeoffs: Server resources vs performance
Timeout settings matter: LB ≥ Server timeout
TCP Keep-Alive ≠ HTTP Keep-Alive: Completely different concepts
More critical with HTTP/2: Relies on single connection

Final: "Don't Hang Up"

Initially I wondered "why maintain connections?"

Now I understand:

"Connection establishment cost >> Data transfer cost"

Key takeaways:

100 files load: 100 Handshakes vs 1
40s → 1s: roughly 3.8x faster
HTTP/1.1 essential feature: No reason not to use
Same for databases: Connection Pool is same principle
Combined with CDN is strongest: Local Handshake + remote reuse

"Cherish established connections"

Same applies to TCP connections. Don't create new ones every time; reuse what you have. That's the essence of Keep-Alive.

This was it. The most basic yet most powerful weapon in web performance optimization. The fact that a single setting can make things 3.8x faster. I can't imagine the web without Keep-Alive.