Thrashing: Vicious Cycle of Page Faults

The Day My Computer Became a Brick

Back in 2020, when I was just starting my first company, I was running everything on a beat-up laptop with 4GB RAM and a spinning hard drive. Chrome with 20 tabs, Slack, VSCode, Docker containers all running at once. Then one day, everything froze. The mouse cursor still moved, but nothing else responded. Clicking did nothing. And that sound—"grrrr... grrrr..."—the hard drive grinding away like a coffee machine from hell.

I tried Ctrl+Alt+Del to open Task Manager. Five minutes passed. Nothing. The screen just sat there, frozen, while the hard drive kept screaming. I had no idea what was happening. The CPU usage wasn't even at 100%. It was around 20%. So why was my computer completely dead?

That was my first encounter with thrashing. And I didn't understand it at all.

The Virtual Memory Trap

At first, I thought it was just a "low memory" problem. Simple fix, right? Close some programs. But here's the thing about thrashing: you can't even close programs. Because to close a program, you need to load it into memory. To load it into memory, you need to kick something else out. And the thing you kick out is immediately needed again. Infinite loop of suffering.

Virtual memory is supposed to be one of the great innovations of operating systems. It lets you run programs bigger than your physical RAM. It lets multiple programs share the same memory while thinking they each have their own private space. The OS uses a page table to translate virtual addresses to physical addresses, and when a page isn't in physical memory, it loads it from disk. This is called a page fault.

Page faults are normal. They're fine. They're supposed to happen. The problem is when they happen too often.

Imagine a library with 10 seats but 100 students. Student A sits down, but wait, no more seats. Kick out Student B. Student B comes back, needs a seat. Kick out Student C. Student C comes back. Kick out Student A. Nobody gets any studying done. Everyone just spends all their time fighting for chairs.

That's thrashing. The CPU does zero actual work. It just shuffles memory pages between RAM and disk, over and over and over.

The Aha Moment: Moving Furniture vs Actually Working

I finally understood thrashing after reading Peter Denning's 1968 paper on the Working Set Model. His insight was simple but profound: programs don't access memory randomly. They access certain pages repeatedly during specific time periods. The set of pages a program needs to run smoothly is its working set.

Think about a simple loop:

int sum = 0;
for (int i = 0; i < 10000; i++) {
    sum += array[i];
}

While this loop runs, it needs:

The page containing the loop code
The page with the stack variables sum and i
The pages containing the array data

These pages are the working set. They're referenced again and again. This is called locality of reference, and it comes in two flavors:

Temporal locality: If you accessed a page recently, you'll probably access it again soon
Spatial locality: If you accessed a page, you'll probably access nearby pages soon

If the OS gives a process enough physical memory (frames) to hold its working set, page faults are rare. Everything runs smoothly. But if the OS gives it less than the working set? Page fault cascade.

Process A: working set of 10 pages, allocated 3 frames
Process B: working set of 8 pages, allocated 2 frames
Process C: working set of 12 pages, allocated 4 frames

Result: Constant page faults in all processes
→ Disk I/O explodes
→ CPU waits for I/O
→ CPU utilization drops
→ OS thinks: "CPU is idle! Add more processes!"
→ Degree of multiprogramming increases
→ More page faults
→ Thrashing gets worse

This is the vicious cycle. The CPU utilization drops because everything is waiting for disk I/O. But the OS thinks the CPU is idle, so it tries to "help" by running more processes. Which makes everything worse.

The Multiprogramming Paradox

Normally, more processes means higher CPU utilization. While one process waits for I/O, another can use the CPU. Win-win. But this relationship only holds up to a point. Past that point, the graph falls off a cliff.

CPU Utilization
  ^
  |     Normal Zone       Thrashing Zone
100%|    ********       *
  |    *        *     *
  |   *          *   *
  |  *            * *
  | *              *
  |*________________*___________> Degree of Multiprogramming
  0                Threshold

Cross the threshold, and physical memory becomes too scarce. Page replacement happens constantly. The CPU spends all its time on page replacement instead of actual work. That's thrashing.

The page replacement policy matters too. FIFO (First-In-First-Out) kicks out the oldest page. Simple, but it suffers from Belady's Anomaly—sometimes adding more frames actually increases page faults. Weird and counterintuitive.

LRU (Least Recently Used) kicks out the page that hasn't been used in the longest time. It leverages locality of reference, so it performs better. But in thrashing conditions, it doesn't matter which algorithm you use. If there's not enough physical memory for the working sets, thrashing is inevitable.

Thrashing in Production: A War Story

Our backend server started acting weird around 3 AM. Monitoring alert: "Server not responding." I tried SSH. Connection timeout. I had to hard reboot the server. After it came back up, I checked the logs:

$ dmesg | grep -i "killed process"
[12345.678901] Out of memory: Killed process 2345 (node) total-vm:4GB, anon-rss:3.8GB, file-rss:0kB
[12389.123456] Out of memory: Killed process 3456 (postgres) total-vm:2GB, anon-rss:1.9GB, file-rss:0kB

The OOM Killer (Out-Of-Memory Killer) had struck. When Linux runs completely out of physical memory, it starts executing processes. It calculates a score for each process and kills the one that seems most expendable. Brutal, but it's the last resort to prevent total system freeze.

To avoid being OOM Killer'd, you need to detect memory pressure early. You can monitor with vmstat:

$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  2 524288  12340  8192 102400  120  180  2400  1200  500  800 15 10 50 25  0
 4  3 589824   8192  8192 102400  340  420  5600  3200  700 1200 12  8 40 40  0
 6  5 655360   4096  8192 102400  780  920 12000  7800 1100 2000  8  5 20 67  0

Watch these columns:

si (swap in): Pages swapped from disk to memory per second
so (swap out): Pages swapped from memory to disk per second
bi (blocks in): Blocks read from disk per second
bo (blocks out): Blocks written to disk per second
wa (wait): Percentage of time waiting for I/O

In this example, si and so are climbing, and wa hits 67%. Classic thrashing symptoms.

You can also check swap usage with free:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7.8G        7.2G        100M        256M        500M        200M
Swap:          2.0G        1.9G        100M

Swap usage near 100% is a red flag. The available column is critical—it shows memory available for starting new programs. If it drops below 10% of total memory, thrashing is imminent.

Real-time monitoring with top:

$ top
top - 03:24:15 up 10 days,  4:32,  1 user,  load average: 8.45, 7.23, 6.11
Tasks: 312 total,   4 running, 308 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.5 us,  8.3 sy,  0.0 ni, 15.2 id, 64.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7984.0 total,    102.4 used,    200.5 free,   7681.1 buff/cache
MiB Swap:   2048.0 total,   1945.6 used,    102.4 free.    195.2 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 2345 app       20   0  4.1g    3.8g   12m  D  15.3  48.9  123:45.67 node
 3456 postgres  20   0  2.1g    1.9g   8m   D   8.7  24.5   89:12.34 postgres

Low %CPU but load average above 8. wa (wait) at 64%. Process state D (uninterruptible sleep, waiting for disk I/O). Textbook thrashing.

How to Fix It: Attacking the Root Cause

1. The Physical Solution: Buy More RAM

Most straightforward. If you can throw money at the problem, do it. We upgraded our server from 8GB to 32GB. Thrashing vanished overnight.

2. Reduce the Process Count

Limit the number of concurrent processes. Don't spin up Docker containers willy-nilly. Run only what's necessary. For Node.js cluster mode, tune the worker count:

const cluster = require('cluster');
const os = require('os');

if (cluster.isMaster) {
    // Old approach: spawn workers equal to CPU cores
    // const numWorkers = os.cpus().length;

    // Better: consider available memory
    const totalMemory = os.totalmem();
    const availableMemory = os.freemem();
    const numWorkers = availableMemory > 4 * 1024 * 1024 * 1024
        ? os.cpus().length
        : Math.max(2, Math.floor(os.cpus().length / 2));

    console.log(`Starting ${numWorkers} workers`);
    for (let i = 0; i < numWorkers; i++) {
        cluster.fork();
    }
} else {
    require('./app');
}

3. Guarantee the Working Set

The OS should estimate each process's working set and allocate enough frames to hold it. If physical memory is too scarce to accommodate all working sets, suspend entire processes. Better to have some processes completely stopped than all processes crawling.

On Linux, use nice and ionice to adjust process priorities:

# Lower CPU priority
$ nice -n 19 ./heavy_process

# Lower I/O priority (idle class)
$ ionice -c 3 ./heavy_process

4. Page Fault Frequency Scheme

Monitor each process's page fault rate. If page faults are too frequent (e.g., 100+ per second), allocate more frames. If page faults are rare (e.g., fewer than 5 per second), reclaim frames. Dynamic adjustment prevents thrashing.

5. SSD vs HDD: The Severity Gap

Thrashing on HDD versus SSD is night and day. HDDs have mechanical parts. Random I/O requires physically moving the disk head, which takes around 10ms per seek. With 100 page faults per second, you're spending 1 full second on disk head movement alone.

SSDs have no moving parts. No seek time. Same page fault frequency, but 10x faster. With an SSD, thrashing becomes "sluggish but usable" instead of "completely frozen."

But don't take this as a license to ignore thrashing just because you have an SSD. SSDs have limited write cycles. Constant swap activity from thrashing will dramatically shorten SSD lifespan. And even on SSDs, disk I/O is thousands of times slower than CPU operations. Thrashing is still a performance disaster.

The Courage to Accept Limits

Thrashing taught me something fundamental: systems have limits. You can't run 50 Chrome tabs on 4GB RAM. You can't handle 100,000 concurrent users on an 8GB server. No matter how clever the OS, no matter how magical virtual memory seems, physics is physics.

In the early startup days, I tried to "optimize" my way out of resource constraints. Tuned code, added caching, improved algorithms. But eventually I realized: this wasn't a code problem. It was a hardware problem. Adding 16GB of RAM was more effective than two weeks of optimization work.

Optimization still matters. Wasteful code should be fixed. But fundamental resource scarcity requires adding resources. That's the fastest, most reliable, most honest solution. Thrashing is the system's way of telling you: "I can't do this anymore. I need help."

And sometimes, the bravest thing you can do is listen.