
Thrashing: Vicious Cycle of Page Faults
PC froze. Mouse moves but clicks don't work. HDD led blinking like crazy. This is Thrashing.

PC froze. Mouse moves but clicks don't work. HDD led blinking like crazy. This is Thrashing.
Why does my server crash? OS's desperate struggle to manage limited memory. War against Fragmentation.

Two ways to escape a maze. Spread out wide (BFS) or dig deep (DFS)? Who finds the shortest path?

Fast by name. Partitioning around a Pivot. Why is it the standard library choice despite O(N²) worst case?

Establishing TCP connection is expensive. Reuse it for multiple requests.

Back in 2020, when I was just starting my first company, I was running everything on a beat-up laptop with 4GB RAM and a spinning hard drive. Chrome with 20 tabs, Slack, VSCode, Docker containers all running at once. Then one day, everything froze. The mouse cursor still moved, but nothing else responded. Clicking did nothing. And that sound—"grrrr... grrrr..."—the hard drive grinding away like a coffee machine from hell.
I tried Ctrl+Alt+Del to open Task Manager. Five minutes passed. Nothing. The screen just sat there, frozen, while the hard drive kept screaming. I had no idea what was happening. The CPU usage wasn't even at 100%. It was around 20%. So why was my computer completely dead?
That was my first encounter with thrashing. And I didn't understand it at all.
At first, I thought it was just a "low memory" problem. Simple fix, right? Close some programs. But here's the thing about thrashing: you can't even close programs. Because to close a program, you need to load it into memory. To load it into memory, you need to kick something else out. And the thing you kick out is immediately needed again. Infinite loop of suffering.
Virtual memory is supposed to be one of the great innovations of operating systems. It lets you run programs bigger than your physical RAM. It lets multiple programs share the same memory while thinking they each have their own private space. The OS uses a page table to translate virtual addresses to physical addresses, and when a page isn't in physical memory, it loads it from disk. This is called a page fault.
Page faults are normal. They're fine. They're supposed to happen. The problem is when they happen too often.
Imagine a library with 10 seats but 100 students. Student A sits down, but wait, no more seats. Kick out Student B. Student B comes back, needs a seat. Kick out Student C. Student C comes back. Kick out Student A. Nobody gets any studying done. Everyone just spends all their time fighting for chairs.
That's thrashing. The CPU does zero actual work. It just shuffles memory pages between RAM and disk, over and over and over.
I finally understood thrashing after reading Peter Denning's 1968 paper on the Working Set Model. His insight was simple but profound: programs don't access memory randomly. They access certain pages repeatedly during specific time periods. The set of pages a program needs to run smoothly is its working set.
Think about a simple loop:
int sum = 0;
for (int i = 0; i < 10000; i++) {
sum += array[i];
}
While this loop runs, it needs:
sum and iarray dataThese pages are the working set. They're referenced again and again. This is called locality of reference, and it comes in two flavors:
If the OS gives a process enough physical memory (frames) to hold its working set, page faults are rare. Everything runs smoothly. But if the OS gives it less than the working set? Page fault cascade.
Process A: working set of 10 pages, allocated 3 frames
Process B: working set of 8 pages, allocated 2 frames
Process C: working set of 12 pages, allocated 4 frames
Result: Constant page faults in all processes
→ Disk I/O explodes
→ CPU waits for I/O
→ CPU utilization drops
→ OS thinks: "CPU is idle! Add more processes!"
→ Degree of multiprogramming increases
→ More page faults
→ Thrashing gets worse
This is the vicious cycle. The CPU utilization drops because everything is waiting for disk I/O. But the OS thinks the CPU is idle, so it tries to "help" by running more processes. Which makes everything worse.
Normally, more processes means higher CPU utilization. While one process waits for I/O, another can use the CPU. Win-win. But this relationship only holds up to a point. Past that point, the graph falls off a cliff.
CPU Utilization
^
| Normal Zone Thrashing Zone
100%| ******** *
| * * *
| * * *
| * * *
| * *
|*________________*___________> Degree of Multiprogramming
0 Threshold
Cross the threshold, and physical memory becomes too scarce. Page replacement happens constantly. The CPU spends all its time on page replacement instead of actual work. That's thrashing.
The page replacement policy matters too. FIFO (First-In-First-Out) kicks out the oldest page. Simple, but it suffers from Belady's Anomaly—sometimes adding more frames actually increases page faults. Weird and counterintuitive.
LRU (Least Recently Used) kicks out the page that hasn't been used in the longest time. It leverages locality of reference, so it performs better. But in thrashing conditions, it doesn't matter which algorithm you use. If there's not enough physical memory for the working sets, thrashing is inevitable.
$ dmesg | grep -i "killed process"
[12345.678901] Out of memory: Killed process 2345 (node) total-vm:4GB, anon-rss:3.8GB, file-rss:0kB
[12389.123456] Out of memory: Killed process 3456 (postgres) total-vm:2GB, anon-rss:1.9GB, file-rss:0kB
The OOM Killer (Out-Of-Memory Killer) had struck. When Linux runs completely out of physical memory, it starts executing processes. It calculates a score for each process and kills the one that seems most expendable. Brutal, but it's the last resort to prevent total system freeze.
To avoid being OOM Killer'd, you need to detect memory pressure early. You can monitor with vmstat:
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 2 524288 12340 8192 102400 120 180 2400 1200 500 800 15 10 50 25 0
4 3 589824 8192 8192 102400 340 420 5600 3200 700 1200 12 8 40 40 0
6 5 655360 4096 8192 102400 780 920 12000 7800 1100 2000 8 5 20 67 0
Watch these columns:
In this example, si and so are climbing, and wa hits 67%. Classic thrashing symptoms.
You can also check swap usage with free:
$ free -h
total used free shared buff/cache available
Mem: 7.8G 7.2G 100M 256M 500M 200M
Swap: 2.0G 1.9G 100M
Swap usage near 100% is a red flag. The available column is critical—it shows memory available for starting new programs. If it drops below 10% of total memory, thrashing is imminent.
Real-time monitoring with top:
$ top
top - 03:24:15 up 10 days, 4:32, 1 user, load average: 8.45, 7.23, 6.11
Tasks: 312 total, 4 running, 308 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.5 us, 8.3 sy, 0.0 ni, 15.2 id, 64.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 7984.0 total, 102.4 used, 200.5 free, 7681.1 buff/cache
MiB Swap: 2048.0 total, 1945.6 used, 102.4 free. 195.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2345 app 20 0 4.1g 3.8g 12m D 15.3 48.9 123:45.67 node
3456 postgres 20 0 2.1g 1.9g 8m D 8.7 24.5 89:12.34 postgres
Low %CPU but load average above 8. wa (wait) at 64%. Process state D (uninterruptible sleep, waiting for disk I/O). Textbook thrashing.
Most straightforward. If you can throw money at the problem, do it. We upgraded our server from 8GB to 32GB. Thrashing vanished overnight.
Limit the number of concurrent processes. Don't spin up Docker containers willy-nilly. Run only what's necessary. For Node.js cluster mode, tune the worker count:
const cluster = require('cluster');
const os = require('os');
if (cluster.isMaster) {
// Old approach: spawn workers equal to CPU cores
// const numWorkers = os.cpus().length;
// Better: consider available memory
const totalMemory = os.totalmem();
const availableMemory = os.freemem();
const numWorkers = availableMemory > 4 * 1024 * 1024 * 1024
? os.cpus().length
: Math.max(2, Math.floor(os.cpus().length / 2));
console.log(`Starting ${numWorkers} workers`);
for (let i = 0; i < numWorkers; i++) {
cluster.fork();
}
} else {
require('./app');
}
The OS should estimate each process's working set and allocate enough frames to hold it. If physical memory is too scarce to accommodate all working sets, suspend entire processes. Better to have some processes completely stopped than all processes crawling.
On Linux, use nice and ionice to adjust process priorities:
# Lower CPU priority
$ nice -n 19 ./heavy_process
# Lower I/O priority (idle class)
$ ionice -c 3 ./heavy_process
Monitor each process's page fault rate. If page faults are too frequent (e.g., 100+ per second), allocate more frames. If page faults are rare (e.g., fewer than 5 per second), reclaim frames. Dynamic adjustment prevents thrashing.
Thrashing on HDD versus SSD is night and day. HDDs have mechanical parts. Random I/O requires physically moving the disk head, which takes around 10ms per seek. With 100 page faults per second, you're spending 1 full second on disk head movement alone.
SSDs have no moving parts. No seek time. Same page fault frequency, but 10x faster. With an SSD, thrashing becomes "sluggish but usable" instead of "completely frozen."
But don't take this as a license to ignore thrashing just because you have an SSD. SSDs have limited write cycles. Constant swap activity from thrashing will dramatically shorten SSD lifespan. And even on SSDs, disk I/O is thousands of times slower than CPU operations. Thrashing is still a performance disaster.
Thrashing taught me something fundamental: systems have limits. You can't run 50 Chrome tabs on 4GB RAM. You can't handle 100,000 concurrent users on an 8GB server. No matter how clever the OS, no matter how magical virtual memory seems, physics is physics.
In the early startup days, I tried to "optimize" my way out of resource constraints. Tuned code, added caching, improved algorithms. But eventually I realized: this wasn't a code problem. It was a hardware problem. Adding 16GB of RAM was more effective than two weeks of optimization work.
Optimization still matters. Wasteful code should be fixed. But fundamental resource scarcity requires adding resources. That's the fastest, most reliable, most honest solution. Thrashing is the system's way of telling you: "I can't do this anymore. I need help."
And sometimes, the bravest thing you can do is listen.