User Mode vs Kernel Mode: Dual Protection
Memories of Blue Screen
Let me recall the Windows 95 era. Blue Screens appeared daily.
The reason was simple. A buggy application accidentally touched OS memory areas. One app's mistake killed the whole computer. When one program triggered a bug, the entire system crashed. Unsaved documents? Gone.
After experiencing this disaster, I understood it this way. Without boundaries between programs, the system collapses.
To prevent this catastrophe, CPUs implemented Modes at the hardware level. Not in software, but security mechanisms etched into silicon.
Difference in Power: Citizen vs Police
I didn't understand this concept initially. "Why can't programs directly read the disk?" I thought imposing restrictions was inefficient.
But this analogy clicked for me.
Imagine a city.
Regular citizens can do anything in their own homes. Cook, watch TV, write code. But they cannot break into the police station and directly modify the criminal record database. To do that, they must go to the police station counter and follow "official procedures."
1. User Mode
- Status: Civilian.
- Power: Limited. Can only touch its own allocated memory.
- Trait: All code we write (
Hello World, Web Browsers, VS Code, Games) runs here. - Restriction: Absolutely NO direct access to hardware (Disk, Network Card, USB).
- Ring Level: Ring 3 (lowest privilege)
2. Kernel Mode
- Status: Police / Administrator.
- Power: Unlimited (Privileged). Can execute all CPU instructions and access all memory.
- Trait: Only OS Kernel runs here.
- Privileged Instructions:
HLT(halt CPU),CLI(disable interrupts), I/O port control - Ring Level: Ring 0 (highest privilege)
This was it. Privilege Separation. Physically isolating untrusted code from trusted code.
CPU Protection Rings: 4 Security Layers
Initially, I thought there were only two modes: "User Mode / Kernel Mode." I was wrong.
x86 CPUs actually define 4 Protection Rings.
Ring 0: Kernel (OS kernel)
↓
Ring 1: Device Drivers (theoretical, rarely used)
↓
Ring 2: Device Drivers (theoretical, rarely used)
↓
Ring 3: Applications (our programs)
Most modern OSes only use Ring 0 and Ring 3. Rings 1 and 2 are effectively abandoned territory. Because the security gain versus complexity wasn't worth it.
But when virtualization technology emerged, a new level appeared.
Ring -1: Hypervisor (VMware, KVM, Xen)
↓
Ring 0: Guest OS Kernel (Linux inside a VM)
↓
Ring 3: Apps inside VM
I summarized it this way. The lower the ring number, the closer to hardware and the greater the privilege. Closer to 0 means god-like powers, closer to 3 means prisoner.
Mode Switching Mechanism: Trap, Interrupt, Exception
I initially mistook system calls as "function calls."
I was wrong.
System calls are software interrupts.
Normal function call:
int result = add(3, 5); // executes at same privilege level
System call:
int fd = open("/etc/passwd", O_RDONLY); // CPU mode switch occurs
When you call the open() function, this happens internally.
mov eax, 5 ; syscall number (open = 5 in x86)
mov ebx, filename ; first argument
mov ecx, O_RDONLY ; second argument
int 0x80 ; <- This is key! Software interrupt
When the int 0x80 instruction executes:
- CPU immediately switches from user mode to kernel mode
- References the Interrupt Descriptor Table (IDT) to find the 0x80 handler
- Jumps to the kernel's
system_call()function - Kernel opens the file
- Returns to user mode
This clicked for me. System calls are not jumps, they're "traps." Voluntarily entering prison to ask a favor from the warden (kernel).
Trap vs Interrupt vs Exception
I often confused these three. I accepted it this way.
| Type | Trigger | Example |
|---|---|---|
| Trap | Program intentionally triggers | System call (int 0x80, syscall) |
| Interrupt | Hardware triggers | Keyboard input, timer, network packet arrival |
| Exception | CPU detects abnormal situation | Divide by zero, Page Fault, Segmentation Fault |
All cause kernel mode transition. The difference is "who pulls the trigger."
Real System Calls: Tracing with strace
I knew the concept but had never seen it in action. So I tried strace.
strace ls
Output:
execve("/bin/ls", ["ls"], [/* env vars */]) = 0
brk(NULL) = 0x55b8f0a0e000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=95788, ...}) = 0
mmap(NULL, 95788, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f8c9c0a0000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0"..., 832) = 832
...
write(1, "file1.txt\nfile2.txt\n", 20) = 20
close(1) = 0
exit_group(0) = ?
+++ exited with 0 +++
I was shocked. A single ls command makes dozens of system calls.
openat(), fstat(), mmap(), read(), write(), close()... each one causes a user → kernel → user transition.
I understood it this way. Programs don't talk directly to hardware. They only talk through the kernel.
Cost of Mode Switching: Context Switch Overhead
Now I understood why system calls are expensive.
One mode switch consumes hundreds to thousands of CPU cycles.
Why?
- Save Registers: Back up all user mode CPU registers to the stack
- Switch Page Tables: Switch to kernel memory address space
- Cache Invalidation: Some CPU cache (L1, L2) entries may be invalidated
- Permission Check: Kernel verifies "Does this process have permission to open this file?"
- Return Process: Restore registers when returning to user mode
Doing this once is fine. But doing it 1 million times per second?
// Inefficient code
for (int i = 0; i < 1000000; i++) {
write(fd, &i, sizeof(i)); // 1 million system calls!
}
This code is terribly slow because every loop iteration causes a user → kernel → user transition.
Solution: Buffered I/O
// Efficient code
char buffer[4096];
int pos = 0;
for (int i = 0; i < 1000000; i++) {
memcpy(&buffer[pos], &i, sizeof(i));
pos += sizeof(i);
if (pos >= 4096) {
write(fd, buffer, pos); // system call only when buffer fills
pos = 0;
}
}
I summarized it this way. Reducing system calls is key to performance optimization.
Why Do Docker Containers Run in User Space?
While using Docker, I wondered: "Containers are isolated environments, so why are they faster than VMs?"
The answer was they share the kernel.
[VM Structure]
App A (Ring 3)
↓
Guest OS Kernel (Ring 0)
↓
Hypervisor (Ring -1)
↓
Host OS Kernel (Ring 0)
↓
Hardware
[Container Structure]
App A (Ring 3)
↓
Host OS Kernel (Ring 0) <- direct system call
↓
Hardware
VMs require 2-stage mode switching (Guest → Hypervisor → Host). Containers require 1 stage (App → Host Kernel).
That's why Docker is fast. But it's less secure. Because if there's a kernel vulnerability, container escape is possible.
I accepted it this way. Docker isn't isolation, it's an "illusion" using namespaces and cgroups. They actually share the same kernel.
Danger of Kernel Modules: Ring 0 Code Injection
I tried creating a Linux kernel module for the first time.
// hello.c - kernel module
#include <linux/module.h>
#include <linux/kernel.h>
int init_module(void) {
printk(KERN_INFO "Hello Kernel!\n");
return 0;
}
void cleanup_module(void) {
printk(KERN_INFO "Bye Kernel!\n");
}
After compilation:
sudo insmod hello.ko
At this moment, my code executes at Ring 0.
I am now god. I can read and write all memory, kill any process, intercept keyboard input.
One wrong line:
*(int*)0 = 42; // NULL pointer dereference
Kernel Panic. Entire system dies.
I understood it this way. Kernel mode is absolute power, and absolute power is absolutely dangerous.
That's why Linux requires sudo to install kernel modules. Without administrator privileges, you cannot inject Ring 0 code.
Spectre and Meltdown: Mode Boundary Collapse
In 2018, I saw news about Spectre and Meltdown vulnerabilities. Initially, I didn't understand.
"What does a CPU bug matter?"
But this was serious. User mode programs could read kernel memory.
I simplified the principle.
// Spectre attack example (simplified)
char kernel_memory[4096]; // kernel area (inaccessible)
int secret = kernel_memory[0]; // <- should raise exception here
// But CPU's "Speculative Execution" already executed it
// Data loaded into cache before exception occurs
// Attacker can infer secret value through cache timing attacks
CPUs predict and execute ahead for performance. Later they realize "Oh, I shouldn't have executed this" and roll back, but traces remain in CPU cache.
This was it. Hardware optimization became a security hole.
Solution: KPTI (Kernel Page Table Isolation). When in user mode, kernel memory is completely removed from the page table. But this increased system call costs by 10-30%.
I accepted it this way. Security and performance are a trade-off.
/proc/interrupts: Traces of Interrupts
I was curious how often the kernel does mode switching.
cat /proc/interrupts
Output:
CPU0 CPU1 CPU2 CPU3
0: 142 0 0 0 IO-APIC 2-edge timer
1: 9 0 0 0 IO-APIC 1-edge i8042
8: 0 0 0 0 IO-APIC 8-edge rtc0
9: 0 0 0 0 IO-APIC 9-fasteoi acpi
12: 155 0 0 0 IO-APIC 12-edge i8042
...
NMI: 123 456 789 101 Non-maskable interrupts
LOC: 5234567 5234568 5234569 5234570 Local timer interrupts
Look at LOC (Local timer interrupts). Over 5 million times.
Timer interrupts typically occur every 1ms (1000 Hz). This means the computer has been up for about 5000 seconds (1.5 hours).
1000 times per second, forced transition to kernel mode. Whether the program wants it or not.
I understood it this way. Interrupts are forced summons to the CPU.
Summary: This Was It
I summarized user mode and kernel mode this way.
- User Mode (Ring 3): Prisoner. Monitored and can only act in restricted space.
- Kernel Mode (Ring 0): Warden. Can open all doors, control all prisoners.
- System Call: Prisoner submitting a request to the warden. Expensive but safe.
- Mode Switch Cost: Consumes hundreds of cycles due to register saving, page table switching, cache invalidation.
- Buffered I/O: Improve performance by reducing system call count.
- Docker: Fast because it shares kernel, but exposed to kernel vulnerabilities.
- Kernel Module: Injecting Ring 0 code. One line mistake kills entire system.
- Spectre/Meltdown: CPU speculative execution became a security hole. Patches degraded performance.
This was it. Computers are a hierarchy of trust. Ring 3 trusts Ring 0, Ring 0 trusts hardware, and hardware... trusts its designers.