
System Call: How to Ask Kernel for Favors
Developers can't control the hard disk directly. Instead, we must 'ask' the Kernel via an API. That request is the System Call.

Developers can't control the hard disk directly. Instead, we must 'ask' the Kernel via an API. That request is the System Call.
Why does my server crash? OS's desperate struggle to manage limited memory. War against Fragmentation.

Two ways to escape a maze. Spread out wide (BFS) or dig deep (DFS)? Who finds the shortest path?

Fast by name. Partitioning around a Pivot. Why is it the standard library choice despite O(N²) worst case?

Establishing TCP connection is expensive. Reuse it for multiple requests.

When I started my first startup, I built an image upload feature with Node.js. I knew fs.readFile() would read a file, but I had no idea what happened inside. I just assumed "Node handles it" and moved on.
Then production hit. File reads became so slow they bottlenecked the entire system. Why? I learned the hard way that fs.readFile() isn't just grabbing data from memory. It's asking the kernel for a favor, and that favor has a cost.
That favor mechanism is the System Call.
Imagine if every program could move the hard drive's read head directly. Program A says "read here!" and Program B immediately moves the head somewhere else yelling "no, me first!" Data gets corrupted. Security becomes zero.
So operating systems split CPU privilege modes into two levels.
1. User Mode: Where normal applications run. Limited sandbox. No direct hardware access. Can only touch your own memory.
2. Kernel Mode: Where the OS kernel runs. Full control. Hardware manipulation, total memory access, I/O device control.
Think of it like a hotel system. Guests (User Mode) only have keys to their rooms. To use shared facilities, they call the front desk (Kernel). The staff (Kernel Mode) holds master keys to everything.
That "phone call" is the System Call.
System calls are implemented with special CPU instructions.
int 0x80 (legacy) or syscall (modern)svc (supervisor call)When you execute this instruction, a Trap occurs. The CPU declares "switching to kernel mode now." At the hardware level, the privilege level changes and the CPU jumps to a predefined kernel address.
In Linux, this entry point is entry_SYSCALL_64 in assembly. Here, the kernel asks "what system call did you request?"
The kernel maintains a System Call Table, an array mapping numbers to function pointers. Like a restaurant menu, each number points to a specific function.
// Linux kernel's system call table (simplified)
const sys_call_ptr_t sys_call_table[] = {
[0] = sys_read,
[1] = sys_write,
[2] = sys_open,
[3] = sys_close,
[57] = sys_fork,
[59] = sys_execve,
// ... 300+ more
};
When a user program invokes a system call, it puts the system call number into a register (on x86-64, that's rax). The kernel uses this number as an index to find and execute the corresponding function.
write() System Call
// C program
printf("Hello");
printf() internally calls the write() library functionwrite() is a wrapper provided by glibcrax = 1 (sys_write number)rdi = 1 (file descriptor, stdout)rsi = address of "Hello"rdx = 5 (character count)syscall instruction → Switch from User Mode to Kernel Modesys_call_table[1] → runs sys_write()sysret instruction → Return from Kernel Mode to User ModeThis entire process is called a Context Switch. The CPU's state (registers, stack pointer, etc.) must be saved and restored, which costs cycles.
Linux has over 300 system calls, but the frequently used ones are predictable.
int fd = open("/tmp/data.txt", O_RDWR | O_CREAT, 0644);
write(fd, "Hello", 5);
read(fd, buffer, 100);
close(fd);
open(): Opens a file, returns a file descriptor (integer)read(): Reads data from file into bufferwrite(): Writes buffer data to fileclose(): Closes the file descriptorpid_t pid = fork(); // Clone current process
if (pid == 0) {
// Child process
execve("/bin/ls", args, env); // Replace with new program
} else {
// Parent process
wait(NULL); // Wait for child to finish
}
fork(): Duplicates the current process to create a childexec(): Replaces current process with a different programwait(): Waits for child process terminationvoid* ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// Allocate a 4KB memory page
munmap(ptr, 4096); // Free the memory
mmap(): Maps files into memory or allocates anonymous memorybrk()/sbrk(): Adjusts heap size (used internally by malloc)int fd = open("/dev/ttyUSB0", O_RDWR);
ioctl(fd, TIOCMGET, &status); // Read serial port status
ioctl(): Sends device-specific commands (universal control interface)The problem is that every OS has different system calls. Linux's open() and Windows's CreateFile() are completely different functions.
So the POSIX (Portable Operating System Interface) standard emerged. Unix-like OSes (Linux, macOS, BSD) agreed to provide the same system call interface.
For example, POSIX specifies the signature and behavior of functions like open(), read(), write(), fork(). Thanks to this, C code written on Linux can be recompiled on macOS and just work.
But Windows doesn't follow POSIX. Windows uses its own Win32 API system.
| POSIX (Linux/Mac) | Win32 API (Windows) |
|---|---|
open() | CreateFile() |
read() | ReadFile() |
fork() | CreateProcess() |
execve() | CreateProcess() |
That's why cross-platform programs typically go through libraries like libc to call OS-specific system calls.
When we use printf() in C, we don't manually put system call numbers into registers. Instead, we call functions provided by libc (the C standard library).
libc provides wrapper functions around system calls.
// glibc's write() wrapper (simplified)
ssize_t write(int fd, const void *buf, size_t count) {
ssize_t result;
asm volatile (
"mov $1, %%rax\n" // sys_write number
"mov %1, %%rdi\n" // fd
"mov %2, %%rsi\n" // buf
"mov %3, %%rdx\n" // count
"syscall\n"
"mov %%rax, %0\n"
: "=r" (result)
: "r" (fd), "r" (buf), "r" (count)
: "rax", "rdi", "rsi", "rdx"
);
return result;
}
Why use wrappers?
errno when system calls failCurious which system calls your program makes? Use strace.
strace ls
Output:
execve("/bin/ls", ["ls"], 0x7ffd...) = 0
brk(NULL) = 0x55a1b2000000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=123456, ...}) = 0
mmap(NULL, 123456, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f8a...
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3...", 832) = 832
...
write(1, "file1.txt\nfile2.txt\n", 20) = 20
exit_group(0) = ?
Each line is a system call invocation. Running just the ls command triggers dozens of system calls.
To trace specific system calls only:
strace -e trace=open,read,write cat file.txt
You can even trace Node.js applications:
strace -e trace=read,write node app.js
That's how I first discovered that fs.readFile() internally calls openat(), fstat(), read(), and close() in sequence.
What happens when you call fs.readFile() in Node.js?
const fs = require('fs');
fs.readFile('/tmp/data.txt', 'utf8', (err, data) => {
console.log(data);
});
Internal flow:
fs.readFile()binding.cc)open(), fstat(), read(), close()syscall instruction enters kernelsys_openat(), sys_read(), etc. for actual file I/OIn other words, a simple fs.readFile() traverses multiple layers and ultimately resolves to kernel system calls. This process involves at least 4 User Mode ↔ Kernel Mode switches (open, fstat, read, close).
System calls aren't free. Every User Mode to Kernel Mode transition involves:
Typically, a single system call takes hundreds of nanoseconds. That's 100+ times slower than a function call (a few nanoseconds).
That's why high-performance applications minimize system calls.
Bad example:for (int i = 0; i < 1000000; i++) {
write(fd, &data[i], 1); // Write 1 byte at a time → 1 million syscalls
}
Good example:
write(fd, data, 1000000); // Write all at once → 1 syscall
This is exactly why buffering matters.
Linux introduced vDSO (virtual Dynamic Shared Object) to reduce overhead for frequently used system calls.
vDSO is a small shared library mapped by the kernel into userspace. It contains implementations of lightweight system calls like gettimeofday(), clock_gettime(), getcpu().
These functions execute directly in userspace without kernel mode switching. The kernel periodically updates data like current time in a memory region, and user programs just read it.
Regular system call:User Mode → syscall → Kernel Mode → sysret → User Mode
vDSO system call:
User Mode → read memory → User Mode (no switching!)
The speed difference is 10x or more.
Windows doesn't follow POSIX and has its own system call architecture.
For example, reading a file:
// Win32 API
HANDLE hFile = CreateFile("file.txt", GENERIC_READ, ...);
DWORD bytesRead;
ReadFile(hFile, buffer, 100, &bytesRead, NULL);
CloseHandle(hFile);
Internally:
ReadFile() → calls NtReadFile() (Native API)syscall instruction enters kernelNtReadFile()Windows doesn't publicly document system call numbers, and they can change between versions, so direct invocation is discouraged. Always go through Win32 API.
When I first used fs.readFile(), I just thought "it's a function that reads files." But underneath, there are:
Multiple layers of abstraction stacked on top of each other.
Once I understood this, I started seeing why bottlenecks happen and where optimization is possible. Reading 100 files separately versus batching them, reducing system call counts with buffering.
System calls aren't just "how to talk to the kernel." They're the first gateway to understanding how your code actually moves hardware. And once you pass through that gateway, 90% of performance issues become explainable.