Journaling File System: Safe Writing
Prologue: Where Did My File Go?
Physical server environments have a well-known vulnerability: unexpected power loss. Datacenter incident reports follow a familiar pattern. The UPS batteries drain, servers die mid-operation, and on the next boot the filesystem is corrupted. The machine won't even start. Backups help, but any work in progress since the last checkpoint is gone.
The question is: "Isn't saving a file just... writing to disk? Why does a power loss corrupt everything?" It turns out saving a single file involves multiple steps across different disk locations. Interrupt the process midway, and you get inconsistent state. The solution modern filesystems use is called journaling, borrowed directly from database transaction logs.
Journaling is essentially Write-Ahead Logging (WAL) applied to filesystems. Before doing the actual work, you write a log saying "I'm about to do this." It's like making a photocopy of a contract before signing the original. If something goes wrong, you have a record of what you were trying to do.
Struggle: File Writes Are Not Atomic
When I first learned about filesystems, I naively thought write() was atomic. Just call it once, done. But reality is messier.
Creating a file and writing data involves several discrete steps:
- Update directory entry: Add filename and inode number to directory file
- Allocate inode: Write file size, permissions, timestamps, data block pointers
- Allocate data blocks: Write actual file contents to disk blocks
- Update bitmaps: Mark blocks and inodes as used in allocation bitmaps
- Update superblock: Update filesystem-wide statistics
Each of these is a separate disk write. Disks guarantee atomic sector writes, but file creation spans multiple sectors. If power dies during step 3, you get:
- Directory pointing to a new file
- Inode pointing to garbage data blocks
- Bitmap showing blocks as free when they're actually used
Contradictory state. Filesystem corruption.
Old filesystems (ext2, FAT32) had one solution: fsck (File System Check). This tool scans the entire disk, checking consistency between inodes, bitmaps, and directory structures. The problem? It's slow. A 1TB disk could take hours to check during every boot.
# Run fsck on ext2 filesystem (happens automatically on boot)
# Time scales with disk size - can take hours
fsck.ext2 /dev/sda1
I felt frustrated when I first encountered this. Filesystems seemed fundamentally fragile. But then I noticed: modern filesystems like ext4 and NTFS boot in seconds, even after crashes. How?
Aha Moment: Just Keep a Diary
One day I was reading about databases and stumbled on Write-Ahead Logging (WAL). Before changing data, databases write a log describing what they're about to change. If a crash happens mid-transaction, they replay the log to recover.
The lightbulb moment: filesystems can do the same thing. Before performing file operations, write them to a journal - a special disk area that acts like a diary of "work I'm about to do."
Here's the journaling workflow:
- Write transaction start to journal: "Transaction #1234 starting"
- Write changes to journal: "Will modify inode 456 like this", "Will write this data to block 789"
- Write commit record to journal: "Transaction #1234 committed"
- Apply changes to actual locations: Now write to real inodes and data blocks
- Checkpoint: Delete journal entries (no longer needed)
If power dies after step 3 but during step 4? On reboot, the filesystem reads the journal. "Ah, transaction #1234 committed but wasn't applied yet." It replays the journal entries (Redo). If power died during step 2 before commit? No commit record exists, so it ignores the incomplete transaction (Undo).
The genius part: recovery time is independent of disk size. fsck scans the entire disk, but journaling only reads the journal (typically hundreds of MB). Recovery takes seconds, not hours.
# Check ext4 journal information
sudo dumpe2fs /dev/sda1 | grep -i journal
# Output:
# Journal inode: 8
# Journal backup: inode blocks
# Journal size: 128M
Deep Dive: Journaling Modes and Internals
As I dug deeper, I learned there are different flavors of journaling. ext4 supports three modes:
1. Journal Mode (Safest, Slowest)
Logs both metadata and actual data to the journal. You write data twice: once to journal, once to final location. Maximum safety, but performance hit.
# Mount ext4 in journal mode
sudo mount -o data=journal /dev/sda1 /mnt
2. Ordered Mode (Default, Balanced)
Only logs metadata to journal, not data. But enforces ordering: write data first, then log metadata. This ensures metadata never points to garbage. If metadata recovery succeeds, file contents are consistent.
# Mount ext4 in ordered mode (default)
sudo mount -o data=ordered /dev/sda1 /mnt
3. Writeback Mode (Fastest, Less Safe)
Only logs metadata, no ordering guarantees. Data might be written after metadata. After a crash, metadata might recover but point to garbage blocks. Best performance, worst safety.
# Mount ext4 in writeback mode
sudo mount -o data=writeback /dev/sda1 /mnt
Journal Structure
The journal operates like a circular buffer. When full, it overwrites the oldest entries (already checkpointed). Each journal entry looks roughly like this:
// Simplified journal transaction structure
struct journal_transaction {
uint32_t transaction_id; // Unique transaction ID
uint32_t sequence_num; // Sequence number in journal
uint32_t num_blocks; // Number of blocks to modify
block_update blocks[]; // The actual changes
uint32_t commit_record; // Commit marker (present = committed)
};
struct block_update {
uint32_t block_number; // Which block to modify
char data[4096]; // New block contents (full copy)
};
NTFS uses similar journaling via a special file called $LogFile. Unlike ext4, NTFS only journals metadata, never data.
Redo vs Undo Logging
There are two recovery strategies:
- Redo Logging: Replay committed transactions. "You wanted this change, let me finish it."
- Undo Logging: Roll back uncommitted transactions. "This wasn't finished, let me reverse it."
Most journaling filesystems use redo logging because it's simpler: just check if a commit record exists.
Copy-on-Write: Alternative to Journaling
Newer filesystems like ZFS and Btrfs use Copy-on-Write (COW) instead of journaling. Instead of modifying data in place, they write new copies and atomically update pointers. No journal needed - the old version stays valid until the pointer switches.
# Create ZFS filesystem (COW-based)
zpool create mypool /dev/sdb
zfs create mypool/data
COW enables free snapshots and clones, but can cause fragmentation.
Application: Server Reliability
Now when I set up production servers, I always consider journaling modes. Most Linux distros default to ext4 ordered mode, but for critical workloads like databases, I consider journal mode.
Interestingly, databases have their own WAL, so you get double logging: filesystem journal + database WAL. Slight performance hit, but maximum safety. PostgreSQL has pg_wal/, MySQL has ib_logfile*.
# Check PostgreSQL WAL directory
ls -lh /var/lib/postgresql/14/main/pg_wal/
# Contains 16MB WAL segment files
# Check MySQL redo logs
ls -lh /var/lib/mysql/ib_logfile*
In cloud environments, even block storage like EBS uses internal journaling. Modern systems have journaling at multiple layers.
The speed difference between fsck and journal recovery is massive. A 1TB ext2 disk corruption can take 6 hours to fsck. After migrating to ext4, post-crash boot takes around 10 seconds.
One practical consideration: databases often recommend turning off filesystem journaling (or using writeback mode) because they handle consistency themselves. The double-write overhead isn't worth it when the database already guarantees ACID. But for general-purpose servers, I keep ordered mode enabled.
Another lesson: journaling isn't free. The journal area uses disk space (usually 128MB-1GB), and writing to the journal before actual data adds latency. For write-heavy workloads, this can be 5-10% slower. But the tradeoff is worth it - crash recovery that takes seconds instead of hours means better uptime.
I've also learned to monitor journal health. A corrupted journal is bad news. On ext4, you can check journal status:
# Check filesystem state including journal health
sudo tune2fs -l /dev/sda1 | grep -i journal
# Look for "Journal: healthy" or similar indicators
If the journal itself gets corrupted, you're back to full fsck. Thankfully, journals are small and written sequentially, so they rarely fail.
Closing: Logging Before Acting
Journaling taught me a fundamental systems design principle: log your intent before taking action. This simple idea revolutionized filesystem reliability.
As a founder, the lesson I took away is that complexity is worth it for resilience. Journaling makes writes slightly slower, but recovery goes from hours to seconds. That's a massive win for business continuity.
Even though I mostly use cloud storage now, the journaling concept remains relevant. We apply it everywhere: database design, distributed system logging, even application-level operations. The pattern of "log before acting, recover from logs" is universal.
Understanding journaling also made me more confident deploying systems. I know that unexpected power loss won't destroy data, because modern filesystems keep careful diaries. It's one less thing to worry about, and in a startup, reducing operational anxiety is valuable.
The filesystem's diary habit became my own habit: always log what you're about to do, especially in production systems. It's saved me countless times.