COMS W4118 Operating Systems I

Journaling

The Consistent Update Problem

Motivation: non-atomic updates

Recall FFS disk layout:

ffs-layout

Writes require several steps:

Update inode/block bitmaps
Update inode
Update data blocks

What if the system crashes between any of these steps? Disk only provides atomic writes one sector at a time… the data structures may not all be within the same sector. Like race conditions in concurrent programs, but can’t lock out a failure with a lock :(

Example: ext2 file creation

ext2 is a port of FFS. Say that we currently have a small ext2 partition with just one empty directory in it.

In order to create a new empty file foo in filesystem, we read three pieces of the filesystem into memory and mutate them:

inode bitmap: allocate a new inode for foo
inode: populate an inode for foo in the inode array
data block: add a dentry for foo in the directory’s data
no need to allocate a data block for empty file foo yet

ext2-foo

Let’s analyze possible crash scenarios. Define B, I, D as follows:

inode bitmap update (B)
add inode for foo (I)
add dentry for foo to dir data block (D)

And define the contents for these updates as follows:

B = 01000   ---> B' = 01010
I = garbage ---> I' = initialized
D = {., ..} ---> D' = {., .., foo}

Any subset can be written out to disk!

B   I   D  --->  Consistent (new data lost)

B'  I   D  --->  Inconsistency! Bitmap says I was allocated,
                 but no one is using it (leak)

B   I'  D  --->  As if nothing happened! we wrote to the inode
                 but map still says its garbage

B   I   D' --->  SERIOUS PROBLEMS: dentry exists, but points to garbage inode.
                 bitmap says that inode is free, can be taken by another file.

B'  I'  D  --->  Inconsistency! Bitmap says I was allocated, and we wrote to I,
                 but no one uses I.

B'  I   D' --->  MOST SERIOUS PROBLEM! FS is consistent according to bitmap and
                 dentry, but inode has garbage data.

B   I'  D' --->  Inconsistency! Dentry refers to valid I, but bitmap says I is free.
                 I can be taken by another file.

B'  I'  D' --->  Consistent (new data persisted)

In these crash scenarios, data loss isn’t the primary concern – we care more about filesystem consistency. Ruining data structures makes the fs unusable!

fsck: file system consistency check

in the old days, reboot after crash and scan entire disk to make fs consistent.
disadvantages:
- slow to scan large disk
- cannot correctly fix all crash scenarios (e.g. B' I D').
- no well-defined consistency (e.g. what do we do for B I D'?)

Journaling

Concept: write-ahead logging

Persistently write intent to log/journal, then update filesystem

crash before intent is committed: noop
crash after intent is committed: replay op

Better than fsck:

no need to scan entire disk
well-defined consistency

Example: ext3 physical journaling

Let’s first write all of the block updates to the journal and then update the file system:

Commit dirty blocks to journal as one transaction
Write commit record (finalize journal entry)
Write dirty blocks to real file system
Reclaim journal space for transaction (we don’t need it anymore)

ext3-journal

If you crash after committing to the log, just replay changes from the log!

This was an example of physical journaling, as oppposed to logical journaling, where we write a logical record of the operation to the log:

Complex to implement, but may be faster and save disk space

Journaling Write Orders

Journal writes, then FS writes
- otherwise, crash will leave FS inconsistent but no journal record to patch it up
FS writes, then reclaim journal space
- otherwise, if you crash before you finish the FS write, the journal record to patch it up will already be gone!
Journal writes, the commit record, then FS writes
- we need the commit record to tell us that we journaled the entirety of the change. Otherwise, the journal may have garbage in it!

ext3 Journaling Modes

Motivation: journaling is expensive. Every FS write requires two disk writes, two seeks. Balance consistency and performance…

Data journaling: journal all writes, including file data
- Problem: expensive to journal data
Metadata journaling: journal only metadata
- Used by most FS (IBM JFS, SGI XFS, NTFS)
- Problem: file may contain garbage data
Ordered mode: write file data to FS first, then journal metadata
- Default mode for ext3
- Problem: if crash before writing metadata, then you end up with old file metadata and new file data, where the journal says everything is OK

Last updated: 2023-04-08