Lecture 21: Filesystems: ext4, btrfs, F2FS & overlayfs¶
Overview¶
A filesystem is the layer between your application and raw storage. It is easy to treat filesystems as invisible infrastructure — you open a file, write data, close it, and assume the data persists. But the choice of filesystem has direct, measurable consequences for AI hardware systems: it determines how fast camera frames land on storage, how long a reboot takes after a crash, whether a failed OTA update bricks a device or safely rolls back, and how long an eMMC chip lasts before wearing out.
The mental model to carry through this lecture is the layered storage stack: applications talk to the VFS (Virtual File System) layer, which provides a uniform API. Below VFS sits the actual filesystem implementation (ext4, btrfs, F2FS). Below that is the block layer, which routes I/O to the physical device. Each layer makes decisions that affect the layers above it.
AI hardware engineers need to understand filesystems because edge AI devices (dashcams, robotics controllers) write continuously to flash storage, which wears out. OTA update reliability depends on filesystem snapshot and rollback capabilities. Containerized inference deployments use overlayfs for immutable base images. And the IPC mechanism that ties together all of openpilot's processes runs on a RAM-backed filesystem (tmpfs).
Filesystem Role and VFS¶
A filesystem organizes files into directories, provides persistence across power cycles, manages free space, and ensures metadata consistency after crashes.
VFS (Virtual File System)¶
Linux VFS is an abstraction layer that presents a uniform API regardless of the underlying filesystem implementation.
The layered architecture looks like this:
Application
│ open() read() write() mmap() fsync()
▼
┌─────────────────────────────────────────┐
│ VFS (Virtual File System) │
│ struct file_operations dispatch table │
│ dentry cache (name → inode) │
│ inode cache (per-file metadata) │
└──────┬──────────┬──────────┬────────────┘
│ │ │
┌────▼───┐ ┌───▼───┐ ┌───▼────┐
│ ext4 │ │ btrfs │ │ F2FS │ ... (any registered filesystem)
└────┬───┘ └───┬───┘ └───┬────┘
└─────────┴──────────┘
│
┌────────────▼────────────┐
│ Block Layer │
│ blk-mq, I/O scheduler │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Storage Device │
│ NVMe / eMMC / UFS │
└─────────────────────────┘
Key data structures:
struct super_block: per-mounted-filesystem state; points to root inodestruct inode: per-file metadata (uid, gid, size, timestamps, block pointers); no filenamestruct dentry: name-to-inode cache entry; forms the directory tree in memorystruct file: per-open-file state; current offset, flags; points to inode
Key operations table: struct file_operations — open, read, write, mmap, ioctl, fsync, llseek. Every filesystem registers its own implementation of these callbacks. mmap is critical for AI workloads: memory-mapped dataset files avoid read() syscalls entirely.
Key Insight: When you call
open("/data/frames/frame0001.raw", O_RDONLY)on an ext4 filesystem and then again on a btrfs filesystem, you get the same file descriptor, the sameread()API, the same behavior from the application's perspective. VFS is the reason this works. What differs between filesystems is what happens underneath — how blocks are allocated, how crashes are handled, and how fast metadata operations complete.
ext4¶
ext4 is the default Linux filesystem; backward-compatible with ext2/ext3.
ext4 is a mature, well-tested filesystem that prioritizes reliability and broad hardware compatibility. It is the right choice for general-purpose Linux root partitions — including Jetson rootfs — where predictable behavior matters more than advanced features.
Key Features¶
- Extent-based allocation: replaces indirect block maps; one extent = (logical block, physical block, length); reduces metadata for large contiguous files
- 64-bit block addresses: supports volumes up to 1 EB
- Delayed allocation (
delalloc): batch block allocation at writeback time; improves spatial locality;nodelallocdisables it - dir_index (htree): large directories stored as hash B-tree; O(log n) lookup vs O(n) linear scan
Key Insight: The ext4 extent format matters for AI dataset performance. A large file described by a single extent (logical block 0 → physical block 1000, length 100K blocks) can be read in one contiguous DMA operation. A file described by hundreds of indirect block pointers requires many separate metadata lookups. For large training dataset files, ext4 with extent allocation approaches sequential read performance.
Journal Modes¶
| Mode | What is journaled | Data safety | Performance |
|---|---|---|---|
data=ordered (default) |
Metadata only; data written before commit | Good | Fast |
data=journal |
Metadata + data | Best | Slowest |
data=writeback |
Metadata only; data order not guaranteed | Weakest | Fastest |
Mount with mount -o data=ordered /dev/mmcblk0p1 /mnt. For Jetson rootfs, data=ordered is the standard.
Default journal size: 128 MB (tunable via tune2fs -J size=N). Journal is a circular log; on crash, uncommitted transactions are discarded; committed but not checkpointed are replayed.
The crash recovery sequence for data=ordered mode:
- File data is written to disk before any metadata journal entry referencing it is committed. This ensures that if the system crashes after the journal commit, the data blocks it points to are already on disk.
- Journal commit writes a commit block to the circular journal log, making the metadata change durable.
- Checkpoint: at some later time, the journaled metadata is written to its final location in the filesystem. Only then can the journal space be reused.
- On crash recovery:
e2fsckreads the journal, replays committed but uncheckpointed entries, discards uncommitted entries. A clean journal means no fsck needed.
Common Pitfall:
data=writebackis sometimes used for maximum write throughput (e.g., on a temporary data partition). The danger is that after a crash, a committed metadata entry (file size, block pointer) may reference data blocks that were not yet flushed, resulting in garbage data in files that appeared successfully written. Never usedata=writebackfor application data that must survive a crash.
btrfs¶
btrfs is a Copy-on-Write B-tree filesystem with features designed for modern storage and system management. Where ext4 prioritizes simplicity and compatibility, btrfs prioritizes advanced features — particularly snapshots and checksums — that are critical for OTA update strategies.
Core Features¶
- Copy-on-Write (CoW): writes never overwrite existing data in place; new data is written to free space, then the B-tree root pointer is atomically updated
- Snapshots: O(1) creation; a snapshot is simply a new B-tree root pointing to the same leaf nodes; used for A/B rootfs in OTA update strategies
- Subvolumes: independent filesystem trees within the same btrfs pool; can be mounted separately or snapshotted individually
- Checksums: CRC32c (default), xxHash, SHA256, or Blake2 on both data and metadata; detects silent bit rot
- RAID modes: RAID 0, 1, 10, 5, 6 across multiple devices
- Transparent compression: zstd (best ratio), lzo (fastest), zlib; per-file or per-subvolume
- send/receive: compute a delta between two snapshots and stream it to another system; enables incremental OTA updates
Key Insight: btrfs snapshot creation is O(1) because a snapshot is just a new B-tree root that shares the same leaf nodes as the original. No data is copied. The original and the snapshot both reference the same physical blocks. When either is modified, CoW creates new blocks for the modified data — the other snapshot continues to reference the original blocks undisturbed. This is what makes instant A/B rootfs switching possible.
A/B Rootfs with btrfs Snapshots¶
# Create read-only snapshot before OTA
btrfs subvolume snapshot -r /rootfs /.snapshots/$(date +%Y%m%d)
# After update, create new snapshot from updated rootfs
btrfs subvolume snapshot /rootfs_new /.snapshots/new
# Rollback: delete new rootfs, restore from snapshot
The A/B OTA flow with btrfs snapshots works as follows:
- Before the update begins: create a read-only snapshot of the current rootfs. This takes milliseconds and uses no additional disk space.
- Apply the update: write new files, packages, or a full rootfs image to a new subvolume. The snapshot of the old rootfs is untouched.
- Reboot into the new rootfs: bootloader mounts the new subvolume as the active rootfs.
- If the new rootfs fails to boot: bootloader detects the failure (boot counter exceeds threshold) and mounts the old snapshot as the active rootfs. Full rollback with zero data movement.
- After a successful boot: the old snapshot can be deleted to reclaim space.
btrfs scrub start /: verify all data blocks against stored checksums; background operation; critical for long-running edge AI devices where silent corruption is a concern.
Common Pitfall: btrfs CoW creates extra write amplification for database-style workloads (many small, random updates to the same file regions). SQLite databases and RocksDB on btrfs can perform worse than on ext4 because each small write copies an entire B-tree leaf node. Use
chattr +C(disable CoW for specific files or directories) for database files on btrfs, or place database files on a separate ext4 partition.
F2FS (Flash-Friendly File System)¶
With ext4 and btrfs understood, we turn to the filesystem designed specifically for the storage hardware used in AI edge devices: NAND flash in eMMC and UFS packages.
F2FS is designed for NAND flash storage: eMMC, UFS, and NVMe SSDs. Developed by Samsung; merged into Linux 3.8.
Design Principles¶
- Log-structured with node/data separation: node area (inodes and index) and data area are logged separately; reduces fragmentation from mixed updates
- Adaptive logging: switches between normal logging (high utilization) and threaded logging (low utilization) to balance write amplification and fragmentation
- Reduced write amplification: flash-aware allocation avoids partial block updates; aligns writes to flash erase block size
- Optimized
fsync()latency: important for database workloads (SQLite on Android); uses a roll-forward recovery mechanism to avoid full checkpoint on every fsync
Key Insight: Flash memory has a fundamental asymmetry: you can write small chunks but must erase large chunks (typically 256 KB–2 MB). A filesystem that writes small random updates forces the flash controller to read-modify-write large erase blocks (write amplification). F2FS's log-structured design collects writes sequentially, aligning them to erase block boundaries and dramatically reducing this amplification — extending the life of the flash chip.
Use cases: Android (since Android 9 default on eMMC/UFS), Chromebooks, embedded AI devices with eMMC storage.
overlayfs¶
overlayfs is the mechanism that makes containers, OTA overlays, and read-only rootfs configurations work. Understanding it is required for working with Docker containers and Jetson OTA strategies.
overlayfs overlays two directory trees into a unified view:
- lower: read-only base layer (one or more stacked layers)
- upper: writable layer; receives all modifications
- workdir: temporary directory on the same filesystem as
upper; used for atomic rename operations
On first write to a file from lower, the file is copied up to upper (copy-on-write). Subsequent writes go directly to upper.
Application sees /container/merged (unified view):
/bin/bash ← from lower layer (image)
/lib/libc.so ← from lower layer (image)
/etc/config ← from upper layer (modified by container)
/tmp/cache ← from upper layer (created by container)
Physical layout:
┌────────────────────────────────────────────────────────┐
│ upper (writable, per-container) │
│ /etc/config /tmp/cache │
├────────────────────────────────────────────────────────┤
│ lower layer 2 (image layer, read-only) │
│ /etc/config.orig (shadowed by upper /etc/config) │
├────────────────────────────────────────────────────────┤
│ lower layer 1 (base image, read-only) │
│ /bin/bash /lib/libc.so │
└────────────────────────────────────────────────────────┘
overlayfs merges these: upper takes precedence, lower fills in the rest
Mount Syntax¶
mount -t overlay overlay \
-o lowerdir=/image/layer2:/image/layer1,\ # colon-separated, top-to-bottom
upperdir=/container/rw,\ # writable layer (must be empty)
workdir=/container/work \ # must be on same fs as upperdir
/container/merged # unified view mount point
Use Cases¶
- Docker/OCI containers: image layers form the
lowerstack; the container's writable layer isupper; the container sees/mergedas its rootfs - Jetson OTA: new rootfs image as
lower; persistent user data inupper; avoids a full rootfs copy for configuration - Read-only rootfs with writable overlay: boot from read-only squashfs; mount overlayfs with tmpfs as
upper; writable rootfs in RAM without modifying flash
Common Pitfall: The
upperandworkdirdirectories must reside on the same filesystem (same mount point). This is because overlayfs uses atomic renames within workdir for safe copy-up operations. Placing workdir on a different filesystem than upper causes the mount to fail withEINVAL. A common mistake is placing workdir on tmpfs and upper on ext4 (or vice versa).
tmpfs: Filesystem in RAM¶
overlayfs often uses tmpfs as its writable upper layer for ephemeral containers. But tmpfs has its own important standalone role: it is the foundation of all shared memory IPC in Linux.
tmpfs uses anonymous pages (RAM + swap) for storage; no disk I/O for reads or writes.
/dev/shmand/runare tmpfs by default on systemd systemsshm_open()/mmap(MAP_SHARED)use tmpfs for POSIX shared memory- openpilot
cerealmsgq allocates shared memory segments viashm_open()on/dev/shm; all IPC between processes (modeld, plannerd, controlsd) traverses these RAM-backed segments
Key Insight: When openpilot's
modeldpublishes a model output (trajectory prediction, detected objects) andplannerdreads it, the data never touches the disk. The "file" created byshm_open("/cereal_modeld", ...)exists only in tmpfs — in RAM. This is how a multi-process architecture achieves the same low-latency IPC as a single-process design with direct memory writes.
TRIM and Flash Longevity¶
Every time the OS deletes a file on a flash-backed filesystem, it knows those blocks are now free — but the flash hardware does not automatically learn this. TRIM bridges this gap.
fstrim notifies the SSD of freed block ranges so the controller can reclaim them during idle time.
discardmount option: issue TRIM inline on everyunlink()(higher latency per delete)systemd-fstrim.timer: weeklyfstrimsweep (preferred; batches TRIMs)- Without TRIM, the FTL does not know which logical blocks are free; write amplification increases as the drive ages
- Critical for embedded AI devices with eMMC (dashcams, edge inference boxes) that write continuously and cannot be taken offline for maintenance
Common Pitfall: Deploying an eMMC-based AI edge device with
discarddisabled andsystemd-fstrim.timernot enabled. Over weeks of continuous camera recording, the FTL fills its mapping table with "live" blocks that the OS knows are free but the SSD does not. Write amplification grows, throughput drops, and eventually write latency spikes cause frame drops in the recording pipeline. The fix is simple — enable the timer — but diagnosing the root cause after the fact requires correlatingiostatthroughput degradation over time with device age.
Summary¶
| Filesystem | Journaling | CoW | Best for | Key limitation |
|---|---|---|---|---|
| ext4 | Yes (ordered/journal/writeback) | No | General-purpose Linux rootfs | No snapshot support |
| btrfs | No (CoW replaces journal) | Yes | A/B OTA, snapshot-based rollback | Higher CPU overhead |
| F2FS | Checkpointing | No | eMMC/UFS embedded devices | Not ideal for HDDs |
| overlayfs | Inherits from upper layer | On first write | Container rootfs, OTA overlay | upper/lower must differ |
| tmpfs | None (RAM) | No | IPC shared memory, /tmp | Lost on reboot/OOM |
| ext4 + F2FS | Yes | No | Mixed: rootfs (ext4) + data (F2FS) | Separate partitions needed |
Conceptual Review¶
-
Why does ext4 use
data=orderedmode by default instead ofdata=journal? Full data journaling (data=journal) writes every byte of data twice: once to the journal, once to its final location. For a device recording camera frames at high throughput, this doubles the write load.data=orderedprovides strong crash consistency (data is on disk before metadata commits to the journal) at the cost of only journaling metadata — a much smaller amount of data. -
What makes btrfs snapshots O(1) in time and initially O(0) in space? A btrfs snapshot is a new B-tree root node that points to the same leaf nodes as the original subvolume. Creating it requires writing only one new root node — regardless of how many files are in the filesystem. No data blocks are duplicated at snapshot time. Space is only consumed when one version (original or snapshot) diverges from the other through writes, which triggers CoW.
-
Why is F2FS preferred over ext4 on eMMC for continuous recording workloads? eMMC NAND flash requires erase-before-write at the granularity of large erase blocks. ext4's random-write patterns force the eMMC FTL to perform read-modify-write cycles frequently, multiplying the physical write count (write amplification). F2FS's log-structured design aggregates writes sequentially, matching the natural write granularity of flash and reducing write amplification, extending the device's P/E cycle budget.
-
How does overlayfs enable immutable container images? The base image layers are mounted as
lower(read-only). Any modification the container makes — installing a package, creating a log file — goes to theupper(writable) layer. The base image on disk is never modified. When the container stops, theupperlayer can be discarded, leaving the base image exactly as it was. Multiple containers can share the same base image layers simultaneously with no conflicts. -
What role does tmpfs play in openpilot's IPC architecture? openpilot processes (modeld, plannerd, controlsd, camerad) communicate via cereal msgq, which uses POSIX shared memory (
shm_open()) backed by/dev/shm— a tmpfs mount. Messages between processes are written to RAM, not disk. This gives inter-process communication the same latency as direct memory writes, with no filesystem overhead. -
What happens to write amplification on an eMMC without TRIM over time? The Flash Translation Layer (FTL) maintains a logical-to-physical block mapping. When a file is deleted, the OS frees the logical blocks. Without TRIM, the FTL still considers those physical blocks "live" and cannot use them for new writes without first erasing them. As the device fills with these zombie blocks, the FTL must increasingly perform garbage collection (erase + copy) on partially-used erase blocks before any new write can land, multiplying the effective write amplification by 10–50x in the worst case.
AI Hardware Connection¶
- btrfs snapshots provide O(1) A/B rootfs switching in openpilot Agnos and Jetson OTA update pipelines; on failed boot, the bootloader activates the previous snapshot without data movement
- F2FS on eMMC significantly reduces write amplification in embedded AI edge devices (dashcams, robotics controllers) that perform continuous camera recording
- overlayfs enables containerized TensorRT inference deployments where the base image layer is read-only and immutable while the container adds only runtime state in the upper layer
- tmpfs and
shm_open()underpin openpilot cereal msgq IPC; all model outputs, trajectory plans, and control commands between processes traverse RAM-backed shared memory segments - ext4
data=orderedmode is the standard for Jetson rootfs partitions; it provides the strongest crash consistency guarantee without the overhead of full data journaling systemd-fstrim.timeris a mandatory configuration item for any long-running eMMC-based AI edge device; without it, write amplification degrades throughput and reduces flash lifespan