Lecture Note 06 (L12, L13, L14): Virtual Memory, Page Tables & TLBs, Kernel Memory Allocation¶
Combines: Lecture L12 (Virtual Memory & Linux Memory Model), L13 (Page Tables, TLBs & Huge Pages), L14 (Memory Allocation: SLUB, kmalloc & CMA).
How This Note Is Organized¶
- Part 1 — Virtual memory: Address space; layout (x86-64/ARM64); mm_struct, VMA; mmap; demand paging; COW; mlock; RSS vs PSS.
- Part 2 — Page tables & TLBs: Page table walk; PTE bits; TLB and TLB shootdown; PCID/ASID; huge pages (2MB, 1GB); THP vs HugeTLBFS; madvise.
- Part 3 — Kernel allocation: Buddy allocator; SLUB/kmalloc; vmalloc; zones; GFP flags; CMA.
Part 1: Virtual Memory & the Linux Memory Model¶
Context: Each process has a private virtual address space. The MMU translates VA→PA per page. Isolation, overcommit, and demand paging are central; page faults and swap are critical for RT (avoid in hot path).
Virtual Address Space¶
- Isolation: One process cannot access another's memory without kernel-mediated sharing.
- Overcommit: Total virtual across processes can exceed physical RAM; pages allocated on demand.
- Layout (x86-64): User: text, data/BSS, heap (↑), mmap (↓), stack (↓). Kernel: direct map, vmalloc, kernel text, modules, vDSO. ARM64: TTBR0 = user, TTBR1 = kernel.
mm_struct & VMA¶
- mm_struct: One per process;
pgd,mmaplist,mm_rbtree,start_code/end_code,start_brk/brk,total_vm,locked_vm. - VMA (vm_area_struct): Contiguous region with same permissions and backing. Fields:
vm_start/vm_end,vm_flags(READ, WRITE, EXEC, SHARED, LOCKED),vm_file,vm_pgoff,vm_ops(fault, open, close).
VMA types: Text (file, R-X); data/BSS (file/zero); heap (anonymous, zero-fill); stack (anonymous, grow-down); file mmap shared (e.g. model weights); anonymous shared (e.g. VisionIPC); vDSO (kernel-provided).
mmap() & Demand Paging¶
- mmap() creates a VMA; it does not allocate physical pages. Pages are allocated on first access (demand paging).
- Fault types: Minor — page in RAM but PTE absent (e.g. COW, demand-zero); ~1 µs. Major — page from disk (file/swap); 1–10 ms; fatal for RT.
- mlockall(MCL_CURRENT|MCL_FUTURE) prevents eviction; does not fault in untouched pages — pre-touch (e.g. memset) to avoid minor faults in the RT loop.
Copy-on-Write (COW)¶
After fork(), parent and child share pages read-only. On first write, kernel allocates new page, copies content, installs writable PTE for writer. Makes fork() O(1); model weights shared read-only consume one physical copy.
Inspecting Memory: maps, smaps, RSS vs PSS¶
/proc/<pid>/maps: VMAs (range, perms, backing)./proc/<pid>/smaps: per-VMA RSS, PSS, swap, AnonHugePages.- RSS: Resident set; double-counts shared pages. PSS: Proportional; shared pages split by number of sharers — correct for per-process accounting.
Part 2: Page Tables, TLBs & Huge Pages¶
Context: Every access needs VA→PA translation. The TLB caches it; a miss triggers a multi-level page table walk (e.g. 4 reads, 300–400 ns). Larger pages reduce TLB pressure.
Page Table Walk (x86-64)¶
- 48-bit VA split: PGD index (9), PUD (9), PMD (9), PTE (9), offset (12). CR3 → PGD; each level one 4 KB table (512 entries). Four memory reads on TLB miss.
- PTE bits: Present, R/W, U/S, PWT, PCD (cache disable — use for MMIO), Accessed, Dirty, Global, NX, PFN. PCD=1 for device registers (
ioremapsets it).
TLB & TLB Shootdown¶
- TLB: Hardware cache of VA→PA; hit in 1–6 cycles; miss → hardware walk.
- TLB shootdown: When one CPU changes a PTE, others may have stale TLB entries. Kernel sends IPI; each CPU flushes affected entries. Expensive on many CPUs; hot in mmap/munmap-heavy workloads.
PCID / ASID¶
- PCID (x86): Tag TLB entries by process; context switch can keep TLB (new CR3 with same PCID or NOFLUSH). Reduces switch cost.
- ASID (ARM64): Same idea in TTBR0; avoids full TLB flush on switch.
Huge Pages¶
- 4 KB: 1 GB = 262K PTEs; exceeds typical L2 TLB.
- 2 MB (PMD): One PMD entry; 512× fewer entries for same range. HugeTLBFS: Pre-allocated, pinned;
MAP_HUGETLB. THP: Kernel promotes 4 KB→2 MB via khugepaged; usemadvisemode andMADV_HUGEPAGEon chosen regions to avoid unpredictable latency. - 1 GB (PUD): Boot-time only;
hugepagesz=1G hugepages=N. - madvise:
MADV_HUGEPAGE,MADV_NOHUGEPAGE,MADV_WILLNEED(prefetch),MADV_DONTNEED,MADV_SEQUENTIAL/MADV_RANDOM.
Part 3: Kernel Memory — Buddy, SLUB, vmalloc, CMA¶
Context: Buddy manages physical pages; SLUB carves objects from pages; vmalloc gives virtual continuity without physical continuity; GFP and zones control where and how; CMA reserves contiguous regions for DMA.
Buddy Allocator¶
- Free lists by order 0..10 (2^order pages).
alloc_pages(gfp, order)returns 2^order contiguous pages. Splits higher-order blocks when needed; merges buddies on free. - Fragmentation: High-order allocation can fail despite total free memory. CMA reserves contiguous region at boot before fragmentation.
SLUB & kmalloc¶
- SLUB: Per-CPU slabs; fast path lock-free. Partial slab list shared when CPU slab is full.
- kmem_cache: Dedicated cache for one object type (e.g. DMA descriptors);
kmem_cache_alloc/kmem_cache_free. - kmalloc(size, gfp): Size-based caches (8, 16, … 8K); above 8K uses buddy.
kzalloc= zeroed. Use GFP_KERNEL in process context; GFP_ATOMIC in IRQ/spinlock (never sleep).
vmalloc & kvmalloc¶
- vmalloc(size): Virtually contiguous, physically scattered; page tables built in vmalloc region. Slower than kmalloc; use for large buffers that do not need DMA contiguity.
- kvmalloc / kvfree: Prefer kmalloc; fall back to vmalloc for large sizes.
Zones & GFP Flags¶
- Zones: ZONE_DMA (0–16 MB), ZONE_DMA32 (0–4 GB), ZONE_NORMAL (4 GB+), ZONE_MOVABLE (migration/CMA). GFP selects zone and behavior.
- GFP_KERNEL: May sleep, reclaim; process context. GFP_ATOMIC: No sleep/reclaim; IRQ/spinlock; can return NULL. GFP_DMA/ GFP_DMA32 for device DMA range. __GFP_ZERO, __GFP_NOFAIL, __GFP_NOWARN.
CMA (Contiguous Memory Allocator)¶
- Reserves contiguous physical region at boot (in ZONE_MOVABLE). Movable pages can use it until a device requests it; then kernel migrates them out. Driver gets contiguous DMA memory via
dma_alloc_*or alloc_pages from CMA. Avoids runtime fragmentation failure for large DMA buffers.
Summary Tables¶
VMA / fault: Text/data/heap/stack/file mmap/anon shared — fault = file read, zero-fill, or swap-in. Minor ~1 µs; major 1–10 ms.
Page size / TLB: 4 KB → 262K entries/GB; 2 MB → 512/GB; 1 GB → 1/GB. THP = dynamic; HugeTLBFS = explicit, pinned.
Allocator: Buddy = pages; SLUB = objects; kmalloc = general; vmalloc = virtual contiguity; CMA = contiguous for DMA.
AI Hardware Connection¶
- mmap(MAP_SHARED) on DMA-BUF fd: zero-copy GPU/CPU buffer sharing. mlockall + pre-touch in RT inference to avoid faults. COW after fork shares model weights without doubling RAM.
- THP / MADV_HUGEPAGE on weight tensors reduces TLB misses. PSS in smaps for correct per-process memory accounting. PCID/ASID reduce context-switch cost.
- GFP_ATOMIC in ISR; GFP_KERNEL in process context. CMA or early allocation for large DMA buffers; avoid high-order alloc_pages on fragmented systems. slabtop for slab leaks.
Combines Lectures L12, L13, L14 (Virtual Memory; Page Tables, TLBs, Huge Pages; SLUB, kmalloc, CMA).