Lecture Note 07 (L15, L16): DMA, IOMMU & GPU Memory; NUMA & HPC Optimization¶
Combines: Lecture L15 (DMA, IOMMU & GPU Memory Management) and Lecture L16 (NUMA Topology & HPC Memory Optimization).
How This Note Is Organized¶
- Part 1 — DMA & IOMMU: Direct Memory Access; cache coherency (coherent vs streaming); DMA-BUF; IOMMU and security; VFIO; GPU memory and zero-copy.
- Part 2 — NUMA: Non-uniform memory access; first-touch and placement; memory policies (bind, preferred, interleave); numactl and libnuma; AutoNUMA; multi-GPU and CPU–GPU affinity.
Part 1: DMA, IOMMU & GPU Memory Management¶
Context: Devices move data to/from RAM without CPU copies. The CPU must not see stale data (cache coherency). The IOMMU translates device addresses and restricts access; DMA-BUF shares one buffer across CPU, GPU, camera, display.
DMA (Direct Memory Access)¶
- Device (NIC, NVMe, GPU, camera ISP) transfers data to/from system RAM autonomously. CPU programs descriptor (src, dst, length); device runs transfer; completion via IRQ or poll.
- Without DMA: Device → CPU copy → RAM. With DMA: Device → DMA engine → RAM; CPU is free after submitting descriptor.
Cache Coherency: Coherent vs Streaming¶
- Coherent DMA:
dma_alloc_coherent()— uncached or hardware-coherent; CPU and device always see same data; no explicit sync. Use for small control/descriptor regions. - Streaming DMA: CPU uses cached memory; driver synchronizes explicitly:
- DMA_TO_DEVICE: CPU wrote data →
dma_map_single()flushes cache → device reads. - DMA_FROM_DEVICE: Device will write → after transfer,
dma_unmap_single()invalidates cache → CPU reads. Between map and unmap the buffer is "owned" by the device — CPU must not touch it. - dma_map_sg / dma_unmap_sg for scatter-gather (fragmented buffers). Direction controls which cache op is used; wrong direction = silent corruption.
IOMMU (Input-Output MMU)¶
- Sits between devices and memory bus. Translates IOVA (I/O Virtual Address) to physical address using per-device/group page tables.
- Without IOMMU: Device can DMA to any physical address (security risk). With IOMMU: Only mapped IOVAs are allowed; unmapped access → fault.
- Implementations: Intel VT-d, AMD-Vi, ARM SMMU (Jetson Orin). IOMMU groups: Devices behind same translation unit; for VFIO passthrough, whole group is assigned together.
- Drivers usually use dma_map_* which uses IOMMU automatically; low-level
iommu_map/iommu_unmapin framework code.
DMA-BUF & Zero-Copy Pipeline¶
- DMA-BUF: Kernel abstraction to share one DMA buffer across subsystems (CPU, GPU, camera ISP, display). Exporter allocates and exports; importer attaches and maps for its device.
- Lifecycle: Exporter creates buffer, gets fd; fd passed (e.g. Unix socket); importer
dma_buf_get(fd),dma_buf_attach,dma_buf_map_attachment→ sg_table with IOVAs for that device. - Zero-copy: V4L2 camera → DMA-BUF fd → CUDA importer → same physical pages for inference → display. No CPU copy; sync via DMA fence (
dma_fence_wait). V4L2 supportsV4L2_MEMORY_DMABUF; userspace passes fd in buffer.
GPU Memory & Unified Memory¶
- Discrete GPU: dedicated VRAM; driver manages allocations and CPU↔GPU copies. Unified memory (e.g. Jetson): CPU and GPU share physical RAM; one address space; no explicit copy for shared buffers, but bandwidth and latency depend on placement.
- Resizable BAR (SAM): GPU BAR1 can cover full VRAM; CPU can address all GPU memory (important for GPUDirect Storage, zero-copy).
Part 2: NUMA Topology & HPC Memory Optimization¶
Context: On multi-socket machines, memory attached to socket 0 is "local" to CPUs on socket 0 and "remote" to socket 1. Remote access has higher latency and lower bandwidth; placement of data and threads matters.
NUMA Architecture¶
- Each socket (NUMA node) has local DRAM (lower latency, full bandwidth). Access to another node's DRAM goes over interconnect (QPI/UPI, Infinity Fabric) — ~2× latency, ~half bandwidth.
- Discovery:
numactl --hardware,lstopo,numastat,/sys/devices/system/node/nodeN/distance. Distance matrix: local = 10; remote often 20–40.
First-Touch and Placement¶
- Default (first-touch): Page is allocated on the node of the CPU that first faults it. If main thread on node 0 initializes a buffer later used only by workers on node 1, every access is remote — silent slowdown.
- Fix: Allocate (and fault) on the same node that will use the data: pin thread to node, set mempolicy, then malloc + memset; or use
numa_alloc_onnode()/mbind(..., MPOL_MF_MOVE)to move pages.
Memory Policies & numactl¶
- MPOL_DEFAULT: First-touch. MPOL_BIND: Only listed nodes; fail if full. MPOL_PREFERRED: Prefer one node; fallback. MPOL_INTERLEAVE: Round-robin across nodes (bandwidth-bound).
- set_mempolicy (process); mbind (VMA; MPOL_MF_MOVE migrates existing pages).
- numactl:
--cpunodebind=0 --membind=0 ./app(bind CPU and memory to node 0);--interleave=all(interleave);--preferred=1(prefer node 1).
libnuma & AutoNUMA¶
- libnuma:
numa_node_of_cpu(sched_getcpu()),numa_alloc_onnode(),numa_bind(),numa_set_membind(),numa_set_interleave_mask(). - AutoNUMA: Kernel migrates "hot" pages to the node of the accessing CPU. Overhead (scan, TLB shootdown, migration); can cause tail latency spikes. Recommendation: Disable (
numa_balancing=0) on RT and latency-sensitive inference; use explicit numactl/mbind.
Multi-GPU & CPU–GPU Affinity¶
- GPUs are attached to one socket's PCIe root. CPU↔GPU traffic from the other socket crosses interconnect. Use
nvidia-smi topo -mto see topology; bind process and memory to the node that owns the GPU(s) used.
Summary Tables¶
DMA: Coherent = uncached/coherent, no sync. Streaming = map (flush/invalidate) → device uses → unmap (invalidate/flush); CPU must not touch between map/unmap.
IOMMU: IOVA→PA; per-device/group; restricts DMA; VFIO exposes to userspace for passthrough.
NUMA: Local vs remote latency/bandwidth; first-touch; bind/preferred/interleave; numactl; disable AutoNUMA for RT.
AI Hardware Connection¶
- DMA-BUF + V4L2: Zero-copy camera → GPU; fd over IPC. Streaming DMA direction must be correct (TO_DEVICE / FROM_DEVICE) to avoid corruption.
- IOMMU: Security and isolation for GPU/VFIO; required for safe passthrough.
- NUMA: Pin inference and weights to same node as GPU; first-touch on wrong node halves effective bandwidth. numastat -p to verify; disable AutoNUMA on inference servers.
- Multi-GPU: Place processes and allocations on the node that owns the target GPU; use topology tools to avoid cross-socket PCIe.
Combines Lectures L15, L16 (DMA, IOMMU, GPU Memory; NUMA, HPC Optimization).