Lecture 19: Modern I/O: io_uring, DMA-BUF & Zero-Copy Pipelines¶
Overview¶
Every time your program asks the OS to read or write data, it pays a tax: a context switch into kernel mode, a data copy between kernel and user memory, and often a wait for slow hardware. At modest data rates this tax is invisible. At AI-system scale — millions of I/O operations per second, gigabytes per second of camera frames, continuous GPU inference — this overhead consumes a significant fraction of available CPU. This lecture addresses that problem.
The mental model to carry through this lecture is the copy chain: data originates in hardware (a sensor, a NIC, a storage device), and the goal is to get it to the GPU without the CPU ever touching it unnecessarily. Every copy, every syscall, and every context switch is a potential elimination target. io_uring eliminates syscall overhead for storage I/O. DMA-BUF eliminates copies between kernel subsystems. Zero-copy techniques like sendfile, DPDK, and VisionIPC eliminate copies at every remaining stage.
AI hardware engineers need to understand these mechanisms because the bottleneck in a real-time perception pipeline is often not the GPU — it is the data path feeding the GPU. A camera frame that spends 0.7 ms being copied across the PCIe bus is a frame that arrived 0.7 ms late to inference.
Traditional I/O Limitations¶
The legacy POSIX I/O model imposes overhead that becomes a bottleneck in high-throughput AI data pipelines.
read()/write(): each operation requires at least 2 syscalls (initiate + complete) plus a kernel-to-userspace data copyselect()/poll(): O(n) scanning of file descriptor sets; degrades linearly with fd countepoll: eliminates O(n) scan but still requires one syscall per event notification; context switch cost accumulates at high IOPS
At 1M IOPS (NVMe throughput), syscall overhead alone can consume 30–50% of CPU cycles.
Key Insight: The POSIX I/O API was designed for correctness and portability, not for throughput. Each
read()call is a round-trip: the CPU drops what it is doing, enters kernel mode, copies data, then returns. At high IOPS the CPU spends more time on this overhead than on actual work.
Think of it like a warehouse where every single box must be personally handed to a supervisor (kernel), who then hands it to the delivery driver (userspace). At low volumes this is fine. At a million boxes per second, the supervisor becomes the bottleneck.
io_uring (Linux 5.1+)¶
io_uring replaces per-operation syscalls with shared memory ring buffers visible to both kernel and userspace simultaneously.
Ring Buffer Architecture¶
Two rings reside in memory mapped into both kernel and userspace:
- SQE ring (Submission Queue Entry): application writes operation descriptors here; kernel reads them
- CQE ring (Completion Queue Entry): kernel writes completion status here; application polls without a syscall
The key insight is that both the application and the kernel can read and write these rings directly — no syscall crossing required for normal operation.
┌─────────────────────────────────────────────────────────────┐
│ Shared Memory Region │
│ │
│ SQ Ring (Submission Queue) CQ Ring (Completion) │
│ ┌──────────────────────────┐ ┌──────────────────────┐ │
│ │ SQE[0]: read fd=5 │ │ CQE[0]: res=512 │ │
│ │ SQE[1]: write fd=7 │ │ CQE[1]: res=0 │ │
│ │ SQE[2]: fsync fd=5 │ │ CQE[2]: res=512 │ │
│ │ SQE[3]: (empty) │ │ CQE[3]: (empty) │ │
│ └──────────────────────────┘ └──────────────────────┘ │
│ ↑ app writes here ↑ kernel writes here │
│ ↓ kernel reads here ↓ app polls here │
└─────────────────────────────────────────────────────────────┘
Userspace Kernel
sees both rings ←── mmap ──→ sees both rings
Key Insight: The SQ and CQ rings live in memory that is simultaneously visible to both userspace and the kernel. The application never needs to "hand data to the kernel" — it just writes to a shared slot. This is the fundamental reason io_uring can reach zero syscalls in its most aggressive configuration.
Submission and Completion Flow¶
Understanding each step in this sequence is essential for tuning io_uring performance:
- Application calls
io_uring_prep_read(): fills one SQE slot with the operation descriptor (fd, buffer, length, offset). No kernel involvement yet — this is a pure userspace write to shared memory. - Application calls
io_uring_submit(): this may callio_uring_enter()(one syscall for a batch of N operations) — or in SQPOLL mode, the kernel thread picks up the SQE without any syscall at all. - Kernel processes the operation asynchronously: I/O is dispatched to the block layer, network stack, or file system. The calling thread is free to do other work.
- Kernel writes a CQE to the CQ ring: the completion result (bytes read, error code) is placed in the next available CQ slot. This is a write into shared memory — no interrupt to userspace required.
- Application polls the CQ ring: the app checks the CQ head pointer. If a new CQE is present, it reads the result directly. No syscall required for completion.
With IORING_SETUP_SQPOLL, a dedicated kernel thread continuously drains the SQ ring. The submit path becomes zero-syscall. The application only polls the CQ ring.
Common Pitfall: Forgetting to call
io_uring_cqe_seen()after processing a completion entry. This advances the CQ ring head pointer. Without it, the ring fills up, new completions are dropped (unlessIORING_FEAT_NODROPis set, which applies backpressure instead), and the application stalls.
Key Flags and Features¶
| Flag / Feature | Meaning |
|---|---|
IORING_SETUP_SQPOLL |
Kernel thread auto-submits; zero syscall submit path |
IORING_SETUP_IOPOLL |
Kernel polls for completions (no IRQ); lowest latency |
IORING_FEAT_NODROP |
CQEs never dropped; backpressure instead of loss |
| Fixed buffers | Pre-registered buffers skip get_user_pages() per op |
| Multishot | Single SQE generates multiple CQEs (e.g., accept loop) |
Supported Operations¶
read, write, send, recv, accept, connect, fsync, splice, openat, statx, timeout, link, hardlink, renameat, unlinkat
The breadth of supported operations is significant: io_uring is not just a storage optimization — it can replace nearly every blocking I/O syscall in a network or file server.
liburing Helper Library¶
struct io_uring ring;
io_uring_queue_init(256, &ring, 0); // init with depth 256 SQEs
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring); // get a free SQE slot
io_uring_prep_read(sqe, fd, buf, len, 0); // fill the SQE descriptor
io_uring_sqe_set_data(sqe, user_data_ptr); // tag with user context pointer
io_uring_submit(&ring); // single syscall submits ALL pending SQEs
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe); // blocks until one CQE ready (or poll)
// cqe->res contains bytes read, or negative errno on error
io_uring_cqe_seen(&ring, cqe); // advance CQ head — MUST call this
The io_uring_submit() call is batched: all SQEs prepared since the last submit are sent in a single syscall. Applications that pipeline many reads or writes before calling submit achieve near-zero syscall overhead.
Performance Comparison¶
| Method | IOPS (NVMe 4K rand read) | CPU utilization | Syscalls per op |
|---|---|---|---|
pread() |
~200K | high | 1 |
epoll + aio |
~600K | moderate | 2 |
io_uring (default) |
~800K | low | ~0.1 (batched) |
io_uring + SQPOLL |
1M+ | near-zero | 0 |
Key Insight: The jump from
pread()at 200K IOPS toio_uring+ SQPOLL at 1M+ IOPS is almost entirely CPU overhead elimination, not hardware speedup. The NVMe SSD is the same in all rows. What changes is how much CPU time is wasted on syscalls and copies.
DMA-BUF: Cross-Subsystem Buffer Sharing¶
With io_uring handling syscall overhead for storage, the next bottleneck is copying data between kernel subsystems — for example, between a camera driver and a GPU. DMA-BUF solves this.
DMA-BUF provides a file descriptor abstraction for memory buffers that multiple kernel subsystems and userspace can share without copying.
- One subsystem allocates a buffer and exports it as an fd via
dma_buf_export()→dma_buf_fd() - Another subsystem imports the fd via
dma_buf_get()and maps it into its own DMA address space - Userspace passes fds between processes via
sendmsg()or direct API calls
Think of a DMA-BUF fd as a key to a locker. The camera driver puts data in the locker. The GPU driver opens the same locker with the same key. No data is moved — only the key is passed.
┌──────────────┐ exports fd ┌─────────────────────────────────┐
│ GPU Allocator│ ─────────────→ │ DMA-BUF Object │
│ (nvmap/ION) │ │ (physical pages in device mem) │
└──────────────┘ └──────────────────────────────────┘
↑ ↑
imports │ │ imports
┌─────────┘ └──────────┐
┌────┴─────┐ ┌───────────┴──┐
│ V4L2 │ │ CUDA kernel │
│ camera │ │ (device ptr) │
│ DMA eng. │ └──────────────┘
└──────────┘
ISP writes GPU reads
directly here directly here
↕ NO CPU COPY ↕
V4L2 + DMA-BUF Integration¶
V4L2_MEMORY_DMABUF: V4L2 buffer type that accepts external DMA-BUF fds- Application exports a DMA-BUF fd from GPU allocator (nvmap, ION, or DRM allocator)
- Passes fd to camera driver via
VIDIOC_QBUFwithm.fdset - Camera DMA engine writes captured frame directly to GPU-accessible memory
VIDIOC_EXPBUF: export a V4L2MMAPbuffer as a DMA-BUF fd for import by another device
Key Insight:
V4L2_MEMORY_DMABUFinverts the normal flow. Instead of the camera driver allocating its own buffer and then copying out, the application provides a buffer that the camera engine writes into directly. The application chooses a buffer that is also visible to the GPU, making the copy physically impossible (there is only one copy of the data, in one location).
Zero-Copy Camera to Inference Pipeline (Jetson)¶
Now we combine io_uring and DMA-BUF into the full zero-copy pipeline. This section shows the concrete benefit: eliminating 1–2 memory copies from every camera frame.
Traditional pipeline: Camera → DMA → kernel buffer → copy to userspace → copy to GPU (2 extra copies).
Zero-copy pipeline via DMA-BUF:
- GPU allocates buffer:
cudaMalloc()→ nvmap handle →dma_buf_fd(). The buffer lives in GPU-accessible memory from the start. - Camera (V4L2) is given the buffer fd:
VIDIOC_QBUFwithV4L2_MEMORY_DMABUF; fd passed to camera driver. The camera driver now knows exactly where to write the frame. - Camera ISP DMA engine writes directly into that GPU-mapped buffer: the camera hardware DMAs the captured frame into the GPU buffer. No CPU is involved in this transfer.
- CUDA inference: buffer already in GPU memory; no
cudaMemcpy()needed. The CUDA kernel reads the frame from the pointer that was allocated in step 1. - Access via
cudaGraphicsMapResources()or direct device pointer: standard CUDA APIs work normally; they just happen to be operating on data that arrived without any CPU-side copy.
Latency reduction: eliminates 1–2 CPU-GPU memcpy operations. At PCIe Gen3 bandwidth (~12 GB/s), copying a 1080p RGBA frame (~8 MB) costs ~0.7 ms. At 30 fps with 3 cameras, eliminated copies save significant memory bandwidth.
Common Pitfall: A common mistake is allocating the camera buffer with
malloc()and only later trying to import it via DMA-BUF. The buffer must be allocated through the GPU-visible allocator (nvmap, ION, orcudaMallocManaged) from the start. Amalloc()'d buffer has no DMA-BUF handle and cannot be passed to the camera driver as a target.
splice and sendfile: Kernel-to-Kernel Zero-Copy¶
Once data is in the kernel, splice and sendfile allow it to be moved between kernel subsystems without ever surfacing in userspace.
Both calls move data between file descriptors without copying to userspace:
sendfile(out_fd, in_fd, &offset, count): copy from file or socket to socket inside kernel; used for HTTP video streaming and static file servingsplice(fd_in, off_in, fd_out, off_out, len, flags): more general; works with pipes; can chain across callstee(fd_in, fd_out, len, flags): duplicate pipe data without consuming it
Use case: camera recording server sending H.264 frames over HTTP without a userspace buffer copy.
Key Insight:
sendfileis the reason a web server can serve a 1 GB video file at near-wire speed using almost no CPU. The data travels: NVMe → page cache → NIC DMA, all without crossing the user/kernel boundary. The server process issues a single syscall and the hardware handles the rest.
DPDK: User-Space Network Driver¶
For inference-serving scenarios where network I/O must match GPU throughput, even kernel network stack overhead becomes unacceptable. DPDK eliminates it entirely.
DPDK (Data Plane Development Kit) bypasses the kernel network stack entirely:
- User-space PMD (Poll Mode Driver) owns the NIC; no kernel driver, no interrupts
- Huge pages (2 MB/1 GB) for packet buffers: eliminates TLB misses at line rate
- CPU core polling at 100% — achieves 100 Gbps with a single core
- No context switches, no syscalls, no socket buffer copies
- Used in inference-serving front-ends where network I/O must match GPU throughput
Tools: dpdk-testpmd for benchmarking; rte_mbuf for zero-copy packet buffers; rte_ring for inter-core packet passing.
Common Pitfall: DPDK dedicates a CPU core to 100% polling. This is intentional — it is the price of near-zero latency network I/O. Running DPDK on a shared core, or alongside other workloads on that core, destroys its latency guarantees. DPDK cores must be isolated with
isolcpusin the kernel boot parameters.
VisionIPC: openpilot Zero-Copy Video IPC¶
The final piece of the zero-copy puzzle is passing video frames between processes without copying. VisionIPC demonstrates this at a production level.
openpilot replaces socket-based video transfer with shared memory:
vipc_serverallocates a pool of shared memory buffers at startup viammap(MAP_SHARED)camerad(producer): fills a buffer, posts the buffer index to consumers via semaphoremodeldandencoderd(consumers): receive the index, map the same memory region, read directly- No video data copy between processes; only a small integer (buffer index) is communicated
cerealhandles all other IPC (non-video) via capnproto overmsgq(also shared memory)
┌────────────┐ fills buffer[N] ┌─────────────────────────────┐
│ camerad │ ──────────────────→│ Shared Memory Buffer Pool │
│ (producer) │ posts index N │ buf[0]: 1920×1208 YUV │
└────────────┘ via semaphore │ buf[1]: 1920×1208 YUV │
↓ │ buf[2]: 1920×1208 YUV │
┌────────┴────────┐ └─────────────────────────────┘
│ │ ↑ ↑
┌────▼─────┐ ┌───────▼────┐ reads reads
│ modeld │ │ encoderd │ buffer[N] buffer[N]
│(inference│ │(H.265 enc.)│ directly directly
└──────────┘ └────────────┘
GPU kernel encoder DMA
reads buf[N] reads buf[N]
NO COPY NO COPY
Key Insight: VisionIPC shows the real-world application of every concept in this lecture. The camera fills a buffer once. Multiple consumers — the neural network inference engine and the video encoder — read from that same buffer. The GPU processes the frame in-place. No data is ever duplicated. This is achievable because the OS shared memory primitives (
mmap(MAP_SHARED), DMA-BUF) allow multiple subsystems to reference the same physical memory simultaneously.
Summary¶
| I/O Method | Syscall per op | Data copy | Max throughput | Use case |
|---|---|---|---|---|
read()/write() |
1 | kernel to user | ~200K IOPS | Simple file I/O |
epoll + callbacks |
1 per event | kernel to user | ~600K IOPS | Network servers |
io_uring batched |
~0.1 (batched) | kernel to user | ~800K IOPS | NVMe logging |
io_uring SQPOLL |
0 | kernel to user | 1M+ IOPS | Ultra-low latency |
sendfile |
1 | none (kernel) | line rate | Video streaming |
| DMA-BUF + V4L2 | 0 (DMA) | none | hardware DMA rate | Camera to GPU |
| DPDK | 0 | none | 100 Gbps | Inference serving |
| VisionIPC | 0 (shared mem) | none | memory bandwidth | openpilot camera IPC |
Conceptual Review¶
-
What is the fundamental cost of a traditional
read()call? Two mode switches (user→kernel→user) plus one data copy from kernel buffer to user buffer. At 1M IOPS this overhead can consume 30–50% of available CPU cycles. -
How does io_uring's ring buffer eliminate syscalls? By placing the submission and completion queues in memory that is simultaneously mapped into both kernel and user address spaces, the application can post work and read results without ever entering kernel mode. With SQPOLL, a kernel thread drains the submission ring continuously.
-
What problem does DMA-BUF solve that io_uring does not? io_uring reduces syscall overhead for existing I/O paths. DMA-BUF eliminates entire copies between kernel subsystems — the camera driver, GPU driver, and display driver can all reference the same physical buffer via a file descriptor, so data written by one is immediately readable by another without any copy.
-
Why does the zero-copy camera-to-GPU pipeline require the GPU to allocate the buffer, not the camera driver? The buffer must be visible to GPU hardware (mapped into GPU address space). Only the GPU allocator (nvmap, ION) produces a buffer with both a DMA-BUF fd (for the camera driver) and a CUDA device pointer (for inference). If the camera driver allocates the buffer, it lives in kernel memory with no GPU mapping.
-
What is the trade-off of DPDK's polling model? A dedicated CPU core runs at 100% utilization continuously. This eliminates all interrupt and context-switch overhead, achieving line-rate packet processing. The cost is one full CPU core permanently consumed. This is acceptable in inference-serving systems where the NIC core is paired with many GPU cores.
-
How does VisionIPC prevent multiple processes from racing on the same buffer? A semaphore (posted by
cameradafter filling a buffer) signals readiness. Consumers read from the shared memory region indexed by the buffer number received via the semaphore. The buffer is not returned to the pool until all consumers acknowledge, preventing camerad from overwriting a buffer still in use by modeld or encoderd.
AI Hardware Connection¶
- DMA-BUF with
V4L2_MEMORY_DMABUFenables true zero-copy camera-to-GPU pipelines on Jetson; frames arrive in GPU memory directly from the ISP with no CPU involvement in the data path - io_uring with
IORING_SETUP_SQPOLLis applicable to high-throughput CAN bus and sensor logging in openpilot, achieving 1M+ IOPS with near-zero CPU cost - VisionIPC demonstrates the OS-level shared memory design that eliminates inter-process video copies in production autonomous driving software
- DPDK is used in cloud inference front-ends where network I/O at 100 Gbps must be matched to GPU throughput without kernel overhead
sendfileenables camera stream relay servers to forward encoded video to clients with zero userspace buffer involvement- Understanding the full copy chain (camera ISP → kernel buffer → userspace → GPU) is prerequisite for diagnosing latency in any AI perception pipeline