Lecture 4: System Calls, vDSO & eBPF¶
Overview¶
Every time user-space code needs the kernel to do something — open a file, allocate memory, send a packet, talk to a GPU — it crosses the hardware privilege boundary via a system call. The core challenge this lecture addresses is: how does this boundary crossing work, how expensive is it, and how can you observe what is happening inside the kernel without modifying source code? The mental model is a toll booth: every syscall is a controlled crossing from user space into kernel space, with a fixed cost whether the work takes 1 ns or 1 ms. For an AI hardware engineer, this matters because a 200 fps camera pipeline can spend measurable CPU time just crossing the toll booth for V4L2 ioctls, and eBPF is the instrument you use to measure and diagnose the kernel's internal behavior in production without changing a single line of application code.
The System Call Interface¶
System calls are the only sanctioned path for user-space code to request kernel services. User space cannot directly access hardware registers, allocate physical memory, or change scheduler policy — it asks the kernel via the syscall ABI.
Call Path — x86-64¶
Syscall Path — x86-64
┌─────────────────────────────────────────────────────────┐
│ User Space (Ring 3) │
│ │
│ Application code │
│ │ calls read(fd, buf, len) │
│ ▼ │
│ glibc wrapper │
│ │ mov $0, %eax (syscall number for read = 0) │
│ │ mov fd, %rdi (first argument) │
│ │ mov buf, %rsi (second argument) │
│ │ mov len, %rdx (third argument) │
│ │ SYSCALL ← hardware mode switch │
└──────┼──────────────────────────────────────────────────┘
│ Ring 3 → Ring 0 (hardware enforced)
┌──────┼──────────────────────────────────────────────────┐
│ Kernel Space (Ring 0) │
│ ▼ │
│ entry_SYSCALL_64 │
│ │ saves registers to kernel stack │
│ │ looks up sys_call_table[RAX] │
│ ▼ │
│ sys_read() │
│ │ does actual work (VFS, page cache, driver) │
│ │ return value → RAX │
│ │ SYSRET ← hardware mode switch back │
└──────┼──────────────────────────────────────────────────┘
│ Ring 0 → Ring 3 (hardware enforced)
┌──────┼──────────────────────────────────────────────────┐
│ User Space (Ring 3) │
│ ▼ │
│ glibc wrapper returns to application │
└─────────────────────────────────────────────────────────┘
ARM64 uses SVC #0 → VBAR_EL1 + 0x400 → el0_svc → sys_call_table[x8]() → ERET.
strace -c ./inference_app # count syscalls by type and cumulative time
strace -T -e mmap,ioctl ./camerad # trace specific calls with per-call duration
Syscall Overhead¶
| Cost component | Typical penalty |
|---|---|
| Mode switch (SYSCALL/SYSRET) | 50–150 ns |
| Spectre/Meltdown mitigations (IBRS, retpoline, KPTI) | 50–200 ns |
| TLB and cache effects (KPTI flushes user-space TLB entries) | 20–100 ns |
| Total round-trip (minimal kernel work) | 100–400 ns |
ARM64 mitigations (CSV2, SSBS) are lighter than x86. At 200 fps across 2 cameras, each frame requiring 4 V4L2 ioctls = 1600 syscalls/s × 300 ns = 0.5 ms/s pure mode-switch overhead. Batching and zero-copy (mmap, io_uring) reduce crossing frequency.
Key Insight: Spectre and Meltdown mitigations (KPTI, IBRS, retpoline) roughly doubled syscall overhead on x86 compared to pre-2018 systems. The kernel flushes or isolates page table entries on each user↔kernel transition to prevent speculative execution from leaking kernel memory. ARM64 mitigations are architecturally lighter — one reason embedded AI platforms often prefer ARM for latency-sensitive workloads. If you are porting code from x86 benchmarks, do not assume syscall overhead is similar.
Now that we understand the cost of crossing the privilege boundary, let's look at how the kernel avoids that crossing for the most frequently called time functions.
vDSO: Kernel Calls Without Kernel Entry¶
The virtual Dynamic Shared Object (vDSO) is a read-only ELF page mapped by the kernel into every process address space at startup. Selected time functions read from a shared vvar data page that the kernel updates — no SYSCALL instruction, no mode transition, no TLB flush.
Think of the vDSO as a memo that the kernel leaves in your address space: "Here is the current time, updated continuously by the kernel. Read it directly without asking me."
| Function | Full syscall | vDSO |
|---|---|---|
clock_gettime(CLOCK_MONOTONIC) |
~200 ns | ~10–20 ns |
clock_gettime(CLOCK_REALTIME) |
~200 ns | ~10–20 ns |
gettimeofday() |
~200 ns | ~10–20 ns |
getcpu() |
~150 ns | ~5 ns |
Calling through glibc automatically uses vDSO when available — no application change required. Use CLOCK_MONOTONIC for inter-process timing (not subject to NTP adjustments); use CLOCK_MONOTONIC_RAW for hardware-only monotonic counter.
# Verify vDSO is mapped
cat /proc/[pid]/maps | grep vdso
# Check which symbols are exported
nm /proc/[pid]/map_files/[vdso-range] 2>/dev/null | grep " T "
Common Pitfall:
CLOCK_REALTIMEis adjusted by NTP and can jump backward. For timestamping sensor data or measuring inter-frame intervals, always useCLOCK_MONOTONIC(orCLOCK_MONOTONIC_RAWto also exclude PTP/NTP rate adjustments). UsingCLOCK_REALTIMEfor sensor fusion timestamps creates subtle synchronization bugs when NTP makes a correction mid-run.
Key Syscalls for AI and Embedded Systems¶
Understanding the most important syscalls for AI hardware work means understanding what happens under the hood of every V4L2 camera operation, GPU command submission, and zero-copy buffer transfer.
mmap / munmap¶
/* Zero-copy shared memory between processes */
int fd = shm_open("/sensor_ring", O_CREAT | O_RDWR, 0600);
/* shm_open creates an anonymous file in /dev/shm backed by tmpfs */
void *buf = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
/* MAP_SHARED: writes are visible to all processes that mapped this fd */
/* No copy is ever made — all processes share the same physical pages */
/* Huge page mapping for large GPU staging buffer */
void *huge = mmap(NULL, 2*1024*1024, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
/* MAP_HUGETLB: allocate 2 MB huge pages instead of 4 KB pages */
/* Reduces TLB pressure: 1 TLB entry covers 2 MB instead of 4 KB */
/* Requires vm.nr_hugepages to be pre-allocated in /proc/sys/vm/ */
MAP_HUGETLB: reduces TLB pressure for large staging buffers; requiresvm.nr_hugepagesCUDA cudaMallocHost()callsmmapon/dev/nvidia*for pinned host memory- On Jetson unified memory,
cudaMallocusesmmapon/dev/nvhost-as-gpu
ioctl¶
Primary device control interface; nearly all hardware-specific operations use it.
Syscall Path for V4L2 Camera Buffer Dequeue
┌─────────────┐
│ camerad │ ioctl(fd, VIDIOC_DQBUF, &buf)
└──────┬──────┘
│ SYSCALL (ioctl number)
▼
┌─────────────────────────────────────────────────────────┐
│ sys_ioctl() → vfs_ioctl() → v4l2_ioctl() │
│ → video_device.ioctl_ops.vidioc_dqbuf() │
│ → driver's dequeue function │
│ → blocks in TASK_UNINTERRUPTIBLE until frame arrives │
│ → returns buffer with frame pointer │
└─────────────────────────────────────────────────────────┘
│ SYSRET
▼
┌─────────────┐
│ camerad │ buf.m.userptr now points to frame data
└─────────────┘
| ioctl | Device | Purpose |
|---|---|---|
VIDIOC_QBUF / VIDIOC_DQBUF |
V4L2 camera | Queue / dequeue capture buffer |
VIDIOC_STREAMON / STREAMOFF |
V4L2 | Start / stop streaming |
DRM_IOCTL_GEM_* |
DRM/KMS | Buffer allocation, display scanout |
NVGPU_IOCTL_CHANNEL_ALLOC_GPFIFO |
NVIDIA GPU (/dev/nvhost-gpu) |
Allocate GPU command queue |
RPMSG_CREATE_EPT_IOCTL |
RPMsg | IPC with Cortex-M coprocessor (i.MX8) |
Custom _IOWR(MAGIC, N, struct) |
FPGA PCIe driver | Submit inference workload to accelerator |
epoll¶
O(1) per-event multiplexing across many file descriptors. Used in openpilot's cereal messaging layer to multiplex CAN frames, camera V4L2 events, IMU data, and model output events.
int epfd = epoll_create1(EPOLL_CLOEXEC);
/* EPOLL_CLOEXEC: close the epoll fd automatically on exec() */
struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = camera_fd };
/* EPOLLET: edge-triggered — notify once when data arrives, not repeatedly */
epoll_ctl(epfd, EPOLL_CTL_ADD, camera_fd, &ev);
int n = epoll_wait(epfd, events, MAX_EVENTS, timeout_ms);
/* blocks until at least one fd is ready; returns number of ready events */
Edge-triggered mode (EPOLLET) is preferred for latency-sensitive paths — one wakeup per edge, no repeated notifications for unread data.
prctl¶
PR_SET_NAME Name thread; visible in ps/top/htop/perf
PR_SET_TIMERSLACK Reduce timer coalescing (set to 1 ns for RT threads; default 50 µs)
PR_SET_SECCOMP Apply seccomp-BPF filter to current thread
PR_SET_NO_NEW_PRIVS Prevent privilege escalation across exec()
Key Insight:
PR_SET_TIMERSLACKis a hidden source of timing jitter. The default 50 µs timer slack allows the kernel to coalesce nearby timer expirations to save power. An RT thread waiting for a 1 ms timer may be woken up to 50 µs late. Settingprctl(PR_SET_TIMERSLACK, 1)(1 ns) disables coalescing for that thread and removes this source of jitter. This is standard practice forcontrolsdand any CAN write thread on openpilot.
sched_setattr — SCHED_DEADLINE¶
struct sched_attr attr = {
.size = sizeof(attr),
.sched_policy = SCHED_DEADLINE,
.sched_runtime = 5000000, /* 5ms budget per period — CPU time consumed before descheduling */
.sched_deadline = 16666666, /* 16.7ms relative deadline — must complete by this time */
.sched_period = 16666666, /* 16.7ms period — one activation per period (60fps) */
};
sched_setattr(0, &attr, 0); /* 0 = self; requires CAP_SYS_NICE */
perf_event_open¶
Accesses hardware performance counters from userspace. Used by perf stat, Nsight Systems, and VTune.
struct perf_event_attr pe = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CACHE_MISSES, /* LLC miss counter */
.disabled = 1, /* start disabled; enable manually */
};
int fd = perf_event_open(&pe, 0, -1, -1, 0); /* measure self (pid=0) */
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); /* start counting */
/* ... inference workload runs here ... */
read(fd, &count, sizeof(count)); /* read accumulated miss count */
On Jetson, ARM PMU counters measure LLC miss rate during DNN inference — guides INT8 tiling decisions.
memfd_create¶
Creates an anonymous file-backed memory region — usable as shared memory without a filesystem path. Used in openpilot's msgq for zero-copy IPC between modeld and controlsd.
int fd = memfd_create("shared_tensor", MFD_CLOEXEC);
/* Creates an anonymous file in kernel memory — no filesystem path needed */
ftruncate(fd, TENSOR_SIZE);
/* Sets the file size; backing pages are allocated lazily on first access */
void *ptr = mmap(NULL, TENSOR_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
/* Map the shared region into this process's address space */
/* pass fd over Unix socket to second process for its own mmap */
/* both processes now share the same physical pages — zero-copy IPC */
This pattern allows modeld to write inference outputs directly into a shared buffer that controlsd reads without any copy. The fd is passed over a Unix domain socket using SCM_RIGHTS.
Common Pitfall:
memfd_createbuffers live in RAM (tmpfs). If the inference tensor is large (e.g., 50 MB feature map), creating and destroying shared tensors at 30 fps consumes significant RAM bandwidth from repeated page faults. Pre-allocate the memfd buffers once at startup and reuse them, rather than creating new ones per frame.
eBPF: Kernel-Attached Verified Programs¶
eBPF programs run inside the kernel with verified safety: the kernel verifier checks bounds, loop termination, and pointer types before JIT-compiling to native code. No kernel module, no recompile. Think of eBPF as a read-only microscope you can attach to any kernel function at runtime — it observes without modifying.
Architecture¶
eBPF Program Lifecycle
┌──────────────────────────────────────────────────────────────┐
│ Development │
│ C source → clang + libbpf → BPF bytecode (.o file) │
└───────────────────────────────┬──────────────────────────────┘
│ load via bpf() syscall
▼
┌──────────────────────────────────────────────────────────────┐
│ Kernel Verification │
│ → bounds checker (no out-of-bounds memory access) │
│ → loop termination verifier (no infinite loops) │
│ → pointer type checker (no arbitrary kernel ptr dereference)│
│ → JIT compile to native code │
└───────────────────────────────┬──────────────────────────────┘
│ attach to hook
▼
┌──────────────────────────────────────────────────────────────┐
│ Runtime │
│ Event fires (syscall / tracepoint / kprobe / XDP packet) │
│ → eBPF program runs in kernel context │
│ → writes results to BPF maps (ring buffer / hash / array) │
└───────────────────────────────┬──────────────────────────────┘
│ user-space reads maps
▼
┌──────────────────────────────────────────────────────────────┐
│ User Space │
│ bpftrace / BCC tool reads BPF maps │
│ → latency histograms, counts, traces │
└──────────────────────────────────────────────────────────────┘
Attachment Points¶
| Hook | Use |
|---|---|
kprobe / kretprobe |
Any kernel function entry/return — driver internals |
tracepoint |
Stable kernel tracepoints: sched:sched_switch, irq:irq_handler_entry, block:block_rq_issue |
uprobe |
User-space function: TensorRT engine execution, glibc allocations |
XDP |
Pre-stack packet processing in driver context — 100GbE line rate filtering |
TC (traffic control) |
Post-stack egress/ingress; used for latency tagging |
BCC / bpftrace Production Tools¶
runqlat # scheduler run-queue latency histogram
# First tool to run when inference latency is inconsistent
offcputime -p $(pgrep modeld) 10 # where modeld spends time blocked off-CPU
# Shows what kernel function modeld is sleeping in and for how long
# Count context switches per process
bpftrace -e 'tracepoint:sched:sched_switch { @[comm] = count(); }'
# High counts on modeld → investigate CFS preemption; consider SCHED_FIFO
# Trace ioctl latency from camerad
bpftrace -e '
tracepoint:syscalls:sys_enter_ioctl /comm == "camerad"/ { @s[tid] = nsecs; }
tracepoint:syscalls:sys_exit_ioctl /comm == "camerad"/ {
@us = hist((nsecs - @s[tid]) / 1000); delete(@s[tid]); }'
# Outputs histogram of VIDIOC_DQBUF latency in microseconds
execsnoop # trace new process executions
opensnoop # trace file opens (useful to find what config camerad loads)
biolatency # block I/O latency histogram (model loading from NVMe)
Key Insight: eBPF is uniquely powerful because it attaches to production systems without any source code changes, recompilation, or restart. You can instrument
modeldrunning on a Jetson in the field, collect a latency histogram ofVIDIOC_DQBUFcalls, and detach the probe — all while the inference pipeline continues running. This is impossible with traditional profilers that require either source instrumentation or stopping the process.Common Pitfall: eBPF programs that use
kprobehooks are fragile across kernel versions — kernel internal function names and signatures change between releases. Usetracepointhooks instead when possible; tracepoints are stable ABIs defined inDocumentation/trace/tracepoints.rst. Thesyscalls:sys_enter_ioctltracepoint will work on any Linux kernel version.
seccomp: Syscall Filtering¶
A BPF program evaluated on every syscall; returns ALLOW, ERRNO(N), KILL, or TRAP. Applied with prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog).
Docker's default seccomp profile blocks ~44 syscalls (kexec_load, ptrace, mount, unshare, etc.). A TensorRT inference container needs fewer than 50 syscalls; a tight allowlist eliminates the rest. Prevents post-exploit lateral movement in containerized AI deployments.
seccomp Decision Flow
┌─────────────┐
│ Application│ calls read(fd, buf, len)
└──────┬──────┘
│ SYSCALL
▼
┌──────────────────────────────────┐
│ seccomp-BPF filter │
│ checks syscall number against │
│ the program's allow-list │
│ │
│ read (0)? → ALLOW → continue │
│ ptrace(101)? → KILL │
│ kexec_load? → ERRNO(EPERM) │
└──────────────────────────────────┘
Linux Capabilities¶
Root privilege is split into ~40 fine-grained grants. Drop capabilities after setup.
| Capability | Grants |
|---|---|
CAP_SYS_NICE |
sched_setscheduler(), sched_setattr() — set RT priority without root |
CAP_IPC_LOCK |
mlockall() — lock all memory pages; eliminates RT page-fault latency |
CAP_NET_ADMIN |
Network configuration, raw sockets, eBPF TC programs |
CAP_SYS_RAWIO |
Direct hardware I/O, PCIe MMIO access, FPGA register writes |
CAP_PERFMON |
perf_event_open() with hardware counters |
setcap cap_sys_nice+ep /opt/inference/modeld # grant RT capability; no sudo at runtime
grep Cap /proc/$(pgrep modeld)/status # inspect effective capability mask
Key Insight:
setcapwrites the capability into the ELF binary's extended attributes. Whenmodeldexecutes, the kernel reads these attributes and grants the capabilities without root access. This is safer than runningmodeldas root because the process only has the specific capabilities it needs — it cannot, for example, mount filesystems (CAP_SYS_ADMIN) or load kernel modules (CAP_SYS_MODULE). Principle of least privilege applied to system calls.Common Pitfall:
setcapgrants are lost when a binary is replaced (e.g., by a software update). A Makefile or install script that copies a new binary without re-applyingsetcapwill silently break RT scheduling. Always include thesetcapcall in the install step and verify withgetcap /opt/inference/modeldafter deployment.
Summary¶
| Mechanism | Kernel entry? | Latency | Primary use |
|---|---|---|---|
| Syscall (SYSCALL / SVC) | Yes | 100–400 ns | All kernel services |
vDSO (clock_gettime) |
No | 10–20 ns | High-frequency timestamping |
mmap (after setup) |
No (page faults only) | Sub-ns when cached | Zero-copy buffers, shared memory |
| eBPF (JIT, kernel hook) | Runs in kernel | < 1 µs per probe | Production profiling, syscall filtering |
| seccomp filter | Per-syscall check | 5–20 ns overhead | Sandbox; attack surface reduction |
vDSO getcpu() |
No | ~5 ns | Determine current CPU for lockless rings |
Conceptual Review¶
- Why does crossing from user space to kernel space cost 100–400 ns? The CPU must save registers, switch page tables (KPTI), flush TLB entries, and execute Spectre/Meltdown mitigations on x86. The actual kernel work may be trivial, but the transition itself has unavoidable hardware costs. ARM64 mitigations are architecturally lighter, which is one reason embedded AI platforms prefer ARM.
- What is the vDSO and why is it faster than a syscall? The vDSO is a kernel-maintained ELF page mapped into every process. It contains functions like
clock_gettimethat read from a sharedvvarpage the kernel updates continuously. NoSYSCALLinstruction is needed — the function executes entirely in user space, reducing latency from ~200 ns to ~15 ns. - What is the difference between
epolllevel-triggered and edge-triggered modes? Level-triggered:epoll_waitreturns as long as there is data available (can return the same fd multiple times). Edge-triggered:epoll_waitreturns only when new data arrives (one notification per event). Edge-triggered is preferred for high-throughput paths like V4L2 events because it avoids repeated wakeups when the consumer is slower than the producer. - What does eBPF's kernel verifier check? It verifies that the program has no out-of-bounds memory accesses, will always terminate (no infinite loops), and does not dereference arbitrary kernel pointers. Only programs that pass all checks are JIT-compiled and attached. This is what makes eBPF safe to run in production kernel context without source changes.
- Why use
memfd_createinstead of POSIX shared memory (shm_open)?memfd_createfiles are anonymous — they have no filesystem path and are automatically cleaned up when all file descriptors referencing them are closed.shm_opencreates a file in/dev/shmthat persists after process exit (until explicitly unlinked), which can leak shared memory between runs. - What is seccomp and why does it matter for inference containers? seccomp filters the syscalls a process is allowed to make. A TensorRT inference container only needs ~50 syscalls (
mmap,ioctl,read,write,epoll_wait, etc.). Blocking everything else (includingptrace,kexec_load,mount) means a compromised inference process cannot escalate privilege or persist on the system — the kernel refuses the dangerous syscall before any exploit code runs.
AI Hardware Connection¶
- vDSO
clock_gettime(CLOCK_MONOTONIC)provides µs-accurate sensor timestamping at ~15 ns per call — essential for synchronizing camera frames, IMU samples, and CAN messages in openpilot's sensor fusion pipeline without kernel entry overhead. ioctl(VIDIOC_DQBUF)is the hot path in every V4L2 camera pipeline; measuring its latency distribution withbpftracedirectly quantifies camera-to-model input delay without modifying or recompilingcamerad.bpftrace runqlatis the first diagnostic for inference latency spikes — 5 ms scheduler run-queue latency on the model thread appears immediately in the histogram without code changes or kernel module insertion.CAP_SYS_NICEviasetcapallowsmodeldto set its own RT scheduling policy without running as root, compatible with container security policies and KubernetessecurityContext.capabilities.- seccomp allowlists on TensorRT inference containers block syscalls like
ptrace,kexec_load, andmountthat are meaningless for inference but provide significant post-exploit paths in AV deployments. perf_event_openwith ARM PMU events on Jetson Orin measures LLC miss rate during DNN inference — cache miss data guides INT8 quantization and layer tiling decisions to reduce the memory working set below the L2/L3 cache size.