Track A §5 — Runtime & Driver Development for FPGA Accelerators¶
Connect your FPGA accelerator to the host — write the runtime, drivers, and memory management that make hardware usable from software.
Prerequisites: Track A §1–2 (Vivado, Zynq PS/PL, AXI), Phase 1 §3 (Operating Systems — drivers, memory, system calls), Phase 1 §4 (C++).
Layer mapping: Layer 3 (Runtime & Driver) of the AI chip stack. This module bridges Layer 6 (your RTL/HLS accelerator) with Layer 1 (the AI framework or application calling into it).
Role targets: FPGA Runtime Engineer · Accelerator Driver Developer · SoC Platform Engineer · Embedded Linux BSP Engineer
Why runtime & driver matters for FPGA¶
You can design the fastest accelerator in RTL or HLS — but without a runtime and driver, no software can use it. This module teaches you to: - Move data between host CPU and FPGA accelerator via DMA - Manage device memory (DDR attached to PL, BRAM, on-chip buffers) - Build a user-space API that applications call to submit inference jobs - Write or extend Linux kernel drivers for your custom IP
1. Xilinx Runtime (XRT) Architecture¶
-
XRT overview:
- User-space library (
libxrt_core) + kernel driver (xocl/zocl). - Supported platforms: Alveo (PCIe), Zynq/MPSoC (embedded), Versal.
xclbincontainer format: bitstream + metadata + kernel signatures.
- User-space library (
-
Host programming model:
xrt::device,xrt::bo(buffer object),xrt::kernel,xrt::run.- Synchronous and asynchronous execution models.
- Buffer allocation: device memory, host-only, host-device shared.
- Memory bank assignment and topology awareness.
-
OpenCL on FPGA (Vitis flow):
- How Vitis wraps HLS kernels as OpenCL kernels.
cl::Buffer,cl::Kernel,cl::CommandQueue— mapping to XRT underneath.- When to use XRT native API vs OpenCL API.
Projects:
* Write an XRT host application that loads an xclbin, allocates input/output buffers, runs a vector-add kernel, and verifies results.
* Profile the same kernel with xbutil examine and vitis_analyzer. Identify DMA transfer time vs compute time.
2. DMA and Memory Management¶
-
AXI DMA engine:
- Memory-mapped (MM2S) and stream (S2MM) channels.
- Scatter-gather DMA: descriptor chains for non-contiguous transfers.
- Simple DMA vs SG DMA: trade-offs (latency, flexibility, CPU overhead).
-
Buffer management patterns:
- Double-buffering / ping-pong: overlap compute and transfer.
- Zero-copy: mapping device memory into user-space (
mmapvia driver). - CMA (Contiguous Memory Allocator) for large physically-contiguous buffers on Zynq.
-
Memory hierarchy on Zynq/Alveo:
- PS DDR ↔ PL DDR ↔ BRAM ↔ UltraRAM.
- AXI SmartConnect / interconnect bandwidth and arbitration.
- Cache coherency between PS (ARM) and PL: HPC ports, ACE, snoop control unit.
-
IOMMU/SMMU:
- Virtual addresses for DMA on Zynq UltraScale+ (ARM SMMU).
- Why this matters: memory isolation, security, multi-tenant FPGA.
Projects: * Implement a DMA transfer between PS and PL on Zynq using AXI DMA IP. Measure throughput for different buffer sizes (1KB–16MB). * Implement double-buffering: while the accelerator processes buffer A, DMA fills buffer B. Measure throughput improvement.
3. Linux Kernel Driver Development for FPGA¶
-
Platform device driver (Zynq):
- Device tree binding:
compatible,reg,interrupts. probe()/remove()lifecycle.- Memory-mapped I/O:
ioremap(),readl()/writel()for control registers.
- Device tree binding:
-
Interrupt handling:
- Registering interrupt handlers (
request_irq). - Top-half / bottom-half (tasklet, workqueue) for DMA completion.
- Waiting for accelerator completion:
wait_for_completion()/ poll.
- Registering interrupt handlers (
-
Character device interface:
- Exposing accelerator to user-space via
/dev/myaccel. file_operations:open,release,ioctl,mmap,poll.ioctlcommand design: submit job, query status, configure parameters.
- Exposing accelerator to user-space via
-
UIO (Userspace I/O) driver:
- Lightweight alternative: map registers + interrupt to user-space.
- When UIO is sufficient vs when a full kernel driver is needed.
generic-uiodevice tree overlay for quick prototyping.
-
PCIe driver (Alveo):
- BAR (Base Address Register) mapping.
- MSI-X interrupts.
- XRT's
xocldriver as reference implementation.
Projects:
* Write a minimal platform driver for a custom AXI-Lite IP on Zynq. Expose register read/write via /dev + ioctl.
* Add interrupt-driven DMA completion to your driver. Compare latency vs polling.
* Use UIO to control an HLS accelerator from user-space Python. Benchmark vs kernel driver path.
4. User-Space Runtime Library¶
-
Runtime API design:
- Model the API after familiar patterns:
create_context(),allocate_buffer(),submit(),sync(). - Asynchronous execution: command queues, events, callbacks.
- Error handling: hardware timeouts, DMA errors, ECC faults.
- Model the API after familiar patterns:
-
Building a C++ runtime:
- RAII wrappers for device handles, buffers, and kernel objects.
- Thread-safe job submission with
std::future/ completion callbacks. - Memory pool: pre-allocate device buffers to avoid allocation overhead per inference.
-
Python bindings:
pybind11wrapper over C++ runtime.- NumPy ↔ device buffer zero-copy transfer.
- Integration point: PyTorch custom operator or ONNX Runtime execution provider.
-
Profiling and observability:
- Timestamps at DMA start, DMA end, compute start, compute end.
- Hardware performance counters (if designed into RTL).
- Exposing metrics via sysfs or runtime API.
Projects:
* Build a C++ runtime library for your HLS matmul accelerator: init(), malloc_device(), submit_matmul(), sync(), free().
* Add Python bindings. Run a PyTorch model where the matmul is offloaded to FPGA via your runtime.
* Add profiling: measure per-layer latency breakdown (host overhead, DMA, compute).
5. Inference Runtime Integration¶
-
Vitis AI runtime:
- DPU (Deep Processing Unit) integration on Zynq/Alveo.
- Model compilation: Vitis AI quantizer → compiler →
xmodel. vart(Vitis AI Runtime): C++/Python APIs for running compiled models.
-
FINN runtime:
- FINN-generated accelerators: stitched IP with AXI-Stream interfaces.
finn-rtlib: Python runtime for FINN accelerators.- Throughput vs latency trade-offs with FINN's streaming architecture.
-
PYNQ overlay model:
- Load bitstream + driver from Python.
- Allocate buffers with
pynq.allocate()(CMA-backed). - Quick prototyping path for ML accelerators.
-
TVM on FPGA:
- TVM's VTA (Versatile Tensor Accelerator) as reference FPGA target.
- BYOC for offloading subgraphs to FPGA accelerator (connection to Track C).
Projects: * Deploy a quantized CNN on Vitis AI DPU (Zynq or Alveo). Measure FPS and compare with CPU baseline. * Run a FINN-generated binary CNN accelerator. Profile streaming throughput. * Use PYNQ to prototype: load a custom HLS overlay, run inference from Jupyter, visualize latency.
Relationship to Other Modules¶
| This module (A §5) | Connects to |
|---|---|
| DMA and memory management | A §2 (Zynq PS/PL, AXI interconnect) |
| Kernel driver development | Phase 1 §3 (OS — drivers, interrupts, memory) |
| User-space runtime | A §4 (HLS — the accelerator you're driving) |
| Inference runtime (Vitis AI, FINN) | Phase 3 (Neural Networks — model to deploy) |
| TVM/BYOC on FPGA | Track C (ML Compiler — custom backend) |
Build Summary¶
| Section | Hands-on deliverable |
|---|---|
| §1 XRT | XRT host app + profiling |
| §2 DMA | AXI DMA throughput benchmark, double-buffering |
| §3 Kernel driver | Platform driver with interrupt-driven DMA |
| §4 User-space runtime | C++ runtime + Python bindings for HLS accelerator |
| §5 Inference runtime | Vitis AI DPU deployment, FINN streaming, PYNQ prototype |