Skip to content

Track A §5 — Runtime & Driver Development for FPGA Accelerators

Connect your FPGA accelerator to the host — write the runtime, drivers, and memory management that make hardware usable from software.

Prerequisites: Track A §1–2 (Vivado, Zynq PS/PL, AXI), Phase 1 §3 (Operating Systems — drivers, memory, system calls), Phase 1 §4 (C++).

Layer mapping: Layer 3 (Runtime & Driver) of the AI chip stack. This module bridges Layer 6 (your RTL/HLS accelerator) with Layer 1 (the AI framework or application calling into it).

Role targets: FPGA Runtime Engineer · Accelerator Driver Developer · SoC Platform Engineer · Embedded Linux BSP Engineer


Why runtime & driver matters for FPGA

You can design the fastest accelerator in RTL or HLS — but without a runtime and driver, no software can use it. This module teaches you to: - Move data between host CPU and FPGA accelerator via DMA - Manage device memory (DDR attached to PL, BRAM, on-chip buffers) - Build a user-space API that applications call to submit inference jobs - Write or extend Linux kernel drivers for your custom IP


1. Xilinx Runtime (XRT) Architecture

  • XRT overview:

    • User-space library (libxrt_core) + kernel driver (xocl / zocl).
    • Supported platforms: Alveo (PCIe), Zynq/MPSoC (embedded), Versal.
    • xclbin container format: bitstream + metadata + kernel signatures.
  • Host programming model:

    • xrt::device, xrt::bo (buffer object), xrt::kernel, xrt::run.
    • Synchronous and asynchronous execution models.
    • Buffer allocation: device memory, host-only, host-device shared.
    • Memory bank assignment and topology awareness.
  • OpenCL on FPGA (Vitis flow):

    • How Vitis wraps HLS kernels as OpenCL kernels.
    • cl::Buffer, cl::Kernel, cl::CommandQueue — mapping to XRT underneath.
    • When to use XRT native API vs OpenCL API.

Projects: * Write an XRT host application that loads an xclbin, allocates input/output buffers, runs a vector-add kernel, and verifies results. * Profile the same kernel with xbutil examine and vitis_analyzer. Identify DMA transfer time vs compute time.


2. DMA and Memory Management

  • AXI DMA engine:

    • Memory-mapped (MM2S) and stream (S2MM) channels.
    • Scatter-gather DMA: descriptor chains for non-contiguous transfers.
    • Simple DMA vs SG DMA: trade-offs (latency, flexibility, CPU overhead).
  • Buffer management patterns:

    • Double-buffering / ping-pong: overlap compute and transfer.
    • Zero-copy: mapping device memory into user-space (mmap via driver).
    • CMA (Contiguous Memory Allocator) for large physically-contiguous buffers on Zynq.
  • Memory hierarchy on Zynq/Alveo:

    • PS DDR ↔ PL DDR ↔ BRAM ↔ UltraRAM.
    • AXI SmartConnect / interconnect bandwidth and arbitration.
    • Cache coherency between PS (ARM) and PL: HPC ports, ACE, snoop control unit.
  • IOMMU/SMMU:

    • Virtual addresses for DMA on Zynq UltraScale+ (ARM SMMU).
    • Why this matters: memory isolation, security, multi-tenant FPGA.

Projects: * Implement a DMA transfer between PS and PL on Zynq using AXI DMA IP. Measure throughput for different buffer sizes (1KB–16MB). * Implement double-buffering: while the accelerator processes buffer A, DMA fills buffer B. Measure throughput improvement.


3. Linux Kernel Driver Development for FPGA

  • Platform device driver (Zynq):

    • Device tree binding: compatible, reg, interrupts.
    • probe() / remove() lifecycle.
    • Memory-mapped I/O: ioremap(), readl() / writel() for control registers.
  • Interrupt handling:

    • Registering interrupt handlers (request_irq).
    • Top-half / bottom-half (tasklet, workqueue) for DMA completion.
    • Waiting for accelerator completion: wait_for_completion() / poll.
  • Character device interface:

    • Exposing accelerator to user-space via /dev/myaccel.
    • file_operations: open, release, ioctl, mmap, poll.
    • ioctl command design: submit job, query status, configure parameters.
  • UIO (Userspace I/O) driver:

    • Lightweight alternative: map registers + interrupt to user-space.
    • When UIO is sufficient vs when a full kernel driver is needed.
    • generic-uio device tree overlay for quick prototyping.
  • PCIe driver (Alveo):

    • BAR (Base Address Register) mapping.
    • MSI-X interrupts.
    • XRT's xocl driver as reference implementation.

Projects: * Write a minimal platform driver for a custom AXI-Lite IP on Zynq. Expose register read/write via /dev + ioctl. * Add interrupt-driven DMA completion to your driver. Compare latency vs polling. * Use UIO to control an HLS accelerator from user-space Python. Benchmark vs kernel driver path.


4. User-Space Runtime Library

  • Runtime API design:

    • Model the API after familiar patterns: create_context(), allocate_buffer(), submit(), sync().
    • Asynchronous execution: command queues, events, callbacks.
    • Error handling: hardware timeouts, DMA errors, ECC faults.
  • Building a C++ runtime:

    • RAII wrappers for device handles, buffers, and kernel objects.
    • Thread-safe job submission with std::future / completion callbacks.
    • Memory pool: pre-allocate device buffers to avoid allocation overhead per inference.
  • Python bindings:

    • pybind11 wrapper over C++ runtime.
    • NumPy ↔ device buffer zero-copy transfer.
    • Integration point: PyTorch custom operator or ONNX Runtime execution provider.
  • Profiling and observability:

    • Timestamps at DMA start, DMA end, compute start, compute end.
    • Hardware performance counters (if designed into RTL).
    • Exposing metrics via sysfs or runtime API.

Projects: * Build a C++ runtime library for your HLS matmul accelerator: init(), malloc_device(), submit_matmul(), sync(), free(). * Add Python bindings. Run a PyTorch model where the matmul is offloaded to FPGA via your runtime. * Add profiling: measure per-layer latency breakdown (host overhead, DMA, compute).


5. Inference Runtime Integration

  • Vitis AI runtime:

    • DPU (Deep Processing Unit) integration on Zynq/Alveo.
    • Model compilation: Vitis AI quantizer → compiler → xmodel.
    • vart (Vitis AI Runtime): C++/Python APIs for running compiled models.
  • FINN runtime:

    • FINN-generated accelerators: stitched IP with AXI-Stream interfaces.
    • finn-rtlib: Python runtime for FINN accelerators.
    • Throughput vs latency trade-offs with FINN's streaming architecture.
  • PYNQ overlay model:

    • Load bitstream + driver from Python.
    • Allocate buffers with pynq.allocate() (CMA-backed).
    • Quick prototyping path for ML accelerators.
  • TVM on FPGA:

    • TVM's VTA (Versatile Tensor Accelerator) as reference FPGA target.
    • BYOC for offloading subgraphs to FPGA accelerator (connection to Track C).

Projects: * Deploy a quantized CNN on Vitis AI DPU (Zynq or Alveo). Measure FPS and compare with CPU baseline. * Run a FINN-generated binary CNN accelerator. Profile streaming throughput. * Use PYNQ to prototype: load a custom HLS overlay, run inference from Jupyter, visualize latency.


Relationship to Other Modules

This module (A §5) Connects to
DMA and memory management A §2 (Zynq PS/PL, AXI interconnect)
Kernel driver development Phase 1 §3 (OS — drivers, interrupts, memory)
User-space runtime A §4 (HLS — the accelerator you're driving)
Inference runtime (Vitis AI, FINN) Phase 3 (Neural Networks — model to deploy)
TVM/BYOC on FPGA Track C (ML Compiler — custom backend)

Build Summary

Section Hands-on deliverable
§1 XRT XRT host app + profiling
§2 DMA AXI DMA throughput benchmark, double-buffering
§3 Kernel driver Platform driver with interrupt-driven DMA
§4 User-space runtime C++ runtime + Python bindings for HLS accelerator
§5 Inference runtime Vitis AI DPU deployment, FINN streaming, PYNQ prototype