Track B §8 — Runtime & Driver Development for GPU/Jetson Inference¶
Own the software stack between the compiled model and the GPU hardware — CUDA runtime, TensorRT engine execution, DLA scheduling, and Linux driver interfaces.
Prerequisites: Track B §1 (Jetson Platform, CUDA basics), Phase 1 §3 (Operating Systems — drivers, memory management), Phase 1 §4 (C++/CUDA).
Layer mapping: Layer 3 (Runtime & Driver) of the AI chip stack. This module connects Layer 2 (compiler output — TensorRT engines, CUDA kernels) to Layer 5 (GPU/DLA hardware) through the driver and runtime software.
Role targets: GPU/Accelerator Runtime Engineer · CUDA Runtime Engineer · Inference Platform Engineer · Embedded Linux BSP Engineer (Jetson)
Why runtime & driver matters for Jetson/GPU¶
The compiler produces optimized kernels and execution plans. The runtime makes them actually run: managing GPU memory, scheduling kernels across CUDA cores and DLA engines, handling multi-stream concurrency, and communicating with the Linux kernel driver. Understanding this layer is essential for hitting latency and throughput targets on real hardware.
1. CUDA Runtime & Driver Architecture¶
-
Two-level CUDA API:
- Runtime API (
cudart):cudaMalloc,cudaMemcpy,cudaLaunchKernel, streams, events. - Driver API (
cuda):cuCtxCreate,cuModuleLoad,cuLaunchKernel— lower-level, explicit context management. - When to use which: runtime API for most work; driver API for multi-context, dynamic module loading, or building frameworks.
- Runtime API (
-
CUDA context and streams:
- Context = per-GPU state (address space, modules, streams).
- Streams = ordered queues of GPU work. Default stream vs explicit streams.
- Multi-stream concurrency: overlap compute, DMA (H2D, D2H), and peer transfer.
- Events for synchronization and timing.
-
Memory management:
- Device memory (
cudaMalloc), pinned host memory (cudaMallocHost), unified memory (cudaMallocManaged). - Async memcpy with streams:
cudaMemcpyAsync. - Memory pools (
cudaMallocAsync/cudaMemPool) — reduce allocation overhead. - Jetson unified memory architecture: GPU and CPU share LPDDR5 — implications for zero-copy vs explicit copy.
- Device memory (
-
Kernel launch mechanics:
- Grid → block → thread hierarchy. Occupancy calculator.
- Launch configuration:
<<<grid, block, shared_mem, stream>>>. - Cooperative groups, dynamic parallelism (advanced).
Projects:
* Write a multi-stream pipeline on Jetson: stream 1 does H2D + kernel A, stream 2 does kernel B + D2H, with event-based synchronization. Measure overlap with nsys.
* Compare unified memory (cudaMallocManaged) vs explicit copy (cudaMalloc + cudaMemcpy) for a CNN inference workload on Jetson Orin Nano. Profile with Nsight Systems.
2. TensorRT Runtime Deep Dive¶
-
Engine lifecycle:
- Build phase:
IBuilder→INetworkDefinition→IBuilderConfig→ICudaEngine. - Serialization: save/load engine (
.engine/.planfile). Why engines are hardware-specific. - Execution:
IExecutionContext— one engine, multiple contexts for concurrent inference.
- Build phase:
-
Memory management in TensorRT:
- Bindings: input/output tensor addresses. Static vs dynamic shapes.
- Workspace memory: scratch space for tactics. Sizing trade-offs.
- Device memory: pre-allocate all inference buffers; avoid per-inference allocation.
-
Execution and scheduling:
- Enqueue-based async execution:
context->enqueueV3(stream). - Multi-stream inference: separate streams for different models or batch sizes.
- DLA integration:
config->setDeviceType(layer, DeviceType::kDLA). - DLA + GPU split: layers that DLA doesn't support fall back to GPU; runtime manages handoff.
- Enqueue-based async execution:
-
Plugins:
IPluginV2DynamicExtinterface: custom layers not natively supported.- Plugin lifecycle:
getOutputDimensions,enqueue,serialize. - When to write a plugin vs when to rely on TensorRT's fusion.
-
Dynamic shapes and batching:
- Optimization profiles: min/opt/max dimensions for each input.
- Runtime shape specification before each inference.
- Dynamic batching in serving: collect requests → pad to batch → infer → split results.
Projects: * Build a TensorRT engine for YOLOv8 on Jetson Orin Nano with INT8 quantization. Benchmark FPS at different batch sizes (1, 4, 8). * Write a TensorRT plugin for a custom activation function. Integrate into the engine and verify correctness. * Implement a dual-engine pipeline: detection model on GPU, classification model on DLA, running concurrently with separate CUDA streams.
3. DLA (Deep Learning Accelerator) Runtime¶
-
DLA architecture on Orin:
- Two DLA engines (NVDLA-based) on Orin Nano.
- Supported layers: Conv, deconv, pooling, LRN, batch norm, element-wise, softmax.
- Limitations: no custom ops, limited dynamic shapes, restricted concat/split behavior.
-
DLA programming model:
- Compile-time: TensorRT marks layers for DLA during engine build.
- Runtime: TensorRT runtime dispatches DLA-assigned layers to DLA engine via
nvhostdriver. - Fallback: unsupported layers run on GPU; data transfer between DLA and GPU memory.
-
DLA driver stack:
nvhost-nvdlakernel driver: submits DLA tasks, manages DLA memory.nvdla_runtimefirmware on DLA engine.- NVDLA open-source reference: github.com/nvdla — study for understanding DLA runtime concepts.
-
DLA performance tuning:
- Maximize DLA-resident layers: fewer GPU↔DLA transitions.
- DLA-friendly model design: avoid unsupported ops, prefer standard convolutions.
- Power efficiency: DLA is more power-efficient than GPU for supported ops.
Projects:
* Take a MobileNetV2 and deploy with setDeviceType(kDLA) for all supported layers. Log which layers fall back to GPU. Optimize the model to maximize DLA coverage.
* Measure power consumption (Jetson tegrastats) for GPU-only vs DLA-only vs mixed inference.
4. NVIDIA Kernel Driver Stack on Jetson¶
-
GPU driver (
nvgpu):- Open-source Tegra GPU driver (not the proprietary desktop driver).
- Device tree configuration: GPU clocks, power domains, memory carve-outs.
nvgpusysfs: frequency scaling, power capping, hardware counters.- How user-space CUDA calls reach
nvgpu:ioctlinterface, channel submission.
-
Video and camera drivers:
nvcsi→vi(Video Input) →isppipeline.- Sensor driver (
v4l2_subdev): I2C register programming, mode tables. - How sensor data flows to GPU memory for inference (zero-copy path via NvBufSurface).
-
Display and multimedia:
tegra-dc/nvdisplay: display controller driver.- NVDEC/NVENC: hardware video decode/encode. GStreamer integration via
nvv4l2decoder.
-
Power management drivers:
- DVFS (Dynamic Voltage and Frequency Scaling):
devfreqframework. - Power domains and clock tree:
clk_tegra,tegra-pmc. - Thermal management:
thermal_zone, throttling policies. nvpmodel: power mode profiles, MAX_N vs 15W vs 7W modes.jetson_clocks: pin max clocks for benchmarking.
- DVFS (Dynamic Voltage and Frequency Scaling):
-
Kernel module development on Jetson:
- Building out-of-tree modules against L4T kernel headers.
- Device tree overlay for custom hardware on carrier board.
- Debugging:
dmesg,ftrace,/sys/kernel/debug.
Projects:
* Write a minimal kernel module that reads GPU utilization from nvgpu sysfs and exposes a /proc/gpu_monitor interface.
* Add a device tree overlay for a custom SPI sensor on your Jetson carrier board. Write a v4l2_subdev stub driver that reads sensor ID over SPI.
* Profile the camera-to-inference pipeline: measure latency from sensor capture to inference result using nsys + kernel tracepoints.
5. Multi-Engine Scheduling & System-Level Runtime¶
-
Concurrent engine execution:
- GPU + DLA + NVDEC + PVA (Programmable Vision Accelerator) running simultaneously.
- Per-engine CUDA streams: avoid serialization across engines.
- Priority scheduling: high-priority streams (
cudaStreamCreateWithPriority) for latency-critical inference.
-
Real-time inference scheduling:
- Frame-rate-driven scheduling: sensor frame → preprocess → infer → postprocess → actuate.
- Deadline-aware execution: drop frames vs queue frames vs adaptive batch.
- CPU-GPU synchronization: minimizing CPU blocking with async APIs + events.
-
Multi-model deployment:
- Multiple TensorRT engines sharing GPU: time-slicing vs MPS (Multi-Process Service).
- Memory partitioning: pre-allocate buffers per model to avoid fragmentation.
- Model switching: engine loading latency, warm-up strategies.
-
DeepStream as system-level runtime:
- GStreamer-based pipeline: decode → preprocess → infer → track → display.
nvinferplugin: wraps TensorRT engine inside a GStreamer element.- Multi-stream video: process N cameras in one pipeline.
- Custom
GstBuffermetadata for passing inference results downstream.
-
Triton Inference Server on Jetson:
- Model repository, dynamic batching, concurrent model execution.
- Backend options: TensorRT, ONNX Runtime, PyTorch, custom C++ backend.
- Jetson deployment considerations: memory constraints, power budget.
Projects:
* Build a 4-camera DeepStream pipeline on Jetson: decode → detect (GPU) → classify (DLA) → track → OSD → display. Measure end-to-end latency.
* Deploy two models (detection + segmentation) on Triton Inference Server on Jetson. Configure dynamic batching and measure throughput under load.
* Implement a priority-based scheduler: safety-critical model gets high-priority stream, telemetry model gets low-priority. Verify with nsys that high-priority is never starved.
Relationship to Other Modules¶
| This module (B §8) | Connects to |
|---|---|
| CUDA runtime & memory | B §1 (Jetson Platform — CUDA basics) |
| TensorRT engine execution | B §5 (Application Development — ML/AI) |
| DLA runtime | B §1 deep dives (DLA architecture, tensor cores) |
| Kernel driver stack | B §3 (L4T Customization — kernel, device tree) |
| Power management | B §4 (FSP — SPE firmware, power control) |
| Inference runtime (Triton, DeepStream) | Phase 3 (Neural Networks — models to deploy) |
| Compiler output → runtime input | Track C (ML Compiler — generates engines/kernels) |
Build Summary¶
| Section | Hands-on deliverable |
|---|---|
| §1 CUDA runtime | Multi-stream pipeline, unified vs explicit memory comparison |
| §2 TensorRT runtime | INT8 engine build, custom plugin, dual-engine pipeline |
| §3 DLA runtime | DLA coverage maximization, power measurement |
| §4 Kernel drivers | GPU monitor module, device tree overlay + sensor driver |
| §5 System runtime | 4-camera DeepStream pipeline, Triton deployment, priority scheduler |