Jetson Orin Nano 8GB: Real-Time and Deterministic Inference Guide¶
Platform Reference¶
| Parameter | Value |
|---|---|
| SoC | NVIDIA T234 (Orin) |
| GPU | 1024-core Ampere (SM 8.7) |
| DLA | 1x NVDLA v2.0 |
| CPU | 6-core Arm Cortex-A78AE |
| Memory | 8 GB LPDDR5 (shared CPU/GPU/DLA) |
| JetPack | 6.x (L4T 36.x), CUDA 12.x, TensorRT 10.x |
1. Introduction¶
Real-time inference on edge hardware demands a fundamentally different engineering mindset than throughput-oriented data-center serving. This guide covers every layer of the stack -- from kernel scheduling to TensorRT engine construction -- required to deliver deterministic, low-latency AI inference on the Jetson Orin Nano 8GB.
1.1 What "Real-Time" Means for Edge AI¶
A system is real-time when the correctness of its output depends on when it is produced, not only on what it produces. A classification result that arrives after a robotic arm has already moved is a wrong result, regardless of its accuracy.
There are two categories:
| Property | Hard Real-Time | Soft Real-Time |
|---|---|---|
| Deadline miss consequence | System failure / safety hazard | Degraded quality of service |
| Typical domain | Autonomous braking, surgical robots | Video analytics, drone tracking |
| Latency budget | 1-10 ms | 10-100 ms |
| Jitter tolerance | < 100 us | < 5 ms |
| Typical guarantee | Worst-case execution time (WCET) | p99 latency bound |
On the Orin Nano, most practical deployments target firm real-time: deadline misses are tolerated rarely (e.g., < 0.1 % of frames) but each miss is logged and may trigger fallback behavior.
1.2 Latency vs Throughput Tradeoffs¶
Throughput-optimized: Latency-optimized:
[B1 B2 B3 B4] --> GPU [B1] --> GPU
large batch, high util batch=1, lower util
latency = batch_fill + exec latency = exec only
Key tradeoffs on Orin Nano:
| Strategy | Throughput (FPS) | p99 Latency (ms) | When to use |
|---|---|---|---|
| Batch=8, FP16 | ~120 | ~35 | Offline video |
| Batch=1, FP16 | ~45 | ~8 | Live camera |
| Batch=1, INT8 | ~70 | ~5 | Safety-critical |
| Batch=1, DLA INT8 | ~30 | ~4 (low jitter) | Deterministic path |
1.3 End-to-End Latency Anatomy¶
Camera capture --> Pre-process --> Inference --> Post-process --> Actuator
| | | | |
t_cap t_pre t_infer t_post t_act
(2-8 ms) (0.5-2 ms) (3-15 ms) (0.3-1 ms) (0.5-2 ms)
Total end-to-end = t_cap + t_pre + t_infer + t_post + t_act
Typical target: < 20 ms for 50 Hz control loops
The inference time t_infer is usually the largest single contributor, but
capture latency and pre-processing are often underestimated.
2. Jetson Real-Time Capabilities¶
Deep dive: For kernel-level RT internals — PREEMPT_RT patch architecture, preemption models, interrupt threading, lock primitives under RT, priority inversion and PI mutexes, SCHED_DEADLINE, complete rt-tests suite (cyclictest, hwlatdetect, signaltest, pip_stress), ARM Cortex-A78AE RT specifics, GICv3 tuning, WCET analysis, RT-safe kernel module development, ftrace RT debugging, and production certification — see Orin Nano RT Linux Deep Dive.
2.1 PREEMPT_RT Kernel on Jetson¶
NVIDIA ships a real-time capable kernel starting with JetPack 6.0. The
PREEMPT_RT patch converts most interrupt handlers and kernel locks to
preemptible, schedulable entities.
Checking current kernel preemption model:
# Check if PREEMPT_RT is active
uname -a
# Look for "PREEMPT RT" in the output
cat /sys/kernel/realtime
# Returns "1" if PREEMPT_RT is active
# If using stock JetPack without RT:
zcat /proc/config.gz | grep PREEMPT
# CONFIG_PREEMPT_RT=y <-- full RT
# CONFIG_PREEMPT=y <-- standard preemption (not RT)
Building the RT kernel from source (JetPack 6.x):
# Download L4T kernel source
cd ~/jetson-kernel
tar xf public_sources.tbz2
cd Linux_for_Tegra/source
# Apply PREEMPT_RT patch (included in NVIDIA source)
./generic_rt_build.sh apply
# Configure
export CROSS_COMPILE=aarch64-buildroot-linux-gnu-
export ARCH=arm64
make tegra_defconfig
# Enable RT
scripts/config --enable CONFIG_PREEMPT_RT
scripts/config --disable CONFIG_PREEMPT_VOLUNTARY
scripts/config --enable CONFIG_HIGH_RES_TIMERS
scripts/config --set-val CONFIG_HZ 1000
make -j$(nproc) Image modules dtbs
2.2 Real-Time Scheduling Classes¶
Linux provides several scheduling policies. For real-time inference threads:
Priority (highest to lowest):
SCHED_DEADLINE -- EDF scheduler, specify runtime/deadline/period
SCHED_FIFO -- Fixed priority, no time slicing
SCHED_RR -- Fixed priority, round-robin time slicing
SCHED_OTHER -- CFS (default), not real-time
Setting SCHED_FIFO for an inference thread (C++):
#include <pthread.h>
#include <sched.h>
void set_realtime_priority(pthread_t thread, int priority) {
struct sched_param param;
param.sched_priority = priority; // 1-99, higher = more urgent
int ret = pthread_setschedparam(thread, SCHED_FIFO, ¶m);
if (ret != 0) {
fprintf(stderr, "Failed to set RT priority: %s\n", strerror(ret));
fprintf(stderr, "Run with: sudo or set CAP_SYS_NICE\n");
}
}
// Usage in inference thread setup:
void* inference_thread(void* arg) {
set_realtime_priority(pthread_self(), 80);
// ... run inference loop ...
}
Setting SCHED_FIFO from the command line:
# Run inference process with FIFO priority 80
sudo chrt -f 80 ./my_inference_app
# Verify scheduling policy of a running process
chrt -p $(pidof my_inference_app)
# Set CPU affinity + RT priority together
sudo taskset -c 4,5 chrt -f 80 ./my_inference_app
Python equivalent using os module:
import os
import ctypes
# Set SCHED_FIFO priority 80 for current thread
SCHED_FIFO = 1
class SchedParam(ctypes.Structure):
_fields_ = [("sched_priority", ctypes.c_int)]
libc = ctypes.CDLL("libc.so.6", use_errno=True)
def set_realtime_priority(priority=80):
param = SchedParam(priority)
result = libc.sched_setscheduler(0, SCHED_FIFO, ctypes.byref(param))
if result != 0:
errno = ctypes.get_errno()
raise OSError(errno, f"sched_setscheduler failed: {os.strerror(errno)}")
print(f"Set SCHED_FIFO priority {priority}")
2.3 Clock Precision and Timekeeping¶
For latency measurement, clock source matters:
# Check available clock sources
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
# arch_sys_counter tsc
# Check current clock source
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
# The Arm arch timer provides ~56 ns resolution on Orin Nano
High-resolution timing in C++:
#include <time.h>
// Use CLOCK_MONOTONIC_RAW for jitter-free measurements
// (not adjusted by NTP)
struct timespec get_time() {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
return ts;
}
double elapsed_ms(struct timespec start, struct timespec end) {
return (end.tv_sec - start.tv_sec) * 1000.0 +
(end.tv_nsec - start.tv_nsec) / 1e6;
}
// In inference loop:
auto t0 = get_time();
context->enqueueV3(stream);
cudaStreamSynchronize(stream);
auto t1 = get_time();
printf("Inference: %.3f ms\n", elapsed_ms(t0, t1));
Measuring timer resolution:
# cyclictest -- the standard RT latency benchmark
sudo cyclictest -t1 -p 90 -n -i 1000 -l 10000
# T: 0 Min: 3 Act: 5 Avg: 5 Max: 18
# On PREEMPT_RT Orin Nano, expect:
# Average latency: 5-10 us
# Max latency: < 50 us (after tuning)
3. TensorRT Engine Optimization¶
3.1 Building Optimized Engines¶
TensorRT engines are hardware-specific optimized representations of neural networks. On the Orin Nano, engine building explores the GPU's SM 8.7 capabilities and available DLA.
Converting an ONNX model to a TensorRT engine:
# Basic FP16 engine build
trtexec \
--onnx=model.onnx \
--saveEngine=model_fp16.engine \
--fp16 \
--workspace=2048 \
--minShapes=input:1x3x224x224 \
--optShapes=input:1x3x224x224 \
--maxShapes=input:1x3x224x224 \
--verbose
# INT8 engine with calibration cache
trtexec \
--onnx=model.onnx \
--saveEngine=model_int8.engine \
--int8 \
--fp16 \
--calib=calibration_cache.bin \
--workspace=2048
# DLA engine (Orin Nano has 1 DLA core: dla0)
trtexec \
--onnx=model.onnx \
--saveEngine=model_dla.engine \
--int8 --fp16 \
--useDLACore=0 \
--allowGPUFallback
3.2 Precision Modes¶
| Precision | Speedup vs FP32 | Accuracy Impact | Memory Reduction | Use Case |
|---|---|---|---|---|
| FP32 | 1x (baseline) | None | None | Debugging |
| FP16 | 1.5-2.5x | Negligible | ~50% | Default production |
| INT8 | 2-4x | 0.1-1% mAP drop | ~75% | Latency-critical |
| INT8 on DLA | 1.5-3x | 0.1-1% mAP drop | ~75% + frees GPU | Deterministic path |
3.3 Layer Fusion¶
TensorRT automatically fuses compatible layers to reduce memory bandwidth and kernel launch overhead. Common fusion patterns on Orin Nano:
Before fusion: After fusion:
Conv2D --> BatchNorm --> ReLU CBR (single kernel)
Conv2D --> Add --> ReLU ConvAddReLU
FC --> ReLU FC_ReLU
Inspecting fused layers with trtexec:
trtexec --onnx=model.onnx --fp16 --verbose --dumpLayerInfo 2>&1 | grep -i "fused\|fusion"
# Or use the engine inspector in code:
import tensorrt as trt
logger = trt.Logger(trt.Logger.VERBOSE)
with open("model_fp16.engine", "rb") as f:
runtime = trt.Runtime(logger)
engine = runtime.deserialize_cuda_engine(f.read())
inspector = engine.create_engine_inspector()
# Print layer-by-layer info showing fusion results
print(inspector.get_engine_information(trt.LayerInformationFormat.ONELINE))
3.4 INT8 Calibration¶
INT8 calibration determines per-tensor scale factors by running representative data through the network. This is critical for accuracy.
Implementing a calibration class in Python:
import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
class Int8Calibrator(trt.IInt8EntropyCalibrator2):
def __init__(self, data_loader, cache_file="calibration.cache",
batch_size=1):
super().__init__()
self.data_loader = data_loader
self.cache_file = cache_file
self.batch_size = batch_size
self.current_index = 0
# Pre-allocate device buffer for calibration input
sample = next(iter(data_loader))
self.device_input = cuda.mem_alloc(sample.nbytes)
self.batch_allocation = self.device_input
def get_batch_size(self):
return self.batch_size
def get_batch(self, names):
try:
batch = next(self.data_iter)
cuda.memcpy_htod(self.device_input, batch.ravel())
return [int(self.device_input)]
except StopIteration:
return None
def read_calibration_cache(self):
try:
with open(self.cache_file, "rb") as f:
return f.read()
except FileNotFoundError:
return None
def write_calibration_cache(self, cache):
with open(self.cache_file, "wb") as f:
f.write(cache)
def reset(self):
self.data_iter = iter(self.data_loader)
def build_int8_engine(onnx_path, calibrator):
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
with open(onnx_path, "rb") as f:
parser.parse(f.read())
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 2 << 30)
config.set_flag(trt.BuilderFlag.INT8)
config.set_flag(trt.BuilderFlag.FP16) # Allow FP16 fallback
config.int8_calibrator = calibrator
serialized = builder.build_serialized_network(network, config)
return serialized
3.5 Builder Configuration Best Practices for Orin Nano¶
import tensorrt as trt
def create_optimized_config(builder):
config = builder.create_builder_config()
# Workspace: 2 GB is safe on 8 GB Orin Nano
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 2 << 30)
# Enable FP16 for all layers that benefit
config.set_flag(trt.BuilderFlag.FP16)
# Enable timing cache to speed up subsequent builds
# (engine building on Orin Nano can take 10-30 minutes)
try:
with open("timing.cache", "rb") as f:
timing_cache = config.create_timing_cache(f.read())
except FileNotFoundError:
timing_cache = config.create_timing_cache(b"")
config.set_timing_cache(timing_cache, ignore_mismatch=False)
# Prefer precision-reduction for latency
config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
# Limit builder to tactics that run in under 50 ms
# (avoids slow tactics that hurt latency consistency)
config.set_flag(trt.BuilderFlag.REJECT_EMPTY_ALGORITHMS)
return config, timing_cache
4. TensorRT Runtime¶
4.1 Execution Context and CUDA Streams¶
The execution context holds intermediate activation memory. For real-time inference, pre-allocate everything before entering the hot loop.
Complete C++ inference setup:
#include <NvInfer.h>
#include <cuda_runtime.h>
#include <fstream>
#include <vector>
#include <memory>
using namespace nvinfer1;
class Logger : public ILogger {
void log(Severity severity, const char* msg) noexcept override {
if (severity <= Severity::kWARNING)
printf("[TRT] %s\n", msg);
}
} gLogger;
class RTInferenceEngine {
public:
bool loadEngine(const std::string& enginePath) {
// Read serialized engine
std::ifstream file(enginePath, std::ios::binary);
file.seekg(0, std::ios::end);
size_t size = file.tellg();
file.seekg(0, std::ios::beg);
std::vector<char> buffer(size);
file.read(buffer.data(), size);
// Deserialize
auto runtime = std::unique_ptr<IRuntime>(
createInferRuntime(gLogger));
m_engine.reset(runtime->deserializeCudaEngine(
buffer.data(), size));
if (!m_engine) return false;
// Create execution context
m_context.reset(m_engine->createExecutionContext());
if (!m_context) return false;
// Create CUDA stream for inference
cudaStreamCreateWithFlags(&m_stream, cudaStreamNonBlocking);
// Pre-allocate device buffers
allocateBuffers();
return true;
}
void allocateBuffers() {
int numIO = m_engine->getNbIOTensors();
for (int i = 0; i < numIO; i++) {
const char* name = m_engine->getIOTensorName(i);
auto dims = m_engine->getTensorShape(name);
auto dtype = m_engine->getTensorDataType(name);
size_t bytes = volume(dims) * dataTypeSize(dtype);
void* deviceMem;
cudaMalloc(&deviceMem, bytes);
m_context->setTensorAddress(name, deviceMem);
if (m_engine->getTensorIOMode(name) == TensorIOMode::kINPUT) {
m_inputBuffers[name] = deviceMem;
m_inputSizes[name] = bytes;
} else {
m_outputBuffers[name] = deviceMem;
m_outputSizes[name] = bytes;
}
}
}
// Hot path -- zero allocation
bool infer(const void* inputData, void* outputData) {
// Copy input H2D
const char* inputName = m_engine->getIOTensorName(0);
cudaMemcpyAsync(m_inputBuffers[inputName], inputData,
m_inputSizes[inputName],
cudaMemcpyHostToDevice, m_stream);
// Execute
bool status = m_context->enqueueV3(m_stream);
if (!status) return false;
// Copy output D2H
const char* outputName = m_engine->getIOTensorName(1);
cudaMemcpyAsync(outputData, m_outputBuffers[outputName],
m_outputSizes[outputName],
cudaMemcpyDeviceToHost, m_stream);
cudaStreamSynchronize(m_stream);
return true;
}
void warmup(int iterations = 50) {
// Allocate dummy input/output on host
const char* inputName = m_engine->getIOTensorName(0);
const char* outputName = m_engine->getIOTensorName(1);
std::vector<uint8_t> dummyIn(m_inputSizes[inputName], 0);
std::vector<uint8_t> dummyOut(m_outputSizes[outputName], 0);
for (int i = 0; i < iterations; i++) {
infer(dummyIn.data(), dummyOut.data());
}
printf("Warmup complete: %d iterations\n", iterations);
}
~RTInferenceEngine() {
cudaStreamDestroy(m_stream);
for (auto& [name, ptr] : m_inputBuffers) cudaFree(ptr);
for (auto& [name, ptr] : m_outputBuffers) cudaFree(ptr);
}
private:
std::unique_ptr<ICudaEngine> m_engine;
std::unique_ptr<IExecutionContext> m_context;
cudaStream_t m_stream;
std::map<std::string, void*> m_inputBuffers;
std::map<std::string, void*> m_outputBuffers;
std::map<std::string, size_t> m_inputSizes;
std::map<std::string, size_t> m_outputSizes;
size_t volume(Dims dims) {
size_t vol = 1;
for (int i = 0; i < dims.nbDims; i++) vol *= dims.d[i];
return vol;
}
size_t dataTypeSize(DataType dtype) {
switch (dtype) {
case DataType::kFLOAT: return 4;
case DataType::kHALF: return 2;
case DataType::kINT8: return 1;
case DataType::kINT32: return 4;
default: return 0;
}
}
};
4.2 Engine Deserialization Optimization¶
Engine deserialization on Orin Nano can take 500 ms to 3 seconds for large models. For real-time systems that must start quickly:
// Strategy 1: Memory-mapped engine file (avoids copy during load)
#include <sys/mman.h>
#include <fcntl.h>
ICudaEngine* loadEngineMmap(const std::string& path, IRuntime* runtime) {
int fd = open(path.c_str(), O_RDONLY);
struct stat st;
fstat(fd, &st);
void* mapped = mmap(nullptr, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Advise kernel to read ahead
madvise(mapped, st.st_size, MADV_SEQUENTIAL);
ICudaEngine* engine = runtime->deserializeCudaEngine(mapped, st.st_size);
munmap(mapped, st.st_size);
close(fd);
return engine;
}
// Strategy 2: Pre-load engine into pinned memory
ICudaEngine* loadEnginePinned(const std::string& path, IRuntime* runtime) {
std::ifstream file(path, std::ios::binary | std::ios::ate);
size_t size = file.tellg();
file.seekg(0);
void* pinnedMem;
cudaMallocHost(&pinnedMem, size);
file.read(static_cast<char*>(pinnedMem), size);
ICudaEngine* engine = runtime->deserializeCudaEngine(pinnedMem, size);
cudaFreeHost(pinnedMem);
return engine;
}
4.3 Warm-Up Strategies¶
The first several inferences after engine load exhibit higher latency due to: - CUDA context initialization - JIT compilation of any remaining PTX - GPU clock ramp-up (DVFS) - Memory page faults on first access
import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import time
def warmup_engine(context, stream, d_input, d_output, h_input,
iterations=100):
"""
Run dummy inferences until latency stabilizes.
Returns the stable latency baseline.
"""
latencies = []
for i in range(iterations):
cuda.memcpy_htod_async(d_input, h_input, stream)
context.execute_async_v3(stream_handle=stream.handle)
cuda.memcpy_dtoh_async(np.empty_like(h_input), d_output, stream)
stream.synchronize()
# Measure the last 20 iterations for baseline
for i in range(20):
start = time.perf_counter_ns()
cuda.memcpy_htod_async(d_input, h_input, stream)
context.execute_async_v3(stream_handle=stream.handle)
stream.synchronize()
end = time.perf_counter_ns()
latencies.append((end - start) / 1e6)
avg = np.mean(latencies)
std = np.std(latencies)
print(f"Warmup baseline: {avg:.2f} +/- {std:.2f} ms")
return avg
5. DLA for Deterministic Inference¶
5.1 DLA Fixed-Function Advantage¶
The Orin Nano contains one NVDLA v2.0 core. DLA is a fixed-function accelerator -- unlike the GPU, it does not share execution resources with other workloads. This makes DLA inference latency highly predictable.
GPU inference jitter: DLA inference jitter:
| * |
| *** * | ****
|***** * * |******
|****** ** ** * |*******
+--------------------> +-------------------->
4 6 8 10 12 ms 5 5.5 6 6.5 ms
GPU: mean=6.2, std=1.8 ms DLA: mean=5.5, std=0.2 ms
5.2 DLA-Compatible Layers¶
Not all TensorRT layers run on DLA. Unsupported layers fall back to GPU:
| Layer Type | DLA Support | Notes |
|---|---|---|
| Convolution | Yes | 3x3, 5x5, 7x7; groups supported |
| Deconvolution | Yes | Limited kernel sizes |
| Pooling (Max/Avg) | Yes | -- |
| ElementWise | Yes | Add, Sub, Mul, Min, Max |
| Activation (ReLU) | Yes | -- |
| Activation (Sigmoid) | Partial | May fall back |
| BatchNormalization | Yes | Fused with Conv |
| Softmax | No | Falls back to GPU |
| Resize/Upsample | Partial | Nearest-neighbor only |
| Transformer attention | No | Falls back to GPU |
5.3 Building a DLA Engine¶
import tensorrt as trt
def build_dla_engine(onnx_path, output_path, allow_gpu_fallback=True):
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
with open(onnx_path, "rb") as f:
if not parser.parse(f.read()):
for i in range(parser.num_errors):
print(parser.get_error(i))
return None
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
# Enable INT8 + FP16 (DLA requires INT8 or FP16)
config.set_flag(trt.BuilderFlag.INT8)
config.set_flag(trt.BuilderFlag.FP16)
# Assign to DLA core 0 (Orin Nano has 1 DLA)
config.default_device_type = trt.DeviceType.DLA
config.DLA_core = 0
if allow_gpu_fallback:
config.set_flag(trt.BuilderFlag.GPU_FALLBACK)
serialized = builder.build_serialized_network(network, config)
if serialized:
with open(output_path, "wb") as f:
f.write(serialized)
print(f"DLA engine saved to {output_path}")
return serialized
5.4 Measuring DLA vs GPU Jitter¶
# GPU-only engine benchmark
trtexec --loadEngine=model_fp16.engine \
--iterations=1000 --warmUp=5000 \
--percentile=50,95,99 \
--useSpinWait
# DLA engine benchmark
trtexec --loadEngine=model_dla.engine \
--iterations=1000 --warmUp=5000 \
--percentile=50,95,99 \
--useDLACore=0 \
--useSpinWait
Typical results on Orin Nano (ResNet-50 INT8, batch=1):
| Metric | GPU | DLA |
|---|---|---|
| p50 latency | 3.8 ms | 5.1 ms |
| p95 latency | 4.5 ms | 5.3 ms |
| p99 latency | 6.2 ms | 5.4 ms |
| Std deviation | 0.9 ms | 0.15 ms |
| Max latency | 12.1 ms | 5.8 ms |
DLA is slower on average but dramatically more consistent. The p99/p50 ratio is 1.06 for DLA vs 1.63 for GPU.
6. Multi-Engine Scheduling¶
6.1 GPU + DLA Concurrent Execution¶
The Orin Nano can run GPU and DLA inference simultaneously because they are independent hardware blocks. This enables pipeline parallelism.
Timeline:
GPU: [---Model A---][---Model A---][---Model A---]
DLA: [--Model B--] [--Model B--] [--Model B--]
Total throughput = GPU throughput + DLA throughput
C++ concurrent GPU + DLA execution:
#include <thread>
#include <atomic>
class DualEngineScheduler {
public:
struct EngineSlot {
ICudaEngine* engine;
IExecutionContext* context;
cudaStream_t stream;
void* inputBuffer;
void* outputBuffer;
size_t inputSize;
size_t outputSize;
};
EngineSlot gpuSlot; // FP16 engine on GPU
EngineSlot dlaSlot; // INT8 engine on DLA
void init(const std::string& gpuEnginePath,
const std::string& dlaEnginePath) {
loadSlot(gpuSlot, gpuEnginePath);
loadSlot(dlaSlot, dlaEnginePath);
}
// Run both engines concurrently on different inputs
void inferConcurrent(
const void* gpuInput, void* gpuOutput,
const void* dlaInput, void* dlaOutput)
{
// Launch GPU inference
cudaMemcpyAsync(gpuSlot.inputBuffer, gpuInput,
gpuSlot.inputSize,
cudaMemcpyHostToDevice, gpuSlot.stream);
gpuSlot.context->enqueueV3(gpuSlot.stream);
cudaMemcpyAsync(gpuOutput, gpuSlot.outputBuffer,
gpuSlot.outputSize,
cudaMemcpyDeviceToHost, gpuSlot.stream);
// Launch DLA inference (independent stream)
cudaMemcpyAsync(dlaSlot.inputBuffer, dlaInput,
dlaSlot.inputSize,
cudaMemcpyHostToDevice, dlaSlot.stream);
dlaSlot.context->enqueueV3(dlaSlot.stream);
cudaMemcpyAsync(dlaOutput, dlaSlot.outputBuffer,
dlaSlot.outputSize,
cudaMemcpyDeviceToHost, dlaSlot.stream);
// Wait for both
cudaStreamSynchronize(gpuSlot.stream);
cudaStreamSynchronize(dlaSlot.stream);
}
private:
void loadSlot(EngineSlot& slot, const std::string& path) {
// ... load engine, create context, allocate buffers ...
cudaStreamCreateWithFlags(&slot.stream, cudaStreamNonBlocking);
}
};
6.2 Pipeline Parallelism¶
For multi-stage pipelines (e.g., detection then classification), assign stages to different accelerators:
Frame N: [GPU: Detect]---->[DLA: Classify]
Frame N+1: [GPU: Detect]---->[DLA: Classify]
Frame N+2: [GPU: Detect]---->[DLA: Classify]
Throughput limited by slower stage, latency = sum of stages
import threading
import queue
import time
class PipelineScheduler:
def __init__(self, detector_engine, classifier_engine):
self.detect_queue = queue.Queue(maxsize=2)
self.classify_queue = queue.Queue(maxsize=2)
self.result_queue = queue.Queue(maxsize=2)
self.detector = detector_engine # GPU engine
self.classifier = classifier_engine # DLA engine
def detection_worker(self):
"""Runs on GPU -- detects objects in each frame."""
while True:
frame = self.detect_queue.get()
if frame is None:
break
detections = self.detector.infer(frame)
self.classify_queue.put((frame, detections))
def classification_worker(self):
"""Runs on DLA -- classifies each detected region."""
while True:
item = self.classify_queue.get()
if item is None:
break
frame, detections = item
results = []
for det in detections:
crop = extract_crop(frame, det)
cls = self.classifier.infer(crop)
results.append((det, cls))
self.result_queue.put(results)
def start(self):
self.det_thread = threading.Thread(
target=self.detection_worker, daemon=True)
self.cls_thread = threading.Thread(
target=self.classification_worker, daemon=True)
self.det_thread.start()
self.cls_thread.start()
def submit(self, frame):
self.detect_queue.put(frame)
def get_result(self, timeout=0.1):
return self.result_queue.get(timeout=timeout)
6.3 Engine Assignment Strategies¶
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| GPU-only | Single model, max throughput | Highest raw speed | Jitter under contention |
| DLA-only | Deterministic latency required | Lowest jitter | Slower, limited layers |
| GPU primary + DLA secondary | Two-model pipeline | Parallelism | Complex scheduling |
| DLA primary + GPU fallback | DLA for latency, GPU for unsupported ops | Best of both | Some layers on GPU add jitter |
7. Latency Optimization Techniques¶
7.1 Batching Strategies for Real-Time¶
For real-time applications, batch size = 1 is almost always optimal. Larger batches trade latency for throughput:
Batch=1: [frame] --> [infer 4ms] --> result total: 4 ms
Batch=4: [frame][frame][frame][frame] --> [infer 10ms] --> results
wait 3 frames (60ms @50fps) + 10ms = 70ms per first frame
Dynamic batching with timeout (for variable-rate inputs):
import time
import threading
class LatencyAwareBatcher:
def __init__(self, engine, max_batch=4, max_wait_ms=5.0):
self.engine = engine
self.max_batch = max_batch
self.max_wait_s = max_wait_ms / 1000.0
self.pending = []
self.lock = threading.Lock()
def submit(self, input_data):
"""Submit input; returns a Future-like event."""
event = threading.Event()
result_holder = [None]
with self.lock:
self.pending.append((input_data, event, result_holder))
if len(self.pending) >= self.max_batch:
self._flush()
# If not flushed yet, wait for timeout-based flush
event.wait(timeout=self.max_wait_s * 2)
return result_holder[0]
def _flush(self):
"""Execute batch inference on accumulated inputs."""
batch = self.pending[:self.max_batch]
self.pending = self.pending[self.max_batch:]
inputs = np.stack([item[0] for item in batch])
outputs = self.engine.infer_batch(inputs)
for i, (_, event, holder) in enumerate(batch):
holder[0] = outputs[i]
event.set()
def timeout_flusher(self):
"""Background thread that flushes on timeout."""
while True:
time.sleep(self.max_wait_s)
with self.lock:
if self.pending:
self._flush()
7.2 Input Pipeline Optimization¶
Pre-processing on CPU is a common bottleneck. Move it to GPU:
import cv2
import numpy as np
import pycuda.driver as cuda
# BAD: CPU preprocessing (5-10 ms on Orin Nano A78AE)
def preprocess_cpu(frame):
resized = cv2.resize(frame, (224, 224)) # CPU
rgb = cv2.cvtColor(resized, cv2.COLOR_BGR2RGB) # CPU
normalized = resized.astype(np.float32) / 255.0 # CPU
chw = np.transpose(normalized, (2, 0, 1)) # CPU
return np.ascontiguousarray(chw)
# GOOD: GPU preprocessing with CUDA (0.3-0.5 ms)
# Use cv2.cuda or custom CUDA kernels (see Section 11)
def preprocess_gpu(frame, d_input, stream):
gpu_frame = cv2.cuda_GpuMat()
gpu_frame.upload(frame, stream=stream)
gpu_resized = cv2.cuda.resize(gpu_frame, (224, 224), stream=stream)
# Continue with GPU-based normalize + transpose
# (custom CUDA kernel, see Section 11)
return gpu_resized
7.3 End-to-End Latency Measurement¶
import time
import numpy as np
class LatencyTracker:
def __init__(self, window_size=1000):
self.latencies = []
self.window_size = window_size
def record(self, latency_ms):
self.latencies.append(latency_ms)
if len(self.latencies) > self.window_size:
self.latencies.pop(0)
def report(self):
arr = np.array(self.latencies)
return {
"count": len(arr),
"mean": np.mean(arr),
"std": np.std(arr),
"min": np.min(arr),
"max": np.max(arr),
"p50": np.percentile(arr, 50),
"p95": np.percentile(arr, 95),
"p99": np.percentile(arr, 99),
"p999": np.percentile(arr, 99.9),
}
# Usage in inference loop:
tracker = LatencyTracker()
for frame in camera_stream:
t0 = time.perf_counter_ns()
preprocessed = preprocess(frame)
output = engine.infer(preprocessed)
result = postprocess(output)
t1 = time.perf_counter_ns()
tracker.record((t1 - t0) / 1e6)
if frame_count % 100 == 0:
stats = tracker.report()
print(f"p50={stats['p50']:.2f} p99={stats['p99']:.2f} "
f"max={stats['max']:.2f} ms")
8. Memory Optimization for Inference¶
8.1 Understanding the 8 GB Shared Memory¶
The Orin Nano's 8 GB LPDDR5 is shared between CPU, GPU, and DLA. Typical memory budget for a real-time inference application:
Total: 8192 MB
- OS + services: ~1500 MB
- CUDA runtime: ~300 MB
- TensorRT engine(s): 200-800 MB (depends on model)
- Workspace: 512-2048 MB
- Input/output buffers: 10-50 MB
- Application code: 100-500 MB
----------------------------------------
Available headroom: ~3000-5000 MB
8.2 Workspace Size Tuning¶
TensorRT workspace is scratch memory used during inference for tactics that need temporary storage. Too small limits optimization; too large wastes memory.
# Query actual workspace usage after build
inspector = engine.create_engine_inspector()
info = inspector.get_engine_information(trt.LayerInformationFormat.JSON)
# Parse JSON to find max workspace per layer
# Recommended approach: start large, then tighten
# Build with 2 GB workspace
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 2 << 30)
# After building, check actual usage via verbose build log:
# [TRT] [V] Total Activation Memory: 12345678 bytes
# Set workspace to 1.5x that value for safety margin
8.3 Engine Memory Footprint Reduction¶
# Check engine file size (approximates device memory footprint)
ls -lh model_fp16.engine
# -rw-r--r-- 1 user user 45M model_fp16.engine
# INT8 engines are smaller
ls -lh model_int8.engine
# -rw-r--r-- 1 user user 24M model_int8.engine
Strategies to reduce engine memory:
# 1. Use INT8 precision (halves weight memory vs FP16)
config.set_flag(trt.BuilderFlag.INT8)
# 2. Enable weight streaming for very large models
# (streams weights from host memory on demand)
config.set_flag(trt.BuilderFlag.WEIGHT_STREAMING)
# 3. Use strongly-typed network to prevent unnecessary FP32 copies
config.set_flag(trt.BuilderFlag.STRONGLY_TYPED)
# 4. Share execution context memory across engines
# that do NOT run concurrently
context_a = engine_a.create_execution_context()
context_b = engine_b.create_execution_context()
# Both can share the same device memory if only one runs at a time
8.4 Avoiding Allocation During Inference¶
Critical rule: Never call cudaMalloc, new, malloc, or any allocating
function in the inference hot path.
// BAD: allocating in the hot loop
void infer_bad(const float* input, int size) {
float* d_input;
cudaMalloc(&d_input, size); // ALLOCATION IN HOT PATH
cudaMemcpy(d_input, input, size, cudaMemcpyHostToDevice);
context->enqueueV3(stream);
cudaFree(d_input); // DEALLOCATION IN HOT PATH
}
// GOOD: pre-allocated in constructor, reused every call
class ZeroAllocInference {
void* d_input_;
void* d_output_;
size_t input_size_;
cudaStream_t stream_;
IExecutionContext* context_;
public:
ZeroAllocInference(ICudaEngine* engine) {
// All allocation happens once at construction
cudaStreamCreate(&stream_);
context_ = engine->createExecutionContext();
const char* inName = engine->getIOTensorName(0);
const char* outName = engine->getIOTensorName(1);
input_size_ = /* compute from dims */;
cudaMalloc(&d_input_, input_size_);
cudaMalloc(&d_output_, /* output size */);
context_->setTensorAddress(inName, d_input_);
context_->setTensorAddress(outName, d_output_);
}
// Hot path: zero allocations
void infer(const void* input, void* output) {
cudaMemcpyAsync(d_input_, input, input_size_,
cudaMemcpyHostToDevice, stream_);
context_->enqueueV3(stream_);
cudaMemcpyAsync(output, d_output_, /* size */,
cudaMemcpyDeviceToHost, stream_);
cudaStreamSynchronize(stream_);
}
};
8.5 Memory Pools¶
Use CUDA memory pools to avoid repeated allocation/deallocation overhead:
// Create a memory pool for inference buffers
cudaMemPool_t memPool;
cudaDeviceGetDefaultMemPool(&memPool, 0);
// Set pool to retain allocations (avoid returning memory to OS)
uint64_t threshold = UINT64_MAX; // Never release
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold,
&threshold);
// Allocations from pool are fast (no syscall)
void* ptr;
cudaMallocAsync(&ptr, size, stream);
// ... use ptr ...
cudaFreeAsync(ptr, stream); // Returns to pool, not OS
9. CPU Real-Time Configuration¶
9.1 CPU Isolation (isolcpus)¶
The Orin Nano has 6 Cortex-A78AE cores (cores 0-5). Isolate cores for dedicated inference threads:
# Edit /boot/extlinux/extlinux.conf on Orin Nano
# Add to APPEND line:
# isolcpus=4,5 nohz_full=4,5 rcu_nocbs=4,5
# Example extlinux.conf entry:
# APPEND ${cbootargs} root=/dev/mmcblk0p1 rw rootwait
# isolcpus=4,5 nohz_full=4,5 rcu_nocbs=4,5
# irqaffinity=0,1,2,3
# After reboot, verify isolation:
cat /sys/devices/system/cpu/isolated
# 4,5
# Isolated cores will not run any tasks unless explicitly pinned
taskset -p 1 # PID 1 (init) should show mask without cores 4,5
9.2 Thread Affinity¶
Pin inference threads to isolated cores:
#include <pthread.h>
void pin_to_core(pthread_t thread, int core_id) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
int ret = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
if (ret != 0) {
fprintf(stderr, "Failed to set CPU affinity: %s\n", strerror(ret));
}
}
// Pin inference thread to core 4, post-processing to core 5
void* inference_thread(void* arg) {
pin_to_core(pthread_self(), 4);
set_realtime_priority(pthread_self(), 80);
// ... inference loop ...
}
void* postprocess_thread(void* arg) {
pin_to_core(pthread_self(), 5);
set_realtime_priority(pthread_self(), 70);
// ... post-processing loop ...
}
Python equivalent:
import os
def pin_to_cores(cores):
"""Pin current process to specific CPU cores."""
os.sched_setaffinity(0, cores)
actual = os.sched_getaffinity(0)
print(f"Pinned to cores: {actual}")
# Pin inference process to isolated cores
pin_to_cores({4, 5})
9.3 IRQ Affinity¶
Move hardware interrupts away from inference cores:
# List current IRQ affinity
for irq in /proc/irq/*/smp_affinity_list; do
echo "$irq: $(cat $irq)"
done
# Move all IRQs to cores 0-3 (away from inference cores 4-5)
for irq in $(ls /proc/irq/ | grep -E '^[0-9]+$'); do
echo 0-3 > /proc/irq/$irq/smp_affinity_list 2>/dev/null
done
# Verify GPU interrupt is on non-inference core
cat /proc/interrupts | grep gpu
# The GPU interrupt should fire on core 0-3
9.4 Reducing Jitter Sources¶
# 1. Disable CPU frequency scaling (lock to max)
echo performance > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
# Or use jetson_clocks to lock all clocks
sudo jetson_clocks
# 2. Disable kernel tick on isolated cores (tickless)
# (Already enabled via nohz_full=4,5 boot parameter)
# 3. Minimize kernel background work on isolated cores
echo 1 > /sys/bus/workqueue/devices/writeback/cpumask # cores 0 only
# 4. Disable transparent huge pages (THP causes latency spikes)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
# 5. Lock memory to prevent page faults in RT threads
#include <sys/mman.h>
// Lock all current and future memory to prevent page faults
void lock_memory() {
if (mlockall(MCL_CURRENT | MCL_FUTURE) != 0) {
perror("mlockall failed");
}
}
9.5 Complete CPU RT Setup Script¶
#!/bin/bash
# rt_setup.sh -- Run once after boot on Orin Nano
set -e
echo "=== Orin Nano Real-Time Setup ==="
# Lock CPU frequency to maximum
sudo jetson_clocks
# Verify RT kernel
if [ "$(cat /sys/kernel/realtime 2>/dev/null)" = "1" ]; then
echo "[OK] PREEMPT_RT kernel active"
else
echo "[WARN] Not running PREEMPT_RT kernel"
fi
# Move IRQs away from isolated cores
ISOLATED=$(cat /sys/devices/system/cpu/isolated)
echo "Isolated cores: $ISOLATED"
for irq in $(ls /proc/irq/ | grep -E '^[0-9]+$'); do
echo 0-3 > /proc/irq/$irq/smp_affinity_list 2>/dev/null || true
done
echo "[OK] IRQs moved to cores 0-3"
# Disable THP
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled > /dev/null
echo "[OK] Transparent huge pages disabled"
# Set GPU and memory clocks to maximum
sudo bash -c 'echo 1 > /sys/devices/gpu.0/force_idle'
sleep 0.1
sudo bash -c 'echo 0 > /sys/devices/gpu.0/force_idle'
echo "=== RT setup complete ==="
10. CUDA Real-Time Patterns¶
10.1 Persistent Kernels¶
Standard CUDA kernels launch, execute, and exit. Persistent kernels stay resident on the GPU, polling for work. This eliminates kernel launch latency.
__global__ void persistent_inference_kernel(
volatile int* work_ready,
volatile int* work_done,
float* input_buffer,
float* output_buffer,
int input_size)
{
// Each block stays resident and polls for work
while (true) {
// Spin-wait for work (low latency)
while (atomicAdd((int*)work_ready, 0) == 0) {
__nanosleep(100); // Brief pause to save power
}
// Process input
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < input_size) {
output_buffer[idx] = some_computation(input_buffer[idx]);
}
__threadfence();
// Signal completion
if (threadIdx.x == 0 && blockIdx.x == 0) {
atomicExch((int*)work_done, 1);
atomicExch((int*)work_ready, 0);
}
__syncthreads();
}
}
Note: Persistent kernels are an advanced pattern. For most TensorRT-based inference, CUDA Graphs (Section 10.2) are the recommended approach.
10.2 CUDA Graphs for Reduced Launch Overhead¶
CUDA Graphs capture a sequence of operations (memcpy, kernel launches) into a single launchable unit. This eliminates per-operation CPU overhead.
Capturing TensorRT inference as a CUDA Graph:
#include <cuda_runtime.h>
class GraphInference {
cudaGraph_t graph_;
cudaGraphExec_t graphExec_;
cudaStream_t captureStream_;
cudaStream_t execStream_;
IExecutionContext* context_;
void* d_input_;
void* d_output_;
size_t inputSize_;
size_t outputSize_;
public:
void captureGraph(const void* sampleInput) {
cudaStreamCreate(&captureStream_);
cudaStreamCreate(&execStream_);
// Run a few warmup iterations first (required before capture)
for (int i = 0; i < 10; i++) {
cudaMemcpyAsync(d_input_, sampleInput, inputSize_,
cudaMemcpyHostToDevice, captureStream_);
context_->enqueueV3(captureStream_);
cudaStreamSynchronize(captureStream_);
}
// Begin graph capture
cudaStreamBeginCapture(captureStream_,
cudaStreamCaptureModeGlobal);
// Record the operations we want to capture
cudaMemcpyAsync(d_input_, sampleInput, inputSize_,
cudaMemcpyHostToDevice, captureStream_);
context_->enqueueV3(captureStream_);
// End capture
cudaStreamEndCapture(captureStream_, &graph_);
// Instantiate executable graph
cudaGraphInstantiate(&graphExec_, graph_, nullptr, nullptr, 0);
printf("CUDA Graph captured successfully\n");
}
// Ultra-low-overhead inference: single API call
void infer(const void* input, void* output) {
// Update input data (graph uses same device pointers)
cudaMemcpy(d_input_, input, inputSize_, cudaMemcpyHostToDevice);
// Launch entire graph
cudaGraphLaunch(graphExec_, execStream_);
cudaStreamSynchronize(execStream_);
// Copy output
cudaMemcpy(output, d_output_, outputSize_, cudaMemcpyDeviceToHost);
}
};
Python CUDA Graph capture with TensorRT:
import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
class CudaGraphInference:
def __init__(self, engine_path):
self.logger = trt.Logger(trt.Logger.WARNING)
with open(engine_path, "rb") as f:
runtime = trt.Runtime(self.logger)
self.engine = runtime.deserialize_cuda_engine(f.read())
self.context = self.engine.create_execution_context()
self.stream = cuda.Stream()
self.graph = None
self.graph_exec = None
# Allocate buffers
self._alloc_buffers()
def _alloc_buffers(self):
self.d_input = cuda.mem_alloc(self.input_size)
self.d_output = cuda.mem_alloc(self.output_size)
self.h_output = np.empty(self.output_shape, dtype=np.float32)
self.context.set_tensor_address("input", int(self.d_input))
self.context.set_tensor_address("output", int(self.d_output))
def capture_graph(self, sample_input):
"""Capture inference as CUDA graph."""
# Warmup
for _ in range(10):
cuda.memcpy_htod_async(self.d_input, sample_input, self.stream)
self.context.execute_async_v3(self.stream.handle)
self.stream.synchronize()
# Capture
self.stream.begin_capture()
cuda.memcpy_htod_async(self.d_input, sample_input, self.stream)
self.context.execute_async_v3(self.stream.handle)
cuda.memcpy_dtoh_async(self.h_output, self.d_output, self.stream)
self.graph = self.stream.end_capture()
self.graph_exec = self.graph.instantiate()
print("CUDA Graph captured")
def infer(self, input_data):
"""Launch pre-captured graph."""
cuda.memcpy_htod(self.d_input, input_data)
self.graph_exec.launch(self.stream)
self.stream.synchronize()
return self.h_output.copy()
10.3 Pre-Allocated Buffers and Pinned Memory¶
Pinned (page-locked) memory enables faster H2D/D2H transfers and is required for asynchronous memcpy:
// Allocate pinned host memory for input/output
float* h_input;
float* h_output;
cudaMallocHost(&h_input, input_size); // Pinned
cudaMallocHost(&h_output, output_size); // Pinned
// Transfer is ~2x faster with pinned memory on Orin Nano
// Pinned: ~12 GB/s effective bandwidth
// Pageable: ~6 GB/s effective bandwidth
// IMPORTANT: Don't over-allocate pinned memory
// It reduces available memory for the OS page cache
// Rule of thumb: < 500 MB pinned on 8 GB Orin Nano
10.4 Stream Callbacks for Async Processing¶
// Callback triggered when inference completes
void CUDART_CB inferenceComplete(cudaStream_t stream,
cudaError_t status,
void* userData) {
auto* result = static_cast<InferenceResult*>(userData);
result->timestamp = std::chrono::steady_clock::now();
result->ready.store(true, std::memory_order_release);
// Post-processing can start immediately
}
// Register callback after inference
cudaMemcpyAsync(d_input, h_input, size, cudaMemcpyHostToDevice, stream);
context->enqueueV3(stream);
cudaMemcpyAsync(h_output, d_output, size, cudaMemcpyDeviceToHost, stream);
cudaLaunchHostFunc(stream, inferenceComplete, &result);
// CPU thread can do other work while GPU executes
11. Pre/Post Processing Pipeline¶
11.1 GPU-Accelerated Image Preprocessing¶
A custom CUDA kernel for the common resize+normalize+HWC-to-CHW pipeline:
// preprocess.cu
__global__ void preprocess_kernel(
const uint8_t* __restrict__ input, // HWC, BGR, uint8
float* __restrict__ output, // CHW, RGB, float32
int src_w, int src_h,
int dst_w, int dst_h,
float mean_r, float mean_g, float mean_b,
float std_r, float std_g, float std_b)
{
int dx = blockIdx.x * blockDim.x + threadIdx.x;
int dy = blockIdx.y * blockDim.y + threadIdx.y;
if (dx >= dst_w || dy >= dst_h) return;
// Bilinear interpolation coordinates
float sx = (dx + 0.5f) * src_w / dst_w - 0.5f;
float sy = (dy + 0.5f) * src_h / dst_h - 0.5f;
int x0 = (int)floorf(sx), y0 = (int)floorf(sy);
int x1 = x0 + 1, y1 = y0 + 1;
float fx = sx - x0, fy = sy - y0;
x0 = max(0, min(x0, src_w - 1));
x1 = max(0, min(x1, src_w - 1));
y0 = max(0, min(y0, src_h - 1));
y1 = max(0, min(y1, src_h - 1));
// Interpolate each channel (input is BGR)
for (int c = 0; c < 3; c++) {
float v00 = input[(y0 * src_w + x0) * 3 + c];
float v01 = input[(y0 * src_w + x1) * 3 + c];
float v10 = input[(y1 * src_w + x0) * 3 + c];
float v11 = input[(y1 * src_w + x1) * 3 + c];
float val = (1 - fx) * (1 - fy) * v00 +
fx * (1 - fy) * v01 +
(1 - fx) * fy * v10 +
fx * fy * v11;
// BGR to RGB: channel 0->2, 1->1, 2->0
int out_c = 2 - c;
// Normalize
float mean = (out_c == 0) ? mean_r : (out_c == 1) ? mean_g : mean_b;
float std = (out_c == 0) ? std_r : (out_c == 1) ? std_g : std_b;
val = (val / 255.0f - mean) / std;
// Write CHW format
output[out_c * dst_h * dst_w + dy * dst_w + dx] = val;
}
}
void launchPreprocess(
const uint8_t* d_input, float* d_output,
int src_w, int src_h, int dst_w, int dst_h,
cudaStream_t stream)
{
dim3 block(16, 16);
dim3 grid((dst_w + 15) / 16, (dst_h + 15) / 16);
// ImageNet normalization values
preprocess_kernel<<<grid, block, 0, stream>>>(
d_input, d_output, src_w, src_h, dst_w, dst_h,
0.485f, 0.456f, 0.406f, // mean RGB
0.229f, 0.224f, 0.225f); // std RGB
}
11.2 NMS (Non-Maximum Suppression) on GPU¶
For object detection post-processing, NMS on CPU can take 1-5 ms. A GPU-accelerated version:
// Simplified GPU NMS for YOLO-style detections
__global__ void nms_kernel(
const float* boxes, // [N, 4] x1,y1,x2,y2
const float* scores, // [N]
int* keep, // [N] output mask
int num_boxes,
float iou_threshold)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= num_boxes) return;
if (scores[i] < 0.01f) {
keep[i] = 0;
return;
}
keep[i] = 1; // Assume keeping
for (int j = 0; j < i; j++) {
if (keep[j] == 0) continue;
if (scores[j] <= scores[i]) continue;
// Compute IoU
float x1 = fmaxf(boxes[i*4+0], boxes[j*4+0]);
float y1 = fmaxf(boxes[i*4+1], boxes[j*4+1]);
float x2 = fminf(boxes[i*4+2], boxes[j*4+2]);
float y2 = fminf(boxes[i*4+3], boxes[j*4+3]);
float intersection = fmaxf(0, x2-x1) * fmaxf(0, y2-y1);
float area_i = (boxes[i*4+2]-boxes[i*4+0]) *
(boxes[i*4+3]-boxes[i*4+1]);
float area_j = (boxes[j*4+2]-boxes[j*4+0]) *
(boxes[j*4+3]-boxes[j*4+1]);
float iou = intersection / (area_i + area_j - intersection);
if (iou > iou_threshold) {
keep[i] = 0;
return;
}
}
}
11.3 Complete Pre/Post Pipeline Timing¶
Typical pipeline timing on Orin Nano (1080p camera, YOLOv8s):
CPU preprocess: 6.2 ms | GPU preprocess: 0.4 ms
H2D copy: 0.8 ms | (already on GPU) 0.0 ms
Inference (FP16): 7.5 ms | Inference (FP16): 7.5 ms
D2H copy: 0.3 ms | D2H copy: 0.3 ms
CPU NMS: 2.1 ms | GPU NMS: 0.2 ms
-----------------------------|-------------------------------
Total: 16.9 ms | Total: 8.4 ms
| Savings: 8.5 ms (50%)
12. Multi-Model Inference¶
12.1 Running Multiple Models Concurrently¶
On the Orin Nano, you can run up to 3 inference tasks simultaneously: - 1 on GPU (via TensorRT) - 1 on DLA (via TensorRT) - 1 on CPU (lightweight model via ONNX Runtime or PyTorch)
import threading
import tensorrt as trt
import numpy as np
class MultiModelManager:
def __init__(self):
self.engines = {}
self.contexts = {}
self.streams = {}
self.lock = threading.Lock()
def load_model(self, name, engine_path, device="gpu"):
"""Load a TensorRT engine."""
logger = trt.Logger(trt.Logger.WARNING)
runtime = trt.Runtime(logger)
with open(engine_path, "rb") as f:
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
stream = cuda.Stream()
self.engines[name] = engine
self.contexts[name] = context
self.streams[name] = stream
def infer(self, name, input_data):
"""Run inference on a specific model."""
context = self.contexts[name]
stream = self.streams[name]
# Each model has its own stream -- can run concurrently
cuda.memcpy_htod_async(self.d_inputs[name], input_data, stream)
context.execute_async_v3(stream.handle)
cuda.memcpy_dtoh_async(self.h_outputs[name],
self.d_outputs[name], stream)
stream.synchronize()
return self.h_outputs[name].copy()
def infer_concurrent(self, tasks):
"""
Run multiple models concurrently.
tasks: list of (model_name, input_data) tuples
"""
threads = []
results = {}
def worker(name, data):
results[name] = self.infer(name, data)
for name, data in tasks:
t = threading.Thread(target=worker, args=(name, data))
threads.append(t)
t.start()
for t in threads:
t.join()
return results
12.2 Resource Partitioning¶
Orin Nano resource partitioning for 3-model pipeline:
Model A (Detection): GPU, FP16, Priority HIGH
Model B (Segmentation): DLA, INT8, Priority MEDIUM
Model C (Classification): GPU, INT8, Priority LOW
GPU timeline:
[----Model A (high)----] [--C (low)--] [----Model A----] ...
DLA timeline:
[------Model B------] [------Model B------] ...
12.3 Priority-Based Scheduling¶
#include <queue>
#include <mutex>
#include <condition_variable>
struct InferenceRequest {
int priority; // Higher = more urgent
std::string model;
void* input;
void* output;
std::function<void()> callback;
bool operator<(const InferenceRequest& other) const {
return priority < other.priority; // Max-heap
}
};
class PriorityInferenceScheduler {
std::priority_queue<InferenceRequest> queue_;
std::mutex mutex_;
std::condition_variable cv_;
std::map<std::string, RTInferenceEngine*> engines_;
bool running_ = true;
public:
void submit(InferenceRequest req) {
{
std::lock_guard<std::mutex> lock(mutex_);
queue_.push(std::move(req));
}
cv_.notify_one();
}
void worker() {
while (running_) {
InferenceRequest req;
{
std::unique_lock<std::mutex> lock(mutex_);
cv_.wait(lock, [&] {
return !queue_.empty() || !running_;
});
if (!running_) break;
req = queue_.top();
queue_.pop();
}
auto* engine = engines_[req.model];
engine->infer(req.input, req.output);
if (req.callback) req.callback();
}
}
};
13. Inference Serving¶
13.1 Triton Inference Server on Jetson¶
NVIDIA provides Triton Inference Server containers for Jetson (JetPack 6.x).
Installation:
# Pull the Jetson-optimized Triton container
# (Uses L4T base, includes TensorRT backend)
sudo docker pull nvcr.io/nvidia/tritonserver:24.08-jetpack-py3
# Create model repository structure
mkdir -p ~/triton-models/yolov8/1
cp model.engine ~/triton-models/yolov8/1/model.plan
# Create model configuration
cat > ~/triton-models/yolov8/config.pbtxt << 'PBTXT'
name: "yolov8"
platform: "tensorrt_plan"
max_batch_size: 4
input [
{
name: "images"
data_type: TYPE_FP32
dims: [ 3, 640, 640 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [ 84, 8400 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
dynamic_batching {
preferred_batch_size: [ 1, 2, 4 ]
max_queue_delay_microseconds: 5000
}
PBTXT
Running Triton on Orin Nano:
sudo docker run --rm --runtime=nvidia \
--network=host \
-v ~/triton-models:/models \
nvcr.io/nvidia/tritonserver:24.08-jetpack-py3 \
tritonserver \
--model-repository=/models \
--strict-model-config=false \
--min-supported-compute-capability=8.7 \
--log-verbose=0 \
--exit-on-error=false
13.2 Client Inference Request¶
import tritonclient.grpc as grpcclient
import numpy as np
def triton_infer(image_data):
client = grpcclient.InferenceServerClient(url="localhost:8001")
# Prepare input
input_tensor = grpcclient.InferInput("images", [1, 3, 640, 640], "FP32")
input_tensor.set_data_from_numpy(image_data)
# Prepare output
output = grpcclient.InferRequestedOutput("output0")
# Inference with deadline
result = client.infer(
model_name="yolov8",
inputs=[input_tensor],
outputs=[output],
client_timeout=50.0, # 50 ms timeout
)
return result.as_numpy("output0")
13.3 Triton Performance Tuning for Real-Time¶
# Key Triton settings for low-latency on Orin Nano:
# 1. Disable dynamic batching for lowest latency
# (Remove dynamic_batching section from config.pbtxt)
# 2. Use gRPC with shared memory for zero-copy
# (Avoids serialization overhead)
# 3. Pin Triton to specific cores
sudo taskset -c 0,1,2,3 docker run ...
# 4. Benchmark with perf_analyzer
perf_analyzer \
-m yolov8 \
-u localhost:8001 \
-i grpc \
--concurrency-range 1:1 \
--measurement-interval 10000 \
--percentile=99 \
-b 1
# Expected output on Orin Nano (YOLOv8s FP16):
# Concurrency: 1, throughput: 42 infer/sec
# p50 latency: 7.8 ms, p99 latency: 12.3 ms
14. Benchmarking and Profiling¶
14.1 trtexec Usage¶
trtexec is the primary benchmarking tool for TensorRT engines.
# Basic latency benchmark
trtexec --loadEngine=model_fp16.engine \
--iterations=1000 \
--warmUp=5000 \
--duration=0 \
--useSpinWait \
--percentile=50,95,99
# Key flags for real-time benchmarking:
# --useSpinWait Busy-wait instead of sleep (lower jitter)
# --noDataTransfers Measure compute only (no H2D/D2H)
# --streams=1 Single stream (matches RT deployment)
# --exposeDMA Show H2D/D2H timing separately
# --dumpProfile Per-layer timing breakdown
# Compare GPU vs DLA
trtexec --loadEngine=model_gpu_int8.engine --useSpinWait --percentile=50,95,99
trtexec --loadEngine=model_dla_int8.engine --useDLACore=0 --useSpinWait \
--percentile=50,95,99
# Profile with layer-level timing
trtexec --loadEngine=model_fp16.engine \
--dumpProfile --separateProfileRun \
--iterations=100
14.2 End-to-End Latency Measurement Framework¶
import time
import json
import numpy as np
from dataclasses import dataclass, asdict
from typing import List
@dataclass
class LatencyBreakdown:
preprocess_ms: float
h2d_ms: float
inference_ms: float
d2h_ms: float
postprocess_ms: float
total_ms: float
class E2EBenchmark:
def __init__(self):
self.records: List[LatencyBreakdown] = []
def measure(self, preprocess_fn, infer_fn, postprocess_fn, input_data):
"""Measure each stage of the pipeline."""
# Preprocess
t0 = time.perf_counter_ns()
preprocessed = preprocess_fn(input_data)
t1 = time.perf_counter_ns()
# H2D + Inference + D2H (measured inside infer_fn)
h2d_start = time.perf_counter_ns()
cuda.memcpy_htod(d_input, preprocessed)
h2d_end = time.perf_counter_ns()
infer_start = time.perf_counter_ns()
context.execute_async_v3(stream.handle)
stream.synchronize()
infer_end = time.perf_counter_ns()
d2h_start = time.perf_counter_ns()
cuda.memcpy_dtoh(h_output, d_output)
d2h_end = time.perf_counter_ns()
# Postprocess
t4 = time.perf_counter_ns()
result = postprocess_fn(h_output)
t5 = time.perf_counter_ns()
record = LatencyBreakdown(
preprocess_ms=(t1 - t0) / 1e6,
h2d_ms=(h2d_end - h2d_start) / 1e6,
inference_ms=(infer_end - infer_start) / 1e6,
d2h_ms=(d2h_end - d2h_start) / 1e6,
postprocess_ms=(t5 - t4) / 1e6,
total_ms=(t5 - t0) / 1e6,
)
self.records.append(record)
return result
def report(self):
"""Generate percentile report for each stage."""
fields = ["preprocess_ms", "h2d_ms", "inference_ms",
"d2h_ms", "postprocess_ms", "total_ms"]
report = {}
for field in fields:
values = [getattr(r, field) for r in self.records]
arr = np.array(values)
report[field] = {
"mean": f"{np.mean(arr):.2f}",
"p50": f"{np.percentile(arr, 50):.2f}",
"p95": f"{np.percentile(arr, 95):.2f}",
"p99": f"{np.percentile(arr, 99):.2f}",
"max": f"{np.max(arr):.2f}",
}
return report
def save(self, path):
with open(path, "w") as f:
json.dump(self.report(), f, indent=2)
14.3 Nsight Systems Profiling¶
# Profile inference application with nsys
nsys profile \
--trace=cuda,nvtx,osrt \
--output=inference_profile \
--duration=10 \
--sample=cpu \
./my_inference_app
# Generate summary report
nsys stats inference_profile.nsys-rep
# Key things to look for in the profile:
# 1. Gap between kernel launches (CPU overhead)
# 2. Time spent in cudaMemcpy (data transfer bottleneck)
# 3. Long gaps between inference iterations (scheduling delay)
# 4. Unexpected CPU activity during GPU execution
# Add NVTX markers in code for fine-grained profiling:
#include <nvtx3/nvToolsExt.h>
void inference_loop() {
while (running) {
nvtxRangePush("Frame");
nvtxRangePush("Preprocess");
preprocess(frame);
nvtxRangePop();
nvtxRangePush("H2D");
cudaMemcpyAsync(d_in, h_in, size, cudaMemcpyHostToDevice, stream);
nvtxRangePop();
nvtxRangePush("Inference");
context->enqueueV3(stream);
nvtxRangePop();
nvtxRangePush("D2H");
cudaMemcpyAsync(h_out, d_out, size, cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);
nvtxRangePop();
nvtxRangePush("Postprocess");
postprocess(h_out);
nvtxRangePop();
nvtxRangePop(); // Frame
}
}
14.4 Interpreting p50/p95/p99 Latency¶
Latency distribution for GPU FP16 inference (Orin Nano, 10000 samples):
Count
|
| *
| ***
| *****
| *******
|********* * *
+----------------------------------> Latency (ms)
3 5 7 9 11 13 15
p50 = 5.2 ms (half of inferences faster than this)
p95 = 7.8 ms (only 5% slower than this)
p99 = 11.3 ms (only 1% slower -- tail latency)
max = 15.1 ms (worst case observed)
For a 20 ms deadline:
p50 meets deadline: YES
p99 meets deadline: YES
max meets deadline: YES
--> Safe for soft real-time at 50 Hz
For a 10 ms deadline:
p50 meets deadline: YES
p99 meets deadline: NO (11.3 > 10)
--> Need INT8 or DLA to meet this deadline at p99
15. Real-Time Video Inference¶
15.1 Frame-Accurate Processing¶
For real-time video, every frame has a timestamp. The system must process frames within their validity window.
import cv2
import time
import threading
from collections import deque
class RTVideoProcessor:
def __init__(self, camera_id=0, target_fps=30):
self.cap = cv2.VideoCapture(camera_id, cv2.CAP_V4L2)
self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1920)
self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 1080)
self.cap.set(cv2.CAP_PROP_FPS, target_fps)
self.cap.set(cv2.CAP_PROP_BUFFERSIZE, 1) # Minimize buffering
self.target_fps = target_fps
self.frame_deadline_ms = 1000.0 / target_fps
self.latest_frame = None
self.frame_lock = threading.Lock()
self.stats = {"processed": 0, "dropped": 0, "deadline_miss": 0}
def capture_thread(self):
"""Continuously grab latest frame (drop old ones)."""
while self.running:
ret, frame = self.cap.read()
if not ret:
continue
with self.frame_lock:
self.latest_frame = (frame, time.perf_counter_ns())
def process_loop(self, engine):
"""Process frames with deadline awareness."""
self.running = True
cap_thread = threading.Thread(target=self.capture_thread,
daemon=True)
cap_thread.start()
while self.running:
# Grab latest frame (skip stale frames)
with self.frame_lock:
if self.latest_frame is None:
continue
frame, capture_time = self.latest_frame
self.latest_frame = None
# Check if frame is already stale
age_ms = (time.perf_counter_ns() - capture_time) / 1e6
if age_ms > self.frame_deadline_ms:
self.stats["dropped"] += 1
continue
# Process
t0 = time.perf_counter_ns()
preprocessed = preprocess(frame)
result = engine.infer(preprocessed)
detections = postprocess(result)
t1 = time.perf_counter_ns()
latency_ms = (t1 - t0) / 1e6
total_age_ms = (t1 - capture_time) / 1e6
if total_age_ms > self.frame_deadline_ms * 2:
self.stats["deadline_miss"] += 1
self.stats["processed"] += 1
# Report periodically
if self.stats["processed"] % 100 == 0:
total = (self.stats["processed"] +
self.stats["dropped"])
drop_rate = self.stats["dropped"] / max(total, 1) * 100
miss_rate = (self.stats["deadline_miss"] /
max(self.stats["processed"], 1) * 100)
print(f"Processed: {self.stats['processed']}, "
f"Dropped: {drop_rate:.1f}%, "
f"Deadline miss: {miss_rate:.1f}%, "
f"Latency: {latency_ms:.1f} ms")
15.2 GStreamer Pipeline Integration¶
For production video pipelines on Jetson, GStreamer with NVIDIA plugins provides hardware-accelerated capture and decode:
# GStreamer pipeline: CSI camera -> hardware ISP -> inference
# Use nvarguscamerasrc for CSI cameras on Orin Nano
gst-launch-1.0 \
nvarguscamerasrc sensor-id=0 ! \
'video/x-raw(memory:NVMM),width=1920,height=1080,framerate=30/1' ! \
nvvidconv ! \
'video/x-raw(memory:NVMM),width=640,height=640,format=RGBA' ! \
queue max-size-buffers=1 leaky=downstream ! \
nvvidconv ! \
'video/x-raw,width=640,height=640,format=RGBA' ! \
appsink max-buffers=1 drop=true
import gi
gi.require_version('Gst', '1.0')
from gi.repository import Gst, GLib
class GstInferencePipeline:
def __init__(self, engine):
Gst.init(None)
self.engine = engine
pipeline_str = (
"nvarguscamerasrc sensor-id=0 ! "
"video/x-raw(memory:NVMM),width=1920,height=1080,"
"framerate=30/1 ! "
"nvvidconv ! "
"video/x-raw,width=640,height=640,format=BGR ! "
"queue max-size-buffers=1 leaky=downstream ! "
"appsink name=sink emit-signals=true max-buffers=1 drop=true"
)
self.pipeline = Gst.parse_launch(pipeline_str)
sink = self.pipeline.get_by_name("sink")
sink.connect("new-sample", self.on_new_sample)
def on_new_sample(self, sink):
sample = sink.emit("pull-sample")
buf = sample.get_buffer()
caps = sample.get_caps()
# Extract frame data
success, map_info = buf.map(Gst.MapFlags.READ)
if success:
frame_data = np.frombuffer(map_info.data, dtype=np.uint8)
frame = frame_data.reshape((640, 640, 3))
# Run inference
result = self.engine.infer(preprocess(frame))
detections = postprocess(result)
buf.unmap(map_info)
return Gst.FlowReturn.OK
def start(self):
self.pipeline.set_state(Gst.State.PLAYING)
def stop(self):
self.pipeline.set_state(Gst.State.NULL)
15.3 Pipeline Buffering Strategies¶
Strategy 1: Drop-oldest (for latest-frame applications)
Camera: [F1][F2][F3][F4][F5]...
Buffer: [F5] (always keep only latest)
Inference gets the most recent frame
Strategy 2: Drop-newest (for sequential processing)
Camera: [F1][F2][F3][F4][F5]...
Buffer: [F1][F2][F3] (process in order, drop if full)
Inference processes frames sequentially
Strategy 3: Skip-N (for fixed-rate processing)
Camera: [F1][F2][F3][F4][F5][F6]...
Process: [F1] [F4] [F7]...
Process every Nth frame (N = camera_fps / inference_fps)
class FramePolicy:
"""Frame selection policies for real-time video inference."""
@staticmethod
def latest_only(buffer):
"""Drop all but the most recent frame."""
while len(buffer) > 1:
buffer.popleft()
return buffer.popleft() if buffer else None
@staticmethod
def skip_n(buffer, n=2):
"""Process every Nth frame."""
frame = None
for _ in range(min(n, len(buffer))):
frame = buffer.popleft()
return frame
@staticmethod
def deadline_aware(buffer, max_age_ms=30.0):
"""Drop frames older than max_age_ms."""
now = time.perf_counter_ns()
while buffer:
frame, timestamp = buffer[0]
age_ms = (now - timestamp) / 1e6
if age_ms > max_age_ms:
buffer.popleft() # Too old, drop
else:
return buffer.popleft()
return None
16. Production Real-Time Patterns¶
16.1 Watchdog Timers¶
A watchdog ensures the inference loop does not hang:
#include <atomic>
#include <thread>
#include <chrono>
class InferenceWatchdog {
std::atomic<bool> alive_{true};
std::atomic<uint64_t> last_heartbeat_{0};
std::thread watchdog_thread_;
uint64_t timeout_ms_;
std::function<void()> on_timeout_;
public:
InferenceWatchdog(uint64_t timeout_ms,
std::function<void()> on_timeout)
: timeout_ms_(timeout_ms), on_timeout_(on_timeout)
{
last_heartbeat_.store(now_ms());
watchdog_thread_ = std::thread([this]() {
while (alive_.load()) {
std::this_thread::sleep_for(
std::chrono::milliseconds(timeout_ms_ / 2));
uint64_t elapsed = now_ms() - last_heartbeat_.load();
if (elapsed > timeout_ms_) {
fprintf(stderr, "WATCHDOG: Inference timeout "
"(%lu ms without heartbeat)\n", elapsed);
on_timeout_();
}
}
});
}
void heartbeat() {
last_heartbeat_.store(now_ms());
}
~InferenceWatchdog() {
alive_.store(false);
if (watchdog_thread_.joinable())
watchdog_thread_.join();
}
private:
uint64_t now_ms() {
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::steady_clock::now().time_since_epoch()
).count();
}
};
// Usage:
InferenceWatchdog watchdog(100, []() { // 100 ms timeout
// Emergency action: restart inference, switch to fallback model
fprintf(stderr, "Restarting inference engine...\n");
// ... reinitialize engine ...
});
while (running) {
auto result = engine.infer(input);
watchdog.heartbeat();
process(result);
}
16.2 Deadline Monitoring¶
import time
import logging
class DeadlineMonitor:
def __init__(self, deadline_ms, model_name="default"):
self.deadline_ms = deadline_ms
self.model_name = model_name
self.violations = 0
self.total = 0
self.logger = logging.getLogger(f"deadline.{model_name}")
def check(self, latency_ms):
self.total += 1
if latency_ms > self.deadline_ms:
self.violations += 1
violation_rate = self.violations / self.total * 100
self.logger.warning(
f"Deadline violation: {latency_ms:.2f} ms > "
f"{self.deadline_ms} ms "
f"(rate: {violation_rate:.2f}%)"
)
return False
return True
def sla_compliant(self, target_rate=99.9):
"""Check if meeting SLA (e.g., 99.9% within deadline)."""
if self.total == 0:
return True
success_rate = (self.total - self.violations) / self.total * 100
return success_rate >= target_rate
def report(self):
success_rate = ((self.total - self.violations)
/ max(self.total, 1) * 100)
return {
"model": self.model_name,
"deadline_ms": self.deadline_ms,
"total_inferences": self.total,
"violations": self.violations,
"success_rate_pct": f"{success_rate:.3f}",
}
16.3 Graceful Degradation Under Load¶
class AdaptiveInference:
"""
Switch between quality tiers based on latency budget.
Tier 0: Full model (FP16, 640x640) -- highest accuracy
Tier 1: Reduced resolution (FP16, 320x320) -- faster
Tier 2: INT8 reduced model (INT8, 320x320) -- fastest
Tier 3: Skip inference, use last result -- emergency
"""
def __init__(self, engines, deadline_ms=20.0):
self.engines = engines # List of (engine, resolution) by tier
self.deadline_ms = deadline_ms
self.current_tier = 0
self.consecutive_violations = 0
self.consecutive_ok = 0
self.last_result = None
def infer(self, frame):
if self.current_tier >= len(self.engines):
# Emergency: return last known result
return self.last_result
engine, resolution = self.engines[self.current_tier]
t0 = time.perf_counter_ns()
preprocessed = preprocess(frame, resolution)
result = engine.infer(preprocessed)
t1 = time.perf_counter_ns()
latency_ms = (t1 - t0) / 1e6
self.last_result = result
# Adapt tier based on latency
if latency_ms > self.deadline_ms * 0.9:
self.consecutive_violations += 1
self.consecutive_ok = 0
if self.consecutive_violations >= 3:
self._downgrade()
else:
self.consecutive_ok += 1
self.consecutive_violations = 0
if self.consecutive_ok >= 50:
self._upgrade()
return result
def _downgrade(self):
if self.current_tier < len(self.engines):
self.current_tier += 1
self.consecutive_violations = 0
print(f"Degrading to tier {self.current_tier}")
def _upgrade(self):
if self.current_tier > 0:
self.current_tier -= 1
self.consecutive_ok = 0
print(f"Upgrading to tier {self.current_tier}")
16.4 SLA Compliance Framework¶
class SLATracker:
"""
Track SLA compliance over sliding windows.
Example SLA: 99.9% of inferences under 15 ms,
measured over 1-minute windows.
"""
def __init__(self, deadline_ms, target_pct=99.9,
window_seconds=60):
self.deadline_ms = deadline_ms
self.target_pct = target_pct
self.window_seconds = window_seconds
self.measurements = deque() # (timestamp, latency_ms)
def record(self, latency_ms):
now = time.time()
self.measurements.append((now, latency_ms))
self._evict_old(now)
def _evict_old(self, now):
cutoff = now - self.window_seconds
while self.measurements and self.measurements[0][0] < cutoff:
self.measurements.popleft()
def is_compliant(self):
if len(self.measurements) < 10:
return True # Not enough data
within = sum(1 for _, lat in self.measurements
if lat <= self.deadline_ms)
pct = within / len(self.measurements) * 100
return pct >= self.target_pct
def current_percentile(self):
if not self.measurements:
return 0.0
within = sum(1 for _, lat in self.measurements
if lat <= self.deadline_ms)
return within / len(self.measurements) * 100
def alert_if_noncompliant(self):
if not self.is_compliant():
pct = self.current_percentile()
print(f"SLA VIOLATION: {pct:.2f}% within {self.deadline_ms}ms "
f"(target: {self.target_pct}%)")
return True
return False
17. Common Issues and Debugging¶
17.1 Latency Spikes¶
Symptom: Occasional inference takes 2-5x longer than normal.
Common causes and solutions:
# Cause 1: GPU clock scaling (DVFS)
# The GPU dynamically adjusts frequency based on load.
# Fix: Lock clocks with jetson_clocks
sudo jetson_clocks
sudo jetson_clocks --show
# Verify GPU frequency is locked:
cat /sys/devices/gpu.0/devfreq/17000000.ga10b/cur_freq
# Should show maximum frequency
# Cause 2: CPU throttling due to thermals
# Check current CPU frequency:
for i in $(seq 0 5); do
echo "CPU$i: $(cat /sys/devices/system/cpu/cpu$i/cpufreq/scaling_cur_freq) kHz"
done
# Check thermal zones:
for zone in /sys/class/thermal/thermal_zone*/; do
echo "$(cat ${zone}type): $(cat ${zone}temp) m-degC"
done
# Cause 3: Memory pressure causing swapping
free -h
cat /proc/meminfo | grep -E "MemAvail|SwapUsed"
# If SwapUsed > 0, you are swapping -- reduce memory usage
Diagnostic script for latency spikes:
import subprocess
import time
class JetsonHealthMonitor:
def __init__(self):
self.warnings = []
def check_gpu_clock(self):
try:
freq = int(open(
"/sys/devices/gpu.0/devfreq/17000000.ga10b/cur_freq"
).read().strip())
max_freq = int(open(
"/sys/devices/gpu.0/devfreq/17000000.ga10b/max_freq"
).read().strip())
if freq < max_freq * 0.95:
self.warnings.append(
f"GPU clock below max: {freq/1e6:.0f} / "
f"{max_freq/1e6:.0f} MHz")
except Exception as e:
self.warnings.append(f"Cannot read GPU clock: {e}")
def check_thermals(self):
import glob
for zone in glob.glob("/sys/class/thermal/thermal_zone*"):
try:
temp = int(open(f"{zone}/temp").read().strip()) / 1000.0
zone_type = open(f"{zone}/type").read().strip()
if temp > 85.0:
self.warnings.append(
f"THERMAL WARNING: {zone_type} = {temp:.1f} C")
except Exception:
pass
def check_memory(self):
with open("/proc/meminfo") as f:
lines = f.readlines()
info = {}
for line in lines:
parts = line.split()
info[parts[0].rstrip(":")] = int(parts[1])
avail_mb = info.get("MemAvailable", 0) / 1024
swap_used = (info.get("SwapTotal", 0)
- info.get("SwapFree", 0)) / 1024
if avail_mb < 500:
self.warnings.append(
f"Low memory: {avail_mb:.0f} MB available")
if swap_used > 10:
self.warnings.append(
f"Swapping active: {swap_used:.0f} MB used")
def full_check(self):
self.warnings = []
self.check_gpu_clock()
self.check_thermals()
self.check_memory()
return self.warnings
17.2 GPU Scheduling Interference¶
Symptom: Inference latency increases when other GPU workloads are running (e.g., display compositor, video decode).
# Check what is using the GPU
sudo cat /sys/kernel/debug/gpu.0/load
# Shows GPU utilization percentage
# Identify processes using GPU memory
nvidia-smi # (if available on Jetson)
# Or check /proc/*/status for gpu_mem entries
# Mitigation 1: Use DLA for deterministic inference
# (DLA is not affected by GPU contention)
# Mitigation 2: Disable the desktop compositor
sudo systemctl stop gdm3
sudo systemctl disable gdm3
# This frees ~200 MB GPU memory and removes display-driven GPU work
# Mitigation 3: Use separate CUDA streams with priorities
// Create a high-priority CUDA stream for inference
cudaStream_t inferenceStream;
int leastPriority, greatestPriority;
cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);
// greatestPriority is the highest (most urgent) priority
cudaStreamCreateWithPriority(&inferenceStream,
cudaStreamNonBlocking,
greatestPriority);
// Other GPU work uses lower priority
cudaStream_t backgroundStream;
cudaStreamCreateWithPriority(&backgroundStream,
cudaStreamNonBlocking,
leastPriority);
17.3 Memory Fragmentation¶
Symptom: cudaMalloc fails or becomes slow after hours of operation, even
though total free memory appears sufficient.
// Diagnostic: check fragmented vs total free memory
size_t free_bytes, total_bytes;
cudaMemGetInfo(&free_bytes, &total_bytes);
printf("GPU memory: %zu MB free / %zu MB total\n",
free_bytes / (1 << 20), total_bytes / (1 << 20));
// Prevention: allocate all GPU memory at startup, never free
// Use a custom memory pool:
class InferenceMemoryPool {
void* pool_;
size_t pool_size_;
size_t offset_ = 0;
public:
InferenceMemoryPool(size_t size_mb) {
pool_size_ = size_mb << 20;
cudaMalloc(&pool_, pool_size_);
}
void* alloc(size_t bytes) {
// Align to 256 bytes (GPU requirement)
size_t aligned = (bytes + 255) & ~255;
if (offset_ + aligned > pool_size_) {
fprintf(stderr, "Pool exhausted!\n");
return nullptr;
}
void* ptr = static_cast<char*>(pool_) + offset_;
offset_ += aligned;
return ptr;
}
void reset() { offset_ = 0; }
~InferenceMemoryPool() { cudaFree(pool_); }
};
17.4 Thermal-Induced Slowdowns¶
The Orin Nano throttles CPU and GPU clocks when temperature exceeds limits:
Thermal zones on Orin Nano (T234):
Zone Warning Throttle Shutdown
CPU 85 C 95 C 105 C
GPU 85 C 95 C 105 C
SOC 85 C 95 C 105 C
TBoard 80 C 90 C 100 C
# Monitor thermals in real time
watch -n 1 'paste <(cat /sys/class/thermal/thermal_zone*/type) \
<(cat /sys/class/thermal/thermal_zone*/temp) | \
awk "{printf \"%20s: %5.1f C\n\", \$1, \$2/1000}"'
# Check if throttling is active
cat /sys/devices/gpu.0/devfreq/17000000.ga10b/trans_stat
# Shows frequency transition history -- frequent changes = throttling
# Mitigation:
# 1. Use an active cooling fan
sudo bash -c 'echo 255 > /sys/devices/pwm-fan/target_pwm'
# 2. Use a heat sink with proper thermal compound
# 3. Set power mode to reduce heat generation
sudo nvpmodel -m 2 # 15W mode (balanced)
sudo nvpmodel -m 0 # MAXN (maximum performance, most heat)
# 4. Monitor and adapt -- reduce model complexity when hot
Thermal-aware inference loop:
class ThermalAwareInference:
THROTTLE_TEMP = 82.0 # Start reducing load before HW throttle
def __init__(self, full_engine, lite_engine):
self.full_engine = full_engine
self.lite_engine = lite_engine
self.use_lite = False
def get_max_temp(self):
import glob
max_temp = 0
for zone in glob.glob("/sys/class/thermal/thermal_zone*"):
try:
temp = int(open(f"{zone}/temp").read()) / 1000.0
max_temp = max(max_temp, temp)
except Exception:
pass
return max_temp
def infer(self, input_data):
# Check temperature every 100 inferences
if self.inference_count % 100 == 0:
temp = self.get_max_temp()
if temp > self.THROTTLE_TEMP and not self.use_lite:
print(f"Temperature {temp:.1f}C > {self.THROTTLE_TEMP}C, "
f"switching to lite model")
self.use_lite = True
elif temp < self.THROTTLE_TEMP - 5.0 and self.use_lite:
print(f"Temperature {temp:.1f}C cooled down, "
f"switching to full model")
self.use_lite = False
engine = self.lite_engine if self.use_lite else self.full_engine
self.inference_count += 1
return engine.infer(input_data)
17.5 Clock Scaling Surprises¶
Symptom: First inference after idle is 3-10x slower.
The GPU enters a low-power state during idle, and the first operation triggers a clock ramp-up that can take 5-20 ms.
# Check current GPU power state
cat /sys/devices/gpu.0/power/runtime_status
# "active" or "suspended"
# Prevent GPU from entering low-power state
# Method 1: jetson_clocks (locks all clocks)
sudo jetson_clocks
# Method 2: Disable GPU runtime PM
echo on > /sys/devices/gpu.0/power/control
# Method 3: Keep a lightweight "heartbeat" kernel running
// Heartbeat kernel to prevent GPU idle
__global__ void heartbeat_kernel() {
// Minimal work -- just enough to keep GPU "active"
if (threadIdx.x == 0) {
volatile int x = 1;
}
}
void gpu_heartbeat_thread(cudaStream_t stream) {
while (running) {
heartbeat_kernel<<<1, 1, 0, stream>>>();
cudaStreamSynchronize(stream);
std::this_thread::sleep_for(std::chrono::milliseconds(50));
}
}
17.6 Debugging Checklist¶
When inference latency is not meeting requirements, work through this checklist:
STEP CHECK COMMAND / ACTION
---- ----- ----------------
1 Is PREEMPT_RT kernel active? cat /sys/kernel/realtime
2 Are clocks locked to maximum? sudo jetson_clocks --show
3 Is thermal throttling occurring? Check thermal zones
4 Is the GPU being shared? Disable desktop, check load
5 Is the right precision mode used? trtexec --best --verbose
6 Are CPU cores isolated? cat /sys/devices/system/cpu/isolated
7 Is the inference thread RT-scheduled? chrt -p <pid>
8 Is memory being allocated in hot path? Profile with nsys
9 Are there page faults? perf stat -e page-faults ./app
10 Is swap being used? free -h
11 Is the input pipeline on GPU? Profile preprocess stage
12 Is CUDA graph being used? Check launch overhead in nsys
13 Are IRQs on inference cores? Check /proc/interrupts
14 Is the engine built with timing cache? Rebuild with --timingCacheFile
15 Is warmup sufficient? Run 100+ warmup iterations
17.7 Complete Production Startup Sequence¶
#!/bin/bash
# production_start.sh -- Full RT inference startup on Orin Nano
set -e
echo "=== Production RT Inference Startup ==="
# 1. Lock clocks
echo "[1/8] Locking clocks..."
sudo jetson_clocks
# 2. Set power mode (MAXN for maximum performance)
echo "[2/8] Setting power mode..."
sudo nvpmodel -m 0
# 3. Set fan to maximum
echo "[3/8] Setting fan speed..."
sudo bash -c 'echo 255 > /sys/devices/pwm-fan/target_pwm' 2>/dev/null || \
echo " (no fan control available)"
# 4. Disable desktop if running
echo "[4/8] Disabling desktop compositor..."
sudo systemctl stop gdm3 2>/dev/null || true
# 5. Set up CPU isolation and IRQ affinity
echo "[5/8] Configuring CPU affinity..."
for irq in $(ls /proc/irq/ | grep -E '^[0-9]+$'); do
echo 0-3 > /proc/irq/$irq/smp_affinity_list 2>/dev/null || true
done
# 6. Disable THP
echo "[6/8] Disabling transparent huge pages..."
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled > /dev/null
# 7. Drop filesystem caches
echo "[7/8] Dropping caches..."
sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'
# 8. Launch inference with RT priority on isolated cores
echo "[8/8] Launching inference application..."
sudo taskset -c 4,5 chrt -f 80 ./inference_app \
--engine model_fp16.engine \
--warmup 100 \
--deadline-ms 15 \
--camera /dev/video0
echo "=== Startup complete ==="
Appendix A: Quick Reference -- Orin Nano Real-Time Tuning Parameters¶
| Parameter | Default | Real-Time Setting | How to Set |
|---|---|---|---|
| Kernel preemption | PREEMPT | PREEMPT_RT | Rebuild kernel |
| CPU governor | schedutil | performance | cpufreq sysfs |
| GPU clock | Dynamic | Locked max | jetson_clocks |
| CPU isolation | None | isolcpus=4,5 | Boot param |
| IRQ affinity | All cores | Cores 0-3 only | /proc/irq/*/smp_affinity |
| THP | always | never | sysfs |
| Scheduler | SCHED_OTHER | SCHED_FIFO (80) | chrt / sched_setscheduler |
| CUDA stream | Default | High-priority | cudaStreamCreateWithPriority |
| TRT precision | FP32 | FP16 or INT8 | Builder flags |
| Batch size | Model-dependent | 1 | Engine build shapes |
| Memory locking | Disabled | mlockall | mlockall(MCL_CURRENT) |
| Swap | Enabled | Disabled | swapoff -a |
| Desktop | Enabled | Disabled | systemctl stop gdm3 |
Appendix B: Latency Budget Template¶
Application: ___________________________
Target FPS: _____ (frame deadline: _____ ms)
SLA: _____ % within deadline
Stage Budget (ms) Measured p50 Measured p99
-----------------------------------------------------------------
Camera capture ________ ________ ________
Preprocessing ________ ________ ________
H2D transfer ________ ________ ________
Inference ________ ________ ________
D2H transfer ________ ________ ________
Postprocessing ________ ________ ________
Application logic ________ ________ ________
-----------------------------------------------------------------
Total ________ ________ ________
Meets SLA? [ ] Yes [ ] No
Action items: ________________________________________________
Appendix C: TensorRT Version Compatibility¶
| JetPack | L4T | CUDA | TensorRT | DLA Compiler | Notes |
|---|---|---|---|---|---|
| 5.1.2 | 35.4.1 | 11.4 | 8.5.2 | 3.12.0 | Last JP5 |
| 6.0 | 36.3.0 | 12.2 | 8.6.2 | 3.16.0 | First JP6 |
| 6.1 | 36.4.0 | 12.6 | 10.3 | 3.17.0 | Current stable |
Engines built with one TensorRT version are not compatible with another. Always rebuild engines when upgrading JetPack.
End of guide. For updates, refer to the NVIDIA Jetson documentation at https://docs.nvidia.com/jetson/ and the TensorRT documentation at https://docs.nvidia.com/deeplearning/tensorrt/.