Orin Nano 8GB — Deep Learning Accelerator (DLA) Deep Dive¶

Scope: Production-level understanding of the DLA on Jetson Orin Nano 8GB — hardware architecture, memory interaction, software stack, TensorRT integration, layer support, multi-engine scheduling, performance profiling, and production deployment patterns.

Prerequisites: Familiarity with the Orin Nano memory architecture (CMA, SMMU, zero-copy) and kernel internals (driver model, module loading).

Table of Contents¶

What Is DLA
DLA on Orin Nano 8GB — Specifications
DLA vs GPU vs CPU — When to Use Each
DLA Hardware Architecture
DLA Memory Interaction
Software Stack — From Model to DLA Execution
TensorRT DLA Integration
Supported Layers and Precision
DLA Execution Flow — Step by Step
Multi-Engine Scheduling (DLA + GPU)
DLA Kernel Driver
DLA Memory Path — Full Pipeline
CPU/GPU/DLA/SMMU/CMA Interaction Diagram
Performance Characteristics
Profiling DLA Workloads
DLA Limitations and Fallback Behavior
Production Deployment Patterns
Common DLA Issues and Solutions
References

1. What Is DLA¶

DLA (Deep Learning Accelerator) is a dedicated hardware block inside the Tegra SoC designed specifically for neural network inference. It is not a GPU, not a CPU — it is a fixed-function accelerator optimized for low-power, high-efficiency AI workloads.

Key characteristics:

Purpose-built for inference — convolution, pooling, activation, normalization
Power-efficient — achieves high TOPS/watt compared to GPU
Deterministic latency — no contention with other GPU workloads
Runs in parallel with CPU and GPU — true heterogeneous computing

DLA does not support training, only inference. It does not support all neural network operations — unsupported layers fall back to GPU.

2. DLA on Orin Nano 8GB — Specifications¶

Specification	Value
Number of DLA engines	1
Peak performance (INT8)	Up to 10 TOPS
Peak performance (FP16)	Up to 5 TFLOPS
Supported precisions	INT8, FP16
On-chip SRAM	Small buffer for weights/activations
Memory access	System DRAM via DMA + SMMU
Power consumption	Significantly lower than GPU

Note: Orin Nano 8GB has 1 DLA engine. Higher-end Orin modules (NX, AGX) have 2 DLA engines, enabling parallel inference on two models simultaneously.

3. DLA vs GPU vs CPU — When to Use Each¶

Feature	CPU	GPU	DLA
Architecture	General-purpose	Massively parallel	Fixed-function AI accelerator
Best workload	OS, control logic	Parallel FP/INT ops	CNN/RNN inference
Supported operations	Everything	Everything (CUDA)	Subset (conv, pool, etc.)
Power consumption	High for compute	Medium-high	Low
Latency	Higher	Medium	Very low (deterministic)
Precision	FP32/FP64	FP32/FP16/INT8/TF32	INT8/FP16 only
Programmability	Full (C/C++/Python)	Full (CUDA)	Via TensorRT only
Parallel with others	Yes	Yes	Yes

Decision Matrix¶

Scenario	Best Engine
Single model, maximum throughput	GPU
Single model, minimum power	DLA
Two models simultaneously	DLA + GPU
Model with many unsupported layers	GPU
Battery-powered device	DLA
Pre/post-processing + inference	CPU + DLA
Real-time video + inference + display	GPU + DLA

The Power Argument¶

On Orin Nano 8GB (15W TDP total):

Running inference on GPU: GPU consumes ~5–8W, leaving less for CPU and I/O
Running inference on DLA: DLA consumes ~1–3W, freeing power budget for GPU (display, encode) and CPU

In power-constrained systems (battery, solar, thermal-limited enclosures), DLA can be the difference between meeting and missing the power budget.

4. DLA Hardware Architecture¶

Block Diagram¶

┌─────────────────────────────────────────────────┐
│                  DLA Engine                      │
│                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │ Convolution│  │ SDP      │  │ PDP      │       │
│  │ Core      │  │ (Single  │  │ (Planar  │       │
│  │           │  │  Data    │  │  Data    │       │
│  │ MAC array │  │  Proc.)  │  │  Proc.)  │       │
│  └─────┬─────┘  └─────┬────┘  └─────┬────┘       │
│        │              │              │            │
│  ┌─────┴──────────────┴──────────────┴─────┐     │
│  │           Internal Data Bus              │     │
│  └─────────────────┬────────────────────────┘     │
│                    │                              │
│  ┌─────────────────┴────────────────────────┐     │
│  │          SRAM Buffer (on-chip)            │     │
│  └─────────────────┬────────────────────────┘     │
│                    │                              │
│  ┌─────────────────┴────────────────────────┐     │
│  │          DMA Engine                       │     │
│  │  (reads/writes tensors from/to DRAM)      │     │
│  └─────────────────┬────────────────────────┘     │
│                    │                              │
└────────────────────┼──────────────────────────────┘
                     │
                     ↓
              SMMU → System DRAM

Core Components¶

Convolution Core¶

Contains the MAC (Multiply-Accumulate) array
Heart of DLA — performs convolution, matrix multiplication, deconvolution
Optimized for INT8 and FP16 data types
Supports various kernel sizes (1x1, 3x3, 5x5, 7x7, etc.)
Handles strided and dilated convolutions

SDP (Single Data Processor)¶

Performs element-wise operations after convolution
Handles: bias addition, batch normalization, ReLU, PReLU, sigmoid, tanh
Operates on the output of the convolution core before writing to memory
Fuses multiple post-convolution operations to avoid memory round-trips

PDP (Planar Data Processor)¶

Performs pooling operations (max pool, average pool)
Operates on 2D spatial data
Supports various pool sizes and strides

CDP (Channel Data Processor)¶

Performs channel-wise operations
Local Response Normalization (LRN)
Channel-wise scaling

SRAM Buffer¶

Small on-chip memory for staging weights and activations
Reduces DRAM bandwidth consumption for frequently accessed data
DLA compiler decides what to cache in SRAM vs. stream from DRAM

DMA Engine¶

Moves input tensors from DRAM into DLA processing cores
Writes output tensors from DLA back to DRAM
Uses IOVA addresses (mapped through SMMU)
Supports scatter-gather for non-contiguous buffers

5. DLA Memory Interaction¶

DLA does not have large dedicated memory. It uses system DRAM for all tensor storage.

Memory Flow¶

System DRAM (8GB LPDDR5, shared with CPU/GPU)
      ↑ ↓
    SMMU (ARM SMMU v2)
      ↑ ↓
  IOVA address space (DLA's view of memory)
      ↑ ↓
  DMA Engine (inside DLA)
      ↑ ↓
  SRAM Buffer (small, on-chip)
      ↑ ↓
  Processing Cores (Conv, SDP, PDP, CDP)

Buffer Allocation¶

DLA buffers are allocated by the TensorRT runtime / DLA driver:

Input tensors — allocated from CMA (contiguous) or carved-out memory
Weight tensors — loaded from the serialized TensorRT engine file
Intermediate tensors — allocated for layer-to-layer data flow
Output tensors — allocated from CMA, returned to the caller

All buffers are mapped through SMMU so DLA accesses them via IOVA.

Zero-Copy With GPU¶

When a DLA layer's output feeds into a GPU layer (or vice versa):

DLA output buffer (in DRAM)
   ↓
Same physical pages
   ↓
GPU SMMU maps same pages at different IOVA
   ↓
GPU reads data — no copy needed

TensorRT handles this automatically when building a hybrid DLA+GPU engine.

Memory Budget Impact¶

DLA buffers consume system DRAM just like GPU and CPU allocations:

Component	Typical Memory
Model weights	10–100 MB
Input tensor	1–12 MB
Intermediate	10–50 MB
Output tensor	< 1 MB
Total per model	20–160 MB

On an 8GB system, this is significant. Plan memory budgets across CPU + GPU + DLA workloads.

6. Software Stack — From Model to DLA Execution¶

┌──────────────────────────────────────────┐
│  User Application                         │
│  (Python/C++ — inference request)         │
└──────────────────┬───────────────────────┘
                   ↓
┌──────────────────────────────────────────┐
│  TensorRT Runtime                         │
│  (engine deserialization, execution)      │
│  Selects DLA or GPU per layer             │
└──────────────────┬───────────────────────┘
                   ↓
┌──────────────────────────────────────────┐
│  libnvdla (DLA runtime library)           │
│  Programs DLA registers                   │
│  Manages DMA descriptors                  │
│  Handles synchronization                  │
└──────────────────┬───────────────────────┘
                   ↓
┌──────────────────────────────────────────┐
│  Kernel Driver (nvdla.ko / nvhost)        │
│  Allocates CMA buffers                    │
│  Creates SMMU/IOVA mappings               │
│  Submits work to DLA hardware             │
│  Handles completion interrupts            │
└──────────────────┬───────────────────────┘
                   ↓
┌──────────────────────────────────────────┐
│  DLA Hardware                             │
│  Executes neural network layers           │
│  DMA reads/writes tensors from/to DRAM    │
└──────────────────────────────────────────┘

7. TensorRT DLA Integration¶

Building a DLA-Enabled Engine¶

import tensorrt as trt

logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)

# Parse ONNX model
with open("model.onnx", "rb") as f:
    parser.parse(f.read())

config = builder.create_builder_config()
config.max_workspace_size = 1 << 30  # 1 GB

# Enable DLA
config.default_device_type = trt.DeviceType.DLA
config.DLA_core = 0  # Use DLA core 0

# Allow GPU fallback for unsupported layers
config.set_flag(trt.BuilderFlag.GPU_FALLBACK)

# Use FP16 or INT8 (DLA does not support FP32)
config.set_flag(trt.BuilderFlag.FP16)
# Or for INT8:
# config.set_flag(trt.BuilderFlag.INT8)
# config.int8_calibrator = MyCalibrator()

engine = builder.build_engine(network, config)

# Serialize engine
with open("model_dla.engine", "wb") as f:
    f.write(engine.serialize())

Key TensorRT DLA Options¶

Option	Purpose
`default_device_type`	Set all layers to DLA by default
`DLA_core`	Select which DLA engine (0 or 1)
`GPU_FALLBACK`	Allow unsupported layers to run on GPU
`FP16` / `INT8`	DLA requires reduced precision
`set_device_type(layer)`	Override per-layer device assignment

Per-Layer Device Assignment¶

For fine-grained control, assign specific layers to DLA or GPU:

for i in range(network.num_layers):
    layer = network.get_layer(i)
    if can_run_on_dla(layer):
        config.set_device_type(layer, trt.DeviceType.DLA)
    else:
        config.set_device_type(layer, trt.DeviceType.GPU)

Inspecting DLA Layer Assignment¶

After building the engine, check which layers run on DLA:

# Use trtexec to build and profile
trtexec --onnx=model.onnx --useDLACore=0 --fp16 --allowGPUFallback --verbose

# Output shows per-layer device assignment:
# Layer: conv1 ... Device: DLA
# Layer: unsupported_op ... Device: GPU (fallback)

8. Supported Layers and Precision¶

DLA-Supported Operations¶

Operation	Supported	Notes
Convolution (2D)	Yes	All kernel sizes, strides, dilation
Deconvolution	Yes	Transposed convolution
Fully connected	Yes	Via 1x1 convolution
Pooling (max, avg)	Yes	Various sizes and strides
ReLU	Yes	Fused with convolution in SDP
PReLU / Leaky ReLU	Yes	Fused in SDP
Sigmoid	Yes	Via SDP
Tanh	Yes	Via SDP
Batch Normalization	Yes	Fused with convolution in SDP
Element-wise (add/mul)	Yes	Via SDP
Concatenation	Yes	Channel concatenation
Softmax	Limited	May fall back to GPU
Resize / Upsample	Limited	Nearest-neighbor only, some constraints
Transpose	No	Falls back to GPU
Attention (QKV)	No	Falls back to GPU
Custom plugins	No	DLA only runs compiled operations
Dynamic shapes	No	Fixed input dimensions required

Precision Support¶

Precision	DLA Support	Notes
FP32	No	Must convert to FP16 or INT8
FP16	Yes	Default for DLA
INT8	Yes	Requires calibration data
TF32	No	GPU-only feature
BF16	No	Not supported on Orin DLA

What Happens With Unsupported Layers¶

If GPU_FALLBACK is enabled:

DLA layers → DLA engine
   ↓
Unsupported layer → data transfer → GPU
   ↓
GPU executes layer
   ↓
Next DLA layer → data transfer → back to DLA

Each DLA↔GPU transition involves a memory synchronization (not a copy if using zero-copy, but a sync fence). Too many transitions add latency.

Maximizing DLA Utilization¶

Use architectures with DLA-friendly operations (convolution-heavy CNNs)
Avoid attention mechanisms, dynamic shapes, and custom operations
Fuse batch normalization into convolution before export
Use INT8 for maximum DLA throughput
Minimize DLA↔GPU transitions by grouping unsupported layers

9. DLA Execution Flow — Step by Step¶

Example: Image Classification (ResNet50 on DLA)¶

Step 1: Application submits inference request
        Input: 224x224x3 image tensor (FP16)
             ↓
Step 2: TensorRT runtime selects DLA engine
        Deserializes compiled DLA loadable
             ↓
Step 3: libnvdla programs DLA registers
        Sets up DMA descriptors for input/weights/output
        All addresses are IOVA (mapped through SMMU)
             ↓
Step 4: DLA DMA engine reads input tensor from DRAM
        Streams data into on-chip SRAM buffer
             ↓
Step 5: Convolution core processes layer 1
        MAC array performs conv2d on input
        SDP applies batch norm + ReLU (fused)
        PDP applies max pooling
             ↓
Step 6: Intermediate result written to DRAM (or kept in SRAM)
        Next layer reads from previous output
             ↓
Step 7: Repeat for all DLA-compatible layers
        Conv → BN → ReLU → Pool → Conv → ...
             ↓
Step 8: If unsupported layer encountered:
        DLA writes intermediate to DRAM
        GPU reads same physical pages (zero-copy via SMMU)
        GPU executes unsupported layer(s)
        DLA reads GPU output and continues
             ↓
Step 9: Final output tensor written to DRAM via DMA
             ↓
Step 10: DLA raises completion interrupt
         Kernel driver signals userspace
         Application reads output tensor

10. Multi-Engine Scheduling (DLA + GPU)¶

Heterogeneous Execution¶

On Orin Nano, you can run workloads on DLA and GPU simultaneously:

┌─────────────────────────────────────────┐
│               Time →                     │
│                                          │
│  DLA:  [Model A inference ][Model A    ] │
│  GPU:  [Model B inference      ][post] │
│  CPU:  [pre-process][      ][result]    │
│                                          │
└─────────────────────────────────────────┘

Pipeline Architecture¶

For real-time video inference:

Frame N:    CPU pre-process → DLA inference → CPU post-process
Frame N+1:  CPU pre-process → GPU inference → CPU post-process
Frame N+2:  CPU pre-process → DLA inference → CPU post-process

DLA and GPU alternate or run different models in parallel.

TensorRT Multi-Stream Execution¶

import tensorrt as trt
import pycuda.driver as cuda

# Create two execution contexts
context_dla = engine_dla.create_execution_context()
context_gpu = engine_gpu.create_execution_context()

# Create two CUDA streams
stream_dla = cuda.Stream()
stream_gpu = cuda.Stream()

# Execute in parallel
context_dla.execute_async_v2(bindings_dla, stream_dla.handle)
context_gpu.execute_async_v2(bindings_gpu, stream_gpu.handle)

# Wait for both
stream_dla.synchronize()
stream_gpu.synchronize()

Benefits of DLA + GPU Parallelism¶

Metric	GPU Only	DLA Only	DLA + GPU
Throughput	1x	0.5–0.8x	1.3–1.8x
Power	High	Low	Medium
Latency	Medium	Low	Medium (pipelined)
GPU availability	0%	100%	100% (for other tasks)

Running inference on DLA frees the GPU for:

Display rendering
Video encode/decode (NVENC/NVDEC)
Additional CUDA workloads
GStreamer processing

11. DLA Kernel Driver¶

Driver Architecture¶

The DLA kernel driver is part of the nvhost subsystem:

/dev/nvhost-nvdla0        ← DLA engine 0 device node
/dev/nvhost-nvdla1        ← DLA engine 1 (if present)

Driver Responsibilities¶

Responsibility	Details
Buffer allocation	Allocates CMA buffers for DLA DMA
SMMU mapping	Maps physical pages to DLA's IOVA space
Work submission	Programs DLA registers, starts execution
Interrupt handling	Receives completion interrupt, signals userspace
Power management	Clock gating, power gating when idle
Error handling	Detects DLA faults, reports to userspace

Module Loading¶

# Check if DLA driver is loaded
lsmod | grep nvdla
# nvdla                  12345  0

# Check device nodes
ls /dev/nvhost-nvdla*
# /dev/nvhost-nvdla0

# Check driver messages
dmesg | grep nvdla
# nvdla 15880000.nvdla0: probed

DLA Clock and Power¶

DLA has its own clock domain managed by BPMP:

# Check DLA clock rate
cat /sys/kernel/debug/clk/clk_summary | grep dla

# DLA power domain
cat /sys/kernel/debug/bpmp/debug/regulator/*/name | grep dla

DLA is power-gated when idle — it consumes near-zero power when not executing inference.

12. DLA Memory Path — Full Pipeline¶

Complete Data Flow¶

1. TensorRT allocates input buffer
   └→ dma_alloc_coherent() → CMA region → physical pages
   └→ SMMU maps pages → IOVA for DLA

2. Application fills input buffer (e.g., camera frame)
   └→ If from camera: DMA-BUF import (zero-copy from ISP)
   └→ If from CPU: memcpy into mapped buffer

3. TensorRT submits inference to DLA
   └→ libnvdla programs DMA descriptors with IOVAs
   └→ ioctl to /dev/nvhost-nvdla0

4. Kernel driver submits work
   └→ Writes to DLA control registers
   └→ DLA starts execution

5. DLA DMA engine reads input tensor
   └→ IOVA → SMMU translation → physical DRAM
   └→ Streams into on-chip SRAM

6. DLA processes layers
   └→ Conv core → SDP → PDP (all on-chip)
   └→ Intermediate results: SRAM or spill to DRAM

7. DLA DMA engine writes output tensor
   └→ IOVA → SMMU → physical DRAM

8. DLA raises IRQ
   └→ Kernel driver handles interrupt
   └→ Signals completion to userspace

9. Application reads output
   └→ Same mapped buffer (zero-copy to CPU)
   └→ Or GPU reads same pages (zero-copy via GPU SMMU)

13. CPU/GPU/DLA/SMMU/CMA Interaction Diagram¶

┌─────────────────────────────────────────────────────────────┐
│                        8GB LPDDR5                            │
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │ OS/Kernel │  │ CUDA     │  │ CMA      │  │ Carve-   │    │
│  │ Memory   │  │ Memory   │  │ Region   │  │ outs     │    │
│  │          │  │ (GPU)    │  │          │  │ (BPMP,   │    │
│  │          │  │          │  │          │  │  OP-TEE) │    │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └──────────┘    │
│       │              │              │                        │
└───────┼──────────────┼──────────────┼────────────────────────┘
        │              │              │
   ┌────┴────┐    ┌────┴────┐    ┌────┴────────────────────┐
   │  CPU    │    │  GPU    │    │  SMMU                    │
   │  MMU    │    │  SMMU   │    │  (shared by DLA, ISP,   │
   │         │    │  context│    │   VI, NVENC, NVDEC)      │
   │  VA→PA  │    │  IOVA→PA│    │  IOVA→PA                │
   └────┬────┘    └────┬────┘    └────┬────────┬───────────┘
        │              │              │        │
   ┌────┴────┐    ┌────┴────┐    ┌────┴───┐ ┌──┴──────┐
   │  CPU    │    │  GPU    │    │  DLA   │ │  ISP    │
   │  A78AE  │    │ Ampere  │    │ Engine │ │  Camera │
   │  cores  │    │ 1024    │    │        │ │  pipe   │
   │         │    │ cores   │    │ MAC +  │ │         │
   │         │    │         │    │ SRAM + │ │         │
   │         │    │         │    │ DMA    │ │         │
   └─────────┘    └─────────┘    └────────┘ └─────────┘

Data Flow Examples¶

Camera → DLA inference (zero-copy):

Camera sensor
 → NVCSI → VI → ISP
 → CMA buffer (physical pages)
 → SMMU maps to ISP IOVA (write) and DLA IOVA (read)
 → DLA reads same physical pages — zero copy
 → DLA output to CMA
 → CPU reads result — zero copy

Camera → GPU inference → DLA post-processing:

Camera → CMA buffer
 → GPU SMMU maps buffer → GPU inference
 → GPU output to CUDA memory
 → DLA SMMU maps same pages → DLA post-processing
 → DLA output to CMA → CPU reads result

DLA + GPU parallel inference (two models):

Input → CMA buffer
 ├→ DLA SMMU → DLA runs Model A
 └→ GPU SMMU → GPU runs Model B
Both access DRAM through SMMU, no copies between them
Results available simultaneously

14. Performance Characteristics¶

Typical Inference Latency (Orin Nano 8GB)¶

Model	Precision	Engine	Batch	Latency	Power
ResNet-50	INT8	DLA	1	~5–8 ms	~1.5W
ResNet-50	INT8	GPU	1	~3–5 ms	~5W
ResNet-50	FP16	DLA	1	~8–12 ms	~2W
MobileNetV2	INT8	DLA	1	~2–4 ms	~1W
MobileNetV2	INT8	GPU	1	~1–3 ms	~3W
YOLOv5s	FP16	DLA	1	~15–25 ms	~2.5W
YOLOv5s	FP16	GPU	1	~8–12 ms	~6W

Throughput vs Power Efficiency¶

Metric	GPU	DLA
Raw throughput (FPS)	Higher	Lower
TOPS per watt	Lower	Higher
Frames per joule	Lower	Higher

DLA wins on efficiency (TOPS/watt), GPU wins on raw speed. Choose based on your constraint — power budget or throughput target.

INT8 vs FP16 on DLA¶

Precision	Throughput	Accuracy	Calibration Required
FP16	1x	Higher	No
INT8	~2x	Lower	Yes (calibration dataset)

INT8 roughly doubles DLA throughput. Use INT8 when accuracy loss is acceptable (test with your dataset).

15. Profiling DLA Workloads¶

trtexec — Quick Profiling¶

# Profile DLA inference
trtexec \
    --onnx=model.onnx \
    --useDLACore=0 \
    --fp16 \
    --allowGPUFallback \
    --verbose \
    --iterations=100

# Key output:
# [DLA] Layer conv1: 1.2ms
# [GPU] Layer unsupported_op: 0.5ms (fallback)
# [DLA] Layer conv2: 0.8ms
# Total: 5.3ms
# DLA utilization: 78%

Nsight Systems — Detailed Timeline¶

nsys profile --trace=cuda,nvtx,osrt \
    trtexec --loadEngine=model_dla.engine --iterations=50

# Open in Nsight Systems GUI
# Shows:
# - DLA execution blocks
# - GPU fallback blocks
# - DMA transfers
# - CPU overhead
# - DLA↔GPU sync points

DLA-Specific Metrics¶

# Check DLA utilization via tegrastats
tegrastats --interval 1000
# Output includes DLA% utilization

# DLA clock frequency
cat /sys/kernel/debug/clk/clk_summary | grep dla

Identifying Bottlenecks¶

Symptom	Cause	Solution
Low DLA utilization	Many GPU fallback layers	Use DLA-friendly architecture
High DLA-GPU transition time	Frequent engine switches	Group DLA/GPU layers contiguously
DLA latency higher than GPU	Model too small for DLA	Use GPU instead (overhead > benefit)
Inconsistent DLA latency	Memory bandwidth contention	Reduce concurrent DRAM access

16. DLA Limitations and Fallback Behavior¶

Hard Limitations¶

No FP32 — must use FP16 or INT8
No dynamic shapes — input dimensions must be fixed at build time
No custom CUDA plugins — DLA only runs compiled operations
Limited layer support — see Section 8 for full list
Single batch only — batch > 1 is not supported on all layers
No in-place operations — every operation writes to a new buffer

Fallback Behavior¶

When GPU_FALLBACK is enabled and a layer is not supported on DLA:

TensorRT builds a hybrid engine with DLA and GPU sections
At runtime, DLA executes its layers, then synchronizes
GPU picks up the unsupported layers
After GPU finishes, DLA resumes (if more DLA layers follow)

Each DLA→GPU→DLA transition adds:

Memory synchronization overhead (~0.1–0.5 ms)
Context switch overhead
No data copy (zero-copy via SMMU)

When NOT to Use DLA¶

Transformer-heavy models (attention is not supported)
Models with many custom operations
Models requiring FP32 precision
Very small models where DLA setup overhead exceeds computation
Latency-critical single-model inference where GPU is faster

17. Production Deployment Patterns¶

Pattern 1: DLA-Only Inference (Maximum Power Efficiency)¶

Camera → pre-process (CPU) → DLA inference → post-process (CPU) → output
GPU: idle / display only

Best for: battery-powered devices, thermal-constrained systems, always-on monitoring.

Pattern 2: DLA + GPU Pipeline (Maximum Throughput)¶

Camera → pre-process (CPU)
   ├→ DLA: detection model (lightweight, e.g., MobileNet-SSD)
   └→ GPU: classification model (heavier, e.g., ResNet-50)
Both run simultaneously on alternating frames or different ROIs.

Best for: multi-model systems, video analytics with detection + classification.

Pattern 3: DLA Primary, GPU Fallback (Balanced)¶

Full model compiled for DLA with GPU fallback enabled.
DLA handles convolutions, GPU handles unsupported ops.
TensorRT manages transitions automatically.

Best for: single-model deployment where some layers are unsupported.

Pattern 4: DLA for Always-On + GPU for On-Demand¶

DLA: continuously running lightweight detection (person, vehicle)
GPU: idle, wakes up for heavy processing when DLA detects event
   → GPU runs detailed classification, tracking, or segmentation
   → GPU returns to idle

Best for: surveillance, smart cameras, event-driven systems. Minimizes average power.

Engine Serialization for Production¶

Always serialize (save) TensorRT engines for production deployment:

# Build once (slow — compile time)
engine = builder.build_engine(network, config)

# Serialize
with open("model_dla_int8.engine", "wb") as f:
    f.write(engine.serialize())

# Deploy: deserialize (fast — load time)
runtime = trt.Runtime(logger)
with open("model_dla_int8.engine", "rb") as f:
    engine = runtime.deserialize_cuda_engine(f.read())

Engine files are device-specific — an engine built for Orin Nano will not work on Orin NX or AGX Orin. Rebuild for each target.

18. Common DLA Issues and Solutions¶

DLA Engine Build Fails¶

[TensorRT] ERROR: Layer X is not supported on DLA

Cause: Model contains unsupported operations.

Solution: Enable GPU_FALLBACK, or modify the model to use DLA-friendly operations.

DLA Inference Slower Than GPU¶

Cause: Too many DLA↔GPU transitions, or model is too small for DLA overhead.

Solution: Profile with trtexec --verbose to count transitions. If many, consider GPU-only. If model is tiny, GPU is likely faster.

DLA Accuracy Degradation (INT8)¶

Cause: Poor INT8 calibration or sensitive layers quantized incorrectly.

Solution: * Use a representative calibration dataset (>500 images) * Use per-channel quantization instead of per-tensor * Keep sensitive layers (first/last conv, skip connections) in FP16

DLA Buffer Allocation Failure¶

nvdla: failed to allocate buffer

Cause: CMA exhaustion or fragmentation.

Solution: See Memory Architecture Guide — CMA for CMA sizing and fragmentation mitigation.

DLA Not Detected¶

ls /dev/nvhost-nvdla*
# No output

Cause: DLA device tree node disabled, driver not loaded, or JetPack version mismatch.

Solution: Check dmesg | grep nvdla, verify DTB has DLA nodes enabled, ensure nvdla.ko is loaded.

19. References¶

NVIDIA TensorRT — DLA Documentation — official DLA integration guide
NVIDIA TensorRT — DLA Supported Layers — layer support matrix
NVIDIA Jetson Linux — DLA — Jetson DLA documentation
trtexec Reference — TensorRT command-line profiling tool
Main guide: Nvidia Jetson Platform Guide
Memory deep dive: Orin Nano Memory Architecture
Kernel internals: Orin Nano Kernel Internals