NVIDIA TAO Toolkit — Concrete Guide¶

Goal: Use NVIDIA TAO Toolkit to take a pre-trained model from NGC, fine-tune it on a custom dataset, prune and quantize it, export to TensorRT, and deploy on Jetson Orin Nano — without training from scratch.

Table of Contents¶

What is TAO Toolkit?
Why TAO Instead of Training from Scratch?
Installation
Pre-Trained Models from NGC
Dataset Preparation
Transfer Learning — Fine-Tuning a Model
Pruning
Quantization-Aware Training (QAT)
Exporting to TensorRT
Deploying on Jetson Orin Nano
DeepStream Integration
TAO vs Manual Pipeline Comparison
Projects
Resources

1. What is TAO Toolkit?¶

TAO = Train, Adapt, Optimize — NVIDIA's no-code/low-code framework for building production-grade AI models with transfer learning, without needing to write training loops from scratch.

TAO Toolkit Architecture:
─────────────────────────────────────────────────────────────────
NGC Model Zoo                                     Your Custom Model
(pre-trained weights)                             (optimized for edge)
       │                                                  │
       ▼                                                  ▲
  tao download ──► tao train ──► tao prune ──► tao export
                   (fine-tune)   (compress)   (TensorRT engine)
                       │
               Your custom dataset
               (KITTI / COCO format)
─────────────────────────────────────────────────────────────────

What TAO provides:¶

Feature	What it does
NGC pre-trained models	ResNet, EfficientNet, YOLO, SSD with ImageNet/COCO weights
Transfer learning	Freeze backbone, train detection head on your data
Auto-pruning	Magnitude-based channel pruning with one command
QAT	Fake quantization during training → INT8 TRT engine
TensorRT export	Generates calibrated `.engine` files directly
DeepStream plugins	Drop-in `nvinfer` config for video pipelines

TAO Toolkit Versions¶

Version	Key feature	Python API
TAO 3.x	PyTorch + TF2 backends, Docker-based	`tao_toolkit`
TAO 4.x	Unified launcher, 100+ models	`nvidia-tao` pip
TAO 5.x (current)	Cloud-native, Triton support	`nvidia-tao` pip

2. Why TAO Instead of Training from Scratch?¶

The Core Problem¶

Training a detection model from scratch requires:

From scratch:
  Dataset needed:  100,000–1,000,000 labeled images
  GPU time:        72–200 hours (A100)
  Cost:            $500–5,000 cloud compute
  Expertise:       Deep learning researcher

TAO transfer learning:
  Dataset needed:  500–5,000 labeled images       ← 100× less data
  GPU time:        2–8 hours (RTX 3080)           ← 10× faster
  Cost:            $10–50 cloud compute            ← 100× cheaper
  Expertise:       Engineer following docs

Why It Works: Transfer Learning¶

Pre-trained backbones have learned to detect edges → textures → shapes → objects. Your fine-tuning only needs to adjust the final layers to recognize your specific objects.

ResNet-50 trained on ImageNet (1.2M images, 1000 classes):
  Layer 1:  detects edges, gradients
  Layer 2:  detects corners, textures
  Layer 3:  detects patterns (wheels, faces, windows)
  Layer 4:  detects object parts (car front, person torso)
  Layer 5:  classifies ImageNet objects  ← replace this

After TAO fine-tuning on 2,000 custom images:
  Layers 1-4: frozen (ImageNet features still useful)
  Layer 5:  retrained on "forklift", "pallet", "person"

Result: 94% mAP after 4 hours of training

TAO Optimization Pipeline Output¶

Original ResNet-50 + SSD head:
  Model size:      98 MB
  Jetson FPS:      8 FPS (FP32)

After TAO:
  Prune 60%:       39 MB
  Retrain + QAT:   ACC within 1% of baseline
  Export INT8:     9.8 MB TRT engine
  Jetson FPS:      47 FPS (INT8 TRT)   ← 6× speedup

3. Installation¶

Option A: pip (Recommended for workstation)¶

# Requires Python 3.8+, CUDA 11.x+
pip install nvidia-tao

# Verify
tao --version
# NVIDIA TAO Toolkit, version 5.x.x

# Install model-specific dependencies (e.g., object detection)
tao model yolo_v4 --help

Option B: Docker (Recommended for reproducibility)¶

# Pull the TAO container (replace 5.x.x with latest)
docker pull nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2.11.0

# Run with GPU access and workspace mounted
docker run --gpus all -it \
  -v $(pwd)/workspace:/workspace \
  -v ~/.tao:/root/.tao \
  nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2.11.0 \
  /bin/bash

NGC API Key Setup¶

TAO downloads models from NGC — you need an API key.

# 1. Create account at ngc.nvidia.com
# 2. Generate API key: Setup → API key → Generate API key

# 3. Configure NGC CLI
pip install ngccli
ngc config set
# Prompts for API key and org

# 4. TAO uses NGC automatically after this
tao ngc-registry model list --filter_str object_detection

Workspace Structure¶

workspace/
├── data/
│   ├── train/           # training images + labels
│   ├── val/             # validation images + labels
│   └── test/            # test images (no labels needed)
├── models/
│   └── pretrained/      # downloaded NGC weights
├── specs/               # TAO YAML spec files
├── results/             # training outputs (checkpoints)
└── export/              # TensorRT engines

4. Pre-Trained Models from NGC¶

Browse and Download Models¶

# List available object detection models
ngc registry model list nvidia/tao/pretrained_object_detection:*

# Common models:
#   pretrained_object_detection:resnet18      (fast, Jetson-friendly)
#   pretrained_object_detection:resnet50      (balanced)
#   pretrained_object_detection:efficientdet_ef0  (compact)
#   pretrained_classification:efficientnet_b0
#   pretrained_segmentation:vanilla_unet_resnet25

# Download ResNet-18 SSD backbone
ngc registry model download-version \
  nvidia/tao/pretrained_object_detection:resnet18 \
  --dest workspace/models/pretrained

# Directory after download:
# workspace/models/pretrained/
#   resnet18.hdf5   (Keras weights)
#   OR
#   resnet18.pth    (PyTorch weights, newer models)

Supported Task Models (TAO 5.x)¶

Task	Models available
Object Detection	YOLOv4, YOLO-NAS, SSD, RetinaNet, EfficientDet, DINO, Grounding DINO
Classification	ResNet-{18,34,50,101}, EfficientNet-{B0-B7}, MobileNet-V2, ViT
Segmentation	UNet, MaskRCNN, SegFormer
Face Detection	FaceDetect, FPENet (facial landmarks)
Re-identification	ReID (person tracking across cameras)
NLP	BERT, QuestionAnswering, IntentSlot
Multi-modal	CLIP, DINO-v2

5. Dataset Preparation¶

TAO supports KITTI format (primary) and COCO JSON format.

KITTI Format (Object Detection)¶

data/train/
├── images/
│   ├── 000001.jpg
│   ├── 000002.jpg
│   └── ...
└── labels/
    ├── 000001.txt   ← one label file per image, same stem name
    ├── 000002.txt
    └── ...

Label file format (one object per line):

# class_name truncated occluded alpha xmin ymin xmax ymax h w l x y z ry
# For 2D detection, only class_name xmin ymin xmax ymax are required:

forklift 0 0 0 452 78 929 534 0 0 0 0 0 0 0
person 0 0 0 120 200 180 380 0 0 0 0 0 0 0
pallet 0 0 0 300 400 600 520 0 0 0 0 0 0 0

Convert COCO to KITTI with Python¶

import json
import os
from pathlib import Path

def coco_to_kitti(coco_json_path: str, output_label_dir: str):
    """Convert COCO annotation JSON to KITTI format label files."""
    with open(coco_json_path) as f:
        coco = json.load(f)

    # Build category ID → name map
    cat_map = {cat['id']: cat['name'] for cat in coco['categories']}

    # Group annotations by image ID
    annotations_by_image: dict[int, list] = {}
    for ann in coco['annotations']:
        img_id = ann['image_id']
        annotations_by_image.setdefault(img_id, []).append(ann)

    os.makedirs(output_label_dir, exist_ok=True)

    for img_info in coco['images']:
        img_id = img_info['id']
        stem = Path(img_info['file_name']).stem
        label_path = os.path.join(output_label_dir, f"{stem}.txt")

        lines = []
        for ann in annotations_by_image.get(img_id, []):
            x, y, w, h = ann['bbox']  # COCO: x_min, y_min, width, height
            xmin, ymin = x, y
            xmax, ymax = x + w, y + h
            class_name = cat_map[ann['category_id']]

            # KITTI format: class trunc occ alpha x1 y1 x2 y2 h w l x y z ry
            lines.append(
                f"{class_name} 0 0 0 {xmin:.2f} {ymin:.2f} {xmax:.2f} {ymax:.2f} "
                f"0 0 0 0 0 0 0"
            )

        with open(label_path, 'w') as f:
            f.write('\n'.join(lines))

    print(f"Converted {len(coco['images'])} images to KITTI format in {output_label_dir}")

# Usage:
coco_to_kitti("annotations/instances_train2017.json", "data/train/labels")

Dataset Validation¶

# TAO can validate your dataset before training
tao model yolo_v4 dataset_convert \
  -e specs/yolov4_train.yaml \
  --validate_only

# Output shows:
#   ✓ Found 2,847 images
#   ✓ Found 2,847 label files
#   ✓ Class distribution: forklift=1204, person=3891, pallet=892
#   ✗ 3 images missing labels → listed by name

6. Transfer Learning — Fine-Tuning a Model¶

Step 1: Write a Spec File (YAML)¶

TAO uses YAML spec files to configure everything. This is the core interface.

# specs/yolov4_train.yaml
dataset_config:
  data_sources:
    - label_directory_path: workspace/data/train/labels
      image_directory_path: workspace/data/train/images
  target_class_mapping:
    forklift: forklift
    person: person
    pallet: pallet
  validation_data_sources:
    - label_directory_path: workspace/data/val/labels
      image_directory_path: workspace/data/val/images

model_config:
  pretrained_model_file: workspace/models/pretrained/resnet18.hdf5
  num_layers: 18                    # ResNet-18
  arch: resnet
  input_image_size: "3,544,960"     # C,H,W
  output_stride: 16
  freeze_bn: True                   # freeze BatchNorm during fine-tune
  freeze_blocks: [0, 1]             # freeze first 2 residual blocks

augmentation_config:
  output_width: 960
  output_height: 544
  randomize_input_shape_period: 0
  hue: 0.1
  saturation: 1.5
  exposure: 1.5
  random_flip: 1
  random_rotate: False
  jitter: 0.3

training_config:
  batch_size_per_gpu: 16
  num_epochs: 80
  learning_rate: 0.00261
  lr_schedule: "cosine"
  warmup_epochs: 3
  weight_decay: 0.0005
  momentum: 0.9
  pretrain_model_path: workspace/models/pretrained/resnet18.hdf5
  enable_qat: False

bbox_handler_config:
  kitti_box_utils:
    num_small_boxes_rejected: 0
  target_class_config:
    - name: forklift
      coverage_threshold: 0.005    # min object area fraction
    - name: person
      coverage_threshold: 0.005
    - name: pallet
      coverage_threshold: 0.005

Step 2: Train¶

# Single GPU training
tao model yolo_v4 train \
  -e specs/yolov4_train.yaml \
  -r workspace/results/yolov4_resnet18 \
  --gpus 1

# Multi-GPU (if available)
tao model yolo_v4 train \
  -e specs/yolov4_train.yaml \
  -r workspace/results/yolov4_resnet18 \
  --gpus 4

# Training output structure:
# workspace/results/yolov4_resnet18/
#   train/
#     events.out.tfevents.*   ← TensorBoard logs
#   weights/
#     yolov4_resnet18_epoch_010.hdf5
#     yolov4_resnet18_epoch_020.hdf5
#     ...
#     yolov4_resnet18_epoch_080.hdf5  ← best kept

Step 3: Evaluate¶

tao model yolo_v4 evaluate \
  -e specs/yolov4_eval.yaml \
  -m workspace/results/yolov4_resnet18/weights/yolov4_resnet18_epoch_080.hdf5

# Output:
# class_name       ap
# forklift         91.3%
# person           88.7%
# pallet           85.2%
# mAP:             88.4%

Step 4: Visualize with TensorBoard¶

tensorboard --logdir workspace/results/yolov4_resnet18/train \
            --port 6006 --bind_all
# Open http://localhost:6006
# Shows: loss curves, mAP per epoch, learning rate schedule

Training Config Knobs¶

Parameter	Too low	Good range	Too high
`batch_size_per_gpu`	slow GPU util	8–32	OOM
`learning_rate`	never converges	1e-4 – 1e-2	diverges
`freeze_blocks`	overfits small data	[0,1] or [0,1,2]	underfits
`num_epochs`	underfits	50–150	overfits
`jitter`	no augmentation	0.2–0.4	too distorted

7. Pruning¶

TAO pruning uses magnitude-based channel pruning: removes entire feature map channels whose L1-norm falls below a threshold. Result: a smaller model with the same architecture family but fewer channels.

Why Prune?¶

Before pruning:
  ResNet-18 + SSD:  98 MB, 47 FPS INT8 on Orin Nano

After 60% pruning + retrain:
  ResNet-18 + SSD:  38 MB, 62 FPS INT8 on Orin Nano
  mAP drop:         < 1%

Prune Command¶

# Spec file for pruning
cat > specs/yolov4_prune.yaml << 'EOF'
pruning_config:
  method: "l1_norm"        # magnitude of weight channels
  prune_ratio: 0.6         # remove 60% of channels by weight
  equalization_criterion: "union"
  granularity: 8           # prune in groups of 8 channels (CUDA efficiency)
  min_num_filters: 16      # never go below 16 channels per layer
  threshold: 0.1           # alternative: threshold instead of ratio
EOF

tao model yolo_v4 prune \
  -e specs/yolov4_prune.yaml \
  -m workspace/results/yolov4_resnet18/weights/yolov4_resnet18_epoch_080.hdf5 \
  -o workspace/results/yolov4_pruned/yolov4_pruned.hdf5

# Output shows:
# Original filters: 512
# Pruned filters:   198  (61.3% reduction)
# Model size: 98 MB → 37 MB

Retrain After Pruning¶

Pruning disrupts learned features — you must retrain the pruned model to recover accuracy.

# specs/yolov4_retrain.yaml — same as train spec but:
training_config:
  pretrain_model_path: workspace/results/yolov4_pruned/yolov4_pruned.hdf5
  num_epochs: 40           # fewer epochs — recovering, not learning from scratch
  learning_rate: 0.00065   # lower LR — fine adjustment

tao model yolo_v4 train \
  -e specs/yolov4_retrain.yaml \
  -r workspace/results/yolov4_retrained \
  --gpus 1

# After retraining, re-evaluate:
tao model yolo_v4 evaluate \
  -e specs/yolov4_eval.yaml \
  -m workspace/results/yolov4_retrained/weights/yolov4_resnet18_epoch_040.hdf5
# mAP: 87.8% (was 88.4%) — only 0.6% drop after 60% pruning

Iterative Pruning Strategy¶

Round 1: Prune 30% → Retrain → Check mAP drop
Round 2: Prune 30% more → Retrain → Check mAP drop
Round 3: Prune 20% more → Retrain → mAP drops >2% → stop

Final: ~65% total reduction, mAP within 1.5% of baseline

This is better than single-step aggressive pruning because the model adapts gradually.

8. Quantization-Aware Training (QAT)¶

TAO can insert fake quantization nodes during training so the model learns to be robust to INT8 precision before export.

Enable QAT in Training Spec¶

# specs/yolov4_qat.yaml
training_config:
  pretrain_model_path: workspace/results/yolov4_retrained/weights/yolov4_resnet18_epoch_040.hdf5
  num_epochs: 15          # short fine-tune — model already trained
  learning_rate: 0.0001   # very low — just adapting to quantization noise
  enable_qat: True        # THE KEY FLAG
  batch_size_per_gpu: 8   # reduce if OOM during QAT (2× memory overhead)

tao model yolo_v4 train \
  -e specs/yolov4_qat.yaml \
  -r workspace/results/yolov4_qat \
  --gpus 1

# During QAT, TAO automatically:
# 1. Inserts FakeQuantize nodes after Conv/Linear layers
# 2. Learns quantization scale/zero-point as parameters
# 3. Backprop through STE (Straight-Through Estimator)

QAT vs PTQ Comparison¶

Post-Training Quantization (PTQ):
  Workflow:  train FP32 → calibrate on 1000 samples → INT8 engine
  mAP drop:  ~2–4% typical
  Time:      +30 minutes calibration

Quantization-Aware Training (QAT):
  Workflow:  train FP32 → QAT fine-tune 15 epochs → INT8 engine
  mAP drop:  ~0.5–1% typical
  Time:      +6 hours QAT training

Use QAT when:
  - mAP is critical and budget allows extra training time
  - PTQ drops below acceptable threshold
  - Model has depthwise separable convolutions (very sensitive to quantization)

9. Exporting to TensorRT¶

Export to ONNX + TRT Engine¶

# specs/yolov4_export.yaml
export_config:
  # Input model
  model: workspace/results/yolov4_qat/weights/yolov4_resnet18_epoch_015.hdf5
  # OR for non-QAT:
  # model: workspace/results/yolov4_retrained/weights/yolov4_resnet18_epoch_040.hdf5

  # Output
  output_file: workspace/export/yolov4_resnet18.onnx

  # Target precision
  data_type: int8               # "fp32", "fp16", "int8"

  # Calibration (for PTQ INT8 only, not needed after QAT)
  cal_image_dir: workspace/data/train/images
  cal_cache_file: workspace/export/cal.bin
  cal_batch_size: 8
  cal_batches: 20               # 20 × 8 = 160 calibration images

  # Input shape
  input_dims: [3, 544, 960]     # C, H, W

  # Batch size for engine
  batch_size: 1                 # Jetson real-time: batch=1
  max_batch_size: 8

  # Jetson target (for DLA)
  enable_dla: False             # set True to target DLA

# Export to ONNX (intermediate)
tao model yolo_v4 export \
  -e specs/yolov4_export.yaml \
  -m workspace/results/yolov4_qat/weights/yolov4_resnet18_epoch_015.hdf5 \
  -o workspace/export/yolov4_resnet18.onnx

# Convert ONNX → TensorRT engine
tao deploy tao-converter \
  -k your_ngc_api_key \
  -d 3,544,960 \
  -o BatchedNMS \
  -e workspace/export/yolov4_int8.engine \
  -p Input,1x3x544x960,4x3x544x960,8x3x544x960 \
  -t int8 \
  -c workspace/export/cal.bin \
  workspace/export/yolov4_resnet18.onnx

# Alternative: use trtexec directly
trtexec \
  --onnx=workspace/export/yolov4_resnet18.onnx \
  --saveEngine=workspace/export/yolov4_int8.engine \
  --int8 \
  --calib=workspace/export/cal.bin \
  --workspace=2048 \
  --verbose

Verify the Engine¶

# Run benchmark on the engine
trtexec \
  --loadEngine=workspace/export/yolov4_int8.engine \
  --batch=1 \
  --iterations=200 \
  --avgRuns=100

# Output:
# [I] mean: 21.33 ms  ← 47 FPS on workstation RTX 3080
# [I] max:  23.1 ms
# [I] GPU Compute: 20.2 ms
# [I] H2D Latency: 0.47 ms
# [I] D2H Latency: 0.63 ms

10. Deploying on Jetson Orin Nano¶

Transfer Engine to Jetson¶

# TRT engines are NOT portable between GPUs — must build on Jetson
# Transfer the ONNX and calibration file, then build on device

# On workstation — copy files to Jetson
scp workspace/export/yolov4_resnet18.onnx jetson@192.168.1.100:~/models/
scp workspace/export/cal.bin jetson@192.168.1.100:~/models/

# On Jetson — build INT8 engine
ssh jetson@192.168.1.100
trtexec \
  --onnx=/home/jetson/models/yolov4_resnet18.onnx \
  --saveEngine=/home/jetson/models/yolov4_int8_jetson.engine \
  --int8 \
  --calib=/home/jetson/models/cal.bin \
  --workspace=2048
# This takes ~5–15 minutes on Jetson (engine building is slow)

Python Inference on Jetson¶

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
import cv2

class TAOYOLOInference:
    """Runs TAO-exported YOLO TensorRT engine on Jetson."""

    CLASS_NAMES = ['forklift', 'person', 'pallet']
    INPUT_H, INPUT_W = 544, 960
    CONF_THRESHOLD = 0.4
    NMS_THRESHOLD = 0.5

    def __init__(self, engine_path: str):
        logger = trt.Logger(trt.Logger.WARNING)
        with open(engine_path, 'rb') as f:
            runtime = trt.Runtime(logger)
            self.engine = runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()

        # Allocate buffers
        self.inputs, self.outputs, self.bindings, self.stream = [], [], [], cuda.Stream()
        for binding in self.engine:
            size = trt.volume(self.engine.get_binding_shape(binding))
            dtype = trt.nptype(self.engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))
            if self.engine.binding_is_input(binding):
                self.inputs.append({'host': host_mem, 'device': device_mem})
            else:
                self.outputs.append({'host': host_mem, 'device': device_mem})

    def preprocess(self, frame: np.ndarray) -> np.ndarray:
        """Resize, normalize, NHWC→NCHW."""
        img = cv2.resize(frame, (self.INPUT_W, self.INPUT_H))
        img = img.astype(np.float32) / 255.0
        img = img.transpose(2, 0, 1)   # HWC → CHW
        img = np.expand_dims(img, 0)   # add batch dim
        return np.ascontiguousarray(img)

    def infer(self, frame: np.ndarray) -> list[dict]:
        """Run inference, return list of detections."""
        # Preprocess
        inp = self.preprocess(frame)
        np.copyto(self.inputs[0]['host'], inp.ravel())

        # H2D
        cuda.memcpy_htod_async(
            self.inputs[0]['device'], self.inputs[0]['host'], self.stream)

        # Execute
        self.context.execute_async_v2(
            bindings=self.bindings, stream_handle=self.stream.handle)

        # D2H
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)
        self.stream.synchronize()

        return self._parse_detections(self.outputs[0]['host'], frame.shape)

    def _parse_detections(
        self, raw: np.ndarray, orig_shape: tuple
    ) -> list[dict]:
        """Parse TRT BatchedNMS output.

        TAO YOLOv4 output after BatchedNMS plugin:
          [num_dets, boxes(4), scores(1), classes(1)] per detection
        """
        orig_h, orig_w = orig_shape[:2]
        sx = orig_w / self.INPUT_W
        sy = orig_h / self.INPUT_H

        # BatchedNMS output format: [1, keepTopK, 1, 4]
        # Reshape based on your export's output binding shape
        num_dets = int(raw[0])
        detections = []
        for i in range(num_dets):
            score = raw[1 + i]
            if score < self.CONF_THRESHOLD:
                continue
            cls_id = int(raw[1 + 100 + i])  # offset depends on keepTopK
            x1 = raw[1 + 200 + i * 4 + 0] * sx
            y1 = raw[1 + 200 + i * 4 + 1] * sy
            x2 = raw[1 + 200 + i * 4 + 2] * sx
            y2 = raw[1 + 200 + i * 4 + 3] * sy
            detections.append({
                'class': self.CLASS_NAMES[cls_id],
                'score': float(score),
                'box': [int(x1), int(y1), int(x2), int(y2)]
            })
        return detections


# Usage on Jetson
detector = TAOYOLOInference('/home/jetson/models/yolov4_int8_jetson.engine')
cap = cv2.VideoCapture(0)  # or GStreamer pipeline

while True:
    ret, frame = cap.read()
    if not ret:
        break

    dets = detector.infer(frame)
    for d in dets:
        x1, y1, x2, y2 = d['box']
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, f"{d['class']} {d['score']:.2f}",
                    (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

    cv2.imshow('TAO Detection', frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

Benchmark on Jetson¶

# On Jetson — set max performance mode first
sudo jetson_clocks --fan

# Benchmark
trtexec \
  --loadEngine=/home/jetson/models/yolov4_int8_jetson.engine \
  --batch=1 \
  --iterations=500 \
  --avgRuns=200 \
  --warmUp=100

# Expected results on Orin Nano 8GB:
# INT8 ResNet-18 + SSD (544×960):  ~42 FPS
# INT8 ResNet-18 + SSD (416×416):  ~68 FPS
# INT8 EfficientDet-EF0 (512×512): ~35 FPS
# FP16 ResNet-18 + SSD (544×960):  ~28 FPS

11. DeepStream Integration¶

TAO-exported models plug directly into DeepStream as nvinfer elements — no custom code needed.

nvinfer Config File¶

# config_infer_primary_yolov4.txt
[property]
gpu-id=0
net-scale-factor=0.0039215697906911373   # 1/255
model-engine-file=/home/jetson/models/yolov4_int8_jetson.engine
int8-calib-file=/home/jetson/models/cal.bin
batch-size=1
process-mode=1                           # 1=primary detector
model-color-format=0                     # 0=RGB
labelfile-path=labels.txt                # one class name per line
gie-unique-id=1
output-blob-names=BatchedNMS
num-detected-classes=3
interval=0

[class-attrs-all]
threshold=0.4
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=20
detected-min-h=20
detected-max-w=9999
detected-max-h=9999

# labels.txt
forklift
person
pallet

DeepStream Pipeline YAML¶

# deepstream_app_config.yaml
[application]
enable-perf-measurement=1
perf-measurement-interval-sec=2

[source0]
enable=1
type=4                    # 4=RTSP
uri=rtsp://192.168.1.50/stream
num-sources=1

[sink0]
enable=1
type=5                    # 5=RTSP output
sync=0
codec=1                   # H.264
bitrate=4000000
rtsp-port=8554
udp-port=5400

[osd]
enable=1
gpu-id=0

[streammux]
gpu-id=0
batch-size=1
batched-push-timeout=40000
width=960
height=544

[primary-gie]
enable=1
gpu-id=0
gie-unique-id=1
config-file=config_infer_primary_yolov4.txt

# Run the full pipeline
deepstream-app -c deepstream_app_config.yaml

12. TAO vs Manual Pipeline Comparison¶

Aspect	TAO Toolkit	Manual PyTorch + TensorRT
Dataset labeling format	KITTI, COCO	Any (you write the loader)
Model architecture	Fixed list (~50 models)	Any model you can write
Training code	Zero (YAML config)	Full training loop
Pruning	One command	Custom pruning code
INT8 calibration	Built-in calibrator	Custom `IInt8Calibrator`
TRT export	`tao export` command	`torch.onnx.export` + `trtexec`
DeepStream integration	Drop-in nvinfer config	Custom Gst plugin
Customization	Low	Full control
Time to first working model	Hours	Days–weeks
Custom layer support	No (use standard layers)	Yes

Use TAO when: - You need a standard detection/classification model fast - Your architecture is in the NGC catalog - You want push-button optimization pipeline - You're deploying with DeepStream

Use manual pipeline when: - Custom architecture (attention heads, custom losses) - Research experiments - Need to debug gradient flow - Architectures not in NGC catalog (BEVFusion, PointPillars, etc.)

13. Projects¶

Project 1: Warehouse Object Detector¶

Train a 3-class detector (forklift, person, pallet) for a Jetson-powered warehouse camera.

Data collection:
  - Record 30-min warehouse video at 1080p
  - Sample every 5 seconds: ~360 frames
  - Label with CVAT (free, open source)
  - Export as KITTI format

TAO workflow:
  1. tao download pretrained_object_detection:resnet18
  2. Train 80 epochs (4 hours on RTX 3080)
  3. Evaluate: target mAP > 80%
  4. Prune 50%, retrain 40 epochs
  5. QAT 15 epochs
  6. Export INT8 engine
  7. Deploy on Jetson + DeepStream RTSP stream

Success metric: >40 FPS on Orin Nano, mAP > 78%

Project 2: Defect Detection on Conveyor Belt¶

Binary classifier: defect vs no-defect on manufactured parts.

TAO classification workflow:
  1. tao download pretrained_classification:efficientnet_b0
  2. Collect 1,000 defect + 1,000 no-defect images under controlled lighting
  3. Train classifier (10 epochs — classification is fast)
  4. Export FP16 engine (classification needs less optimization than detection)
  5. Jetson reads camera at 120 FPS, classify every frame
  6. Trigger alarm on 3 consecutive defect predictions

Success metric: <0.5% false positive rate, >99% recall on defects

Project 3: Multi-Camera People Counter¶

Use TAO ReID (Re-Identification) to track people across 4 cameras in a building.

Components:
  Camera 1–4 → DeepStream source → Primary detector (TAO YOLOv4 person)
             → Secondary ReID model (TAO ReID)
             → Tracker (NvDCF)
             → Redis person count aggregator
             → Grafana dashboard

TAO models:
  Primary: pretrained_object_detection:resnet18 (person only, 2 classes: person/background)
  Secondary: pretrained_re_identification:resnet50 (128-dim embedding)

14. Resources¶

Official Documentation¶

TAO Toolkit User Guide — docs.nvidia.com/tao/tao-toolkit/
TAO Toolkit API Reference — docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/
NGC Model Catalog — ngc.nvidia.com/catalog/models (filter: TAO)
TAO Toolkit GitHub — github.com/NVIDIA/tao_pytorch_backend

Tutorials¶

NVIDIA Developer Blog: TAO Toolkit Getting Started — developer.nvidia.com/blog/training-like-a-pro-with-tao-toolkit/
Jetson AI Lab — TAO + DeepStream — jetson-ai-lab.com
TAO Toolkit Quick Start Guide — catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/tao_toolkit_quick_start_guide

DeepStream SDK — docs.nvidia.com/metropolis/deepstream/dev-guide/
trtexec documentation — docs.nvidia.com/deeplearning/tensorrt/developer-guide/ (section: trtexec)
CVAT (labeling tool) — github.com/opencv/cvat — recommended for KITTI export
Roboflow (roboflow.com) — cloud annotate, augment, export YOLO/COCO; good for quick iteration and edge deployment

Benchmark Reference (Orin Nano 8GB, JetPack 6.x)¶

Model	Task	Precision	Input	FPS
ResNet-18 + SSD	Detection	INT8	544×960	42
ResNet-18 + SSD	Detection	FP16	544×960	28
EfficientNet-B0	Classification	INT8	224×224	380
EfficientDet-EF0	Detection	INT8	512×512	35
MobileNet-V2 + SSD	Detection	INT8	300×300	95

Previous: 5.5 ML and AI See also: ML and AI main guide for quantization theory, TensorRT manual pipeline, and DLA usage