Skip to content

NVIDIA TAO Toolkit — Concrete Guide

Goal: Use NVIDIA TAO Toolkit to take a pre-trained model from NGC, fine-tune it on a custom dataset, prune and quantize it, export to TensorRT, and deploy on Jetson Orin Nano — without training from scratch.


Table of Contents

  1. What is TAO Toolkit?
  2. Why TAO Instead of Training from Scratch?
  3. Installation
  4. Pre-Trained Models from NGC
  5. Dataset Preparation
  6. Transfer Learning — Fine-Tuning a Model
  7. Pruning
  8. Quantization-Aware Training (QAT)
  9. Exporting to TensorRT
  10. Deploying on Jetson Orin Nano
  11. DeepStream Integration
  12. TAO vs Manual Pipeline Comparison
  13. Projects
  14. Resources

1. What is TAO Toolkit?

TAO = Train, Adapt, Optimize — NVIDIA's no-code/low-code framework for building production-grade AI models with transfer learning, without needing to write training loops from scratch.

TAO Toolkit Architecture:
─────────────────────────────────────────────────────────────────
NGC Model Zoo                                     Your Custom Model
(pre-trained weights)                             (optimized for edge)
       │                                                  │
       ▼                                                  ▲
  tao download ──► tao train ──► tao prune ──► tao export
                   (fine-tune)   (compress)   (TensorRT engine)
               Your custom dataset
               (KITTI / COCO format)
─────────────────────────────────────────────────────────────────

What TAO provides:

Feature What it does
NGC pre-trained models ResNet, EfficientNet, YOLO, SSD with ImageNet/COCO weights
Transfer learning Freeze backbone, train detection head on your data
Auto-pruning Magnitude-based channel pruning with one command
QAT Fake quantization during training → INT8 TRT engine
TensorRT export Generates calibrated .engine files directly
DeepStream plugins Drop-in nvinfer config for video pipelines

TAO Toolkit Versions

Version Key feature Python API
TAO 3.x PyTorch + TF2 backends, Docker-based tao_toolkit
TAO 4.x Unified launcher, 100+ models nvidia-tao pip
TAO 5.x (current) Cloud-native, Triton support nvidia-tao pip

2. Why TAO Instead of Training from Scratch?

The Core Problem

Training a detection model from scratch requires:

From scratch:
  Dataset needed:  100,000–1,000,000 labeled images
  GPU time:        72–200 hours (A100)
  Cost:            $500–5,000 cloud compute
  Expertise:       Deep learning researcher

TAO transfer learning:
  Dataset needed:  500–5,000 labeled images       ← 100× less data
  GPU time:        2–8 hours (RTX 3080)           ← 10× faster
  Cost:            $10–50 cloud compute            ← 100× cheaper
  Expertise:       Engineer following docs

Why It Works: Transfer Learning

Pre-trained backbones have learned to detect edges → textures → shapes → objects. Your fine-tuning only needs to adjust the final layers to recognize your specific objects.

ResNet-50 trained on ImageNet (1.2M images, 1000 classes):
  Layer 1:  detects edges, gradients
  Layer 2:  detects corners, textures
  Layer 3:  detects patterns (wheels, faces, windows)
  Layer 4:  detects object parts (car front, person torso)
  Layer 5:  classifies ImageNet objects  ← replace this

After TAO fine-tuning on 2,000 custom images:
  Layers 1-4: frozen (ImageNet features still useful)
  Layer 5:  retrained on "forklift", "pallet", "person"

Result: 94% mAP after 4 hours of training

TAO Optimization Pipeline Output

Original ResNet-50 + SSD head:
  Model size:      98 MB
  Jetson FPS:      8 FPS (FP32)

After TAO:
  Prune 60%:       39 MB
  Retrain + QAT:   ACC within 1% of baseline
  Export INT8:     9.8 MB TRT engine
  Jetson FPS:      47 FPS (INT8 TRT)   ← 6× speedup

3. Installation

# Requires Python 3.8+, CUDA 11.x+
pip install nvidia-tao

# Verify
tao --version
# NVIDIA TAO Toolkit, version 5.x.x

# Install model-specific dependencies (e.g., object detection)
tao model yolo_v4 --help
# Pull the TAO container (replace 5.x.x with latest)
docker pull nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2.11.0

# Run with GPU access and workspace mounted
docker run --gpus all -it \
  -v $(pwd)/workspace:/workspace \
  -v ~/.tao:/root/.tao \
  nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2.11.0 \
  /bin/bash

NGC API Key Setup

TAO downloads models from NGC — you need an API key.

# 1. Create account at ngc.nvidia.com
# 2. Generate API key: Setup → API key → Generate API key

# 3. Configure NGC CLI
pip install ngccli
ngc config set
# Prompts for API key and org

# 4. TAO uses NGC automatically after this
tao ngc-registry model list --filter_str object_detection

Workspace Structure

workspace/
├── data/
│   ├── train/           # training images + labels
│   ├── val/             # validation images + labels
│   └── test/            # test images (no labels needed)
├── models/
│   └── pretrained/      # downloaded NGC weights
├── specs/               # TAO YAML spec files
├── results/             # training outputs (checkpoints)
└── export/              # TensorRT engines

4. Pre-Trained Models from NGC

Browse and Download Models

# List available object detection models
ngc registry model list nvidia/tao/pretrained_object_detection:*

# Common models:
#   pretrained_object_detection:resnet18      (fast, Jetson-friendly)
#   pretrained_object_detection:resnet50      (balanced)
#   pretrained_object_detection:efficientdet_ef0  (compact)
#   pretrained_classification:efficientnet_b0
#   pretrained_segmentation:vanilla_unet_resnet25

# Download ResNet-18 SSD backbone
ngc registry model download-version \
  nvidia/tao/pretrained_object_detection:resnet18 \
  --dest workspace/models/pretrained

# Directory after download:
# workspace/models/pretrained/
#   resnet18.hdf5   (Keras weights)
#   OR
#   resnet18.pth    (PyTorch weights, newer models)

Supported Task Models (TAO 5.x)

Task Models available
Object Detection YOLOv4, YOLO-NAS, SSD, RetinaNet, EfficientDet, DINO, Grounding DINO
Classification ResNet-{18,34,50,101}, EfficientNet-{B0-B7}, MobileNet-V2, ViT
Segmentation UNet, MaskRCNN, SegFormer
Face Detection FaceDetect, FPENet (facial landmarks)
Re-identification ReID (person tracking across cameras)
NLP BERT, QuestionAnswering, IntentSlot
Multi-modal CLIP, DINO-v2

5. Dataset Preparation

TAO supports KITTI format (primary) and COCO JSON format.

KITTI Format (Object Detection)

data/train/
├── images/
│   ├── 000001.jpg
│   ├── 000002.jpg
│   └── ...
└── labels/
    ├── 000001.txt   ← one label file per image, same stem name
    ├── 000002.txt
    └── ...

Label file format (one object per line):

# class_name truncated occluded alpha xmin ymin xmax ymax h w l x y z ry
# For 2D detection, only class_name xmin ymin xmax ymax are required:

forklift 0 0 0 452 78 929 534 0 0 0 0 0 0 0
person 0 0 0 120 200 180 380 0 0 0 0 0 0 0
pallet 0 0 0 300 400 600 520 0 0 0 0 0 0 0

Convert COCO to KITTI with Python

import json
import os
from pathlib import Path

def coco_to_kitti(coco_json_path: str, output_label_dir: str):
    """Convert COCO annotation JSON to KITTI format label files."""
    with open(coco_json_path) as f:
        coco = json.load(f)

    # Build category ID → name map
    cat_map = {cat['id']: cat['name'] for cat in coco['categories']}

    # Group annotations by image ID
    annotations_by_image: dict[int, list] = {}
    for ann in coco['annotations']:
        img_id = ann['image_id']
        annotations_by_image.setdefault(img_id, []).append(ann)

    os.makedirs(output_label_dir, exist_ok=True)

    for img_info in coco['images']:
        img_id = img_info['id']
        stem = Path(img_info['file_name']).stem
        label_path = os.path.join(output_label_dir, f"{stem}.txt")

        lines = []
        for ann in annotations_by_image.get(img_id, []):
            x, y, w, h = ann['bbox']  # COCO: x_min, y_min, width, height
            xmin, ymin = x, y
            xmax, ymax = x + w, y + h
            class_name = cat_map[ann['category_id']]

            # KITTI format: class trunc occ alpha x1 y1 x2 y2 h w l x y z ry
            lines.append(
                f"{class_name} 0 0 0 {xmin:.2f} {ymin:.2f} {xmax:.2f} {ymax:.2f} "
                f"0 0 0 0 0 0 0"
            )

        with open(label_path, 'w') as f:
            f.write('\n'.join(lines))

    print(f"Converted {len(coco['images'])} images to KITTI format in {output_label_dir}")

# Usage:
coco_to_kitti("annotations/instances_train2017.json", "data/train/labels")

Dataset Validation

# TAO can validate your dataset before training
tao model yolo_v4 dataset_convert \
  -e specs/yolov4_train.yaml \
  --validate_only

# Output shows:
#   ✓ Found 2,847 images
#   ✓ Found 2,847 label files
#   ✓ Class distribution: forklift=1204, person=3891, pallet=892
#   ✗ 3 images missing labels → listed by name

6. Transfer Learning — Fine-Tuning a Model

Step 1: Write a Spec File (YAML)

TAO uses YAML spec files to configure everything. This is the core interface.

# specs/yolov4_train.yaml
dataset_config:
  data_sources:
    - label_directory_path: workspace/data/train/labels
      image_directory_path: workspace/data/train/images
  target_class_mapping:
    forklift: forklift
    person: person
    pallet: pallet
  validation_data_sources:
    - label_directory_path: workspace/data/val/labels
      image_directory_path: workspace/data/val/images

model_config:
  pretrained_model_file: workspace/models/pretrained/resnet18.hdf5
  num_layers: 18                    # ResNet-18
  arch: resnet
  input_image_size: "3,544,960"     # C,H,W
  output_stride: 16
  freeze_bn: True                   # freeze BatchNorm during fine-tune
  freeze_blocks: [0, 1]             # freeze first 2 residual blocks

augmentation_config:
  output_width: 960
  output_height: 544
  randomize_input_shape_period: 0
  hue: 0.1
  saturation: 1.5
  exposure: 1.5
  random_flip: 1
  random_rotate: False
  jitter: 0.3

training_config:
  batch_size_per_gpu: 16
  num_epochs: 80
  learning_rate: 0.00261
  lr_schedule: "cosine"
  warmup_epochs: 3
  weight_decay: 0.0005
  momentum: 0.9
  pretrain_model_path: workspace/models/pretrained/resnet18.hdf5
  enable_qat: False

bbox_handler_config:
  kitti_box_utils:
    num_small_boxes_rejected: 0
  target_class_config:
    - name: forklift
      coverage_threshold: 0.005    # min object area fraction
    - name: person
      coverage_threshold: 0.005
    - name: pallet
      coverage_threshold: 0.005

Step 2: Train

# Single GPU training
tao model yolo_v4 train \
  -e specs/yolov4_train.yaml \
  -r workspace/results/yolov4_resnet18 \
  --gpus 1

# Multi-GPU (if available)
tao model yolo_v4 train \
  -e specs/yolov4_train.yaml \
  -r workspace/results/yolov4_resnet18 \
  --gpus 4

# Training output structure:
# workspace/results/yolov4_resnet18/
#   train/
#     events.out.tfevents.*   ← TensorBoard logs
#   weights/
#     yolov4_resnet18_epoch_010.hdf5
#     yolov4_resnet18_epoch_020.hdf5
#     ...
#     yolov4_resnet18_epoch_080.hdf5  ← best kept

Step 3: Evaluate

tao model yolo_v4 evaluate \
  -e specs/yolov4_eval.yaml \
  -m workspace/results/yolov4_resnet18/weights/yolov4_resnet18_epoch_080.hdf5

# Output:
# class_name       ap
# forklift         91.3%
# person           88.7%
# pallet           85.2%
# mAP:             88.4%

Step 4: Visualize with TensorBoard

tensorboard --logdir workspace/results/yolov4_resnet18/train \
            --port 6006 --bind_all
# Open http://localhost:6006
# Shows: loss curves, mAP per epoch, learning rate schedule

Training Config Knobs

Parameter Too low Good range Too high
batch_size_per_gpu slow GPU util 8–32 OOM
learning_rate never converges 1e-4 – 1e-2 diverges
freeze_blocks overfits small data [0,1] or [0,1,2] underfits
num_epochs underfits 50–150 overfits
jitter no augmentation 0.2–0.4 too distorted

7. Pruning

TAO pruning uses magnitude-based channel pruning: removes entire feature map channels whose L1-norm falls below a threshold. Result: a smaller model with the same architecture family but fewer channels.

Why Prune?

Before pruning:
  ResNet-18 + SSD:  98 MB, 47 FPS INT8 on Orin Nano

After 60% pruning + retrain:
  ResNet-18 + SSD:  38 MB, 62 FPS INT8 on Orin Nano
  mAP drop:         < 1%

Prune Command

# Spec file for pruning
cat > specs/yolov4_prune.yaml << 'EOF'
pruning_config:
  method: "l1_norm"        # magnitude of weight channels
  prune_ratio: 0.6         # remove 60% of channels by weight
  equalization_criterion: "union"
  granularity: 8           # prune in groups of 8 channels (CUDA efficiency)
  min_num_filters: 16      # never go below 16 channels per layer
  threshold: 0.1           # alternative: threshold instead of ratio
EOF

tao model yolo_v4 prune \
  -e specs/yolov4_prune.yaml \
  -m workspace/results/yolov4_resnet18/weights/yolov4_resnet18_epoch_080.hdf5 \
  -o workspace/results/yolov4_pruned/yolov4_pruned.hdf5

# Output shows:
# Original filters: 512
# Pruned filters:   198  (61.3% reduction)
# Model size: 98 MB → 37 MB

Retrain After Pruning

Pruning disrupts learned features — you must retrain the pruned model to recover accuracy.

# specs/yolov4_retrain.yaml — same as train spec but:
training_config:
  pretrain_model_path: workspace/results/yolov4_pruned/yolov4_pruned.hdf5
  num_epochs: 40           # fewer epochs — recovering, not learning from scratch
  learning_rate: 0.00065   # lower LR — fine adjustment
tao model yolo_v4 train \
  -e specs/yolov4_retrain.yaml \
  -r workspace/results/yolov4_retrained \
  --gpus 1

# After retraining, re-evaluate:
tao model yolo_v4 evaluate \
  -e specs/yolov4_eval.yaml \
  -m workspace/results/yolov4_retrained/weights/yolov4_resnet18_epoch_040.hdf5
# mAP: 87.8% (was 88.4%) — only 0.6% drop after 60% pruning

Iterative Pruning Strategy

Round 1: Prune 30% → Retrain → Check mAP drop
Round 2: Prune 30% more → Retrain → Check mAP drop
Round 3: Prune 20% more → Retrain → mAP drops >2% → stop

Final: ~65% total reduction, mAP within 1.5% of baseline

This is better than single-step aggressive pruning because the model adapts gradually.


8. Quantization-Aware Training (QAT)

TAO can insert fake quantization nodes during training so the model learns to be robust to INT8 precision before export.

Enable QAT in Training Spec

# specs/yolov4_qat.yaml
training_config:
  pretrain_model_path: workspace/results/yolov4_retrained/weights/yolov4_resnet18_epoch_040.hdf5
  num_epochs: 15          # short fine-tune — model already trained
  learning_rate: 0.0001   # very low — just adapting to quantization noise
  enable_qat: True        # THE KEY FLAG
  batch_size_per_gpu: 8   # reduce if OOM during QAT (2× memory overhead)
tao model yolo_v4 train \
  -e specs/yolov4_qat.yaml \
  -r workspace/results/yolov4_qat \
  --gpus 1

# During QAT, TAO automatically:
# 1. Inserts FakeQuantize nodes after Conv/Linear layers
# 2. Learns quantization scale/zero-point as parameters
# 3. Backprop through STE (Straight-Through Estimator)

QAT vs PTQ Comparison

Post-Training Quantization (PTQ):
  Workflow:  train FP32 → calibrate on 1000 samples → INT8 engine
  mAP drop:  ~2–4% typical
  Time:      +30 minutes calibration

Quantization-Aware Training (QAT):
  Workflow:  train FP32 → QAT fine-tune 15 epochs → INT8 engine
  mAP drop:  ~0.5–1% typical
  Time:      +6 hours QAT training

Use QAT when:
  - mAP is critical and budget allows extra training time
  - PTQ drops below acceptable threshold
  - Model has depthwise separable convolutions (very sensitive to quantization)

9. Exporting to TensorRT

Export to ONNX + TRT Engine

# specs/yolov4_export.yaml
export_config:
  # Input model
  model: workspace/results/yolov4_qat/weights/yolov4_resnet18_epoch_015.hdf5
  # OR for non-QAT:
  # model: workspace/results/yolov4_retrained/weights/yolov4_resnet18_epoch_040.hdf5

  # Output
  output_file: workspace/export/yolov4_resnet18.onnx

  # Target precision
  data_type: int8               # "fp32", "fp16", "int8"

  # Calibration (for PTQ INT8 only, not needed after QAT)
  cal_image_dir: workspace/data/train/images
  cal_cache_file: workspace/export/cal.bin
  cal_batch_size: 8
  cal_batches: 20               # 20 × 8 = 160 calibration images

  # Input shape
  input_dims: [3, 544, 960]     # C, H, W

  # Batch size for engine
  batch_size: 1                 # Jetson real-time: batch=1
  max_batch_size: 8

  # Jetson target (for DLA)
  enable_dla: False             # set True to target DLA
# Export to ONNX (intermediate)
tao model yolo_v4 export \
  -e specs/yolov4_export.yaml \
  -m workspace/results/yolov4_qat/weights/yolov4_resnet18_epoch_015.hdf5 \
  -o workspace/export/yolov4_resnet18.onnx

# Convert ONNX → TensorRT engine
tao deploy tao-converter \
  -k your_ngc_api_key \
  -d 3,544,960 \
  -o BatchedNMS \
  -e workspace/export/yolov4_int8.engine \
  -p Input,1x3x544x960,4x3x544x960,8x3x544x960 \
  -t int8 \
  -c workspace/export/cal.bin \
  workspace/export/yolov4_resnet18.onnx

# Alternative: use trtexec directly
trtexec \
  --onnx=workspace/export/yolov4_resnet18.onnx \
  --saveEngine=workspace/export/yolov4_int8.engine \
  --int8 \
  --calib=workspace/export/cal.bin \
  --workspace=2048 \
  --verbose

Verify the Engine

# Run benchmark on the engine
trtexec \
  --loadEngine=workspace/export/yolov4_int8.engine \
  --batch=1 \
  --iterations=200 \
  --avgRuns=100

# Output:
# [I] mean: 21.33 ms  ← 47 FPS on workstation RTX 3080
# [I] max:  23.1 ms
# [I] GPU Compute: 20.2 ms
# [I] H2D Latency: 0.47 ms
# [I] D2H Latency: 0.63 ms

10. Deploying on Jetson Orin Nano

Transfer Engine to Jetson

# TRT engines are NOT portable between GPUs — must build on Jetson
# Transfer the ONNX and calibration file, then build on device

# On workstation — copy files to Jetson
scp workspace/export/yolov4_resnet18.onnx jetson@192.168.1.100:~/models/
scp workspace/export/cal.bin jetson@192.168.1.100:~/models/

# On Jetson — build INT8 engine
ssh jetson@192.168.1.100
trtexec \
  --onnx=/home/jetson/models/yolov4_resnet18.onnx \
  --saveEngine=/home/jetson/models/yolov4_int8_jetson.engine \
  --int8 \
  --calib=/home/jetson/models/cal.bin \
  --workspace=2048
# This takes ~5–15 minutes on Jetson (engine building is slow)

Python Inference on Jetson

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
import cv2

class TAOYOLOInference:
    """Runs TAO-exported YOLO TensorRT engine on Jetson."""

    CLASS_NAMES = ['forklift', 'person', 'pallet']
    INPUT_H, INPUT_W = 544, 960
    CONF_THRESHOLD = 0.4
    NMS_THRESHOLD = 0.5

    def __init__(self, engine_path: str):
        logger = trt.Logger(trt.Logger.WARNING)
        with open(engine_path, 'rb') as f:
            runtime = trt.Runtime(logger)
            self.engine = runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()

        # Allocate buffers
        self.inputs, self.outputs, self.bindings, self.stream = [], [], [], cuda.Stream()
        for binding in self.engine:
            size = trt.volume(self.engine.get_binding_shape(binding))
            dtype = trt.nptype(self.engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))
            if self.engine.binding_is_input(binding):
                self.inputs.append({'host': host_mem, 'device': device_mem})
            else:
                self.outputs.append({'host': host_mem, 'device': device_mem})

    def preprocess(self, frame: np.ndarray) -> np.ndarray:
        """Resize, normalize, NHWC→NCHW."""
        img = cv2.resize(frame, (self.INPUT_W, self.INPUT_H))
        img = img.astype(np.float32) / 255.0
        img = img.transpose(2, 0, 1)   # HWC → CHW
        img = np.expand_dims(img, 0)   # add batch dim
        return np.ascontiguousarray(img)

    def infer(self, frame: np.ndarray) -> list[dict]:
        """Run inference, return list of detections."""
        # Preprocess
        inp = self.preprocess(frame)
        np.copyto(self.inputs[0]['host'], inp.ravel())

        # H2D
        cuda.memcpy_htod_async(
            self.inputs[0]['device'], self.inputs[0]['host'], self.stream)

        # Execute
        self.context.execute_async_v2(
            bindings=self.bindings, stream_handle=self.stream.handle)

        # D2H
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out['host'], out['device'], self.stream)
        self.stream.synchronize()

        return self._parse_detections(self.outputs[0]['host'], frame.shape)

    def _parse_detections(
        self, raw: np.ndarray, orig_shape: tuple
    ) -> list[dict]:
        """Parse TRT BatchedNMS output.

        TAO YOLOv4 output after BatchedNMS plugin:
          [num_dets, boxes(4), scores(1), classes(1)] per detection
        """
        orig_h, orig_w = orig_shape[:2]
        sx = orig_w / self.INPUT_W
        sy = orig_h / self.INPUT_H

        # BatchedNMS output format: [1, keepTopK, 1, 4]
        # Reshape based on your export's output binding shape
        num_dets = int(raw[0])
        detections = []
        for i in range(num_dets):
            score = raw[1 + i]
            if score < self.CONF_THRESHOLD:
                continue
            cls_id = int(raw[1 + 100 + i])  # offset depends on keepTopK
            x1 = raw[1 + 200 + i * 4 + 0] * sx
            y1 = raw[1 + 200 + i * 4 + 1] * sy
            x2 = raw[1 + 200 + i * 4 + 2] * sx
            y2 = raw[1 + 200 + i * 4 + 3] * sy
            detections.append({
                'class': self.CLASS_NAMES[cls_id],
                'score': float(score),
                'box': [int(x1), int(y1), int(x2), int(y2)]
            })
        return detections


# Usage on Jetson
detector = TAOYOLOInference('/home/jetson/models/yolov4_int8_jetson.engine')
cap = cv2.VideoCapture(0)  # or GStreamer pipeline

while True:
    ret, frame = cap.read()
    if not ret:
        break

    dets = detector.infer(frame)
    for d in dets:
        x1, y1, x2, y2 = d['box']
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, f"{d['class']} {d['score']:.2f}",
                    (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 255, 0), 2)

    cv2.imshow('TAO Detection', frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

Benchmark on Jetson

# On Jetson — set max performance mode first
sudo jetson_clocks --fan

# Benchmark
trtexec \
  --loadEngine=/home/jetson/models/yolov4_int8_jetson.engine \
  --batch=1 \
  --iterations=500 \
  --avgRuns=200 \
  --warmUp=100

# Expected results on Orin Nano 8GB:
# INT8 ResNet-18 + SSD (544×960):  ~42 FPS
# INT8 ResNet-18 + SSD (416×416):  ~68 FPS
# INT8 EfficientDet-EF0 (512×512): ~35 FPS
# FP16 ResNet-18 + SSD (544×960):  ~28 FPS

11. DeepStream Integration

TAO-exported models plug directly into DeepStream as nvinfer elements — no custom code needed.

nvinfer Config File

# config_infer_primary_yolov4.txt
[property]
gpu-id=0
net-scale-factor=0.0039215697906911373   # 1/255
model-engine-file=/home/jetson/models/yolov4_int8_jetson.engine
int8-calib-file=/home/jetson/models/cal.bin
batch-size=1
process-mode=1                           # 1=primary detector
model-color-format=0                     # 0=RGB
labelfile-path=labels.txt                # one class name per line
gie-unique-id=1
output-blob-names=BatchedNMS
num-detected-classes=3
interval=0

[class-attrs-all]
threshold=0.4
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=20
detected-min-h=20
detected-max-w=9999
detected-max-h=9999
# labels.txt
forklift
person
pallet

DeepStream Pipeline YAML

# deepstream_app_config.yaml
[application]
enable-perf-measurement=1
perf-measurement-interval-sec=2

[source0]
enable=1
type=4                    # 4=RTSP
uri=rtsp://192.168.1.50/stream
num-sources=1

[sink0]
enable=1
type=5                    # 5=RTSP output
sync=0
codec=1                   # H.264
bitrate=4000000
rtsp-port=8554
udp-port=5400

[osd]
enable=1
gpu-id=0

[streammux]
gpu-id=0
batch-size=1
batched-push-timeout=40000
width=960
height=544

[primary-gie]
enable=1
gpu-id=0
gie-unique-id=1
config-file=config_infer_primary_yolov4.txt
# Run the full pipeline
deepstream-app -c deepstream_app_config.yaml

12. TAO vs Manual Pipeline Comparison

Aspect TAO Toolkit Manual PyTorch + TensorRT
Dataset labeling format KITTI, COCO Any (you write the loader)
Model architecture Fixed list (~50 models) Any model you can write
Training code Zero (YAML config) Full training loop
Pruning One command Custom pruning code
INT8 calibration Built-in calibrator Custom IInt8Calibrator
TRT export tao export command torch.onnx.export + trtexec
DeepStream integration Drop-in nvinfer config Custom Gst plugin
Customization Low Full control
Time to first working model Hours Days–weeks
Custom layer support No (use standard layers) Yes

Use TAO when: - You need a standard detection/classification model fast - Your architecture is in the NGC catalog - You want push-button optimization pipeline - You're deploying with DeepStream

Use manual pipeline when: - Custom architecture (attention heads, custom losses) - Research experiments - Need to debug gradient flow - Architectures not in NGC catalog (BEVFusion, PointPillars, etc.)


13. Projects

Project 1: Warehouse Object Detector

Train a 3-class detector (forklift, person, pallet) for a Jetson-powered warehouse camera.

Data collection:
  - Record 30-min warehouse video at 1080p
  - Sample every 5 seconds: ~360 frames
  - Label with CVAT (free, open source)
  - Export as KITTI format

TAO workflow:
  1. tao download pretrained_object_detection:resnet18
  2. Train 80 epochs (4 hours on RTX 3080)
  3. Evaluate: target mAP > 80%
  4. Prune 50%, retrain 40 epochs
  5. QAT 15 epochs
  6. Export INT8 engine
  7. Deploy on Jetson + DeepStream RTSP stream

Success metric: >40 FPS on Orin Nano, mAP > 78%

Project 2: Defect Detection on Conveyor Belt

Binary classifier: defect vs no-defect on manufactured parts.

TAO classification workflow:
  1. tao download pretrained_classification:efficientnet_b0
  2. Collect 1,000 defect + 1,000 no-defect images under controlled lighting
  3. Train classifier (10 epochs — classification is fast)
  4. Export FP16 engine (classification needs less optimization than detection)
  5. Jetson reads camera at 120 FPS, classify every frame
  6. Trigger alarm on 3 consecutive defect predictions

Success metric: <0.5% false positive rate, >99% recall on defects

Project 3: Multi-Camera People Counter

Use TAO ReID (Re-Identification) to track people across 4 cameras in a building.

Components:
  Camera 1–4 → DeepStream source → Primary detector (TAO YOLOv4 person)
             → Secondary ReID model (TAO ReID)
             → Tracker (NvDCF)
             → Redis person count aggregator
             → Grafana dashboard

TAO models:
  Primary: pretrained_object_detection:resnet18 (person only, 2 classes: person/background)
  Secondary: pretrained_re_identification:resnet50 (128-dim embedding)

14. Resources

Official Documentation

  • TAO Toolkit User Guide — docs.nvidia.com/tao/tao-toolkit/
  • TAO Toolkit API Reference — docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/
  • NGC Model Catalog — ngc.nvidia.com/catalog/models (filter: TAO)
  • TAO Toolkit GitHub — github.com/NVIDIA/tao_pytorch_backend

Tutorials

  • NVIDIA Developer Blog: TAO Toolkit Getting Started — developer.nvidia.com/blog/training-like-a-pro-with-tao-toolkit/
  • Jetson AI Lab — TAO + DeepStream — jetson-ai-lab.com
  • TAO Toolkit Quick Start Guide — catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/tao_toolkit_quick_start_guide
  • DeepStream SDK — docs.nvidia.com/metropolis/deepstream/dev-guide/
  • trtexec documentation — docs.nvidia.com/deeplearning/tensorrt/developer-guide/ (section: trtexec)
  • CVAT (labeling tool) — github.com/opencv/cvat — recommended for KITTI export
  • Roboflow (roboflow.com) — cloud annotate, augment, export YOLO/COCO; good for quick iteration and edge deployment

Benchmark Reference (Orin Nano 8GB, JetPack 6.x)

Model Task Precision Input FPS
ResNet-18 + SSD Detection INT8 544×960 42
ResNet-18 + SSD Detection FP16 544×960 28
EfficientNet-B0 Classification INT8 224×224 380
EfficientDet-EF0 Detection INT8 512×512 35
MobileNet-V2 + SSD Detection INT8 300×300 95

Previous: 5.5 ML and AI See also: ML and AI main guide for quantization theory, TensorRT manual pipeline, and DLA usage