Jetson Orin Nano 8GB -- Video Codec Hardware, GStreamer Pipelines, and DeepStream SDK¶

Target: Jetson Orin Nano 8GB Developer Kit (T234 SoC, Ampere GPU, JetPack 6.x / L4T 36.x).

Prerequisites: Familiarity with the Orin Nano memory architecture (NVMM, zero-copy, DMA-BUF) and real-time inference (TensorRT engine building).

Table of Contents¶

Introduction
NVDEC Hardware -- Video Decoder Engine
NVENC Hardware -- Video Encoder Engine
NVJPEG -- Hardware JPEG Engine
V4L2 Codec Interface
GStreamer on Jetson -- NVIDIA Accelerated Plugins
GStreamer Pipeline Patterns
Hardware-Accelerated Transcoding
DeepStream SDK Overview
DeepStream Pipeline Construction
DeepStream with Custom Models
DeepStream Multi-Stream Processing
DeepStream Analytics and Message Brokering
Zero-Copy Video Pipeline Architecture
Performance Profiling
Production Deployment
Common Issues and Debugging

1. Introduction¶

1.1 Video Processing in Edge AI¶

Edge AI deployments overwhelmingly involve video. Surveillance cameras, autonomous robots, industrial inspection systems, and smart-city infrastructure all generate continuous video feeds that must be decoded, analyzed, and often re-encoded in real time. The Jetson Orin Nano 8GB is purpose-built for this class of workload: its T234 SoC integrates fixed-function video codec hardware alongside an Ampere-architecture GPU, enabling a pipeline where frames flow from camera sensor through decode, inference, and encode without ever touching the CPU for pixel-level work.

A representative edge video analytics pipeline:

  [IP Camera]          [IP Camera]          [USB Camera]
      |                     |                     |
      v                     v                     v
  RTSP Decode (NVDEC)  RTSP Decode (NVDEC)  V4L2 Capture
      |                     |                     |
      +----------+----------+----------+----------+
                 |                     |
                 v                     v
           Stream Muxer          nvvideoconvert
           (batching)            (sys RAM -> NVMM)
                 |                     |
                 +----------+----------+
                            |
                            v
                   TensorRT Inference (GPU)
                            |
                            v
                   Object Tracker (GPU/CPU)
                            |
                   +--------+--------+
                   |                 |
                   v                 v
              OSD Overlay       Metadata Export
              (bboxes)          (Kafka / MQTT)
                   |
                   v
              H.265 Encode (NVENC)
                   |
                   v
              RTSP Server Output

1.2 Hardware Acceleration vs Software Codecs¶

Software codecs (libx264, libx265, libvpx) execute on CPU cores. On the Orin Nano's six-core Arm Cortex-A78AE complex, a single 1080p H.265 software encode consumes 80-100% of multiple cores, leaving no headroom for application logic or inference.

Hardware codec engines (NVDEC, NVENC, NVJPEG) are dedicated ASIC blocks on the T234 die. They operate independently of both the CPU and GPU:

Metric	Software (libx265)	Hardware (NVENC H.265)
1080p30 encode CPU usage	300-400% (3-4 cores)	< 5% (control plane)
1080p30 encode GPU usage	0%	0% (separate ASIC)
1080p30 encode latency	15-40 ms/frame	3-8 ms/frame
1080p30 encode power	4-6W additional	< 1W additional
Concurrent streams	1-2 (CPU-limited)	4+ (hardware-limited)

The hardware path frees the CPU for control-plane logic, I/O, and orchestration. The GPU remains fully available for CUDA kernels and TensorRT inference. Power per encoded frame is a fraction of the software path, which is critical for 7-15W power budgets.

1.3 T234 SoC Video Engine Overview¶

The T234 contains four dedicated hardware engines for video and image processing:

+-------------------------------------------------------------------+
|                        T234 SoC Die                               |
|                                                                   |
|  +--------+  +--------+  +--------+  +--------+  +-------------+ |
|  | NVDEC  |  | NVENC  |  | NVJPEG |  |  VIC   |  | Ampere GPU  | |
|  | (1x)   |  | (1x)   |  | (1x)   |  | (1x)   |  | 1024 CUDA   | |
|  |        |  |        |  |        |  |        |  | cores       | |
|  +---+----+  +---+----+  +---+----+  +---+----+  +------+------+ |
|      |           |           |           |               |        |
|      +-----+-----+-----+----+-----+-----+               |        |
|            |                       |                     |        |
|       +----v-----------------------v---------------------v----+   |
|       |              LPDDR5 Memory Subsystem (8GB)            |   |
|       |              (102.4 GB/s bandwidth, shared)           |   |
|       +-------------------------------------------------------+   |
+-------------------------------------------------------------------+

Engine	Full Name	Function
NVDEC	Video Decoder	Fixed-function H.264/H.265/VP9/AV1 decode
NVENC	Video Encoder	Fixed-function H.264/H.265 encode
NVJPEG	JPEG Engine	Hardware JPEG encode and decode
VIC	Video Image Compositor	Scaling, color conversion, rotation, compose

All engines access DRAM via DMA through the SMMU. They output to NVMM (NVIDIA Multimedia Memory) buffers that are zero-copy readable by the GPU and DLA. This is the foundation of the zero-copy pipeline described in Section 14.

1.4 JetPack and Software Stack Versions¶

JetPack	L4T	DeepStream	GStreamer	Key Changes
5.1.2	35.4.1	6.3	1.16	Orin Nano initial support
6.0	36.3	6.4 / 7.0	1.20	AV1 decode, new tracker, DS 7.0
6.1	36.4	7.1	1.20	Performance improvements

Always match DeepStream version to JetPack exactly. Cross-version combinations cause symbol resolution errors and silent data corruption.

2. NVDEC Hardware -- Video Decoder Engine¶

2.1 Decoder Block Architecture¶

The T234 contains one NVDEC instance. This is a fixed-function hardware block that accepts compressed bitstreams and produces decoded frames in NV12 (or NV12_10LE for 10-bit content) into DRAM via DMA. The decoder runs asynchronously with respect to both the CPU and GPU.

The driver exposes NVDEC as a V4L2 Memory-to-Memory (M2M) device. The userspace flow:

  1. Open /dev/video0 (decoder device)
  2. Set OUTPUT format (compressed: H264, H265, VP9, AV1)
  3. Set CAPTURE format (raw: NV12)
  4. STREAMON on both queues
  5. Queue compressed NAL units to OUTPUT
  6. Dequeue decoded frames from CAPTURE

GStreamer's nvv4l2decoder wraps this V4L2 interface and outputs NVMM buffers.

2.2 Supported Codecs and Profiles¶

Codec	Profiles Supported	Max Bit Depth	Chroma
H.264	Baseline, Main, High, High 10	10-bit	4:2:0
H.265	Main, Main 10, Main Still Picture	10-bit	4:2:0
VP9	Profile 0, Profile 2 (10-bit)	10-bit	4:2:0
AV1	Main Profile	10-bit	4:2:0

AV1 decode requires JetPack 6.0 or later. 10-bit decode outputs NV12_10LE format; downstream conversion via VIC (nvvideoconvert) is needed before elements expecting 8-bit.

2.3 Maximum Resolution and Throughput¶

Codec	Max Resolution	Peak Decode Throughput
H.264	4096 x 4096	1x 4K30 or 2x 1080p60 or 8x 720p30
H.265	8192 x 8192	1x 4K60 or 2x 4K30 or 4x 1080p60
VP9	8192 x 8192	1x 4K60 or 2x 4K30
AV1	8192 x 4320	1x 4K30

These are hardware peak rates. Actual throughput depends on bitstream complexity (high motion / high QP streams decode faster), memory bandwidth contention with GPU and other engines, and thermal state (throttling above ~85C junction temperature).

2.4 Simultaneous Decode Streams¶

The single NVDEC engine time-slices across multiple concurrent decode sessions. The practical limit is determined by total pixel throughput, not a fixed stream count:

Total pixels/s budget (H.265): ~500 Mpixels/s

Scenario A: 4x 1080p30 H.265  = 4 * 1920 * 1080 * 30 = 249M pixels/s   [OK]
Scenario B: 8x 720p30  H.265  = 8 * 1280 *  720 * 30 = 221M pixels/s   [OK]
Scenario C: 1x 4K60    H.265  = 1 * 3840 * 2160 * 60 = 497M pixels/s   [OK]
Scenario D: 2x 4K60    H.265  = 2 * 3840 * 2160 * 60 = 995M pixels/s   [EXCEEDS]
Scenario E: 1x 4K30 + 4x 720p30 = 249M + 111M        = 360M pixels/s   [OK]

When the pixel budget is exceeded, NVDEC cannot maintain real-time and frames will be dropped or queued, causing increasing latency.

2.5 Decoder Latency¶

Configuration	Typical Decode Latency
H.264 Baseline (no B-frames)	1-2 frames (33-66 ms @30)
H.264 Main/High (with B-frames)	2-4 frames (66-133 ms @30)
H.265 Main (no B-frames)	1-3 frames (33-100 ms @30)
AV1 Main	2-4 frames (66-133 ms @30)

For low-latency applications (robotics, teleoperation), the encoder producing the stream should be configured without B-frames and with short GOP lengths. The decoder has a enable-max-performance property to lock NVDEC clocks:

gst-launch-1.0 rtspsrc location=rtsp://camera/live latency=0 ! \
  rtph265depay ! h265parse ! nvv4l2decoder enable-max-performance=true ! \
  nvvideoconvert ! nv3dsink sync=false

2.6 Querying NVDEC Capabilities¶

# List V4L2 video devices (decoder is typically /dev/video0)
ls -la /dev/video*

# Query decoder device capabilities
v4l2-ctl -d /dev/video0 --all

# List supported compressed input formats (OUTPUT queue)
v4l2-ctl -d /dev/video0 --list-formats-out
# Expected output includes: H264, H265, VP9, AV1

# List supported raw output formats (CAPTURE queue)
v4l2-ctl -d /dev/video0 --list-formats
# Expected output includes: NV12, NV12M

# Check current NVDEC clock and utilization
sudo cat /sys/kernel/debug/clk/nvdec/clk_rate
sudo tegrastats --interval 500
# Look for NVDEC% in tegrastats output

3. NVENC Hardware -- Video Encoder Engine¶

3.1 Encoder Block Architecture¶

The T234 contains one NVENC instance. Like NVDEC, this is a fixed-function ASIC block that accepts raw frames (NV12, I420, P010 for 10-bit) and produces compressed bitstreams. The encoder operates via DMA and runs independently of the CPU and GPU.

The driver exposes NVENC as a V4L2 M2M device (typically /dev/video1). The flow mirrors the decoder but in reverse: raw frames go into the OUTPUT queue, compressed NAL units come out of the CAPTURE queue.

3.2 Supported Codecs and Profiles¶

Codec	Profiles Supported	Max Bit Depth	Chroma
H.264	Baseline, Main, High	8-bit	4:2:0
H.265	Main, Main 10	10-bit	4:2:0

Unlike NVDEC, the encoder does not support VP9 or AV1 encoding. This asymmetry is common in edge SoCs -- decode support is broader than encode.

3.3 Maximum Resolution and Throughput¶

Codec	Max Encode Resolution	Peak Encode Throughput
H.264	4096 x 4096	1x 4K30 or 2x 1080p60 or 4x 1080p30
H.265	8192 x 8192	1x 4K30 or 2x 1080p60 or 4x 1080p30

3.4 Rate Control Modes¶

Mode	GStreamer Value	Description
VBR	control-rate=0	Variable bitrate. Quality varies to hit average bitrate.
CBR	control-rate=1	Constant bitrate. Best for fixed-bandwidth streaming.
CQP	control-rate=2	Constant QP. Fixed quality, variable file size.

# CBR at 4 Mbps
gst-launch-1.0 videotestsrc num-buffers=300 ! \
  'video/x-raw,width=1920,height=1080,framerate=30/1' ! nvvideoconvert ! \
  'video/x-raw(memory:NVMM),format=NV12' ! \
  nvv4l2h265enc bitrate=4000000 control-rate=1 ! \
  h265parse ! mp4mux ! filesink location=cbr_output.mp4

# VBR with peak 8 Mbps, average 4 Mbps
gst-launch-1.0 videotestsrc num-buffers=300 ! \
  'video/x-raw,width=1920,height=1080,framerate=30/1' ! nvvideoconvert ! \
  'video/x-raw(memory:NVMM),format=NV12' ! \
  nvv4l2h265enc bitrate=4000000 peak-bitrate=8000000 control-rate=0 ! \
  h265parse ! mp4mux ! filesink location=vbr_output.mp4

# CQP with explicit QP values for I, P, B frames
gst-launch-1.0 videotestsrc num-buffers=300 ! \
  'video/x-raw,width=1920,height=1080,framerate=30/1' ! nvvideoconvert ! \
  'video/x-raw(memory:NVMM),format=NV12' ! \
  nvv4l2h265enc control-rate=2 quant-i-frames=20 quant-p-frames=23 quant-b-frames=25 ! \
  h265parse ! mp4mux ! filesink location=cqp_output.mp4

3.5 B-Frame Support and GOP Structure¶

NVENC on the Orin Nano supports B-frames for both H.264 and H.265:

# 2 B-frames, IDR every 30 frames
gst-launch-1.0 videotestsrc num-buffers=300 ! \
  'video/x-raw,width=1920,height=1080,framerate=30/1' ! nvvideoconvert ! \
  'video/x-raw(memory:NVMM),format=NV12' ! \
  nvv4l2h264enc num-b-frames=2 idrinterval=30 bitrate=4000000 ! \
  h264parse ! mp4mux ! filesink location=bframe_output.mp4

GOP structure visualization:

IDR interval=30, B-frames=2:
  I B B P B B P B B P B B P B B P B B P B B P B B P B B P B B [IDR]

IDR interval=30, B-frames=0 (low-latency):
  I P P P P P P P P P P P P P P P P P P P P P P P P P P P P P [IDR]

For low-latency streaming, set num-b-frames=0 and idrinterval=15 or lower.

3.6 Encoder Properties Reference¶

Property	Type	Default	Description
`bitrate`	uint	4000000	Target bitrate in bits/sec
`peak-bitrate`	uint	0	Peak bitrate for VBR (0 = auto)
`control-rate`	enum	1 (CBR)	Rate control mode (0=VBR, 1=CBR, 2=CQP)
`preset-level`	uint	1	0=UltraFast, 1=Fast, 2=Medium, 3=Slow, 4=HQ
`idrinterval`	uint	256	Frames between IDR frames
`iframeinterval`	uint	30	Frames between I-frames
`num-b-frames`	uint	0	Number of B-frames between P-frames
`insert-sps-pps`	bool	false	Insert SPS/PPS with every IDR (needed for RTSP)
`maxperf-enable`	bool	false	Lock NVENC clock to maximum frequency
`EnableTwopassCBR`	bool	false	Two-pass CBR for higher quality
`insert-vui`	bool	false	Insert Video Usability Information
`profile`	enum	varies	H264: 0=Base,2=Main,4=High; H265: 0=Main
`quant-i-frames`	uint	0	QP for I-frames (CQP mode)
`quant-p-frames`	uint	0	QP for P-frames (CQP mode)
`quant-b-frames`	uint	0	QP for B-frames (CQP mode)

# Inspect all encoder properties
gst-inspect-1.0 nvv4l2h265enc
gst-inspect-1.0 nvv4l2h264enc

3.7 Quality Tuning Presets¶

# Ultra-low-latency streaming (robotics, teleoperation)
gst-launch-1.0 nvarguscamerasrc ! \
  'video/x-raw(memory:NVMM),width=1280,height=720,framerate=60/1' ! \
  nvv4l2h264enc preset-level=0 bitrate=3000000 control-rate=1 \
    idrinterval=15 num-b-frames=0 insert-sps-pps=true \
    maxperf-enable=true profile=0 ! \
  h264parse ! rtph264pay ! udpsink host=192.168.1.100 port=5000

# High-quality archival recording
gst-launch-1.0 nvarguscamerasrc ! \
  'video/x-raw(memory:NVMM),width=3840,height=2160,framerate=30/1' ! \
  nvv4l2h265enc preset-level=4 bitrate=15000000 control-rate=1 \
    EnableTwopassCBR=true idrinterval=60 num-b-frames=2 ! \
  h265parse ! mp4mux ! filesink location=archive.mp4

4. NVJPEG -- Hardware JPEG Engine¶

4.1 JPEG Hardware Block¶

The T234 integrates a dedicated JPEG encoder/decoder hardware block (NVJPEG). This engine handles JPEG compression and decompression entirely in hardware, independently of the CPU and GPU.

4.2 Specifications¶

Feature	Specification
Max encode resolution	32768 x 32768 (memory-limited)
Max decode resolution	32768 x 32768 (memory-limited)
Encode throughput	~350-500 Mpixels/s
Decode throughput	~500 Mpixels/s
Chroma subsampling	4:2:0, 4:2:2, 4:4:4
Quality range	1-100 (JPEG quality factor)
Profile	Baseline JPEG only (no progressive/arithmetic)

4.3 Performance Comparison¶

Metric	NVJPEG (hardware)	libjpeg-turbo (CPU)
1080p encode time	~1.5 ms	~8-12 ms
4K encode time	~5 ms	~35-50 ms
CPU utilization	Near zero	100% of 1-2 cores
Power impact	Minimal (< 0.5W)	Significant (2-4W)

4.4 Snapshot Capture¶

# Single JPEG frame from CSI camera
gst-launch-1.0 nvarguscamerasrc num-buffers=1 ! \
  'video/x-raw(memory:NVMM),width=4032,height=3040,framerate=30/1' ! \
  nvjpegenc quality=95 ! filesink location=snapshot.jpg

# Single JPEG from USB camera
gst-launch-1.0 v4l2src device=/dev/video2 num-buffers=1 ! \
  'video/x-raw,width=1920,height=1080' ! nvvideoconvert ! \
  'video/x-raw(memory:NVMM),format=NV12' ! \
  nvjpegenc quality=90 ! filesink location=snapshot.jpg

4.5 MJPEG Streaming¶

# MJPEG RTP stream from CSI camera
gst-launch-1.0 nvarguscamerasrc ! \
  'video/x-raw(memory:NVMM),width=1920,height=1080,framerate=30/1' ! \
  nvjpegenc quality=85 ! rtpjpegpay ! \
  udpsink host=192.168.1.100 port=5000

# Receive and display on client machine
gst-launch-1.0 udpsrc port=5000 ! \
  'application/x-rtp,encoding-name=JPEG' ! rtpjpegdepay ! \
  jpegdec ! videoconvert ! autovideosink

4.6 JPEG Decode for Inference¶

# Decode directory of JPEG images for batch inference
gst-launch-1.0 multifilesrc location="images/%05d.jpg" index=0 caps="image/jpeg" ! \
  jpegparse ! nvv4l2decoder mjpeg=1 ! \
  nvvideoconvert ! 'video/x-raw(memory:NVMM),format=RGBA' ! fakesink

# In Python using jetson.utils

import jetson.utils

# Hardware-accelerated JPEG decode
img = jetson.utils.loadImage("input.jpg", format="rgb8")
print(f"Decoded: {img.width}x{img.height}, format={img.format}")

# Hardware-accelerated JPEG encode
jetson.utils.saveImage("output.jpg", img, quality=90)

5. V4L2 Codec Interface¶

5.1 Device Topology¶

On the Orin Nano, the V4L2 codec devices are exposed as Memory-to-Memory (M2M) devices:

$ ls -la /dev/video*
crw-rw---- 1 root video 81, 0 ... /dev/video0    # NVDEC (decoder)
crw-rw---- 1 root video 81, 1 ... /dev/video1    # NVENC (encoder)
crw-rw---- 1 root video 81, 2 ... /dev/video2    # NVJPEG (JPEG enc/dec)
crw-rw---- 1 root video 81, 3 ... /dev/video3    # USB camera (if connected)

Note: Actual device numbers may vary depending on USB devices and kernel configuration. Use v4l2-ctl --list-devices to identify the correct device.

$ v4l2-ctl --list-devices
NVIDIA Tegra Video Decoder (platform:15480000.nvdec):
        /dev/video0

NVIDIA Tegra Video Encoder (platform:15a80000.nvenc):
        /dev/video1

NVIDIA Tegra JPEG Encoder (platform:15380000.nvjpg):
        /dev/video2

5.2 V4L2 M2M Architecture¶

    Userspace Application
         |           ^
    [OUTPUT queue]  [CAPTURE queue]
    (compressed)    (raw NV12)
         |           |
         v           |
    +-----------------------+
    |   V4L2 M2M Driver    |
    |   (nvdec / nvenc)     |
    +-----------------------+
         |           ^
         v           |
    +-----------------------+
    |   NVDEC / NVENC HW   |
    +-----------------------+

Each M2M device has two buffer queues: - OUTPUT: Where the application sends data to be processed (compressed for decode, raw for encode). - CAPTURE: Where the application receives processed data (raw for decode, compressed for encode).

5.3 Buffer Management -- MMAP vs DMA-BUF¶

MMAP buffers: The kernel allocates buffers and maps them into userspace.

#include <linux/videodev2.h>
#include <sys/ioctl.h>
#include <sys/mman.h>

// Request MMAP buffers on the CAPTURE queue
struct v4l2_requestbuffers req = {0};
req.count  = 4;
req.type   = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE;
req.memory = V4L2_MEMORY_MMAP;
ioctl(fd, VIDIOC_REQBUFS, &req);

// Map each buffer
for (int i = 0; i < req.count; i++) {
    struct v4l2_buffer buf = {0};
    struct v4l2_plane planes[1] = {0};
    buf.type   = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE;
    buf.memory = V4L2_MEMORY_MMAP;
    buf.index  = i;
    buf.m.planes = planes;
    buf.length = 1;
    ioctl(fd, VIDIOC_QUERYBUF, &buf);

    void *ptr = mmap(NULL, planes[0].length,
                     PROT_READ | PROT_WRITE, MAP_SHARED,
                     fd, planes[0].m.mem_offset);
    // Store ptr for later use
}

DMA-BUF buffers: Zero-copy sharing between hardware engines. This is the preferred path for pipelines where decoded frames feed directly into CUDA or NVENC.

// Request DMA-BUF export on CAPTURE queue
struct v4l2_requestbuffers req = {0};
req.count  = 4;
req.type   = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE;
req.memory = V4L2_MEMORY_DMABUF;
ioctl(fd, VIDIOC_REQBUFS, &req);

// Export buffer as DMA-BUF fd
struct v4l2_exportbuffer expbuf = {0};
expbuf.type  = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE;
expbuf.index = 0;
expbuf.plane = 0;
ioctl(fd, VIDIOC_EXPBUF, &expbuf);
int dma_fd = expbuf.fd;
// dma_fd can be imported by CUDA, NVENC, or other DMA-BUF consumers

5.4 Command-Line Codec Usage with v4l2-ctl¶

# Query decoder capabilities
v4l2-ctl -d /dev/video0 --info
v4l2-ctl -d /dev/video0 --list-formats-out   # Compressed input formats
v4l2-ctl -d /dev/video0 --list-formats       # Raw output formats

# Query encoder capabilities
v4l2-ctl -d /dev/video1 --info
v4l2-ctl -d /dev/video1 --list-formats-out   # Raw input formats
v4l2-ctl -d /dev/video1 --list-formats       # Compressed output formats

# Set encoder bitrate via V4L2 control
v4l2-ctl -d /dev/video1 --set-ctrl=video_bitrate=4000000
v4l2-ctl -d /dev/video1 --set-ctrl=video_bitrate_mode=1  # CBR

# List all controls for encoder
v4l2-ctl -d /dev/video1 --list-ctrls-menus

5.5 V4L2 Decode Example (C)¶

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <linux/videodev2.h>

#define DECODER_DEV "/dev/video0"

int main() {
    int fd = open(DECODER_DEV, O_RDWR);
    if (fd < 0) { perror("open"); return 1; }

    // Set OUTPUT format (compressed H.265 input)
    struct v4l2_format fmt_out = {0};
    fmt_out.type = V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE;
    fmt_out.fmt.pix_mp.width       = 1920;
    fmt_out.fmt.pix_mp.height      = 1080;
    fmt_out.fmt.pix_mp.pixelformat = V4L2_PIX_FMT_H265;
    fmt_out.fmt.pix_mp.num_planes  = 1;
    ioctl(fd, VIDIOC_S_FMT, &fmt_out);

    // Set CAPTURE format (raw NV12 output)
    struct v4l2_format fmt_cap = {0};
    fmt_cap.type = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE;
    fmt_cap.fmt.pix_mp.width       = 1920;
    fmt_cap.fmt.pix_mp.height      = 1080;
    fmt_cap.fmt.pix_mp.pixelformat = V4L2_PIX_FMT_NV12M;
    fmt_cap.fmt.pix_mp.num_planes  = 2;
    ioctl(fd, VIDIOC_S_FMT, &fmt_cap);

    // Request buffers, streamon, queue/dequeue loop omitted for brevity
    // See NVIDIA L4T Multimedia API samples: /usr/src/jetson_multimedia_api/

    close(fd);
    return 0;
}

The full V4L2 encode/decode examples are installed at: /usr/src/jetson_multimedia_api/samples/ on JetPack systems.

5.6 NVIDIA Multimedia API (NvMedia Alternative)¶

For applications that need finer control than GStreamer but more structure than raw V4L2, NVIDIA provides the Jetson Multimedia API:

# Sample locations on JetPack 6.x
ls /usr/src/jetson_multimedia_api/samples/
# 01_video_encode/   -- V4L2 encode example
# 02_video_decode/   -- V4L2 decode example
# 10_camera_recording/ -- Camera + encode
# 12_camera_v4l2_cuda/  -- Camera + CUDA processing

# Build a sample
cd /usr/src/jetson_multimedia_api/samples/02_video_decode
make
./video_decode H265 --disable-rendering -o decoded.yuv input.h265

6. GStreamer on Jetson -- NVIDIA Accelerated Plugins¶

6.1 Plugin Overview¶

NVIDIA provides hardware-accelerated GStreamer plugins that replace standard software elements. These plugins communicate with the hardware engines via the V4L2 driver and keep data in NVMM buffers for zero-copy throughput.

NVIDIA Plugin	Replaces	Hardware Engine	Function
`nvv4l2decoder`	`avdec_h264/h265`	NVDEC	Hardware video decode
`nvv4l2h264enc`	`x264enc`	NVENC	Hardware H.264 encode
`nvv4l2h265enc`	`x265enc`	NVENC	Hardware H.265 encode
`nvvideoconvert`	`videoconvert`	VIC / GPU	Color/format conversion, scaling
`nvjpegenc`	`jpegenc`	NVJPEG	Hardware JPEG encode
`nvjpegdec`	`jpegdec`	NVJPEG	Hardware JPEG decode
`nvarguscamerasrc`	`v4l2src` (CSI)	ISP	CSI camera via libargus
`nvv4l2camerasrc`	`v4l2src` (USB)	--	USB camera with NVMM output
`nv3dsink`	`xvimagesink`	GPU (EGL)	GPU-accelerated display
`nvegltransform`	--	GPU (EGL)	EGL transform for display
`nvstreammux`	--	--	Batch multiple streams
`nvdsosd`	`textoverlay`	GPU	Bounding box / text overlay
`nvmultistreamtiler`	--	GPU	N-stream grid composite

6.2 Checking Installed Plugins¶

# List all NVIDIA GStreamer plugins
gst-inspect-1.0 | grep -i nv

# Detailed info on a specific plugin
gst-inspect-1.0 nvv4l2decoder
gst-inspect-1.0 nvv4l2h265enc
gst-inspect-1.0 nvvideoconvert
gst-inspect-1.0 nvarguscamerasrc

# Verify plugin versions
gst-inspect-1.0 nvv4l2decoder | grep Version

6.3 Memory Negotiation -- The NVMM Caps Convention¶

NVIDIA plugins use (memory:NVMM) in GStreamer caps to indicate that buffers reside in NVMM (NVIDIA Multimedia Memory) -- physically contiguous, DMA-capable memory accessible by all hardware engines without CPU copies.

# NVMM buffer (zero-copy between hardware elements)
video/x-raw(memory:NVMM),format=NV12,width=1920,height=1080

# System memory buffer (CPU-accessible, requires copy to reach hardware)
video/x-raw,format=NV12,width=1920,height=1080

The transition between system memory and NVMM always requires nvvideoconvert:

  [v4l2src]                   [nvv4l2decoder]
  (system RAM)                (NVMM)
      |                            |
      v                            v
  nvvideoconvert              directly to
  (sys -> NVMM copy)         nvv4l2h265enc
      |                      (zero-copy)
      v
  nvv4l2h265enc

6.4 nvvideoconvert Capabilities¶

nvvideoconvert (also aliased as nvvidconv in older JetPack) handles:

Color space conversion: NV12 to RGBA, NV12 to I420, RGBA to NV12, etc.
Resolution scaling: Arbitrary input-to-output resolution change.
Memory domain transfer: System memory to NVMM and vice versa.
Pixel format conversion: Between all supported formats.
Cropping and letterboxing: Via caps negotiation.

# Scale 4K to 1080p using VIC hardware
gst-launch-1.0 videotestsrc ! \
  'video/x-raw,width=3840,height=2160' ! nvvideoconvert ! \
  'video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12' ! \
  nv3dsink

# Convert NV12 to RGBA for CUDA processing
gst-launch-1.0 ... ! nvv4l2decoder ! \
  nvvideoconvert ! 'video/x-raw(memory:NVMM),format=RGBA' ! ...

6.5 nvegltransform for Display¶

When displaying NVMM buffers on screen, nvegltransform converts NVMM to EGLImage for rendering. On some JetPack versions, this is required between nvvideoconvert and nveglglessink:

gst-launch-1.0 videotestsrc ! nvvideoconvert ! \
  'video/x-raw(memory:NVMM),format=RGBA' ! \
  nvegltransform ! nveglglessink

With nv3dsink, this intermediate step is not needed -- nv3dsink handles NVMM directly.

6.6 Plugin Pipeline Data Flow¶

  nvarguscamerasrc                        v4l2src (USB)
  [NVMM output]                          [sys RAM output]
       |                                       |
       |                                  nvvideoconvert
       |                                  [sys -> NVMM copy]
       |                                       |
       +-------------------+-------------------+
                           |
                    nvv4l2h265enc / nvinfer / nvdsosd
                    [all accept NVMM input]
                           |
                    nvvideoconvert
                    [NVMM, format/scale changes via VIC]
                           |
                    nv3dsink / rtph265pay
                    [display or network output]

7. GStreamer Pipeline Patterns¶

7.1 Pattern 1: Camera to Encode to Stream¶

CSI camera captured via ISP, hardware-encoded to H.265, and streamed over RTP/UDP:

# CSI camera -> H.265 encode -> RTP stream
gst-launch-1.0 nvarguscamerasrc sensor-id=0 ! \
  'video/x-raw(memory:NVMM),width=1920,height=1080,framerate=30/1,format=NV12' ! \
  nvv4l2h265enc bitrate=4000000 insert-sps-pps=true idrinterval=30 \
    maxperf-enable=true ! \
  h265parse ! rtph265pay config-interval=1 ! \
  udpsink host=192.168.1.100 port=5000 sync=false

# Receive on client
gst-launch-1.0 udpsrc port=5000 ! \
  'application/x-rtp,media=video,encoding-name=H265,payload=96' ! \
  rtph265depay ! h265parse ! avdec_h265 ! videoconvert ! autovideosink

USB camera variant (note the required nvvideoconvert for sys-to-NVMM copy):

# USB camera -> nvvideoconvert -> H.264 encode -> file
gst-launch-1.0 v4l2src device=/dev/video2 ! \
  'video/x-raw,width=1280,height=720,framerate=30/1' ! \
  nvvideoconvert ! 'video/x-raw(memory:NVMM),format=NV12' ! \
  nvv4l2h264enc bitrate=2000000 insert-sps-pps=true ! \
  h264parse ! mp4mux ! filesink location=usb_recording.mp4

7.2 Pattern 2: File to Decode to Display¶

# MP4 file -> H.265 decode -> display
gst-launch-1.0 filesrc location=video.mp4 ! qtdemux name=d \
  d.video_0 ! h265parse ! nvv4l2decoder enable-max-performance=true ! \
  nvvideoconvert ! 'video/x-raw(memory:NVMM),format=RGBA' ! \
  nv3dsink sync=true

# With audio (demux both tracks)
gst-launch-1.0 filesrc location=video.mp4 ! qtdemux name=d \
  d.video_0 ! h265parse ! nvv4l2decoder ! nv3dsink \
  d.audio_0 ! aacparse ! avdec_aac ! audioconvert ! alsasink

7.3 Pattern 3: Decode to Inference to Encode¶

This is the core edge AI pattern -- decode a stream, run inference, overlay results, and re-encode for output:

# File -> decode -> inference (nvinfer) -> overlay -> encode -> file
gst-launch-1.0 filesrc location=traffic.mp4 ! qtdemux ! h265parse ! \
  nvv4l2decoder ! \
  m.sink_0 nvstreammux name=m batch-size=1 width=1920 height=1080 ! \
  nvinfer config-file-path=pgie_config.txt ! \
  nvvideoconvert ! 'video/x-raw(memory:NVMM),format=RGBA' ! \
  nvdsosd ! \
  nvvideoconvert ! 'video/x-raw(memory:NVMM),format=NV12' ! \
  nvv4l2h265enc bitrate=4000000 ! h265parse ! mp4mux ! \
  filesink location=output_with_detections.mp4

7.4 Pattern 4: Multi-Stream with tee¶

Split a single source into multiple outputs (display + record + stream):

# Camera -> tee -> display + H.265 file + JPEG snapshots
gst-launch-1.0 nvarguscamerasrc ! \
  'video/x-raw(memory:NVMM),width=1920,height=1080,framerate=30/1' ! \
  tee name=t \
  t. ! queue ! nv3dsink sync=false \
  t. ! queue ! nvv4l2h265enc bitrate=4000000 ! h265parse ! \
       mp4mux ! filesink location=recording.mp4 \
  t. ! queue ! videorate ! 'video/x-raw(memory:NVMM),framerate=1/5' ! \
       nvjpegenc quality=90 ! multifilesink location="snap_%05d.jpg"

7.5 Pattern 5: Multi-Source Input¶

Multiple RTSP cameras feeding into a single DeepStream pipeline:

gst-launch-1.0 \
  rtspsrc location=rtsp://cam1/live latency=100 ! rtph265depay ! h265parse ! \
    nvv4l2decoder ! m.sink_0 \
  rtspsrc location=rtsp://cam2/live latency=100 ! rtph265depay ! h265parse ! \
    nvv4l2decoder ! m.sink_1 \
  rtspsrc location=rtsp://cam3/live latency=100 ! rtph264depay ! h264parse ! \
    nvv4l2decoder ! m.sink_2 \
  nvstreammux name=m batch-size=3 width=1920 height=1080 \
    batched-push-timeout=40000 live-source=1 ! \
  nvinfer config-file-path=pgie_config.txt ! \
  nvmultistreamtiler rows=2 columns=2 width=1920 height=1080 ! \
  nvvideoconvert ! nvdsosd ! nv3dsink sync=false

7.6 Pattern 6: RTSP Input to RTSP Output¶

End-to-end analytics pipeline with RTSP in and RTSP out:

# This requires the GStreamer RTSP server library
# Best done via Python (see Section 16) or DeepStream config file:

# deepstream_app config approach:
# [source0]
# enable=1
# type=4
# uri=rtsp://camera.local:554/live
# [sink0]
# enable=1
# type=4
# rtsp-port=8554
# udp-port=5400
# codec=1
# bitrate=4000000

7.7 Pipeline Debugging Aid: DOT Graph¶

# Generate pipeline graph for any gst-launch pipeline
export GST_DEBUG_DUMP_DOT_DIR=/tmp/gst-dots
mkdir -p /tmp/gst-dots

gst-launch-1.0 videotestsrc num-buffers=30 ! nvvideoconvert ! \
  'video/x-raw(memory:NVMM),format=NV12' ! nvv4l2h265enc ! fakesink

# Convert to PNG (requires graphviz)
dot -Tpng /tmp/gst-dots/*.dot -o pipeline_graph.png

8. Hardware-Accelerated Transcoding¶

8.1 Full Decode-Scale-Encode Pipeline¶

Transcoding from H.264 at 4K to H.265 at 1080p, entirely in hardware:

gst-launch-1.0 filesrc location=input_4k.mp4 ! qtdemux ! h264parse ! \
  nvv4l2decoder enable-max-performance=true ! \
  nvvideoconvert ! \
  'video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12' ! \
  nvv4l2h265enc bitrate=4000000 preset-level=3 control-rate=1 \
    idrinterval=30 maxperf-enable=true ! \
  h265parse ! mp4mux ! filesink location=output_1080p.mp4

Data flow through hardware engines:

  File (disk I/O)
       |
       v
  qtdemux (CPU: demux only, no pixel work)
       |
       v
  h264parse (CPU: parse NAL headers only)
       |
       v
  nvv4l2decoder [NVDEC hardware]
  Output: 3840x2160 NV12 in NVMM
       |
       v
  nvvideoconvert [VIC hardware]
  Scale: 3840x2160 -> 1920x1080
  Output: 1920x1080 NV12 in NVMM (zero-copy, same NVMM pool)
       |
       v
  nvv4l2h265enc [NVENC hardware]
  Input: 1920x1080 NV12 from NVMM (zero-copy)
  Output: H.265 bitstream
       |
       v
  h265parse + mp4mux (CPU: mux only)
       |
       v
  filesink (disk I/O)

CPU involvement in this pipeline is limited to demuxing, parsing, muxing, and I/O control. No CPU core touches pixel data.

8.2 Format Conversion During Transcode¶

# Transcode with format change: NV12 decode -> RGBA intermediate -> NV12 encode
# Useful when inserting CUDA processing between decode and encode
gst-launch-1.0 filesrc location=input.mp4 ! qtdemux ! h265parse ! \
  nvv4l2decoder ! \
  nvvideoconvert ! 'video/x-raw(memory:NVMM),format=RGBA' ! \
  identity name=cuda_tap ! \
  nvvideoconvert ! 'video/x-raw(memory:NVMM),format=NV12' ! \
  nvv4l2h265enc bitrate=4000000 ! h265parse ! mp4mux ! \
  filesink location=output.mp4

8.3 Resolution Scaling Options¶

nvvideoconvert supports several interpolation methods controlled by the interpolation-method property:

Value	Method	Quality	Speed
0	Nearest neighbor	Lowest	Fastest
1	Bilinear	Good	Fast
2	5-tap filter	Better	Moderate
3	10-tap filter	Best	Slowest
4	Smart (adaptive)	Best	Moderate
5	Nicest	Highest	Slowest

# High-quality downscale with 5-tap filter
gst-launch-1.0 filesrc location=4k.mp4 ! qtdemux ! h265parse ! \
  nvv4l2decoder ! \
  nvvideoconvert interpolation-method=2 ! \
  'video/x-raw(memory:NVMM),width=1280,height=720,format=NV12' ! \
  nvv4l2h264enc bitrate=2000000 ! h264parse ! mp4mux ! \
  filesink location=720p.mp4

8.4 Batch Transcoding Script¶

#!/bin/bash
# transcode_directory.sh -- Transcode all MP4s in a directory
INPUT_DIR="$1"
OUTPUT_DIR="$2"
BITRATE="${3:-4000000}"

mkdir -p "$OUTPUT_DIR"

for f in "$INPUT_DIR"/*.mp4; do
    base=$(basename "$f" .mp4)
    echo "Transcoding: $f -> $OUTPUT_DIR/${base}_h265.mp4"

    gst-launch-1.0 -e \
      filesrc location="$f" ! qtdemux ! h264parse ! \
      nvv4l2decoder ! nvvideoconvert ! \
      'video/x-raw(memory:NVMM),width=1920,height=1080,format=NV12' ! \
      nvv4l2h265enc bitrate="$BITRATE" preset-level=3 ! \
      h265parse ! mp4mux ! \
      filesink location="$OUTPUT_DIR/${base}_h265.mp4"
done
echo "Done."

8.5 Transcode Performance Expectations¶

Transcode Scenario	Expected Speed (15W)
1080p30 H.264 -> 1080p30 H.265	Real-time (30 fps)
4K30 H.265 -> 1080p30 H.265	Real-time (30 fps)
4K30 H.265 -> 4K30 H.265 (re-encode)	Real-time (30 fps)
2x 1080p30 parallel transcode	Both real-time
4x 1080p30 parallel transcode	~20 fps each

Parallel transcoding beyond 2 streams will be limited by NVENC throughput, since only one NVENC instance exists and it must time-slice.

9. DeepStream SDK Overview¶

9.1 What is DeepStream¶

NVIDIA DeepStream SDK is a streaming analytics framework built on GStreamer. It provides a set of GStreamer plugins purpose-built for AI video analytics: inference, tracking, multi-stream batching, analytics, metadata handling, and message brokering. DeepStream manages the entire pipeline from input to output, keeping data in NVMM buffers throughout.

9.2 Architecture¶

+-------------------------------------------------------------------+
|                      DeepStream Application                       |
|  (deepstream-app, Python app, C app, or Triton-based)             |
+-------------------------------------------------------------------+
|                      DeepStream SDK Plugins                       |
|  +-----------+  +----------+  +----------+  +---------+           |
|  |nvstreammux|->| nvinfer  |->|nvtracker |->| nvdsosd |-> Sink    |
|  +-----------+  +----------+  +----------+  +---------+           |
|  | nvurisrc  |  |nvinfersvr|  |nvdsanalyt|  |nvmsgconv|           |
|  +-----------+  +----------+  +----------+  +---------+           |
+-------------------------------------------------------------------+
|                    GStreamer Framework (1.20)                      |
+-------------------------------------------------------------------+
|                  NVIDIA Accelerated Plugins                       |
|  (nvv4l2decoder, nvv4l2h265enc, nvvideoconvert, ...)              |
+-------------------------------------------------------------------+
|                  CUDA / TensorRT / cuDLA                          |
+-------------------------------------------------------------------+
|                  V4L2 Drivers, DMA-BUF, NVMM                     |
+-------------------------------------------------------------------+
|                  T234 Hardware (NVDEC, NVENC, GPU, VIC)            |
+-------------------------------------------------------------------+

9.3 Metadata System¶

DeepStream attaches metadata to every GStreamer buffer as it flows through the pipeline. The metadata hierarchy:

NvDsBatchMeta (per batch)
  |
  +-- NvDsFrameMeta[] (one per frame in batch)
        |
        +-- source_id, frame_num, buf_pts, ntp_timestamp
        |
        +-- NvDsObjectMeta[] (one per detected object)
        |     |
        |     +-- class_id, object_id (tracker), confidence
        |     +-- rect_params (bbox), text_params (label)
        |     +-- NvDsClassifierMeta[] (secondary classifier results)
        |
        +-- NvDsUserMeta[] (custom application metadata)
        |
        +-- NvDsDisplayMeta[] (OSD drawing commands)

This metadata flows alongside the video buffers without copying pixel data. Downstream elements (OSD, message converter, application probes) read and modify metadata to implement analytics logic.

9.4 Batch Processing Model¶

nvstreammux collects one frame from each of N input streams and forms a batch:

Stream 0 frame -> +
Stream 1 frame -> +----> Batch (N frames) -> nvinfer (single TensorRT enqueue)
Stream 2 frame -> +
Stream 3 frame -> +

TensorRT processes all N frames in one batched inference call.

This is far more efficient than running N separate inference calls. The batch-size property of both nvstreammux and nvinfer must match the number of streams.

9.5 DeepStream Installation Verification¶

# Check DeepStream version
deepstream-app --version-all

# Run the reference test application
deepstream-app -c /opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app/\
source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt

# List sample configs
ls /opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app/

# List sample models
ls /opt/nvidia/deepstream/deepstream/samples/models/

10. DeepStream Pipeline Construction¶

10.1 Key Elements and Their Roles¶

Element	Role
`nvurisrcbin`	Unified source: handles file, RTSP, HTTP, CSI inputs
`nvstreammux`	Batches N streams into a single batched buffer
`nvinfer`	TensorRT inference (detection, classification, segmentation)
`nvtracker`	Multi-object tracking (IOU, NvDCF, DeepSORT)
`nvdsanalytics`	Line crossing, ROI counting, direction detection
`nvdsosd`	On-screen display: bounding boxes, text, lines
`nvmultistreamtiler`	Composites N streams into a single tiled output
`nvmsgconv`	Converts metadata to JSON / protobuf payload
`nvmsgbroker`	Publishes payloads to Kafka, MQTT, AMQP, Azure IoT
`nv3dsink`	GPU-accelerated display sink
`nvrtspoutsinkbin`	RTSP server output sink

10.2 Configuration File Format (deepstream-app)¶

The deepstream-app reference application uses an INI-style configuration file:

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5

[tiled-display]
enable=1
rows=2
columns=2
width=1920
height=1080

[source0]
enable=1
type=3                          # 3=URI (file/RTSP), 4=RTSP, 5=CSI
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h265.mp4
num-sources=1
gpu-id=0
cudadec-memtype=0

[source1]
enable=1
type=4
uri=rtsp://192.168.1.10:554/live
latency=100
num-sources=1

[sink0]
enable=1
type=2                          # 2=EGL (display), 4=RTSP, 5=overlay
sync=0
gpu-id=0

[sink1]
enable=1
type=4                          # RTSP output
rtsp-port=8554
udp-port=5400
codec=1                         # 0=H.264, 1=H.265
bitrate=4000000
enc-type=0                      # 0=hardware encoder

[osd]
enable=1
text-size=15
border-width=2
border-color=0;1;0;1

[streammux]
batch-size=2
batched-push-timeout=40000
width=1920
height=1080
enable-padding=0
live-source=1                   # 1 for RTSP/camera sources

[primary-gie]
enable=1
gpu-id=0
gie-unique-id=1
nvbuf-memory-type=0
config-file=pgie_config.txt

[tracker]
enable=1
tracker-width=640
tracker-height=384
gpu-id=0
ll-lib-file=/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.so
ll-config-file=tracker_config.yml
enable-batch-process=1

[secondary-gie0]
enable=1
gpu-id=0
gie-unique-id=2
operate-on-gie-id=1
operate-on-class-ids=0
config-file=sgie_config.txt

10.3 Primary Inference Configuration (pgie_config.txt)¶

[property]
gpu-id=0
net-scale-factor=0.00392157    # 1/255 for 0-1 normalization
model-engine-file=yolov8n_b4_gpu0_fp16.engine
labelfile-path=labels.txt
batch-size=4                   # Must match streammux batch-size
network-mode=2                 # 0=FP32, 1=INT8, 2=FP16
num-detected-classes=80
interval=0                     # 0=infer every frame, N=skip N frames
gie-unique-id=1
process-mode=1                 # 1=primary (full-frame)
network-type=0                 # 0=detector, 1=classifier, 2=segmentation, 3=instance-seg
cluster-mode=2                 # 2=NMS
maintain-aspect-ratio=1
symmetric-padding=1
workspace-size=1024            # MB, for TensorRT engine building
parse-bbox-func-name=NvDsInferParseYoloV8
custom-lib-path=libnvds_infercustomparser_yolov8.so

[class-attrs-all]
pre-cluster-threshold=0.25
nms-iou-threshold=0.45
topk=300

10.4 Python Pipeline Construction¶

#!/usr/bin/env python3
"""DeepStream pipeline: 2-stream detection with tiled display and RTSP output."""

import sys
import gi
gi.require_version('Gst', '1.0')
from gi.repository import Gst, GLib
import pyds

def osd_sink_pad_buffer_probe(pad, info, u_data):
    """Probe to extract detection metadata from each frame."""
    gst_buffer = info.get_buffer()
    if not gst_buffer:
        return Gst.PadProbeReturn.OK

    batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(gst_buffer))
    l_frame = batch_meta.frame_meta_list

    while l_frame is not None:
        try:
            frame_meta = pyds.NvDsFrameMeta.cast(l_frame.data)
        except StopIteration:
            break

        num_objects = frame_meta.num_obj_meta
        l_obj = frame_meta.obj_meta_list

        while l_obj is not None:
            try:
                obj_meta = pyds.NvDsObjectMeta.cast(l_obj.data)
            except StopIteration:
                break

            print(f"Source {frame_meta.source_id} | "
                  f"Frame {frame_meta.frame_num} | "
                  f"Class {obj_meta.class_id} | "
                  f"Track ID {obj_meta.object_id} | "
                  f"Confidence {obj_meta.confidence:.2f} | "
                  f"BBox ({obj_meta.rect_params.left:.0f},"
                  f"{obj_meta.rect_params.top:.0f},"
                  f"{obj_meta.rect_params.width:.0f},"
                  f"{obj_meta.rect_params.height:.0f})")

            try:
                l_obj = l_obj.next
            except StopIteration:
                break

        try:
            l_frame = l_frame.next
        except StopIteration:
            break

    return Gst.PadProbeReturn.OK

def main():
    Gst.init(None)
    pipeline = Gst.Pipeline()

    # Sources
    sources = [
        "file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h265.mp4",
        "file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.mp4",
    ]

    streammux = Gst.ElementFactory.make("nvstreammux", "mux")
    streammux.set_property("batch-size", len(sources))
    streammux.set_property("width", 1920)
    streammux.set_property("height", 1080)
    streammux.set_property("batched-push-timeout", 40000)
    pipeline.add(streammux)

    for i, uri in enumerate(sources):
        src = Gst.ElementFactory.make("nvurisrcbin", f"src-{i}")
        src.set_property("uri", uri)
        pipeline.add(src)
        padname = f"sink_{i}"
        sinkpad = streammux.request_pad_simple(padname)
        src.connect("pad-added",
                     lambda src, pad, sink=sinkpad: pad.link(sink))

    # Inference
    pgie = Gst.ElementFactory.make("nvinfer", "pgie")
    pgie.set_property("config-file-path", "pgie_config.txt")
    pipeline.add(pgie)

    # Tracker
    tracker = Gst.ElementFactory.make("nvtracker", "tracker")
    tracker.set_property("tracker-width", 640)
    tracker.set_property("tracker-height", 384)
    tracker.set_property("ll-lib-file",
        "/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.so")
    tracker.set_property("ll-config-file", "tracker_config.yml")
    pipeline.add(tracker)

    # Tiler
    tiler = Gst.ElementFactory.make("nvmultistreamtiler", "tiler")
    tiler.set_property("rows", 1)
    tiler.set_property("columns", 2)
    tiler.set_property("width", 1920)
    tiler.set_property("height", 540)
    pipeline.add(tiler)

    # OSD
    nvvidconv = Gst.ElementFactory.make("nvvideoconvert", "conv")
    osd = Gst.ElementFactory.make("nvdsosd", "osd")
    pipeline.add(nvvidconv)
    pipeline.add(osd)

    # Sink (display)
    sink = Gst.ElementFactory.make("nv3dsink", "sink")
    sink.set_property("sync", False)
    pipeline.add(sink)

    # Link: mux -> pgie -> tracker -> tiler -> conv -> osd -> sink
    streammux.link(pgie)
    pgie.link(tracker)
    tracker.link(tiler)
    tiler.link(nvvidconv)
    nvvidconv.link(osd)
    osd.link(sink)

    # Add probe on OSD sink pad
    osd_sinkpad = osd.get_static_pad("sink")
    osd_sinkpad.add_probe(Gst.PadProbeType.BUFFER, osd_sink_pad_buffer_probe, None)

    # Run
    pipeline.set_state(Gst.State.PLAYING)
    loop = GLib.MainLoop()

    bus = pipeline.get_bus()
    bus.add_signal_watch()
    bus.connect("message", lambda bus, msg: (
        loop.quit() if msg.type in (Gst.MessageType.EOS, Gst.MessageType.ERROR)
        else None
    ))

    try:
        loop.run()
    except KeyboardInterrupt:
        pass

    pipeline.set_state(Gst.State.NULL)

if __name__ == "__main__":
    main()

10.5 C Pipeline Construction¶

#include <gst/gst.h>
#include <glib.h>
#include "gstnvdsmeta.h"
#include "nvdsmeta.h"

static GstPadProbeReturn osd_sink_pad_probe(GstPad *pad, GstPadProbeInfo *info,
                                             gpointer user_data) {
    GstBuffer *buf = GST_PAD_PROBE_INFO_BUFFER(info);
    NvDsBatchMeta *batch_meta = gst_buffer_get_nvds_batch_meta(buf);
    NvDsMetaList *l_frame = batch_meta->frame_meta_list;

    while (l_frame != NULL) {
        NvDsFrameMeta *frame_meta = (NvDsFrameMeta *)(l_frame->data);
        NvDsMetaList *l_obj = frame_meta->obj_meta_list;

        while (l_obj != NULL) {
            NvDsObjectMeta *obj = (NvDsObjectMeta *)(l_obj->data);
            g_print("Src %d Frame %d Class %d Track %lu Conf %.2f "
                    "BBox(%.0f,%.0f,%.0f,%.0f)\n",
                    frame_meta->source_id, frame_meta->frame_num,
                    obj->class_id, obj->object_id, obj->confidence,
                    obj->rect_params.left, obj->rect_params.top,
                    obj->rect_params.width, obj->rect_params.height);
            l_obj = l_obj->next;
        }
        l_frame = l_frame->next;
    }
    return GST_PAD_PROBE_OK;
}

int main(int argc, char *argv[]) {
    gst_init(&argc, &argv);

    GstElement *pipeline  = gst_pipeline_new("ds-pipeline");
    GstElement *source    = gst_element_factory_make("nvurisrcbin",  "src-0");
    GstElement *mux       = gst_element_factory_make("nvstreammux",  "mux");
    GstElement *pgie      = gst_element_factory_make("nvinfer",      "pgie");
    GstElement *tracker   = gst_element_factory_make("nvtracker",    "tracker");
    GstElement *tiler     = gst_element_factory_make("nvmultistreamtiler", "tiler");
    GstElement *conv      = gst_element_factory_make("nvvideoconvert","conv");
    GstElement *osd       = gst_element_factory_make("nvdsosd",      "osd");
    GstElement *sink      = gst_element_factory_make("nv3dsink",     "sink");

    g_object_set(G_OBJECT(source), "uri",
        "file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h265.mp4",
        NULL);
    g_object_set(G_OBJECT(mux), "batch-size", 1, "width", 1920,
        "height", 1080, "batched-push-timeout", 40000, NULL);
    g_object_set(G_OBJECT(pgie), "config-file-path", "pgie_config.txt", NULL);
    g_object_set(G_OBJECT(tracker),
        "ll-lib-file",
        "/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.so",
        "ll-config-file", "tracker_config.yml",
        "tracker-width", 640, "tracker-height", 384, NULL);
    g_object_set(G_OBJECT(tiler), "rows", 1, "columns", 1,
        "width", 1920, "height", 1080, NULL);
    g_object_set(G_OBJECT(sink), "sync", FALSE, NULL);

    gst_bin_add_many(GST_BIN(pipeline), source, mux, pgie, tracker,
                     tiler, conv, osd, sink, NULL);

    GstPad *srcpad = gst_element_get_static_pad(source, "src");
    GstPad *sinkpad = gst_element_request_pad_simple(mux, "sink_0");
    gst_pad_link(srcpad, sinkpad);
    gst_object_unref(srcpad);
    gst_object_unref(sinkpad);

    gst_element_link_many(mux, pgie, tracker, tiler, conv, osd, sink, NULL);

    /* Add probe */
    GstPad *osd_sink_pad = gst_element_get_static_pad(osd, "sink");
    gst_pad_add_probe(osd_sink_pad, GST_PAD_PROBE_TYPE_BUFFER,
                      osd_sink_pad_probe, NULL, NULL);
    gst_object_unref(osd_sink_pad);

    gst_element_set_state(pipeline, GST_STATE_PLAYING);
    GstBus *bus = gst_element_get_bus(pipeline);
    gst_bus_timed_pop_filtered(bus, GST_CLOCK_TIME_NONE,
        GST_MESSAGE_ERROR | GST_MESSAGE_EOS);

    gst_element_set_state(pipeline, GST_STATE_NULL);
    gst_object_unref(bus);
    gst_object_unref(pipeline);
    return 0;
}

Compile:

gcc -o ds_pipeline ds_pipeline.c \
  $(pkg-config --cflags --libs gstreamer-1.0) \
  -I/opt/nvidia/deepstream/deepstream/sources/includes \
  -L/opt/nvidia/deepstream/deepstream/lib -lnvdsgst_meta -lnvds_meta

11. DeepStream with Custom Models¶

11.1 Using TensorRT Engines¶

DeepStream's nvinfer plugin accepts pre-built TensorRT engine files or builds them from ONNX/Caffe/UFF models on first run. For production, always ship pre-built engines.

Engine building workflow:

# Build engine from ONNX (do this offline, not at runtime)
/usr/src/tensorrt/bin/trtexec \
  --onnx=yolov8n.onnx \
  --saveEngine=yolov8n_b4_gpu0_fp16.engine \
  --fp16 \
  --minShapes=images:1x3x640x640 \
  --optShapes=images:4x3x640x640 \
  --maxShapes=images:8x3x640x640 \
  --workspace=1024

Configuration for pre-built engine:

[property]
model-engine-file=yolov8n_b4_gpu0_fp16.engine
# Do NOT set onnx-file, caffe-model, or uff-file when using pre-built engine
batch-size=4
network-mode=2                  # Must match engine precision

11.2 Custom Output Parsers¶

When using non-standard model architectures (YOLO, custom detectors), you must provide a custom parser library that converts raw tensor output to DeepStream's NvDsInferObjectDetectionInfo format.

Parser header (nvds_custom_parser.h):

#include "nvdsinfer_custom_impl.h"

#ifdef __cplusplus
extern "C" {
#endif

bool NvDsInferParseYoloV8(
    std::vector<NvDsInferLayerInfo> const &outputLayersInfo,
    NvDsInferNetworkInfo const &networkInfo,
    NvDsInferParseDetectionParams const &detectionParams,
    std::vector<NvDsInferObjectDetectionInfo> &objectList);

#ifdef __cplusplus
}
#endif

Parser implementation (nvds_custom_parser_yolov8.cpp):

#include "nvds_custom_parser.h"
#include <cstring>
#include <algorithm>
#include <cmath>

static bool NvDsInferParseYoloV8(
    std::vector<NvDsInferLayerInfo> const &outputLayersInfo,
    NvDsInferNetworkInfo const &networkInfo,
    NvDsInferParseDetectionParams const &detectionParams,
    std::vector<NvDsInferObjectDetectionInfo> &objectList) {

    // YOLOv8 output: [batch, 84, 8400] -> transposed to [batch, 8400, 84]
    // 84 = 4 (bbox: cx, cy, w, h) + 80 (class scores)
    const int num_classes = 80;
    const int num_boxes = 8400;

    const float *output = (const float *)outputLayersInfo[0].buffer;
    float conf_threshold = detectionParams.perClassPreclusterThreshold[0];

    for (int i = 0; i < num_boxes; i++) {
        // Find best class
        int best_class = 0;
        float best_score = 0.0f;
        for (int c = 0; c < num_classes; c++) {
            float score = output[i * (4 + num_classes) + 4 + c];
            if (score > best_score) {
                best_score = score;
                best_class = c;
            }
        }

        if (best_score < conf_threshold) continue;

        // Extract bbox (cx, cy, w, h) normalized to network input size
        float cx = output[i * (4 + num_classes) + 0];
        float cy = output[i * (4 + num_classes) + 1];
        float w  = output[i * (4 + num_classes) + 2];
        float h  = output[i * (4 + num_classes) + 3];

        NvDsInferObjectDetectionInfo obj;
        obj.classId = best_class;
        obj.detectionConfidence = best_score;
        obj.left   = (cx - w / 2.0f);
        obj.top    = (cy - h / 2.0f);
        obj.width  = w;
        obj.height = h;
        objectList.push_back(obj);
    }

    return true;
}

extern "C"
bool NvDsInferParseCustomYoloV8(
    std::vector<NvDsInferLayerInfo> const &outputLayersInfo,
    NvDsInferNetworkInfo const &networkInfo,
    NvDsInferParseDetectionParams const &detectionParams,
    std::vector<NvDsInferObjectDetectionInfo> &objectList) {
    return NvDsInferParseYoloV8(outputLayersInfo, networkInfo,
                                 detectionParams, objectList);
}

CHECK_CUSTOM_PARSE_FUNC_PROTOTYPE(NvDsInferParseCustomYoloV8);

Build the parser:

g++ -shared -o libnvds_infercustomparser_yolov8.so \
  nvds_custom_parser_yolov8.cpp \
  -I/opt/nvidia/deepstream/deepstream/sources/includes \
  -I/usr/include/aarch64-linux-gnu \
  -fPIC -std=c++14

11.3 Output Tensor Metadata¶

For models where you need access to raw output tensors (segmentation masks, embeddings, pose keypoints), enable output tensor metadata:

[property]
output-tensor-meta=1           # Attach raw tensors to metadata

Access in Python probe:

def tensor_probe(pad, info, user_data):
    batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(info.get_buffer()))
    l_frame = batch_meta.frame_meta_list

    while l_frame:
        frame_meta = pyds.NvDsFrameMeta.cast(l_frame.data)
        l_user = frame_meta.frame_user_meta_list

        while l_user:
            user_meta = pyds.NvDsUserMeta.cast(l_user.data)
            if user_meta.base_meta.meta_type == \
                    pyds.NvDsMetaType.NVDSINFER_TENSOR_OUTPUT_META:
                tensor_meta = pyds.NvDsInferTensorMeta.cast(
                    user_meta.user_meta_data)

                # Access output layers
                for i in range(tensor_meta.num_output_layers):
                    layer = pyds.get_nvds_LayerInfo(tensor_meta, i)
                    print(f"Layer {i}: name={layer.layerName}, "
                          f"dims={layer.inferDims.d[:layer.inferDims.numDims]}")

                    # Get pointer to tensor data
                    ptr = ctypes.cast(
                        pyds.get_ptr(layer.buffer),
                        ctypes.POINTER(ctypes.c_float))
                    # Copy to numpy for processing
                    import numpy as np
                    data = np.ctypeslib.as_array(
                        ptr, shape=layer.inferDims.d[:layer.inferDims.numDims])

            try:
                l_user = l_user.next
            except StopIteration:
                break
        try:
            l_frame = l_frame.next
        except StopIteration:
            break

    return Gst.PadProbeReturn.OK

11.4 Secondary Classifiers¶

Run a secondary model on cropped detections from the primary detector:

# sgie_vehicle_type_config.txt
[property]
gpu-id=0
net-scale-factor=1.0
model-engine-file=vehicle_type_classifier_b8_gpu0_int8.engine
labelfile-path=vehicle_type_labels.txt
batch-size=8
network-mode=1              # INT8
num-detected-classes=6
gie-unique-id=2
process-mode=2              # Secondary
operate-on-gie-id=1         # Operate on primary detector output
operate-on-class-ids=0      # Only classify class 0 (vehicle) detections
network-type=1              # Classifier
classifier-async-mode=1
input-object-min-width=64
input-object-min-height=64

11.5 Segmentation Models¶

[property]
network-type=2              # Segmentation
segmentation-threshold=0.5
output-tensor-meta=1        # Required for accessing segmentation mask
num-detected-classes=21     # Number of segmentation classes

12. DeepStream Multi-Stream Processing¶

12.1 Stream Muxing Architecture¶

nvstreammux is the central batching element. It collects one frame per source, assembles them into a NvBufSurface batch, and pushes them downstream as a single GStreamer buffer with NvDsBatchMeta attached.

  Source 0 (1080p) ---+
                       |
  Source 1 (720p)  ---+---> nvstreammux ---> Single batched buffer
                       |    (scales all to    containing N frames
  Source 2 (4K)    ---+     mux width/height)
                       |
  Source 3 (1080p) ---+

  nvstreammux properties:
    batch-size=4         # Maximum streams to batch
    width=1920           # All frames scaled to this width
    height=1080          # All frames scaled to this height
    batched-push-timeout=40000  # Timeout in microseconds
    live-source=1        # 1 for live RTSP/camera sources

12.2 Resource Allocation Per Stream¶

Each stream consumes the following resources:

Resource	Per-Stream Allocation	8-Stream Total
NVDEC bandwidth	1920108030 = 62M pixels/s	497M pixels/s
Decode buffers	3 NV12 buffers * 3 MB each	72 MB
Mux buffer	1 batch slot (1920*1080 NV12)	24 MB
TensorRT workspace	Shared across batch	200-500 MB
Tracker state	~2 MB per stream (150 targets)	16 MB
Total per stream	~12-15 MB (excluding model weights)	96-120 MB

On 8 GB total system RAM, plan for: - Model weights: 50-200 MB (depending on model) - Pipeline buffers: 120-200 MB (8 streams) - OS and system: 1-2 GB - CUDA context: 200-400 MB - Remaining for application: ~5-6 GB

12.3 Performance Scaling¶

Benchmarks on Orin Nano 8GB at 15W (JetPack 6.0, DeepStream 7.0):

Model	Resolution	Streams	FPS/stream	GPU %	NVDEC %	Notes
PeopleNet (ResNet18)	1080p	4	30	72	60	Comfortable
PeopleNet (ResNet18)	1080p	8	15	95	85	GPU-limited
YOLOv8n INT8	1080p	4	30	55	60	Comfortable
YOLOv8n INT8	1080p	6	25	80	75	Near limit
YOLOv8s INT8	1080p	4	20	90	45	GPU-limited
TrafficCamNet	720p	8	30	65	50	Comfortable
TrafficCamNet	720p	12	22	88	75	Near limit
SSD MobileNetV2	720p	12	30	58	75	Comfortable

Rules of thumb: - 4x 1080p30 with a lightweight INT8 detector is the comfortable operating point. - Beyond 4 streams at 1080p, drop to 720p or use nvinfer interval to skip frames. - The bottleneck is usually the GPU (inference), not NVDEC (decode). - Tracker compute scales with number of tracked objects, not just streams.

12.4 Frame Skipping for Higher Stream Counts¶

# In pgie_config.txt, skip every other frame for inference
# Tracker interpolates positions on skipped frames
[property]
interval=1                    # 0=every frame, 1=skip 1 (infer every 2nd frame)

With interval=1 and a good tracker, perceived detection quality drops minimally while effectively doubling inference throughput.

12.5 Multi-Source Configuration¶

# deepstream-app config for 4 RTSP cameras + 2 file sources
[source0]
enable=1
type=4
uri=rtsp://192.168.1.10:554/stream1
latency=200
num-sources=1
gpu-id=0
cudadec-memtype=0

[source1]
enable=1
type=4
uri=rtsp://192.168.1.11:554/stream1
latency=200
num-sources=1

[source2]
enable=1
type=4
uri=rtsp://192.168.1.12:554/stream1
latency=200
num-sources=1

[source3]
enable=1
type=4
uri=rtsp://192.168.1.13:554/stream1
latency=200
num-sources=1

[source4]
enable=1
type=3
uri=file:///data/videos/test1.mp4
num-sources=1

[source5]
enable=1
type=3
uri=file:///data/videos/test2.mp4
num-sources=1

[streammux]
batch-size=6
width=1920
height=1080
batched-push-timeout=40000
live-source=1

12.6 Dynamic Stream Add/Remove¶

DeepStream supports adding and removing streams at runtime without restarting the pipeline:

# Add a new source at runtime
def add_source(pipeline, streammux, uri, source_id):
    src = Gst.ElementFactory.make("nvurisrcbin", f"src-{source_id}")
    src.set_property("uri", uri)
    pipeline.add(src)

    sinkpad = streammux.request_pad_simple(f"sink_{source_id}")
    src.connect("pad-added",
                 lambda src, pad, sink=sinkpad: pad.link(sink))
    src.sync_state_with_parent()

# Remove a source at runtime
def remove_source(pipeline, streammux, source_id):
    src = pipeline.get_by_name(f"src-{source_id}")
    sinkpad = streammux.get_static_pad(f"sink_{source_id}")

    src.set_state(Gst.State.NULL)
    pipeline.remove(src)
    streammux.release_request_pad(sinkpad)

13. DeepStream Analytics and Message Brokering¶

13.1 nvdsanalytics Plugin¶

The nvdsanalytics plugin provides built-in analytics primitives without custom code:

Line crossing detection: Count objects crossing a defined line.
ROI-based analytics: Count objects within a defined region of interest.
Direction detection: Determine movement direction of tracked objects.
Overcrowding detection: Alert when ROI object count exceeds threshold.

13.2 Analytics Configuration File¶

# nvdsanalytics_config.txt

[property]
enable=1
config-width=1920
config-height=1080

# Line crossing: count vehicles crossing the intersection line
[line-crossing-stream-0]
enable=1
line-crossing-0=600;500;1400;500;    # x1;y1;x2;y2 -- horizontal line
lc-label-0=Intersection_LC
class-id=0                            # Apply to class 0 (vehicle)
extended=0
mode=balanced                         # strict, balanced, loose

# Second line crossing on same stream
line-crossing-1=960;200;960;900;     # Vertical line
lc-label-1=Center_LC
class-id=-1                           # -1 = all classes

# ROI-based counting
[roi-filtering-stream-0]
enable=1
roi-0=200;400;200;800;1000;800;1000;400;   # x1;y1;x2;y2;x3;y3;x4;y4 (polygon)
roi-label-0=Parking_Zone_A
inverse-roi=0                               # 0=count inside, 1=count outside
class-id=0

# Second ROI
roi-1=1100;400;1100;800;1800;800;1800;400;
roi-label-1=Parking_Zone_B
class-id=0

# Overcrowding alert
[overcrowding-stream-0]
enable=1
roi-0=200;400;200;800;1000;800;1000;400;
overcrowding-label-0=Zone_A_Overcrowd
object-threshold=10                          # Alert when > 10 objects
class-id=0

# Direction detection
[direction-detection-stream-0]
enable=1
direction-0=400;500;900;500;               # Reference line
direction-label-0=Northbound
class-id=0

13.3 Accessing Analytics Metadata¶

def analytics_probe(pad, info, user_data):
    batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(info.get_buffer()))
    l_frame = batch_meta.frame_meta_list

    while l_frame:
        frame_meta = pyds.NvDsFrameMeta.cast(l_frame.data)
        l_user = frame_meta.frame_user_meta_list

        while l_user:
            user_meta = pyds.NvDsUserMeta.cast(l_user.data)
            if user_meta.base_meta.meta_type == \
                    pyds.NvDsMetaType.NVDS_ANALYTICS_FRAME_META:
                analytics_meta = pyds.NvDsAnalyticsFrameMeta.cast(
                    user_meta.user_meta_data)

                # Line crossing cumulative counts
                if analytics_meta.objLCCumCnt:
                    print(f"Line crossing counts: "
                          f"{pyds.nvds_analytics_get_lc_cum_cnt(analytics_meta)}")

                # Objects currently in each ROI
                if analytics_meta.objInROIcnt:
                    print(f"Objects in ROI: "
                          f"{pyds.nvds_analytics_get_roi_cnt(analytics_meta)}")

                # Overcrowding status
                if analytics_meta.ocStatus:
                    for label, status in analytics_meta.ocStatus.items():
                        if status:
                            print(f"OVERCROWDING ALERT: {label}")

            try:
                l_user = l_user.next
            except StopIteration:
                break

        # Per-object analytics: direction, LC status
        l_obj = frame_meta.obj_meta_list
        while l_obj:
            obj_meta = pyds.NvDsObjectMeta.cast(l_obj.data)
            l_user_obj = obj_meta.obj_user_meta_list

            while l_user_obj:
                user_meta = pyds.NvDsUserMeta.cast(l_user_obj.data)
                if user_meta.base_meta.meta_type == \
                        pyds.NvDsMetaType.NVDS_ANALYTICS_OBJ_META:
                    obj_analytics = pyds.NvDsAnalyticsObjInfo.cast(
                        user_meta.user_meta_data)

                    if obj_analytics.lcStatus:
                        print(f"Object {obj_meta.object_id} crossed line: "
                              f"{obj_analytics.lcStatus}")
                    if obj_analytics.dirStatus:
                        print(f"Object {obj_meta.object_id} direction: "
                              f"{obj_analytics.dirStatus}")
                    if obj_analytics.roiStatus:
                        print(f"Object {obj_meta.object_id} in ROI: "
                              f"{obj_analytics.roiStatus}")

                try:
                    l_user_obj = l_user_obj.next
                except StopIteration:
                    break

            try:
                l_obj = l_obj.next
            except StopIteration:
                break

        try:
            l_frame = l_frame.next
        except StopIteration:
            break

    return Gst.PadProbeReturn.OK

13.4 Message Brokering -- Sending Metadata to Cloud¶

DeepStream can publish detection metadata to external message brokers using nvmsgconv (metadata to payload) and nvmsgbroker (payload to broker).

Supported protocols:

Protocol	Adapter Library	Use Case
Kafka	`libnvds_kafka_proto.so`	High-throughput IoT
MQTT	`libnvds_mqtt_proto.so`	Lightweight IoT
AMQP	`libnvds_amqp_proto.so`	Enterprise messaging
Azure	`libnvds_azure_proto.so`	Azure IoT Hub
Redis	`libnvds_redis_proto.so`	In-memory data store

13.5 Kafka Integration¶

Pipeline configuration:

# deepstream-app config
[message-consumer0]
enable=0

[message-converter]
enable=1
msg-conv-config=msg_conv_config.txt
msg-conv-payload-type=0           # 0=DeepStream, 1=custom, 256=minimal

[message-broker]
enable=1
msg-broker-proto-lib=/opt/nvidia/deepstream/deepstream/lib/libnvds_kafka_proto.so
msg-broker-conn-str=kafka-broker.local;9092
topic=deepstream-detections
msg-broker-config=kafka_config.txt

Kafka adapter configuration (kafka_config.txt):

[message-broker]
enable=1
broker-proto-lib=/opt/nvidia/deepstream/deepstream/lib/libnvds_kafka_proto.so
broker-conn-str=kafka-broker.local;9092
topic=deepstream-detections

[message-broker-config]
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.username=api-key
sasl.password=api-secret

Python pipeline approach:

# Add message converter and broker to pipeline
msgconv = Gst.ElementFactory.make("nvmsgconv", "msgconv")
msgconv.set_property("config", "msg_conv_config.txt")
msgconv.set_property("payload-type", 0)    # DeepStream schema

msgbroker = Gst.ElementFactory.make("nvmsgbroker", "msgbroker")
msgbroker.set_property("proto-lib",
    "/opt/nvidia/deepstream/deepstream/lib/libnvds_kafka_proto.so")
msgbroker.set_property("conn-str", "kafka-broker.local;9092;deepstream-topic")
msgbroker.set_property("topic", "detections")

# Use tee to branch: one path to display, one to message broker
tee = Gst.ElementFactory.make("tee", "tee")
queue_display = Gst.ElementFactory.make("queue", "q_display")
queue_msg = Gst.ElementFactory.make("queue", "q_msg")

pipeline.add(tee)
pipeline.add(queue_display)
pipeline.add(queue_msg)
pipeline.add(msgconv)
pipeline.add(msgbroker)

# Link: ... -> osd -> tee
#                       |-> queue_display -> sink
#                       |-> queue_msg -> msgconv -> msgbroker
osd.link(tee)
tee.link(queue_display)
queue_display.link(sink)
tee.link(queue_msg)
queue_msg.link(msgconv)
msgconv.link(msgbroker)

13.6 MQTT Integration¶

broker = Gst.ElementFactory.make("nvmsgbroker", "broker")
broker.set_property("proto-lib",
    "/opt/nvidia/deepstream/deepstream/lib/libnvds_mqtt_proto.so")
broker.set_property("conn-str", "mqtt-broker.local;1883")
broker.set_property("topic", "edge/detections")

13.7 Default Message Schema¶

The DeepStream default schema (payload-type=0) produces JSON like:

{
  "@timestamp": "2026-03-05T10:30:00.000Z",
  "sensorId": "sensor-0",
  "objects": [
    {
      "id": "42",
      "bbox": {"topleftx": 100, "toplefty": 200, "bottomrightx": 300, "bottomrighty": 400},
      "classId": 0,
      "label": "person",
      "confidence": 0.92,
      "trackingId": 42,
      "direction": "north",
      "lcStatus": ["Line_A"],
      "roiStatus": ["Zone_1"]
    }
  ],
  "analyticsModule": {
    "lineCrossing": {"Line_A": {"cumCount": {"in": 15, "out": 12}}},
    "roiCount": {"Zone_1": 3}
  }
}

13.8 Custom Payload Generation¶

For application-specific schemas, use payload-type=256 (minimal) or implement a custom payload generator:

def custom_payload_probe(pad, info, user_data):
    """Generate custom JSON and publish via external client."""
    import json
    import paho.mqtt.client as mqtt

    batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(info.get_buffer()))
    l_frame = batch_meta.frame_meta_list

    events = []
    while l_frame:
        frame_meta = pyds.NvDsFrameMeta.cast(l_frame.data)
        l_obj = frame_meta.obj_meta_list

        while l_obj:
            obj = pyds.NvDsObjectMeta.cast(l_obj.data)
            events.append({
                "camera_id": frame_meta.source_id,
                "frame": frame_meta.frame_num,
                "timestamp": frame_meta.buf_pts / 1e9,
                "class": obj.class_id,
                "track_id": obj.object_id,
                "confidence": round(obj.confidence, 3),
                "bbox": [
                    round(obj.rect_params.left),
                    round(obj.rect_params.top),
                    round(obj.rect_params.width),
                    round(obj.rect_params.height)
                ]
            })
            try:
                l_obj = l_obj.next
            except StopIteration:
                break
        try:
            l_frame = l_frame.next
        except StopIteration:
            break

    if events:
        payload = json.dumps({"detections": events})
        # Publish via MQTT (or Kafka, HTTP, etc.)
        user_data["mqtt_client"].publish("edge/custom", payload)

    return Gst.PadProbeReturn.OK

14. Zero-Copy Video Pipeline Architecture¶

14.1 The NvBufSurface Abstraction¶

NvBufSurface is the central data structure that carries video frame data through the entire DeepStream/GStreamer pipeline on Jetson. It encapsulates one or more video frames as a batch, with each frame stored in NVMM (DMA-capable, physically contiguous memory).

typedef struct {
    uint32_t batchSize;              // Number of frames in batch
    uint32_t numFilled;              // Number of frames actually filled
    NvBufSurfaceMemType memType;     // NVBUF_MEM_DEFAULT, _CUDA_DEVICE, etc.
    uint64_t gpuId;
    NvBufSurfaceParams *surfaceList; // Array of per-frame parameters
} NvBufSurface;

typedef struct {
    uint32_t width;
    uint32_t height;
    uint32_t pitch;
    NvBufSurfaceColorFormat colorFormat;  // NVBUF_COLOR_FORMAT_NV12, _RGBA, etc.
    NvBufSurfacePlaneParams planeParams;
    int bufferDesc;                       // DMA-BUF fd
    void *dataPtr;                        // Mapped pointer (if mapped)
    uint32_t dataSize;
    void *mappedAddr;                     // CPU-mapped address (after Map)
} NvBufSurfaceParams;

14.2 Zero-Copy Data Flow¶

The key principle: video pixel data never crosses between CPU and GPU/hardware address spaces. Instead, NvBufSurface handles (pointers to DMA-BUF file descriptors) are passed between pipeline elements.

  NVDEC (hardware)
     |
     | Outputs NvBufSurface with NV12 data in NVMM
     | (DMA-BUF fd in surfaceList[i].bufferDesc)
     |
     v
  nvvideoconvert (VIC hardware)
     |
     | Reads NV12 via DMA-BUF, writes RGBA via DMA-BUF
     | (new NvBufSurface, still in NVMM, no CPU copy)
     |
     v
  nvinfer (GPU - TensorRT)
     |
     | Maps NvBufSurface as CUDA resource via EGLImage/DMA-BUF import
     | cudaGraphicsEGLRegisterImage() or cuGraphicsResourceGetMappedEGLFrame()
     | Runs inference kernel, writes metadata (not pixel data)
     |
     v
  nvdsosd (GPU)
     |
     | Reads NvBufSurface as CUDA surface, draws bboxes/text
     | (modifies pixels in-place in NVMM, no copy)
     |
     v
  nvv4l2h265enc (NVENC hardware)
     |
     | Reads NV12/RGBA from NVMM via DMA-BUF
     | Outputs compressed H.265 bitstream to system memory
     |
     v
  rtph265pay + udpsink (CPU: packetization and network I/O only)

At no point does CPU memcpy() touch pixel data. The only CPU work is: - GStreamer buffer metadata management (lightweight) - NAL unit parsing (h264parse/h265parse -- header bytes only) - Muxing/demuxing (container format, not pixel data) - Network I/O (sendto/recvfrom for RTP/RTSP)

14.3 When Zero-Copy Breaks¶

Zero-copy breaks when a non-NVMM-aware element is inserted into the pipeline:

  WRONG (breaks zero-copy):
  nvv4l2decoder ! videoconvert ! nvv4l2h265enc
                  ^^^^^^^^^^^^
                  Standard GStreamer element.
                  Forces: NVMM -> CPU copy -> CPU conversion -> CPU copy -> NVMM
                  Result: 10x slower, massive CPU usage

  RIGHT (maintains zero-copy):
  nvv4l2decoder ! nvvideoconvert ! nvv4l2h265enc
                  ^^^^^^^^^^^^^^^
                  NVIDIA element. Conversion happens in VIC/GPU via DMA-BUF.
                  Result: Zero CPU copies

Common zero-copy-breaking mistakes: - Using videoconvert instead of nvvideoconvert - Using videoscale instead of nvvideoconvert (which also scales) - Using jpegenc instead of nvjpegenc - Using avdec_h264 instead of nvv4l2decoder - Inserting appsink + appsrc to pass frames through Python/NumPy - Using capsfilter with non-NVMM caps between NVMM elements

14.4 CPU Access When Needed¶

When the CPU must access pixel data (e.g., custom OpenCV processing), use NvBufSurfaceMap / NvBufSurfaceUnMap:

#include "nvbufsurface.h"

// In a pad probe or custom element
NvBufSurface *surface = /* from GstBuffer map */;

// Map to CPU address space
NvBufSurfaceMap(surface, 0, -1, NVBUF_MAP_READ_WRITE);
NvBufSurfaceSyncForCpu(surface, 0, -1);  // Cache invalidate

// Access pixels
uint8_t *y_plane  = (uint8_t *)surface->surfaceList[0].mappedAddr.addr[0];
uint8_t *uv_plane = (uint8_t *)surface->surfaceList[0].mappedAddr.addr[1];
int pitch = surface->surfaceList[0].planeParams.pitch[0];

// Modify pixels...

NvBufSurfaceSyncForDevice(surface, 0, -1);  // Cache flush
NvBufSurfaceUnMap(surface, 0, -1);

In Python with OpenCV:

def frame_access_probe(pad, info, user_data):
    import numpy as np
    import cv2

    gst_buffer = info.get_buffer()
    batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(gst_buffer))
    l_frame = batch_meta.frame_meta_list

    while l_frame:
        frame_meta = pyds.NvDsFrameMeta.cast(l_frame.data)

        # Get NvBufSurface
        n_frame = pyds.get_nvds_buf_surface(hash(gst_buffer),
                                             frame_meta.batch_id)
        # n_frame is now a numpy array (RGBA format)
        # This DOES involve a CPU copy -- use sparingly

        # OpenCV processing
        gray = cv2.cvtColor(n_frame, cv2.COLOR_RGBA2GRAY)
        edges = cv2.Canny(gray, 100, 200)

        try:
            l_frame = l_frame.next
        except StopIteration:
            break

    return Gst.PadProbeReturn.OK

The DMA-BUF file descriptor from NvBufSurface can be shared with other subsystems:

// Import DMA-BUF into CUDA
NvBufSurface *surface = /* from pipeline */;
int dma_fd = surface->surfaceList[0].bufferDesc;

// Create EGLImage from DMA-BUF
EGLImageKHR egl_image = NvEGLImageFromFd(NULL, dma_fd);

// Register with CUDA
cudaGraphicsResource_t cuda_resource;
cudaGraphicsEGLRegisterImage(&cuda_resource, egl_image,
                              cudaGraphicsRegisterFlagsReadOnly);

// Map to get CUDA array
cudaGraphicsMapResources(1, &cuda_resource, 0);
cudaArray_t cuda_array;
cudaGraphicsSubResourceGetMappedArray(&cuda_array, cuda_resource, 0, 0);

// Use cuda_array in CUDA kernels...

cudaGraphicsUnmapResources(1, &cuda_resource, 0);
NvDestroyEGLImage(NULL, egl_image);

14.6 Buffer Pool Sizing¶

Each NVMM buffer consumes memory proportional to resolution and format:

Resolution	Format	Bytes/frame	4 buffers	8 buffers
720p	NV12	1.38 MB	5.5 MB	11.1 MB
1080p	NV12	3.11 MB	12.4 MB	24.9 MB
1080p	RGBA	8.29 MB	33.2 MB	66.4 MB
4K	NV12	12.4 MB	49.8 MB	99.5 MB

Buffer pool sizes are negotiated automatically by GStreamer, but can be tuned:

# Control decoder output buffer count
gst-launch-1.0 ... ! nvv4l2decoder num-extra-surfaces=2 ! ...
# Default is typically 4-8 buffers; reducing saves memory but risks starvation

15. Performance Profiling¶

15.1 tegrastats -- System-Level Monitoring¶

tegrastats is the primary tool for monitoring hardware engine utilization:

sudo tegrastats --interval 500

Sample output:

RAM 3200/7620MB (lfb 1234x4MB) SWAP 0/3810MB (cached 0MB)
CPU [40%@1510,35%@1510,22%@1510,18%@1510,10%@1510,8%@1510]
EMC_FREQ 3% GR3D_FREQ 72% NVDEC 55% NVENC 40% NVJPG 0%
VIC_FREQ 0% APE 150 MTS fg 0% bg 0%
TEMP CPU@45C GPU@43C SOC@44.5C tj@45C
VDD_IN 8500mW VDD_CPU_GPU_CV 3200mW VDD_SOC 2100mW

Key fields: - GR3D_FREQ: GPU utilization (inference bottleneck indicator) - NVDEC: Decoder utilization - NVENC: Encoder utilization - RAM: Total memory consumption - TEMP tj: Junction temperature (throttle at ~85C) - VDD_IN: Total board power

15.2 DeepStream Performance Measurement¶

Enable built-in FPS reporting:

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5

Output:

**PERF:  FPS 0 (Avg)    FPS 1 (Avg)    FPS 2 (Avg)    FPS 3 (Avg)
**PERF:  30.02 (30.01)  30.01 (30.00)  29.98 (29.99)  30.00 (30.00)

15.3 GStreamer Latency Tracing¶

# Enable latency tracer (measures per-element latency)
GST_TRACERS="latency(flags=element)" \
GST_DEBUG="GST_TRACER:7" \
gst-launch-1.0 filesrc location=test.mp4 ! qtdemux ! h265parse ! \
  nvv4l2decoder ! nvvideoconvert ! nv3dsink 2>&1 | grep -i latency

# Output shows per-element processing time in nanoseconds
# element-latency, element=nvv4l2decoder0, time=3456789

15.4 Pipeline DOT Graphs¶

Visualize the negotiated pipeline including caps and buffer pools:

export GST_DEBUG_DUMP_DOT_DIR=/tmp/gst-dots
mkdir -p /tmp/gst-dots

# Run pipeline (DOT files generated at state changes)
gst-launch-1.0 ... your pipeline ...

# Convert to PNG
sudo apt-get install graphviz
dot -Tpng /tmp/gst-dots/0.00.00.*PLAYING*.dot -o pipeline.png

# The DOT graph shows:
# - Element names and types
# - Negotiated caps on each link
# - Buffer pool configurations
# - Queue levels

15.5 NVIDIA Nsight Systems Profiling¶

For deep analysis of CUDA kernels, NVDEC/NVENC timing, and memory transfers:

# Profile a DeepStream application
nsys profile \
  --trace=cuda,nvtx,nvmedia,osrt \
  --output=ds_profile \
  --duration=30 \
  deepstream-app -c config.txt

# Profile a GStreamer pipeline
nsys profile \
  --trace=cuda,nvtx,osrt \
  --output=gst_profile \
  gst-launch-1.0 ... your pipeline ...

# View results
nsys-ui ds_profile.nsys-rep

In the Nsight Systems GUI, look for: - CUDA kernel timeline: Identify inference kernel duration and gaps. - NVDEC row: Shows decode operations and idle time. - NVENC row: Shows encode operations. - Memory transfers: Any unexpected cudaMemcpy indicates broken zero-copy. - CPU thread activity: High CPU usage suggests software processing in the path.

15.6 GST_DEBUG for Plugin-Level Debugging¶

# Debug levels: 0=none, 1=ERROR, 2=WARNING, 3=FIXME, 4=INFO, 5=DEBUG, 6=LOG, 7=TRACE

# Debug specific plugins
GST_DEBUG="nvv4l2decoder:5,nvv4l2h265enc:4,nvinfer:4" \
  gst-launch-1.0 ... pipeline ...

# Debug caps negotiation
GST_DEBUG="GST_CAPS:5" gst-launch-1.0 ... pipeline ...

# Debug buffer flow
GST_DEBUG="GST_BUFFER:5" gst-launch-1.0 ... pipeline ...

# Write debug output to file
GST_DEBUG="*:3" GST_DEBUG_FILE=/tmp/gst_debug.log \
  gst-launch-1.0 ... pipeline ...

15.7 Bottleneck Identification Methodology¶

Step 1: Run tegrastats during pipeline execution.

Step 2: Identify the constrained resource:
  - GPU > 90%        -> Inference is the bottleneck
  - NVDEC > 90%      -> Decode is the bottleneck (too many streams)
  - NVENC > 90%      -> Encode is the bottleneck
  - CPU > 80% total  -> CPU processing in pipeline (software element?)
  - RAM near limit   -> Memory pressure, possible swap thrashing

Step 3: Address the bottleneck:
  GPU-limited:
    - Reduce model size (YOLOv8n instead of YOLOv8s)
    - Use INT8 quantization
    - Increase nvinfer interval (skip frames)
    - Reduce input resolution to inference

  NVDEC-limited:
    - Reduce number of streams
    - Lower stream resolution at the source
    - Use more efficient codec (H.265 vs H.264)

  NVENC-limited:
    - Reduce output resolution or framerate
    - Use lower preset-level
    - Encode fewer output streams

  CPU-limited:
    - Replace software GStreamer elements with NVIDIA equivalents
    - Move CPU processing to GPU (CUDA kernels)
    - Reduce metadata processing frequency

15.8 Profiling Script¶

#!/bin/bash
# profile_pipeline.sh -- Capture tegrastats during pipeline run
DURATION=${1:-60}
CONFIG=${2:-"config.txt"}
OUTPUT_DIR="profile_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$OUTPUT_DIR"

# Start tegrastats logging in background
sudo tegrastats --interval 200 --logfile "$OUTPUT_DIR/tegrastats.log" &
TEGRA_PID=$!

# Run DeepStream with perf enabled
timeout "$DURATION" deepstream-app -c "$CONFIG" 2>&1 | \
  tee "$OUTPUT_DIR/deepstream.log"

# Stop tegrastats
sudo kill $TEGRA_PID 2>/dev/null

# Parse tegrastats for GPU/NVDEC/NVENC utilization
echo "=== GPU Utilization ==="
grep -oP 'GR3D_FREQ \K[0-9]+' "$OUTPUT_DIR/tegrastats.log" | \
  awk '{sum+=$1; n++} END {print "Avg:", sum/n "%", "Max:", max}'

echo "=== NVDEC Utilization ==="
grep -oP 'NVDEC \K[0-9]+' "$OUTPUT_DIR/tegrastats.log" | \
  awk '{sum+=$1; n++} END {print "Avg:", sum/n "%"}'

echo "=== Power ==="
grep -oP 'VDD_IN \K[0-9]+' "$OUTPUT_DIR/tegrastats.log" | \
  awk '{sum+=$1; n++} END {print "Avg:", sum/n "mW"}'

echo "Results saved to $OUTPUT_DIR/"

16. Production Deployment¶

16.1 RTSP Server Output¶

The most common production output is an RTSP server that clients (VMS, NVR, browsers) can connect to and pull the analyzed video stream.

GStreamer RTSP Server (Python):

#!/usr/bin/env python3
"""RTSP server serving DeepStream-analyzed video."""

import gi
gi.require_version('Gst', '1.0')
gi.require_version('GstRtspServer', '1.0')
from gi.repository import Gst, GstRtspServer, GLib

Gst.init(None)

class DeepStreamRTSPServer:
    def __init__(self, port=8554, mount_point="/live"):
        self.server = GstRtspServer.RTSPServer.new()
        self.server.set_service(str(port))

        factory = GstRtspServer.RTSPMediaFactory.new()
        factory.set_launch(
            '( nvarguscamerasrc sensor-id=0 ! '
            'video/x-raw(memory:NVMM),width=1920,height=1080,'
            'framerate=30/1,format=NV12 ! '
            'nvvideoconvert ! video/x-raw(memory:NVMM),format=NV12 ! '
            'nvv4l2h265enc bitrate=4000000 preset-level=1 '
            'insert-sps-pps=true idrinterval=30 '
            'maxperf-enable=true ! '
            'h265parse ! rtph265pay name=pay0 pt=96 )'
        )
        factory.set_shared(True)

        mounts = self.server.get_mount_points()
        mounts.add_factory(mount_point, factory)
        self.server.attach(None)

        print(f"RTSP server running at rtsp://localhost:{port}{mount_point}")

    def run(self):
        loop = GLib.MainLoop()
        loop.run()

if __name__ == "__main__":
    server = DeepStreamRTSPServer(port=8554, mount_point="/live")
    server.run()

DeepStream config file approach:

[sink0]
enable=1
type=4                          # RTSP output
rtsp-port=8554
udp-port=5400
codec=1                         # 0=H.264, 1=H.265
bitrate=4000000
enc-type=0                      # 0=hardware encoder
sync=0

[sink1]
enable=0                        # Disable display sink for headless
type=2

Client connection:

# VLC
vlc rtsp://jetson-ip:8554/live

# GStreamer client
gst-launch-1.0 rtspsrc location=rtsp://jetson-ip:8554/live latency=100 ! \
  rtph265depay ! h265parse ! avdec_h265 ! videoconvert ! autovideosink

# FFmpeg
ffplay -rtsp_transport tcp rtsp://jetson-ip:8554/live

16.2 Recording to Disk¶

Continuous recording with segment rotation:

# Record to segmented MP4 files (5-minute segments)
gst-launch-1.0 nvarguscamerasrc ! \
  'video/x-raw(memory:NVMM),width=1920,height=1080,framerate=30/1' ! \
  nvv4l2h265enc bitrate=4000000 ! h265parse ! \
  splitmuxsink location="recording_%05d.mp4" \
    max-size-time=300000000000 \
    muxer-factory=mp4mux

DeepStream config for recording:

[sink2]
enable=1
type=3                          # File output
container=1                     # 1=MP4
codec=1                         # 1=H.265
enc-type=0                      # 0=hardware encoder
bitrate=4000000
output-file=recording.mp4
source-id=0                     # Record specific source

Recording with metadata sidecar (for forensic review):

import json
import time
from pathlib import Path

class MetadataRecorder:
    """Records detection metadata alongside video for later review."""

    def __init__(self, output_dir):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.current_file = None
        self.segment_start = None

    def new_segment(self, video_filename):
        if self.current_file:
            self.current_file.close()
        meta_filename = video_filename.replace('.mp4', '.jsonl')
        self.current_file = open(self.output_dir / meta_filename, 'w')
        self.segment_start = time.time()

    def record_detection(self, frame_num, timestamp, detections):
        record = {
            "frame": frame_num,
            "pts": timestamp,
            "wall_clock": time.time(),
            "detections": detections
        }
        self.current_file.write(json.dumps(record) + '\n')

    def close(self):
        if self.current_file:
            self.current_file.close()

16.3 Error Recovery and Pipeline State Management¶

Production pipelines must handle errors gracefully without human intervention.

Bus watch with automatic recovery:

import time
import threading

class ResilientPipeline:
    """Pipeline wrapper with automatic error recovery."""

    def __init__(self, pipeline, max_retries=10, retry_delay=5):
        self.pipeline = pipeline
        self.max_retries = max_retries
        self.retry_delay = retry_delay
        self.retry_count = 0
        self.running = True

        # Install bus watch
        bus = self.pipeline.get_bus()
        bus.add_signal_watch()
        bus.connect("message::error", self._on_error)
        bus.connect("message::eos", self._on_eos)
        bus.connect("message::state-changed", self._on_state_changed)

    def _on_error(self, bus, message):
        err, debug = message.parse_error()
        element = message.src.get_name()
        print(f"ERROR from {element}: {err.message}")
        print(f"Debug: {debug}")

        if self.retry_count < self.max_retries:
            self.retry_count += 1
            print(f"Attempting recovery ({self.retry_count}/{self.max_retries})...")
            self._restart()
        else:
            print("Max retries exceeded. Shutting down.")
            self.running = False

    def _on_eos(self, bus, message):
        print("End of stream. Restarting for loop playback...")
        self._restart()

    def _on_state_changed(self, bus, message):
        if message.src == self.pipeline:
            old, new, pending = message.parse_state_changed()
            if new == Gst.State.PLAYING:
                self.retry_count = 0  # Reset on successful play

    def _restart(self):
        self.pipeline.set_state(Gst.State.NULL)
        time.sleep(self.retry_delay)
        ret = self.pipeline.set_state(Gst.State.PLAYING)
        if ret == Gst.StateChangeReturn.FAILURE:
            print("Failed to restart pipeline")

    def start(self):
        self.pipeline.set_state(Gst.State.PLAYING)

    def stop(self):
        self.running = False
        self.pipeline.send_event(Gst.Event.new_eos())
        time.sleep(2)
        self.pipeline.set_state(Gst.State.NULL)

16.4 RTSP Source Reconnection¶

# DeepStream config: auto-reconnect RTSP sources
[source0]
enable=1
type=4
uri=rtsp://camera.local:554/live
rtsp-reconnect-interval-sec=5      # Reconnect every 5 seconds on failure
latency=200
num-sources=1

For programmatic control:

def handle_rtsp_source_error(src_element, source_id):
    """Handle RTSP source disconnection and reconnect."""
    print(f"Source {source_id} disconnected. Reconnecting...")
    src_element.set_state(Gst.State.NULL)
    time.sleep(3)
    src_element.set_state(Gst.State.PLAYING)

16.5 Long-Running Stability¶

Considerations for pipelines that must run 24/7:

Memory leak prevention:

# Periodically check for buffer leaks
def monitor_memory():
    import psutil
    process = psutil.Process()
    initial_rss = process.memory_info().rss

    while True:
        current_rss = process.memory_info().rss
        delta_mb = (current_rss - initial_rss) / (1024 * 1024)

        if delta_mb > 500:  # 500 MB growth threshold
            print(f"WARNING: Memory grew by {delta_mb:.0f} MB. "
                  f"Possible leak. Current RSS: "
                  f"{current_rss / (1024*1024):.0f} MB")

        time.sleep(60)

Thermal management:

# Monitor thermal zone temperatures
cat /sys/class/thermal/thermal_zone*/temp
# Values are in millidegrees Celsius (45000 = 45.0C)

# Set fan to maximum for 24/7 operation
sudo sh -c 'echo 255 > /sys/devices/pwm-fan/target_pwm'

def check_thermal():
    """Read SoC junction temperature and throttle if needed."""
    with open('/sys/class/thermal/thermal_zone0/temp') as f:
        temp_mc = int(f.read().strip())
    temp_c = temp_mc / 1000.0

    if temp_c > 85:
        print(f"THERMAL WARNING: {temp_c}C -- throttling pipeline")
        # Reduce inference interval or drop streams
        return True
    return False

Watchdog timer:

class PipelineWatchdog:
    """Restarts pipeline if no frames received within timeout."""

    def __init__(self, pipeline, timeout_sec=30):
        self.pipeline = pipeline
        self.timeout = timeout_sec
        self.last_frame_time = time.time()
        self.lock = threading.Lock()

    def frame_received(self):
        with self.lock:
            self.last_frame_time = time.time()

    def monitor(self):
        while True:
            with self.lock:
                elapsed = time.time() - self.last_frame_time

            if elapsed > self.timeout:
                print(f"WATCHDOG: No frames for {elapsed:.0f}s. Restarting...")
                self.pipeline.set_state(Gst.State.NULL)
                time.sleep(5)
                self.pipeline.set_state(Gst.State.PLAYING)
                with self.lock:
                    self.last_frame_time = time.time()

            time.sleep(5)

16.6 Systemd Service for Auto-Start¶

# /etc/systemd/system/deepstream-analytics.service
[Unit]
Description=DeepStream Video Analytics Pipeline
After=network-online.target nvpmodel.service
Wants=network-online.target

[Service]
Type=simple
User=root
Environment="DISPLAY=:0"
Environment="GST_DEBUG=2"
ExecStartPre=/usr/bin/nvpmodel -m 0
ExecStartPre=/usr/bin/jetson_clocks
ExecStart=/usr/bin/deepstream-app -c /opt/analytics/config.txt
Restart=always
RestartSec=10
WatchdogSec=120
StandardOutput=journal
StandardError=journal
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

sudo systemctl enable deepstream-analytics
sudo systemctl start deepstream-analytics
sudo journalctl -u deepstream-analytics -f   # View logs

16.7 Production Deployment Checklist¶

Item	Action
Power mode	`sudo nvpmodel -m 0` (15W max)
Clock locking	`sudo jetson_clocks` (max freq)
TensorRT engines	Pre-built, not built at runtime
Fan control	Set to max for 24/7 (`echo 255 > target_pwm`)
RTSP reconnect	`rtsp-reconnect-interval-sec=5`
Error recovery	Bus watch with auto-restart
Watchdog	Frame timeout detection
Thermal monitoring	Alert at 85C, throttle or shutdown at 90C
Memory monitoring	Track RSS growth, alert on > 500 MB growth
Log rotation	`logrotate` for GST_DEBUG and app logs
Systemd service	Auto-start, restart on failure
Health endpoint	HTTP endpoint reporting pipeline status
Metrics export	Prometheus/Grafana for FPS, latency, temperature

17. Common Issues and Debugging¶

17.1 Pipeline Negotiation Errors¶

Symptom: not-negotiated (-4) error or Internal data stream error.

ERROR from element nvv4l2decoder0: Internal data stream error.
Debug info: gstbasesrc.c(3127): gst_base_src_loop():
  streaming stopped, reason not-negotiated (-4)

Common causes and fixes:

Cause	Fix
Missing parser before decoder	Add `h264parse` or `h265parse`
Wrong caps format	Check `(memory:NVMM)` in caps filter
Incompatible pixel format	Insert `nvvideoconvert` between elements
Resolution exceeds codec maximum	Reduce resolution
Missing demuxer	Add `qtdemux` for MP4 or `matroskademux` for MKV

Debugging approach:

# Dump negotiation details
GST_DEBUG="GST_CAPS:5,nvv4l2decoder:5" gst-launch-1.0 ... 2>&1 | head -200

# Check what caps each element supports
gst-inspect-1.0 nvv4l2decoder   # Look at "Pad Templates" section

17.2 Buffer Starvation and Underruns¶

Symptom: Choppy playback, frame drops, increasing latency.

WARNING from element nvv4l2decoder0: Output buffer pool has no free buffers

Fixes:

# Increase decoder output buffer pool
gst-launch-1.0 ... ! nvv4l2decoder num-extra-surfaces=4 ! ...

# Add queue elements with larger buffers
gst-launch-1.0 ... ! queue max-size-buffers=10 max-size-time=0 max-size-bytes=0 ! ...

# For live sources, increase latency tolerance
gst-launch-1.0 rtspsrc location=... latency=500 ! ...

17.3 Codec Limitations and Workarounds¶

Limitation	Workaround
No VP9/AV1 encode	Transcode to H.265 for storage/streaming
No H.264 10-bit encode	Use H.265 Main10 for 10-bit encode
No progressive JPEG (NVJPEG)	Use libjpeg-turbo for progressive JPEG
Single NVDEC instance	Limit concurrent decode streams to pixel budget
Single NVENC instance	Time-slice encode; reduce encode streams
No interlaced video encode	Deinterlace with `nvvideoconvert` first

17.4 Memory Leaks¶

Detection:

# Monitor RSS growth over time
watch -n 5 'ps -o pid,rss,vsz,comm -p $(pgrep deepstream-app)'

# Use valgrind (slow, use on development machine, not production)
valgrind --leak-check=full --show-leak-kinds=all \
  deepstream-app -c config.txt 2>&1 | tee valgrind.log

# GStreamer's built-in leak tracer
GST_TRACERS="leaks(check-refs=true)" gst-launch-1.0 ... pipeline ...

Common leak sources:

Source	Fix
Unreleased GstBuffer refs	Ensure `gst_buffer_unref()` in every code path
Leaked pad probes	Remove probes when pipeline stops
Python pyds metadata refs	Use `try/except StopIteration` pattern
NvBufSurface not unmapped	Always pair `Map` with `UnMap`
Dynamic source pads not released	Release request pads on source removal

17.5 NVDEC/NVENC Errors¶

Symptom: Decoder or encoder returns error after working initially.

ERROR from element nvv4l2decoder0: Could not decode stream.

Diagnostic steps:

# Check if codec device is available
ls -la /dev/video0   # decoder
ls -la /dev/video1   # encoder

# Check for other processes holding the codec
sudo fuser /dev/video0
sudo fuser /dev/video1

# Check kernel log for hardware errors
dmesg | grep -i -E "nvdec|nvenc|tegra-video|fault"

# Verify clock status
sudo cat /sys/kernel/debug/clk/nvdec/clk_rate
sudo cat /sys/kernel/debug/clk/msenc/clk_rate

# Reset video engines (last resort)
sudo systemctl restart nvargus-daemon

17.6 DeepStream Model Loading Failures¶

Symptom: nvinfer fails to load model or build engine.

ERROR: Failed to create engine from model file

Fixes:

# Verify engine file matches GPU architecture
# Orin Nano = SM 8.7 (compute capability)
# Engine built on different GPU will not load

# Rebuild engine for this platform
/usr/src/tensorrt/bin/trtexec --onnx=model.onnx \
  --saveEngine=model_orin_fp16.engine --fp16

# Check engine compatibility
python3 -c "
import tensorrt as trt
logger = trt.Logger(trt.Logger.INFO)
runtime = trt.Runtime(logger)
with open('model.engine', 'rb') as f:
    engine = runtime.deserialize_cuda_engine(f.read())
    if engine is None:
        print('Engine failed to load -- likely wrong platform')
    else:
        print(f'Engine loaded: {engine.num_bindings} bindings')
"

# Common causes:
# 1. Engine built on x86 (SM 8.6) but running on Orin (SM 8.7)
# 2. TensorRT version mismatch between build and runtime
# 3. Batch size in config exceeds engine max batch
# 4. Insufficient workspace memory

17.7 GST_DEBUG Usage Reference¶

# Environment variables for GStreamer debugging

# Set global debug level
export GST_DEBUG=3           # WARNING and above for all

# Set per-element debug level (comma-separated)
export GST_DEBUG="nvv4l2decoder:5,nvinfer:4,nvstreammux:3"

# Write to file instead of stderr
export GST_DEBUG_FILE=/tmp/gst_debug.log

# Enable color output (useful for terminal)
export GST_DEBUG_COLOR_MODE=on

# Dump pipeline graphs
export GST_DEBUG_DUMP_DOT_DIR=/tmp/gst-dots

# Debug categories for NVIDIA plugins:
# nvv4l2decoder:N  -- Hardware decoder
# nvv4l2h264enc:N  -- H.264 encoder
# nvv4l2h265enc:N  -- H.265 encoder
# nvvideoconvert:N -- Video format converter
# nvinfer:N        -- TensorRT inference
# nvtracker:N      -- Object tracker
# nvstreammux:N    -- Stream muxer
# nvdsosd:N        -- On-screen display
# nvmsgconv:N      -- Message converter
# nvmsgbroker:N    -- Message broker

# Example: debug a failing decode pipeline
GST_DEBUG="GST_CAPS:4,nvv4l2decoder:6,GST_BUFFER:4" \
  gst-launch-1.0 filesrc location=test.mp4 ! qtdemux ! h265parse ! \
  nvv4l2decoder ! nvvideoconvert ! nv3dsink 2>&1 | tee debug.log

17.8 Common Error Messages and Solutions¶

Error Message	Likely Cause	Solution
`nvbufsurface: mapping of buffer failed`	Out of NVMM memory	Reduce buffer pools or stream count
`Failed to create NvBufSurface`	Too many concurrent allocations	Reduce batch size or resolution
`nvv4l2decoder: capture plane deque buffer failed`	NVDEC overloaded	Reduce decode streams
`Error in NvBufSurfTransformAsync`	VIC queue full	Add queue before nvvideoconvert
`Pipeline doesn't want to PREROLL`	Missing element or broken link	Check all links; use DOT graph
`no element "nvv4l2decoder"`	Plugin not installed	Verify JetPack install; `gst-inspect-1.0 nvv4l2decoder`
`Could not initialize nvbufsurface`	Display server not running (headless)	Use fakesink or set `DISPLAY=:0`
`ERROR from nvinfer: TensorRT inference failed`	Engine/model error	Rebuild engine for this platform
`nvstreammux: Failed to get batch`	No sources producing frames	Check source connectivity and state
`RTSP connection timed out`	Network or camera issue	Set `rtsp-reconnect-interval-sec=5`

17.9 Performance Degradation Over Time¶

If performance degrades after hours or days of continuous operation:

# Check for thermal throttling
cat /sys/devices/virtual/thermal/thermal_zone*/temp
# Junction temp > 85000 means throttling is likely active

# Check for memory fragmentation
cat /proc/buddyinfo
free -h

# Check for swap usage (severe performance hit)
swapon --show
cat /proc/meminfo | grep Swap

# Check if clocks dropped from max
sudo cat /sys/kernel/debug/clk/gpcclk/clk_rate  # GPU clock
sudo cat /sys/kernel/debug/clk/nvdec/clk_rate    # NVDEC clock

# Re-lock clocks (may be lost after thermal throttle)
sudo jetson_clocks

# Check for zombie processes consuming resources
ps aux | grep -i -E "defunct|zombie"

# Check open file descriptors (RTSP connections can leak fds)
ls /proc/$(pgrep deepstream-app)/fd | wc -l

17.10 Quick Diagnostic Checklist¶

1. Pipeline fails to start:
   [ ] Check GST_DEBUG output for first ERROR
   [ ] Verify all elements exist: gst-inspect-1.0 <element>
   [ ] Dump DOT graph at READY state
   [ ] Check device permissions: ls -la /dev/video*
   [ ] Verify JetPack version matches DeepStream version

2. Pipeline starts but no video output:
   [ ] Check source connectivity (RTSP reachable? File exists?)
   [ ] Verify display server running (for nv3dsink)
   [ ] Check nvstreammux batch-size matches source count
   [ ] Try with fakesink to isolate display issues

3. Low FPS / poor performance:
   [ ] Run tegrastats -- which engine is saturated?
   [ ] Check nvinfer interval setting
   [ ] Verify hardware encoder/decoder (not software fallback)
   [ ] Check power mode: nvpmodel -q
   [ ] Check clocks: sudo jetson_clocks --show

4. Pipeline crashes after running:
   [ ] Check dmesg for GPU/NVDEC faults
   [ ] Monitor memory with tegrastats -- OOM?
   [ ] Check thermal -- throttling or shutdown?
   [ ] Enable core dumps: ulimit -c unlimited
   [ ] Run with GST_DEBUG=3 and check last messages before crash

References¶

NVIDIA Jetson Orin Nano Data Sheet -- https://developer.nvidia.com/embedded/jetson-orin-nano
Jetson Linux Multimedia API Reference -- https://docs.nvidia.com/jetson/archives/l4t-multimedia/index.html
NVIDIA Accelerated GStreamer Plugins -- https://docs.nvidia.com/jetson/archives/l4t-multimedia/group__gstreamer__plugins.html
DeepStream SDK Developer Guide -- https://docs.nvidia.com/metropolis/deepstream/dev-guide/
DeepStream Python Apps Repository -- https://github.com/NVIDIA-AI-IOT/deepstream_python_apps
NvBufSurface API Reference -- https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_NvBufSurface.html
TensorRT Developer Guide -- https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/
GStreamer Documentation -- https://gstreamer.freedesktop.org/documentation/
NVIDIA Nsight Systems -- https://docs.nvidia.com/nsight-systems/
Jetson Power and Performance -- https://docs.nvidia.com/jetson/archives/r36.3/DeveloperGuide/SD/PlatformPowerAndPerformance.html
V4L2 API Specification -- https://www.kernel.org/doc/html/latest/userspace-api/media/v4l/v4l2.html