Skip to content

Computer Vision — Complete Guide

Goal: Build a comprehensive understanding of computer vision from image processing fundamentals through modern deep learning architectures, with practical skills in annotation tools, dataset formats, and deployment on edge hardware.


Table of Contents

  1. What is Computer Vision?
  2. Image Processing Fundamentals — ISP pipeline, color spaces, filtering, edge detection
  3. Feature Extraction
  4. Image Segmentation — threshold, watershed, semantic segmentation, instance segmentation
  5. Object Detection
  6. Object Tracking
  7. 3D Vision — calibration, stereo depth, pose, multi-camera ADAS (openpilot near/wide)
  8. Advanced Deep Learning Architectures
  9. OpenCV — Core to Advanced
  10. Annotation Tools
  11. Dataset Formats
  12. Model Training Pipeline
  13. Projects
  14. Resources

1. What is Computer Vision?

Computer vision gives machines the ability to interpret and understand visual information from the world — images, videos, depth maps, point clouds.

Input: raw pixels / point cloud / depth frame
  Preprocessing (resize, normalize, augment)
  Feature extraction (learned or hand-crafted)
  Task-specific head (detection / segmentation / depth)
Output: bounding boxes / masks / pose / 3D structure

Core Tasks

Task Input Output Example
Classification Image Class label "this is a cat"
Object Detection Image Boxes + labels "car at (x1,y1,x2,y2)"
Semantic Segmentation Image Per-pixel class every pixel = road/car/sky
Instance Segmentation Image Per-object mask each car gets its own mask
Depth Estimation RGB Depth map distance per pixel
Pose Estimation Image Keypoints skeleton of a person
3D Detection Point Cloud 3D boxes LiDAR object detection
Optical Flow Video Motion field pixel displacement between frames

Why Computer Vision is Hard

Challenges:
  Viewpoint variation    — same object, different camera angle
  Illumination           — shadows, overexposure, nighttime
  Occlusion              — object partly hidden by another
  Scale variation        — same object at different distances
  Intra-class variation  — all chairs look different
  Background clutter     — object blends into background
  Deformable objects     — humans, animals change shape

2. Image Processing Fundamentals

2.0 ISP Pipeline — From Photons to Pixels

Beginners think: "the camera gives an image." Reality: the camera gives raw physics — the ISP builds the image.

Every image a neural network processes was manufactured by an Image Signal Processor. The sensor does not output RGB. It outputs a Bayer mosaic of single-channel photon counts at 10/12/14-bit depth. The ISP pipeline transforms this raw physics into a stable, color-correct image. Every stage exists for a physical reason, and every stage can break your model if done wrong.

This is core knowledge for ADAS, embedded vision, Jetson, and camera systems.

Full RAW-to-NV12 Pipeline

Photons → Lens → Sensor (CCD/CMOS with Bayer CFA)
                        │  raw Bayer mosaic (10/12/14 bit)
              ┌───────────────────────────┐
              │  1. RAW Bayer             │  sensor output — NOT an image yet
              │  2. Linearization         │  fix sensor non-linearity
              │  3. White Balance         │  correct for lighting color
              │  4. Demosaic              │  Bayer → full RGB
              │  5. Color Correction (CCM)│  sensor RGB → standard sRGB
              │  6. Brightness / Contrast │  tone mapping, gamma
              │  7. Lens Shading          │  fix vignetting (dark edges)
              │  8. NV12 output           │  YUV for GPU / neural net
              └───────────────────────────┘

Stage 1: RAW Bayer (Sensor Output)

The sensor does NOT capture RGB.

It captures a Bayer mosaic pattern — each physical pixel has a color filter in front of it that passes only one color.

Bayer RGGB pattern (most common):

  col 0   col 1   col 2   col 3
  ┌───────┬───────┬───────┬───────┐
  │   R   │   G   │   R   │   G   │  row 0
  ├───────┼───────┼───────┼───────┤
  │   G   │   B   │   G   │   B   │  row 1
  ├───────┼───────┼───────┼───────┤
  │   R   │   G   │   R   │   G   │  row 2
  ├───────┼───────┼───────┼───────┤
  │   G   │   B   │   G   │   B   │  row 3
  └───────┴───────┴───────┴───────┘

Each pixel records ONLY ONE color channel.
50% green — human vision is most sensitive to green, so more green pixels
            maximizes signal-to-noise ratio.
25% red, 25% blue.

Why this design?

  • Hardware simplicity — one filter per photodiode, no beam splitters
  • Smaller bandwidth — 1 channel per pixel instead of 3
  • Standard across the industry — Bayer pattern invented 1976, still universal

The problem:

  • This is NOT a usable image
  • A neural network cannot ingest raw Bayer directly (wrong structure, single-channel)
  • Every downstream stage exists to turn this into something usable
CCD vs CMOS:

  CCD  — Charge Coupled Device
         Shifts charge row-by-row to a single ADC at the edge
         Global shutter (all pixels exposed simultaneously — no rolling artifacts)
         Higher power, mostly legacy (replaced by CMOS)

  CMOS — Complementary Metal-Oxide-Semiconductor
         Each pixel has its own amplifier + ADC (column-parallel readout)
         Rolling shutter (rows read sequentially — fast motion causes skew)
         Lower power, faster, dominant in ALL modern cameras

ADAS reference sensors:
  AR0233AT (ON Semi)    — 2.1 MP, HDR, automotive qualified, RGGB
  OV8856  (OmniVision)  — 8 MP, openpilot comma 3 sensor, RGGB
  IMX390  (Sony)        — 2.3 MP, HDR, LED flicker mitigation, ADAS standard
  IMX728  (Sony)        — 8.3 MP, latest generation ADAS, stacked BSI

Stage 2: Linearization

Sensors are non-linear devices. The raw pixel value coming out of the ADC is NOT proportional to the number of photons that hit the pixel.

Why non-linear?

  Photodiode response      — charge accumulates non-linearly near saturation
  ADC transfer function    — analog-to-digital converter has its own curve
  Sensor gain stages       — programmable gain amplifiers introduce distortion

What linearization does:

  Apply a correction curve (measured during factory calibration)
  so that:  output_value = k × (number of photons)

  After linearization: double the light → double the pixel value

Why this matters:

  • ALL subsequent math (white balance, color correction, noise reduction) assumes linear light
  • Without linearization: colors are wrong, brightness is wrong, exposure control breaks
  • The correction curve is specific to each sensor model (stored in sensor firmware or ISP config)
Before linearization:
  200 photons → ADC reads 1000
  400 photons → ADC reads 1700    ← should be 2000 if linear

After linearization:
  200 photons → corrected to 1000
  400 photons → corrected to 2000  ← now proportional

Stage 3: White Balance (Critical for ADAS)

Real-world lighting is not neutral. Different light sources emit different color temperatures, which shifts the color of everything in the scene.

Color temperature of common illuminants:

  1800 K    candle flame            very warm / orange
  2700 K    tungsten bulb           warm yellow
  4000 K    fluorescent tube        neutral / slightly green
  5500 K    direct sunlight (D55)   daylight standard
  6500 K    overcast sky (D65)      cool / slightly blue — sRGB reference
 10000 K    blue sky / deep shade   very cool / blue

The sensor's R, G, B channels respond differently to each illuminant. Under tungsten light, red channel is too strong; under shade, blue is too strong.

White balance corrects this by scaling each channel with a gain factor:

How it works:

  Measure the scene illuminant (auto white balance algorithms):
    - Gray World:    assume average scene color should be neutral gray
    - White Patch:   assume brightest region is white
    - Illuminant estimation: statistical model of light source

  Apply per-channel gain:
    R_corrected = R_raw × gain_R
    G_corrected = G_raw × gain_G     (usually gain_G = 1.0, reference channel)
    B_corrected = B_raw × gain_B

  Goal: a white object appears as (R=G=B) regardless of lighting

Why this is critical for ADAS:

  • A dashcam sees constantly changing illuminants: sun, shadows, tunnels, LED taillights, streetlights
  • Bad AWB causes color shifts between frames
  • A neural net trained on D65 (overcast daylight) images will misclassify objects when the ISP outputs orange-tinted tunnel frames
  • Color shift → class confusion → missed detections
ADAS scenario:

  Highway → tunnel → exit:
    Frame 1: outdoor sunlight (5500K) → correct colors
    Frame 2: tunnel sodium lamps (2000K) → everything orange
    Frame 3: tunnel exit (mixed) → half orange, half blue

  Without good AWB: the neural net sees three different "worlds"
  With good AWB:    all three frames have consistent color → stable predictions

Stage 4: Demosaic (Most Important ISP Stage)

The raw Bayer image has incomplete color information. Each pixel recorded only ONE channel (R, G, or B). Demosaicing reconstructs the missing two channels for every pixel.

Before demosaic:                  After demosaic:

  pixel (2,3) has only G value    pixel (2,3) has (R_interpolated,
                                                    G_actual,
                                                    B_interpolated)

How interpolation works (simplified — bilinear):

  To estimate the missing Red value at a Green pixel:
    R_estimated = average of neighboring R pixels (above, below, left, right)

  To estimate the missing Blue value at a Green pixel:
    B_estimated = average of neighboring B pixels (diagonal neighbors)

Demosaicing algorithms range in quality:

Method           Speed      Quality    Artifacts
──────────────────────────────────────────────────────────
Bilinear         fastest    poor       zipper edges, moiré, false color
VNG              medium     good       mild color fringing at sharp edges
AHD / EA         slower     very good  edge-aware — interpolates ALONG edges,
                                       not across them
Neural network   GPU        excellent  learned from ground-truth RGB pairs

Why quality matters enormously:

  • Bad demosaicing creates zipper artifacts at sharp edges (alternating color fringes)
  • These artifacts look like real features to a CNN
  • Moiré patterns (false periodic patterns) trigger false detections
  • Every downstream CV task inherits demosaic errors — they propagate forever
ADAS tradeoff:

  Accuracy:  want AHD or neural-net demosaic (best edges)
  Speed:     need real-time (30 fps minimum)
  Solution:  hardware ISP does demosaic in dedicated silicon at pixel rate
             — no CPU/GPU cycles consumed, runs at sensor speed

Hardware ISP demosaicers:
  NVIDIA Jetson:    built into VI/ISP unit (not user-configurable)
  Qualcomm Spectra: multi-pass with chroma correction
  ARM Mali-C71:     proprietary edge-directed algorithm

Stage 5: Color Space Correction (CCM)

Sensor color filters are NOT ideal R, G, B. Each sensor model has a unique spectral response — the red filter passes some green, the blue filter leaks some red, etc.

The Color Correction Matrix (CCM) transforms sensor-native RGB into a standard color space (sRGB).

What the CCM does:

  [R_sRGB]     [  1.86  -0.64  -0.21 ]   [R_sensor]
  [G_sRGB]  =  [ -0.25   1.52  -0.26 ] × [G_sensor]
  [B_sRGB]     [  0.04  -0.57   1.53 ]   [B_sensor]

  This 3×3 matrix corrects:
    - Spectral cross-talk between filter channels
    - Sensitivity differences between R/G/B photodiodes
    - Maps sensor-specific colors to device-independent standard

The full color chain:
  Sensor RGB → CCM → CIE XYZ → matrix → linear sRGB → gamma curve → sRGB
  (device-      (device-          (physical     (perceptual /
   dependent)    independent)      light values)  display-ready)

Why this matters for neural networks:

  • NNs trained on sRGB images learn specific color statistics
  • If your camera's CCM differs from training data, color distribution shifts
  • Shifted colors → degraded accuracy on color-sensitive tasks (traffic light, lane color, brake lights)
  • Solution: either calibrate CCM to match training distribution, or augment training with diverse ISP outputs

Stage 6: Brightness and Contrast Control

After color correction, the image is in linear light — but linear light looks wrong to human eyes and to models trained on gamma-corrected images.

Why we need this stage:

  Linear light reality:
    - Most of the useful scene detail is compressed into a tiny range
    - Dark areas look too dark, bright areas look too bright
    - Real scenes have 120+ dB dynamic range (HDR sensors)
    - Display/JPEG supports only ~48 dB (8 bits)

What this stage does:

  1. Tone mapping (HDR → SDR)
     Compress the enormous dynamic range into 8 bits
     Preserve detail in both shadows and highlights
     Critical for dashcams: headlights + dark road in same frame

  2. Gamma correction
     Apply power curve: output = input^(1/2.2)
     Redistributes bit depth to match human perception
     Without gamma: dark regions have too few distinct values

  3. Contrast adjustment
     Local or global contrast enhancement
     Prevents losing lane markings in shadows
     Prevents saturating sky / headlight regions

ADAS-specific concerns:

  Problem: driving toward sunset
    - Sky: massively overexposed (saturated white)
    - Road: deeply underexposed (near black)
    - Lane markings: invisible in both regions

  ISP tone mapper must:
    - Compress sky brightness so headlights don't bloom
    - Lift road shadows so lane markings remain visible
    - Do this consistently frame-to-frame (no flickering)

  Bad tone mapping → lost lane markings → lateral control fails
  Good tone mapping → neural net sees lanes in all conditions

Stage 7: Lens Shading / Vignetting Correction

Every lens is optically imperfect. Light falls off toward the edges of the image — the center is brighter than the corners. This is called vignetting.

Why it happens:

  - Cos⁴ law: light intensity drops as cos⁴(angle) from optical axis
  - At 30° off-axis: brightness drops to ~56% of center
  - At 45° off-axis: brightness drops to ~25% of center
  - Worse with wide-angle lenses (ADAS cameras are typically 60–120° FOV)

What the correction does:

  Apply a per-pixel gain map (measured during lens calibration):
    pixel_corrected = pixel_raw × gain_map(x, y)

  The gain map is:
    - ~1.0 at center (no correction needed)
    - ~1.5–2.5 at edges (amplify to compensate for falloff)
    - Unique per lens + sensor combination
    - Stored in ISP firmware

Why it matters for ADAS:

  Without lens shading correction:
    - Objects at image edges appear darker than center
    - A pedestrian in the left corner looks different from same pedestrian at center
    - Segmentation models produce inconsistent masks at image borders
    - Lane detection fails at peripheral lanes (they appear too dim)

  Especially critical for:
    - Wide-angle cameras (openpilot wide cam: ~120° FOV → severe vignetting)
    - Multi-camera stitching (brightness mismatch at overlap zones)
    - Surround-view systems (4+ cameras with edge overlap)

Stage 8: NV12 Output (Final Format)

The ISP's final step converts the processed RGB image into NV12 (a YUV 4:2:0 format) for downstream consumption by GPU, neural network, and video encoder.

What is NV12?

  Y plane:   full resolution luminance
             every pixel has its own brightness value
             size: H × W bytes

  UV plane:  half-resolution chrominance (interleaved U, V)
             shared between 2×2 pixel blocks
             size: H/2 × W bytes

  Total size: H × W × 1.5 bytes (vs H × W × 3 for RGB)
             → 50% smaller than RGB

NV12 memory layout for a 4×4 image:

  Y plane (4×4):        UV plane (2×4, interleaved):
  ┌───┬───┬───┬───┐     ┌───┬───┬───┬───┐
  │Y00│Y01│Y02│Y03│     │U00│V00│U01│V01│  ← shared by top 2 rows
  ├───┼───┼───┼───┤     ├───┼───┼───┼───┤
  │Y10│Y11│Y12│Y13│     │U10│V10│U11│V11│  ← shared by bottom 2 rows
  ├───┼───┼───┼───┤     └───┴───┴───┴───┘
  │Y20│Y21│Y22│Y23│
  ├───┼───┼───┼───┤
  │Y30│Y31│Y32│Y33│
  └───┴───┴───┴───┘

Why NV12 and not RGB?

  Bandwidth:      NV12 is 50% smaller → saves memory bus bandwidth
  GPU friendly:   NVIDIA hardware texture units read NV12 natively
  ISP native:     hardware ISPs output NV12 directly (no conversion needed)
  Neural net:     openpilot supercombo takes YUV input directly
  Video encode:   H.264/H.265 encoders expect YUV420 input
  Cache friendly:  Y plane accessed sequentially → good cache behavior

  RGB is expensive:
    3 bytes per pixel, 3× the bandwidth
    Conversion from NV12→RGB costs CPU/GPU cycles
    Most inference pipelines avoid RGB entirely
Typical ADAS SoC data path:

  Sensor → MIPI CSI-2 → Hardware ISP → NV12 in DRAM
                                         ├──→ GPU/DLA (inference)
                                         ├──→ H.265 encoder (recording)
                                         └──→ Display (preview)

  The ISP output goes everywhere. No CPU touches it.
  This is why ISP quality directly determines system quality.

Why Each Stage Exists (Summary)

Every stage corrects a specific physical reality. Skip any one and the image degrades in a specific, predictable way:

  Stage                Fixes                        If wrong, neural net sees...
  ──────────────────────────────────────────────────────────────────────────────
  Linearization        sensor physics               wrong brightness relationships
  White Balance        lighting color variation      color-shifted objects (blue cars, yellow roads)
  Demosaic             Bayer sampling limitation     zipper edges, moiré false patterns
  CCM                  sensor spectral mismatch      wrong colors (red looks orange)
  Brightness/Contrast  scene dynamic range           crushed shadows, blown highlights
  Lens Shading         optical brightness falloff    dark edges, inconsistent features
  NV12 conversion      compute efficiency            (correct by construction)

ADAS / Embedded AI Perspective

Neural networks require:

  • Stable brightness frame-to-frame
  • Stable colors across lighting conditions
  • Stable contrast (no flickering exposure)
  • Minimal noise in dark regions

The ISP guarantees these properties. It is the input contract for perception.

  Bad ISP + good neural net  →  unreliable perception  ✗
  Good ISP + good neural net →  robust perception      ✓

  The ISP is not optional. It is the foundation.

Jetson / Embedded Connection

On every ADAS SoC, the ISP is a dedicated hardware block — not software.

  Platform              ISP Block            Notes
  ─────────────────────────────────────────────────────────────
  NVIDIA Jetson Orin    VI/ISP unit          12-bit, HDR, up to 16 cameras,
                                             Bayer → NV12, accessed via libargus
  Qualcomm Snapdragon   Spectra ISP          14-bit, triple ISP (3 cameras),
                                             HDR, multi-frame noise reduction
  Texas Instruments     VPAC                 12-bit, lens distortion correction,
    TDA4                                     DMA to C7x DSP
  Xilinx Zynq (KV260)  Soft ISP on PL       implement in HLS — full control
  Mobileye EyeQ        Integrated ISP       proprietary, not configurable

  Jetson data path:
    Sensor → MIPI CSI-2 → VI → ISP → NV12 → NVMM buffer
                                              ├──→ CUDA / TensorRT (inference)
                                              ├──→ NVENC (H.265 recording)
                                              └──→ nvvidconv (display)

  Same physics, dedicated silicon. Zero CPU involvement.

2.1 Color Spaces

import cv2
import numpy as np

img_bgr = cv2.imread('image.jpg')           # OpenCV reads as BGR

img_rgb  = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
img_gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
img_hsv  = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2HSV)
img_lab  = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2LAB)

# HSV is useful for color-based segmentation (hue is color, S=saturation, V=brightness)
# LAB is perceptually uniform — good for distance-based color similarity
# Gray reduces to 1 channel — reduces computation for non-color tasks

2.2 Filtering

Spatial Filters

# Gaussian blur — smooths noise
blurred = cv2.GaussianBlur(img_gray, ksize=(5, 5), sigmaX=1.5)

# Median blur — removes salt-and-pepper noise (edge-preserving)
median = cv2.medianBlur(img_gray, ksize=5)

# Bilateral filter — edge-preserving smoothing (keeps sharp edges, blurs flat regions)
bilateral = cv2.bilateralFilter(img_bgr, d=9, sigmaColor=75, sigmaSpace=75)

# Sharpening via custom kernel
kernel_sharpen = np.array([[ 0, -1,  0],
                            [-1,  5, -1],
                            [ 0, -1,  0]])
sharpened = cv2.filter2D(img_bgr, -1, kernel_sharpen)

Adaptive Filtering

# Adaptive threshold — threshold varies per region (handles uneven lighting)
adaptive_thresh = cv2.adaptiveThreshold(
    img_gray,
    maxValue=255,
    adaptiveMethod=cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    thresholdType=cv2.THRESH_BINARY,
    blockSize=11,   # neighborhood size (must be odd)
    C=2             # constant subtracted from mean
)

# CLAHE — Contrast Limited Adaptive Histogram Equalization
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
enhanced = clahe.apply(img_gray)

Frequency Domain Filtering

import numpy as np

def frequency_domain_filter(img_gray: np.ndarray, low_pass: bool = True,
                             radius: int = 30) -> np.ndarray:
    """Apply low-pass or high-pass filter in frequency domain."""
    rows, cols = img_gray.shape
    crow, ccol = rows // 2, cols // 2

    # Fourier transform + shift DC to center
    dft = np.fft.fft2(img_gray.astype(np.float32))
    dft_shift = np.fft.fftshift(dft)

    # Create circular mask
    Y, X = np.ogrid[:rows, :cols]
    dist = np.sqrt((X - ccol)**2 + (Y - crow)**2)
    mask = (dist <= radius).astype(np.float32)
    if not low_pass:
        mask = 1 - mask   # invert for high-pass

    # Apply mask, inverse FFT
    filtered = np.fft.ifftshift(dft_shift * mask)
    result = np.fft.ifft2(filtered)
    return np.abs(result).clip(0, 255).astype(np.uint8)

# Low-pass: removes high-freq noise (blurring effect)
# High-pass: keeps edges only (sharpening effect)

Wavelet Transform

# pip install PyWavelets
import pywt

def wavelet_denoise(img_gray: np.ndarray, wavelet: str = 'db1',
                    level: int = 3, threshold: float = 20.0) -> np.ndarray:
    """Denoise image using wavelet soft-thresholding."""
    coeffs = pywt.wavedec2(img_gray.astype(float), wavelet, level=level)

    # Soft-threshold all detail coefficients (not approximation)
    new_coeffs = [coeffs[0]]  # keep approximation
    for detail_tuple in coeffs[1:]:
        new_coeffs.append(tuple(
            pywt.threshold(d, threshold, mode='soft') for d in detail_tuple
        ))

    denoised = pywt.waverec2(new_coeffs, wavelet)
    return denoised.clip(0, 255).astype(np.uint8)

2.3 Edge Detection

# Canny edge detector — gradient magnitude + non-max suppression + hysteresis
edges = cv2.Canny(img_gray, threshold1=50, threshold2=150)

# Sobel — gradient in X and Y separately
sobel_x = cv2.Sobel(img_gray, cv2.CV_64F, dx=1, dy=0, ksize=3)
sobel_y = cv2.Sobel(img_gray, cv2.CV_64F, dx=0, dy=1, ksize=3)
magnitude = np.sqrt(sobel_x**2 + sobel_y**2)

# Laplacian — second derivative (finds blobs + edges)
laplacian = cv2.Laplacian(img_gray, cv2.CV_64F)

2.4 Morphological Operations

kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5))

erosion  = cv2.erode(binary_img, kernel, iterations=1)   # shrinks white regions
dilation = cv2.dilate(binary_img, kernel, iterations=1)  # grows white regions
opening  = cv2.morphologyEx(binary_img, cv2.MORPH_OPEN, kernel)   # erosion→dilation (removes noise)
closing  = cv2.morphologyEx(binary_img, cv2.MORPH_CLOSE, kernel)  # dilation→erosion (fills holes)
gradient = cv2.morphologyEx(binary_img, cv2.MORPH_GRADIENT, kernel)  # dilation−erosion = outline

3. Feature Extraction

3.1 Classical Descriptors

# ── ORB (fast, free, rotation/scale invariant) ──────────────────────────────
orb = cv2.ORB_create(nfeatures=500)
keypoints, descriptors = orb.detectAndCompute(img_gray, mask=None)
# descriptors: (N, 32) uint8 — binary descriptor

# ── SIFT (scale/rotation invariant, patent expired 2020) ────────────────────
sift = cv2.SIFT_create()
kp, des = sift.detectAndCompute(img_gray, None)
# des: (N, 128) float32

# ── Feature Matching ─────────────────────────────────────────────────────────
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)   # for ORB
matches = bf.match(des1, des2)
matches = sorted(matches, key=lambda x: x.distance)

# FLANN matcher (faster for large descriptor sets)
FLANN_INDEX_LSH = 6
index_params = dict(algorithm=FLANN_INDEX_LSH, table_number=6, key_size=12)
search_params = dict(checks=50)
flann = cv2.FlannBasedMatcher(index_params, search_params)
matches = flann.knnMatch(des1, des2, k=2)

# Lowe's ratio test — filter bad matches
good = [m for m, n in matches if m.distance < 0.75 * n.distance]

3.2 Texture Features

from skimage.feature import local_binary_pattern, graycomatrix, graycoprops

# LBP — Local Binary Patterns (rotation-invariant texture)
radius = 3
n_points = 8 * radius
lbp = local_binary_pattern(img_gray, n_points, radius, method='uniform')
hist, _ = np.histogram(lbp.ravel(), bins=n_points + 2,
                       range=(0, n_points + 2), density=True)

# GLCM — Gray-Level Co-occurrence Matrix
glcm = graycomatrix(img_gray, distances=[1, 3], angles=[0, np.pi/4, np.pi/2],
                    levels=256, symmetric=True, normed=True)
contrast    = graycoprops(glcm, 'contrast').mean()
homogeneity = graycoprops(glcm, 'homogeneity').mean()
energy      = graycoprops(glcm, 'energy').mean()
correlation = graycoprops(glcm, 'correlation').mean()

3.3 HOG — Histogram of Oriented Gradients

from skimage.feature import hog

features, hog_image = hog(
    img_gray,
    orientations=9,
    pixels_per_cell=(8, 8),
    cells_per_block=(2, 2),
    visualize=True,
    channel_axis=None
)
# features: 1D vector describing gradient orientation histogram
# Classic use: pedestrian detection with SVM classifier

4. Image Segmentation

4.1 Threshold-Based

# Global Otsu threshold — automatic threshold selection
_, otsu = cv2.threshold(img_gray, 0, 255,
                        cv2.THRESH_BINARY + cv2.THRESH_OTSU)

4.2 Watershed

def watershed_segment(img_bgr: np.ndarray) -> np.ndarray:
    """Segment touching objects using Watershed algorithm."""
    gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
    _, thresh = cv2.threshold(gray, 0, 255,
                              cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

    # Remove noise
    kernel = np.ones((3, 3), np.uint8)
    opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=2)

    # Sure background (dilated)
    sure_bg = cv2.dilate(opening, kernel, iterations=3)

    # Sure foreground (via distance transform)
    dist = cv2.distanceTransform(opening, cv2.DIST_L2, 5)
    _, sure_fg = cv2.threshold(dist, 0.7 * dist.max(), 255, 0)
    sure_fg = sure_fg.astype(np.uint8)

    # Unknown region
    unknown = cv2.subtract(sure_bg, sure_fg)

    # Markers for watershed
    _, markers = cv2.connectedComponents(sure_fg)
    markers += 1
    markers[unknown == 255] = 0

    markers = cv2.watershed(img_bgr, markers)
    img_bgr[markers == -1] = [0, 0, 255]   # boundaries in red
    return markers

4.3 Semantic Segmentation

Semantic segmentation assigns a class label to every pixel in the image. Unlike object detection (bounding boxes) or instance segmentation (per-object masks), semantic segmentation treats all instances of a class as one region.

Input:  RGB image (H, W, 3)
Output: label map  (H, W)    — each pixel = class index (0=background, 1=road, 2=car …)

Architectures Overview

FCN (2015)           — first fully-convolutional net; bilinear upsampling
SegNet (2016)        — encoder-decoder with pooling indices for upsampling
U-Net (2015)         — encoder-decoder with skip connections; standard for medical/satellite
PSPNet (2017)        — pyramid pooling module captures multi-scale context
DeepLab v3+ (2018)   — atrous (dilated) convolutions + ASPP + decoder; PASCAL/Cityscapes SOTA
SegFormer (2021)     — transformer encoder + lightweight MLP decoder; strong on ADE20K
Mask2Former (2022)   — unified architecture for semantic/instance/panoptic

Architecture Detail: DeepLab v3+

Input image
Encoder (ResNet/MobileNet backbone with dilated conv)
ASPP — Atrous Spatial Pyramid Pooling
  ├── 1×1 conv
  ├── dilated conv rate=6
  ├── dilated conv rate=12
  ├── dilated conv rate=18
  └── global average pooling
  → concatenate → 1×1 conv → features (H/16, W/16, 256)
Decoder
  ├── bilinear upsample ×4
  ├── concat with low-level encoder features (H/4 resolution)
  └── 3×3 conv → 1×1 conv → (H/4, W/4, num_classes)
Bilinear upsample ×4 → logits (H, W, num_classes)
argmax → per-pixel class label

segmentation_models_pytorch — The Standard Library

pip install segmentation-models-pytorch
import segmentation_models_pytorch as smp
import torch
import torch.nn as nn

# ── Build model ──────────────────────────────────────────────────────────────
model = smp.DeepLabV3Plus(
    encoder_name='resnet50',        # backbone: resnet18/34/50, efficientnet-b*, mit_b*
    encoder_weights='imagenet',     # pretrained weights
    in_channels=3,
    classes=19,                     # Cityscapes has 19 evaluation classes
)

# Other architectures (same API):
# smp.Unet(...)         — classic U-Net (best for small datasets)
# smp.FPN(...)          — Feature Pyramid Network decoder
# smp.PSPNet(...)       — Pyramid Scene Parsing
# smp.SegFormer(...)    — transformer encoder + MLP decoder

# ── Loss functions ────────────────────────────────────────────────────────────
# smp provides standard losses that work with logits (no sigmoid/softmax needed)
dice_loss = smp.losses.DiceLoss(mode='multiclass')
ce_loss   = nn.CrossEntropyLoss(ignore_index=255)   # 255 = void/ignore label

def combined_loss(logits, targets):
    return 0.5 * ce_loss(logits, targets) + 0.5 * dice_loss(logits, targets)

# ── Training step ─────────────────────────────────────────────────────────────
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.PolynomialLR(optimizer, total_iters=100, power=0.9)

model.train()
for images, masks in train_loader:
    images = images.cuda()   # (B, 3, H, W) float32
    masks  = masks.cuda()    # (B, H, W) long  — class indices

    logits = model(images)   # (B, num_classes, H, W)
    loss   = combined_loss(logits, masks)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
scheduler.step()

# ── Inference ─────────────────────────────────────────────────────────────────
model.eval()
with torch.no_grad():
    logits = model(images)                  # (B, C, H, W)
    probs  = torch.softmax(logits, dim=1)   # (B, C, H, W)
    preds  = probs.argmax(dim=1)            # (B, H, W)

Metrics: mIoU and Pixel Accuracy

mIoU (mean Intersection-over-Union) is the standard metric. It is computed per-class then averaged.

import torch

def compute_miou(preds: torch.Tensor, targets: torch.Tensor,
                 num_classes: int, ignore_index: int = 255) -> dict:
    """
    Compute per-class IoU and mIoU.

    Args:
        preds:   (B, H, W) long — predicted class per pixel
        targets: (B, H, W) long — ground truth class per pixel
    Returns:
        dict with 'miou', 'per_class_iou', 'pixel_acc'
    """
    mask = targets != ignore_index
    preds   = preds[mask]
    targets = targets[mask]

    per_class_iou = []
    for cls in range(num_classes):
        pred_cls   = preds == cls
        target_cls = targets == cls

        intersection = (pred_cls & target_cls).sum().float()
        union        = (pred_cls | target_cls).sum().float()

        if union == 0:
            per_class_iou.append(float('nan'))   # class not present
        else:
            per_class_iou.append((intersection / union).item())

    valid_iou = [v for v in per_class_iou if not torch.isnan(torch.tensor(v))]
    miou = sum(valid_iou) / len(valid_iou) if valid_iou else 0.0
    pixel_acc = (preds == targets).float().mean().item()

    return {
        'miou': miou,
        'per_class_iou': per_class_iou,
        'pixel_acc': pixel_acc,
    }

# Usage
metrics = compute_miou(preds, targets, num_classes=19)
print(f"mIoU: {metrics['miou']:.4f}  |  Pixel Acc: {metrics['pixel_acc']:.4f}")
Common mIoU benchmarks:
  Cityscapes val:  DeepLabV3+(R101) = 81.3%
                   SegFormer(B5)    = 84.0%
  ADE20K val:      SegFormer(B5)    = 51.8%
  PASCAL VOC:      DeepLabV3+       = 89.0%

YOLOv8-seg — Semantic + Instance in One Model

from ultralytics import YOLO
import numpy as np
import cv2

# YOLOv8-seg does instance segmentation — masks per detected object
# For ADAS road segmentation, use a class-specific semantic projection
model = YOLO('yolov8n-seg.pt')   # n/s/m/l/x variants

results = model('frame.jpg', conf=0.4)

for r in results:
    if r.masks is not None:
        # masks.data: (N_instances, H, W) float32, values 0–1
        masks = r.masks.data.cpu().numpy()
        classes = r.boxes.cls.cpu().numpy().astype(int)

        # Build semantic map by projecting instance masks
        H, W = r.orig_shape
        semantic_map = np.zeros((H, W), dtype=np.uint8)
        for mask, cls_id in zip(masks, classes):
            binary = (cv2.resize(mask, (W, H)) > 0.5)
            semantic_map[binary] = cls_id + 1   # 0 = background

# Fine-tune on custom segmentation dataset
model = YOLO('yolov8n-seg.pt')
model.train(
    data='seg_dataset.yaml',   # same format as detection, but images have masks
    epochs=100,
    imgsz=640,
    batch=8,
    device='cuda',
)

Dataset Format for Segmentation Training (YOLO)

# seg_dataset.yaml
path: /data/road_seg
train: images/train
val:   images/val

nc: 5
names:
  0: road
  1: sidewalk
  2: car
  3: person
  4: vegetation
# Polygon mask label (YOLO seg format)
# labels/frame001.txt
# class_id  x1 y1  x2 y2  x3 y3  ...  (normalized polygon vertices)
0  0.10 0.95  0.45 0.55  0.90 0.55  0.95 0.95   ← road polygon
2  0.30 0.40  0.45 0.40  0.45 0.70  0.30 0.70   ← car bounding polygon

ADAS Road Segmentation with Cityscapes Classes

import torch
import cv2
import numpy as np
import segmentation_models_pytorch as smp

# Cityscapes 19-class palette (RGB)
CITYSCAPES_PALETTE = np.array([
    [128, 64, 128],   # 0  road
    [244, 35, 232],   # 1  sidewalk
    [ 70, 70, 70],   # 2  building
    [102, 102, 156],  # 3  wall
    [190, 153, 153],  # 4  fence
    [153, 153, 153],  # 5  pole
    [250, 170,  30],  # 6  traffic light
    [220, 220,   0],  # 7  traffic sign
    [107, 142,  35],  # 8  vegetation
    [152, 251, 152],  # 9  terrain
    [ 70, 130, 180],  # 10 sky
    [220,  20,  60],  # 11 person
    [255,   0,   0],  # 12 rider
    [  0,   0, 142],  # 13 car
    [  0,   0,  70],  # 14 truck
    [  0,  60, 100],  # 15 bus
    [  0,  80, 100],  # 16 train
    [  0,   0, 230],  # 17 motorcycle
    [119,  11,  32],  # 18 bicycle
], dtype=np.uint8)

CITYSCAPES_MEAN = np.array([0.485, 0.456, 0.406])
CITYSCAPES_STD  = np.array([0.229, 0.224, 0.225])


def preprocess_frame(frame_bgr: np.ndarray, size=(1024, 512)) -> torch.Tensor:
    """Resize and normalize a BGR frame for Cityscapes-trained model."""
    img = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, size)
    img = img.astype(np.float32) / 255.0
    img = (img - CITYSCAPES_MEAN) / CITYSCAPES_STD
    return torch.from_numpy(img.transpose(2, 0, 1)).unsqueeze(0).float()


def colorize(label_map: np.ndarray) -> np.ndarray:
    """Map (H, W) class indices → (H, W, 3) RGB color image."""
    color = np.zeros((*label_map.shape, 3), dtype=np.uint8)
    for cls_id, rgb in enumerate(CITYSCAPES_PALETTE):
        color[label_map == cls_id] = rgb
    return color


def segment_frame(model, frame_bgr: np.ndarray, device='cuda') -> np.ndarray:
    """Run full segmentation pipeline on one frame. Returns color overlay."""
    H, W = frame_bgr.shape[:2]
    inp = preprocess_frame(frame_bgr).to(device)

    with torch.no_grad():
        logits = model(inp)                        # (1, 19, h, w)
        pred   = logits.argmax(1).squeeze().cpu().numpy()  # (h, w)

    # Upsample prediction to original resolution
    pred_full = cv2.resize(pred.astype(np.uint8), (W, H),
                           interpolation=cv2.INTER_NEAREST)
    color_mask = colorize(pred_full)

    # Blend with original frame
    overlay = cv2.addWeighted(frame_bgr, 0.5,
                              cv2.cvtColor(color_mask, cv2.COLOR_RGB2BGR), 0.5, 0)
    return overlay, pred_full


# Load Cityscapes pretrained model
model = smp.DeepLabV3Plus(
    encoder_name='resnet50',
    encoder_weights='imagenet',
    classes=19,
).cuda().eval()

# Run on video
cap = cv2.VideoCapture('dashcam.mp4')
while True:
    ret, frame = cap.read()
    if not ret:
        break
    overlay, pred = segment_frame(model, frame)
    road_pixels = (pred == 0).sum()
    total_pixels = pred.size
    print(f"Road coverage: {100 * road_pixels / total_pixels:.1f}%")
    cv2.imshow('Semantic Segmentation', overlay)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

Quick Comparison: Semantic vs Instance vs Panoptic

Semantic segmentation:
  Output: (H, W) label map — every pixel has one class
  All cars = same "car" region
  Models: DeepLab, SegFormer, FCN

Instance segmentation:
  Output: N masks, one per object — same class objects get separate masks
  Car A mask, Car B mask, Car C mask
  Models: Mask R-CNN, YOLOv8-seg, SOLO, SOLOv2

Panoptic segmentation:
  Output: combines both — stuff (road, sky) as semantic, things (car, person) as instances
  Complete scene understanding
  Models: Panoptic FPN, Mask2Former, DETR (panoptic head)

ADAS typically needs:
  Drivable area      → semantic (road class)
  Obstacle avoidance → instance (each car/pedestrian separately)
  Full understanding → panoptic

4.4 Instance Segmentation with Mask R-CNN (PyTorch)

import torchvision
import torch

model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
model.eval()

from torchvision.transforms.functional import to_tensor
img_tensor = to_tensor(img_rgb).unsqueeze(0)

with torch.no_grad():
    predictions = model(img_tensor)[0]

# predictions keys: 'boxes', 'labels', 'scores', 'masks'
for i, score in enumerate(predictions['scores']):
    if score > 0.5:
        box  = predictions['boxes'][i].int().numpy()    # [x1, y1, x2, y2]
        mask = predictions['masks'][i, 0].numpy() > 0.5  # bool mask (H, W)
        label = predictions['labels'][i].item()

5. Object Detection

5.1 Detection Paradigms

Two-stage (high accuracy, slower):
  Region proposals → classify each proposal
  Examples: Faster R-CNN, Mask R-CNN

One-stage (real-time, slightly lower accuracy):
  Grid/anchor predictions in single pass
  Examples: YOLO, SSD, RetinaNet, FCOS

Anchor-free (modern, simpler):
  No pre-defined anchor boxes
  Examples: CenterNet, FCOS, RT-DETR

5.2 YOLO — You Only Look Once

# YOLOv8 — Ultralytics (easiest modern YOLO)
# pip install ultralytics

from ultralytics import YOLO

# Load pretrained model
model = YOLO('yolov8n.pt')   # nano (fastest), also: s, m, l, x

# Inference
results = model('image.jpg', conf=0.4, iou=0.45)

for r in results:
    for box in r.boxes:
        xyxy  = box.xyxy[0].numpy()   # [x1, y1, x2, y2]
        conf  = float(box.conf)
        cls   = int(box.cls)
        label = model.names[cls]
        print(f"{label}: {conf:.2f} at {xyxy}")

# Fine-tune on custom dataset
model.train(
    data='custom_dataset.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    device='cuda',
    patience=20,        # early stopping
    optimizer='AdamW',
    lr0=0.001
)

# Export to ONNX / TensorRT
model.export(format='onnx', opset=12, simplify=True)
model.export(format='engine', device=0, half=True)  # TensorRT FP16

5.3 Custom YOLO Dataset YAML

# custom_dataset.yaml
path: /data/my_dataset      # root path
train: images/train
val: images/val
test: images/test

nc: 3                       # number of classes
names:
  0: forklift
  1: person
  2: pallet

5.4 Detection Metrics

Metric Definition
IoU Intersection over Union — box overlap quality
Precision TP / (TP + FP) — of all detections, how many are correct
Recall TP / (TP + FN) — of all GT objects, how many were found
AP Area under Precision-Recall curve (per class)
mAP@50 Mean AP at IoU threshold 0.5
mAP@50:95 Mean AP averaged over IoU 0.5–0.95 (COCO standard)
# Evaluate with Ultralytics
metrics = model.val(data='custom_dataset.yaml')
print(metrics.box.map)      # mAP@50:95
print(metrics.box.map50)    # mAP@50
print(metrics.box.mp)       # mean precision
print(metrics.box.mr)       # mean recall

6. Object Tracking

6.1 Classic Trackers in OpenCV

# Create tracker — options: CSRT (accurate), KCF (fast), MIL
tracker = cv2.TrackerCSRT_create()

# Initialize with first frame + bounding box
ret, frame = cap.read()
bbox = cv2.selectROI('Select Object', frame, fromCenter=False)
tracker.init(frame, bbox)

while True:
    ret, frame = cap.read()
    ok, bbox = tracker.update(frame)
    if ok:
        x, y, w, h = [int(v) for v in bbox]
        cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 2)
    else:
        cv2.putText(frame, 'Tracking failure', (50, 80),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 0, 255), 2)

6.2 ByteTrack — Multi-Object Tracking

ByteTrack is the current standard for real-time MOT (pairs with any detector).

pip install lap  # linear assignment
pip install git+https://github.com/ifzhang/ByteTrack.git
from ultralytics import YOLO

# Ultralytics has ByteTrack built-in
model = YOLO('yolov8n.pt')

# Track across video frames
results = model.track(
    source='video.mp4',
    tracker='bytetrack.yaml',
    persist=True,       # keep track IDs between frames
    conf=0.3,
    iou=0.5
)

for r in results:
    if r.boxes.id is not None:
        track_ids = r.boxes.id.int().tolist()
        boxes = r.boxes.xyxy.tolist()
        for tid, box in zip(track_ids, boxes):
            print(f"Track {tid}: {box}")

6.3 Optical Flow

# Lucas-Kanade sparse optical flow — tracks specific points
feature_params = dict(maxCorners=100, qualityLevel=0.3,
                      minDistance=7, blockSize=7)
lk_params = dict(winSize=(15, 15), maxLevel=2,
                 criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 0.03))

ret, old_frame = cap.read()
old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)
p0 = cv2.goodFeaturesToTrack(old_gray, mask=None, **feature_params)

while True:
    ret, frame = cap.read()
    frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    p1, st, err = cv2.calcOpticalFlowPyrLK(old_gray, frame_gray, p0, None, **lk_params)
    good_new = p1[st == 1]
    good_old = p0[st == 1]
    old_gray = frame_gray.copy()
    p0 = good_new.reshape(-1, 1, 2)

# Farneback dense optical flow — flow for every pixel
flow = cv2.calcOpticalFlowFarneback(
    old_gray, frame_gray, None,
    pyr_scale=0.5, levels=3, winsize=15,
    iterations=3, poly_n=5, poly_sigma=1.2, flags=0
)
mag, ang = cv2.cartToPolar(flow[..., 0], flow[..., 1])

7. 3D Vision

7.1 Camera Model and Calibration

# Checkerboard calibration
import cv2
import numpy as np

CHECKERBOARD = (9, 6)   # inner corners
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)

objp = np.zeros((CHECKERBOARD[0] * CHECKERBOARD[1], 3), np.float32)
objp[:, :2] = np.mgrid[0:CHECKERBOARD[0], 0:CHECKERBOARD[1]].T.reshape(-1, 2)

objpoints, imgpoints = [], []
for fname in calibration_images:
    img = cv2.imread(fname)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    ret, corners = cv2.findChessboardCorners(gray, CHECKERBOARD, None)
    if ret:
        corners2 = cv2.cornerSubPix(gray, corners, (11,11), (-1,-1), criteria)
        objpoints.append(objp)
        imgpoints.append(corners2)

ret, K, dist, rvecs, tvecs = cv2.calibrateCamera(
    objpoints, imgpoints, gray.shape[::-1], None, None
)
# K: 3×3 intrinsic matrix [[fx,0,cx],[0,fy,cy],[0,0,1]]
# dist: [k1,k2,p1,p2,k3] distortion coefficients

# Undistort an image
h, w = img.shape[:2]
newK, roi = cv2.getOptimalNewCameraMatrix(K, dist, (w,h), 1, (w,h))
undistorted = cv2.undistort(img, K, dist, None, newK)

7.2 Stereo Vision — Depth from Two Cameras

# After calibrating both cameras separately, stereo calibrate
ret, K1, D1, K2, D2, R, T, E, F = cv2.stereoCalibrate(
    obj_pts, img_pts_left, img_pts_right,
    K1, D1, K2, D2, gray.shape[::-1],
    flags=cv2.CALIB_FIX_INTRINSIC
)

# Rectify — make epipolar lines horizontal
R1, R2, P1, P2, Q, roi1, roi2 = cv2.stereoRectify(
    K1, D1, K2, D2, gray.shape[::-1], R, T
)

# Compute disparity map
stereo = cv2.StereoSGBM_create(
    minDisparity=0, numDisparities=128, blockSize=11,
    P1=8*3*11**2, P2=32*3*11**2,
    disp12MaxDiff=1, uniquenessRatio=15,
    speckleWindowSize=100, speckleRange=32
)
disparity = stereo.compute(left_rect, right_rect).astype(np.float32) / 16.0

# Convert disparity to depth
# depth = (baseline × focal_length) / disparity
baseline = np.linalg.norm(T)          # meters between cameras
depth = (baseline * K1[0, 0]) / (disparity + 1e-6)

# Or use the Q matrix from stereoRectify
points_3D = cv2.reprojectImageTo3D(disparity, Q)  # (H, W, 3) XYZ per pixel

7.3 Pose Estimation (solvePnP)

# Given 3D object points and their 2D image projections, find camera pose
# object_points: (N, 3) float32 — 3D coordinates in object frame
# image_points:  (N, 2) float32 — corresponding 2D pixel coordinates

success, rvec, tvec = cv2.solvePnP(
    object_points, image_points, K, dist,
    flags=cv2.SOLVEPNP_ITERATIVE
)

# rvec: rotation vector (Rodrigues format) → convert to matrix
R, _ = cv2.Rodrigues(rvec)   # 3×3 rotation matrix
# tvec: translation vector (camera-to-object)

# Project 3D points back to image
projected, _ = cv2.projectPoints(object_points, rvec, tvec, K, dist)

# Draw axes on object
cv2.drawFrameAxes(img, K, dist, rvec, tvec, length=0.05)

7.4 Multi-Camera ADAS: openpilot Near + Wide Camera

openpilot (comma.ai) is the reference open-source ADAS stack. It uses two forward-facing cameras with deliberately different fields of view — a design decision that reveals the fundamental trade-off in ADAS perception.

Why Two Cameras?

Single camera problem:
  Wide FOV (120°) → low focal length → poor resolution at distance
                  → lane lines 50m ahead are only a few pixels wide
                  → lead vehicle at 80m is too small to detect reliably

  Narrow FOV (60°) → high focal length → good long-range resolution
                   → misses adjacent lanes and near-field cut-ins
                   → blind to pedestrians at crossings

Solution: run both simultaneously, fuse their outputs per task
comma 3 / 3X camera layout:
  ┌─────────────────────────────────────────────┐
  │         windshield mount (top center)        │
  │                                              │
  │  [Wide road cam]   [Road cam]   [Driver cam] │
  │   ~120° HFOV       ~60° HFOV   (inward)     │
  │   1.71 mm lens     2.2 mm lens               │
  └─────────────────────────────────────────────┘

Camera Specifications

Property Road (Narrow) Wide Road
Sensor OV8856 OV8856
Focal length (approx) 2.2 mm 1.71 mm
HFOV ~60° ~120°
Native resolution 1928×1208 1928×1208
Model input crop 1152×1152 1152×1152
Resized for model 512×256 → 128×256 512×256 → 128×256
Primary use Lead vehicle, lane geometry, long range Adjacent lanes, near-field, wider view

Intrinsic Matrix for Each Camera

import numpy as np

# comma 3 road camera intrinsics (approximate — calibrated per device)
# K = [[fx,  0, cx],
#       [ 0, fy, cy],
#       [ 0,  0,  1]]

K_road = np.array([
    [2648.0,    0.0,  964.0],
    [   0.0, 2648.0,  604.0],
    [   0.0,    0.0,    1.0],
], dtype=np.float64)

K_wide = np.array([
    [1036.0,    0.0,  964.0],
    [   0.0, 1036.0,  604.0],
    [   0.0,    0.0,    1.0],
], dtype=np.float64)

# Focal lengths: road (2648 px) vs wide (1036 px)
# Lower fx/fy = shorter focal length = wider field of view
# Both cameras at same image center (cx=964, cy=604) assuming no lens offset

def pixel_to_ray(u: float, v: float, K: np.ndarray) -> np.ndarray:
    """Convert pixel (u, v) to unit direction ray in camera frame."""
    fx, fy = K[0, 0], K[1, 1]
    cx, cy = K[0, 2], K[1, 2]
    ray = np.array([(u - cx) / fx, (v - cy) / fy, 1.0])
    return ray / np.linalg.norm(ray)

# Road camera: pixel at image center-right (u=1200, v=604)
ray_road = pixel_to_ray(1200, 604, K_road)
angle_road = np.degrees(np.arctan2(ray_road[0], ray_road[2]))
print(f"Road cam angle: {angle_road:.1f}°")   # ~8.7° off-center

# Wide camera: same pixel
ray_wide = pixel_to_ray(1200, 604, K_wide)
angle_wide = np.degrees(np.arctan2(ray_wide[0], ray_wide[2]))
print(f"Wide cam angle: {angle_wide:.1f}°")   # ~22.5° off-center

YUV420 Frame Packing — supercombo Input

The supercombo model takes input_imgs of shape (1, 12, 128, 256). These 12 channels encode two consecutive YUV420 frames (temporal context for motion estimation).

YUV420 memory layout for one frame (H=128, W=256):
  Y  plane: 128 × 256 = 32768 bytes  (full luma)
  U  plane:  64 × 128 =  8192 bytes  (half-res chroma)
  V  plane:  64 × 128 =  8192 bytes  (half-res chroma)
  Total: 49152 bytes per frame

Packed into 6 channels at (128, 256):
  ch 0: Y[0::2, 0::2]   — even rows, even cols  (64×128, tiled to 128×256)
  ch 1: Y[0::2, 1::2]   — even rows, odd cols
  ch 2: Y[1::2, 0::2]   — odd rows, even cols
  ch 3: Y[1::2, 1::2]   — odd rows, odd cols
  ch 4: U resized to 128×256 via nearest-neighbor
  ch 5: V resized to 128×256 via nearest-neighbor

Two frames (t and t-1) → 6 × 2 = 12 channels
import numpy as np
import cv2


def yuv420_to_6ch(yuv420_frame: np.ndarray, out_h: int = 128, out_w: int = 256) -> np.ndarray:
    """
    Convert a YUV420 frame to the 6-channel format used by supercombo.

    Args:
        yuv420_frame: raw YUV420 bytes as (H*3//2, W) uint8 array,
                      OR (H + H//2, W) array (standard cv2 YUV420 layout).
                      For comma 3: H=886 (after crop), W=1152.
        out_h, out_w: target spatial size (128, 256 for supercombo)

    Returns:
        (6, out_h, out_w) float32 array, values in [0, 1]
    """
    h_full = yuv420_frame.shape[0] * 2 // 3
    w_full = yuv420_frame.shape[1]

    # Split Y, U, V planes
    Y = yuv420_frame[:h_full, :]                                  # (H, W)
    uv = yuv420_frame[h_full:, :].reshape(h_full // 2, w_full)
    U = uv[:h_full // 4, :]                                      # (H/2, W/2) after reshape
    V = uv[h_full // 4:, :]

    # Resize Y to target
    Y_resized = cv2.resize(Y, (out_w, out_h), interpolation=cv2.INTER_AREA)

    # Sub-sample Y into 4 interleaved channels
    ch0 = Y_resized[0::2, 0::2]   # (out_h//2, out_w//2)
    ch1 = Y_resized[0::2, 1::2]
    ch2 = Y_resized[1::2, 0::2]
    ch3 = Y_resized[1::2, 1::2]

    # Resize to full (out_h, out_w) via repeat
    ch0 = np.repeat(np.repeat(ch0, 2, axis=0), 2, axis=1)
    ch1 = np.repeat(np.repeat(ch1, 2, axis=0), 2, axis=1)
    ch2 = np.repeat(np.repeat(ch2, 2, axis=0), 2, axis=1)
    ch3 = np.repeat(np.repeat(ch3, 2, axis=0), 2, axis=1)

    # Resize U and V to (out_h, out_w)
    ch4 = cv2.resize(U, (out_w, out_h), interpolation=cv2.INTER_NEAREST)
    ch5 = cv2.resize(V, (out_w, out_h), interpolation=cv2.INTER_NEAREST)

    channels = np.stack([ch0, ch1, ch2, ch3, ch4, ch5], axis=0)  # (6, H, W) uint8
    return channels.astype(np.float32) / 255.0


def build_supercombo_input(frame_t: np.ndarray, frame_t1: np.ndarray) -> np.ndarray:
    """
    Stack two consecutive YUV420 frames into supercombo input_imgs.

    Args:
        frame_t:  current frame YUV420 raw array
        frame_t1: previous frame YUV420 raw array

    Returns:
        (1, 12, 128, 256) float32 — ready for supercombo inference
    """
    ch_t  = yuv420_to_6ch(frame_t)    # (6, 128, 256)
    ch_t1 = yuv420_to_6ch(frame_t1)   # (6, 128, 256)
    stacked = np.concatenate([ch_t, ch_t1], axis=0)   # (12, 128, 256)
    return stacked[np.newaxis]                         # (1, 12, 128, 256)


# Load two consecutive dashcam frames (BGR → YUV420)
frame_bgr_0 = cv2.imread('frame_000.jpg')
frame_bgr_1 = cv2.imread('frame_001.jpg')

yuv_0 = cv2.cvtColor(frame_bgr_0, cv2.COLOR_BGR2YUV_I420)
yuv_1 = cv2.cvtColor(frame_bgr_1, cv2.COLOR_BGR2YUV_I420)

input_imgs = build_supercombo_input(yuv_0, yuv_1)
print(input_imgs.shape)  # (1, 12, 128, 256)
print(input_imgs.min(), input_imgs.max())  # 0.0  1.0

Road vs Wide: Per-Task Routing

Task                     Primary camera       Reason
──────────────────────────────────────────────────────────────────────
Lane detection (far)     Road (narrow)        High focal length → lane lines at 50m are wider
Lead vehicle distance    Road (narrow)        Accurate size → distance estimation
Adjacent lane cuts       Wide                 FOV covers 2 lanes either side
Near-field pedestrians   Wide                 Pedestrians at <10m outside road cam FOV
Traffic sign reading     Road (narrow)        Signs need text resolution at distance
Blind spot monitoring    Wide                 ~120° catches what road cam misses at sides
Ego-motion estimation    Road (narrow)        Stable horizon → cleaner optical flow

openpilot's supercombo model receives only the road camera frame as input_imgs. The wide camera feeds a separate path (e.g., driver monitoring, wide-angle object detection). The two outputs are fused in controlsd — the lateral controller uses road-camera-derived lane predictions while wide-angle detections update the obstacle map.

Projecting Between Camera Coordinate Systems

import numpy as np


def project_road_to_wide(
    pts_road_cam: np.ndarray,   # (N, 3) 3D points in road camera frame
    R_r2w: np.ndarray,          # (3, 3) rotation: road cam → wide cam
    t_r2w: np.ndarray,          # (3,) translation: road cam → wide cam
    K_wide: np.ndarray,         # (3, 3) wide cam intrinsics
    D_wide: np.ndarray = None,  # (5,) wide cam distortion (optional)
) -> np.ndarray:
    """
    Project 3D points seen in road camera frame into wide camera pixel coordinates.
    Used to cross-validate detections across cameras.

    Returns:
        (N, 2) pixel coordinates in wide camera image
    """
    # Transform points to wide camera frame
    pts_wide_cam = (R_r2w @ pts_road_cam.T).T + t_r2w   # (N, 3)

    # Project to wide image plane
    if D_wide is not None:
        pts_2d, _ = cv2.projectPoints(
            pts_wide_cam.astype(np.float64),
            np.zeros(3), np.zeros(3),   # identity (already in cam frame)
            K_wide, D_wide
        )
        return pts_2d.squeeze()   # (N, 2)
    else:
        # Simple pinhole projection (no distortion)
        z = pts_wide_cam[:, 2:3].clip(0.01, None)
        xy = pts_wide_cam[:, :2] / z
        uv = (K_wide[:2, :2] @ xy.T + K_wide[:2, 2:3]).T
        return uv   # (N, 2)


# Example: project lead vehicle center from road cam 3D into wide cam 2D
# Lead vehicle detected at 30m ahead, ~0m lateral, ~1.5m height
lead_vehicle_3d = np.array([[30.0, 0.0, 1.5]])   # (N, 3) in road cam frame

# Extrinsic: wide cam is ~5mm to the left, same orientation (approx)
R_r2w = np.eye(3)   # cameras are nearly parallel
t_r2w = np.array([-0.005, 0.0, 0.0])   # 5mm lateral offset

uv_in_wide = project_road_to_wide(lead_vehicle_3d, R_r2w, t_r2w, K_wide)
print(f"Lead vehicle projects to wide cam pixel: {uv_in_wide}")

Camera-to-Ground Homography (Flat Road Assumption)

def camera_to_ground_homography(K: np.ndarray,
                                 camera_height_m: float = 1.22,
                                 pitch_deg: float = 0.0) -> np.ndarray:
    """
    Compute homography mapping image pixels → ground plane (Z=0) coordinates.

    Assumes flat road. Used for:
      - Lane width estimation in meters
      - Lead vehicle distance from pixel height
      - Free-space estimation

    Args:
        K:                camera intrinsics
        camera_height_m:  camera height above road (comma 3: ~1.22m)
        pitch_deg:        downward pitch of camera (positive = looking down)

    Returns:
        H: (3, 3) homography — pixel (u,v,1) → ground (X, Y, 1) in meters
    """
    pitch = np.radians(pitch_deg)
    # Rotation: camera tilted downward by pitch
    R = np.array([
        [1,           0,            0],
        [0,  np.cos(pitch), -np.sin(pitch)],
        [0,  np.sin(pitch),  np.cos(pitch)],
    ])
    t = np.array([0, -camera_height_m, 0])   # camera above ground

    # Ground plane normal in world: [0, 1, 0], d = 0
    # Homography from image to ground (standard derivation)
    # P = K [R | t], solve for intersection with Y=0 plane
    P = K @ np.hstack([R, t[:, None]])
    # Columns 0, 2, 3 of P (drop Y column since Y=0 on ground)
    H = P[:, [0, 2, 3]]
    return np.linalg.inv(H)


def pixel_to_ground(u: float, v: float, H: np.ndarray):
    """Convert image pixel to ground plane coordinates (meters from camera)."""
    p = H @ np.array([u, v, 1.0])
    return p[0] / p[2], p[1] / p[2]   # (X_meters, Z_meters)


H_road = camera_to_ground_homography(K_road, camera_height_m=1.22, pitch_deg=1.5)
H_wide = camera_to_ground_homography(K_wide, camera_height_m=1.22, pitch_deg=1.5)

# Where does bottom-center of image touch the road in road cam?
x, z = pixel_to_ground(964, 1100, H_road)
print(f"Road cam bottom-center → ({x:.2f}m lateral, {z:.2f}m ahead)")

# Same pixel in wide cam
x_w, z_w = pixel_to_ground(964, 1100, H_wide)
print(f"Wide cam bottom-center → ({x_w:.2f}m lateral, {z_w:.2f}m ahead)")

Auto Exposure (AE) in openpilot — How the Camera Decides Brightness

Auto Exposure is the control loop that decides how bright each frame should be. It sits above the ISP — it tells the sensor how long to expose and how much gain to apply, then the ISP processes whatever the sensor captures.

In openpilot, AE is not left to the sensor's built-in logic. openpilot runs its own AE algorithm in software because the default camera AE optimizes for "nice-looking photos" — but ADAS needs a fundamentally different exposure strategy.

Why default AE fails for ADAS:

  Phone / consumer camera AE:
    Goal: make the image look pleasing to a human
    Strategy: expose for the center-weighted average brightness
    Result: faces look good, sky is white, shadows are crushed

  ADAS AE:
    Goal: maximize information for the neural network
    Strategy: preserve detail in BOTH road AND sky simultaneously
    Result: may look "ugly" to humans but neural net sees everything

  Specific failure cases with consumer AE:
    - Driving toward sunset: AE exposes for bright sky → road is black → lane lines invisible
    - Tunnel entry: AE averages bright exterior + dark tunnel → both look bad
    - Night with headlights: AE exposes for headlight → everything else is black
    - Snow: AE underexposes (thinks scene is too bright) → gray snow, dark objects
The exposure triangle — three variables the AE controller adjusts:

  1. Exposure time (shutter speed)
     How long the sensor collects photons per frame
     Longer = brighter but motion blur increases
     ADAS constraint: must stay below 1/framerate (33 ms at 30 fps)
     Typical ADAS: 1–20 ms

  2. Analog gain (ISO)
     Amplify the sensor signal before ADC
     Higher gain = brighter but noise increases proportionally
     ADAS constraint: keep gain low when possible (noise hurts NN accuracy)
     Typical range: ×1 (ISO 100) to ×16 (ISO 1600)

  3. Digital gain (applied after ADC)
     Software multiplication of pixel values
     Same as analog gain but applied digitally — amplifies both signal AND quantization noise
     Used only when analog gain is maxed out (nighttime)
     Last resort — worst quality

  Priority order (openpilot strategy):
    First:  increase exposure time (up to motion-blur limit)
    Second: increase analog gain (up to noise limit)
    Last:   increase digital gain (emergency only)

How openpilot's AE works:

The AE control loop runs every frame:

  1. Capture frame with current exposure settings
  2. Compute frame statistics
     - Mean luminance of the frame (Y channel average)
     - Luminance histogram (distribution of brightness values)
     - Region-of-interest weighting (road region matters more than sky)
  3. Compare to target brightness
     - Target is NOT "middle gray" (unlike phone cameras)
     - Target is tuned per camera:
         Road cam:  optimize for road surface visibility
         Wide cam:  optimize for near-field obstacle visibility
  4. Compute exposure adjustment
     - If frame is too dark:  increase exposure time, then gain
     - If frame is too bright: decrease exposure time first
     - Use a PID-like controller (not bang-bang) to avoid oscillation
     - Rate-limit changes to prevent flickering between frames
  5. Send new exposure settings to sensor via I2C/MIPI
     - Takes effect 2–3 frames later (sensor pipeline delay)
     - AE must predict where brightness is going, not just react
  6. Next frame arrives → repeat from step 1
Key AE concepts:

  Metering regions:
    Consumer camera: center-weighted or face-priority
    ADAS camera:     lower-half-weighted (road is more important than sky)

    openpilot uses the bottom 60% of the frame for metering:
    ┌──────────────────────────────┐
    │          sky / trees         │  ← low weight (don't expose for this)
    │                              │
    ├──────────────────────────────┤
    │      road / lanes / cars     │  ← HIGH weight (expose for this)
    │      obstacles / signs       │
    └──────────────────────────────┘

    This means: even if the sky is overexposed (white), the road surface
    has correct brightness → lane lines are visible → NN succeeds

  Target luminance:
    Not a fixed value — adapts to scene conditions
    Bright daylight: target = high (use full dynamic range)
    Night:           target = lower (accept darker image to keep noise down)
    Tunnel:          ramp quickly (don't wait for slow exponential adaptation)

  Temporal smoothing:
    AE changes must be gradual — sudden brightness jumps confuse the NN
    openpilot applies exponential moving average to exposure settings
    Typical time constant: ~0.5 seconds (15 frames at 30 fps)
    Exception: tunnel entry/exit → faster adaptation allowed
HDR and AE — solving the dynamic range problem:

  Real driving scenes often have extreme contrast:
    - Sunlit road: 100,000 lux
    - Shadow under bridge: 500 lux
    - Headlight beam: 1,000,000+ lux
    - Dark lane markings in shadow: 200 lux

  Single exposure cannot capture all of this.

  HDR sensors (IMX390, AR0233AT) solve this with DOL-HDR:
    DOL = Digital Overlap

    How DOL-HDR works:
      Sensor captures TWO exposures per frame:
        Long exposure  (20 ms) → good for dark regions
        Short exposure (1 ms)  → good for bright regions

      ISP merges them:
        For each pixel:
          if short_exposure pixel is not saturated → use it (bright region)
          else → use long_exposure pixel (dark region)

      Result: single frame with much wider dynamic range
      Neural net sees both road surface AND headlights in same image

  openpilot on comma 3:
    OV8856 does NOT have DOL-HDR
    Relies entirely on software AE to choose the best single exposure
    This is why AE quality matters even more on non-HDR sensors
AE failure modes in ADAS and how openpilot handles them:

  Failure                    Cause                     Solution
  ──────────────────────────────────────────────────────────────────────
  Flickering exposure        AE oscillates between     PID controller with
                             two brightness targets    damping + rate limiting

  Tunnel blindness           sudden dark→bright or     fast-ramp mode when
                             bright→dark transition    luminance delta > threshold

  Headlight bloom            oncoming headlights       lower-half metering ignores
                             saturate AE               top-of-frame bright sources

  Snow blindness             high reflectance scene    histogram-based target
                             causes underexposure      (don't target mean, target
                                                       percentile)

  LED flicker                traffic lights / signs    exposure time must NOT be
                             modulated at 50/60 Hz     a multiple of 1/100 or 1/120 s
                             → can appear OFF in       (sensor anti-flicker mode)
                             some frames

  Sunrise/sunset glare       direct sun in camera      nothing AE can fully fix —
                             saturates entire region   need physical sun visor or
                                                       polarizing filter
AE and the neural network — the contract:

  The NN was trained on images with a specific brightness distribution.

  If the AE produces frames that are:
    Too dark    → NN sees shadows as obstacles, misses lane markings
    Too bright  → NN sees saturated white where objects should be
    Flickering  → NN confidence oscillates, control becomes jerky
    Inconsistent between cameras → road cam and wide cam disagree on scene

  openpilot's AE goal:
    Produce frames where the road region has mean luminance ≈ 100–120
    (on a 0–255 scale) regardless of weather, time of day, or lighting

    This is the input contract between camera and neural network.
    Break it → perception degrades → vehicle control degrades.

openpilot Model Input Summary

supercombo inputs:
  input_imgs:             (1, 12, 128, 256)  ← 2 consecutive road cam YUV420 frames
  desire:                 (1, 8)             ← one-hot: lane change L/R, keep, turn L/R
  traffic_convention:     (1, 2)             ← [RHD, LHD] traffic direction
  lateral_control_params: (1, 2)             ← [v_ego, roll]
  nav_features:           (1, 64)            ← map/route embedding
  nav_instructions:       (1, 150)           ← turn-by-turn instruction vector

supercombo output:
  outputs: flat float32 vector (~6504 values)
  Parsed by openpilot modeldata.py into:
    lead:         lead vehicle position, velocity, acceleration
    path:         lane line polynomials (4 points × 33 time steps)
    desire_state: predicted driver intent
    meta:         model confidence, disengagement probability
    pose:         ego velocity and orientation

Wide camera (separate model / separate input stream):
  Feeds wide-angle obstacle detection (pedestrians, cyclists at ±60° off-center)
  Output merged in controlsd with supercombo lead predictions

8. Advanced Deep Learning Architectures

8.1 Vision Transformer (ViT)

# ViT splits image into fixed patches, treats each as a "token"
# Self-attention captures global relationships between any two patches

from transformers import ViTForImageClassification, ViTImageProcessor
import torch

processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
model.eval()

from PIL import Image
img = Image.open('image.jpg')
inputs = processor(images=img, return_tensors='pt')

with torch.no_grad():
    logits = model(**inputs).logits
pred_class = logits.argmax(-1).item()
print(model.config.id2label[pred_class])

8.2 DETR — Detection Transformer

from transformers import DetrImageProcessor, DetrForObjectDetection
import torch
from PIL import Image

processor = DetrImageProcessor.from_pretrained('facebook/detr-resnet-50')
model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')

img = Image.open('image.jpg')
inputs = processor(images=img, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)

# Post-process to get boxes and labels
target_sizes = torch.tensor([img.size[::-1]])
results = processor.post_process_object_detection(
    outputs, threshold=0.9, target_sizes=target_sizes)[0]

for score, label, box in zip(results['scores'], results['labels'], results['boxes']):
    print(f"{model.config.id2label[label.item()]}: {score:.2f} {box.tolist()}")

8.3 RT-DETR — Real-Time Detection Transformer

RT-DETR combines transformer accuracy with YOLO-level speed (runs in real-time on GPU).

from ultralytics import RTDETR

model = RTDETR('rtdetr-l.pt')   # large variant
results = model.predict('image.jpg', conf=0.4)
model.export(format='engine')   # TensorRT export

8.4 SAM — Segment Anything Model

# pip install segment-anything
from segment_anything import sam_model_registry, SamPredictor
import numpy as np

sam = sam_model_registry['vit_b'](checkpoint='sam_vit_b_01ec64.pth')
predictor = SamPredictor(sam)

predictor.set_image(img_rgb)

# Prompt with point (x, y) + foreground/background label
input_point = np.array([[500, 375]])
input_label = np.array([1])   # 1=foreground, 0=background

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True,  # returns 3 masks (small/medium/large)
)
best_mask = masks[np.argmax(scores)]   # (H, W) bool

9. OpenCV — Core to Advanced

9.1 DNN Module — Run Models Without PyTorch/TF

# Load any ONNX model
net = cv2.dnn.readNetFromONNX('yolov8n.onnx')

# Use CUDA backend on GPU
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

# Or use OpenVINO for Intel CPUs
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_INFERENCE_ENGINE)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

# Preprocess image into blob
blob = cv2.dnn.blobFromImage(
    img_bgr,
    scalefactor=1.0/255,
    size=(640, 640),
    mean=(0, 0, 0),
    swapRB=True,        # BGR→RGB
    crop=False
)
net.setInput(blob)
outputs = net.forward(net.getUnconnectedOutLayersNames())

9.2 CUDA Acceleration

# Check CUDA support at runtime
print(cv2.cuda.getCudaEnabledDeviceCount())   # 0 if no CUDA build

# Upload mat to GPU
gpu_mat = cv2.cuda_GpuMat()
gpu_mat.upload(img_gray)

# GPU operations
gpu_blurred = cv2.cuda.createGaussianFilter(
    cv2.CV_8UC1, cv2.CV_8UC1, (5, 5), 0
).apply(gpu_mat)

gpu_edges = cv2.cuda.createCannyEdgeDetector(50, 150).detect(gpu_mat)

# Download back to CPU
result = gpu_edges.download()

9.3 Video Pipeline with GStreamer

# CSI camera on Jetson (zero-copy path)
pipeline = (
    "nvarguscamerasrc sensor-id=0 ! "
    "video/x-raw(memory:NVMM), width=1920, height=1080, framerate=30/1 ! "
    "nvvidconv flip-method=0 ! "
    "video/x-raw, format=BGRx ! "
    "videoconvert ! video/x-raw, format=BGR ! appsink"
)
cap = cv2.VideoCapture(pipeline, cv2.CAP_GSTREAMER)

# RTSP stream input
rtsp_pipeline = "rtspsrc location=rtsp://192.168.1.50/stream ! decodebin ! videoconvert ! appsink"
cap = cv2.VideoCapture(rtsp_pipeline, cv2.CAP_GSTREAMER)

9.4 G-API — Graph API for Pipelines

G-API lets you define your vision pipeline as a computation graph, enabling automatic optimization and backend switching.

import cv2 as cv

# Define the graph
g_in  = cv.GMat()
g_gray = cv.gapi.BGR2Gray(g_in)
g_blur = cv.gapi.gaussianBlur(g_gray, (5, 5), 0)
g_edges = cv.gapi.Canny(g_blur, 50, 150)

# Compile
pipeline = cv.GComputation(cv.GIn(g_in), cv.GOut(g_edges))
compiled = pipeline.compileStreaming()

compiled.setSource(cv.gin(cv.gapi.wip.make_capture_src(0)))
compiled.start()

while True:
    ok, (edges,) = compiled.pull()
    if not ok:
        break
    cv2.imshow('Edges', edges)

10. Annotation Tools

Data annotation is the foundation of supervised learning. The quality and format of your annotations directly determines model performance. Choosing the right tool depends on your scale, team size, and output format requirements.

Overview Comparison

Tool Type Price AI-Assist Formats Best for
X-AnyLabeling Desktop Free ✓ GPU COCO, VOC, YOLO, DOTA, MOT, MASK AI-assisted single/batch annotation
CVAT Web / Self-hosted Free Pascal VOC, COCO, YOLO, Datumaro Team collaboration, video tracking
LabelImg Desktop Free Pascal VOC, YOLO Quick single-annotator labeling
RectLabel Desktop (macOS) Paid COCO, VOC, YOLO macOS users, fast polygon annotation
VGG VIA Browser Free JSON, CSV Lightweight, no install needed
Labelbox Cloud SaaS Freemium COCO, VOC, YOLO + custom Enterprise scale, team management
Roboflow Cloud Freemium YOLO, COCO, VOC, TFRecord, TensorFlow Dataset versioning, augmentation, train & deploy, edge export
COCO Annotator Web / Self-hosted Free ✓ (limited) COCO JSON COCO-format segmentation datasets

10.1 X-AnyLabeling

X-AnyLabeling is the most feature-rich open-source annotation tool — it embeds AI inference engines directly, enabling one-click predictions on images and videos.

GitHub: github.com/CVHub520/X-AnyLabeling

Key Features

AI-Assisted Annotation:
  Embedded models: YOLOv5/v8/v9/v10, SAM, GroundingDINO, DINO,
                   RT-DETR, CLIP, DepthAnything, and more
  GPU acceleration via ONNX Runtime with CUDA EP
  Single-frame prediction → manually correct → next frame
  Batch prediction → annotate entire folder in one click

Annotation Types:
  Polygons                  (fine-grained segmentation)
  Rectangles                (bounding boxes)
  Rotated boxes             (DOTA / aerial imagery)
  Circles
  Lines / polylines
  Points / keypoints

Format Support:
  Import:  COCO, VOC, YOLO, DOTA, MOT, MASK, LabelMe
  Export:  COCO, VOC, YOLO, DOTA, MOT, MASK, LabelMe, ODVG

Video Support:
  Frame-by-frame annotation
  Auto-tracking using embedded trackers
  Batch inference on video files

Installation

# Via pip (recommended)
pip install anylabeling

# With GPU support (ONNX Runtime CUDA)
pip install anylabeling[gpu]

# Or via release binary:
# Download from github.com/CVHub520/X-AnyLabeling/releases
# Available: .exe (Windows), .AppImage (Linux), .dmg (macOS)

# Run
anylabeling

Workflow

1. Open X-AnyLabeling
2. File → Open Dir → select your image folder
3. AI Models → select a model (e.g., YOLOv8n, SAM)
   - First use: model downloads automatically
4. Predict:
   - Single frame: Ctrl+M or click "Run Model"
   - Batch: Tools → Auto Labeling → Run on All Images
5. Review predictions, correct mistakes manually
6. File → Export → choose format (COCO / YOLO / VOC)

Export Formats from X-AnyLabeling

# YOLO format export example output:
# labels/image001.txt
# 0 0.512 0.334 0.241 0.189     ← class cx cy w h (normalized 0–1)

# COCO format export:
# annotations.json — single file with all boxes + images list

# VOC format export:
# Annotations/image001.xml — one XML per image

# DOTA format (rotated boxes for aerial imagery):
# labels/image001.txt
# 100 200 150 200 150 250 100 250 vehicle 0   ← 4 corner points + class + difficult

10.2 CVAT

CVAT (Computer Vision Annotation Tool) is the industry-standard open-source annotation platform, developed by Intel and used at scale by major ML teams.

GitHub: github.com/cvat-ai/cvat Hosted version: app.cvat.ai

Key Features

Annotation Types:
  Bounding boxes (2D + 3D)
  Polygons / polylines / points
  Masks (brush tool, superpixel segmentation)
  Keypoints + skeletons (pose estimation)
  Cuboids (3D objects in 2D images)
  LiDAR point cloud annotation (3D bboxes)

Video Annotation:
  Frame-by-frame OR annotate keyframes → interpolation fills gaps
  Semi-automatic tracking (SiamMask, OpenCV trackers)
  Object track IDs across frames (MOT format)

AI Assistance:
  SAM (Segment Anything) integration — click → mask
  Detection models via Nuclio serverless functions
  Interactors: DEXTR, f-BRS for polygon assistance

Collaboration:
  Role-based access: annotator / reviewer / owner
  Task assignment across multiple annotators
  Review workflow with accept/reject per annotation

Format Support:
  Export: Pascal VOC, COCO JSON, YOLO, TFRecord, MOT,
          Cityscapes, KITTI, LFW, Wider Face, VGGFace2
  Import: All of the above

Self-Hosted Setup

# Clone and start with Docker Compose
git clone https://github.com/cvat-ai/cvat.git
cd cvat

# Start all services
docker compose up -d

# Create admin user
docker exec -it cvat_server python manage.py createsuperuser

# Access at http://localhost:8080

CVAT Python SDK — Programmatic Annotation

# pip install cvat-sdk

from cvat_sdk import make_client
from cvat_sdk.models import TaskWriteRequest, DataRequest

with make_client(host='localhost', credentials=('user', 'password')) as client:
    # Create task
    task = client.tasks.create(TaskWriteRequest(
        name='My Detection Task',
        labels=[
            {'name': 'car',    'color': '#ff0000'},
            {'name': 'person', 'color': '#00ff00'},
        ]
    ))

    # Upload images
    task.upload_data(DataRequest(
        image_quality=95,
        server_files=['path/to/images/']
    ))

    # Export annotations
    task.export_dataset(
        format_name='COCO 1.0',
        filename='annotations.zip',
        include_images=False
    )

Export from CVAT — YOLO Format Structure

cvat_export/
├── obj_train_data/
│   ├── image001.txt       ← YOLO label file
│   ├── image002.txt
│   └── ...
├── obj.data               ← paths config
├── obj.names              ← class names, one per line
└── train.txt              ← list of image paths

10.3 LabelImg

LabelImg is the classic, lightweight, single-annotator tool — simple, battle-tested, and universally supported.

GitHub: github.com/HumanSignal/labelImg

Installation

pip install labelImg

# Or from source
git clone https://github.com/HumanSignal/labelImg.git
cd labelImg
pip install pyqt5 lxml
python labelImg.py

Key Shortcuts

Key Action
W Create rectangle box
D Next image
A Previous image
Ctrl+S Save annotation
Ctrl+R Change save directory
Del Delete selected box

Workflow

1. Open Dir → select image folder
2. Change Save Dir → set label output folder
3. Toggle format: PascalVOC or YOLO (bottom-left of window)
4. Press W → draw box → enter class name → confirm
5. Ctrl+S → save → D (next image) → repeat
6. Result:
   YOLO:  labels/image001.txt   (class cx cy w h normalized)
   VOC:   Annotations/image001.xml

Pascal VOC XML Format

<!-- Annotations/image001.xml -->
<annotation>
  <folder>images</folder>
  <filename>image001.jpg</filename>
  <size>
    <width>1920</width><height>1080</height><depth>3</depth>
  </size>
  <object>
    <name>car</name>
    <difficult>0</difficult>
    <bndbox>
      <xmin>452</xmin><ymin>78</ymin>
      <xmax>929</xmax><ymax>534</ymax>
    </bndbox>
  </object>
</annotation>

10.4 RectLabel

RectLabel is a polished commercial annotation tool for macOS, focused on speed for individual annotators.

Website: rectlabel.com

Key Features

Annotation Types:
  Bounding boxes
  Polygons
  Polylines
  Points
  Keypoints with skeleton templates
  Oriented bounding boxes

Unique Capabilities:
  Core ML integration — annotate with Apple Neural Engine
  Pre-labeling via Core ML detection models
  Tracking between video frames
  Label history / auto-complete class names
  Attribute support (color, material, truncated, occluded)

Export Formats:
  COCO JSON
  Pascal VOC XML
  YOLO
  CreateML JSON
  CSV

macOS Features:
  Native SwiftUI app (fast, no Electron overhead)
  Drag-and-drop image/video import
  Full keyboard shortcut customization

When to Use RectLabel

✓ Solo annotator on macOS
✓ Need fast Core ML pre-labeling (uses Apple Silicon Neural Engine)
✓ Dataset has oriented/rotated objects (satellite, aerial)
✓ Need keypoint/skeleton annotation (sports, medical)

✗ Team collaboration (single user license)
✗ Linux/Windows (macOS only)
✗ Free budget required

10.5 VGG Image Annotator (VIA)

VIA is a zero-install browser-based annotator — just open an HTML file and start labeling. Developed by Oxford VGG.

Website: robots.ox.ac.uk/~vgg/software/via/ GitHub: github.com/ox-vgg/via

Key Features

No installation:
  Single HTML file — open in any browser, works offline
  Load local images directly (no server upload needed)
  All data stays on your machine

Annotation Types:
  Bounding boxes
  Polygons
  Polylines
  Points
  Circles / ellipses

Attributes:
  Define custom attributes per region (class, color, quality, etc.)
  Checkbox / radio / text / number attribute types

Export Formats:
  VIA JSON (native format)
  CSV (one row per annotation)
  COCO JSON (via VIA3 version)

Usage

1. Download via.html from robots.ox.ac.uk/~vgg/software/via/
2. Open in browser (no server needed)
3. Add Files → select images from your disk
4. Define attributes (e.g., "class": type=radio, options=car,person,bike)
5. Draw regions: select shape → draw → set attribute values
6. Annotations → Export → JSON or CSV

Parse VIA JSON in Python

import json

with open('via_annotations.json') as f:
    via_data = json.load(f)

for file_key, file_data in via_data['_via_img_metadata'].items():
    filename = file_data['filename']
    for region in file_data['regions']:
        shape = region['shape_attributes']
        attrs = region['region_attributes']
        class_name = attrs.get('class', 'unknown')

        if shape['name'] == 'rect':
            x, y = shape['x'], shape['y']
            w, h = shape['width'], shape['height']
            print(f"{filename}: {class_name} box ({x},{y}) {w}×{h}")
        elif shape['name'] == 'polygon':
            xs = shape['all_points_x']
            ys = shape['all_points_y']
            print(f"{filename}: {class_name} polygon with {len(xs)} points")

10.6 Labelbox

Labelbox is an enterprise cloud platform for large-scale data labeling — collaborative, API-driven, and integrated with MLOps pipelines.

Website: labelbox.com

Key Features

Scale:
  Manage thousands of annotators
  Built-in quality control (review workflows, consensus)
  Nested task queues with priority

Automation:
  Model-assisted labeling (MAL): run your model, human corrects
  DINO, SAM integration for pre-labeling
  Active learning: prioritize uncertain/valuable examples

API-First:
  Full REST + GraphQL API
  Python SDK for programmatic dataset management
  Webhooks for real-time events

Export Formats:
  COCO, Pascal VOC, YOLO
  Custom format via API
  Direct integration: HuggingFace, Roboflow, AWS SageMaker

Pricing:
  Free: 5 users, 1 project, limited storage
  Team: $0.06–0.12/label
  Enterprise: custom pricing

Labelbox Python SDK

# pip install labelbox

import labelbox as lb

client = lb.Client(api_key='YOUR_API_KEY')

# Create dataset
dataset = client.create_dataset(name='MyDetectionDataset')

# Upload images from URLs
dataset.create_data_rows([
    {'row_data': 'https://example.com/image001.jpg', 'global_key': 'img001'},
    {'row_data': 'https://example.com/image002.jpg', 'global_key': 'img002'},
])

# Create labeling project
ontology_builder = lb.OntologyBuilder(
    tools=[
        lb.Tool(tool=lb.Tool.Type.BOUNDING_BOX, name='car'),
        lb.Tool(tool=lb.Tool.Type.BOUNDING_BOX, name='person'),
        lb.Tool(tool=lb.Tool.Type.POLYGON, name='road'),
    ]
)
ontology = client.create_ontology('Detection Ontology', ontology_builder.asdict())

project = client.create_project(
    name='Warehouse Detection',
    media_type=lb.MediaType.Image
)
project.setup_editor(ontology)
project.create_batch('batch-1', dataset.export_data_rows(), 5)

# Export completed labels
export_task = project.export(params={
    'data_row_details': True,
    'metadata_fields': True,
    'attachments': False,
    'project_details': True,
    'performance_details': True,
    'label_details': True
})
export_task.wait_till_done()

for row in export_task.get_buffered_stream():
    label = row.json
    for annotation in label.get('projects', {}).get(project.uid, {}).get('labels', []):
        for obj in annotation.get('annotations', {}).get('objects', []):
            print(obj['value'], obj['bounding_box'])

10.7 Roboflow

Roboflow is a cloud platform for building computer vision datasets and pipelines — annotate, augment, version, and export to YOLO/COCO/VOC, then train in the cloud or export for local/edge deployment (Jetson, TensorRT, ONNX, TFLite).

Website: roboflow.com

Key Features

Dataset Management:
  Upload images/video; annotate in browser (boxes, polygons, keypoints)
  Dataset versioning — track changes and augmentations per version
  Train/valid/test split with one click
  Public dataset catalog (COCO, Open Images, etc.) and custom uploads

Augmentation & Preprocessing:
  Built-in augmentations (flip, rotate, brightness, mosaic, etc.)
  Auto-orientation and resize for target model input
  Generate new versions with different augmentations without re-labeling

Export & Deployment:
  Export: YOLO (v5/v8/v11), COCO, Pascal VOC, TFRecord, TensorFlow, Create ML
  Train in Roboflow (YOLOv8, etc.) or download dataset for local training
  Deploy: Roboflow API, ONNX, TensorRT, TFLite, Core ML, browser (JavaScript)
  Edge: direct export for Jetson, Raspberry Pi, and embedded targets

API & Integrations:
  Python SDK (roboflow) for upload, annotation, version, and inference
  REST API for pipelines and MLOps
  Integrates with Labelbox, CVAT (import/export), and major frameworks

When to Use Roboflow

  • You want one place to version datasets, augment, and export to YOLO/COCO for training and edge deployment.
  • You need quick iteration: annotate → augment → export → train (locally or in cloud) → deploy to Jetson/edge.
  • You prefer a managed pipeline over self-hosting CVAT or managing raw files and scripts.

Quick Start (Python)

# pip install roboflow

from roboflow import Roboflow

rf = Roboflow(api_key="YOUR_API_KEY")
project = rf.workspace("my-workspace").project("my-project")
dataset = project.version(1).download("yolov8")   # or "coco", "voc", etc.

# Inference with hosted model
model = project.version(1).model
pred = model.predict("image.jpg")
print(pred.json())

10.8 COCO Annotator

COCO Annotator is an open-source web-based annotation tool built specifically for the COCO dataset format — supports segmentation masks natively.

GitHub: github.com/jsbroks/coco-annotator

Key Features

Annotation Types:
  Bounding boxes
  Segmentation masks (polygon)
  Keypoints

COCO-Native:
  Stores directly in COCO JSON format
  Supports supercategory hierarchy
  Proper instance segmentation with "iscrowd" flag

Smart Tools:
  Magic Wand (color-based region fill)
  Superpixel-based selection (SLIC segments)
  Polygon simplification

Collaboration:
  User accounts with dataset permissions
  Multi-user concurrent annotation
  Export at any time (live COCO JSON)

Docker Setup

git clone https://github.com/jsbroks/coco-annotator.git
cd coco-annotator

# Start with Docker Compose (includes MongoDB + Flask + Vue frontend)
docker-compose up -d

# Access at http://localhost:5000
# Default credentials: admin / admin

# Mount your images directory
# Edit docker-compose.yml:
# volumes:
#   - /path/to/your/images:/datasets

Export and Load COCO JSON

import json

# Load COCO format annotations
with open('annotations.json') as f:
    coco = json.load(f)

# COCO structure:
# {
#   "images":      [...],   ← image metadata (id, file_name, width, height)
#   "annotations": [...],   ← boxes/masks (image_id, category_id, bbox, segmentation, area)
#   "categories":  [...],   ← class list (id, name, supercategory)
# }

# Build lookup maps
cat_map = {c['id']: c['name'] for c in coco['categories']}
img_map = {i['id']: i['file_name'] for i in coco['images']}

# Iterate annotations
for ann in coco['annotations']:
    img_name = img_map[ann['image_id']]
    class_name = cat_map[ann['category_id']]
    x, y, w, h = ann['bbox']   # COCO bbox: x_min, y_min, width, height
    area = ann['area']
    segmentation = ann.get('segmentation', [])   # list of polygon vertex lists
    is_crowd = ann.get('iscrowd', 0)             # 1=RLE mask, 0=polygon

    print(f"{img_name}: {class_name} at [{x:.0f},{y:.0f},{x+w:.0f},{y+h:.0f}]")

10.9 Annotation Workflow Best Practices

Quality Control

# Check annotation quality: count boxes per image, flag outliers
import json
from collections import Counter

with open('annotations.json') as f:
    coco = json.load(f)

boxes_per_image = Counter()
for ann in coco['annotations']:
    boxes_per_image[ann['image_id']] += 1

# Flag images with too few or too many boxes
avg = sum(boxes_per_image.values()) / len(boxes_per_image)
for img_id, count in boxes_per_image.items():
    if count < avg * 0.2 or count > avg * 3:
        img = next(i for i in coco['images'] if i['id'] == img_id)
        print(f"REVIEW: {img['file_name']} has {count} boxes (avg={avg:.1f})")

Dataset Splitting

import random
from pathlib import Path
import shutil

def split_dataset(image_dir: str, label_dir: str,
                  train=0.7, val=0.2, test=0.1, seed=42):
    """Split YOLO-format dataset into train/val/test."""
    images = list(Path(image_dir).glob('*.jpg')) + \
             list(Path(image_dir).glob('*.png'))
    random.seed(seed)
    random.shuffle(images)

    n = len(images)
    splits = {
        'train': images[:int(n*train)],
        'val':   images[int(n*train):int(n*(train+val))],
        'test':  images[int(n*(train+val)):],
    }

    for split_name, imgs in splits.items():
        img_out = Path(f'dataset/{split_name}/images')
        lbl_out = Path(f'dataset/{split_name}/labels')
        img_out.mkdir(parents=True, exist_ok=True)
        lbl_out.mkdir(parents=True, exist_ok=True)

        for img_path in imgs:
            lbl_path = Path(label_dir) / (img_path.stem + '.txt')
            shutil.copy(img_path, img_out / img_path.name)
            if lbl_path.exists():
                shutil.copy(lbl_path, lbl_out / lbl_path.name)

    print(f"Split: train={len(splits['train'])}, "
          f"val={len(splits['val'])}, test={len(splits['test'])}")

split_dataset('images', 'labels')

11. Dataset Formats

11.1 COCO JSON

The most complete format — supports detection, segmentation, keypoints, captions.

{
  "info": {"description": "My dataset", "version": "1.0", "year": 2025},
  "licenses": [],
  "categories": [
    {"id": 1, "name": "car",    "supercategory": "vehicle"},
    {"id": 2, "name": "person", "supercategory": "human"}
  ],
  "images": [
    {"id": 1, "file_name": "image001.jpg", "width": 1920, "height": 1080}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [452, 78, 477, 456],
      "area": 217332,
      "segmentation": [[452,78, 929,78, 929,534, 452,534]],
      "iscrowd": 0
    }
  ]
}

11.2 YOLO Format

One .txt file per image. Class indices are 0-based. All values normalized 0–1.

# labels/image001.txt
# class_id  cx      cy      width   height
0            0.5130  0.2833  0.2474  0.4222
1            0.1250  0.5370  0.0625  0.1667
# Convert YOLO label to pixel coordinates
def yolo_to_pixel(cx, cy, w, h, img_w, img_h):
    x1 = int((cx - w/2) * img_w)
    y1 = int((cy - h/2) * img_h)
    x2 = int((cx + w/2) * img_w)
    y2 = int((cy + h/2) * img_h)
    return x1, y1, x2, y2

11.3 Pascal VOC XML

<annotation>
  <folder>images</folder>
  <filename>image001.jpg</filename>
  <size><width>1920</width><height>1080</height><depth>3</depth></size>
  <object>
    <name>car</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox><xmin>452</xmin><ymin>78</ymin><xmax>929</xmax><ymax>534</ymax></bndbox>
  </object>
</annotation>

11.4 MOT Format (Multi-Object Tracking)

# mot_labels/image001.txt
# frame_id, track_id, x, y, w, h, confidence, class, visibility
1, 1, 452, 78, 477, 456, 1.0, 1, 1.0
1, 2, 120, 200,  60, 180, 0.9, 2, 0.8

11.5 Format Conversion Utilities

# Roboflow — free web tool for format conversion
# Or use supervision library:
# pip install supervision

import supervision as sv

# Load COCO
ds = sv.DetectionDataset.from_coco(
    images_directory_path='images/',
    annotations_path='annotations.json'
)

# Save as YOLO
ds.as_yolo(
    images_directory_path='yolo/images/',
    annotations_directory_path='yolo/labels/',
    data_yaml_path='yolo/data.yaml'
)

# Save as Pascal VOC
ds.as_pascal_voc(
    images_directory_path='voc/images/',
    annotations_directory_path='voc/Annotations/'
)

12. Model Training Pipeline

End-to-End Workflow

1. Collect images
2. Annotate (X-AnyLabeling / CVAT)
3. Export in target format (YOLO / COCO)
4. Split dataset (70/20/10 train/val/test)
5. Augment (Albumentations / YOLOv8 built-in)
6. Train (YOLOv8 / TAO Toolkit / custom PyTorch)
7. Evaluate (mAP@50, mAP@50:95, confusion matrix)
8. Optimize (TensorRT / tinygrad quantization)
9. Deploy (Jetson / DeepStream / ROS2)

Augmentation with Albumentations

import albumentations as A
from albumentations.pytorch import ToTensorV2

train_transform = A.Compose([
    A.RandomResizedCrop(640, 640, scale=(0.5, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1, p=0.8),
    A.GaussNoise(var_limit=(10, 50), p=0.3),
    A.MotionBlur(blur_limit=5, p=0.2),
    A.RandomRain(p=0.1),
    A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
    ToTensorV2(),
], bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))

13. Projects

Project 1: Full Annotation-to-Deployment Pipeline

Build a warehouse safety detector (person, forklift, hard hat) end-to-end.

1. Collect 500 images from warehouse video
2. Annotate with X-AnyLabeling (YOLOv8 pre-label → human correction)
3. Export as YOLO format, split 70/20/10
4. Fine-tune YOLOv8n for 100 epochs
5. Evaluate: target mAP@50 > 85%
6. Export to TensorRT INT8 for Jetson
7. Run at 30+ FPS on Jetson Orin Nano with ROS2 publisher

Deliverable: GitHub repo + annotated dataset on Roboflow Universe

Project 2: Multi-Camera People Re-Identification

Track the same person across 4 cameras without overlapping FOVs.

Components:
  4 cameras → YOLOv8 person detection
           → Re-ID embedding (OSNet, MobileNetV2 backbone)
           → Hungarian algorithm cross-camera matching
           → Redis store for active identities

Annotation: label 1000 person crops per camera with CVAT
Training: contrastive loss on person pairs (same/different identity)

Project 3: Aerial Object Detection with Rotated Boxes

Detect vehicles and buildings in satellite imagery using oriented bounding boxes.

Dataset: DOTA dataset (aerial images, 15 categories, rotated boxes)
Tool: X-AnyLabeling DOTA export
Model: YOLOv8-OBB (oriented bounding box variant)
Metric: mAP@50 with rotated IoU

uv run ultralytics train model=yolov8n-obb.pt data=dota.yaml epochs=100

Project 4: Real-Time Pose Estimation

# YOLOv8 pose — 17 keypoints (COCO skeleton)
from ultralytics import YOLO

model = YOLO('yolov8n-pose.pt')
results = model('video.mp4', stream=True)

for r in results:
    if r.keypoints is not None:
        kpts = r.keypoints.xy.numpy()   # (N_persons, 17, 2)
        # kpts[0, 0] = nose, [0, 5] = left shoulder, [0, 11] = left hip
        for person_kpts in kpts:
            nose = person_kpts[0]
            left_shoulder = person_kpts[5]
            right_shoulder = person_kpts[6]
            # Compute angle, detect fall, measure posture

Project 5: 3D Scene Understanding with PointNet

# pip install torch-geometric
import torch
import torch.nn as nn

class PointNet(nn.Module):
    """Classify a point cloud into one of N classes."""
    def __init__(self, num_classes: int):
        super().__init__()
        self.mlp1 = nn.Sequential(
            nn.Conv1d(3, 64, 1), nn.BatchNorm1d(64), nn.ReLU(),
            nn.Conv1d(64, 128, 1), nn.BatchNorm1d(128), nn.ReLU(),
            nn.Conv1d(128, 1024, 1), nn.BatchNorm1d(1024), nn.ReLU(),
        )
        self.fc = nn.Sequential(
            nn.Linear(1024, 512), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):  # x: (B, 3, N)
        x = self.mlp1(x)
        x = x.max(dim=-1).values   # global max pooling: (B, 1024)
        return self.fc(x)

14. Resources

Foundational

  • "Computer Vision: Algorithms and Applications" — Richard Szeliski (free online): the most complete CV textbook, covers everything from pinhole cameras to neural nets
  • "Programming Computer Vision with Python" — Jan Erik Solem: practical OpenCV projects
  • CS231n: Convolutional Neural Networks for Visual Recognition (Stanford, free): the canonical CNN course, lecture notes are excellent
  • "Deep Learning for Vision Systems" — Mohamed Elgendy: CNN architectures with practical code

Papers

Topic Paper
YOLO original Redmon et al., "You Only Look Once" (CVPR 2016)
YOLOv8 Jocher et al., Ultralytics YOLOv8 (2023)
Faster R-CNN Ren et al., "Faster R-CNN" (NeurIPS 2015)
ViT Dosovitskiy et al., "An Image is Worth 16×16 Words" (ICLR 2021)
DETR Carion et al., "End-to-End Object Detection with Transformers" (ECCV 2020)
SAM Kirillov et al., "Segment Anything" (ICCV 2023)
DeepLab v3+ Chen et al., "Encoder-Decoder with Atrous Separable Convolution" (ECCV 2018)
SegFormer Xie et al., "SegFormer: Simple and Efficient Design for Semantic Segmentation" (NeurIPS 2021)
Mask2Former Cheng et al., "Masked-attention Mask Transformer" (CVPR 2022)
BEVFusion Liu et al., "BEVFusion" (ICRA 2023)
ByteTrack Zhang et al., "ByteTrack" (ECCV 2022)

Tools and Libraries

  • OpenCV — docs.opencv.org — core library
  • Halidegithub.com/halide/Halide — language for fast, portable data-parallel image processing (CPU, GPU, FPGA)
  • Ultralytics YOLOv8 — docs.ultralytics.com — modern detection/segmentation/pose
  • Albumentations — albumentations.ai — augmentation
  • Supervision — supervision.roboflow.com — detection utilities, visualization
  • Roboflow — roboflow.com — dataset hosting, format conversion, augmentation

Annotation Tools Summary

Tool Link
X-AnyLabeling github.com/CVHub520/X-AnyLabeling
CVAT app.cvat.ai or github.com/cvat-ai/cvat
LabelImg github.com/HumanSignal/labelImg
RectLabel rectlabel.com
VGG VIA robots.ox.ac.uk/~vgg/software/via
Labelbox labelbox.com
Roboflow roboflow.com
COCO Annotator github.com/jsbroks/coco-annotator

Conferences

  • CVPR — IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • ICCV — International Conference on Computer Vision
  • ECCV — European Conference on Computer Vision
  • NeurIPS — ML/AI (strong CV track)

Previous: Phase 3 — Artificial Intelligence