Skip to content

Lecture 3: Advanced Perception and AI for Robotics

Overview

This lecture has three parts:

  • Part A — Applied perception, tracking, and control (ROS 2): Closing the loop from pixels to actuation with standard messages, estimators, and middleware—what many applied research / perception engineer roles require.
  • Part B — Advanced perception: Deep learning 2D/3D perception, semantics, VIO, and multi-modal sensing for manipulation and navigation.
  • Part C — Robot learning: RL, imitation, foundation models, and whole-body / legged control—how learning stacks sit above or beside classical ROS 2 stacks.

By the end of this lecture you should be able to:

  • Sketch a ROS 2 graph for detector → tracker → planner/controller, with vision_msgs and TF2.
  • Explain Kalman prediction + association for multi-object tracking at a systems level.
  • Name when learned depth fails and geometry (stereo, structure-from-motion) still matters.
  • Describe sim-to-real levers (randomization, latency, calibration) for RL policies.
  • Place LLM / VLM planners as high-level supervisors over ROS 2 primitives.

Part A — Applied perception, tracking, and control (ROS 2 focus)

This block maps detection → tracking → estimation → actuation to ROS 2 primitives: nodes, topics, QoS, launch, ros2 bag, and bridges to non-ROS services.

A.1 Detection in the ROS graph

Goal: Turn a camera or LiDAR stream into stable, typed messages for downstream nodes.

Piece Role
sensor_msgs/Image Raw image (often compressed with image_transport)
cv_bridge Convert ROS images ↔ OpenCV without copy mistakes
vision_msgs/Detection2D / Detection3D Standard bounding boxes + class ids; use message_filters for sync

Deployment path: Train in PyTorch → export ONNXTensorRT on Jetson → thin ROS 2 node that only runs inference and publishes.

A.2 Multi-object tracking (MOT)

Core loop: Predict each track with a Kalman (or constant-velocity) model → associate detections to tracks (Hungarian / IoU / Mahalanobis gating) → create / delete tracks.

Learning-assisted trackers (e.g. ByteTrack-style) add association robustness when detections flicker. In ROS 2, publish track IDs and markers for RViz2 so debugging is visual.

Roadmap deep dive: Phase 3 — Multi-Object Tracking guide (Kalman + assignment + ROS 2 patterns).

A.3 State estimation and robot_localization

Same mathematics as Lecture 1, applied to perception-driven systems: fuse wheel odom, IMU, visual odometry, or GPS (outdoor). TF2 must stay consistent—a wrong odombase_link corrupts both tracking and Nav2.

A.4 Semantic layer and behavior trees

Nav2 already uses BehaviorTree.CPP. For semantic goals (“only drive on labeled floor”), feed segmentation into costmap plugins or BT conditions that branch on class labels.

A.5 Control and visual servoing

Visual servoing: Regulate image features (points, lines) to desired positions; control law outputs twist or joint velocities. Always transform setpoints through TF2 (geometry_msgs) so the controller operates in the correct frame.

Aerial / PX4: Use PX4 ↔ ROS 2 (micro-ROS / uXRCE-DDS) rather than reinventing the autopilot; ROS 2 supplies missions, offboard setpoints, or perception hooks.

A.6 Monocular geometry

Monocular systems lack absolute scale; IMU or known object size provides scale. VIO packages (ORB-SLAM3, OpenVINS, Kimera-VIO) expose ROS interfaces—treat outputs as noisy and rate-limited for fusion.

A.7 Simulation and sim-to-real

  • Gazebo + ros_gz: ground robots, sensors.
  • PX4 SITL: aerial stacks before flight.

Sim-to-real: Inject latency, noise, and misp calibration in sim before claiming hardware readiness.

A.8 Service integration (FastAPI, NATS)

Pattern: ROS 2 owns real-time-ish sensing and control; FastAPI exposes REST/WebSocket for dashboards; NATS fans out events to analytics. Bridge with small nodes; do not starve the DDS thread with blocking HTTP.

QoS: DDS tuning for sensor streams vs commands.

A.9 GPU, Docker, GStreamer

NVIDIA Container Toolkit for GPU nodes in Docker. gscam or pipeline shims when v4l2 is not enough for camera ingest.

How Part A connects to other lectures

Topic Where
Nav2, SLAM, robot_localization Lecture 1 — Advanced ROS, Lecture 2 — Industrial
Multi-robot Lecture 4 — Multi-Robot

Projects (Part A)

  1. Detector → tracker → RViz2: vision_msgs + track markers + ros2 bag.
  2. PX4 SITL or Gazebo + Nav2: Compare command latency sim vs hardware.
  3. Bridge: ROS 2 state → FastAPI + NATS events; measure end-to-end delay.

Part B — Advanced perception and AI (expanded track)

B.1 Deep learning–based perception

  • 2D detection: YOLO-family, RT-DETR, etc.—optimize for latency on Jetson (TensorRT).
  • 6D pose: FoundationPose, DenseFusion-style methods for grasp—outputs must feed MoveIt 2 via TF and collision-aware planning.
  • 3D point clouds: Open3D / PCL for classical geometry; PointNet++ / voxel nets for learned segmentation in structured scenes.

B.2 VIO / SLAM

VIO fuses IMU (high rate, biased) with camera (lower rate, rich). Failure modes: motion blur, rolling shutter, textureless regions. Always compare against wheel odometry or mocap when possible.

B.3 Semantics and scene graphs

Semantic segmentation (e.g. drivable vs obstacle) feeds Nav2 costmaps. 3D scene graphs attach objects and relations for task planning and HRI (“the cup on the left table”).

Open-vocabulary detectors (Grounding DINO, CLIP-based) reduce retraining but require latency and grounding validation on your robot.

B.4 Tactile and multi-modal sensing

Tactile arrays estimate slip and contact; fusion with vision helps in-hand manipulation. Audio can flag collision or motor anomalies—treat as asynchronous cues to supervisors, not hard real-time control unless validated.

Resources (Part B)

  • Open3D
  • Siciliano et al., Robotics: Modelling, Planning and Control
  • Berkeley Robot Sensing (BRS) line of work (papers)

Projects (Part B)

  • 6D pose + grasp: Jetson Orin + table-top objects + MoveIt 2.
  • VIO benchmark: ORB-SLAM3 vs ground truth.
  • Open-vocabulary pick-and-place: language → detection → grasp.

Part C — Robot learning and autonomous behaviors

C.1 Reinforcement learning

Sim-to-real: Randomize dynamics, friction, sensor noise, latency; domain randomization reduces overfitting to one simulator build.

Algorithms: PPO and SAC are common for continuous control; TD-MPC and model-based variants sample-efficiently in some setups.

Frameworks: Stable-Baselines3, RLlib, Isaac Lab for GPU-heavy training.

C.2 Imitation and offline RL

Behavior cloning is fragile out of distribution; DAgger reduces covariate shift by mixing expert and policy data. Diffusion Policy outputs smooth multi-modal action trajectories.

C.3 Foundation models for robotics

Vision-language-action (VLA) models aim to map images + language to actions. In deployment, LLMs often act as high-level planners that call ROS 2 skills (navigate, pick, place)—verify each step with executable checks.

C.4 Whole-body and legged control

WBC coordinates many DoF under constraints (contacts, COM). Legged systems blend model-based (MPC, WBC) with RL policies. Contact-rich manipulation uses hybrid force/position control.

Resources (Part C)

  • Isaac Lab
  • Sutton & Barto, Reinforcement Learning: An Introduction
  • Lerobot (Hugging Face)

Projects (Part C)

  • Sim-to-real locomotion: Isaac Sim → real quadruped (document transfer).
  • Diffusion policy: Teleop demos → train → evaluate on arm.
  • LLM task planner: LLM outputs skill sequence executed via ROS 2 actions/services.

Self-check (whole lecture)

Part A: (1) What does vision_msgs buy you vs a custom float array topic? (2) Name one reason to keep FastAPI out of the critical DDS callback path.

Part B: When does monocular depth fail outdoors at high speed?

Part C: What is one failure mode of behavior cloning without DAgger?


Next in this roadmap