Skip to content

Module 4B — ML Engineering & MLOps

Parent: Phase 3 — Artificial Intelligence · Track B

Build and operate ML training/inference pipelines — model lifecycle, data pipelines, serving infrastructure.

Prerequisites: Module 2 (Frameworks — PyTorch fluency), Module 3B (Agentic AI — understand LLM workloads).

Role targets: Machine Learning Engineer · MLOps Engineer · AI/ML Engineer · ML Platform Engineer


Why This Matters for AI Hardware

ML engineers define the training and serving workloads that hardware must support: - Distributed training (data/model/pipeline parallelism) → L3: NCCL, multi-GPU runtime - Model serving (latency SLAs, throughput targets) → L1a: TensorRT, Triton, vLLM - Experiment tracking and model versioning → infrastructure that uses GPU clusters (Phase 5A) - Data pipelines → I/O bandwidth requirements that drive GPUDirect Storage (Phase 5B)


1. Training Pipelines

  • Data pipelines: PyTorch DataLoader, NVIDIA DALI, streaming datasets
  • Distributed training: DDP (Data Distributed Parallel), FSDP (Fully Sharded), DeepSpeed
  • Mixed precision: torch.cuda.amp, BF16, FP8 training
  • Checkpointing: model saving, resume from checkpoint, elastic training
  • Hyperparameter tuning: Optuna, Ray Tune, grid/random/Bayesian search

Projects: 1. Train a model with DDP across 2+ GPUs. Measure scaling efficiency. 2. Add mixed precision training. Compare FP32 vs BF16 training speed and accuracy.


2. Experiment Tracking & Model Registry

  • Experiment tracking: MLflow, Weights & Biases (W&B), Neptune
  • Model registry: versioning, staging, production promotion
  • Dataset versioning: DVC, LakeFS
  • Reproducibility: environment tracking, seed management, deterministic training

Projects: 1. Set up MLflow tracking for a training run. Log hyperparameters, metrics, and artifacts. 2. Register a model in MLflow. Create a staging → production promotion workflow.


3. Model Serving & Inference Infrastructure

  • Serving frameworks: Triton Inference Server, TorchServe, vLLM, TensorRT-LLM
  • Batching: static batching, dynamic batching, continuous batching (for LLMs)
  • Scaling: horizontal (replicas), vertical (larger GPU), model parallelism for large models
  • Monitoring: latency percentiles (p50/p99), throughput (QPS), error rates, GPU utilization
  • A/B testing: canary deployments, shadow mode, traffic splitting

Projects: 1. Deploy a model on Triton with dynamic batching. Load test and measure p50/p99 latency. 2. Deploy an LLM on vLLM. Compare throughput with different batch sizes and quantization levels.


4. MLOps & CI/CD for Models

  • CI/CD: GitHub Actions / GitLab CI for model training, testing, deployment
  • Container orchestration: Docker, Kubernetes, NVIDIA GPU Operator
  • Model testing: unit tests for preprocessing, integration tests for inference, data drift detection
  • Feature stores: Feast, Tecton — manage features for training and serving consistency
  • Orchestration: Kubeflow Pipelines, Apache Airflow, Prefect

Projects: 1. Build a CI/CD pipeline: on push → train → evaluate → if improved → deploy to Triton. 2. Set up GPU-enabled Kubernetes with NVIDIA GPU Operator. Deploy a model serving workload.


Resources

Resource What it covers
MLflow Documentation Experiment tracking, model registry
Triton Inference Server Production model serving
vLLM LLM serving engine
DeepSpeed Distributed training
Designing Machine Learning Systems (Huyen) ML systems design

Next

Module 5B — LLM Application Development