Module 4B — ML Engineering & MLOps¶

Parent: Phase 3 — Artificial Intelligence · Track B

Build and operate ML training/inference pipelines — model lifecycle, data pipelines, serving infrastructure.

Prerequisites: Module 2 (Frameworks — PyTorch fluency), Module 3B (Agentic AI — understand LLM workloads).

Role targets: Machine Learning Engineer · MLOps Engineer · AI/ML Engineer · ML Platform Engineer

Why This Matters for AI Hardware¶

ML engineers define the training and serving workloads that hardware must support: - Distributed training (data/model/pipeline parallelism) → L3: NCCL, multi-GPU runtime - Model serving (latency SLAs, throughput targets) → L1a: TensorRT, Triton, vLLM - Experiment tracking and model versioning → infrastructure that uses GPU clusters (Phase 5A) - Data pipelines → I/O bandwidth requirements that drive GPUDirect Storage (Phase 5B)

1. Training Pipelines¶

Data pipelines: PyTorch DataLoader, NVIDIA DALI, streaming datasets
Distributed training: DDP (Data Distributed Parallel), FSDP (Fully Sharded), DeepSpeed
Mixed precision: torch.cuda.amp, BF16, FP8 training
Checkpointing: model saving, resume from checkpoint, elastic training
Hyperparameter tuning: Optuna, Ray Tune, grid/random/Bayesian search

Projects: 1. Train a model with DDP across 2+ GPUs. Measure scaling efficiency. 2. Add mixed precision training. Compare FP32 vs BF16 training speed and accuracy.

2. Experiment Tracking & Model Registry¶

Experiment tracking: MLflow, Weights & Biases (W&B), Neptune
Model registry: versioning, staging, production promotion
Dataset versioning: DVC, LakeFS
Reproducibility: environment tracking, seed management, deterministic training

Projects: 1. Set up MLflow tracking for a training run. Log hyperparameters, metrics, and artifacts. 2. Register a model in MLflow. Create a staging → production promotion workflow.

3. Model Serving & Inference Infrastructure¶

Serving frameworks: Triton Inference Server, TorchServe, vLLM, TensorRT-LLM
Batching: static batching, dynamic batching, continuous batching (for LLMs)
Scaling: horizontal (replicas), vertical (larger GPU), model parallelism for large models
Monitoring: latency percentiles (p50/p99), throughput (QPS), error rates, GPU utilization
A/B testing: canary deployments, shadow mode, traffic splitting

Projects: 1. Deploy a model on Triton with dynamic batching. Load test and measure p50/p99 latency. 2. Deploy an LLM on vLLM. Compare throughput with different batch sizes and quantization levels.

4. MLOps & CI/CD for Models¶

CI/CD: GitHub Actions / GitLab CI for model training, testing, deployment
Container orchestration: Docker, Kubernetes, NVIDIA GPU Operator
Model testing: unit tests for preprocessing, integration tests for inference, data drift detection
Feature stores: Feast, Tecton — manage features for training and serving consistency
Orchestration: Kubeflow Pipelines, Apache Airflow, Prefect

Projects: 1. Build a CI/CD pipeline: on push → train → evaluate → if improved → deploy to Triton. 2. Set up GPU-enabled Kubernetes with NVIDIA GPU Operator. Deploy a model serving workload.

Resources¶

Resource	What it covers
MLflow Documentation	Experiment tracking, model registry
Triton Inference Server	Production model serving
vLLM	LLM serving engine
DeepSpeed	Distributed training
Designing Machine Learning Systems (Huyen)	ML systems design

Next¶

→ Module 5B — LLM Application Development