1. GPU Infrastructure (Phase 5)¶
Timeline: 12–24 months (fundamentals); 24–48 months for advanced phase.
Prerequisites: Phase 4 Track B (Jetson, CUDA stack), Phase 4 Track C (ML compiler + DL inference optimization).
What this track covers¶
HPC = high-performance computing: solving problems that need massive compute and memory by using many machines (or many GPUs) working together. In AI, "HPC" usually means large-scale training and high-throughput inference on GPU clusters, not single workstations.
This track is organized around the main accelerator platforms you will evaluate in real HPC and large-scale AI work:
| Sub-track | Focus | Guide |
|---|---|---|
| Nvidia GPU | CUDA, NCCL, NVLink/NVSwitch, InfiniBand, GPUDirect, Slurm/K8s, multi-GPU/multi-node clusters | Nvidia GPU → |
| AMD GPU | ROCm, HIP, RCCL, AMD Instinct (MI300X), RDNA/CDNA architecture, porting CUDA workloads | AMD GPU → |
| Distributed AI Interconnects | vLLM, PyTorch distributed, UCX, UCC, NCCL/RCCL, topology-aware multi-node inference, and broken-fabric debugging | Distributed AI Interconnects → |
| Accelerator Platform Evaluation | Practical comparison of NVIDIA GPUs, AMD GPUs, and Google accelerators for training, inference, scaling, tooling, and cost/performance tradeoffs | Accelerator Platform Evaluation → |
For the full CUDA-X library ecosystem (cuBLAS, cuDNN, CUTLASS, TensorRT, NCCL, RAPIDS, and 40+ more), see Phase 5B — High Performance Computing.
How to use this track¶
- Start with Nvidia GPU — the dominant ecosystem for AI HPC. Covers fundamentals, virtualization, interconnects, storage, distributed training, and performance modeling.
- Add AMD GPU — for portability, alternative hardware, and understanding the growing AMD AI ecosystem (MI300X, ROCm 6+).
- Study Distributed AI Interconnects when your project crosses vLLM, PyTorch distributed, UCX/UCC, NCCL/RCCL, and non-trivial physical topologies.
- Use Accelerator Platform Evaluation when you need to choose between NVIDIA, AMD, and Google accelerator paths for real workloads and real cluster operations.
These tracks assume you've completed Phase 4 Track C (compiler + inference optimization). The HPC content here focuses on infrastructure, multi-GPU scaling, and distributed systems — the compiler and kernel optimization skills come from Track C.