HPC with Nvidia GPU¶

Timeline: 12–24 months (fundamentals and deep dives); 24–48 months for advanced phase.

Basic concepts: what "HPC with Nvidia GPU" means¶

HPC = high-performance computing: solving problems that need massive compute and memory by using many machines (or many GPUs) working together. In AI, "HPC" usually means large-scale training and high-throughput inference on GPU clusters, not single workstations.

Why Nvidia GPUs? Nvidia GPUs are the dominant hardware for training and deploying large models. They offer the best-supported software stack: CUDA (programming model and runtime), cuDNN and CUTLASS (high-performance kernels — cuDNN for conv/RNN/attention; CUTLASS for customizable GEMM, matrix multiply, and epilogue fusion used by frameworks and custom kernels), TensorRT (inference optimization and deployment), and NCCL (multi-GPU collectives). Add the fastest inter-GPU links (NVLink, NVSwitch) and the architectures (Hopper, Blackwell) that ML frameworks target first, and you see why AI infrastructure and kernel-level optimization work almost always involves Nvidia GPUs in data centers.

What this track covers:

Single GPU → many GPUs — From one GPU (e.g. Jetson, which you saw in Phase 4 Track B) to multi-GPU nodes and multi-node clusters. You need to understand how jobs are placed, how data and gradients move, and how to avoid communication becoming the bottleneck.
Two main workloads:
- Training — One big model, one big dataset; you split work across GPUs (data parallelism, model parallelism, pipeline parallelism). Performance is about throughput (samples/sec) and scaling to hundreds or thousands of GPUs.
- Inference — Many requests, one (or many) deployed model; you care about latency and throughput under load. At scale this means batching, KV-cache, and often multi-GPU or multi-node serving (e.g. TensorRT-LLM, vLLM).
The stack you must understand:
- Hardware: GPUs (A100, H100/H200, L40S, etc.), NVLink/NVSwitch inside a node, InfiniBand or Ethernet across nodes.
- Software: CUDA, drivers, containers (NGC), orchestrators (Slurm, Kubernetes), and collective libraries (NCCL) for multi-GPU communication.
- Storage and I/O: Getting data to GPUs fast (dataloaders, GPUDirect Storage, high-throughput disks) so the GPU is not waiting.

Key terms (used in this track)¶

Order: basic ops → attention → distributed.

Term	Meaning
Matrix multiply	Compute A·B for matrices A, B (often plus bias or activation). The core operation in linear layers and most heavy compute. "GEMM" is the standard name in libraries.
GEMM	General Element-wise Matrix Multiply: C = α(A·B) + βC. The BLAS/cuBLAS/CUTLASS interface for matrix multiply. GPU GEMM kernels (tiling, tensor cores) dominate training and inference time.
Epilogue fusion	In a GEMM kernel, the epilogue is what you do with the result (add bias, ReLU/GELU, write to memory). Fusion = doing that in the same kernel as the multiply. Saves memory bandwidth and launch overhead; CUTLASS and similar libraries support it.
Attention	In transformers: each token has Q, K, V (from linear layers); attention = softmax(Q·K^T/√d)·V. Lets the model focus on relevant tokens. The matmuls are heavy for long sequences; optimized attention kernels and KV-cache are critical.
FlashAttention	A family of attention kernels that reduce memory traffic by tiling and keeping Q,K,V in SRAM, and avoid materializing the full Q·K^T matrix. Faster and more memory-efficient than naive attention; standard in LLM training and inference (e.g. FlashAttention-2, -3).
KV-cache	In transformer attention: keys and values for previous tokens are cached so you don't recompute them. KV-cache = that cache. Long context → huge cache → memory and bandwidth become the bottleneck; paging/sharding and efficient kernels matter.
Data / model / pipeline parallelism	Data: same model on every GPU, different data; sync gradients. Model: split the model across GPUs. Pipeline: different layers on different GPUs, pass activations in a pipeline.
Collectives	Multi-GPU operations: AllReduce (everyone gets the same sum), AllGather (everyone gets all pieces), ReduceScatter. Used to sync gradients (data parallel) or exchange activations (model/pipeline parallel).
NCCL	Nvidia Collective Communications Library. Implements collectives (AllReduce, AllGather, etc.) for multi-GPU training and inference. Often the bottleneck at scale.
NVLink / NVSwitch	NVLink: high-bandwidth GPU↔GPU (and GPU↔CPU) link inside a node. NVSwitch: switch connecting many GPUs in one node. Much faster than PCIe for GPU↔GPU traffic.

What this track covers (after reorganization)¶

This track now focuses on HPC infrastructure and multi-GPU operations. The DL Inference Optimization content (graph optimization, kernel engineering, compiler stack, quantization, inference runtimes, tinygrad deep dive) has moved to Phase 4 Track C — ML Compiler & Graph Optimization as Part 2, since those skills are foundational for all hardware tracks, not just HPC.

Part	Description	Guide
HPC Setup	Fundamentals, virtualization, interconnects, advanced CUDA/distributed training/performance — plus hardware-specific deep dives (8x H200, L40S, NCCL, CUDA Advanced, GPUDirect Storage)	HPC Setup →
~~DL Inference Optimization~~	Moved to Phase 4 Track C Part 2	Track C →

How to use this track¶

Complete Phase 4 Track C first — covers compiler fundamentals and DL inference optimization (graph ops, kernels, compiler stack, quantization, runtimes).
Then HPC Setup — Covers Nvidia GPU HPC fundamentals, virtualization (vGPU, KVM), interconnects and storage (InfiniBand, GDS, Slurm, Kubernetes), and Phase 2 advanced topics (advanced CUDA, distributed training, performance modeling). Use the deep dives (8x H200, L40S, NCCL, CUDA Advanced, GDS) for your target hardware and stack.

Prerequisite: Phase 4 Track B (Jetson, TensorRT, CUDA) and Phase 4 Track C (compiler + inference optimization).