HPC Setup¶

Part of: High Performance Computing — Nvidia GPU (Phase 5)

This guide combines HPC fundamentals, virtualization, interconnects, and advanced topics with hardware-specific deep dives for real-world GPU cluster setup and optimization. Study the fundamentals first, then use the deep dives for your target hardware (8x H200, L40S, NCCL, CUDA, GDS).

Hardware & Stack Deep Dives¶

Detailed guides for specific GPU cluster configurations and subsystems:

Setup	Use Case	Guide
8x H200 SXM5	Large model training & inference (1.1 TB HBM3e, NVLink 4.0)	8x-H200-Training-Inference/
L40S x12	Cost-efficient inference deployment (576 GB GDDR6, PCIe)	L40S-x12-Inference/
NCCL Deep Dive	GPU-to-GPU communication: algorithms, tuning, debugging, 1T-scale	NCCL-Deep-Dive/
CUDA Advanced Optimization	CUDA Graphs, Cooperative Groups, Persistent Kernels, Fusion, Warp Specialization	CUDA-Advanced-Optimization/
GPUDirect Storage (GDS)	Direct NVMe→GPU DMA, NVMe-oF, WD OpenFlex + RapidFlex, libcufile API	GPUDirect-Storage/

8x H200 — Topics¶

01 Hardware Architecture — GH100 die, HBM3e, NVLink 4.0, NVSwitch topology
02 Training Setup — FSDP, DeepSpeed ZeRO-3, Megatron-LM, 3D parallelism
03 Inference Setup — vLLM, TensorRT-LLM, FP8, speculative decoding
04 Memory Management — KV cache, PagedAttention, GQA, profiling
05 Performance Optimization — Roofline, CUDA Graphs, kernel fusion, NCCL tuning
06 Benchmarks & Validation — MFU, MBU, latency/throughput targets

L40S x12 — Topics¶

01 Hardware Architecture — AD102 die, GDDR6, PCIe topology, no-NVLink constraints
02 Inference Optimization — GPTQ, AWQ, FP8 quantization, vLLM, continuous batching
03 Multi-GPU Strategy — PCIe parallelism, pipeline vs tensor parallel, InfiniBand
04 Deployment Guide — systemd, Docker, Kubernetes, NGINX load balancing, monitoring
05 Benchmarks — throughput tables, L40S vs H200 comparison, load testing

NCCL Deep Dive — Topics¶

01 Fundamentals — AllReduce, Broadcast, AllGather, ReduceScatter, AllToAll explained with diagrams
02 Algorithms & Bandwidth — Ring vs Tree vs Double Binary Tree, how 900 GB/s is achieved, bandwidth math
03 Framework Integration — how PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron call NCCL internally
04 Configuration & Tuning — every important env var, per-topology recipes (H200 / L40S / multi-node)
05 Multi-Node Clusters — hierarchical AllReduce, InfiniBand, GPUDirect RDMA, SHARP in-network compute
06 Debugging — hangs, errors, XID codes, fault tolerance, recovery patterns
07 Trillion-Parameter Scale — 3D parallelism NCCL patterns, MoE AllToAll, communication budgets at scale

CUDA Advanced Optimization — Topics¶

01 CUDA Graphs — capture/replay pipelines, PyTorch patterns, bucketing for dynamic shapes, profiling
02 Cooperative Groups — thread block, warp, tiled partition, coalesced, grid-wide sync with examples
03 Persistent Kernels — always-resident GPU workers, GPU-side work queues, zero-overhead dispatch
04 Kernel Fusion — HBM round-trip elimination, Triton, torch.compile, FlashAttention as fusion example
05 Warp Specialization — producer/consumer warpgroups, TMA, WGMMA, software pipelining, CUTLASS 3.x

GPUDirect Storage (GDS) — Topics¶

01 Architecture & Data Path — CPU vs GDS data paths, PCIe topology, NUMA pinning, 3 transport modes
02 Hardware Setup — WD OpenFlex reference config (A100 + CX-7 + SN3700), PCIe layout, version matrix
03 Software Stack — OFED 5.8, GDS 2.17.3, libcufile install, gdscheck verification, cufile.json config
04 libcufile API — cuFileRead/Write, buffer registration, batch I/O, PyTorch DataLoader integration
05 Performance Tuning — 512-byte alignment, optimal transfer size, queue depth, buffer pool, benchmarks
06 Disaggregated Storage — NVMe-oF over RoCEv2, WD OpenFlex + RapidFlex, 75 GB/s scale-out, lossless config

1. Nvidia GPU HPC Fundamentals¶

GPU Architecture for HPC:
- CUDA and Tensor Cores: Master CUDA programming for HPC workloads. Understand Tensor Core utilization for mixed-precision compute (FP16, BF16, TF32) in scientific and AI applications.
- NVLink and NVSwitch: Learn about high-bandwidth GPU interconnect technologies for multi-GPU systems. Understand NVLink topology, NVSwitch for scalable GPU clusters, and bandwidth optimization.
- GPU Memory Hierarchy: Deep dive into GPU memory—global memory, shared memory, L1/L2 cache, and unified memory. Optimize memory access patterns for HPC workloads.
Multi-GPU Programming:
- NCCL (Nvidia Collective Communications Library): Master NCCL for efficient multi-GPU and multi-node collective operations—all-reduce, broadcast, all-gather. Understand NCCL topology detection and tuning for optimal performance.
- CUDA Multi-Process Service (MPS): Learn MPS for sharing GPUs across multiple processes, improving utilization in HPC and inference workloads.
- MPI + CUDA: Combine MPI for distributed computing with CUDA for GPU acceleration. Implement hybrid MPI-CUDA applications for large-scale HPC clusters.

Resources: Nvidia NCCL Documentation · Nvidia vGPU Documentation · "Professional CUDA C Programming" by Cheng et al.

Projects: Implement a Multi-GPU Training Pipeline with NCCL all-reduce; benchmark NVLink vs. PCIe.

2. Virtualization and Cloud HPC (vGPU, KVM)¶

Nvidia vGPU (Virtual GPU):
- vGPU Architecture: Understand vGPU technology for sharing physical GPUs across multiple virtual machines. Learn vGPU types (e.g., vComputeServer, vPC, vApp) and licensing.
- vGPU Deployment: Deploy and configure vGPU on hypervisors. Understand GPU partitioning, time-slicing, and MIG (Multi-Instance GPU) for fine-grained sharing.
- vGPU for HPC and AI: Configure vGPU environments for HPC workloads, ML training, and inference in virtualized data centers.
KVM and GPU Passthrough:
- GPU Passthrough (VFIO): Learn PCIe passthrough for dedicating physical GPUs to VMs. Understand IOMMU groups, VFIO drivers, and SR-IOV for GPU virtualization.
- KVM with Nvidia GPUs: Configure KVM-based virtualization with Nvidia GPUs. Explore nested virtualization and GPU resource management.
- Orchestration: Integrate GPU VMs with Kubernetes, Slurm, or other HPC job schedulers for resource allocation.
Containerization for HPC:
- Nvidia Container Toolkit: Use the Nvidia Container Toolkit to run GPU workloads in Docker and Podman containers.
- Singularity/Apptainer: Deploy HPC applications with Singularity/Apptainer for GPU-accelerated containerized workloads in shared clusters.

Resources: Nvidia vGPU Software Documentation · Linux VFIO and IOMMU Documentation · Nvidia Container Toolkit.

Projects: Deploy a vGPU environment; configure GPU passthrough with KVM (VFIO).

3. HPC Interconnects and Storage¶

High-Speed Interconnects:
- InfiniBand: Master InfiniBand for low-latency, high-bandwidth HPC networking. Understand RDMA (Remote Direct Memory Access), GPUDirect RDMA, and topology design.
- RoCE (RDMA over Converged Ethernet): Explore RoCE for Ethernet-based RDMA in HPC and cloud environments.
- GPUDirect Storage: Learn GPUDirect Storage (GDS) for direct GPU-to-NVMe data access, bypassing CPU for I/O-bound workloads. See GPUDirect-Storage deep dive above.
Parallel File Systems and I/O:
- Lustre and GPFS: Understand parallel file systems for HPC storage. Optimize I/O patterns for large-scale scientific applications.
- DAOS (Distributed Asynchronous Object Storage): Explore DAOS for next-generation HPC storage with native GPU support.
Job Scheduling and Orchestration:
- Slurm with GPU Support: Configure Slurm for GPU resource management, GRES (Generic Resources), and multi-node GPU jobs.
- Kubernetes for HPC/AI: Use Kubernetes with Nvidia GPU operator for orchestrating GPU workloads in hybrid HPC/cloud environments.

Resources: Nvidia GPUDirect Documentation · Slurm GPU Configuration · TOP500 and Green500.

Projects: Build a multi-node GPU cluster with InfiniBand or high-speed Ethernet, NCCL, and Slurm; optimize I/O with GPUDirect Storage.

Phase 2: Advanced HPC (24–48 months)¶

1. Advanced CUDA Programming for HPC¶

CUDA Memory Optimization: Memory access coalescing, shared memory and L1 cache, pinned and unified memory. Analyze with Nsight Compute; restructure layouts (AoS → SoA).
Warp-Level and Thread-Level Optimization: Warp divergence, warp shuffle intrinsics (__shfl_sync, __ballot_sync, __reduce_sync), Tensor Cores (WMMA/CUTLASS).
CUDA Graphs and Streams: Overlap compute and transfer with streams; capture/replay with CUDA Graphs; cooperative groups. See CUDA-Advanced-Optimization deep dive.

Resources: Nsight Compute and Nsight Systems · CUTLASS · "CUDA Programming" by Shane Cook.

Projects: Optimize a GEMM kernel (tiling, shared memory, Tensor Cores) vs cuBLAS; overlapped pipeline with streams; CUDA Graph for inference.

2. Distributed Training and Large-Scale AI¶

Parallel Training Strategies: Data parallelism (NCCL all-reduce), model parallelism (tensor + pipeline), 3D parallelism (DeepSpeed + Megatron).
Frameworks and Infrastructure: PyTorch DDP and FSDP, DeepSpeed ZeRO (1/2/3), Megatron-LM. See 8x-H200 Training Setup and NCCL.
Monitoring and Fault Tolerance: Weights & Biases / TensorBoard; distributed checkpointing; PyTorch Elastic for node failure and resizing.

Resources: DeepSpeed Documentation · Megatron-LM GitHub · "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (paper).

Projects: 3D-parallel training run (>1B params); ZeRO-3 memory analysis; fault-tolerant training with Elastic.

3. HPC Performance Modeling and Application Optimization¶

Roofline Model and Performance Analysis: Compute-bound vs memory-bound; arithmetic intensity (FLOPs/Byte); hierarchical memory rooflines (L1, L2, HBM). See 8x-H200 Performance Optimization.
Scientific Computing Applications: Molecular dynamics (GROMACS, AMBER), CFD (sparse solvers, CG, GMRES), cuFFT for FFT-based applications.
Compiler and Auto-Tuning: TVM for GPU, Triton for custom kernels, cuBLAS/cuDNN tuning (workspace, algorithm selection, math modes).

Resources: Nsight Compute Roofline Analysis · "Programming Massively Parallel Processors" (Kirk & Hwu) · OpenAI Triton Documentation.

Projects: Roofline analysis of a scientific kernel; custom Triton kernel (e.g. layer norm + dropout); auto-tuned cuFFT pipeline.