05 — NCCL in Multi-Node GPU Clusters¶
1. The Two-Level Communication Problem¶
A single node with 8x H200 communicates at NVLink speeds (900 GB/s). Adding a second node introduces a fundamentally slower link between them.
Node 0 (8x H200): Node 1 (8x H200):
GPU0 ──┐ GPU8 ──┐
GPU1 ──┤ GPU9 ──┤
GPU2 ──┤ NVLink 900 GB/s GPU10──┤ NVLink 900 GB/s
GPU3 ──┤ (intra-node) GPU11──┤ (intra-node)
GPU4 ──┤ GPU12──┤
GPU5 ──┤ GPU13──┤
GPU6 ──┤ GPU14──┤
GPU7 ──┘ GPU15──┘
│ │
└────────── InfiniBand HDR ──────────┘
200 Gb/s = 25 GB/s
(inter-node)
Speed ratio: NVLink / IB = 900 / 25 = 36×
→ Inter-node bandwidth is the critical bottleneck for multi-node training
2. Hierarchical AllReduce¶
NCCL automatically uses hierarchical AllReduce when mixing NVLink and network:
Step 1: Intra-node Reduce-Scatter (NVLink, fast)
Each node reduces its 8 GPUs' gradients to per-GPU shards
Node0: [G0, G1, G2, G3, G4, G5, G6, G7] → Node0 GPU0 holds G0_reduced
→ Node0 GPU1 holds G1_reduced
→ ...
Step 2: Inter-node AllReduce (InfiniBand, slow but minimized)
Only N/world_size fraction of data crosses IB
Node0 GPU0 ↔ Node1 GPU0: exchange G0_reduced shards
Node0 GPU1 ↔ Node1 GPU1: exchange G1_reduced shards (in parallel)
...
Step 3: Intra-node AllGather (NVLink, fast)
Reconstruct full reduced gradient on all GPUs within each node
Total data sent over IB = S × N_nodes / (N_nodes × N_gpus) × 2 = S × 2 / N_gpus
For 2 nodes × 8 GPUs: IB carries only 2/8 = 25% of gradient data
This hierarchy is why NCCL achieves nearly optimal bandwidth even with mixed interconnects.
3. InfiniBand Setup¶
Hardware Requirements¶
| Component | Specification | Notes |
|---|---|---|
| HCA | Mellanox ConnectX-7, HDR 200 Gb/s | One per GPU recommended |
| Switch | Mellanox Quantum-2 (NDR 400 Gb/s) or Quantum (HDR 200 Gb/s) | Fat-tree topology |
| Cables | QSFP112 (NDR) or QSFP56 (HDR) | Length matters for signal integrity |
| Topology | Fat-tree, Dragonfly, or Torus | Fat-tree most common |
Software Installation¶
# Install MLNX_OFED (Mellanox OpenFabrics Enterprise Distribution)
wget https://www.mellanox.com/downloads/ofed/MLNX_OFED-23.10-3.2.2.0/...
./mlnxofedinstall --all --without-fw-update
# Verify installation
ibstat # shows HCA status and port info
ibv_devinfo # InfiniBand device details
ib_write_bw # bandwidth test
# Check IB link speed
ibstat | grep -A2 "Port 1"
# State: Active
# Physical state: LinkUp
# Rate: 200 Gb/s ← HDR
GPUDirect RDMA — GPU Memory Directly Over InfiniBand¶
Without GPUDirect RDMA:
GPU HBM → CPU RAM (PCIe) → NIC buffer (PCIe) → IB cable → NIC → CPU RAM → GPU HBM
← PCIe copy → ← CPU involvement → ← PCIe copy →
Latency: 2 PCIe transfers + CPU processing = ~10 µs extra
With GPUDirect RDMA:
GPU HBM → IB cable → GPU HBM
↑ direct DMA ↑
No CPU involved, no PCIe double-copy
Latency reduction: ~5 µs, bandwidth: full IB speed to GPU memory
# Enable GPUDirect RDMA
export NCCL_NET_GDR_LEVEL=5 # enable for all GPU-IB distances
export NCCL_IB_HCA=mlx5_0 # specify HCA connected closest to GPU (check with nvidia-smi topo)
# Verify GPUDirect is working
nv_peer_mem status # should show "Module is loaded"
# or
ibv_devinfo -d mlx5_0 | grep dm_size # non-zero = GPUDirect works
# Test with NCCL tests across nodes
mpirun -np 16 \
-H node0:8,node1:8 \
-x NCCL_NET_GDR_LEVEL=5 \
-x NCCL_IB_HCA=mlx5_0 \
-x NCCL_DEBUG=INFO \
./build/all_reduce_perf -b 1G -e 8G -f 2 -g 1
4. SHARP — In-Network Computing¶
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) moves the AllReduce computation into the InfiniBand switch itself, reducing GPU and network load.
Without SHARP:
Node0 GPU → IB switch → Node1 GPU → AllReduce in GPU → back over IB
Data travels full round-trip
With SHARP:
Node0 GPU → IB switch → (AllReduce computed inside switch) → each GPU gets result
Data travels half the distance, reduction done in-switch
SHARP Performance Gains¶
AllReduce of 1 GB across 16 nodes (128 GPUs):
Without SHARP: 1 GB × 2(N-1)/N × 1/IB_BW = 1 GB × 1.98 / 25 GB/s = 79 ms
With SHARP: ~40 ms (50% reduction, especially for small-medium message sizes)
SHARP Configuration¶
# SHARP requires: MLNX_OFED with SHARP support + Mellanox Quantum switch
# Enable SHARP in NCCL
export NCCL_ALGO=CollNetDirect # use SHARP-based CollNet algorithm
# or
export NCCL_ALGO=CollNetChain
# Verify SHARP is active
export NCCL_DEBUG=INFO
# Look for: "NCCL INFO Using SHARP" in output
# SHARP works best for: 1 KB - 128 MB messages, small-medium clusters
# For very large messages (>1 GB): Ring AllReduce still better due to switch memory limits
5. Multi-Node Launch Patterns¶
torchrun (Elastic Distributed Training)¶
# Node 0 (master):
torchrun \
--nnodes=4 \ # 4 nodes total
--nproc_per_node=8 \ # 8 GPUs per node
--node_rank=0 \ # this node's rank
--master_addr=192.168.1.100 \ # node 0 IP
--master_port=29500 \
train.py
# Nodes 1, 2, 3 (identical, change --node_rank):
torchrun \
--nnodes=4 \
--nproc_per_node=8 \
--node_rank=1 \ # 1 for node 1, 2 for node 2, etc.
--master_addr=192.168.1.100 \
--master_port=29500 \
train.py
MPI Launch (for nccl-tests and HPC workflows)¶
# Using OpenMPI
mpirun \
-np 32 \ # 32 total processes (4 nodes × 8 GPUs)
--host node0:8,node1:8,node2:8,node3:8 \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_HCA=mlx5_0 \
-x NCCL_NET_GDR_LEVEL=5 \
-bind-to none \ # don't pin to CPU cores (GPU-heavy workload)
./build/all_reduce_perf -b 1G -e 8G -f 2 -g 1
Slurm + torchrun (HPC Cluster)¶
#!/bin/bash
# sbatch script: train_job.sh
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --time=48:00:00
#SBATCH --partition=gpu-cluster
# Export NCCL settings
export NCCL_IB_HCA=mlx5_0,mlx5_1
export NCCL_NET_GDR_LEVEL=5
export NCCL_IB_TC=106
export NCCL_DEBUG=WARN
export NCCL_TIMEOUT=600
# Get master address from Slurm node list
export MASTER_ADDR=$(scontrol show hostname "$SLURM_NODELIST" | head -1)
export MASTER_PORT=29500
srun torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=$SLURM_NTASKS_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
--node_rank=$SLURM_NODEID \
train.py
# Submit: sbatch train_job.sh
6. Multi-Node Communication Optimization¶
Minimize Inter-Node Traffic¶
Principle: keep as much communication as possible on NVLink (intra-node)
Strategy for 3D Parallelism across nodes:
TP (tensor parallel): keep WITHIN a node (NVLink, frequent small all-reduces)
PP (pipeline parallel): split ACROSS nodes (IB, infrequent large activation tensors)
DP (data parallel): split ACROSS nodes (IB, one all-reduce per step, large)
Example: 2 nodes × 8 GPUs = 16 GPUs total
TP=8 (within each node), PP=2 (one stage per node), DP=1
→ Only pipeline activations cross IB (much smaller than gradient all-reduce)
# Megatron-LM 3D parallel setup for 2 nodes
# Node 0: GPU 0-7 = stage 0 of pipeline, fully tensor-parallel
# Node 1: GPU 8-15 = stage 1 of pipeline, fully tensor-parallel
# torchrun launch:
torchrun \
--nnodes=2 \
--nproc_per_node=8 \
pretrain_gpt.py \
--tensor-model-parallel-size 8 \ # TP=8, stays within node
--pipeline-model-parallel-size 2 \ # PP=2, crosses nodes
--data-parallel-size 1 # DP=1
Gradient Compression for Slow Inter-Node Links¶
# PowerSGD: low-rank gradient approximation
from torch.distributed.algorithms.ddp_comm_hooks import powerSGD_hook as powerSGD
state = powerSGD.PowerSGDState(
process_group=dist.get_world_group(),
matrix_approximation_rank=16, # lower = more compression, less quality
start_powerSGD_iter=100, # start compression after 100 iters (warmup)
warm_start=True,
)
model.register_comm_hook(state, powerSGD.powerSGD_hook)
# Compression ratio: 8192×8192 matrix (536 MB) → rank-16 approx:
# U: 8192×16 + V: 8192×16 = 4 MB → 134× compression
# Typical loss in quality: < 0.5% on perplexity
Gradient Accumulation to Amortize Communication¶
# Accumulate gradients over multiple microbatches before AllReduce
# This reduces communication frequency at the cost of more memory
GRAD_ACCUM_STEPS = 8 # accumulate 8 microbatches before AllReduce
optimizer.zero_grad()
for step, batch in enumerate(dataloader):
# Only sync every GRAD_ACCUM_STEPS
with model.no_sync() if (step % GRAD_ACCUM_STEPS != GRAD_ACCUM_STEPS - 1) else nullcontext():
loss = model(**batch).loss / GRAD_ACCUM_STEPS
loss.backward()
if (step + 1) % GRAD_ACCUM_STEPS == 0:
# AllReduce fires here — effective batch size is 8× larger
optimizer.step()
optimizer.zero_grad()
7. Benchmarking Multi-Node NCCL¶
# 2-node AllReduce benchmark (16 GPUs)
mpirun -np 16 -H node0:8,node1:8 \
-x NCCL_NET_GDR_LEVEL=5 \
-x NCCL_IB_HCA=mlx5_0 \
./build/all_reduce_perf -b 8M -e 4G -f 2 -g 1
# Expected results (H200 × 16, IB HDR):
# size algbw busbw
# 8 MB 12 GB/s 22 GB/s ← small: latency bound
# 256 MB 22 GB/s 41 GB/s ← medium
# 4 GB 24 GB/s 45 GB/s ← large: approaches IB peak (~25 GB/s unidirectional)
# Test latency for small ops (important for inference)
./build/all_reduce_perf -b 8 -e 4096 -f 2 -g 1
# Target: < 10 µs for 8-byte messages (control messages in inference serving)