02 — NCCL Algorithms & How It Achieves 900 GB/s¶

1. The Core Problem: AllReduce Algorithms¶

There are multiple ways to implement AllReduce across N GPUs. NCCL picks the best one based on topology and message size. Understanding them explains why NCCL performance varies with GPU count and buffer size.

2. Ring AllReduce¶

The most bandwidth-efficient algorithm for large messages. Used by NCCL on NVLink systems for messages > ~1 MB.

Step-by-Step with 4 GPUs, tensor [A, B, C, D]¶

Phase 1: Reduce-Scatter (N-1 steps)

Each GPU holds one chunk. In each step, each GPU sends its current chunk to the next GPU and receives the next chunk.

Initial state:
  GPU0: [A0, B0, C0, D0]
  GPU1: [A1, B1, C1, D1]
  GPU2: [A2, B2, C2, D2]
  GPU3: [A3, B3, C3, D3]

Step 1 (each GPU sends chunk i to GPU i+1, receives chunk i-1):
  GPU0: sends D0 → GPU1, receives C3 from GPU3 → computes [C2+C3]
  GPU1: sends A1 → GPU2, receives D0 from GPU0 → computes [D0+D1]
  GPU2: sends B2 → GPU3, receives A1 from GPU1 → computes [A1+A2]
  GPU3: sends C3 → GPU0, receives B2 from GPU2 → computes [B2+B3]

Step 2:
  GPU0: [B1+B2+B3], GPU1: [C2+C3+C0], ...

After N-1=3 steps:
  GPU0 has fully reduced chunk D: [D0+D1+D2+D3]
  GPU1 has fully reduced chunk A: [A0+A1+A2+A3]
  GPU2 has fully reduced chunk B: [B0+B1+B2+B3]
  GPU3 has fully reduced chunk C: [C0+C1+C2+C3]

Phase 2: AllGather (N-1 steps)

Each GPU distributes its fully-reduced chunk to all others.

Step 1:
  GPU0 sends [D_sum] → GPU1
  GPU1 sends [A_sum] → GPU2
  GPU2 sends [B_sum] → GPU3
  GPU3 sends [C_sum] → GPU0

After N-1=3 steps: all GPUs have all chunks
  GPU0: [A_sum, B_sum, C_sum, D_sum]  ✓
  GPU1: [A_sum, B_sum, C_sum, D_sum]  ✓
  ...

Ring AllReduce Bandwidth Analysis¶

Data sent per GPU per step: S/N  (S = tensor size, N = GPU count)
Number of steps: 2(N-1)
Total data sent per GPU: 2(N-1) × S/N ≈ 2S (for large N)

Bus bandwidth used: S × 2(N-1)/N / time
Fraction of peak bandwidth used: (N-1)/N → approaches 1.0 as N grows

For 8 GPUs: (8-1)/8 = 87.5% of peak NVLink bandwidth
For 16 GPUs: 93.75% of peak

This is why Ring AllReduce is "bandwidth optimal" — it uses nearly all available bandwidth regardless of GPU count.

3. Tree AllReduce¶

Better for small messages or when latency matters more than bandwidth. NVSwitch clusters often use this.

Reduce phase (gather to root):
       GPU0  (root)
      /    \
   GPU1    GPU2
   /  \
GPU3  GPU4

Step 1: GPU3→GPU1, GPU4→GPU1, GPU2→GPU0 (all parallel)
Step 2: GPU1→GPU0

AllGather phase (reverse direction):
Step 1: GPU0→GPU1, GPU0→GPU2
Step 2: GPU1→GPU3, GPU1→GPU4

Latency: O(log N) steps vs O(N) for Ring. Bandwidth: Less efficient than Ring for large messages — data must pass through intermediate nodes.

NCCL switches between Ring and Tree based on message size automatically.

4. Double Binary Tree (NCCL's Default for NVSwitch)¶

NCCL uses two overlapping binary trees that share the reduction work. This achieves full bandwidth utilization with O(log N) latency — combining the best of both worlds.

Tree A (left-leaning):
    0
   / \
  1   2
 / \
3   4

Tree B (right-leaning, complement):
    4
   / \
  3   0
     / \
    2   1

Reduce-Scatter uses Tree A, AllGather uses Tree B.
Every link is used in both trees → 100% bandwidth utilization
Every GPU participates equally → no root bottleneck

This is why NCCL 2.x on NVSwitch systems is so efficient.

5. How NCCL Achieves 900 GB/s on 8x H200¶

H200 NVLink 4.0 specs: - 18 NVLink 4.0 lanes per GPU - 50 GB/s bidirectional per lane - Total: 900 GB/s bidirectional per GPU

Why Full Bandwidth Is Achievable¶

Full NVSwitch mesh topology:

Every GPU has a direct NVLink path to every other GPU (no hops).

GPU0 ←→ GPU1: 900 GB/s direct
GPU0 ←→ GPU2: 900 GB/s direct
GPU0 ←→ GPU7: 900 GB/s direct

With Ring AllReduce across 8 GPUs: - Each GPU sends/receives S/8 per step - 2(N-1) = 14 steps total - Each step uses the full NVLink bandwidth between adjacent ring GPUs

Measured bus bandwidth formula:

bus_bandwidth = algbw × (2 × (N-1) / N)

where algbw = (message_size / time_taken)

For N=8: bus_bw = algbw × (14/8) = algbw × 1.75

Measured with nccl-tests:

./build/all_reduce_perf -b 1G -e 8G -f 2 -g 8

# Output columns:
# size   algbw(GB/s)   busbw(GB/s)
# 1 GB   257 GB/s      449 GB/s
# 4 GB   280 GB/s      490 GB/s
# 8 GB   291 GB/s      509 GB/s  ← approaches 512 GB/s = 900 GB/s × 8/8 × 0.57

# Why not 900 GB/s?
# Bus bandwidth per GPU (900 GB/s) is bidirectional peak
# AllReduce uses each link for both send+receive simultaneously
# 509 GB/s busbw ≈ 57% of 900 GB/s (bidirectional divided by 2 for half-duplex accounting)
# Full-duplex: 509 × 2 = 1018 GB/s ≈ ~900 GB/s per GPU ✓

6. Bandwidth Roofline for AllReduce¶

AllReduce communication time = (2 × S × (N-1)) / (N × BW)

where:
  S  = tensor size in bytes
  N  = number of GPUs
  BW = per-GPU NVLink bandwidth (one direction)

For H200 (BW = 450 GB/s unidirectional per GPU):
  8 GPUs, 1 GB gradient:
  Time = (2 × 1 GB × 7) / (8 × 450 GB/s)
       = 14 / 3600
       = 3.9 ms

  This matches measured values (~4 ms for 1 GB on 8x H200).

7. Algorithm Selection in NCCL¶

NCCL picks the algorithm based on message size and topology:

Message size:
  < 1 KB:     NCCL uses simple P2P (direct GPU-to-GPU send/recv)
  1 KB - 1 MB: Tree algorithm (lower latency for small messages)
  > 1 MB:     Ring algorithm (maximum bandwidth for large messages)

Topology:
  NVSwitch (full mesh): Double Binary Tree preferred
  PCIe ring: Ring AllReduce preferred
  Multi-node: Hierarchical (NVLink intra-node, IB inter-node)

Override manually:

export NCCL_ALGO=Ring         # force Ring AllReduce
export NCCL_ALGO=Tree         # force Tree AllReduce
export NCCL_PROTO=Simple      # protocol: Simple / LL / LL128

8. Protocols: Simple, LL, LL128¶

NCCL uses three protocols optimized for different scenarios:

Protocol	Optimized For	Mechanism
LL (Low Latency)	Small messages, latency	Packs data + flag in 8-byte chunks, polling
LL128	Medium messages	128-byte chunks, better bandwidth
Simple	Large messages, bandwidth	Standard DMA transfers, best bandwidth

# NCCL auto-selects based on message size
# Force protocol manually:
export NCCL_PROTO=Simple      # best for training (large gradients)
export NCCL_PROTO=LL          # best for inference token sync (small, latency-sensitive)

9. Measuring Your Actual NCCL Bandwidth¶

# Install nccl-tests
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests
make -j MPI=0 CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr

# AllReduce bandwidth sweep (1 MB to 8 GB)
./build/all_reduce_perf \
    -b 1M -e 8G -f 2 \
    -g 8 \                   # 8 GPUs
    -n 20 \                  # 20 iterations
    -w 5                     # 5 warmup iterations

# AllGather (important for FSDP parameter gathering)
./build/all_gather_perf -b 1M -e 8G -f 2 -g 8

# ReduceScatter (important for ZeRO gradient sharding)
./build/reduce_scatter_perf -b 1M -e 8G -f 2 -g 8

# Broadcast (model weight distribution)
./build/broadcast_perf -b 1M -e 8G -f 2 -g 8

Interpreting output:

#                                                              out-of-place                       in-place
#       size         count    type   redop    root     time   algbw    busbw #wrong     time   algbw    busbw #wrong
#        (B)    (elements)                             (us)  (GB/s)   (GB/s)            (us)  (GB/s)   (GB/s)
   1048576       262144   float     sum      -1    382.1     2.75     4.81      0    381.5     2.75     4.81      0
  67108864     16777216   float     sum      -1   2431.6    27.60    48.30      0   2426.5    27.66    48.41      0
 536870912    134217728   float     sum      -1  14985.5   358.28   626.98      0  14894.2   360.47   630.82      0

# algbw: algorithm bandwidth = size / time
# busbw: bus bandwidth = algbw × 2(N-1)/N  ← the number to compare against NVLink spec
# Target busbw for 8x H200: > 450 GB/s for large messages

10. Gradient Communication as a Bottleneck¶

Training step time = compute_time + communication_time

Without overlap:
  Total = T_compute + T_comm

With overlap (async AllReduce during backward):
  Total = max(T_compute, T_comm) + T_serial_overhead

Overlap efficiency = T_comm / T_compute
  If T_comm < T_compute: communication is completely hidden (ideal)
  If T_comm > T_compute: communication is the bottleneck

For H200 (1979 TFLOPS BF16) training 70B model, batch=4, seq=2048:
  T_compute ≈ 6 × 70B params × 4 × 2048 tokens / 1979 TFLOPS ≈ ~1.7 ms/step
  T_comm (gradient AllReduce, 140 GB BF16): 140e9 / (450e9 × 7/8) = ~355 ms/step

  → Communication dominates massively!
  → This is why ZeRO-3 is necessary: it replaces AllReduce with ReduceScatter (only S/N data)