04 — NCCL Configuration & Tuning¶

1. The Most Important Environment Variables¶

NCCL is configured entirely through environment variables. These affect performance more than almost any code change.

Debugging and Logging¶

# Log level: VERSION < WARN < INFO < TRACE
export NCCL_DEBUG=INFO          # recommended for production setup
export NCCL_DEBUG=WARN          # quieter; only warnings and errors
export NCCL_DEBUG=TRACE         # very verbose; use only for debugging

# Filter debug output to specific subsystems
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG_SUBSYS=INIT   # topology detection and communicator setup
export NCCL_DEBUG_SUBSYS=GRAPH  # algorithm and ring/tree selection
export NCCL_DEBUG_SUBSYS=TUNING # algorithm performance tuning

# Write debug output to a file (one per rank)
export NCCL_DEBUG_FILE=/tmp/nccl_debug_%h_%p.txt
# %h = hostname, %p = process ID

Reading NCCL_DEBUG=INFO output:

NCCL INFO Bootstrap: Using [0] eth0:192.168.1.100<0>
NCCL INFO NCCL version 2.20.5+cuda12.2
NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7
NCCL INFO Ring 00 : 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 0
NCCL INFO Trees [0] 3/-1/-1->0->-1|0->1/3/-1
NCCL INFO Setting affinity for GPU 0 to ffff

Channel 00/08: 8 channels (parallel rings) for throughput
Ring 00: the ring order (GPU IDs) for channel 0
Trees [0]: the tree structure for the tree algorithm

Algorithm and Protocol¶

# Force algorithm (NCCL auto-selects by default)
export NCCL_ALGO=Ring           # Ring AllReduce
export NCCL_ALGO=Tree           # Tree AllReduce
export NCCL_ALGO=CollNetDirect  # CollNet (in-network computing, requires switch support)
export NCCL_ALGO=CollNetChain   # CollNet chain variant
export NCCL_ALGO=NVLS           # NVLink SHARP (H100/H200 NVSwitch only)
export NCCL_ALGO=NVLSTree       # NVLink SHARP tree variant

# Force protocol
export NCCL_PROTO=Simple        # best for large messages (>512 KB)
export NCCL_PROTO=LL128         # medium messages
export NCCL_PROTO=LL            # latency-optimized for small messages

# Number of channels (parallel rings)
export NCCL_MIN_NCHANNELS=4     # minimum channels
export NCCL_MAX_NCHANNELS=8     # maximum channels (default: auto)
# More channels = higher bandwidth for large messages
# Too many channels = wasted SM resources for small messages

P2P (GPU-to-GPU) Communication¶

# Use NVLink for P2P (strongly recommended for NVLink systems)
export NCCL_P2P_LEVEL=NVL      # NVLink only
export NCCL_P2P_LEVEL=SYS      # all P2P (NVLink + PCIe)
export NCCL_P2P_LEVEL=LOC      # same socket only
export NCCL_P2P_DISABLE=1      # disable P2P (fallback to shared memory or network)

# SHM (shared memory) transport for same-host fallback
export NCCL_SHM_DISABLE=0      # enable (default)
export NCCL_SHM_DISABLE=1      # disable (force network transport)

# Buffer size for SHM
export NCCL_BUFFSIZE=8388608   # 8 MB (default is 4 MB)
# Larger buffer = better bandwidth for large messages
# Smaller buffer = less memory used

InfiniBand (Multi-Node)¶

# Select InfiniBand HCA (Host Channel Adapter)
export NCCL_IB_HCA=mlx5_0                    # specific HCA
export NCCL_IB_HCA=mlx5_0,mlx5_1            # multiple HCAs (bonding)
export NCCL_IB_HCA=^mlx5_2                  # exclude specific HCA

# Enable/disable InfiniBand
export NCCL_IB_DISABLE=0                     # enable IB (default)
export NCCL_IB_DISABLE=1                     # force Ethernet

# GPUDirect RDMA level
export NCCL_NET_GDR_LEVEL=0   # no GPUDirect RDMA
export NCCL_NET_GDR_LEVEL=1   # single-hop (same PCIe switch)
export NCCL_NET_GDR_LEVEL=2   # two-hop
export NCCL_NET_GDR_LEVEL=5   # any topology (recommended when IB is present)

# Traffic class (for QoS on IB fabric)
export NCCL_IB_TC=106         # DSCP traffic class
export NCCL_IB_SL=0           # service level (0-15)
export NCCL_IB_GID_INDEX=3    # GID index (3 = RoCEv2 for Ethernet-based RDMA)

# Timeout (seconds) for IB operations
export NCCL_IB_TIMEOUT=23     # 23 = ~8 seconds (2^23 × 4 ns)

Timeout and Fault Tolerance¶

# Watchdog timeout for detecting hangs (seconds)
export NCCL_TIMEOUT=1800       # 30 minutes (default)
# Reduce for faster hang detection in production:
export NCCL_TIMEOUT=300        # 5 minutes

# Socket timeout for bootstrap
export NCCL_SOCKET_TIMEOUT=60  # 60 seconds (default)

# Async error handling
export NCCL_ASYNC_ERROR_HANDLING=1  # raise errors asynchronously

2. Topology-Specific Tuning Recipes¶

Recipe: 8x H200 (NVSwitch, NVLink 4.0)¶

# Maximum performance on NVSwitch system
export NCCL_P2P_LEVEL=NVL
export NCCL_NET_GDR_LEVEL=5
export NCCL_ALGO=NVLS               # NVLink SHARP (in-switch reduction)
# or
export NCCL_ALGO=Tree               # Double binary tree (if NVLS unavailable)
export NCCL_MIN_NCHANNELS=8         # use all 8 channels
export NCCL_BUFFSIZE=16777216       # 16 MB buffer
export NCCL_DEBUG=WARN              # production verbosity

Recipe: 12x L40S (PCIe, No NVLink)¶

# PCIe-constrained system
export NCCL_P2P_LEVEL=SYS          # use PCIe P2P
export NCCL_SHM_DISABLE=0          # use shared memory for same-host
export NCCL_ALGO=Ring              # Ring preferred for PCIe
export NCCL_MAX_NCHANNELS=4        # fewer channels (PCIe can't support many)
export NCCL_BUFFSIZE=8388608       # 8 MB
export NCCL_SOCKET_IFNAME=eth0     # correct network interface

Recipe: Multi-Node with InfiniBand (HDR 200 Gb/s)¶

export NCCL_IB_HCA=mlx5_0,mlx5_1  # use both IB ports
export NCCL_NET_GDR_LEVEL=5        # GPUDirect RDMA
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3         # RoCEv2
export NCCL_IB_TC=106
export NCCL_IB_SL=0
export NCCL_SOCKET_IFNAME=ib0
export NCCL_TIMEOUT=600            # longer timeout for large clusters

# Hierarchical: NVLink intra-node, IB inter-node
# NCCL detects this automatically with correct IB + NVLink configuration

3. Topology Detection and Verification¶

Checking Detected Topology¶

# NCCL topology detection (set DEBUG=INFO before your job)
export NCCL_DEBUG=INFO
torchrun --nproc_per_node=8 python -c "
import torch.distributed as dist
dist.init_process_group('nccl')
import time; time.sleep(2)  # let NCCL print topology
dist.destroy_process_group()
" 2>&1 | grep -E "Ring|Tree|Channel|NVLink|IB"

Expected output for 8x H200:

NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7     ← 8 parallel channels
NCCL INFO Ring 00 : 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 0
NCCL INFO Using NVLink  ← NVLink detected
NCCL INFO Trees [0] 3/-1/-1->0->-1|0->1/3/-1

Expected output for 8x L40S (PCIe):

NCCL INFO Channel 00/04 : 0 1 2 3 4 5 6 7     ← 4 channels (PCIe limited)
NCCL INFO Ring 00 : 0 -> 2 -> 4 -> 6 -> 1 -> 3 -> 5 -> 7 -> 0  ← non-sequential (PCIe-aware)
NCCL INFO Using shared memory for GPU-to-GPU  ← SHM fallback for same-host

nvidia-smi Topology Check¶

# Full topology matrix
nvidia-smi topo -m

# Example for 8x H200:
#         GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
# GPU0      X  NV18 NV18 NV18 NV18 NV18 NV18 NV18
# GPU1    NV18   X  NV18 NV18 NV18 NV18 NV18 NV18
# ...
# NV18 = 18 NVLink lanes = full NVLink 4.0 connectivity

# PCIe system:
#         GPU0 GPU1 GPU2 GPU3
# GPU0      X  PXB  PHB  SYS
# GPU1    PXB   X   PHB  SYS
# PXB = same PCIe switch (fast), PHB = host bridge (slower), SYS = cross NUMA (slow)

4. Bandwidth Tuning by Message Size¶

NCCL performance varies significantly with message size. Tune based on your gradient/parameter sizes.

# Find your typical gradient tensor sizes
python - <<'EOF'
from transformers import LlamaForCausalLM, LlamaConfig
import torch

config = LlamaConfig(hidden_size=8192, num_hidden_layers=80)
model = LlamaForCausalLM(config)

total_bytes = 0
for name, param in model.named_parameters():
    size_mb = param.numel() * 2 / 1e6  # BF16 = 2 bytes
    print(f"{name:60s}: {size_mb:.1f} MB")
    total_bytes += param.numel() * 2

print(f"\nTotal: {total_bytes/1e9:.1f} GB")
# This tells you the AllReduce size for DDP
# For ZeRO-3: each AllGather is parameter_size / world_size
EOF

Tuning for Small Gradients (< 1 MB)¶

# Small gradients: prioritize latency
export NCCL_PROTO=LL          # low-latency protocol
export NCCL_ALGO=Tree         # tree has better latency (log N steps)
export NCCL_MAX_NCHANNELS=1   # single channel (less overhead)

Tuning for Large Gradients (> 100 MB)¶

# Large gradients: maximize bandwidth
export NCCL_PROTO=Simple      # best bandwidth
export NCCL_ALGO=Ring         # bandwidth-optimal
export NCCL_MIN_NCHANNELS=8   # all channels for parallelism
export NCCL_BUFFSIZE=16777216  # 16 MB buffer

5. DDP Bucket Size Tuning¶

Bucket size controls how gradients are grouped for AllReduce. Larger buckets = fewer NCCL calls = better bandwidth utilization.

# Default: 25 MB per bucket
# For large models on NVLink: increase significantly
model = DDP(
    model,
    device_ids=[rank],
    bucket_cap_mb=200,       # 200 MB per bucket — fewer, larger AllReduces
    gradient_as_bucket_view=True,  # avoid extra buffer copy
    static_graph=True,       # optimization if graph doesn't change (most training)
)

# Find optimal bucket size:
# Too small: many small NCCL calls → high overhead
# Too large: all-or-nothing sync → less overlap with backward
# Sweet spot: ~1-2 buckets covering the largest parameter groups

6. Profile NCCL Operations¶

# Profile with Nsight Systems
nsys profile \
    --trace=cuda,nvtx,nccl \        # include NCCL in trace
    --stats=true \
    --output=nccl_profile \
    torchrun --nproc_per_node=8 train.py

# View nccl_profile.nsys-rep in nsys-ui
# NCCL ops appear as colored blocks in the GPU timeline
# Look for:
#   ncclAllReduce → how long does it take?
#   GPU idle between AllReduce and next compute → communication gap

# Mark NCCL calls with NVTX for easier profiling
import torch.cuda.nvtx as nvtx
import torch.distributed as dist

def profile_allreduce(tensor, name="grad"):
    nvtx.range_push(f"AllReduce:{name}")
    dist.all_reduce(tensor)
    torch.cuda.synchronize()
    nvtx.range_pop()