06 — Debugging NCCL Failures in Production¶
1. The Three Categories of NCCL Failures¶
Category 1: HANGS (most common in production)
All processes freeze. No error, no timeout (unless NCCL_TIMEOUT set).
Root cause: one or more processes cannot reach a collective.
Category 2: ERRORS (second most common)
Process crashes with ncclInternalError, ncclSystemError, etc.
Root cause: network issues, GPU errors, software bugs.
Category 3: SILENT CORRUPTION (hardest to detect)
Training runs but produces wrong results.
Root cause: ECC errors, NVLink errors, numerical overflow.
2. First Response Checklist¶
When NCCL hangs or crashes, run this checklist immediately:
# 1. Check GPU health on all nodes
nvidia-smi -q | grep -E "ECC|Uncorrected|Temperature|Power"
nvidia-smi --query-gpu=index,ecc.errors.uncorrected.volatile.total --format=csv
# Non-zero ECC errors = hardware problem
# 2. Check NVLink errors (for NVLink systems)
nvidia-smi nvlink --errorcounters -i 0 # repeat for GPU 0-7
# Any counter > 0 is a problem
# 3. Check dmesg for PCIe or GPU hardware errors
dmesg | grep -E "NVRM|nvidia|PCIe|AER" | tail -50
# Look for: "PCIe Bus Error", "IOMMU", "XID" (nvidia error codes)
# 4. Check NCCL debug output (restart with NCCL_DEBUG=INFO to capture)
export NCCL_DEBUG=INFO
export NCCL_DEBUG_FILE=/tmp/nccl_%h_%p.log
# Rerun job, collect all log files from all nodes
# 5. Check if all ranks are alive
ps aux | grep python # are all 8 processes still running?
# Kill orphan processes before retry:
pkill -f "torchrun\|python.*train"
3. Diagnosing Hangs¶
Pattern 1: One Rank is Missing¶
Symptom: Job hangs immediately at dist.init_process_group().
Node 0: Rank 0, 1, 2, 3, 4, 5, 6, 7 all call ncclCommInitRank()
Rank 7 waits for all 8 ranks to connect
Node 1: Only Rank 8, 9, 10, 11, 12, 13, 14 start (Rank 15 never launches)
→ All 15 other ranks hang indefinitely waiting for Rank 15
Diagnosis:
# Check all processes are running
for node in node0 node1; do
ssh $node "ps aux | grep python | grep -v grep | wc -l"
done
# Should show 8 on each node
# Check master port is reachable
nc -zv $MASTER_ADDR $MASTER_PORT
# Should connect; if not, firewall issue
Fix: ensure all processes launch, fix firewall rules, check Slurm job allocation.
Pattern 2: Hang During Training (Not Init)¶
Symptom: Training runs for N steps then freezes at a specific layer/operation.
# Attach gdb to find where each rank is stuck
gdb -p $(pgrep -f "python train.py" | head -1)
(gdb) thread apply all bt # show stack trace for all threads
# Or use Python's faulthandler
import faulthandler
faulthandler.enable() # dumps stack trace on SIGSEGV or after timeout
# Or signal all processes to print stack trace
kill -USR1 $(pgrep -f "python train.py")
# Check which NCCL operation is pending
export NCCL_DEBUG=INFO
# Look for last NCCL op before hang:
# "ncclAllReduce" → gradient sync hang
# "ncclAllGather" → FSDP parameter gather hang
# "ncclBroadcast" → model weight broadcast hang
Common cause: unbalanced control flow — one rank takes a different code path and skips a collective:
# WRONG — can cause hang if condition differs across ranks
if loss.item() > threshold: # loss.item() may differ between ranks!
dist.all_reduce(some_tensor) # only some ranks execute this
# CORRECT — synchronize the condition first
is_above = torch.tensor(loss.item() > threshold, device="cuda")
dist.all_reduce(is_above, op=dist.ReduceOp.SUM)
if is_above.item() > 0: # now all ranks agree
dist.all_reduce(some_tensor)
Pattern 3: Network Hang (Multi-Node)¶
# Test basic network connectivity between nodes
ssh node1 ping -c 4 node0
# Test IB connectivity
ibping -S -d mlx5_0 -P 1 # start server on node1
ibping -d mlx5_0 -P 1 node1 # ping from node0
# Should show < 1 µs round-trip
# Test bandwidth
ib_write_bw -d mlx5_0 node1 # run on node0, point to node1
# Should show ~23 GB/s for HDR 200 Gb/s
# Check for dropped IB packets
perfquery -x 0 -d mlx5_0 -P 1 # shows error counters
# RcvSwRelayErrors, SymbolErrorCounter non-zero = physical link problem
4. Common NCCL Errors and Fixes¶
ncclSystemError — System/Network Issue¶
# Likely causes:
# 1. Out of shared memory
ls -la /dev/shm/
# NCCL uses /dev/shm for intra-node; if full, communication fails
# Fix:
rm /dev/shm/nccl-* # clean stale NCCL shared memory files
# 2. InfiniBand device error
ibstat | grep State
# Should show "Active" for all ports
# 3. CUDA out of memory during NCCL buffer allocation
# NCCL needs ~300 MB per GPU for communication buffers
# Check: nvidia-smi shows GPU memory before launching
ncclInvalidArgument — Wrong Usage¶
Common cause: tensor not contiguous, wrong data type, mismatched sizes.
# Ensure tensors are contiguous before NCCL
tensor = tensor.contiguous()
dist.all_reduce(tensor)
# Ensure all ranks have the same tensor size
assert tensor.numel() == expected_size, f"Rank {rank}: got {tensor.numel()}, expected {expected_size}"
# Ensure data types match
tensor = tensor.to(torch.float32) # NCCL doesn't support BF16 for all ops on older versions
ncclUnhandledCudaError — GPU Error¶
# Check CUDA error
python -c "import torch; torch.cuda.check_error(torch.cuda.current_device())"
# Run with CUDA error checking enabled
CUDA_LAUNCH_BLOCKING=1 torchrun ... # makes CUDA errors synchronous and easier to trace
# Check for GPU hardware errors
nvidia-smi --query-gpu=index,ecc.errors.uncorrected.volatile.total --format=csv
# Non-zero = bad GPU; replace hardware
Timeout (Long-Running Hangs)¶
# PyTorch DDP has a built-in timeout
import datetime
dist.init_process_group(
backend="nccl",
timeout=datetime.timedelta(seconds=300), # 5-minute timeout
)
# After 300s of a hung collective: ncclInternalError with timeout message
5. XID Codes — Nvidia GPU Error IDs¶
When dmesg shows nvidia: Xid..., these are GPU hardware errors:
dmesg | grep "Xid"
# Examples:
# Xid 43: GPU-NVLink error → NVLink hardware problem
# Xid 48: DBE (uncorrected) → Double-bit ECC error → GPU memory failing
# Xid 79: GPU hang detected → GPU kernel timeout
# Xid 94: Container violated → GPU process isolation issue
# Xid 119: GSP RPC timeout → Driver-firmware communication failure
| XID | Meaning | Action |
|---|---|---|
| 43 | NVLink error | Check nvidia-smi nvlink --errorcounters, check cables |
| 48 | Uncorrected ECC | GPU memory failing, replace GPU |
| 79 | GPU hang | Driver or kernel bug, update driver |
| 94 | Container error | Check CUDA/Docker version compatibility |
| 119 | GSP timeout | Driver issue, reload driver or reboot |
# Enable ECC to catch memory errors before they become Xid 48
sudo nvidia-smi -e 1 # enable ECC (requires reboot)
6. Debugging NCCL with NVTX and Nsight¶
# Add NVTX markers around NCCL calls for timeline visualization
import torch.cuda.nvtx as nvtx
class DebugDDP(torch.nn.parallel.DistributedDataParallel):
def _run_ddp_forward(self, *args, **kwargs):
nvtx.range_push("DDP_forward")
result = super()._run_ddp_forward(*args, **kwargs)
nvtx.range_pop()
return result
# Profile with Nsight Systems:
nsys profile \
--trace=cuda,nvtx,nccl \
--output=nccl_debug \
torchrun --nproc_per_node=8 train.py
# Open nccl_debug.nsys-rep — NCCL ops appear as colored blocks
# Look for: long gaps between compute and NCCL = communication bottleneck
7. Fault Tolerance: Recovering from NCCL Failures¶
PyTorch Elastic Training (Automatic Recovery)¶
# torchrun with elastic training: automatically restarts on node failure
torchrun \
--nnodes=4:8 \ # accept 4-8 nodes (elastic range)
--nproc_per_node=8 \
--max_restarts=3 \ # retry up to 3 times
--rdzv_backend=etcd \ # use etcd for rendezvous (fault-tolerant)
--rdzv_endpoint=etcd-server:2379 \
train.py
# Training code must use checkpointing for elastic recovery
import torch.distributed.elastic.multiprocessing as mp
def train(state):
# Load from checkpoint if restarting
if state.step > 0:
model.load_state_dict(torch.load(f"checkpoint_{state.step}.pt"))
for step in range(state.step, total_steps):
loss = compute_loss(model, batch)
loss.backward()
optimizer.step()
# Save checkpoint every N steps
if step % 100 == 0:
torch.save(model.state_dict(), f"checkpoint_{step}.pt")
state.step = step
Manual Checkpoint-Restart¶
#!/bin/bash
# retry_training.sh — simple checkpoint-restart loop
MAX_RETRIES=5
RETRY=0
while [ $RETRY -lt $MAX_RETRIES ]; do
torchrun --nnodes=4 --nproc_per_node=8 train.py \
--resume-from-checkpoint latest
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
echo "Training complete"
break
else
RETRY=$((RETRY+1))
echo "Training failed (exit $EXIT_CODE), retry $RETRY/$MAX_RETRIES"
sleep 30 # wait before retry
fi
done
8. Debugging Checklist Summary¶
NCCL Hang:
[ ] Are all N processes running? (ps aux count matches world_size)
[ ] Is master port reachable? (nc -zv MASTER_ADDR MASTER_PORT)
[ ] Is there a conditional collective? (all ranks must call every collective)
[ ] Check NCCL_DEBUG=INFO for last operation before hang
[ ] Check IB link status: ibstat
[ ] Set NCCL_TIMEOUT=300 to force timeout + error message
NCCL Error:
[ ] Check nvidia-smi for ECC errors
[ ] Check dmesg for Xid codes
[ ] Check /dev/shm/ for stale NCCL files
[ ] Run with CUDA_LAUNCH_BLOCKING=1 for synchronous error reporting
[ ] Check tensor is contiguous and correct dtype
Performance Issue:
[ ] Run nccl-tests to verify expected bus bandwidth
[ ] Check NCCL_DEBUG=INFO for algorithm selected (Ring/Tree)
[ ] Profile with Nsight Systems to find communication gaps
[ ] Verify P2P_LEVEL=NVL for NVLink systems
[ ] Check bucket_cap_mb in DDP, reduce_bucket_size in DeepSpeed
References¶
- PyTorch NCCL Error Handling
- NCCL Known Issues
- Nvidia XID Error Reference
- PyTorch Elastic Training
- NCCL GitHub Issues — search before opening new issues