05 — Performance Optimization on 8x H200¶

1. Performance Targets and Roofline¶

H200 Hardware Limits¶

Bound	Metric	H200 Value
Compute (BF16)	TFLOPS	1,979
Compute (FP8)	TFLOPS	3,958
Memory bandwidth	TB/s	4.8
NVLink bandwidth	GB/s bidirectional	900 per GPU
PCIe 5.0 bandwidth	GB/s	~128 (host)

Roofline Analysis¶

Arithmetic Intensity (AI) = FLOPs / Bytes transferred from HBM

If AI < Ridge Point:  → MEMORY BOUND → optimize data movement
If AI > Ridge Point:  → COMPUTE BOUND → optimize FLOPs efficiency

H200 Ridge Point = 1,979 TFLOPS / 4.8 TB/s ≈ 412 FLOP/Byte (BF16)

Operation AIs:
  GEMM (large):     ~1000+ FLOP/Byte → COMPUTE BOUND ✓ (targets Tensor Cores)
  Attention (small batch): ~10 FLOP/Byte → MEMORY BOUND (use Flash Attention)
  Element-wise ops: ~1-4 FLOP/Byte → MEMORY BOUND (fuse them)
  All-reduce:       ~0 FLOPs, pure bandwidth → NVLink BOUND

# Run roofline analysis with Nsight Compute
ncu --set roofline --target-processes all \
    --output roofline_report \
    python -c "import torch; a=torch.randn(4096,4096,device='cuda',dtype=torch.bfloat16); torch.mm(a,a)"

# Open in Nsight Compute GUI for visualization
ncu-ui roofline_report.ncu-rep

2. CUDA Graphs for Inference¶

Inference has a fixed computation graph (same shapes every step). CUDA Graphs eliminate CPU kernel launch overhead.

import torch

# Standard inference: CPU launches each kernel individually
# ~5-50 µs overhead per kernel → adds up for many small kernels

# CUDA Graph: capture once, replay instantly
def setup_cuda_graph(model, sample_input):
    # Warmup (fills CUDA caches, etc.)
    with torch.cuda.stream(torch.cuda.Stream()):
        for _ in range(3):
            model(sample_input)

    # Capture
    g = torch.cuda.CUDAGraph()
    static_input = sample_input.clone()
    static_output = torch.zeros_like(model(static_input))

    with torch.cuda.graph(g):
        static_output = model(static_input)

    return g, static_input, static_output

def graph_inference(g, static_input, static_output, new_input):
    static_input.copy_(new_input)
    g.replay()
    return static_output.clone()

# Speedup: 1.5-3× for small batch sizes (< 32)
# Largest impact at batch_size=1 (pure latency mode)

CUDA Graphs with Dynamic Shapes (Bucketing)¶

# For variable sequence lengths, maintain a graph per "bucket"
BUCKETS = [128, 256, 512, 1024, 2048, 4096]

graphs = {}
for seq_len in BUCKETS:
    dummy_input = torch.zeros(1, seq_len, dtype=torch.long, device="cuda")
    g, si, so = setup_cuda_graph(model, dummy_input)
    graphs[seq_len] = (g, si, so)

def bucketed_inference(input_ids):
    seq_len = input_ids.shape[1]
    bucket = next(b for b in BUCKETS if b >= seq_len)
    padded = torch.nn.functional.pad(input_ids, (0, bucket - seq_len))
    g, si, so = graphs[bucket]
    return graph_inference(g, si, so, padded)

3. Kernel Fusion¶

Fusing multiple element-wise operations into a single kernel eliminates redundant HBM reads/writes.

Manual Fusion with Triton¶

import triton
import triton.language as tl

@triton.jit
def fused_gelu_add_kernel(
    x_ptr, bias_ptr, out_ptr,
    n_elements,
    BLOCK_SIZE: tl.constexpr,
):
    pid = tl.program_id(axis=0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements

    x = tl.load(x_ptr + offsets, mask=mask)
    bias = tl.load(bias_ptr + offsets, mask=mask)

    # GELU approximation
    val = x + bias
    gelu = val * 0.5 * (1.0 + tl.math.erf(val / 1.41421356))

    tl.store(out_ptr + offsets, gelu, mask=mask)

# 1 HBM read + 1 HBM write instead of 3 reads + 2 writes for separate kernels

torch.compile for Automatic Fusion¶

import torch

@torch.compile(mode="max-autotune")  # most aggressive optimization
def forward(x, w, bias):
    h = torch.mm(x, w) + bias
    return torch.nn.functional.gelu(h)

# torch.compile uses Triton to auto-fuse element-wise ops after GEMM
# mode options: "default", "reduce-overhead" (CUDA graphs), "max-autotune"

4. Communication Optimization¶

Overlap Compute and Communication¶

# Without overlap:
# [ALL-REDUCE gradient][forward pass][ALL-REDUCE gradient]...
#  ← communication →  ← compute →  ← communication →

# With overlap (DDP default):
# [forward][backward for layer N][ALL-REDUCE layer N, while computing layer N-1]
# Communication is hidden behind backward computation

# FSDP with prefetching
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    forward_prefetch=True,     # prefetch next layer's params during forward
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,  # prefetch during backward
    limit_all_gathers=True,    # limit concurrent all-gathers to avoid OOM
)

Gradient Compression¶

# PowerSGD: low-rank gradient compression for bandwidth reduction
from torch.distributed.algorithms.ddp_comm_hooks import powerSGD_hook as powerSGD

state = powerSGD.PowerSGDState(
    process_group=None,
    matrix_approximation_rank=32,   # rank of approximation (lower = more compression)
    warm_start=True,
)
model.register_comm_hook(state, powerSGD.batched_powerSGD_hook)
# Reduces gradient communication by 4-10× (with small quality trade-off)

NCCL Tuning for H200¶

# Force NVLink for all inter-GPU communication
export NCCL_P2P_LEVEL=NVL
export NCCL_NET_GDR_LEVEL=5          # use GPUDirect RDMA where available
export NCCL_ALGO=Tree                 # Tree or Ring; Tree often better on NVSwitch
export NCCL_PROTO=Simple              # Simple/LL/LL128

# Tune all-reduce chunk size
export NCCL_BUFFSIZE=8388608          # 8 MB

# Profile NCCL operations
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

5. Mixed Precision Best Practices¶

# Preferred: BF16 (H200 native, stable numerics)
with torch.autocast("cuda", dtype=torch.bfloat16):
    loss = model(inputs)

# For FP8 (maximum throughput on H200):
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling

fp8_recipe = DelayedScaling(
    margin=0,
    interval=1,
    fp8_format=Format.HYBRID,
    amax_history_len=16,
    amax_compute_algo="max",
)

with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    loss = model(inputs)

# FP8 gains: ~1.8× speedup on GEMM-heavy workloads vs BF16

6. torch.compile Optimization Modes¶

# Progressive optimization levels:

# Level 1: Default — safe fusions, no kernel search
model = torch.compile(model, mode="default")

# Level 2: Reduce overhead — adds CUDA graphs automatically
model = torch.compile(model, mode="reduce-overhead")

# Level 3: Max autotune — searches best kernel configs (slow first run)
model = torch.compile(model, mode="max-autotune")

# Backend options:
# "inductor" (default): Triton-based fused kernels
# "cudagraphs": CUDA graph capture only
# "aot_eager": AOT Autograd without optimization (debugging)

# Disable for dynamic shapes (e.g., variable-length generation)
model = torch.compile(model, dynamic=True)

7. Profiling Workflow¶

Step 1: High-Level Timeline (Nsight Systems)¶

nsys profile \
    --trace=cuda,nvtx,python \
    --gpu-metrics-device=all \
    --stats=true \
    --output=profile_run \
    torchrun --nproc_per_node=8 train.py

# View: nsys-ui profile_run.nsys-rep
# Look for: GPU idle gaps, PCIe transfers, NCCL synchronization

Step 2: Kernel-Level Analysis (Nsight Compute)¶

ncu \
    --set detailed \
    --kernel-name "ampere_fp16_s884gemm" \
    --launch-count 5 \
    --target-processes all \
    --output kernel_analysis \
    python inference.py

# Key metrics:
# sm__throughput.avg.pct_of_peak_sustained_elapsed → SM utilization
# dram__throughput.avg.pct_of_peak_sustained_elapsed → HBM utilization
# l1tex__throughput → L1 hit rate

Step 3: PyTorch Profiler Integration¶

from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True,
    record_shapes=True,
    profile_memory=True,
    schedule=torch.profiler.schedule(wait=1, warmup=3, active=5),
    on_trace_ready=torch.profiler.tensorboard_trace_handler("./prof"),
) as prof:
    for step, batch in enumerate(dataloader):
        with record_function("forward"):
            loss = model(**batch).loss
        with record_function("backward"):
            loss.backward()
        prof.step()

# Visualize in TensorBoard: tensorboard --logdir=./prof

8. Common Bottlenecks and Fixes¶

Symptom	Root Cause	Fix
GPU util < 40% (training)	Data loading bottleneck	Increase DataLoader workers, use DALI
GPU util < 40% (inference)	Batch size too small	Increase batch size or use continuous batching
All-reduce > 30% of step time	Gradient sync dominates	Reduce DP replicas, increase gradient accumulation
MFU < 20%	Memory-bound kernels	Enable Flash Attention, fuse ops, use larger batch
OOM during compilation	torch.compile traces peak	Compile with `dynamic=True`, reduce batch for compile step
NVLink not used	NCCL falls back to PCIe	Set `NCCL_P2P_LEVEL=NVL`, verify `nvidia-smi topo -m`