06 — Benchmarks & Validation on 8x H200
1. Standard Benchmark Suite
Run these benchmarks in order to validate hardware health, software stack, and end-to-end AI performance.
Level 0: Hardware Health Check
# GPU info and clock state
nvidia-smi -q | grep -E "Product Name|Memory Total|Clocks|Power"
# Verify all 8 GPUs visible
nvidia-smi --query-gpu=index,name,memory.total --format=csv
# NVLink health
nvidia-smi nvlink --status -i 0 # repeat for 0-7
nvidia-smi nvlink --errorcounters -i 0
# Thermal stress — should stay < 83°C at 700W TDP
nvidia-smi dmon -s pucvt -d 2
Level 1: GPU Compute (HPL / DGEMM)
# NVIDIA HPL benchmark (standard Top500 benchmark)
docker run --gpus all --rm nvcr.io/nvidia/hpc-benchmarks:23.10 \
hpl.sh --dat /workspace/hpl-dgx-h100/HPL.dat
# Expected: ~50+ PFLOPS for 8x H200 at FP64
# (HPL tests FP64; AI workloads use BF16/FP8)
# Quick GEMM bandwidth test
python - <<'EOF'
import torch, time
M, N, K = 4096, 4096, 4096
a = torch.randn(M, K, device="cuda", dtype=torch.bfloat16)
b = torch.randn(K, N, device="cuda", dtype=torch.bfloat16)
for _ in range(10): torch.mm(a, b) # warmup
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(100): torch.mm(a, b)
torch.cuda.synchronize()
t1 = time.perf_counter()
tflops = 2 * M * N * K * 100 / (t1 - t0) / 1e12
print(f"GEMM TFLOPS (BF16): {tflops:.0f}")
# Target: > 1700 TFLOPS (86% of 1979 peak)
EOF
Level 2: NVLink Bandwidth
# NCCL all-reduce bandwidth test
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests && make -j MPI=0 CUDA_HOME=/usr/local/cuda
# All-reduce across 8 GPUs
./build/all_reduce_perf -b 1G -e 8G -f 2 -g 8
# Expected: ~450+ GB/s bus bandwidth (out of 900 GB/s bidirectional NVLink)
# Lower numbers indicate NVLink issues or NCCL misconfiguration
Level 3: Memory Bandwidth
python - <<'EOF'
import torch, time
N = 2 * 1024**3 // 2 # 2 GB of BF16 data
x = torch.randn(N, device="cuda", dtype=torch.bfloat16)
y = torch.empty_like(x)
for _ in range(5): y.copy_(x) # warmup
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(100): y.copy_(x)
torch.cuda.synchronize()
t1 = time.perf_counter()
bw = (2 * N * 2 * 100) / (t1 - t0) / 1e12 # read + write, BF16 = 2 bytes
print(f"HBM Bandwidth: {bw:.2f} TB/s")
# Target: > 4.0 TB/s (83% of 4.8 TB/s peak)
EOF
2. Training Benchmarks
MFU (Model FLOP Utilization)
import torch, time
from transformers import LlamaForCausalLM, LlamaConfig
# Proxy model for benchmarking
config = LlamaConfig(
hidden_size=8192, num_hidden_layers=80,
num_attention_heads=64, num_key_value_heads=8,
intermediate_size=28672, max_position_embeddings=4096,
)
model = LlamaForCausalLM(config).to("cuda:0").to(torch.bfloat16)
batch_size, seq_len = 4, 2048
inputs = torch.randint(0, 32000, (batch_size, seq_len), device="cuda:0")
# Warmup
for _ in range(3):
loss = model(inputs, labels=inputs).loss
loss.backward()
model.zero_grad()
torch.cuda.synchronize()
# Benchmark
t0 = time.perf_counter()
N_STEPS = 20
for _ in range(N_STEPS):
loss = model(inputs, labels=inputs).loss
loss.backward()
model.zero_grad()
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0
# MFU calculation
params = sum(p.numel() for p in model.parameters())
flops_per_token = 6 * params # forward + backward ≈ 3x forward
tokens_per_step = batch_size * seq_len
total_flops = flops_per_token * tokens_per_step * N_STEPS
achieved_tflops = total_flops / elapsed / 1e12
mfu = achieved_tflops / 1979.0 # H200 BF16 peak
print(f"Achieved: {achieved_tflops:.0f} TFLOPS | MFU: {mfu*100:.1f}%")
# Target single-GPU MFU: > 45%
Multi-GPU Scaling Efficiency
# Run training script with 1, 2, 4, 8 GPUs and measure throughput
for N in 1 2 4 8; do
torchrun --nproc_per_node=$N train_benchmark.py \
--batch-size $((4 * N)) \
--seq-len 2048 \
--steps 50 \
--output scaling_N${N}.json
done
# Expected scaling efficiency (vs linear):
# 2 GPUs: > 95%
# 4 GPUs: > 92%
# 8 GPUs: > 88%
Throughput vs Batch Size Curve
| Batch Size |
Tokens/s (Llama-70B, 8x H200, BF16) |
GPU Mem Used |
| 1 |
~1,200 |
~200 GB |
| 4 |
~4,500 |
~240 GB |
| 16 |
~14,000 |
~400 GB |
| 64 |
~18,000 |
~900 GB |
| 128 |
OOM (need FP8 or quant) |
— |
3. Inference Benchmarks
Latency vs Throughput Trade-off
# vLLM benchmark suite
python benchmarks/benchmark_serving.py \
--model meta-llama/Llama-3-70b-instruct \
--tensor-parallel-size 8 \
--request-rate 10 \
--num-prompts 500 \
--input-len 512 \
--output-len 256
# Sweep request rates
for RATE in 1 5 10 20 50 100; do
python benchmarks/benchmark_serving.py \
--request-rate $RATE \
[other args] \
--output-json rate_${RATE}.json
done
Key Inference Metrics
| Metric |
Definition |
H200 Target (Llama-70B, TP=8) |
| TTFT |
Time to first token (prefill) |
< 100 ms (512 input tokens) |
| TPOT |
Time per output token (decode) |
< 20 ms/token |
| Throughput |
Output tokens/second total |
> 15,000 tokens/s at BS=64 |
| ITL |
Inter-token latency (streaming) |
< 30 ms P99 |
| MBU |
Memory bandwidth utilization |
> 75% (decode is memory-bound) |
Decode MBU Calculation
def compute_mbu(model_params_bytes, tokens_per_second, hbm_bw_tbs=4.8):
"""
Decode is memory-bound: GPU loads all model weights per token generated.
model_params_bytes: total model size in bytes (e.g., 70B * 2 for BF16 = 140e9)
"""
bytes_per_token = model_params_bytes # approximately load full model
achieved_tb_s = bytes_per_token * tokens_per_second / 1e12
mbu = achieved_tb_s / hbm_bw_tbs
return mbu
# 70B BF16: 140 GB weights, 15000 tokens/s
mbu = compute_mbu(140e9, 15000)
print(f"MBU: {mbu*100:.1f}%") # ~44% — room for improvement with larger batches
4. Validation Checklist
Before Production Deployment
Hardware:
[ ] All 8 GPUs detected by nvidia-smi (driver version ≥ 550)
[ ] NVLink errors = 0 on all links (nvidia-smi nvlink --errorcounters)
[ ] GPU temperatures stable under load (< 83°C)
[ ] Power readings ~700W per GPU at full load
[ ] ECC errors = 0 (nvidia-smi --query-gpu=ecc.errors.uncorrected.volatile.total)
Software stack:
[ ] CUDA 12.3+, cuDNN 9+, NCCL 2.20+
[ ] PyTorch ≥ 2.3 (Flash Attention 3 support)
[ ] vLLM or TRT-LLM latest release
[ ] Transformer Engine installed for FP8
Networking:
[ ] NCCL all-reduce latency < 100 µs for 1 MB payload
[ ] NVLink bus bandwidth > 400 GB/s (nccl-tests all_reduce_perf)
[ ] No PCIe retrain errors in dmesg
Model serving:
[ ] Health check endpoint responds < 50 ms
[ ] TTFT within SLA for p50/p99
[ ] Memory utilization stable (no slow leak)
[ ] Graceful handling of max_model_len exceeded requests
ECC and Error Monitoring
# Check for GPU hardware errors
nvidia-smi --query-gpu=gpu_name,ecc.errors.corrected.volatile.total,\
ecc.errors.uncorrected.volatile.total --format=csv
# Set ECC mode (requires reboot)
sudo nvidia-smi -e 1 # enable ECC (reduces memory by ~6%)
sudo nvidia-smi -e 0 # disable ECC (more memory, no correction)
# For training: ECC ON (data integrity > memory)
# For inference: ECC ON (production reliability)
5. Comparative Baselines
H200 vs H100 vs A100
| Benchmark |
A100 SXM4 80GB |
H100 SXM5 80GB |
H200 SXM5 141GB |
| BF16 TFLOPS |
312 |
989 |
1,979 |
| HBM Bandwidth |
2.0 TB/s |
3.35 TB/s |
4.8 TB/s |
| Llama-70B TTFT (512 in) |
~400 ms |
~150 ms |
~90 ms |
| Llama-70B throughput |
~5K tok/s |
~10K tok/s |
~18K tok/s |
| Max context (fits in VRAM) |
~32K tokens |
~32K tokens |
~128K tokens |
References