01 — H200 Hardware Architecture¶
1. Die and Process¶
- GPU die: GH100 (Hopper architecture, TSMC 4N)
- Transistors: 80 billion
- SMs: 132 Streaming Multiprocessors
- CUDA cores: 16,896
- Tensor Cores: 4th generation (FP8, FP16, BF16, TF32, INT8, INT4)
- FP8 peak: ~3,958 TFLOPS (dense)
- BF16 peak: ~1,979 TFLOPS (dense)
- TF32 peak: ~989 TFLOPS
The H200 uses the same GH100 die as H100 but replaces HBM3 with HBM3e stacks for higher capacity and bandwidth.
2. HBM3e Memory Subsystem¶
| Property | H100 SXM5 | H200 SXM5 |
|---|---|---|
| Memory type | HBM3 | HBM3e |
| Capacity | 80 GB | 141 GB |
| Bandwidth | 3.35 TB/s | 4.8 TB/s |
| Stacks | 5 | 6 |
Why HBM3e Matters for AI¶
- Larger KV caches for long-context LLM inference (fit 128K+ tokens in memory)
- Bigger model shards → fewer pipeline stages → less pipeline bubble overhead
- 43% more bandwidth → memory-bound ops (attention, embedding lookups) run faster
Memory Access Best Practices¶
# Profile actual HBM bandwidth utilization
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# Always use BF16 for training (numerically stable, uses Tensor Cores)
model = model.to(torch.bfloat16)
# Pin CPU buffers for async H2D/D2H transfers
buffer = torch.zeros(size, pin_memory=True)
3. NVLink 4.0 and NVSwitch 3rd Gen¶
NVLink 4.0 Specs¶
- Links per GPU: 18 NVLink 4.0 lanes
- Bandwidth per link: 50 GB/s bidirectional
- Total GPU-to-GPU bandwidth: 900 GB/s bidirectional per GPU
NVSwitch 3rd Gen (8-GPU Node Topology)¶
An 8-GPU SXM5 node uses four NVSwitch 3.0 chips forming a full all-to-all mesh:
GPU0 ──┐
GPU1 ──┤
GPU2 ──┤ NVSwitch 0 ←→ NVSwitch 1
GPU3 ──┤ ↕ ↕
GPU4 ──┤ NVSwitch 2 ←→ NVSwitch 3
GPU5 ──┤
GPU6 ──┤
GPU7 ──┘
Every GPU has direct full-bandwidth path to every other GPU.
No multi-hop penalty unlike ring or tree topologies.
Why Full Mesh Matters¶
- All-reduce across 8 GPUs stays at full 900 GB/s (no bottleneck GPUs)
- Tensor parallel all-reduce for attention heads completes in ~1 µs for typical sizes
- Point-to-point P2P transfers for pipeline parallelism are lossless
Verifying NVLink in Practice¶
# Check NVLink topology
nvidia-smi topo -m
# Monitor NVLink traffic per GPU
nvidia-smi dmon -s u -d 1
# NCCL topology detection
NCCL_DEBUG=INFO torchrun --nproc_per_node=8 your_script.py 2>&1 | grep "NCCL"
4. SXM5 Baseboard and Host Connectivity¶
- PCIe: PCIe 5.0 x16 per GPU (via NVLink C2C bridge to CPU on Grace-Hopper)
- NVMe: Direct GPUDirect Storage paths to local NVMe over PCIe 5.0
- Thermal: Liquid-cooled SXM5 module; GPU junction temperature target < 83°C
CPU-GPU Memory (Grace Hopper Superchip variant)¶
On GH200 (Grace + H200), CPU and GPU share a unified 900 GB/s NVLink-C2C fabric — CPU LPDDR5x and GPU HBM3e appear in the same address space. This eliminates PCIe bottlenecks for CPU-GPU data movement.
5. Power and Cooling¶
| Property | Value |
|---|---|
| TDP per H200 | 700 W |
| 8-GPU node TDP | ~5,600 W (GPUs only) |
| Cooling method | Direct liquid cooling (DLC) |
| Inlet coolant temp | ≤ 45°C recommended |
Design your rack PDU for at least 7.5 kW per node (accounting for CPUs, NVSwitches, networking).
6. Key Architectural Features for AI¶
Transformer Engine¶
The H200 includes a Transformer Engine that automatically selects FP8 or BF16 precision per layer:
# PyTorch + Transformer Engine (TE)
import transformer_engine.pytorch as te
# Replace standard Linear with TE Linear — auto FP8
layer = te.Linear(in_features, out_features, bias=True)
# TE handles scaling factors, amax history, and E4M3/E5M2 selection
In-Network Computing via NVSwitch¶
NVSwitch 3.0 supports SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) in-network reductions — all-reduce operations are partially computed inside the switch fabric, reducing GPU cycles spent on communication.