Skip to content

01 — H200 Hardware Architecture

1. Die and Process

  • GPU die: GH100 (Hopper architecture, TSMC 4N)
  • Transistors: 80 billion
  • SMs: 132 Streaming Multiprocessors
  • CUDA cores: 16,896
  • Tensor Cores: 4th generation (FP8, FP16, BF16, TF32, INT8, INT4)
  • FP8 peak: ~3,958 TFLOPS (dense)
  • BF16 peak: ~1,979 TFLOPS (dense)
  • TF32 peak: ~989 TFLOPS

The H200 uses the same GH100 die as H100 but replaces HBM3 with HBM3e stacks for higher capacity and bandwidth.

2. HBM3e Memory Subsystem

Property H100 SXM5 H200 SXM5
Memory type HBM3 HBM3e
Capacity 80 GB 141 GB
Bandwidth 3.35 TB/s 4.8 TB/s
Stacks 5 6

Why HBM3e Matters for AI

  • Larger KV caches for long-context LLM inference (fit 128K+ tokens in memory)
  • Bigger model shards → fewer pipeline stages → less pipeline bubble overhead
  • 43% more bandwidth → memory-bound ops (attention, embedding lookups) run faster

Memory Access Best Practices

# Profile actual HBM bandwidth utilization
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Always use BF16 for training (numerically stable, uses Tensor Cores)
model = model.to(torch.bfloat16)

# Pin CPU buffers for async H2D/D2H transfers
buffer = torch.zeros(size, pin_memory=True)
  • Links per GPU: 18 NVLink 4.0 lanes
  • Bandwidth per link: 50 GB/s bidirectional
  • Total GPU-to-GPU bandwidth: 900 GB/s bidirectional per GPU

NVSwitch 3rd Gen (8-GPU Node Topology)

An 8-GPU SXM5 node uses four NVSwitch 3.0 chips forming a full all-to-all mesh:

GPU0 ──┐
GPU1 ──┤
GPU2 ──┤   NVSwitch 0   ←→   NVSwitch 1
GPU3 ──┤         ↕               ↕
GPU4 ──┤   NVSwitch 2   ←→   NVSwitch 3
GPU5 ──┤
GPU6 ──┤
GPU7 ──┘

Every GPU has direct full-bandwidth path to every other GPU.
No multi-hop penalty unlike ring or tree topologies.

Why Full Mesh Matters

  • All-reduce across 8 GPUs stays at full 900 GB/s (no bottleneck GPUs)
  • Tensor parallel all-reduce for attention heads completes in ~1 µs for typical sizes
  • Point-to-point P2P transfers for pipeline parallelism are lossless
# Check NVLink topology
nvidia-smi topo -m

# Monitor NVLink traffic per GPU
nvidia-smi dmon -s u -d 1

# NCCL topology detection
NCCL_DEBUG=INFO torchrun --nproc_per_node=8 your_script.py 2>&1 | grep "NCCL"

4. SXM5 Baseboard and Host Connectivity

  • PCIe: PCIe 5.0 x16 per GPU (via NVLink C2C bridge to CPU on Grace-Hopper)
  • NVMe: Direct GPUDirect Storage paths to local NVMe over PCIe 5.0
  • Thermal: Liquid-cooled SXM5 module; GPU junction temperature target < 83°C

CPU-GPU Memory (Grace Hopper Superchip variant)

On GH200 (Grace + H200), CPU and GPU share a unified 900 GB/s NVLink-C2C fabric — CPU LPDDR5x and GPU HBM3e appear in the same address space. This eliminates PCIe bottlenecks for CPU-GPU data movement.

5. Power and Cooling

Property Value
TDP per H200 700 W
8-GPU node TDP ~5,600 W (GPUs only)
Cooling method Direct liquid cooling (DLC)
Inlet coolant temp ≤ 45°C recommended

Design your rack PDU for at least 7.5 kW per node (accounting for CPUs, NVSwitches, networking).

6. Key Architectural Features for AI

Transformer Engine

The H200 includes a Transformer Engine that automatically selects FP8 or BF16 precision per layer:

# PyTorch + Transformer Engine (TE)
import transformer_engine.pytorch as te

# Replace standard Linear with TE Linear — auto FP8
layer = te.Linear(in_features, out_features, bias=True)

# TE handles scaling factors, amax history, and E4M3/E5M2 selection

In-Network Computing via NVSwitch

NVSwitch 3.0 supports SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) in-network reductions — all-reduce operations are partially computed inside the switch fabric, reducing GPU cycles spent on communication.

References