04 — Memory Management on 8x H200¶
1. Memory Budget Planning¶
Training Memory Breakdown (per GPU, Mixed Precision BF16)¶
Model parameters: P bytes (BF16 = 2 bytes/param)
Gradients: P bytes (BF16)
Optimizer states: 4P bytes (FP32 copies for AdamW: params + m + v)
Activations: A bytes (depends on sequence length, batch size)
─────────────────────────────────────────────────────────────────────────
Total (no ZeRO): 6P + A
With ZeRO-3 (8 GPUs): (6P + A) / 8 per GPU
For a 70B model: - P = 70B × 2 bytes = 140 GB - Optimizer = 70B × 12 bytes = 840 GB - Total without ZeRO = 980 GB → needs ZeRO-3 - With ZeRO-3 across 8 GPUs: ~123 GB per GPU ✓
Inference Memory Breakdown¶
Weights: W bytes (FP16 = 2 bytes/param)
KV Cache: K bytes (2 × layers × heads × head_dim × seq_len × batch × 2)
CUDA workspace: ~2 GB
Activations: ~batch × seq × hidden × 2 bytes
─────────────────────────────────────────────────────────────────────────
Total: W + K + overhead
2. HBM3e Memory Allocation Strategy¶
import torch
def get_memory_stats(device=0):
t = torch.cuda.get_device_properties(device).total_memory / 1e9
r = torch.cuda.memory_reserved(device) / 1e9
a = torch.cuda.memory_allocated(device) / 1e9
print(f"Total: {t:.1f} GB | Reserved: {r:.1f} GB | Allocated: {a:.1f} GB | Free: {t-r:.1f} GB")
# Monitor before and after model load
get_memory_stats()
model = load_model()
get_memory_stats()
CUDA Memory Pool Configuration¶
# Prevent memory fragmentation with expandable segments
import torch
torch.cuda.memory.set_per_process_memory_fraction(0.95) # leave 5% for CUDA runtime
# Configure memory allocator
torch.cuda.set_per_process_memory_fraction(0.95, device=0)
# Environment variable approach (before import torch)
# PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
# For vLLM — controls how much GPU memory the KV cache pool uses
export VLLM_GPU_MEMORY_UTILIZATION=0.90
3. KV Cache Deep Dive¶
KV Cache Size Formula¶
def kv_cache_size_gb(
num_layers: int,
num_kv_heads: int,
head_dim: int,
max_seq_len: int,
max_batch_size: int,
dtype_bytes: int = 2, # FP16 = 2, FP8 = 1
) -> float:
# 2 for K and V
total_bytes = 2 * num_layers * num_kv_heads * head_dim * max_seq_len * max_batch_size * dtype_bytes
return total_bytes / 1e9
# Llama-3 70B: 80 layers, 8 GQA kv-heads, head_dim=128
kv = kv_cache_size_gb(
num_layers=80,
num_kv_heads=8, # GQA reduces from 64 heads to 8 KV heads
head_dim=128,
max_seq_len=8192,
max_batch_size=32,
)
print(f"KV Cache: {kv:.1f} GB") # ~42 GB with FP16
Grouped Query Attention (GQA) Memory Savings¶
| Attention Type | KV Heads | KV Cache Size | Used By |
|---|---|---|---|
| MHA | 64 | 100% | GPT-3, early transformers |
| GQA | 8 | 12.5% | Llama-3, Mistral |
| MQA | 1 | 1.6% | PaLM, Falcon |
GQA is the best trade-off: near-MHA quality at much lower KV memory cost.
4. PagedAttention (vLLM)¶
Traditional KV cache allocation:
Max sequence length is pre-allocated, even if unused.
Seq A (actual 200 tokens, allocated 4096): [200 used ][3896 wasted]
Seq B (actual 3800 tokens, allocated 4096): [3800 used][296 wasted ]
→ ~30–50% GPU memory waste → fewer concurrent sequences
PagedAttention:
Memory divided into fixed-size "blocks" (16 tokens per block by default)
Seq A: [Block 3][Block 7][Block 12] (only 13 blocks used)
Seq B: [Block 0][Block 1]...[Block 237] (238 blocks used)
Blocks are allocated on demand, freed immediately when sequence ends
→ Near-zero waste → 2-4× more concurrent sequences
# Tune block size for your workload
llm = LLM(
model="...",
block_size=16, # 16 tokens per block (default)
# 32 tokens/block = higher throughput, less flexibility
# 8 tokens/block = more flexibility, more metadata overhead
)
5. Activation Checkpointing Trade-offs¶
Without checkpointing:
Forward pass saves ALL activations for backward.
Memory = O(layers × batch × seq × hidden)
With checkpointing (every k layers):
Forward pass saves only checkpoint activations.
Backward recomputes k layers when needed.
Memory = O((layers/k) × batch × seq × hidden)
Compute overhead = ~33% extra forward passes
For H200 (large memory): checkpoint only when nearing OOM
# Selective checkpointing — only checkpoint expensive ops
import torch.utils.checkpoint as cp
class TransformerLayer(nn.Module):
def forward(self, x):
# Only checkpoint the attention (expensive memory-wise)
attn_out = cp.checkpoint(self.attention, x, use_reentrant=False)
# Don't checkpoint MLP (cheaper to store)
return self.mlp(attn_out)
6. CPU Offloading Strategy¶
For H200, CPU offloading is rarely needed for 70B inference. But for training very large models:
# DeepSpeed CPU offloading for optimizer states
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true, # pinned memory for fast H2D
"buffer_count": 4, # prefetch buffers
"fast_init": false
},
"offload_param": {
"device": "none" # H200: keep params on GPU
}
}
}
Bandwidth Requirement for CPU Offload¶
CPU RAM ↔ GPU: PCIe 5.0 x16 = ~64 GB/s (H200 host)
NVLink C2C (GH200): 900 GB/s (no PCIe bottleneck)
With PCIe 5.0: for a 70B model optimizer (840 GB FP32),
full round-trip = 840 GB / 64 GB/s = ~13 seconds per update step
→ CPU offload is a LAST RESORT for PCIe-attached systems
7. Memory Profiling Tools¶
# PyTorch memory snapshot (detailed allocation trace)
import torch.cuda.memory as mem
torch.cuda.memory._record_memory_history(max_entries=100000)
# ... run workload ...
mem.memory._dump_snapshot("mem_snapshot.pickle")
# Visualize at: https://pytorch.org/memory_viz
# Nsight Systems — system-level memory timeline
nsys profile \
--trace=cuda,nvtx,osrt \
--gpu-metrics-device=all \
--output=mem_profile \
python train.py
# Nsight Compute — per-kernel memory analysis
ncu --metrics l1tex__t_bytes,dram__bytes \
--target-processes all \
python train.py
# nvidia-smi memory monitoring during training
watch -n 1 nvidia-smi --query-gpu=index,memory.used,memory.free,utilization.gpu \
--format=csv,noheader,nounits
8. OOM Debugging Checklist¶
1. Check reserved vs allocated:
torch.cuda.memory_reserved() >> torch.cuda.memory_allocated() ?
→ Memory fragmentation: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
2. Unexpected allocations:
→ Use torch.cuda.memory_snapshot() to find surprise allocations
3. Gradient accumulation with DDP:
→ with model.no_sync(): inside accumulation loop prevents premature all-reduce
4. KV cache growing unbounded (inference):
→ Set max_model_len and max_num_seqs limits in vLLM
→ Check for missing attention mask / infinite generation loops
5. NCCL broadcast allocations:
→ NCCL uses ~300 MB per GPU; account for this in memory budget
6. torch.compile memory increase:
→ First iteration may OOM during graph compilation; use smaller batch for warmup