Chapter 4: Multi-B200, GB200 Superchip, and NVL72¶
Overview¶
Single-B200 deployment covers Qwen2.5-72B at FP4 cleanly. It doesn't cover: dense models larger than 192 GB, MoE models with huge expert pools, extreme-batch long-context serving, or sub-30 ms p99 latency on 72B-class chat at scale. For all of those you go multi-B200 — HGX B200 (8-GPU board), GB200 superchip (2 B200 + 1 Grace, coherent), or NVL72 (72 B200 + 36 Grace in one rack).
This chapter is the multi-GPU scaling story for Qwen-class workloads on Blackwell. It covers the fabric, the TP/PP choices, the NCCL hot path, and the new memory tiers that GB200/NVL72 unlock.
By the end you should be able to:
- Pick the right Blackwell platform (HGX vs GB200 vs NVL72) for a given Qwen workload.
- Compute NCCL bandwidth in the decode hot path for TP=8 / TP=16 / TP=72.
- Use Grace LPDDR coherent memory for KV spill, prefix caching, or MoE expert offload.
- Predict serving throughput for Qwen2.5-72B and hypothetical Qwen3-300B+ at rack scale.
1. The Blackwell Platform Ladder¶
┌────────────────────────────────────────────────────────────────┐
│ Tier 1: Single B200 (~$30k chip + carrier) │
│ 192 GB HBM · 8 TB/s · 1.8 TB/s NVLink-5 │
│ Fits Qwen2.5-72B-MX-FP4 entirely │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Tier 2: HGX B200 (8 × B200, SXM) │
│ 1.54 TB HBM total │
│ 64 TB/s aggregate HBM bandwidth │
│ NVSwitch fabric, 130 TB/s bisection │
│ Drop-in HGX H100/H200 replacement │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Tier 3: GB200 Superchip (2 × B200 + 1 Grace, NVLink-C2C) │
│ 384 GB HBM + 480 GB LPDDR5X coherent │
│ Grace ↔ B200: 900 GB/s coherent │
│ Building block of NVL72 racks │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Tier 4: NVL72 Rack (36 × GB200, liquid-cooled) │
│ 72 × B200 + 36 × Grace │
│ 13.8 TB HBM + 17.3 TB LPDDR = ~30 TB coherent │
│ 576 TB/s aggregate HBM bandwidth │
│ NVLink-5 fabric: 130 TB/s bisection │
│ ~120 kW per rack │
└────────────────────────────────────────────────────────────────┘
Decision rule for Qwen workloads:
| Workload | Right tier |
|---|---|
| Single-user low-latency Qwen2.5-72B | Tier 1 (single B200, intra-package TP=2) |
| Multi-tenant chat, 100s of concurrent users | Tier 2 (HGX B200 8-GPU) |
| Long-context KV spill / prefix cache / MoE expert pool | Tier 3 (GB200, uses Grace coherent memory) |
| Frontier model serving (Qwen3-300B+ / Qwen-MoE) | Tier 4 (NVL72) |
| Trillion-parameter dense models | Tier 4 (NVL72), TP=72 |
Most production Qwen2.5-72B serving lives at Tier 1 or Tier 2. NVL72 is overkill for current Qwen lineups but inevitable for what's coming.
2. HGX B200 — The 8-GPU Standard¶
The HGX B200 board is the direct successor to HGX H100 SXM and HGX H200. Eight B200 GPUs on a board, all connected through NVSwitch-5.
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ B200 #0 │ │ B200 #1 │ │ B200 #2 │ │ B200 #3 │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
└──────────────┴───┬──────────┴──────────────┘
│
┌───────┴───────┐
│ NVSwitch │ (130 TB/s bisection)
│ fabric │
└───────┬───────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌────┴────┐ ┌────────┴────┐ ┌────────┴────┐ ┌─────────┐
│ B200 #4 │ │ B200 #5 │ │ B200 #6 │ │ B200 #7 │
└─────────┘ └─────────────┘ └─────────────┘ └─────────┘
Aggregate: 1.54 TB HBM3e, 64 TB/s HBM bandwidth, 130 TB/s NVLink fabric
For Qwen2.5-72B at TP=8 the model partitions across all 8 cards:
- Each GPU holds 1/8 of weights (~5.6 GB at MX-FP4-mixed).
- Each GPU holds 1/8 of KV per stored sequence.
- AllReduce after attention and FFN happens on NVLink-5 / NVSwitch.
The key change vs HGX H100: collective overhead drops dramatically. NVLink-5 is 2× the per-GPU bandwidth of NVLink-4, and NVSwitch-5 has full bisection at 130 TB/s. For Qwen2.5-72B per-token:
Per layer collectives (TP=8):
AllReduce post-attention: d_model=8192 × 2 bytes = 16 KB
AllReduce post-FFN: d_model=8192 × 2 bytes = 16 KB
= 32 KB / layer
× 80 layers = 2.56 MB / token
NVLink-5 bandwidth per GPU: 1.8 TB/s
Time per 16-KB AllReduce on NVLink-5: ~4 µs (vs ~10 µs on NVLink-4)
Per-token NVLink-5 time: 2 × 80 × 4 µs = 0.64 ms
That 0.64 ms is the new latency floor on collectives. Compare to H100 SXM TP=8's ~1.6 ms floor. The collective overhead is now a smaller fraction of the (faster) compute step, so amortized scaling is even cleaner.
2.1 Expected HGX B200 numbers on Qwen2.5-72B¶
| Metric | TP=4 | TP=8 |
|---|---|---|
| Single-stream decode | ~450 tok/s | ~620 tok/s |
| Batch=32 aggregate decode | ~9,500 tok/s | ~14,000 tok/s |
| Batch=128 aggregate decode | ~22,000 tok/s | ~32,000 tok/s |
| TTFT @ 2k prompt, B=1 | ~15 ms | ~12 ms |
| TTFT @ 32k prompt, B=8 | ~280 ms | ~180 ms |
A single HGX B200 8-GPU box handles production traffic equivalent to 4–6 HGX H100 8-GPU boxes for Qwen-72B-class serving. Operationally and economically, this is the inflection point.
3. The Tensor Parallel Cadence Beyond TP=8¶
Above TP=8 you start losing the natural alignment with n_kv_heads = 8 for Qwen2.5-72B. Options:
- TP=16 — pair off KV heads (each pair of GPUs shares one KV head's slice). Doubles KV memory cost per slice (it's replicated), worth it only when you need the per-stream latency.
- TP=72 — for frontier models, requires PP-like fragmentation or expert-parallelism (MoE). Pure dense Qwen2.5-72B at TP=72 wastes resources; reserved for Qwen3-300B+ class.
The general rule: TP should divide n_kv_heads evenly. Qwen2.5-72B with n_kv_heads = 8 is happy at TP ∈ {1, 2, 4, 8}. Hypothetical Qwen3-MoE with 16 KV heads would extend to TP=16 cleanly.
4. GB200 Superchip and Coherent Grace Memory¶
Above HGX sits the GB200 superchip: 2 × B200 + 1 × Grace ARM CPU, with NVLink-C2C between the Grace and the B200s.
┌──────────────────────────────────────────────────────────┐
│ GB200 Superchip │
│ ┌────────────────────┐ NVLink-C2C ┌──────────────┐ │
│ │ B200 (2 dies) │◄══════════════►│ Grace │ │
│ │ 192 GB HBM3e │ 900 GB/s │ 72-core │ │
│ │ 8 TB/s │ coherent │ 480 GB │ │
│ │ │ │ LPDDR5X │ │
│ └────────────────────┘ └──────────────┘ │
│ ┌────────────────────┐ ┌──────────────┐ │
│ │ B200 (2 dies) │◄══════════════►│ Grace │ │
│ │ 192 GB HBM3e │ │ 72-core │ │
│ │ 8 TB/s │ │ 480 GB │ │
│ │ │ │ LPDDR5X │ │
│ └────────────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
The crucial property: Grace LPDDR is GPU-addressable at memory-coherent speeds. This creates a two-tier memory hierarchy for inference:
- Hot tier: HBM3e — 8 TB/s, 192 GB per B200.
- Warm tier: Grace LPDDR — 900 GB/s, 480 GB per Grace.
4.1 What you put in the warm tier¶
Three production-relevant uses for warm-tier Grace memory:
(a) Long-context KV spill. When a sequence's KV exceeds HBM budget, page the cold blocks (oldest tokens, lowest-attention-weight blocks) to Grace LPDDR. Hot blocks (recent tokens, high-attention ones) stay in HBM. Production frameworks handle this with policies like LRU or attention-weight-based eviction.
Single sequence at 256k context:
KV per layer per token (FP8) = 2 KB
× 80 layers × 256k = 41 GB
→ All in HBM? Yes, but eats 20% of budget for one sequence.
→ Spill cold blocks to Grace: keep latest 32k in HBM, page older.
(b) Prefix cache for shared system prompts. Multi-tenant deployments often share a long system prompt across many requests (RAG context, agent system message, persona). Compute KV once, store in Grace, paste into HBM for each new request. Cuts TTFT for prompt-shared traffic by 10–100×.
(c) MoE expert pool offload. For Qwen-MoE variants (or hypothetical Qwen3-MoE), only ~10% of experts are active per token. Keep all expert weights in Grace; stream active experts to HBM each step. Trade some bandwidth for the ability to host much larger expert pools per GPU.
4.2 Numbers for warm-tier KV¶
A worked example: Qwen2.5-72B serving 16 concurrent users at 128k average context.
KV per user @ MX-FP8: 80 layers × 2 × 8 KV-heads × 128 head-dim × 128k × 1 byte
= 21 GB per user
16 users × 21 GB = 336 GB
Single-B200 HBM budget for KV: ~120 GB
GB200 superchip HBM (2 B200): ~240 GB
GB200 Grace LPDDR: 960 GB
Strategy:
- Hot 16k tokens per user in HBM: 16 × 2.6 GB = 42 GB (fits HBM)
- Cold 112k tokens per user in Grace LPDDR: 16 × 18.4 GB = 295 GB (fits Grace)
- Total: 337 GB, comfortable.
Bandwidth cost per decode:
Hot KV read (HBM): ~2.6 GB × 16 = 42 GB at 8 TB/s = 5 ms aggregate
Cold KV read (Grace): ~18.4 GB × 16 = 295 GB at 900 GB/s = 330 ms aggregate
→ Cold-tier reads dominate if you re-read every step.
This is why production frameworks don't re-read cold KV every step. Instead, recompute KV scores for the cold tier sparingly (e.g., every 16 tokens), or use sliding-window attention for the cold tier so it never gets re-read at full bandwidth. Both patterns are how 256k+ context becomes affordable on GB200.
5. NVL72 — The Rack-Scale Computer¶
NVL72 Rack:
- 36 GB200 superchips (72 B200, 36 Grace)
- 18 compute trays, each holding 2 GB200
- 9 NVSwitch trays
- Liquid cooling loop
- ~120 kW power, ~3000 lb
- 13.8 TB HBM3e + 17.3 TB Grace LPDDR = ~30 TB coherent
- 576 TB/s aggregate HBM bandwidth
- 130 TB/s NVLink-5 bisection (full crossbar through NVSwitch)
The defining property: all 72 GPUs are NVLink-5 peers. Any GPU can read another GPU's HBM at full NVLink-5 speed. Any GPU can read any Grace LPDDR with the same coherent protocol. The entire rack presents as one logical accelerator with 30 TB of memory and 576 TB/s of bandwidth.
5.1 What NVL72 is actually for¶
Not for serving Qwen2.5-72B in default config. That fits in one B200. Putting it on NVL72 wastes 71 cards.
Yes for:
- Hypothetical Qwen3-Max / Qwen-1T — dense models bigger than the GB200's 384 GB. TP=72 puts a 1T-parameter model in one logical device.
- Massive Qwen-MoE serving — total expert pool in 100s of GB, with router-driven activation choosing ~10% per token. NVL72 holds the entire MoE in HBM for full bandwidth.
- Trillion-token concurrent context — 10,000 concurrent users at 32k context each. Aggregate KV = 320 TB at FP16 (210 TB at FP8). Mostly Grace-resident, with hot tiers in HBM.
- Extreme-batch frontier serving — batch=2048+ on Qwen-class large models with continuous batching.
5.2 NCCL on NVLink-5 fabric¶
NVL72's NCCL collectives use the NVSwitch fabric. For TP=72 on a 1T-parameter model (d_model = 16384):
Per-layer AllReduce: 16384 floats × 2 bytes = 32 KB
× ~100 layers (1T-class) = 3.2 MB / token
NVLink-5 bandwidth in the rack: 130 TB/s bisection
Time per 32-KB AllReduce: ~10 µs at TP=72 (latency-bound)
Per-token NCCL time: 2 × 100 × 10 µs = 2 ms
Compute time per token estimate: ~25 ms
NCCL fraction: 2/27 ≈ 7%
Healthy. Above ~TP=72 you'd start being collective-latency-bound, but within one NVL72 rack you have headroom.
6. Choosing the Right Platform — The Decision Matrix¶
| Qwen Workload | Recommended Platform | Why |
|---|---|---|
| Qwen2.5-72B, single-user low-latency chat | Single B200, intra-package TP=2 | 280 tok/s in one chip |
| Qwen2.5-72B, multi-tenant chat (100s users) | HGX B200 (1 box, TP=8) | 32k tok/s aggregate, simplest |
| Qwen2.5-72B, multi-tenant with 128k+ context | GB200 superchip | Grace LPDDR for KV spill |
| Qwen-MoE serving | GB200 (small) or NVL72 (large) | Grace holds expert pool |
| Qwen3-Max-class dense (hypothetical 300B+) | NVL72 partial, TP=16-32 | Doesn't fit smaller platforms |
| Frontier 1T+ dense or MoE | NVL72 full rack | TP=72 with NVSwitch crossbar |
| Trillion-token concurrent serving | NVL72 multi-rack | Need coherent memory at scale |
| Research/training adjacency | NVL72 — same hardware as DGX SuperPOD | Reuses the training fleet |
7. Cost Economics — When Multi-GPU Pays Off¶
Approximate mid-2026 pricing (chip + integration, not list, not on-prem TCO):
| Platform | Cost | Per-GPU effective cost |
|---|---|---|
| Single B200 carrier | ~$45k | $45k |
| HGX B200 8-GPU board | ~$280k | $35k |
| GB200 superchip | ~$100k | $50k |
| NVL72 rack | ~$3.0M | $42k |
For pure Qwen2.5-72B serving where one B200 suffices, single-GPU deployment has the best $/tok/s. For workloads that need multi-GPU collectives (frontier models, extreme concurrency), HGX B200 amortizes the multi-GPU premium across throughput. NVL72 is only economic when the workload demands its scale; for anything smaller, an 8-GPU HGX is cheaper per token served.
The rough heuristic: stay on the smallest platform that fits your workload at the throughput you need. Multi-GPU is for "the model literally doesn't fit" or "I need more aggregate bandwidth than 8 TB/s." Don't scale up because it's available.
8. Diagnostics for Multi-B200 Deployments¶
# 1. Confirm NVLink topology
nvidia-smi topo -m
# HGX B200 expected: NV18 (18-link NVLink-5) between every GPU pair
# 2. Confirm NCCL using NVLink not PCIe
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL trtllm-bench ...
# Look for: "NCCL INFO Channel 00 : ... [NVLink]"
# 3. Check NCCL bandwidth in isolation
nccl-tests/build/all_reduce_perf -b 16K -e 1M -f 2 -g 8
# Healthy on HGX B200: 32 KB AllReduce ≤ 6 µs
# 4. On GB200, check Grace coherent path
numactl --hardware
# Should show 2 NUMA nodes per superchip with low cross-node distance
# 5. NVL72: check rack-level NVLink reach
nvidia-smi nvlink --status -l 0
# All 18 links should be Active with 100 Gbps each
If any of these reveal degraded connectivity, single-GPU performance can be fine while multi-GPU collapses. Always validate the fabric before bench-marking the model.
Key Takeaways¶
| Takeaway | Why it matters |
|---|---|
| HGX B200 8-GPU is the natural Qwen2.5-72B serving platform | One box replaces 4–6 HGX H100 boxes for the same throughput |
| GB200 adds 480 GB of coherent Grace LPDDR per GPU | Unlocks long-context KV spill, prefix caching, MoE pools |
| NVL72 is rack-scale: 30 TB coherent memory, 576 TB/s HBM | The platform for frontier dense and MoE models |
TP should divide n_kv_heads — 8 for Qwen2.5-72B |
TP=1,2,4,8 are the natural Qwen-72B configurations |
| NVLink-5 collective latency is half NVLink-4's | Collective floor drops to ~0.6 ms/token on HGX B200 |
| Cost-efficient deployment = smallest platform that fits | Don't scale up just because the rack is available |
| Always validate the fabric before benchmarking the model | NCCL on PCIe destroys multi-GPU performance silently |
Resources¶
- NVIDIA GB200 NVL72 Architecture Brief: Official rack-scale platform.
- HGX B200 Datasheet: 8-GPU board spec.
- NVIDIA NCCL Documentation: Updated for NVLink-5 / NVSwitch-5.
- Chapter 3 — Single-B200 Qwen Inference: When you don't need multi-GPU.
- Chapter 5 — Blackwell Kernel Engineering: The kernels backing these collectives.
- Phase 5 — NCCL Deep Dive: Companion deep dive on the collective layer.