Blackwell B200 for Qwen Transformer Inference¶
A 6-chapter special course on running Qwen-class transformer models on NVIDIA's Blackwell B200 generation: single B200, GB200 superchip, and the NVL72 rack-scale system. Inference only — no training, no fine-tuning.
This series is a Blackwell counterpart to the Qwen Inference Optimization series in Edge AI. That series covered Qwen3-4B on Jetson and Qwen2.5-72B on 4×H100. This series picks up where the H100 left off: what changes when you move Qwen2.5-72B (and bigger) to Blackwell.
The short version: a single B200 holds 192 GB of HBM3e at 8 TB/s and supports FP4 tensor cores, so a Qwen2.5-72B-class model fits in one GPU with bandwidth-bound decode throughput that requires 4–8 H100s to match. The NVL72 rack scales the same model to extreme batch + long context with one coherent memory image. Software has to keep up — Transformer Engine 2, FP4 microscaling, FlashAttention-3 on 5th-gen tensor cores, TMA enhancements, persistent kernels.
| Chapter | Title | Focus |
|---|---|---|
| 01 | Blackwell Architecture | Dual-die package, NVLink-C2C, HBM3e, 5th-gen tensor cores, NVL72 |
| 02 | FP4 Numerics & Transformer Engine 2 | MX-FP4/FP6/FP8 microscaling, dynamic per-block quant, quality vs bandwidth |
| 03 | Qwen on a Single B200 | Qwen2.5-72B fitting in one die, layout, per-die TP via NVLink-C2C |
| 04 | Multi-B200 and NVL72 | GB200 superchip, NVL72 architecture, TP=16/72, NCCL on NVLink-5 |
| 05 | Blackwell Kernel Engineering | 5th-gen WGMMA, TMA-2, async warp specialization, FlashAttention-3, CUTLASS 4 |
| 06 | Production Serving on Blackwell | TRT-LLM 0.20+, vLLM Blackwell backend, FP4 in prod, benchmarks, cost |
Prerequisites¶
- Phase 5 — Edge LLM Inference Internals — GEMV vs GEMM roofline math.
- Phase 5 — Qwen Inference Optimization (full series) — especially Lecture 04 (Qwen2.5-72B Multi-GPU FP16 on H100).
- Phase 5 — NCCL Deep Dive — collective math for the multi-GPU chapters.
- Phase 5 — CUDA Advanced Optimization — TMA, persistent kernels, warp specialization patterns.
Scope¶
- In scope: Qwen2.5-72B-Instruct, Qwen3-32B/A3B (MoE), and hypothetical Qwen3-300B+ class models on B200 silicon. FP4/FP6/FP8/FP16 dtypes. Single-GPU through NVL72 rack-scale.
- Out of scope: training, fine-tuning, LoRA. Hopper (H100/H200) recipes — see the H100 chapter of the Qwen series. AMD MI300X — different chapter, different fabric, different software stack.
A note on dates and software versions¶
This series assumes Blackwell-aware software stacks from mid-2026: CUDA 13, cuBLAS 13, TensorRT-LLM 0.20+, vLLM Blackwell backend (PR series merged Q1 2026), Transformer Engine 2.x, FlashAttention 3.x. If you're reading on an earlier toolchain, expect missing kernels and slower-than-quoted numbers.