Lecture 4: Qwen2.5-72B-Instruct FP16 — Multi-GPU Inference¶
Overview¶
Qwen2.5-72B-Instruct at FP16 is ~145 GB of weights. No single accelerator on the planet holds that in HBM. The model only runs by splitting across multiple GPUs — tensor parallel within a layer, optionally pipeline parallel across layers, all coordinated by NCCL collectives that sit in the decode hot path.
This lecture is the inference engineer's view of getting Qwen2.5-72B to serve at production quality on realistic boxes:
- 4 × H100 80 GB (NVL or SXM) — the sweet spot.
- 8 × A100 80 GB — common in 2024-vintage deployments.
- 8 × L40S 48 GB — cheap, no NVLink.
- 4 × MI300X 192 GB — increasingly common.
No quantization. Pure FP16 (the model ships BF16; we cast to FP16 for inference). The goal is correct partitioning, minimal collective overhead in decode, and serving high enough throughput that the answer doesn't matter — both single-stream latency and batched throughput.
By the end you should be able to:
- Compute exactly which slices of which tensors live on which GPU under tensor parallel.
- Predict NCCL bandwidth requirements per decoded token.
- Choose TP vs PP vs hybrid for a given (model, hardware, latency goal) triple.
- Diagnose where vLLM / SGLang / TRT-LLM is sitting on the roofline.
1. Footprint — Where Does Qwen2.5-72B Fit?¶
Per-tensor footprint at FP16:
| Tensor group | Per-layer FP16 bytes | Total (× 80 layers) |
|---|---|---|
attn_q.weight (8192 × 8192) |
134 MB | 10.7 GB |
attn_k.weight (8192 × 1024) |
16.8 MB | 1.3 GB |
attn_v.weight (8192 × 1024) |
16.8 MB | 1.3 GB |
attn_o.weight (8192 × 8192) |
134 MB | 10.7 GB |
attn_qkv.bias |
~26 KB | 2 MB |
attn_norm.weight |
16 KB | 1.3 MB |
ffn_gate.weight (8192 × 29568) |
485 MB | 38.8 GB |
ffn_up.weight (8192 × 29568) |
485 MB | 38.8 GB |
ffn_down.weight (29568 × 8192) |
485 MB | 38.8 GB |
ffn_norm.weight |
16 KB | 1.3 MB |
| Per-layer total | ~1.78 GB | 142.3 GB |
token_embd.weight (152064 × 8192) |
— | 2.49 GB |
output.weight (152064 × 8192, untied) |
— | 2.49 GB |
output_norm.weight |
— | 16 KB |
| Grand total | ~147.3 GB |
KV cache (per token, all layers):
2 (K+V) × 80 layers × 8 KV heads × 128 head_dim × 2 bytes = 327 680 bytes
At 32 k context, single sequence: 10.0 GB
At 131 k context, single sequence: 40.0 GB
You need weights + KV + activations + CUDA contexts + NCCL buffers in aggregate VRAM:
| Box | Aggregate VRAM | Headroom after 147 GB weights | KV for batch=1 @ 32k | Notes |
|---|---|---|---|---|
| 8 × L40S 48 GB | 384 GB | 237 GB | trivial | No NVLink → PCIe-only collectives |
| 8 × A100 40 GB | 320 GB | 173 GB | trivial | NVLink 3, 600 GB/s |
| 4 × A100 80 GB | 320 GB | 173 GB | trivial | NVLink 3 |
| 4 × H100 80 GB | 320 GB | 173 GB | trivial | NVLink 4, 900 GB/s |
| 8 × H100 80 GB | 640 GB | 493 GB | trivial | NVLink 4 + NVSwitch |
| 4 × MI300X 192 GB | 768 GB | 621 GB | trivial | Infinity Fabric |
Even the smallest box (8 × L40S) has 237 GB free after weights. The constraint is not capacity — it's bandwidth across the inter-GPU fabric during decode.
2. Tensor Parallelism: Split Each Layer Across GPUs¶
Tensor parallel (TP) splits each matrix multiply across the GPUs that share the layer. Each GPU holds a slice of the weights and computes a slice of the output.
2.1 TP on QKV¶
Standard pattern (Megatron-style):
- Q, K, V are sharded by head. For Qwen2.5-72B with
n_heads = 64,n_kv_heads = 8:
TP=2: GPU0 gets 32 Q heads + 4 KV heads
GPU1 gets 32 Q heads + 4 KV heads
TP=4: each GPU gets 16 Q heads + 2 KV heads
TP=8: each GPU gets 8 Q heads + 1 KV head ← matches n_kv_heads exactly
TP=16: cannot evenly split 8 KV heads — KV gets replicated on some pairs
The natural TP degree is whatever divides n_kv_heads evenly. Qwen2.5-72B's 8 KV heads makes TP=8 the largest "clean" split. Above TP=8 you must duplicate KV across GPU pairs, which wastes VRAM but is still done in practice for latency reasons.
- Output projection (W_O) is sharded by input. Each GPU produces a partial output that gets summed across the TP group with AllReduce.
2.2 TP on FFN¶
- Gate and Up are sharded by output (column-parallel). Each GPU holds
intermediate / TPcolumns. After SwiGLU, each GPU has a slice of the intermediate vector. - Down is sharded by input (row-parallel). Each GPU multiplies its slice of the intermediate by its slice of W_down. Partials get summed across the TP group with AllReduce.
So per layer, you have two AllReduce operations (one after attention's W_O, one after FFN's W_down), each on a vector of size d_model = 8192.
2.3 NCCL bandwidth in the hot path¶
Per token, per layer:
AllReduce post-attention: 8192 floats × 2 bytes = 16 KB
AllReduce post-FFN: 8192 floats × 2 bytes = 16 KB
= 32 KB / layer
× 80 layers = 2.56 MB / token
At 50 tok/s target this is 128 MB/s — trivially within NVLink-4 (900 GB/s per pair) or even PCIe Gen4 x16 (~32 GB/s).
But AllReduce has a latency floor. On NVLink-4 a 16 KB AllReduce takes ~10 µs. On PCIe-only (8 × L40S without NVLink), it can take 50–100 µs. Times 2 ops per layer × 80 layers:
- NVLink-4: ~1.6 ms latency floor per token
- NVLink-3 (A100): ~2.4 ms
- PCIe-only (L40S): 8–16 ms
On NVLink boxes the collective latency floor caps you at ~600 tok/s. Far above any actual throughput you'd hit. On PCIe-only L40S, the floor caps you at ~60-120 tok/s — which can matter when you're optimizing the rest of the stack hard.
2.4 Sequence parallelism for the activations¶
A subtle TP optimization: between the AllReduce points, each GPU only needs a slice of the activation tensor (since the next op is column-parallel). Sequence parallelism turns the AllReduce into an AllGather + ReduceScatter pair, which keeps each GPU's per-step activation memory proportional to 1/TP rather than the full d_model.
For Qwen2.5-72B at long context with high batch, this is a real memory win. vLLM, TRT-LLM, and DeepSpeed-Inference all support it; whether they enable it by default is version-dependent. Worth checking.
3. Pipeline Parallelism: Split Layers Across GPUs¶
Pipeline parallel (PP) puts entire transformer blocks on different GPUs:
PP=2: GPU0 = layers 0..39 GPU1 = layers 40..79
PP=4: GPU0 = 0..19 GPU1 = 20..39 GPU2 = 40..59 GPU3 = 60..79
Pros: - No collectives during the layer computation; only point-to-point handoff between adjacent stages. - Each stage holds fewer weights — lower per-GPU memory pressure.
Cons: - Latency adds in series. Decoding one token requires N stage roundtrips. - For batch=1 decode, a pipeline is strictly slower than tensor parallel because there's no batch dimension to fill bubbles. - The classic "fill the pipeline" only works with batch >> stages, which is throughput-mode serving.
Recommendation: for Qwen2.5-72B inference, use TP within a node, optionally PP across nodes. Never use PP within a single 4-or-8-GPU NVLink island unless you have a very specific reason (which usually turns out to be wrong).
4. Recipe Table for Production Boxes¶
| Hardware | Recommended TP/PP | Why |
|---|---|---|
| 4 × H100 SXM 80 GB | TP=4 (one node) | Cleanest setup; weights ~37 GB/GPU; NVLink-4 fast collectives |
| 8 × H100 SXM 80 GB | TP=8 | Lets KV head replication = 1; highest batch capacity |
| 8 × A100 80 GB SXM | TP=8 | Same logic as H100; slightly slower collectives |
| 4 × A100 40 GB | TP=4 | Tight — weights are ~37 GB/GPU, activations push you near limit at long ctx |
| 8 × L40S 48 GB | TP=8 over PCIe + NCCL P2P enabled | Cost-effective but watch collective latency floor |
| 2 × node × 4 × H100 | TP=4 in-node, PP=2 across | Useful only for batch >> 16 |
| 4 × MI300X 192 GB | TP=4 | ROCm + Infinity Fabric; vLLM-ROCm works as of 2026 |
5. Decoding One Token — The Annotated Wall Clock¶
Take TP=4 on 4 × H100 SXM, batch=1, ctx=2 k. Per-token wall clock breakdown (approximate):
Step Time What dominates
─────────────────────────────────────────────────────────────
Token embedding lookup 5 µs L2 cache hit
Per layer × 80:
RMSNorm + residual 10 µs GPU compute
QKV GEMV (fused, sharded) 45 µs HBM bandwidth (W_QKV slice)
RoPE on Q,K 5 µs GPU compute
KV append + FlashAttention-decode 80 µs HBM bandwidth (KV cache slice)
AllReduce post-W_O 10 µs NVLink (small payload)
RMSNorm + residual 10 µs GPU compute
Gate+Up GEMV (fused, sharded) 95 µs HBM (intermediate matrices)
SwiGLU 5 µs GPU compute
Down GEMV (sharded) 95 µs HBM
AllReduce post-W_down 10 µs NVLink
─────────
365 µs / layer
× 80 layers = 29.2 ms
LM head GEMV (sharded) 850 µs HBM (huge W_lm_head slice)
Softmax + sample 30 µs
─────────
Total per token ≈ 30.1 ms = 33 tok/s
That's the steady-state single-stream number you should expect from a competent runtime on H100. vLLM in mid-2026 reports ~30–38 tok/s on this configuration; TRT-LLM with custom AllReduce kernels has shipped ~42 tok/s in published benchmarks.
For batch=8, the cost per-token-per-stream stays nearly the same (you're bandwidth-bound on weights, which are reused across the batch). So aggregate throughput goes to ~250+ tok/s.
6. Continuous Batching and Paged Attention¶
The single biggest production runtime difference between "8-tokens-per-second" and "300-tokens-per-second" Qwen2.5-72B serving is continuous batching with paged attention (the vLLM paper, now table stakes).
6.1 Why static batching fails¶
Naive batching: collect a batch of N requests, run them together until the slowest finishes, then start the next batch. Two killers:
- Requests have wildly different completion lengths.
- Each batch step is bounded by the longest sequence's remaining tokens.
The result: GPU utilization < 30% on production traffic.
6.2 Continuous batching¶
Treat each request as a sequence of decoding steps. At each step, insert new requests into the batch as old ones finish. The forward pass operates on a "ragged" batch where sequences have arbitrary lengths and ages.
6.3 Paged attention¶
The naive KV-cache implementation allocates [max_seq_len × n_kv_heads × head_dim] per sequence up front — wastes memory on short sequences. Paged attention manages the KV cache as fixed-size blocks (typical: 16 tokens) and has a per-sequence block table mapping logical positions to physical blocks.
Logical KV for sequence A: [block_3][block_7][block_9][block_2]
Logical KV for sequence B: [block_4][block_1][block_8]
Physical block pool: { 1: ..., 2: ..., 3: ..., 4: ..., ... }
When a sequence finishes, its blocks go back into the pool. New sequences allocate blocks as needed. This gives near-100% KV-cache utilization in practice.
The kernel impact: FlashAttention-decode has to follow the block table during the K and V reads. There's a custom kernel variant for this — vLLM ships their own, TRT-LLM has its own, SGLang has its own.
7. Long Context — YaRN and Chunked Prefill¶
Qwen2.5-72B's native context is 32 k. With rope_scaling: yarn, it extends to 131 k.
7.1 YaRN at inference time¶
At inference, YaRN is "applied" by:
1. Computing rotary frequencies theta_i = 1e6 ^ (-2i/d_head) for i = 0..d_head/2.
2. Scaling those frequencies by a position-dependent factor that smoothly interpolates between native-range (no scaling) and extended-range (logarithmic scaling).
3. Applying RoPE with the scaled frequencies.
No additional weights, no inference-time cost. The runtime just needs to compute the right cos / sin table at startup.
Catch: runtimes that hard-code "max context = 32 k" without checking rope_scaling will silently truncate. Verify your config flag is --rope-scaling yarn or equivalent.
7.2 Chunked prefill¶
Prefilling a 100 k-token prompt as one GEMM uses ~100 GB of activation memory — too much. Split prefill into chunks of e.g. 2048 tokens, process sequentially, KV grows incrementally.
Implementation detail: the same kernel that does decode-step seq_len=1 attention can be reused for prefill chunks — the chunk's queries attend to all of the KV accumulated so far. Generalized FlashAttention does this in one kernel form.
vLLM, SGLang, and TRT-LLM all do chunked prefill automatically. Whether they overlap it with ongoing decode steps (which improves serving TTFT) is a per-runtime decision.
8. Choosing a Production Runtime in 2026¶
| Runtime | Best at | Notes |
|---|---|---|
| vLLM | OpenAI-API-compatible serving, broad model support, AWQ/GPTQ | The default. Marlin kernels for quantized models. |
| SGLang | Programmatic generation control, structured output | Beats vLLM on cached-prefix workloads |
| TensorRT-LLM | Best raw throughput, NVIDIA-only, harder ops | Used at production scale (NVIDIA Triton-LLM integration) |
| DeepSpeed-Inference (MII) | Microsoft-stack integration | Faded since 2024 but still used in enterprise |
| LMDeploy (TurboMind) | Optimal for Qwen specifically — InternLM-team origin | Best out-of-box numbers on Qwen2.5-72B historically |
| vLLM-ROCm / Aiter | MI300X support | Catching up to CUDA path in throughput |
For Qwen2.5-72B specifically, the InternLM/Alibaba ecosystem has shipped optimized recipes for LMDeploy that consistently outperform other runtimes by 15–25% on Qwen models. If you're locked to Qwen and Nvidia, evaluate LMDeploy before committing to vLLM.
9. Practical Deployment Recipe — 4 × H100 SXM¶
# Pull the model
huggingface-cli download Qwen/Qwen2.5-72B-Instruct --local-dir ./qwen72b
# vLLM
docker run --gpus all --ipc=host -p 8000:8000 \
-v $(pwd)/qwen72b:/model \
vllm/vllm-openai:latest \
--model /model \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enable-chunked-prefill \
--rope-scaling '{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
--served-model-name qwen2.5-72b
Expected numbers from a clean vLLM deployment in mid-2026:
| Metric | Value |
|---|---|
| Single-stream decode | ~30 tok/s |
| Batch=8 aggregate decode | ~220 tok/s |
| Batch=32 aggregate decode | ~560 tok/s |
| TTFT @ 2 k prompt | ~250 ms |
| TTFT @ 32 k prompt (chunked) | ~2.5 s |
| Peak VRAM/GPU | ~75 GB (out of 80) |
If you're significantly off these, the usual suspects:
- NCCL not using NVLink — check NCCL_P2P_LEVEL=NVL and that nvidia-smi topo -m shows NVLink between all four GPUs.
- Chunked prefill disabled — explicitly enable.
- CUDA Graphs disabled in vLLM (--enforce-eager should NOT be set).
- GPU not at full power — nvidia-smi -q -d POWER should show ~700 W/GPU at saturation.
10. AMD MI300X Path¶
Brief because the user didn't ask, but it changes the playbook:
- vLLM-ROCm has caught up to ~85% of CUDA path throughput as of 2026.
- MI300X holds 192 GB HBM per GPU — Qwen2.5-72B FP16 fits in one GPU. Pure TP=1 inference works.
- For latency-sensitive serving, TP=2 across two MI300X still helps because per-GPU bandwidth is the limiter even on 192 GB cards.
- Infinity Fabric provides ~896 GB/s peer-to-peer — competitive with NVLink-4 in aggregate.
Hands-On Exercises¶
-
TP-degree sweep. On a 4×H100 or 8×A100 box, run Qwen2.5-72B at TP=2, 4, and (if 8 available) 8. Measure single-stream decode and batch=16 aggregate decode. Confirm single-stream tok/s saturates near TP=8 and batch throughput scales nearly linearly with TP up to NCCL bandwidth limits.
-
NCCL collective trace. Run
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL python serve.pyand identify, for each decoded token, the size and count of AllReduce calls. Confirm they match 2 × n_layers ×d_model× 2 bytes. -
Chunked vs unchunked prefill. On a 32 k prompt, run with
--enable-chunked-prefilland without. Measure TTFT and peak VRAM. Quantify the win. -
YaRN coherence test. Generate a 100-token completion at the end of a 64 k-token prompt (with
--rope-scaling yarn). Compare to the same generation at the end of a 4 k-token prompt. The 64 k case should still produce coherent, grammatical output. If it doesn't, your YaRN config isn't being applied — check the runtime flag. -
vLLM vs LMDeploy. Deploy Qwen2.5-72B on the same hardware under both runtimes. Run an identical 5-minute load test (e.g., 50 concurrent users, mixed prompt lengths). Compare aggregate tok/s, p95 latency, and TTFT. Identify which workload class favors which runtime.
-
Paged-attention KV utilization. Watch
vllm metrics(Prometheus endpoint) during a real load test. Confirm KV-cache utilization climbs to >90% with continuous batching. If it's < 50%, you're under-batched or paged attention is misconfigured.
Key Takeaways¶
| Takeaway | Why it matters |
|---|---|
| 145 GB FP16 weights — TP is mandatory | No single GPU; this is a system-engineering exercise |
| TP=8 is the natural degree (n_kv_heads=8) | Cleanly partitions KV without replication |
| Two AllReduces per layer, on tiny tensors | NVLink latency floor matters more than bandwidth |
| PP only helps when batch >> stages | Within an NVLink island, TP wins |
| Continuous batching + paged attention is non-negotiable | 3-10× throughput multiplier over static batching |
| YaRN gives you 131 k context with no weight changes | Verify your runtime applies it |
| LMDeploy often beats vLLM on Qwen models specifically | Test both before locking in |
Resources¶
- vLLM paper — Efficient Memory Management for LLM Serving with PagedAttention: The continuous-batching + paged-attention reference.
- Megatron-LM TP paper: Original tensor parallelism design.
- FlashAttention-2 paper: Inference kernel reference.
- FasterTransformer / TRT-LLM kernel guides: Custom AllReduce and decode kernels for production deployments.
- NCCL Tuning Guide: P2P configuration, algorithm selection.
- LMDeploy / TurboMind: Qwen-optimized inference engine.
- YaRN paper: The context-extension math.
- Qwen2.5 technical report: Original release with config and benchmarks.
- vLLM-ROCm: AMD path.
- Phase 5 — GPU Infrastructure — NCCL Deep Dive: The companion deep dive on the collective layer.