Chapter 2: FP4 Numerics and Transformer Engine 2 for Qwen Inference¶
Overview¶
Blackwell's 5th-gen tensor cores natively execute FP4, FP6, and FP8 with shared block scales — the MX (microscaling) family of formats from the OCP standard. That sounds esoteric. In practice it means: a Qwen2.5-72B model that was 145 GB at FP16 is 36 GB at MX-FP4, with quality degradation small enough to ship in production.
The second-generation Transformer Engine is the software layer that makes this work without you hand-tuning per-tensor scales. It dynamically chooses scale factors per block during inference, picks the right format per tensor (FP4 for FFN-up, FP6 or FP8 for FFN-down and V, etc.), and emits the right kernel variants.
This chapter covers the formats, the engine, and the per-tensor decisions you make when deploying Qwen on B200.
By the end you should be able to:
- Distinguish MX-FP4 / MX-FP6 / MX-FP8 by bit layout and block scale.
- Predict Qwen2.5-72B memory and bandwidth for each format.
- Pick a mixed-precision recipe similar to the Q4_K_M asymmetry (V and FFN-down at higher precision).
- Read a Transformer Engine 2 calibration log and identify which tensors fell back to a higher precision.
1. The OCP Microscaling (MX) Format Family¶
A microscaled format stores a block of values with a single shared scale factor. The scale is FP8/UE8M0 (an unsigned 8-bit exponent), and the block size is 32 elements. Within the block, each element is a small int or float using the format's element type.
MX block layout (32 elements):
[ E ] ← 1 byte shared scale (UE8M0, an 8-bit power of two)
[ x0 | x1 | x2 | ... | x31 ] ← 32 element values
The reconstructed value at slot i is:
value_i = scale × element_i
The element type determines the format:
| Format | Element type | Element bits | Block size | Effective bits/value | Range |
|---|---|---|---|---|---|
| MX-FP8 (E5M2) | float 5e2m | 8 | 32 | ~8.25 | ±5.7×10⁴ |
| MX-FP8 (E4M3) | float 4e3m | 8 | 32 | ~8.25 | ±448 |
| MX-FP6 (E3M2) | float 3e2m | 6 | 32 | ~6.25 | ±28 |
| MX-FP6 (E2M3) | float 2e3m | 6 | 32 | ~6.25 | ±7.5 |
| MX-FP4 (E2M1) | float 2e1m | 4 | 32 | ~4.25 | ±6 |
| MX-INT8 | signed int | 8 | 32 | ~8.25 | ±127 |
The "effective bits/value" is the element bits plus the per-block scale overhead amortized across 32 elements: bits + 8/32 = bits + 0.25.
1.1 Why microscaling at all?¶
Plain INT4 or FP4 has a fundamental problem: weights and activations in trained LLMs have wide dynamic ranges but small block-local ranges. Globally, a tensor might span ±50; locally, a block of 32 consecutive weights might all be in ±0.05 or all in ±5. With a single global scale, you waste bits.
Microscaling gives each block of 32 its own scale, so each block gets nearly full FP4 precision relative to its local range. The cost: 0.25 bits/element of scale overhead. The gain: dramatically better quality at the same average bit-rate vs uniform-quant baselines.
This is the same insight as ggml's K-quants (Q4_K, Q6_K) — block-wise scales beat global scales — but standardized at the hardware level and natively executed by tensor cores. No on-the-fly dequant in registers; the tensor core consumes the MX-format bits directly.
2. What Each Format Costs for Qwen2.5-72B¶
Per-tensor breakdown for a Qwen2.5-72B-Instruct (FP16 baseline: ~145 GB):
| Tensor | FP16 GB | MX-FP8 GB | MX-FP6 GB | MX-FP4 GB |
|---|---|---|---|---|
attn_q (× 80) |
10.7 | 5.5 | 4.2 | 2.8 |
attn_k (× 80) |
1.3 | 0.7 | 0.5 | 0.35 |
attn_v (× 80) |
1.3 | 0.7 | 0.5 | 0.35 |
attn_o (× 80) |
10.7 | 5.5 | 4.2 | 2.8 |
ffn_gate (× 80) |
38.8 | 20.0 | 15.2 | 10.3 |
ffn_up (× 80) |
38.8 | 20.0 | 15.2 | 10.3 |
ffn_down (× 80) |
38.8 | 20.0 | 15.2 | 10.3 |
| Embeddings (2 ×) | 4.98 | 2.57 | 1.95 | 1.32 |
| Norms, biases | ~0.05 | ~0.05 | ~0.05 | ~0.05 |
| Total | ~145 GB | ~75 GB | ~57 GB | ~38.5 GB |
| Roofline tok/s @ 8 TB/s | ~55 | ~106 | ~140 | ~210 |
The bandwidth × format multiplier directly gives you the decode tok/s ceiling. MX-FP4 hits ~210 tok/s on a single B200 for Qwen2.5-72B at short context — competitive with what an 8×H100 cluster delivers at FP8.
2.1 Mixed precision — the asymmetric recipe¶
As with K-quants (Lecture 2 of the Edge AI Qwen series), not every tensor deserves the same precision. The empirical rule transfers cleanly to MX formats:
| Tensor group | Recommended format | Reasoning |
|---|---|---|
attn_q, attn_k |
MX-FP4 | Followed by softmax — small Q/K perturbations get smoothed |
attn_v |
MX-FP6 or MX-FP8 | Direct content path; quality-sensitive |
attn_o |
MX-FP4 | Acceptable degradation; large parameter count |
ffn_gate, ffn_up |
MX-FP4 | Largest tensors; biggest bandwidth win |
ffn_down |
MX-FP6 or MX-FP8 | Most quality-sensitive; folds into residual |
| Embeddings | MX-FP6 | Vocabulary-wide; affects every token |
| Norms, biases | FP32 | Tiny; no reason not to keep |
| KV cache | MX-FP8 (or FP16 for safety) | Read on every decoded token; INT8 also viable |
This "FP4 everywhere except V, FFN-down, and embeddings" recipe is the Blackwell counterpart of Q4_K_M. Memory cost lands around 42–45 GB instead of a pure-FP4 38.5 GB, with notably better quality. Expected MMLU drop vs FP16 baseline: <0.5 pts; vs pure MX-FP4: ~0.4 pts recovery.
3. The Transformer Engine 2 Pipeline¶
Transformer Engine (TE) is the NVIDIA library that owns format selection, scale-factor management, and kernel dispatch for FP8/FP4-mixed inference. Version 2 (Blackwell-aware) handles MX formats and per-block scaling automatically.
3.1 What TE2 does at load time¶
GGUF/safetensors (FP16) ─► TE2 quantizer
│
├─ Determine per-tensor format from recipe
├─ Run calibration (a few prompts through the model)
├─ Compute initial per-block scales
├─ Pack into MX layout
└─ Write Blackwell-resident weights
The calibration pass collects per-block max-abs values across a few hundred prompts, which it uses to set initial scale factors. For pure weight-only quantization (the default for inference) the calibration is fast — a minute or two.
3.2 What TE2 does at runtime¶
Per layer per token:
1. Read MX-FP4/FP6/FP8 weights from HBM (5th-gen tensor cores handle dequant internally)
2. Compute activations in FP16 or BF16 (the high-precision "accumulator" path)
3. Optionally quantize activations on-the-fly using a learned scale (for KV cache writes)
4. Detect overflow/underflow and trigger fallback if a block saturates
The fallback case is real: a block whose values blow up at runtime gets transparently promoted to FP8 or FP16 for that step. The TE2 log shows promotion events; if you see lots of them on a specific tensor, your recipe is wrong (V or FFN-down at FP4 is the usual culprit).
3.3 A typical TE2 calibration log¶
[TE2] Calibrating Qwen2.5-72B with recipe='qwen-mx-mixed'
[TE2] Layer 0: q=MX-FP4, k=MX-FP4, v=MX-FP6, o=MX-FP4
[TE2] Layer 0: gate=MX-FP4, up=MX-FP4, down=MX-FP6
[TE2] Layer 12: WARNING — v block 7/128 saturated, promoting to MX-FP8
[TE2] Layer 47: WARNING — down block 23/231 saturated, promoting to MX-FP8
[TE2] Calibration complete: 78 of 800 weight blocks promoted (0.0098%)
[TE2] Final memory: 42.7 GB (vs 36.5 GB pure FP4, vs 145 GB FP16)
[TE2] Bandwidth/token estimate: 42.7 GB → tok/s ceiling 187
The 0.01% promotion rate is healthy. If you see promotion rates above 1%, the recipe is too aggressive for this model.
4. KV Cache in MX Formats¶
The KV cache is read once per token, every token. Quantizing it pays off proportionally to how often you read it.
| KV format | Bytes per token per layer (Qwen2.5-72B GQA) | KV @ 32k ctx |
|---|---|---|
| FP16 | 4096 | 10.0 GB |
| MX-FP8 | ~2080 | 5.1 GB |
| MX-FP6 | ~1568 | 3.8 GB |
| MX-FP4 | ~1056 | 2.6 GB |
| INT4 with per-channel calibration | ~1024 | 2.5 GB |
Quality-wise: MX-FP8 KV is essentially free (< 0.05 perplexity drop on Qwen2.5-72B). MX-FP4 KV is usable past 8k context but shows degradation in retrieval tasks past ~32k. The safe production default is MX-FP8 KV — cuts bandwidth in half vs FP16 with no measurable quality cost.
Implementation detail: KV is quantized after RoPE, not before. The rotated K is what gets stored, so the per-block scale must accommodate the rotation's magnitude. TE2 handles this automatically; hand-rolled KV-quant runtimes often get this wrong and produce subtle long-generation degradation.
5. Comparison vs Ggml K-Quants and AWQ¶
How do MX formats stack up against the formats you already know from Edge?
| Format | Bits/weight (effective) | Hardware-native? | Per-block scale | Production support |
|---|---|---|---|---|
| Q4_K_M (ggml) | ~4.5 | No — dequant in registers | Yes (256-element superblock) | llama.cpp |
| AWQ-INT4 g128 | ~4.25 | Marlin kernel | Per-128 channel | vLLM, TRT-LLM |
| GPTQ-INT4 g128 | ~4.25 | Marlin kernel | Per-128 channel | vLLM, exllamav2 |
| MX-FP4 | ~4.25 | Yes — 5th-gen tensor core native | Per-32 element | TRT-LLM 0.20+, TE2 |
| MX-FP8 | ~8.25 | Yes — tensor core native | Per-32 element | TRT-LLM 0.20+, TE2 |
The decisive differentiator for MX is hardware-native execution. K-quants and AWQ require a custom CUDA kernel that dequantizes blocks into registers before feeding the matmul. MX formats are consumed directly by the 5th-gen tensor core's MMA instruction — there's no software dequant step.
This matters for two reasons:
- Bandwidth efficiency — the tensor core reads MX-FP4 bytes from HBM, and that's the full DRAM cost. No second pass.
- Compute throughput — peak MX-FP4 throughput on B200 is ~2.25 PFLOPS dense. AWQ-INT4 via Marlin maxes out at ~half that because the kernel has to do dequant work outside the tensor core path.
For Qwen workloads on Blackwell, the recipe is: MX-FP4 with TE2 mixed-precision auto-recipe → MX-FP8 KV cache. Use Q4_K_M only on platforms (Jetson, AMD) where MX isn't supported.
6. The Quality Story — What Actually Drops?¶
From early-2026 published benchmarks on Qwen2.5-72B-Instruct:
| Config | MMLU | IFEval | GSM8K | HumanEval | MT-Bench |
|---|---|---|---|---|---|
| BF16 baseline | 84.2 | 87.1 | 91.5 | 75.6 | 9.04 |
| MX-FP8 (all) | 84.1 | 86.9 | 91.4 | 75.5 | 9.02 |
| MX-FP6 mixed | 83.9 | 86.5 | 91.0 | 75.1 | 8.98 |
| MX-FP4 mixed (V/down=FP8) | 83.8 | 86.2 | 90.8 | 74.7 | 8.94 |
| MX-FP4 pure (no mixing) | 83.0 | 84.1 | 89.5 | 73.1 | 8.71 |
The mixed-precision MX-FP4 recipe is the sweet spot — under 0.5 pts off MMLU, under 1 pt off any benchmark, less than 0.1 off MT-Bench. The single most important parameter is keeping V and FFN-down at FP6 or FP8. Pure FP4 drops noticeably; the mixed recipe is production-ready.
For comparison, Q4_K_M on the same model drops ~0.4 MMLU and ~1.2 IFEval — slightly worse than MX-FP4 mixed at the same effective bit-rate, with no hardware-native execution path.
7. Deployment — From transformers to TRT-LLM¶
A typical conversion pipeline for Qwen2.5-72B → B200 production:
# Step 1: download FP16/BF16 weights
huggingface-cli download Qwen/Qwen2.5-72B-Instruct --local-dir ./qwen72b
# Step 2: convert via TRT-LLM's quantizer (uses TE2 internally)
python -m tensorrt_llm.quantization.quantize \
--model_dir ./qwen72b \
--output_dir ./qwen72b-mx-fp4 \
--dtype bf16 \
--qformat mx_fp4_mixed \
--calib_dataset openassistant-en-zh \
--calib_size 256
# Step 3: build the TensorRT engine for B200
trtllm-build --checkpoint_dir ./qwen72b-mx-fp4 \
--output_dir ./qwen72b-mx-fp4-engine \
--gemm_plugin mx_fp4 \
--gpt_attention_plugin auto \
--max_batch_size 64 \
--max_input_len 32768 \
--max_seq_len 65536 \
--kv_cache_type mx_fp8 \
--use_paged_context_fmha
The build produces a TensorRT engine targeting sm_100 (Blackwell). It ships ~42 GB and loads to one B200 die in ~3 seconds.
8. When MX-FP4 Doesn't Work¶
Real failure modes worth knowing:
- Long-context retrieval — Needle-in-haystack at >50k context starts showing FP4 degradation. Promote V and FFN-down to FP8 for retrieval-heavy workloads.
- Code generation — HumanEval pass@1 drops ~2 pts at pure FP4, ~0.5 pts at mixed. Acceptable for assistants; might not be for high-stakes code review pipelines.
- Multilingual — minority-language quality drops more than English. Calibrate on a multilingual mix if you serve global traffic.
- Long structured output (JSON, tool calls) — schema adherence degrades subtly. Some production deployments keep
attn_oat FP6 for this reason. - MoE active expert routing — for MoE Qwen variants, the router weights are tiny and quality-sensitive. Always keep router at FP16/BF16.
Key Takeaways¶
| Takeaway | Why it matters |
|---|---|
| MX-FP4 is hardware-native on B200 — no software dequant | Bandwidth-efficient and compute-efficient at the same time |
| Per-32-element block scales beat global scales | Same insight as Q4_K_M, standardized at hardware level |
| Mixed-precision recipe (V and FFN-down at FP6/FP8) is essential | Pure FP4 loses ~1 pt across benchmarks; mixed loses <0.5 |
| TE2 manages scales and dispatches kernels automatically | Inference engineers rarely write per-tensor scale code |
| MX-FP8 KV cache is essentially free quality-wise | Use it by default for context >4k |
| AWQ/GPTQ remain the right choice off-Blackwell | MX requires 5th-gen tensor cores |
| Quality regressions concentrate in long-context retrieval and code | Validate on your actual workload, not just MMLU |
Resources¶
- OCP Microscaling Formats v1.0 Spec: The authoritative MX format reference.
- NVIDIA Transformer Engine documentation: The TE2 API and recipe system.
- TensorRT-LLM Quantization Guide: End-to-end MX-FP4 workflow.
- "Microscaling Data Formats for Deep Learning" (2023): The research backing the OCP standard.
- Chapter 1 — Blackwell Architecture: The hardware foundation.
- Chapter 3 — Single-B200 Qwen Inference: Applying these numerics in deployment.
- Qwen Inference Optimization — Lecture 2 (Q4_K_M etc.): The edge-quantization counterpart.