Chapter 6: Production Serving of Qwen on Blackwell¶

Overview¶

You have an HGX B200 box. You have a Qwen2.5-72B-MX-FP4-mixed engine that benchmarks well. Now you need to serve a thousand concurrent users at p99 < 500 ms and not have a 3 a.m. page every Tuesday. This chapter is the production engineering layer — runtime selection, batching, observability, capacity planning, cost economics, and the failure modes that show up only under real load.

Most of what's in the production-serving lecture of the Qwen Inference Optimization series (Edge AI / Lecture 05) still applies on Blackwell. This chapter focuses on what changes when you move from H100/H200 to B200 — and what doesn't change but is worth re-validating.

By the end you should be able to:

Choose between TRT-LLM, vLLM (Blackwell backend), SGLang, and LMDeploy for a given workload.
Set up observability that catches Blackwell-specific failure modes (kernel fallback to FP8, NVLink-5 degradation, Grace memory paths).
Plan capacity for a Qwen2.5-72B chat product with realistic peak factors.
Quantify the cost economics vs H100/H200 — and identify the workloads where Hopper still wins.

1. Production Runtime Choices, Mid-2026¶

Runtime	Blackwell support	Best at	Caveat
TensorRT-LLM 0.20+	First-class, NVIDIA-tested	Highest raw throughput, MX-FP4 mature, full Triton-Inference-Server integration	NVIDIA-only; build flow is heavier
vLLM (Blackwell backend)	Merged Q1 2026	OpenAI-API serving, broad model support, AWQ/GPTQ fallbacks	MX-FP4 path is younger; less mature on edge cases
SGLang on B200	Active port, Q2 2026	Programmatic generation, structured outputs, prefix caching	Smaller team, slower release cadence
LMDeploy / TurboMind	Late-2026 Blackwell port	Qwen-specific optimizations, InternLM-team focus	Catching up to TRT-LLM/vLLM on Blackwell
Custom (CUTLASS 4 + persistent kernels)	Roll-your-own	Research, novel quant formats, model archs	Don't unless stock runtimes fall short

For Qwen2.5-72B on a single B200 or HGX B200 box in mid-2026 production: TRT-LLM 0.20+ is the default choice. vLLM is competitive on most workloads and easier to deploy in a Kubernetes-native shop. SGLang wins for prefix-cache-heavy workloads (RAG with long shared system prompts).

2. Deployment Pattern — Triton Inference Server + TRT-LLM¶

The reference NVIDIA stack:

┌──────────────────────────────────────────────────────────┐
│   Triton Inference Server (port 8000)                    │
│   ┌────────────────────────────────────────────────┐    │
│   │  TRT-LLM backend                               │    │
│   │   - Continuous batching                        │    │
│   │   - Paged KV cache                             │    │
│   │   - Chunked prefill                            │    │
│   │   - Speculative decoding (EAGLE-2)             │    │
│   │   - Streaming responses                        │    │
│   └──────────────────┬─────────────────────────────┘    │
│                      │                                   │
│                      ▼                                   │
│        Qwen2.5-72B-MX-FP4 engine                         │
│        (sm_100, persistent kernels)                      │
│                      │                                   │
│                      ▼                                   │
│             B200 (or HGX B200 × N)                       │
└──────────────────────────────────────────────────────────┘
                      │
                      │ REST/gRPC
                      ▼
              Application layer
              (OpenAI-compatible API)

Deployment manifest (Docker Compose excerpt):

services:
  triton:
    image: nvcr.io/nvidia/tritonserver:25.06-trtllm-python-py3
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 8                    # HGX B200 full board
              capabilities: [gpu]
    environment:
      - NCCL_P2P_LEVEL=NVL
      - NCCL_DEBUG=WARN
      - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
      - LD_LIBRARY_PATH=/opt/tritonserver/backends/tensorrtllm
    volumes:
      - ./qwen72b-mx-fp4-engine-tp8:/models/qwen
    command: >
      tritonserver
        --model-repository=/models
        --http-port=8000
        --grpc-port=8001
        --log-verbose=1
        --strict-readiness=false
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/v2/health/ready"]
      interval: 30s
      timeout: 10s
      retries: 3

Expected baseline numbers from this deployment (mid-2026):

Metric	HGX B200, TP=8
Single-stream decode	~620 tok/s
Batch=32 aggregate decode	~14,000 tok/s
Batch=128 aggregate decode	~32,000 tok/s
TTFT @ 2k prompt, B=8	25 ms
TTFT @ 32k prompt (chunked)	180 ms
HBM utilization at saturation	~90%
GPU compute utilization	70-85%
Power per GPU at saturation	950-1000 W
Total box power	8-9 kW

3. Observability — What to Watch¶

The dashboards from the Edge AI / Qwen / Lecture 05 still apply. What's added for Blackwell:

3.1 Blackwell-specific metrics¶

MX format kernel hit rate — fraction of GEMM calls hitting the MX-FP4 path vs the FP8 fallback path. TRT-LLM exposes this via kernel_dispatch_fp4_count vs kernel_dispatch_fp8_count. If FP8 fallbacks exceed 1%, your recipe is too aggressive.
Transformer Engine 2 promotion events — TE2 logs when a block's quant gets promoted at runtime (overflow detection). Counter: te2_block_promotion_count. Healthy: < 100/hour under steady load.
NVLink-5 link health — nvidia-smi nvlink --errors -l 0 for each GPU should show zero CRC errors. Errors here silently degrade throughput by 20–50% for the affected GPU.
Grace coherent memory access ratio (GB200/NVL72) — fraction of memory reads that crossed the NVLink-C2C from B200 to Grace. Above ~10% suggests KV spill is dominant and worth re-tuning.
TMA-2 multicast utilization — Nsight metric tma_multicast_loads_per_sec. Should be high in attention kernels; absence means FA-3 isn't actually active.

3.2 The Prometheus scrape¶

A minimal Prometheus query set for a Triton + TRT-LLM Blackwell deployment:

# Latency
histogram_quantile(0.50, rate(triton_inference_compute_input_duration_us_bucket[1m]))
histogram_quantile(0.95, rate(triton_inference_request_duration_us_bucket[1m]))

# Throughput
rate(trtllm_decoded_tokens_total[1m])

# KV cache health
trtllm_kv_cache_used_blocks / trtllm_kv_cache_total_blocks

# GPU
DCGM_FI_DEV_GPU_UTIL
DCGM_FI_DEV_MEM_COPY_UTIL
DCGM_FI_DEV_POWER_USAGE
DCGM_FI_DEV_NVLINK_BANDWIDTH_L0   # NVLink utilization

# Blackwell-specific
trtllm_mx_fp4_kernel_dispatches_total
trtllm_te2_promotion_events_total
trtllm_tma2_multicast_loads_total

3.3 Failure modes you'll see in production¶

Symptom	Cause	Fix
Tok/s drops 30% overnight without code change	Driver update reset kernel cache; first req cold-paths	Warm-up step in deploy hook
Specific GPU shows 60% util while others 80%	NVLink-5 CRC errors on that GPU's links	Re-seat or RMA; rebalance shard
MX-FP4 hit rate < 80%	Engine built without `--gemm_plugin mx_fp4`	Rebuild with correct flags
TE2 promotion events spike on layer 47	Activation distribution drift on a tensor	Re-calibrate with broader dataset
Grace memory access ratio climbs over time	KV cache fragmentation in Grace tier	Restart serving instance; investigate paging policy
TTFT p99 spikes correlate with long prompts	Chunked prefill not enabled or too small	Increase chunk size or enable
Cold-start TTFT >10s after deploy	TensorRT graph compile from scratch	Cache compiled engine; preload at boot
OOM at certain concurrency despite headroom	Activation working set spike, paged-KV race	Lower max_num_tokens; raise free fraction

4. Capacity Planning — Worked Example¶

Same scenario as the Edge AI / Qwen / Lecture 05 example, re-derived for Blackwell:

Scenario: - 5,000 daily active users (5× the original example). - 50 turns/user/day. - 300 input + 400 output tokens average. - 40% to Qwen2.5-72B-class cloud (60% on edge for free).

Daily cloud load:

Queries:     5000 × 50 × 0.40 = 100,000 / day
Output tok:  100,000 × 400    = 40 M / day
Avg tok/s:   40 M / 86,400    ≈ 460 tok/s
Peak tok/s (5× peak factor):  ~2,300 tok/s

Sizing options:

Platform	Capacity at p95 SLA	Boxes needed	Capex	Yearly OPEX (rented)
HGX H100 (FP8, TP=8)	~5,000 tok/s sustained	1	~$250k	~$170k
HGX H200 (FP8, TP=8)	~7,000 tok/s sustained	1	~$280k	~$190k
HGX B200 (MX-FP4, TP=8)	~15,000 tok/s sustained	1	~$280k	~$220k
Single B200 (intra-TP=2)	~3,000 tok/s sustained	1	~$45k	~$45k

For a 2,300 tok/s peak you actually have several viable options:

Single B200 with intra-package TP=2 fits the peak with ~25% headroom. Cheapest by a wide margin if you don't need redundancy.
HGX H100/H200 also works but with less headroom and a near-future capacity ceiling.
HGX B200 is overkill for this load but gives you 6× growth headroom — pick this if you expect to 5× users in 12 months.

For redundancy/failover, always pair: e.g., 2 × single B200 in active/passive, or 1 HGX B200 + 1 spare B200 for graceful degradation. Don't run production single-instance.

5. Cost Economics — Where Blackwell Wins, Where Hopper Persists¶

Per-million-tokens-served cost is the production-relevant metric. Mid-2026 numbers (rented cloud capacity, fully-loaded):

Platform	Cost per 1M output tokens, batch-32 steady state
Single H100 SXM	$0.85
HGX H100 8-GPU, TP=8	$0.34
HGX H200 8-GPU, TP=8	$0.28
Single B200	$0.30
HGX B200 8-GPU, TP=8	$0.18
OpenAI gpt-4.1-mini API list price	$0.60

The HGX B200 8-GPU result is ~2× better than HGX H200, ~3× better than H100, and competitive with API list prices for hosted models. This is the cost-economics inflection point that drove the rapid market shift to Blackwell through 2026.

5.1 When Hopper still wins on cost¶

Workloads that don't use FP4 — if your eval shows MX-FP4 regression you can't tolerate, you're running FP8 on B200. The cost advantage shrinks to ~1.4× over H200 — still meaningful, but less commanding.
Latency-tier-0 with low concurrency — at single-user serving, the per-GPU cost dominates and Hopper's lower chip price wins. Single H100 chat costs about half what single B200 chat costs.
Existing fleet write-off — if you already paid for a Hopper fleet, the marginal cost of running it through depreciation is small. Refresh on capacity-add, not on full-replace.
Compliance/contracts — some procurement processes are locked to specific SKUs. Migration timelines are quarters, not weeks.

6. The Production Checklist¶

Before shipping Qwen on Blackwell to real traffic:

7. The Next 12 Months — Speculative¶

What changes between mid-2026 and mid-2027 in this space:

B300 / Blackwell Ultra likely lands in volume. Same MX format support, marginally faster memory, more HBM capacity (288 GB+). Drop-in for B200 in most deployments.
FP6 maturity — currently FP4-mixed is the sweet spot; FP6-pure may become the production default as kernels mature. Cuts memory ~25% vs FP6-mixed at near-zero quality loss.
Better speculative decoding — EAGLE-3 ports to Blackwell; throughput multipliers in the 2-3× range likely.
Cross-arch portability of quantized models — running an MX-FP4 quantized model on Hopper via emulation is slow today; this gap closes.
MoE serving optimization — for Qwen3-MoE and similar, the Grace LPDDR expert pool pattern matures, dropping the cost of expert serving by 2-3×.
NVL576 (multi-rack) — frontier customers stack NVL72 racks via additional NVLink generations. Trillion-parameter inference at scale becomes routine.

Key Takeaways¶

Takeaway	Why it matters
TRT-LLM 0.20+ is the production default for Qwen on B200 in mid-2026	Mature MX-FP4 path, Triton-server integration, NVIDIA-tested
HGX B200 8-GPU replaces 4–6 HGX H100 boxes for the same throughput	The fleet refresh inflection point
MX-FP4 hit rate is a top-tier production metric	Below 80% means you're not getting the Blackwell advantage
NVLink-5 CRC errors silently degrade throughput	Add to your alerts; check `nvidia-smi nvlink --errors`
Blackwell wins per-token cost ~2× over H200, ~3× over H100	Drives the market shift through 2026
Single B200 beats Hopper on cost for many workloads with no multi-GPU overhead	Don't default to HGX-class boxes if a single chip suffices
Hopper still wins on capex-amortized fleet refresh and FP4-incompatible workloads	Migration is a quarter-scale project, not a week-scale one
Capacity planning hasn't changed — measure, then size	Same TTFT/ITL/KV/util metrics as Hopper

Resources¶

TensorRT-LLM 0.20 Release Notes: Blackwell support and MX-FP4.
Triton Inference Server User Guide: Front-end and orchestration.
NVIDIA DCGM exporter: Prometheus exporter for GPU metrics.
vLLM Blackwell backend tracking issue: Status of vLLM Blackwell features.
SGLang documentation: Structured generation on B200.
LMDeploy / TurboMind: Qwen-optimized inference engine.
Phase 5 — Edge AI / Qwen Inference Optimization / Lecture 05: The original production-serving playbook.
Chapter 1 — Blackwell Architecture: Where this all starts.