Qwen Inference Optimization — 5-Lecture Series¶
A hands-on, hardware-first series on getting real throughput out of two specific Qwen models that bracket the deployment spectrum:
- Qwen3-4B-Instruct (Q4_K_M) — edge target, runs on Jetson Orin Nano 8 GB.
- Qwen2.5-72B-Instruct (FP16) — datacenter target, needs multi-GPU.
Same architecture family, completely different hardware engineering. The series teaches both endpoints, then unifies them through cross-model strategies (speculative decoding, edge↔cloud routing).
Scope: inference only. Training and fine-tuning are out of scope.
| Lecture | Title | Focus |
|---|---|---|
| 01 | Qwen Architecture Deep Dive | Config, shapes, GQA, RoPE, SwiGLU, tokenizer |
| 02 | Quantizing Qwen3-4B to Q4 | AWQ, GPTQ, K-quants, calibration, GGUF layout |
| 03 | Decode Optimization on Jetson | GEMV chain, KV cache, fusion, CUDA Graphs, INT8 KV |
| 04 | Qwen2.5-72B Multi-GPU FP16 | TP/PP, NCCL hot path, paged attention, YaRN |
| 05 | Cross-Model & Production Serving | Speculative decoding, vLLM/TRT-LLM, observability, hybrid |
| 06 | Batched GEMM vs Normal GEMM | cuBLAS API forms, layout, tensor cores, bit-exact reproducibility |
Prerequisites:
- Phase 5 — Edge AI — Edge LLM Inference Internals (GEMV vs GEMM, roofline)
- Phase 4 Track B — Jetson Real-Time Inference
- Phase 4 Track C — Quantization