04 — Quantization & Low-Precision Inference¶
Order: Fourth. After you have kernels and compiler context (01–03), you add low-precision kernels and deployment.
Role target: DL Inference Optimization Engineer · MTS Kernels (INT8/INT4 kernels, quantization-aware implementations).
Why this comes fourth¶
Quantization (INT8, INT4, etc.) is a major lever for latency and throughput. It touches graph transforms, kernel implementations (integer matmul, quantized attention), and runtime integration. Doing 01–03 first lets you see where quantization plugs in and how it affects kernel choice and fusion.
1. Post-training quantization (PTQ)¶
- INT8 / INT4 — Calibration (representative data), per-tensor vs per-channel scales, calibration datasets.
- Tooling — TensorRT INT8, ONNX Runtime QDQ, PyTorch
torch.quantization. How each represents scales and zero-points. - Kernel impact — Runtimes dispatch to quantized kernels (e.g. INT8 GEMM, INT4 matmul); understanding the kernel path for quantized models.
2. Quantization-aware training (QAT)¶
- Fake quantization — Simulate quantization in the forward pass; straight-through estimator for gradients.
- Accuracy recovery — Fine-tuning to recover accuracy after quantization.
- When to use QAT vs PTQ — Accuracy vs effort; production trade-offs.
3. Advanced formats and methods¶
- Precision — FP16/BF16 on GPUs; INT4/INT8 on CPUs and accelerators.
- LLM-focused — SmoothQuant, GPTQ, AWQ for LLM inference (concepts and where they hook into the stack).
Resources¶
- TensorRT Quantization
- PyTorch Quantization
- tinygrad — Search for quantization passes and kernel support.
Projects¶
- INT8 with TensorRT — Quantize a CNN to INT8 using TensorRT. Report accuracy change and latency/speedup vs FP32.
- PTQ vs QAT — Compare PTQ and QAT on the same model and target. Document accuracy, latency, and engineering effort for each.
- Kernel path — For one quantized model (e.g. TensorRT or ONNX Runtime), identify which kernels run for a few key layers (e.g. INT8 conv/GEMM). Document the dispatch path.
Next¶
→ 05 — Inference Runtimes and Deployment — Production deployment (TensorRT, ONNX Runtime, Triton server) and measurable outcomes.