05 — Inference Runtimes & Deployment Targets¶
Order: Fifth. After graph, kernels, compiler, and quantization (01–04), you deploy and measure in production-like settings.
Role target: DL Inference Optimization Engineer · MTS Kernels (deployment, production reliability, measurable outcomes).
Why this comes fifth¶
Your kernels and optimizations only matter if they run correctly in a runtime and meet latency/throughput goals. This unit covers the main inference runtimes and how to measure and compare them.
1. Runtimes¶
- TensorRT — Engine build, plugins, dynamic shapes, DLA. How your kernels and graph optimizations show up in the engine.
- ONNX Runtime — Execution providers (CUDA, TensorRT, OpenVINO). Graph optimizations and provider selection.
- Triton Inference Server — Batching, model concurrency, metrics. Serving multiple models and dynamic batching.
2. TensorRT-LLM — Production LLM Inference¶
TensorRT-LLM is NVIDIA's open-source library purpose-built for LLM inference on NVIDIA GPUs. It extends TensorRT with LLM-specific optimizations that general-purpose runtimes cannot match. If you deploy large language models on NVIDIA hardware, this is the production path.
Why TensorRT-LLM exists¶
Standard TensorRT handles vision and small models well, but LLMs have unique challenges: - KV-cache management — grows linearly with context length, must be paged and reused across requests. - Autoregressive decoding — each token depends on all previous tokens; batching is complex. - Multi-GPU serving — models that don't fit on one GPU need tensor/pipeline parallelism. - Mixed workloads — prefill (compute-bound) and decode (memory-bound) phases have opposite bottlenecks.
TensorRT-LLM solves all of these with a Python API that compiles LLMs into optimized TensorRT engines with LLM-specific runtime features.
Core features¶
-
In-flight batching (continuous batching):
- Batch requests as they arrive — don't wait for a full batch. New requests join while others are mid-generation.
- Maximizes GPU utilization by mixing prefill and decode phases across requests.
-
Paged KV-cache:
- Inspired by vLLM's PagedAttention — allocates KV-cache in fixed-size blocks, not contiguous per-sequence.
- Eliminates memory fragmentation; enables serving more concurrent sequences.
-
Quantization:
- FP8 (Hopper+), INT8 (SmoothQuant), INT4 (AWQ, GPTQ) — all with fused dequantize in GEMM kernels.
- FP4 on Blackwell for maximum throughput.
-
Tensor parallelism and pipeline parallelism:
- Split model across GPUs: tensor parallel (split within layers) or pipeline parallel (split across layers).
- NCCL-based communication, overlapped with compute.
-
Speculative decoding:
- Draft model generates candidate tokens; main model verifies in one forward pass.
- Reduces time-to-first-token and overall latency.
-
Custom Hopper/Blackwell kernels:
- CUTLASS-based GEMM kernels optimized for each GPU generation.
- Warp specialization, persistent kernels, Transformer Engine integration.
-
CUDA Graphs:
- Captures the decode loop as a CUDA graph — eliminates per-token kernel launch overhead.
Build and deploy workflow¶
# Install
pip install tensorrt-llm
# Step 1: Convert model checkpoint to TRT-LLM format
python convert_checkpoint.py \
--model_dir ./llama-3-8b \
--output_dir ./trt_ckpt \
--dtype float16 \
--tp_size 2 # tensor parallel across 2 GPUs
# Step 2: Build TRT-LLM engine
trtllm-build \
--checkpoint_dir ./trt_ckpt \
--output_dir ./trt_engine \
--gemm_plugin float16 \
--max_batch_size 64 \
--max_input_len 2048 \
--max_seq_len 4096 \
--paged_kv_cache enable \
--use_fused_mlp enable
# Step 3: Run inference
python run.py \
--engine_dir ./trt_engine \
--tokenizer_dir ./llama-3-8b \
--input_text "Explain how a systolic array works"
# Step 4: Serve with Triton Inference Server
# TRT-LLM integrates with Triton via the TRT-LLM backend
# → in-flight batching, streaming, multi-model serving
TensorRT-LLM vs vLLM¶
| TensorRT-LLM | vLLM | |
|---|---|---|
| Approach | Compile model to optimized engine (ahead-of-time) | JIT with PyTorch + custom CUDA kernels |
| Performance | Highest throughput on NVIDIA GPUs (custom Hopper/Blackwell kernels) | Very good; slightly lower peak but faster iteration |
| Quantization | FP8, INT8, INT4, FP4 with fused kernels | GPTQ, AWQ, FP8 via external libraries |
| Multi-GPU | TP + PP via NCCL | TP via NCCL |
| Setup complexity | Higher — build step required | Lower — load and serve |
| Model support | Major LLMs (Llama, Mistral, GPT, Falcon, etc.) | Broader model support via HuggingFace |
| Hardware | NVIDIA only | NVIDIA + AMD (ROCm) |
| Best for | Maximum throughput in production on NVIDIA | Rapid prototyping, AMD support, flexibility |
Key concepts to internalize¶
- Prefill vs decode phases — Prefill processes the entire prompt in one pass (compute-bound, high arithmetic intensity). Decode generates one token at a time (memory-bound, low arithmetic intensity). TRT-LLM optimizes both with different kernel strategies.
- KV-cache sizing — For a 7B model at FP16 with 4096 context: ~2 GB KV-cache per sequence. With paged KV-cache, 80 GB H100 can serve ~30 concurrent sequences. Understanding this math is essential.
- Engine build trade-offs —
max_batch_size,max_input_len,max_seq_lenare baked into the engine. Larger = more memory reserved, fewer concurrent engines. Size for your actual workload.
3. Measurable outcomes¶
- Latency — p50, p99; what to measure (single request, batch). For LLMs: time-to-first-token (TTFT) and inter-token latency (ITL).
- Throughput — QPS, tokens/s; how batch size and concurrency affect it. For LLMs: output tokens/s across all concurrent requests.
- Memory footprint — Peak GPU/system memory; impact of batching and precision. For LLMs: model weights + KV-cache + activation memory.
- Methodology — Reproducible benchmarks; A/B comparison (e.g. before/after kernel change, or TensorRT-LLM vs vLLM).
Resources¶
- TensorRT Best Practices
- TensorRT-LLM GitHub — Source, examples, model support matrix.
- TensorRT-LLM Documentation — Build, deploy, and optimize guides.
- vLLM — Alternative LLM serving engine for comparison.
- Triton Inference Server — Production serving with TRT-LLM backend.
- MLPerf Inference — Reference benchmarks and methodology.
Projects¶
- Runtime comparison — Deploy the same model with ONNX Runtime and TensorRT (same hardware). Compare latency and throughput; document configuration and measurement method.
- Triton server — Set up a minimal Triton server with dynamic batching. Measure QPS vs batch size and document how batching affects latency and throughput.
- Benchmark report — For one model and one runtime, produce a one-page benchmark report: latency (p50/p99), throughput, memory, and exact environment (GPU, driver, runtime version).
- TensorRT-LLM engine build — Build a TensorRT-LLM engine for Llama-3-8B with FP16 and INT8 quantization. Measure tokens/s, TTFT, and memory usage. Compare with vLLM on the same hardware.
- Multi-GPU LLM serving — Deploy a 70B model across 2+ GPUs with tensor parallelism using TensorRT-LLM. Measure scaling efficiency (tokens/s per GPU) vs single-GPU with a smaller model.
- TRT-LLM + Triton — Deploy TensorRT-LLM engine behind Triton Inference Server with in-flight batching enabled. Load test with concurrent clients and measure p50/p99 latency under load.
Next¶
→ 06 — tinygrad Deep Dive (optional) — Hands-on compiler/kernel interface: IR, scheduler, backends, and adding a simple optimization.