01 — Graph and Operator Optimization¶

Order: First (foundation). You need to know what you're optimizing before writing kernels.

Role target: Step 2 — DL Inference Optimization Engineer · MTS Kernels

Before this unit: Read Basic concepts in the DL Inference Optimization guide (LLM inference, TensorRT-LLM/vLLM, distributed training, KV-cache, and why new hardware changes kernel design).

Why this comes first¶

Before writing or tuning kernels, you must:

Understand the graph — which ops run, in what order, and how they connect.
Find bottlenecks — which ops or layers are compute-bound vs memory-bound.
Know fusion opportunities — which op chains can become a single kernel (e.g. Conv–BN–ReLU).

This unit gives you the graph/operator view and profiling skills that every kernel engineer uses daily.

1. Graph-level optimizations¶

Constant folding — Evaluate constant subgraphs at build time (e.g. shape ops, fixed weights).
Dead code elimination — Remove ops whose outputs are never used.
Common subexpression elimination (CSE) — Reuse computed values instead of recomputing.
Operator fusion — Combine multiple ops into one kernel:
- Conv–BN–ReLU, Linear–Activation, Attention (Q/K/V + softmax + matmul).
- Reduces memory traffic and kernel launch overhead.
Layout and shape transformations — NCHW vs NHWC, transpose folding, reshape/expand for hardware-friendly layouts.
Framework graph formats — ONNX, TorchScript, TensorFlow SavedModel; how optimization passes are applied in each.

Concepts to internalize: A single "layer" in a model often becomes many ops in the graph; fusion turns them back into fewer, faster kernels.

2. Operator-level optimization¶

Kernel selection and dispatch — How runtimes choose implementations: cuBLAS, cuDNN, oneDNN, or custom kernels. Algorithm selection (e.g. conv algorithm) and heuristics.
Memory planning — Buffer allocation, in-place ops where safe (same buffer for input/output), reducing peak memory.
Batching and dynamic batching — Batching requests for inference servers; trade-offs between latency and throughput.

3. Profiling and bottleneck identification¶

Tools:
- Nsight Systems — Timeline view: kernel launches, memory copies, CPU–GPU overlap.
- Nsight Compute — Per-kernel: occupancy, memory throughput, compute utilization.
- PyTorch profiler — Op-level and kernel-level timing in Python.
- ONNX Runtime — Execution provider timing, operator cost.
Roofline-style analysis — For each major op/layer: compute-bound vs memory-bound; arithmetic intensity and roofline limits.
End-to-end latency breakdown — Data loading → preprocess → inference (per layer) → postprocess. Where does time go?

Goal: From a single model run, you should be able to name the top 3–5 bottlenecks and say whether they are compute or memory bound.

Resources¶

TensorRT Developer Guide — Graph optimization and layer fusion.
ONNX Runtime Performance Tuning — Graph and execution provider tuning.
PyTorch Profiler — Profiling PyTorch models.
NVIDIA Nsight Systems / Compute — GPU profiling.

Projects¶

Fusion and measure — Take a ResNet-style model (or small transformer). Fuse Conv–BN–ReLU in ONNX or TorchScript (or use a framework that does it). Measure latency before and after; document the speedup.
Profile and report — Profile a transformer block (attention + FFN) with Nsight Systems and PyTorch profiler. Identify the top 3 bottlenecks; for each, state whether it is compute-bound or memory-bound and why.
End-to-end breakdown — For one inference pipeline (e.g. image → model → result), break down time into: data load, preprocess, each major graph region, postprocess. Draw a simple timeline and note the largest segment.

Next¶

→ 02 — Kernel Engineering — Design and implement the high-performance kernels that implement these ops (Triton, CUTLASS, Flash-Attention).