03 — Compiler Stack for Inference (IR, Scheduling, Codegen)¶

Order: Third. After graph/ops (01) and kernel authoring (02), you see how compilers generate and schedule kernels.

Role target: DL Inference Optimization Engineer · MTS Kernels (Member of Technical Staff, Kernels — roles focused on code generation, compiler–hardware mapping, and owning kernel/backend implementation at scale).

Why this comes third¶

Kernels (02) are either hand-written or compiler-generated. This unit covers how compilers represent the model (IR), decide fusion and placement (scheduling), and emit code (codegen). You need this to co-design with frameworks and to add or tune backends.

1. Intermediate representation (IR)¶

Graph IR vs linearized IR — Graph: nodes = ops, edges = tensors. Linearized: list of ops in execution order (e.g. tinygrad's linearized ops).
SSA form — Single assignment; each value defined once. Enables clear memory and alias analysis.
Memory and alias analysis — Which buffers can overlap; when fusion or in-place is safe.

Takeaway: The IR is the contract between "model graph" and "kernel backend." Your kernels are targets for lowering from this IR.

2. Scheduling and lowering¶

tinygrad — Scheduler, BEAM search for kernel fusion and placement. One op vs fused op; how BEAM explores fusion choices.
TVM — TIR (Tensor IR), AutoTVM/AutoScheduler for mapping to hardware. Schedule primitives (tile, vectorize, parallel).
MLIR — linalg/tensor dialects; progressive lowering (linalg → loops → vector → gpu). How high-level ops become loops and then GPU kernels.

Takeaway: Scheduling decides which kernels run (fused or not) and how they're tiled/parallelized; codegen then emits CUDA/LLVM/etc.

3. Code generation¶

Backend codegen — From IR/schedule to CUDA, OpenCL, LLVM, or custom target. Role of codegen in Triton, TVM, tinygrad.
Kernel fusion and tile selection — How the compiler chooses tile sizes and fusion sets for GPUs and accelerators.

Resources¶

tinygrad — IR, scheduler, BEAM, backends.
TVM Documentation — TIR, AutoTVM, BYOC.
MLIR Tutorial — Dialects and lowering.

Projects¶

BEAM in tinygrad — Run tinygrad with BEAM on a small model. Compare scheduled kernel count and runtime vs default scheduler; document what BEAM fused.
Fusion pass — Implement a simple fusion pass (e.g. Conv+ReLU) in a graph you control (ONNX or tinygrad). Measure impact on kernel count and latency.
Trace lowering — Pick one op (e.g. matmul) in TVM or tinygrad and trace from high-level op to generated code. Document the lowering steps.

Next¶

→ 04 — Quantization — Low-precision inference (PTQ, QAT) and how it affects kernels and deployment.