Phase 4 — Track C: ML Compiler & Graph Optimization (6–12 months)¶
The bridge between AI models and hardware — learn how compilers lower neural-network graphs to efficient, hardware-specific code, then apply that knowledge to build and optimize real inference pipelines.
Prerequisites: Phase 1 §4 (C++ and Parallel Computing), Phase 3 (Neural Networks). Recommended: Phase 1 §3 (Operating Systems — memory, processes).
Layer mapping: Primarily Layer 2 (Compiler & Graph Optimization) of the AI chip stack, with connections into Layer 1 (framework graphs) and Layer 3 (runtime that executes compiled artifacts).
Role targets: AI Compiler Engineer · DL Graph Optimization Engineer · ML Compiler Backend Engineer · DL Inference Optimization Engineer · MTS Kernels
Why a dedicated compiler track¶
Tracks A (FPGA) and B (Jetson) teach you to deploy on specific hardware. This track teaches you how models become hardware instructions — the compiler stack that sits between a PyTorch/ONNX graph and the kernel code running on any accelerator. Whether you target GPU, FPGA, or a custom NPU, you need to understand IR design, graph optimization, scheduling, and code generation. This is also the fastest-growing hiring area in AI infrastructure.
Track structure¶
This track has two parts. Part 1 covers compiler fundamentals (IR, graph optimization, LLVM, MLIR, compilation pipelines, fusion, custom backends). Part 2 applies these concepts to real DL inference workloads (profiling, kernel engineering, quantization, runtimes, tinygrad deep dive). Together they take you from theory to production.
| Part | Focus | Sections |
|---|---|---|
| Part 1 — Compiler Fundamentals | How compilers work, from IR to hardware code | §1–§7 below |
| Part 2 — DL Inference Optimization | Applying compiler + kernel skills to real inference | DL Inference Optimization → (6 units) |
Recommended order: Work through Part 1 first (or at least §1–§2 and §5), then Part 2 in order. Part 2 units 01–03 reinforce and deepen Part 1 concepts with GPU-specific practice; units 04–06 add quantization, deployment, and hands-on tinygrad.
Part 1 — Compiler Fundamentals¶
1. Graph Representation & Intermediate Representation (IR)¶
-
Computational graphs:
- How PyTorch, TensorFlow, and ONNX represent models as directed acyclic graphs (DAGs).
- Tracing vs scripting vs export:
torch.export,torch.compile, ONNX export. - Graph-level metadata: shapes, dtypes, memory layout (NCHW vs NHWC).
-
Intermediate representations:
- Graph IR vs linearized IR — Graph: nodes = ops, edges = tensors. Linearized: list of ops in execution order (tinygrad's approach).
- SSA (Single Static Assignment) form — Each value defined once; enables clean alias and memory analysis.
- ONNX as interchange IR — Opsets, shape inference, version converters.
-
tinygrad IR study:
- Trace a model from
Tensorops toLazyBufferto linearized ops to generated code. - Understand how tinygrad's IR differs from graph-based IRs (TorchFX, ONNX).
- Trace a model from
Projects:
* Export a CNN (ResNet-18) to ONNX. Visualize the graph with Netron. Identify redundant ops.
* Trace a matmul through tinygrad: Tensor → LazyBuffer → scheduled ops → generated CUDA kernel. Document every IR boundary.
2. Graph Optimization Passes¶
-
Algebraic simplifications:
- Constant folding, dead code elimination, common sub-expression elimination (CSE).
- Strength reduction (e.g., replace
x / 2withx * 0.5, replacepow(x, 2)withx * x).
-
Operator fusion:
- Why fusion matters: reduces memory traffic and kernel launch overhead.
- Vertical fusion: Conv → BatchNorm → ReLU merged into a single kernel.
- Horizontal fusion: Independent ops run in one kernel to saturate compute.
- Fusion in practice: TensorRT's layer fusion, tinygrad's BEAM-based fusion, XLA's fusion heuristics.
-
Layout and memory optimizations:
- Data layout transformations (NCHW ↔ NHWC) to match hardware preference.
- In-place operation detection via alias analysis.
- Memory planning: operator-level liveness analysis → buffer reuse → reduced peak memory.
-
Quantization as a graph pass:
- Inserting quantize/dequantize nodes (PTQ).
- Folding BN into Conv weights before quantization.
- Calibration: collecting activation ranges to set scale/zero-point.
Projects:
* Implement a Conv+BN+ReLU fusion pass on an ONNX graph using onnx + onnxruntime Python APIs. Measure kernel count reduction.
* Write a memory planning pass: given an op schedule, compute minimum buffer allocation using liveness intervals.
3. LLVM Fundamentals for AI Compilers¶
-
Three-phase compiler design:
- Frontend → Optimizer → Backend. Why this modularity matters for AI targets.
- LLVM IR: types, instructions, SSA, basic blocks, functions.
-
LLVM IR in depth:
- Address spaces (important for GPU/accelerator memory models).
- Intrinsics: how hardware-specific operations are exposed in IR.
- Metadata and debug info.
-
Optimization passes:
- Canonicalization, loop unrolling, vectorization, dead store elimination.
- Writing a custom LLVM pass (C++): register a pass, walk the IR, transform.
-
Backend and code generation:
- Instruction selection via SelectionDAG / GlobalISel.
- TableGen: declaratively describing target instructions.
- Register allocation, instruction scheduling.
- How GPU targets (NVPTX, AMDGPU) use LLVM.
Resources: * LLVM Language Reference * LLVM Programmer's Manual * Writing an LLVM Pass
Projects:
* Write and compile a small function to LLVM IR (clang -emit-llvm). Read the IR, identify SSA values, trace through opt passes.
* Write a custom LLVM pass that counts floating-point multiply-accumulate (FMA) operations in a function — a proxy for FLOPS estimation.
4. MLIR: Multi-Level Intermediate Representation¶
-
Why MLIR:
- LLVM has one IR level; AI compilers need many. MLIR provides a framework for building dialects at every abstraction level.
- Progressive lowering: high-level tensor ops → loop nests → vector instructions → hardware code.
-
Core MLIR concepts:
- Dialects:
tensor,linalg,memref,affine,vector,scf,gpu,llvm. - Operations, regions, blocks — the MLIR data model.
- Types and attributes — extensible type system.
- Dialects:
-
Key dialects for AI:
linalg— Named ops (conv, matmul, pooling) and generic ops on tensors. Tiling, fusion, promotion.affine— Polyhedral loop analysis. Dependence analysis, loop interchange, tiling.tensor/memref— Bufferization: converting value-semantics tensors to in-memory buffers.vector— Target-independent SIMD representation.gpu— GPU kernel launch, thread/block mapping.
-
Progressive lowering walkthrough:
tosa(from ONNX/TF) →linalg→affine/scf→vector→gpuorllvm.- Bufferization pass: when and how tensors become memrefs.
-
Custom dialect development:
- Defining a custom NPU dialect with ODS (Operation Definition Specification).
- Writing lowering passes from
linalgto your custom dialect.
Resources: * MLIR Documentation * MLIR Tutorial: Creating a Dialect * Toy Tutorial — end-to-end custom language in MLIR.
Projects:
* Complete the MLIR Toy tutorial (chapters 1–7). Understand how to define ops, write lowering passes, and generate LLVM IR.
* Write a minimal NPU dialect with a single npu.matmul op. Lower linalg.matmul to npu.matmul with a tiling strategy.
5. ML-to-Hardware Compilation Pipelines¶
-
TVM:
- Relay (graph-level IR) → TIR (Tensor IR, loop-level) → target code.
- Schedule primitives:
tile,vectorize,parallel,unroll,reorder. - AutoTVM / AutoScheduler (Ansor): Search-based tuning for operator schedules.
- BYOC (Bring Your Own Codegen): Offloading subgraphs to custom accelerators.
-
tinygrad compiler:
- Scheduler: how ops are grouped into kernels.
- BEAM search: exploring fusion choices to minimize runtime.
- Backends: CUDA, OpenCL, Metal, LLVM, custom.
- Adding a new backend to tinygrad.
-
Production compilers:
- torch.compile + Inductor: TorchFX → Triton kernels. How
torch._inductorschedules and generates code. - XLA (Accelerated Linear Algebra): Used by JAX and TF. HLO IR → target code (TPU, GPU, CPU).
- IREE: MLIR-based compiler for heterogeneous deployment (CPU, GPU, DSP).
- Triton-MLIR: Python-level kernel writing compiled via MLIR pipeline.
- torch.compile + Inductor: TorchFX → Triton kernels. How
-
End-to-end flow comparison:
| Pipeline | Input | IR(s) | Scheduling | Target |
|---|---|---|---|---|
| TVM | ONNX/Relay | Relay → TIR | AutoTVM / Ansor | CUDA, LLVM, custom |
| tinygrad | tinygrad graph | Linearized ops | BEAM search | CUDA, OpenCL, Metal, LLVM |
| torch.compile | TorchFX | FX → Inductor IR | Triton codegen | CUDA (Triton), CPU |
| XLA | HLO | HLO → LLO | XLA scheduler | TPU, GPU, CPU |
| IREE | MLIR (tosa/linalg) | linalg → vector → spirv/llvm | Tile + distribute | CPU, GPU, DSP |
Projects:
* Compile a ResNet-18 with TVM for your GPU. Use AutoTVM to tune 3 key ops (conv2d, dense, batch_matmul). Compare latency before/after tuning.
* Study tinygrad BEAM: run with BEAM=3 on a small model. Compare kernel count and latency vs BEAM=0.
* Use torch.compile on a transformer block. Read the generated Triton kernel for the attention op. Annotate the tiling strategy.
6. Kernel Fusion & Tiling Strategies¶
-
Tiling for memory hierarchy:
- Why tile: make working sets fit in L1/L2/shared memory/SRAM scratchpad.
- Tile size selection: auto-tuning vs analytical models (roofline-guided).
- Multi-level tiling: thread-block tiles → warp tiles → register tiles (GPU); array tiles → PE tiles (accelerator).
-
Fusion strategies:
- Producer-consumer fusion: Fuse ops where one's output is the other's input (Conv → ReLU).
- Parallel fusion: Fuse independent ops sharing inputs to reduce reads.
- Reduction fusion: Fuse element-wise ops with reductions (softmax components).
- Fusion legality: when aliasing or data dependencies prevent fusion.
-
Dataflow scheduling:
- Weight-stationary, output-stationary, row-stationary, no-local-reuse.
- How the compiler maps these strategies to hardware tiling parameters.
- Connection to Layer 5 (hardware architecture): the compiler must know the hardware's dataflow.
Projects: * Implement a 2-level tiled matmul in C++ (outer tiles for L2, inner tiles for L1). Benchmark vs naive and compare with vendor BLAS. * In tinygrad or TVM, modify a fusion heuristic. Measure the impact on a real model (kernel count, total latency, memory traffic).
7. Custom Backend Development¶
-
What is a custom backend:
- The code generator that takes compiler IR and produces instructions for your specific hardware (FPGA, NPU, custom ASIC).
-
TVM BYOC path:
- Annotate subgraph → partition → custom codegen function → emit code or runtime calls.
- Build a minimal BYOC backend for a simulated accelerator.
-
tinygrad custom backend:
- Implement
RuntimeandCompilerclasses for a new target. - Map linearized ops to hardware instructions.
- Implement
-
MLIR custom lowering:
- Define target dialect → write lowering pass from
linalg→ emit target-specific code. - Integration with LLVM backend or standalone code emitter.
- Define target dialect → write lowering pass from
-
Connection to other tracks:
- Track A (FPGA): your compiler backend generates HLS directives or RTL control sequences.
- Track B (Jetson): your compiler backend targets CUDA, DLA, or TensorRT.
Projects:
* Build a minimal TVM BYOC backend that offloads nn.dense to a Python-simulated accelerator. Verify correctness.
* Add a simple tinygrad backend for a simulated 4×4 systolic array (compute matmul tiles, verify output).
* (Advanced) Write an MLIR lowering pass from linalg.matmul to a custom dialect that emits tiled DMA + compute commands.
Part 2 — DL Inference Optimization¶
Apply compiler and kernel skills to real inference workloads: profile, write kernels, quantize, and deploy.
This part was previously a Phase 5 specialization track. It now lives here because the skills (profiling, kernel authoring, quantization, runtime integration) are the direct application of Part 1's compiler fundamentals — and they're needed before the advanced Phase 5 specializations (HPC infrastructure, AI chip design).
Role target: Step 2 — DL Inference Optimization Engineer · MTS Kernels (Member of Technical Staff, Kernels)
Prerequisites for Part 2: Part 1 (at least §1–§2 and §5), Phase 4 Track B (Jetson, TensorRT, CUDA).
Study the units in order. Each folder is one unit; do them one by one.
| Order | Unit | What you learn | Guide |
|---|---|---|---|
| 1 | Graph & Operator Optimization | What we're optimizing: graph, ops, fusion, profiling, bottlenecks. Reinforces Part 1 §2 with GPU-specific profiling. | 01 → |
| 2 | Kernel Engineering | How to write and own kernels: Triton, CUTLASS/CuTe, Flash-Attention, long-context, NCCL. Core of MTS Kernels. | 02 → |
| 3 | Compiler Stack | How compilers produce kernels: IR, scheduling (BEAM), codegen, TVM, MLIR. Reinforces Part 1 §3–§5 with hands-on projects. | 03 → |
| 4 | Quantization | Low-precision inference: PTQ, QAT, INT8/INT4, kernel and runtime integration. | 04 → |
| 5 | Inference Runtimes & Deployment | Production: TensorRT, ONNX Runtime, Triton server; latency, throughput, methodology. | 05 → |
| 6 | tinygrad Deep Dive | Optional: hands-on IR, scheduler, backends; compiler-kernel interface. | 06 → |
Basic concepts (read before Part 2)¶
Before diving into the units, you need the vocabulary and mental model of modern LLM inference. Key concepts:
- In-flight batching — Batch requests as they arrive; improves throughput without killing latency.
- Paged KV-cache — Attention needs key/value cache per token; paging and reuse make it memory-efficient.
- Speculative decoding — Draft multiple tokens with a small model, verify with the big model; fewer forward passes.
- Throughput vs latency — Batch more → higher throughput, worse latency. Tune batching and scheduling to the SLA.
- CuTe DSL — C++ header library inside CUTLASS that defines layouts and copy operations for tiling, used in production GEMM/attention kernels.
- New architectures — Every GPU generation (Hopper, Blackwell) changes execution model, memory hierarchy, and instruction throughput. Old tile sizes become suboptimal.
See the DL Inference Optimization overview for the full vocabulary (LLM inference, distributed inference, production systems).
Relationship to Other Tracks¶
| This track (C) provides | Track A (FPGA) uses it for | Track B (Jetson) uses it for |
|---|---|---|
| Graph optimization passes | Mapping ONNX → HLS-friendly subgraphs | TensorRT graph optimization understanding |
| MLIR / TVM compilation | Vitis AI / FINN compilation flow | torch.compile + Inductor on GPU |
| Custom backend development | FPGA backend in TVM or tinygrad | DLA/TensorRT backend integration |
| Tiling & dataflow scheduling | HLS pragma-driven tiling | CUDA kernel tiling strategies |
| IR & SSA fundamentals | Understanding Vivado synthesis IR | Understanding NVPTX code generation |
| Kernel engineering (Part 2) | — | Triton/CUTLASS kernels for GPU inference |
| Quantization (Part 2) | Vitis AI quantizer understanding | TensorRT INT8/INT4 deployment |
| Inference runtimes (Part 2) | Vitis AI / FINN runtime | TensorRT, Triton server, DeepStream |
Build Summary¶
Part 1 — Compiler Fundamentals¶
| Module | Hands-on deliverable |
|---|---|
| §1 IR | ONNX graph analysis + tinygrad IR trace |
| §2 Graph opts | Conv+BN+ReLU fusion pass, memory planner |
| §3 LLVM | Custom LLVM pass (FMA counter) |
| §4 MLIR | Toy tutorial + minimal NPU dialect |
| §5 Pipelines | TVM AutoTVM tuning, BEAM comparison, torch.compile analysis |
| §6 Fusion/tiling | Tiled matmul, fusion heuristic modification |
| §7 Custom backend | TVM BYOC or tinygrad backend for simulated accelerator |
Part 2 — DL Inference Optimization¶
| Unit | Hands-on deliverable |
|---|---|
| 01 Graph & ops | Fusion + measure, profiling report, end-to-end breakdown |
| 02 Kernel engineering | Triton fused kernel, long-context attention, NCCL at scale |
| 03 Compiler stack | BEAM in tinygrad, fusion pass, lowering trace |
| 04 Quantization | INT8 with TensorRT, PTQ vs QAT comparison, kernel path trace |
| 05 Runtimes | Runtime comparison, Triton server setup, benchmark report |
| 06 tinygrad (optional) | Pipeline trace, add optimization, backend hook |