Skip to content

Module 2 — Deep Learning Frameworks

Parent: Phase 3 — Artificial Intelligence

Understand how software generates the workloads your hardware must run — from autograd to GPU kernels.

Prerequisites: Module 1 (Neural Networks — understand what a forward/backward pass computes).

Layer mapping: L1 (Application) — you use frameworks to build models. L2 (Compiler) — tinygrad exposes the compiler pipeline that Phase 4C teaches you to build.


Why a Dedicated Frameworks Module

Module 1 teaches you what neural networks compute. This module teaches you how — the software machinery that turns model(x) into GPU kernel launches. Understanding this machinery is essential because:

  • L2 (Compiler): You can't build an ML compiler without understanding what frameworks produce (computational graphs, ops, tensors)
  • L5 (Architecture): You can't design an accelerator without knowing which ops dominate real workloads
  • L6 (RTL): You can't build a PE array without understanding the precision and data flow of actual training/inference

Three-Framework Mental Model

Study these three frameworks in order. Each teaches a different level of the stack.

Framework What it teaches Size Your learning goal
micrograd How autograd works — reverse-mode differentiation from scratch ~100 lines Build it yourself. Understand backprop at the code level.
PyTorch Industry-standard API — tensors, modules, optimizers, data loading Millions of lines Use it fluently. Train models, export ONNX, profile with torch.profiler.
tinygrad How a compiler turns tensor ops into GPU kernels — IR, scheduler, backends ~10,000 lines Read the source. Trace from Tensor to generated CUDA/OpenCL code.
micrograd          PyTorch              tinygrad
(education)        (production)         (hackable production)
    │                  │                     │
    ▼                  ▼                     ▼
 Autograd          Full API             Compiler pipeline
 from scratch      industry standard    IR → scheduler → codegen
    │                  │                     │
    └──────────────────┴─────────────────────┘
              Understanding grows left → right

1. micrograd — Autograd from Scratch

micrograd by Andrej Karpathy. ~100 lines of Python. Implements: - A Value class that tracks computation history - Reverse-mode automatic differentiation (backpropagation) - A tiny neural network API (Neuron, Layer, MLP)

What you'll build:

from micrograd.engine import Value

# Forward pass
x = Value(2.0)
y = Value(3.0)
z = x * y + y ** 2  # z = 2*3 + 9 = 15

# Backward pass (autograd)
z.backward()
print(x.grad)  # dz/dx = y = 3.0
print(y.grad)  # dz/dy = x + 2y = 2 + 6 = 8.0

Why it matters for hardware: Every training accelerator must implement this backward pass. Understanding the computation graph and gradient flow tells you what memory access patterns and operations the hardware must support.

Project: Implement micrograd from scratch (don't copy — type it yourself). Train an MLP on a 2D classification dataset. Visualize the computation graph.


2. PyTorch — Industry Standard

PyTorch is the framework most models are written in. You need fluency here because: - Models you deploy on hardware are written in PyTorch - ONNX export comes from PyTorch (torch.onnx.export) - torch.compile (Inductor) is a production ML compiler - Profiling tools (torch.profiler, Nsight) show you where time is spent

Key concepts to master:

Concept Why it matters for hardware
torch.Tensor The data structure accelerators process. Shape, dtype, layout (contiguous, channels-last).
nn.Module How models are structured. Layers → forward pass → computational graph.
Autograd (loss.backward()) Generates the backward graph that training hardware executes.
Data loading (DataLoader) CPU-GPU pipeline. Bottleneck if not overlapped with compute.
torch.onnx.export() How models leave PyTorch and enter the compiler/runtime stack (Phase 4C).
torch.compile() PyTorch's built-in compiler (Inductor). Generates Triton kernels. Connection to Phase 4C.
torch.profiler Where is time spent? Kernel launches, memory copies, CPU overhead.
Mixed precision (torch.cuda.amp) FP16/BF16 training — what tensor cores accelerate.
Quantization (torch.ao.quantization) INT8 inference — what L6 PE arrays must support.

Projects: 1. Train a CNN (ResNet-18) on CIFAR-10 from scratch. Profile with torch.profiler. Identify the top-3 time-consuming operations. 2. Export the trained model to ONNX. Visualize the graph with Netron. Count the total number of ops and parameters. 3. Apply post-training quantization (PTQ) to INT8. Measure accuracy drop and inference speedup on CPU. 4. Use torch.compile() on a transformer block. Compare eager vs compiled execution time.


3. tinygrad — The Hackable Compiler

tinygrad is a minimal DL framework (~10K lines) that exposes the entire compiler pipeline in readable Python. It's the ideal codebase for understanding what happens between loss.backward() and the GPU kernel that actually runs.

Why tinygrad is uniquely valuable for this roadmap: - It's the inference engine inside openpilot (Phase 5E) - It exposes the IR, scheduler, and code generation that Phase 4C teaches you to build - You can add a custom backend (Phase 4C §7) — targeting your own accelerator - It runs on CUDA, OpenCL, Metal, LLVM, and custom targets

Key concepts:

Concept What it teaches Connection to stack
Lazy evaluation Nothing runs until .realize() L2: compiler decides when to execute
3 operation types Elementwise, Reduce, Movement (25 primitives total) L5: what the PE array must support
ShapeTracker Zero-copy reshapes and transposes L2: memory layout optimization
UOp IR The intermediate representation before code generation L2: same concept as MLIR/TVM IR
BEAM search Explores fusion choices to minimize runtime L2: auto-tuning for kernel optimization
Backends How the same IR generates CUDA, OpenCL, or LLVM code L2: multi-target compilation

Projects: 1. Trace a matmul through tinygrad: Tensor → lazy buffer → scheduled ops → generated CUDA kernel. Document every step. 2. Run a small model with DEBUG=4 to see the generated kernels. Count the number of kernel launches. 3. Run with BEAM=3 and compare kernel count and latency vs BEAM=0. 4. (Advanced) Add a minimal logging backend that prints each kernel launch — verify which ops fuse.

Deep dive: The full tinygrad learning path (11 parts, 7 projects) is in Phase 5E — Autonomous Vehicles / tinygrad.


How Frameworks Connect to the Rest of the Roadmap

Framework skill Where it leads
micrograd autograd Phase 4C: understand what compiler must differentiate
PyTorch model export (ONNX) Phase 4C §1: graph IR as compiler input
PyTorch quantization Phase 4C Part 2 §4: quantization passes
torch.compile (Inductor) Phase 4C §5: production ML compiler pipeline
tinygrad IR and scheduler Phase 4C §5: BEAM search, fusion strategies
tinygrad backends Phase 4C §7: custom backend for your accelerator
PyTorch profiling Phase 4C Part 2 §1: graph/operator optimization

Resources

Resource What it covers
Andrej Karpathy — micrograd video Build autograd from scratch (2 hours)
PyTorch Tutorials Official beginner → advanced tutorials
tinygrad GitHub Source code — read it
tinygrad Discord Community, contributions, help
Deep Learning with PyTorch (Stevens, Antiga, Viehmann) Comprehensive PyTorch book

Next

Module 3 — Computer Vision — the perception workloads that drive edge AI and autonomous systems.