Module 2 — Deep Learning Frameworks¶
Parent: Phase 3 — Artificial Intelligence
Understand how software generates the workloads your hardware must run — from autograd to GPU kernels.
Prerequisites: Module 1 (Neural Networks — understand what a forward/backward pass computes).
Layer mapping: L1 (Application) — you use frameworks to build models. L2 (Compiler) — tinygrad exposes the compiler pipeline that Phase 4C teaches you to build.
Why a Dedicated Frameworks Module¶
Module 1 teaches you what neural networks compute. This module teaches you how — the software machinery that turns model(x) into GPU kernel launches. Understanding this machinery is essential because:
- L2 (Compiler): You can't build an ML compiler without understanding what frameworks produce (computational graphs, ops, tensors)
- L5 (Architecture): You can't design an accelerator without knowing which ops dominate real workloads
- L6 (RTL): You can't build a PE array without understanding the precision and data flow of actual training/inference
Three-Framework Mental Model¶
Study these three frameworks in order. Each teaches a different level of the stack.
| Framework | What it teaches | Size | Your learning goal |
|---|---|---|---|
| micrograd | How autograd works — reverse-mode differentiation from scratch | ~100 lines | Build it yourself. Understand backprop at the code level. |
| PyTorch | Industry-standard API — tensors, modules, optimizers, data loading | Millions of lines | Use it fluently. Train models, export ONNX, profile with torch.profiler. |
| tinygrad | How a compiler turns tensor ops into GPU kernels — IR, scheduler, backends | ~10,000 lines | Read the source. Trace from Tensor to generated CUDA/OpenCL code. |
micrograd PyTorch tinygrad
(education) (production) (hackable production)
│ │ │
▼ ▼ ▼
Autograd Full API Compiler pipeline
from scratch industry standard IR → scheduler → codegen
│ │ │
└──────────────────┴─────────────────────┘
Understanding grows left → right
1. micrograd — Autograd from Scratch¶
micrograd by Andrej Karpathy. ~100 lines of Python. Implements:
- A Value class that tracks computation history
- Reverse-mode automatic differentiation (backpropagation)
- A tiny neural network API (Neuron, Layer, MLP)
What you'll build:
from micrograd.engine import Value
# Forward pass
x = Value(2.0)
y = Value(3.0)
z = x * y + y ** 2 # z = 2*3 + 9 = 15
# Backward pass (autograd)
z.backward()
print(x.grad) # dz/dx = y = 3.0
print(y.grad) # dz/dy = x + 2y = 2 + 6 = 8.0
Why it matters for hardware: Every training accelerator must implement this backward pass. Understanding the computation graph and gradient flow tells you what memory access patterns and operations the hardware must support.
Project: Implement micrograd from scratch (don't copy — type it yourself). Train an MLP on a 2D classification dataset. Visualize the computation graph.
2. PyTorch — Industry Standard¶
PyTorch is the framework most models are written in. You need fluency here because:
- Models you deploy on hardware are written in PyTorch
- ONNX export comes from PyTorch (torch.onnx.export)
- torch.compile (Inductor) is a production ML compiler
- Profiling tools (torch.profiler, Nsight) show you where time is spent
Key concepts to master:
| Concept | Why it matters for hardware |
|---|---|
torch.Tensor |
The data structure accelerators process. Shape, dtype, layout (contiguous, channels-last). |
nn.Module |
How models are structured. Layers → forward pass → computational graph. |
Autograd (loss.backward()) |
Generates the backward graph that training hardware executes. |
Data loading (DataLoader) |
CPU-GPU pipeline. Bottleneck if not overlapped with compute. |
torch.onnx.export() |
How models leave PyTorch and enter the compiler/runtime stack (Phase 4C). |
torch.compile() |
PyTorch's built-in compiler (Inductor). Generates Triton kernels. Connection to Phase 4C. |
torch.profiler |
Where is time spent? Kernel launches, memory copies, CPU overhead. |
Mixed precision (torch.cuda.amp) |
FP16/BF16 training — what tensor cores accelerate. |
Quantization (torch.ao.quantization) |
INT8 inference — what L6 PE arrays must support. |
Projects:
1. Train a CNN (ResNet-18) on CIFAR-10 from scratch. Profile with torch.profiler. Identify the top-3 time-consuming operations.
2. Export the trained model to ONNX. Visualize the graph with Netron. Count the total number of ops and parameters.
3. Apply post-training quantization (PTQ) to INT8. Measure accuracy drop and inference speedup on CPU.
4. Use torch.compile() on a transformer block. Compare eager vs compiled execution time.
3. tinygrad — The Hackable Compiler¶
tinygrad is a minimal DL framework (~10K lines) that exposes the entire compiler pipeline in readable Python. It's the ideal codebase for understanding what happens between loss.backward() and the GPU kernel that actually runs.
Why tinygrad is uniquely valuable for this roadmap: - It's the inference engine inside openpilot (Phase 5E) - It exposes the IR, scheduler, and code generation that Phase 4C teaches you to build - You can add a custom backend (Phase 4C §7) — targeting your own accelerator - It runs on CUDA, OpenCL, Metal, LLVM, and custom targets
Key concepts:
| Concept | What it teaches | Connection to stack |
|---|---|---|
| Lazy evaluation | Nothing runs until .realize() |
L2: compiler decides when to execute |
| 3 operation types | Elementwise, Reduce, Movement (25 primitives total) | L5: what the PE array must support |
| ShapeTracker | Zero-copy reshapes and transposes | L2: memory layout optimization |
| UOp IR | The intermediate representation before code generation | L2: same concept as MLIR/TVM IR |
| BEAM search | Explores fusion choices to minimize runtime | L2: auto-tuning for kernel optimization |
| Backends | How the same IR generates CUDA, OpenCL, or LLVM code | L2: multi-target compilation |
Projects:
1. Trace a matmul through tinygrad: Tensor → lazy buffer → scheduled ops → generated CUDA kernel. Document every step.
2. Run a small model with DEBUG=4 to see the generated kernels. Count the number of kernel launches.
3. Run with BEAM=3 and compare kernel count and latency vs BEAM=0.
4. (Advanced) Add a minimal logging backend that prints each kernel launch — verify which ops fuse.
Deep dive: The full tinygrad learning path (11 parts, 7 projects) is in Phase 5E — Autonomous Vehicles / tinygrad.
How Frameworks Connect to the Rest of the Roadmap¶
| Framework skill | Where it leads |
|---|---|
| micrograd autograd | Phase 4C: understand what compiler must differentiate |
| PyTorch model export (ONNX) | Phase 4C §1: graph IR as compiler input |
| PyTorch quantization | Phase 4C Part 2 §4: quantization passes |
torch.compile (Inductor) |
Phase 4C §5: production ML compiler pipeline |
| tinygrad IR and scheduler | Phase 4C §5: BEAM search, fusion strategies |
| tinygrad backends | Phase 4C §7: custom backend for your accelerator |
| PyTorch profiling | Phase 4C Part 2 §1: graph/operator optimization |
Resources¶
| Resource | What it covers |
|---|---|
| Andrej Karpathy — micrograd video | Build autograd from scratch (2 hours) |
| PyTorch Tutorials | Official beginner → advanced tutorials |
| tinygrad GitHub | Source code — read it |
| tinygrad Discord | Community, contributions, help |
| Deep Learning with PyTorch (Stevens, Antiga, Viehmann) | Comprehensive PyTorch book |
Next¶
→ Module 3 — Computer Vision — the perception workloads that drive edge AI and autonomous systems.