Module 3 — tinygrad for Inference¶
Parent: Phase 5 — Autonomous Driving
A structured, hands-on path from first tensor to custom backend
Tinygrad is a minimal deep learning framework where the entire compiler and IR are visible and hackable in Python. It is the ideal codebase for understanding what happens between
loss.backward()and the GPU kernel that actually runs.
Time estimate: 3–6 months for Parts 1–6 · Ongoing for Part 7
Prerequisites¶
- Python proficiency (functions, classes, decorators, generators)
- Basic linear algebra (matrix multiply, dot product, broadcasting)
- Familiarity with neural network training (forward pass, loss, backprop, optimizer)
- Optional but helpful: CUDA/OpenCL basics for backend sections
Part 1: Why Tinygrad?¶
The Problem with PyTorch for Learning¶
PyTorch's autograd, JIT, and kernel dispatch live in thousands of lines of C++/CUDA. You can't read them. You can only observe their effects. Tinygrad solves this:
- The entire framework is ~5,000 lines of Python
- Every optimization, IR transformation, and kernel is readable
- You can set
DEBUG=4and watch every stage of compilation
Tinygrad's Position in the Ecosystem¶
- micrograd (Karpathy): pure autograd, no tensors, no GPU — great for understanding backprop
- tinygrad: lazy tensors, real GPU backends, hackable compiler — great for understanding frameworks
- PyTorch: full production framework, opaque internals — great for building things
Why It Matters for AI Hardware Engineers¶
Tinygrad is the software interface between ML workloads and custom hardware. Understanding it lets you: - Know exactly what operations your accelerator must support - Design hardware for the actual compute patterns (matmul, reduction, elementwise) - Write compiler backends that target your custom chip - Understand why Openpilot uses tinygrad for production ADAS inference
Part 2: Setup and First Steps¶
Installation¶
# Option 1: pip (stable)
pip install tinygrad
# Option 2: from source (recommended for learning — you can read/modify it)
git clone https://github.com/tinygrad/tinygrad
cd tinygrad
pip install -e ".[dev]" # editable install so changes take effect immediately
Verifying Your Setup¶
from tinygrad import Tensor, Device
print(Device.DEFAULT) # shows your default backend (CUDA, METAL, CLANG, etc.)
print(Tensor([1,2,3]).numpy()) # [1. 2. 3.]
Debug Environment Variables¶
These are your most important tools for understanding what tinygrad does:
| Variable | Value | What You See |
|---|---|---|
DEBUG |
1 |
Kernel count and timing |
DEBUG |
2 |
Kernel names and shapes |
DEBUG |
3 |
Generated kernel source code |
DEBUG |
4 |
Full UOp IR at every optimization stage |
VIZ |
1 |
Opens a browser graph visualizer for UOps |
BEAM |
2 |
Enable BEAM search kernel optimization |
NOOPT |
1 |
Disable optimizations (see unoptimized IR) |
Part 3: Tensor API¶
Creating Tensors¶
from tinygrad import Tensor
import numpy as np
# From Python lists
t = Tensor([1, 2, 3, 4])
# From numpy
t = Tensor(np.array([[1.0, 2.0], [3.0, 4.0]]))
# Factory methods
t = Tensor.zeros(3, 4)
t = Tensor.ones(3, 4)
t = Tensor.randn(3, 4) # normal distribution
t = Tensor.rand(3, 4) # uniform [0, 1)
t = Tensor.arange(0, 10, 2) # [0, 2, 4, 6, 8]
t = Tensor.eye(4) # identity matrix
Operations (PyTorch-Compatible API)¶
a = Tensor.randn(4, 4)
b = Tensor.randn(4, 4)
# Elementwise
c = a + b
c = a * b
c = a.relu()
c = a.exp()
c = a.log()
c = a.sqrt()
# Reduction
s = a.sum()
s = a.sum(axis=0) # sum along rows → shape (4,)
m = a.max(axis=1, keepdim=True)
# Matrix operations
c = a @ b # matmul → shape (4,4)
c = a.T # transpose
# Shape manipulation
c = a.reshape(2, 8)
c = a.permute(1, 0) # like numpy.transpose
c = a.expand(2, 4, 4) # broadcast-expand (zero-copy)
c = a.unsqueeze(0) # add dim at position 0
c = a.squeeze(0) # remove dim of size 1
Getting Values Back (Materializing)¶
# .realize() executes the lazy graph and returns the same tensor
t = Tensor.randn(3, 3)
t = t.realize()
# .numpy() realizes AND copies to numpy
arr = t.numpy() # triggers .realize() if not already done
# .item() for scalar tensors
loss_val = loss.item()
Part 4: Lazy Evaluation — The Core Concept¶
This is the most important concept to internalize. Nothing runs until you call .realize() or .numpy().
What "Lazy" Means¶
import os
os.environ['DEBUG'] = '2'
a = Tensor.randn(4, 4) # no computation
b = a + 1 # no computation — records ADD in graph
c = b * 2 # no computation — records MUL in graph
d = c.relu() # no computation — records RELU in graph
# Only now does tinygrad compile all 3 ops into ONE kernel
d.realize() # prints: "1 kernels, X.Xms"
Tinygrad fused a+1, *2, and relu() into a single kernel — no intermediate buffers needed. This is op fusion, a key optimization.
The UOp: Tinygrad's IR Node¶
Every tensor has a .uop attribute — a node in the computation graph:
from tinygrad import Tensor
x = Tensor([1.0, 2.0, 3.0])
y = x + 1
print(type(y.uop)) # <class 'tinygrad.uop.UOp'>
print(y.uop.op) # the operation type
print(y.uop.src) # input UOps (the graph edges)
UOps form a DAG (Directed Acyclic Graph). The compiler walks this graph to:
1. Fuse operations
2. Apply algebraic rewrites (e.g., x * 1 → x)
3. Generate kernel code
Part 5: The Three Operation Types¶
Everything in tinygrad decomposes to three kinds of primitive operations. No conv2d primitive. No matmul primitive. They are built from these three:
1. ElementwiseOps¶
Operate on each element independently — trivially parallelizable:
| Category | Operations |
|---|---|
| UnaryOps | SQRT, LOG2, EXP2, SIN, NEG, RECIP, CAST |
| BinaryOps | ADD, MUL, SUB, DIV, MAX, MOD, CMPLT |
| TernaryOps | WHERE (if/else), MULACC |
# All of these lower to ElementwiseOps:
x.relu() # WHERE(x > 0, x, 0) → TernaryOp(WHERE)
x.sigmoid() # 1 / (1 + exp(-x)) → chain of UnaryOps + BinaryOps
x.exp() # EXP2(x * log2(e)) → BinaryOp(MUL) + UnaryOp(EXP2)
2. ReduceOps¶
Collapse a dimension — require communication across elements:
x = Tensor.randn(1024, 1024)
# These are ReduceOps:
x.sum(axis=0) # SUM reduce along axis 0
x.max(axis=1) # MAX reduce along axis 1
x.mean() # SUM / size → ReduceOp + ElementwiseOp
ReduceOps are the hard part of GPU programming — they require careful parallel reduction patterns.
3. MovementOps (via ShapeTracker)¶
Reshape, permute, expand — zero-copy because they only change how indices map to memory:
x = Tensor.randn(4, 8, 16)
# These don't copy data — they update the ShapeTracker:
y = x.reshape(32, 16) # just relabels dimensions
y = x.permute(2, 0, 1) # reorders access pattern
y = x.expand(10, 4, 8, 16) # broadcasts (adds a new dimension)
y = x[1:3, :, ::2] # slice + stride
# Data only moves when a materialization (realize/numpy) is needed
ShapeTracker stores the strides and offsets that define how a logical index maps to a physical buffer index. This is how zero-copy views work.
Part 6: Autograd¶
How tinygrad Implements Backprop¶
Tinygrad uses reverse-mode automatic differentiation (backprop). Every op that needs a gradient has a corresponding backward function.
from tinygrad import Tensor
# Gradient tracking is enabled with requires_grad=True
# (or automatically when a leaf tensor needs grad)
x = Tensor([2.0, 3.0], requires_grad=True)
y = (x * x).sum() # y = sum(x^2)
y.backward() # computes dy/dx = 2x
print(x.grad.numpy()) # [4. 6.] — correct: d(sum(x^2))/dx = 2x
Training Loop Pattern¶
from tinygrad import Tensor
from tinygrad.nn.optim import Adam, SGD
# Model (simple 2-layer MLP)
class MLP:
def __init__(self):
self.l1 = Tensor.randn(784, 128) * 0.01
self.l2 = Tensor.randn(128, 10) * 0.01
def __call__(self, x):
return x.linear(self.l1).relu().linear(self.l2)
model = MLP()
optim = Adam([model.l1, model.l2], lr=0.001)
for step in range(1000):
x = Tensor.randn(32, 784) # fake batch
y_target = Tensor.zeros(32, 10) # fake labels
optim.zero_grad()
y_pred = model(x)
loss = (y_pred - y_target).pow(2).mean() # MSE loss
loss.backward()
optim.step()
if step % 100 == 0:
print(f"step {step}, loss={loss.item():.4f}")
tinygrad's nn Module¶
from tinygrad.nn import Linear, BatchNorm, Conv2d
from tinygrad.nn.optim import Adam, SGD, AdamW
from tinygrad.nn.state import get_parameters, get_state_dict, load_state_dict, safe_save, safe_load
# Linear layer
layer = Linear(128, 64, bias=True)
# Conv2d
conv = Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)
# Get all parameters of a model
params = get_parameters(model)
# Save/load weights
state = get_state_dict(model)
safe_save(state, "model.safetensors")
state = safe_load("model.safetensors")
load_state_dict(model, state)
Part 7: The Compiler Pipeline¶
This is where tinygrad becomes uniquely educational. Set DEBUG=4 and watch every stage.
Stage 1: Schedule Generation¶
After you call .realize(), tinygrad first creates a schedule — a list of kernels that need to run:
from tinygrad import Tensor
x = Tensor.randn(4, 4)
y = Tensor.randn(4, 4)
z = (x @ y).relu().sum(axis=0)
# Get the schedule without executing
sched = z.schedule()
print(f"{len(sched)} kernels")
for item in sched:
print(item.ast) # the abstract syntax tree for each kernel
Fused ops appear as a single schedule item. Multiple reductions or ops across buffer boundaries become separate items.
Stage 2: UOp Lowering and Optimization¶
Each schedule item's AST is lowered to a UOp tree and then optimized through a series of pattern-matching rewrite rules:
Key rewrite rules include:
- Constant folding: x + 0 → x, x * 1 → x
- Algebraic identities: log(exp(x)) → x
- Loop fusion: merge adjacent loops with compatible access patterns
- Linearization: convert the tree into a flat list of operations for codegen
Stage 3: Kernel Optimization (BEAM Search)¶
Set BEAM=2 to enable BEAM search — tinygrad tries multiple loop orderings and tile sizes, benchmarks them, and picks the fastest:
BEAM search explores: - Loop ordering: which dimension to parallelize, which to tile - Work-group sizes: how many threads per block - Unrolling and vectorization: SIMD width
Stage 4: Code Generation¶
The optimized UOp tree is converted to target-specific code:
import os
os.environ['DEBUG'] = '3'
os.environ['CLANG'] = '1' # use CPU backend to see readable C
from tinygrad import Tensor
x = Tensor.randn(4, 4)
y = Tensor.randn(4, 4)
(x @ y).realize()
# Prints: actual C function that implements the matmul
Backends: CLANG (C), CUDA (PTX/SASS), METAL (MSL), AMD (HIP), OpenCL
Part 8: Backends¶
Existing Backend Structure¶
In the tinygrad source, backends live in tinygrad/runtime/:
tinygrad/runtime/
ops_gpu.py # OpenCL backend
ops_cuda.py # NVIDIA CUDA backend
ops_metal.py # Apple Metal backend
ops_clang.py # CPU/Clang backend (simplest — read this first)
ops_hsa.py # AMD HIP backend
Each backend implements three things: 1. Allocator: allocate/free device memory, copy to/from host 2. Compiler: take generated source code, compile to binary (PTX, SPIR-V, etc.) 3. Runtime: load a compiled program and execute it with given buffers
Reading the Clang Backend¶
ops_clang.py is the simplest — it generates C, compiles with clang, and runs as a shared library. Start here:
# Simplified structure of ops_clang.py:
class ClangAllocator(Allocator):
def _alloc(self, size): return (ctypes.c_float * size)()
def _free(self, buf): del buf
def copyin(self, dst, src): ctypes.memmove(dst, src, src.nbytes)
def copyout(self, dst, src): ctypes.memmove(dst, src, ctypes.sizeof(src))
class ClangCompiler(Compiler):
def compile(self, src: str) -> bytes:
# Write C to temp file, compile with clang, return .so bytes
...
class ClangRuntime(Runner):
def __call__(self, *bufs, global_size, local_size):
# Call the compiled C function with the given buffers
...
Adding a Custom Backend (Skeleton)¶
# my_backend.py — minimal custom backend skeleton
from tinygrad.device import Compiled, Allocator, Compiler, Runner
from tinygrad.renderer.cstyle import CStyleLanguage
class MyAllocator(Allocator):
def _alloc(self, size: int, options):
# Allocate `size` bytes on your device
# Return a handle (pointer, buffer object, etc.)
...
def _free(self, buf, options):
# Free the buffer
...
def copyin(self, dst, src: memoryview):
# Copy from host (numpy array) to device buffer
...
def copyout(self, dst: memoryview, src):
# Copy from device buffer to host (numpy array)
...
class MyCompiler(Compiler):
def compile(self, src: str) -> bytes:
# src is the generated C/kernel source as a string
# Compile it to binary and return bytes
...
class MyRunner(Runner):
def __init__(self, name: str, lib: bytes):
# Load the compiled binary
...
def __call__(self, *bufs, global_size, local_size, wait=False):
# Execute the kernel with given buffers and launch dimensions
...
class MyDevice(Compiled):
def __init__(self, device: str):
super().__init__(
device,
MyAllocator(),
CStyleLanguage(), # reuse the C code generator
MyCompiler(),
MyRunner,
)
# Register the device:
# from tinygrad.device import Device
# Device._devices["MYDEVICE"] = MyDevice
Part 9: Projects¶
Work through these in order. Each builds on the previous.
Project 1: Tensor Basics and Lazy Evaluation¶
Goal: Understand that tinygrad doesn't compute until .realize(), and see op fusion in action.
File: projects/01_tensor_basics.py
Tasks:
1. Create tensors with all factory methods. Check shapes and dtypes.
2. Set DEBUG=1. Run (a + b).relu().sum() and count how many kernels run (should be 1 — fused).
3. Set DEBUG=3. Read the generated kernel. Find the ADD, RELU, and SUM operations in the C code.
4. Break fusion: call .realize() after each op. Count kernels now (should be 3). Compare speed.
5. Use .schedule() to print the schedule before and after a .realize() call.
Key insight: Op fusion is free performance — tinygrad does it automatically when you let the graph grow before realizing.
Project 2: The Three Op Types¶
Goal: Understand ElementwiseOps, ReduceOps, and MovementOps by inspecting what tinygrad generates for each.
File: projects/02_op_types.py
Tasks:
1. ElementwiseOps: Run x.relu(), x * 2, x.exp() with DEBUG=3. Find the corresponding operations in the C output.
2. ReduceOps: Run x.sum(), x.max(axis=0) with DEBUG=3. Observe the loop structure in the reduction kernel.
3. MovementOps: Run x.reshape(...), x.permute(...), x.expand(...). Use DEBUG=2 — notice that movement ops alone produce 0 kernels. They are zero-copy.
4. Mixed: Build (x.permute(1,0) @ y.reshape(4,4)).sum(axis=1). Count kernels with DEBUG=1. Try to predict how many before running.
5. Implement matmul manually without @: use expand + * + sum. Verify it matches @.
Key insight: Movement ops are free because ShapeTracker only changes index math. Fusing movement ops into downstream compute ops is how tinygrad avoids unnecessary memory copies.
Project 3: Autograd from Scratch¶
Goal: Understand how tinygrad implements reverse-mode autodiff.
File: projects/03_autograd.py
Tasks:
1. Manually verify gradients:
- y = x.pow(2).sum() → dy/dx = 2x (verify with .grad)
- y = (x * w).sum() → dy/dw = x (verify)
- y = x.sigmoid() → dy/dx = sigmoid(x) * (1 - sigmoid(x)) (verify)
2. Build a 2-layer MLP and train it to overfit a 10-sample XOR dataset. Verify loss reaches near zero.
3. Read the tinygrad source: tinygrad/tensor.py, search for def _broadcasted and class Function. Read 3 backward functions (e.g., Mul, Add, Sum). Write a comment explaining each.
4. Implement a custom loss function: Huber loss. Verify its gradient numerically using finite differences (f(x+ε) - f(x-ε)) / 2ε.
Key insight: Autograd is just a chain of backward() functions stored in the graph. Each op records how to propagate the gradient back through it.
Project 4: Training MNIST¶
Goal: Train a real model end-to-end. Understand the full training loop, data loading, and evaluation.
File: projects/04_mnist.py
Tasks:
1. Download MNIST: tinygrad provides from tinygrad.nn.datasets import mnist (or download manually).
2. Build a CNN with Conv2d, BatchNorm, relu, and Linear.
3. Train for 5 epochs with Adam. Target: >98% test accuracy.
4. Profile the training loop with DEBUG=1. How many kernels per step? Which ones take the most time?
5. Enable BEAM=2. Does it speed up training? By how much?
6. Save the model with safe_save and reload it. Verify accuracy is identical after reload.
Reference: tinygrad/examples/mnist.py in the tinygrad source — read it before writing your own.
Project 5: Inspecting the Compiler¶
Goal: Understand the full pipeline from Python tensor ops to GPU kernel.
File: projects/05_compiler_pipeline.py
Tasks:
1. Schedule inspection: Build a computation graph with 5+ ops. Call .schedule(). Print the AST for each kernel item. Explain why some ops are fused and others aren't.
2. UOp tree: Set VIZ=1 and run a matmul. Navigate the browser graph visualizer. Find the elementwise, reduce, and loop nodes.
3. IR stages with DEBUG=4: Run a small matmul with DEBUG=4. Identify and describe:
- The initial UOp IR (before optimization)
- The IR after constant folding
- The IR after loop analysis
- The final linearized IR
4. Algebraic rewrites: Set NOOPT=1 and run x * 1 + 0. Note the kernel. Remove NOOPT=1 and run again. Observe the optimized kernel.
5. BEAM search: Run a 1024×1024 matmul with BEAM=0 and BEAM=4. Compare runtime. What tile size did BEAM choose?
Project 6: Custom Operations and Backend Extensions¶
Goal: Extend tinygrad with a custom op and understand the backend interface.
File: projects/06_custom_ops.py
Tasks:
1. Compose custom ops: Implement these from primitives only (no torch-like built-ins):
- gelu(x): x * 0.5 * (1 + (x / sqrt(2)).erf()) — use tinygrad's erf or approximate it
- rms_norm(x): x / sqrt((x*x).mean() + 1e-6)
- swiglu(x, gate): x * gate.sigmoid()
2. Verify each custom op matches a reference PyTorch implementation numerically.
3. Write gradients test for rms_norm: compute gradient via autograd, verify with finite differences.
4. Inspect what they compile to: Use DEBUG=3 to see the kernel for each. Count operations.
5. Fuse vs. unfused: Implement gelu step-by-step with .realize() between each step vs. all at once. Compare kernel count and runtime.
Project 7: Implement a Custom Backend¶
Goal: Build a functional custom backend that runs tinygrad ops.
File: projects/07_custom_backend/
This is the capstone project. You'll implement a backend that targets the CPU via ctypes (simpler than CUDA, but teaches the full interface).
Part A — Allocator:
- Implement _alloc(size) using ctypes.create_string_buffer
- Implement copyin and copyout using ctypes.memmove
- Test: allocate a buffer, copy a numpy array in, copy it back out. Verify round-trip.
Part B — Compiler:
- Implement compile(src: str) → bytes by calling clang as a subprocess on the generated C source
- Return the compiled .so as bytes
- Test: compile a trivial C kernel void kernel(float* a) { a[0] = 42.0f; } and verify it builds
Part C — Runner:
- Load the compiled .so bytes into memory with ctypes.CDLL
- Implement __call__(*bufs, global_size, local_size) to invoke the compiled function
- Handle the global_size loop in Python (single-threaded — focus on correctness)
Part D — Integration:
- Assemble MyAllocator, MyCompiler, and MyRunner into a MyDevice(Compiled) class
- Register it: Device._devices["MY"] = MyDevice
- Run Tensor([1,2,3], device="MY") + Tensor([4,5,6], device="MY") and get [5,7,9]
Part E — Benchmark:
- Run a 256×256 matmul on MY, CLANG, and (if available) CUDA
- Compare GFLOPS. Document the overhead of your Python runtime vs. the C clang runtime.
Part 10: Reading the Source¶
After completing the projects, read these files in the tinygrad source in order:
| File | What It Teaches |
|---|---|
tinygrad/tensor.py |
Tensor API and autograd — the user-facing layer |
tinygrad/engine/lazy.py |
LazyBuffer — how lazy evaluation is implemented |
tinygrad/engine/schedule.py |
How the execution schedule is built from the graph |
tinygrad/codegen/uops.py |
UOp IR definition and all op types |
tinygrad/codegen/lowerer.py |
Lowering schedule AST to UOp IR |
tinygrad/codegen/linearizer.py |
Linearization — UOp tree to flat kernel instructions |
tinygrad/renderer/cstyle.py |
C/CUDA/Metal code generation from linear UOps |
tinygrad/runtime/ops_clang.py |
Simplest backend — Allocator, Compiler, Runner |
tinygrad/runtime/ops_cuda.py |
CUDA backend — compare with clang backend |
tinygrad/nn/__init__.py |
Standard layers (Linear, Conv2d, BatchNorm) |
Study method: Pick a function, add print() statements, run an example with DEBUG=4, and trace the execution path from Tensor.add() all the way to the compiled C string.
Part 11: Contributing to Tinygrad¶
Where to Start¶
- Read
CONTRIBUTING.mdin the tinygrad source - Run the test suite:
python -m pytest test/ -x— understand what's tested - Find a "good first issue" on the GitHub issue tracker
Types of Contributions¶
- New ops or nn layers: Implement a missing activation, loss, or layer
- Backend improvements: Optimize kernel generation for a specific GPU architecture
- New backend: Add support for a new device (e.g., RISC-V simulator, custom FPGA)
- Bug fixes: Fix a correctness issue with a specific op/dtype combination
- Documentation: Write examples or clarify existing docs
Testing Your Changes¶
# Run tests for a specific file
python -m pytest test/test_tensor.py -x -v
# Run tests for a specific test
python -m pytest test/test_ops.py::TestOps::test_relu -v
# Test with a specific backend
DEVICE=CLANG python -m pytest test/ -x
# Run the full CI suite (slow)
python -m pytest test/ -x --timeout=300
The tinygrad Standard¶
The core team is strict about code quality. When contributing: - Keep changes minimal — tinygrad values simplicity above all - No new dependencies - All tests must pass - New features need tests - Code style: no type annotations in function bodies, minimal comments
Resources¶
| Resource | What It's For |
|---|---|
| tinygrad GitHub | Source, issues, discussions |
| tinygrad Docs | API reference and quickstart |
| tinygrad Discord | Community, Q&A, contributor help |
| tinygrad-notes | Community deep-dives on internals |
| George Hotz streams (Twitch/YouTube) | Live coding tinygrad — watch the compiler evolve |
| abstractions.py | Annotated walkthrough of tinygrad's architecture |
hacking-tinygrad.md (this folder) |
Code snippets for inspecting internals |
See the projects/ folder for runnable Python scripts for each project above.