Neural Networks¶

Phase 3 — Artificial Intelligence (after Phase 1 §4 — C++ and Parallel Computing). Optional after Phase 2 if you prefer embedded first.

Goal: Build a concrete, ground-up understanding of what AI is, what artificial neural networks are, how they learn, and how to implement everything hands-on using tinygrad — the minimal ML framework that exposes every fundamental operation.

Related (separate topic): Edge AI — where models run on-device, latency/privacy tiers, train → optimize → deploy.

Table of Contents¶

Training vs the edge
What is Artificial Intelligence?
What is a Neural Network?
The Neuron: Building Block
Activation Functions
The Multi-Layer Perceptron (MLP)
Forward Pass: Making a Prediction
Loss Functions: Measuring Error
Backpropagation: How Networks Learn
Gradient Descent and Optimizers
The Training Loop
Convolutional Neural Networks (CNNs)
Regularization and Generalization
Hands-On with tinygrad
Projects
Resources

Deep Dive Subfolders¶

Subfolder	Description
pytorch-and-micrograd/	Build autograd from scratch with micrograd, master PyTorch, then bridge to tinygrad — the essential prerequisite for deep tinygrad mastery

1. Training vs the edge¶

This guide focuses on training and inference math (usually on a workstation or cloud with enough RAM and FLOPs). On-device constraints—MCU vs SBC vs Jetson, latency, privacy, and the full train → quantize → deploy loop—live in Edge AI. You can read that track after the sections below, or skim it first for motivation.

2. What is Artificial Intelligence?¶

The Core Idea¶

Classical programming:

Rules + Data → Program → Output

Machine learning (the dominant AI approach today):

Data + Output (labels) → Learning Algorithm → Rules (model weights)

You don't write the rules. The algorithm learns the rules from examples.

Types of Machine Learning¶

Supervised Learning:
  Input: labeled pairs (image, "cat") (image, "dog")
  Learns: mapping from inputs to labels
  Examples: image classification, speech recognition, fraud detection

Unsupervised Learning:
  Input: unlabeled data
  Learns: hidden structure (clusters, patterns)
  Examples: customer segmentation, anomaly detection

Reinforcement Learning:
  Agent learns by trial and error in an environment
  Reward signal guides behavior
  Examples: game playing (AlphaGo), robot locomotion, autonomous driving

Where Neural Networks Fit¶

Neural networks are a family of supervised (and self-supervised) learning algorithms that are especially powerful for: - Images and video (CNNs) - Text and sequences (Transformers, RNNs) - Audio (CNNs + Transformers) - Tabular data (MLP) - Graphs (GNNs)

They dominate modern AI because they can learn arbitrary complex mappings given enough data and compute.

3. What is a Neural Network?¶

Biological Inspiration¶

The brain contains ~86 billion neurons. Each neuron: - Receives electrical signals from other neurons via dendrites - Sums up the signals - If the sum exceeds a threshold, it fires (sends a signal to others) via the axon - The strength of connections between neurons is called synaptic weight

Learning = changing synaptic weights.

The Artificial Analogy¶

Biological Neuron          Artificial Neuron
─────────────────          ─────────────────
Dendrites                  Inputs  x₁, x₂, ..., xₙ
Synaptic weights           Weights w₁, w₂, ..., wₙ
Cell body summation        z = Σ(wᵢ · xᵢ) + b
Firing threshold           Activation function: a = f(z)
Axon output                Output a

A Network of Neurons¶

Neurons are organized into layers:

Input Layer     Hidden Layers      Output Layer
    x₁  ─────→  [neuron]  ─────→
    x₂  ─────→  [neuron]  ─────→  [neuron] → ŷ
    x₃  ─────→  [neuron]  ─────→
               [neuron]

Input layer: raw features (pixel values, sensor readings, etc.)
Hidden layers: learn intermediate representations
Output layer: final prediction (class probabilities, regression value)

The "deep" in deep learning = many hidden layers.

4. The Neuron: Building Block¶

Mathematical Definition¶

Given inputs x = [x₁, x₂, ..., xₙ]:

Step 1 — Linear combination (weighted sum):
  z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
    = w·x + b          (dot product notation)

Step 2 — Activation:
  a = f(z)             (apply nonlinearity)

Where: - w = weight vector (learned parameters) - b = bias (learned scalar offset) - f = activation function

Why the Bias?¶

Without bias, the decision boundary must pass through the origin. With bias:

z = wx + b

The bias shifts the activation, allowing the network to learn when to activate regardless of input magnitude. Think of it as the neuron's "default activation level."

Concrete Example¶

A single neuron classifying whether a tumor is malignant:

Inputs: x₁ = tumor size (cm), x₂ = patient age (years)
Weights: w₁ = 0.8, w₂ = 0.3
Bias: b = -2.0

z = 0.8 × 3.5 + 0.3 × 45 + (-2.0)
  = 2.8 + 13.5 - 2.0
  = 14.3

a = sigmoid(14.3) ≈ 0.9999  → malignant (high probability)

5. Activation Functions¶

Without activation functions, a stack of linear layers collapses to a single linear function. Nonlinearity is what gives neural networks their expressive power.

Sigmoid¶

σ(z) = 1 / (1 + e^(-z))

Range: (0, 1)
Use: binary classification output
Problem: vanishing gradients for large |z|

σ(z)
 1.0 |          ___________
 0.5 |        /
 0.0 |_______/
      -5   0   5     z

Tanh¶

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))

Range: (-1, 1)
Use: hidden layers (historically), RNNs
Better than sigmoid: zero-centered
Problem: still has vanishing gradient

ReLU (Rectified Linear Unit)¶

ReLU(z) = max(0, z)

Range: [0, ∞)
Use: default for hidden layers in modern networks
Advantage: no vanishing gradient for z > 0, computationally fast
Problem: "dying ReLU" — neurons stuck at 0 if z always negative

ReLU(z)
  |          /
  |         /
  |        /
  |_______/
  -2  -1  0  1  2   z

Leaky ReLU¶

LeakyReLU(z) = z        if z > 0
              = 0.01·z   if z ≤ 0

Fixes dying ReLU by allowing small negative gradient

GELU (Gaussian Error Linear Unit)¶

GELU(z) ≈ z · σ(1.702·z)

Used in: Transformers (BERT, GPT)
Smooth approximation of ReLU

Softmax (Output Layer for Multi-class)¶

softmax(z)ᵢ = e^(zᵢ) / Σⱼ e^(zⱼ)

Converts raw scores to probabilities that sum to 1
Use: multi-class classification output

Which to Use?¶

Layer Type	Recommended Activation
Hidden (MLP/CNN)	ReLU or GELU
Output (binary)	Sigmoid
Output (multi)	Softmax
Output (regression)	None (linear)
RNN gates	Tanh + Sigmoid

6. The Multi-Layer Perceptron (MLP)¶

Architecture¶

An MLP with 2 hidden layers:

Input        Hidden 1      Hidden 2      Output
(3 neurons)  (4 neurons)   (4 neurons)   (2 neurons)

  x₁ ─┐
  x₂ ─┼──→ [h₁¹]          [h₁²]
  x₃ ─┘    [h₂¹] ──────→  [h₂²] ──────→ [o₁]
            [h₃¹]          [h₃²]          [o₂]
            [h₄¹]          [h₄²]

Matrix Form¶

For a layer with n inputs and m neurons:

Z = X @ W + b

X: input matrix    shape [batch_size, n_inputs]
W: weight matrix   shape [n_inputs, n_neurons]
b: bias vector     shape [n_neurons]
Z: output          shape [batch_size, n_neurons]

Then: A = f(Z)   (apply activation element-wise)

This single matrix multiplication computes all neurons in a layer at once — highly parallelizable on GPU.

Parameter Count¶

Layer (n_in → n_out):
  Weights: n_in × n_out
  Biases:  n_out
  Total:   n_in × n_out + n_out

Example MLP: 784 → 256 → 128 → 10
  Layer 1: 784×256 + 256 = 201,216
  Layer 2: 256×128 + 128 = 32,896
  Layer 3: 128×10  + 10  = 1,290
  Total:   235,402 parameters

7. Forward Pass: Making a Prediction¶

The forward pass computes the output for a given input, layer by layer.

Step-by-Step (2-layer MLP)¶

# Pseudocode
def forward(x):
    # Layer 1
    z1 = x @ W1 + b1       # linear
    a1 = relu(z1)           # activation

    # Layer 2
    z2 = a1 @ W2 + b2      # linear
    a2 = relu(z2)           # activation

    # Output layer
    z3 = a2 @ W3 + b3      # linear
    output = softmax(z3)    # probabilities

    return output

What the Network Learns Layer by Layer¶

For image classification (MNIST digits):

Layer 1 (edges):    detects horizontal, vertical, diagonal edges
Layer 2 (shapes):   combines edges into curves, corners
Layer 3 (parts):    detects parts of digits (loops, lines)
Output (class):     combines parts into digit predictions

Each layer learns increasingly abstract representations of the input.

8. Loss Functions: Measuring Error¶

The loss (or cost) function measures how wrong the model's predictions are. Training = minimizing loss.

Mean Squared Error (MSE) — Regression¶

MSE = (1/N) Σᵢ (yᵢ - ŷᵢ)²

yᵢ  = true value
ŷᵢ  = predicted value
N   = number of samples

Good for: predicting continuous values (house price, temperature)

Binary Cross-Entropy — Binary Classification¶

BCE = -(1/N) Σᵢ [yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]

yᵢ ∈ {0, 1}       true label
ŷᵢ ∈ (0, 1)       predicted probability (after sigmoid)

Intuition: penalizes confident wrong predictions heavily
  Predicted 0.99 when truth is 0 → very high loss
  Predicted 0.5  when truth is 0 → moderate loss

Categorical Cross-Entropy — Multi-class Classification¶

CCE = -(1/N) Σᵢ Σₖ yᵢₖ log(ŷᵢₖ)

yᵢₖ = 1 if sample i belongs to class k, else 0  (one-hot)
ŷᵢₖ = predicted probability for class k (after softmax)

Most common loss for image classification

Why Cross-Entropy?¶

Cross-entropy comes from information theory. Minimizing it is equivalent to maximum likelihood estimation — finding parameters that make the observed data most probable. It penalizes the model proportionally to how surprised it was by the correct answer.

9. Backpropagation: How Networks Learn¶

The Core Idea¶

Backpropagation = chain rule of calculus applied to compute gradients of the loss with respect to every weight in the network.

∂Loss/∂W = how much does the loss change when W changes slightly?

We want: decrease the loss
Strategy: move W in the direction that decreases Loss
Update:   W ← W - α · (∂Loss/∂W)

Chain Rule Foundation¶

For a composed function f(g(x)):

df/dx = (df/dg) · (dg/dx)

In a neural network, the loss flows through many composed functions:

Loss = L(softmax(W₃ · relu(W₂ · relu(W₁ · x + b₁) + b₂) + b₃))

Backprop applies chain rule backwards from output to input.

Computational Graph¶

Think of the network as a graph of operations:

x → [W₁ matmul] → z₁ → [relu] → a₁ → [W₂ matmul] → z₂ → [loss] → L

Forward pass: compute and cache every intermediate value. Backward pass: compute gradient at each node using cached values + chain rule.

Step-by-Step Backward Pass (1 layer example)¶

Forward:
  z = x·w + b
  a = relu(z)
  L = MSE(a, y)

Backward (chain rule):
  dL/da = 2(a - y)/N                    (MSE gradient)
  dL/dz = dL/da · d(relu)/dz           (relu gradient: 1 if z>0 else 0)
  dL/dw = dL/dz · x                    (matmul gradient)
  dL/db = dL/dz · 1                    (bias gradient)
  dL/dx = dL/dz · w                    (input gradient, for prev layer)

Automatic Differentiation (Autograd)¶

Modern frameworks (including tinygrad) implement autograd: they build the computational graph during the forward pass, then automatically compute all gradients during .backward().

You never implement backprop by hand — you just define the forward pass.

# tinygrad autograd example
x = Tensor([2.0], requires_grad=True)
w = Tensor([3.0], requires_grad=True)
b = Tensor([1.0], requires_grad=True)

z = x * w + b          # forward: builds graph
loss = z.pow(2).mean() # loss computation

loss.backward()         # backward: compute all gradients

print(w.grad)  # dL/dw computed automatically

Vanishing and Exploding Gradients¶

Vanishing gradients: in deep networks, gradients become extremely small as they propagate back through many layers. - Cause: sigmoid/tanh saturate (derivative ≈ 0) - Fix: ReLU, residual connections (ResNets), batch normalization

Exploding gradients: gradients grow exponentially. - Cause: large weights × many layers - Fix: gradient clipping, weight initialization schemes

10. Gradient Descent and Optimizers¶

Gradient Descent (Base Algorithm)¶

For each training step:
  1. Compute loss on a batch of data
  2. Compute gradients via backprop
  3. Update every parameter:
     W ← W - α · ∂Loss/∂W

α = learning rate (hyperparameter, typically 1e-3 to 1e-4)

Batch Variants¶

Variant	Batch Size	Pros	Cons
Batch GD	Full dataset	Stable gradients	Slow, memory-intensive
Stochastic GD (SGD)	1 sample	Fast updates	Very noisy
Mini-batch GD	32–256	Balance of both	Industry standard

SGD with Momentum¶

vₜ = β·vₜ₋₁ + (1-β)·∇L     (momentum term, β ≈ 0.9)
W ← W - α·vₜ

Intuition: gradient as a ball rolling downhill, accumulates velocity
Benefit: faster convergence, escapes shallow local minima

Adam (Adaptive Moment Estimation)¶

The most widely used optimizer:

mₜ = β₁·mₜ₋₁ + (1-β₁)·∇L          (1st moment: mean of gradients)
vₜ = β₂·vₜ₋₁ + (1-β₂)·(∇L)²       (2nd moment: variance of gradients)

m̂ₜ = mₜ/(1-β₁ᵗ)                   (bias correction)
v̂ₜ = vₜ/(1-β₂ᵗ)

W ← W - α · m̂ₜ / (√v̂ₜ + ε)

Defaults: β₁=0.9, β₂=0.999, ε=1e-8, α=1e-3

Adam adapts the learning rate per parameter based on historical gradients. Works well out of the box for most tasks.

Learning Rate Scheduling¶

Constant LR:       α = 0.001 throughout
Step decay:        α = α₀ × 0.1 every 10 epochs
Cosine annealing:  α follows a cosine curve
Warmup + decay:    Start low, increase, then decrease (Transformers)

11. The Training Loop¶

Complete Training Loop¶

# Pseudocode (very close to tinygrad code)

model = MyNetwork()
optimizer = Adam(model.parameters(), lr=1e-3)
loss_fn = CrossEntropyLoss()

for epoch in range(num_epochs):
    # ── Training phase ──────────────────────────────
    model.train()
    for batch_x, batch_y in train_loader:

        # 1. Forward pass
        predictions = model(batch_x)

        # 2. Compute loss
        loss = loss_fn(predictions, batch_y)

        # 3. Zero gradients (clear previous step's gradients)
        optimizer.zero_grad()

        # 4. Backward pass (compute gradients)
        loss.backward()

        # 5. Update weights
        optimizer.step()

    # ── Validation phase ─────────────────────────────
    model.eval()
    val_loss, val_acc = evaluate(model, val_loader)
    print(f"Epoch {epoch}: loss={val_loss:.4f}, acc={val_acc:.2%}")

Key Concepts in the Loop¶

Epoch: one complete pass through the entire training dataset.

Batch: a subset of the dataset processed together (e.g., 32 images). Enables: - GPU parallelism - Gradient averaging (more stable than single samples) - Fitting large datasets in memory

Overfitting: model memorizes training data, fails on new data.

Training loss:   ↓ ↓ ↓ ↓ ↓ 0.01
Validation loss: ↓ ↓ ↑ ↑ ↑ 0.45   ← overfitting after this point

Underfitting: model too simple, fails on both training and validation.

Training loss:   0.40 (stays high)
Validation loss: 0.42 (also high)

Hyperparameters vs Parameters¶

Parameters (learned by training):
  - Weights W
  - Biases b

Hyperparameters (set by you before training):
  - Learning rate α
  - Batch size
  - Number of layers
  - Number of neurons per layer
  - Number of epochs
  - Dropout rate
  - Weight decay

12. Convolutional Neural Networks (CNNs)¶

Why Not Just Use MLP for Images?¶

A 224×224 RGB image = 224×224×3 = 150,528 inputs. MLP first hidden layer with 1024 neurons = 150,528 × 1024 = 154 million parameters for one layer.

Problems: - Doesn't capture spatial structure (neighboring pixels matter together) - No weight sharing (same edge detector at every location) - Too many parameters → overfitting

The Convolution Operation¶

A filter (kernel) slides over the input, computing a weighted sum at each position:

Input (5×5):          Filter (3×3):      Output (3×3):
 1  2  3  4  5         1  0  1            ?  ?  ?
 5  6  7  8  9         0  1  0            ?  ?  ?
 1  2  3  4  5         1  0  1            ?  ?  ?
 5  6  7  8  9
 1  2  3  4  5

At top-left position:
  (1×1)+(2×0)+(3×1) + (5×0)+(6×1)+(7×0) + (1×1)+(2×0)+(3×1)
  = 1+3+6+1+3 = 14

The filter is a learned pattern detector. The network learns filters that detect edges, textures, shapes automatically.

CNN Architecture Components¶

Input Image
    ↓
Conv Layer (learns feature maps)
    ↓
Activation (ReLU)
    ↓
Pooling (downsample, reduce size)
    ↓
... (repeat)
    ↓
Flatten
    ↓
FC (Fully Connected) Layers
    ↓
Output (Softmax)

Pooling¶

Max pooling reduces spatial dimensions, retaining the most prominent features:

Input (4×4):       Max Pool 2×2:
 1  3  2  4          3  4
 5  6  7  8    →     6  8
 3  1  4  2          3  4
 7  8  5  6          8  6

CNN vs MLP for Images¶

MLP:
  - Every pixel connects to every neuron
  - No spatial structure preserved
  - Millions of redundant parameters

CNN:
  - Local connectivity (filter slides over image)
  - Weight sharing (same filter at all positions)
  - Translation invariant (cat in corner = cat in center)
  - Far fewer parameters

Classic CNN Architectures¶

Model	Year	Layers	Params	Top-5 Acc
LeNet-5	1998	7	60K	—
AlexNet	2012	8	61M	84.6%
VGG-16	2014	16	138M	92.7%
ResNet-50	2015	50	25M	95.3%
MobileNetV2	2018	53	3.4M	93.4%
EfficientNet	2019	varies	varies	97.1%

MobileNet and EfficientNet are designed for edge deployment — accuracy vs. compute tradeoff.

13. Regularization and Generalization¶

The Bias-Variance Tradeoff¶

Total Error = Bias² + Variance + Irreducible Noise

High Bias (underfitting):   model too simple, misses patterns
High Variance (overfitting): model too complex, memorizes noise

Dropout¶

Randomly zero out neurons during training:

# During training:
  For each neuron, with probability p, set output to 0
  Scale remaining neurons by 1/(1-p)

# During inference:
  No dropout (all neurons active)

Effect: Forces network to learn redundant representations
        Acts as ensemble of many sub-networks
Typical p: 0.2–0.5

Weight Decay (L2 Regularization)¶

Add a penalty to the loss for large weights:

L_total = L_task + λ · Σ w²

λ = regularization strength (e.g., 1e-4)

Effect: Keeps weights small, reduces overfitting
Equivalent to Gaussian prior on weights (Bayesian view)

Batch Normalization¶

Normalize the inputs to each layer:

For a mini-batch of activations z:
  μ = mean(z)
  σ² = var(z)
  z_norm = (z - μ) / √(σ² + ε)
  output = γ · z_norm + β        (γ, β are learned)

Benefits:
  - Faster training (higher learning rates)
  - Less sensitive to weight initialization
  - Mild regularization effect
  - Reduces internal covariate shift

Early Stopping¶

Monitor validation loss during training
Stop training when validation loss stops improving

Epoch: 1  train_loss=2.3  val_loss=2.3
Epoch: 5  train_loss=1.2  val_loss=1.3
Epoch:10  train_loss=0.5  val_loss=0.8   ← save checkpoint here
Epoch:15  train_loss=0.2  val_loss=1.1   ← overfitting, stop
Epoch:20  train_loss=0.1  val_loss=1.4

Data Augmentation¶

Artificially expand dataset by transforming existing samples:

Image augmentations:
  - Horizontal/vertical flip
  - Random crop and resize
  - Color jitter (brightness, contrast, saturation)
  - Random rotation
  - Gaussian noise

Effect: Model sees more variety → better generalization
Edge AI benefit: reduces need for large datasets

14. Hands-On with tinygrad¶

Why tinygrad?¶

tinygrad is a minimal ML framework (~1000 lines of core code) that exposes every fundamental operation clearly. Unlike PyTorch or TensorFlow:

No magic — you can read and understand the entire source
Teaches you exactly what happens inside a neural network
Compiles to CPU, GPU, CUDA, Metal, WebGPU
Used in production at comma.ai (autonomous driving)

tinygrad Core Concepts¶

from tinygrad.tensor import Tensor
import numpy as np

# Tensor creation
x = Tensor([[1.0, 2.0, 3.0]])           # shape [1, 3]
w = Tensor.kaiming_uniform(3, 4)        # shape [3, 4]

# Operations (lazy by default)
z = x.matmul(w)                         # shape [1, 4]
a = z.relu()

# Realize (execute computation)
result = a.numpy()                       # triggers actual computation

# Gradient computation
loss = a.sum()
loss.backward()
print(w.grad.numpy())                   # ∂loss/∂w

Implementing a Neuron from Scratch¶

from tinygrad.tensor import Tensor

class Neuron:
    def __init__(self, n_inputs):
        # Initialize weights with small random values
        self.w = Tensor.randn(n_inputs, 1) * 0.01
        self.b = Tensor.zeros(1)

    def __call__(self, x):
        z = x.matmul(self.w) + self.b
        return z.relu()

    def parameters(self):
        return [self.w, self.b]

# Test
neuron = Neuron(3)
x = Tensor([[0.5, 1.2, -0.3]])
output = neuron(x)
print(output.numpy())  # shape [1, 1]

Implementing a Layer¶

class Linear:
    def __init__(self, n_in, n_out):
        # Kaiming initialization for ReLU
        self.w = Tensor.kaiming_uniform(n_in, n_out)
        self.b = Tensor.zeros(n_out)

    def __call__(self, x):
        return x.matmul(self.w) + self.b

    def parameters(self):
        return [self.w, self.b]

Implementing an MLP¶

class MLP:
    def __init__(self, layers):
        # layers = [784, 256, 128, 10]
        self.linears = [
            Linear(layers[i], layers[i+1])
            for i in range(len(layers)-1)
        ]

    def __call__(self, x):
        for i, layer in enumerate(self.linears[:-1]):
            x = layer(x).relu()          # hidden layers: ReLU
        x = self.linears[-1](x)          # output layer: no activation
        return x.softmax()               # softmax for probabilities

    def parameters(self):
        params = []
        for layer in self.linears:
            params.extend(layer.parameters())
        return params

Training on MNIST¶

from tinygrad.tensor import Tensor
from tinygrad.nn.optim import Adam
import numpy as np

# Load MNIST (use fetch from tinygrad)
from tinygrad.helpers import fetch
import gzip

def load_mnist():
    base = "https://storage.googleapis.com/cvdf-datasets/mnist/"
    X_train = np.frombuffer(gzip.open(fetch(base+"train-images-idx3-ubyte.gz")).read(), np.uint8, offset=16).reshape(-1, 784)
    Y_train = np.frombuffer(gzip.open(fetch(base+"train-labels-idx1-ubyte.gz")).read(), np.uint8, offset=8)
    X_test  = np.frombuffer(gzip.open(fetch(base+"t10k-images-idx3-ubyte.gz")).read(), np.uint8, offset=16).reshape(-1, 784)
    Y_test  = np.frombuffer(gzip.open(fetch(base+"t10k-labels-idx1-ubyte.gz")).read(), np.uint8, offset=8)
    return X_train/255.0, Y_train, X_test/255.0, Y_test

X_train, Y_train, X_test, Y_test = load_mnist()

# Define model
model = MLP([784, 256, 128, 10])
optimizer = Adam(model.parameters(), lr=1e-3)

# Training loop
BATCH = 64
EPOCHS = 10

for epoch in range(EPOCHS):
    # Shuffle
    idx = np.random.permutation(len(X_train))
    X_train, Y_train = X_train[idx], Y_train[idx]

    total_loss = 0
    for i in range(0, len(X_train), BATCH):
        xb = Tensor(X_train[i:i+BATCH].astype(np.float32))
        yb = Y_train[i:i+BATCH]

        # Forward pass
        out = model(xb)

        # Cross-entropy loss
        # One-hot encode labels
        yb_onehot = np.zeros((len(yb), 10), dtype=np.float32)
        yb_onehot[np.arange(len(yb)), yb] = 1.0
        yb_t = Tensor(yb_onehot)

        loss = -(yb_t * out.log()).sum(axis=1).mean()

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.numpy()

    # Validation
    test_out = model(Tensor(X_test.astype(np.float32)))
    preds = test_out.numpy().argmax(axis=1)
    acc = (preds == Y_test).mean()
    print(f"Epoch {epoch+1}: loss={total_loss/(len(X_train)//BATCH):.4f}, test_acc={acc:.2%}")

Understanding tinygrad's Autograd¶

# tinygrad builds a computation graph lazily
# Every operation on a Tensor records itself

x = Tensor([2.0])
y = x * x       # records: mul(x, x)
z = y + x       # records: add(y, x)
loss = z.sum()

# .backward() traverses graph in reverse (topological sort)
# applying chain rule at each node
loss.backward()

print(x.grad)   # d(loss)/d(x) = d(x²+x)/dx = 2x+1 = 5.0

Implementing a CNN in tinygrad¶

from tinygrad.tensor import Tensor
from tinygrad.nn import Conv2d, BatchNorm2d
from tinygrad.nn.optim import Adam

class SimpleCNN:
    def __init__(self):
        # Conv layers
        self.c1 = Conv2d(1, 32, 3, padding=1)   # 1 channel in, 32 out, 3×3 kernel
        self.c2 = Conv2d(32, 64, 3, padding=1)

        # Fully connected
        self.fc1 = Linear(64 * 7 * 7, 128)
        self.fc2 = Linear(128, 10)

    def __call__(self, x):
        # x shape: [batch, 1, 28, 28]
        x = self.c1(x).relu().max_pool2d()       # → [batch, 32, 14, 14]
        x = self.c2(x).relu().max_pool2d()       # → [batch, 64, 7, 7]
        x = x.reshape(x.shape[0], -1)            # flatten → [batch, 3136]
        x = self.fc1(x).relu()
        x = self.fc2(x)
        return x.softmax()

    def parameters(self):
        params = []
        for layer in [self.c1, self.c2, self.fc1, self.fc2]:
            params.extend(layer.parameters() if hasattr(layer, 'parameters') else [layer.weight, layer.bias])
        return params

Exploring tinygrad Internals¶

Study these files in the tinygrad source to deeply understand how ML works:

tinygrad/
  tensor.py          ← Tensor class, all ops, autograd engine
  lazybuffer.py      ← Lazy evaluation / computation graph
  ops.py             ← All primitive operations
  nn/
    optim.py         ← SGD, Adam, AdaGrad implementations
  runtime/
    ops_cpu.py       ← How ops execute on CPU
    ops_gpu.py       ← How ops execute on GPU

Reading tensor.py is one of the best ways to understand how autograd and deep learning frameworks work internally.

15. Projects¶

Project 1: Neuron from Scratch (No Framework)¶

Implement a single neuron in pure Python/NumPy. Manually compute gradients and verify with finite differences.

Goal: understand forward pass, loss, gradient by hand
Dataset: XOR problem (4 samples)
Deliverable: working neuron with manual backprop

Project 2: MLP for MNIST with tinygrad¶

Train a fully connected network on MNIST digit classification.

Goal: achieve >97% test accuracy
Architecture: 784 → 256 → 128 → 10
Deliverable: training script + loss/accuracy curves

Project 3: CNN for MNIST/CIFAR-10 with tinygrad¶

Replace MLP with a convolutional network.

Goal: understand convolution, pooling, feature maps
Dataset: MNIST (>99%) or CIFAR-10 (>85%)
Deliverable: CNN training script, visualize learned filters

Project 4: Implement Adam from Scratch¶

Implement the Adam optimizer manually in tinygrad (without using tinygrad's built-in Adam).

Goal: understand optimizer internals
Deliverable: custom Adam class matching tinygrad's results

Project 5: Read tinygrad's Backprop¶

Trace through tinygrad source code for a simple operation (e.g., x * x) and document exactly what happens during backward().

Goal: understand autograd mechanics
Deliverable: annotated source code walkthrough
Files: tensor.py, ops.py

Project 6: MNIST on Jetson (Edge Deployment)¶

Train on desktop, export model weights, run inference on Jetson Nano.

Goal: complete edge AI pipeline
Steps:
  1. Train CNN on desktop
  2. Export weights as numpy arrays
  3. Load weights in tinygrad on Jetson
  4. Measure inference latency: CPU vs GPU
  5. Quantize to INT8, compare accuracy/speed

16. Resources¶

Foundational Theory¶

3Blue1Brown — Neural Networks (YouTube series): Best visual introduction to how neural networks work. Watch all 4 episodes before writing code.
"But what is a neural network?"
"Gradient descent, how neural networks learn"
"What is backpropagation really doing?"
"Backpropagation calculus"
Andrej Karpathy — micrograd (GitHub + YouTube): Build autograd from scratch in ~150 lines. Essential for understanding backprop deeply. Then compare with tinygrad's implementation.
https://github.com/karpathy/micrograd
CS231n: Convolutional Neural Networks for Visual Recognition (Stanford): The definitive CNN course. Lecture notes are excellent even without watching videos.
https://cs231n.github.io/
The Deep Learning Book (Goodfellow, Bengio, Courville): Free online. Chapters 6-9 cover MLP, backprop, regularization, CNNs rigorously.
https://www.deeplearningbook.org/

tinygrad-Specific¶

tinygrad source code: The best tinygrad documentation is the source itself.
tensor.py for ops and autograd
examples/ for training scripts
test/ for understanding expected behavior
tinygrad MNIST example: tinygrad/examples/mnist.py — canonical starting point
tinygrad CNN example: tinygrad/examples/efficientnet.py

Edge AI Context¶

TinyML book (Pete Warden, Daniel Situnayake): Running ML on microcontrollers. Chapters 1-3 give strong context for why edge AI design choices matter.
AI at the Edge (Daniel Situnayake, Jenny Plunkett): End-to-end edge AI system design.

Math Prerequisites¶

Linear Algebra: 3Blue1Brown "Essence of Linear Algebra" — especially matrices as transformations
Calculus: Chain rule, partial derivatives — Khan Academy calculus or 3Blue1Brown "Essence of Calculus"
Statistics: Probability, distributions — needed for loss functions and regularization

Practice Datasets¶

Dataset	Task	Samples	Input Size	Baseline
MNIST	Digit classification	70,000	28×28×1	99.7%
Fashion-MNIST	Clothing classification	70,000	28×28×1	94%
CIFAR-10	Object classification	60,000	32×32×3	95%+
Iris	Flower classification	150	4 features	97%

Quick Reference: Key Equations¶

Neuron forward:
  z = Wx + b
  a = f(z)

MSE Loss:
  L = (1/N) Σ(y - ŷ)²

Cross-entropy Loss:
  L = -(1/N) Σ y·log(ŷ)

Gradient descent:
  W ← W - α·∂L/∂W

Adam update:
  m = β₁m + (1-β₁)g
  v = β₂v + (1-β₂)g²
  W ← W - α·m̂/√(v̂+ε)

Convolution output size:
  out = floor((in + 2p - k) / s) + 1
  in=input, p=padding, k=kernel, s=stride

Deep Dive: Subfolders¶

PyTorch + micrograd → tinygrad ¶

Before diving deep into tinygrad internals or porting models, you need a rock-solid foundation in how autograd actually works. This subfolder provides exactly that:

Part 1 — micrograd: Reimplement Karpathy's scalar autograd engine from scratch. Every +, *, tanh, exp carries a _backward closure. You'll see chain rule as running code, not math notation.
Part 2 — PyTorch: Aligned with the OpenCV PyTorch Bootcamp curriculum. Tensors, autograd, nn.Module, CNNs, transfer learning, object detection heads — with the same mental model as micrograd.
Part 3 — Bridge to tinygrad: Side-by-side comparison of the same MLP in all three frameworks. What PyTorch hides (lazy evaluation, kernel fusion) that tinygrad exposes via DEBUG=4.

Why this matters for tinygrad mastery:

micrograd  →  you understand every gradient operation
PyTorch    →  you understand the production API and patterns
tinygrad   →  you see what frameworks hide, control the scheduler

Next (Phase 4 Track B): ML and AI — quantization, pruning, TensorRT, TFLite