Digital Design Fundamentals¶
The circuits inside every AI accelerator — tensor cores, SRAM banks, systolic arrays — are built from the same primitives covered here. This guide takes you from bits to memory hierarchies, with deliberate attention to the representations and building blocks that show up every day in hardware design.
Before diving into the technical building blocks, let's understand the big picture: how a chip idea becomes a physical piece of silicon.
0. From Design to Silicon — How Chips Are Made¶
0.1 The Tapeout Flow¶
"Tapeout" is the moment a chip design team hands off the final design files to a foundry (TSMC, Samsung, Intel Foundry, GlobalFoundries) for manufacturing. The name comes from the old days when designs were literally written to magnetic tape. Today it means delivering a set of verified GDSII/OASIS files — the geometric description of every transistor, wire, and via on the chip.
The full journey from idea to working silicon:
┌─────────────────────────────────────────────────────────────────────────┐
│ CHIP DESIGN FLOW │
│ │
│ Specification │
│ │ │
│ ▼ │
│ Architecture Design "What does the chip do?" │
│ │ ISA, block diagram, memory map │
│ ▼ │
│ RTL Design (Verilog/VHDL) "Describe behavior in HDL" │
│ │ modules, FSMs, datapaths │
│ ▼ │
│ Functional Verification "Does the RTL match the spec?" │
│ │ testbenches, UVM, formal verification │
│ ▼ │
│ Logic Synthesis "Convert RTL to gates" │
│ │ tool: Synopsys Design Compiler │
│ │ input: RTL + standard cell library │
│ │ output: gate-level netlist │
│ ▼ │
│ Place and Route (PnR) "Put gates on chip, connect wires" │
│ │ tool: Cadence Innovus / Synopsys ICC2 │
│ │ floorplan → placement → CTS → routing │
│ ▼ │
│ Sign-off Checks "Is it manufacturable and correct?" │
│ │ ├── Timing (STA) static timing analysis, all corners │
│ │ ├── DRC design rule check (min spacing, width) │
│ │ ├── LVS layout vs schematic (geometry = netlist?) │
│ │ ├── ERC electrical rule check │
│ │ └── IR drop / EM power integrity, electromigration │
│ ▼ │
│ ★ TAPEOUT ★ "Ship GDSII to foundry" │
│ │ │
│ ▼ │
│ Fabrication (TSMC etc.) 12–18 weeks for advanced nodes │
│ │ │
│ ▼ │
│ Packaging & Test die → package → burn-in → final test │
│ │ │
│ ▼ │
│ Silicon Bring-up first boot, characterization │
└─────────────────────────────────────────────────────────────────────────┘
0.2 RTL to Gates: Logic Synthesis¶
You write hardware in Verilog or VHDL — a behavioral description of what each register, MUX, and ALU does. The synthesis tool (Synopsys Design Compiler is the industry standard) maps this to actual logic gates from a standard cell library provided by the foundry.
Your RTL (Verilog): Synthesized gates:
assign y = a & b | c; → NAND2 → INV → NOR2 → ...
(mapped to TSMC N5 cell library)
always @(posedge clk)
if (en) q <= d; → DFF with enable (EDFCND2)
A standard cell library contains hundreds of pre-designed, pre-characterized cells:
| Cell type | Examples |
|---|---|
| Logic gates | INV, NAND2, NAND3, NOR2, AOI22, MUX2 |
| Sequential | DFF, DFF with reset, DFF with enable |
| Buffers/drivers | BUF, CLKBUF, CLKINV |
| Special | Tie-high, tie-low, filler cells |
Each cell comes with: - Layout (geometric shapes for each process layer) - Timing model (.lib) — delay, setup, hold at various voltage/temperature corners - Power model — leakage + switching power - Parasitic model — capacitance, resistance
The synthesis tool optimizes for your constraints — typically timing (target clock frequency), area, and power — making trade-offs like using faster (larger) cells on critical paths and slower (smaller) cells elsewhere.
0.3 Place and Route¶
After synthesis, you have a netlist: a list of ~billion gates and their connections. Place and route turns this into a physical layout.
Step 1: Floorplanning
┌──────────────────────────────────┐
│ ┌──────┐ ┌────────┐ ┌──────┐ │
│ │ SRAM │ │Compute │ │ SRAM │ │
│ │Block │ │ Core │ │Block │ │
│ └──────┘ └────────┘ └──────┘ │
│ ┌────────────────────────────┐ │
│ │ I/O ring │ │
│ └────────────────────────────┘ │
└──────────────────────────────────┘
Decide where major blocks go, power grid, pin placement
Step 2: Placement
Standard cells placed in rows between power rails (VDD/VSS)
┌─VDD─────────────────────────────┐
│ [NAND][INV][DFF][BUF][MUX]... │
├─VSS─────────────────────────────┤
│ [NOR][AOI][DFF][INV][NAND]... │
├─VDD─────────────────────────────┤
│ ... │
└─────────────────────────────────┘
Step 3: Clock Tree Synthesis (CTS)
Build a balanced tree of clock buffers so CLK arrives
at every flip-flop within ~50ps skew
Step 4: Routing
Connect all signals using metal layers (M1–M15+ on advanced nodes)
Lower metals: short local wires
Upper metals: long global wires, power distribution
0.4 Fabrication at TSMC¶
Once the GDSII file arrives at the foundry, it goes through photolithography — printing the design layer by layer onto a silicon wafer.
Wafer cross-section (simplified, TSMC N5):
══════════ Metal 15 (thick, global power/signal)
... (13 more metal layers)
────────── Metal 2
────────── Metal 1 (thin, local connections)
────────── Contact layer
▓▓▓▓▓▓▓▓▓ Gate (FinFET or GAA nanosheet at N3/N2)
░░░░░░░░░ Fin / channel
▒▒▒▒▒▒▒▒▒ Silicon substrate (300mm wafer)
Each layer requires:
1. Deposit material (metal, oxide, etc.)
2. Spin photoresist
3. Expose with EUV light through reticle (mask)
4. Develop (remove exposed/unexposed resist)
5. Etch pattern into material
6. Strip remaining resist
7. Repeat for next layer (~80-100 mask steps for N5)
Process node comparison (as of 2025–2026):
| Node | Foundry | Transistor type | Density (MTr/mm²) | Used in |
|---|---|---|---|---|
| N7 | TSMC | FinFET | ~91 | A100, AMD Zen 2 |
| N5 | TSMC | FinFET | ~171 | H100, Apple M2, Zen 4 |
| N4P | TSMC | FinFET | ~180 | B100/B200 |
| N3E | TSMC | FinFET | ~208 | Apple M4, Vera (2026) |
| N2 | TSMC (2025) | GAA nanosheet | ~250+ | Next-gen AI chips |
| 20A/18A | Intel | RibbonFET (GAA) | ~200+ | Intel next-gen |
| 2nm | Samsung | GAA | ~200+ | Foundry customers |
FinFET vs GAA (Gate-All-Around):
FinFET (N7–N3): GAA Nanosheet (N2 and below):
Gate Gate (wraps all sides)
┌───┐ ┌─────────────┐
│ │ │ ┌─────────┐ │
│ F │ ← gate wraps │ │nanosheet│ │ ← gate wraps
│ I │ 3 sides │ └─────────┘ │ ALL 4 sides
│ N │ │ ┌─────────┐ │
│ │ │ │nanosheet│ │ (stacked sheets)
└─┬─┘ │ └─────────┘ │
│ └──────┬──────┘
─────┴───── ───────┴───────
substrate substrate
GAA advantage: better electrostatic control → less leakage,
higher drive current, better scaling below 3nm
0.5 Packaging¶
The raw die is cut from the wafer and placed into a package that provides: - Electrical connections (power, ground, I/O signals) - Mechanical protection - Heat dissipation path
Modern AI chip packaging (CoWoS — Chip on Wafer on Substrate):
┌─── Heat spreader / lid ───┐
│ │
┌────┴──────┬───────┬──────┬─────┴───┐
│ HBM │ GPU │ HBM │ HBM │ ← dies on interposer
│ stack │ die │ stack│ stack │
└───┬────────┴───┬───┴──────┴────┬────┘
│ Silicon interposer │ ← passive routing layer
┌───┴────────────┴───────────────┴────┐
│ Organic substrate │ ← BGA ball grid below
└─────────────────────────────────────┘
│││││
solder balls → PCB
CoWoS enables:
- GPU die + HBM stacks on same interposer (short, wide buses)
- 1024-bit HBM interface (impossible with PCB traces)
- H100: ~814mm² GPU die + 6× HBM3 stacks on ~2× larger interposer
Packaging technologies comparison:
| Package | Description | Used in |
|---|---|---|
| Wire bond | Gold wires from die pads to package leads | Low-cost, legacy chips |
| Flip-chip | Solder bumps, die face-down | CPUs, GPUs (standard) |
| 2.5D (CoWoS) | Multiple dies on silicon interposer | AI GPUs (H100, MI300X) |
| 3D stacking | Dies stacked vertically with TSVs | HBM, AMD 3D V-Cache |
| Chiplets | Multiple small dies in one package | AMD EPYC, Intel Ponte Vecchio |
0.6 Testing and Yield¶
Not every die on a wafer works — defects from particles, lithography errors, and process variation kill some transistors.
300mm wafer with ~100 GPU dies:
┌─────────────────────────────┐
│ ○ ○ ○ ○ ○ ○ ○ ○ ○ │
│ ○ ● ○ ○ ● ○ ○ ○ ○ ○ │ ○ = good die
│ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ │ ● = defective die
│ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ │
│ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ │ Edge dies = partial/wasted
│ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ │
│ ○ ○ ○ ○ ○ ○ ○ ○ ○ │ Yield = good dies / total dies
└─────────────────────────────┘
H100 die: ~814mm² → ~60 dies per 300mm wafer
At 80% yield → ~48 good dies per wafer
Wafer cost at N5: ~$16,000–$20,000
Cost per good die: ~$330–$415 (before packaging, test, profit margin)
Defect tolerance — binning and salvage:
NVIDIA harvests partially defective dies by disabling broken SMs: - H100 SXM: 132 SMs designed, 114 enabled (disable 18 defective ones) - This dramatically improves effective yield
Test flow:
1. Wafer sort (probe test) — test each die on wafer, mark bad dies
2. Die singulation — cut wafer into individual dies
3. Packaging — bond good dies into packages
4. Final test (ATE) — automated test at speed, burn-in
5. Binning — sort by max frequency / power tier
(e.g., RTX 4090 vs 4080 = same die, different bin)
0.7 Timeline and Cost¶
A modern AI chip from architecture to production:
| Phase | Duration | Key deliverable |
|---|---|---|
| Architecture & spec | 3–6 months | Microarchitecture document |
| RTL design | 12–18 months | Verilog/VHDL codebase |
| Verification | Overlaps RTL | 60–70% of total effort |
| Synthesis + PnR | 3–6 months | GDSII layout |
| Tapeout → first silicon | 3–4 months | Engineering samples (ES) |
| Validation & bring-up | 3–6 months | Working chips |
| Total | ~2–3 years | Production silicon |
Cost to design a chip at N5: $300M–$500M+ (including EDA tools, IP licenses, masks, engineering team). This is why only a handful of companies (NVIDIA, AMD, Apple, Google, etc.) design leading-edge chips — the economics require massive volume to amortize the NRE (Non-Recurring Engineering) cost.
Why this matters for AI hardware engineers: understanding the tapeout flow tells you why design decisions are made the way they are. When an architect chooses a simpler pipeline over a faster but more complex one, it's often because verification effort scales superlinearly with complexity. When NVIDIA uses the same die for multiple GPU tiers (binning), it's yield economics. When HBM is on an interposer instead of on-package DRAM, it's because the 1024-bit bus is physically impossible with PCB routing. Every section in this guide — from Boolean algebra through MIPS pipelines — is a building block in this flow.
1. Number Systems¶
1.1 Binary, Hexadecimal, and Positional Notation¶
Every number in a digital system is a pattern of bits. Positional notation makes the value of each bit depend on its position:
Decimal 173 = 1×10² + 7×10¹ + 3×10⁰
Binary 10101101 = 1×2⁷ + 0×2⁶ + 1×2⁵ + 0×2⁴ + 1×2³ + 1×2² + 0×2¹ + 1×2⁰
= 128 + 32 + 8 + 4 + 1 = 173
Hex 0xAD = 10×16¹ + 13×16⁰ = 160 + 13 = 173
Hex is used everywhere in hardware because one hex digit maps exactly to four bits — easier to read, easier to type.
Conversion drill (do these by hand until they feel automatic):
| Decimal | Binary | Hex |
|---|---|---|
| 0 | 0000 0000 | 0x00 |
| 127 | 0111 1111 | 0x7F |
| 128 | 1000 0000 | 0x80 |
| 255 | 1111 1111 | 0xFF |
| 256 | 0001 0000 0000 | 0x100 |
1.2 Signed Integers: Two's Complement¶
Three ways to represent negative numbers have been tried in hardware. Only one survived.
| Scheme | +5 | −5 | Problem |
|---|---|---|---|
| Sign-magnitude | 0000 0101 | 1000 0101 | Two zeros (+0, −0); complex adder |
| One's complement | 0000 0101 | 1111 1010 | Two zeros; end-around carry |
| Two's complement | 0000 0101 | 1111 1011 | One zero; same adder for add/sub |
Two's complement rule: flip all bits, add 1.
Verify by addition — a correct negation adds to zero:
Range for N-bit two's complement: −2^(N−1) to +2^(N−1) − 1. For 8-bit: −128 to +127. For 32-bit: −2,147,483,648 to +2,147,483,647.
Sign extension — when widening a value (e.g., INT8 → INT32), copy the sign bit into all new high bits:
INT8 −5 = 1111 1011
INT32 −5 = 1111 1111 1111 1111 1111 1111 1111 1011
↑ all new bits filled with sign bit (1)
AI hardware connection: matrix multiply accumulate units (MMA/WMMA) operate on INT8 with INT32 accumulators. The sign extension from INT8 to INT32 is automatic in hardware, but matters when you mix signed/unsigned operands.
1.3 IEEE 754 Floating-Point¶
Floating-point trades the fixed range of integers for a sliding window of precision.
| Format | Width | Sign | Exponent | Mantissa | Bias | ~Range | ~Precision |
|---|---|---|---|---|---|---|---|
| FP32 | 32b | 1 | 8 | 23 | 127 | ±3.4×10^38 | ~7 digits |
| FP64 | 64b | 1 | 11 | 52 | 1023 | ±1.8×10^308 | ~15 digits |
| FP16 | 16b | 1 | 5 | 10 | 15 | ±65504 | ~3 digits |
| BF16 | 16b | 1 | 8 | 7 | 127 | same as FP32 | ~2 digits |
| FP8 E4M3 | 8b | 1 | 4 | 3 | 7 | ±448 | ~1 digit |
| FP8 E5M2 | 8b | 1 | 5 | 2 | 15 | ±57344 | ~0.5 digit |
FP32 example — representing 0.15625:
0.15625 = 0.00101 in binary = 1.01 × 2^(−3)
S = 0
Exponent field = −3 + 127 = 124 = 0111 1100
Mantissa = 01 followed by 21 zeros
Bit pattern: 0 01111100 01000000000000000000000
S exponent mantissa
Special values (every hardware engineer must know these):
| Pattern | Meaning |
|---|---|
| Exp=0, Mantissa=0 | ±Zero |
| Exp=0, Mantissa≠0 | Subnormal |
| Exp=255, Mantissa=0 | ±Infinity |
| Exp=255, Mantissa≠0 | NaN |
Rounding modes (four are standard; round-to-nearest-even is default):
Exact result → nearest representable value
Ties (exactly halfway) → choose the one with even last bit
Rounding modes matter in AI because accumulated rounding error over millions of operations can cause training instability. Stochastic rounding (rounding up or down randomly, weighted by proximity) is used in some FP8 training flows.
1.4 AI Float Formats: FP16, BF16, FP8¶
Modern AI training and inference use reduced-precision formats to fit more data in SRAM and reduce multiply-add energy.
FP32 [S|EEEEEEEE|MMMMMMMMMMMMMMMMMMMMMMM] 32 bits
BF16 [S|EEEEEEEE|MMMMMMM] 16 bits (truncated FP32 mantissa)
FP16 [S|EEEEE|MMMMMMMMMM] 16 bits (different exponent width)
FP8 [S|EEEE|MMM] or [S|EEEEE|MM] 8 bits
BF16 vs FP16 trade-off:
| Property | FP16 | BF16 |
|---|---|---|
| Exponent bits | 5 | 8 (same as FP32) |
| Range | ±65504 | ±3.4×10^38 |
| Precision | ~3 decimal digits | ~2 decimal digits |
| FP32 conversion | Needs range clamp | Truncate 16 mantissa bits |
| Usage | Inference, some training | Mixed-precision training |
FP8 — two variants for different roles:
- E4M3 (4-bit exp, 3-bit mantissa): more precision, less range → weights and activations
- E5M2 (5-bit exp, 2-bit mantissa): more range, less precision → gradients
FP8 training requires loss scaling and careful handling of overflow — the range is too small for raw gradients without scaling.
1.5 Fixed-Point and Quantization¶
Fixed-point represents Q = integer × 2^(−f) where f is the number of fractional bits.
Q8.8 format (8 integer bits, 8 fractional bits, 16 bits total)
Value 3.75:
Integer part: 3 = 0000 0011
Fraction part: 0.75 = 0.11 in binary → stored as 1100 0000
Stored: 0000 0011 1100 0000 = 0x03C0
Quantization maps floating-point weights to INT8 for efficient inference:
INT8 = round(FP32_weight / scale)
scale = max(|weights|) / 127 ← per-tensor
scale = max(|row|) / 127 ← per-row (better quality)
Dequantize: FP32 ≈ INT8 × scale
The multiply-accumulate (MAC) still runs in INT32 (INT8 × INT8 → INT32 accumulate), then the result is scaled back. This is exactly what CUDA's Tensor Core IMMA instructions implement.
1.6 Error Detection and Correction¶
Parity — add one bit so the total number of 1s is even (even parity) or odd (odd parity):
Parity detects any single-bit error but cannot locate or correct it.
Hamming(7,4) — encode 4 data bits into 7 bits with 3 parity bits positioned at powers of 2:
Positions: p1 p2 d1 p4 d2 d3 d4
1 2 3 4 5 6 7
p1 covers positions 1,3,5,7 (odd positions)
p2 covers positions 2,3,6,7
p4 covers positions 4,5,6,7
Syndrome = (check p1)(check p2)(check p4) → 3-bit number = position of error
Hamming(7,4) corrects any single-bit error and detects any double-bit error. Extended Hamming adds a fourth parity bit to detect (not correct) double errors.
CRC (Cyclic Redundancy Check) — treats the data as a polynomial and divides by a generator polynomial:
Data polynomial D(x) × x^r mod G(x) = R(x) (remainder = CRC)
Transmitted: D(x) concat R(x)
Receiver: divide received polynomial by G(x) — nonzero remainder means error
CRC-32 (used in Ethernet, ZIP, PNG) uses G(x) = x^32 + x^26 + x^23 + ... + 1. ECC DRAM uses a Hamming-based SECDED (Single Error Correct, Double Error Detect) code across the 64-bit data bus plus 8 ECC bits.
2. Boolean Algebra and Logic Gates¶
2.1 Gates and Truth Tables¶
Every digital function reduces to NAND gates (or NOR gates) — they are functionally complete.
AND gate: OR gate: NOT gate: NAND gate:
A──┐ A──┐ A──○──Y A──┐
├──Y ├──Y ├──○──Y
B──┘ B──┘ B──┘
Truth tables:
A B | AND OR NAND NOR XOR XNOR
0 0 | 0 0 1 1 0 1
0 1 | 0 1 1 0 1 0
1 0 | 0 1 1 0 1 0
1 1 | 1 1 0 0 0 1
Universal gates — NAND implements everything:
NOT A = NAND(A, A)
A AND B = NAND(NAND(A,B), NAND(A,B)) i.e. NOT(NAND(A,B))
A OR B = NAND(NAND(A,A), NAND(B,B)) i.e. NOT(NOT A) NAND NOT(NOT B)
2.2 Boolean Laws¶
| Law | AND form | OR form |
|---|---|---|
| Identity | A · 1 = A | A + 0 = A |
| Null | A · 0 = 0 | A + 1 = 1 |
| Idempotent | A · A = A | A + A = A |
| Complement | A · Ā = 0 | A + Ā = 1 |
| Double negative | ¬(¬A) = A | |
| Commutative | A · B = B · A | A + B = B + A |
| Associative | (A·B)·C = A·(B·C) | (A+B)+C = A+(B+C) |
| Distributive | A·(B+C) = A·B + A·C | A+(B·C) = (A+B)·(A+C) |
| De Morgan's | ¬(A·B) = Ā + B̄ | ¬(A+B) = Ā · B̄ |
| Absorption | A·(A+B) = A | A + A·B = A |
De Morgan's is the most used law in logic design — it lets you push negations through gates and convert AND↔OR.
2.3 Karnaugh Maps¶
K-maps group minterms visually to find the minimal sum-of-products (SOP) expression.
4-variable K-map layout (Gray code order — adjacent cells differ by one bit):
CD
AB 00 01 11 10
00 | 0 | 1 | 3 | 2 |
01 | 4 | 5 | 7 | 6 |
11 |12 |13 |15 |14 |
10 | 8 | 9 |11 |10 |
Rules for grouping: - Group size must be a power of 2: 1, 2, 4, 8, 16 - Groups must be rectangular (wrap around edges) - Use the largest possible groups - Cover every 1; don't cover 0s - Each group eliminates the variables that change within it
Example — minimize F(A,B,C,D) = Σm(0,2,4,5,6,7,8,10,13,15):
CD
AB 00 01 11 10
00 | 1 | 0 | 0 | 1 | → Group of 4: cells 0,2,8,10 → B̄D̄
01 | 1 | 1 | 1 | 1 | → Group of 4: cells 4,5,6,7 → AB̄
11 | 0 | 1 | 1 | 0 | → Group of 2: cells 13,15 → ACD
10 | 1 | 0 | 0 | 1 |
F = B̄D̄ + AB̄ + ACD
Without K-map: 10 minterms = up to 10 four-literal terms. K-map: 3 terms.
2.4 CMOS Implementation¶
CMOS (Complementary MOS) uses PMOS pull-up networks and NMOS pull-down networks. The two networks are duals of each other.
CMOS NAND gate (2-input):
VDD
|
─────┤ PMOS (A) ← parallel pull-up
│ └──┐
└──────┤ PMOS (B)
└─── Y
┌──
┌──────┤ NMOS (A) ← series pull-down
│ ┌──┘
─────┤ NMOS (B)
|
GND
Logic: Y = NAND(A,B) = ¬(A · B)
If A=1 AND B=1: both NMOS on (pull Y low), both PMOS off → Y = 0 ✓
If A=0 OR B=0: one PMOS on (pull Y high), NMOS chain broken → Y = 1 ✓
Key CMOS properties:
| Property | Value |
|---|---|
| Static power | Near zero (only leakage) |
| Dynamic power | C × V² × f (scales with switching activity) |
| Noise margin | ~40% of VDD |
| Fan-out | Theoretically unlimited (MOS gate is capacitive) |
| Inversion | Every CMOS gate inverts (NAND, NOR, NOT) |
Why only NAND/NOR? AND = NAND + NOT = two gate stages. An extra inversion costs delay and area. Modern synthesis tools always work in NAND/NOR/NOT internally.
3. Combinational Logic¶
3.1 Multiplexers and Decoders¶
MUX (2:1) — select one of two data inputs based on a control signal:
4:1 MUX from two layers of 2:1 MUX:
D0 ─┐ S1=0: top 2:1 → D0 or D1
D1 ─┤ 2:1 MUX ─┐ S1=1: bot 2:1 → D2 or D3
└──(S0) │
D2 ─┐ ├─ 2:1 MUX ── Y
D3 ─┤ 2:1 MUX ─┘ └──(S1)
└──(S0)
Decoder (2-to-4) — activate exactly one of N outputs:
Inputs: A1, A0
Outputs: Y0..Y3
Y0 = Ā1·Ā0 (active when A=00)
Y1 = Ā1·A0 (active when A=01)
Y2 = A1·Ā0 (active when A=10)
Y3 = A1·A0 (active when A=11)
Decoders appear as row/column address decoders in every SRAM and register file.
3.2 Adders: Ripple-Carry vs Carry-Lookahead¶
Full adder (1-bit):
A ──┐
B ──┼── XOR ──────── Sum = A ⊕ B ⊕ Cin
Cin─┘
Carry-out = A·B + Cin·(A⊕B)
= A·B + Cin·(A XOR B)
Ripple-carry adder (RCA) — chain N full adders:
A3 B3 A2 B2 A1 B1 A0 B0
│ │ │ │ │ │ │ │
┌┴──┴┐ ┌┴──┴┐ ┌┴──┴┐ ┌┴──┴┐
│ FA │◄──│ FA │◄──│ FA │◄──│ FA │◄── Cin=0
└────┘ └────┘ └────┘ └────┘
Cout S3 S2 S1 S0
Delay: N × t_FA (linear in word width)
32-bit: ~32 × 0.5ns = 16ns at 1 GHz (too slow for the critical path)
Carry-Lookahead Adder (CLA) — compute all carries in parallel:
Generate: Gi = Ai · Bi (this bit produces a carry regardless of Cin)
Propagate: Pi = Ai ⊕ Bi (this bit passes Cin to Cout)
C1 = G0 + P0·C0
C2 = G1 + P1·G0 + P1·P0·C0
C3 = G2 + P2·G1 + P2·P1·G0 + P2·P1·P0·C0
C4 = G3 + P3·G2 + P3·P2·G1 + P3·P2·P1·G0 + P3·P2·P1·P0·C0
All four carries computed in 2 gate delays (one AND, one OR).
Delay comparison:
| Adder type | Delay (N-bit) | Area |
|---|---|---|
| Ripple-carry | O(N) | O(N) |
| Carry-lookahead | O(log N) | O(N log N) |
| Carry-select | O(√N) | O(N) |
| Kogge-Stone (tree) | O(log N) | O(N log N) |
Modern ALUs use Kogge-Stone or Han-Carlson prefix trees to achieve O(log N) delay with regular, pipeable structure.
3.3 Multipliers and MAC Units¶
Array multiplier (4×4 unsigned):
A3 A2 A1 A0
× B3 B2 B1 B0
─────────────────
A3B0 A2B0 A1B0 A0B0 ← partial product row 0 (AND gates)
A3B1 A2B1 A1B1 A0B1 ← partial product row 1 (shift 1)
A3B2 A2B2 A1B2 A0B2 ← partial product row 2 (shift 2)
A3B3 A2B3 A1B3 A0B3 ← partial product row 3 (shift 3)
─────────────────────────────────
P7 P6 P5 P4 P3 P2 P1 P0
Partial products are summed with an adder tree (Wallace tree reduces rows in O(log N) delay).
MAC (Multiply-Accumulate):
ACC ← ACC + (A × B)
A ─┐
├── MULTIPLIER ─┐
B ─┘ ├── ADDER ── ACC ──┐
│ │
ACC ─────┘ │
↑ │
└──────────────────────────┘
One MAC = one ×, one +, one register write
The MAC is the fundamental operation in matrix multiplication. A systolic array tiles N×N MACs to compute a full matrix product:
A[i,k] flows right → B[k,j] flows down ↓
Each PE: acc += A × B, then pass A right and B down.
After N cycles, PE[i,j] holds C[i,j] = Σ_k A[i,k]×B[k,j].
3.4 ALU Design¶
A minimal ALU selects among operations via an opcode:
┌──────────────────────────────────────────────┐
A[31:0] ─────► │ ADD SUB AND OR XOR SLT SHR SHL │
B[31:0] ─────► │ │
└──────────────────┬───────────────────────────┘
│ 8:1 MUX
OpCode[2:0] ┘
▼
Result[31:0]
+ Flags: Zero, Carry, Overflow, Negative
Flag bits:
| Flag | Meaning | Use |
|---|---|---|
| Zero (Z) | Result == 0 | Branch equal/not-equal |
| Carry (C) | Unsigned overflow / borrow | Multi-word arithmetic |
| Overflow (V) | Signed overflow | Signed arithmetic check |
| Negative (N) | Result MSB is 1 (negative in 2's C) | Signed comparisons |
3.5 Propagation Delay and Critical Path¶
Every gate has a propagation delay t_p — the time from input change to output stable.
Critical path: the longest delay path from any input to any output.
Minimum clock period = critical path delay + setup time + clock skew.
Example: 32-bit ripple-carry adder
t_FA = t_XOR + t_AND + t_OR ≈ 3 × 100ps = 300ps per stage
Critical path = 32 × 300ps = 9.6ns → max freq ≈ 104 MHz
Same adder, 4-bit CLA blocks (4 blocks):
CLA block delay ≈ 2 gate delays = 200ps
Carry ripple between 4 blocks: 4 × 200ps = 800ps
Sum out: 800ps + 300ps ≈ 1.1ns → max freq ≈ 900 MHz
Pipelining cuts the critical path by adding registers mid-path:
Before pipelining (one long combinational path):
Input → [Stage A: 3ns] → [Stage B: 4ns] → [Stage C: 2ns] → Output
Throughput: 1 result / 9ns → ~111 MHz
After pipelining (registers between stages):
Input → [A: 3ns] → REG → [B: 4ns] → REG → [C: 2ns] → Output
Clock period = max(3, 4, 2) + setup ≈ 4.2ns → ~238 MHz
Throughput: 1 result / 4.2ns (latency is 3× longer, throughput ~2.1× better)
4. Sequential Logic¶
4.1 D Flip-Flop¶
The D flip-flop (DFF) is the canonical memory element. It captures D on the rising clock edge.
D ──┤D Q├── Q
CLK ──┤CLK │
│ Q̄├── Q̄ (complement)
└──────┘
Timing:
___ ___ ___
CLK: ___| |___| |___| |___
↑ ↑ ← capture edges
D: ──[stable]──[stable]──[stable]──
Q: ──────────[D₀]─────[D₁]────────
^ after t_clk-to-q
Critical timing parameters:
| Parameter | Symbol | Definition |
|---|---|---|
| Setup time | t_su | D must be stable this long BEFORE clock edge |
| Hold time | t_h | D must be stable this long AFTER clock edge |
| Clock-to-Q | t_cq | Time from clock edge to Q valid |
| Propagation | t_p | Combinational logic delay between DFFs |
Setup time violation (D changes too close to clock edge → Q becomes metastable):
CLK: _____|‾‾‾|_____
D: ────[old]X[new] ← X = transition inside setup window
Q: ──────────[???] ← metastable! may settle to wrong value
Minimum period constraint: t_clk ≥ t_cq + t_p + t_su
4.2 Registers, Shift Registers, and LFSR¶
N-bit register — N DFFs sharing a clock and optional load-enable:
Shift register — DFFs chained; data shifts one position per cycle:
D_in → [DFF₀] → [DFF₁] → [DFF₂] → [DFF₃] → D_out
Q₀ Q₁ Q₂ Q₃
Use: serial-to-parallel conversion, FIFO, delay line, CRC computation
LFSR (Linear Feedback Shift Register) — XOR taps from specific positions feed back to input:
Fibonacci LFSR, polynomial x⁴ + x³ + 1 (taps at positions 4 and 3):
┌─── XOR ◄──────────────────────────────┐
│ ↑ │
│ [DFF₄] ← [DFF₃] ← [DFF₂] ← [DFF₁] │
│ Q₄ Q₃ Q₂ Q₁ │
│ │
└──────────────────────────────── Q₄ → output
Period: 2⁴ - 1 = 15 (maximal length for this polynomial)
Output sequence: 1111, 0111, 1011, 1101, 0110, 1010, 0101, 1010, ...
LFSRs are used in: pseudo-random number generation, BIST (Built-In Self-Test), spread spectrum clocking, CRC hardware, and scrambling in PCIe/USB.
4.3 Finite State Machines¶
Two types:
| Type | Output depends on | Outputs change |
|---|---|---|
| Moore | State only | After clock edge |
| Mealy | State AND current inputs | Immediately |
Traffic light FSM (Moore) — 4 states:
States: GREEN_NS, YELLOW_NS, GREEN_EW, YELLOW_EW
Input: timer_expired (1 = time to change)
timer_expired
GREEN_NS ──────────────► YELLOW_NS
↑ │
│ always │
│ ▼
YELLOW_EW ◄──────── GREEN_EW
always
HDL-style state register (synthesizable pattern):
// State register — always sequential
always @(posedge clk or posedge rst) begin
if (rst) state <= GREEN_NS;
else state <= next_state;
end
// Next-state logic — always combinational
always @(*) begin
case (state)
GREEN_NS: next_state = timer_expired ? YELLOW_NS : GREEN_NS;
YELLOW_NS: next_state = GREEN_EW;
GREEN_EW: next_state = timer_expired ? YELLOW_EW : GREEN_EW;
YELLOW_EW: next_state = GREEN_NS;
default: next_state = GREEN_NS;
endcase
end
// Output logic — Moore: outputs depend on state only
assign light_ns = (state == GREEN_NS) ? GREEN :
(state == YELLOW_NS) ? YELLOW : RED;
assign light_ew = (state == GREEN_EW) ? GREEN :
(state == YELLOW_EW) ? YELLOW : RED;
State encoding strategies:
| Encoding | Bits needed | Power | Speed | Use case |
|---|---|---|---|---|
| Binary | log₂(N) | Low | Moderate | Area-constrained FPGAs, ASICs |
| One-hot | N | High | Fast | FPGA (abundant flip-flops) |
| Gray code | log₂(N) | Low | Moderate | Reduces glitches in output logic |
AI hardware connection: the control unit inside a GPU SM is a large FSM. The warp scheduler tracks warp state (ready, waiting on memory, waiting on dependency) and transitions each warp on every cycle. Up to 64 warps = 64 simultaneous FSM instances running in parallel.
4.4 Timing Analysis and Critical Path¶
Setup analysis — find the critical path across all register-to-register paths:
REG A → [combo logic, delay t₁] → REG B
Setup check: t₁ + t_su ≤ t_clk - t_skew
(data must arrive before clock arrives at destination minus skew)
Hold analysis — ensure data doesn't change too fast:
Clock skew — the difference in clock arrival time at two flip-flops:
CLK source
│
┌──┴──┐
│ │ ← routing delays differ
REG A REG B
CLK CLK
t₁ t₂ skew = t₂ - t₁
Positive skew (t₂ > t₁): relaxes setup, tightens hold
Negative skew (t₂ < t₁): tightens setup, relaxes hold
Clock gating — disable the clock to a register bank when idle to save dynamic power:
CLK ───┐
├── AND ──► gated CLK → register bank
EN ───┘
Saves: C_register × V² × f per gated cycle
Risk: glitches if EN changes while CLK=1 (use latch-based clock gate)
5. Memory Technologies¶
5.1 SRAM — Static Random Access Memory¶
SRAM stores a bit in a cross-coupled inverter pair (the 6T cell):
VDD
│
─────┼──────────────────────────────
│ PMOS₁ PMOS₂ │
│ │ │ │
│ ├── Q ─────── Q̄ ───┤ │
│ NMOS₁ ┌───┐ NMOS₂ │
│ │ │INV│ │ │
│ │ └───┘ │ │
─────NMOS₃ NMOS₄──────
│ │ │ │
BL WL─────────────────WL BL̄
6T SRAM cell:
- 2 PMOS (pull-up)
- 2 NMOS (pull-down) forming the latch
- 2 NMOS access transistors (NMOS₃, NMOS₄) controlled by Word Line
Operation:
| Phase | WL | BL | BL̄ | Action |
|---|---|---|---|---|
| Standby | 0 | Pre | Pre | Latch holds state; access transistors off |
| Read | 1 | Pre | Pre | Cell discharges one BL; sense amp detects |
| Write | 1 | D | D̄ | Driver overrides latch; cell flips |
SRAM characteristics:
| Property | Value |
|---|---|
| Access time | 0.5–5 ns (L1: ~1ns, LLC: ~10ns) |
| Density | 6T cell ≈ 120–150 F² (F = feature size) |
| Power | Static leakage + dynamic read/write |
| Retention | Holds data as long as VDD applied |
| On-chip use | Registers, caches, scratchpad, FIFOs |
5.2 DRAM — Dynamic RAM¶
DRAM stores charge on a capacitor — one transistor, one capacitor (1T1C):
BL
│
NMOS ← access transistor (WL controls)
│
CAP ← ~20–30 fF capacitor
│
GND
Charged cap (≥ Vth/2) = logic 1
Discharged cap (≤ Vth/2) = logic 0
Key differences from SRAM:
| Property | SRAM | DRAM |
|---|---|---|
| Cell size | 6 transistors | 1 transistor + 1 capacitor |
| Density | Low | 4–8× higher than SRAM |
| Speed | ~1–10 ns | ~50–100 ns (row access) |
| Refresh | Not needed | Every ~64 ms (capacitor leaks) |
| Destructive read | No | Yes (must restore after read) |
| Cost/GB | ~50–100× more than DRAM | Baseline |
DRAM access sequence:
1. Row activate (RAS): open one row (wordline) into row buffer (~45 ns)
2. Column access (CAS): select columns from row buffer (~15 ns)
3. Precharge: close row, precharge bit lines
4. Refresh: periodically rewrite all rows before charge leaks away
DRAM bandwidth = (bus width × frequency × burst length) / 8
DDR5-6400: 64-bit × 3200 MT/s × burst-8 = 51.2 GB/s per channel
5.3 HBM — High Bandwidth Memory¶
HBM stacks DRAM dies vertically on silicon interposer, connected via Through-Silicon Vias (TSVs):
GPU Die ──────────────────────────── HBM stack
Silicon interposer
HBM stack cross-section:
┌─────────────┐ ← DRAM die 4
├─────────────┤ ← DRAM die 3
├─────────────┤ ← DRAM die 2
├─────────────┤ ← DRAM die 1
└─────────────┘ ← Base die (logic)
│││ ← TSVs (Through-Silicon Vias)
───────────────── ← Silicon interposer
│││
┌─────────────┐ ← GPU/accelerator die
HBM vs GDDR6 vs DDR5:
| Spec | DDR5-6400 | GDDR6X | HBM3 |
|---|---|---|---|
| Bus width/stack | 64-bit | 32-bit | 1024-bit |
| Bandwidth/stack | 51 GB/s | 96 GB/s | ~900 GB/s |
| Capacity/stack | Unlimited | 16–24 GB | 16–48 GB |
| Power efficiency | Moderate | Moderate | High |
| On-package | No | No | Yes (interposer) |
| Used in | CPU systems | Gaming GPUs | AI GPUs/TPUs |
A100 has 80 GB HBM2e across 5 stacks at ~2 TB/s. H100 SXM uses HBM3 at ~3.35 TB/s. The interconnect between the GPU die and HBM is the single biggest memory bandwidth bottleneck in AI compute.
5.4 Flash Memory¶
Flash stores charge in a floating gate or charge trap layer — no power needed for retention.
NAND Flash cell (floating gate):
Control gate (CG)
│
─────┼──── ← tunnel oxide
Floating gate (FG) ← charge stored here
─────┼──── ← gate oxide
Source Drain
│ │
└─────── ────────┘
Channel
Writing: tunnel electrons onto FG (Fowler-Nordheim tunneling) → raises Vt → reads as 0
Erasing: remove electrons from FG → lowers Vt → reads as 1
Flash cell types by bits per cell:
| Type | Bits/cell | Voltage levels | Endurance | Speed | Use |
|---|---|---|---|---|---|
| SLC | 1 | 2 | 100K P/E | Fast | Enterprise SSD |
| MLC | 2 | 4 | 10K P/E | Medium | Consumer SSD |
| TLC | 3 | 8 | 1K P/E | Slower | Bulk storage |
| QLC | 4 | 16 | 300 P/E | Slow | Archival, cold data |
Flash is not byte-addressable for writes — writes go to a page (4–16 KB), erasures go to a block (128–512 pages). This asymmetry drives the Flash Translation Layer (FTL) in SSD controllers.
5.5 Cache Hierarchy¶
A cache exploits spatial and temporal locality to bridge the speed gap between CPU/GPU registers and main memory.
GPU/CPU Memory Hierarchy (approximate, H100/server class):
Registers ~0.2 ns ~256 KB per SM (64K 32-bit regs × 108 SMs)
L1 / Shared mem ~5 ns ~228 KB per SM (configurable split)
L2 cache ~50 ns ~50 MB on-chip, shared
HBM3 ~100 ns ~80 GB off-chip, on-package
System DDR5 ~200 ns ~TBs off-package
NVMe SSD ~100 μs ~TBs PCIe storage
Bandwidth (rough):
Register file → FPU: 100+ TB/s
L1/Shared mem → register: ~15 TB/s per SM
L2 → SM: ~12 TB/s total
HBM3 → chip: ~3.35 TB/s
PCIe5 x16 → host: ~64 GB/s
Cache organization:
Direct-mapped (1-way): Set-associative (4-way):
Tag | Index | Offset Tag | Index | Offset
One slot per set Four slots per set (4 ways)
Fast, simple Fewer conflict misses
Index selects row Index selects set, all 4 ways checked in parallel
Tag must match Any way whose tag matches → hit
Replacement policies:
| Policy | Evicts | Hardware cost |
|---|---|---|
| LRU | Least recently used way | O(log W) bits per set |
| PLRU | Pseudo-LRU (tree approximation) | W−1 bits per set |
| Random | Random way | LFSR counter |
| FIFO | Oldest loaded way | Pointer per set |
GPU L1 uses a simplified LRU or PLRU. CPU last-level caches often use QLRU (quad-age LRU) at 16-way associativity.
5.6 On-Chip Scratchpad and Tiling¶
Unlike a cache (hardware-managed, transparent), shared memory in GPUs is software-managed SRAM — the programmer explicitly controls what lives there.
Why tiling matters — matrix multiply on GPU without tiling:
Naive: each thread loads A[row, k] and B[k, col] from HBM for every k
For M=N=K=1024, C=A×B:
Loads from HBM: M × N × K × 2 (A and B) = 2 × 10⁹ float reads
At 3.35 TB/s HBM bandwidth: 2 × 10⁹ × 4B / 3.35TB/s ≈ 2.4 ms
Compute at 312 TFLOPS: 2 × 10⁹ FLOPs / 312 TFLOPS ≈ 0.006 ms
→ memory-bound by 400×
With tiling (shared memory):
Block size: 16×16 output tile
Load tile of A (16×K_chunk) → shared memory } one time per block
Load tile of B (K_chunk×16) → shared memory }
Each thread computes its dot product from shared memory (fast)
Arithmetic intensity = FLOPs / bytes from HBM
Naive: 2 FLOPs per 8 bytes (FMA = 2 ops, loads A and B) → 0.25 FLOP/byte
Tiled: 2 × T FLOPs per (2 × T × 4) bytes → T/4 FLOP/byte (T = tile size)
T=16: 4 FLOP/byte
T=128: 32 FLOP/byte → compute-bound at H100 (312 TFLOPS / 3350 GB/s = 93)
Shared memory bank conflicts — 32 banks, each 4 bytes wide; threads in a warp access conflict-free if they hit different banks:
No conflict: thread i accesses address i × 4 (each thread → different bank)
2-way conflict: threads 0 and 16 both access bank 0 (serialized → 2× slower)
Broadcast: all threads read same address (one bank, no conflict → hardware broadcasts)
Fix: pad shared memory arrays by one element
float tile[16][16+1]; ← +1 padding shifts each row to different bank
6. MIPS Processor Design¶
Building a processor from the components in sections 1–5 is the definitive digital design exercise. MIPS is the standard teaching ISA because its fixed 32-bit instruction format maps cleanly onto hardware. We build up from a single-cycle datapath through a 5-stage pipeline — the same pipeline that every modern processor (including GPU shader cores) descends from.
6.1 MIPS ISA Subset¶
A minimal core supports three instruction formats:
R-type (register): [opcode 6][rs 5][rt 5][rd 5][shamt 5][funct 6]
I-type (immediate): [opcode 6][rs 5][rt 5][imm 16]
J-type (jump): [opcode 6][address 26]
Key instructions for the datapath:
| Instruction | Format | Operation | Example |
|---|---|---|---|
add |
R | rd = rs + rt | add $t0, $s1, $s2 |
sub |
R | rd = rs − rt | sub $t0, $s1, $s2 |
and |
R | rd = rs & rt | and $t0, $s1, $s2 |
or |
R | rd = rs | rt | or $t0, $s1, $s2 |
slt |
R | rd = (rs < rt) ? 1 : 0 | slt $t0, $s1, $s2 |
lw |
I | rt = Mem[rs + imm] | lw $t0, 4($s1) |
sw |
I | Mem[rs + imm] = rt | sw $t0, 4($s1) |
beq |
I | if (rs == rt) PC += imm<<2 | beq $s1, $s2, L1 |
addi |
I | rt = rs + sign_ext(imm) | addi $t0, $s1, 10 |
j |
J | PC = | j target |
Register file — 32 registers, $0 hardwired to zero:
$zero ($0) = always 0 $t0-$t9 = temporaries
$at ($1) = assembler temp $s0-$s7 = saved (callee-preserved)
$v0-$v1 = return values $sp ($29)= stack pointer
$a0-$a3 = arguments $ra ($31)= return address
6.2 Single-Cycle Datapath¶
Every instruction completes in one clock cycle. The datapath connects the building blocks from sections 3–5:
┌────────────────────────────────────────────────────┐
│ Control Unit │
│ (opcode, funct → RegDst, ALUSrc, MemtoReg, │
│ RegWrite, MemRead, MemWrite, Branch, ALUOp, Jump)│
└──────┬──────┬──────┬──────┬──────┬──────┬─────────┘
│ │ │ │ │ │
┌──────┐ ┌──────────┐ ┌─────┴──┐ MUX MUX │ │ │
│ │ │ Register │ │ │ │ │ │ │ │
│ PC ├──►│ File ├──►│ ALU ├───┘ │ ┌─┴──┐ │ │
│ │ │ (32×32) │ │ │ └───►│Data│ │ │
└──┬───┘ │ │ └────────┘ │Mem │ │ │
│ │ rd1─────┼───► A │ │ │ │
│ │ rd2─────┼───► B / MUX ◄── imm │ │ │ │
▼ └──────────┘ └────┘ │ │
┌──────┐ │ │
│Instr │ │ │
│ Mem │◄────────────────────────────────────────────────────┘ │
└──────┘ │
│ │
└─── PC+4 / branch target / jump target ◄──────────────────────┘
Datapath walk-through for lw $t0, 8($s1):
Cycle:
1. Fetch: Instr Mem[PC] → instruction word
2. Decode: rs=$s1(17), rt=$t0(8), imm=8
Register File reads: rd1 = Reg[17] = value of $s1
3. Execute: ALU computes: $s1 + sign_ext(8) = address
4. Memory: Data Mem[address] → loaded word
5. Writeback: MUX selects memory output → Reg[8] ($t0)
6. PC update: PC ← PC + 4
All in ONE cycle. Clock period = longest path (lw: through Instr Mem + Reg File + ALU + Data Mem + MUX).
Control signals for each instruction type:
| Signal | R-type | lw |
sw |
beq |
addi |
j |
|---|---|---|---|---|---|---|
| RegDst | 1 (rd) | 0 (rt) | X | X | 0 (rt) | X |
| ALUSrc | 0 (rt) | 1 (imm) | 1 (imm) | 0 (rt) | 1 (imm) | X |
| MemtoReg | 0 (ALU) | 1 (Mem) | X | X | 0 (ALU) | X |
| RegWrite | 1 | 1 | 0 | 0 | 1 | 0 |
| MemRead | 0 | 1 | 0 | 0 | 0 | 0 |
| MemWrite | 0 | 0 | 1 | 0 | 0 | 0 |
| Branch | 0 | 0 | 0 | 1 | 0 | 0 |
| Jump | 0 | 0 | 0 | 0 | 0 | 1 |
| ALUOp | 10 | 00 | 00 | 01 | 00 | XX |
ALU control — two-level decode:
ALUOp (from main control) + funct field (from instruction) → ALU operation
ALUOp=00 → ADD (lw, sw, addi: address calculation)
ALUOp=01 → SUB (beq: compare by subtraction)
ALUOp=10 → look at funct:
funct=100000 → ADD
funct=100010 → SUB
funct=100100 → AND
funct=100101 → OR
funct=101010 → SLT
6.3 Single-Cycle Timing Problem¶
The single-cycle design's clock period equals the slowest instruction:
lw critical path:
Instr Mem + Reg Read + ALU + Data Mem + Reg Write MUX
200ps + 150ps + 200ps + 200ps + 25ps
= 775ps → max freq ≈ 1.29 GHz
But EVERY instruction — even a simple add (375ps) — takes 775ps.
add waste: 775 - 375 = 400ps idle per add
This wastes significant time on fast instructions. The solution: pipelining.
6.4 Five-Stage Pipeline¶
Break the datapath into 5 stages, each taking one clock cycle. Pipeline registers separate stages:
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
IF ID EX MEM WB
┌──────┐ ┌──────────┐ ┌────────┐ ┌────────┐ ┌──────────┐
│Instr │ │ Register │ │ │ │ Data │ │ Write │
│Fetch │ ►REG► Decode │►REG► ALU │►REG►Memory │►REG► Back │
│ │ │ + Read │ │ │ │ │ │ │
└──────┘ └──────────┘ └────────┘ └────────┘ └──────────┘
PC+4 rs, rt, imm ALU result Mem data rd ← result
instr control sigs branch addr
Clock period = max(stage delay) + register overhead
= 200ps + 20ps = 220ps → max freq ≈ 4.5 GHz
Throughput = 1 instruction per 220ps (vs 775ps single-cycle → 3.5× faster)
Latency = 5 × 220ps = 1100ps per instruction (worse, but throughput wins)
Pipeline timing diagram — 5 instructions flowing through:
Time (cycles): 1 2 3 4 5 6 7 8 9
Instr 1: IF ID EX MEM WB
Instr 2: IF ID EX MEM WB
Instr 3: IF ID EX MEM WB
Instr 4: IF ID EX MEM WB
Instr 5: IF ID EX MEM WB
After cycle 5, one instruction completes EVERY cycle.
Steady-state CPI = 1 (Cycles Per Instruction).
6.5 Pipeline Hazards¶
Three types of hazards break the ideal CPI = 1:
Data Hazards¶
A later instruction reads a register that an earlier instruction hasn't written yet:
add $s0, $t0, $t1 # WB writes $s0 in cycle 5
sub $t2, $s0, $t3 # ID reads $s0 in cycle 3 — STALE value!
Time: 1 2 3 4 5
add: IF ID EX MEM WB ← $s0 written here
sub: IF ID ←reads $s0 here (old value!)
Solution 1 — Forwarding (bypassing): route the ALU result directly back to the ALU input without waiting for WB:
┌──────── forward path ────────┐
│ │
add: IF ID EX ─┤─ MEM WB │
sub: IF ID ─┘─ EX ◄──┘ MEM WB
↑
ALU gets $s0 from EX/MEM register
instead of register file
Forwarding unit checks: if (EX/MEM.rd == ID/EX.rs) → forward EX/MEM.ALUresult
Solution 2 — Load-use hazard (forwarding can't fix):
lw $s0, 0($t0) # data available after MEM (cycle 4)
add $t2, $s0, $t1 # EX needs $s0 in cycle 3 — too early!
Must stall (insert bubble) for 1 cycle:
Time: 1 2 3 4 5 6
lw: IF ID EX MEM WB
add: IF ID stall EX MEM WB
↑ ↑
hazard forward from MEM/WB register
detected
The stall costs 1 cycle. Compilers reorder instructions to fill the load-delay slot when possible.
Control Hazards¶
Branch outcome isn't known until EX (or MEM), but the next instruction is already fetched:
beq $s0, $s1, target # branch resolved in EX (cycle 3)
add ... # fetched in cycle 2 — might be wrong path!
or ... # fetched in cycle 3 — might be wrong path!
Solutions:
| Strategy | Penalty | Hardware cost | Description |
|---|---|---|---|
| Stall (flush) | 1–2 cycles | Minimal | Kill wrong-path instructions |
| Branch prediction | 0 if correct | Moderate | Predict taken/not-taken |
| Delayed branch | 0 | Compiler | Always execute next instruction (MIPS) |
| Early branch resolution | 1 cycle | MUX in ID | Move comparison to ID stage |
Static prediction: predict not-taken (sequential fetch). If wrong, flush 1 instruction.
Dynamic prediction (branch history table):
1-bit predictor: remember last outcome → predict same
Problem: loops always mispredict twice (entering and exiting)
2-bit saturating counter:
States: Strongly Taken (11) → Weakly Taken (10) → Weakly Not (01) → Strongly Not (00)
Must mispredict twice to switch direction → better loop behavior
Structural Hazards¶
Two stages need the same hardware unit in the same cycle (e.g., single memory for both IF and MEM):
Solution: use separate instruction memory and data memory (Harvard architecture)
or use a multi-ported cache
The classic MIPS pipeline uses separate instruction and data memories — no structural hazards in the base design.
6.6 Complete Pipelined Datapath¶
┌─────┐ IF/ID ┌─────┐ ID/EX ┌─────┐ EX/MEM ┌─────┐ MEM/WB ┌─────┐
│ │ reg │ │ reg │ │ reg │ │ reg │ │
│ IF ├────────►│ ID ├────────►│ EX ├───────►│ MEM ├───────►│ WB │
│ │ │ │ │ │ │ │ │ │
└──┬──┘ └──┬──┘ └──┬──┘ └─────┘ └──┬──┘
│ │ │ │
│ ┌──┴──┐ ┌──┴──┐ │
│ │Hazard│ │Fwd │ │
│ │Detect│ │Unit │◄──────────────────────────┘
│ └──┬──┘ └─────┘ (forward paths)
│ │
│ stall/flush
│
┌─┴──────────────────────────┐
│ PC MUX │
│ (PC+4 / branch / jump) │
└────────────────────────────┘
Pipeline registers store ALL signals needed by later stages:
IF/ID: instruction, PC+4
ID/EX: control signals, rs data, rt data, sign-ext imm, rd/rt dest
EX/MEM: control signals, ALU result, rt data (for sw), dest register
MEM/WB: control signals, ALU result OR memory data, dest register
Forwarding unit logic (simplified):
// Forward from EX/MEM stage
if (EX/MEM.RegWrite && EX/MEM.rd != 0 && EX/MEM.rd == ID/EX.rs)
ForwardA = EX/MEM.ALUresult
// Forward from MEM/WB stage
if (MEM/WB.RegWrite && MEM/WB.rd != 0 && MEM/WB.rd == ID/EX.rs
&& !(EX/MEM.RegWrite && EX/MEM.rd == ID/EX.rs)) // EX/MEM has priority
ForwardA = MEM/WB.WriteData
// Same logic for ForwardB (rt input)
Hazard detection unit (load-use stall):
if (ID/EX.MemRead && (ID/EX.rt == IF/ID.rs || ID/EX.rt == IF/ID.rt))
stall pipeline:
- IF/ID register holds (re-decode same instruction)
- insert NOP into ID/EX (bubble)
- PC holds (re-fetch same PC+4)
6.7 Performance Metrics¶
Execution time = IC × CPI × T_clk
IC = Instruction Count (depends on ISA and compiler)
CPI = Cycles Per Instruction (depends on pipeline + hazards)
T_clk = Clock period (depends on critical path in longest stage)
Single-cycle: CPI = 1, T_clk = 775ps → time/instr = 775ps
Pipelined: CPI ≈ 1.2 (stalls + flushes), T_clk = 220ps
→ time/instr = 264ps → ~2.9× speedup
Ideal speedup from N-stage pipeline = N (limited by hazard stalls)
5-stage MIPS: ideal 5×, typical 3–4× in practice
CPI breakdown example (gcc on MIPS):
| Component | CPI contribution |
|---|---|
| Base | 1.0 |
| Load-use stalls | +0.05 |
| Branch mispredicts | +0.12 |
| Cache misses | +0.07 |
| Total CPI | 1.24 |
AI hardware connection: GPU shader cores use a much deeper pipeline (20+ stages in modern GPUs) but hide latency through massive thread-level parallelism (TLP) rather than branch prediction. When a warp stalls on a cache miss, the scheduler switches to another ready warp — same principle as pipeline interleaving, but across thousands of threads instead of individual instructions. The MIPS 5-stage pipeline is the conceptual foundation: once you understand forwarding, hazards, and CPI, the GPU's approach is a natural extension.
Resources¶
| Resource | Type | Focus |
|---|---|---|
| Digital Design and Computer Architecture — Harris & Harris | Textbook | Gates through ISA, HDL examples |
| Computer Organization and Design — Patterson & Hennessy | Textbook | RISC-V, memory hierarchy, pipelining |
| Digital Design and Computer Architecture: ARM Edition — Harris & Harris | Textbook | Single-cycle + pipelined processor, HDL |
| Modern VLSI Design — Wolf | Textbook | CMOS, timing, place-and-route |
| CMOS VLSI Design — Weste & Harris | Textbook | Transistor-level design, layout, fabrication |
| Sam Zeloof (YouTube) | Video | DIY semiconductor fab — see the process hands-on |
| Chips and Cheese (chipsandcheese.com) | Blog | Deep dives into real chip microarchitectures |
| Nandland (nandland.com) | Online | FPGA/Verilog/VHDL hands-on |
| EEVblog / Ben Eater | Video | Intuition for digital circuits, breadboard CPU |
| JEDEC JESD79-5B (DDR5 spec) | Spec | DRAM timing parameters |
| JEDEC HBM3 spec | Spec | HBM architecture and interface |
| IEEE 754-2019 | Standard | Floating-point standard |
Projects¶
| Project | Skills | Outcome |
|---|---|---|
| Binary/BCD/Gray code converter | Number systems, K-map | Combinational logic in Verilog |
| 32-bit CLA adder in Verilog | Generate/propagate, prefix tree | RTL design + timing analysis |
| IEEE 754 FP32 adder | Alignment shift, normalization, rounding | Understand FP hardware |
| INT8 MAC unit | Fixed-point, two's complement, accumulation | Core of quantized inference |
| LFSR-based PRNG | Shift registers, XOR feedback, polynomial | BIST, test stimulus generation |
| 4-state FSM (Moore + Mealy) | State encoding, next-state logic | Control unit design pattern |
| Tiled matrix multiply (CUDA) | Shared memory, bank conflict, tiling | Connect theory to GPU programming |
| Single-cycle MIPS in Verilog | Datapath, control unit, ALU, register file | Complete processor from gates to ISA |
| Pipelined MIPS in Verilog | Pipeline registers, forwarding, hazard detection | Understand CPI, stalls, bypassing |
| Cache simulator (Python/C++) | Direct-map vs set-assoc, LRU, miss rates | Memory hierarchy intuition |
| RTL-to-GDS with OpenLane | Synthesis, PnR, DRC/LVS, SkyWater PDK | Experience the full tapeout flow (open-source) |