Skip to content

Digital Design Fundamentals

The circuits inside every AI accelerator — tensor cores, SRAM banks, systolic arrays — are built from the same primitives covered here. This guide takes you from bits to memory hierarchies, with deliberate attention to the representations and building blocks that show up every day in hardware design.

Before diving into the technical building blocks, let's understand the big picture: how a chip idea becomes a physical piece of silicon.


0. From Design to Silicon — How Chips Are Made

0.1 The Tapeout Flow

"Tapeout" is the moment a chip design team hands off the final design files to a foundry (TSMC, Samsung, Intel Foundry, GlobalFoundries) for manufacturing. The name comes from the old days when designs were literally written to magnetic tape. Today it means delivering a set of verified GDSII/OASIS files — the geometric description of every transistor, wire, and via on the chip.

The full journey from idea to working silicon:

┌─────────────────────────────────────────────────────────────────────────┐
│                        CHIP DESIGN FLOW                                 │
│                                                                         │
│  Specification                                                          │
│      │                                                                  │
│      ▼                                                                  │
│  Architecture Design          "What does the chip do?"                  │
│      │                        ISA, block diagram, memory map            │
│      ▼                                                                  │
│  RTL Design (Verilog/VHDL)    "Describe behavior in HDL"                │
│      │                        modules, FSMs, datapaths                  │
│      ▼                                                                  │
│  Functional Verification      "Does the RTL match the spec?"            │
│      │                        testbenches, UVM, formal verification     │
│      ▼                                                                  │
│  Logic Synthesis              "Convert RTL to gates"                    │
│      │                        tool: Synopsys Design Compiler            │
│      │                        input: RTL + standard cell library        │
│      │                        output: gate-level netlist                │
│      ▼                                                                  │
│  Place and Route (PnR)        "Put gates on chip, connect wires"        │
│      │                        tool: Cadence Innovus / Synopsys ICC2     │
│      │                        floorplan → placement → CTS → routing     │
│      ▼                                                                  │
│  Sign-off Checks              "Is it manufacturable and correct?"        │
│      │   ├── Timing (STA)     static timing analysis, all corners       │
│      │   ├── DRC              design rule check (min spacing, width)    │
│      │   ├── LVS              layout vs schematic (geometry = netlist?) │
│      │   ├── ERC              electrical rule check                     │
│      │   └── IR drop / EM     power integrity, electromigration         │
│      ▼                                                                  │
│  ★ TAPEOUT ★                  "Ship GDSII to foundry"                   │
│      │                                                                  │
│      ▼                                                                  │
│  Fabrication (TSMC etc.)      12–18 weeks for advanced nodes            │
│      │                                                                  │
│      ▼                                                                  │
│  Packaging & Test             die → package → burn-in → final test      │
│      │                                                                  │
│      ▼                                                                  │
│  Silicon Bring-up             first boot, characterization              │
└─────────────────────────────────────────────────────────────────────────┘

0.2 RTL to Gates: Logic Synthesis

You write hardware in Verilog or VHDL — a behavioral description of what each register, MUX, and ALU does. The synthesis tool (Synopsys Design Compiler is the industry standard) maps this to actual logic gates from a standard cell library provided by the foundry.

Your RTL (Verilog):                     Synthesized gates:

  assign y = a & b | c;        →       NAND2 → INV → NOR2 → ...
                                        (mapped to TSMC N5 cell library)

  always @(posedge clk)
    if (en) q <= d;             →       DFF with enable (EDFCND2)

A standard cell library contains hundreds of pre-designed, pre-characterized cells:

Cell type Examples
Logic gates INV, NAND2, NAND3, NOR2, AOI22, MUX2
Sequential DFF, DFF with reset, DFF with enable
Buffers/drivers BUF, CLKBUF, CLKINV
Special Tie-high, tie-low, filler cells

Each cell comes with: - Layout (geometric shapes for each process layer) - Timing model (.lib) — delay, setup, hold at various voltage/temperature corners - Power model — leakage + switching power - Parasitic model — capacitance, resistance

The synthesis tool optimizes for your constraints — typically timing (target clock frequency), area, and power — making trade-offs like using faster (larger) cells on critical paths and slower (smaller) cells elsewhere.

0.3 Place and Route

After synthesis, you have a netlist: a list of ~billion gates and their connections. Place and route turns this into a physical layout.

Step 1: Floorplanning
  ┌──────────────────────────────────┐
  │  ┌──────┐  ┌────────┐  ┌──────┐ │
  │  │ SRAM │  │Compute │  │ SRAM │ │
  │  │Block │  │ Core   │  │Block │ │
  │  └──────┘  └────────┘  └──────┘ │
  │  ┌────────────────────────────┐  │
  │  │        I/O ring            │  │
  │  └────────────────────────────┘  │
  └──────────────────────────────────┘
  Decide where major blocks go, power grid, pin placement

Step 2: Placement
  Standard cells placed in rows between power rails (VDD/VSS)
  ┌─VDD─────────────────────────────┐
  │ [NAND][INV][DFF][BUF][MUX]...  │
  ├─VSS─────────────────────────────┤
  │ [NOR][AOI][DFF][INV][NAND]...  │
  ├─VDD─────────────────────────────┤
  │ ...                             │
  └─────────────────────────────────┘

Step 3: Clock Tree Synthesis (CTS)
  Build a balanced tree of clock buffers so CLK arrives
  at every flip-flop within ~50ps skew

Step 4: Routing
  Connect all signals using metal layers (M1–M15+ on advanced nodes)
  Lower metals: short local wires
  Upper metals: long global wires, power distribution

0.4 Fabrication at TSMC

Once the GDSII file arrives at the foundry, it goes through photolithography — printing the design layer by layer onto a silicon wafer.

Wafer cross-section (simplified, TSMC N5):

  ══════════  Metal 15 (thick, global power/signal)
     ...      (13 more metal layers)
  ──────────  Metal 2
  ──────────  Metal 1 (thin, local connections)
  ──────────  Contact layer
  ▓▓▓▓▓▓▓▓▓  Gate (FinFET or GAA nanosheet at N3/N2)
  ░░░░░░░░░  Fin / channel
  ▒▒▒▒▒▒▒▒▒  Silicon substrate (300mm wafer)

Each layer requires:
  1. Deposit material (metal, oxide, etc.)
  2. Spin photoresist
  3. Expose with EUV light through reticle (mask)
  4. Develop (remove exposed/unexposed resist)
  5. Etch pattern into material
  6. Strip remaining resist
  7. Repeat for next layer (~80-100 mask steps for N5)

Process node comparison (as of 2025–2026):

Node Foundry Transistor type Density (MTr/mm²) Used in
N7 TSMC FinFET ~91 A100, AMD Zen 2
N5 TSMC FinFET ~171 H100, Apple M2, Zen 4
N4P TSMC FinFET ~180 B100/B200
N3E TSMC FinFET ~208 Apple M4, Vera (2026)
N2 TSMC (2025) GAA nanosheet ~250+ Next-gen AI chips
20A/18A Intel RibbonFET (GAA) ~200+ Intel next-gen
2nm Samsung GAA ~200+ Foundry customers

FinFET vs GAA (Gate-All-Around):

FinFET (N7–N3):                     GAA Nanosheet (N2 and below):

      Gate                              Gate (wraps all sides)
     ┌───┐                            ┌─────────────┐
     │   │                            │ ┌─────────┐ │
     │ F │  ← gate wraps             │ │nanosheet│ │  ← gate wraps
     │ I │     3 sides                │ └─────────┘ │     ALL 4 sides
     │ N │                            │ ┌─────────┐ │
     │   │                            │ │nanosheet│ │  (stacked sheets)
     └─┬─┘                            │ └─────────┘ │
       │                              └──────┬──────┘
  ─────┴─────                         ───────┴───────
   substrate                            substrate

GAA advantage: better electrostatic control → less leakage,
               higher drive current, better scaling below 3nm

0.5 Packaging

The raw die is cut from the wafer and placed into a package that provides: - Electrical connections (power, ground, I/O signals) - Mechanical protection - Heat dissipation path

Modern AI chip packaging (CoWoS — Chip on Wafer on Substrate):

        ┌─── Heat spreader / lid ───┐
        │                           │
   ┌────┴──────┬───────┬──────┬─────┴───┐
   │  HBM      │  GPU  │  HBM │  HBM    │  ← dies on interposer
   │  stack     │  die  │ stack│  stack   │
   └───┬────────┴───┬───┴──────┴────┬────┘
       │   Silicon interposer       │      ← passive routing layer
   ┌───┴────────────┴───────────────┴────┐
   │         Organic substrate           │  ← BGA ball grid below
   └─────────────────────────────────────┘
                  │││││
            solder balls → PCB

CoWoS enables:
  - GPU die + HBM stacks on same interposer (short, wide buses)
  - 1024-bit HBM interface (impossible with PCB traces)
  - H100: ~814mm² GPU die + 6× HBM3 stacks on ~2× larger interposer

Packaging technologies comparison:

Package Description Used in
Wire bond Gold wires from die pads to package leads Low-cost, legacy chips
Flip-chip Solder bumps, die face-down CPUs, GPUs (standard)
2.5D (CoWoS) Multiple dies on silicon interposer AI GPUs (H100, MI300X)
3D stacking Dies stacked vertically with TSVs HBM, AMD 3D V-Cache
Chiplets Multiple small dies in one package AMD EPYC, Intel Ponte Vecchio

0.6 Testing and Yield

Not every die on a wafer works — defects from particles, lithography errors, and process variation kill some transistors.

300mm wafer with ~100 GPU dies:

  ┌─────────────────────────────┐
  │   ○ ○ ○ ○ ○ ○ ○ ○ ○        │
  │  ○ ● ○ ○ ● ○ ○ ○ ○ ○      │    ○ = good die
  │ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○     │    ● = defective die
  │  ○ ○ ○ ● ○ ○ ○ ○ ○ ○      │
  │ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○     │    Edge dies = partial/wasted
  │  ○ ○ ○ ○ ○ ○ ○ ○ ○ ○      │
  │   ○ ○ ○ ○ ○ ○ ○ ○ ○       │    Yield = good dies / total dies
  └─────────────────────────────┘

  H100 die: ~814mm²  →  ~60 dies per 300mm wafer
  At 80% yield → ~48 good dies per wafer
  Wafer cost at N5: ~$16,000–$20,000
  Cost per good die: ~$330–$415 (before packaging, test, profit margin)

Defect tolerance — binning and salvage:

NVIDIA harvests partially defective dies by disabling broken SMs: - H100 SXM: 132 SMs designed, 114 enabled (disable 18 defective ones) - This dramatically improves effective yield

Test flow:

1. Wafer sort (probe test)    — test each die on wafer, mark bad dies
2. Die singulation            — cut wafer into individual dies
3. Packaging                  — bond good dies into packages
4. Final test (ATE)           — automated test at speed, burn-in
5. Binning                    — sort by max frequency / power tier
                                (e.g., RTX 4090 vs 4080 = same die, different bin)

0.7 Timeline and Cost

A modern AI chip from architecture to production:

Phase Duration Key deliverable
Architecture & spec 3–6 months Microarchitecture document
RTL design 12–18 months Verilog/VHDL codebase
Verification Overlaps RTL 60–70% of total effort
Synthesis + PnR 3–6 months GDSII layout
Tapeout → first silicon 3–4 months Engineering samples (ES)
Validation & bring-up 3–6 months Working chips
Total ~2–3 years Production silicon

Cost to design a chip at N5: $300M–$500M+ (including EDA tools, IP licenses, masks, engineering team). This is why only a handful of companies (NVIDIA, AMD, Apple, Google, etc.) design leading-edge chips — the economics require massive volume to amortize the NRE (Non-Recurring Engineering) cost.

Why this matters for AI hardware engineers: understanding the tapeout flow tells you why design decisions are made the way they are. When an architect chooses a simpler pipeline over a faster but more complex one, it's often because verification effort scales superlinearly with complexity. When NVIDIA uses the same die for multiple GPU tiers (binning), it's yield economics. When HBM is on an interposer instead of on-package DRAM, it's because the 1024-bit bus is physically impossible with PCB routing. Every section in this guide — from Boolean algebra through MIPS pipelines — is a building block in this flow.


1. Number Systems

1.1 Binary, Hexadecimal, and Positional Notation

Every number in a digital system is a pattern of bits. Positional notation makes the value of each bit depend on its position:

Decimal 173  =  1×10² + 7×10¹ + 3×10⁰
Binary 10101101  =  1×2⁷ + 0×2⁶ + 1×2⁵ + 0×2⁴ + 1×2³ + 1×2² + 0×2¹ + 1×2⁰
               =  128 + 32 + 8 + 4 + 1  =  173
Hex 0xAD       =  10×16¹ + 13×16⁰  =  160 + 13  =  173

Hex is used everywhere in hardware because one hex digit maps exactly to four bits — easier to read, easier to type.

Binary group:  1010  1101
Hex digit:      A     D      →  0xAD

Conversion drill (do these by hand until they feel automatic):

Decimal Binary Hex
0 0000 0000 0x00
127 0111 1111 0x7F
128 1000 0000 0x80
255 1111 1111 0xFF
256 0001 0000 0000 0x100

1.2 Signed Integers: Two's Complement

Three ways to represent negative numbers have been tried in hardware. Only one survived.

Scheme +5 −5 Problem
Sign-magnitude 0000 0101 1000 0101 Two zeros (+0, −0); complex adder
One's complement 0000 0101 1111 1010 Two zeros; end-around carry
Two's complement 0000 0101 1111 1011 One zero; same adder for add/sub

Two's complement rule: flip all bits, add 1.

+5  =  0000 0101
flip   1111 1010
+1     1111 1011   ← this is −5 in 8-bit two's complement

Verify by addition — a correct negation adds to zero:

  0000 0101   (+5)
+ 1111 1011   (−5)
-----------
  0000 0000   (0, carry out discarded)  ✓

Range for N-bit two's complement: −2^(N−1) to +2^(N−1) − 1. For 8-bit: −128 to +127. For 32-bit: −2,147,483,648 to +2,147,483,647.

Sign extension — when widening a value (e.g., INT8 → INT32), copy the sign bit into all new high bits:

INT8  −5  =  1111 1011
INT32 −5  =  1111 1111 1111 1111 1111 1111 1111 1011
                ↑ all new bits filled with sign bit (1)

AI hardware connection: matrix multiply accumulate units (MMA/WMMA) operate on INT8 with INT32 accumulators. The sign extension from INT8 to INT32 is automatic in hardware, but matters when you mix signed/unsigned operands.

1.3 IEEE 754 Floating-Point

Floating-point trades the fixed range of integers for a sliding window of precision.

Value = (−1)^S  ×  1.Mantissa  ×  2^(Exponent − Bias)
Format Width Sign Exponent Mantissa Bias ~Range ~Precision
FP32 32b 1 8 23 127 ±3.4×10^38 ~7 digits
FP64 64b 1 11 52 1023 ±1.8×10^308 ~15 digits
FP16 16b 1 5 10 15 ±65504 ~3 digits
BF16 16b 1 8 7 127 same as FP32 ~2 digits
FP8 E4M3 8b 1 4 3 7 ±448 ~1 digit
FP8 E5M2 8b 1 5 2 15 ±57344 ~0.5 digit

FP32 example — representing 0.15625:

0.15625 = 0.00101 in binary = 1.01 × 2^(−3)

S = 0
Exponent field = −3 + 127 = 124 = 0111 1100
Mantissa = 01 followed by 21 zeros

Bit pattern: 0  01111100  01000000000000000000000
             S  exponent  mantissa

Special values (every hardware engineer must know these):

Pattern Meaning
Exp=0, Mantissa=0 ±Zero
Exp=0, Mantissa≠0 Subnormal
Exp=255, Mantissa=0 ±Infinity
Exp=255, Mantissa≠0 NaN

Rounding modes (four are standard; round-to-nearest-even is default):

Exact result  →  nearest representable value
Ties (exactly halfway)  →  choose the one with even last bit

Rounding modes matter in AI because accumulated rounding error over millions of operations can cause training instability. Stochastic rounding (rounding up or down randomly, weighted by proximity) is used in some FP8 training flows.

1.4 AI Float Formats: FP16, BF16, FP8

Modern AI training and inference use reduced-precision formats to fit more data in SRAM and reduce multiply-add energy.

FP32   [S|EEEEEEEE|MMMMMMMMMMMMMMMMMMMMMMM]   32 bits
BF16   [S|EEEEEEEE|MMMMMMM]                   16 bits  (truncated FP32 mantissa)
FP16   [S|EEEEE|MMMMMMMMMM]                   16 bits  (different exponent width)
FP8    [S|EEEE|MMM]  or  [S|EEEEE|MM]          8 bits

BF16 vs FP16 trade-off:

Property FP16 BF16
Exponent bits 5 8 (same as FP32)
Range ±65504 ±3.4×10^38
Precision ~3 decimal digits ~2 decimal digits
FP32 conversion Needs range clamp Truncate 16 mantissa bits
Usage Inference, some training Mixed-precision training

FP8 — two variants for different roles:

  • E4M3 (4-bit exp, 3-bit mantissa): more precision, less range → weights and activations
  • E5M2 (5-bit exp, 2-bit mantissa): more range, less precision → gradients

FP8 training requires loss scaling and careful handling of overflow — the range is too small for raw gradients without scaling.

1.5 Fixed-Point and Quantization

Fixed-point represents Q = integer × 2^(−f) where f is the number of fractional bits.

Q8.8 format  (8 integer bits, 8 fractional bits, 16 bits total)

Value 3.75:
  Integer part:  3  = 0000 0011
  Fraction part: 0.75 = 0.11 in binary → stored as 1100 0000

Stored: 0000 0011  1100 0000  = 0x03C0

Quantization maps floating-point weights to INT8 for efficient inference:

INT8 = round(FP32_weight / scale)
scale = max(|weights|) / 127          ← per-tensor
scale = max(|row|) / 127              ← per-row (better quality)

Dequantize: FP32 ≈ INT8 × scale

The multiply-accumulate (MAC) still runs in INT32 (INT8 × INT8 → INT32 accumulate), then the result is scaled back. This is exactly what CUDA's Tensor Core IMMA instructions implement.

1.6 Error Detection and Correction

Parity — add one bit so the total number of 1s is even (even parity) or odd (odd parity):

Data: 1011 0110   →  five 1s  →  odd
Even parity bit: 1  (makes six 1s → even)
Transmitted: 1011 0110 1

Parity detects any single-bit error but cannot locate or correct it.

Hamming(7,4) — encode 4 data bits into 7 bits with 3 parity bits positioned at powers of 2:

Positions:  p1  p2  d1  p4  d2  d3  d4
            1   2   3   4   5   6   7

p1 covers positions 1,3,5,7  (odd positions)
p2 covers positions 2,3,6,7
p4 covers positions 4,5,6,7

Syndrome = (check p1)(check p2)(check p4) → 3-bit number = position of error

Hamming(7,4) corrects any single-bit error and detects any double-bit error. Extended Hamming adds a fourth parity bit to detect (not correct) double errors.

CRC (Cyclic Redundancy Check) — treats the data as a polynomial and divides by a generator polynomial:

Data polynomial D(x) × x^r  mod  G(x)  =  R(x)   (remainder = CRC)
Transmitted: D(x) concat R(x)
Receiver: divide received polynomial by G(x) — nonzero remainder means error

CRC-32 (used in Ethernet, ZIP, PNG) uses G(x) = x^32 + x^26 + x^23 + ... + 1. ECC DRAM uses a Hamming-based SECDED (Single Error Correct, Double Error Detect) code across the 64-bit data bus plus 8 ECC bits.


2. Boolean Algebra and Logic Gates

2.1 Gates and Truth Tables

Every digital function reduces to NAND gates (or NOR gates) — they are functionally complete.

AND gate:          OR gate:          NOT gate:         NAND gate:
  A──┐              A──┐              A──○──Y           A──┐
     ├──Y              ├──Y                             ├──○──Y
  B──┘              B──┘                             B──┘

Truth tables:

A B | AND  OR  NAND  NOR  XOR  XNOR
0 0 |  0   0    1    1    0    1
0 1 |  0   1    1    0    1    0
1 0 |  0   1    1    0    1    0
1 1 |  1   1    0    0    0    1

Universal gates — NAND implements everything:

NOT A    =  NAND(A, A)
A AND B  =  NAND(NAND(A,B), NAND(A,B))   i.e. NOT(NAND(A,B))
A OR B   =  NAND(NAND(A,A), NAND(B,B))   i.e. NOT(NOT A) NAND NOT(NOT B)

2.2 Boolean Laws

Law AND form OR form
Identity A · 1 = A A + 0 = A
Null A · 0 = 0 A + 1 = 1
Idempotent A · A = A A + A = A
Complement A · Ā = 0 A + Ā = 1
Double negative ¬(¬A) = A
Commutative A · B = B · A A + B = B + A
Associative (A·B)·C = A·(B·C) (A+B)+C = A+(B+C)
Distributive A·(B+C) = A·B + A·C A+(B·C) = (A+B)·(A+C)
De Morgan's ¬(A·B) = Ā + B̄ ¬(A+B) = Ā · B̄
Absorption A·(A+B) = A A + A·B = A

De Morgan's is the most used law in logic design — it lets you push negations through gates and convert AND↔OR.

2.3 Karnaugh Maps

K-maps group minterms visually to find the minimal sum-of-products (SOP) expression.

4-variable K-map layout (Gray code order — adjacent cells differ by one bit):

        CD
AB    00  01  11  10
  00 | 0 | 1 | 3 | 2 |
  01 | 4 | 5 | 7 | 6 |
  11 |12 |13 |15 |14 |
  10 | 8 | 9 |11 |10 |

Rules for grouping: - Group size must be a power of 2: 1, 2, 4, 8, 16 - Groups must be rectangular (wrap around edges) - Use the largest possible groups - Cover every 1; don't cover 0s - Each group eliminates the variables that change within it

Example — minimize F(A,B,C,D) = Σm(0,2,4,5,6,7,8,10,13,15):

        CD
AB    00  01  11  10
  00 | 1 | 0 | 0 | 1 |   →  Group of 4: cells 0,2,8,10 → B̄D̄
  01 | 1 | 1 | 1 | 1 |   →  Group of 4: cells 4,5,6,7 → AB̄
  11 | 0 | 1 | 1 | 0 |   →  Group of 2: cells 13,15 → ACD
  10 | 1 | 0 | 0 | 1 |

F = B̄D̄ + AB̄ + ACD

Without K-map: 10 minterms = up to 10 four-literal terms. K-map: 3 terms.

2.4 CMOS Implementation

CMOS (Complementary MOS) uses PMOS pull-up networks and NMOS pull-down networks. The two networks are duals of each other.

CMOS NAND gate (2-input):

        VDD
         |
    ─────┤  PMOS (A)     ← parallel pull-up
    │    └──┐
    └──────┤  PMOS (B)
            └─── Y
            ┌──
    ┌──────┤  NMOS (A)   ← series pull-down
    │    ┌──┘
    ─────┤  NMOS (B)
         |
        GND

Logic:  Y = NAND(A,B) = ¬(A · B)

If A=1 AND B=1:  both NMOS on (pull Y low), both PMOS off → Y = 0  ✓
If A=0 OR  B=0:  one PMOS on (pull Y high), NMOS chain broken → Y = 1  ✓

Key CMOS properties:

Property Value
Static power Near zero (only leakage)
Dynamic power C × V² × f (scales with switching activity)
Noise margin ~40% of VDD
Fan-out Theoretically unlimited (MOS gate is capacitive)
Inversion Every CMOS gate inverts (NAND, NOR, NOT)

Why only NAND/NOR? AND = NAND + NOT = two gate stages. An extra inversion costs delay and area. Modern synthesis tools always work in NAND/NOR/NOT internally.


3. Combinational Logic

3.1 Multiplexers and Decoders

MUX (2:1) — select one of two data inputs based on a control signal:

  D0 ──┐
        ├── Y   Y = S̄·D0 + S·D1
  D1 ──┘
        S (select)

  S=0 → Y=D0
  S=1 → Y=D1

4:1 MUX from two layers of 2:1 MUX:

  D0 ─┐                  S1=0: top 2:1 → D0 or D1
  D1 ─┤ 2:1 MUX ─┐       S1=1: bot 2:1 → D2 or D3
       └──(S0)    │
  D2 ─┐           ├─ 2:1 MUX ── Y
  D3 ─┤ 2:1 MUX ─┘         └──(S1)
       └──(S0)

Decoder (2-to-4) — activate exactly one of N outputs:

Inputs: A1, A0
Outputs: Y0..Y3

Y0 = Ā1·Ā0   (active when A=00)
Y1 = Ā1·A0   (active when A=01)
Y2 = A1·Ā0   (active when A=10)
Y3 = A1·A0   (active when A=11)

Decoders appear as row/column address decoders in every SRAM and register file.

3.2 Adders: Ripple-Carry vs Carry-Lookahead

Full adder (1-bit):

  A ──┐
  B ──┼── XOR ──────── Sum  =  A ⊕ B ⊕ Cin
  Cin─┘

  Carry-out  =  A·B  +  Cin·(A⊕B)
             =  A·B  +  Cin·(A XOR B)

Ripple-carry adder (RCA) — chain N full adders:

  A3 B3    A2 B2    A1 B1    A0 B0
   │  │     │  │     │  │     │  │
  ┌┴──┴┐   ┌┴──┴┐   ┌┴──┴┐   ┌┴──┴┐
  │ FA │◄──│ FA │◄──│ FA │◄──│ FA │◄── Cin=0
  └────┘   └────┘   └────┘   └────┘
  Cout S3     S2       S1       S0

Delay: N × t_FA   (linear in word width)
32-bit: ~32 × 0.5ns = 16ns at 1 GHz  (too slow for the critical path)

Carry-Lookahead Adder (CLA) — compute all carries in parallel:

Generate: Gi = Ai · Bi        (this bit produces a carry regardless of Cin)
Propagate: Pi = Ai ⊕ Bi      (this bit passes Cin to Cout)

C1 = G0 + P0·C0
C2 = G1 + P1·G0 + P1·P0·C0
C3 = G2 + P2·G1 + P2·P1·G0 + P2·P1·P0·C0
C4 = G3 + P3·G2 + P3·P2·G1 + P3·P2·P1·G0 + P3·P2·P1·P0·C0

All four carries computed in 2 gate delays (one AND, one OR).

Delay comparison:

Adder type Delay (N-bit) Area
Ripple-carry O(N) O(N)
Carry-lookahead O(log N) O(N log N)
Carry-select O(√N) O(N)
Kogge-Stone (tree) O(log N) O(N log N)

Modern ALUs use Kogge-Stone or Han-Carlson prefix trees to achieve O(log N) delay with regular, pipeable structure.

3.3 Multipliers and MAC Units

Array multiplier (4×4 unsigned):

           A3  A2  A1  A0
         ×  B3  B2  B1  B0
         ─────────────────
         A3B0 A2B0 A1B0 A0B0     ← partial product row 0 (AND gates)
    A3B1 A2B1 A1B1 A0B1          ← partial product row 1 (shift 1)
A3B2 A2B2 A1B2 A0B2              ← partial product row 2 (shift 2)
A3B3 A2B3 A1B3 A0B3              ← partial product row 3 (shift 3)
─────────────────────────────────
P7   P6   P5   P4   P3  P2  P1  P0

Partial products are summed with an adder tree (Wallace tree reduces rows in O(log N) delay).

MAC (Multiply-Accumulate):

ACC ← ACC + (A × B)

  A ─┐
     ├── MULTIPLIER ─┐
  B ─┘               ├── ADDER ── ACC ──┐
                      │                  │
             ACC ─────┘                  │
              ↑                          │
              └──────────────────────────┘

One MAC = one ×, one +, one register write

The MAC is the fundamental operation in matrix multiplication. A systolic array tiles N×N MACs to compute a full matrix product:

A[i,k] flows right →      B[k,j] flows down ↓

Each PE: acc += A × B, then pass A right and B down.
After N cycles, PE[i,j] holds C[i,j] = Σ_k A[i,k]×B[k,j].

3.4 ALU Design

A minimal ALU selects among operations via an opcode:

                  ┌──────────────────────────────────────────────┐
  A[31:0] ─────►  │  ADD   SUB   AND   OR   XOR   SLT   SHR   SHL │
  B[31:0] ─────►  │                                                │
                  └──────────────────┬───────────────────────────┘
                                     │ 8:1 MUX
                         OpCode[2:0] ┘
                                 Result[31:0]
                                 + Flags: Zero, Carry, Overflow, Negative

Flag bits:

Flag Meaning Use
Zero (Z) Result == 0 Branch equal/not-equal
Carry (C) Unsigned overflow / borrow Multi-word arithmetic
Overflow (V) Signed overflow Signed arithmetic check
Negative (N) Result MSB is 1 (negative in 2's C) Signed comparisons

3.5 Propagation Delay and Critical Path

Every gate has a propagation delay t_p — the time from input change to output stable.

Critical path: the longest delay path from any input to any output.
Minimum clock period = critical path delay + setup time + clock skew.

Example: 32-bit ripple-carry adder
  t_FA = t_XOR + t_AND + t_OR ≈ 3 × 100ps = 300ps per stage
  Critical path = 32 × 300ps = 9.6ns  →  max freq ≈ 104 MHz

Same adder, 4-bit CLA blocks (4 blocks):
  CLA block delay ≈ 2 gate delays = 200ps
  Carry ripple between 4 blocks: 4 × 200ps = 800ps
  Sum out: 800ps + 300ps ≈ 1.1ns  →  max freq ≈ 900 MHz

Pipelining cuts the critical path by adding registers mid-path:

Before pipelining (one long combinational path):
  Input → [Stage A: 3ns] → [Stage B: 4ns] → [Stage C: 2ns] → Output
  Throughput: 1 result / 9ns  →  ~111 MHz

After pipelining (registers between stages):
  Input → [A: 3ns] → REG → [B: 4ns] → REG → [C: 2ns] → Output
  Clock period = max(3, 4, 2) + setup ≈ 4.2ns  →  ~238 MHz
  Throughput: 1 result / 4.2ns  (latency is 3× longer, throughput ~2.1× better)

4. Sequential Logic

4.1 D Flip-Flop

The D flip-flop (DFF) is the canonical memory element. It captures D on the rising clock edge.

       D ──┤D    Q├── Q
    CLK ──┤CLK   │
           │    Q̄├── Q̄ (complement)
           └──────┘

Timing:
                ___     ___     ___
  CLK:      ___|   |___|   |___|   |___
                     ↑           ↑       ← capture edges
  D:    ──[stable]──[stable]──[stable]──
  Q:    ──────────[D₀]─────[D₁]────────
                     ^ after t_clk-to-q

Critical timing parameters:

Parameter Symbol Definition
Setup time t_su D must be stable this long BEFORE clock edge
Hold time t_h D must be stable this long AFTER clock edge
Clock-to-Q t_cq Time from clock edge to Q valid
Propagation t_p Combinational logic delay between DFFs

Setup time violation (D changes too close to clock edge → Q becomes metastable):

  CLK:    _____|‾‾‾|_____
  D:      ────[old]X[new]    ← X = transition inside setup window
  Q:      ──────────[???]    ← metastable! may settle to wrong value

Minimum period constraint: t_clk ≥ t_cq + t_p + t_su

4.2 Registers, Shift Registers, and LFSR

N-bit register — N DFFs sharing a clock and optional load-enable:

  D[N-1:0] ──► [DFF] [DFF] ... [DFF] ──► Q[N-1:0]
                CLK shared across all

Shift register — DFFs chained; data shifts one position per cycle:

  D_in → [DFF₀] → [DFF₁] → [DFF₂] → [DFF₃] → D_out
           Q₀       Q₁       Q₂       Q₃

  Use: serial-to-parallel conversion, FIFO, delay line, CRC computation

LFSR (Linear Feedback Shift Register) — XOR taps from specific positions feed back to input:

  Fibonacci LFSR, polynomial x⁴ + x³ + 1 (taps at positions 4 and 3):

  ┌─── XOR ◄──────────────────────────────┐
  │    ↑                                   │
  │  [DFF₄] ← [DFF₃] ← [DFF₂] ← [DFF₁]  │
  │    Q₄       Q₃       Q₂       Q₁      │
  │                                        │
  └──────────────────────────────── Q₄ → output

  Period: 2⁴ - 1 = 15 (maximal length for this polynomial)
  Output sequence: 1111, 0111, 1011, 1101, 0110, 1010, 0101, 1010, ...

LFSRs are used in: pseudo-random number generation, BIST (Built-In Self-Test), spread spectrum clocking, CRC hardware, and scrambling in PCIe/USB.

4.3 Finite State Machines

Two types:

Type Output depends on Outputs change
Moore State only After clock edge
Mealy State AND current inputs Immediately

Traffic light FSM (Moore) — 4 states:

States: GREEN_NS, YELLOW_NS, GREEN_EW, YELLOW_EW
Input:  timer_expired (1 = time to change)

         timer_expired
GREEN_NS ──────────────► YELLOW_NS
   ↑                          │
   │                   always │
   │                          ▼
YELLOW_EW ◄──────── GREEN_EW
              always

HDL-style state register (synthesizable pattern):

// State register — always sequential
always @(posedge clk or posedge rst) begin
    if (rst)  state <= GREEN_NS;
    else      state <= next_state;
end

// Next-state logic — always combinational
always @(*) begin
    case (state)
        GREEN_NS:  next_state = timer_expired ? YELLOW_NS : GREEN_NS;
        YELLOW_NS: next_state = GREEN_EW;
        GREEN_EW:  next_state = timer_expired ? YELLOW_EW : GREEN_EW;
        YELLOW_EW: next_state = GREEN_NS;
        default:   next_state = GREEN_NS;
    endcase
end

// Output logic — Moore: outputs depend on state only
assign light_ns = (state == GREEN_NS)  ? GREEN  :
                  (state == YELLOW_NS) ? YELLOW : RED;
assign light_ew = (state == GREEN_EW)  ? GREEN  :
                  (state == YELLOW_EW) ? YELLOW : RED;

State encoding strategies:

Encoding Bits needed Power Speed Use case
Binary log₂(N) Low Moderate Area-constrained FPGAs, ASICs
One-hot N High Fast FPGA (abundant flip-flops)
Gray code log₂(N) Low Moderate Reduces glitches in output logic

AI hardware connection: the control unit inside a GPU SM is a large FSM. The warp scheduler tracks warp state (ready, waiting on memory, waiting on dependency) and transitions each warp on every cycle. Up to 64 warps = 64 simultaneous FSM instances running in parallel.

4.4 Timing Analysis and Critical Path

Setup analysis — find the critical path across all register-to-register paths:

  REG A → [combo logic, delay t₁] → REG B

  Setup check: t₁ + t_su ≤ t_clk - t_skew
  (data must arrive before clock arrives at destination minus skew)

Hold analysis — ensure data doesn't change too fast:

  Hold check: t_cq + t_min_combo ≥ t_h + t_skew
  (data must stay stable long enough after clock edge)

Clock skew — the difference in clock arrival time at two flip-flops:

   CLK source
   ┌──┴──┐
   │     │   ← routing delays differ
 REG A  REG B
  CLK   CLK
  t₁    t₂     skew = t₂ - t₁

  Positive skew (t₂ > t₁): relaxes setup, tightens hold
  Negative skew (t₂ < t₁): tightens setup, relaxes hold

Clock gating — disable the clock to a register bank when idle to save dynamic power:

  CLK ───┐
          ├── AND ──► gated CLK → register bank
  EN  ───┘

  Saves: C_register × V² × f per gated cycle
  Risk: glitches if EN changes while CLK=1 (use latch-based clock gate)

5. Memory Technologies

5.1 SRAM — Static Random Access Memory

SRAM stores a bit in a cross-coupled inverter pair (the 6T cell):

        VDD
    ─────┼──────────────────────────────
    │    PMOS₁                PMOS₂   │
    │      │                    │     │
    │      ├── Q ─────── Q̄ ───┤      │
    │    NMOS₁       ┌───┐  NMOS₂    │
    │      │         │INV│    │      │
    │      │         └───┘    │      │
    ─────NMOS₃            NMOS₄──────
    │      │                    │    │
    BL     WL─────────────────WL    BL̄

6T SRAM cell:
  - 2 PMOS (pull-up)
  - 2 NMOS (pull-down) forming the latch
  - 2 NMOS access transistors (NMOS₃, NMOS₄) controlled by Word Line

Operation:

Phase WL BL BL̄ Action
Standby 0 Pre Pre Latch holds state; access transistors off
Read 1 Pre Pre Cell discharges one BL; sense amp detects
Write 1 D Driver overrides latch; cell flips

SRAM characteristics:

Property Value
Access time 0.5–5 ns (L1: ~1ns, LLC: ~10ns)
Density 6T cell ≈ 120–150 F² (F = feature size)
Power Static leakage + dynamic read/write
Retention Holds data as long as VDD applied
On-chip use Registers, caches, scratchpad, FIFOs

5.2 DRAM — Dynamic RAM

DRAM stores charge on a capacitor — one transistor, one capacitor (1T1C):

       BL
       NMOS  ← access transistor (WL controls)
       CAP   ← ~20–30 fF capacitor
       GND

Charged cap (≥ Vth/2) = logic 1
Discharged cap (≤ Vth/2) = logic 0

Key differences from SRAM:

Property SRAM DRAM
Cell size 6 transistors 1 transistor + 1 capacitor
Density Low 4–8× higher than SRAM
Speed ~1–10 ns ~50–100 ns (row access)
Refresh Not needed Every ~64 ms (capacitor leaks)
Destructive read No Yes (must restore after read)
Cost/GB ~50–100× more than DRAM Baseline

DRAM access sequence:

1. Row activate (RAS): open one row (wordline) into row buffer (~45 ns)
2. Column access (CAS): select columns from row buffer (~15 ns)
3. Precharge: close row, precharge bit lines
4. Refresh: periodically rewrite all rows before charge leaks away

DRAM bandwidth = (bus width × frequency × burst length) / 8
DDR5-6400: 64-bit × 3200 MT/s × burst-8 = 51.2 GB/s per channel

5.3 HBM — High Bandwidth Memory

HBM stacks DRAM dies vertically on silicon interposer, connected via Through-Silicon Vias (TSVs):

  GPU Die ──────────────────────────── HBM stack
              Silicon interposer

  HBM stack cross-section:

    ┌─────────────┐   ← DRAM die 4
    ├─────────────┤   ← DRAM die 3
    ├─────────────┤   ← DRAM die 2
    ├─────────────┤   ← DRAM die 1
    └─────────────┘   ← Base die (logic)
           │││         ← TSVs (Through-Silicon Vias)
    ─────────────────  ← Silicon interposer
           │││
    ┌─────────────┐   ← GPU/accelerator die

HBM vs GDDR6 vs DDR5:

Spec DDR5-6400 GDDR6X HBM3
Bus width/stack 64-bit 32-bit 1024-bit
Bandwidth/stack 51 GB/s 96 GB/s ~900 GB/s
Capacity/stack Unlimited 16–24 GB 16–48 GB
Power efficiency Moderate Moderate High
On-package No No Yes (interposer)
Used in CPU systems Gaming GPUs AI GPUs/TPUs

A100 has 80 GB HBM2e across 5 stacks at ~2 TB/s. H100 SXM uses HBM3 at ~3.35 TB/s. The interconnect between the GPU die and HBM is the single biggest memory bandwidth bottleneck in AI compute.

5.4 Flash Memory

Flash stores charge in a floating gate or charge trap layer — no power needed for retention.

NAND Flash cell (floating gate):

    Control gate (CG)
    ─────┼────   ← tunnel oxide
    Floating gate (FG) ← charge stored here
    ─────┼────   ← gate oxide
    Source           Drain
       │                │
       └─────── ────────┘
                Channel

Writing: tunnel electrons onto FG (Fowler-Nordheim tunneling) → raises Vt → reads as 0
Erasing: remove electrons from FG → lowers Vt → reads as 1

Flash cell types by bits per cell:

Type Bits/cell Voltage levels Endurance Speed Use
SLC 1 2 100K P/E Fast Enterprise SSD
MLC 2 4 10K P/E Medium Consumer SSD
TLC 3 8 1K P/E Slower Bulk storage
QLC 4 16 300 P/E Slow Archival, cold data

Flash is not byte-addressable for writes — writes go to a page (4–16 KB), erasures go to a block (128–512 pages). This asymmetry drives the Flash Translation Layer (FTL) in SSD controllers.

5.5 Cache Hierarchy

A cache exploits spatial and temporal locality to bridge the speed gap between CPU/GPU registers and main memory.

GPU/CPU Memory Hierarchy (approximate, H100/server class):

Registers          ~0.2 ns   ~256 KB     per SM (64K 32-bit regs × 108 SMs)
L1 / Shared mem   ~5 ns     ~228 KB     per SM (configurable split)
L2 cache          ~50 ns    ~50 MB      on-chip, shared
HBM3              ~100 ns   ~80 GB      off-chip, on-package
System DDR5       ~200 ns   ~TBs        off-package
NVMe SSD          ~100 μs   ~TBs        PCIe storage

Bandwidth (rough):
  Register file   → FPU:         100+ TB/s
  L1/Shared mem  → register:    ~15 TB/s per SM
  L2              → SM:          ~12 TB/s total
  HBM3            → chip:        ~3.35 TB/s
  PCIe5 x16       → host:        ~64 GB/s

Cache organization:

Direct-mapped (1-way):     Set-associative (4-way):
  Tag | Index | Offset       Tag | Index | Offset

  One slot per set          Four slots per set (4 ways)
  Fast, simple              Fewer conflict misses

  Index selects row         Index selects set, all 4 ways checked in parallel
  Tag must match            Any way whose tag matches → hit

Replacement policies:

Policy Evicts Hardware cost
LRU Least recently used way O(log W) bits per set
PLRU Pseudo-LRU (tree approximation) W−1 bits per set
Random Random way LFSR counter
FIFO Oldest loaded way Pointer per set

GPU L1 uses a simplified LRU or PLRU. CPU last-level caches often use QLRU (quad-age LRU) at 16-way associativity.

5.6 On-Chip Scratchpad and Tiling

Unlike a cache (hardware-managed, transparent), shared memory in GPUs is software-managed SRAM — the programmer explicitly controls what lives there.

Why tiling matters — matrix multiply on GPU without tiling:

Naive: each thread loads A[row, k] and B[k, col] from HBM for every k

For M=N=K=1024, C=A×B:
  Loads from HBM: M × N × K × 2 (A and B) = 2 × 10⁹ float reads
  At 3.35 TB/s HBM bandwidth: 2 × 10⁹ × 4B / 3.35TB/s ≈ 2.4 ms
  Compute at 312 TFLOPS: 2 × 10⁹ FLOPs / 312 TFLOPS ≈ 0.006 ms
  → memory-bound by 400×

With tiling (shared memory):

Block size: 16×16 output tile
  Load tile of A (16×K_chunk) → shared memory    } one time per block
  Load tile of B (K_chunk×16) → shared memory    }
  Each thread computes its dot product from shared memory (fast)

Arithmetic intensity = FLOPs / bytes from HBM
  Naive: 2 FLOPs per 8 bytes (FMA = 2 ops, loads A and B) → 0.25 FLOP/byte
  Tiled: 2 × T FLOPs per (2 × T × 4) bytes → T/4 FLOP/byte (T = tile size)
  T=16: 4 FLOP/byte
  T=128: 32 FLOP/byte  →  compute-bound at H100 (312 TFLOPS / 3350 GB/s = 93)

Shared memory bank conflicts — 32 banks, each 4 bytes wide; threads in a warp access conflict-free if they hit different banks:

No conflict: thread i accesses address i × 4   (each thread → different bank)
2-way conflict: threads 0 and 16 both access bank 0 (serialized → 2× slower)
Broadcast: all threads read same address (one bank, no conflict → hardware broadcasts)

Fix: pad shared memory arrays by one element
  float tile[16][16+1];   ← +1 padding shifts each row to different bank

6. MIPS Processor Design

Building a processor from the components in sections 1–5 is the definitive digital design exercise. MIPS is the standard teaching ISA because its fixed 32-bit instruction format maps cleanly onto hardware. We build up from a single-cycle datapath through a 5-stage pipeline — the same pipeline that every modern processor (including GPU shader cores) descends from.

6.1 MIPS ISA Subset

A minimal core supports three instruction formats:

R-type (register):   [opcode 6][rs 5][rt 5][rd 5][shamt 5][funct 6]
I-type (immediate):  [opcode 6][rs 5][rt 5][imm 16]
J-type (jump):       [opcode 6][address 26]

Key instructions for the datapath:

Instruction Format Operation Example
add R rd = rs + rt add $t0, $s1, $s2
sub R rd = rs − rt sub $t0, $s1, $s2
and R rd = rs & rt and $t0, $s1, $s2
or R rd = rs | rt or $t0, $s1, $s2
slt R rd = (rs < rt) ? 1 : 0 slt $t0, $s1, $s2
lw I rt = Mem[rs + imm] lw $t0, 4($s1)
sw I Mem[rs + imm] = rt sw $t0, 4($s1)
beq I if (rs == rt) PC += imm<<2 beq $s1, $s2, L1
addi I rt = rs + sign_ext(imm) addi $t0, $s1, 10
j J PC = j target

Register file — 32 registers, $0 hardwired to zero:

$zero ($0)  = always 0          $t0-$t9  = temporaries
$at   ($1)  = assembler temp    $s0-$s7  = saved (callee-preserved)
$v0-$v1     = return values     $sp ($29)= stack pointer
$a0-$a3     = arguments         $ra ($31)= return address

6.2 Single-Cycle Datapath

Every instruction completes in one clock cycle. The datapath connects the building blocks from sections 3–5:

                            ┌────────────────────────────────────────────────────┐
                            │                    Control Unit                    │
                            │  (opcode, funct → RegDst, ALUSrc, MemtoReg,       │
                            │   RegWrite, MemRead, MemWrite, Branch, ALUOp, Jump)│
                            └──────┬──────┬──────┬──────┬──────┬──────┬─────────┘
                                   │      │      │      │      │      │
  ┌──────┐    ┌──────────┐   ┌─────┴──┐  MUX   MUX    │      │      │
  │      │    │ Register │   │        │   │     │      │      │      │
  │  PC  ├──►│  File    ├──►│  ALU   ├───┘     │    ┌─┴──┐   │      │
  │      │    │ (32×32)  │   │        │         └───►│Data│   │      │
  └──┬───┘    │          │   └────────┘              │Mem │   │      │
     │        │  rd1─────┼───► A                     │    │   │      │
     │        │  rd2─────┼───► B / MUX ◄── imm       │    │   │      │
     ▼        └──────────┘                           └────┘   │      │
 ┌──────┐                                                     │      │
 │Instr │                                                     │      │
 │ Mem  │◄────────────────────────────────────────────────────┘      │
 └──────┘                                                            │
     │                                                               │
     └─── PC+4 / branch target / jump target ◄──────────────────────┘

Datapath walk-through for lw $t0, 8($s1):

Cycle:
  1. Fetch:     Instr Mem[PC] → instruction word
  2. Decode:    rs=$s1(17), rt=$t0(8), imm=8
                Register File reads: rd1 = Reg[17] = value of $s1
  3. Execute:   ALU computes: $s1 + sign_ext(8) = address
  4. Memory:    Data Mem[address] → loaded word
  5. Writeback: MUX selects memory output → Reg[8] ($t0)
  6. PC update: PC ← PC + 4

All in ONE cycle. Clock period = longest path (lw: through Instr Mem + Reg File + ALU + Data Mem + MUX).

Control signals for each instruction type:

Signal R-type lw sw beq addi j
RegDst 1 (rd) 0 (rt) X X 0 (rt) X
ALUSrc 0 (rt) 1 (imm) 1 (imm) 0 (rt) 1 (imm) X
MemtoReg 0 (ALU) 1 (Mem) X X 0 (ALU) X
RegWrite 1 1 0 0 1 0
MemRead 0 1 0 0 0 0
MemWrite 0 0 1 0 0 0
Branch 0 0 0 1 0 0
Jump 0 0 0 0 0 1
ALUOp 10 00 00 01 00 XX

ALU control — two-level decode:

ALUOp (from main control) + funct field (from instruction) → ALU operation

ALUOp=00 → ADD  (lw, sw, addi: address calculation)
ALUOp=01 → SUB  (beq: compare by subtraction)
ALUOp=10 → look at funct:
             funct=100000 → ADD
             funct=100010 → SUB
             funct=100100 → AND
             funct=100101 → OR
             funct=101010 → SLT

6.3 Single-Cycle Timing Problem

The single-cycle design's clock period equals the slowest instruction:

lw critical path:
  Instr Mem  +  Reg Read  +  ALU  +  Data Mem  +  Reg Write MUX
    200ps    +   150ps    + 200ps +   200ps    +     25ps
  = 775ps  →  max freq ≈ 1.29 GHz

But EVERY instruction — even a simple add (375ps) — takes 775ps.
  add waste:  775 - 375 = 400ps idle per add

This wastes significant time on fast instructions. The solution: pipelining.

6.4 Five-Stage Pipeline

Break the datapath into 5 stages, each taking one clock cycle. Pipeline registers separate stages:

Stage 1       Stage 2        Stage 3       Stage 4        Stage 5
  IF             ID             EX            MEM             WB
┌──────┐    ┌──────────┐   ┌────────┐    ┌────────┐    ┌──────────┐
│Instr │    │ Register │   │        │    │  Data  │    │  Write   │
│Fetch │ ►REG► Decode  │►REG► ALU  │►REG►Memory │►REG► Back    │
│      │    │ + Read   │   │        │    │        │    │          │
└──────┘    └──────────┘   └────────┘    └────────┘    └──────────┘
  PC+4       rs, rt, imm   ALU result   Mem data       rd ← result
  instr      control sigs  branch addr

Clock period = max(stage delay) + register overhead
             = 200ps + 20ps = 220ps  →  max freq ≈ 4.5 GHz
Throughput = 1 instruction per 220ps (vs 775ps single-cycle → 3.5× faster)
Latency = 5 × 220ps = 1100ps per instruction (worse, but throughput wins)

Pipeline timing diagram — 5 instructions flowing through:

Time (cycles):  1     2     3     4     5     6     7     8     9
Instr 1:       IF    ID    EX    MEM   WB
Instr 2:             IF    ID    EX    MEM   WB
Instr 3:                   IF    ID    EX    MEM   WB
Instr 4:                         IF    ID    EX    MEM   WB
Instr 5:                               IF    ID    EX    MEM   WB

After cycle 5, one instruction completes EVERY cycle.
Steady-state CPI = 1 (Cycles Per Instruction).

6.5 Pipeline Hazards

Three types of hazards break the ideal CPI = 1:

Data Hazards

A later instruction reads a register that an earlier instruction hasn't written yet:

add $s0, $t0, $t1    # WB writes $s0 in cycle 5
sub $t2, $s0, $t3    # ID reads $s0 in cycle 3 — STALE value!

Time:      1     2     3     4     5
add:      IF    ID    EX    MEM   WB ← $s0 written here
sub:            IF    ID ←reads $s0 here (old value!)

Solution 1 — Forwarding (bypassing): route the ALU result directly back to the ALU input without waiting for WB:

                       ┌──────── forward path ────────┐
                       │                               │
add:   IF    ID    EX ─┤─ MEM    WB                    │
sub:         IF    ID  ─┘─ EX ◄──┘  MEM    WB
                    ALU gets $s0 from EX/MEM register
                    instead of register file

Forwarding unit checks: if (EX/MEM.rd == ID/EX.rs) → forward EX/MEM.ALUresult

Solution 2 — Load-use hazard (forwarding can't fix):

lw  $s0, 0($t0)    # data available after MEM (cycle 4)
add $t2, $s0, $t1  # EX needs $s0 in cycle 3 — too early!

Must stall (insert bubble) for 1 cycle:

Time:      1     2     3     4     5     6
lw:       IF    ID    EX    MEM   WB
add:            IF    ID   stall  EX    MEM    WB
                       ↑          ↑
                   hazard      forward from MEM/WB register
                  detected

The stall costs 1 cycle. Compilers reorder instructions to fill the load-delay slot when possible.

Control Hazards

Branch outcome isn't known until EX (or MEM), but the next instruction is already fetched:

beq $s0, $s1, target   # branch resolved in EX (cycle 3)
add ...                 # fetched in cycle 2 — might be wrong path!
or  ...                 # fetched in cycle 3 — might be wrong path!

Solutions:

Strategy Penalty Hardware cost Description
Stall (flush) 1–2 cycles Minimal Kill wrong-path instructions
Branch prediction 0 if correct Moderate Predict taken/not-taken
Delayed branch 0 Compiler Always execute next instruction (MIPS)
Early branch resolution 1 cycle MUX in ID Move comparison to ID stage

Static prediction: predict not-taken (sequential fetch). If wrong, flush 1 instruction.

Dynamic prediction (branch history table):

1-bit predictor: remember last outcome → predict same
  Problem: loops always mispredict twice (entering and exiting)

2-bit saturating counter:
  States: Strongly Taken (11) → Weakly Taken (10) → Weakly Not (01) → Strongly Not (00)
  Must mispredict twice to switch direction → better loop behavior

Structural Hazards

Two stages need the same hardware unit in the same cycle (e.g., single memory for both IF and MEM):

Solution: use separate instruction memory and data memory (Harvard architecture)
          or use a multi-ported cache

The classic MIPS pipeline uses separate instruction and data memories — no structural hazards in the base design.

6.6 Complete Pipelined Datapath

 ┌─────┐  IF/ID  ┌─────┐  ID/EX  ┌─────┐ EX/MEM ┌─────┐ MEM/WB ┌─────┐
 │     │  reg    │     │  reg    │     │  reg   │     │  reg   │     │
 │ IF  ├────────►│ ID  ├────────►│ EX  ├───────►│ MEM ├───────►│ WB  │
 │     │         │     │         │     │        │     │        │     │
 └──┬──┘         └──┬──┘         └──┬──┘        └─────┘        └──┬──┘
    │               │               │                              │
    │            ┌──┴──┐         ┌──┴──┐                           │
    │            │Hazard│        │Fwd  │                           │
    │            │Detect│        │Unit │◄──────────────────────────┘
    │            └──┬──┘        └─────┘         (forward paths)
    │               │
    │          stall/flush
  ┌─┴──────────────────────────┐
  │   PC MUX                    │
  │   (PC+4 / branch / jump)   │
  └────────────────────────────┘

Pipeline registers store ALL signals needed by later stages:
  IF/ID:  instruction, PC+4
  ID/EX:  control signals, rs data, rt data, sign-ext imm, rd/rt dest
  EX/MEM: control signals, ALU result, rt data (for sw), dest register
  MEM/WB: control signals, ALU result OR memory data, dest register

Forwarding unit logic (simplified):

// Forward from EX/MEM stage
if (EX/MEM.RegWrite && EX/MEM.rd != 0 && EX/MEM.rd == ID/EX.rs)
    ForwardA = EX/MEM.ALUresult

// Forward from MEM/WB stage
if (MEM/WB.RegWrite && MEM/WB.rd != 0 && MEM/WB.rd == ID/EX.rs
    && !(EX/MEM.RegWrite && EX/MEM.rd == ID/EX.rs))   // EX/MEM has priority
    ForwardA = MEM/WB.WriteData

// Same logic for ForwardB (rt input)

Hazard detection unit (load-use stall):

if (ID/EX.MemRead && (ID/EX.rt == IF/ID.rs || ID/EX.rt == IF/ID.rt))
    stall pipeline:
      - IF/ID register holds (re-decode same instruction)
      - insert NOP into ID/EX (bubble)
      - PC holds (re-fetch same PC+4)

6.7 Performance Metrics

Execution time = IC × CPI × T_clk

  IC    = Instruction Count (depends on ISA and compiler)
  CPI   = Cycles Per Instruction (depends on pipeline + hazards)
  T_clk = Clock period (depends on critical path in longest stage)

Single-cycle:  CPI = 1,  T_clk = 775ps    →  time/instr = 775ps
Pipelined:     CPI ≈ 1.2 (stalls + flushes), T_clk = 220ps
               →  time/instr = 264ps  →  ~2.9× speedup

Ideal speedup from N-stage pipeline = N (limited by hazard stalls)
5-stage MIPS: ideal 5×, typical 3–4× in practice

CPI breakdown example (gcc on MIPS):

Component CPI contribution
Base 1.0
Load-use stalls +0.05
Branch mispredicts +0.12
Cache misses +0.07
Total CPI 1.24

AI hardware connection: GPU shader cores use a much deeper pipeline (20+ stages in modern GPUs) but hide latency through massive thread-level parallelism (TLP) rather than branch prediction. When a warp stalls on a cache miss, the scheduler switches to another ready warp — same principle as pipeline interleaving, but across thousands of threads instead of individual instructions. The MIPS 5-stage pipeline is the conceptual foundation: once you understand forwarding, hazards, and CPI, the GPU's approach is a natural extension.


Resources

Resource Type Focus
Digital Design and Computer Architecture — Harris & Harris Textbook Gates through ISA, HDL examples
Computer Organization and Design — Patterson & Hennessy Textbook RISC-V, memory hierarchy, pipelining
Digital Design and Computer Architecture: ARM Edition — Harris & Harris Textbook Single-cycle + pipelined processor, HDL
Modern VLSI Design — Wolf Textbook CMOS, timing, place-and-route
CMOS VLSI Design — Weste & Harris Textbook Transistor-level design, layout, fabrication
Sam Zeloof (YouTube) Video DIY semiconductor fab — see the process hands-on
Chips and Cheese (chipsandcheese.com) Blog Deep dives into real chip microarchitectures
Nandland (nandland.com) Online FPGA/Verilog/VHDL hands-on
EEVblog / Ben Eater Video Intuition for digital circuits, breadboard CPU
JEDEC JESD79-5B (DDR5 spec) Spec DRAM timing parameters
JEDEC HBM3 spec Spec HBM architecture and interface
IEEE 754-2019 Standard Floating-point standard

Projects

Project Skills Outcome
Binary/BCD/Gray code converter Number systems, K-map Combinational logic in Verilog
32-bit CLA adder in Verilog Generate/propagate, prefix tree RTL design + timing analysis
IEEE 754 FP32 adder Alignment shift, normalization, rounding Understand FP hardware
INT8 MAC unit Fixed-point, two's complement, accumulation Core of quantized inference
LFSR-based PRNG Shift registers, XOR feedback, polynomial BIST, test stimulus generation
4-state FSM (Moore + Mealy) State encoding, next-state logic Control unit design pattern
Tiled matrix multiply (CUDA) Shared memory, bank conflict, tiling Connect theory to GPU programming
Single-cycle MIPS in Verilog Datapath, control unit, ALU, register file Complete processor from gates to ISA
Pipelined MIPS in Verilog Pipeline registers, forwarding, hazard detection Understand CPI, stalls, bypassing
Cache simulator (Python/C++) Direct-map vs set-assoc, LRU, miss rates Memory hierarchy intuition
RTL-to-GDS with OpenLane Synthesis, PnR, DRC/LVS, SkyWater PDK Experience the full tapeout flow (open-source)