Computer Architecture: From ISA to Microarchitecture¶
A comprehensive exploration of how modern processors work — from Instruction Set Architecture (ISA) through microarchitecture, pipelining, caching, and multi-core systems. Designed for engineers building AI accelerators, embedded systems, and custom hardware.
1. Instruction Set Architecture (ISA) Fundamentals¶
Core Concepts¶
What is an ISA? * The contract between software and hardware — defines the instruction format, registers, memory access model, and addressing modes that a CPU implements. * Hardware can change internally (microarchitecture), but if the ISA remains compatible, software runs unchanged. * Examples: x86-64, ARM64 (AArch64), RISC-V, MIPS.
ISA Classification:
-
RISC (Reduced Instruction Set Computer):
- Load-Store architecture: Only
load/storeaccess memory; arithmetic on registers. - Simple, uniform instructions (typically 32 bits).
- Fewer instructions, but easier to optimize and parallelize.
- Examples: ARM, RISC-V, MIPS, PowerPC.
- Advantage: Simpler hardware design, better for pipelining.
- Load-Store architecture: Only
-
CISC (Complex Instruction Set Computer):
- Memory-oriented: Instructions can operate directly on memory.
- Variable-length instructions (x86: 1-15 bytes).
- Larger instruction set (100s of instructions).
- Examples: x86, x86-64, VAX.
- Advantage: Denser code (fewer instructions), reduces memory footprint. Disadvantage: Complex decoder.
ISA Components:
- Registers: Storage locations (fast, on-chip).
- General-purpose registers (GPRs): For data and addresses.
- Special-purpose registers: Program counter (PC), stack pointer (SP), status register.
- Floating-point registers: For FP arithmetic (separate in some ISAs, shared in others).
-
Example: ARM64 has 31 general-purpose 64-bit registers + PC/SP.
-
Data Types: Integer (8/16/32/64-bit), floating-point (FP32/FP64), vectors (SIMD).
-
Instructions: Arithmetic (ADD, SUB, MUL), logic (AND, OR, XOR), memory (LOAD, STORE), control flow (BRANCH, JUMP), special (NOP, SYSCALL).
-
Addressing Modes: How to specify operands.
- Register:
ADD R1, R2, R3(add R2 and R3, store in R1). - Immediate:
ADD R1, R2, 5(add 5 to R2, store in R1). - Indirect:
LOAD R1, [R2](load from memory address in R2). - Indexed:
LOAD R1, [R2 + offset](base + offset). -
Auto-increment:
LOAD R1, [R2++](load from R2, then increment R2). -
Memory Model:
- Byte-addressable: Each byte has a unique address (x86, ARM, RISC-V).
-
Endianness: Byte order (little-endian vs. big-endian). ARM64 and x86 are little-endian.
-
Instruction Encoding: How bits map to opcode, registers, immediates.
- Fixed-length (RISC, e.g., 32 bits for ARM): Simpler decode, easier to parallelize.
- Variable-length (CISC, e.g., x86): Denser code, complex decode.
ISA Specification Examples:
- ARM64 (ARMv8 & later): 32-bit fixed-length instructions, 31 GPRs, load-store architecture, unified floating-point ISA.
- x86-64: 1-15 byte variable-length instructions, 8/16 GPRs (depending on mode), memory-to-memory operations, separate FP/SIMD registers.
- RISC-V: Modular ISA with base RVI (integer) + optional extensions (F=float, D=double, M=multiply, etc.). 32-bit base instruction (16-bit compressed variant).
Instruction Formats & Encoding¶
ARM64 Example Instruction Formats:
Register-to-Register (arithmetic):
31 30 29 28 27 26 25 24 [opcode: 8 bits] [operands: 16 bits]
ADD X0, X1, X2 → Opcode determines operation; X0=destination, X1/X2=sources.
Load from Memory:
LDR X0, [X1] → Load from address in X1, store in X0.
Branch:
B label → Branch to address specified by label (26-bit offset).
x86-64 Example:
89 C8 → MOV EAX, ECX (copy ECX to EAX)
48 89 C0 → MOV RAX, RAX (64-bit version, REX prefix required)
Resources¶
- "Computer Organization and Design: ARM Edition" by Patterson & Hennessy: ISA fundamentals with ARM64 focus.
- "ARM Architecture Reference Manual (ARMv8 & ARMv9)": Official ISA specification.
- "RISC-V ISA Specification": Official open-source ISA guide.
- "x86-64 Application Binary Interface (ABI)": System V ABI for x86-64.
2. CPU Design & Microarchitecture¶
From ISA to Hardware¶
Microarchitecture = How the ISA is implemented in silicon.
The same ISA can have multiple microarchitectures with different performance/power/area characteristics: * Apple M4: Efficient, high-performance ARM64 implementation (wide, OoO). * Cortex-A53: Budget ARM64 in-order implementation. * Both execute the same ARM64 instructions but with vastly different performance.
Single-Cycle Datapath¶
Concept: Simplest CPU — execute one instruction per cycle.
┌─────────────────────────────────────────────────────────────┐
│ Instruction Fetch (IF) │
│ PC += 4; read instruction from memory │
├─────────────────────────────────────────────────────────────┤
│ Execute (EX) Register Fetch (RF) in parallel │
│ Operates on registers, memory addresses │
├─────────────────────────────────────────────────────────────┤
│ Memory (MEM) │
│ Load/Store operations │
├─────────────────────────────────────────────────────────────┤
│ Write-Back (WB) │
│ Store result in register │
└─────────────────────────────────────────────────────────────┘
Problems: * Critical path: Entire datapath (fetch + execute + memory + write-back) must complete in one clock cycle — limits clock frequency. * Resource contention: If two instructions need the ALU simultaneously, one stalls.
3. Pipelining¶
Concept¶
Pipeline = Break execution into stages; multiple instructions progress in parallel, each at a different stage.
Cycle: 1 2 3 4 5 6 7
Instr 1: IF |RF |EX |MEM|WB
Instr 2: IF |RF |EX |MEM|WB
Instr 3: IF |RF |EX |MEM|WB
Instr 4: IF |RF |EX |MEM|WB
Advantages: * Throughput: ~1 instruction per cycle (instead of 5 cycles per instruction). * Frequency: Each stage is shorter; clock can run faster.
Disadvantages: * Pipeline Hazards: Data, control, and structural hazards cause stalls. * Latency: Still takes 5 cycles to retire one instruction; pipelining improves throughput, not latency.
Types of Hazards¶
1. Data Hazards
Two instructions depend on each other's data:
Solution: * Forwarding (Bypass): Route EX result directly to next instr's input. * Stalling: Insert NOP to delay dependent instruction. * Out-of-order execution: Execute independent instructions first.
2. Control Hazards
Branch instruction changes PC; next instruction unknown until branch resolves:
BEQ R1, Label # Resolved in EX (cycle 3)
AND R5, R6, R7 # What instruction goes here? Fetched in cycle 2!
Solutions: * Branch Prediction: Guess next instruction (covered in Section 7). * Delay Slots (MIPS, older ISAs): Instruction after branch always executes.
3. Structural Hazards
Two instructions need same hardware resource: * Single memory port, single ALU, etc.
Solution: Duplicate resources or serialize access.
Pipeline Stalling & Forwarding¶
Example: 5-Stage Pipeline (IF → RF → EX → MEM → WB)
Data Dependency Problem:
Cycle 1: LDR X0, [X1] # Load X0 from memory
Cycle 2: ADD X2, X0, X3 # Use X0 (not yet in register!)
Without fix: Stall 2 cycles for X0 to be written back.
With forwarding: Route MEM result directly to ADD's input.
4. Superscalar Architecture¶
Concept¶
Superscalar = Fetch & execute multiple instructions per cycle, not just one.
Cycle 1:
IF: Fetch instr 1, 2, 3 (3-wide fetch)
EX: Instr 4, 5 can execute in parallel (if independent)
MEM: Instr 6, 7 can access memory in parallel
WB: Instr 8, 9, 10 write results
Width: 2-wide (2 instructions/cycle), 4-wide, 6-wide (Apple M4 is ~6-wide).
Requirements: 1. Multiple instruction fetch/decode units. 2. Multiple functional units (ALUs, FP units, load/store units). 3. Larger register file (to avoid contention). 4. Renaming hardware to avoid false dependencies.
Limitations¶
- Instruction-level parallelism (ILP) limit: Code may not have 4+ independent instructions per cycle.
- Complexity: Increases exponentially; beyond ~6-wide becomes impractical.
- Power consumption: More hardware = more power.
Real-world widths: * Intel Core (x86): 4-6 wide. * ARM Cortex-A72: 3-wide. * Apple M4: ~6-wide.
5. Memory Hierarchy & Caching¶
The Memory Hierarchy¶
┌─────────────────────────────────────────────────────────┐
│ Registers (CPU) │ ~1 cycle access, KB size │
├─────────────────────────────────────────────────────────┤
│ L1 Cache (on-chip) │ ~4 cycles, 32-64 KB │
├─────────────────────────────────────────────────────────┤
│ L2 Cache (on-chip) │ ~12 cycles, 256 KB - 1 MB │
├─────────────────────────────────────────────────────────┤
│ L3 Cache (on-chip, shared)│ ~40 cycles, 8-32 MB │
├─────────────────────────────────────────────────────────┤
│ RAM (off-chip) │ ~100-200 cycles, GB size │
├─────────────────────────────────────────────────────────┤
│ Disk/SSD (storage) │ ms latency, TB size │
└─────────────────────────────────────────────────────────┘
Cache Organization¶
Direct-Mapped Cache:
Address: [Tag (high bits)] [Index (middle bits)] [Offset (low bits)]
↓ ↓
Compare Tag Selects cache line
Memory Address 0x12345678:
Tag: 0x123, Index: 0x45, Offset: 0x678
→ Cache line at index 0x45 must have tag 0x123 to hit.
- Pros: Simple, fast.
- Cons: Poor temporal locality; repeated accesses to same line might evict on conflict.
Set-Associative Cache:
Address → [Index] selects SET (N-way set)
Cache checks all N entries in set for tag match
(N = 2, 4, 8 for 2-way, 4-way, 8-way associativity)
- Pros: Better conflict avoidance than direct-mapped.
- Cons: Slower (parallel tag lookup) and more power than direct.
Fully-Associative Cache: * Every tag is checked in parallel. * Ideal but expensive (rarely used except for TLB).
Cache Policies¶
Replacement Policies (what to evict when full): * LRU (Least Recently Used): Evict line not used longest. Good for temporal locality. * FIFO: Evict oldest line. Simpler hardware. * Random: Pseudo-random eviction. Low overhead.
Write Policies: * Write-Through: Write to cache AND memory simultaneously. Safe but slow. * Write-Back: Write to cache only; mark dirty. Flush to memory on eviction. Faster but complex.
Cache Performance¶
Cache Hit Rate: * L1: 90-95% typical (tight working set). * L2: 99%+ typical. * L3: varies (90-99% depending on application).
Cache Misses: 1. Compulsory (Cold): First access to line; must load from memory. 2. Capacity: Cache too small for working set. 3. Conflict: Poor placement (e.g., direct-mapped hash collision).
Virtual Memory & TLB¶
Virtual Memory: * Programs use virtual addresses (0x00001000 - 4 GB in 32-bit). * OS maps virtual → physical addresses (CPU + MMU). * Allows: protection, isolation, larger address space than physical RAM.
Translation Lookaside Buffer (TLB): * Cache for virtual → physical mappings. * Typical: 4-way, 256-512 entries. * Large TLB = fewer misses; small TLB = lower cost.
TLB Miss Penalty: 50-500 cycles to walk page tables and load TLB.
Resources¶
- "Computer Architecture: A Quantitative Approach" by Hennessy & Patterson: Cache hierarchy, policies, and design.
- "What Every Programmer Should Know About Memory" by Ulrich Drepper: Practical cache optimizations.
- Cache Simulation Tools (DineroIV, Cacti): Model cache behaviors.
6. Out-of-Order (OoO) Execution¶
Motivation¶
In-order execution bottleneck:
OoO solution: Execute SUB before waiting for ADD:
Fetch: ADD, SUB (in order)
Rename: Assign temporary registers (remove false dependencies)
Execute: SUB first (if hardware available)
Retire: In original order
Key Components¶
1. Instruction Window (Re-Order Buffer / Reservation Station) * Decouples fetch/decode (in-order) from execution (out-of-order). * Holds 32-256 instructions waiting to execute. * Larger window = more parallelism potential, but more complexity.
2. Register Renaming
Original code:
ADD R1, R2, R3
SUB R1, R4, R5 (depends on ADD's R1)
Renamed (internal):
ADD P5, P2, P3 (use physical reg P5)
SUB P6, P4, P5 (P5 = result of ADD)
No false dependency: P6 and P5 are different registers.
3. Reservation Station (RS) * Holds instructions waiting for operands. * When operands ready → send to execution unit. * Multiple RSs (ALU RS, FP RS, Load/Store RS) for parallelism.
4. Common Data Bus (CDB) * Broadcasts execution results to all RSs. * Implements forwarding at scale.
Execution Model¶
Instruction sequence (in program order):
1. ADD R1, R2, R3
2. MUL R4, R1, R5 (depends on ADD)
3. SUB R6, R7, R8 (independent)
4. DIV R9, R4, R10 (depends on MUL)
Out-of-order timeline:
Cycle 1: Fetch & rename all 4.
Cycle 2: SUB executes (no dependencies).
Cycle 3: ADD executes (data ready).
Cycle 4: MUL executes (R1 now available via CDB).
Cycle 5: DIV executes (R4 now available).
Retire (in-order): All complete by cycle 5.
Commit/Retire¶
- Results are computed OoO but committed to architectural state in-order.
- Handles exceptions/mispredictions: If instr causes fault, later instructions must be discarded.
- Precise exceptions: Architectural state reflects all instructions up to faulting instruction.
OoO Limitations¶
- Complexity: Renaming logic, speculation, many data structures.
- Power consumption: Reservation stations, CDB, large register file.
- Width limits: Beyond 6 instructions/cycle, dispatch becomes bottleneck.
- Memory ordering: Must track loads/stores to ensure correct order.
7. Branch Prediction & Speculation¶
The Branch Problem (Revisited)¶
Cycle 1: Fetch BEQ R1, Label
Don't know next PC yet; can't fetch instr 2!
Cycle 4: Branch resolves in EX stage.
Lost 3 cycles of pipeline fill!
With 4-wide superscalar: Lost ~12 instruction slots.
Prediction Mechanisms¶
1. Two-Level Branch Predictor (Correlating)
Pattern History Table (PHT):
n-bit history of recent branches → lookup table of predictions.
Example: 2-bit history + 2-bit saturation counter
History = "T, T" (last 2 branches taken)
→ Look up counter for (history="11")
→ Predict "taken" or "not taken" based on counter value.
State diagram (2-bit saturating):
Strongly Not Taken (00) → (01) → (10) → (11) Strongly Taken
Transitions based on actual outcome.
2. Gshare (Global-History + XOR)
Global history (last N branches) XORed with PC bits
→ Index into PHT.
Captures program + global dynamics better than local-only.
3. Branch Target Buffer (BTB)
Stores (branch_address → target_address) pairs.
Fast lookup: BEQ at 0x1000 target = 0x2000
→ Predict target immediately without computing.
4. Return Stack Buffer (RSB)
Stack of return addresses.
CALL pushes return address; RET pops from stack.
Perfectly predicts most function returns (nested ~16 levels typical).
Speculation & Recovery¶
Speculative Execution:
Predict branch not taken; fetch instr A.
A is speculatively executed.
If prediction wrong: Discard A, fetch correct instr B.
(Uses OoO pipeline to discard; physical registers freed).
Misprediction Cost: * Pipeline flush: 10-20+ cycles latency penalty for deep pipelines. * Example: M4 misprediction = ~8-12 cycle penalty.
Branch Prediction Rates: * Conditional branches (~80% of all branches): Modern predictors: 94-98% accuracy. * Indirect branches (~20%): Harder; BTB-based guessing: 85-95%.
Hardware Details¶
Apple M4 Branch Prediction (Example): * ~8,000-entry BTB. * 2-level predictor with global history. * RSB for return prediction. * Prediction latency: 1 cycle (integrated into fetch).
Limits & Challenges¶
- Misprediction power: Wrong path wasted energy.
- Unpredictable branches: Hash functions, indirect calls → high misprediction rate.
- Hardware cost: BTB, pattern history tables consume area/power.
8. Multi-Core & Cache Coherence¶
Multi-Core Architecture¶
┌──────────────┐ ┌──────────────┐
│ Core 0 │ │ Core 1 │
│ L1I / L1D │ │ L1I / L1D │
│ L2 │ │ L2 │
└──────────────┘ └──────────────┘
↓ ↓
┌─────────────────────────┐
│ L3 Cache (Shared) │
└─────────────────────────┘
↓
Memory Controller & Main RAM
Multi-Core Benefits: * 2 cores → 2x work (for independent tasks). * Shared L3 cache → reduced bandwidth to main memory. * Effective TLP (Thread-Level Parallelism).
Challenges: * Cache Coherence: If Core 0 writes to address X, Core 1 must see updated value (not stale cache copy). * Synchronization: Atomic operations, locks, barriers.
Cache Coherence Protocols¶
MESI Protocol (MSI + Exclusive state):
States per cache line:
M (Modified): This core wrote it (dirty).
E (Exclusive): Loaded; unmodified; no other core has it.
S (Shared): Multiple cores have valid copy.
I (Invalid): Stale/not loaded.
Transitions:
Read miss: M/E → S (if other core has it); else E (if only local).
Write: Any state → M (invalidate other cores' copies).
MOESI Protocol (MESI + Owned state): * O (Owned): Modified by this core, but others can read (reduce writes). * Used in AMD, some ARM systems.
Example: Two cores, shared variable X
Core 0: X = 5 (writes)
→ X's line: M in Core 0; I in Core 1.
Core 1: Read X
→ Coherence request; Core 0 sends data.
→ X's line: S in both cores.
Core 0: X = 10 (write again)
→ Invalidate Core 1's copy.
→ X's line: M in Core 0; I in Core 1.
Coherence Cost: * Snooping-based (bus): All cores watch all writes (broadcast). Works for 2-8 cores. * Directory-based (point-to-point): Central directory tracks line locations. For 16+ cores.
9. Real-World ISA & Architecture Case Studies¶
ARM64 (ARMv8 & ARMv9)¶
ISA Characteristics: * Load-store RISC architecture. * 31 general-purpose 64-bit registers, CC register, PC. * Fixed 32-bit instructions (also 16-bit Thumb-2 compressed). * Unified floating-point (FP32/FP64 + SIMD in vector registers).
Example Instructions:
ADD X0, X1, X2 # X0 = X1 + X2
LDR X0, [X1] # Load from address in X1
STR X0, [X1] # Store to address in X1
B my_label # Unconditional branch
BEQ my_label # Branch if equal (CC flag)
MOVI V0.4S, #1 # SIMD: Set 4x 32-bit values to 1
Microarchitecture Examples: * Cortex-A53: In-order, 2-wide, low-power (embedded). * Cortex-A72: Out-of-order, 3-wide, mid-range power/perf. * Cortex-A77: OoO, 4-wide, high-performance. * Apple M4: OoO, 6-wide, highest ILP extraction (P-cores); 2-4 wide efficiency cores (E-cores).
Advantages: * Clean ISA, load-store simplicity. * Excellent for energy efficiency (mobile/embedded).
Disadvantages: * Denser code than x86 requires slightly more memory.
x86-64¶
ISA Characteristics: * CISC architecture (evolved from x86 and x86-32). * Variable-length instructions (1-15 bytes); complex decoder. * 16 general-purpose 64-bit registers (some restricted). * Separate floating-point (FP stack) + SIMD (XMM/YMM/ZMM) registers.
Example Instructions:
mov rax, [rbx] # Load from memory (x86 mem-to-reg)
add rax, rbx # Register-to-register
add rax, 5 # Immediate operand
mov [rax], rbx # Store to memory (base + offset)
Microarchitecture Examples: * Intel Core Ultra (Lunar Lake, 2024): P-cores (Lion Cove, 4-wide OoO), E-cores (Skymont). * AMD Ryzen 7000 (Zen 5, 2024): 4-wide OoO, strong ILP extraction.
Advantages: * Backward compatibility (40+ years of x86 code still runs). * Memory-to-register instructions; denser code. * Dominant ecosystem.
Disadvantages: * Complex decoder (high power, area cost). * Variable-length instructions complicate fetching.
RISC-V¶
ISA Characteristics: * Modular, open-source RISC ISA. * Base 32-bit instructions; RV64 for 64-bit. * Extensions: M (multiply), F (float), D (double), A (atomic), V (vector), etc.
Example Instructions (RV64I base):
add x1, x2, x3 # x1 = x2 + x3
ld x1, 8(x2) # Load doubleword from [x2 + 8]
beq x1, x2, label # Branch if equal
Advantages: * Radical simplicity: ~50 core instructions. * Modular extensions (pick what you need). * Open-source; no licensing fees.
Disadvantages: * Young ecosystem (software still maturing). * No dominant commercial implementation yet (SiFive, Ventana, others emerging).
Comparison¶
| Feature | ARM64 | x86-64 | RISC-V |
|---|---|---|---|
| ISA Type | RISC | CISC | RISC |
| Instruction Length | 32-bit fixed | 1-15 bytes | 32-bit (RV64) |
| Primary Market | Mobile, embedded, servers | Desktops, servers | Emerging, research |
| Software Ecosystem | Mature | Most mature | Growing |
| Encoding Density | Medium | High | Medium |
| Decoder Complexity | Simple | Complex | Very simple |
| Energy Efficiency | Excellent | Good | Excellent |
10. Advanced Topics¶
Speculative Execution & Security¶
Spectre/Meltdown Context: * CPUs fetch & execute speculatively before branch resolves. * Attacker code runs speculatively; can leak cache state (timing side-channel). * Modern mitigations: LFENCE, RSB stuffing, TLB isolation, CET (Control-Flow Enforcement).
SIMD & Vector Execution¶
Single Instruction, Multiple Data:
Modern vector ISAs: * ARM NEON: 128-bit vectors (2x FP64 or 4x FP32). * x86 AVX-512: 512-bit vectors (16x FP32). * RISC-V V extension: Scalable to 1024+ bits.
Hardware Security¶
- Control-flow integrity: CET (x86) / CoreSight (ARM).
- Authenticated encryption: AES-NI (x86), ARMv8.3 Crypto.
- Trusted execution: SGX (x86), TrustZone (ARM).
Learning Resources¶
Textbooks¶
- "Computer Architecture: A Quantitative Approach" by Hennessy & Patterson (Essential)
- Gold standard; covers pipelining, caching, superscalar, OoO in depth.
-
~1000 pages but skip chapters as needed.
-
"Computer Organization and Design" by Patterson & Hennessy (Alternative)
-
ARM64 edition available; more accessible than above.
-
"Modern Processor Design" by Shen & Lipasti
-
Superscalar & OoO design in detail; excellent figures.
-
"Structured Computer Organization" by Tanenbaum
- Layered approach; good for big-picture understanding.
Online References¶
- ARM Architecture Reference Manual (ARMv8/ARMv9): Official ISA spec.
- RISC-V ISA Specification: Open-source; modular.
- x86-64 System V ABI: Calling conventions, linking.
- Wikichip / Chips & Cheese: Community-sourced microarchitecture analysis.
- AnandTech Deep Dives: CPU reviews with architectural analysis.
Simulation & Tools¶
- SimpleSimulator / SimpleScalar: Cycle-accurate CPU simulation.
- Gem5: Full-system CPU simulator (complex but flexible).
- DineroIV: Cache simulation.
- CacTi: Cache area/power estimation.
- LLVM / GCC: Compiler toolchains; build code for ARM/x86/RISC-V.
Projects & Labs¶
Lab 1: Single-Cycle CPU in Verilog¶
Design a minimal RISC CPU that executes one instruction per cycle. * Support: LDR, STR, ADD, SUB, BEQ. * Components: Fetch unit, decoder, ALU, register file, memory. * Simulate with Icarus Verilog or ModelSim. * Deliverable: Verilog code + test benches; cycle trace showing instruction execution.
Lab 2: 5-Stage Pipeline Implementation¶
Extend Lab 1 with pipelining: * Break into IF→RF→EX→MEM→WB stages. * Implement forwarding to resolve data hazards. * Add NOPs for unresolvable hazards (or stalling logic). * Measure: CPI (cycles per instruction), throughput. * Deliverable: Pipeline simulator + benchmark showing ~1 CPI for independent instructions.
Lab 3: Branch Prediction Simulator¶
Build a cycle-accurate predictor simulator: * Implement 2-level correlating predictor. * Add BTB (Branch Target Buffer). * Simulate on real branch traces (SPEC benchmarks). * Measure: Hit rate, misprediction penalty, speedup vs. stalling. * Deliverable: Prediction accuracy report; analysis of branch patterns.
Lab 4: Cache Performance Analysis¶
Measure cache behavior on real code: * Use Cachegrind / PAPI (Performance API). * Run benchmarks (matrix multiply, merge sort, hash table). * Analyze: Hit rate, misses, working set size. * Simulate different cache configs (size, associativity, line size). * Deliverable: Cache miss charts; optimization suggestions.
Lab 5: Multi-Core Coherence Simulator¶
Simulate cache coherence in 2-4 core system: * Implement MESI protocol. * Simulate memory access patterns from multiple threads. * Measure: Coherence miss rate, cache-to-cache transfers. * Compare MESI vs. snooping vs. directory protocol. * Deliverable: Coherence miss analysis; protocol trade-offs.
Lab 6: ISA Comparison Project¶
Compare x86-64 vs. ARM64 vs. RISC-V: * Compile same C/C++ code for all three ISAs. * Measure: Code size, cycle count, memory bandwidth. * Analyze: Instruction selection, register usage, memory reference patterns. * Deliverable: Comparative benchmark report; architecture lessons.
Lab 7: CPU Microarchitecture Study¶
Reverse-engineer a real CPU (e.g., M4 Pro): * Use tools: Geekbench, SpecCPU, custom microbenchmarks. * Deduce: Cache hierarchy, line size, prefetching, branch predictor. * Measure: IPC, branch misprediction rate, cache miss rate. * Deliverable: Microarchitecture model + analysis; validation against spec.
Capstone: Simple Out-of-Order CPU Simulator¶
Build an OoO CPU simulator (C++/Python): * Instruction fetch, decode, rename, dispatch to reservation stations. * Out-of-order execute; in-order commit. * Support data hazard resolution, branch speculation, memory ordering. * Simulate on benchmark suite. * Deliverable: Functional simulator; IPC comparison vs. in-order.
Next Steps¶
After mastering this material: 1. Proceed to Phase 2 (Embedded Systems): PCB design, real ARM/RISC-V processors, RTOSes, device drivers, embedded Linux. 2. Specialize in FPGA design (Phase 3): Implement custom CPUs on Xilinx. 3. Move to AI accelerators (roadmap Phase 4 Track B — Jetson/GPU, and optionally Track A — FPGA): GPU architectures, tensor operations, memory optimization.
Recommended progression: * Weeks 1-2: ISA fundamentals, simple CPU design (Lab 1). * Weeks 3-4: Pipelining, hazards, forwarding (Lab 2). * Weeks 5-6: Branch prediction, advanced execution (Lab 3). * Weeks 7-8: Caching, memory hierarchy (Lab 4). * Weeks 9-10: Multi-core, coherence (Lab 5). * Weeks 11-12: ISA comparison, architecture analysis (Lab 6-7). * ongoing: Capstone OoO simulator (parallel).