Hardware platforms and components¶
A comprehensive guide to modern computer hardware — CPUs, memory, storage, GPUs, and I/O — across laptops, workstations, and servers, with coverage through CES 2026. Part of Computer Architecture and Hardware (Phase 1 §2); start from Guide.md for the full section map.
1. Central Processing Unit (CPU)¶
Core Concepts¶
- Instruction Set Architecture (ISA): x86-64 (Intel/AMD) vs. ARM (Apple, Qualcomm, Ampere) vs. RISC-V (emerging). Understand how ISA choice affects software compatibility, performance, and power efficiency.
- Microarchitecture: How CPUs implement the ISA — pipelines, out-of-order execution, branch prediction, superscalar execution, and speculative execution.
- Process Node & Transistor Density: Understand nanometer designations (TSMC N3E, Intel 18A, Samsung 2nm GAA) and their impact on performance, power, and thermal design.
- Core Count vs. Clock Speed: Multi-core scaling, Amdahl's Law, and when single-threaded performance matters more than core count.
- Hybrid Architecture: Performance cores (P-cores) and efficiency cores (E-cores) — thread scheduling, power management, and OS-level task assignment.
- Cache Hierarchy: L1/L2/L3 cache design, cache coherency protocols (MESI, MOESI), and impact on latency-sensitive workloads.
- Chiplet & Tile Architecture: Multi-die designs (AMD chiplets, Intel tiles, Apple UltraFusion) — inter-die interconnects, yield advantages, and scalability.
Intel¶
- Core Ultra 200V (Lunar Lake, 2024): 4 P-cores (Lion Cove) + 4 E-cores (Skymont), integrated NPU (48 TOPS), LPDDR5x on-package. Laptop-focused, 17W TDP.
- Core Ultra 200S (Arrow Lake, 2024): Desktop platform, up to 24 cores (8P+16E), LGA 1851, DDR5-5600, PCIe 5.0. Arc integrated GPU or discrete.
- Core Ultra 200HX (Arrow Lake-HX, 2025): High-performance mobile, up to 24 cores, aimed at mobile workstations and gaming laptops.
- Xeon 6 (Granite Rapids / Sierra Forest, 2024–2025): Server platform — Granite Rapids (P-cores, up to 128 cores) for compute-intensive workloads; Sierra Forest (E-cores, up to 288 cores) for cloud-native density.
- Intel 18A Process (2025): Intel's foundry process with RibbonFET (GAA transistors) and PowerVia (backside power delivery), targeting Panther Lake client and Clearwater Forest server chips.
- CES 2026 — Panther Lake (Expected): Next-gen client CPUs on Intel 18A, expected to feature improved NPU (>60 TOPS), Wi-Fi 7, and Thunderbolt 5 integration.
AMD¶
- Ryzen 9000 Series (Zen 5, 2024): Desktop CPUs on TSMC 4nm, up to 16 cores (Ryzen 9 9950X), AM5 socket, DDR5, PCIe 5.0. Improved IPC over Zen 4.
- Ryzen AI 300 Series (Strix Point, 2024–2025): Laptop APUs with Zen 5 + RDNA 3.5 iGPU + XDNA 2 NPU (up to 50 TOPS). Targeting Copilot+ PC requirements.
- Ryzen 9000X3D (2025): 3D V-Cache variants for gaming and professional workloads, stacking additional L3 cache for massive hit-rate improvements.
- EPYC 9005 (Turin, 2024–2025): Server CPUs with up to 192 Zen 5 cores (Turin Dense with Zen 5c), SP5 socket, 12-channel DDR5, PCIe 5.0 (160 lanes), CXL 2.0.
- CES 2026 — Ryzen AI Max (Strix Halo): Monolithic APU with up to 16 Zen 5 cores + RDNA 3.5 (40 CUs) + 256-bit LPDDR5x (up to 128 GB unified memory). Targeting mobile workstations and compact AI PCs.
- CES 2026 — Ryzen Z2 Series: Handheld gaming processors (Z2, Z2 Go, Z2 Extreme) for next-gen portable gaming devices.
Apple Silicon¶
- M4 Family (2024–2025): TSMC N3E process.
- M4: 10-core CPU (4P+6E), 10-core GPU, 16-core Neural Engine (38 TOPS). MacBook Pro 14", iMac, Mac Mini.
- M4 Pro: 14-core CPU (10P+4E), 20-core GPU, Thunderbolt 5, up to 48 GB unified memory.
- M4 Max: 16-core CPU (12P+4E), 40-core GPU, up to 128 GB unified memory, 546 GB/s memory bandwidth.
- M4 Ultra (2025): UltraFusion die-to-die interconnect, up to 32-core CPU, 80-core GPU, up to 512 GB unified memory. Mac Studio, Mac Pro.
- Unified Memory Architecture (UMA): CPU, GPU, and Neural Engine share the same memory pool — eliminates data copying, critical for on-device AI/ML workflows.
- CES 2026 / Expected — M5 Family: TSMC N2 (2nm) process, expected improvements in power efficiency and Neural Engine throughput for local LLM inference.
Qualcomm (Windows on ARM)¶
- Snapdragon X Elite / X Plus (2024): Oryon custom CPU cores (up to 12 cores), Adreno GPU, Hexagon NPU (45 TOPS). Targeting Windows Copilot+ PCs.
- Snapdragon X2 (Expected 2025–2026): Next-gen PC platform with improved Oryon cores and NPU performance, deeper Windows integration.
- CES 2026 — Snapdragon Compute Expansion: Broader OEM adoption across laptops, 2-in-1s, and always-connected PCs. Enhanced developer toolchain for ARM-native Windows apps.
Emerging: RISC-V in Desktops/Servers¶
- RISC-V for Compute: Companies like SiFive (P870 core), Ventana Micro (Veyron server CPUs), and Tenstorrent exploring RISC-V for data center and edge computing.
- Timeline: Still early for mainstream desktop/server adoption but rapidly maturing in 2025–2026 with improved software ecosystem (Linux, LLVM, Android).
CPU Comparison by Form Factor¶
| Segment | Intel | AMD | Apple | Qualcomm |
|---|---|---|---|---|
| Ultrabook/Thin Laptop | Core Ultra 200V | Ryzen AI 300 | M4 | Snapdragon X Elite |
| Gaming/High-Perf Laptop | Core Ultra 200HX | Ryzen 9000HX | M4 Pro/Max | — |
| Desktop | Core Ultra 200S | Ryzen 9000 / 9000X3D | — | — |
| Workstation | Xeon W-3500 | Ryzen Threadripper 7000 | M4 Ultra | — |
| Server | Xeon 6 (Granite/Sierra) | EPYC 9005 (Turin) | — | — |
2. Memory (RAM)¶
Core Concepts¶
- DDR5 vs. DDR4: DDR5 doubles bandwidth (4800–8800+ MT/s vs. DDR4's 2133–3600 MT/s), supports on-die ECC, and uses lower voltage (1.1V vs. 1.2V).
- Channels and Ranks: Dual/quad/octa-channel configurations. More channels = more bandwidth. Servers use 8–12 channels per socket.
- ECC (Error-Correcting Code): Critical for servers and workstations — detects and corrects single-bit errors, detects multi-bit errors. Standard in EPYC/Xeon, supported on Ryzen Pro/Threadripper.
- LPDDR5x: Low-power variant for laptops and mobile — up to 8533 MT/s, soldered on-package for reduced latency and board space.
- Unified Memory (Apple Silicon): Shared memory pool between CPU, GPU, and Neural Engine. Eliminates PCIe bottleneck for GPU memory access.
Memory by Form Factor¶
- Laptops:
- Mainstream: LPDDR5x (soldered), 16–32 GB typical in 2025–2026.
- Gaming/Workstation: SO-DIMM DDR5-5600, up to 64 GB.
- Apple: Unified LPDDR5, 16–128 GB (M4 Max), up to 512 GB (M4 Ultra).
- Desktops:
- DDR5-5600 baseline (Intel/AMD), enthusiast kits at DDR5-8000+ with XMP/EXPO profiles.
- Dual-channel standard, 32–64 GB typical for prosumer use.
- Workstations:
- DDR5-4800 ECC RDIMM, quad-channel (Threadripper) or octa-channel (Xeon W).
- Up to 2 TB per socket with 256 GB RDIMMs.
- Servers:
- DDR5-5600 ECC RDIMM/LRDIMM, 8–12 channels per socket.
- EPYC Turin: 12 channels, up to 6 TB per socket.
- Xeon 6: 8 channels (MCC), up to 4 TB per socket.
Emerging Memory Technologies¶
- CXL (Compute Express Link) Memory: CXL 2.0/3.0 enables memory expansion and pooling over PCIe — attach terabytes of shared memory to servers, disaggregate compute and memory. Supported on EPYC 9005 and Xeon 6.
- HBM (High Bandwidth Memory): Used alongside GPUs and AI accelerators (HBM3/HBM3e). Not for general CPU use but critical for AI workstation/server configs.
- MRDIMM (Multiplexed Rank DIMM): Next-gen server DIMMs pushing DDR5 to 8800+ MT/s for bandwidth-hungry AI/HPC workloads.
3. Storage¶
Core Concepts¶
- NVMe SSD Architecture: NVMe protocol over PCIe — understand namespaces, queues (up to 64K queues × 64K commands), and controller design.
- NAND Flash Types: SLC, MLC, TLC, QLC, PLC — trade-offs between endurance (DWPD), speed, and cost per GB.
- DRAM Cache vs. HMB (Host Memory Buffer): High-end SSDs use DRAM cache for mapping tables; budget drives use HMB (host RAM) — performance implications for random writes.
Consumer / Laptop Storage¶
- PCIe Gen 5 NVMe SSDs (2024–2026): Sequential reads up to 14,000+ MB/s (e.g., Samsung 990 EVO Plus, Crucial T705, WD Black SN8100). Requires PCIe 5.0 x4 M.2 slot.
- PCIe Gen 4 NVMe: Still mainstream in mid-range laptops, 7,000 MB/s sequential reads, excellent price/performance.
- CES 2026 Highlights: Larger capacity consumer drives (8 TB M.2), improved Gen 5 thermals with integrated heatsink designs, and early PCIe Gen 6 controller announcements.
Workstation Storage¶
- NVMe RAID: Multiple M.2 or U.2/U.3 NVMe drives in RAID 0/1/5 for performance or redundancy. Hardware RAID vs. software RAID (mdadm, Storage Spaces).
- High-Endurance SSDs: Enterprise/workstation SSDs with 3+ DWPD for sustained write workloads (e.g., video editing scratch disks, database logs).
Server / Data Center Storage¶
- Enterprise NVMe (U.2/U.3/EDSFF): Hot-swappable NVMe in E1.S and E3.S form factors (EDSFF) replacing 2.5" U.2 in modern servers.
- PCIe Gen 5 Enterprise SSDs: Samsung PM9D5a, Solidigm D7-PS1010 — targeting AI training data pipelines with 13+ GB/s sequential reads.
- CXL-Attached Storage: Emerging CXL storage devices blurring the line between memory and storage tiers.
- Computational Storage: SSDs with onboard compute (FPGA/ARM cores) for near-data processing — compression, encryption, database filtering offloaded to the drive.
HDD (Still Relevant)¶
- Capacity Drives: 24–32 TB CMR HDDs (Seagate Exos, WD Ultrastar) for bulk storage, NAS, and archival.
- HAMR (Heat-Assisted Magnetic Recording): Seagate HAMR drives shipping 30+ TB, targeting 40+ TB by 2026.
- SMR vs. CMR: Shingled Magnetic Recording for sequential/archival workloads; Conventional Magnetic Recording for random I/O.
4. Graphics Processing Unit (GPU)¶
Core Concepts¶
- GPU Architecture: Streaming multiprocessors (NVIDIA) / Compute Units (AMD) / Xe-cores (Intel). Massive parallelism with thousands of cores optimized for throughput.
- VRAM (Video RAM): GDDR6/GDDR6X/GDDR7 for consumer GPUs; HBM3/HBM3e for data center. VRAM capacity and bandwidth are critical for AI model size.
- Ray Tracing & Rasterization: Hardware RT cores (NVIDIA), Ray Accelerators (AMD) for real-time ray tracing. Hybrid rendering pipelines.
- AI/Tensor Cores: Dedicated matrix multiplication hardware — FP16, BF16, INT8, FP8, FP4 operations for deep learning training and inference.
NVIDIA¶
- GeForce RTX 50 Series (Blackwell, CES 2025–2026):
- RTX 5090: 21,760 CUDA cores, 32 GB GDDR7, 1,792 GB/s bandwidth, 575W TDP. Flagship consumer GPU.
- RTX 5080: 10,752 CUDA cores, 16 GB GDDR7. High-end gaming/content creation.
- RTX 5070 Ti / 5070: Mid-range Blackwell with DLSS 4 (Multi Frame Generation), improved ray tracing.
- CES 2026 — RTX 5060 / Laptop Lineup: Mainstream and mobile Blackwell GPUs expanding the lineup. Laptop variants with Max-Q efficiency profiles.
- DLSS 4: AI-powered upscaling, frame generation (up to 4x frame multiplication), and ray reconstruction. Transformer-based models replacing CNN.
- RTX PRO Series (Workstation):
- RTX PRO 6000 (Blackwell): 96 GB GDDR7 ECC, for CAD/simulation/AI development.
- RTX PRO 4000/2000: Mid-range professional cards with ISV certification.
- Data Center / AI:
- B200 / GB200: Blackwell GPU for AI training, 192 GB HBM3e, FP4 training support, NVLink 5.0 (1.8 TB/s).
- GB200 NVL72: Full-rack AI supercomputer (72 GPUs + 36 Grace CPUs), liquid-cooled, 720 PFLOPs FP4.
- H200: Hopper refresh with 141 GB HBM3e, widely deployed for LLM inference in 2025.
AMD¶
- Radeon RX 9070 XT / 9070 (RDNA 4, 2025):
- New Compute Units with improved ray tracing, GDDR6 (16 GB), targeting mainstream-to-high-end gaming.
- FSR 4 with ML-based upscaling (first time using machine learning in AMD's upscaler).
- Radeon PRO W7900/W7800 (RDNA 3, Workstation): 48/32 GB GDDR6 ECC, DisplayPort 2.1, AV1 encode/decode.
- Instinct MI300X / MI325X (CDNA 3, Data Center): 192/256 GB HBM3e, for LLM training and inference. Competitive with NVIDIA H100/H200 for AI workloads.
- CES 2026 — Instinct MI350 (CDNA 4, Expected): Next-gen AI accelerator on advanced packaging, targeting 2x inference performance over MI300X.
Intel¶
- Arc B-Series (Battlemage, 2024–2025):
- Arc B580/B570: Budget-to-mid-range gaming GPUs with Xe2 architecture. XeSS 2 AI upscaling.
- Arc Pro (Workstation): Entry-level professional GPUs with AV1 encode, ISV certification for CAD.
- Gaudi 3 (Data Center AI): Intel's AI training accelerator, 128 GB HBM2e, competing in the AI training market alongside NVIDIA and AMD.
GPU Comparison by Use Case¶
| Use Case | NVIDIA | AMD | Intel |
|---|---|---|---|
| Gaming (Mainstream) | RTX 5060/5070 | RX 9070 | Arc B580 |
| Gaming (Enthusiast) | RTX 5080/5090 | RX 9070 XT | — |
| Content Creation | RTX 5080/5090 | RX 9070 XT | — |
| Workstation (CAD/Sim) | RTX PRO 6000 | Radeon PRO W7900 | Arc Pro |
| AI Training | B200/GB200 | Instinct MI325X | Gaudi 3 |
| AI Inference | H200/B200 | Instinct MI300X | Gaudi 3 |
5. I/O, Connectivity & Expansion¶
Bus & Interconnect Standards¶
- PCIe 5.0 (Current Mainstream): 32 GT/s per lane, x16 = 64 GB/s. Standard on Intel 13th/14th gen+, AMD Ryzen 7000+, EPYC 9004+.
- PCIe 6.0 (2025–2026 Early Adoption): 64 GT/s per lane, PAM4 signaling. First controllers and switches appearing in server/HPC. Consumer adoption expected 2027+.
- CXL 2.0/3.0 (Compute Express Link): Memory-semantic protocol over PCIe physical layer — enables memory pooling, sharing, and expansion. Critical for AI/HPC data centers.
- NVLink 5.0 (NVIDIA): 1.8 TB/s GPU-to-GPU interconnect for multi-GPU AI training (GB200 NVL72).
- Infinity Fabric (AMD): Inter-chiplet and inter-socket interconnect for Ryzen, Threadripper, and EPYC.
- UltraFusion (Apple): Die-to-die interconnect for M-series Ultra chips, 2.5 TB/s bandwidth.
External Connectivity¶
- Thunderbolt 5 (2024–2026): 80 Gbps bidirectional (120 Gbps with Bandwidth Boost). Supports dual 6K displays, eGPU, NVMe storage, and 240W USB PD. Available on Intel Core Ultra 200, Apple M4 Pro/Max.
- USB4 v2.0: 80 Gbps, tunneling DisplayPort 2.1 and PCIe. Broadly adopted in 2025–2026 laptops.
- USB 3.2 Gen 2x2: 20 Gbps, USB-C. Common for external SSDs and docking stations.
- Wi-Fi 7 (802.11be): 320 MHz channels, 4096-QAM, MLO (Multi-Link Operation). Up to 46 Gbps theoretical. Standard in 2025–2026 laptops and routers.
- Wi-Fi 8 (802.11bn, Expected 2026+): Early announcements at CES 2026 — coordinated AP operation, improved latency for AR/VR.
- Bluetooth 6.0 (2025): Channel sounding for precision distance measurement, improved LE Audio.
- Ethernet:
- Consumer: 2.5 GbE standard on modern motherboards.
- Workstation: 10 GbE becoming common.
- Server: 25/100/400 GbE; 800 GbE emerging for AI clusters.
Display Output¶
- DisplayPort 2.1a (UHBR20): 80 Gbps, supports 8K@60Hz with DSC, 4K@240Hz. On NVIDIA RTX 50 series, AMD RDNA 3/4.
- HDMI 2.1a: 48 Gbps, 4K@120Hz, 8K@60Hz, VRR, ALLM.
- Thunderbolt 5 Display Chaining: Daisy-chain multiple high-resolution monitors from a single port.
Expansion Slots & Form Factors¶
- M.2 (NVMe/SATA): M.2 2230 (compact, Steam Deck/laptops), 2242, 2280 (standard desktop/laptop).
- U.2/U.3/EDSFF (Enterprise): Hot-swappable NVMe for servers.
- PCIe Slots: x16 (GPU), x4 (NVMe adapter, capture card), x1 (NICs, sound cards). PCIe 5.0 x16 standard in current platforms.
- OCuLink (2025–2026): External PCIe connection for eGPUs on mini PCs and handhelds, up to PCIe 4.0 x4 (64 Gbps).
6. Architectural Fundamentals¶
Understanding how processors are designed — from ISA specification through microarchitecture implementation — is essential for predicting where performance comes from and why different CPUs behave differently. This section bridges the gap between abstract instruction sets and the silicon reality described in Section 1.
6.1 Instruction Set Architecture (ISA) Design¶
An Instruction Set Architecture is the contract between software and hardware. It defines which operations a programmer can request, how those operations are encoded into binary, and what results to expect. Three elements define an ISA:
- Operation vocabulary: What instructions exist (ADD, LOAD, BRANCH, etc.)
- Encoding: How instructions are represented in memory (instruction formats, bit layouts)
- Semantics: What each instruction does to registers, memory, and CPU state
ISA Comparison: x86-64 vs. ARM64 vs. RISC-V
| Aspect | x86-64 | ARM64 (aarch64) | RISC-V |
|---|---|---|---|
| Instruction Count | ~1,500+ (complex) | ~300 (simpler) | ~150 core (extensible) |
| Registers | 16 general (RAX–R15) | 31 general + SP | 31 general + SP |
| Addressing Modes | Rich (8+ modes) | Limited (3 modes) | Minimal (2 modes) |
| Variable Instr. Length | Yes (1–15 bytes) | Fixed 4-byte | Typically 4-byte, compressed 2-byte |
| Floating Point | Integrated (x87, SSE) | Separate (NEON, SVE) | Separate (F/D extensions) |
| Conditional Codes | Flag register (EFLAGS) | Per-instruction condition | Separate branch comp. registers |
Why ISA Matters for AI Hardware:
- Register Pressure: Matrix operations on x86 with 16 registers vs. ARM64 with 31 registers means different register allocation strategies
- SIMD Width: x86 AVX-512 (512-bit) vs. ARM SVE (up to 2048-bit) vs. RISC-V Vector (configurable) affects throughput for tensor operations
- Memory Addressing: RISC-V's simple addressing encourages compiler optimization for deep pipelines; x86's complex modes reduce instruction count but complicate hardware
Key Insight: Changing ISA requires recompilation of all software and OS support — it's a strategic choice made once per product line, not lightly modified. Apple chose ARM to enable tight integration with custom silicon; Intel stays with x86 for ecosystem compatibility; RISC-V attracts greenfield projects without legacy constraints.
6.2 Microarchitecture Fundamentals: The Pipeline¶
Microarchitecture answers "how does hardware implement the ISA?" The most visible aspect is the pipeline — breaking instruction execution into stages to increase throughput.
Simple In-Order Pipeline (5 stages, like ARM Cortex-A57):
┌──────────┬────────┬─────────┬────────┬──────────┐
│ Fetch │ Decode │ Execute │ Memory │ Writeback│
│ │ │ │ │ │
│ Fetch │ Parse │ ALU ops │ L1/L2 │ Reg │
│ from L1i │ instr. │ calcs │ access │ update │
└──────────┴────────┴─────────┴────────┴──────────┘
One instruction per stage → up to 1 instruction per cycle (IPC = 1.0 best case)
Pipeline Hazards (causes of stalls):
- Data Hazard: Instruction needs result from previous instruction still in pipeline
- Example:
R1 = R2 + R3followed byR4 = R1 * R5— must wait for R1 -
Mitigation: Forwarding paths bypass later pipeline stages to earlier stages
-
Control Hazard: Branch outcome unknown until execute stage, but fetch already chose wrong path
- Example:
if (x) { ... }with delayed branch prediction penalty -
Mitigation: Branch prediction (discussed Section 6.4)
-
Structural Hazard: Multiple instructions competing for same hardware resource
- Example: Two instructions need ALU in same cycle
- Mitigation: Replicate hardware (multiple ALUs) or serialize
Deeper Pipeline (Intel P-cores: 12–15 stages): * Longer pipelines increase clock frequency (less logic per stage = faster propagation) * Trade-off: Branch misprediction penalty larger (must flush more pipeline stages) * Benefit: Higher frequency compensates (e.g., 5 GHz with 15-stage pipeline beats 3 GHz with 5-stage)
6.3 Out-of-Order Execution¶
Problem: A stall in a 5-stage pipeline wastes 80% of throughput while waiting for data. Example: branch misprediction waiting for memory load.
Solution: Fetch and decode many instructions (wide front-end), execute them out of original program order, but commit results in order to maintain correctness. This is Out-of-Order (OoO) execution.
Key Hardware Components:
- Instruction Window: Tracks 50–300 unfinished instructions (Intel: 256-entry Reorder Buffer; ARM A78: 128-entry)
- Reservation Stations: Queue instructions waiting for operands, issue when ready
- Load/Store Queue: Separate tracking for memory operations to detect conflicts
- Reorder Buffer (ROB): Maintains original program order for committing results
Example: OoO Execution with Memory Stall
Program order: 1: R1 = LOAD(addr) [memory miss, stalls 100 cycles]
2: R2 = R3 + R4 [independent, executes immediately]
3: R5 = R6 * R7 [independent, executes immediately]
OoO CPU execution order: 2 → 3 → ... → 1 (when memory returns)
But commits: 1 → 2 → 3
Trade-off Table:
| Feature | In-Order | Out-of-Order |
|---|---|---|
| Complexity | Simple (~1M transistors) | Very complex (~5M transistors) |
| Power Efficiency | High (less logic) | Lower (more logic, larger window) |
| IPC Potential | 1.0–2.0 | 2.0–5.0+ |
| Latency (fixed work) | Same if no stalls | Same (execution latency unchanged) |
| Latency (with hazards) | Stalls directly visible | Masked if other work available |
| Example Chips | ARM Cortex-M, older ARM A55 | Intel P-cores, ARM A78/A715, AMD Zen |
Real Example: ARM Cortex-A78 OoO Configuration * 8-wide fetch, 4-wide decode, 4-wide execute * 128-entry reorder buffer * 4 ALU units + 2 load/store units + 2 multiply-accumulate units * Mispredict latency: 11 cycles (penalty for recovering from wrong path) * Achievable IPC on real code: 2.5–3.5 (depends on memory stalls and branch frequency)
Key Insight: OoO execution masks memory stalls by finding independent work. If no independent work ("in-order" scenario), OoO provides no benefit—this is why memory bandwidth matters as much as frequency.
6.4 Branch Prediction¶
A branch changes program flow based on a condition (if, for, while). Until the condition is computed (cycle 4+ in pipeline), the CPU doesn't know which instruction to fetch next.
Problem: Fetching wrong path wastes the entire pipeline depth worth of instructions (15+ cycles for modern CPUs).
Solution: Branch Predictor guesses the outcome before it's computed, allows fetching to continue. If guess is wrong, flush speculative work and restart.
Predictor Designs (Complexity vs. Accuracy Trade-off):
| Type | Accuracy | Latency | Method |
|---|---|---|---|
| Always-Taken | 50% (on random) | 0 cycles | Assume taken |
| 1-bit History | 80% | 1 cycle | "Same as last time" |
| 2-bit Saturating Counter | 85% | 2 cycles | Hysteresis: need 2 mispredicts to flip |
| Branch History Table | 92–94% | 2–3 cycles | Use recent branch pattern as index |
| Gshare | 95% | 3 cycles | Blend global + local history |
| TAGE (Tagless Geometric) | 97%+ | 4–5 cycles | Multiple geometric tables, championship-winning design (modern CPUs) |
Spectre/Meltdown Note: These exploits abused the fact that modern CPUs speculatively execute past branch mispredicts, leaving traces in caches. Fixes: speculation barriers, microcode updates, CET (Control-flow Enforcement Technology).
Real-World Impact on AI Workloads:
- Tight loops (common in kernels): Branch is highly predictable (95%+ success) → minimal penalty
- Irregular graphs (graph neural nets): Variable patterns → 90% accuracy → 1–2 cycle average penalty per branch
- Deep decision trees: Less predictable → can approach 10–15% IPC loss
Key Insight: Modern branch predictors exceed human intuition in accuracy (95%+). Focus optimization effort elsewhere unless branch-heavy code dominates (uncommon in compute kernels).
6.5 Superscalar Execution & Instructions Per Cycle (IPC)¶
Superscalar means executing multiple instructions per cycle. Achieved by: 1. Wide fetch (grab multiple instructions per cycle) 2. Independent execution units in parallel 3. Out-of-order issuing to avoid blocking on hazards
IPC Calculation:
IPC = Instructions Completed / CPU Cycles
Peak IPC = width of front-end (Intel 8-wide fetch → max 8 IPC, but in practice 3–4)
Actual IPC = depends on dependencies and memory stalls
Real-World IPC Ranges:
| Scenario | Typical IPC | Reason |
|---|---|---|
| Memory-bound (cache misses) | 0.8–1.2 | Waiting for main memory (100+ cycle latency) |
| Integer-heavy (loops) | 1.5–2.0 | ALU saturation, branch penalty |
| Floating-point (no memory) | 2.5–3.5 | Multiply-accumulate units can run in parallel |
| Perfect (all independent ops) | 4–8 | Limited by fetch/decode width, not execution |
Why It Matters for AI:
Matrix multiply has high arithmetic intensity (many FLOPs per memory access) → compute-bound → achieves 3+ IPC → utilizes multiple execution units → saturates compute throughput.
Image preprocessing (memory-bound) → achieves 1.0 IPC → CPUs waste execution units → GPUs better suited.
Key Insight: IPC = (Frequency × Width × Efficiency). You can't exceed width; you can't approach width if dependencies are high. AI accelerators side-step the IPC problem by hardwiring the dataflow (systolic arrays, etc.).
7. Memory Hierarchy Deep Dive¶
Modern CPUs trade memory speed for size and cost. The hierarchy from fastest-smallest to slowest-largest is: registers → L1 cache → L2 cache → L3 cache → main DRAM → SSD → HDD. Understanding this hierarchy is critical for optimization.
7.1 Cache Hierarchy Design¶
Caches are small, fast memories that hold copies of frequently-used data. Each level has different size, latency, and bandwidth.
Typical 3-Level Cache Hierarchy:
┌──────────────────────────────────────────────┐
│ CPU Die (1 socket) │
│ ┌────────────────────────────────────────┐ │
│ │ Core 0 ┌─────────────────────┐ │ │
│ │ Registers │ L1i Cache (32 KB) │ │ │
│ │ (256 bytes) │ L1d Cache (32 KB) │ │ │
│ │ └─────────────────────┘ │ │
│ │ │ L2 Cache (256 KB) │ │ │
│ │ └─────────────────────┘ │ │
│ └────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────┐ │
│ │ Core 1 Similar to Core 0 │ │
│ │ ... │ │
│ └────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────┐ │
│ │ L3 Cache (2–32 MB, shared) │ │
│ │ (all cores access) │ │
│ └────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
│
│ System Interconnect (100s of GB/s)
↓
┌──────────────────┐
│ Main DRAM (8GB) │ (10–20 GB/s bandwidth)
└──────────────────┘
Cache Specifications by Architecture:
| Level | Intel Xeon | AMD EPYC | Apple M4 Pro |
|---|---|---|---|
| L1i | 32 KB / core | 32 KB / core | 16 KB / core |
| L1d | 32 KB / core | 32 KB / core | 16 KB / core |
| L1 Latency | 4 cycles | 3 cycles | 3–4 cycles |
| L2 | 1 MB / core | 1 MB / core | 256 KB / core |
| L2 Latency | 11 cycles | 10 cycles | 9 cycles |
| L3 | 36 MB / socket | 32 MB / socket | 12 MB / cluster |
| L3 Latency | 42 cycles | 40 cycles | 35 cycles |
| Line Size | 64 bytes | 64 bytes | 64 bytes |
Cache Line and Spatial Locality:
Cache operates at 64-byte granularity (one L1→L2 transfer moves 64 bytes). Accessing adjacent bytes in the same cache line is free (already loaded). Algorithms that exploit spatial locality (sequential memory access, matrix tiling) benefit enormously.
Example: Sequential vs. Random Access
// Sequential (good spatial locality)
for (int i = 0; i < N; i++) {
sum += array[i]; // Each address in different 64-byte line
} // Cache line prefetch: 1 miss per 8 integers
// Random (poor spatial locality)
for (int i = 0; i < N; i++) {
sum += array[random() % N]; // Random addresses
} // Cache line prefetch useless: 1 miss per integer
Result: Sequential: ~5 GB/s throughput; Random: ~200 MB/s (25x slower!).
7.2 Cache Coherency Protocols (MESI)¶
Problem: With private L1/L2 caches per core, how does the system ensure that all cores see the same value for shared data?
Solution: Cache Coherency Protocol (e.g., MESI, MOESI) ensures that whenever one core modifies data, other cores' caches are invalidated or updated.
MESI State Diagram:
┌─────────────────────┐
│ Modified (M) │ Cache line recently written by this core
│ This core's data is │ Other caches must be invalidated
│ the only valid copy │
└─────────────────────┘
↕
┌─────────────────────┐
│ Exclusive (E) │ Unmodified, only in this cache
│ Only this core has │ Can write without bus traffic
│ the line (unmodified)
└─────────────────────┘
↕
┌─────────────────────┐
│ Shared (S) │ Unmodified, in multiple caches
│ Other cores may │ Writing requires invalidating others
│ also have it │ (transition to M)
└─────────────────────┘
↕
┌─────────────────────┐
│ Invalid (I) │ Not in this cache
│ Must fetch from │ Occupied cache lines can be evicted
│ memory or another │
│ cache to use │
└─────────────────────┘
Coherency Traffic Overhead:
Coherency messages (invalidations, acknowledgments) consume memory bus bandwidth. On a high-core-count system (16 cores all updating shared data), coherency can account for 30–50% of bus traffic.
Real-World Impact:
- Fully Coherent (Apple M4 Max, Intel 12th gen+): All cores see consistent memory → easier programming, higher coherency overhead
- NUMA (AMD EPYC): Local DRAM fully coherent; remote DRAM access slower → requires
numactlbinding or performance penalty
Key Insight: Coherency is automatic but not free. If all cores constantly modify the same cache line, coherency traffic dominates and speedup plateaus (Amdahl's Law effect).
7.3 DRAM Design & Refresh¶
While caches are speed-optimized SRAM, main memory uses DRAM (Dynamic RAM) for cost. DRAM cells leak charge and must be refreshed every 64ms.
DRAM Cell (1T-1C Design):
Wordline ──────┬─────────────
│ Transistor
│ (access gate)
│
=== Capacitor (1 bit: charged = 1, discharged = 0)
│
Bitline ───────┴─────────────
Access Timing: * Row Activate (tRCD): ~15 ns (open row buffer) * Column Select (tCAS): ~12 ns (selected column output) * Row Precharge (tRP): ~15 ns (prepare for next different row) * Refresh (tREF): ~64 ms (every cell must be read and rewritten)
Addressing:
DRAM organized as 2D array (rows × columns):
Example: 8 GB DRAM, 64-bit bus, 8 ns cycle
Bank 0 (4 GB):
Row address (16 bits): 64K rows
Column address (10 bits): 1K columns
= 64K × 1K × 8 bytes = 512 MB per bank
Usually 8–16 banks per channel
Row Buffer Locality:
If consecutive accesses hit the same row, no precharge needed → 3x faster (skip tRP). This is why sequential access is much faster than random.
Refresh Overhead:
Refreshing 1 device every 64 ms on an 8 Gb (1 GB) DRAM: * ~8,000 rows × 15ns/row ≈ 2% bandwidth overhead
Real: DDR5 Specifications
| Metric | DDR5-5600 | DDR5-6400 | DDR5-8000 |
|---|---|---|---|
| Transfer Rate | 5600 MT/s | 6400 MT/s | 8000 MT/s |
| Bandwidth / Ch | 44.8 GB/s | 51.2 GB/s | 64 GB/s |
| tCAS | 42 ns | 37.5 ns | 30 ns |
| tRCD | 38 ns | 34 ns | 30 ns |
| Latency (RAS-to-CAS) | ~90 ns | ~75 ns | ~60 ns |
| Per Server (8ch) | 358 GB/s | 410 GB/s | 512 GB/s |
Key Insight: DRAM latency hasn't improved much (fundamental physics: charge/recharge cycles). Row buffer locality is your best friend for DRAM performance; sequential streaming beats random.
Common Pitfall: Assuming all memory accesses cost the same. Row buffer hits 3x cache lines faster; random access to different rows costs 3x more.
7.4 Multi-Socket Coherency & NUMA Latency¶
Most servers and workstations use multi-socket configurations (2+ CPUs on same motherboard) to scale core count and memory capacity.
NUMA (Non-Uniform Memory Access):
Each socket has local DRAM. Accessing local DRAM is fast (~60 ns); accessing remote (another socket's DRAM) is slow (~120–180 ns) due to inter-socket protocol (QPI on Intel, Infinity Fabric on AMD).
Topology Example: Dual-Socket EPYC:
┌─────────────────────────────────────┐
│ EPYC Socket 0 (96 cores) │
│ ┌──────────────────────────────────┤
│ │ Local DRAM: 6 TB / 12 channels │ Latency: ~70 ns
│ │ (192 GB DDR5-5600 / DRAM module │
│ └──────────────────────────────────┤
└──────────────┬──────────────────────┘
│ Infinity Fabric
│ (128 GB/s bidirectional)
┌──────────────┴──────────────────────┐
│ EPYC Socket 1 (96 cores) │
│ ┌──────────────────────────────────┤
│ │ Local DRAM: 6 TB / 12 channels │ Latency: ~70 ns
│ │ Remote access to Socket 0: ~150ns│ (2.1x slowdown)
│ └──────────────────────────────────┤
└─────────────────────────────────────┘
Performance Impact:
Thread pinned to Socket 0 accessing local DRAM: 70 ns RTT
for (int i = 0; i < 1B; i++) {
remote_sum += remote_array[i]; // 2.1x loop time due to memory latency
}
Solutions:
-
numactlBinding: Pin threads + allocate memory to same socket -
NUMA-Aware Data Layout: Partition data across NUMA domains:
-
CXL Memory Pooling (emerging): CXL 2.0 allows disaggregated memory attached over PCIe, partially hiding NUMA.
Key Insight: NUMA-aware programming is tedious but necessary for workloads larger than single-socket capacity. AI training (multi-GPU, multi-node) must consider NUMA scaling.
7.5 Unified Memory Architecture (Apple Silicon)¶
Apple's approach eliminates the NUMA problem by integrating CPU, GPU, and Neural Engine into a single die with shared memory.
Traditional Discrete CPU+GPU:
┌────────────┐ ┌─────────────┐
│ Xeon CPU │ │ RTX 3090 │
│ 64 GB DDR5 │─────PCIe─────│ 24 GB GDDR6│
└────────────┘ (16 GB/s) │ (mem copy) │
└─────────────┘
Bottleneck: PCIe × 16 transfers at 16 GB/s, latency ~5 µs per transfer
Apple M4 Max Unified Memory:
┌──────────────────────────────────────────┐
│ M4 Max (Single Die) │
│ ┌───────────┬──────────────┬─────────┐ │
│ │ CPU │ GPU │ NE │ │
│ │ 12 cores │ 40 cores │ 16 core │ │
│ │ (RISC SoC)│ (streaming │ (Tensor)│ │
│ └───────────┴──────────────┴─────────┘ │
│ Shared 128 GB LPDDR5 │
│ (120 GB/s bandwidth) │
│ (500 ns latency) │
└──────────────────────────────────────────┘
Zero PCIe overhead; GPU accesses CPU memory without copy
CPU accesses GPU memory without copy
Impact on Machine Learning:
- Traditional (copy): CPU→GPU:
- Prepare tensor on CPU (1 ms)
- Copy to GPU (0.5 ms for 8 GB)
- Execute on GPU (10 ms)
- Copy back (0.5 ms)
-
Total: 12 ms; 83% GPU-time
-
Unified Memory (M4 Max):
- Execute on GPU (10 ms)
- No copy overhead
- Total: 10 ms; 100% GPU-time
For small tensors (inference models): unified memory eliminates 1–3 ms of transfer latency per pass → significant for latency-critical applications (AR, robotics).
Key Insight: Unified memory requires custom silicon integration (Apple's advantage). Discrete GPUs never achieve this; CXL attempts to approach it.
8. Performance Analysis & Measurement¶
Optimizing code requires measurement before optimization. This section covers tools, methodologies, and mental models for identifying bottlenecks.
8.1 Benchmarking Fundamentals¶
Microbenchmark: Isolated measurement of one component (cache latency, memory bandwidth, branch misprediction cost)
Application Benchmark: Real workload running end-to-end
Reproducibility Checklist:
- Disable CPU turbo boost (
echo 1 | tee /sys/devices/system/cpu/intel_pstate/no_turbo) - Pin threads to specific cores (
taskset -c 0-7 ./program) - Warm cache by running once before measuring
- Run multiple iterations; report median/stddev
- Close other processes; silence background noise
8.2 Roofline Model¶
The Roofline Model is a visual way to identify whether code is compute-bound or memory-bound.
Axes:
- Horizontal (X): Arithmetic Intensity = FLOPs per byte loaded from DRAM
- Vertical (Y): Performance = GFLOPs achieved
Two Constraints:
- Compute Roofline:
Performance = Peak FLOPs(doesn't scale with intensity) -
Example: 3 TFLOPS (Orin Nano GPU) → horizontal line at Y=3
-
Memory Roofline:
Performance = Memory Bandwidth × Arithmetic Intensity - Example: 50 GB/s × Arithmetic Intensity = Memory limited throughput
- 50 GB/s × 1 FLOP/byte = 50 GFLOPS max if intensity=1
Visual:
Performance (GFLOPS)
↑
3.0 ├────┬─────────────────────── Compute Roofline (Peak TFLOPS)
│ │\
1.0 ├────┤ \ Memory Roofline
│ │ \ (Slope = Bandwidth)
0.5 ├────┤ \___
│ │ \___
└────┴─────────────\──────→ Arithmetic Intensity
0 1 4 (FLOPs per byte)
Interpretation:
* Intensity < 1: Memory-bound (dot below roofline)
* Intensity > 1: Approaches compute roofline; ideally at compute roofline
Case Study: Matrix Multiply on Orin Nano
Orin Nano GPU: * Peak: 1.5 TFLOPS * Memory: 51.2 GB/s (LPDDR5x, 4-channel)
For N×N matrix multiply: * FLOPs: 2×N³ (multiply + add) * Memory: 3×N² bytes loaded (A, B, + partial C) * Arithmetic Intensity = 2N³ / 3N² = 2N/3
For N=512: Intensity = 341 FLOPS/byte → Should reach 1.5 TFLOPS (compute-bound ✓) For N=64: Intensity = 43 FLOPS/byte → Roofline limits to ~2.1 GFLOPS (memory-bound) For N=16: Intensity = 11 FLOPS/byte → Roofline limits to ~0.55 GFLOPS
Optimization Strategy: * Large N: Compute-bound; optimize kernel * Small N: Memory-bound; batch, fuse operations
Key Insight: Roofline explains 80% of performance problems. Measure roofline once per hardware platform; use it to predict scalability before optimizing.
8.3 Amdahl's Law & Parallel Scaling¶
Not all code parallelizes. Amdahl's Law predicts the maximum speedup from parallel execution.
Formula:
Speedup = 1 / (1 - P + P/S)
P = fraction of code that is parallel (0 to 1)
S = speedup of parallel portion (number of cores)
Examples (2% Serial, 98% Parallel):
| Cores | Speedup | % Peak |
|---|---|---|
| 1 | 1.0 | 100% |
| 4 | 3.5 | 87% |
| 8 | 6.2 | 78% |
| 16 | 10.9 | 68% |
| 32 | 18.5 | 58% |
The serial 2% (a few synchronization barriers, I/O operations) becomes the bottleneck as core count increases.
For AI Workloads:
- Data Parallelism (batching): Near-linear scaling if batch processing doesn't require global synchronization
- Model Parallelism (split model across GPUs): Requires all-reduce communications → lower scaling efficiency due to synchronization
- Pipeline Parallelism (each GPU stage N): Scales linearly if stages are balanced
Key Insight: Multi-GPU scaling rarely exceeds 80% efficiency due to synchronization. Doubling GPUs → expect 1.6x–1.8x throughput (not 2x).
8.4 Profiling Tools & Workflow¶
CPU Profiling (Linux):
# Record CPU samples
sudo perf record -F 99 -g ./myprogram
# Analyze: which functions consume time
perf report
# Flamegraph visualization
perf script | ~/prof/flamegraph/stackcollapse-perf.pl | flamegraph.pl > profile.svg
GPU Profiling (NVIDIA, Jetson):
# Profile CUDA kernels, memory transfers
nsys profile -o profile.nsys-rep ./myprogram
# Wait for completion, open in UI
nsys-ui profile.nsys-rep
# Output: Timeline of kernels, memory transfers, CPU-GPU sync points
Memory Profiling (Cache Behavior):
# Measure L1/L2/L3 misses, memory traffic
valgrind --tool=cachegrind ./myprogram
# Generates cachegrind.out.PID with detailed cache stats
cg_annotate cachegrind.out.PID | head -50
Custom Microbenchmarks:
// Measure FLOPS
for (int i = 0; i < ITERATIONS; i++) {
sum += a[i] * b[i]; // 1 FLOP per iteration
}
double gflops = (double)ITERATIONS / elapsed_seconds / 1e9;
// Measure Memory Bandwidth
for (int i = 0; i < N; i++) {
dst[i] = src[i]; // Load + Store = 16 bytes
}
double gb_s = (double)(2 * N * sizeof(float)) / elapsed_seconds / 1e9;
8.5 Case Studies: Bottleneck Analysis¶
Case 1: Vector Dot Product (Memory-Bound)
# Python pseudocode
result = 0
for i in range(N):
result += a[i] * b[i] # 1 Load + 1 Load + 1 FLOP + 1 Store
- Arithmetic Intensity: 1 FLOP / 16 bytes loaded (2 loads) = 0.0625
- Roofline (assume 500 GB/s CPU memory): 500 × 0.0625 = 31 GFLOPS
- Actual Performance: ~30 GFLOPS
- Conclusion: Memory-bound (can't improve with wider vectors; fix with algorithmic changes like blocked GEMM)
Case 2: Matrix Multiply (Compute-Bound if Blocked)
// Naive (N³ memops, poor cache reuse)
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
// Blocked (fits small matrices in L3 cache)
// Tile size: 64×64 (if B_tile fits in L3)
// Intensity: (2 × 64³) / (3 × 64²) = 85 FLOPS/byte → compute-bound
- Naive: ~50 GFLOPS (memory-bound, intensity ≈ 0.25)
- Blocked: 450 GFLOPS (compute-bound, intensity ≈ 85)
- 9x improvement from cache locality, not SIMD or clock speed
Case 3: Sparse Matrix-Vector Multiply (Unpredictable Memory)
// A is sparse (CSR format), v is dense
for (int i = 0; i < M; i++) {
y[i] = 0;
for (int j = row_ptr[i]; j < row_ptr[i+1]; j++) {
y[i] += A_val[j] * v[col_idx[j]]; // v[] is random access
}
}
- Random accesses to v[] → no prefetch, cache misses
- Memory bandwidth utilization: ~10% (stalls waiting for L1 on every 10th byte loaded)
- Performance: 10–50 GFLOPS (despite 200+ GFLOPS peak)
- Fix: Reorder matrix to improve locality, use sparse GPU kernels, decompress if sparsity is >90%
Bottleneck Diagnosis Table:
| Symptom | Root Cause | Fix |
|---|---|---|
| Using <20% of peak FLOPS | Memory-bound | Batch, increase arithmetic intensity |
| High cache miss rate L3 | Poor locality | Tile/block, prefetch |
| Low IPC (<1.5) | Dependencies | Parallelize independent ops |
| Branch mispredict % high | Irregular control | Branchless code, predication |
| Many context switches | Contention | Pin threads, reduce sharing |
Key Insight: The roofline model explains 80% of problems; the other 20% require profiling-based diagnosis.
9. Hands-On Labs & Projects¶
Understanding architecture in the abstract is useful; validating theory experimentally is essential. This section specifies labs that directly test the concepts from Sections 6–8.
9.1 Lab 1: Cache Behavior Analysis (6–8 hours)¶
Objective: Understand how cache size, line size, and associativity affect real performance.
Tools: valgrind --tool=cachegrind (open source, accurate L1/L2/L3 simulation)
Setup:
# Install valgrind (Linux)
sudo apt-get install valgrind
# Write test program in C
cat > cache_test.c << 'EOF'
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
int size = atoi(argv[1]); // Array size in bytes
int stride = atoi(argv[2]); // Access stride
int *array = malloc(size);
int sum = 0;
clock_t start = clock();
for (int i = 0; i < size / sizeof(int); i += stride / sizeof(int)) {
sum += array[i];
}
double elapsed = (double)(clock() - start) / CLOCKS_PER_SEC;
printf("Size: %d, Stride: %d, Sum: %d, Time: %.3f\n", size, stride, sum, elapsed);
free(array);
return 0;
}
EOF
gcc -O2 -o cache_test cache_test.c
Tasks:
- Vary array size, measure cache misses:
Expected output:
Size 32K (L1): Low miss rate (~1%)
Size 256K (L2): Low miss rate (~2%)
Size 2M (L3): Mid miss rate (~5%)
Size 4M (beyond L3): High miss rate (~50%)
- Measure stride impact:
# Sequential (stride = 1 integer = 4 bytes) valgrind --tool=cachegrind ./cache_test 1048576 4 # Strided (stride = 64 bytes, skips within cache line) valgrind --tool=cachegrind ./cache_test 1048576 64 # Very strided (stride = 256 bytes, multiple cache lines) valgrind --tool=cachegrind ./cache_test 1048576 256
Expected: Sequential prefetch works; stride >64 bytes defeats prefetch.
- Construct L1/L2/L3 miss rate matrix:
| Array Size | L1 MR | L2 MR | L3 MR |
|---|---|---|---|
| 8 KB | <1% | <1% | <1% |
| 32 KB | 1% | <1% | <1% |
| 256 KB | 30% | 2% | <1% |
| 1 MB | 40% | 20% | 1% |
| 4 MB | 45% | 35% | 10% |
| 16 MB | 47% | 40% | 25% |
Success Metric: Identify L1/L2/L3 cache size boundaries ±10% (should be 32KB/256KB/8MB on typical system).
Deliverables: * C program with variable-size array access * Valgrind output showing L1/L2/L3 miss rates * Analysis: Where does miss rate inflect (cache boundary)?
9.2 Lab 2: Branch Predictor Simulation (8–10 hours)¶
Objective: Implement a branch predictor and measure real branch traces.
Tools: Linux perf (branch tracing), Python (simulator), GCC
Setup:
# Collect branch trace on real code
cat > branch_test.c << 'EOF'
#include <stdio.h>
#include <stdlib.h>
int binary_search(int *arr, int n, int target) {
int low = 0, high = n - 1;
while (low <= high) { // BRANCH: loop condition
int mid = (low + high) / 2;
if (arr[mid] == target) // BRANCH: equality check
return mid;
else if (arr[mid] < target) // BRANCH: less-than check
low = mid + 1;
else
high = mid - 1;
}
return -1;
}
int main() {
int arr[1000];
for (int i = 0; i < 1000; i++) arr[i] = i;
long sum = 0;
for (int i = 0; i < 100000; i++) {
sum += binary_search(arr, 1000, rand() % 1000);
}
printf("Sum: %ld\n", sum);
return 0;
}
EOF
gcc -O2 -o branch_test branch_test.c
Tasks:
-
Collect branch trace:
-
Implement predictors in Python:
# 2-bit saturating counter predictor class TwoBitPredictor: def __init__(self): self.counters = {} # PC -> counter (0,1,2,3) def predict(self, pc): counter = self.counters.get(pc, 2) # Default: weakly taken return counter >= 2 # True if taken def update(self, pc, taken): counter = self.counters.get(pc, 2) if taken: counter = min(3, counter + 1) else: counter = max(0, counter - 1) self.counters[pc] = counter # Parse branch trace and evaluate correct = 0 total = 0 for line in branches: pc, actual_taken = parse_branch(line) predicted = predictor.predict(pc) if predicted == actual_taken: correct += 1 predictor.update(pc, actual_taken) total += 1 accuracy = correct / total print(f"2-bit Predictor: {accuracy:.1%} accuracy") -
Try multiple predictor models:
- Always-taken
- 1-bit history
- 2-bit saturating counter
- Bimodal (16K entries)
- Gshare (8K entries, use global history)
Success Metric: Achieve >95% accuracy on conditional branches (realistic), demonstrate how accuracy improves with history.
Deliverables: * Branch trace from perf * Python simulator with 3+ predictor models * Accuracy comparison table * Analysis: Which prediction scheme works best for binary search?
9.3 Lab 3: Memory Bandwidth Measurement (4–6 hours)¶
Objective: Measure actual memory bandwidth for sequential vs. random access.
Tools: Custom C program, taskset (CPU affinity)
Setup:
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#define MB (1024 * 1024)
#define ITERATIONS 100
double measure_bandwidth(int *data, int size, int stride, int iterations) {
int sum = 0;
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
for (int iter = 0; iter < iterations; iter++) {
for (int i = 0; i < size / sizeof(int); i += stride / sizeof(int)) {
sum += data[i];
}
}
clock_gettime(CLOCK_MONOTONIC, &end);
double elapsed = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec) / 1e9;
double bytes = (long long)iterations * size;
return bytes / elapsed / 1e9; // GB/s
}
int main() {
int *data = malloc(256 * MB);
memset(data, 0, 256 * MB);
printf("Sequential Read Bandwidth:\n");
for (int mb = 1; mb <= 256; mb *= 2) {
double bw = measure_bandwidth(data, mb * MB, 4, ITERATIONS);
printf("%d MB: %.1f GB/s\n", mb, bw);
}
printf("\nRandom Read Bandwidth:\n");
for (int mb = 1; mb <= 256; mb *= 2) {
double bw = measure_bandwidth(data, mb * MB, 256, ITERATIONS);
printf("%d MB: %.1f GB/s\n", mb, bw);
}
free(data);
return 0;
}
Tasks:
-
Compile and run:
-
Measure scaling:
- Sequential read (stride = 4 bytes): Should reach peak bandwidth (~50 GB/s on DDR5-5600)
- Stride = 64 bytes: Should still be close to peak (prefetcher works)
- Stride = 256 bytes: Should drop significantly (~20–30% of peak)
- Random (stride = random): Should drop to membar ~10% of peak
Success Metric: Demonstrate 5–10x bandwidth difference between sequential and random access.
Deliverables: * C program with bandwidth measurement * Output table: size vs. stride vs. achieved bandwidth * Graph: sequential vs. random bandwidth
9.4 Lab 4: Roofline Analysis of Real Kernel (8–10 hours)¶
Objective: Profile a real algorithm (matrix multiply), measure roofline, validate theory.
Tools: CUDA or OpenMP, custom timing + FLOPs counter
Setup:
// Matrix multiply in C (CPU version)
void matmul(float *C, float *A, float *B, int N) {
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[i*N + k] * B[k*N + j];
}
C[i*N + j] = sum;
}
}
}
// Measure FLOPs + bandwidth
int main() {
for (int N = 64; N <= 1024; N *= 2) {
float *A = malloc(N*N*sizeof(float));
float *B = malloc(N*N*sizeof(float));
float *C = malloc(N*N*sizeof(float));
// Initialize...
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
matmul(C, A, B, N);
clock_gettime(CLOCK_MONOTONIC, &end);
double elapsed = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec) / 1e9;
double flops = 2LL * N * N * N; // Multiply + add
double gflops = flops / elapsed / 1e9;
// Roofline metrics
double bytes_loaded = 3LL * N * N * sizeof(float); // A, B, C
double intensity = flops / bytes_loaded;
printf("N=%d: %.1f GFLOPS, Intensity=%.1f FLOP/byte\n", N, gflops, intensity);
free(A); free(B); free(C);
}
}
Tasks:
-
Measure for different N:
-
Plot on roofline:
- X-axis: Arithmetic Intensity
- Y-axis: GFLOPS
- Draw roofline: Compute roofline (peak GFLOPS) + Memory roofline (bandwidth × intensity)
-
Plot measurements as points
-
Interpret:
- Lower N points should lie below memory roofline (memory-bound)
- Higher N points should approach compute roofline (compute-bound)
Success Metric: Observe transition from memory-bound to compute-bound as N increases; validate roofline prediction within ±10%.
Deliverables: * C code measuring matrix multiply (N=64 to 1024) * Plot: measured GFLOPS + roofline model * Analysis: At what N does code become compute-bound?
9.5 Lab 5: NUMA & Multi-Socket Impact (Optional, 6–8 hours)¶
Objective: Measure performance on local vs. remote DRAM in multi-socket system.
Tools: numactl, perf, Linux multi-socket system (Threadripper, Xeon, EPYC)
Setup:
# Check NUMA topology
numactl --hardware
# Example output:
# available: 2 nodes (0-1)
# node 0 cpus: 0-31
# node 0 memory: 96128 MB
# node 1 cpus: 32-63
# node 1 memory: 96128 MB
Tasks:
-
Measure local access:
-
Measure remote access:
-
Measure latency impact:
Success Metric: Demonstrate 2–3x latency penalty for remote DRAM, significant bandwidth reduction.
Deliverables: * NUMA topology output * Local vs. remote bandwidth table * Latency measurement (address-dependent chain) * Analysis: NUMA impact on matrix multiply with poor thread/memory binding
10. Form Factor Deep Dives¶
In practice, computers come in different form factors optimized for different use cases. Understanding form factor constraints shapes architecture decisions.
Laptops (2025–2026)¶
- Copilot+ PC & AI PC: Microsoft's specification requires NPU with 40+ TOPS. Intel Core Ultra, AMD Ryzen AI, and Qualcomm Snapdragon X all qualify. See dedicated ARM PC / Copilot+ PC section below.
- On-Device AI: Local LLM inference (Microsoft Copilot, ChatGPT, Gemini), AI-powered image/video editing, real-time translation.
- Thin & Light Trends: LPDDR5x soldered memory, single-fan or fanless designs, 70+ Wh batteries, OLED/Mini-LED displays.
- Gaming Laptops: RTX 50-series mobile GPUs (CES 2026), 240Hz+ QHD displays, per-key RGB, advanced cooling (vapor chamber + liquid metal).
- Mobile Workstations: ISV-certified GPUs (RTX PRO mobile), ECC memory support, color-accurate displays (DCI-P3 100%, Delta E<1), Thunderbolt 5 docking.
- CES 2026 Highlights:
- Lenovo, HP, Dell, ASUS, Acer all launched Copilot+ PCs with Intel Panther Lake and AMD Ryzen AI Max.
- Samsung Galaxy Book 5 with integrated AI features.
- ASUS ROG and MSI gaming laptops with RTX 5070/5080 mobile.
- Ultra-thin designs under 1 kg with fanless Snapdragon X2.
Workstations (2025–2026)¶
- Tower Workstations:
- Intel: Xeon W-3500 (Sapphire Rapids), LGA 4677, 8-channel DDR5 ECC, PCIe 5.0.
- AMD: Ryzen Threadripper 7000 (Storm Peak), up to 96 cores, sTR5, quad-channel DDR5 ECC.
- Apple: Mac Pro (M4 Ultra), Mac Studio (M4 Max/Ultra), unified memory architecture.
- Use Cases: CAD/CAM (SolidWorks, CATIA), simulation (ANSYS, COMSOL), VFX/3D rendering (Blender, Houdini), AI/ML development (local training with RTX PRO/Instinct GPUs).
- Key Differentiation from Desktops: ECC memory, ISV-certified GPUs, IPMI/remote management, higher reliability components, longer product lifecycle.
Servers (2025–2026)¶
- Dual-Socket & Multi-Node:
- Intel Xeon 6: P-core (Granite Rapids) for HPC/AI, E-core (Sierra Forest) for cloud/microservices density.
- AMD EPYC 9005 (Turin): Up to 192 cores/socket, 12-channel DDR5, 160 PCIe 5.0 lanes.
- AI Servers:
- NVIDIA DGX/HGX B200: 8x B200 GPUs with NVLink 5.0, 1.5 TB HBM3e total.
- NVIDIA GB200 NVL72: Rack-scale AI system.
- AMD Instinct MI300X platforms: OAM form factor, ROCm software stack.
- Edge Servers: Compact form factors for retail, manufacturing, telecom. NVIDIA Jetson AGX, Intel Xeon D, AMD EPYC Embedded.
- Liquid Cooling: Direct-to-chip and immersion cooling becoming standard for AI/HPC racks (>40 kW per rack). CES 2026 showcased rear-door heat exchangers and CDU (Coolant Distribution Unit) solutions from CoolIT, Asetek, and Vertiv.
- CES 2026 Server Trends:
- Supermicro, Dell, HPE, and Lenovo showcased GB200 NVL72-based AI platforms.
- EPYC Turin-based platforms dominating cloud deployments.
- CXL memory expansion modules for memory-intensive AI inference.
ARM PCs & Copilot+ PCs¶
The ARM PC ecosystem has matured dramatically in 2024–2026, shifting from a niche experiment to a mainstream computing platform alongside x86.
What is a Copilot+ PC?¶
- Microsoft's Definition: A Windows PC with an integrated NPU delivering 40+ TOPS, 16 GB+ RAM, and 256 GB+ SSD. Enables on-device AI features powered by Windows AI runtime.
- Qualifying Platforms: Qualcomm Snapdragon X Elite/Plus, Intel Core Ultra 200V+, AMD Ryzen AI 300+. All three architectures can earn Copilot+ branding, but Snapdragon X was the launch platform (June 2024).
- Key AI Features:
- Windows Recall: AI-powered timeline search that indexes everything you've seen on screen (privacy-gated, opt-in).
- Cocreator in Paint: Real-time AI image generation directly in Paint using the NPU.
- Live Captions with Translation: Real-time subtitle translation across 40+ languages, powered entirely on-device.
- Windows Studio Effects: AI-powered camera effects — background blur, eye contact correction, auto framing — processed by the NPU.
- Copilot Integration: Context-aware AI assistant with deep OS integration, file search, and app orchestration.
- Third-Party NPU Apps: Adobe Premiere Pro (AI-powered scene detection), DaVinci Resolve (NPU-accelerated denoising), and developer tools like ONNX Runtime and DirectML leveraging the NPU.
ARM PC Hardware Platforms¶
- Qualcomm Snapdragon X Elite (2024):
- Oryon custom cores (Nuvia-derived): 12 cores, up to 3.8 GHz (dual-core boost 4.3 GHz).
- Adreno X1 GPU: ~3.8 TFLOPS, supports Vulkan 1.3, DX12.
- Hexagon NPU: 45 TOPS (INT8). Dedicated AI accelerator separate from CPU/GPU.
- LPDDR5x-8448 (up to 64 GB), PCIe Gen 4, Wi-Fi 7, Bluetooth 5.4.
- Battery life advantage: 20–28 hours claimed in many OEM designs.
- Qualcomm Snapdragon X Plus (2024):
- 8-core or 10-core Oryon variants at lower TDP.
- Same Hexagon NPU (45 TOPS), slightly lower GPU CU count.
- Targets $799–$999 Copilot+ PCs from Lenovo, HP, Dell, Samsung, ASUS.
- Snapdragon X2 (Expected 2025–2026):
- Next-gen Oryon cores with IPC and clock improvements.
- Enhanced NPU (60+ TOPS expected) for next-gen Copilot+ features.
- Improved GPU with ray tracing support.
- Better app compatibility via improved x86 emulation and native ARM app ecosystem.
- Apple Silicon (macOS ARM):
- M4 family: Industry-leading single-thread performance and power efficiency.
- Not Windows Copilot+ (macOS only), but established the ARM PC viability that inspired Windows on ARM push.
- Developer-friendly: Universal Binary 2, Rosetta 2 for x86 translation.
- MediaTek Kompanio (Chromebook ARM):
- Kompanio 1380/1300T for premium Chromebooks.
- ARM-based ChromeOS — mature Linux-based ARM PC ecosystem.
- NVIDIA Grace (Server/Workstation ARM):
- 72 Neoverse V2 cores, LPDDR5x (up to 480 GB), designed for AI/HPC workloads.
- Grace Hopper Superchip: Grace CPU + Hopper GPU with NVLink-C2C coherent interconnect.
- Not a consumer PC platform but extends ARM into the data center alongside x86 EPYC/Xeon.
Software Compatibility & Emulation¶
- Prism (x86 Emulation on Windows ARM):
- Microsoft's x86/x64 emulation layer for Snapdragon X PCs.
- Runs most x86 Windows apps without modification — Office, Chrome, Steam, Adobe Creative Suite.
- Performance: Typically 70–90% of native x86 speed for most productivity apps; some gaming and heavy workloads see larger penalties.
- Improved in Windows 11 24H2+: Better compatibility with anti-cheat, kernel drivers, and virtualization.
- Native ARM64 Apps (Growing Ecosystem):
- Major apps with native ARM64 builds (2025–2026): Microsoft Office, Edge, Chrome, Firefox, Slack, Zoom, Spotify, WhatsApp, VLC, OBS Studio, Visual Studio Code, Visual Studio 2022, Adobe Photoshop/Lightroom/Premiere Pro, Blender (partial), AutoCAD.
- Developer tools: Node.js, Python, Git, Docker Desktop (ARM containers), WSL2 (ARM Linux), .NET 8+, Java (Adoptium ARM64).
- Games: Native ARM ports growing slowly. Most games run via Prism emulation; performance varies. Anti-cheat compatibility improving but still a challenge for some competitive titles.
- Rosetta 2 (macOS x86 Emulation):
- Apple's highly optimized x86-to-ARM translation for macOS.
- Near-native performance for most apps (often within 5–20% of native ARM).
- Mature ecosystem: Nearly all major macOS apps are now Universal Binary (native ARM + x86).
- Linux on ARM PCs:
- Ubuntu, Fedora, and Debian fully support ARM64 (aarch64).
- Snapdragon X Linux support improving (kernel 6.8+ with Qualcomm mainline patches) but still experimental in 2025 — display, Wi-Fi, and GPU drivers maturing.
- Asahi Linux: Excellent macOS-to-Linux on Apple Silicon, with GPU acceleration (AGX driver).
ARM PC vs. x86 PC — When to Choose What¶
| Factor | ARM PC (Snapdragon X / Apple Silicon) | x86 PC (Intel / AMD) |
|---|---|---|
| Battery Life | Excellent (20–28 hrs typical) | Good (10–16 hrs typical) |
| Fanless/Thin Design | Common (sub-1 kg possible) | Limited to low-power chips |
| Single-Thread Perf | Competitive (Apple leads) | Intel/AMD still strong |
| Multi-Thread Perf | Good (12 cores Snapdragon X) | Superior (up to 24 cores desktop) |
| App Compatibility | ~95% via emulation, growing native | 100% native |
| Gaming | Limited (emulation overhead, driver gaps) | Full ecosystem |
| AI/NPU Features | Core differentiator (Copilot+) | Intel/AMD catching up with NPUs |
| Enterprise/IT | Growing (Intune, SCCM support) | Mature, established |
| Developer Tools | Good (VS Code, Docker, WSL2) | Complete |
| Peripheral Support | Most USB/BT work; some driver gaps | Universal |
| Price | Competitive ($799+ for Copilot+) | Broad range ($400+) |
ARM Server & Cloud (Extending ARM Beyond PCs)¶
- AWS Graviton4: Custom ARM Neoverse cores, 30–40% better price-performance than x86 for cloud workloads. Most popular ARM server instance.
- Ampere Altra Max / AmpereOne: Up to 192 ARM cores per socket. Used by Oracle Cloud, Microsoft Azure, Google Cloud.
- Microsoft Azure Cobalt 100: ARM-based custom chip for Azure VMs, optimizing cloud-native workloads.
- NVIDIA Grace CPU: ARM for AI/HPC servers (see above).
- Implications for PC Developers: Software developed on ARM PCs (Snapdragon X, Apple Silicon) can run natively in ARM cloud environments — a unified ARM development-to-deployment pipeline.
CES 2026 ARM PC / Copilot+ Highlights¶
- Qualcomm Snapdragon X2 Announcements: Improved Oryon cores, 60+ TOPS NPU, ray tracing GPU — featured in new designs from Lenovo ThinkPad, HP EliteBook, Dell Latitude, and Samsung Galaxy Book.
- OEM Expansion: ARM Copilot+ PCs now available from 10+ OEMs across consumer, business, and education segments.
- Microsoft Copilot+ Updates: New AI features exclusive to Copilot+ PCs — enhanced Recall with app-level integration, Copilot Actions for task automation, and improved Windows Studio Effects (gesture recognition, object removal in video calls).
- Developer Ecosystem: Arm announced Windows on Arm developer kit updates, improved Prism emulation, and partnerships with game studios for native ARM64 game ports.
- Enterprise Adoption: Lenovo and HP showcased ARM-based enterprise fleets with Intune management, BitLocker, and Windows Autopilot support — positioned as x86 alternatives for knowledge workers.
11. Motherboard & Platform¶
Desktop Platforms¶
- Intel LGA 1851 (Core Ultra 200S): Z890/B860 chipsets, DDR5, PCIe 5.0 (x16 GPU + x4 SSD), Thunderbolt 4, Wi-Fi 7.
- AMD AM5 (Ryzen 7000/9000): X870E/X870/B850 chipsets, DDR5, PCIe 5.0, USB4. Long-lived socket (AMD committed through 2025+).
Workstation Platforms¶
- Intel LGA 4677 (Xeon W-3500): W790 chipset, 8-channel DDR5 ECC, 112 PCIe 5.0 lanes.
- AMD sTR5 (Threadripper 7000): TRX50 chipset, quad-channel DDR5 ECC, 88 PCIe 5.0 lanes.
Server Platforms¶
- Intel LGA 7529 (Xeon 6): Multi-chipset options, CXL 2.0, DDR5 RDIMM/MRDIMM.
- AMD SP5 (EPYC 9005): Dual-socket capable, 12-channel DDR5, CXL 2.0, PCIe 5.0 x160.
Key Chipset Features (2025–2026)¶
- Thunderbolt 5 integration (Intel 200-series)
- Wi-Fi 7 standard on mid-range and above
- USB4 v2.0 on flagship boards
- Integrated 5 GbE on high-end desktop boards
12. Power Supply & Thermal Management¶
Power Supply¶
- ATX 3.1 / ATX12VO: Updated PSU standards with 12V-2x6 (600W) GPU power connector, replacing legacy Molex/SATA power. Required for RTX 50-series GPUs.
- 80 PLUS Efficiency: Titanium (96%) > Platinum (94%) > Gold (92%). Higher efficiency = less waste heat and lower electricity costs.
- Wattage Trends: High-end desktop builds need 850W–1200W+ for RTX 5090 + modern CPUs. Workstations may need 1600W+ for multi-GPU.
Thermal Management¶
- Air Cooling: Tower coolers (Noctua NH-D15 G2, DeepCool Assassin IV) still competitive for ≤200W TDP CPUs.
- AIO Liquid Cooling: 240mm–420mm radiators for high-TDP desktop CPUs (360mm sweet spot for 250W+ chips).
- Direct-Die Cooling: Enthusiast technique — removing IHS for direct contact with cooling solution.
- Laptop Thermals: Vapor chamber, liquid metal TIM, multi-fan designs with dedicated GPU and CPU heat pipes.
- Server/Data Center: Direct-to-chip liquid cooling (cold plates), rear-door heat exchangers, immersion cooling (single-phase and two-phase). 40–100+ kW per rack for AI workloads.
- CES 2026: Thermaltake, Corsair, and NZXT showcased AIO coolers with integrated LCD displays and AI-driven fan curves.
13. CES 2026 Key Announcements Summary¶
| Category | Notable Announcements |
|---|---|
| CPUs | Intel Panther Lake (18A), AMD Ryzen AI Max (Strix Halo), AMD Ryzen Z2, Qualcomm Snapdragon X2 |
| GPUs | NVIDIA RTX 5060/5050 mobile, AMD RDNA 4 mobile variants |
| Memory | MRDIMM DDR5-8800 for servers, LPDDR5x-8533 standard in laptops, CXL 2.0 memory expanders |
| Storage | 8 TB consumer M.2 NVMe, PCIe Gen 6 controller demos, computational storage announcements |
| AI PCs | Copilot+ PC ecosystem expansion, 60+ TOPS NPUs, on-device LLM inference demos |
| Displays | OLED monitors (27" 4K 240Hz), 8K Mini-LED, transparent OLED monitors (Samsung, LG) |
| Connectivity | Wi-Fi 7 ubiquitous, Thunderbolt 5 on more laptops, Wi-Fi 8 previews, OCuLink eGPU docks |
| Cooling | AI-optimized fan curves, 420mm AIO with LCD, immersion cooling for prosumers |
| Servers | GB200 NVL72 rack deployments, EPYC Turin platforms, CXL memory pooling demos |
Resources¶
Books¶
- "Computer Organization and Design: RISC-V Edition" by Patterson & Hennessy: Foundational text on how CPUs work, covering pipelining, caches, and memory hierarchy.
- "Structured Computer Organization" by Andrew S. Tanenbaum: Layered approach to understanding computer hardware from digital logic to OS.
Online Resources¶
- AnandTech Archive / TechInsights: Deep-dive CPU/GPU architecture analysis.
- ServeTheHome: Server, workstation, and data center hardware reviews and analysis.
- Chips and Cheese: Detailed microarchitecture analysis for Intel, AMD, Apple, and ARM.
- WikiChip / WikiChip Fuse: CPU architecture database and news.
- Tom's Hardware / GamersNexus: Consumer hardware reviews with technical depth.
- CES 2026 Coverage: The Verge, Ars Technica, AnandTech for product announcements and hands-on.
Vendor Technical Documentation¶
- Intel ARK & Developer Zone: Specifications, whitepapers, and optimization guides.
- AMD Developer Resources: EPYC tuning guides, RDNA/CDNA architecture docs.
- Apple Platform Documentation: Apple Silicon architecture and performance guides.
- NVIDIA Developer: CUDA programming guides, GPU architecture whitepapers.
Projects¶
- Build a Desktop PC: Assemble a modern DDR5/PCIe 5.0 system (AMD AM5 or Intel LGA 1851). Document component selection trade-offs and benchmark with Cinebench, 3DMark, and CrystalDiskMark.
- Benchmark CPU Architectures: Compare single-thread vs. multi-thread performance across Intel, AMD, and Apple Silicon using Geekbench 6, SPEC CPU, and real-world workloads (compilation, video encode).
- Storage Performance Analysis: Benchmark PCIe Gen 4 vs. Gen 5 NVMe SSDs with fio (random 4K IOPS, sequential throughput, latency under load). Compare consumer vs. enterprise drives.
- GPU Compute Comparison: Run AI inference benchmarks (Stable Diffusion, LLM inference with llama.cpp) on NVIDIA RTX vs. AMD Radeon vs. Apple M4 GPU. Compare TOPS, VRAM utilization, and power efficiency.
- Server Hardware Lab: Set up a home lab with a used server (Dell PowerEdge, HPE ProLiant) — configure RAID, IPMI/BMC remote management, and run virtualization (Proxmox/ESXi).
- Memory Bandwidth Analysis: Measure memory bandwidth and latency across DDR5 configurations (single vs. dual channel, different speeds) using AIDA64 or Intel MLC. Demonstrate impact on real workloads.
- Thermal Analysis Project: Compare cooling solutions (stock, tower, AIO) using HWiNFO64 logging. Measure CPU thermals, clock speeds, and sustained performance under Prime95/OCCT stress tests.
- ARM PC Compatibility Lab: Set up a Snapdragon X Copilot+ PC — test x86 app compatibility via Prism emulation, benchmark native ARM64 vs. emulated apps (Geekbench, Cinebench, real workloads), evaluate NPU features (Windows Studio Effects, Copilot), and document app compatibility gaps.
- Cross-Architecture Development: Build and test the same application on x86 (Intel/AMD), ARM (Snapdragon X or Apple Silicon), and Linux ARM (Raspberry Pi 5 or cloud Graviton). Compare build toolchains, performance, and deployment workflows.