AI Hardware Engineer Roadmap¶
From firmware & AI applications to ML compilers — and ultimately, custom silicon.

A custom AI chip is an 8-layer vertical stack. This roadmap builds vertical literacy across all 8 layers — from AI applications and ML compilers down to RTL design and silicon fabrication.
The 8-Layer AI Chip Stack¶
| Layer | What it does | Key technologies |
|---|---|---|
| L1 | AI Application & Framework | PyTorch, ONNX, Agentic AI, MLOps, quantization |
| L2 | Compiler & Graph Optimization | MLIR dialects, TVM, LLVM, custom NPU lowering |
| L3 | Runtime & Driver | C++ runtime, Linux kernel driver, CUDA-like API |
| L4 | Firmware & OS | FreeRTOS, bootloader, embedded Linux, RTOS |
| L5 | Hardware Architecture | Systolic arrays, HBM controllers, NoC, power domains |
| L6 | RTL & Logic Design | SystemVerilog, UVM verification, FPGA prototyping |
| L7 | Physical Implementation | EDA tools, place & route, timing closure |
| L8 | Fabrication & Packaging | Foundry process, CoWoS, post-silicon bring-up |
L1–L6: Hands-on projects throughout the curriculum | L7–L8: Theory and guided labs (OpenROAD, TinyTapeout)
How Phases Map to Layers¶
L1 L2 L3 L4 L5 L6 L7 L8
Application Compiler Runtime & Firmware Hardware RTL & Physical Fab &
& Framework & Graph Driver & OS Architecture Logic Implement. Packaging
───────────┬───────────┬───────────┬───────────┬───────────┬───────────┬───────────┬───────────
Phase 1 ░░░ │ │ ░░░ │ ░░░ │ ███ │ ███ │ │
Phase 2 │ │ ░░░ │ ███ │ │ │ │ ░
Phase 3 ███ │ │ │ │ │ │ │
Phase 4A │ │ ███ │ │ ██░ │ ███ │ │
Phase 4B │ │ ███ │ ██░ │ ░░░ │ │ │ ░
Phase 4C ░░░ │ ███ │ ░░░ │ │ │ │ │
Phase 5 ░░░ │ ░░░ │ ░░░ │ │ ███ │ ██░ │ ██░ │ ░░░
───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────
███ = primary coverage ██░ = strong supporting ░░░ = background (blank) = minimal
5-Phase Curriculum¶
Phase 1: Digital Foundations (6–12 months)¶
The language of hardware — from gates and Verilog to CUDA kernels.
| Topic | Key Skills | Layer |
|---|---|---|
| Digital Design and HDL | Number systems, logic, memory; Verilog, testbenches, synthesis | L6 |
| Computer Architecture | ISA, pipelines, caches, OoO, coherence; modern CPUs/GPUs/memory | L5 |
| Operating Systems | Processes, threads, scheduling, memory management, drivers | L3/L4 |
| C++ and Parallel Computing | C++ & SIMD · OpenMP & oneTBB · CUDA & SIMT · ROCm & HIP · OpenCL & SYCL | L1/L3 |
Phase 2: Embedded Systems (6–12 months)¶
The boards and buses that sit next to inference — MCUs, RTOS, and embedded Linux.
| Topic | Key Skills | Layer |
|---|---|---|
| Embedded Software | ARM Cortex-M, FreeRTOS, SPI/UART/I2C/CAN, power, OTA | L4 |
| Embedded Linux | Yocto, PetaLinux, kernel, rootfs | L3/L4 |
Phase 3: Artificial Intelligence (6–12 months)¶
The workloads your hardware must run. Core + two tracks. · Hub: Phase 3 — Artificial Intelligence
Core (mandatory):
| # | Topic | Layer |
|---|---|---|
| 1 | Neural Networks — MLPs, CNNs, training, backpropagation | L1 |
| 2 | Deep Learning Frameworks — micrograd → PyTorch → tinygrad | L1/L2 |
Track A — Hardware & Edge AI (→ Phase 4):
| # | Topic | Layer |
|---|---|---|
| 3 | Computer Vision — detection, segmentation, 3D vision, OpenCV | L1 |
| 4 | Sensor Fusion — camera/LiDAR/IMU, Kalman, BEVFusion | L1 |
| 5 | Voice AI — STT (Whisper), TTS (Piper), VAD, keyword spotting | L1 |
| 6 | Edge AI & Model Optimization — quantization, pruning, deployment | L1 |
Track B — Agentic AI & ML Engineering (→ Phase 5 HPC/GenAI):
| # | Topic | Layer |
|---|---|---|
| 3 | Agentic AI & GenAI — LLM agents, RAG, tool use | L1 |
| 4 | ML Engineering & MLOps — training pipelines, model serving | L1 |
| 5 | LLM Application Development — fine-tuning, RAG architecture | L1 |
Phase 4: Hardware Deployment & Compilation (6–12 months each)¶
Deploy AI on real silicon and learn how compilers bridge models to hardware.
Track A — Xilinx FPGA¶
| Topic | Key Skills | Layer |
|---|---|---|
| Xilinx FPGA Development | Vivado, IP, timing, ILA/VIO | L6 |
| Zynq UltraScale+ MPSoC | PS/PL, Linux on Zynq | L5/L6 |
| Advanced FPGA Design | CDC, floorplanning, power, PR | L6 |
| HLS | C→RTL, dataflow, pipelining | L5/L6 |
| Runtime & Drivers | XRT, DMA, kernel drivers, Vitis AI/FINN | L3 |
| Projects | 1080p→4K wireless video (VCU, MIPI, TDMA) | L3–L6 |
Track B — NVIDIA Jetson¶
| Topic | Key Skills | Layer |
|---|---|---|
| Jetson Platform | Orin Nano, JetPack 6.2.2, L4T, CUDA | L3 |
| Carrier Board | Schematic, PCB, thermal, bring-up | L5 |
| L4T Customization | Rootfs, kernel/DT, OTA | L3/L4 |
| FSP Firmware | FreeRTOS on SPE/AON | L4 |
| App Development | ML/AI, ROS 2, multimedia | L1/L3 |
| Security & OTA | Secure boot, OP-TEE, A/B OTA | L4 |
| Manufacturing | FCC/CE, DFM, production flash | L8 |
| Runtime & Drivers | CUDA runtime, TensorRT, DLA, DeepStream | L3 |
Track C — ML Compiler¶
| Part | Key Skills | Layer |
|---|---|---|
| Compiler Fundamentals | Graph IR, LLVM, MLIR, TVM/tinygrad, custom backends | L2 |
| DL Inference Optimization | Triton, CUTLASS, Flash-Attention, TensorRT-LLM, quantization | L1/L2 |
Phase 5: Specialization Tracks (ongoing)¶
| Track | Focus | Guide |
|---|---|---|
| A: GPU Infrastructure | Nvidia GPU (NCCL, NVLink) · AMD GPU (ROCm, HIP, MI300X) | Guide → |
| B: HPC | CUDA-X Libraries: 40+ GPU-accelerated libraries | Guide → |
| C: Edge AI | Efficient nets, quantization, Holoscan, real-time pipelines | Guide → |
| D: Robotics | Nav2, MoveIt, planning | Guide → |
| E: Autonomous Vehicles | openpilot, tinygrad, BEV, safety, Lauterbach | Guide → |
| F: AI Chip Design | Systolic arrays, dataflow, tinygrad↔hardware, ASIC path | Guide → |
Layer → Job Title Quick Reference¶
| Layer | Job titles | Primary phases |
|---|---|---|
| L1 | ML Inference Eng · Edge AI Eng · Agentic AI Eng · GenAI Eng · MLOps Eng | Phase 3, 4B+C |
| L2 | AI Compiler Eng · Graph Optimization Eng · Kernel Eng | Phase 4C |
| L3 | GPU Runtime Eng · Linux Kernel Eng · Embedded Linux BSP Eng | Phase 4A§5, 4B§8 |
| L4 | Firmware Eng · Embedded SW Eng · Embedded Linux Eng · IoT Eng | Phase 2, 4B§4 |
| L5 | AI Accelerator Architect · SoC Platform Eng | Phase 1§2, 4A, 5F |
| L6 | RTL Design Eng · FPGA Eng · DV Eng | Phase 1§1, 4A |
| L7 | Theory: OpenROAD, GDS flow | Phase 5F |
| L8 | Theory: chiplets, CoWoS, TinyTapeout | Phase 5F |
Cross-layer roles: - Autonomous Vehicles HW/SW Engineer — L1 through L4 (Phase 4B + Phase 5E) - AI Hardware Engineer (Full-Stack) — L1 through L6 (the signature role this roadmap targets)
Reference Projects¶
| Project | What it teaches | Used in |
|---|---|---|
| tinygrad | Minimal DL framework — IR, scheduler, BEAM, backends | Phase 3, 4C, 5E |
| openpilot | Production ADAS — camera→ISP→modeld→planning→CAN | Phase 4B, 5E |
Additional Resources¶
- Roles & Market Analysis — 23 sub-layers, salary data, job postings, remote %, hiring priorities
A community-driven educational roadmap for AI hardware engineering. · Star ⭐ if you find this useful