Skip to content

AI Hardware Engineer Roadmap

Overview

ai-hpc/ai-hardware-engineer-roadmap

CUDA Advanced Optimization — Deep Dive¶

Five techniques used by GPU engineers at NVIDIA, OpenAI, and Meta to push inference and HPC kernels to hardware limits. These go far beyond basic CUDA programming — they are what separates a "working" GPU kernel from a "production" one.

Why These Techniques Matter¶

Naive CUDA kernel:         ~30% of hardware peak  (common)
With kernel fusion:        ~50% of hardware peak
With CUDA Graphs:          +10-30% latency reduction
With cooperative groups:   enables algorithms impossible without them
With persistent kernels:   near-zero kernel launch overhead
With warp specialization:  ~70-85% of hardware peak  (elite)

Every LLM inference engine (TensorRT-LLM, vLLM, FasterTransformer, FlashAttention) uses all five.

Topic Index¶

#	Topic	What It Solves
01	CUDA Graphs	CPU launch overhead kills latency at small batch sizes
02	Cooperative Groups	Thread block boundary limits synchronization flexibility
03	Persistent Kernels	Repeated kernel launches waste SM setup time
04	Kernel Fusion	Separate kernels waste HBM bandwidth on intermediate results
05	Warp Specialization	Compute and memory latency are not overlapped inside a kernel

How They Relate¶

CUDA Graphs          → reduces CPU↔GPU interface overhead
Cooperative Groups   → enables flexible intra-kernel synchronization
Persistent Kernels   → eliminates kernel launch overhead entirely
Kernel Fusion        → reduces HBM round-trips between operations
Warp Specialization  → overlaps compute and memory within a single kernel

Combined (e.g. FlashAttention-3):
  Persistent kernel + warp specialization + cooperative groups
  → 90%+ of H200 BF16 peak on attention kernels

LLM inference latency too high? → 01-CUDA-Graphs
Writing a custom reduction/scan? → 02-Cooperative-Groups
Kernel launch overhead visible in profile? → 03-Persistent-Kernels
GPU memory bandwidth bottleneck? → 04-Kernel-Fusion
Want to write FlashAttention-style kernels? → 05-Warp-Specialization