C++ and Parallel Computing (Phase 1 §4)¶
From a single instruction to a thousand GPU threads — how computing became parallel, and why every AI hardware engineer must think in parallel.
Layer mapping: L1 (application — you write the code that runs on hardware), L3 (runtime — CUDA runtime, OpenCL runtime are the bridge to the GPU driver).
Prerequisites: Phase 1 §2 (Computer Architecture — memory hierarchy, pipelining, caches), Phase 1 §3 (Operating Systems — threads, processes, scheduling).
What comes after: Phase 3 (Neural Networks — tensors map directly to these execution models), Phase 4 Track B (CUDA on Jetson with power limits), Phase 4 Track A (HLS/RTL — you design the hardware these kernels run on).
Why This Module Exists¶
Every AI inference chip — from NVIDIA's H100 to your future custom NPU — exists because sequential computing hit a wall. Understanding why parallelism is necessary, how it evolved, and how to program it is the foundation for everything in this roadmap.
If you can't think in parallel, you can't design hardware that runs parallel workloads.
Part 1 — How Computing Became Parallel (4 Steps)¶
Before diving into code, understand the historical pressure that created the hardware you'll design.
Step 1: Sequential Computing (Single-Core Era)¶
The simple model: one instruction executes, then the next, then the next.
For decades, performance came from clock speed scaling — run the same sequential code faster: - 1980s: 4.77 MHz (Intel 8088) - 1990s: 200 MHz (Pentium) - 2000s: 3+ GHz (Pentium 4)
What stopped it: The power wall. Power scales roughly as P = C * V^2 * f (capacitance x voltage-squared x frequency). By ~2004, clock speeds plateaued at 3-4 GHz because chips couldn't dissipate the heat. This is called Dennard scaling breakdown.
Connection to L5/L7/L8: This is why AI chips exist. If clock speed still scaled freely, you'd just run everything on a faster CPU. The power wall forced the industry into parallelism and specialization — which is exactly what you're building toward in this roadmap.
Step 2: Instruction-Level Parallelism (ILP / SIMD)¶
Key idea: do multiple things inside a single core per clock cycle.
Pipelining — overlap stages of different instructions:
One instruction completes per cycle even though each takes multiple cycles. (You studied this in Phase 1 §2 — Computer Architecture.)Superscalar execution — issue multiple independent instructions per cycle. Modern CPUs can execute 4-6 instructions per cycle if they don't depend on each other.
SIMD (Single Instruction, Multiple Data) — apply one operation to a vector of data:
Without SIMD: With SIMD (4-wide):
a[0] = b[0] + c[0] a[0..3] = b[0..3] + c[0..3] ← ONE instruction
a[1] = b[1] + c[1]
a[2] = b[2] + c[2]
a[3] = b[3] + c[3]
This is the first taste of data parallelism — the same concept that GPUs take to the extreme.
Connection to the stack: - L1/L2: AI compilers (TVM, MLIR) emit SIMD instructions when targeting CPUs. - L5: When you design an accelerator's vector unit, you're designing custom SIMD hardware.
Step 3: Multi-Core / Shared Memory Parallelism¶
Key idea: put multiple independent CPU cores on one chip.
When clock speed stopped scaling, chip designers added more cores: - 2005: Intel Core 2 Duo (2 cores) - 2010: Intel Core i7 (4 cores) - 2024: AMD EPYC 9004 (96 cores)
But software must explicitly use multiple cores. A single-threaded program uses one core; the rest sit idle. This is why OpenMP, pthreads, and oneTBB exist.
Core 0: Process chunk [0..N/4]
Core 1: Process chunk [N/4..N/2] ← all running simultaneously
Core 2: Process chunk [N/2..3N/4]
Core 3: Process chunk [3N/4..N]
The challenge: shared memory means race conditions, deadlocks, and cache coherence — all topics from Phase 1 §3 (OS).
Connection to the stack: - L4: Firmware on your AI chip's embedded cores (ARM/RISC-V) uses multi-core scheduling. - L5: NoC (Network-on-Chip) connects multiple compute tiles — same shared-memory model at hardware level.
Step 4: Heterogeneous Computing (CPU + GPU)¶
Key idea: use specialized hardware for specific workload types.
CPUs are optimized for latency — do one thing fast (branch prediction, out-of-order execution, large caches). GPUs are optimized for throughput — do many simple things simultaneously (thousands of small ALUs, minimal control logic).
| CPU | GPU | |
|---|---|---|
| Cores | 4–96 (complex) | 1,000–16,000 (simple) |
| Optimized for | Latency (one task fast) | Throughput (many tasks at once) |
| Control logic | ~50% of die area | ~5% of die area |
| Cache per core | Large (MB) | Small (KB) |
| Best for | Sequential code, branching | Data-parallel: matrix math, image processing |
Why this matters for AI: Neural network inference is almost entirely matrix multiplication and element-wise operations — perfectly data-parallel. A GPU can process 1000x more multiply-accumulate operations per second than a CPU at the same power.
Connection to the stack: - L5: Your custom AI chip is the next step beyond GPU — even more specialized for tensor operations. - L6: The systolic array you'll design in Phase 5 (AI Chip Design) is purpose-built for the exact workloads GPUs handle with general-purpose SIMT.
Part 2 — The Five Sub-Tracks¶
Study these in order. Each builds on the previous, and each maps to a layer of the chip stack.
| Order | Sub-track | What you learn | Layer | Guide |
|---|---|---|---|---|
| 1 | C++ and SIMD | Data-level parallelism inside a single core | L1 | Guide → |
| 2 | OpenMP and oneTBB | Thread-level parallelism across CPU cores | L1/L3 | Guide → |
| 3 | CUDA and SIMT | Massive parallelism on NVIDIA GPU (the main focus) | L1/L3 | Guide → |
| 4 | ROCm and HIP | AMD GPU programming — portable CUDA-like API | L1/L3 | Guide → |
| 5 | OpenCL and SYCL | Portable compute across GPU/FPGA/CPU, modern C++ abstraction | L1/L3 | Guide → |
Sub-Track 1: C++ and SIMD¶
The gateway to GPU thinking — same operation, multiple data.
What SIMD is: CPU vector instructions that process 4, 8, 16, or 32 data elements in one instruction.
| Instruction Set | Width | Data per instruction |
|---|---|---|
| SSE (1999) | 128-bit | 4x float32 |
| AVX (2011) | 256-bit | 8x float32 |
| AVX-512 (2017) | 512-bit | 16x float32 |
| ARM NEON | 128-bit | 4x float32 |
Manual intrinsics example:
#include <immintrin.h>
void add_vectors(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_load_ps(&a[i]); // Load 8 floats
__m256 vb = _mm256_load_ps(&b[i]); // Load 8 floats
__m256 vc = _mm256_add_ps(va, vb); // Add 8 pairs at once
_mm256_store_ps(&c[i], vc); // Store 8 results
}
}
Key concepts:
- Aligned memory: SIMD loads require 16/32/64-byte alignment (alignas(32))
- Auto-vectorization: Compilers can auto-vectorize simple loops (-O2 -march=native)
- Manual intrinsics: For control over exactly which instructions execute
Why it matters for AI hardware:
- cuDNN and CUTLASS use vectorized memory loads internally
- When you design an accelerator's vector unit (L5), you're designing custom SIMD
- MLIR's vector dialect (Phase 4C) targets exactly this level of abstraction
Projects:
- Implement vector addition with raw loops vs AVX intrinsics. Benchmark the speedup.
- Write a dot product using _mm256_fmadd_ps (fused multiply-add). Compare with scalar.
Sub-Track 2: OpenMP and oneTBB¶
Scale from one core to many — the CPU parallelism layer.
OpenMP — Pragma-Based (Easiest Start)¶
Add one line to a loop, and OpenMP distributes iterations across all CPU cores:
The OpenMP runtime creates a thread pool (one thread per core), divides the loop range equally, and joins when done. You control scheduling strategy:
// Static: divide N evenly across cores (predictable, best for uniform work)
#pragma omp parallel for schedule(static)
// Dynamic: each thread grabs a chunk when idle (better for uneven work)
#pragma omp parallel for schedule(dynamic, 64)
// Reduction: safely sum across threads
double total = 0.0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < N; i++) {
total += data[i];
}
OpenMP strengths: zero boilerplate, easy to add to existing code, widely supported (GCC, Clang, MSVC).
OpenMP weakness: limited control over task decomposition. For irregular workloads (tree traversals, graph algorithms, nested parallelism), you need something smarter.
oneTBB — Task-Based (More Control)¶
oneTBB (oneAPI Threading Building Blocks) is a C++ template library developed by Intel for parallel programming on multi-core processors. Originally known as TBB, it was rebranded as oneTBB in 2020 to integrate into the broader oneAPI ecosystem.
Key difference from OpenMP: Instead of parallelizing loops, oneTBB parallelizes tasks. You describe what can run in parallel — the runtime decides how to schedule it across cores.
What oneTBB provides:
| Component | What it does | Example |
|---|---|---|
| Parallel algorithms | Ready-to-use parallel patterns | parallel_for, parallel_reduce, parallel_sort, parallel_pipeline |
| Concurrent containers | Thread-safe data structures | concurrent_hash_map, concurrent_vector, concurrent_queue |
| Task-based runtime | Breaks work into small tasks, schedules across cores | Work-stealing scheduler (see below) |
| Flow graph | Dataflow-style parallelism | Connect processing nodes into a DAG |
Basic parallel_for:
#include <oneapi/tbb/parallel_for.h>
#include <oneapi/tbb/blocked_range.h>
// Simple lambda version
tbb::parallel_for(0, N, [&](int i) {
C[i] = A[i] + B[i];
});
// blocked_range version (more control over grain size)
tbb::parallel_for(
tbb::blocked_range<size_t>(0, N, /*grain_size=*/1024),
[&](const tbb::blocked_range<size_t>& r) {
for (size_t i = r.begin(); i < r.end(); i++) {
C[i] = A[i] + B[i];
}
}
);
parallel_reduce (thread-safe accumulation):
#include <oneapi/tbb/parallel_reduce.h>
double total = tbb::parallel_reduce(
tbb::blocked_range<size_t>(0, N),
0.0, // identity value
[&](const tbb::blocked_range<size_t>& r, double partial_sum) {
for (size_t i = r.begin(); i < r.end(); i++) {
partial_sum += data[i];
}
return partial_sum;
},
std::plus<double>() // combine partial results
);
parallel_sort:
#include <oneapi/tbb/parallel_sort.h>
std::vector<int> data(10000000);
// ... fill data ...
tbb::parallel_sort(data.begin(), data.end());
// Automatically parallelizes across all cores
The Work-Stealing Scheduler — How oneTBB Actually Works¶
The work-stealing scheduler is the "brain" of oneTBB. It's what makes oneTBB automatically balance uneven workloads across cores — and it's a fundamental concept in parallel computing that appears in many systems (Go runtime, Tokio in Rust, Java ForkJoinPool).
How it works — step by step:
Initial state: 4 cores, work divided into tasks
Core 0 queue: [T1] [T2] [T3] [T4] [T5] ← lots of work
Core 1 queue: [T6] [T7] ← some work
Core 2 queue: [T8] ← little work
Core 3 queue: (empty) ← idle!
Step 1 — Private queues: Each worker thread (one per CPU core) maintains its own private deque (double-ended queue) of tasks.
Step 2 — LIFO execution (own work): A thread pops tasks from the bottom of its own deque (newest first). This is like a stack — Last-In, First-Out. Why? The most recently created task likely has its data still in the CPU's L1/L2 cache. LIFO maximizes cache locality.
Step 3 — The "steal" (when idle): When a thread finishes all its tasks, it doesn't sleep. Instead, it becomes a thief and looks at other threads' queues.
Step 4 — FIFO stealing (from the top): The thief steals from the top of another thread's deque (oldest task). This is critical: - The thief steals the oldest task (likely a large chunk of work, worth stealing) - The victim continues working on the newest task (undisturbed, cache-hot) - Minimal interference between thief and victim (they access opposite ends of the deque)
Core 3 is idle → steals T1 from Core 0's queue (top/oldest)
Core 0 queue: [T2] [T3] [T4] ← didn't notice, still working on T4
Core 3 queue: [T1] ← now has work!
Why this is better than a central queue:
| Approach | How it works | Problem |
|---|---|---|
| Central queue | All threads grab tasks from one shared queue | Contention: every thread fights for the lock |
| Static partition | Divide work evenly at start | Imbalance: some chunks finish faster than others |
| Work-stealing | Each thread has private queue; steal only when idle | Only incurs synchronization overhead when actually out of work |
Why it matters: - Dynamic load balancing: Handles irregular workloads where some tasks take 10x longer than others (tree traversals, sparse matrix, uneven image regions) - Scalability: Programs automatically scale to however many cores the machine has — no code changes needed - Efficiency: Near-zero overhead when all cores are busy; only the idle core pays the cost of stealing
Connection to the stack: - L4 (Firmware): Your AI chip's command processor needs a task scheduler for DMA, compute, and I/O tasks — work-stealing is one approach - L3 (Runtime): CUDA's stream scheduler and TensorRT's engine executor solve similar problems on GPU
OpenMP vs oneTBB — When to Use Which¶
| OpenMP | oneTBB | |
|---|---|---|
| Model | Pragma annotations on existing code | C++ task scheduler with templates |
| Ease of use | Very easy (one #pragma line) |
Moderate (C++ lambda + template patterns) |
| Control | Low (schedule policy, num threads) | High (grain size, task graphs, flow graphs, custom partitioners) |
| Load balancing | Static or dynamic (loop-level) | Work-stealing (automatic, task-level) |
| Nested parallelism | Awkward (nested parallel for creates too many threads) |
Natural (tasks spawn sub-tasks, scheduler handles it) |
| Concurrent containers | None (use critical sections) |
Built-in (concurrent_hash_map, concurrent_vector) |
| Best for | Regular loops, quick parallelization | Irregular work, graph algorithms, production C++ libraries |
| Compiler support | GCC, Clang, MSVC (built-in) | Separate library (install via package manager or oneAPI) |
Rule of thumb: Start with OpenMP for simple loops. Switch to oneTBB when you need irregular parallelism, concurrent containers, or fine-grained control.
Key Concepts (Both OpenMP and oneTBB)¶
- Data races: Two threads writing to the same memory → undefined behavior. Use
#pragma omp critical,std::mutex, or oneTBB's concurrent containers. - False sharing: Two threads writing to adjacent cache lines → cache thrashing even though they access different data. Fix: pad data structures to cache line boundaries (
alignas(64)). - Amdahl's Law: If 10% of your code is sequential, maximum speedup is 10x — no matter how many cores you add. Always parallelize the bottleneck first.
- Grain size: The minimum chunk of work per task. Too small → scheduling overhead dominates. Too large → poor load balancing. oneTBB's
auto_partitionertunes this automatically.
Why it matters for AI hardware: - Phase 4B Jetson: Cortex-A78AE cores use OpenMP for CPU-side preprocessing - Phase 2: FreeRTOS tasks are a form of multi-core parallelism on embedded SoCs - L4 (Firmware): Your AI chip's command processor uses multi-core scheduling — work-stealing is directly applicable
Projects:
1. OpenMP matmul — Parallelize matrix multiplication with #pragma omp parallel for. Measure speedup from 1 to N cores. Plot the scaling curve.
2. oneTBB parallel_reduce — Sum 100M floats using tbb::parallel_reduce. Compare with OpenMP reduction(+:sum). Measure both.
3. oneTBB parallel_sort — Sort 10M integers with tbb::parallel_sort. Benchmark against std::sort (single-threaded). Measure speedup.
4. Work-stealing visualization — Implement parallel merge sort with tbb::parallel_invoke. Use tbb::task_scheduler_observer to log which core runs which task. Observe work-stealing in action.
5. False sharing demo — Write a program where N threads increment N counters in a shared array. First version: counters are adjacent. Second version: counters are padded to 64 bytes. Measure the performance difference.
Bridge to GPU: CPU parallelism scales to 8–96 cores. For 10,000+ parallel operations (like a 4096-element vector times a 4096x4096 matrix), you need a GPU.
Sub-Track 3: CUDA and SIMT (Main Focus)¶
Thousands of threads, one program — the execution model behind every modern AI workload.
This is the most important sub-track in Phase 1. CUDA is the programming model for NVIDIA GPUs, which run the vast majority of AI training and inference worldwide. Understanding CUDA is essential for every role in the AI hardware stack.
GPU Architecture (The Hardware You're Programming)¶
Streaming Multiprocessor (SM): - The basic compute unit on a GPU - Contains: CUDA cores (ALUs), tensor cores, shared memory, register file, warp schedulers - NVIDIA A100: 108 SMs. H100: 132 SMs.
CUDA core: A simple ALU that can do one multiply-add per clock. An SM has 64-128 CUDA cores.
Tensor core: A specialized unit that does a 4x4 matrix multiply-accumulate in one instruction. This is what makes GPUs fast for AI — and what you'll eventually design your own version of (Phase 5F — AI Chip Design).
CUDA Execution Model¶
Hierarchy: Thread → Warp → Block → Grid
Grid (the entire job)
├── Block 0
│ ├── Warp 0 (threads 0-31)
│ ├── Warp 1 (threads 32-63)
│ └── ...
├── Block 1
│ ├── Warp 0
│ └── ...
└── ...
Key rules: - A warp = 32 threads that execute the same instruction simultaneously (SIMT) - A block = group of threads that share memory and can synchronize - A grid = all blocks for one kernel launch - You choose block size (typically 128 or 256 threads) and grid size (total work / block size)
Warp divergence: If threads in a warp take different if/else branches, both branches execute serially. Avoid branching in hot paths.
Memory Hierarchy¶
This is the most important concept for GPU performance. The difference between a fast and slow CUDA kernel is almost always memory access.
| Memory | Speed | Size | Scope | Managed by |
|---|---|---|---|---|
| Registers | Fastest (~0 cycles) | 64K per SM | Per-thread | Compiler |
| Shared memory | Fast (~5 cycles) | 48-164 KB per SM | Per-block | Programmer |
| L1 cache | Fast | 128 KB per SM | Per-SM | Hardware |
| L2 cache | Medium | 4-50 MB | All SMs | Hardware |
| Global memory (HBM) | Slow (~400 cycles) | 16-80 GB | All threads | Programmer |
Golden rule: Minimize global memory access. Keep data in shared memory or registers.
Coalesced memory access: When 32 threads in a warp access 32 consecutive addresses, the hardware combines them into one transaction. Random access = 32 separate transactions = 32x slower.
// GOOD: coalesced — thread i accesses element i
int val = data[threadIdx.x];
// BAD: strided — threads access every 32nd element
int val = data[threadIdx.x * 32];
CUDA Programming¶
A minimal kernel:
// Runs on GPU — one instance per thread
__global__ void vector_add(float* a, float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
// Host code — launches the kernel
int main() {
int n = 1000000;
float *d_a, *d_b, *d_c;
// Allocate GPU memory
cudaMalloc(&d_a, n * sizeof(float));
cudaMalloc(&d_b, n * sizeof(float));
cudaMalloc(&d_c, n * sizeof(float));
// Copy data from CPU to GPU
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
// Launch kernel: 256 threads per block, enough blocks to cover n elements
int block_size = 256;
int grid_size = (n + block_size - 1) / block_size;
vector_add<<<grid_size, block_size>>>(d_a, d_b, d_c, n);
// Copy result back to CPU
cudaMemcpy(h_c, d_c, n * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}
Key qualifiers:
- __global__ — kernel function, callable from CPU, runs on GPU
- __device__ — callable only from GPU code
- __host__ — runs on CPU (default)
Streams and async: Overlap data transfer with computation:
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync(d_a, h_a, size, cudaMemcpyHostToDevice, stream);
kernel<<<grid, block, 0, stream>>>(d_a, d_b, d_c);
cudaMemcpyAsync(h_c, d_c, size, cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);
The real bottleneck: CPU ↔ GPU data transfer over PCIe. A well-optimized program minimizes transfers and keeps data on the GPU.
Matrix Multiplication — The Core of AI¶
Every neural network layer is a matrix multiply. This is the operation you'll optimize in Phase 4C and eventually design hardware for in Phase 5F.
Naive CUDA matmul:
__global__ void matmul_naive(float* A, float* B, float* C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
Why this is slow: Each thread reads an entire row of A and column of B from global memory. N=4096 → each thread does 4096 global memory reads.
Tiled matmul with shared memory:
__global__ void matmul_tiled(float* A, float* B, float* C, int N) {
__shared__ float As[TILE][TILE];
__shared__ float Bs[TILE][TILE];
int row = blockIdx.y * TILE + threadIdx.y;
int col = blockIdx.x * TILE + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < N / TILE; t++) {
// Cooperatively load one tile into shared memory
As[threadIdx.y][threadIdx.x] = A[row * N + t * TILE + threadIdx.x];
Bs[threadIdx.y][threadIdx.x] = B[(t * TILE + threadIdx.y) * N + col];
__syncthreads();
// Compute partial product from shared memory (fast!)
for (int k = 0; k < TILE; k++) {
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads();
}
C[row * N + col] = sum;
}
Why this is faster: Each global memory element is loaded once into shared memory and reused TILE times. For TILE=32, that's 32x fewer global memory accesses.
Connection to the stack: - L2 (Compiler): This tiling is exactly what TVM, MLIR, and tinygrad do automatically - L5 (Architecture): Systolic arrays do this in hardware — data flows between PEs, each value is reused - L6 (RTL): In Phase 4A HLS, you'll implement tiled matmul as an FPGA accelerator
Projects:
1. Vector add — write a CUDA kernel, benchmark vs CPU.
2. SAXPY — y = a*x + y on GPU. Measure bandwidth (GB/s) and compare to theoretical peak.
3. Naive matmul — implement the naive version, measure GFLOPS.
4. Tiled matmul — add shared memory tiling. Measure the speedup vs naive.
5. CPU golden — for every GPU kernel, write a CPU reference and verify correctness.
Sub-Track 4: ROCm and HIP¶
AMD's answer to CUDA — write GPU code that runs on both NVIDIA and AMD hardware.
Full guide: ROCm and HIP Guide →
What you'll learn: AMD's two GPU architectures (CDNA for data center AI vs RDNA for gaming), HIP programming (95% identical to CUDA), HIPIFY toolchain for automatic CUDA→HIP conversion, AMD Compute Unit architecture, ROCm software stack and library ecosystem.
Key topics in the sub-guide: 1. CDNA vs RDNA — why MI300X (CDNA, compute-only, 192 GB HBM3) is completely different from Radeon RX (RDNA, gaming) 2. HIP API — CUDA-to-HIP mapping table, wavefront (64) vs warp (32) differences, code examples 3. HIPIFY deep dive — hipify-clang vs hipify-perl, what auto-converts, what needs manual work, full project porting workflow 4. CDNA Compute Unit — internal architecture diagram, SM vs CU comparison, optimization differences 5. ROCm stack — software layers, CUDA-X → ROCm library mapping (rocBLAS, MIOpen, RCCL, CK), profiling (Omniperf, Omnitrace)
Projects: HIPIFY a CUDA kernel, wavefront vs warp reduction, Omniperf roofline, rocBLAS benchmark, port a real CUDA project
Sub-Track 5: OpenCL and SYCL¶
Write once, run anywhere — portable compute across GPU, FPGA, and CPU.
OpenCL — The Multi-Vendor Standard¶
What OpenCL is: An open standard for parallel programming across heterogeneous devices. Unlike CUDA (NVIDIA only) or HIP (AMD+NVIDIA), OpenCL runs on NVIDIA, AMD, Intel, ARM, and FPGAs.
When to use OpenCL: - Targeting non-NVIDIA/non-AMD hardware (Intel GPUs, ARM Mali, Xilinx FPGAs) - Phase 4 Track A: Xilinx Vitis uses OpenCL as the host API for FPGA kernels - Embedded/mobile GPUs that only support OpenCL
OpenCL kernel:
__kernel void vector_add(__global float* a, __global float* b, __global float* c) {
int i = get_global_id(0);
c[i] = a[i] + b[i];
}
CUDA vs HIP vs OpenCL comparison:
| Concept | CUDA | HIP | OpenCL |
|---|---|---|---|
| Kernel qualifier | __global__ |
__global__ |
__kernel |
| Thread ID | threadIdx.x + blockIdx.x * blockDim.x |
Same | get_global_id(0) |
| Shared memory | __shared__ |
__shared__ |
__local |
| Launch | kernel<<<grid, block>>>() |
Same | clEnqueueNDRangeKernel() |
| Vendor | NVIDIA | AMD + NVIDIA | All vendors |
| Performance | Best on NVIDIA | Best on AMD | ~80-90% of native |
| Host API | C/C++ | C/C++ | C (verbose) |
OpenCL's weakness: The host API is extremely verbose — creating contexts, command queues, building programs, setting arguments requires dozens of lines of boilerplate. This is what SYCL solves.
SYCL — Modern C++ for Heterogeneous Computing¶
What SYCL is: A Khronos standard that provides a single-source C++ programming model for heterogeneous computing. Write host and device code in the same C++ file, using standard C++ features (lambdas, templates, RAII).
Why SYCL matters: - Single source: No separate kernel files or string-based kernels - Modern C++: Lambdas, templates, type safety — unlike OpenCL's C99 kernels - Multi-target: Compile the same code for CPU, NVIDIA GPU, AMD GPU, Intel GPU, or FPGA - Growing ecosystem: Intel oneAPI (DPC++), AdaptiveCpp (hipSYCL), Codeplay ComputeCpp
SYCL example:
#include <sycl/sycl.hpp>
int main() {
sycl::queue q; // Auto-selects best device (GPU if available)
float* a = sycl::malloc_shared<float>(N, q);
float* b = sycl::malloc_shared<float>(N, q);
float* c = sycl::malloc_shared<float>(N, q);
// Initialize a, b on host...
q.parallel_for(sycl::range<1>(N), [=](sycl::id<1> i) {
c[i] = a[i] + b[i];
}).wait();
sycl::free(a, q); sycl::free(b, q); sycl::free(c, q);
}
Compare with the equivalent CUDA code — SYCL is significantly less boilerplate while being portable across vendors.
CUDA vs HIP vs SYCL side-by-side:
| Feature | CUDA | HIP | SYCL |
|---|---|---|---|
| Language | CUDA C++ | HIP C++ | Standard C++ |
| Source model | Separate .cu files |
Separate .hip.cpp files |
Single source |
| Memory model | Manual cudaMalloc/Memcpy |
Manual hipMalloc/Memcpy |
Unified shared memory (USM) or buffers |
| Kernel syntax | <<<grid, block>>> |
Same | parallel_for lambda |
| Targets | NVIDIA only | AMD + NVIDIA | CPU, NVIDIA, AMD, Intel, FPGA |
| Maturity | 17 years, dominant | 7 years, growing | 5 years, emerging |
| Best for | Peak NVIDIA perf | AMD perf + portability | Maximum portability |
SYCL implementations:
| Implementation | Vendor | Targets |
|---|---|---|
| Intel oneAPI (DPC++) | Intel | Intel GPU, CPU, FPGA, NVIDIA (plugin), AMD (plugin) |
| AdaptiveCpp (hipSYCL) | Open source | NVIDIA (CUDA), AMD (ROCm), Intel, CPU |
| Codeplay ComputeCpp | Codeplay | Multiple backends |
Connection to the stack: - L2 (Compiler): SYCL compilers use LLVM/SPIR-V — the same IR infrastructure as MLIR (Phase 4C) - L3 (Runtime): SYCL runtimes manage device selection, memory, and scheduling — same concepts as XRT (Phase 4A) and CUDA runtime (Phase 4B) - Future direction: As AI inference moves beyond NVIDIA-only (AMD, Intel, custom NPUs), portable APIs like SYCL become essential
Projects:
1. Port CUDA vector add to OpenCL — experience the verbose host API. Run on CPU and GPU.
2. Same kernel in SYCL — rewrite the same vector add using SYCL parallel_for. Compare code size and readability.
3. SYCL matmul — implement tiled matrix multiplication in SYCL with local memory (sycl::local_accessor). Run on CPU and GPU, compare performance with your CUDA tiled matmul.
4. Multi-backend test — compile the same SYCL kernel for CPU (OpenMP backend), NVIDIA GPU (CUDA backend via AdaptiveCpp), and Intel GPU (Level Zero). Compare performance across all three.
Part 3 — How This Connects to the Rest of the Roadmap¶
| What you learn here | Where it leads |
|---|---|
| SIMD / vectorization | Phase 4C: MLIR vector dialect, compiler auto-vectorization |
| OpenMP / multi-core | Phase 2: FreeRTOS multi-core on Jetson SPE, Zynq PS |
| CUDA kernels | Phase 4B: Jetson inference, Phase 4C: Triton/CUTLASS kernel engineering |
| ROCm / HIP | Phase 5A: AMD GPU infrastructure (MI300X), portable kernel engineering |
| OpenCL | Phase 4A: Xilinx Vitis FPGA host API, embedded GPU compute |
| SYCL | Future portable compute — CPU, GPU, FPGA, custom NPU from one source |
| Memory hierarchy thinking | Phase 4A: FPGA BRAM/URAM tiling, Phase 5F: scratchpad design for AI chip |
| Tiled matmul | Phase 4A: HLS matmul accelerator, Phase 5F: systolic array architecture |
| GPU architecture model | Phase 5B: CUDA-X libraries, Phase 5F: design something better |
The big picture: - Phase 1 §4 teaches you to program parallel hardware - Phase 4 teaches you to optimize and deploy on real parallel hardware - Phase 5F teaches you to design new parallel hardware
You're learning the workload first. Then you'll build the machine that runs it.
Key Takeaways¶
- Parallelism exists at multiple levels — instruction (SIMD), thread (OpenMP), massive (CUDA/HIP)
- CPU vs GPU is latency vs throughput — different tools for different jobs
- Memory is the real bottleneck — not compute. This is true for CUDA kernels, FPGA accelerators, and custom AI chips.
- CUDA is the industry standard for GPU compute and AI inference — learn it first
- HIP makes your CUDA skills portable — same code runs on AMD and NVIDIA
- Tiling is the universal optimization — from shared memory in CUDA to systolic arrays in silicon
- Future = heterogeneous + portable — SYCL targets CPU, GPU, FPGA, and custom accelerators from one codebase
Next¶
→ Phase 3 — Neural Networks — the workloads that make all this parallelism necessary.