CUDA and SIMT¶
Part of Phase 1 section 4 — C++ and Parallel Computing.
Goal: Master NVIDIA's SIMT programming model — GPU architecture, thread hierarchy, memory spaces, kernel writing, and performance optimization — so you can read and write production CUDA code and reason about hardware behavior.
Official reference: CUDA C++ Programming Guide
1. Why GPU? — The Transistor Budget¶
CPUs dedicate most transistors to control logic and cache (latency minimization). GPUs dedicate most transistors to arithmetic units (throughput maximization).

Source: NVIDIA CUDA Programming Guide
| CPU | GPU | |
|---|---|---|
| Core count | 8–128 | 1,000s–10,000s |
| Clock speed | 3–5 GHz | 1.5–2.5 GHz |
| Design goal | Low latency (single thread) | High throughput (many threads) |
| Cache | Large (MB per core) | Small (KB per SM) |
| Control flow | Complex OOO, branch prediction | Simple, in-order per lane |
| Memory BW | ~50–200 GB/s | ~900–3,500 GB/s (HBM) |
The key insight: GPUs hide memory latency by switching to other warps while waiting, not by predicting and prefetching. This requires thousands of in-flight threads.
2. Heterogeneous Programming Model¶
A CUDA application always starts on the CPU (host). The host copies data to the GPU (device), launches kernels, and waits for results.

Source: NVIDIA CUDA Programming Guide
Host (CPU) Device (GPU)
────────── ────────────
Serial code runs here Parallel kernels run here
Allocates device memory Executes kernel with thousands of threads
Copies data H→D Each thread runs the same function
Launches kernel ──────────────► Grid of blocks of threads
Waits for completion Results written to device memory
Copies results D→H ◄────────── Done
Key rules:
- CPU and GPU have separate DRAM — data must be explicitly transferred (unless using Unified Memory)
- Kernel launches are asynchronous — the CPU continues while the GPU works
- cudaDeviceSynchronize() blocks the CPU until all GPU work completes
3. GPU Hardware Architecture¶
3.1 The Full Chip¶
A GPU die is organized as a hierarchy:
GPU Die
└── GPC (Graphics Processing Cluster) × N
└── TPC (Texture Processing Cluster) × M
└── SM (Streaming Multiprocessor) × 2
├── Warp Schedulers × 4
├── Dispatch Units × 8
├── CUDA Cores (FP32) × 128
├── Tensor Cores × 4
├── Register File (256 KB)
├── Shared Memory / L1 Cache (up to 228 KB)
└── Load/Store Units, Special Function Units

H100 full GPU — 144 SMs organized into GPCs. Source: NVIDIA Hopper Architecture
3.2 The Streaming Multiprocessor (SM)¶
The SM is the fundamental execution unit. All threads in one block run on one SM. The SM executes warps — groups of 32 threads.

H100 SM: 4 warp schedulers, 128 FP32 CUDA cores, 4 Tensor Cores (4th gen), 256 KB register file, up to 228 KB shared mem / L1. Source: NVIDIA
H100 SM internals: - 4 warp schedulers — each selects one ready warp per cycle to issue - 8 dispatch units — issue 2 instructions per warp scheduler per cycle - 128 FP32 CUDA cores — execute floating-point operations - 64 INT32 units — integer arithmetic (can run simultaneously with FP32) - 4 Tensor Cores (4th gen) — matrix multiply-accumulate for ML workloads - 256 KB register file — fastest memory, compiler-allocated per thread - 228 KB unified data cache — partitioned into shared memory + L1 (configurable)
3.3 GPU Architecture Generations¶
| Generation | Architecture | Compute Cap | Key Feature | Example GPU |
|---|---|---|---|---|
| 2017 | Volta | 7.0 | Tensor Cores (1st gen), Independent Thread Scheduling | V100 |
| 2018 | Turing | 7.5 | RT Cores, INT8/INT4 Tensor Cores | RTX 2080 |
| 2020 | Ampere | 8.0 / 8.6 | 3rd gen Tensor Cores, TF32, BF16, MIG, sparsity | A100, RTX 3090 |
| 2022 | Hopper | 9.0 | 4th gen Tensor Cores, TMA, Thread Block Clusters, FP8, Transformer Engine | H100 |
| 2024 | Blackwell | 10.x | 5th gen Tensor Cores, FP4, NVLink 5, confidential compute | B100, B200 |
4. Thread Hierarchy¶
CUDA organizes threads in a 3-level hierarchy: Grid → Block → Thread, with warps as an implicit hardware grouping of 32 threads.

Source: NVIDIA CUDA Programming Guide
4.1 Grid, Block, Warp, Thread¶
Grid (one per kernel launch)
├── Block (0,0) ── 1024 threads max ── runs on one SM
│ ├── Warp 0 ── threads 0-31
│ ├── Warp 1 ── threads 32-63
│ └── Warp 31 ── threads 992-1023
├── Block (1,0)
├── Block (0,1)
└── ...

Thread blocks are dispatched to SMs. Multiple blocks can reside on one SM simultaneously (limited by registers, shared memory, and SM capacity). Source: NVIDIA
Built-in variables every kernel sees:
| Variable | Type | Meaning |
|---|---|---|
threadIdx.x/y/z |
uint3 |
Thread index within its block |
blockIdx.x/y/z |
uint3 |
Block index within the grid |
blockDim.x/y/z |
uint3 |
Block dimensions (total threads per block) |
gridDim.x/y/z |
uint3 |
Grid dimensions (total blocks) |
warpSize |
int |
Always 32 |
Compute a global 1D thread ID:
Compute a global 2D thread ID:
4.2 Constraints and Limits¶
Max threads per block: 1024
Max blocks per grid (x): 2,147,483,647
Max shared memory per block: 48 KB (default) → 228 KB (Hopper, if requested)
Max registers per thread: 255
Warp size: 32 (always)
Block size rule of thumb:
- Must be a multiple of 32 (warp size) — avoid partial warps
- 128, 256, or 512 are common choices
- Use cudaOccupancyMaxPotentialBlockSize() to find the optimal size programmatically
4.3 Thread Block Clusters (Compute Capability 9.0+ / Hopper)¶
Hopper adds a new level between grid and block: clusters.

Thread Block Clusters: multiple blocks scheduled simultaneously on the same GPC, sharing distributed shared memory. Source: NVIDIA
// Launch with clusters (Hopper, compute capability 9.0)
cudaLaunchConfig_t config = {};
config.gridDim = grid;
config.blockDim = block;
cudaLaunchAttribute attr;
attr.id = cudaLaunchAttributeClusterDimension;
attr.val.clusterDim = {2, 1, 1}; // 2 blocks per cluster
config.attrs = &attr;
config.numAttrs = 1;
cudaLaunchKernelEx(&config, my_kernel, args...);
Threads in a cluster can access distributed shared memory — the shared memory of all blocks in the cluster — enabling much larger on-chip data exchange without going to global memory.
5. Writing Kernels¶
5.1 Function Qualifiers¶
// Runs on GPU, called from CPU (kernel)
__global__ void my_kernel(float* a, float* b, float* c, int N);
// Runs on GPU, called from GPU only
__device__ float helper(float x);
// Runs on CPU (default)
void host_function();
// Compiles for both CPU and GPU
__host__ __device__ float clamp(float x, float lo, float hi) {
return x < lo ? lo : (x > hi ? hi : x);
}
5.2 Launching a Kernel¶
// Syntax: kernel<<<grid_dim, block_dim, shared_mem_bytes, stream>>>(args)
int N = 1 << 20; // 1M elements
int block_size = 256;
int grid_size = (N + block_size - 1) / block_size; // ceil division
saxpy<<<grid_size, block_size>>>(a, b, c, N);
// 2D launch for matrix ops
dim3 block(16, 16); // 256 threads per block
dim3 grid((W + 15) / 16, (H + 15) / 16);
matmul<<<grid, block>>>(A, B, C, M, N, K);
5.3 Complete SAXPY Example¶
#include <cuda_runtime.h>
#include <cstdio>
// Kernel: c[i] = alpha * a[i] + b[i]
__global__ void saxpy(float alpha, const float* a, const float* b, float* c, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) // bounds check — last block may be partial
c[i] = alpha * a[i] + b[i];
}
int main() {
const int N = 1 << 20;
const size_t bytes = N * sizeof(float);
// Allocate host memory
float *h_a = new float[N], *h_b = new float[N], *h_c = new float[N];
for (int i = 0; i < N; i++) { h_a[i] = 1.0f; h_b[i] = 2.0f; }
// Allocate device memory
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
// Copy H → D
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
// Launch
int block = 256;
int grid = (N + block - 1) / block;
saxpy<<<grid, block>>>(2.0f, d_a, d_b, d_c, N);
// Copy D → H
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
// Verify
for (int i = 0; i < N; i++) {
if (h_c[i] != 4.0f) { printf("WRONG at %d\n", i); break; }
}
printf("OK: c[0] = %.1f\n", h_c[0]); // expected: 4.0
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
delete[] h_a; delete[] h_b; delete[] h_c;
}
5.4 Error Checking¶
// Macro to check any CUDA call
#define CUDA_CHECK(call) do { \
cudaError_t err = (call); \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d — %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(1); \
} \
} while(0)
// Usage
CUDA_CHECK(cudaMalloc(&d_a, bytes));
CUDA_CHECK(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
// Check kernel errors (kernels don't return error codes)
my_kernel<<<grid, block>>>(args);
CUDA_CHECK(cudaGetLastError()); // check launch error
CUDA_CHECK(cudaDeviceSynchronize()); // wait + check execution error
Always check errors in development. Suppress
cudaDeviceSynchronize()in production (it blocks the CPU) but keepcudaGetLastError().
6. Memory Spaces¶
Every variable in CUDA lives in a specific memory space. Understanding this is the most important skill for optimization.

Source: NVIDIA CUDA Programming Guide
6.1 Memory Space Summary¶
| Memory | Location | Scope | Lifetime | Latency | Bandwidth | Size |
|---|---|---|---|---|---|---|
| Register | On-chip (SM) | 1 thread | Kernel | 0 cycles | N/A | 256 KB/SM |
| Local | Off-chip (DRAM) | 1 thread | Kernel | ~600 cycles | ~same as global | Per thread |
| Shared | On-chip (SM) | All threads in block | Kernel | ~20–40 cycles | ~19 TB/s (H100) | Up to 228 KB/SM |
| L1 cache | On-chip (SM) | 1 SM | Automatic | ~20–40 cycles | Same as shared | Part of unified cache |
| L2 cache | On-chip (GPU) | All SMs | Automatic | ~200 cycles | ~TB/s | 50 MB (H100) |
| Global | Off-chip (HBM) | All threads | Application | ~600 cycles | ~3.35 TB/s (H100) | ~80 GB |
| Constant | Off-chip (cached) | All threads (read-only) | Application | ~20 cycles (cached) | — | 64 KB |
| Texture | Off-chip (cached) | All threads (read-only) | Application | ~20 cycles (cached) | — | Up to 2D |
6.2 Register and Local Memory¶
__global__ void kernel() {
int x = 5; // register (fast, private per thread)
float arr[10]; // may spill to local memory if too large
// local memory = per-thread DRAM — very slow, avoid large on-stack arrays
}
Register spilling: If a thread uses too many registers, the compiler stores the overflow in slow local memory (off-chip DRAM). Detect with nvcc --ptxas-options=-v.
6.3 Shared Memory¶
Shared memory is the most important optimization tool in CUDA. All threads in a block share it, it's on-chip, and it's ~30× faster than global memory.
__global__ void use_shared(float* in, float* out, int N) {
__shared__ float tile[256]; // allocated at compile time
int tid = threadIdx.x;
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// Load from global into shared
tile[tid] = (gid < N) ? in[gid] : 0.0f;
__syncthreads(); // ← CRITICAL: all threads must finish loading
// Now process from shared (fast)
tile[tid] = tile[tid] * 2.0f;
__syncthreads();
// Write back to global
if (gid < N) out[gid] = tile[tid];
}
Dynamic shared memory (size determined at launch):
__global__ void kernel(float* data) {
extern __shared__ float smem[]; // size determined at launch
smem[threadIdx.x] = data[...];
// ...
}
// Launch: third argument = shared memory bytes
kernel<<<grid, block, 256 * sizeof(float)>>>(data);
Request more than 48 KB per block:
// Required for > 48 KB shared memory (Ampere+)
cudaFuncSetAttribute(my_kernel,
cudaFuncAttributeMaxDynamicSharedMemorySize,
96 * 1024); // 96 KB
my_kernel<<<grid, block, 96 * 1024>>>(args);
6.4 Global Memory Allocation¶
float *d_data;
cudaMalloc(&d_data, N * sizeof(float)); // allocate
cudaFree(d_data); // free
// Transfers
cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice); // H→D
cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost); // D→H
cudaMemcpy(d_dst, d_src, bytes, cudaMemcpyDeviceToDevice); // D→D
// Zero-initialize
cudaMemset(d_data, 0, bytes);
Pinned (page-locked) host memory — faster H↔D transfers:
float *h_data;
cudaMallocHost(&h_data, bytes); // allocate pinned host memory
// OR: cudaHostAlloc(&h_data, bytes, cudaHostAllocDefault);
// Use exactly like regular memory
// Transfer speed: ~2× faster than pageable memory
cudaFreeHost(h_data);
Unified Memory — single pointer accessible from both CPU and GPU:
float *data;
cudaMallocManaged(&data, bytes); // managed allocation
// Use on CPU
for (int i = 0; i < N; i++) data[i] = 1.0f;
// Use on GPU — CUDA runtime migrates pages automatically
kernel<<<grid, block>>>(data, N);
cudaDeviceSynchronize();
// Use on CPU again — migrates back
printf("%f\n", data[0]);
cudaFree(data);
6.5 Constant Memory¶
For read-only data shared by all threads — cached with broadcast (one read serves all 32 threads in a warp):
__constant__ float weights[1024];
__global__ void apply_weights(float* data, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) data[i] *= weights[i % 1024]; // all threads read same value: broadcast
}
// Host-side: copy to constant memory
cudaMemcpyToSymbol(weights, h_weights, 1024 * sizeof(float));
7. Warp Execution and SIMT¶
7.1 SIMT — Single Instruction, Multiple Threads¶
A warp is 32 threads. The SM issues one instruction to all 32 lanes simultaneously. This is hardware-level SIMD, but each thread has its own registers and can follow its own control flow.

Active vs inactive lanes in a warp. Inactive lanes (diverged threads) are masked off — they consume time but produce no output. Source: NVIDIA
7.2 Warp Divergence¶
When threads in a warp take different branches, they execute both paths serially — inactive lanes are masked off.
__global__ void divergent(float* data, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i % 2 == 0) // ← half the warp goes here
data[i] *= 2.0f;
else // ← other half goes here (serialized!)
data[i] += 1.0f;
}
// Result: warp takes TWO passes — both branches run, half masked each time
// 2× slower than a non-divergent version
No-divergence version (branchless):
__global__ void no_divergence(float* data, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float even = data[i] * 2.0f;
float odd = data[i] + 1.0f;
data[i] = (i % 2 == 0) ? even : odd; // select, not branch
}
When divergence is unavoidable: if (i < N) bounds checks diverge only the last partial warp — acceptable.
7.3 Occupancy — Hiding Latency¶
The GPU hides memory latency by switching to other resident warps while one warp waits. More resident warps = more latency hiding = higher throughput.
SM capacity: 2048 resident threads (H100) = 64 warps
If block size = 256 → 8 warps/block → 8 blocks resident per SM
If block size = 32 → 1 warp/block → 64 blocks resident per SM (but tiny blocks waste overhead)
If block size = 1024 → 32 warps/block → 2 blocks = 64 warps = 100% occupancy
Occupancy is limited by: 1. Registers per thread — more registers = fewer threads fit 2. Shared memory per block — more shared mem = fewer blocks fit 3. Max threads per SM — hard limit
Find optimal config programmatically:
int block_size, min_grid;
cudaOccupancyMaxPotentialBlockSize(&min_grid, &block_size, my_kernel, 0, 0);
printf("Optimal block size: %d\n", block_size);
// Query actual occupancy
int active_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&active_blocks, my_kernel, block_size, 0);
float occupancy = (active_blocks * block_size) / (float)props.maxThreadsPerMultiProcessor;
printf("Occupancy: %.1f%%\n", occupancy * 100);
8. Memory Coalescing¶
Global memory is accessed in 128-byte cache line chunks. If 32 threads in a warp access 32 contiguous floats (128 bytes), that's one memory transaction. If they access scattered addresses, that's up to 32 transactions — 32× slower.

32 threads access 32 consecutive 4-byte floats → 1 transaction (128 bytes). Source: NVIDIA

Scattered accesses → multiple transactions, wasted bandwidth. Source: NVIDIA
Coalesced (good):
// Thread i accesses a[i] — consecutive, perfectly coalesced
__global__ void coalesced(float* a, float* b, float* c, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) c[i] = a[i] + b[i];
}
Strided (bad):
// Thread i accesses a[i * stride] — stride = 4 means 1/4 cache lines used
__global__ void strided(float* a, float* b, float* c, int stride, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i * stride < N) c[i] = a[i * stride] + b[i * stride];
}
// Each 128-byte load fetches 32 floats, but only uses 1 → 97% wasted bandwidth
Fix strided access with transpose or SoA layout (see AoS vs SoA in SIMD guide).
9. Shared Memory and Bank Conflicts¶
Shared memory is divided into 32 banks, each 4 bytes wide. Bank k holds addresses k, k+32, k+64, etc.
If multiple threads in a warp access different addresses in the same bank, that's a bank conflict — serialized. If all threads access the same address in one bank, it's a broadcast — no conflict.

Source: NVIDIA CUDA Programming Guide
// No conflict: thread i accesses bank i
__shared__ float smem[32];
float val = smem[threadIdx.x]; // OK: thread 0→bank 0, thread 1→bank 1, ...
// 2-way bank conflict: thread 0 and 16 both access bank 0
float val = smem[threadIdx.x * 2]; // BAD: stride 2 → 2-way conflict
// Fix: pad the array
__shared__ float smem[32 + 1]; // +1 pad shifts all addresses
float val = smem[threadIdx.x * 2]; // Now stride-2 is conflict-free
// 32-way conflict (broadcast): all threads read the same element
float val = smem[0]; // OK — hardware broadcasts, no conflict
Matrix transpose (classic bank conflict example):
#define TILE 32
__global__ void transpose(float* out, const float* in, int width, int height) {
__shared__ float tile[TILE][TILE + 1]; // +1 padding avoids conflicts
int x = blockIdx.x * TILE + threadIdx.x;
int y = blockIdx.y * TILE + threadIdx.y;
if (x < width && y < height)
tile[threadIdx.y][threadIdx.x] = in[y * width + x]; // coalesced read
__syncthreads();
x = blockIdx.y * TILE + threadIdx.x;
y = blockIdx.x * TILE + threadIdx.y;
if (x < height && y < width)
out[y * height + x] = tile[threadIdx.x][threadIdx.y]; // coalesced write
}
10. Synchronization¶
10.1 Within a Block¶
__syncthreads(); // barrier: all threads in block reach this before any continues
__syncwarp(); // barrier: all threads in a warp (lighter, within one warp)
__syncwarp(mask); // barrier for a subset of warp lanes (mask = bitfield)
__syncthreads()in conditional code is undefined behavior if not all threads reach it:// WRONG: some threads may not reach __syncthreads() if (threadIdx.x < 16) { smem[threadIdx.x] = data[threadIdx.x]; __syncthreads(); // ← UB: threads 16-31 never reach this } // CORRECT: syncthreads outside the conditional smem[threadIdx.x] = (threadIdx.x < 16) ? data[threadIdx.x] : 0; __syncthreads(); // all threads hit this
10.2 Atomic Operations¶
// Atomic add (global or shared memory)
atomicAdd(&counter, 1);
atomicAdd(&shared_sum, val);
// Other atomics
atomicSub(&counter, 1);
atomicMax(&max_val, val);
atomicMin(&min_val, val);
atomicCAS(&lock, 0, 1); // compare-and-swap: if *lock==0, set to 1, return old
// Warp-level reduction (compute capability 8.0+)
float warp_sum = __reduce_add_sync(0xFFFFFFFF, val);
float warp_max = __reduce_max_sync(0xFFFFFFFF, val);
10.3 Warp Shuffle — Pass Data Between Threads Without Shared Memory¶
// Each thread gets the value from thread src_lane in the same warp
float val = __shfl_sync(0xFFFFFFFF, local_val, src_lane);
// Shift: thread i receives from thread i+delta
float val = __shfl_down_sync(0xFFFFFFFF, local_val, delta);
float val = __shfl_up_sync(0xFFFFFFFF, local_val, delta);
// XOR exchange (butterfly for reductions)
float val = __shfl_xor_sync(0xFFFFFFFF, local_val, mask);
// Warp reduction using shuffles (no shared memory, no sync needed)
__device__ float warp_reduce_sum(float val) {
for (int offset = warpSize / 2; offset > 0; offset >>= 1)
val += __shfl_down_sync(0xFFFFFFFF, val, offset);
return val; // thread 0 holds the sum
}
11. Streams and Asynchronous Execution¶
Streams allow overlapping GPU computation with host execution and with data transfers.
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
// Overlapping compute and copy
cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream1>>>(d_a, d_out, N);
cudaMemcpyAsync(h_out, d_out, bytes, cudaMemcpyDeviceToHost, stream1);
// Different work on stream2, runs concurrently with stream1
other_kernel<<<grid, block, 0, stream2>>>(d_b, d_out2, N);
// Synchronize
cudaStreamSynchronize(stream1); // wait for stream1
cudaStreamSynchronize(stream2); // wait for stream2
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
Stream timeline (overlapping copy + compute):
Stream 1: [copy H→D] [kernel] [copy D→H]
Stream 2: [copy H→D] [kernel] [copy D→H]
─────────────────────────────────────────► time
Events for precise timing and cross-stream sync:
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, stream1);
kernel<<<grid, block, 0, stream1>>>(args);
cudaEventRecord(stop, stream1);
cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);
printf("Kernel time: %.3f ms\n", ms);
cudaEventDestroy(start);
cudaEventDestroy(stop);
12. CUDA Graphs¶
For workloads that repeat the same sequence of kernels and transfers, CUDA graphs eliminate per-launch CPU overhead by capturing and replaying the entire execution graph.
// Step 1: capture a sequence of operations
cudaGraph_t graph;
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
kernel_a<<<grid, block, 0, stream>>>(d_a, N);
kernel_b<<<grid, block, 0, stream>>>(d_a, d_b, N);
cudaMemcpyAsync(h_out, d_b, bytes, cudaMemcpyDeviceToHost, stream);
cudaStreamEndCapture(stream, &graph);
// Step 2: instantiate (compile the graph)
cudaGraphExec_t graph_exec;
cudaGraphInstantiate(&graph_exec, graph, nullptr, nullptr, 0);
// Step 3: launch repeatedly (minimal CPU overhead)
for (int iter = 0; iter < 1000; iter++) {
cudaGraphLaunch(graph_exec, stream);
cudaStreamSynchronize(stream);
}
// Cleanup
cudaGraphExecDestroy(graph_exec);
cudaGraphDestroy(graph);
When to use graphs: - Same kernel sequence repeated many times (training loops, inference batches) - Many small kernels where launch overhead dominates - CPU overhead reduction from ~5 µs/launch → ~1 µs/graph replay
13. Parallel Reduction — Complete Example¶
Reduction is the canonical shared memory + sync pattern.
// Parallel sum reduction — each block reduces its chunk to one value
__global__ void reduce_sum(const float* in, float* partial, int N) {
extern __shared__ float smem[];
int tid = threadIdx.x;
int gid = blockIdx.x * blockDim.x * 2 + threadIdx.x; // × 2: each thread loads 2
// Load two elements per thread
float val = 0.0f;
if (gid < N) val += in[gid];
if (gid + blockDim.x < N) val += in[gid + blockDim.x];
smem[tid] = val;
__syncthreads();
// Tree reduction in shared memory
for (int stride = blockDim.x / 2; stride > 32; stride >>= 1) {
if (tid < stride)
smem[tid] += smem[tid + stride];
__syncthreads();
}
// Final warp reduction (no sync needed within a warp)
if (tid < 32) {
volatile float* s = smem; // volatile prevents compiler from caching
s[tid] += s[tid + 32];
s[tid] += s[tid + 16];
s[tid] += s[tid + 8];
s[tid] += s[tid + 4];
s[tid] += s[tid + 2];
s[tid] += s[tid + 1];
}
// Thread 0 writes this block's partial sum
if (tid == 0) partial[blockIdx.x] = smem[0];
}
// Host: launch reduce_sum, then sum the partial results
float gpu_sum(const float* d_in, int N) {
int block = 256;
int grid = (N + block * 2 - 1) / (block * 2);
float *d_partial;
cudaMalloc(&d_partial, grid * sizeof(float));
reduce_sum<<<grid, block, block * sizeof(float)>>>(d_in, d_partial, N);
// Recursively reduce if needed, or just copy partial to host and sum there
std::vector<float> h_partial(grid);
cudaMemcpy(h_partial.data(), d_partial, grid * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_partial);
return std::accumulate(h_partial.begin(), h_partial.end(), 0.0f);
}
14. Tiled Matrix Multiply — Shared Memory Optimization¶
#define TILE_SIZE 16
__global__ void tiled_matmul(
const float* A, const float* B, float* C,
int M, int N, int K) // C[M×N] = A[M×K] * B[K×N]
{
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
// Iterate over tiles of K
for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
// Cooperative load: each thread loads one element
int a_col = t * TILE_SIZE + threadIdx.x;
int b_row = t * TILE_SIZE + threadIdx.y;
As[threadIdx.y][threadIdx.x] = (row < M && a_col < K) ? A[row * K + a_col] : 0.0f;
Bs[threadIdx.y][threadIdx.x] = (b_row < K && col < N) ? B[b_row * N + col] : 0.0f;
__syncthreads();
// Compute partial dot product for this tile
for (int k = 0; k < TILE_SIZE; k++)
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
__syncthreads();
}
if (row < M && col < N)
C[row * N + col] = sum;
}
Why tiling works:
Naive matmul:
Each C[i][j] loads K elements from A (row i) and K from B (col j)
Total global loads: M*N*2K = O(MNK) → memory bound for large K
Tiled matmul with TILE=16:
Each 16×16 block of C cooperatively loads one 16×16 tile from A and B
Each load reused 16 times within the tile
Memory traffic reduced by ~TILE_SIZE = 16×
Compute:memory ratio increased → compute bound (much faster)
15. Tensor Cores¶
Tensor Cores are dedicated matrix-multiply-accumulate units introduced in Volta (CC 7.0). They operate on small matrix tiles in a single instruction.
#include <mma.h>
using namespace nvcuda::wmma;
// Tensor Core: 16×16×16 GEMM (FP16 input, FP32 accumulator)
__global__ void tensor_core_matmul(
const half* A, const half* B, float* C, int M, int N, int K)
{
fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;
fill_fragment(c_frag, 0.0f);
for (int k = 0; k < K; k += 16) {
load_matrix_sync(a_frag, A + blockIdx.y * 16 * K + k, K);
load_matrix_sync(b_frag, B + k * N + blockIdx.x * 16, N);
mma_sync(c_frag, a_frag, b_frag, c_frag); // D = A*B + C
}
store_matrix_sync(C + blockIdx.y * 16 * N + blockIdx.x * 16, c_frag, N, mem_row_major);
}
Tensor Core throughput vs CUDA Cores (H100 SXM, per SM):
| Precision | CUDA Cores | Tensor Cores |
|---|---|---|
| FP64 | 67 TFLOPS | 134 TFLOPS |
| TF32 | — | 989 TFLOPS |
| FP16/BF16 | 134 TFLOPS | 1,979 TFLOPS |
| FP8 | — | 3,958 TFLOPS |
| INT8 | 268 TOPS | 3,958 TOPS |
In practice: use cuBLAS or cuDNN — they use Tensor Cores automatically when shapes are multiples of 16.
16. Dynamic Parallelism¶
Kernels can launch other kernels from the GPU (no CPU round-trip required):
__global__ void child_kernel(float* data, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) data[i] *= 2.0f;
}
__global__ void parent_kernel(float* data, int N) {
// Each block launches its own child kernel
int chunk = N / gridDim.x;
int offset = blockIdx.x * chunk;
child_kernel<<<1, chunk>>>(data + offset, chunk);
cudaDeviceSynchronize(); // wait for children (device-side sync)
}
Requires compute capability 3.5+. Adds latency per launch. Best for irregular recursive problems (trees, AMR meshes).
17. Cooperative Groups¶
Cooperative groups let you express sync and reduction at any granularity — warp, block, multi-block, or grid.
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void flexible_reduction(float* data, float* out, int N) {
auto block = cg::this_thread_block(); // all threads in this block
auto warp = cg::tiled_partition<32>(block); // this warp
auto tile = cg::tiled_partition<4>(block); // group of 4 threads
float val = data[block.thread_rank()];
// Warp reduce
for (int i = warp.size() / 2; i > 0; i >>= 1)
val += warp.shfl_down(val, i);
// Write warp result
if (warp.thread_rank() == 0)
atomicAdd(out, val);
}
// Grid-wide sync (requires cudaLaunchCooperativeKernel)
__global__ void grid_sync_kernel(float* data) {
auto grid = cg::this_grid();
// ... do phase 1 ...
grid.sync(); // all threads in grid synchronize
// ... do phase 2 ...
}
18. Automatic Scalability¶
CUDA programs scale automatically across different GPU sizes. The same grid runs on a GPU with 4 SMs or 144 SMs — the runtime schedules blocks to available SMs.

The same program scales automatically: 2 SMs run 2 blocks at a time, 4 SMs run 4 blocks at a time. Source: NVIDIA
This is why you write for maximum parallelism and let the hardware decide — don't hardcode grid sizes to a specific GPU.
19. Performance Optimization Checklist¶
Step 1 — Profile First¶
# NVIDIA Nsight Systems (timeline, CPU/GPU overlap)
nsys profile --stats=true ./my_program
# NVIDIA Nsight Compute (kernel-level metrics)
ncu --set full ./my_program
Step 2 — Roofline Analysis¶
Every kernel is either compute-bound or memory-bound:
Arithmetic Intensity (AI) = FLOPs / bytes accessed
│ Tensor Core roof (3958 TFLOPS)
Performance (FLOP/s) │
╔═══════════════════════════════════════════════════╗
║ memory-bound │ compute-bound ║
║ AI < ridge │ AI > ridge point ║
╚══════════════╧═══════════════════════════════════╝
ridge ← Memory BW (3.35 TB/s) / Peak FLOPS
H100 ridge point: ~3,958 TFLOPS / 3,350 GB/s ≈ 1.18 FLOP/byte
If your kernel's AI < 1.18 FLOP/byte on H100 → it's memory-bound → focus on coalescing and reuse.
Step 3 — Common Optimizations¶
| Issue | Symptom | Fix |
|---|---|---|
| Low occupancy | Kernel uses many registers | Reduce register pressure; use __launch_bounds__ |
| Uncoalesced access | High DRAM traffic in Nsight | Restructure to SoA; transpose before processing |
| Bank conflicts | Shared mem replay in Nsight | Pad shared arrays (+1 column) |
| Warp divergence | Low warp execution efficiency | Branchless arithmetic; sort inputs first |
| Too many atomics | High contention | Use per-warp/per-block local accumulators first |
| Launch overhead | Kernel takes < 50 µs but lots of launches | Use CUDA Graphs |
| CPU-GPU idle time | GPU idle in timeline | Async transfers + streams |
Step 4 — Launch Bounds¶
// Tell the compiler max threads per block and min blocks per SM
// Helps compiler tune register allocation for better occupancy
__launch_bounds__(256, 4) // max 256 threads/block, at least 4 blocks/SM
__global__ void my_kernel(float* data, int N) { ... }
20. Compute Capability Quick Reference¶
| Arch | CC | GPU Examples | FP16 TC | BF16 | FP8 | TMA | Clusters |
|---|---|---|---|---|---|---|---|
| Volta | 7.0 | V100 | ✓ | — | — | — | — |
| Turing | 7.5 | RTX 2080, T4 | ✓ | — | — | — | — |
| Ampere | 8.0 | A100 | ✓ | ✓ | — | — | — |
| Ampere | 8.6 | RTX 3090, A10 | ✓ | ✓ | — | — | — |
| Hopper | 9.0 | H100 | ✓ | ✓ | ✓ | ✓ | ✓ |
| Blackwell | 10.x | B100, B200 | ✓ | ✓ | ✓ | ✓ | ✓ |
SM resource limits (selected):
| Resource | Volta 7.0 | Ampere 8.0 | Hopper 9.0 |
|---|---|---|---|
| Max warps/SM | 64 | 64 | 64 |
| Max threads/SM | 2048 | 2048 | 2048 |
| Max blocks/SM | 32 | 32 | 32 |
| Registers/SM | 64K | 64K | 64K |
| Shared mem/SM | 96 KB | 164 KB | 228 KB |
| L2 cache | 6 MB | 40 MB | 50 MB |
| CUDA cores/SM | 64 FP64 | 64 FP64 | 64 FP64 |
| FP32 cores/SM | 64 | 128 | 128 |
21. Suggested Projects (in order)¶
| # | Project | Key Skills |
|---|---|---|
| 1 | Vector add / SAXPY | Launch config, indexing, bounds check, H↔D copy |
| 2 | Parallel reduction (sum) | Shared memory, __syncthreads, tree reduction, warp shuffles |
| 3 | 2D grayscale transform | 2D threadIdx/blockIdx, image stride |
| 4 | Matrix transpose | Shared memory, bank conflict padding |
| 5 | Naive matmul → tiled matmul | Tiling, shared mem reuse, compare vs cuBLAS |
| 6 | Histogram | Atomics, per-block private histogram, reduce-merge |
| 7 | Prefix scan | Blelloch scan, upsweep/downsweep in shared mem |
| 8 | Multi-stream pipeline | Streams, async copy, overlap compute + transfer |
Keep CPU golden references for every kernel. Validate before optimizing.
Resources¶
| Resource | What it covers |
|---|---|
| CUDA C++ Programming Guide | The authoritative reference — thread hierarchy, memory, streams, graphs |
| CUDA C++ Best Practices Guide | Optimization strategies, memory patterns, profiling |
| NVIDIA Nsight Compute | Kernel profiler — roofline, memory, warp stats |
| NVIDIA Nsight Systems | System profiler — CPU/GPU timeline, stream overlap |
| Hopper Architecture In-Depth | H100 SM, TMA, FP8, Thread Block Clusters |
| Ampere Architecture In-Depth | A100 SM, MIG, sparsity, TF32 |
| CUDA Samples | Reference implementations: reduction, matmul, scan |
| Programming Massively Parallel Processors (Hwu, Kirk, Hajj) | GPU architecture + CUDA optimization textbook |
Next¶
→ OpenCL and SYCL — portable compute across vendors.