CUDA and SIMT¶

Part of Phase 1 section 4 — C++ and Parallel Computing.

Goal: Master NVIDIA's SIMT programming model — GPU architecture, thread hierarchy, memory spaces, kernel writing, and performance optimization — so you can read and write production CUDA code and reason about hardware behavior.

Official reference: CUDA C++ Programming Guide

1. Why GPU? — The Transistor Budget¶

CPUs dedicate most transistors to control logic and cache (latency minimization). GPUs dedicate most transistors to arithmetic units (throughput maximization).

GPU vs CPU transistor allocation

Source: NVIDIA CUDA Programming Guide

	CPU	GPU
Core count	8–128	1,000s–10,000s
Clock speed	3–5 GHz	1.5–2.5 GHz
Design goal	Low latency (single thread)	High throughput (many threads)
Cache	Large (MB per core)	Small (KB per SM)
Control flow	Complex OOO, branch prediction	Simple, in-order per lane
Memory BW	~50–200 GB/s	~900–3,500 GB/s (HBM)

The key insight: GPUs hide memory latency by switching to other warps while waiting, not by predicting and prefetching. This requires thousands of in-flight threads.

2. Heterogeneous Programming Model¶

A CUDA application always starts on the CPU (host). The host copies data to the GPU (device), launches kernels, and waits for results.

Heterogeneous programming

Source: NVIDIA CUDA Programming Guide

Host (CPU)                        Device (GPU)
──────────                        ────────────
Serial code runs here             Parallel kernels run here
Allocates device memory           Executes kernel with thousands of threads
Copies data H→D                   Each thread runs the same function
Launches kernel ──────────────►   Grid of blocks of threads
Waits for completion              Results written to device memory
Copies results D→H ◄──────────   Done

Key rules: - CPU and GPU have separate DRAM — data must be explicitly transferred (unless using Unified Memory) - Kernel launches are asynchronous — the CPU continues while the GPU works - cudaDeviceSynchronize() blocks the CPU until all GPU work completes

3. GPU Hardware Architecture¶

3.1 The Full Chip¶

A GPU die is organized as a hierarchy:

GPU Die
└── GPC (Graphics Processing Cluster) × N
    └── TPC (Texture Processing Cluster) × M
        └── SM (Streaming Multiprocessor) × 2
            ├── Warp Schedulers × 4
            ├── Dispatch Units × 8
            ├── CUDA Cores (FP32) × 128
            ├── Tensor Cores × 4
            ├── Register File (256 KB)
            ├── Shared Memory / L1 Cache (up to 228 KB)
            └── Load/Store Units, Special Function Units

Full H100 GPU with 144 SMs

H100 full GPU — 144 SMs organized into GPCs. Source: NVIDIA Hopper Architecture

3.2 The Streaming Multiprocessor (SM)¶

The SM is the fundamental execution unit. All threads in one block run on one SM. The SM executes warps — groups of 32 threads.

H100 SM internal diagram

H100 SM: 4 warp schedulers, 128 FP32 CUDA cores, 4 Tensor Cores (4th gen), 256 KB register file, up to 228 KB shared mem / L1. Source: NVIDIA

H100 SM internals: - 4 warp schedulers — each selects one ready warp per cycle to issue - 8 dispatch units — issue 2 instructions per warp scheduler per cycle - 128 FP32 CUDA cores — execute floating-point operations - 64 INT32 units — integer arithmetic (can run simultaneously with FP32) - 4 Tensor Cores (4th gen) — matrix multiply-accumulate for ML workloads - 256 KB register file — fastest memory, compiler-allocated per thread - 228 KB unified data cache — partitioned into shared memory + L1 (configurable)

3.3 GPU Architecture Generations¶

Generation	Architecture	Compute Cap	Key Feature	Example GPU
2017	Volta	7.0	Tensor Cores (1st gen), Independent Thread Scheduling	V100
2018	Turing	7.5	RT Cores, INT8/INT4 Tensor Cores	RTX 2080
2020	Ampere	8.0 / 8.6	3rd gen Tensor Cores, TF32, BF16, MIG, sparsity	A100, RTX 3090
2022	Hopper	9.0	4th gen Tensor Cores, TMA, Thread Block Clusters, FP8, Transformer Engine	H100
2024	Blackwell	10.x	5th gen Tensor Cores, FP4, NVLink 5, confidential compute	B100, B200

4. Thread Hierarchy¶

CUDA organizes threads in a 3-level hierarchy: Grid → Block → Thread, with warps as an implicit hardware grouping of 32 threads.

Grid of thread blocks

Source: NVIDIA CUDA Programming Guide

4.1 Grid, Block, Warp, Thread¶

Grid  (one per kernel launch)
├── Block (0,0)  ── 1024 threads max ── runs on one SM
│   ├── Warp 0   ── threads 0-31
│   ├── Warp 1   ── threads 32-63
│   └── Warp 31  ── threads 992-1023
├── Block (1,0)
├── Block (0,1)
└── ...

Thread block scheduling

Thread blocks are dispatched to SMs. Multiple blocks can reside on one SM simultaneously (limited by registers, shared memory, and SM capacity). Source: NVIDIA

Built-in variables every kernel sees:

Variable	Type	Meaning
`threadIdx.x/y/z`	`uint3`	Thread index within its block
`blockIdx.x/y/z`	`uint3`	Block index within the grid
`blockDim.x/y/z`	`uint3`	Block dimensions (total threads per block)
`gridDim.x/y/z`	`uint3`	Grid dimensions (total blocks)
`warpSize`	`int`	Always 32

Compute a global 1D thread ID:

int tid = blockIdx.x * blockDim.x + threadIdx.x;

Compute a global 2D thread ID:

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

4.2 Constraints and Limits¶

Max threads per block:      1024
Max blocks per grid (x):    2,147,483,647
Max shared memory per block: 48 KB (default) → 228 KB (Hopper, if requested)
Max registers per thread:   255
Warp size:                  32  (always)

Block size rule of thumb: - Must be a multiple of 32 (warp size) — avoid partial warps - 128, 256, or 512 are common choices - Use cudaOccupancyMaxPotentialBlockSize() to find the optimal size programmatically

4.3 Thread Block Clusters (Compute Capability 9.0+ / Hopper)¶

Hopper adds a new level between grid and block: clusters.

Grid of clusters

Thread Block Clusters: multiple blocks scheduled simultaneously on the same GPC, sharing distributed shared memory. Source: NVIDIA

// Launch with clusters (Hopper, compute capability 9.0)
cudaLaunchConfig_t config = {};
config.gridDim = grid;
config.blockDim = block;

cudaLaunchAttribute attr;
attr.id = cudaLaunchAttributeClusterDimension;
attr.val.clusterDim = {2, 1, 1};   // 2 blocks per cluster
config.attrs = &attr;
config.numAttrs = 1;

cudaLaunchKernelEx(&config, my_kernel, args...);

Threads in a cluster can access distributed shared memory — the shared memory of all blocks in the cluster — enabling much larger on-chip data exchange without going to global memory.

5. Writing Kernels¶

5.1 Function Qualifiers¶

// Runs on GPU, called from CPU (kernel)
__global__ void my_kernel(float* a, float* b, float* c, int N);

// Runs on GPU, called from GPU only
__device__ float helper(float x);

// Runs on CPU (default)
void host_function();

// Compiles for both CPU and GPU
__host__ __device__ float clamp(float x, float lo, float hi) {
    return x < lo ? lo : (x > hi ? hi : x);
}

5.2 Launching a Kernel¶

// Syntax: kernel<<<grid_dim, block_dim, shared_mem_bytes, stream>>>(args)

int N = 1 << 20;   // 1M elements
int block_size = 256;
int grid_size = (N + block_size - 1) / block_size;  // ceil division

saxpy<<<grid_size, block_size>>>(a, b, c, N);

// 2D launch for matrix ops
dim3 block(16, 16);         // 256 threads per block
dim3 grid((W + 15) / 16, (H + 15) / 16);

matmul<<<grid, block>>>(A, B, C, M, N, K);

5.3 Complete SAXPY Example¶

#include <cuda_runtime.h>
#include <cstdio>

// Kernel: c[i] = alpha * a[i] + b[i]
__global__ void saxpy(float alpha, const float* a, const float* b, float* c, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N)   // bounds check — last block may be partial
        c[i] = alpha * a[i] + b[i];
}

int main() {
    const int N = 1 << 20;
    const size_t bytes = N * sizeof(float);

    // Allocate host memory
    float *h_a = new float[N], *h_b = new float[N], *h_c = new float[N];
    for (int i = 0; i < N; i++) { h_a[i] = 1.0f; h_b[i] = 2.0f; }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy H → D
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch
    int block = 256;
    int grid  = (N + block - 1) / block;
    saxpy<<<grid, block>>>(2.0f, d_a, d_b, d_c, N);

    // Copy D → H
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Verify
    for (int i = 0; i < N; i++) {
        if (h_c[i] != 4.0f) { printf("WRONG at %d\n", i); break; }
    }
    printf("OK: c[0] = %.1f\n", h_c[0]);   // expected: 4.0

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    delete[] h_a; delete[] h_b; delete[] h_c;
}

5.4 Error Checking¶

// Macro to check any CUDA call
#define CUDA_CHECK(call) do {                                    \
    cudaError_t err = (call);                                    \
    if (err != cudaSuccess) {                                    \
        fprintf(stderr, "CUDA error at %s:%d — %s\n",           \
                __FILE__, __LINE__, cudaGetErrorString(err));    \
        exit(1);                                                 \
    }                                                            \
} while(0)

// Usage
CUDA_CHECK(cudaMalloc(&d_a, bytes));
CUDA_CHECK(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));

// Check kernel errors (kernels don't return error codes)
my_kernel<<<grid, block>>>(args);
CUDA_CHECK(cudaGetLastError());       // check launch error
CUDA_CHECK(cudaDeviceSynchronize());  // wait + check execution error

Always check errors in development. Suppress cudaDeviceSynchronize() in production (it blocks the CPU) but keep cudaGetLastError().

6. Memory Spaces¶

Every variable in CUDA lives in a specific memory space. Understanding this is the most important skill for optimization.

CUDA Memory Hierarchy

Source: NVIDIA CUDA Programming Guide

6.1 Memory Space Summary¶

Memory	Location	Scope	Lifetime	Latency	Bandwidth	Size
Register	On-chip (SM)	1 thread	Kernel	0 cycles	N/A	256 KB/SM
Local	Off-chip (DRAM)	1 thread	Kernel	~600 cycles	~same as global	Per thread
Shared	On-chip (SM)	All threads in block	Kernel	~20–40 cycles	~19 TB/s (H100)	Up to 228 KB/SM
L1 cache	On-chip (SM)	1 SM	Automatic	~20–40 cycles	Same as shared	Part of unified cache
L2 cache	On-chip (GPU)	All SMs	Automatic	~200 cycles	~TB/s	50 MB (H100)
Global	Off-chip (HBM)	All threads	Application	~600 cycles	~3.35 TB/s (H100)	~80 GB
Constant	Off-chip (cached)	All threads (read-only)	Application	~20 cycles (cached)	—	64 KB
Texture	Off-chip (cached)	All threads (read-only)	Application	~20 cycles (cached)	—	Up to 2D

6.2 Register and Local Memory¶

__global__ void kernel() {
    int x = 5;        // register (fast, private per thread)
    float arr[10];    // may spill to local memory if too large
    // local memory = per-thread DRAM — very slow, avoid large on-stack arrays
}

Register spilling: If a thread uses too many registers, the compiler stores the overflow in slow local memory (off-chip DRAM). Detect with nvcc --ptxas-options=-v.

6.3 Shared Memory¶

Shared memory is the most important optimization tool in CUDA. All threads in a block share it, it's on-chip, and it's ~30× faster than global memory.

__global__ void use_shared(float* in, float* out, int N) {
    __shared__ float tile[256];   // allocated at compile time

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Load from global into shared
    tile[tid] = (gid < N) ? in[gid] : 0.0f;
    __syncthreads();   // ← CRITICAL: all threads must finish loading

    // Now process from shared (fast)
    tile[tid] = tile[tid] * 2.0f;
    __syncthreads();

    // Write back to global
    if (gid < N) out[gid] = tile[tid];
}

Dynamic shared memory (size determined at launch):

__global__ void kernel(float* data) {
    extern __shared__ float smem[];   // size determined at launch
    smem[threadIdx.x] = data[...];
    // ...
}

// Launch: third argument = shared memory bytes
kernel<<<grid, block, 256 * sizeof(float)>>>(data);

Request more than 48 KB per block:

// Required for > 48 KB shared memory (Ampere+)
cudaFuncSetAttribute(my_kernel,
    cudaFuncAttributeMaxDynamicSharedMemorySize,
    96 * 1024);  // 96 KB

my_kernel<<<grid, block, 96 * 1024>>>(args);

6.4 Global Memory Allocation¶

float *d_data;
cudaMalloc(&d_data, N * sizeof(float));   // allocate
cudaFree(d_data);                          // free

// Transfers
cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);   // H→D
cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);   // D→H
cudaMemcpy(d_dst, d_src,   bytes, cudaMemcpyDeviceToDevice); // D→D

// Zero-initialize
cudaMemset(d_data, 0, bytes);

Pinned (page-locked) host memory — faster H↔D transfers:

float *h_data;
cudaMallocHost(&h_data, bytes);   // allocate pinned host memory
// OR: cudaHostAlloc(&h_data, bytes, cudaHostAllocDefault);

// Use exactly like regular memory
// Transfer speed: ~2× faster than pageable memory

cudaFreeHost(h_data);

Unified Memory — single pointer accessible from both CPU and GPU:

float *data;
cudaMallocManaged(&data, bytes);   // managed allocation

// Use on CPU
for (int i = 0; i < N; i++) data[i] = 1.0f;

// Use on GPU — CUDA runtime migrates pages automatically
kernel<<<grid, block>>>(data, N);
cudaDeviceSynchronize();

// Use on CPU again — migrates back
printf("%f\n", data[0]);

cudaFree(data);

6.5 Constant Memory¶

For read-only data shared by all threads — cached with broadcast (one read serves all 32 threads in a warp):

__constant__ float weights[1024];

__global__ void apply_weights(float* data, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) data[i] *= weights[i % 1024];   // all threads read same value: broadcast
}

// Host-side: copy to constant memory
cudaMemcpyToSymbol(weights, h_weights, 1024 * sizeof(float));

7. Warp Execution and SIMT¶

7.1 SIMT — Single Instruction, Multiple Threads¶

A warp is 32 threads. The SM issues one instruction to all 32 lanes simultaneously. This is hardware-level SIMD, but each thread has its own registers and can follow its own control flow.

Active warp lanes

Active vs inactive lanes in a warp. Inactive lanes (diverged threads) are masked off — they consume time but produce no output. Source: NVIDIA

7.2 Warp Divergence¶

When threads in a warp take different branches, they execute both paths serially — inactive lanes are masked off.

__global__ void divergent(float* data, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i % 2 == 0)    // ← half the warp goes here
        data[i] *= 2.0f;
    else               // ← other half goes here (serialized!)
        data[i] += 1.0f;
}

// Result: warp takes TWO passes — both branches run, half masked each time
// 2× slower than a non-divergent version

No-divergence version (branchless):

__global__ void no_divergence(float* data, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    float even = data[i] * 2.0f;
    float odd  = data[i] + 1.0f;
    data[i] = (i % 2 == 0) ? even : odd;  // select, not branch
}

When divergence is unavoidable: if (i < N) bounds checks diverge only the last partial warp — acceptable.

7.3 Occupancy — Hiding Latency¶

The GPU hides memory latency by switching to other resident warps while one warp waits. More resident warps = more latency hiding = higher throughput.

SM capacity: 2048 resident threads (H100) = 64 warps
If block size = 256 → 8 warps/block → 8 blocks resident per SM
If block size = 32  → 1 warp/block  → 64 blocks resident per SM (but tiny blocks waste overhead)
If block size = 1024 → 32 warps/block → 2 blocks = 64 warps = 100% occupancy

Occupancy is limited by: 1. Registers per thread — more registers = fewer threads fit 2. Shared memory per block — more shared mem = fewer blocks fit 3. Max threads per SM — hard limit

Find optimal config programmatically:

int block_size, min_grid;
cudaOccupancyMaxPotentialBlockSize(&min_grid, &block_size, my_kernel, 0, 0);
printf("Optimal block size: %d\n", block_size);

// Query actual occupancy
int active_blocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&active_blocks, my_kernel, block_size, 0);
float occupancy = (active_blocks * block_size) / (float)props.maxThreadsPerMultiProcessor;
printf("Occupancy: %.1f%%\n", occupancy * 100);

8. Memory Coalescing¶

Global memory is accessed in 128-byte cache line chunks. If 32 threads in a warp access 32 contiguous floats (128 bytes), that's one memory transaction. If they access scattered addresses, that's up to 32 transactions — 32× slower.

Coalesced memory access

32 threads access 32 consecutive 4-byte floats → 1 transaction (128 bytes). Source: NVIDIA

Uncoalesced memory access

Scattered accesses → multiple transactions, wasted bandwidth. Source: NVIDIA

Coalesced (good):

// Thread i accesses a[i] — consecutive, perfectly coalesced
__global__ void coalesced(float* a, float* b, float* c, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) c[i] = a[i] + b[i];
}

Strided (bad):

// Thread i accesses a[i * stride] — stride = 4 means 1/4 cache lines used
__global__ void strided(float* a, float* b, float* c, int stride, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i * stride < N) c[i] = a[i * stride] + b[i * stride];
}
// Each 128-byte load fetches 32 floats, but only uses 1 → 97% wasted bandwidth

Fix strided access with transpose or SoA layout (see AoS vs SoA in SIMD guide).

9. Shared Memory and Bank Conflicts¶

Shared memory is divided into 32 banks, each 4 bytes wide. Bank k holds addresses k, k+32, k+64, etc.

If multiple threads in a warp access different addresses in the same bank, that's a bank conflict — serialized. If all threads access the same address in one bank, it's a broadcast — no conflict.

Shared memory bank conflicts

Source: NVIDIA CUDA Programming Guide

// No conflict: thread i accesses bank i
__shared__ float smem[32];
float val = smem[threadIdx.x];          // OK: thread 0→bank 0, thread 1→bank 1, ...

// 2-way bank conflict: thread 0 and 16 both access bank 0
float val = smem[threadIdx.x * 2];     // BAD: stride 2 → 2-way conflict

// Fix: pad the array
__shared__ float smem[32 + 1];         // +1 pad shifts all addresses
float val = smem[threadIdx.x * 2];     // Now stride-2 is conflict-free

// 32-way conflict (broadcast): all threads read the same element
float val = smem[0];                   // OK — hardware broadcasts, no conflict

Matrix transpose (classic bank conflict example):

#define TILE 32

__global__ void transpose(float* out, const float* in, int width, int height) {
    __shared__ float tile[TILE][TILE + 1];   // +1 padding avoids conflicts

    int x = blockIdx.x * TILE + threadIdx.x;
    int y = blockIdx.y * TILE + threadIdx.y;

    if (x < width && y < height)
        tile[threadIdx.y][threadIdx.x] = in[y * width + x];   // coalesced read

    __syncthreads();

    x = blockIdx.y * TILE + threadIdx.x;
    y = blockIdx.x * TILE + threadIdx.y;

    if (x < height && y < width)
        out[y * height + x] = tile[threadIdx.x][threadIdx.y]; // coalesced write
}

10. Synchronization¶

10.1 Within a Block¶

__syncthreads();          // barrier: all threads in block reach this before any continues
__syncwarp();             // barrier: all threads in a warp (lighter, within one warp)
__syncwarp(mask);         // barrier for a subset of warp lanes (mask = bitfield)

__syncthreads() in conditional code is undefined behavior if not all threads reach it:

// WRONG: some threads may not reach __syncthreads()
if (threadIdx.x < 16) {
    smem[threadIdx.x] = data[threadIdx.x];
    __syncthreads();   // ← UB: threads 16-31 never reach this
}

// CORRECT: syncthreads outside the conditional
smem[threadIdx.x] = (threadIdx.x < 16) ? data[threadIdx.x] : 0;
__syncthreads();       // all threads hit this

10.2 Atomic Operations¶

// Atomic add (global or shared memory)
atomicAdd(&counter, 1);
atomicAdd(&shared_sum, val);

// Other atomics
atomicSub(&counter, 1);
atomicMax(&max_val, val);
atomicMin(&min_val, val);
atomicCAS(&lock, 0, 1);      // compare-and-swap: if *lock==0, set to 1, return old

// Warp-level reduction (compute capability 8.0+)
float warp_sum = __reduce_add_sync(0xFFFFFFFF, val);
float warp_max = __reduce_max_sync(0xFFFFFFFF, val);

10.3 Warp Shuffle — Pass Data Between Threads Without Shared Memory¶

// Each thread gets the value from thread src_lane in the same warp
float val = __shfl_sync(0xFFFFFFFF, local_val, src_lane);

// Shift: thread i receives from thread i+delta
float val = __shfl_down_sync(0xFFFFFFFF, local_val, delta);
float val = __shfl_up_sync(0xFFFFFFFF, local_val, delta);

// XOR exchange (butterfly for reductions)
float val = __shfl_xor_sync(0xFFFFFFFF, local_val, mask);

// Warp reduction using shuffles (no shared memory, no sync needed)
__device__ float warp_reduce_sum(float val) {
    for (int offset = warpSize / 2; offset > 0; offset >>= 1)
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    return val;   // thread 0 holds the sum
}

11. Streams and Asynchronous Execution¶

Streams allow overlapping GPU computation with host execution and with data transfers.

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Overlapping compute and copy
cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream1>>>(d_a, d_out, N);
cudaMemcpyAsync(h_out, d_out, bytes, cudaMemcpyDeviceToHost, stream1);

// Different work on stream2, runs concurrently with stream1
other_kernel<<<grid, block, 0, stream2>>>(d_b, d_out2, N);

// Synchronize
cudaStreamSynchronize(stream1);   // wait for stream1
cudaStreamSynchronize(stream2);   // wait for stream2

cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Stream timeline (overlapping copy + compute):

Stream 1:  [copy H→D] [kernel] [copy D→H]
Stream 2:             [copy H→D] [kernel] [copy D→H]
           ─────────────────────────────────────────► time

Events for precise timing and cross-stream sync:

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, stream1);
kernel<<<grid, block, 0, stream1>>>(args);
cudaEventRecord(stop, stream1);

cudaEventSynchronize(stop);
float ms;
cudaEventElapsedTime(&ms, start, stop);
printf("Kernel time: %.3f ms\n", ms);

cudaEventDestroy(start);
cudaEventDestroy(stop);

12. CUDA Graphs¶

For workloads that repeat the same sequence of kernels and transfers, CUDA graphs eliminate per-launch CPU overhead by capturing and replaying the entire execution graph.

// Step 1: capture a sequence of operations
cudaGraph_t graph;
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

kernel_a<<<grid, block, 0, stream>>>(d_a, N);
kernel_b<<<grid, block, 0, stream>>>(d_a, d_b, N);
cudaMemcpyAsync(h_out, d_b, bytes, cudaMemcpyDeviceToHost, stream);

cudaStreamEndCapture(stream, &graph);

// Step 2: instantiate (compile the graph)
cudaGraphExec_t graph_exec;
cudaGraphInstantiate(&graph_exec, graph, nullptr, nullptr, 0);

// Step 3: launch repeatedly (minimal CPU overhead)
for (int iter = 0; iter < 1000; iter++) {
    cudaGraphLaunch(graph_exec, stream);
    cudaStreamSynchronize(stream);
}

// Cleanup
cudaGraphExecDestroy(graph_exec);
cudaGraphDestroy(graph);

When to use graphs: - Same kernel sequence repeated many times (training loops, inference batches) - Many small kernels where launch overhead dominates - CPU overhead reduction from ~5 µs/launch → ~1 µs/graph replay

13. Parallel Reduction — Complete Example¶

Reduction is the canonical shared memory + sync pattern.

// Parallel sum reduction — each block reduces its chunk to one value
__global__ void reduce_sum(const float* in, float* partial, int N) {
    extern __shared__ float smem[];

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x * 2 + threadIdx.x;  // × 2: each thread loads 2

    // Load two elements per thread
    float val = 0.0f;
    if (gid < N)         val += in[gid];
    if (gid + blockDim.x < N) val += in[gid + blockDim.x];
    smem[tid] = val;
    __syncthreads();

    // Tree reduction in shared memory
    for (int stride = blockDim.x / 2; stride > 32; stride >>= 1) {
        if (tid < stride)
            smem[tid] += smem[tid + stride];
        __syncthreads();
    }

    // Final warp reduction (no sync needed within a warp)
    if (tid < 32) {
        volatile float* s = smem;   // volatile prevents compiler from caching
        s[tid] += s[tid + 32];
        s[tid] += s[tid + 16];
        s[tid] += s[tid + 8];
        s[tid] += s[tid + 4];
        s[tid] += s[tid + 2];
        s[tid] += s[tid + 1];
    }

    // Thread 0 writes this block's partial sum
    if (tid == 0) partial[blockIdx.x] = smem[0];
}

// Host: launch reduce_sum, then sum the partial results
float gpu_sum(const float* d_in, int N) {
    int block = 256;
    int grid  = (N + block * 2 - 1) / (block * 2);

    float *d_partial;
    cudaMalloc(&d_partial, grid * sizeof(float));

    reduce_sum<<<grid, block, block * sizeof(float)>>>(d_in, d_partial, N);

    // Recursively reduce if needed, or just copy partial to host and sum there
    std::vector<float> h_partial(grid);
    cudaMemcpy(h_partial.data(), d_partial, grid * sizeof(float), cudaMemcpyDeviceToHost);
    cudaFree(d_partial);

    return std::accumulate(h_partial.begin(), h_partial.end(), 0.0f);
}

14. Tiled Matrix Multiply — Shared Memory Optimization¶

#define TILE_SIZE 16

__global__ void tiled_matmul(
    const float* A, const float* B, float* C,
    int M, int N, int K)  // C[M×N] = A[M×K] * B[K×N]
{
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;

    float sum = 0.0f;

    // Iterate over tiles of K
    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Cooperative load: each thread loads one element
        int a_col = t * TILE_SIZE + threadIdx.x;
        int b_row = t * TILE_SIZE + threadIdx.y;

        As[threadIdx.y][threadIdx.x] = (row < M && a_col < K) ? A[row * K + a_col] : 0.0f;
        Bs[threadIdx.y][threadIdx.x] = (b_row < K && col < N) ? B[b_row * N + col]  : 0.0f;
        __syncthreads();

        // Compute partial dot product for this tile
        for (int k = 0; k < TILE_SIZE; k++)
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        __syncthreads();
    }

    if (row < M && col < N)
        C[row * N + col] = sum;
}

Why tiling works:

Naive matmul:
  Each C[i][j] loads K elements from A (row i) and K from B (col j)
  Total global loads: M*N*2K = O(MNK) → memory bound for large K

Tiled matmul with TILE=16:
  Each 16×16 block of C cooperatively loads one 16×16 tile from A and B
  Each load reused 16 times within the tile
  Memory traffic reduced by ~TILE_SIZE = 16×
  Compute:memory ratio increased → compute bound (much faster)

15. Tensor Cores¶

Tensor Cores are dedicated matrix-multiply-accumulate units introduced in Volta (CC 7.0). They operate on small matrix tiles in a single instruction.

#include <mma.h>
using namespace nvcuda::wmma;

// Tensor Core: 16×16×16 GEMM (FP16 input, FP32 accumulator)
__global__ void tensor_core_matmul(
    const half* A, const half* B, float* C, int M, int N, int K)
{
    fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
    fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
    fragment<accumulator, 16, 16, 16, float>         c_frag;

    fill_fragment(c_frag, 0.0f);

    for (int k = 0; k < K; k += 16) {
        load_matrix_sync(a_frag, A + blockIdx.y * 16 * K + k, K);
        load_matrix_sync(b_frag, B + k * N + blockIdx.x * 16, N);
        mma_sync(c_frag, a_frag, b_frag, c_frag);   // D = A*B + C
    }

    store_matrix_sync(C + blockIdx.y * 16 * N + blockIdx.x * 16, c_frag, N, mem_row_major);
}

Tensor Core throughput vs CUDA Cores (H100 SXM, per SM):

Precision	CUDA Cores	Tensor Cores
FP64	67 TFLOPS	134 TFLOPS
TF32	—	989 TFLOPS
FP16/BF16	134 TFLOPS	1,979 TFLOPS
FP8	—	3,958 TFLOPS
INT8	268 TOPS	3,958 TOPS

In practice: use cuBLAS or cuDNN — they use Tensor Cores automatically when shapes are multiples of 16.

16. Dynamic Parallelism¶

Kernels can launch other kernels from the GPU (no CPU round-trip required):

__global__ void child_kernel(float* data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] *= 2.0f;
}

__global__ void parent_kernel(float* data, int N) {
    // Each block launches its own child kernel
    int chunk = N / gridDim.x;
    int offset = blockIdx.x * chunk;

    child_kernel<<<1, chunk>>>(data + offset, chunk);
    cudaDeviceSynchronize();   // wait for children (device-side sync)
}

Requires compute capability 3.5+. Adds latency per launch. Best for irregular recursive problems (trees, AMR meshes).

17. Cooperative Groups¶

Cooperative groups let you express sync and reduction at any granularity — warp, block, multi-block, or grid.

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void flexible_reduction(float* data, float* out, int N) {
    auto block = cg::this_thread_block();  // all threads in this block
    auto warp  = cg::tiled_partition<32>(block);  // this warp
    auto tile  = cg::tiled_partition<4>(block);   // group of 4 threads

    float val = data[block.thread_rank()];

    // Warp reduce
    for (int i = warp.size() / 2; i > 0; i >>= 1)
        val += warp.shfl_down(val, i);

    // Write warp result
    if (warp.thread_rank() == 0)
        atomicAdd(out, val);
}

// Grid-wide sync (requires cudaLaunchCooperativeKernel)
__global__ void grid_sync_kernel(float* data) {
    auto grid = cg::this_grid();
    // ... do phase 1 ...
    grid.sync();   // all threads in grid synchronize
    // ... do phase 2 ...
}

18. Automatic Scalability¶

CUDA programs scale automatically across different GPU sizes. The same grid runs on a GPU with 4 SMs or 144 SMs — the runtime schedules blocks to available SMs.

Automatic scalability

The same program scales automatically: 2 SMs run 2 blocks at a time, 4 SMs run 4 blocks at a time. Source: NVIDIA

This is why you write for maximum parallelism and let the hardware decide — don't hardcode grid sizes to a specific GPU.

19. Performance Optimization Checklist¶

Step 1 — Profile First¶

# NVIDIA Nsight Systems (timeline, CPU/GPU overlap)
nsys profile --stats=true ./my_program

# NVIDIA Nsight Compute (kernel-level metrics)
ncu --set full ./my_program

Step 2 — Roofline Analysis¶

Every kernel is either compute-bound or memory-bound:

Arithmetic Intensity (AI) = FLOPs / bytes accessed
                                                         │ Tensor Core roof (3958 TFLOPS)
Performance (FLOP/s)                                     │
     ╔═══════════════════════════════════════════════════╗
     ║ memory-bound │               compute-bound        ║
     ║  AI < ridge  │           AI > ridge point         ║
     ╚══════════════╧═══════════════════════════════════╝
                ridge ← Memory BW (3.35 TB/s) / Peak FLOPS

H100 ridge point: ~3,958 TFLOPS / 3,350 GB/s ≈ 1.18 FLOP/byte

If your kernel's AI < 1.18 FLOP/byte on H100 → it's memory-bound → focus on coalescing and reuse.

Step 3 — Common Optimizations¶

Issue	Symptom	Fix
Low occupancy	Kernel uses many registers	Reduce register pressure; use `__launch_bounds__`
Uncoalesced access	High DRAM traffic in Nsight	Restructure to SoA; transpose before processing
Bank conflicts	Shared mem replay in Nsight	Pad shared arrays (+1 column)
Warp divergence	Low warp execution efficiency	Branchless arithmetic; sort inputs first
Too many atomics	High contention	Use per-warp/per-block local accumulators first
Launch overhead	Kernel takes < 50 µs but lots of launches	Use CUDA Graphs
CPU-GPU idle time	GPU idle in timeline	Async transfers + streams

Step 4 — Launch Bounds¶

// Tell the compiler max threads per block and min blocks per SM
// Helps compiler tune register allocation for better occupancy
__launch_bounds__(256, 4)   // max 256 threads/block, at least 4 blocks/SM
__global__ void my_kernel(float* data, int N) { ... }

20. Compute Capability Quick Reference¶

Arch	CC	GPU Examples	FP16 TC	BF16	FP8	TMA	Clusters
Volta	7.0	V100	✓	—	—	—	—
Turing	7.5	RTX 2080, T4	✓	—	—	—	—
Ampere	8.0	A100	✓	✓	—	—	—
Ampere	8.6	RTX 3090, A10	✓	✓	—	—	—
Hopper	9.0	H100	✓	✓	✓	✓	✓
Blackwell	10.x	B100, B200	✓	✓	✓	✓	✓

SM resource limits (selected):

Resource	Volta 7.0	Ampere 8.0	Hopper 9.0
Max warps/SM	64	64	64
Max threads/SM	2048	2048	2048
Max blocks/SM	32	32	32
Registers/SM	64K	64K	64K
Shared mem/SM	96 KB	164 KB	228 KB
L2 cache	6 MB	40 MB	50 MB
CUDA cores/SM	64 FP64	64 FP64	64 FP64
FP32 cores/SM	64	128	128

21. Suggested Projects (in order)¶

#	Project	Key Skills
1	Vector add / SAXPY	Launch config, indexing, bounds check, H↔D copy
2	Parallel reduction (sum)	Shared memory, `__syncthreads`, tree reduction, warp shuffles
3	2D grayscale transform	2D `threadIdx`/`blockIdx`, image stride
4	Matrix transpose	Shared memory, bank conflict padding
5	Naive matmul → tiled matmul	Tiling, shared mem reuse, compare vs cuBLAS
6	Histogram	Atomics, per-block private histogram, reduce-merge
7	Prefix scan	Blelloch scan, upsweep/downsweep in shared mem
8	Multi-stream pipeline	Streams, async copy, overlap compute + transfer

Keep CPU golden references for every kernel. Validate before optimizing.

Resources¶

Resource	What it covers
CUDA C++ Programming Guide	The authoritative reference — thread hierarchy, memory, streams, graphs
CUDA C++ Best Practices Guide	Optimization strategies, memory patterns, profiling
NVIDIA Nsight Compute	Kernel profiler — roofline, memory, warp stats
NVIDIA Nsight Systems	System profiler — CPU/GPU timeline, stream overlap
Hopper Architecture In-Depth	H100 SM, TMA, FP8, Thread Block Clusters
Ampere Architecture In-Depth	A100 SM, MIG, sparsity, TF32
CUDA Samples	Reference implementations: reduction, matmul, scan
Programming Massively Parallel Processors (Hwu, Kirk, Hajj)	GPU architecture + CUDA optimization textbook

Next¶

→ OpenCL and SYCL — portable compute across vendors.