01 — CUDA Graphs¶
1. The Problem: CPU Launch Overhead¶
Every CUDA kernel launch involves the CPU:
CPU side:
1. Validate kernel arguments
2. Compute grid/block dimensions
3. Write launch packet to driver queue
4. Driver forwards to GPU command queue
5. CPU returns (kernel runs asynchronously)
Cost per launch: 5–20 µs
LLM decode step with 200 kernels:
200 × 10 µs = 2 ms wasted on launch overhead alone
At batch_size=1, TTFT target = 50 ms:
2 ms overhead = 4% of budget — significant
At higher throughput: GPU waits for CPU to submit next kernel
This is called CPU-GPU launch gap — the GPU sits idle between kernel launches:
Timeline (without CUDA Graphs):
CPU: [submit K1][submit K2][submit K3][submit K4]...
GPU: [K1 runs][ K2 runs ][K3][K4]
↑ idle gaps between kernels ↑
CUDA Graphs solve this by submitting the entire operation sequence in one shot.
2. How CUDA Graphs Work¶
CUDA Graphs have three phases:
Phase 1 — CAPTURE:
Run your code once in "recording" mode.
CUDA captures every kernel launch, memcpy, and sync event
into a directed acyclic graph (DAG).
Phase 2 — INSTANTIATE:
CUDA compiles the graph into an executable object (cudaGraphExec_t).
Pre-allocates all resources, validates topology.
This takes ~10–50 ms — done once at startup.
Phase 3 — REPLAY:
Launch the entire graph with a single API call.
GPU executes all operations with no further CPU involvement.
Cost: ~3–5 µs (just the launch, not the work).
Graph DAG Structure¶
[memcpy H2D]
│
[embedding kernel]
/ \
[attn] [mlp] ← independent nodes run in parallel
\ /
[residual add]
│
[layernorm]
│
[memcpy D2H]
Dependencies are tracked automatically — CUDA Graphs preserve exact execution order AND enable parallel execution of independent branches.
3. CUDA Graphs in C++¶
Manual Graph Construction¶
#include <cuda_runtime.h>
cudaGraph_t graph;
cudaGraphExec_t graphExec;
cudaStream_t stream;
cudaStreamCreate(&stream);
// === PHASE 1: CAPTURE ===
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
// All operations on `stream` are now captured, not executed
cudaMemcpyAsync(d_input, h_input, size, cudaMemcpyHostToDevice, stream);
embeddingKernel<<<grid, block, 0, stream>>>(d_input, d_embed, vocab_size);
attentionKernel<<<grid, block, 0, stream>>>(d_embed, d_qkv, seq_len);
mlpKernel<<<grid, block, 0, stream>>>(d_embed, d_out, hidden_size);
cudaMemcpyAsync(h_output, d_out, size, cudaMemcpyDeviceToHost, stream);
cudaStreamEndCapture(stream, &graph);
// === PHASE 2: INSTANTIATE ===
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
cudaGraphDestroy(graph); // can free the graph after instantiation
// === PHASE 3: REPLAY (inference loop) ===
for (int request = 0; request < num_requests; request++) {
// Update input data
update_input_buffer(h_input, request);
// Launch entire pipeline with one API call
cudaGraphLaunch(graphExec, stream);
cudaStreamSynchronize(stream);
process_output(h_output);
}
// Cleanup
cudaGraphExecDestroy(graphExec);
cudaStreamDestroy(stream);
Graph Node API (Explicit Construction)¶
For more control, build the graph node-by-node without capture:
cudaGraph_t graph;
cudaGraphCreate(&graph, 0);
// Add kernel node
cudaGraphNode_t kernelNode;
cudaKernelNodeParams kernelParams = {};
kernelParams.func = (void*)myKernel;
kernelParams.gridDim = dim3(grid_x, grid_y);
kernelParams.blockDim = dim3(block_x, block_y);
kernelParams.kernelParams = args;
cudaGraphAddKernelNode(&kernelNode, graph, nullptr, 0, &kernelParams);
// Add memcpy node with dependency on kernel
cudaGraphNode_t memcpyNode;
cudaGraphAddMemcpyNode1D(&memcpyNode, graph,
&kernelNode, 1, // depends on kernelNode
h_out, d_out, size, cudaMemcpyDeviceToHost);
// Instantiate and launch as before
4. CUDA Graphs in PyTorch¶
PyTorch wraps CUDA Graphs cleanly. Two patterns:
Pattern A: torch.cuda.graph¶
import torch
# Must use static (same-shape) tensors for graphs
batch_size, seq_len, hidden = 1, 512, 4096
static_input = torch.zeros(batch_size, seq_len, dtype=torch.long, device="cuda")
static_output = torch.zeros(batch_size, seq_len, hidden, device="cuda")
# Warmup (fills CUDA caches, JIT compiles ops)
s = torch.cuda.Stream()
with torch.cuda.stream(s):
for _ in range(3):
static_output = model(static_input)
torch.cuda.current_stream().wait_stream(s)
# Capture
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
static_output = model(static_input)
# Inference loop — GPU executes graph, no CPU kernel submissions
def graph_inference(new_input: torch.Tensor) -> torch.Tensor:
static_input.copy_(new_input) # update input in-place
g.replay() # re-execute captured graph
return static_output.clone() # clone before next replay overwrites it
Pattern B: torch.compile(mode="reduce-overhead")¶
import torch
# torch.compile automatically wraps in CUDA Graph
model = torch.compile(model, mode="reduce-overhead")
# First call: traces + captures graph (~slow)
out = model(inputs)
# Subsequent calls: replay graph (~3-5 µs overhead)
for batch in dataloader:
out = model(batch) # graph replay, fast
mode="reduce-overhead" is mode="default" + automatic CUDA Graph capture.
Pattern C: Bucketed Graphs for Dynamic Shapes¶
LLM decode has variable sequence lengths — one graph per bucket:
BUCKETS = [128, 256, 512, 1024, 2048, 4096]
graphs = {}
static_ins = {}
static_outs = {}
# Pre-capture one graph per bucket
for seq_len in BUCKETS:
dummy = torch.zeros(1, seq_len, dtype=torch.long, device="cuda")
# Warmup
for _ in range(3):
model(dummy)
torch.cuda.synchronize()
# Capture
g = torch.cuda.CUDAGraph()
s_in = dummy.clone()
s_out = torch.zeros(1, seq_len, model.config.vocab_size, device="cuda")
with torch.cuda.graph(g):
s_out = model(s_in)
graphs[seq_len] = g
static_ins[seq_len] = s_in
static_outs[seq_len] = s_out
def bucketed_decode(input_ids: torch.Tensor) -> torch.Tensor:
L = input_ids.shape[1]
bucket = next(b for b in BUCKETS if b >= L)
# Pad to bucket size
padded = torch.nn.functional.pad(input_ids, (0, bucket - L))
static_ins[bucket].copy_(padded)
graphs[bucket].replay()
return static_outs[bucket][:, :L, :].clone()
5. CUDA Graph Constraints and Limitations¶
Works well with:
✓ Fixed shapes (same tensor sizes every call)
✓ Repeated inference (same model, many requests)
✓ Training with fixed batch size and sequence length
✓ Physics simulation time steps
Does NOT work with:
✗ Dynamic control flow (Python if/else based on tensor values)
✗ Variable-length inputs without bucketing
✗ Operations that allocate new memory each call
✗ CPU-GPU synchronization points inside the graph
✗ Python callbacks inside the capture region
Common pitfall:
# WRONG — Python branching based on GPU tensor value breaks graph
if output.argmax() == stop_token: # requires D2H sync, breaks capture
break
# CORRECT — implement stopping criteria outside the graph replay loop
g.replay()
next_token = static_output.argmax(dim=-1) # D2H after replay
if next_token.item() == stop_token:
break
6. CUDA Graph Updates (Without Re-capture)¶
From CUDA 11.1+, you can update parts of a graph without full re-capture:
// Update kernel parameters in existing graph
cudaGraphExecKernelNodeSetParams(graphExec, kernelNode, &newParams);
// Update memcpy source/destination
cudaGraphExecMemcpyNodeSetParams(graphExec, memcpyNode, &newMemcpyParams);
// Re-upload to GPU (fast, ~100 µs)
cudaGraphUpload(graphExec, stream);
Useful when only tensor contents change (not shapes) — avoids full re-instantiation.
7. Profiling CUDA Graphs¶
# CUDA Graphs appear as "CUDA Graph" events in Nsight Systems
nsys profile \
--trace=cuda,nvtx \
--gpu-metrics-device=all \
python inference.py
# In Nsight Systems GUI:
# - CUDA Graph launches appear as a single block
# - Expand to see individual node executions inside the graph
# - Compare: kernel-by-kernel vs graph — look for removed launch gaps
# Mark graph capture and replay with NVTX for visibility
import torch.cuda.nvtx as nvtx
with nvtx.range("graph_capture"):
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
out = model(inp)
for i in range(N):
with nvtx.range(f"graph_replay_{i}"):
g.replay()
8. Expected Performance Gains¶
| Workload | Without Graphs | With Graphs | Speedup |
|---|---|---|---|
| Llama-3 8B, BS=1, seq=512 | ~45 ms | ~32 ms | 1.4× |
| ResNet-50, BS=1 | ~3.2 ms | ~1.8 ms | 1.8× |
| BERT-large, BS=1 | ~12 ms | ~8 ms | 1.5× |
| Training step (fixed BS) | ~820 ms | ~780 ms | 1.05× |
Graphs have the largest impact at batch_size=1 (pure latency mode) where kernel launch overhead is a large fraction of total time. At large batch sizes, kernel launch overhead is amortized over more work.