03 — Persistent Kernels¶
1. The Problem: Kernel Launch Overhead at Extreme Frequency¶
CUDA Graphs reduce CPU overhead for a fixed sequence of kernels. But some workloads need to dynamically dispatch work to the GPU — such as online inference servers where requests arrive continuously with unknown timing.
Standard inference serving model:
Request arrives → CPU prepares batch → launch kernels → wait for result → next request
Timeline:
[CPU prep 50µs][kernel 2ms][CPU prep 50µs][kernel 2ms]...
↑ repeated kernel launch setup ↑
For low-latency streaming inference (robotics, real-time control), even CUDA Graphs have limits: - Graphs require fixed shapes - Each request still requires a CPU-side graph replay call - Multi-stream contention from concurrent requests
Persistent kernels solve this with a different approach: the kernel never exits. It runs forever, polling for new work.
2. What is a Persistent Kernel?¶
A persistent kernel is a kernel that: 1. Launches once at startup with all SMs 2. Loops forever polling a GPU-side work queue 3. Processes work items as they arrive 4. The CPU submits work by writing to GPU memory (zero kernel launch overhead)
Standard model: Persistent model:
CPU launches kernel → | GPU kernel runs forever
GPU runs kernel | CPU writes work to GPU queue
GPU kernel exits | GPU polls queue, processes work
CPU launches again | CPU writes more work
GPU runs kernel | GPU processes again
GPU kernel exits | ...
... |
3. Persistent Kernel Architecture¶
GPU memory:
┌─────────────────────────────────────────────────────┐
│ Work Queue (ring buffer) │
│ ┌──────┬──────┬──────┬──────┬──────┬──────┐ │
│ │ W0 │ W1 │ W2 │ empty│ empty│ empty│ │
│ └──────┴──────┴──────┴──────┴──────┴──────┘ │
│ ↑ head ↑ tail │
│ │
│ Work Item: │
│ { input_ptr, output_ptr, batch_size, status } │
└─────────────────────────────────────────────────────┘
CPU thread: GPU persistent kernel:
while (serving): while (running):
prepare_input() work = poll_queue()
write_to_queue() if work is valid:
wait_for_output() process(work)
signal_done(work)
4. Persistent Kernel Implementation¶
Work Queue Structure (GPU-accessible)¶
struct WorkItem {
float* input;
float* output;
int batch_size;
int seq_len;
int status; // 0=empty, 1=pending, 2=done
int padding[2];
};
#define MAX_QUEUE_SIZE 64
struct WorkQueue {
WorkItem items[MAX_QUEUE_SIZE];
int head; // next slot to read (GPU side)
int tail; // next slot to write (CPU side)
int running;
};
The Persistent Kernel¶
__global__ void persistent_inference_kernel(
WorkQueue* queue,
ModelWeights* weights,
int num_sms
) {
// Only run one thread block per SM for maximum control
if (blockIdx.x >= num_sms) return;
int sm_id = blockIdx.x;
while (queue->running) {
// Spin-wait for work (polling — burns cycles but zero latency)
int slot = -1;
if (threadIdx.x == 0) {
// Thread 0 polls for available work
int expected = sm_id % MAX_QUEUE_SIZE; // round-robin assignment
if (atomicCAS(&queue->items[expected].status, 1, 0) == 1) {
slot = expected;
}
}
// Broadcast slot to all threads in block
slot = __shfl_sync(0xffffffff, slot, 0);
if (slot >= 0) {
WorkItem* work = &queue->items[slot];
// Run the actual inference
run_transformer_layer(
work->input, work->output,
work->batch_size, work->seq_len,
weights
);
// Signal completion (CPU waits on this)
__threadfence_system(); // ensure output visible to CPU
atomicExch(&work->status, 2); // mark done
}
// Brief pause to avoid excessive power draw from spin-wait
// __nanosleep(100); // CUDA 11.1+ on Ampere+
}
}
CPU Submit Path (Zero Overhead)¶
class PersistentInferenceServer {
WorkQueue* d_queue;
WorkQueue* h_queue; // pinned host mirror
int submit(float* input, float* output, int batch, int seq) {
// Find empty slot
int slot = allocate_slot();
// Write work item directly to GPU memory via pinned mapping
WorkItem& item = d_queue->items[slot];
item.input = input;
item.output = output;
item.batch_size = batch;
item.seq_len = seq;
// Memory fence then set status — GPU sees this atomically
__atomic_thread_fence(__ATOMIC_SEQ_CST);
atomicExch(&item.status, 1); // mark pending
return slot;
}
void wait(int slot) {
// Poll until GPU signals done (spin-wait on CPU side)
while (atomicLoad(&d_queue->items[slot].status) != 2)
_mm_pause(); // CPU spin hint (reduces power)
}
};
5. Persistent Kernels for LLM Token Streaming¶
The most important production use case: streaming token generation without kernel launch per token.
Standard vLLM/TRT-LLM decode:
For each token:
CPU: launch attention kernel (10µs)
GPU: attention (5ms)
CPU: launch FFN kernel (10µs)
GPU: FFN (5ms)
Total per token: 10.04ms, of which 0.02ms = CPU overhead (0.2%)
← overhead is small for normal decode
For speculative decoding with tiny draft model (0.5ms compute):
CPU: launch kernel (10µs)
GPU: tiny draft model (0.5ms)
CPU overhead: 2% — now matters
With persistent kernel for draft model:
__global__ void draft_model_persistent(
TokenStream* token_stream,
TinyModelWeights* weights
) {
auto grid = cooperative_groups::this_grid();
while (token_stream->active) {
// Poll for next input token
int token = spin_wait_for_token(token_stream);
// Run tiny draft model (no kernel launch overhead)
int proposed_tokens[5];
draft_forward(token, weights, proposed_tokens, 5);
// Write proposals for verifier
write_proposals(token_stream, proposed_tokens);
grid.sync(); // sync all SMs before signaling
signal_proposals_ready(token_stream);
}
}
6. Persistent Kernels in Production Frameworks¶
TensorRT-LLM Persistent GEMM¶
TRT-LLM uses persistent kernels for its GEMM operations when the decoder runs in "streaming" mode:
// From TRT-LLM source (simplified)
// The GEMM kernel stays resident, processing one token's projection per iteration
template<typename T>
__global__ void persistent_gemm_kernel(
const T* A, const T* B, T* C,
int M, int N, int K,
volatile int* tile_counter // atomically-incremented work counter
) {
// Tiles are processed until all are done
while (true) {
int tile_id = atomicAdd(const_cast<int*>(tile_counter), 1);
if (tile_id >= total_tiles) break; // all work done
int tile_m = tile_id / num_tiles_n;
int tile_n = tile_id % num_tiles_n;
compute_gemm_tile(A, B, C, tile_m, tile_n, M, N, K);
}
}
FlashAttention Persistent Attention (Hopper)¶
FlashAttention-3 on H200 uses persistent warps that overlap compute and I/O:
Warpgroup A (Producer): continuously loads tiles from HBM
Warpgroup B (Consumer): computes GEMM on loaded tiles
Warpgroup C (Consumer): computes softmax + accumulate
These run concurrently via warp specialization (see topic 05)
The kernel persists for the entire attention computation
7. Persistent Kernel Trade-offs¶
| Factor | Persistent Kernel | Standard Kernel |
|---|---|---|
| Launch overhead | Zero (kernel already running) | 5–20 µs per launch |
| GPU idle between work items | Zero (polling) | Kernel launch gap |
| Power consumption | Higher (SM busy spinning) | Lower (SM sleeps between launches) |
| Flexibility | Fixed SM occupancy | Dynamic SM allocation |
| Debugging difficulty | Very high (infinite loop) | Standard |
| Best for | High-frequency, low-latency workloads | Batch compute, throughput workloads |
| Not suitable for | Low-frequency tasks (wasteful spinning) | Latency-critical streaming |
Power consideration¶
Persistent kernel spinning uses ~150W more per GPU vs idle between kernels.
On 8x H200, always-on persistent kernels: ~1.2 kW extra (out of 5.6 kW total)
→ 21% more power for zero-launch-overhead
Production decision: use persistent kernels ONLY for latency-critical paths
(e.g., draft model in speculative decoding, streaming token generation)
Use standard kernels for batch throughput workloads (training, offline inference)
8. Softer Alternative: Thread Block Specialization¶
For cases where full persistence is too aggressive, persistent thread blocks stay alive between tasks using __grid_constant__ and device-side work queues:
__global__ void semi_persistent_kernel(WorkQueue* queue) {
// Process multiple work items per launch (amortize launch cost)
// but kernel DOES eventually exit
while (true) {
WorkItem* work = try_dequeue(queue);
if (work == nullptr) break; // no more work, exit cleanly
process(work);
}
}
// Launch once with enough blocks to drain the queue efficiently
// Re-launch when queue refills (much less frequent than per-item launch)
This is what TensorRT-LLM inflight batching uses — work items accumulate in a queue, one launch drains the current queue, giving ~10–50× fewer kernel launches than naive per-request launching.