Skip to content

Lecture 43 - MLSys 2026 Kernel Contest: AI-Assisted Blackwell LLM Kernel Optimization

Course: Agentic AI & GenAI | Previous: Lecture 42 | Next: Lab 01


The MLSys 2026 FlashInfer AI Kernel Generation Contest is a compact map of where AI systems work is going.

It asks participants to create high-performance GPU kernels for modern LLM inference operations on NVIDIA Blackwell B200 GPUs.

The important part is not only the kernel work.

The contest explicitly welcomes:

expert-crafted seed kernels with agent-assisted evolution
fully agent-generated kernel solutions

That makes it a case study in:

GPU kernel optimization
  + LLM inference runtime internals
  + AI-assisted systems engineering

This is the layer below ordinary agent apps.

It is where agent workloads become:

  • Tensor Core instructions
  • HBM traffic
  • shared-memory layouts
  • sparse indexers
  • expert routing
  • register pressure
  • occupancy tradeoffs
  • benchmark win rates

Learning objectives

By the end of this lecture, you should be able to:

  1. Explain why the MLSys 2026 contest is an AI systems/runtime signal, not a normal hackathon.
  2. Identify the three competition tracks and the inference bottleneck each targets.
  3. Understand why NVIDIA Blackwell B200 changes the optimization target.
  4. Connect FP8 MoE, sparse attention, and Gated Delta Net kernels to modern LLM inference.
  5. Explain how FlashInfer-Bench evaluates correctness, speed, and win rate against baselines.
  6. Distinguish human-written, agent-assisted, and fully agent-generated kernel workflows.
  7. Design a kernel optimization loop with profiling, correctness checks, and benchmark discipline.
  8. Map this contest to career paths in AI runtime, GPU compiler, and accelerator software engineering.

1. What the contest is

The contest page describes the challenge as:

create optimized CUDA kernels for cutting-edge LLM operations
for NVIDIA Blackwell B200 GPUs

The evaluation platform is FlashInfer-Bench.

Submissions compete on:

  • correctness
  • speed
  • win rate against FlashInfer baselines

The contest tracks target three operations that matter in modern inference:

Track A:
  Fused MoE with FP8 support

Track B:
  DeepSeek Sparse Attention

Track C:
  Gated Delta Net

Participants can use:

  • CUDA
  • Triton
  • CuTe DSL
  • TileLang
  • cuTile
  • other kernel programming systems

That tool list is the real signal.

The frontier of inference engineering is not one API.

It is a stack of DSLs, compilers, profilers, runtimes, and benchmark harnesses.


2. Why this belongs in an agent course

Agentic AI is usually taught from the top:

prompts
tools
RAG
workflows
agent orchestration

But production agents create inference demand.

That demand becomes:

low latency
high throughput
long context
multi-user concurrency
tool-loop responsiveness
local or edge deployment
cost per generated token

Those requirements eventually land on kernels.

The contest sits at the bottom of the stack:

agent workload
  -> model server
  -> inference runtime
  -> attention / MoE / recurrent-state kernels
  -> GPU architecture

If you want to build serious agent infrastructure, you need to understand the lower layers.

Not every agent engineer writes CUDA.

But the best systems engineers know which kernel bottleneck they are paying for.


3. Track A: Fused FP8 MoE

Mixture-of-experts models activate only a subset of experts per token.

That saves compute, but it creates systems problems:

  • dynamic routing
  • irregular memory access
  • token-to-expert dispatch
  • expert load imbalance
  • small or fragmented GEMMs
  • scaling and reduction overhead
  • top-k expert selection
  • FP8 quantization and dequantization

The contest's Track A focuses on fused MoE kernels with FP8 support.

The optimization target:

routing + dispatch + GEMM + scaling + output assembly

The reason to fuse:

fewer kernel launches
less HBM traffic
better Tensor Core utilization
less intermediate materialization

Important performance questions:

Are tokens grouped efficiently by expert?
Are FP8 block scales loaded efficiently?
Are GEMMs large enough to use Tensor Cores well?
Is expert imbalance causing tail latency?
Are intermediate buffers hitting HBM unnecessarily?
Does fusion increase register pressure too much?

This connects directly to modern MoE models:

  • DeepSeek-style expert systems
  • Mixtral-style routing
  • Qwen MoE variants
  • small active-parameter models such as Lecture 40's ZAYA1-8B

MoE is not "free sparsity."

It is a routing and memory-layout problem.


4. Track B: Sparse Attention

Dense attention scales poorly with sequence length.

The basic cost pattern:

full attention:
  every query attends over many keys
  KV cache traffic grows with context

Sparse attention reduces work by selecting a subset of relevant tokens or blocks.

The contest's Track B targets DeepSeek Sparse Attention with separate indexer and attention kernels.

That split matters:

indexer:
  choose which blocks/tokens matter

attention:
  compute attention over selected sparse structure

Sparse attention bottlenecks:

  • top-k index generation
  • block sparse metadata
  • irregular KV reads
  • memory coalescing
  • cache locality
  • branch divergence
  • load balancing across queries
  • interaction with paged KV cache
  • FP8 data paths

Sparse attention is hard because it trades arithmetic for control flow and memory indirection.

The key question:

Does the sparsity saved enough memory traffic and compute
to pay for indexer overhead and irregular access?

This connects to:

  • Lecture 36: FP8 KV-cache
  • Lecture 37: trace-driven performance analysis
  • long-context agent sessions
  • DeepSeek-style inference systems

For long-context agents, sparse attention is a direct path to lower memory traffic.

But it must be profiled, not assumed.


5. Track C: Gated Delta Net

Gated Delta Net is a sequence-modeling approach used in Qwen3-Next.

The contest includes decode and prefill kernels.

This matters because GDN-style systems represent a broader shift:

not every future long-context model will be standard dense attention

State-space, recurrent, delta-rule, and hybrid models try to reduce attention cost by maintaining compact state.

The GPU problems change:

  • state update kernels
  • recurrent dependency handling
  • scan-like computation
  • chunking
  • prefill/decode split
  • memory layout for persistent state
  • low-latency decode
  • high-throughput prefill

Track C is therefore not just "one more kernel."

It is a signal that inference runtimes must support model families beyond standard transformers.

For systems engineers, that means:

runtime design must be model-architecture aware

An inference runtime optimized only for dense attention may underperform on hybrid architectures.


6. Why Blackwell B200 matters

The contest targets NVIDIA Blackwell B200 GPUs.

That matters because kernel optimization is hardware-specific.

Questions change by architecture:

  • Tensor Core throughput
  • FP8/FP4 paths
  • memory bandwidth
  • shared memory capacity and behavior
  • register file pressure
  • scheduler behavior
  • warp-group MMA support
  • TMA or async copy behavior
  • occupancy tradeoffs

Optimizing for B200 is not the same as optimizing for A100 or H100.

A kernel that wins on one generation may lose on another because:

  • the compute/memory balance changes
  • Tensor Core instruction choices change
  • memory hierarchy changes
  • compiler lowering changes
  • occupancy limits change

That is why official evaluation on bare metal matters.

The contest page notes that Modal scores are reference-only because clock frequency cannot be locked, while official evaluations run on bare-metal machines.

This is a good performance-engineering rule:

cloud convenience is useful for iteration
bare metal is needed for final claims

7. FlashInfer-Bench

FlashInfer-Bench is the contest evaluation platform.

The benchmark model:

kernel spec
  -> candidate kernel
  -> correctness tests
  -> speed measurement
  -> comparison against FlashInfer baseline
  -> win rate

This is the right structure for kernel work.

A kernel is not useful because it compiles.

It must:

  • match numerical expectations
  • handle shape variations
  • beat or match a strong baseline
  • avoid pathological cases
  • be reproducible
  • be explainable in a writeup

The contest requires tagged GitHub commits for evaluation and a technical report.

That encourages the correct discipline:

code
  -> benchmark
  -> profile
  -> explain
  -> reproduce

8. Human plus agent kernel generation

The contest explicitly allows two approaches:

expert-crafted seed kernels with agent-assisted evolution
fully agent-generated solutions

These are different workflows.

Human plus agent

The human writes:

  • baseline kernel structure
  • tiling strategy
  • memory layout
  • correctness harness
  • profiling plan

The agent helps:

  • generate variants
  • tune constants
  • try layout changes
  • refactor code
  • run benchmarks
  • summarize profiling results

Fully agent-generated

The agent owns more of the loop:

  • read kernel spec
  • generate code
  • run tests
  • benchmark
  • mutate kernel
  • select winners
  • produce reproducibility scripts

The contest page requires agent solutions to open-source scripts that reproduce kernels.

That is important.

For AI-generated systems code, the artifact is not only the final kernel.

The artifact is also:

the generation process

That process must be reproducible.


9. The optimization loop

A serious kernel optimization loop looks like this:

1. Read the kernel spec.
2. Implement a correct baseline.
3. Build correctness tests.
4. Benchmark against the reference.
5. Profile the bottleneck.
6. Generate one controlled optimization variant.
7. Re-test correctness.
8. Re-benchmark.
9. Record the delta.
10. Repeat.

Do not change five things at once.

Kernel work is full of traps:

  • a faster kernel may be numerically wrong
  • a change may help one shape and hurt another
  • a local benchmark may not match official clocks
  • register pressure may erase fusion gains
  • memory coalescing may be broken by a layout change
  • branch divergence may dominate sparse kernels

The right unit of progress:

one hypothesis
one variant
one measurement
one retained or rejected change

10. Profiling checklist

Use Nsight Compute, Nsight Systems, FlashInfer-Bench outputs, and trace tools where appropriate.

Measure:

  • achieved occupancy
  • register count
  • shared memory usage
  • Tensor Core utilization
  • memory throughput
  • L2 hit rate
  • global load/store efficiency
  • warp divergence
  • instruction mix
  • kernel launch overhead
  • achieved TFLOPS or effective bandwidth
  • latency distribution across shapes

Interpretation examples:

low Tensor Core utilization:
  tiling, dtype path, or GEMM shape may be poor

high HBM bandwidth with low compute:
  memory-bound kernel; optimize layout and traffic

high register pressure:
  fusion or unrolling may be too aggressive

high divergence:
  sparse indexing or routing path may be unbalanced

good average but bad tail:
  expert imbalance or sparse metadata distribution may matter

This connects to Lecture 37:

performance claims need trace and profiling evidence

11. Toolchain choices

The contest allows multiple implementation paths.

Do not treat them as interchangeable.

Each tool sits at a different abstraction level:

FlashInfer:
  LLM inference kernel/runtime library and benchmark baseline

CUDA:
  lowest-level mainstream NVIDIA GPU programming interface

Triton:
  Python-like GPU kernel DSL optimized for fast iteration

TileLang:
  tile-level kernel DSL for composable AI kernels

CuTe DSL:
  Python DSL around CUTLASS/CuTe layout and tensor abstractions

cuTile:
  NVIDIA CUDA Tile Python DSL targeting Tile IR and portable tensor-core kernels

OpenEvolve:
  evolutionary coding agent for automated program optimization

Modal:
  serverless GPU/cloud execution environment for iteration and reference runs

Blackwell B200:
  target hardware that determines what "fast" actually means

The practical rule:

choose the tool that gives you the fastest correct iteration
for the bottleneck you actually have

But for this contest, the real skill is knowing when to drop down a level.


11.1 FlashInfer

FlashInfer is the center of gravity for this contest.

It is an LLM serving kernel library focused on high-performance inference primitives:

  • attention
  • paged KV-cache operations
  • sampling
  • normalization
  • MoE-related kernels
  • quantization-aware serving paths
  • decode and prefill helpers

In the contest, FlashInfer plays three roles:

reference implementation:
  the baseline your kernel must beat

benchmark environment:
  FlashInfer-Bench measures correctness and speed

runtime context:
  the operations are real LLM serving bottlenecks, not toy kernels

Why this matters:

If you beat a naive PyTorch baseline, that proves little.

If you beat FlashInfer on a real inference primitive,
that is a meaningful systems result.

What to study in FlashInfer:

  • API shape for decode and prefill kernels
  • tensor layout conventions
  • page/block abstractions for KV cache
  • supported dtypes and quantization paths
  • baseline kernel behavior
  • benchmark input distributions
  • numerical tolerances

Common mistake:

optimizing against the wrong mental model

FlashInfer kernels are already specialized. You need to understand what the baseline is doing before assuming an optimization opportunity exists.

For example, if a sparse attention kernel is slow, the bottleneck may not be the attention math. It may be:

  • sparse index generation
  • metadata reads
  • poor memory coalescing
  • shape-specific occupancy
  • scale loading
  • page table indirection

FlashInfer is where LLM theory becomes concrete runtime layout.


11.2 CUDA

CUDA is the most direct and controllable path.

Use CUDA when you need:

  • explicit thread/block mapping
  • warp-level control
  • shared memory control
  • vectorized global memory access
  • explicit synchronization
  • custom Tensor Core instruction paths
  • fine-grained launch configuration
  • maximum control over register pressure and occupancy

CUDA is the right tool when the kernel is blocked by details the compiler DSL does not expose.

Examples:

FP8 MoE:
  custom token grouping, expert dispatch, scale loading, epilogue fusion

Sparse attention:
  custom sparse metadata traversal and memory coalescing

Gated Delta Net:
  specialized state update and decode loop scheduling

CUDA gives you control, but it also gives you enough rope.

Common CUDA failure modes:

  • out-of-bounds memory access
  • bank conflicts
  • uncoalesced loads
  • excessive register use
  • low occupancy
  • branch divergence
  • poor Tensor Core utilization
  • excessive synchronization
  • numerically wrong accumulation
  • shape-specific regressions

CUDA debugging discipline:

1. correctness first
2. one optimization at a time
3. inspect generated SASS/PTX when needed
4. profile register/shared-memory occupancy
5. test multiple shapes, not one cherry-picked shape

Use CUDA when you need a hand-tuned final kernel or when you need to understand exactly why a higher-level kernel is losing.


11.3 Triton

Triton is a Python-like language for writing GPU kernels.

It is valuable because it shortens the edit-test-profile loop.

Use Triton when you need:

  • fast prototyping
  • parameterized kernels
  • autotuning
  • Python-native iteration
  • easier agent-generated variants
  • compact expression of tile-level math

Triton is often a good first implementation path for:

  • elementwise fusion
  • reductions
  • small GEMM-like kernels
  • custom attention prototypes
  • shape-specialized kernels

Why agents like Triton:

less boilerplate than CUDA
Python syntax
shorter kernels
faster mutation loop
easier benchmark automation

Where Triton can struggle:

  • extremely irregular sparse access
  • full control over warp-level primitives
  • newest NVIDIA architecture features before compiler support catches up
  • complex Tensor Core scheduling
  • cases where generated code choices are opaque

Triton is not "slower CUDA."

It is a different abstraction boundary.

For agentic kernel search, a practical path is:

Triton prototype
  -> find algorithm and tiling idea
  -> benchmark shapes
  -> port hot winner to CUDA or CuTe if deeper control is needed

11.4 TileLang

TileLang is a tile-oriented DSL for writing high-performance AI kernels.

The key abstraction is that you describe computation in terms of tiles, rather than manually managing every thread-level operation.

This is useful because AI kernels usually have tiled structure:

  • matrix multiplication
  • attention blocks
  • reductions
  • normalization
  • block sparse compute
  • fused epilogues

Use TileLang when you want:

  • a higher-level tiled programming model
  • composable kernel descriptions
  • faster search over tile sizes and schedules
  • kernels that are easier for agents to mutate than low-level CUDA

The mental model:

CUDA:
  think threads, warps, shared memory, synchronization

TileLang:
  think tiles, loops over tiles, memory scopes, schedule choices

Why this matters for the contest:

agent-generated kernels benefit from abstractions
that reduce the number of ways to write invalid code

TileLang can be a better target for automated exploration because the search space is closer to the math.

But you still need profiling.

A tile abstraction can hide:

  • bad memory layout
  • excessive register use
  • poor generated code
  • compiler limitations
  • shape-specific scheduling failures

Use TileLang as a rapid kernel-generation layer, then validate with FlashInfer-Bench and Nsight.


11.5 CuTe DSL

CuTe DSL is NVIDIA's Python DSL around CUTLASS/CuTe concepts.

CuTe is fundamentally about tensor layouts and tiled tensor algebra.

This matters because many high-performance kernels are layout problems.

A good kernel is not only:

do the right math

It is:

map the math to hardware-friendly tiled layouts

CuTe-style thinking emphasizes:

  • layout composition
  • tile shapes
  • memory hierarchy
  • tensor views
  • copy atoms
  • MMA atoms
  • pipeline stages
  • warp and warpgroup organization

Use CuTe DSL when:

  • the kernel is GEMM-like
  • Tensor Core utilization is central
  • layouts are complex
  • you need CUTLASS-grade abstractions without raw C++ template pain
  • you care about SM90/SM100-style features and structured tiling

Cost:

  • steep learning curve
  • layout algebra is unforgiving
  • compiler/runtime stack maturity matters
  • debugging requires understanding generated lower-level behavior

Why it matters for Blackwell:

Blackwell optimization is heavily about feeding tensor cores
and moving data through the memory hierarchy correctly.

CuTe DSL gives you a way to express that more directly than Triton in some cases, while staying higher level than hand-written CUDA.

For contest work, CuTe DSL is most relevant to Track A FP8 MoE and any GEMM-like subproblem.


11.6 cuTile

cuTile is NVIDIA's Python implementation of the CUDA Tile programming model.

The official docs describe cuTile as a Python-based DSL where kernels operate on tiles, using functions such as:

  • ct.load
  • ct.store
  • tile arithmetic
  • reductions
  • matrix multiply
  • ct.launch

Important distinction:

arrays:
  global-memory objects passed from host

tiles:
  immutable kernel-local tensor-like values with compile-time shapes

This is a major abstraction shift.

Instead of writing:

thread i loads element i

you write:

load this tile
operate on this tile
store this tile

Why cuTile matters:

  • it targets NVIDIA's tile programming model
  • it aims to expose hardware features through tile abstractions
  • it can be easier to modify than deep CUDA/CUTLASS templates
  • it is aligned with agent-assisted kernel translation work

This connects directly to Lecture 35, where cuTile Python to cuTile.jl translation was used as a concrete Agent Skills example.

For the MLSys contest:

cuTile may be useful when you want tile-level expression
without writing full CUDA boilerplate.

But treat it as a young, hardware-sensitive toolchain.

Validate:

  • supported GPU architecture
  • CUDA Toolkit version
  • generated kernel performance
  • limitations for your target op
  • debugging workflow

11.7 OpenEvolve

OpenEvolve is not a GPU kernel language.

It is an evolutionary coding agent framework.

Its role in this contest is search.

A normal kernel workflow:

human writes variant
human benchmarks variant
human reads result
human writes next variant

An OpenEvolve-style workflow:

population of candidate kernels
  -> correctness filter
  -> benchmark score
  -> select winners
  -> mutate/crossover/generate new candidates
  -> repeat

Why this fits GPU kernels:

  • performance landscapes are rugged
  • small code changes can produce large speed changes
  • many variants fail correctness
  • many optimizations are shape-specific
  • humans cannot manually explore the full schedule space

What OpenEvolve needs to be useful:

  • deterministic benchmark command
  • fast correctness check
  • objective score
  • saved artifacts
  • mutation boundaries
  • timeout policy
  • rollback policy
  • result database

For kernel optimization, the fitness function should not be just:

fastest one run

It should include:

  • correctness
  • median latency
  • variance
  • shape coverage
  • compile success
  • code size or complexity
  • no forbidden APIs
  • no benchmark cheating

This is where the contest's agent-generated-kernel track becomes serious.

The best agent is not the one that writes the prettiest CUDA.

It is the one that can run a disciplined generate-test-profile-select loop.


11.8 Modal

Modal is a serverless cloud platform for running compute-intensive workloads.

In this contest context, Modal is useful for iteration and reference execution.

Use Modal for:

  • repeatable containerized benchmark runs
  • GPU access without owning hardware
  • quick starter-kit experiments
  • parallel search jobs
  • artifact collection
  • CI-like evaluation pipelines

But the contest page explicitly warns that Modal scores are reference-only because GPU clocks cannot be locked.

That means:

Modal is good for iteration.
Bare metal is required for official performance claims.

Good Modal workflow:

1. package benchmark container
2. run correctness tests
3. run rough performance screen
4. collect logs and artifacts
5. promote promising candidates to bare-metal validation

Do not overfit to Modal timing noise.

Use it to reduce the number of bad candidates, not to prove final speed.


11.9 Blackwell B200

Blackwell B200 is the target hardware.

That determines what optimizations matter.

Official NVIDIA docs identify B200 as compute capability 10.0 in the Blackwell tuning guide and describe the architecture as targeting generative AI and accelerated computing.

The practical B200 concerns for this contest:

  • FP8 and lower-precision Tensor Core paths
  • high HBM bandwidth
  • large memory capacity
  • Blackwell-specific scheduling behavior
  • architecture-specific compiler lowering
  • Tensor Core feeding and pipeline design
  • shared-memory/L1 behavior
  • register pressure and occupancy balance
  • support for newer CUDA features

Do not assume Hopper instincts transfer perfectly.

Questions to ask on B200:

Does this kernel use the right dtype path?
Does it keep Tensor Cores busy?
Is it memory bandwidth bound?
Is shared memory helping or hurting?
Is register pressure limiting occupancy?
Does the compiler generate Blackwell-appropriate instructions?
Does performance change across B200 vs H100?

For Track A, Blackwell matters because FP8 MoE wants strong Tensor Core utilization and efficient scale handling.

For Track B, Blackwell matters because sparse attention may be memory and metadata bound.

For Track C, Blackwell matters because recurrent-state kernels may stress different latency and memory paths than dense attention.

The key lesson:

hardware generation is part of the algorithm.

An inference kernel is not just math. It is math mapped onto a specific machine.


11.10 Tool selection matrix

Use this matrix as a starting point.

Need Best first tool Why
Understand contest baseline FlashInfer It defines the reference runtime and benchmark target
Maximum low-level control CUDA Explicit control over threads, memory, and synchronization
Fast prototype and autotune Triton Shorter Python-like kernels and fast iteration
Tiled AI kernel exploration TileLang Tile-level abstraction for AI workloads
Tensor Core / layout-heavy GEMM-like work CuTe DSL Strong layout and tiled tensor abstractions
Pythonic CUDA Tile experiments cuTile Tile IR-oriented Python DSL for NVIDIA GPUs
Automated variant search OpenEvolve Evolutionary loop around code generation and benchmark scores
Cloud iteration Modal Convenient GPU jobs and artifact collection
Final performance claim Bare-metal B200 Official target with controlled clocks and reproducibility

The strongest workflow combines tools:

FlashInfer-Bench:
  tells you whether you are winning

Triton / TileLang / cuTile:
  help explore algorithmic variants quickly

CUDA / CuTe DSL:
  help turn the winning idea into a hardware-tuned kernel

OpenEvolve:
  scales search over variants

Modal:
  scales early experimentation

Bare-metal B200:
  validates the final result

12. Agentic kernel optimization workflow

A useful AI-assisted workflow:

human:
  defines benchmark target and constraints

agent:
  reads spec and prior results
  proposes variants
  edits one kernel at a time
  runs correctness tests
  runs benchmark
  records deltas
  rejects bad variants

human:
  reviews profiling evidence
  adjusts search direction

Required guardrails:

  • deterministic benchmark scripts
  • strict correctness tests
  • one-variant-at-a-time discipline
  • no silent benchmark cherry-picking
  • raw results stored
  • code diffs reviewed
  • shape coverage maintained

This is where agent skills matter.

A good kernel-optimization skill should encode:

  • profiling checklist
  • common CUDA failure modes
  • benchmark protocol
  • correctness rules
  • allowed search space
  • reporting template

Then use Lecture 39's skill eval pattern to test whether the skill improves kernel work.


13. What to learn before competing

Core prerequisites:

GPU architecture:
  SMs, warps, occupancy, memory hierarchy, Tensor Cores

CUDA programming:
  blocks, threads, shared memory, synchronization, vectorized loads

LLM inference:
  prefill, decode, KV cache, paged attention, MoE routing

Numerics:
  FP8 formats, block scaling, accumulation, error tolerances

Profiling:
  Nsight Systems, Nsight Compute, benchmark discipline

Runtime systems:
  FlashInfer, vLLM, TensorRT-LLM concepts

Agent workflows:
  reproducible code generation, patching, test loops

If you lack these, start with a smaller kernel:

  • vector add
  • layernorm
  • small GEMM
  • attention score kernel
  • top-k indexer

Then move toward the contest kernels.


14. Career signal

The contest points toward roles such as:

  • AI runtime engineer
  • GPU kernel engineer
  • inference infrastructure engineer
  • GPU compiler engineer
  • accelerator software engineer
  • AI systems researcher
  • autonomous optimization systems engineer

These roles sit between:

model architecture
hardware architecture
compiler/runtime software
production inference serving

They are rarer than AI application roles because they require:

  • low-level performance instincts
  • ML workload knowledge
  • systems debugging
  • mathematical numerics
  • hardware awareness
  • benchmark integrity

If you want to move from AI application work into AI systems work, this contest is a strong practice target.


15. How this maps to this roadmap

Relevant earlier lectures:

  • Lecture 32: LLM internals and inference mechanics
  • Lecture 35: agent skills for GPU kernel translation
  • Lecture 36: FP8 KV-cache and attention quantization
  • Lecture 37: TraceLens and trace-driven performance analysis
  • Lecture 38: AutoSP and compiler-generated sequence parallelism
  • Lecture 40: small MoE reasoning model deployment tradeoffs
  • Lecture 42: durable agent harnesses for tool-oriented long-running work

The contest combines all of them:

kernel optimization:
  low-level GPU work

LLM inference:
  real model bottlenecks

agent generation:
  automated search and code synthesis

benchmarking:
  correctness and performance evidence

runtime thinking:
  kernels as part of the serving stack

This is a bridge from agent engineering to AI hardware/software co-design.


Mini-lab: build a contest preparation plan

Pick one track:

Track A: FP8 MoE
Track B: Sparse Attention
Track C: Gated Delta Net

Write a preparation plan:

Track:
Target kernel:
Hardware:
Baseline:
Correctness tests:
Benchmark command:
Profiler:
First bottleneck hypothesis:
First three variants:
Expected risk:
Acceptance threshold:
Writeup evidence:
Agent role:
Human review points:

Then write a one-week schedule:

Day 1:
  reproduce starter kit

Day 2:
  understand baseline and shapes

Day 3:
  profile bottleneck

Day 4:
  implement first variant

Day 5:
  benchmark and reject/keep

Day 6:
  agent-assisted search

Day 7:
  write results and next plan

The goal is not to win immediately.

The goal is to build a disciplined kernel optimization loop.


Key takeaways

  • The MLSys 2026 FlashInfer contest is about real LLM inference kernels on NVIDIA Blackwell B200 GPUs.
  • The tracks target high-value inference bottlenecks: FP8 MoE, sparse attention, and Gated Delta Net.
  • MoE optimization is a routing, memory-layout, Tensor Core, and fusion problem.
  • Sparse attention trades dense compute for indexing and irregular memory access.
  • Gated Delta Net points toward non-standard transformer alternatives and recurrent-state inference.
  • FlashInfer-Bench emphasizes correctness, speed, and win rate against strong baselines.
  • CUDA gives maximum control; Triton, TileLang, CuTe DSL, and cuTile trade some control for faster iteration and higher-level tiled abstractions.
  • OpenEvolve-style agents are useful when correctness and benchmark scripts define a reliable evolutionary search loop.
  • Modal is useful for iteration, but final claims need controlled bare-metal B200 measurement.
  • Blackwell B200 changes the optimization target; do not assume Hopper-tuned kernels transfer unchanged.
  • Agent-generated kernels must be reproducible, not just fast once.
  • Serious kernel work requires hypothesis-driven profiling and benchmark integrity.
  • This contest is a strong career signal for AI runtime, GPU compiler, and inference infrastructure engineering.

References


Next: Lab 01 - Research Agent