7. ML Systems Engineering (Phase 5)¶
Track 5G · ML Systems Engineering
Build the runtimes, schedulers, kernels, training systems, and serving infrastructure behind AI at scale.
Build the runtimes, distributed systems, kernels, schedulers, and infrastructure that make modern AI workloads train and serve reliably.
Layer mapping: L3-L8. This track connects model math, CUDA kernels, runtime scheduling, distributed communication, cluster orchestration, compiler/runtime work, serving systems, and observability.
Role targets: ML Systems Engineer · AI Infrastructure Engineer · Inference Systems Engineer · Training Systems Engineer · GPU Runtime Engineer · Edge AI Runtime Engineer
Prerequisites: Operating Systems, C++ and Parallel Computing, Neural Networks, Deep Learning Frameworks, one Phase 4 deployment path, and enough GPU Infrastructure to run and debug GPU benchmarks.
What comes after: a public systems artifact: inference runtime, distributed training runbook, CUDA/Triton kernel benchmark suite, scheduler prototype, compiler/runtime demo, or edge MLSys case study with reproducible measurements.
Course Contract¶
This is not a course about using ML frameworks as black boxes. It is a course about the systems below those frameworks.
By the end, you should be able to:
- trace a transformer through tensor shapes, memory traffic, kernel launches, and communication calls
- separate prefill, decode, training forward pass, backward pass, optimizer state, KV cache, and scheduler overhead
- build a small inference runtime with batching, streaming, cancellation, and backpressure
- run and profile DDP/FSDP/ZeRO training experiments instead of only reading about them
- write and benchmark at least one CUDA or Triton kernel used by a transformer path
- explain when a workload is compute-bound, memory-bound, communication-bound, or scheduler-bound
- use profiler traces, counters, and logs as the default source of truth
- ship a reproducible benchmark that another engineer can run
The point is not to become a generic model trainer. The point is to become the engineer who knows why a model is slow, expensive, unstable, underutilized, or hard to deploy.
Why This Belongs In An AI Hardware Roadmap¶
Hardware is only useful when the software stack can feed it.
MLSys is where workload reality meets hardware reality:
model architecture
-> framework graph
-> runtime scheduler
-> memory planner
-> kernels
-> collectives
-> cluster fabric
-> observability
-> production behavior
If you want to design accelerators, edge AI platforms, or AI infrastructure, you need to understand what the workload actually asks the machine to do:
- how many bytes the KV cache consumes
- how attention changes with sequence length
- why decode often becomes memory-bandwidth limited
- why training stalls on all-reduce or activation memory
- why one extra synchronization point can destroy scaling
- why a scheduler can make a fast kernel look slow
- why a topology mismatch can break distributed inference
MLSys is the discipline of finding those constraints, measuring them, and changing the system so the hardware does useful work more often.
The Core Mental Model¶
Most MLSys work reduces one of five costs:
For inference:
For training:
For memory:
M_training = M_weights + M_activations + M_gradients + M_optimizer + M_workspace
M_inference = M_weights + M_kv_cache + M_workspace + M_batch_state
For distributed scaling:
Good MLSys engineering starts with a hypothesis and ends with measurement:
Avoid vague claims like "this is faster." Use concrete claims:
- p95 TTFT dropped from 420 ms to 260 ms at 32 concurrent requests
- decode throughput increased from 38 to 51 tok/s after KV-cache compaction
- all-reduce time fell from 31 percent to 18 percent of step time after bucket tuning
- peak training memory fell by 42 percent after activation checkpointing
- GPU utilization improved because queue starvation was fixed, not because kernels changed
Course Map¶
| Stage | Focus | Core artifact |
|---|---|---|
| 0 | Measurement discipline | benchmark harness and profiling template |
| 1 | Systems runtime foundations | async inference-like server with backpressure |
| 2 | Transformer execution internals | transformer shape, memory, and throughput report |
| 3 | GPU kernels and CUDA performance | CUDA/Triton kernel benchmark with roofline notes |
| 4 | Inference serving systems | continuous batching or paged KV-cache prototype |
| 5 | Distributed training systems | DDP/FSDP/ZeRO benchmark and bottleneck report |
| 6 | AI infrastructure and orchestration | reproducible cluster/runbook with failure recovery |
| 7 | Compiler and runtime layer | graph lowering, fusion, or memory-planning demo |
| 8 | Research and source-code loop | paper reproduction or source-code deep dive |
Recommended order:
- If you already build edge inference runtimes: Stage 0 -> 1 -> 2 -> 3 -> 5.
- If you want AI infrastructure roles: Stage 0 -> 1 -> 4 -> 5 -> 6.
- If you want compiler/runtime roles: Stage 0 -> 2 -> 3 -> 7 -> 8.
- If you want edge MLSys: Stage 0 -> 1 -> 2 -> 3 -> 4, then add Stage 5 selectively.
Stage 0: Measurement Discipline¶
Why It Matters¶
MLSys work without measurement turns into tool tourism. Before changing runtimes or kernels, build the habit of collecting comparable numbers.
Learn¶
- latency percentiles: p50, p95, p99
- throughput: requests/sec, tokens/sec, samples/sec, tokens/sec/GPU
- GPU counters: occupancy, memory bandwidth, tensor-core utilization, SM activity
- memory metrics: peak allocated, fragmentation, KV-cache blocks, activation memory
- distributed metrics: collective time, bandwidth, GPU idle time, rank skew
- reliability metrics: error rate, retry rate, checkpoint restore time, failed-node recovery
Build It¶
Create a small benchmark harness that can run the same workload repeatedly and emit:
- command line used
- hardware and driver summary
- git commit hash
- model or synthetic workload config
- concurrency level or batch size
- warmup count and measured iteration count
- CSV or JSON result file
Use It In The Real Stack¶
Use the harness for every later stage. Do not hand-copy benchmark numbers into notes. Generate them from scripts.
Measure It¶
Your harness should report:
- mean and percentile latency
- throughput
- peak memory
- CPU and GPU utilization if available
- profiler trace path
Ship It¶
Ship bench/, results/, and reports/ directories with one reproducible baseline. The baseline can be synthetic, but it must be rerunnable.
Stage 1: Systems Runtime Foundations¶
Why It Matters¶
An inference or training service is still a distributed Linux program. It queues work, moves bytes, schedules threads, manages memory, handles failure, and emits telemetry.
Learn¶
- Linux processes, threads, signals, cgroups, namespaces, and filesystems
- memory mapping, page faults, huge pages, pinned memory, NUMA, and zero-copy paths
- concurrency with threads, locks, atomics, queues, work stealing, cancellation, and backpressure
- networking with TCP, HTTP streaming, gRPC,
epoll,io_uring, and RDMA concepts - CPU performance: cache locality, SIMD basics,
perf, flamegraphs, and tracing - production runtime basics: structured logs, metrics, traces, health checks, and graceful shutdown
Languages:
- Python for ML ecosystem integration
- C++ for runtime and CUDA integration
- Rust for infrastructure and systems components when it fits the project
Build It¶
Build an inference-shaped server without a real model first:
- Accept requests with prompt length, max tokens, priority, and deadline.
- Put requests into a scheduler queue.
- Batch compatible requests every few milliseconds.
- Stream fake tokens back to clients.
- Support cancellation.
- Apply backpressure when queues or memory budgets are exceeded.
Then add a second component:
- tokenizer runtime with zero-copy request parsing, or
- mini tensor runtime with explicit allocation and shape tracking, or
- memory arena for request state and KV-cache-like blocks.
Use It In The Real Stack¶
Compare your design to the control-plane responsibilities in vLLM, SGLang, Ray Serve, and Triton Inference Server. Focus on what the runtime schedules and what the GPU kernels actually execute.
Measure It¶
- p50/p95/p99 latency under increasing concurrency
- queue wait time versus execution time
- throughput under different batching windows
- allocation count and peak RSS
- CPU utilization, lock contention, and context switches
- cancellation latency
Ship It¶
Ship an async runtime with a load generator, dashboard or metrics endpoint, and a short report explaining when batching helps and when it hurts tail latency.
Stage 2: Transformer Execution Internals¶
Why It Matters¶
MLSys engineers do not need to invent every model architecture, but they must understand the compute and memory flow of the models they serve or train.
Transformer inference flow:
Transformer training flow:
Attention:
Learn¶
- embeddings, attention, MLP, residual paths, RMSNorm/layer norm, logits, and sampling
- QKV projection, grouped-query attention, RoPE, ALiBi-style position handling, and KV cache
- prefill versus decode
- activation memory and gradient flow
- optimizer state memory
- mixed precision, loss scaling, and gradient accumulation
- batching, speculative decoding, quantization, and long-context behavior
Build It¶
Build a small transformer path that can run both training and inference:
- Print tensor shapes at every major operation.
- Track activation memory during training.
- Track KV-cache memory during inference.
- Separate prefill and decode timing.
- Add a minimal sampler.
- Add gradient accumulation and mixed precision.
Use It In The Real Stack¶
Map your toy implementation to PyTorch modules and to production runtime concepts:
| Concept | Training system concern | Inference system concern |
|---|---|---|
| activations | memory and recomputation | usually not retained |
| gradients | all-reduce/reduce-scatter | not present |
| optimizer state | often larger than weights | not present |
| KV cache | not central in normal training | primary serving memory pressure |
| batch size | throughput and convergence | throughput and latency |
| sequence length | activation and attention cost | KV cache and prefill cost |
Measure It¶
- tokens/sec for prefill and decode
- samples/sec or tokens/sec during training
- activation memory by layer
- KV-cache bytes per token
- effect of batch size and sequence length
- numerical drift across precision modes
Ship It¶
Ship a transformer execution report with diagrams or tables for shape flow, memory flow, and throughput. Include at least one surprising bottleneck you found from measurement.
Stage 3: GPU Kernels And CUDA Performance¶
Why It Matters¶
This is where MLSys becomes hardware-shaped. You need enough GPU knowledge to know whether a bottleneck is bandwidth, launch overhead, occupancy, synchronization, or tensor-core utilization.
Memory hierarchy:
Learn¶
- CUDA grids, blocks, warps, streams, events, and synchronization
- warp execution, divergence, occupancy, memory coalescing, and bank conflicts
- shared memory, registers, L2 behavior, and HBM bandwidth
- tensor cores and matrix tiling
- reductions, scans, softmax, normalization, and matmul kernels
- kernel launch overhead, CUDA graphs, persistent kernels, and kernel fusion
- profiler workflow with Nsight Systems, Nsight Compute, and simple CUDA events
Build It¶
Build a small kernel ladder:
- vector add baseline
- reduction kernel
- layer norm or RMSNorm
- tiled matrix multiply
- fused RMSNorm + residual or fused bias + activation
- Triton version of one kernel
Then add a transformer-shaped benchmark:
- compare naive attention, tiled attention, and library attention where possible
- compare single kernel versus fused path
- compare launch-by-launch execution versus CUDA graph capture when applicable
Use It In The Real Stack¶
Read source with one question in mind: what memory traffic did this code remove?
Study:
- CUTLASS for tiled GEMM and template-based kernel structure
- FlashAttention for attention memory traffic reduction
- TensorRT-LLM kernels for production LLM inference paths
- vLLM or SGLang for scheduler/runtime interaction with kernels
- llama.cpp CUDA paths for smaller, readable inference kernels
Measure It¶
- achieved bandwidth
- achieved FLOP/s
- occupancy
- global memory transactions
- shared-memory bank conflicts
- tensor-core utilization
- kernel launch count
- numerical error versus reference implementation
Ship It¶
Ship a kernel benchmark suite with before/after numbers, profiler screenshots or exported reports, and a short roofline-style explanation for each kernel.
Stage 4: Inference Serving Systems¶
Why It Matters¶
Serving is where kernels become product infrastructure. The runtime must control queueing, memory, streaming, fairness, overload, placement, and observability.
Serving latency decomposes roughly into:
Throughput is often constrained by:
Learn¶
- prefill versus decode scheduling
- batching and continuous batching
- request admission and overload control
- streaming responses and cancellation
- paged KV cache and block allocation
- prefix caching and prompt sharing
- speculative decoding
- tensor-parallel and pipeline-parallel inference
- autoscaling, placement, health checks, and drain logic
- observability: TTFT, inter-token latency, queue depth, active sequences, KV blocks, tokens/sec/GPU
Build It¶
Build a serving prototype in layers:
- request queue with deadlines and priorities
- continuous batching scheduler
- KV-cache block allocator
- streaming token output
- cancellation and eviction path
- admission control based on memory budget
- metrics endpoint
Optional advanced additions:
- speculative decoding path with draft and target model
- distributed router that selects replicas by queue depth and health
- tensor-parallel toy runtime with explicit communication calls
Use It In The Real Stack¶
Study:
- vLLM for PagedAttention, continuous batching, and serving abstractions
- SGLang for structured generation runtime ideas
- TensorRT-LLM for optimized NVIDIA inference paths
- Triton Inference Server for production model serving patterns
- Ray Serve for distributed service orchestration
- llama.cpp for local and edge inference constraints
Measure It¶
- p50/p95/p99 time-to-first-token
- p50/p95/p99 inter-token latency
- requests/sec and tokens/sec/GPU
- prefill throughput versus decode throughput
- KV-cache utilization and fragmentation
- active sequence count
- scheduler overhead
- tail latency under overload
Ship It¶
Ship a serving report that can answer:
- What is the bottleneck at low concurrency?
- What is the bottleneck at high concurrency?
- How much memory does the KV cache consume per active request?
- When does batching improve throughput but hurt latency?
- What does the system do when overloaded?
Stage 5: Distributed Training Systems¶
Why It Matters¶
Training systems teach the part of MLSys that inference alone does not: gradient synchronization, activation memory, optimizer-state partitioning, checkpointing, data loading, and failure recovery.
Basic update:
Step time:
Training memory:
Learn¶
- autograd, activation memory, and recomputation
- mixed precision, gradient scaling, and gradient accumulation
- optimizer state memory and sharding
- data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism
- DDP, FSDP, ZeRO, and DTensor/DeviceMesh concepts
- all-reduce, reduce-scatter, all-gather, broadcast, and point-to-point send/recv
- NCCL topology, NVLink/NVSwitch, PCIe, InfiniBand, RoCE, and RDMA concepts
- checkpointing, elastic recovery, rank failure, and restart behavior
Build It¶
Build the training progression in this order:
- single-GPU transformer training loop
- memory profile with activations and optimizer states
- DDP run on 2 or more GPUs
- FSDP or ZeRO run on the same model
- activation checkpointing experiment
- distributed checkpoint save and restore
- NCCL debug/profile run
If you only have one GPU locally, use rented GPUs for the distributed step. The artifact matters more than owning the cluster.
Use It In The Real Stack¶
Study:
- PyTorch Distributed for process groups, DDP, FSDP, DTensor, and DeviceMesh
- DeepSpeed for ZeRO and optimizer/memory partitioning
- Megatron-LM for tensor, pipeline, and sequence parallelism patterns
- Ray Train for job orchestration and distributed training ergonomics
- Slurm or Kubernetes for scheduling real GPU jobs
Measure It¶
- samples/sec or tokens/sec per GPU
- step time breakdown
- scaling efficiency
- all-reduce/reduce-scatter time
- GPU idle time
- data-loader stall time
- activation memory
- optimizer-state memory
- checkpoint write and restore time
Ship It¶
Ship a training-systems report comparing single GPU, DDP, and FSDP/ZeRO. Include profiler traces and a clear explanation of which cost dominated each run.
Stage 6: AI Infrastructure And Orchestration¶
Why It Matters¶
MLSys does not stop at kernels and frameworks. Real systems need scheduling, deployment, isolation, storage, monitoring, rollout, and recovery.
Learn¶
- GPU scheduling with Slurm, Kubernetes, Ray, or a smaller custom scheduler
- placement constraints: GPU type, memory, topology, MIG, NUMA, and network reachability
- container images, driver compatibility, CUDA runtime compatibility, and reproducible environments
- data pipeline throughput and storage locality
- checkpoint storage, artifact versioning, and restore paths
- autoscaling and admission control
- multi-tenant isolation and quota policies
- metrics, tracing, logging, alerts, and SLOs
- incident debugging: hangs, OOM, bad nodes, slow links, clock throttling, and version skew
Build It¶
Build a small but realistic runbook:
- define a container image for training and serving
- run a benchmark job locally or on one node
- run the same benchmark through a scheduler
- collect logs, metrics, and profiler traces
- simulate one failure: OOM, killed process, lost worker, failed checkpoint, or bad config
- document the recovery path
Use It In The Real Stack¶
Tie the infrastructure to the workload:
- training jobs need checkpoint/restart and efficient data loading
- inference services need health checks, draining, and overload behavior
- multi-GPU jobs need topology-aware placement
- edge systems need thermal, power, and storage constraints in the deployment plan
Measure It¶
- job startup time
- image size and cold-start time
- GPU allocation efficiency
- failed-job recovery time
- checkpoint restore time
- data-loader throughput
- service availability during rollout
Ship It¶
Ship an operations-grade runbook with exact commands, configs, expected metrics, and a failure-mode table.
Stage 7: Compiler And Runtime Layer¶
Why It Matters¶
Compiler/runtime work connects model graphs to hardware execution. It is where high-level operations become fused kernels, memory plans, and backend-specific code.
Compiler path:
Learn¶
- PyTorch graph capture, export, and compile paths
- graph optimization: constant folding, dead-code elimination, layout changes, and operator fusion
- memory planning and buffer reuse
- MLIR dialects and passes
- TVM schedules and auto-tuning concepts
- XLA graph compilation
- TensorRT graph optimization
- Triton language for custom kernels
- correctness testing across rewritten graphs
Build It¶
Pick one:
- fuse two simple tensor ops and measure launch-count reduction
- write a Triton kernel for RMSNorm, softmax, or a small matmul
- lower a toy tensor op through MLIR
- compare TensorRT output against eager PyTorch for a small model fragment
- build a static memory planner for a fixed graph
Use It In The Real Stack¶
Connect this stage back to hardware:
- fusion reduces memory traffic and launch overhead
- layout changes can make kernels faster or slower
- dynamic shapes increase runtime complexity
- quantization changes both graph structure and kernel selection
- compiler wins are only real if numerical quality and deployment constraints survive
Measure It¶
- operator count before/after
- kernel launch count
- memory traffic
- latency and throughput
- compile time
- peak memory
- numerical error versus reference
Ship It¶
Ship a compiler/runtime artifact with a before/after benchmark and correctness tests.
Stage 8: Research And Source-Code Loop¶
Why It Matters¶
At senior MLSys level, papers and production code become inputs to engineering decisions. The goal is not to collect papers. The goal is to turn papers into measurements and design choices.
Research loop:
Read¶
Prioritize systems venues and systems-heavy ML work:
- MLSys
- OSDI
- NSDI
- ASPLOS
- SOSP
- NeurIPS systems, efficiency, and infrastructure papers
Study Source Code¶
Use this reading pattern:
- Identify the hot path.
- Find the scheduler or runtime boundary.
- Find memory allocation and cache policy.
- Find communication calls.
- Find the kernel launch path.
- Reproduce a small benchmark.
- Change one setting and measure the effect.
Good source-code targets:
| Area | Systems |
|---|---|
| Inference | vLLM, SGLang, llama.cpp, TensorRT-LLM, Triton Inference Server |
| Distributed training | PyTorch Distributed, DeepSpeed, Megatron-LM, Horovod |
| Infrastructure | Ray, Ray Serve, Ray Train, Kubernetes, Slurm, Kubeflow |
| Kernels | FlashAttention, CUTLASS, xFormers, FlashInfer |
| Compiler/runtime | Triton language, TVM, MLIR, XLA, TensorRT |
| Edge | Jetson Linux, TensorRT, Holoscan, ONNX Runtime, llama.cpp |
Ship It¶
Every paper or source-code study should produce one artifact:
- reproduction
- benchmark
- implementation note
- diagram
- profiler trace
- bug report
- small patch
- clear negative result
The 6-Month Training Systems Milestone¶
If your current strength is inference/runtime work, this is the best next milestone. Keep it systems-heavy.
| Month | Focus | Artifact |
|---|---|---|
| 1 | transformer training internals, autograd, activation memory | single-GPU training loop with memory profile |
| 2 | mixed precision, gradient accumulation, optimizer states | throughput and memory report across precision modes |
| 3 | DDP and NCCL basics | 2-8 GPU DDP benchmark with step-time breakdown |
| 4 | FSDP/ZeRO and checkpointing | memory scaling comparison and restore test |
| 5 | DeepSpeed or Megatron-LM internals | annotated runbook for one realistic model config |
| 6 | custom optimization | fused kernel, scheduler improvement, checkpoint improvement, or communication tuning |
Minimum acceptable output:
- one repo
- one model config
- three training modes
- profiler traces
- memory tables
- scaling chart
- written bottleneck analysis
Do not stop at "it runs." The milestone is complete when you can explain why it scales or fails to scale.
Edge MLSys Specialization¶
For this roadmap, the strongest niche is Edge MLSys + Inference Runtime Engineering.
This combines:
- Jetson and embedded Linux
- local AI and robotics inference
- low-power deployment
- memory-efficient serving
- multimodal runtime work
- scheduler design
- CUDA/TensorRT optimization
- Rust/C++ runtime engineering
- MLIR/Triton compiler paths
- observability on constrained devices
Good edge MLSys projects:
- Jetson LLM serving runtime with continuous batching and KV-cache accounting
- low-memory LoRA or adapter-training experiment
- edge model adaptation pipeline with checkpoint recovery
- multimodal scheduler for camera, audio, and text workloads
- local/private AI appliance runtime with observability and overload control
- CUDA kernel optimization report on Orin versus desktop GPU
- thermal-aware inference scheduler that changes batch/concurrency under power limits
The edge niche is valuable because it joins skills that are usually split across different engineers: embedded Linux, GPU optimization, AI inference, runtime engineering, and production deployment.
Capstone Options¶
Choose one. A good capstone is narrow enough to finish and deep enough to prove systems ability.
Option A: Edge Inference Runtime¶
Build a Jetson-focused runtime with:
- tokenizer path
- request scheduler
- continuous batching
- KV-cache accounting
- streaming output
- metrics endpoint
- Nsight profile
- power and thermal notes
Success criteria:
- reproducible benchmark
- p50/p95 TTFT and inter-token latency
- tokens/sec under at least three concurrency levels
- memory report for weights, KV cache, and workspace
- overload behavior documented
Option B: Distributed Training Systems Report¶
Build a training benchmark suite with:
- single-GPU baseline
- DDP run
- FSDP or ZeRO run
- activation checkpointing experiment
- checkpoint restore test
- profiler traces
Success criteria:
- tokens/sec or samples/sec per GPU
- step-time breakdown
- scaling efficiency
- memory comparison
- communication-cost analysis
- one concrete tuning recommendation
Option C: Compiler/Kernel Runtime Demo¶
Build a model-fragment optimization with:
- reference PyTorch implementation
- custom CUDA or Triton kernel
- graph fusion or lowering path
- correctness tests
- benchmark harness
Success criteria:
- before/after latency
- kernel launch count
- memory traffic estimate
- numerical error table
- explanation of when the optimization stops helping
Portfolio Standard¶
A strong MLSys portfolio artifact includes:
- architecture diagram
- exact hardware and software versions
- reproducible setup commands
- benchmark harness
- profiler traces
- raw result files
- summary charts
- bottleneck analysis
- failure modes
- next optimization hypothesis
Weak artifact:
Strong artifact:
Continuous batching improved throughput from 410 to 690 tok/s at 64 concurrent
requests, but p95 TTFT increased from 380 ms to 610 ms. The profiler shows
prefill bursts starving decode, so the next experiment limits prefill tokens per
scheduling iteration.
Career Positioning¶
This track supports titles like:
- ML Systems Engineer
- AI Infrastructure Engineer
- Inference Systems Engineer
- Training Systems Engineer
- GPU Runtime Engineer
- Edge AI Runtime Engineer
- LLM Runtime Optimization Engineer
Strong positioning:
ML Systems Engineer | GPU Runtime Optimization | CUDA | TensorRT-LLM |
Distributed Inference | Edge AI Infrastructure | Jetson | MLIR | C++ | Rust
or:
Inference Systems Engineer | LLM Runtime Optimization | CUDA Kernels |
Tensor Parallelism | Edge AI | Jetson | TensorRT-LLM | ML Systems
The title matters less than public proof. Publish benchmark graphs, latency profiles, memory reports, architecture diagrams, profiler traces, and small runtime components.
Official References¶
Use official docs and primary sources first:
- PyTorch Distributed Overview
- vLLM Documentation
- NVIDIA TensorRT-LLM Documentation
- NVIDIA CUDA C++ Best Practices Guide
- NVIDIA NCCL Documentation
- DeepSpeed ZeRO Documentation
- Ray Train Documentation
- NVIDIA Megatron-LM
- Triton Language Documentation
- MLIR Documentation
- Apache TVM Documentation
- Triton Inference Server Documentation
Exit Criteria¶
You are ready to claim MLSys competency when you can:
- explain transformer training and inference as shape, memory, kernel, and communication flows
- profile a workload before proposing an optimization
- write and benchmark at least one custom GPU kernel
- debug a distributed training run limited by memory, communication, or synchronization
- explain how continuous batching and paged KV cache affect serving throughput and tail latency
- connect runtime decisions to hardware constraints
- operate a small training or inference system with logs, metrics, and recovery steps
- ship a reproducible benchmark artifact
The outcome is not a certificate. It is a body of systems work that proves you can make AI workloads run faster, cheaper, and more reliably.