Lecture 37 - TraceLens: Trace-Driven AI Performance Analysis¶
Course: Agentic AI & GenAI | Previous: Lecture 36 | Next: Lecture 38
Performance optimization is not a vibes problem.
For AI systems, the hard part is often not collecting data. It is turning a huge profiler trace into a specific diagnosis:
What was slow?
Where in the model did it happen?
Was the GPU computing, communicating, copying, or waiting?
Did the new kernel actually help?
Can we reproduce the slow op without the full model?
TraceLens is useful because it treats profiler traces as structured data, not screenshots for humans to manually inspect.
It consumes traces from frameworks such as PyTorch and JAX, then produces summaries, comparisons, roofline-style metrics, collective communication analysis, and minimal operation reproducers.
For this course, TraceLens is the missing evidence layer between:
agent workload
-> model server
-> GPU kernels and collectives
-> performance claim
-> trace-backed diagnosis
Learning objectives¶
By the end of this lecture, you should be able to:
- Explain why raw AI profiler traces are hard to analyze manually.
- Understand Trace2Tree as a hierarchical trace intermediate representation.
- Use top-down performance breakdowns to locate bottlenecks.
- Interpret GPU timeline, operator category, dispatch-level, and shape-level reports.
- Understand TraceLens roofline metrics such as TFLOPS/s and TB/s.
- Diagnose multi-GPU scaling with communication latency, bandwidth, and synchronization skew.
- Use trace comparison to quantify regressions across hardware, drivers, kernels, or software versions.
- Understand event replay as a way to produce minimal, IP-safe performance reproducers.
- Apply TraceLens to OpenClaw-style agent serving and vLLM optimization workflows.
1. The profiler trace problem¶
Modern AI workloads generate large traces:
- Python module calls
- PyTorch or JAX framework operations
- CPU dispatches
- runtime launch events
- GPU kernels
- memory copies
- collective communication
- synchronization gaps
Tools such as Perfetto are useful for visual inspection, but the manual workflow does not scale.
The engineer ends up asking:
Which kernels belong to this model layer?
Which aten op launched this slow kernel?
Is this memcpy from my code or from AMP/framework behavior?
Did communication overlap with compute?
Did a new software version shift time into a different op?
Raw traces are too flat.
They show events, but not enough intent.
TraceLens adds structure.
2. Trace2Tree: from flat trace to hierarchy¶
The central idea is Trace2Tree.
TraceLens converts flat events into a tree:
Example shape:
nn.Module: Linear
-> aten::linear
-> aten::to
-> aten::copy_
-> elementwise kernel
-> aten::addmm
-> matrix multiply kernel
This matters because many expensive events are hidden at the Python level.
Common examples:
- automatic mixed precision casts
- implicit copies
- layout conversions
- unfused bias additions
- framework-generated elementwise kernels
- backend-specific convolution or attention kernels
Without a tree, a kernel is just a name and timestamp.
With a tree, the kernel has ownership:
That is the difference between tracing and diagnosis.
3. Top-down bottleneck analysis¶
TraceLens reports are designed to move from coarse to precise.
The useful progression is:
GPU timeline
-> operator category
-> framework op
-> unique input shape
-> exact kernel or replay case
This is the right order.
Do not start with kernel names.
Start by asking how the system spent time.
4. GPU timeline: was the GPU doing useful work?¶
The GPU timeline report separates time into categories such as:
- computation
- exposed communication
- exposed memory copy
- busy time
- idle time
- total communication
- total memory copy
This answers the first diagnostic question:
Example interpretations:
high computation time
-> optimize kernels, shapes, dtype, fusion, algorithm
high exposed communication time
-> improve overlap, collective strategy, tensor parallel layout, network path
high exposed memcpy time
-> inspect dtype casts, host/device movement, layout conversions
high idle time
-> inspect CPU dispatch, dataloader, synchronization, scheduling, dynamic shapes
The ROCm blog's Llama FSDP example shows a GPU timeline where the GPU is mostly busy, but total communication is still a large fraction of the run. That distinction matters: communication can be present but hidden under compute, or it can be exposed and directly hurt wall time.
5. Operator category view: what kind of work dominates?¶
After the timeline, group work by operation family:
- GEMM
- attention / SDPA
- convolution
- elementwise
- reduce
- Triton-generated kernels
- multi-tensor apply
- other backend kernels
This tells you where to spend engineering effort.
If GEMM and attention dominate most of the time, optimizing a small elementwise kernel will not move end-to-end performance unless it blocks fusion, causes copies, or sits on the critical path.
For transformer training and serving, the usual dominant categories are:
The point is not that other ops are irrelevant.
The point is prioritization.
Performance work should follow the trace.
6. Dispatch-level view: stable names across backends¶
GPU kernel names change across:
- CUDA vs ROCm
- driver versions
- compiler versions
- library backends
- hardware generations
- autotuning choices
Framework dispatch names are often more stable.
Examples:
TraceLens uses this level to make comparisons more meaningful.
Instead of asking:
you can ask:
Did aten::convolution become slower?
Did aten::mm improve for this shape?
Did flash attention regress after the backend update?
That abstraction is important when comparing CUDA and ROCm, or comparing two software versions on the same platform.
7. Shape-level view: the real root cause¶
Operator names are not enough.
The same operation can have very different performance depending on:
- tensor dimensions
- strides
- dtype
- layout
- batch size
- sequence length
- head count
- head dimension
- padding
- transposition
TraceLens breaks down operations by unique input shape and arguments.
This is where many real optimizations start.
Example:
Possible causes:
- bad tile shape
- small batch dimension
- unfavorable alignment
- memory-bound shape
- layout conversion nearby
- low occupancy
- poor backend algorithm selection
For agent workloads, this is especially useful because context length and output length change request by request.
A model may be fast at one sequence length and slow at another.
8. Roofline-style compute modeling¶
Kernel duration answers:
It does not answer:
TraceLens estimates theoretical work from operator arguments.
For GEMM:
Then it combines that with actual trace timing:
This gives a roofline-style view:
high arithmetic intensity + low TFLOPS/s
-> compute kernel may be underutilizing the GPU
low arithmetic intensity + high TB/s
-> operation may be memory-bandwidth bound
unexpected bytes or copies nearby
-> inspect layout, dtype, or framework-generated movement
Important distinction:
TraceLens estimates useful theoretical work from framework semantics.
Hardware profilers measure what the GPU actually executed.
Use both:
The gap between those two is often where the optimization lives.
9. Multi-GPU communication analysis¶
Distributed training and serving add another failure mode:
Collective operations can include synchronization skew.
One rank may enter the collective late because:
- its compute was slower
- its batch was heavier
- it hit a local scheduling delay
- it had a CPU dispatch stall
- it waited on a prior dependency
If you only look at total collective duration, you may blame the network for workload imbalance.
TraceLens separates:
- payload size
- collective latency
- algorithmic bandwidth
- bus bandwidth
- synchronization skew
Useful diagnostic patterns:
high communication latency, low skew
-> likely network or collective implementation bottleneck
high skew, reasonable bandwidth
-> likely imbalance or late-arriving ranks
high exposed communication time
-> overlap is insufficient
high total communication but low exposed communication
-> communication exists but is mostly hidden under compute
For hardware engineers, this matters because AI scaling bottlenecks are often misdiagnosed.
The right question is not:
The better question is:
10. Trace comparison: prove the delta¶
Most performance work is comparative:
old kernel vs new kernel
CUDA vs ROCm
H100 vs MI300X
driver A vs driver B
BF16 vs FP8
FlashAttention backend vs FlashInfer backend
baseline vLLM vs optimized vLLM
TraceLens can compare reports and identify where time changed.
For simple cases, matching happens at the same operator level.
For more complex cases, TraceLens can use morphological comparison: it aligns trees and finds the lowest point where the call stacks diverge.
That is useful when two backends implement the same framework op differently.
Example:
The framework-level intent is the same, but the backend subtree differs.
Trace comparison lets you say:
This change improved convolution backward by X ms.
This change regressed aten::mm by Y ms.
This backend moved time from one subtree into another.
That is the standard you should use for optimization claims.
11. Event replay: isolate the slow operation¶
Finding a slow operation is only half the job.
You still need a reproducer.
Full model reproducers are often difficult to share because they include:
- proprietary model architecture
- private weights
- production input data
- distributed launch setup
- complex environment state
TraceLens event replay generates a minimal script from trace metadata:
- operation type
- tensor shapes
- dtypes
- strides
- relevant arguments
This produces a focused benchmark case.
Why this matters:
model team finds a slow op
-> event replay creates minimal reproducer
-> kernel/compiler/vendor team debugs it
-> fix is validated against original trace
This is how you turn model-level performance debugging into systems engineering.
12. Where TraceLens fits in the agent stack¶
Agent systems need performance evidence at multiple layers:
application:
user latency, tool latency, session throughput
agent runtime:
planning time, tool-call count, retry count, context growth
model server:
TTFT, ITL, output tokens/s, queueing, KV-cache memory
GPU:
kernels, copies, collectives, idle time, roofline metrics
TraceLens focuses on the lower layers, but it should be connected to the higher layers.
For OpenClaw-style systems, a useful workflow is:
run representative agent workload
-> collect model-server trace
-> generate TraceLens report
-> identify bottleneck
-> create event replay if needed
-> change model/backend/kernel/config
-> compare traces
-> attach report to deployment decision
This pairs directly with Lecture 36.
If you enable FP8 KV-cache, do not only report:
Report:
TTFT
ITL
output tokens/s
GPU memory
operator-level time
attention kernel time
copy time
communication exposure
accuracy
trace comparison delta
13. Practical workflow¶
Install TraceLens:
Generate a performance report from a PyTorch trace:
Compare two reports:
TraceLens_compare_perf_reports_pytorch \
baseline.xlsx candidate.xlsx \
--names baseline candidate \
--sheets all \
-o comparison.xlsx
Generate a multi-rank collective report:
TraceLens_generate_multi_rank_collective_report_pytorch \
--trace_dir /path/to/traces \
--world_size 8
For ROCm rocprofv3 Perfetto-style traces:
rocprofv3 \
--hip-trace \
--kernel-trace \
--memory-copy-trace \
--rccl-trace \
--output-format pftrace \
-d ./v3_traces \
-- python3 your_app.py
TraceLens_generate_perf_report_pftrace_hip_activity \
--trace_path sample.pftrace \
--write_md
Use these commands as starting points, not a complete profiling policy.
For reliable comparisons, also pin:
- model revision
- input workload
- batch/concurrency
- sequence lengths
- dtype policy
- GPU clocks if relevant
- driver/runtime versions
- backend flags
- warmup count
- random seeds if accuracy is involved
14. Diagnostic playbook¶
Use TraceLens reports to choose the next action.
Case: high idle time¶
Likely causes:
- CPU dataloader bottleneck
- Python overhead
- synchronization
- dynamic shape recompilation
- queue starvation
- model server scheduling gap
Next actions:
- inspect CPU dispatch timeline
- check batch construction
- reduce Python in the hot path
- verify warmup and compile behavior
- correlate with request-level telemetry
Case: GEMM dominates but TFLOPS/s is low¶
Likely causes:
- small or awkward matrix shapes
- poor alignment
- bad layout
- backend algorithm choice
- unnecessary transposes
- low occupancy
Next actions:
- inspect shape-level rows
- compare against hardware peak and expected roofline
- generate event replay for the slow shape
- test alternative kernels or layouts
Case: attention dominates long-context serving¶
Likely causes:
- KV-cache memory traffic
- head dimension issue
- backend selection
- context length distribution
- low-precision conversion overhead
Next actions:
- compare BF16 vs FP8 traces
- inspect attention kernel time
- test skip policies for sliding-window layers
- validate accuracy on real prompts
- connect to Lecture 36's TTFT/ITL benchmark plan
Case: exposed communication is high¶
Likely causes:
- poor overlap
- collective algorithm issue
- rank imbalance
- network bottleneck
- tensor-parallel or FSDP layout mismatch
Next actions:
- inspect skew vs pure communication time
- compare algorithmic and bus bandwidth
- inspect per-rank timelines
- adjust sharding, bucket sizes, or overlap strategy
Case: regression after a software update¶
Likely causes:
- backend kernel selection changed
- compiler changed generated code
- framework changed dispatch path
- dtype/layout behavior changed
- fusion changed
Next actions:
- run TraceLens comparison
- identify the largest positive deltas
- inspect lowest divergent subtree
- replay the slow op
- keep the trace report with the change review
15. How this connects to agent reliability¶
Earlier lectures focused on agent skills, structured tools, and verification.
TraceLens applies the same idea to performance:
claim:
"This optimization improves long-context serving."
required evidence:
trace report
before/after comparison
workload description
bottleneck explanation
accuracy check
deployment decision
This matters because agent systems invite vague performance claims:
- "The model is slow."
- "The GPU is the bottleneck."
- "Communication is killing scaling."
- "FP8 is faster."
- "The new backend regressed."
TraceLens helps turn those into falsifiable statements:
On this workload, aten::mm for this shape regressed by X ms.
On this run, exposed communication is Y% and skew is Z us.
On this model, attention kernel time dropped but copy time increased.
On this backend, GPU idle time is dominated by CPU dispatch gaps.
That is the engineering standard.
Mini-lab: Trace a model change¶
Choose one small but real workload:
- a PyTorch transformer block
- a vLLM benchmark
- a multi-GPU training step
- a custom CUDA/ROCm kernel path
- an OpenClaw agent workload that calls a local model server
Run two configurations:
Examples:
BF16 vs FP8
old backend vs new backend
CUDA vs ROCm
single GPU vs multi-GPU
eager vs compiled
with fusion vs without fusion
Collect traces, then generate TraceLens reports.
Write a short performance note:
Workload:
Hardware:
Software versions:
Main bottleneck:
Biggest improvement:
Biggest regression:
Evidence:
Deployment decision:
Next experiment:
If you find one slow operation, generate an event replay and treat it as a kernel debugging task.
Key takeaways¶
- Raw profiler traces are too large and flat for reliable manual analysis.
- TraceLens turns framework traces into hierarchical, queryable evidence.
- Trace2Tree connects Python modules, CPU dispatches, runtime launches, and GPU kernels.
- Start diagnosis at the GPU timeline, then drill into categories, ops, shapes, and kernels.
- Roofline-style metrics help distinguish "slow" from "inefficient."
- Multi-GPU collective time must be separated into pure communication and synchronization skew.
- Trace comparison is the right way to prove a performance change.
- Event replay turns trace metadata into minimal reproducers for kernel and backend debugging.
- For agent systems, performance claims should be backed by traces, comparisons, and workload descriptions.
References¶
- AMD ROCm Blog, "TraceLens: Democratizing AI Performance Analysis": https://rocm.blogs.amd.com/software-tools-optimization/tracelens/README.html
- TraceLens GitHub repository: https://github.com/AMD-AGI/TraceLens
- PyTorch profiler documentation: https://docs.pytorch.org/docs/stable/profiler.html
- Perfetto trace viewer: https://ui.perfetto.dev
- Lecture 36 - FP8 KV-Cache in vLLM: Lecture-36.md
Next: Lecture 38 - AutoSP