Lecture 37 - TraceLens: Trace-Driven AI Performance Analysis¶

Course: Agentic AI & GenAI | Previous: Lecture 36 | Next: Lecture 38

Performance optimization is not a vibes problem.

For AI systems, the hard part is often not collecting data. It is turning a huge profiler trace into a specific diagnosis:

What was slow?
Where in the model did it happen?
Was the GPU computing, communicating, copying, or waiting?
Did the new kernel actually help?
Can we reproduce the slow op without the full model?

TraceLens is useful because it treats profiler traces as structured data, not screenshots for humans to manually inspect.

It consumes traces from frameworks such as PyTorch and JAX, then produces summaries, comparisons, roofline-style metrics, collective communication analysis, and minimal operation reproducers.

For this course, TraceLens is the missing evidence layer between:

agent workload
  -> model server
  -> GPU kernels and collectives
  -> performance claim
  -> trace-backed diagnosis

Learning objectives¶

By the end of this lecture, you should be able to:

Explain why raw AI profiler traces are hard to analyze manually.
Understand Trace2Tree as a hierarchical trace intermediate representation.
Use top-down performance breakdowns to locate bottlenecks.
Interpret GPU timeline, operator category, dispatch-level, and shape-level reports.
Understand TraceLens roofline metrics such as TFLOPS/s and TB/s.
Diagnose multi-GPU scaling with communication latency, bandwidth, and synchronization skew.
Use trace comparison to quantify regressions across hardware, drivers, kernels, or software versions.
Understand event replay as a way to produce minimal, IP-safe performance reproducers.
Apply TraceLens to OpenClaw-style agent serving and vLLM optimization workflows.

1. The profiler trace problem¶

Modern AI workloads generate large traces:

Python module calls
PyTorch or JAX framework operations
CPU dispatches
runtime launch events
GPU kernels
memory copies
collective communication
synchronization gaps

Tools such as Perfetto are useful for visual inspection, but the manual workflow does not scale.

The engineer ends up asking:

Which kernels belong to this model layer?
Which aten op launched this slow kernel?
Is this memcpy from my code or from AMP/framework behavior?
Did communication overlap with compute?
Did a new software version shift time into a different op?

Raw traces are too flat.

They show events, but not enough intent.

TraceLens adds structure.

2. Trace2Tree: from flat trace to hierarchy¶

The central idea is Trace2Tree.

TraceLens converts flat events into a tree:

Python module / function
  -> framework op
    -> CPU dispatch
      -> runtime launch
        -> GPU kernel

Example shape:

nn.Module: Linear
  -> aten::linear
    -> aten::to
      -> aten::copy_
        -> elementwise kernel
    -> aten::addmm
      -> matrix multiply kernel

This matters because many expensive events are hidden at the Python level.

Common examples:

automatic mixed precision casts
implicit copies
layout conversions
unfused bias additions
framework-generated elementwise kernels
backend-specific convolution or attention kernels

Without a tree, a kernel is just a name and timestamp.

With a tree, the kernel has ownership:

This exact module launched this exact framework op,
which launched this exact kernel.

That is the difference between tracing and diagnosis.

3. Top-down bottleneck analysis¶

TraceLens reports are designed to move from coarse to precise.

The useful progression is:

GPU timeline
  -> operator category
  -> framework op
  -> unique input shape
  -> exact kernel or replay case

This is the right order.

Do not start with kernel names.

Start by asking how the system spent time.

4. GPU timeline: was the GPU doing useful work?¶

The GPU timeline report separates time into categories such as:

computation
exposed communication
exposed memory copy
busy time
idle time
total communication
total memory copy

This answers the first diagnostic question:

Is the GPU compute-bound, communication-bound, copy-bound, or idle?

Example interpretations:

high computation time
  -> optimize kernels, shapes, dtype, fusion, algorithm

high exposed communication time
  -> improve overlap, collective strategy, tensor parallel layout, network path

high exposed memcpy time
  -> inspect dtype casts, host/device movement, layout conversions

high idle time
  -> inspect CPU dispatch, dataloader, synchronization, scheduling, dynamic shapes

The ROCm blog's Llama FSDP example shows a GPU timeline where the GPU is mostly busy, but total communication is still a large fraction of the run. That distinction matters: communication can be present but hidden under compute, or it can be exposed and directly hurt wall time.

5. Operator category view: what kind of work dominates?¶

After the timeline, group work by operation family:

GEMM
attention / SDPA
convolution
elementwise
reduce
Triton-generated kernels
multi-tensor apply
other backend kernels

This tells you where to spend engineering effort.

If GEMM and attention dominate most of the time, optimizing a small elementwise kernel will not move end-to-end performance unless it blocks fusion, causes copies, or sits on the critical path.

For transformer training and serving, the usual dominant categories are:

GEMM
attention
communication
memory movement

The point is not that other ops are irrelevant.

The point is prioritization.

Performance work should follow the trace.

6. Dispatch-level view: stable names across backends¶

GPU kernel names change across:

CUDA vs ROCm
driver versions
compiler versions
library backends
hardware generations
autotuning choices

Framework dispatch names are often more stable.

Examples:

aten::mm
aten::linear
aten::copy_
flash_attn::_flash_attn_forward
flash_attn::_flash_attn_backward

TraceLens uses this level to make comparisons more meaningful.

Instead of asking:

Why did kernel igemm_fwd_gtcx3_... change?

you can ask:

Did aten::convolution become slower?
Did aten::mm improve for this shape?
Did flash attention regress after the backend update?

That abstraction is important when comparing CUDA and ROCm, or comparing two software versions on the same platform.

7. Shape-level view: the real root cause¶

Operator names are not enough.

The same operation can have very different performance depending on:

tensor dimensions
strides
dtype
layout
batch size
sequence length
head count
head dimension
padding
transposition

TraceLens breaks down operations by unique input shape and arguments.

This is where many real optimizations start.

Example:

aten::mm with shape A
  -> high TFLOPS/s

aten::mm with shape B
  -> poor TFLOPS/s

Possible causes:

bad tile shape
small batch dimension
unfavorable alignment
memory-bound shape
layout conversion nearby
low occupancy
poor backend algorithm selection

For agent workloads, this is especially useful because context length and output length change request by request.

A model may be fast at one sequence length and slow at another.

8. Roofline-style compute modeling¶

Kernel duration answers:

How long did it take?

It does not answer:

Was that good?

TraceLens estimates theoretical work from operator arguments.

For GEMM:

FLOPs = 2 * M * N * K
Bytes = (M*K + K*N + M*N) * element_size

Then it combines that with actual trace timing:

TFLOPS/s = FLOPs / time
TB/s     = bytes / time

This gives a roofline-style view:

high arithmetic intensity + low TFLOPS/s
  -> compute kernel may be underutilizing the GPU

low arithmetic intensity + high TB/s
  -> operation may be memory-bandwidth bound

unexpected bytes or copies nearby
  -> inspect layout, dtype, or framework-generated movement

Important distinction:

TraceLens estimates useful theoretical work from framework semantics.

Hardware profilers measure what the GPU actually executed.

Use both:

TraceLens:
  "What work did the model intend?"

Hardware counters:
  "What did the GPU actually do?"

The gap between those two is often where the optimization lives.

9. Multi-GPU communication analysis¶

Distributed training and serving add another failure mode:

total collective time != pure network time

Collective operations can include synchronization skew.

One rank may enter the collective late because:

its compute was slower
its batch was heavier
it hit a local scheduling delay
it had a CPU dispatch stall
it waited on a prior dependency

If you only look at total collective duration, you may blame the network for workload imbalance.

TraceLens separates:

payload size
collective latency
algorithmic bandwidth
bus bandwidth
synchronization skew

Useful diagnostic patterns:

high communication latency, low skew
  -> likely network or collective implementation bottleneck

high skew, reasonable bandwidth
  -> likely imbalance or late-arriving ranks

high exposed communication time
  -> overlap is insufficient

high total communication but low exposed communication
  -> communication exists but is mostly hidden under compute

For hardware engineers, this matters because AI scaling bottlenecks are often misdiagnosed.

The right question is not:

How much time did all-reduce take?

The better question is:

How much pure communication time was exposed on the critical path,
and how much was rank skew?

10. Trace comparison: prove the delta¶

Most performance work is comparative:

old kernel vs new kernel
CUDA vs ROCm
H100 vs MI300X
driver A vs driver B
BF16 vs FP8
FlashAttention backend vs FlashInfer backend
baseline vLLM vs optimized vLLM

TraceLens can compare reports and identify where time changed.

For simple cases, matching happens at the same operator level.

For more complex cases, TraceLens can use morphological comparison: it aligns trees and finds the lowest point where the call stacks diverge.

That is useful when two backends implement the same framework op differently.

Example:

aten::_convolution
  -> ROCm backend subtree
  -> CUDA backend subtree

The framework-level intent is the same, but the backend subtree differs.

Trace comparison lets you say:

This change improved convolution backward by X ms.
This change regressed aten::mm by Y ms.
This backend moved time from one subtree into another.

That is the standard you should use for optimization claims.

11. Event replay: isolate the slow operation¶

Finding a slow operation is only half the job.

You still need a reproducer.

Full model reproducers are often difficult to share because they include:

proprietary model architecture
private weights
production input data
distributed launch setup
complex environment state

TraceLens event replay generates a minimal script from trace metadata:

operation type
tensor shapes
dtypes
strides
relevant arguments

This produces a focused benchmark case.

Why this matters:

model team finds a slow op
  -> event replay creates minimal reproducer
  -> kernel/compiler/vendor team debugs it
  -> fix is validated against original trace

This is how you turn model-level performance debugging into systems engineering.

12. Where TraceLens fits in the agent stack¶

Agent systems need performance evidence at multiple layers:

application:
  user latency, tool latency, session throughput

agent runtime:
  planning time, tool-call count, retry count, context growth

model server:
  TTFT, ITL, output tokens/s, queueing, KV-cache memory

GPU:
  kernels, copies, collectives, idle time, roofline metrics

TraceLens focuses on the lower layers, but it should be connected to the higher layers.

For OpenClaw-style systems, a useful workflow is:

run representative agent workload
  -> collect model-server trace
  -> generate TraceLens report
  -> identify bottleneck
  -> create event replay if needed
  -> change model/backend/kernel/config
  -> compare traces
  -> attach report to deployment decision

This pairs directly with Lecture 36.

If you enable FP8 KV-cache, do not only report:

It feels faster.

Report:

TTFT
ITL
output tokens/s
GPU memory
operator-level time
attention kernel time
copy time
communication exposure
accuracy
trace comparison delta

13. Practical workflow¶

Install TraceLens:

pip install git+https://github.com/AMD-AGI/TraceLens.git

Generate a performance report from a PyTorch trace:

TraceLens_generate_perf_report_pytorch \
  --profile_json_path path/to/your/trace.json

Compare two reports:

TraceLens_compare_perf_reports_pytorch \
  baseline.xlsx candidate.xlsx \
  --names baseline candidate \
  --sheets all \
  -o comparison.xlsx

Generate a multi-rank collective report:

TraceLens_generate_multi_rank_collective_report_pytorch \
  --trace_dir /path/to/traces \
  --world_size 8

For ROCm rocprofv3 Perfetto-style traces:

rocprofv3 \
  --hip-trace \
  --kernel-trace \
  --memory-copy-trace \
  --rccl-trace \
  --output-format pftrace \
  -d ./v3_traces \
  -- python3 your_app.py

TraceLens_generate_perf_report_pftrace_hip_activity \
  --trace_path sample.pftrace \
  --write_md

Use these commands as starting points, not a complete profiling policy.

For reliable comparisons, also pin:

model revision
input workload
batch/concurrency
sequence lengths
dtype policy
GPU clocks if relevant
driver/runtime versions
backend flags
warmup count
random seeds if accuracy is involved

14. Diagnostic playbook¶

Use TraceLens reports to choose the next action.

Case: high idle time¶

Likely causes:

CPU dataloader bottleneck
Python overhead
synchronization
dynamic shape recompilation
queue starvation
model server scheduling gap

Next actions:

inspect CPU dispatch timeline
check batch construction
reduce Python in the hot path
verify warmup and compile behavior
correlate with request-level telemetry

Case: GEMM dominates but TFLOPS/s is low¶

Likely causes:

small or awkward matrix shapes
poor alignment
bad layout
backend algorithm choice
unnecessary transposes
low occupancy

Next actions:

inspect shape-level rows
compare against hardware peak and expected roofline
generate event replay for the slow shape
test alternative kernels or layouts

Case: attention dominates long-context serving¶

Likely causes:

KV-cache memory traffic
head dimension issue
backend selection
context length distribution
low-precision conversion overhead

Next actions:

compare BF16 vs FP8 traces
inspect attention kernel time
test skip policies for sliding-window layers
validate accuracy on real prompts
connect to Lecture 36's TTFT/ITL benchmark plan

Case: exposed communication is high¶

Likely causes:

poor overlap
collective algorithm issue
rank imbalance
network bottleneck
tensor-parallel or FSDP layout mismatch

Next actions:

inspect skew vs pure communication time
compare algorithmic and bus bandwidth
inspect per-rank timelines
adjust sharding, bucket sizes, or overlap strategy

Case: regression after a software update¶

Likely causes:

backend kernel selection changed
compiler changed generated code
framework changed dispatch path
dtype/layout behavior changed
fusion changed

Next actions:

run TraceLens comparison
identify the largest positive deltas
inspect lowest divergent subtree
replay the slow op
keep the trace report with the change review

15. How this connects to agent reliability¶

Earlier lectures focused on agent skills, structured tools, and verification.

TraceLens applies the same idea to performance:

claim:
  "This optimization improves long-context serving."

required evidence:
  trace report
  before/after comparison
  workload description
  bottleneck explanation
  accuracy check
  deployment decision

This matters because agent systems invite vague performance claims:

"The model is slow."
"The GPU is the bottleneck."
"Communication is killing scaling."
"FP8 is faster."
"The new backend regressed."

TraceLens helps turn those into falsifiable statements:

On this workload, aten::mm for this shape regressed by X ms.
On this run, exposed communication is Y% and skew is Z us.
On this model, attention kernel time dropped but copy time increased.
On this backend, GPU idle time is dominated by CPU dispatch gaps.

That is the engineering standard.

Mini-lab: Trace a model change¶

Choose one small but real workload:

a PyTorch transformer block
a vLLM benchmark
a multi-GPU training step
a custom CUDA/ROCm kernel path
an OpenClaw agent workload that calls a local model server

Run two configurations:

baseline
candidate

Examples:

BF16 vs FP8
old backend vs new backend
CUDA vs ROCm
single GPU vs multi-GPU
eager vs compiled
with fusion vs without fusion

Collect traces, then generate TraceLens reports.

Write a short performance note:

Workload:
Hardware:
Software versions:
Main bottleneck:
Biggest improvement:
Biggest regression:
Evidence:
Deployment decision:
Next experiment:

If you find one slow operation, generate an event replay and treat it as a kernel debugging task.

Key takeaways¶

Raw profiler traces are too large and flat for reliable manual analysis.
TraceLens turns framework traces into hierarchical, queryable evidence.
Trace2Tree connects Python modules, CPU dispatches, runtime launches, and GPU kernels.
Start diagnosis at the GPU timeline, then drill into categories, ops, shapes, and kernels.
Roofline-style metrics help distinguish "slow" from "inefficient."
Multi-GPU collective time must be separated into pure communication and synchronization skew.
Trace comparison is the right way to prove a performance change.
Event replay turns trace metadata into minimal reproducers for kernel and backend debugging.
For agent systems, performance claims should be backed by traces, comparisons, and workload descriptions.

References¶

AMD ROCm Blog, "TraceLens: Democratizing AI Performance Analysis": https://rocm.blogs.amd.com/software-tools-optimization/tracelens/README.html
TraceLens GitHub repository: https://github.com/AMD-AGI/TraceLens
PyTorch profiler documentation: https://docs.pytorch.org/docs/stable/profiler.html
Perfetto trace viewer: https://ui.perfetto.dev
Lecture 36 - FP8 KV-Cache in vLLM: Lecture-36.md

Next: Lecture 38 - AutoSP