Lecture 1: LLVM IR & Architecture¶
Overview¶
Every AI compiler — TVM, Triton, tinygrad, XLA, torch.compile — eventually needs to turn high-level tensor operations into machine instructions that run on a specific chip. LLVM is the infrastructure that makes this possible without writing a full compiler from scratch for every new hardware target. The core challenge this lecture addresses is: how does LLVM represent programs in a way that is both hardware-independent (so optimizations apply everywhere) and hardware-aware (so the final code exploits specific chip features)? The mental model to carry forward is that LLVM is a universal translation layer: frontends (C, Rust, CUDA, ML compilers) lower their programs into LLVM IR, LLVM optimizes the IR, and then a target-specific backend generates native code. For an AI chip designer, understanding LLVM IR means you can write a backend for your custom accelerator and immediately inherit decades of compiler optimizations — without reimplementing constant folding, dead code elimination, or loop unrolling.
Why LLVM Matters for AI Hardware¶
| Without LLVM | With LLVM |
|---|---|
| Every new chip needs a full compiler from scratch | Write only the backend; inherit the optimization pipeline |
| Optimizations reimplemented per target | ~70 optimization passes shared across all targets |
| ML frameworks must generate assembly directly | Frameworks emit LLVM IR; LLVM handles instruction selection, register allocation, scheduling |
| No ecosystem tooling | Get debuggers (LLDB), profilers, sanitizers, LTO for free |
Concrete examples in AI: - TVM uses LLVM to generate CPU kernels (x86 AVX-512, ARM NEON/SVE) and GPU kernels (NVPTX for CUDA, AMDGPU for ROCm) - Triton (used by torch.compile) lowers to LLVM IR → NVPTX backend → PTX → cubin - Julia (used in Flux.jl ML framework) compiles to LLVM IR - MLIR (next lectures) is built on top of LLVM's infrastructure - Custom AI accelerator compilers (Cerebras, SambaNova, Graphcore) use LLVM backends
LLVM Architecture: The Three-Phase Design¶
┌──────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Frontend │ │ Middle-End │ │ Backend │
│ │ │ (Optimizer) │ │ (CodeGen) │
│ C → Clang │ │ │ │ │
│ Rust → rustc│────▶│ LLVM IR │────▶│ x86-64 │
│ CUDA → nvcc │ │ Optimization │ │ AArch64 │
│ TVM → codegen│ │ Passes │ │ NVPTX (GPU) │
│ Triton │ │ │ │ AMDGPU │
│ │ │ │ │ RISC-V │
│ │ │ │ │ Your custom chip │
└──────────────┘ └──────────────────┘ └──────────────────┘
The key insight is separation of concerns: - Frontends only need to emit valid LLVM IR — they don't know about registers, instruction encoding, or pipeline hazards - The optimizer works on LLVM IR and is target-independent (mostly) — it doesn't know whether the code came from C or a neural network compiler - Backends only need to translate LLVM IR to machine code — they don't know whether the program is a web server or a matrix multiply kernel
Key Insight: This three-phase design is why LLVM dominates compiler infrastructure. Adding a new source language means writing one frontend. Adding a new hardware target means writing one backend. The N×M problem (N languages × M targets) becomes N+M. For AI hardware, this means your custom chip gets a compiler for free once you write the backend — and every ML framework that emits LLVM IR can target your chip.
LLVM IR: The Language Between Languages¶
LLVM IR is a typed, SSA-based, low-level virtual instruction set. Each of these properties matters:
Static Single Assignment (SSA)¶
Every variable is assigned exactly once. Instead of mutating variables, you create new ones.
; NOT SSA (imperative style — LLVM doesn't allow this):
; x = 5
; x = x + 3 ← x is assigned twice
; SSA form (what LLVM requires):
%x1 = add i32 5, 3 ; %x1 = 8, assigned once
%x2 = mul i32 %x1, 2 ; %x2 = 16, assigned once
Why SSA? It makes optimization trivially correct. If %x1 is defined exactly once, every use of %x1 sees the same value — no need to track which assignment reaches which use. This enables constant propagation, dead code elimination, and common subexpression elimination to be simple graph transformations.
When control flow merges (e.g., after an if-else), SSA uses phi nodes to select which value to use:
; if (cond) { a = 1; } else { a = 2; }
; result = a;
entry:
br i1 %cond, label %then, label %else
then:
br label %merge
else:
br label %merge
merge:
%a = phi i32 [ 1, %then ], [ 2, %else ]
; %a is 1 if we came from %then, 2 if from %else
Type System¶
LLVM IR is strongly typed. Every value has an explicit type. This catches errors early and enables type-based optimizations.
| Type | Syntax | AI Relevance |
|---|---|---|
| Integer | i1, i8, i16, i32, i64 |
INT8/INT4 quantized weights, boolean masks |
| Float | half (f16), bfloat (bf16), float (f32), double (f64) |
Model precision: FP16/BF16 training, FP32 accumulation |
| Vector | <4 x float>, <16 x i8> |
SIMD: AVX-512 processes 16 floats, NEON processes 16 bytes |
| Pointer | ptr (opaque since LLVM 15) |
Memory addresses for tensor buffers |
| Array | [1024 x float] |
Fixed-size tensor dimensions |
| Struct | { i32, float, ptr } |
Tensor metadata (shape, stride, data pointer) |
Key Instructions¶
Arithmetic:
%sum = add i32 %a, %b ; integer add
%prod = fmul float %x, %y ; floating-point multiply
%mac = call float @llvm.fmuladd.f32(float %a, float %b, float %c)
; fused multiply-add: a*b + c
; maps to FMA instruction on hardware that supports it
Memory:
%ptr = alloca [1024 x float] ; allocate 1024 floats on stack
%val = load float, ptr %ptr ; load from memory
store float %result, ptr %out_ptr ; store to memory
%elem = getelementptr [1024 x float], ptr %ptr, i64 0, i64 %idx
; pointer arithmetic for array indexing
Control flow:
br i1 %cond, label %true_bb, label %false_bb ; conditional branch
br label %loop_header ; unconditional branch
ret float %result ; return
Vector operations (critical for AI):
; Load 8 floats from memory as a vector
%vec = load <8 x float>, ptr %tensor_ptr
; Element-wise multiply two 8-float vectors
%mul = fmul <8 x float> %vec_a, %vec_b
; Horizontal reduction (sum all 8 elements)
%sum = call float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> %mul)
; Shuffle: rearrange elements (used in transpose, gather operations)
%shuf = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
LLVM IR in Practice: A Dot Product¶
Consider computing the dot product of two 4-element vectors — the fundamental operation inside every matrix multiply, which is the core of neural network computation.
C source:
float dot4(float *a, float *b) {
float sum = 0.0f;
for (int i = 0; i < 4; i++)
sum += a[i] * b[i];
return sum;
}
LLVM IR (simplified, after optimization):
define float @dot4(ptr %a, ptr %b) {
entry:
; Load all 4 elements as vectors
%va = load <4 x float>, ptr %a, align 16
%vb = load <4 x float>, ptr %b, align 16
; Element-wise multiply
%vmul = fmul <4 x float> %va, %vb
; Horizontal sum (reduce)
%sum = call float @llvm.vector.reduce.fadd.v4f32(float 0.0, <4 x float> %vmul)
ret float %sum
}
What the backends produce:
| Target | Generated instructions |
|---|---|
| x86-64 (AVX) | vmovaps → vmulps → vhaddps → vhaddps |
| AArch64 (NEON) | ld1 → fmul → faddp → faddp |
| NVPTX (CUDA) | ld.global.v4.f32 → fma.rn.f32 (×4) |
The same LLVM IR produces optimal code for three completely different architectures. This is the power of the three-phase design.
LLVM IR Representations: Three Forms¶
LLVM IR exists in three equivalent forms:
| Form | Extension | Use Case |
|---|---|---|
| Human-readable text | .ll |
Debugging, learning, manual inspection |
| Bitcode (binary) | .bc |
On-disk storage, LTO (Link-Time Optimization) |
| In-memory C++ objects | — | Programmatic construction by frontends and passes |
# Compile C to LLVM IR (text)
clang -S -emit-llvm -O2 dot4.c -o dot4.ll
# Compile to bitcode
clang -c -emit-llvm -O2 dot4.c -o dot4.bc
# Convert between forms
llvm-as dot4.ll -o dot4.bc # text → bitcode
llvm-dis dot4.bc -o dot4.ll # bitcode → text
# Compile bitcode to native object
llc dot4.bc -o dot4.o -filetype=obj
# Run optimization passes on bitcode
opt -O2 dot4.bc -o dot4_opt.bc
Key Insight: ML compilers like TVM construct LLVM IR programmatically using the C++ API (or the C bindings). They never write
.llfiles — they buildllvm::Module,llvm::Function,llvm::BasicBlock, andllvm::Instructionobjects in memory, then call the backend to emit machine code directly. Understanding the text form is essential for debugging, but the programmatic API is what you'll use in practice.
Module Structure¶
An LLVM IR module is a complete compilation unit. For an AI kernel, a module typically contains one or a few functions (the kernel entry points) plus metadata.
; Module-level declarations
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
; Global constant (e.g., lookup table for activation function)
@relu_lut = internal constant [256 x i8] [ ... ]
; Function declaration (external — will be linked later)
declare void @llvm.memcpy.p0.p0.i64(ptr, ptr, i64, i1)
; Function definition (the actual kernel)
define void @matmul_4x4(ptr %A, ptr %B, ptr %C) #0 {
entry:
; ... kernel body ...
ret void
}
; Function attributes
attributes #0 = { nounwind "target-cpu"="skylake" "target-features"="+avx2,+fma" }
; Metadata (debug info, optimization hints)
!llvm.module.flags = !{!0}
!0 = !{i32 1, !"wchar_size", i32 4}
Key fields for AI compilers:
- target datalayout — endianness, pointer sizes, alignment. Your custom chip defines its own.
- target triple — arch-vendor-os. For NVIDIA GPUs: nvptx64-nvidia-cuda. For your chip: myaccel-unknown-unknown.
- attributes — tell the optimizer which hardware features are available (AVX-512, FMA, etc.)
Address Spaces¶
LLVM supports multiple address spaces via the pointer type, which is critical for GPU and accelerator programming where different memory regions have different performance characteristics.
| Address Space | GPU Meaning | AI Relevance |
|---|---|---|
| 0 (default) | Generic | CPU memory, or generic GPU memory |
| 1 | Global | Tensor data in VRAM (weights, activations) |
| 3 | Shared | Tile buffers in on-chip SRAM (shared memory on GPU) |
| 4 | Constant | Read-only data (quantization scale factors, LUTs) |
| 5 | Local (private) | Per-thread registers/stack |
; Load from global memory (GPU VRAM)
%val = load float, ptr addrspace(1) %global_ptr
; Store to shared memory (on-chip SRAM)
store float %val, ptr addrspace(3) %shared_ptr
For a custom AI accelerator, you define your own address space mapping to describe your memory hierarchy (e.g., address space 1 = weight SRAM, 2 = activation SRAM, 3 = accumulator registers).
Intrinsics: Hardware-Specific Operations¶
Intrinsics are LLVM's mechanism for exposing hardware-specific operations that have no equivalent in generic IR. They look like function calls but compile to specific instructions.
; Fused multiply-add (maps to FMA instruction)
%fma = call float @llvm.fmuladd.f32(float %a, float %b, float %c)
; Vector reduction (maps to horizontal add on x86, FADDP on ARM)
%sum = call float @llvm.vector.reduce.fadd.v8f32(float 0.0, <8 x float> %v)
; Matrix multiply intrinsic (AMX on Sapphire Rapids)
call void @llvm.x86.tamx.tdpbf16ps(i8 %dst, i8 %src1, i8 %src2)
; NVIDIA-specific: warp shuffle (requires NVPTX backend)
%shfl = call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %val, i32 %lane, i32 31)
For custom accelerator design: You define your own intrinsics. If your chip has a dedicated MAC (multiply-accumulate) unit that operates on 8×8 INT8 tiles, you'd define:
The backend pattern-matches this intrinsic to your hardware instruction.
Building LLVM IR Programmatically¶
This is how ML compilers (TVM, Triton, etc.) actually construct LLVM IR — not by writing text files, but by using the C++ API.
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/Module.h"
// Create context and module
LLVMContext ctx;
Module mod("ai_kernel", ctx);
IRBuilder<> builder(ctx);
// Define function: void relu(float* input, float* output, int n)
Type *floatPtrTy = PointerType::get(Type::getFloatTy(ctx), 0);
Type *i32Ty = Type::getInt32Ty(ctx);
FunctionType *fnTy = FunctionType::get(
Type::getVoidTy(ctx),
{floatPtrTy, floatPtrTy, i32Ty},
false
);
Function *relu = Function::Create(fnTy, Function::ExternalLinkage, "relu", mod);
// Create basic blocks
BasicBlock *entry = BasicBlock::Create(ctx, "entry", relu);
BasicBlock *loop = BasicBlock::Create(ctx, "loop", relu);
BasicBlock *exit = BasicBlock::Create(ctx, "exit", relu);
// Entry block: initialize loop counter
builder.SetInsertPoint(entry);
builder.CreateBr(loop);
// Loop block: load, compare, select (ReLU), store
builder.SetInsertPoint(loop);
PHINode *i = builder.CreatePHI(i32Ty, 2, "i");
i->addIncoming(ConstantInt::get(i32Ty, 0), entry);
// ... load input[i], compute max(0, val), store to output[i] ...
// ... increment i, branch back or exit ...
Key Insight: TVM's
codegen_llvm.ccdoes exactly this — it walks the TVM IR (TIR) and emits LLVMIRBuildercalls for each node. Understanding the IRBuilder API is essential if you're building a compiler for a custom AI accelerator. You don't need to be a compiler expert — the API is straightforward once you understand the IR structure.
Hands-On Exercises¶
-
Inspect ML compiler output: Install TVM, compile a simple model (e.g., element-wise ReLU on a 1024-element tensor) for the
llvmtarget. Extract the LLVM IR withmod.get_source("ll"). Identify the loop structure, vectorization, and any intrinsics used. -
Write a dot product in LLVM IR: Write a
.llfile that computes the dot product of two 8-element float vectors using vector instructions. Compile withllcfor both x86-64 and AArch64 targets. Compare the generated assembly. -
Explore address spaces: Write LLVM IR that uses address space 1 (global) and address space 3 (shared). Compile for the NVPTX target with
llc -march=nvptx64. Examine the PTX output — you should seeld.globalandld.sharedinstructions. -
Custom intrinsic sketch: Design the intrinsic signature for a hypothetical accelerator that has a fused Conv2D+ReLU instruction operating on 3×3 INT8 patches. What are the input types, output types, and side effects?
Key Takeaways¶
| Concept | Why It Matters for AI Hardware |
|---|---|
| SSA form | Enables the optimization passes that make generated code fast |
| Type system | Maps directly to hardware data types (INT8, FP16, BF16) |
| Vector types | The IR representation of SIMD — how AI compilers express parallelism within a thread |
| Address spaces | Model the memory hierarchy of GPUs and custom accelerators |
| Intrinsics | The escape hatch for hardware-specific operations (tensor cores, MAC units) |
| Three-phase design | Write one backend → get all ML frameworks for free |
Resources¶
- LLVM Language Reference Manual: The authoritative specification of LLVM IR — every type, instruction, and intrinsic.
- LLVM Programmer's Manual: How to use the C++ API (IRBuilder, PassManager, etc.).
- "Getting Started with LLVM Core Libraries" by Bruno Cardoso Lopes and Rafael Auler: Practical book for building tools on LLVM.
- Mapping High-Level Constructs to LLVM IR: Community guide for translating programming patterns to IR.
- TVM source:
src/target/llvm/codegen_llvm.cc: Real-world example of an ML compiler generating LLVM IR programmatically.