LLM Optimization on Jetson — From Cloud Techniques to Edge Reality¶

Parent: ML and AI

Goal: Take the optimization techniques used by cloud LLM platforms (vLLM, TensorRT-LLM, RunInfra, etc.) and adapt them to the extreme constraints of Jetson Orin Nano Super 8 GB — 102 GB/s bandwidth, 67 TOPS GPU + ~10 TOPS DLA ≈ 77 TOPS total, shared memory, 7–25W power.

Pre-Flight: System Check Before Starting¶

Before any LLM work, verify your Jetson's software stack. JetPack version determines which CUDA, TensorRT, and cuDNN versions you have — and which LLM tools are compatible.

Quick System Audit (copy-paste this)¶

#!/bin/bash
echo "═══════════════════════════════════════════════"
echo "  Jetson System Audit for LLM Deployment"
echo "═══════════════════════════════════════════════"

echo ""
echo "▸ JetPack / L4T version:"
cat /etc/nv_tegra_release 2>/dev/null || echo "  (not found — check dpkg)"
dpkg-query --show nvidia-l4t-core 2>/dev/null | awk '{print "  L4T:", $2}'

echo ""
echo "▸ CUDA version:"
nvcc --version 2>/dev/null | grep release || echo "  nvcc not found"

echo ""
echo "▸ TensorRT version:"
dpkg -l | grep tensorrt | head -1 | awk '{print " ", $3}'

echo ""
echo "▸ cuDNN version:"
dpkg -l | grep cudnn | head -1 | awk '{print " ", $3}'

echo ""
echo "▸ Python version:"
python3 --version

echo ""
echo "▸ Total RAM:"
free -m | awk '/Mem:/ {print "  " $2 " MB total"}'
echo "▸ Free RAM:"
free -m | awk '/Mem:/ {print "  " $7 " MB available"}'

echo ""
echo "▸ CMA allocation:"
grep Cma /proc/meminfo | awk '{print "  " $0}'

echo ""
echo "▸ GPU info:"
cat /sys/devices/17000000.ga10b/devfreq/17000000.ga10b/cur_freq 2>/dev/null \
    | awk '{print "  GPU freq: " $1/1000000 " MHz"}'

echo ""
echo "▸ Power mode:"
sudo nvpmodel -q 2>/dev/null | head -2 | sed 's/^/  /'

echo ""
echo "▸ Disk space:"
df -h / | tail -1 | awk '{print "  Root: " $4 " free of " $2}'
df -h /dev/nvme0n1p1 2>/dev/null | tail -1 | awk '{print "  NVMe: " $4 " free of " $2}'

echo ""
echo "▸ Thermal:"
cat /sys/devices/virtual/thermal/thermal_zone*/temp 2>/dev/null | head -3 \
    | awk '{print "  Zone: " $1/1000 "°C"}'

echo "═══════════════════════════════════════════════"

Example output (Orin Nano Super, JetPack 6.1):

═══════════════════════════════════════════════
  Jetson System Audit for LLM Deployment
═══════════════════════════════════════════════

▸ JetPack / L4T version:
  # R36 (release), REVISION: 4.0
  L4T: 36.4.0-20241031080721

▸ CUDA version:
  Cuda compilation tools, release 12.6, V12.6.77

▸ TensorRT version:
  10.3.0.30-1+cuda12.6

▸ cuDNN version:
  9.3.0.75-1+cuda12.6

▸ Python version:
  Python 3.10.12

▸ Total RAM:
  7633 MB total
▸ Free RAM:
  5814 MB available

▸ CMA allocation:
  CmaTotal:      786432 kB
  CmaFree:       654321 kB

▸ GPU info:
  GPU freq: 624 MHz

▸ Power mode:
  NV Power Mode: MAXN
  Power Mode: 25W

▸ Disk space:
  Root: 42G free of 100G

▸ Thermal:
  Zone: 38.5°C

═══════════════════════════════════════════════

JetPack → CUDA → TensorRT Compatibility Matrix¶

This matrix determines which LLM tools work on your Jetson:

JetPack	L4T	CUDA	TensorRT	cuDNN	Python	llama.cpp	Ollama	TRT-LLM
6.1	R36.4	12.6	10.3	9.3	3.10	Yes	Yes	Yes (0.15+)
6.0	R36.3	12.2	8.6	8.9	3.10	Yes	Yes	Yes (0.9+)
5.1.3	R35.5	11.4	8.5	8.6	3.8	Yes	Yes	Limited
5.1.1	R35.3	11.4	8.5	8.6	3.8	Yes	Older	No
5.0.2	R35.1	11.4	8.4	8.4	3.8	Yes	No	No

Recommendation: Use JetPack 6.1 (R36.4) for LLM work. It has CUDA 12.6, TensorRT 10.3, and full TensorRT-LLM support. If you're on JetPack 5.x, consider upgrading — the CUDA 12.x ecosystem (llama.cpp, Ollama, PyTorch 2.x) is significantly better.

Pre-Flight Checklist¶

Before deploying any LLM on Jetson:

□ JetPack version
  □ JetPack 6.0+ for TensorRT-LLM
  □ JetPack 5.1+ minimum for llama.cpp / Ollama

□ Available memory
  □ Run: free -m → note "available" column
  □ Expect ~5.5–6 GB free on stock Orin Nano Super 8 GB
  □ If < 5 GB: disable GUI (sudo systemctl set-default multi-user.target)
  □ If < 4 GB: reduce CMA, disable unnecessary services

□ Storage
  □ NVMe recommended (models are 1–5 GB each)
  □ SD card works but slower model loading
  □ At least 20 GB free for models + build cache

□ Power mode
  □ sudo nvpmodel -m 0  (MAXN = 25W for Orin Nano Super)
  □ sudo jetson_clocks   (lock to max frequency)

□ Thermal
  □ Active cooling attached (fan or heatsink with fan)
  □ Ambient temperature < 35°C for sustained workloads
  □ Monitor: tegrastats --interval 1000

Disable GUI to Free ~500 MB RAM¶

For headless LLM deployment, disable the desktop environment:

# Switch to text-only mode (saves ~500 MB RAM)
sudo systemctl set-default multi-user.target
sudo reboot

# To re-enable GUI later:
sudo systemctl set-default graphical.target
sudo reboot

# Verify RAM freed:
free -m   # "available" should increase by ~500 MB

This single change can be the difference between a 3B model fitting comfortably and running out of memory.

0. The Optimization Stack — Cloud vs Jetson¶

Cloud platforms like RunInfra, Together AI, and Fireworks AI deploy LLMs using a layered optimization stack. Every technique has a Jetson equivalent — but the priorities are reversed because Jetson is severely memory-bandwidth-bound rather than compute-bound.

Cloud GPU (H100 80GB, 3,350 GB/s, 989 TFLOPS):
  ┌─────────────────────────────────────────────┐
  │ 1. Quantization (FP8, AWQ 4-bit)           │ ← saves VRAM, improves throughput
  │ 2. FlashAttention-2                         │ ← saves SRAM, fuses memory ops
  │ 3. Fused Kernels (RMSNorm, rotary, SwiGLU) │ ← fewer kernel launches
  │ 4. PagedAttention (vLLM)                    │ ← KV cache memory efficiency
  │ 5. Speculative Decoding                     │ ← higher tokens/sec
  │ 6. Batching (continuous batching)           │ ← amortize compute over requests
  │ 7. Tensor Parallelism (multi-GPU)          │ ← scale beyond 1 GPU
  └─────────────────────────────────────────────┘

Jetson Orin Nano Super 8GB (102 GB/s, 67 TOPS GPU + ~10 TOPS DLA ≈ 77 TOPS):
  ┌─────────────────────────────────────────────┐
  │ 1. Quantization (INT4/INT8) ★★★★★          │ ← MANDATORY: model must fit in 5 GB
  │ 2. Model Selection ★★★★★                    │ ← choose models that fit (≤3B params)
  │ 3. KV Cache Management ★★★★                │ ← memory is the #1 constraint
  │ 4. FlashAttention / Fused Ops ★★★★         │ ← reduce bandwidth pressure
  │ 5. TensorRT-LLM Engine ★★★                 │ ← compiled, optimized execution
  │ 6. Speculative Decoding ★★★                │ ← higher tokens/sec within power budget
  │ 7. Batching ★★                              │ ← limited by memory, not compute
  └─────────────────────────────────────────────┘
  ★ = importance on Jetson (more ★ = more critical)

1. Quantization — The Most Important Optimization¶

On cloud GPUs, quantization is optional (saves cost). On Jetson, quantization is mandatory — without it, nothing fits.

1.1 Why Quantization Matters More on Jetson¶

Model: Llama 3.2 3B parameters

FP16:  3B × 2 bytes = 6.0 GB   ← won't fit (only ~5 GB free after OS/CMA)
INT8:  3B × 1 byte  = 3.0 GB   ← fits, but tight
INT4:  3B × 0.5 byte = 1.5 GB  ← fits comfortably, room for KV cache

Model: Phi-3 Mini 3.8B parameters

FP16:  3.8B × 2 bytes = 7.6 GB  ← impossible on 8 GB
INT4:  3.8B × 0.5 byte = 1.9 GB ← fits with room for context

1.2 Quantization Methods Ranked for Jetson¶

Method	Bits	Quality loss	Speed on Jetson	When to use
AWQ (Activation-Aware Weight)	4-bit	Very low	Fast (INT4 GEMM)	Best quality/size for Jetson
GPTQ	4-bit	Low	Fast	Alternative to AWQ, well-supported
INT8 PTQ (TensorRT)	8-bit	Minimal	Fastest	If model fits at INT8
FP8 (E4M3)	8-bit	Minimal	Fast (Ampere+)	When you need FP-like quality
GGUF (llama.cpp)	2–8 bit	Mixed-precision	Good (CPU+GPU)	Easy deployment, any model
SqueezeLLM	3-4 bit	Low	Moderate	Extreme compression

1.3 AWQ Quantization (Recommended for Jetson)¶

AWQ preserves quality by protecting salient weight channels — the 1% of weights that matter most for output quality.

# Quantize on your workstation (not on Jetson — too slow)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "microsoft/Phi-3-mini-4k-instruct"
quant_path = "phi3-mini-awq-int4"

model = AutoAWQForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# AWQ calibration (needs ~128 samples)
model.quantize(
    tokenizer,
    quant_config={
        "zero_point": True,
        "q_group_size": 128,
        "w_bit": 4,           # 4-bit weights
        "version": "GEMM"     # optimized GEMM kernel
    }
)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

1.4 GGUF with llama.cpp (Easiest Path)¶

llama.cpp runs on Jetson with CUDA support and handles mixed-precision quantization:

# On Jetson: install llama.cpp with CUDA
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download a pre-quantized GGUF model
# (Llama 3.2 3B in Q4_K_M = ~2 GB, good quality/size balance)
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/\
Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Run inference
./build/bin/llama-cli \
    -m Llama-3.2-3B-Instruct-Q4_K_M.gguf \
    -ngl 99 \           # offload all layers to GPU
    -c 2048 \           # context length
    -p "Explain how Jetson unified memory works:"

GGUF quantization levels for Jetson:

Quantization	Bits	Size (3B model)	Quality	Recommended?
Q2_K	2.6	~1.0 GB	Poor	Only if nothing else fits
Q3_K_M	3.4	~1.3 GB	Acceptable	Memory-critical deployments
Q4_K_M	4.5	~1.8 GB	Good	Best balance for Jetson
Q5_K_M	5.5	~2.2 GB	Very good	If memory allows
Q6_K	6.6	~2.5 GB	Excellent	Best quality that fits
Q8_0	8.0	~3.0 GB	Near-FP16	Only for small models

2. Model Selection — The Complete Landscape¶

The most important "optimization" is choosing the right model. A well-quantized small model beats a poorly-fitting large model every time.

2.1 Full Model Catalog (Q4_K_M Quantization, Orin Nano Super 8 GB)¶

All values at Q4_K_M — a 4-bit mixed-precision quantization that reduces memory by ~75% with minimal quality loss.

Tier 1 — Runs comfortably (< 3 GB, room for long context + KV cache)

Model	Params	Q4_K_M size	Context	Gen. tok/s (est.)	Best for
TinyLlama 1.1B	1.1B	0.6 GB	2K	~65	Ultra-lightweight, draft model for speculative decoding
Llama 3.2 1B	1.3B	0.7 GB	128K	~55	Lightweight chat, classification, tool calling
StableLM 2 1.6B	1.6B	0.9 GB	4K	~45	Compact, fast edge assistant
Gemma 3 1B	1B	0.6 GB	32K	~60	Multilingual, Google ecosystem
Gemma 2 2B	2.6B	1.5 GB	8K	~35	Multilingual, general purpose
Qwen 3 1.7B	1.7B	1.0 GB	32K	~50	Chinese + English, thinking mode
SmolLM2 1.7B	1.7B	1.0 GB	8K	~50	Hugging Face, compact, well-trained
Llama 3.2 3B	3.2B	1.8 GB	128K	~25	General chat, summarization, tool use
Qwen 2.5 3B	3B	1.7 GB	32K	~28	Bilingual general purpose

Tier 2 — Fits but tight (3–4.5 GB, shorter context recommended)

Model	Params	Q4_K_M size	Context	Gen. tok/s (est.)	Best for
Phi-4 Mini	3.8B	2.3 GB	4K/128K	~20	Reasoning, math, code
Phi-3 Mini	3.8B	2.2 GB	4K/128K	~20	Reasoning, code
Qwen 3 4B	4B	2.4 GB	32K	~18	Thinking + non-thinking modes
Gemma 3 4B	4B	2.5 GB	128K	~17	Vision + language (multimodal)
Llama 3.3 8B	8B	4.6 GB	128K	~10	General purpose (needs short ctx)
Mistral 7B v0.3	7B	4.1 GB	32K	~11	Chat, function calling
Qwen 3 8B	8B	4.9 GB	32K	~9	Bilingual, thinking mode

Tier 3 — Barely fits (4.5+ GB, heavily constrained)

Model	Params	Q4_K_M size	Max ctx	Gen. tok/s (est.)	Notes
Llama 3.1 8B	8B	4.7 GB	512–1K	~9	Functional but cramped
Mistral Nemo 12B	12B	7.0 GB	256	~5	Needs Q3_K or Q2_K to fit
Phi-4 14B	14B	8.4 GB	—	Won't fit	Use Orin NX 16 GB

Rule of thumb: Model Q4_K_M size + 2.5 GB (OS/CMA/CUDA) + KV cache must be < 8 GB. For Tier 2 models, cap context at 1–2K tokens. For Tier 3, consider Q3_K_M or Q2_K quantization.

2.2 Models That Need Larger Jetson Hardware¶

For reference — what runs on bigger Jetson modules:

Model	Params	Q4_K_M	Min Jetson module	Notes
Phi-4 14B	14B	8.4 GB	Orin NX 16 GB	Good reasoning
Qwen 2.5 Coder 14B	14B	8.6 GB	Orin NX 16 GB	Code generation
Mistral Small 24B	24B	14 GB	AGX Orin 32 GB	Strong general purpose
DeepSeek-R1 Distill 32B	32B	19 GB	AGX Orin 32 GB	Reasoning chains
Llama 3.3 70B	70B	42 GB	AGX Orin 64 GB	Near-GPT-4 quality
Qwen 3 235B (MoE)	235B	~60 GB	Multi-GPU / cloud	22B active params
DeepSeek V3 671B (MoE)	671B	~380 GB	Multi-GPU cluster	37B active params

2.3 Latest Models Worth Watching (2025–2026)¶

The edge LLM landscape moves fast. Models to evaluate as they release:

Model family	Why it matters for Jetson
Qwen 3 (0.6B–235B)	Thinking + non-thinking modes, great small variants (1.7B, 4B)
Gemma 3 (1B–27B)	Multimodal (vision+text), good 1B and 4B for edge
Phi-4 Mini (3.8B)	Microsoft's strong reasoning at small size
Llama 4 Scout/Maverick	MoE architecture — active params may fit edge
SmolLM2 (135M–1.7B)	Hugging Face's ultra-compact series
Step 3.5 Flash (196B MoE, 11B active)	StepFun's open-source MoE reasoning model — only 11B active per token (fits Jetson!), 262K context, speed-optimized
MiMo-V2-Pro (1T+ MoE)	Xiaomi's flagship — agentic scenarios, OpenClaw-compatible, 1M context, approaches Opus-4.6 quality. Too large for Jetson directly, but distilled/smaller MiMo variants are edge targets
MiMo (series, 1.5B–7B)	Xiaomi's edge-optimized models — MiMo-7B has strong math/code benchmarks, 1.5B fits Jetson easily
DeepSeek-R1 Distill (1.5B–70B)	Distilled reasoning — 1.5B variant fits easily
Nemotron Nano (series)	NVIDIA's own edge-optimized models

MoE Models — Why Active Parameters Matter for Jetson¶

MoE (Mixture of Experts) models activate only a fraction of their total parameters per token. This changes the memory math:

Dense model (Llama 3.2 3B):
  Total params = Active params = 3B
  Q4_K_M size: 1.8 GB  (must load ALL weights per token)
  Memory bandwidth per token: 1.8 GB

MoE model (Step 3.5 Flash 196B, 11B active):
  Total params: 196B → Q4_K_M: ~110 GB  (won't fit in 8 GB!)
  BUT active params per token: 11B → needs ~6.5 GB of the 110 GB

  Challenge: even though only 11B are active, the FULL 110 GB model
  must be in memory because different tokens may route to different experts.
  → MoE models need the full weight set loaded, not just active params.

  For Jetson: MoE only helps if the TOTAL model fits.
  Step 3.5 Flash (110 GB) → won't fit on 8 GB Jetson
  A hypothetical MoE with 16B total / 4B active → would fit and be fast!

MoE models that could work on Jetson (if available in small total size):

Model	Total params	Active params	Q4_K_M total	Fits 8 GB?
Mixtral 8x0.5B (hypothetical)	4B	0.5B	~2.4 GB	Yes
Llama 4 Scout small variant	TBD	TBD	TBD	Watch for small MoE releases
Step Flash distilled	TBD	TBD	TBD	If StepFun releases smaller variant

The real edge MoE opportunity: models with <8B total parameters and <2B active — giving dense-model memory footprint with large-model routing quality. This is an active research area.

2.4 Choosing the Right Model — Decision Flowchart¶

What's your use case?
│
├── Simple classification / extraction / tool calling
│   └── Llama 3.2 1B or Qwen 3 1.7B  (< 1 GB, ~50+ tok/s)
│
├── General chat / assistant
│   └── Llama 3.2 3B or Gemma 2 2B  (1.5–1.8 GB, ~25–35 tok/s)
│
├── Reasoning / math / code
│   └── Phi-4 Mini 3.8B or Qwen 3 4B  (2.3–2.5 GB, ~18–20 tok/s)
│
├── Multimodal (image + text)
│   └── Gemma 3 4B  (2.5 GB, supports image input)
│
├── Bilingual (Chinese + English)
│   └── Qwen 3 4B or Qwen 2.5 3B  (1.7–2.4 GB)
│
├── Code generation
│   └── Phi-4 Mini or Qwen 2.5 Coder 3B  (if available at 3B)
│
├── Maximum quality (willing to accept slower speed)
│   └── Llama 3.3 8B Q4_K_M with 1K context  (4.6 GB, ~10 tok/s)
│
└── Need long context (8K+ tokens)
    └── Llama 3.2 3B (128K native) or Gemma 3 1B (32K)
        Cap actual context to fit KV cache budget

2.5 Ollama on Jetson — Easiest Deployment¶

Ollama runs natively on Jetson with CUDA support:

# Install Ollama on Jetson
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model (automatically selects Q4_K_M)
ollama pull llama3.2:3b
ollama run llama3.2:3b

# Or for smaller/faster:
ollama pull qwen3:1.7b
ollama run qwen3:1.7b

# Serve as API (compatible with OpenAI format)
ollama serve &
curl http://localhost:11434/v1/chat/completions \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello"}]}'

Ollama advantages on Jetson: - Automatic CUDA detection and GPU offloading - Built-in model management (pull, delete, list) - OpenAI-compatible API endpoint - Handles quantization automatically - One-command install, no Python dependencies

# Check GPU utilization while running
tegrastats --interval 1000

# List downloaded models and sizes
ollama list

# Remove a model to free space
ollama rm llama3.2:3b

2.6 Offline Model Transfer — Jetson Without Internet¶

Production Jetson devices often have no internet access (air-gapped, DNS issues, factory floor). Transfer models from an internet-connected machine via USB or LAN.

Step 1 — Download on your workstation/VM:

# On internet-connected machine (workstation, cloud VM, etc.)
cd /tmp

# Download GGUF model from Hugging Face
wget "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/\
Llama-3.2-3B-Instruct-Q4_K_M.gguf"

# Or for Nemotron (NVIDIA's edge model):
wget "https://huggingface.co/bartowski/Nemotron-Mini-4B-Instruct-GGUF/resolve/main/\
Nemotron-Mini-4B-Instruct-Q4_K_M.gguf"

Step 2 — Transfer to Jetson via USB or LAN:

# Method A: USB device-mode (Jetson acts as USB Ethernet at 192.168.55.1)
scp Llama-3.2-3B-Instruct-Q4_K_M.gguf user@192.168.55.1:/opt/models/

# Method B: Ethernet/WiFi LAN (replace with Jetson's IP)
scp Llama-3.2-3B-Instruct-Q4_K_M.gguf user@192.168.1.100:/opt/models/

# Method C: USB flash drive (if no network)
# Mount USB drive on workstation, copy model, unmount, plug into Jetson
cp Llama-3.2-3B-Instruct-Q4_K_M.gguf /media/usb_drive/
# On Jetson:
sudo mount /dev/sda1 /mnt
cp /mnt/Llama-3.2-3B-Instruct-Q4_K_M.gguf /opt/models/

Step 3 — Verify on Jetson:

# Check file exists and size is correct
ls -lh /opt/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# Expected: ~1.8 GB for 3B Q4_K_M

# Verify integrity (optional — compare SHA256 with HuggingFace page)
sha256sum /opt/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Check disk space
df -h /opt/models/

Step 4 — Run inference:

# With llama.cpp:
./llama-cli -m /opt/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -ngl 99 -c 2048

# With Ollama (import local GGUF):
# Create a Modelfile
echo 'FROM /opt/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf' > Modelfile
ollama create my-llama -f Modelfile
ollama run my-llama

USB device-mode setup (if not working):

# Check if USB device-mode service is running
systemctl status nv-l4t-usb-device-mode

# Enable if disabled
sudo systemctl enable --now nv-l4t-usb-device-mode

# Verify Jetson has USB IP
ip addr show usb0    # should show 192.168.55.1

# On workstation, verify connectivity
ping 192.168.55.1

Transfer speed reference:

Method	Speed	Time for 2 GB model
USB 2.0 device-mode	~30 MB/s	~67 sec
USB 3.0 flash drive	~100 MB/s	~20 sec
Gigabit Ethernet	~110 MB/s	~18 sec
WiFi (802.11ac)	~40 MB/s	~50 sec

3. KV Cache Management — The Hidden Memory Consumer¶

During autoregressive generation, the KV cache stores key/value tensors for every token in the context. This can consume more memory than the model itself.

3.1 KV Cache Size Calculation¶

KV cache size = 2 × num_layers × num_kv_heads × head_dim × context_length × bytes_per_element

Llama 3.2 3B (INT8 KV cache, 2048 context):
  = 2 × 26 layers × 8 kv_heads × 128 head_dim × 2048 tokens × 1 byte
  = 2 × 26 × 8 × 128 × 2048
  = ~109 MB

Same model at 8192 context:
  = ~435 MB  ← significant on 8 GB!

Same model at 128K context:
  = ~6.8 GB  ← impossible, exceeds total free memory

3.2 KV Cache Optimization Techniques¶

Technique	Memory saving	Quality impact	Jetson support
INT8 KV cache	2× vs FP16	Minimal	llama.cpp, TensorRT-LLM
INT4 KV cache	4× vs FP16	Small	llama.cpp (experimental)
GQA (Grouped Query Attention)	4–8× vs MHA	None (model-level)	Built into modern models
Sliding window attention	Bounded	Loses long context	Mistral, some models
KV cache eviction	Bounded	Loses old context	Custom implementation
PagedAttention (vLLM)	No fragmentation waste	None	Not on Jetson (vLLM = server)

GQA is the most important: Llama 3.2 uses GQA with 8 KV heads (vs 32 query heads). This means the KV cache is 4× smaller than traditional MHA. Always prefer GQA models on Jetson.

3.3 Context Length Budget¶

Orin Nano 8GB Memory Budget for LLM:

Total DRAM:                    8.0 GB
  - Firmware carveouts:       -0.4 GB
  - OS + kernel:              -0.5 GB
  - CMA:                      -0.5 GB (reduced for LLM workload)
  - CUDA runtime:             -0.3 GB
  ────────────────────────────────────
  Available for LLM:           6.3 GB

Model (Llama 3.2 3B INT4):   -1.5 GB
KV cache (INT8, ctx=2048):   -0.1 GB
Activation memory:            -0.2 GB
  ────────────────────────────────────
  Remaining:                   4.5 GB  ← room for longer context or larger model

Model (Phi-3 Mini INT4):     -1.9 GB
KV cache (INT8, ctx=4096):   -0.3 GB
Activation memory:            -0.3 GB
  ────────────────────────────────────
  Remaining:                   3.8 GB  ← still comfortable

4. FlashAttention and Fused Kernels¶

4.1 Why Fused Kernels Matter on Jetson¶

On H100, fused kernels save SRAM bandwidth and improve Tensor Core utilization. On Jetson, they save DRAM bandwidth — the #1 bottleneck.

Unfused attention (naive):
  Q × K^T → write attention scores to DRAM → read back → softmax → write →
  read back → multiply by V → write output

  Total DRAM traffic: ~4× the minimum

FlashAttention-2 (fused):
  Q × K^T → softmax → × V  (all in SRAM/registers, ONE read + ONE write)

  Total DRAM traffic: ~1× the minimum → 4× less bandwidth used

On Jetson Orin Nano Super's 102 GB/s bandwidth, this is the difference between 20 tokens/sec and 50 tokens/sec.

4.2 FlashAttention on Jetson¶

# llama.cpp automatically uses FlashAttention when available
./llama-cli -m model.gguf -ngl 99 -fa  # -fa enables FlashAttention

# TensorRT-LLM compiles FlashAttention into the engine
trtllm-build --model_dir ./model --output_dir ./engine \
    --use_fused_mlp \
    --use_flash_attn \
    --max_batch_size 1 \
    --max_input_len 2048 \
    --max_seq_len 2560

4.3 Other Fused Operations¶

Each fused operation eliminates a DRAM round-trip:

Fused operation	What it combines	Bandwidth saved
Fused RMSNorm	Norm + scale in one kernel	~2× less traffic
Fused SwiGLU	Gate + activation + multiply	~3× less traffic
Fused Rotary Embedding	Position encoding + Q/K projection	~2× less traffic
Fused Add + Norm	Residual connection + layer norm	~2× less traffic

TensorRT-LLM enables these automatically. llama.cpp has many fused CUDA kernels built in.

5. TensorRT-LLM on Jetson¶

TensorRT-LLM is NVIDIA's optimized inference engine for LLMs. It compiles the model into a TensorRT engine with fused kernels, quantization, and Tensor Core usage.

5.1 Build a TensorRT-LLM Engine for Jetson¶

# Install TensorRT-LLM (JetPack 6.x includes TensorRT, add LLM extension)
pip install tensorrt-llm --extra-index-url https://pypi.nvidia.com

# Convert Hugging Face model to TensorRT-LLM checkpoint
python convert_checkpoint.py \
    --model_dir ./Llama-3.2-3B-Instruct \
    --output_dir ./checkpoint \
    --dtype float16 \
    --tp_size 1          # single GPU on Jetson

# Build optimized engine
trtllm-build \
    --checkpoint_dir ./checkpoint \
    --output_dir ./engine \
    --gemm_plugin float16 \
    --max_batch_size 1 \
    --max_input_len 1024 \
    --max_seq_len 2048 \
    --max_num_tokens 2048 \
    --use_fused_mlp enable \
    --use_flash_attn enable \
    --strongly_typed

# Run inference
python run.py \
    --engine_dir ./engine \
    --tokenizer_dir ./Llama-3.2-3B-Instruct \
    --max_output_len 256 \
    --input_text "How does Jetson unified memory work?"

5.2 INT4 AWQ Engine for Minimum Memory¶

# Build with INT4 AWQ quantization
trtllm-build \
    --checkpoint_dir ./checkpoint-awq \
    --output_dir ./engine-int4 \
    --gemm_plugin auto \
    --max_batch_size 1 \
    --max_input_len 1024 \
    --max_seq_len 2048 \
    --use_fused_mlp enable \
    --weight_only_precision int4_awq

5.3 TensorRT-LLM vs llama.cpp on Jetson¶

	TensorRT-LLM	llama.cpp
Setup complexity	High (build engine)	Low (download GGUF, run)
Performance	Best (compiled kernels)	Very good (hand-tuned CUDA)
Quantization	FP16, INT8, INT4 AWQ/GPTQ	Q2–Q8 mixed precision (GGUF)
Flexibility	Fixed engine (rebuild for changes)	Dynamic (change at runtime)
Memory efficiency	Excellent (preallocated)	Good (dynamic allocation)
Model support	Major models (Llama, Phi, Mistral, etc.)	Almost everything on HF
Best for	Production deployment	Prototyping + production

Recommendation: Start with llama.cpp for prototyping (5-minute setup). Switch to TensorRT-LLM for production when you need maximum tokens/sec.

6. Speculative Decoding — Free Speed¶

Speculative decoding uses a small draft model to guess N tokens, then the large target model verifies all N in one forward pass. If the guess is correct, you get N tokens for the price of 1.

Without speculative decoding:
  Target model: generate token 1 → token 2 → token 3 → token 4
  Time: 4 forward passes × 50ms = 200ms

With speculative decoding (draft model guesses 4 tokens):
  Draft model:  generate 4 candidate tokens (fast, ~5ms total)
  Target model: verify all 4 in ONE forward pass (~55ms)
  If 3/4 accepted: 3 tokens in 60ms instead of 150ms → 2.5× faster

Speedup: 1.5–3× depending on acceptance rate

6.1 On Jetson¶

# llama.cpp supports speculative decoding
./llama-speculative \
    -m Llama-3.2-3B-Q4_K_M.gguf \       # target model (3B)
    -md TinyLlama-1.1B-Q4_K_M.gguf \    # draft model (1.1B)
    -ngl 99 \
    --draft 8 \                           # speculate 8 tokens ahead
    -p "Write a comprehensive guide to..."

Memory budget for speculative decoding: - Target: Llama 3.2 3B INT4 = 1.5 GB - Draft: TinyLlama 1.1B INT4 = 0.6 GB - Total: 2.1 GB — fits easily on 8 GB

When NOT to use on Jetson: if the draft model pushes you over the memory budget, speculative decoding hurts more than it helps.

7. Runtime Optimizations¶

7.1 Power Mode Selection¶

# Check available power modes
sudo nvpmodel -q --verbose

# Set to maximum performance (15W on Orin Nano, 25W on Orin NX)
sudo nvpmodel -m 0
sudo jetson_clocks    # lock GPU/CPU at max frequency

# Or set power-efficient mode (7W)
sudo nvpmodel -m 1   # fewer CPU cores, lower GPU clock

Higher power mode = higher clock = more tokens/sec. But thermal design must support it.

7.2 GPU Frequency and Memory Clock¶

# Check current clocks
tegrastats --interval 500

# Lock GPU to max clock (prevents dynamic frequency scaling during inference)
sudo jetson_clocks

# Check GPU clock
cat /sys/devices/17000000.ga10b/devfreq/17000000.ga10b/cur_freq

Dynamic frequency scaling adds latency jitter. For consistent inference latency, lock clocks.

7.3 NUMA-Aware Allocation (AGX Orin)¶

On AGX Orin with larger memory, ensure CUDA allocations use the optimal memory controller:

# Pin process to specific CPU cores close to memory controller
taskset -c 0-5 ./llama-cli -m model.gguf -ngl 99

7.4 Swap / zram for Emergency Overflow¶

If a model barely doesn't fit, zram (compressed RAM swap) can help:

# Enable 4 GB zram (compressed in-memory swap)
sudo zramctl --find --size 4G --algorithm zstd
sudo mkswap /dev/zram0
sudo swapon /dev/zram0 -p 5

# Now models slightly over RAM can run (with performance penalty)

zram compresses pages in memory — ~2:1 ratio for model weights. A 6 GB model on 5.5 GB available might work via zram, but with 30–50% speed penalty due to compression/decompression overhead.

8. Kernel-Level Optimization — Where the Real Gains Are¶

Sections 1–7 cover model-level and system-level optimizations. This section goes deeper — into the GPU kernels themselves. This is where cloud platforms like RightNow Forge and RunInfra achieve their 3–7× speedups over baseline inference.

8.1 The GPU Utilization Problem¶

Most AI inference wastes 80%+ of available GPU cycles:

Typical unoptimized LLM inference on GPU:

SM·00░·····█░░░░·····█░░░░·····█░     █ = compute
SM·01····██░░░·····██░░░·····██░░     ░ = memory I/O (waiting for DRAM)
SM·02···█░░░░·····█░░░░·····█░░░░     · = idle (nothing scheduled)
SM·03·██░░░·····██░░░·····██░░░··
  ~16% SM utilization

After kernel optimization:

SM·00██·█████░·█████░████████████     Same hardware, 5× more useful work
SM·01░░████████████████·██████·██
SM·02██████████··█████░██████░███
SM·03███·█████░░████████████████·
  ~88% SM utilization

Why this happens: Default PyTorch/ONNX kernels are generic — they work on any GPU but optimize for none. Each operation (attention, norm, quantized GEMM) launches a separate kernel, reads from DRAM, computes, writes back. The GPU spends most of its time waiting for memory.

On Jetson this matters even more: 102 GB/s shared bandwidth (vs 3,350 GB/s on H100) means the GPU is frequently starved for data. Kernel optimization directly determines tokens/sec.

8.2 The Three Levels of Kernel Optimization¶

Level 1: Operator Fusion (easiest, biggest win)
  Combine multiple operations into one kernel → fewer DRAM round-trips
  Example: RMSNorm + residual add + SwiGLU → one kernel, one read, one write

Level 2: Hardware-Specific Tuning (moderate difficulty)
  Tune tile sizes, thread block dimensions, shared memory usage for YOUR specific GPU
  Example: Orin Nano Ampere SM has different optimal tile size than H100 Hopper SM

Level 3: Custom Kernel Generation (hardest, maximum performance)
  Write or generate Triton/CUDA kernels specifically for your model + GPU + precision
  Example: INT4 dequantize-fused-GEMM kernel for Ampere with 128-thread blocks

8.3 Profiling — Find the Bottleneck First¶

Never optimize blind. Profile to find which kernels consume the most time:

# On Jetson: profile with Nsight Systems
nsys profile --trace=cuda,nvtx -o llm_profile ./llama-cli -m model.gguf -ngl 99 -p "test"

# Analyze the trace
nsys stats llm_profile.nsys-rep

# Example output (typical LLM breakdown):
# Kernel                          Time%    Time       Calls
# ─────────────────────────────────────────────────────────
# attention_fwd                   41.2%    18.4ms     26      ← #1 bottleneck
# quantized_gemm_w4a16           22.8%    10.2ms     78
# rmsnorm_kernel                  14.1%     6.3ms     52
# silu_mul_kernel                  8.3%     3.7ms     26
# rotary_embedding                 5.1%     2.3ms     52
# others                           8.5%     3.8ms     ...

On Jetson, attention dominates even more than on server GPUs because the memory-bandwidth cost of loading Q, K, V from DRAM is proportionally higher.

8.4 Triton Kernels — Writing Custom Optimized Ops¶

Triton is NVIDIA's Python-based GPU kernel language. It's much easier than raw CUDA and generates near-optimal code.

Example: Fused RMSNorm + Residual Add (common LLM bottleneck):

import triton
import triton.language as tl

@triton.jit
def fused_rmsnorm_residual_kernel(
    X_ptr, Residual_ptr, Weight_ptr, Out_ptr,
    N: tl.constexpr, eps: tl.constexpr,
    BLOCK_SIZE: tl.constexpr
):
    """RMSNorm(X + Residual) * Weight — one kernel, one DRAM read, one write."""
    row = tl.program_id(0)
    offsets = tl.arange(0, BLOCK_SIZE)
    mask = offsets < N

    # Load X and Residual (one DRAM read each)
    x = tl.load(X_ptr + row * N + offsets, mask=mask, other=0.0)
    r = tl.load(Residual_ptr + row * N + offsets, mask=mask, other=0.0)

    # Fused: add residual + compute RMS norm + scale by weight
    h = x + r                                          # residual add
    mean_sq = tl.sum(h * h, axis=0) / N               # variance
    rrms = 1.0 / tl.sqrt(mean_sq + eps)               # reciprocal RMS
    w = tl.load(Weight_ptr + offsets, mask=mask, other=1.0)
    out = h * rrms * w                                 # normalize + scale

    # One DRAM write
    tl.store(Out_ptr + row * N + offsets, out, mask=mask)

Without fusion: 3 separate kernels (residual add, RMSNorm, weight multiply) = 6 DRAM accesses. With fusion: 1 kernel = 2 DRAM accesses. 3× less memory traffic.

8.5 Autokernel — Automated Kernel Generation¶

RightNow Autokernel automates the process of generating optimized Triton/CUDA kernels for specific GPU hardware:

# Install autokernel
pip install autokernel

# Generate optimized kernels for your model + GPU
autokernel optimize \
    --model "Llama-3.2-3B" \
    --gpu "orin-nano" \
    --precision "int4" \
    --output ./optimized_kernels/

What autokernel does: 1. Profiles your model to find the slowest kernels (attention, GEMM, norm) 2. Generates Triton kernel variants with different tile sizes, thread configurations 3. Benchmarks all variants on your specific GPU 4. Selects the fastest configuration 5. Verifies numerical correctness against reference implementation

The key insight: optimal kernel parameters differ dramatically between GPUs:

Parameter	H100 (Hopper)	A100 (Ampere)	Orin Nano (Ampere)
Best GEMM tile	256×128	128×128	64×64 (smaller SMs)
Thread block	256 threads	256 threads	128 threads (fewer warps)
Shared mem usage	164 KB	164 KB	48 KB (less per SM)
Optimal batch	64+	32+	1–4 (memory limited)

A kernel tuned for H100 can be 2–3× slower on Orin Nano than a kernel tuned for Orin Nano specifically.

8.6 RightNow Forge — Enterprise Kernel Optimization¶

RightNow Forge is the enterprise platform that automates the full kernel optimization pipeline:

Input:  model = "Llama-3.2-3B"
        gpu = "Jetson Orin Nano"
        baseline = "llama.cpp default"

Forge pipeline:
  1. Profile all kernels on target GPU
  2. Identify bottlenecks:
     ▲ attention       41% of total → generate FlashAttention variant for Ampere
     ▲ quantized GEMM  23% of total → generate INT4 dequant-fused GEMM
     ▲ rmsnorm         14% of total → generate fused RMSNorm+residual
  3. Compile optimized Triton kernels for Orin Nano SM
  4. Verify correctness (bit-accurate vs reference)
  5. Output: drop-in replacement kernels

Result:
  TTFT (Time to First Token): 320ms → 42ms (7.6× faster)
  Throughput: 15 tok/s → 45 tok/s (3× faster)
  SM utilization: 16% → 72%

8.7 Manual Kernel Optimization Checklist¶

If you're writing your own optimized kernels (Phase 5F territory), here's the priority order for Jetson:

Priority 1: Reduce DRAM traffic (Jetson's #1 bottleneck)
  □ Fuse consecutive elementwise ops (norm + add + activation)
  □ Fuse dequantize into GEMM (don't write dequantized weights to DRAM)
  □ Use FlashAttention (fuse Q×K + softmax + ×V into one kernel)
  □ Compute in registers/shared memory, write final result once

Priority 2: Maximize Tensor Core utilization
  □ Use WMMA/MMA instructions for matrix multiply (not scalar CUDA cores)
  □ Pad matrices to multiples of 16 for Tensor Core alignment
  □ Keep data in FP16/INT8 format that Tensor Cores consume directly

Priority 3: Tune for Orin Nano's specific SM
  □ Smaller tile sizes (64×64 vs 128×128 on server GPUs)
  □ Fewer threads per block (128 vs 256 — fewer warps available)
  □ Account for 48 KB shared memory limit per SM
  □ Fewer SMs (16 on Orin Nano vs 132 on H100) — fewer blocks in flight

Priority 4: Minimize kernel launch overhead
  □ Fuse small kernels into larger ones
  □ Use CUDA graphs to batch kernel launches
  □ Pre-allocate all buffers (no cudaMalloc during inference)

8.8 CUDA Graphs — Eliminate Launch Overhead¶

Each CUDA kernel launch has ~5–10 µs overhead. An LLM forward pass with 100+ kernel launches wastes ~1 ms just on launch overhead. CUDA graphs capture the entire sequence and replay it in one call:

// Capture the inference graph once
cudaGraph_t graph;
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

// All kernel launches are recorded, not executed
attention_kernel<<<grid, block, 0, stream>>>(q, k, v, out);
rmsnorm_kernel<<<grid, block, 0, stream>>>(out, norm_out);
ffn_kernel<<<grid, block, 0, stream>>>(norm_out, ffn_out);
// ... all layers ...

cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);

// Replay the entire forward pass with ONE launch
for (int token = 0; token < max_tokens; token++) {
    update_input_pointers(token);  // update KV cache pointers
    cudaGraphLaunch(instance, stream);
    cudaStreamSynchronize(stream);
}
// Launch overhead: ~5µs total instead of ~1ms

TensorRT-LLM uses CUDA graphs internally. llama.cpp has experimental CUDA graph support.

9. Complete Optimization Checklist¶

Before deployment — run through this checklist:

□ Model Selection
  □ Model fits in INT4 with room for KV cache
  □ GQA-based model preferred (smaller KV cache)
  □ Context length budgeted against available memory

□ Quantization
  □ AWQ or GPTQ 4-bit for best quality/size
  □ GGUF Q4_K_M for llama.cpp deployment
  □ Calibration data representative of production input

□ KV Cache
  □ INT8 KV cache enabled
  □ Maximum context length capped to fit memory
  □ GQA model chosen to minimize KV memory

□ Inference Engine
  □ llama.cpp with -ngl 99 (full GPU offload)
  □ Or TensorRT-LLM engine compiled for target batch/context
  □ FlashAttention enabled

□ Kernel Optimization
  □ Profile with nsys to find top 3 bottleneck kernels
  □ Fused ops enabled (RMSNorm+residual, SwiGLU, rotary)
  □ CUDA graphs enabled for decode loop (reduce launch overhead)
  □ Tile sizes appropriate for Orin Nano SM (64×64, not 256×128)
  □ Consider autokernel/Forge for automated kernel tuning

□ System Configuration
  □ nvpmodel set to appropriate power mode
  □ jetson_clocks to lock frequencies
  □ CMA reduced (LLM doesn't need large CMA)
  □ Unnecessary services disabled (GUI, bluetooth)

□ Profiling
  □ tegrastats monitored during inference
  □ Tokens/sec measured at steady state
  □ Memory usage verified (no slow growth / leak)
  □ Thermal verified (no throttling under sustained load)

10. Benchmark Reference¶

Expected performance on Orin Nano Super 8 GB (25W mode, Q4_K_M, llama.cpp):

Model	Params	GGUF size	Prompt eval	Generation	Context
TinyLlama 1.1B	1.1B	0.6 GB	~200 tok/s	~65 tok/s	2048
Llama 3.2 1B	1.3B	0.7 GB	~170 tok/s	~55 tok/s	2048
Gemma 2 2B	2.6B	1.5 GB	~85 tok/s	~35 tok/s	2048
Llama 3.2 3B	3.2B	1.8 GB	~65 tok/s	~25 tok/s	2048
Phi-3 Mini 3.8B	3.8B	2.2 GB	~50 tok/s	~20 tok/s	2048

These are estimates for Orin Nano Super at 25W. The ~1.7× improvement over the original Orin Nano comes from 102 GB/s bandwidth (vs 68 GB/s) and higher clock speeds. Actual performance depends on power mode, thermal design, context length, and prompt content. Always benchmark your specific configuration.

11. Projects¶

#	Project	What you learn
1	llama.cpp on Jetson	Download Llama 3.2 3B Q4_K_M, build llama.cpp with CUDA, measure tokens/sec at different context lengths
2	Quantization comparison	Run same model at Q2_K, Q4_K_M, Q6_K, Q8_0. Measure tokens/sec, memory, and output quality (perplexity)
3	TensorRT-LLM engine	Build a TRT-LLM engine for Phi-3 Mini INT4. Compare latency with llama.cpp on same prompts
4	Speculative decoding	Set up TinyLlama as draft for Llama 3.2 3B. Measure acceptance rate and speedup
5	Memory budget audit	Run tegrastats during inference. Map every MB: OS, CMA, model, KV cache, activations. Verify against Section 3.3
6	Power vs performance	Benchmark same model at 7W, 15W, 25W (if Orin NX). Plot tokens/sec vs power. Calculate tokens/joule
7	Context length scaling	Measure tokens/sec at context 512, 1024, 2048, 4096. Plot. Identify where KV cache pressure causes degradation
8	Production chatbot	Build a REST API serving Llama 3.2 3B on Jetson using llama.cpp server mode. Measure P50/P95 latency under concurrent requests
9	Nsight profile analysis	Profile llama.cpp inference with `nsys`. Identify top 3 kernels by time. Calculate SM utilization. Find memory-bound vs compute-bound kernels
10	Fused RMSNorm Triton kernel	Write the fused RMSNorm+residual kernel from Section 8.4. Benchmark against unfused PyTorch version. Measure DRAM traffic reduction
11	Autokernel on Jetson	Use autokernel to generate optimized kernels for a small model on Orin Nano. Compare throughput before/after. Document which kernels changed
12	CUDA graphs	Wrap the decode loop of a simple transformer in a CUDA graph. Measure kernel launch overhead before/after. Target: <10 µs total launch per token

12. Resources¶

Resource	What it covers
llama.cpp	Best open-source LLM inference engine for edge
TensorRT-LLM	NVIDIA's optimized LLM engine
NVIDIA Jetson AI Lab	Pre-built containers and tutorials for LLMs on Jetson
Jetson Generative AI Playground	NVIDIA's LLM deployment guides for Jetson
AutoAWQ	AWQ quantization library
FlashAttention-2 paper	Algorithm behind fused attention
Speculative Decoding paper	Original speculative decoding paper
vLLM paper (PagedAttention)	KV cache memory management (server reference)
Orin Nano Memory Architecture	Unified memory deep dive (this roadmap)
RightNow Autokernel	Open-source automated GPU kernel optimization
RightNow Forge	Enterprise kernel optimization platform (profile → generate → verify)
Triton Language	Python-based GPU kernel language (easier than CUDA)
CUDA Graphs Programming Guide	Capture and replay kernel sequences for reduced launch overhead
RunInfra	Cloud LLM optimization platform (reference for techniques)

Next¶

→ Back to ML and AI hub