Skip to content

LLM Optimization on Jetson — From Cloud Techniques to Edge Reality

Parent: ML and AI

Goal: Take the optimization techniques used by cloud LLM platforms (vLLM, TensorRT-LLM, RunInfra, etc.) and adapt them to the extreme constraints of Jetson Orin Nano Super 8 GB — 102 GB/s bandwidth, 67 TOPS GPU + ~10 TOPS DLA ≈ 77 TOPS total, shared memory, 7–25W power.


Pre-Flight: System Check Before Starting

Before any LLM work, verify your Jetson's software stack. JetPack version determines which CUDA, TensorRT, and cuDNN versions you have — and which LLM tools are compatible.

Quick System Audit (copy-paste this)

#!/bin/bash
echo "═══════════════════════════════════════════════"
echo "  Jetson System Audit for LLM Deployment"
echo "═══════════════════════════════════════════════"

echo ""
echo "▸ JetPack / L4T version:"
cat /etc/nv_tegra_release 2>/dev/null || echo "  (not found — check dpkg)"
dpkg-query --show nvidia-l4t-core 2>/dev/null | awk '{print "  L4T:", $2}'

echo ""
echo "▸ CUDA version:"
nvcc --version 2>/dev/null | grep release || echo "  nvcc not found"

echo ""
echo "▸ TensorRT version:"
dpkg -l | grep tensorrt | head -1 | awk '{print " ", $3}'

echo ""
echo "▸ cuDNN version:"
dpkg -l | grep cudnn | head -1 | awk '{print " ", $3}'

echo ""
echo "▸ Python version:"
python3 --version

echo ""
echo "▸ Total RAM:"
free -m | awk '/Mem:/ {print "  " $2 " MB total"}'
echo "▸ Free RAM:"
free -m | awk '/Mem:/ {print "  " $7 " MB available"}'

echo ""
echo "▸ CMA allocation:"
grep Cma /proc/meminfo | awk '{print "  " $0}'

echo ""
echo "▸ GPU info:"
cat /sys/devices/17000000.ga10b/devfreq/17000000.ga10b/cur_freq 2>/dev/null \
    | awk '{print "  GPU freq: " $1/1000000 " MHz"}'

echo ""
echo "▸ Power mode:"
sudo nvpmodel -q 2>/dev/null | head -2 | sed 's/^/  /'

echo ""
echo "▸ Disk space:"
df -h / | tail -1 | awk '{print "  Root: " $4 " free of " $2}'
df -h /dev/nvme0n1p1 2>/dev/null | tail -1 | awk '{print "  NVMe: " $4 " free of " $2}'

echo ""
echo "▸ Thermal:"
cat /sys/devices/virtual/thermal/thermal_zone*/temp 2>/dev/null | head -3 \
    | awk '{print "  Zone: " $1/1000 "°C"}'

echo "═══════════════════════════════════════════════"

Example output (Orin Nano Super, JetPack 6.1):

═══════════════════════════════════════════════
  Jetson System Audit for LLM Deployment
═══════════════════════════════════════════════

▸ JetPack / L4T version:
  # R36 (release), REVISION: 4.0
  L4T: 36.4.0-20241031080721

▸ CUDA version:
  Cuda compilation tools, release 12.6, V12.6.77

▸ TensorRT version:
  10.3.0.30-1+cuda12.6

▸ cuDNN version:
  9.3.0.75-1+cuda12.6

▸ Python version:
  Python 3.10.12

▸ Total RAM:
  7633 MB total
▸ Free RAM:
  5814 MB available

▸ CMA allocation:
  CmaTotal:      786432 kB
  CmaFree:       654321 kB

▸ GPU info:
  GPU freq: 624 MHz

▸ Power mode:
  NV Power Mode: MAXN
  Power Mode: 25W

▸ Disk space:
  Root: 42G free of 100G

▸ Thermal:
  Zone: 38.5°C

═══════════════════════════════════════════════

JetPack → CUDA → TensorRT Compatibility Matrix

This matrix determines which LLM tools work on your Jetson:

JetPack L4T CUDA TensorRT cuDNN Python llama.cpp Ollama TRT-LLM
6.1 R36.4 12.6 10.3 9.3 3.10 Yes Yes Yes (0.15+)
6.0 R36.3 12.2 8.6 8.9 3.10 Yes Yes Yes (0.9+)
5.1.3 R35.5 11.4 8.5 8.6 3.8 Yes Yes Limited
5.1.1 R35.3 11.4 8.5 8.6 3.8 Yes Older No
5.0.2 R35.1 11.4 8.4 8.4 3.8 Yes No No

Recommendation: Use JetPack 6.1 (R36.4) for LLM work. It has CUDA 12.6, TensorRT 10.3, and full TensorRT-LLM support. If you're on JetPack 5.x, consider upgrading — the CUDA 12.x ecosystem (llama.cpp, Ollama, PyTorch 2.x) is significantly better.

Pre-Flight Checklist

Before deploying any LLM on Jetson:

□ JetPack version
  □ JetPack 6.0+ for TensorRT-LLM
  □ JetPack 5.1+ minimum for llama.cpp / Ollama

□ Available memory
  □ Run: free -m → note "available" column
  □ Expect ~5.5–6 GB free on stock Orin Nano Super 8 GB
  □ If < 5 GB: disable GUI (sudo systemctl set-default multi-user.target)
  □ If < 4 GB: reduce CMA, disable unnecessary services

□ Storage
  □ NVMe recommended (models are 1–5 GB each)
  □ SD card works but slower model loading
  □ At least 20 GB free for models + build cache

□ Power mode
  □ sudo nvpmodel -m 0  (MAXN = 25W for Orin Nano Super)
  □ sudo jetson_clocks   (lock to max frequency)

□ Thermal
  □ Active cooling attached (fan or heatsink with fan)
  □ Ambient temperature < 35°C for sustained workloads
  □ Monitor: tegrastats --interval 1000

Disable GUI to Free ~500 MB RAM

For headless LLM deployment, disable the desktop environment:

# Switch to text-only mode (saves ~500 MB RAM)
sudo systemctl set-default multi-user.target
sudo reboot

# To re-enable GUI later:
sudo systemctl set-default graphical.target
sudo reboot

# Verify RAM freed:
free -m   # "available" should increase by ~500 MB

This single change can be the difference between a 3B model fitting comfortably and running out of memory.


0. The Optimization Stack — Cloud vs Jetson

Cloud platforms like RunInfra, Together AI, and Fireworks AI deploy LLMs using a layered optimization stack. Every technique has a Jetson equivalent — but the priorities are reversed because Jetson is severely memory-bandwidth-bound rather than compute-bound.

Cloud GPU (H100 80GB, 3,350 GB/s, 989 TFLOPS):
  ┌─────────────────────────────────────────────┐
  │ 1. Quantization (FP8, AWQ 4-bit)           │ ← saves VRAM, improves throughput
  │ 2. FlashAttention-2                         │ ← saves SRAM, fuses memory ops
  │ 3. Fused Kernels (RMSNorm, rotary, SwiGLU) │ ← fewer kernel launches
  │ 4. PagedAttention (vLLM)                    │ ← KV cache memory efficiency
  │ 5. Speculative Decoding                     │ ← higher tokens/sec
  │ 6. Batching (continuous batching)           │ ← amortize compute over requests
  │ 7. Tensor Parallelism (multi-GPU)          │ ← scale beyond 1 GPU
  └─────────────────────────────────────────────┘

Jetson Orin Nano Super 8GB (102 GB/s, 67 TOPS GPU + ~10 TOPS DLA ≈ 77 TOPS):
  ┌─────────────────────────────────────────────┐
  │ 1. Quantization (INT4/INT8) ★★★★★          │ ← MANDATORY: model must fit in 5 GB
  │ 2. Model Selection ★★★★★                    │ ← choose models that fit (≤3B params)
  │ 3. KV Cache Management ★★★★                │ ← memory is the #1 constraint
  │ 4. FlashAttention / Fused Ops ★★★★         │ ← reduce bandwidth pressure
  │ 5. TensorRT-LLM Engine ★★★                 │ ← compiled, optimized execution
  │ 6. Speculative Decoding ★★★                │ ← higher tokens/sec within power budget
  │ 7. Batching ★★                              │ ← limited by memory, not compute
  └─────────────────────────────────────────────┘
  ★ = importance on Jetson (more ★ = more critical)

1. Quantization — The Most Important Optimization

On cloud GPUs, quantization is optional (saves cost). On Jetson, quantization is mandatory — without it, nothing fits.

1.1 Why Quantization Matters More on Jetson

Model: Llama 3.2 3B parameters

FP16:  3B × 2 bytes = 6.0 GB   ← won't fit (only ~5 GB free after OS/CMA)
INT8:  3B × 1 byte  = 3.0 GB   ← fits, but tight
INT4:  3B × 0.5 byte = 1.5 GB  ← fits comfortably, room for KV cache

Model: Phi-3 Mini 3.8B parameters

FP16:  3.8B × 2 bytes = 7.6 GB  ← impossible on 8 GB
INT4:  3.8B × 0.5 byte = 1.9 GB ← fits with room for context

1.2 Quantization Methods Ranked for Jetson

Method Bits Quality loss Speed on Jetson When to use
AWQ (Activation-Aware Weight) 4-bit Very low Fast (INT4 GEMM) Best quality/size for Jetson
GPTQ 4-bit Low Fast Alternative to AWQ, well-supported
INT8 PTQ (TensorRT) 8-bit Minimal Fastest If model fits at INT8
FP8 (E4M3) 8-bit Minimal Fast (Ampere+) When you need FP-like quality
GGUF (llama.cpp) 2–8 bit Mixed-precision Good (CPU+GPU) Easy deployment, any model
SqueezeLLM 3-4 bit Low Moderate Extreme compression

AWQ preserves quality by protecting salient weight channels — the 1% of weights that matter most for output quality.

# Quantize on your workstation (not on Jetson — too slow)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "microsoft/Phi-3-mini-4k-instruct"
quant_path = "phi3-mini-awq-int4"

model = AutoAWQForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# AWQ calibration (needs ~128 samples)
model.quantize(
    tokenizer,
    quant_config={
        "zero_point": True,
        "q_group_size": 128,
        "w_bit": 4,           # 4-bit weights
        "version": "GEMM"     # optimized GEMM kernel
    }
)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

1.4 GGUF with llama.cpp (Easiest Path)

llama.cpp runs on Jetson with CUDA support and handles mixed-precision quantization:

# On Jetson: install llama.cpp with CUDA
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download a pre-quantized GGUF model
# (Llama 3.2 3B in Q4_K_M = ~2 GB, good quality/size balance)
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/\
Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Run inference
./build/bin/llama-cli \
    -m Llama-3.2-3B-Instruct-Q4_K_M.gguf \
    -ngl 99 \           # offload all layers to GPU
    -c 2048 \           # context length
    -p "Explain how Jetson unified memory works:"

GGUF quantization levels for Jetson:

Quantization Bits Size (3B model) Quality Recommended?
Q2_K 2.6 ~1.0 GB Poor Only if nothing else fits
Q3_K_M 3.4 ~1.3 GB Acceptable Memory-critical deployments
Q4_K_M 4.5 ~1.8 GB Good Best balance for Jetson
Q5_K_M 5.5 ~2.2 GB Very good If memory allows
Q6_K 6.6 ~2.5 GB Excellent Best quality that fits
Q8_0 8.0 ~3.0 GB Near-FP16 Only for small models

2. Model Selection — The Complete Landscape

The most important "optimization" is choosing the right model. A well-quantized small model beats a poorly-fitting large model every time.

2.1 Full Model Catalog (Q4_K_M Quantization, Orin Nano Super 8 GB)

All values at Q4_K_M — a 4-bit mixed-precision quantization that reduces memory by ~75% with minimal quality loss.

Tier 1 — Runs comfortably (< 3 GB, room for long context + KV cache)

Model Params Q4_K_M size Context Gen. tok/s (est.) Best for
TinyLlama 1.1B 1.1B 0.6 GB 2K ~65 Ultra-lightweight, draft model for speculative decoding
Llama 3.2 1B 1.3B 0.7 GB 128K ~55 Lightweight chat, classification, tool calling
StableLM 2 1.6B 1.6B 0.9 GB 4K ~45 Compact, fast edge assistant
Gemma 3 1B 1B 0.6 GB 32K ~60 Multilingual, Google ecosystem
Gemma 2 2B 2.6B 1.5 GB 8K ~35 Multilingual, general purpose
Qwen 3 1.7B 1.7B 1.0 GB 32K ~50 Chinese + English, thinking mode
SmolLM2 1.7B 1.7B 1.0 GB 8K ~50 Hugging Face, compact, well-trained
Llama 3.2 3B 3.2B 1.8 GB 128K ~25 General chat, summarization, tool use
Qwen 2.5 3B 3B 1.7 GB 32K ~28 Bilingual general purpose

Tier 2 — Fits but tight (3–4.5 GB, shorter context recommended)

Model Params Q4_K_M size Context Gen. tok/s (est.) Best for
Phi-4 Mini 3.8B 2.3 GB 4K/128K ~20 Reasoning, math, code
Phi-3 Mini 3.8B 2.2 GB 4K/128K ~20 Reasoning, code
Qwen 3 4B 4B 2.4 GB 32K ~18 Thinking + non-thinking modes
Gemma 3 4B 4B 2.5 GB 128K ~17 Vision + language (multimodal)
Llama 3.3 8B 8B 4.6 GB 128K ~10 General purpose (needs short ctx)
Mistral 7B v0.3 7B 4.1 GB 32K ~11 Chat, function calling
Qwen 3 8B 8B 4.9 GB 32K ~9 Bilingual, thinking mode

Tier 3 — Barely fits (4.5+ GB, heavily constrained)

Model Params Q4_K_M size Max ctx Gen. tok/s (est.) Notes
Llama 3.1 8B 8B 4.7 GB 512–1K ~9 Functional but cramped
Mistral Nemo 12B 12B 7.0 GB 256 ~5 Needs Q3_K or Q2_K to fit
Phi-4 14B 14B 8.4 GB Won't fit Use Orin NX 16 GB

Rule of thumb: Model Q4_K_M size + 2.5 GB (OS/CMA/CUDA) + KV cache must be < 8 GB. For Tier 2 models, cap context at 1–2K tokens. For Tier 3, consider Q3_K_M or Q2_K quantization.

2.2 Models That Need Larger Jetson Hardware

For reference — what runs on bigger Jetson modules:

Model Params Q4_K_M Min Jetson module Notes
Phi-4 14B 14B 8.4 GB Orin NX 16 GB Good reasoning
Qwen 2.5 Coder 14B 14B 8.6 GB Orin NX 16 GB Code generation
Mistral Small 24B 24B 14 GB AGX Orin 32 GB Strong general purpose
DeepSeek-R1 Distill 32B 32B 19 GB AGX Orin 32 GB Reasoning chains
Llama 3.3 70B 70B 42 GB AGX Orin 64 GB Near-GPT-4 quality
Qwen 3 235B (MoE) 235B ~60 GB Multi-GPU / cloud 22B active params
DeepSeek V3 671B (MoE) 671B ~380 GB Multi-GPU cluster 37B active params

2.3 Latest Models Worth Watching (2025–2026)

The edge LLM landscape moves fast. Models to evaluate as they release:

Model family Why it matters for Jetson
Qwen 3 (0.6B–235B) Thinking + non-thinking modes, great small variants (1.7B, 4B)
Gemma 3 (1B–27B) Multimodal (vision+text), good 1B and 4B for edge
Phi-4 Mini (3.8B) Microsoft's strong reasoning at small size
Llama 4 Scout/Maverick MoE architecture — active params may fit edge
SmolLM2 (135M–1.7B) Hugging Face's ultra-compact series
Step 3.5 Flash (196B MoE, 11B active) StepFun's open-source MoE reasoning model — only 11B active per token (fits Jetson!), 262K context, speed-optimized
MiMo-V2-Pro (1T+ MoE) Xiaomi's flagship — agentic scenarios, OpenClaw-compatible, 1M context, approaches Opus-4.6 quality. Too large for Jetson directly, but distilled/smaller MiMo variants are edge targets
MiMo (series, 1.5B–7B) Xiaomi's edge-optimized models — MiMo-7B has strong math/code benchmarks, 1.5B fits Jetson easily
DeepSeek-R1 Distill (1.5B–70B) Distilled reasoning — 1.5B variant fits easily
Nemotron Nano (series) NVIDIA's own edge-optimized models

MoE Models — Why Active Parameters Matter for Jetson

MoE (Mixture of Experts) models activate only a fraction of their total parameters per token. This changes the memory math:

Dense model (Llama 3.2 3B):
  Total params = Active params = 3B
  Q4_K_M size: 1.8 GB  (must load ALL weights per token)
  Memory bandwidth per token: 1.8 GB

MoE model (Step 3.5 Flash 196B, 11B active):
  Total params: 196B → Q4_K_M: ~110 GB  (won't fit in 8 GB!)
  BUT active params per token: 11B → needs ~6.5 GB of the 110 GB

  Challenge: even though only 11B are active, the FULL 110 GB model
  must be in memory because different tokens may route to different experts.
  → MoE models need the full weight set loaded, not just active params.

  For Jetson: MoE only helps if the TOTAL model fits.
  Step 3.5 Flash (110 GB) → won't fit on 8 GB Jetson
  A hypothetical MoE with 16B total / 4B active → would fit and be fast!

MoE models that could work on Jetson (if available in small total size):

Model Total params Active params Q4_K_M total Fits 8 GB?
Mixtral 8x0.5B (hypothetical) 4B 0.5B ~2.4 GB Yes
Llama 4 Scout small variant TBD TBD TBD Watch for small MoE releases
Step Flash distilled TBD TBD TBD If StepFun releases smaller variant

The real edge MoE opportunity: models with <8B total parameters and <2B active — giving dense-model memory footprint with large-model routing quality. This is an active research area.

2.4 Choosing the Right Model — Decision Flowchart

What's your use case?
├── Simple classification / extraction / tool calling
│   └── Llama 3.2 1B or Qwen 3 1.7B  (< 1 GB, ~50+ tok/s)
├── General chat / assistant
│   └── Llama 3.2 3B or Gemma 2 2B  (1.5–1.8 GB, ~25–35 tok/s)
├── Reasoning / math / code
│   └── Phi-4 Mini 3.8B or Qwen 3 4B  (2.3–2.5 GB, ~18–20 tok/s)
├── Multimodal (image + text)
│   └── Gemma 3 4B  (2.5 GB, supports image input)
├── Bilingual (Chinese + English)
│   └── Qwen 3 4B or Qwen 2.5 3B  (1.7–2.4 GB)
├── Code generation
│   └── Phi-4 Mini or Qwen 2.5 Coder 3B  (if available at 3B)
├── Maximum quality (willing to accept slower speed)
│   └── Llama 3.3 8B Q4_K_M with 1K context  (4.6 GB, ~10 tok/s)
└── Need long context (8K+ tokens)
    └── Llama 3.2 3B (128K native) or Gemma 3 1B (32K)
        Cap actual context to fit KV cache budget

2.5 Ollama on Jetson — Easiest Deployment

Ollama runs natively on Jetson with CUDA support:

# Install Ollama on Jetson
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model (automatically selects Q4_K_M)
ollama pull llama3.2:3b
ollama run llama3.2:3b

# Or for smaller/faster:
ollama pull qwen3:1.7b
ollama run qwen3:1.7b

# Serve as API (compatible with OpenAI format)
ollama serve &
curl http://localhost:11434/v1/chat/completions \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello"}]}'

Ollama advantages on Jetson: - Automatic CUDA detection and GPU offloading - Built-in model management (pull, delete, list) - OpenAI-compatible API endpoint - Handles quantization automatically - One-command install, no Python dependencies

# Check GPU utilization while running
tegrastats --interval 1000

# List downloaded models and sizes
ollama list

# Remove a model to free space
ollama rm llama3.2:3b

2.6 Offline Model Transfer — Jetson Without Internet

Production Jetson devices often have no internet access (air-gapped, DNS issues, factory floor). Transfer models from an internet-connected machine via USB or LAN.

Step 1 — Download on your workstation/VM:

# On internet-connected machine (workstation, cloud VM, etc.)
cd /tmp

# Download GGUF model from Hugging Face
wget "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/\
Llama-3.2-3B-Instruct-Q4_K_M.gguf"

# Or for Nemotron (NVIDIA's edge model):
wget "https://huggingface.co/bartowski/Nemotron-Mini-4B-Instruct-GGUF/resolve/main/\
Nemotron-Mini-4B-Instruct-Q4_K_M.gguf"

Step 2 — Transfer to Jetson via USB or LAN:

# Method A: USB device-mode (Jetson acts as USB Ethernet at 192.168.55.1)
scp Llama-3.2-3B-Instruct-Q4_K_M.gguf user@192.168.55.1:/opt/models/

# Method B: Ethernet/WiFi LAN (replace with Jetson's IP)
scp Llama-3.2-3B-Instruct-Q4_K_M.gguf user@192.168.1.100:/opt/models/

# Method C: USB flash drive (if no network)
# Mount USB drive on workstation, copy model, unmount, plug into Jetson
cp Llama-3.2-3B-Instruct-Q4_K_M.gguf /media/usb_drive/
# On Jetson:
sudo mount /dev/sda1 /mnt
cp /mnt/Llama-3.2-3B-Instruct-Q4_K_M.gguf /opt/models/

Step 3 — Verify on Jetson:

# Check file exists and size is correct
ls -lh /opt/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# Expected: ~1.8 GB for 3B Q4_K_M

# Verify integrity (optional — compare SHA256 with HuggingFace page)
sha256sum /opt/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Check disk space
df -h /opt/models/

Step 4 — Run inference:

# With llama.cpp:
./llama-cli -m /opt/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -ngl 99 -c 2048

# With Ollama (import local GGUF):
# Create a Modelfile
echo 'FROM /opt/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf' > Modelfile
ollama create my-llama -f Modelfile
ollama run my-llama

USB device-mode setup (if not working):

# Check if USB device-mode service is running
systemctl status nv-l4t-usb-device-mode

# Enable if disabled
sudo systemctl enable --now nv-l4t-usb-device-mode

# Verify Jetson has USB IP
ip addr show usb0    # should show 192.168.55.1

# On workstation, verify connectivity
ping 192.168.55.1

Transfer speed reference:

Method Speed Time for 2 GB model
USB 2.0 device-mode ~30 MB/s ~67 sec
USB 3.0 flash drive ~100 MB/s ~20 sec
Gigabit Ethernet ~110 MB/s ~18 sec
WiFi (802.11ac) ~40 MB/s ~50 sec

3. KV Cache Management — The Hidden Memory Consumer

During autoregressive generation, the KV cache stores key/value tensors for every token in the context. This can consume more memory than the model itself.

3.1 KV Cache Size Calculation

KV cache size = 2 × num_layers × num_kv_heads × head_dim × context_length × bytes_per_element

Llama 3.2 3B (INT8 KV cache, 2048 context):
  = 2 × 26 layers × 8 kv_heads × 128 head_dim × 2048 tokens × 1 byte
  = 2 × 26 × 8 × 128 × 2048
  = ~109 MB

Same model at 8192 context:
  = ~435 MB  ← significant on 8 GB!

Same model at 128K context:
  = ~6.8 GB  ← impossible, exceeds total free memory

3.2 KV Cache Optimization Techniques

Technique Memory saving Quality impact Jetson support
INT8 KV cache 2× vs FP16 Minimal llama.cpp, TensorRT-LLM
INT4 KV cache 4× vs FP16 Small llama.cpp (experimental)
GQA (Grouped Query Attention) 4–8× vs MHA None (model-level) Built into modern models
Sliding window attention Bounded Loses long context Mistral, some models
KV cache eviction Bounded Loses old context Custom implementation
PagedAttention (vLLM) No fragmentation waste None Not on Jetson (vLLM = server)

GQA is the most important: Llama 3.2 uses GQA with 8 KV heads (vs 32 query heads). This means the KV cache is 4× smaller than traditional MHA. Always prefer GQA models on Jetson.

3.3 Context Length Budget

Orin Nano 8GB Memory Budget for LLM:

Total DRAM:                    8.0 GB
  - Firmware carveouts:       -0.4 GB
  - OS + kernel:              -0.5 GB
  - CMA:                      -0.5 GB (reduced for LLM workload)
  - CUDA runtime:             -0.3 GB
  ────────────────────────────────────
  Available for LLM:           6.3 GB

Model (Llama 3.2 3B INT4):   -1.5 GB
KV cache (INT8, ctx=2048):   -0.1 GB
Activation memory:            -0.2 GB
  ────────────────────────────────────
  Remaining:                   4.5 GB  ← room for longer context or larger model

Model (Phi-3 Mini INT4):     -1.9 GB
KV cache (INT8, ctx=4096):   -0.3 GB
Activation memory:            -0.3 GB
  ────────────────────────────────────
  Remaining:                   3.8 GB  ← still comfortable

4. FlashAttention and Fused Kernels

4.1 Why Fused Kernels Matter on Jetson

On H100, fused kernels save SRAM bandwidth and improve Tensor Core utilization. On Jetson, they save DRAM bandwidth — the #1 bottleneck.

Unfused attention (naive):
  Q × K^T → write attention scores to DRAM → read back → softmax → write →
  read back → multiply by V → write output

  Total DRAM traffic: ~4× the minimum

FlashAttention-2 (fused):
  Q × K^T → softmax → × V  (all in SRAM/registers, ONE read + ONE write)

  Total DRAM traffic: ~1× the minimum → 4× less bandwidth used

On Jetson Orin Nano Super's 102 GB/s bandwidth, this is the difference between 20 tokens/sec and 50 tokens/sec.

4.2 FlashAttention on Jetson

# llama.cpp automatically uses FlashAttention when available
./llama-cli -m model.gguf -ngl 99 -fa  # -fa enables FlashAttention

# TensorRT-LLM compiles FlashAttention into the engine
trtllm-build --model_dir ./model --output_dir ./engine \
    --use_fused_mlp \
    --use_flash_attn \
    --max_batch_size 1 \
    --max_input_len 2048 \
    --max_seq_len 2560

4.3 Other Fused Operations

Each fused operation eliminates a DRAM round-trip:

Fused operation What it combines Bandwidth saved
Fused RMSNorm Norm + scale in one kernel ~2× less traffic
Fused SwiGLU Gate + activation + multiply ~3× less traffic
Fused Rotary Embedding Position encoding + Q/K projection ~2× less traffic
Fused Add + Norm Residual connection + layer norm ~2× less traffic

TensorRT-LLM enables these automatically. llama.cpp has many fused CUDA kernels built in.


5. TensorRT-LLM on Jetson

TensorRT-LLM is NVIDIA's optimized inference engine for LLMs. It compiles the model into a TensorRT engine with fused kernels, quantization, and Tensor Core usage.

5.1 Build a TensorRT-LLM Engine for Jetson

# Install TensorRT-LLM (JetPack 6.x includes TensorRT, add LLM extension)
pip install tensorrt-llm --extra-index-url https://pypi.nvidia.com

# Convert Hugging Face model to TensorRT-LLM checkpoint
python convert_checkpoint.py \
    --model_dir ./Llama-3.2-3B-Instruct \
    --output_dir ./checkpoint \
    --dtype float16 \
    --tp_size 1          # single GPU on Jetson

# Build optimized engine
trtllm-build \
    --checkpoint_dir ./checkpoint \
    --output_dir ./engine \
    --gemm_plugin float16 \
    --max_batch_size 1 \
    --max_input_len 1024 \
    --max_seq_len 2048 \
    --max_num_tokens 2048 \
    --use_fused_mlp enable \
    --use_flash_attn enable \
    --strongly_typed

# Run inference
python run.py \
    --engine_dir ./engine \
    --tokenizer_dir ./Llama-3.2-3B-Instruct \
    --max_output_len 256 \
    --input_text "How does Jetson unified memory work?"

5.2 INT4 AWQ Engine for Minimum Memory

# Build with INT4 AWQ quantization
trtllm-build \
    --checkpoint_dir ./checkpoint-awq \
    --output_dir ./engine-int4 \
    --gemm_plugin auto \
    --max_batch_size 1 \
    --max_input_len 1024 \
    --max_seq_len 2048 \
    --use_fused_mlp enable \
    --weight_only_precision int4_awq

5.3 TensorRT-LLM vs llama.cpp on Jetson

TensorRT-LLM llama.cpp
Setup complexity High (build engine) Low (download GGUF, run)
Performance Best (compiled kernels) Very good (hand-tuned CUDA)
Quantization FP16, INT8, INT4 AWQ/GPTQ Q2–Q8 mixed precision (GGUF)
Flexibility Fixed engine (rebuild for changes) Dynamic (change at runtime)
Memory efficiency Excellent (preallocated) Good (dynamic allocation)
Model support Major models (Llama, Phi, Mistral, etc.) Almost everything on HF
Best for Production deployment Prototyping + production

Recommendation: Start with llama.cpp for prototyping (5-minute setup). Switch to TensorRT-LLM for production when you need maximum tokens/sec.


6. Speculative Decoding — Free Speed

Speculative decoding uses a small draft model to guess N tokens, then the large target model verifies all N in one forward pass. If the guess is correct, you get N tokens for the price of 1.

Without speculative decoding:
  Target model: generate token 1 → token 2 → token 3 → token 4
  Time: 4 forward passes × 50ms = 200ms

With speculative decoding (draft model guesses 4 tokens):
  Draft model:  generate 4 candidate tokens (fast, ~5ms total)
  Target model: verify all 4 in ONE forward pass (~55ms)
  If 3/4 accepted: 3 tokens in 60ms instead of 150ms → 2.5× faster

Speedup: 1.5–3× depending on acceptance rate

6.1 On Jetson

# llama.cpp supports speculative decoding
./llama-speculative \
    -m Llama-3.2-3B-Q4_K_M.gguf \       # target model (3B)
    -md TinyLlama-1.1B-Q4_K_M.gguf \    # draft model (1.1B)
    -ngl 99 \
    --draft 8 \                           # speculate 8 tokens ahead
    -p "Write a comprehensive guide to..."

Memory budget for speculative decoding: - Target: Llama 3.2 3B INT4 = 1.5 GB - Draft: TinyLlama 1.1B INT4 = 0.6 GB - Total: 2.1 GB — fits easily on 8 GB

When NOT to use on Jetson: if the draft model pushes you over the memory budget, speculative decoding hurts more than it helps.


7. Runtime Optimizations

7.1 Power Mode Selection

# Check available power modes
sudo nvpmodel -q --verbose

# Set to maximum performance (15W on Orin Nano, 25W on Orin NX)
sudo nvpmodel -m 0
sudo jetson_clocks    # lock GPU/CPU at max frequency

# Or set power-efficient mode (7W)
sudo nvpmodel -m 1   # fewer CPU cores, lower GPU clock

Higher power mode = higher clock = more tokens/sec. But thermal design must support it.

7.2 GPU Frequency and Memory Clock

# Check current clocks
tegrastats --interval 500

# Lock GPU to max clock (prevents dynamic frequency scaling during inference)
sudo jetson_clocks

# Check GPU clock
cat /sys/devices/17000000.ga10b/devfreq/17000000.ga10b/cur_freq

Dynamic frequency scaling adds latency jitter. For consistent inference latency, lock clocks.

7.3 NUMA-Aware Allocation (AGX Orin)

On AGX Orin with larger memory, ensure CUDA allocations use the optimal memory controller:

# Pin process to specific CPU cores close to memory controller
taskset -c 0-5 ./llama-cli -m model.gguf -ngl 99

7.4 Swap / zram for Emergency Overflow

If a model barely doesn't fit, zram (compressed RAM swap) can help:

# Enable 4 GB zram (compressed in-memory swap)
sudo zramctl --find --size 4G --algorithm zstd
sudo mkswap /dev/zram0
sudo swapon /dev/zram0 -p 5

# Now models slightly over RAM can run (with performance penalty)

zram compresses pages in memory — ~2:1 ratio for model weights. A 6 GB model on 5.5 GB available might work via zram, but with 30–50% speed penalty due to compression/decompression overhead.


8. Kernel-Level Optimization — Where the Real Gains Are

Sections 1–7 cover model-level and system-level optimizations. This section goes deeper — into the GPU kernels themselves. This is where cloud platforms like RightNow Forge and RunInfra achieve their 3–7× speedups over baseline inference.

8.1 The GPU Utilization Problem

Most AI inference wastes 80%+ of available GPU cycles:

Typical unoptimized LLM inference on GPU:

SM·00░·····█░░░░·····█░░░░·····█░     █ = compute
SM·01····██░░░·····██░░░·····██░░     ░ = memory I/O (waiting for DRAM)
SM·02···█░░░░·····█░░░░·····█░░░░     · = idle (nothing scheduled)
SM·03·██░░░·····██░░░·····██░░░··
  ~16% SM utilization

After kernel optimization:

SM·00██·█████░·█████░████████████     Same hardware, 5× more useful work
SM·01░░████████████████·██████·██
SM·02██████████··█████░██████░███
SM·03███·█████░░████████████████·
  ~88% SM utilization

Why this happens: Default PyTorch/ONNX kernels are generic — they work on any GPU but optimize for none. Each operation (attention, norm, quantized GEMM) launches a separate kernel, reads from DRAM, computes, writes back. The GPU spends most of its time waiting for memory.

On Jetson this matters even more: 102 GB/s shared bandwidth (vs 3,350 GB/s on H100) means the GPU is frequently starved for data. Kernel optimization directly determines tokens/sec.

8.2 The Three Levels of Kernel Optimization

Level 1: Operator Fusion (easiest, biggest win)
  Combine multiple operations into one kernel → fewer DRAM round-trips
  Example: RMSNorm + residual add + SwiGLU → one kernel, one read, one write

Level 2: Hardware-Specific Tuning (moderate difficulty)
  Tune tile sizes, thread block dimensions, shared memory usage for YOUR specific GPU
  Example: Orin Nano Ampere SM has different optimal tile size than H100 Hopper SM

Level 3: Custom Kernel Generation (hardest, maximum performance)
  Write or generate Triton/CUDA kernels specifically for your model + GPU + precision
  Example: INT4 dequantize-fused-GEMM kernel for Ampere with 128-thread blocks

8.3 Profiling — Find the Bottleneck First

Never optimize blind. Profile to find which kernels consume the most time:

# On Jetson: profile with Nsight Systems
nsys profile --trace=cuda,nvtx -o llm_profile ./llama-cli -m model.gguf -ngl 99 -p "test"

# Analyze the trace
nsys stats llm_profile.nsys-rep

# Example output (typical LLM breakdown):
# Kernel                          Time%    Time       Calls
# ─────────────────────────────────────────────────────────
# attention_fwd                   41.2%    18.4ms     26      ← #1 bottleneck
# quantized_gemm_w4a16           22.8%    10.2ms     78
# rmsnorm_kernel                  14.1%     6.3ms     52
# silu_mul_kernel                  8.3%     3.7ms     26
# rotary_embedding                 5.1%     2.3ms     52
# others                           8.5%     3.8ms     ...

On Jetson, attention dominates even more than on server GPUs because the memory-bandwidth cost of loading Q, K, V from DRAM is proportionally higher.

8.4 Triton Kernels — Writing Custom Optimized Ops

Triton is NVIDIA's Python-based GPU kernel language. It's much easier than raw CUDA and generates near-optimal code.

Example: Fused RMSNorm + Residual Add (common LLM bottleneck):

import triton
import triton.language as tl

@triton.jit
def fused_rmsnorm_residual_kernel(
    X_ptr, Residual_ptr, Weight_ptr, Out_ptr,
    N: tl.constexpr, eps: tl.constexpr,
    BLOCK_SIZE: tl.constexpr
):
    """RMSNorm(X + Residual) * Weight — one kernel, one DRAM read, one write."""
    row = tl.program_id(0)
    offsets = tl.arange(0, BLOCK_SIZE)
    mask = offsets < N

    # Load X and Residual (one DRAM read each)
    x = tl.load(X_ptr + row * N + offsets, mask=mask, other=0.0)
    r = tl.load(Residual_ptr + row * N + offsets, mask=mask, other=0.0)

    # Fused: add residual + compute RMS norm + scale by weight
    h = x + r                                          # residual add
    mean_sq = tl.sum(h * h, axis=0) / N               # variance
    rrms = 1.0 / tl.sqrt(mean_sq + eps)               # reciprocal RMS
    w = tl.load(Weight_ptr + offsets, mask=mask, other=1.0)
    out = h * rrms * w                                 # normalize + scale

    # One DRAM write
    tl.store(Out_ptr + row * N + offsets, out, mask=mask)

Without fusion: 3 separate kernels (residual add, RMSNorm, weight multiply) = 6 DRAM accesses. With fusion: 1 kernel = 2 DRAM accesses. 3× less memory traffic.

8.5 Autokernel — Automated Kernel Generation

RightNow Autokernel automates the process of generating optimized Triton/CUDA kernels for specific GPU hardware:

# Install autokernel
pip install autokernel

# Generate optimized kernels for your model + GPU
autokernel optimize \
    --model "Llama-3.2-3B" \
    --gpu "orin-nano" \
    --precision "int4" \
    --output ./optimized_kernels/

What autokernel does: 1. Profiles your model to find the slowest kernels (attention, GEMM, norm) 2. Generates Triton kernel variants with different tile sizes, thread configurations 3. Benchmarks all variants on your specific GPU 4. Selects the fastest configuration 5. Verifies numerical correctness against reference implementation

The key insight: optimal kernel parameters differ dramatically between GPUs:

Parameter H100 (Hopper) A100 (Ampere) Orin Nano (Ampere)
Best GEMM tile 256×128 128×128 64×64 (smaller SMs)
Thread block 256 threads 256 threads 128 threads (fewer warps)
Shared mem usage 164 KB 164 KB 48 KB (less per SM)
Optimal batch 64+ 32+ 1–4 (memory limited)

A kernel tuned for H100 can be 2–3× slower on Orin Nano than a kernel tuned for Orin Nano specifically.

8.6 RightNow Forge — Enterprise Kernel Optimization

RightNow Forge is the enterprise platform that automates the full kernel optimization pipeline:

Input:  model = "Llama-3.2-3B"
        gpu = "Jetson Orin Nano"
        baseline = "llama.cpp default"

Forge pipeline:
  1. Profile all kernels on target GPU
  2. Identify bottlenecks:
     ▲ attention       41% of total → generate FlashAttention variant for Ampere
     ▲ quantized GEMM  23% of total → generate INT4 dequant-fused GEMM
     ▲ rmsnorm         14% of total → generate fused RMSNorm+residual
  3. Compile optimized Triton kernels for Orin Nano SM
  4. Verify correctness (bit-accurate vs reference)
  5. Output: drop-in replacement kernels

Result:
  TTFT (Time to First Token): 320ms → 42ms (7.6× faster)
  Throughput: 15 tok/s → 45 tok/s (3× faster)
  SM utilization: 16% → 72%

8.7 Manual Kernel Optimization Checklist

If you're writing your own optimized kernels (Phase 5F territory), here's the priority order for Jetson:

Priority 1: Reduce DRAM traffic (Jetson's #1 bottleneck)
  □ Fuse consecutive elementwise ops (norm + add + activation)
  □ Fuse dequantize into GEMM (don't write dequantized weights to DRAM)
  □ Use FlashAttention (fuse Q×K + softmax + ×V into one kernel)
  □ Compute in registers/shared memory, write final result once

Priority 2: Maximize Tensor Core utilization
  □ Use WMMA/MMA instructions for matrix multiply (not scalar CUDA cores)
  □ Pad matrices to multiples of 16 for Tensor Core alignment
  □ Keep data in FP16/INT8 format that Tensor Cores consume directly

Priority 3: Tune for Orin Nano's specific SM
  □ Smaller tile sizes (64×64 vs 128×128 on server GPUs)
  □ Fewer threads per block (128 vs 256 — fewer warps available)
  □ Account for 48 KB shared memory limit per SM
  □ Fewer SMs (16 on Orin Nano vs 132 on H100) — fewer blocks in flight

Priority 4: Minimize kernel launch overhead
  □ Fuse small kernels into larger ones
  □ Use CUDA graphs to batch kernel launches
  □ Pre-allocate all buffers (no cudaMalloc during inference)

8.8 CUDA Graphs — Eliminate Launch Overhead

Each CUDA kernel launch has ~5–10 µs overhead. An LLM forward pass with 100+ kernel launches wastes ~1 ms just on launch overhead. CUDA graphs capture the entire sequence and replay it in one call:

// Capture the inference graph once
cudaGraph_t graph;
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

// All kernel launches are recorded, not executed
attention_kernel<<<grid, block, 0, stream>>>(q, k, v, out);
rmsnorm_kernel<<<grid, block, 0, stream>>>(out, norm_out);
ffn_kernel<<<grid, block, 0, stream>>>(norm_out, ffn_out);
// ... all layers ...

cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);

// Replay the entire forward pass with ONE launch
for (int token = 0; token < max_tokens; token++) {
    update_input_pointers(token);  // update KV cache pointers
    cudaGraphLaunch(instance, stream);
    cudaStreamSynchronize(stream);
}
// Launch overhead: ~5µs total instead of ~1ms

TensorRT-LLM uses CUDA graphs internally. llama.cpp has experimental CUDA graph support.


9. Complete Optimization Checklist

Before deployment — run through this checklist:

□ Model Selection
  □ Model fits in INT4 with room for KV cache
  □ GQA-based model preferred (smaller KV cache)
  □ Context length budgeted against available memory

□ Quantization
  □ AWQ or GPTQ 4-bit for best quality/size
  □ GGUF Q4_K_M for llama.cpp deployment
  □ Calibration data representative of production input

□ KV Cache
  □ INT8 KV cache enabled
  □ Maximum context length capped to fit memory
  □ GQA model chosen to minimize KV memory

□ Inference Engine
  □ llama.cpp with -ngl 99 (full GPU offload)
  □ Or TensorRT-LLM engine compiled for target batch/context
  □ FlashAttention enabled

□ Kernel Optimization
  □ Profile with nsys to find top 3 bottleneck kernels
  □ Fused ops enabled (RMSNorm+residual, SwiGLU, rotary)
  □ CUDA graphs enabled for decode loop (reduce launch overhead)
  □ Tile sizes appropriate for Orin Nano SM (64×64, not 256×128)
  □ Consider autokernel/Forge for automated kernel tuning

□ System Configuration
  □ nvpmodel set to appropriate power mode
  □ jetson_clocks to lock frequencies
  □ CMA reduced (LLM doesn't need large CMA)
  □ Unnecessary services disabled (GUI, bluetooth)

□ Profiling
  □ tegrastats monitored during inference
  □ Tokens/sec measured at steady state
  □ Memory usage verified (no slow growth / leak)
  □ Thermal verified (no throttling under sustained load)

10. Benchmark Reference

Expected performance on Orin Nano Super 8 GB (25W mode, Q4_K_M, llama.cpp):

Model Params GGUF size Prompt eval Generation Context
TinyLlama 1.1B 1.1B 0.6 GB ~200 tok/s ~65 tok/s 2048
Llama 3.2 1B 1.3B 0.7 GB ~170 tok/s ~55 tok/s 2048
Gemma 2 2B 2.6B 1.5 GB ~85 tok/s ~35 tok/s 2048
Llama 3.2 3B 3.2B 1.8 GB ~65 tok/s ~25 tok/s 2048
Phi-3 Mini 3.8B 3.8B 2.2 GB ~50 tok/s ~20 tok/s 2048

These are estimates for Orin Nano Super at 25W. The ~1.7× improvement over the original Orin Nano comes from 102 GB/s bandwidth (vs 68 GB/s) and higher clock speeds. Actual performance depends on power mode, thermal design, context length, and prompt content. Always benchmark your specific configuration.


11. Projects

# Project What you learn
1 llama.cpp on Jetson Download Llama 3.2 3B Q4_K_M, build llama.cpp with CUDA, measure tokens/sec at different context lengths
2 Quantization comparison Run same model at Q2_K, Q4_K_M, Q6_K, Q8_0. Measure tokens/sec, memory, and output quality (perplexity)
3 TensorRT-LLM engine Build a TRT-LLM engine for Phi-3 Mini INT4. Compare latency with llama.cpp on same prompts
4 Speculative decoding Set up TinyLlama as draft for Llama 3.2 3B. Measure acceptance rate and speedup
5 Memory budget audit Run tegrastats during inference. Map every MB: OS, CMA, model, KV cache, activations. Verify against Section 3.3
6 Power vs performance Benchmark same model at 7W, 15W, 25W (if Orin NX). Plot tokens/sec vs power. Calculate tokens/joule
7 Context length scaling Measure tokens/sec at context 512, 1024, 2048, 4096. Plot. Identify where KV cache pressure causes degradation
8 Production chatbot Build a REST API serving Llama 3.2 3B on Jetson using llama.cpp server mode. Measure P50/P95 latency under concurrent requests
9 Nsight profile analysis Profile llama.cpp inference with nsys. Identify top 3 kernels by time. Calculate SM utilization. Find memory-bound vs compute-bound kernels
10 Fused RMSNorm Triton kernel Write the fused RMSNorm+residual kernel from Section 8.4. Benchmark against unfused PyTorch version. Measure DRAM traffic reduction
11 Autokernel on Jetson Use autokernel to generate optimized kernels for a small model on Orin Nano. Compare throughput before/after. Document which kernels changed
12 CUDA graphs Wrap the decode loop of a simple transformer in a CUDA graph. Measure kernel launch overhead before/after. Target: <10 µs total launch per token

12. Resources

Resource What it covers
llama.cpp Best open-source LLM inference engine for edge
TensorRT-LLM NVIDIA's optimized LLM engine
NVIDIA Jetson AI Lab Pre-built containers and tutorials for LLMs on Jetson
Jetson Generative AI Playground NVIDIA's LLM deployment guides for Jetson
AutoAWQ AWQ quantization library
FlashAttention-2 paper Algorithm behind fused attention
Speculative Decoding paper Original speculative decoding paper
vLLM paper (PagedAttention) KV cache memory management (server reference)
Orin Nano Memory Architecture Unified memory deep dive (this roadmap)
RightNow Autokernel Open-source automated GPU kernel optimization
RightNow Forge Enterprise kernel optimization platform (profile → generate → verify)
Triton Language Python-based GPU kernel language (easier than CUDA)
CUDA Graphs Programming Guide Capture and replay kernel sequences for reduced launch overhead
RunInfra Cloud LLM optimization platform (reference for techniques)

Next

→ Back to ML and AI hub