Testing Guide — jetson-llm¶

Step-by-step testing from first build to production validation.

Step 0: Recommended Test Model¶

TinyLlama 1.1B Q4_K_M — the best model for initial testing:

Property	Value	Why it's good for testing
Parameters	1.1B	Small enough to load fast, debug easily
Q4_K_M size	669 MB	Leaves ~5 GB free — no OOM risk
Vocab	32,000 (Llama tokenizer)	Standard, well-tested GGUF format
Architecture	Llama-style	Same tensor names as Llama 3.2 (our target)
Context	2048	Short — fast prefill
Quality	Decent for 1B	Can verify coherent output vs garbage
Download	~670 MB	Fast transfer over USB

# Download on internet-connected machine
wget -O TinyLlama-1.1B-Q4_K_M.gguf \
  "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"

# Transfer to Jetson
scp TinyLlama-1.1B-Q4_K_M.gguf user@192.168.55.1:/opt/models/

After TinyLlama works, graduate to the real target:

Stage	Model	Size	Purpose
1. First boot	TinyLlama 1.1B Q4_K_M	669 MB	Validate loading, tokenizer, basic output
2. Correctness	Llama 3.2 1B Q4_K_M	750 MB	Verify modern GGUF format + GQA
3. Target model	Llama 3.2 3B Q4_K_M	1.8 GB	The actual deployment model
4. Stress test	Phi-3 Mini 3.8B Q4_K_M	2.3 GB	Larger model, tighter memory
5. Limit test	Llama 3.3 8B Q4_K_M	4.6 GB	Edge of what fits — tests OOM guard

Step 1: Build on Jetson¶

# SSH into Jetson
ssh user@192.168.55.1

# Clone or copy the project
cd /opt
git clone <your-repo-url> jetson-llm-runtime
cd jetson-llm-runtime

# First-time setup (sets power mode, checks system)
./scripts/setup_jetson.sh

# Or manual build
cmake -B build \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES="87" \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Expected output:

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- The CUDA compiler identification is NVIDIA 12.6
-- CUDA architectures: 87
...
[100%] Built target jetson-llm

If build fails: - nvcc not found → install: sudo apt install cuda-toolkit-12-6 - aarch64 check fails → you're cross-compiling on x86 (not supported, build on Jetson) - cuda_fp16.h not found → CUDA toolkit incomplete: sudo apt install cuda-*-12-6

Step 2: Test Memory Subsystem (No Model Needed)¶

./build/test_memory

Expected output:

╔══════════════════════════════════╗
║   JLLM Memory Budget             ║
╠══════════════════════════════════╣
║ Total DRAM:      7633 MB         ║
║ OS + kernel:    - 500 MB         ║
║ CMA reserved:  - 768 MB         ║
║ CUDA context:  - 300 MB         ║
║ Model weights: -   0 MB         ║
║ KV cache:      -   0 MB         ║
║ Scratch:       -   0 MB         ║
║ Safety margin: - 256 MB         ║
╠══════════════════════════════════╣
║ FREE:           5809 MB         ║
╚══════════════════════════════════╝
PASS: probe_system_memory
Free: 5814 MB
PASS: OOMGuard
PASS: ScratchPool

All memory tests passed.

What to check: - Total DRAM should be ~7633 MB (not 8192 — carveouts take the rest) - FREE should be >5000 MB on stock Orin Nano Super (GUI disabled) - If FREE < 4000 MB: disable GUI, reduce CMA, kill unnecessary services

Step 3: Test CUDA Kernels (No Model Needed)¶

./build/test_kernels

Expected output:

=== jetson-llm kernel tests ===

GPU: Orin (SM 8.7, 16 SMs)

PASS: softmax (sum = 1.000000)
PASS: rope (pos=0 identity, val[0]=1.0000)
PASS: fused_rmsnorm (val[0]=1.0000, expected ~1.0)
PASS: fp16_to_int8 (scale=0.003937, expected ~0.00394)
PASS: swiglu (val=1.4624, expected ~1.4621)

All kernel tests passed.

If a test fails: - CUDA error: no kernel image is available → wrong arch; rebuild with -DCMAKE_CUDA_ARCHITECTURES="87" - CUDA error: out of memory → something else is using GPU; check tegrastats - Wrong numerical result → kernel bug; file an issue with the exact values

Step 4: Test Model Loading (Needs GGUF File)¶

./build/test_model_load /opt/models/TinyLlama-1.1B-Q4_K_M.gguf

Expected output:

=== jetson-llm model loading test ===

Test 1: System probe
╔══════════════════════════════════════╗
║   Jetson LLM Runtime v0.1            ║
║ L4T:    36.4       CUDA: 12.6       ║
║ SMs:    16          Cores: 1024      ║
║ RAM:    7633  MB    CMA: 768  MB    ║
╚══════════════════════════════════════╝
PASS

Test 2: Memory budget
PASS

Test 3: GGUF config parsing
  name:         TinyLlama-1.1B-Chat-v1.0
  n_layers:     22
  n_heads:      32
  n_kv_heads:   4
  head_dim:     64
  hidden_dim:   2048
  inter_dim:    5632
  vocab_size:   32000
  max_seq_len:  2048
  rope_theta:   10000
  GQA group:    8
PASS

Test 4: Weight size estimate
  Estimated: ~630 MB
  Will fit in 5809 MB free? YES
PASS

Test 5: KV cache context calculation
  Max context (FP16 KV): ~18000 tokens
  Max context (INT8 KV): ~36000 tokens
  KV per token (INT8):   11264 bytes
PASS

Test 6: Tokenizer
  Loaded: yes
  Vocab size: 32000
  BOS ID: 1
  EOS ID: 2
  'Hello' → 3 tokens: 1 15043 29871
  Decoded: 'Hello '
PASS

Test 7: Weight loading
  Loaded: yes (669 MB)
  tok_embd:    0x7f8a000000
  output_norm: 0x7f8a100000
  output:      0x7f8a200000
  Layers with QKV: 22 / 22
PASS

Test 8: Power and thermal
  Power mode: 25W, GPU @ 1300 MHz
  GPU: 42.5°C, CPU: 41.0°C, throttling: no
  Backoff: 0 µs
PASS

═══════════════════════════════════════
  All model loading tests passed!
═══════════════════════════════════════

What to check at each test:

Test	What it validates	Red flag
3	GGUF config	`n_layers=0` or `vocab_size=0` → parser failed
5	Context calc	`Max context < 100` → memory too tight
6	Tokenizer	`Loaded: no` → vocab not found in GGUF
6	Encode	`'Hello' → 1 token` → tokenizer broken (should be 2-3)
7	Weight load	`tok_embd: (nil)` → tensor mapping failed
7	Layers with QKV	`0 / 22` → tensor names don't match

If Test 7 shows 0 / 22 layers mapped: This means the GGUF tensor names don't match our pattern (blk.N.attn_q.weight). Run this to see actual names:

# Inspect tensor names in GGUF (using python)
pip install gguf
python3 -c "
from gguf import GGUFReader
reader = GGUFReader('/opt/models/TinyLlama-1.1B-Q4_K_M.gguf')
for t in reader.tensors:
    print(f'{t.name:50s} {t.tensor_type.name:10s} {t.shape}')
" | head -30

Compare the output names with what load_and_map_weights() expects and fix the pattern matching.

Step 5: First Generation Test¶

# Short generation — verify output is coherent (not garbage)
./build/jetson-llm \
    -m /opt/models/TinyLlama-1.1B-Q4_K_M.gguf \
    -p "What is 2+2?" \
    -n 32 \
    -v

Good output (coherent text):

What is 2+2? The answer is 4.

--- Stats ---
Prompt:  5 tokens, 85.2 tok/s (58 ms)
Decode:  12 tokens, 28.4 tok/s (422 ms)
Memory:  peak 1823 MB
Thermal: peak 48.2°C

Bad output (bugs present):

What is 2+2?ÿÿÿÿÿÿÿÿÿÿ...        ← garbage: tensor offsets wrong (bug #1)
What is 2+2?                          ← no output: decode_step returns EOS immediately
What is 2+2? é«ÿ∂π∑...               ← random tokens: attention accumulator bug (#6)
Segfault                               ← null weight pointer: mapping failed

Debug steps if output is wrong:

# 1. Check tensor offsets are sane
./build/test_model_load model.gguf 2>&1 | grep -E "tensor|Mapped|WARNING"

# 2. Profile to see which kernels are running
nsys profile --trace=cuda -o debug \
    ./build/jetson-llm -m model.gguf -p "Hi" -n 5
nsys stats debug.nsys-rep

# 3. Check memory during inference
./build/jetson-llm -m model.gguf -p "Hi" -n 10 &
PID=$!; sleep 1; cat /proc/$PID/status | grep VmRSS; wait $PID

# 4. Compare tokenizer output with reference
python3 -c "
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')
ids = tok.encode('What is 2+2?')
print('Reference tokens:', ids)
print('Decoded:', tok.decode(ids))
"

Step 6: Interactive Chat Test¶

./build/jetson-llm -m /opt/models/TinyLlama-1.1B-Q4_K_M.gguf -i

> Hello, who are you?
I'm TinyLlama, a small language model...
[12 tokens, 28.4 tok/s, peak 1823 MB, 48.2°C]

> What is the capital of France?
Paris is the capital of France...
[15 tokens, 27.1 tok/s, peak 1830 MB, 49.1°C]

> quit

What to check: - tok/s is stable across turns (no degradation) - Memory doesn't grow (peak MB stays roughly constant) - Temperature stays below 80°C - No OOM warnings

Step 7: API Server Test¶

# Terminal 1: start server
./build/jetson-llm-server -m /opt/models/TinyLlama-1.1B-Q4_K_M.gguf -p 8080

# Terminal 2: test endpoints
# Health check
curl -s http://localhost:8080/health | python3 -m json.tool

# Chat completion
curl -s http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"role":"user","content":"What is 2+2?"}]}' \
    | python3 -m json.tool

# Model list
curl -s http://localhost:8080/v1/models | python3 -m json.tool

Expected /health response:

href="#__codelineno-17-1">{ "status": "ok", "model": "TinyLlama-1.1B-Chat-v1.0", "memory": { "total_mb": 7633, "free_mb": 4200, "model_mb": 669, "kv_mb": 45 }, "thermal": { "gpu_c": 48.5, "cpu_c": 47.0, "throttling": false }, "power": { "mode": "25W", "gpu_mhz": 1300 }, "gpu_util_pct": 75 }

Step 8: Benchmark¶

./scripts/bench.sh /opt/models/TinyLlama-1.1B-Q4_K_M.gguf

Record the numbers and compare across changes:

Metric	Target (TinyLlama 1.1B)	Your result
Prompt eval (512 tok)	>80 tok/s	_____ tok/s
Decode (128 tok)	>30 tok/s	_____ tok/s
Peak memory	<2500 MB	_____ MB
Peak temperature	<75°C	_____ °C

Step 9: Graduate to Target Model¶

Once TinyLlama works, test the actual target:

# Download Llama 3.2 3B
wget -O Llama-3.2-3B-Q4_K_M.gguf \
  "https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf"

scp Llama-3.2-3B-Q4_K_M.gguf user@192.168.55.1:/opt/models/

# Run tests
./build/test_model_load /opt/models/Llama-3.2-3B-Q4_K_M.gguf
./build/jetson-llm -m /opt/models/Llama-3.2-3B-Q4_K_M.gguf -p "Hello" -n 32
./scripts/bench.sh /opt/models/Llama-3.2-3B-Q4_K_M.gguf

Metric	Target (Llama 3.2 3B)	Your result
Prompt eval (512 tok)	>40 tok/s	_____ tok/s
Decode (128 tok)	>15 tok/s	_____ tok/s
Peak memory	<4000 MB	_____ MB

Step 10: Stress Test¶

# Long generation — tests KV cache growth and thermal stability
./build/jetson-llm -m /opt/models/Llama-3.2-3B-Q4_K_M.gguf \
    -p "Write a very long and detailed essay about the history of computing." \
    -n 1024 -v

# Check for:
#   - Memory stable (no growth after initial load)
#   - No OOM guard triggers
#   - No thermal throttling
#   - tok/s consistent throughout

# Overnight stability (production validation)
while true; do
    curl -s http://localhost:8080/v1/chat/completions \
        -d '{"messages":[{"role":"user","content":"Tell me a joke"}],"max_tokens":64}' \
        | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('jetson',{}).get('decode_tok_s','?'), 'tok/s')"
    sleep 2
done
# Run for 24 hours. Check tegrastats for memory leaks.

Quick Reference: All Test Commands¶

# 1. No model needed
./build/test_memory
./build/test_kernels

# 2. Needs GGUF model
./build/test_model_load model.gguf

# 3. Generation
./build/jetson-llm -m model.gguf -p "Hello" -n 32
./build/jetson-llm -m model.gguf -i

# 4. Server
./build/jetson-llm-server -m model.gguf -p 8080
curl http://localhost:8080/health

# 5. Benchmark + profile
./scripts/bench.sh model.gguf
./scripts/profile.sh model.gguf