Skip to content

jetson-llm Roadmap

Current State: v0.1-alpha (code-complete, needs hardware testing)

4,500+ lines | 31 files | 0 known bugs | builds clean on Jetson Orin Nano Super

v0.1 — First Tokens (Target: Week 1)

Goal: generate coherent text from a real GGUF model on Jetson hardware.

□ Build on Jetson (cmake + nvcc SM 8.7)
□ Run test_memory — verify budget reads /proc/meminfo correctly
□ Run test_kernels — all 5 kernel correctness tests pass
□ Download TinyLlama 1.1B Q4_K_M (669 MB)
□ Run test_model_load — config, tokenizer, weight mapping all pass
□ Run jetson-llm -m tinyllama.gguf -p "What is 2+2?" -n 32
□ Output is coherent English (not garbage/random tokens)
□ No segfaults, no OOM, no CUDA errors

If output is garbage: debug tensor offset mapping (model.cpp parse_tensor_infos). Use Python gguf library to verify tensor names match our patterns.

If segfault: check null weight pointers in test_model_load output. Run with cuda-memcheck.

Deliverable: screenshot of first coherent generation on Jetson.


v0.2 — Benchmark Baseline (Target: Week 2)

Goal: measure performance and compare against llama.cpp.

□ Install llama.cpp on same Jetson for baseline comparison
    git clone https://github.com/ggerganov/llama.cpp
    cmake -B build -DGGML_CUDA=ON && cmake --build build -j$(nproc)

□ Run bench.sh with TinyLlama 1.1B — record:
    Prompt eval: ___ tok/s
    Decode:      ___ tok/s
    Peak memory: ___ MB
    Peak temp:   ___ °C

□ Run same model on llama.cpp for comparison:
    ./llama-bench -m tinyllama.gguf -ngl 99

□ Run profile.sh — identify top 3 kernel bottlenecks:
    #1: _____ (___% of time)
    #2: _____ (___% of time)
    #3: _____ (___% of time)

□ Graduate to Llama 3.2 3B Q4_K_M — repeat benchmarks
□ Test context lengths: 512, 1024, 2048 — record tok/s at each
□ Test power modes: 7W, 15W, 25W — record tok/s and tokens/joule

Deliverable: performance comparison table (jetson-llm vs llama.cpp).


v0.3 — Kernel Optimization (Target: Week 3–4)

Goal: >20% faster decode than stock llama.cpp on Llama 3.2 3B.

Based on profiling results from v0.2:

□ Optimize #1 bottleneck kernel (likely gemv_q4):
    □ Tune thread block size (try 64, 128, 256)
    □ Tune elements per thread (try 4, 8, 16)
    □ Add vectorized loads (float4 / int4)
    □ Profile: register count, occupancy, memory throughput

□ Optimize #2 bottleneck kernel (likely attention):
    □ Tune ATTN_TILE_KV (try 32, 64, 128)
    □ Test INT8 vs FP16 KV cache performance difference
    □ Profile shared memory utilization

□ Optimize #3 bottleneck kernel (likely fused_norm):
    □ Verify fusion is working (compare 1-kernel vs 3-kernel time)
    □ Try different block sizes for different hidden_dim

□ Enable CUDA graphs for decode loop:
    □ Verify graph capture works (no host-side ops inside capture)
    □ Measure launch overhead reduction (before/after)

□ Re-run bench.sh — measure improvement:
    Before: ___ tok/s (decode)
    After:  ___ tok/s (decode)
    Speedup: ___×

Deliverable: >20% decode speedup over v0.2, profiling evidence.


v0.4 — Memory Stability (Target: Week 5)

Goal: stable operation for 1000+ tokens with no memory growth.

□ Generate 1000 tokens continuously — monitor memory:
    tegrastats --interval 1000 | tee stability.log &
    ./jetson-llm -m model.gguf -p "Write a long essay..." -n 1000

□ Verify no memory growth:
    Plot RAM usage over time from stability.log
    Delta between token 10 and token 1000 should be < 5 MB

□ Test KV cache eviction:
    Set context limit to 512, generate 1000 tokens
    Verify overflow pool works (eviction messages in stderr)

□ Test OOM guard:
    Load Llama 3.3 8B Q4_K_M (4.6 GB — tight fit)
    Verify OOM guard stops gracefully (no crash, no OOM killer)
    Verify message: "[oom_guard] Stopping at token N"

□ Test thermal stability:
    Run 30 minutes continuous generation at 25W
    Monitor temperature — should stay below 85°C with active cooling
    Verify thermal backoff activates if temp exceeds 80°C

□ Stress test: rapid start/stop:
    for i in {1..100}; do
      ./jetson-llm -m model.gguf -p "Hi" -n 10 2>/dev/null
    done
    Verify no memory leak (free -m before and after)

Deliverable: stability report — memory graph, thermal graph, OOM test results.


v0.5 — Server + Streaming (Target: Week 6)

Goal: production-ready HTTP API with streaming.

□ Implement streaming SSE in http_server.cpp:
    POST /v1/chat/completions with "stream": true
    Returns: data: {"choices":[{"delta":{"content":"token"}}]}\n\n
    Final: data: [DONE]\n\n

□ Add chat template formatting:
    Llama: <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{prompt}<|eot_id|>
    Generic: <|user|>\n{prompt}\n<|assistant|>\n

□ Add request timeout (60 second default)

□ Test with real clients:
    □ curl with streaming: curl -N http://jetson:8080/v1/chat/completions ...
    □ Python OpenAI SDK: client.chat.completions.create(stream=True)
    □ Browser fetch with ReadableStream

□ Add systemd service file:
    /etc/systemd/system/jetson-llm.service
    Restart=always, After=network.target
    EnvironmentFile for model path and port

□ Auto-start on boot, auto-restart on crash

□ Health endpoint enhanced:
    Add: uptime, total_requests, avg_tok_s, kv_cache_tokens_used

Deliverable: streaming API working with OpenAI SDK, systemd service running.


v0.6 — Multi-Model Support (Target: Week 7–8)

Goal: test and document performance across all Tier 1–2 models.

□ Test each model, record in performance table:

    Model              | Q4_K_M | Decode tok/s | Prompt tok/s | Peak RAM | Max ctx
    ────────────────────────────────────────────────────────────────────────────
    TinyLlama 1.1B     | 669 MB |    ___       |    ___       |   ___    |  ___
    Llama 3.2 1B       | 750 MB |    ___       |    ___       |   ___    |  ___
    Gemma 2 2B          | 1.5 GB |    ___       |    ___       |   ___    |  ___
    Qwen 3 1.7B        | 1.0 GB |    ___       |    ___       |   ___    |  ___
    Llama 3.2 3B       | 1.8 GB |    ___       |    ___       |   ___    |  ___
    Phi-4 Mini 3.8B    | 2.3 GB |    ___       |    ___       |   ___    |  ___
    Gemma 3 4B          | 2.5 GB |    ___       |    ___       |   ___    |  ___
    Llama 3.3 8B       | 4.6 GB |    ___       |    ___       |   ___    |  ___

□ Verify tokenizer works for each model family:
    □ Llama tokenizer (BPE)
    □ Gemma tokenizer (SentencePiece)
    □ Qwen tokenizer (tiktoken-style)
    □ Phi tokenizer

□ Document any model-specific issues:
    □ Different tensor name patterns?
    □ Different RoPE theta values?
    □ GQA group sizes?

□ Update README with tested models + performance table

Deliverable: performance table with 8+ models, all verified working.


v0.7 — Speculative Decoding (Target: Week 9–10)

Goal: 1.5–2× faster decode using draft model.

□ Implement speculative decode:
    Draft model: TinyLlama 1.1B (fast, ~65 tok/s)
    Target model: Llama 3.2 3B (slow, ~25 tok/s)

    Algorithm:
    1. Draft generates N candidate tokens (N=4–8)
    2. Target verifies all N in one forward pass
    3. Accept matching tokens, reject from first mismatch
    4. Expected: accept 60–80% → 1.5–2× speedup

□ Memory budget for both models:
    Draft:  ~670 MB (TinyLlama 1.1B Q4_K_M)
    Target: ~1.8 GB (Llama 3.2 3B Q4_K_M)
    KV caches × 2: ~200 MB
    Total:  ~2.7 GB → fits in 5.5 GB available ✓

□ Implement draft-verify loop in engine
□ Share tokenizer between draft and target
□ Measure acceptance rate at draft lengths 2, 4, 6, 8
□ Measure effective tok/s vs single-model decode
□ Add --draft-model CLI flag

Deliverable: speculative decoding working, measured speedup.


v0.8 — Multi-Turn Conversation (Target: Week 11)

Goal: KV cache persistence across turns.

□ Implement conversation state:
    Keep KV cache between turns (don't re-prefill system prompt)
    Only prefill new user message each turn

□ System prompt support:
    Pre-load system prompt into KV cache at startup
    New turns only process user message + generate response

□ Context window management:
    When KV cache exceeds limit:
    Option A: truncate oldest messages (sliding window)
    Option B: summarize old context (future)

□ Conversation API:
    POST /v1/chat/completions with messages array
    Server maintains conversation_id → KV cache state

□ Test: 10-turn conversation
    Verify later turns reference earlier context
    Verify memory stable across turns
    Measure: time-to-first-token for turn 1 vs turn 10

Deliverable: multi-turn chat working, KV reuse verified.


v1.0 — Production Release (Target: Week 12)

Goal: stable, documented, deployable.

□ 24-hour stability test:
    Continuous requests every 5 seconds for 24 hours
    No OOM, no crash, no thermal shutdown
    Memory delta < 10 MB over 24 hours

□ Documentation complete:
    □ README: features, quickstart, performance table
    □ docs/: all 10 documents up to date
    □ TESTING.md: reflects actual test results
    □ CHANGELOG.md: all versions documented

□ Packaging:
    □ Single tar.gz with binary + scripts + docs
    □ Or: Dockerfile for Jetson (l4t-base image)

□ Release:
    □ Tag v1.0.0
    □ GitHub release with binary + docs
    □ Blog post: "Building a Memory-First LLM Runtime for Jetson"

Deliverable: tagged v1.0.0 release, 24-hour stability proven.


Future (Post v1.0)

Feature Description Effort
INT4 KV cache Halve KV memory → 2× more context 1 week
Tensor Core prefill WMMA for batch matmul during prefill 2 weeks
Vision-language models Gemma 3 4B image+text 2 weeks
DLA offload Run some layers on DLA, free GPU for others 2 weeks
Multiple concurrent models Hot-swap models without restart 1 week
WebSocket API Real-time bidirectional streaming 1 week
ONNX Runtime fallback Support non-GGUF models 2 weeks
Jetson Orin NX 16 GB Extend to larger Jetson for 7B models 1 week
Benchmark dashboard Live Grafana/Prometheus metrics 1 week
Model quantization on-device Quantize FP16→INT4 directly on Jetson 2 weeks

Timeline Summary

Week 1:   v0.1  First Tokens         □ build → test → first coherent output
Week 2:   v0.2  Benchmark            □ measure → profile → compare vs llama.cpp
Week 3-4: v0.3  Kernel Optimization  □ tune top 3 kernels → >20% speedup
Week 5:   v0.4  Memory Stability     □ 1000 tokens stable → OOM guard → thermal
Week 6:   v0.5  Server + Streaming   □ SSE → chat templates → systemd
Week 7-8: v0.6  Multi-Model          □ test 8+ models → performance table
Week 9-10:v0.7  Speculative Decode   □ draft+target → 1.5-2× speedup
Week 11:  v0.8  Multi-Turn           □ KV persistence → conversation state
Week 12:  v1.0  Production Release   □ 24-hour test → package → release