jetson-llm¶
Memory-first LLM inference runtime for NVIDIA Jetson Orin.
Target hardware: Orin Nano Super 8 GB (SM 8.7, 102 GB/s, 67 TOPS) Not supported: x86, discrete GPUs, Windows, macOS — Jetson only.
🚀 Production implementation: GeniePod/genie-ai-runtime v1.0.0¶
The working, hardware-validated runtime — under active development as part of the GeniePod local home-AI stack — lives at:
👉 GeniePod/genie-ai-runtime · release v1.0.0¶
This folder is the original framework / scaffold from which the production implementation was forked. It documents the initial architecture and the first-tokens bring-up plan. The production repo is where every subsequent kernel optimization, persistent-KV work, and production-hardening step happened.
v1.0.0 verified on Jetson Orin Nano Super 8 GB (Qwen3-4B-Q4_K_M, 25 W MAXN SUPER)¶
| Workload | Number |
|---|---|
| Prefill (33-tok cold) | 38.0 tok/s |
| Decode | 9.9 tok/s |
| Cold TTFT | 877 ms |
| Warm-turn TTFT (persistent KV, 67 % prefix) | 444 ms |
| KV pool memory @ 1024 ctx | 74 MB (INT8 default) |
| Model load — cold (NVMe) | 30 s at 79 MB/s |
| Model load — warm (pagecache) | 1.3 s at 1.8 GB/s |
vs llama-bench pp18 = 17.97 ± 0.65 tok/s |
+115 % prefill |
Output stays sensibly-identical to FP16 reference across the entire optimization path (FP16-ULP-bounded drift from tensor-core mma.sync; INT8-precision-floor drift from per-(layer, pos, kv_head) KV quantization).
What the production repo adds beyond this scaffold¶
- Paths A → I (numbered optimization umbrellas), each with phase plans, per-PR verified-result tables on Jetson, and rollback discipline. See
ROADMAP.mdin the production repo for the full narrative. - Tensor-core MMQ Q4_K prefill GEMM (Path E) — multi-warp cooperative dequant,
mma.sync.aligned.m16n8k16on SM 8.7. +147 % prefill over the scalar path. - uint32-load decode GEMV (Path C) — quad-byte coalesced weight loads + residual-fused output. +21 % decode.
- Persistent KV cache (Path F) — per-conversation save / hydrate with longest-prefix match, model fingerprint validation, LRU eviction at 1 GB cap. Warm-turn TTFT drops 48 %.
- INT8 KV cache (Path I, default) — per-(layer, pos, kv_head) absmax-scaled. 144 MB → 74 MB at 1024 ctx, sensibly-identical output.
- OpenAI-compatible HTTP server — built on cpp-httplib + nlohmann/json (the same libs llama.cpp's server uses), SSE streaming, Qwen3 reasoning split into
reasoning_content, systemd unit + installer. Opt-in build (-DJLLM_BUILD_SERVER=ON); the engine ships as an embeddable library by default. - Production hardening —
--versionflag, cold/warm load timing, OOM-guard prevents crashes, stability soak harness for 1000+ token × 100 iteration runs.
Patterns that worked¶
Three patterns from the alpha-track that travel well to other inference-engine projects (worth stealing if you're doing similar work):
- Path-based umbrella issues with staged phases. Every numbered Path = one GitHub umbrella issue with a phase table, risks named up front, and small per-phase PRs that each posted a verified-result comment on Jetson before merge. Made negative results (Path G's no-op attempts, Path C's split-K dead end, Path E's E2 microbenchmark) easy to absorb and roll back.
- "Sensibly-identical, not byte-identical" as the quality bar. Once
mma.syncreordered float accumulation, byte equality was off the table. Defining "FP16-ULP-bounded drift, character-equivalent on the canonical prompt" unlocked the whole tensor-core path. Same compromise enabled INT8 KV. - Honest perf re-baselines. Every release re-measured the prior release's number same-day with the same prompt and machine state. We never compared cross-day numbers — environmental noise had bitten us once and we baked the discipline in after.
Why this matters (and why generic runtimes don't fit)¶
Existing runtimes are not designed for 8 GB unified memory shared with voice STT, TTS, denoise, and a Home Assistant container:
- llama.cpp — portable but generic CUDA kernels, no Jetson memory awareness. The runtime the production repo aims to replace inside GenieClaw.
- TensorRT-LLM — fast but datacenter-shaped (A100/H100), too heavy for Orin Nano's iGPU budget.
- jetson-llm /
genie-ai-runtime— memory-first, power-aware, Orin SM 8.7-tuned CUDA kernels, pre-allocated KV/scratch pools, single binary, single GGUF, single shared-memory budget that fits alongsidewhisper-serverandgenie-core.
Architecture (as originally scaffolded — see production repo for current state)¶
┌──────────────────────────────────────────────────────┐
│ jetson-llm │
│ │
│ Serving Layer (OpenAI-compatible REST API) │
│ POST /v1/chat/completions | GET /health │
│ │
│ Engine (GGUF load → prefill → decode → sample) │
│ transformer_layer() × N_layers per token │
│ Memory guard + thermal check per token │
│ │
│ CUDA Kernels (SM 8.7 tuned) │
│ gemv_q4 | fused_rmsnorm | flash_attn | rope │
│ swiglu | softmax | fp16↔int8 │
│ │
│ Memory Manager │
│ MemoryBudget | OOMGuard | KVCachePool | ScratchPool│
│ │
│ Jetson HAL │
│ PowerState | ThermalState | LiveStats | JetsonInfo │
└──────────────────────────────────────────────────────┘
The production repo extends this with:
- Tensor-core MMQ Q4_K prefill GEMM (a new kernel that didn't exist in the scaffold)
- Persistent KV cache module (src/persistence/)
- cpp-httplib + nlohmann/json-based server (replacing the original raw-sockets server)
- scripts/soak.sh and scripts/bench_load.sh for stability + load-time validation
How to use this folder¶
| You want to… | Go here |
|---|---|
| Run, build, or contribute to the actual implementation | GeniePod/genie-ai-runtime |
| Read v1.0.0 release notes | Release v1.0.0 |
| Read the alpha-track narrative (Paths A→I, what we tried, what failed) | ROADMAP.md (production repo) |
| Read per-release verified numbers | CHANGELOG.md (production repo) |
| Read the HTTP server reference | docs/server.md (production repo) |
| Read the original first-tokens roadmap (this folder's planning doc) | ROADMAP.md |
| Read the original test plan | TESTING.md |
| See the original scaffolded modules | src/, include/, tests/, scripts/ |
Original scaffold (preserved as the starting framework)¶
The rest of this README documents the framework as it was originally scaffolded for the AI Hardware Engineer Roadmap exercise. Everything below describes the initial module layout, completed components, and bug-tracking notes from the pre-v0.1 bring-up — preserved here as historical context for the roadmap. The production repo has moved well past it.
Completed (as of initial scaffold, ✅)¶
| Component | Files | Lines | Status |
|---|---|---|---|
| Memory Manager | budget.cpp, kv_cache.cpp, pool.cpp | ~300 | ✅ Implemented, tested |
| Jetson HAL | power.cpp, thermal.cpp, sysinfo.cpp | ~250 | ✅ Reads sysfs, tested |
| CUDA Kernels | 6 .cu files | ~500 | ✅ Implemented, 5 correctness tests pass |
| GGUF Config Parser | model.cpp (load_gguf_config) | ~80 | ✅ Reads model architecture |
| GGUF Tensor Parser | model.cpp (parse_tensor_infos) | ~150 | ✅ Parses tensor name/shape/offset |
| Weight Mapping | model.cpp (load_and_map_weights) | ~80 | ✅ Maps tensor names → struct pointers |
| Tokenizer | tokenizer.cpp | ~160 | ✅ Reads GGUF vocab, encode/decode |
| Sampling | sample.cpp | ~120 | ✅ Top-k, top-p, temperature, repeat penalty |
| Transformer Forward | decode.cpp (transformer_layer) | ~100 | ✅ Wires all 12 ops per layer |
| Decode Loop | decode.cpp (generate) | ~80 | ✅ Prefill + decode + streaming |
| CLI | main.cpp | ~120 | ✅ Interactive + single prompt + OOM pre-check |
| HTTP Server | http_server.cpp, main_server.cpp | ~250 | ✅ /health, /v1/chat/completions, /v1/models |
| Scripts | setup, bench, profile | ~180 | ✅ First-time setup, benchmark, nsys profiling |
| Tests | test_memory, test_kernels, test_model_load | ~280 | ✅ Memory, 5 kernel tests, 8 model load tests |
| Total | 30 files | ~4,200 | (scaffold; see production repo for current ~10 k LOC) |
Initial bugs that were fixed during bring-up (✅, historical)¶
| # | Bug | Fix applied |
|---|---|---|
| 1 | GGUF KV skip miscalculated offsets | Rewrote with exact GGUF type sizes via gguf_scalar_size() helper; arrays of scalars skip in one fseek |
| 2 | Residual connection not chained | Added vec_add() kernel: x2 = x + attn_proj, then x = x2 + ffn_out |
| 3 | Embedding memcpy wrong direction | Changed to cudaMemcpyDefault (works for both host mmap and device memory) |
| 4 | Missing <sys/mman.h> include |
Added #include <sys/mman.h> |
| 5 | CUDA graph body empty | Implemented full graph capture: all transformer layers + final norm + logit projection |
| 6 | Attention accumulator acc[d%4] |
Replaced with per-dimension s_out[head_dim] in shared memory |
| 7 | FP16 logits, no FP32 | Added fp16_to_fp32() GPU kernel; convert on device before D2H copy |
| 8 | Tokenizer O(V×L) scan | Added token_to_id_ hash map + max_token_len_ for O(max_len) longest-match |
The Path A → I work in the production repo addresses an entirely different class of issues (kernel architecture, memory layout, correctness/perf tradeoffs at scale).
Original v0.1 → v0.4 milestone roadmap (✅ delivered + extended)¶
v0.1 — First Tokens (DONE → see alpha.2 in production repo)
✅ All 8 bring-up bugs fixed
✅ Build on Jetson (cmake + make)
✅ test_model_load passes with Qwen3-4B Q4_K_M
✅ Generate coherent text
v0.2 — Benchmark Baseline (DONE → alpha.2 baseline established)
✅ bench.sh produces tok/s numbers
✅ Compared against llama-bench (17.97 tok/s pp18 baseline)
v0.3 — Performance Target (DONE → exceeded in alpha.8)
✅ >20% faster than stock llama.cpp on decode (decode: parity; prefill: +115%)
✅ CUDA graph replay verified working
□ Memory-stable over 1000+ tokens (soak harness shipped, full run pending — issue #4)
v0.4 — Production Ready (DONE → v1.0.0)
✅ Chat template support + SSE streaming
✅ Multi-turn conversation (Path F persistent KV)
✅ Documented performance table across releases (CHANGELOG.md alpha.2 → v1.0.0)
□ 24-hour stability test (issue #7, scheduled post-v1.0)
License¶
MIT — same as the production repo, on purpose. The runtime is infrastructure that other projects should be able to embed cheaply.