Module 5B — LLM Application Development¶
Parent: Phase 3 — Artificial Intelligence · Track B
Ship GenAI products — from prompt engineering to production deployment.
Prerequisites: Module 3B (Agentic AI), Module 4B (ML Engineering).
Role targets: AI Engineer · GenAI Engineer · LLM Application Developer · Full-Stack AI Engineer
Why This Matters for AI Hardware¶
LLM applications are the largest consumer of GPU inference capacity in 2025–2026: - ChatGPT serves 200M+ weekly users → massive GPU fleet - Enterprise RAG deployments → GPU-accelerated vector search + LLM inference - Code assistants → long-context attention, streaming generation - Understanding these patterns helps hardware engineers design chips that serve real demand
1. Prompt Engineering (Advanced)¶
- System prompts: persona, constraints, output format specification
- Few-shot learning: example selection, dynamic few-shot, chain-of-thought
- Structured output: JSON mode, function calling, tool use schemas
- Prompt optimization: iterative refinement, A/B testing prompts, automated evaluation
- Long-context strategies: context window management, chunking, summarization chains
2. Fine-Tuning LLMs¶
- When to fine-tune vs prompt engineering vs RAG
- LoRA / QLoRA: parameter-efficient fine-tuning, adapter merging
- Full fine-tuning: when you need maximum quality and have enough data/compute
- Data preparation: instruction formatting, chat templates, quality filtering
- Evaluation: perplexity, task-specific metrics, human eval, LLM-as-judge
Projects: 1. Fine-tune Llama-3-8B with QLoRA on a domain-specific Q&A dataset. Evaluate vs base model. 2. Merge LoRA adapters and export to ONNX for deployment.
3. Production RAG Architecture¶
- Advanced retrieval: hybrid search (dense + BM25), re-ranking (cross-encoder), query expansion
- Chunking optimization: recursive splitting, semantic chunking, parent-child retrieval
- Multi-modal RAG: images + text, document layout understanding
- Evaluation framework: RAGAS, context precision/recall, faithfulness scoring
- Scaling: distributed vector stores, caching, embedding batch processing
Projects: 1. Build a production RAG system with hybrid retrieval + re-ranking. Evaluate with RAGAS. 2. Add citation tracking — every generated claim linked to source chunks.
4. Production Deployment¶
- API design: streaming responses, structured output, error handling
- Scaling patterns: load balancing, auto-scaling, GPU right-sizing
- Cost optimization: caching (semantic cache, exact cache), model routing (small → large), prompt compression
- Observability: token usage tracking, latency monitoring, quality scoring
- Safety: content filtering, PII detection, output validation, rate limiting
Projects: 1. Deploy a RAG application with streaming, caching, and cost tracking. Measure tokens/$ efficiency. 2. Implement semantic caching — cache similar queries to reduce GPU inference calls by 30%+.
Connection to Hardware¶
| Application pattern | Hardware implication |
|---|---|
| Long-context attention (128K tokens) | HBM bandwidth, KV-cache memory |
| Streaming token generation | Low-latency kernel scheduling |
| Batch inference serving | In-flight batching, GPU utilization |
| Vector search (RAG retrieval) | cuVS / FAISS on GPU |
| Multi-model routing | Multi-GPU scheduling, MIG partitioning |
Resources¶
| Resource | What it covers |
|---|---|
| Anthropic API Documentation | Claude API, tool use, streaming |
| OpenAI Cookbook | GPT API patterns and best practices |
| LlamaIndex | RAG framework |
| RAGAS | RAG evaluation framework |
| Building LLM Applications (various) | End-to-end LLM app development |