Lecture 12 — Production Deployment¶
Track B · Agentic AI & GenAI | ← Lecture 11 | Next → Lab 01
Learning Objectives¶
By the end of this lecture you will be able to:
- Wrap an AI agent in a production-grade FastAPI endpoint.
- Stream tokens to the client using Server-Sent Events (SSE).
- Implement semantic caching with Redis and embeddings to cut repeat-query cost.
- Route requests to fast/cheap or powerful models based on query complexity.
- Apply rate limiting and exponential-backoff retry to third-party API calls.
- Filter outputs for safety: content moderation and PII detection.
- Add health-check endpoints and handle graceful shutdown.
1. FastAPI Wrapper for an Agent Endpoint¶
# pip install fastapi uvicorn pydantic openai python-dotenv
from __future__ import annotations
import os
import time
import uuid
from contextlib import asynccontextmanager
from typing import AsyncGenerator
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import JSONResponse
from pydantic import BaseModel, Field
from openai import AsyncOpenAI
# ── Models ────────────────────────────────────────────────────────────────────
class ChatRequest(BaseModel):
message: str = Field(..., min_length=1, max_length=4096)
session_id: str = Field(default_factory=lambda: uuid.uuid4().hex)
model: str = Field(default="gpt-4o-mini")
class ChatResponse(BaseModel):
session_id: str
answer: str
model_used: str
latency_ms: float
tokens_used: int
# ── Application state ─────────────────────────────────────────────────────────
class AppState:
client: AsyncOpenAI | None = None
ready: bool = False
state = AppState()
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
state.client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY", "sk-fake"))
state.ready = True
print("Agent service started")
yield
# Shutdown
state.ready = False
print("Agent service shutting down gracefully")
app = FastAPI(title="AI Agent API", version="1.0.0", lifespan=lifespan)
# ── Health checks ─────────────────────────────────────────────────────────────
@app.get("/health")
async def health():
return {"status": "ok" if state.ready else "starting", "timestamp": time.time()}
@app.get("/ready")
async def readiness():
if not state.ready:
raise HTTPException(status_code=503, detail="Service not ready")
return {"status": "ready"}
# ── Chat endpoint ─────────────────────────────────────────────────────────────
SYSTEM_PROMPT = (
"You are a helpful AI assistant specializing in hardware engineering. "
"Answer questions clearly and concisely."
)
@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
start = time.perf_counter()
response = await state.client.chat.completions.create(
model=req.model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": req.message},
],
temperature=0.2,
)
answer = response.choices[0].message.content
latency_ms = (time.perf_counter() - start) * 1000
return ChatResponse(
session_id=req.session_id,
answer=answer,
model_used=req.model,
latency_ms=round(latency_ms, 1),
tokens_used=response.usage.total_tokens,
)
Run with:
2. Streaming with Server-Sent Events (SSE)¶
SSE lets the client receive tokens as they are generated, making long responses feel interactive.
# pip install sse-starlette
from sse_starlette.sse import EventSourceResponse
from fastapi import FastAPI
from pydantic import BaseModel
import asyncio
class StreamRequest(BaseModel):
message: str
session_id: str = ""
@app.post("/chat/stream")
async def chat_stream(req: StreamRequest):
async def token_generator() -> AsyncGenerator[dict, None]:
try:
stream = await state.client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": req.message},
],
stream=True,
temperature=0.2,
)
async for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
yield {"event": "token", "data": delta.content}
yield {"event": "done", "data": "[DONE]"}
except Exception as e:
yield {"event": "error", "data": str(e)}
return EventSourceResponse(token_generator())
Client-side JavaScript (for reference):
const evtSource = new EventSource("/chat/stream?message=What+is+HBM3");
evtSource.addEventListener("token", (e) => process.stdout.write(e.data));
evtSource.addEventListener("done", () => evtSource.close());
Python client:
import httpx
with httpx.stream("POST", "http://localhost:8000/chat/stream",
json={"message": "Explain HBM3"}) as response:
for line in response.iter_lines():
if line.startswith("data:"):
token = line[5:].strip()
if token != "[DONE]":
print(token, end="", flush=True)
print()
3. Semantic Caching with Redis + Embeddings¶
Semantic caching stores (query_embedding → answer) pairs. On a new query, if it is within a similarity threshold of a cached query, return the cached answer without calling the LLM.
# pip install redis sentence-transformers numpy
import json
import time
import numpy as np
import redis
from sentence_transformers import SentenceTransformer
class SemanticCache:
"""
Redis-backed semantic cache.
Keys: 'cache:emb:{id}' (vector as JSON) and 'cache:ans:{id}' (answer string).
"""
def __init__(
self,
redis_url: str = "redis://localhost:6379",
model_name: str = "all-MiniLM-L6-v2",
similarity_threshold: float = 0.95,
ttl_seconds: int = 3600,
):
self.redis = redis.from_url(redis_url)
self.model = SentenceTransformer(model_name)
self.threshold = similarity_threshold
self.ttl = ttl_seconds
self._index_key = "cache:index" # list of all cache entry ids
def _embed(self, text: str) -> np.ndarray:
return self.model.encode(text, normalize_embeddings=True)
def get(self, query: str) -> str | None:
"""Return cached answer if a sufficiently similar query exists."""
q_vec = self._embed(query)
ids = self.redis.lrange(self._index_key, 0, -1)
best_sim = -1.0
best_id = None
for id_bytes in ids:
cache_id = id_bytes.decode()
emb_json = self.redis.get(f"cache:emb:{cache_id}")
if not emb_json:
continue
cached_vec = np.array(json.loads(emb_json))
sim = float(np.dot(q_vec, cached_vec))
if sim > best_sim:
best_sim = sim
best_id = cache_id
if best_sim >= self.threshold and best_id:
print(f"[Cache HIT] similarity={best_sim:.4f}")
answer = self.redis.get(f"cache:ans:{best_id}")
return answer.decode() if answer else None
print(f"[Cache MISS] best_similarity={best_sim:.4f}")
return None
def set(self, query: str, answer: str):
"""Store a new cache entry."""
cache_id = f"{time.time():.0f}_{hash(query) % 100000}"
q_vec = self._embed(query)
self.redis.setex(f"cache:emb:{cache_id}", self.ttl, json.dumps(q_vec.tolist()))
self.redis.setex(f"cache:ans:{cache_id}", self.ttl, answer)
self.redis.rpush(self._index_key, cache_id)
print(f"[Cache SET] id={cache_id}")
def stats(self) -> dict:
n = self.redis.llen(self._index_key)
return {"cached_entries": n, "ttl_seconds": self.ttl, "threshold": self.threshold}
# Integrate with the FastAPI endpoint
cache = SemanticCache(similarity_threshold=0.95)
async def cached_chat(message: str) -> tuple[str, bool]:
"""Return (answer, from_cache)."""
cached = cache.get(message)
if cached:
return cached, True
response = await state.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": message}],
temperature=0,
)
answer = response.choices[0].message.content
cache.set(message, answer)
return answer, False
4. Model Routing¶
Route cheap/simple queries to a fast model and complex queries to a more capable one.
import re
# Routing rules
FAST_MODEL = "gpt-4o-mini" # cheap, fast
SMART_MODEL = "gpt-4o" # expensive, powerful
COMPLEXITY_SIGNALS = [
r"\bcompare\b", r"\bdifference\b", r"\bwhy\b", r"\bexplain\b",
r"\banalyze\b", r"\bimplement\b", r"\bdesign\b", r"\barchitecture\b",
r"\btradeoff\b", r"\bpros and cons\b",
]
def estimate_complexity(message: str) -> str:
"""
Returns 'simple' or 'complex' based on heuristics.
For production, replace with a small classifier LLM call.
"""
msg_lower = message.lower()
word_count = len(message.split())
# Rule 1: very short messages are simple
if word_count <= 5:
return "simple"
# Rule 2: complexity signal keywords
for pattern in COMPLEXITY_SIGNALS:
if re.search(pattern, msg_lower):
return "complex"
# Rule 3: long messages tend to be complex
if word_count > 50:
return "complex"
return "simple"
def select_model(message: str) -> str:
complexity = estimate_complexity(message)
model = FAST_MODEL if complexity == "simple" else SMART_MODEL
print(f"[Router] complexity={complexity} → {model}")
return model
@app.post("/chat/smart", response_model=ChatResponse)
async def smart_chat(req: ChatRequest):
start = time.perf_counter()
model = select_model(req.message)
response = await state.client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": req.message},
],
temperature=0.2,
)
answer = response.choices[0].message.content
latency_ms = (time.perf_counter() - start) * 1000
return ChatResponse(
session_id=req.session_id,
answer=answer,
model_used=model,
latency_ms=round(latency_ms, 1),
tokens_used=response.usage.total_tokens,
)
5. Rate Limiting and Retry with Exponential Backoff¶
# pip install tenacity
import asyncio
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
before_sleep_log,
)
import logging
from openai import RateLimitError, APITimeoutError, APIConnectionError
logger = logging.getLogger(__name__)
RETRYABLE_ERRORS = (RateLimitError, APITimeoutError, APIConnectionError)
@retry(
retry=retry_if_exception_type(RETRYABLE_ERRORS),
wait=wait_exponential(multiplier=1, min=1, max=60),
stop=stop_after_attempt(5),
before_sleep=before_sleep_log(logger, logging.WARNING),
reraise=True,
)
async def resilient_completion(client: AsyncOpenAI, messages: list[dict], model: str) -> str:
"""LLM call with automatic retry on transient errors."""
response = await client.chat.completions.create(
model=model,
messages=messages,
temperature=0,
timeout=30.0,
)
return response.choices[0].message.content
# Per-IP rate limiting using a token bucket
from collections import defaultdict
class RateLimiter:
"""Simple in-memory token bucket rate limiter."""
def __init__(self, requests_per_minute: int = 60):
self.rpm = requests_per_minute
self._buckets: dict[str, list[float]] = defaultdict(list)
def is_allowed(self, client_id: str) -> bool:
now = time.time()
window = 60.0
timestamps = self._buckets[client_id]
# Remove timestamps older than 1 minute
self._buckets[client_id] = [t for t in timestamps if now - t < window]
if len(self._buckets[client_id]) >= self.rpm:
return False
self._buckets[client_id].append(now)
return True
rate_limiter = RateLimiter(requests_per_minute=30)
from fastapi import Header
@app.post("/chat/limited", response_model=ChatResponse)
async def rate_limited_chat(req: ChatRequest, x_client_id: str = Header(default="anonymous")):
if not rate_limiter.is_allowed(x_client_id):
raise HTTPException(status_code=429, detail="Rate limit exceeded. Try again in 60 seconds.")
start = time.perf_counter()
answer = await resilient_completion(
state.client,
[{"role": "user", "content": req.message}],
model="gpt-4o-mini",
)
latency_ms = (time.perf_counter() - start) * 1000
return ChatResponse(
session_id=req.session_id,
answer=answer,
model_used="gpt-4o-mini",
latency_ms=round(latency_ms, 1),
tokens_used=0, # usage not available without full response object
)
6. Safety Filters¶
6.1 Content Moderation¶
async def check_moderation(text: str) -> dict:
"""Use OpenAI moderation API to flag unsafe content."""
response = await state.client.moderations.create(input=text)
result = response.results[0]
return {
"flagged": result.flagged,
"categories": {k: v for k, v in result.categories.__dict__.items() if v},
}
@app.post("/chat/safe", response_model=ChatResponse)
async def safe_chat(req: ChatRequest):
# Moderate the input
moderation = await check_moderation(req.message)
if moderation["flagged"]:
raise HTTPException(
status_code=400,
detail=f"Input flagged by content moderation: {moderation['categories']}",
)
start = time.perf_counter()
answer = await resilient_completion(
state.client,
[{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": req.message}],
model="gpt-4o-mini",
)
# Moderate the output as well
output_mod = await check_moderation(answer)
if output_mod["flagged"]:
answer = "I'm unable to provide that response."
latency_ms = (time.perf_counter() - start) * 1000
return ChatResponse(
session_id=req.session_id,
answer=answer,
model_used="gpt-4o-mini",
latency_ms=round(latency_ms, 1),
tokens_used=0,
)
6.2 PII Detection¶
# pip install presidio-analyzer presidio-anonymizer
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
PII_ENTITIES = ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "CREDIT_CARD", "US_SSN", "IP_ADDRESS"]
def detect_pii(text: str) -> list[dict]:
"""Return detected PII entities."""
results = analyzer.analyze(text=text, language="en", entities=PII_ENTITIES)
return [
{"type": r.entity_type, "score": r.score, "start": r.start, "end": r.end}
for r in results
]
def anonymize_pii(text: str) -> str:
"""Replace PII with placeholder tokens."""
results = analyzer.analyze(text=text, language="en", entities=PII_ENTITIES)
if not results:
return text
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
return anonymized.text
# Example
text = "My name is John Smith. Email me at john@example.com or call 555-123-4567."
pii_found = detect_pii(text)
print(f"PII detected: {pii_found}")
clean_text = anonymize_pii(text)
print(f"Anonymized: {clean_text}")
# Output: "My name is <PERSON>. Email me at <EMAIL_ADDRESS> or call <PHONE_NUMBER>."
7. Health Checks and Graceful Shutdown¶
# ── Full application with all features ────────────────────────────────────────
import signal
import asyncio
from contextlib import asynccontextmanager
# Track background tasks for graceful cleanup
background_tasks: set[asyncio.Task] = set()
@asynccontextmanager
async def lifespan(app: FastAPI):
# ── Startup ────────────────────────────────────────────────────────────────
state.client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY", "sk-fake"))
state.ready = True
print("Startup complete. Service is ready.")
yield # application runs here
# ── Shutdown ───────────────────────────────────────────────────────────────
state.ready = False
print("Shutdown signal received. Draining in-flight requests...")
# Cancel and await all background tasks
for task in background_tasks:
task.cancel()
if background_tasks:
await asyncio.gather(*background_tasks, return_exceptions=True)
print("Graceful shutdown complete.")
@app.get("/health/detailed")
async def health_detailed():
"""Detailed health check for orchestrators (k8s liveness probe)."""
checks = {
"api_client": state.client is not None,
"ready": state.ready,
}
# Check Redis connectivity
try:
cache.redis.ping()
checks["cache"] = True
except Exception:
checks["cache"] = False
all_ok = all(checks.values())
return JSONResponse(
content={"status": "ok" if all_ok else "degraded", "checks": checks},
status_code=200 if all_ok else 503,
)
# Kubernetes probe endpoints
@app.get("/livez")
async def liveness():
"""k8s liveness probe — process is alive."""
return {"alive": True}
@app.get("/readyz")
async def readiness_probe():
"""k8s readiness probe — ready to serve traffic."""
if not state.ready:
raise HTTPException(status_code=503)
return {"ready": True}
7.1 Putting It All Together¶
# Complete minimal production agent service
# Save as: agent_service.py
# Run with: uvicorn agent_service:app --host 0.0.0.0 --port 8000
"""
Agent service startup checklist:
1. Set OPENAI_API_KEY environment variable
2. Start Redis: docker run -d -p 6379:6379 redis:alpine
3. uvicorn agent_service:app --host 0.0.0.0 --port 8000 --workers 4
Endpoints:
GET /health — basic health check
GET /health/detailed — full dependency health check
GET /livez — k8s liveness probe
GET /readyz — k8s readiness probe
POST /chat — synchronous chat (JSON response)
POST /chat/stream — streaming chat (SSE)
POST /chat/smart — chat with automatic model routing
POST /chat/safe — chat with content moderation + PII detection
POST /chat/limited — chat with per-client rate limiting
"""
# Test with curl:
# curl -X POST http://localhost:8000/chat \
# -H "Content-Type: application/json" \
# -d '{"message": "What is HBM3?"}'
# Streaming test:
# curl -X POST http://localhost:8000/chat/stream \
# -H "Content-Type: application/json" \
# -H "Accept: text/event-stream" \
# -d '{"message": "Explain NVLink 4.0"}'
Key Takeaways¶
- FastAPI + async OpenAI is the standard production stack. Use
lifespanfor clean startup/shutdown andpydanticfor request validation. - SSE streaming dramatically improves perceived latency for long responses — implement it from day one.
- Semantic caching with a 0.95 similarity threshold can reduce LLM calls by 30–60% on typical support/QA workloads.
- Model routing saves 10–20x on cost by sending simple queries to cheap models. Even a rule-based classifier is a good start.
- Exponential backoff with
tenacityhandles transient API errors gracefully. Always set amax_attemptsceiling. - Moderate both input and output. Input moderation prevents abuse; output moderation prevents liability.
- PII anonymization before sending user data to external LLM APIs is a compliance requirement in most jurisdictions.
- Three probe endpoints (
/health,/livez,/readyz) are the minimum for Kubernetes deployment.
Exercises¶
Exercise 1 — Streaming Chat Client¶
Build a Python command-line chat client that connects to the /chat/stream endpoint using the httpx library. The client should maintain a list of previous messages locally and prepend them as conversation history in each request. The client should print tokens as they stream in and show the total latency and token count after each response.
Exercise 2 — Cache Benchmark¶
Set up the SemanticCache (you can mock Redis with a dict-based in-memory store if needed). Create 20 queries — 10 unique and 10 that are near-paraphrases of the first 10. Measure the latency of the first call (cache miss, LLM call) vs the second call (cache hit, no LLM call) for each pair. Plot the latency distribution and report the cache hit rate and average speedup ratio.
Exercise 3 — Production Hardening¶
Start with the basic /chat endpoint and add the following hardening features one by one:
1. Request ID header (X-Request-ID) that is echoed in the response.
2. Request timeout — if the LLM call takes more than 10 seconds, return HTTP 504.
3. Circuit breaker — after 5 consecutive LLM failures, return HTTP 503 without calling the API until a 30-second cooldown passes.
Write integration tests (using httpx.AsyncClient and pytest-asyncio) that verify each behavior.
Next: Lab 01 — Research Agent