Lecture 11 — Evaluation & Observability¶
Track B · Agentic AI & GenAI | ← Lecture 10 | Next → Lecture 12
Learning Objectives¶
By the end of this lecture you will be able to:
- Score LLM outputs for correctness and quality using an LLM-as-judge pattern.
- Compute RAGAS metrics (faithfulness, answer relevancy, context precision, context recall) on a RAG system.
- Trace LLM calls with LangSmith and understand what to log.
- Build a cost-tracking decorator that accumulates token spend per session.
- Log every LLM call with full input/output/token/latency details.
- Construct an evaluation dataset from production traffic and run A/B prompt tests.
1. LLM-as-Judge¶
The "LLM-as-judge" pattern uses a capable LLM to evaluate the output of another LLM call. It is cheaper and faster than human annotation while correlating well with human judgment for well-designed rubrics.
# pip install openai
import os
import json
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "sk-fake"))
JUDGE_PROMPT = """You are an expert evaluator. Score the answer on the following rubric.
Return a JSON object with keys: score (0-5), reasoning (one sentence).
Rubric:
5 = Completely correct, well-cited, nothing to add.
4 = Correct but missing minor details.
3 = Partially correct with one factual error.
2 = Mostly wrong or misleading.
1 = Completely wrong.
0 = Refused to answer or empty.
Question: {question}
Reference Answer: {reference}
Model Answer: {answer}
"""
def judge_answer(question: str, reference: str, answer: str) -> dict:
prompt = JUDGE_PROMPT.format(
question=question, reference=reference, answer=answer
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(response.choices[0].message.content)
# Example evaluation
test_cases = [
{
"question": "What is the memory bandwidth of the H100 SXM5?",
"reference": "3.35 TB/s using HBM3 memory.",
"answer": "The H100 SXM5 provides approximately 3.35 terabytes per second of memory bandwidth via HBM3.",
},
{
"question": "What is the memory bandwidth of the H100 SXM5?",
"reference": "3.35 TB/s using HBM3 memory.",
"answer": "The H100 SXM5 has 2 TB/s memory bandwidth.", # wrong
},
]
for tc in test_cases:
result = judge_answer(tc["question"], tc["reference"], tc["answer"])
print(f"Score: {result['score']}/5 — {result['reasoning']}")
print(f"Answer was: '{tc['answer'][:60]}...'")
print()
1.1 Batch Evaluation¶
from concurrent.futures import ThreadPoolExecutor, as_completed
def batch_evaluate(test_cases: list[dict], max_workers: int = 4) -> list[dict]:
"""Evaluate multiple test cases in parallel."""
results = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {
executor.submit(judge_answer, tc["question"], tc["reference"], tc["answer"]): tc
for tc in test_cases
}
for future in as_completed(futures):
tc = futures[future]
result = future.result()
results.append({**tc, **result})
results.sort(key=lambda x: x["score"], reverse=True)
avg = sum(r["score"] for r in results) / len(results)
print(f"\nMean score: {avg:.2f}/5.0 over {len(results)} cases")
return results
2. RAGAS Metrics¶
RAGAS (Retrieval Augmented Generation Assessment) defines four complementary metrics for RAG systems. All are computed without needing ground-truth answers for the first two.
# pip install ragas langchain-openai datasets
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Prepare evaluation data
# Each row: question, answer (from RAG), contexts (list of retrieved chunks), ground_truth
eval_data = {
"question": [
"What is the H100 memory bandwidth?",
"What does HBM3 stand for?",
"How does NVLink 4.0 compare to NVLink 3.0?",
],
"answer": [
"The H100 SXM5 delivers 3.35 TB/s of memory bandwidth using HBM3.",
"HBM3 stands for High Bandwidth Memory generation 3.",
"NVLink 4.0 provides 900 GB/s bidirectional bandwidth, double NVLink 3.0's 600 GB/s.",
],
"contexts": [
["H100 SXM5 uses HBM3 providing 3.35 TB/s bandwidth.", "The H100 PCIe offers 2 TB/s."],
["HBM3 is the third generation of High Bandwidth Memory DRAM standard."],
["NVLink 4.0 delivers 900 GB/s total bidirectional bandwidth.", "NVLink 3.0 provided 600 GB/s."],
],
"ground_truth": [
"3.35 TB/s",
"High Bandwidth Memory generation 3",
"NVLink 4.0 doubles NVLink 3.0 bandwidth to 900 GB/s",
],
}
dataset = Dataset.from_dict(eval_data)
# Configure LLM and embeddings for RAGAS internal evaluation
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=llm,
embeddings=embeddings,
)
print("\nRAGAS Evaluation Results:")
print(results.to_pandas().to_string(index=False))
2.1 Understanding Each Metric¶
| Metric | Measures | Range | Formula |
|---|---|---|---|
| Faithfulness | Is the answer factually grounded in the retrieved context? | 0–1 | claims in context / total claims |
| Answer Relevancy | Does the answer address the question? | 0–1 | cosine sim of regenerated questions |
| Context Precision | Are retrieved chunks relevant (precision)? | 0–1 | relevant chunks / total retrieved |
| Context Recall | Are all ground-truth facts covered by context? | 0–1 | facts in context / total facts |
# Manual faithfulness calculation (illustrative)
import re
def compute_faithfulness_manual(answer: str, context: str, llm) -> float:
"""Decompose answer into claims and check each against context."""
# Step 1: extract claims from the answer
claims_prompt = f"List every factual claim in this answer as a JSON array of strings:\n{answer}"
claims_response = llm.invoke(claims_prompt).content
try:
claims = json.loads(claims_response)
except Exception:
claims = [answer] # fallback
if not claims:
return 1.0
# Step 2: verify each claim against context
supported = 0
for claim in claims:
verify_prompt = (
f"Context: {context}\n\n"
f"Claim: {claim}\n\n"
"Is this claim supported by the context? Reply only YES or NO."
)
verdict = llm.invoke(verify_prompt).content.strip().upper()
if verdict.startswith("YES"):
supported += 1
return supported / len(claims)
3. Tracing with LangSmith¶
LangSmith records every LangChain invocation with full input/output, timing, and token counts. When no API key is available, we can mock the tracing interface.
# With a real LangSmith account:
# export LANGCHAIN_TRACING_V2=true
# export LANGCHAIN_API_KEY=ls__...
# export LANGCHAIN_PROJECT=my-rag-project
# All subsequent LangChain calls are automatically traced.
import os
def setup_tracing(project_name: str = "rag-evaluation"):
"""Configure LangSmith tracing if credentials are available."""
api_key = os.environ.get("LANGCHAIN_API_KEY")
if api_key:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = project_name
print(f"LangSmith tracing enabled → project: {project_name}")
else:
print("LANGCHAIN_API_KEY not set — tracing disabled (running locally)")
setup_tracing()
3.1 Manual Run Logging (Mock)¶
When LangSmith is not available, log runs to a local JSONL file:
import json
import time
import uuid
from pathlib import Path
from functools import wraps
from typing import Any, Callable
TRACE_FILE = Path("./traces.jsonl")
def trace_llm_call(run_name: str):
"""Decorator that logs LLM calls to a local JSONL trace file."""
def decorator(fn: Callable) -> Callable:
@wraps(fn)
def wrapper(*args, **kwargs) -> Any:
run_id = str(uuid.uuid4())[:8]
start = time.perf_counter()
error = None
result = None
try:
result = fn(*args, **kwargs)
return result
except Exception as e:
error = str(e)
raise
finally:
elapsed = time.perf_counter() - start
record = {
"run_id": run_id,
"name": run_name,
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"latency_s": round(elapsed, 4),
"inputs": {"args": str(args)[:200], "kwargs": str(kwargs)[:200]},
"output": str(result)[:500] if result is not None else None,
"error": error,
}
with TRACE_FILE.open("a") as f:
f.write(json.dumps(record) + "\n")
return wrapper
return decorator
@trace_llm_call("summarize")
def summarize(text: str) -> str:
# Mocked — replace with real LLM call
return f"Summary of: {text[:50]}..."
summarize("The H100 GPU achieves 3.35 TB/s memory bandwidth using HBM3...")
print(f"Trace written to {TRACE_FILE}")
4. Cost Tracking¶
import os
import time
from dataclasses import dataclass, field
from functools import wraps
from openai import OpenAI
# OpenAI pricing per million tokens (as of early 2025)
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gpt-3.5-turbo": {"input": 0.50, "output": 1.50},
"text-embedding-3-small": {"input": 0.02, "output": 0.0},
}
@dataclass
class CostTracker:
session_id: str = field(default_factory=lambda: str(uuid.uuid4())[:8])
total_input_tokens: int = 0
total_output_tokens: int = 0
total_cost_usd: float = 0.0
calls: list[dict] = field(default_factory=list)
def record(self, model: str, input_tokens: int, output_tokens: int, latency_s: float):
pricing = PRICING.get(model, {"input": 0.0, "output": 0.0})
cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000
self.total_input_tokens += input_tokens
self.total_output_tokens += output_tokens
self.total_cost_usd += cost
self.calls.append({
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost_usd": round(cost, 6),
"latency_s": round(latency_s, 4),
})
def summary(self) -> dict:
return {
"session_id": self.session_id,
"total_calls": len(self.calls),
"total_input_tokens": self.total_input_tokens,
"total_output_tokens": self.total_output_tokens,
"total_cost_usd": round(self.total_cost_usd, 6),
}
# Global tracker for the session
tracker = CostTracker()
def tracked_completion(messages: list[dict], model: str = "gpt-4o-mini") -> str:
"""OpenAI chat completion with automatic cost tracking."""
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "sk-fake"))
start = time.perf_counter()
response = client.chat.completions.create(
model=model, messages=messages, temperature=0
)
elapsed = time.perf_counter() - start
usage = response.usage
tracker.record(model, usage.prompt_tokens, usage.completion_tokens, elapsed)
return response.choices[0].message.content
# Demo
# answer = tracked_completion([{"role": "user", "content": "What is HBM3?"}])
# print(tracker.summary())
5. Structured LLM Call Logger¶
import json
import logging
import time
import uuid
from pathlib import Path
from openai import OpenAI
# Configure structured logging
logging.basicConfig(level=logging.INFO, format="%(message)s")
logger = logging.getLogger("llm_logger")
log_file = Path("llm_calls.jsonl")
class LLMLogger:
"""Logs every LLM call to both console and a JSONL file."""
def __init__(self, model: str = "gpt-4o-mini", log_path: Path = log_file):
self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "sk-fake"))
self.model = model
self.log_path = log_path
def _write(self, record: dict):
self.log_path.open("a").write(json.dumps(record) + "\n")
# Brief console log
logger.info(
f"[LLM] call_id={record['call_id']} "
f"model={record['model']} "
f"tokens={record['usage']['total_tokens']} "
f"latency={record['latency_s']:.3f}s "
f"cost=${record['cost_usd']:.5f}"
)
def complete(self, messages: list[dict], **kwargs) -> str:
call_id = uuid.uuid4().hex[:8]
start = time.perf_counter()
response = self.client.chat.completions.create(
model=self.model, messages=messages, temperature=0, **kwargs
)
latency = time.perf_counter() - start
usage = response.usage
pricing = PRICING.get(self.model, {"input": 0.0, "output": 0.0})
cost = (
usage.prompt_tokens * pricing["input"]
+ usage.completion_tokens * pricing["output"]
) / 1_000_000
record = {
"call_id": call_id,
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"model": self.model,
"messages": [{"role": m["role"], "content": m["content"][:300]} for m in messages],
"response": response.choices[0].message.content[:500],
"usage": {
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"total_tokens": usage.total_tokens,
},
"latency_s": round(latency, 4),
"cost_usd": round(cost, 6),
}
self._write(record)
return response.choices[0].message.content
# Usage
# llm = LLMLogger()
# answer = llm.complete([{"role": "user", "content": "Explain HBM3 in one sentence."}])
6. Building an Evaluation Dataset from Production Traffic¶
import json
import random
from pathlib import Path
from datetime import datetime
TRACE_FILE = Path("./traces.jsonl")
EVAL_DATASET_FILE = Path("./eval_dataset.json")
def sample_production_traces(
trace_file: Path,
sample_size: int = 50,
min_response_length: int = 50,
) -> list[dict]:
"""Sample high-quality traces to build a labeled eval dataset."""
traces = []
if not trace_file.exists():
print(f"Trace file not found: {trace_file}")
return []
with trace_file.open() as f:
for line in f:
try:
record = json.loads(line)
if (
record.get("output")
and len(record["output"]) >= min_response_length
and not record.get("error")
):
traces.append(record)
except json.JSONDecodeError:
continue
sample = random.sample(traces, min(sample_size, len(traces)))
print(f"Sampled {len(sample)} traces from {len(traces)} total")
return sample
def build_eval_dataset(traces: list[dict]) -> list[dict]:
"""
Convert production traces into evaluation examples.
In production: send these to human annotators for labeling.
Here: auto-generate placeholder labels.
"""
dataset = []
for trace in traces:
dataset.append({
"id": trace.get("run_id", "unknown"),
"question": _extract_question(trace),
"answer": trace.get("output", ""),
"label": None, # to be filled by human annotator
"score": None, # to be filled by judge
"sampled_at": datetime.utcnow().isoformat(),
})
EVAL_DATASET_FILE.write_text(json.dumps(dataset, indent=2))
print(f"Saved {len(dataset)} eval examples → {EVAL_DATASET_FILE}")
return dataset
def _extract_question(trace: dict) -> str:
"""Pull the user question from a trace record."""
inputs = trace.get("inputs", {})
if isinstance(inputs, dict):
for key in ("question", "query", "input", "user_message"):
if key in inputs:
return inputs[key]
return str(inputs)[:200]
7. A/B Testing Prompts¶
import hashlib
import random
from typing import Callable
PROMPT_A = """Answer the following question concisely using only facts from the context.
Context: {context}
Question: {question}"""
PROMPT_B = """You are a precise technical assistant. Using ONLY the provided context,
answer the question. If unsure, say "I don't know."
Context: {context}
Question: {question}"""
def ab_router(user_id: str, variants: list[str], weights: list[float] | None = None) -> str:
"""
Deterministically assign a user to a prompt variant using their ID hash.
Same user always gets the same variant (sticky assignment).
"""
if weights:
# Weighted random assignment (not deterministic, for gradual rollout)
return random.choices(variants, weights=weights)[0]
# Deterministic: hash user_id to choose variant
hash_int = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
return variants[hash_int % len(variants)]
class PromptABTest:
def __init__(self, variant_a: str, variant_b: str):
self.variants = {"A": variant_a, "B": variant_b}
self.results: dict[str, list[float]] = {"A": [], "B": []}
def run(self, user_id: str, context: str, question: str, judge_fn: Callable) -> dict:
variant_key = ab_router(user_id, ["A", "B"])
prompt = self.variants[variant_key].format(context=context, question=question)
# Generate answer (mock here)
answer = f"[Variant {variant_key}] Answer about: {question[:40]}..."
# Score the answer
score = judge_fn(question=question, reference="ground truth placeholder", answer=answer)
self.results[variant_key].append(score["score"])
return {"variant": variant_key, "answer": answer, "score": score["score"]}
def report(self) -> dict:
report = {}
for key, scores in self.results.items():
if scores:
report[key] = {
"n": len(scores),
"mean_score": round(sum(scores) / len(scores), 2),
"min": min(scores),
"max": max(scores),
}
winner = max(report, key=lambda k: report[k]["mean_score"]) if report else None
return {"variants": report, "winner": winner}
# Demo
# ab = PromptABTest(PROMPT_A, PROMPT_B)
# for i in range(20):
# ab.run(f"user_{i}", context="H100 has 3.35 TB/s bandwidth.", question="H100 bandwidth?", judge_fn=judge_answer)
# print(ab.report())
Key Takeaways¶
- LLM-as-judge scales evaluation to thousands of examples quickly. Use a rubric with clear integer scores (0-5) and require JSON output for reliable parsing.
- RAGAS gives you four orthogonal RAG metrics. Faithfulness catches hallucinations; context recall diagnoses retrieval gaps.
- LangSmith tracing is zero-code with environment variables set. Always trace in staging even if not in production.
- Cost tracking should accumulate per-session and log per-call. Early visibility on costs prevents budget surprises.
- Structured call logging (JSONL) is your flight recorder — essential for debugging failures and building eval datasets.
- A/B testing with sticky user assignment ensures reproducible experiments. Always run for a statistically meaningful number of samples before declaring a winner.
Exercises¶
Exercise 1 — Multi-Criteria Judge¶
Extend the LLM-as-judge to evaluate on three separate criteria: (1) factual accuracy, (2) completeness, and (3) conciseness. Each criterion should return a score of 0–5 with a one-sentence justification. Build a MultiCriteriaJudge class that averages the three scores and returns a combined verdict. Test it on at least 5 question-answer pairs from a domain you know well.
Exercise 2 — RAGAS on Your Own Data¶
Create a small corpus of 5–10 documents on any technical topic you choose. Write 5 test questions with known ground-truth answers. Build a simple RAG system (from Lecture 09/10), run it on all 5 questions, and evaluate using RAGAS. Report all four metric scores in a table. For any metric below 0.7, identify one concrete change to the RAG system that would improve it.
Exercise 3 — Cost Dashboard¶
Build a CostDashboard class that reads from the JSONL log file produced by LLMLogger and renders a summary report. The report should include: total spend, spend by model, average cost per call, top 5 most expensive calls (with their inputs), and a call count time series grouped by hour. The report should be printable as a formatted text table.