Lecture 01 — LLM Fundamentals for Agents¶

Track B · Agentic AI & GenAI | ← Index | Next →

Learning Objectives¶

By the end of this lecture you will be able to:

Explain how transformer inference works at a high level (prefill vs. decode)
Calculate token counts and context window costs
Choose the right model for a given agent task
Understand why latency, throughput, and TTFT matter for agentic loops

1. How Transformers Generate Text¶

An LLM does one thing: given a sequence of tokens, predict the next token. An agent is just a loop that keeps calling this function.

Input tokens → [Transformer] → Logits → Sample → Output token
                                                        ↓
                                              Append to context
                                                        ↓
                                              Repeat until stop

Two phases of inference:

Phase	What happens	Compute bound
Prefill	Process all input tokens in parallel (matrix multiply)	Compute (FLOP-bound)
Decode	Generate one token at a time (autoregressive)	Memory bandwidth

Hardware implication: Prefill saturates GPU compute. Decode is bottlenecked by how fast you can stream weights from HBM. This is why inference accelerators (Groq, Etched) focus on memory bandwidth, not just FLOPS.

2. Tokens and Context Windows¶

Tokens ≠ words. Rule of thumb: 1 token ≈ 0.75 English words (4 characters).

import os
import anthropic

client = anthropic.Anthropic()

# Count tokens before sending
response = client.messages.count_tokens(
    model=os.environ.get("ANTHROPIC_MODEL", "your-model-id"),
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(response.input_tokens)  # → 10

Context windows change quickly. Think in categories instead of memorizing one snapshot:

Category	Typical use	Engineering concern
Small/fast chat model	routing, classification, short summaries	low latency and low cost
Balanced agent model	tool use, JSON extraction, code review	reliable structure and good reasoning
Long-context model	repo analysis, large documents, multi-file tasks	context cost, retrieval quality, memory pressure
Local/open-weight model	edge inference, privacy, offline demos	VRAM, quantization, throughput
Embedding model	RAG indexing and retrieval	vector dimension, recall, index cost

Always verify current context limits in the provider documentation before designing a production agent around a specific window size.

Why context size matters for agents: - Multi-step reasoning accumulates tokens fast - Tool call results land in context - Long documents fed to RAG agent must fit

3. Inference Parameters¶

import os

response = client.messages.create(
    model=os.environ.get("ANTHROPIC_MODEL", "your-model-id"),
    max_tokens=1024,
    temperature=0.0,    # 0 = deterministic (good for agents/tools)
                        # 1 = creative (good for writing)
    top_p=1.0,
    messages=[{"role": "user", "content": "What is 2+2?"}]
)

Parameter	Effect	Agent recommendation
`temperature`	Randomness of sampling	0.0–0.3 for tool use / reasoning
`top_p`	Nucleus sampling cutoff	Leave at 1.0 (let temperature do the work)
`max_tokens`	Hard output limit	Set generously — truncation breaks JSON

Pro tip: For tool-use agents, always use temperature=0 or close to it. Randomness in function call generation causes JSON parse errors and unpredictable behavior.

4. The Anatomy of an API Call¶

import os
import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

response = client.messages.create(
    model=os.environ.get("ANTHROPIC_MODEL", "your-model-id"),
    max_tokens=2048,
    system="You are a helpful assistant.",       # system prompt
    messages=[
        {"role": "user",    "content": "Tell me about CUDA."},
        {"role": "assistant","content": "CUDA is..."},  # prior turn
        {"role": "user",    "content": "How does it compare to ROCm?"},
    ]
)

print(response.content[0].text)
print(f"Input tokens:  {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Stop reason:   {response.stop_reason}")  # end_turn | tool_use | max_tokens

stop_reason values for agents:

Value	Meaning
`end_turn`	Model finished naturally
`tool_use`	Model wants to call a tool — your loop must handle this
`max_tokens`	Hit the limit — increase or handle gracefully

5. Model Selection for Agent Tasks¶

Not every task needs the most powerful model. Cost and latency add up in multi-step loops.

# Router pattern: use fast/cheap model for simple steps
def route_model(task_type: str) -> str:
    fast_model = "provider-fast-model"
    balanced_model = "provider-balanced-agent-model"
    reasoning_model = "provider-reasoning-model"

    routing = {
        "classification": fast_model,
        "summarization": fast_model,
        "tool_use": balanced_model,
        "complex_reasoning": reasoning_model,
        "coding": balanced_model,
    }
    return routing.get(task_type, balanced_model)

Task	Recommended model class	Why
Simple Q&A, routing	fast model	low latency and cost
Tool use, JSON extraction	balanced agent model	reliable structured output
Complex reasoning, long context	reasoning or long-context model	stronger planning and larger working set
Embeddings	embedding model	specialized vector representation

6. Streaming for Responsive Agents¶

In agentic UIs, streaming dramatically improves perceived responsiveness.

import os
import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model=os.environ.get("ANTHROPIC_MODEL", "your-model-id"),
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain transformer attention."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Access final message with usage stats
final = stream.get_final_message()
print(f"\nTokens used: {final.usage.input_tokens} in, {final.usage.output_tokens} out")

7. Cost Estimation¶

# Rough cost calculator.
# Do not hardcode provider prices in production. Load this from a config file
# maintained from the provider pricing page.
PRICING = {
    "provider-fast-model": {"input": 0.15, "output": 0.60},       # example only, per 1M tokens
    "provider-balanced-agent-model": {"input": 3.00, "output": 15.00},
    "provider-reasoning-model": {"input": 15.00, "output": 75.00},
}

def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    p = PRICING[model]
    return (input_tokens * p["input"] + output_tokens * p["output"]) / 1_000_000

# A 10-step agent loop with a balanced agent model
steps = 10
per_step_input  = 2000   # context grows each step
per_step_output = 500
total = sum(
    estimate_cost("provider-balanced-agent-model", per_step_input * i, per_step_output)
    for i in range(1, steps + 1)
)
print(f"Estimated loop cost: ${total:.4f}")

Key insight: In a 10-step agent loop, context grows linearly — step 10 has 10× the input tokens of step 1. This is why context management (summarization, pruning) is critical in production.

Key Takeaways¶

LLM inference = prefill (compute-bound) + decode (memory-bandwidth-bound)
Use temperature=0 for tool-use agents; reserve higher values for creative tasks
Check stop_reason — tool_use means your loop must call the tool and continue
Route tasks to cheaper models where possible; the cost compounds in multi-step loops
Context grows each step — plan for summarization or windowing in long-running agents

Exercises¶

Write a script that counts tokens for a 10-page PDF before sending it to the API.
Build a simple cost logger that wraps client.messages.create and prints cumulative cost.
Implement a model router that uses a fast model for tasks under 200 input tokens and a balanced agent model otherwise.

Next: Lecture 02 — Prompt Engineering & Structured Output