Lecture 33 - Structured Tools Beat Computer Use: Interface Hierarchy for Agents¶
Course: Agentic AI & GenAI | Previous: Lecture 32 | Next: Lecture 34
Computer-use agents are visually impressive.
They can look at screenshots, click buttons, type into fields, and operate software like a human.
That does not mean vision is the right primary interface for agents.
The Reflex benchmark gives a useful data point:
same app
same task
same model family
vision path: 53 ± 13 steps, ~551k input tokens, ~17 minutes
API path: 8 calls, ~12k input tokens, ~20 seconds
The conclusion is not "never use vision."
The conclusion is:
Vision-based computer use is a fallback interface.
Structured tools are the primary interface when you control the system.
For OpenClaw-style architecture, this is not a small optimization.
It is a tool-design principle.
Learning objectives¶
By the end of this lecture, you should be able to:
- Explain why screenshot-driven agents are expensive and nondeterministic.
- Compare structured APIs, CLI/direct execution, DOM/accessibility, and vision interfaces.
- Explain the Reflex benchmark and its limitations.
- Design an interface hierarchy for agent tools.
- Identify when vision is appropriate and when it is wasteful.
- Connect structured tools to verification, security, and auditability.
- Apply the principle to OpenClaw Gateway tools, node commands, exec, and App SDK surfaces.
1. The benchmark in plain terms¶
Reflex tested two ways for Claude Sonnet to operate the same admin panel.
Task:
find the customer named Smith with the most orders
locate their most recent pending order
accept all of their pending reviews
mark the order as delivered
Path A:
Path B:
Result:
| Metric | Vision agent | API agent |
|---|---|---|
| Steps / calls | 53 ± 13 | 8 ± 0 |
| Wall-clock time | 1003s ± 254s | 19.7s ± 2.8s |
| Input tokens | 550,976 ± 178,849 | 12,151 ± 27 |
| Output tokens | 37,962 ± 10,850 | 934 ± 41 |
The API path completed in 8 calls every time.
The vision path initially missed work because not all pending reviews were visible on screen. It needed a 14-step UI walkthrough to complete successfully.
That walkthrough is itself engineering cost.
2. Why vision is expensive¶
A vision agent pays for perception every step.
Each loop looks like:
screenshot
-> interpret pixels
-> infer UI state
-> decide action
-> click/type
-> wait
-> screenshot again
Costs compound because:
- screenshots are large inputs
- UI state must be rediscovered repeatedly
- scrolling/pagination may be invisible
- every action is sequential
- each screen transition adds another model call
- the model must infer meaning from layout and pixels
Structured APIs avoid most of that.
API loop:
The agent reads the data directly instead of re-deriving it from pixels.
3. Interface bandwidth hierarchy¶
Use the highest-bandwidth interface available.
Structured API / typed tool call best
CLI / direct execution good
DOM / accessibility tree acceptable
Vision / screenshots fallback
Why this hierarchy exists:
| Interface | Signal quality | Determinism | Cost | Security shape |
|---|---|---|---|---|
| Structured API | high | high | low | scoped contracts |
| CLI/direct exec | high if command is stable | medium/high | low/medium | command policy required |
| DOM/accessibility | medium | medium | medium | app/UI state dependent |
| Vision | low/medium | low | high | broad visible authority |
The rule:
4. Structured tools are more than cheaper¶
Cost is only one dimension.
Structured tools also improve:
Determinism¶
is more deterministic than:
Composability¶
Tool calls can be chained, retried, validated, and logged.
Verifiability¶
You can assert:
and preserve exact evidence.
Security¶
Structured tools can expose narrow capabilities:
Vision exposes whatever the agent can see and click.
Auditability¶
Structured logs tell you exactly what happened:
Screenshots require reconstruction.
5. Why the vision path failed first¶
The benchmark's first vision attempt missed pending reviews below the visible fold.
That is not primarily a model-intelligence problem.
It is an interface problem.
The UI showed a partial rendered state.
The API returned structured pagination and result data.
The agent using screenshots had to infer:
is this all the data?
is there pagination?
should I scroll?
did the filter apply?
what changed after the click?
The API agent read:
The same application logic existed underneath both paths.
Only one path exposed it directly.
6. When vision still makes sense¶
Do not remove computer use entirely.
Vision is useful when:
- you do not control the target system
- there is no API
- the workflow is a third-party SaaS UI
- you are doing UX/QA validation
- you are reverse-engineering a legacy workflow
- the task is inherently visual
- you need human-parity behavior
Good use:
vision agent explores unknown workflow
-> extract actions and state
-> design structured tools
-> structured agent executes at scale
Bad use:
vision agent repeatedly operates an internal app you control
even though the app can expose handlers or endpoints
Vision is an exploration and fallback layer.
It should not be the default execution layer for owned systems.
7. OpenClaw architecture implication¶
OpenClaw's Gateway/tool direction is aligned with this benchmark.
The preferred path:
Agent
-> typed tool schema
-> Gateway RPC / tool invoke
-> policy and approvals
-> execution layer
-> structured result
-> session/artifact evidence
Vision/computer-use should plug in as:
not as the primary route.
Example interface priority:
| Task | Preferred interface |
|---|---|
| run local command | exec / node system.run with policy |
| query app state | Gateway RPC / structured API |
| update internal record | typed tool call |
| inspect rendered UI | DOM/accessibility snapshot |
| operate unknown SaaS UI | vision fallback |
The stable principle:
They should behave like humans only when forced to.
8. Tool schema layer¶
A structured-first agent platform needs a tool schema layer.
Example:
{
"name": "update_order_status",
"description": "Update one order's delivery status.",
"input_schema": {
"type": "object",
"required": ["order_id", "status"],
"properties": {
"order_id": { "type": "integer" },
"status": {
"type": "string",
"enum": ["pending", "delivered", "cancelled"]
}
}
}
}
Execution path:
tool schema
-> validation
-> auth scope check
-> approval policy if needed
-> handler call
-> structured response
-> audit event
This is the opposite of screenshot automation.
The model expresses intent through a narrow typed action.
The runtime decides whether the action is allowed.
9. Auto-generating tools¶
The Reflex article matters partly because Reflex 0.9 can expose event handlers as HTTP endpoints, reducing the engineering cost of the API surface.
General pattern:
OpenAPI -> tool schemas
GraphQL -> tool schemas
internal service definitions -> tool schemas
CLI specs -> tool wrappers
typed app handlers -> agent-callable endpoints
The decision flips when API generation is cheap.
Old assumption:
New possibility:
For OpenClaw-like systems, this suggests:
- prefer Gateway RPC method surfaces
- expose typed node commands
- generate tool wrappers from API specs
- keep vision as a fallback adapter
10. Security boundary comparison¶
Structured tools:
Vision agent:
screenshot
natural-language reasoning
click/type action
implicit UI permissions
harder-to-parse evidence
Structured tools are easier to secure because the action is explicit before execution.
You can ask:
Who called this?
Which scope allowed it?
Was approval required?
Which object changed?
What was the before/after state?
With screenshots, the action is often a low-level click.
The semantic meaning must be reconstructed.
That is weaker for audit and incident response.
11. Verification and Agent Skills¶
Lecture 29 argued:
Structured tools make evidence easier.
Example:
{
"tool": "update_order_status",
"result": {
"order_id": 123,
"old_status": "pending",
"new_status": "delivered",
"updated_at": "2026-05-06T12:00:00Z"
}
}
That result can be:
- logged
- asserted
- replayed
- attached to a session
- shown in an App SDK UI
- checked by a final-answer hook
Vision evidence is heavier:
- screenshots
- cursor actions
- OCR
- natural-language descriptions
- brittle UI state
Use vision when necessary.
Do not choose it when structured evidence is available.
12. Design rule for OpenClaw tools¶
Adopt this rule:
Lifecycle:
one-off human action
-> vision exploration
-> manual CLI/API discovery
-> typed tool schema
-> policy/approval
-> test fixture
-> production tool
Example:
Use computer_use to learn how an admin workflow behaves.
Then build update_customer, list_reviews, approve_review, update_order tools.
Then stop using computer_use for that workflow.
The result is faster, cheaper, safer, and more reviewable.
13. Mini-lab: convert a UI workflow into tools¶
Pick one internal workflow:
- approve a user
- create a device pairing
- update an order
- run a deployment
- capture a node screenshot
- restart a service
Step 1: write the UI path:
Step 2: identify the underlying state transition:
Step 3: design the tool:
Step 4: define when vision is still allowed:
14. Benchmark exercise¶
Recreate the Reflex comparison on a small app you control.
Measure:
| Metric | Vision/DOM path | Structured tool path |
|---|---|---|
| steps | ||
| wall-clock time | ||
| input tokens | ||
| output tokens | ||
| failure modes | ||
| required prompt instructions | ||
| audit quality |
Then answer:
What did the vision agent need to rediscover?
What did the structured tool expose directly?
What evidence was easy to capture?
Which path would you trust in production?
Key takeaways¶
- Vision-based computer use is a fallback, not the primary interface for owned systems.
- The Reflex benchmark measured a large gap: roughly 551k input tokens and 53 steps for vision versus about 12k input tokens and 8 calls for API.
- The vision path required a 14-step walkthrough to succeed; that prompt is unpaid engineering cost.
- The right hierarchy is structured API, CLI/direct execution, DOM/accessibility, then vision.
- Structured tools are cheaper, faster, more deterministic, easier to secure, and easier to audit.
- OpenClaw's Gateway/tool direction is aligned with this principle.
- Recurring workflows should graduate from vision exploration into typed, policy-gated tools.
References¶
- Reflex, "Computer use is 45x More Expensive Than Structured APIs": https://reflex.dev/blog/computer-use-is-45x-more-expensive-than-structured-apis/
- Reflex benchmark repo: https://github.com/reflex-dev/agent-benchmark
- Lecture 23 - Gateway RPC Protocol: Lecture-23.md
- Lecture 29 - Agent Skills: Lecture-29.md
- Lecture 31 - Runtime Strategy: Lecture-31.md
Next: Lecture 34 - Nemotron 3 Nano Omni: Multimodal Perception Sub-Agents for Agent Systems