Skip to content

Lecture 39 - Agent Skills Eval: Benchmarking SKILL.md Files

Course: Agentic AI & GenAI | Previous: Lecture 38 | Next: Lecture 40


Skills are code-adjacent infrastructure.

If a skill changes how an agent behaves, it needs tests.

Agent Skills gives agents portable, version-controlled workflows through folders such as:

my-skill/
  SKILL.md
  references/
  scripts/
  assets/
  evals/

But a SKILL.md file can look good and still fail in practice.

The right question is not:

Does this skill read well?

The right question is:

Does this skill measurably improve the agent on the task it claims to support?

agent-skills-eval is a test runner for that question.

It runs the same eval prompt twice:

with_skill
without_skill

Then a judge model grades both outputs against expected behavior and assertions.

That gives you evidence-backed pass/fail rather than subjective prompt review.


Learning objectives

By the end of this lecture, you should be able to:

  1. Explain why skills need regression tests.
  2. Understand the with_skill versus without_skill baseline pattern.
  3. Write basic evals/evals.json cases for a skill.
  4. Interpret judge-graded pass/fail results.
  5. Use deterministic assertions for tool-call skills.
  6. Understand the artifact layout produced by agent-skills-eval.
  7. Design CI gates for OpenClaw-style skill repositories.
  8. Identify common failure modes in LLM-judged skill evaluation.

1. Why skill evaluation matters

Agent skills package procedural knowledge.

Examples:

  • triage GitHub issues
  • summarize weather data
  • operate a CRM
  • write release notes
  • inspect GPU traces
  • follow a security review checklist
  • translate kernels between DSLs

This is powerful because skills are reusable.

It is dangerous because skills can silently degrade.

A bad skill can:

  • add irrelevant context
  • over-constrain the model
  • cause wrong tool calls
  • increase token cost without quality lift
  • make the agent slower
  • hide stale instructions
  • make a task worse than baseline

Without evaluation, every skill PR becomes a taste debate.

With evaluation, the conversation becomes:

This skill improved 7/9 evals.
It regressed the pagination case.
The judge evidence points to missing tool-call criteria.
The report includes both outputs and timing.

That is a better review standard.


2. The baseline pattern

The core design is simple:

same prompt
  -> target model without skill
  -> target model with skill
  -> judge compares both against assertions

This matters because absolute output quality is not enough.

You want to know skill lift:

output with skill passes
output without skill fails
  -> skill likely helps

both pass
  -> skill may be unnecessary for this eval

both fail
  -> skill or eval is insufficient

with skill fails, baseline passes
  -> skill regressed behavior

The baseline mode prevents a common mistake:

The skill produced a good answer, therefore the skill is useful.

Maybe the model already produced the same answer without the skill.

The eval needs to measure the delta.


3. Quickstart

Run directly with npx:

npx agent-skills-eval ./skills \
  --target gpt-4o-mini \
  --judge gpt-4o-mini \
  --baseline \
  --strict

Install if you want it in a project:

npm install agent-skills-eval

The key flags:

--target
  model being evaluated

--judge
  model grading outputs

--baseline
  run without_skill as comparison

--strict
  enforce skill/spec validation

For OpenClaw contributors, this is the useful mental model:

skills are tested like code
eval artifacts are reviewed like logs
reports are attached to PRs

4. Skill layout

A minimal evaluated skill looks like:

skills/
  weather-summary/
    SKILL.md
    evals/
      evals.json

Example SKILL.md:

---
name: weather-summary
description: Summarize weather forecasts and call out operational risks.
license: MIT
compatibility: Works with text-capable chat models.
---

When summarizing weather, identify the location, time range, precipitation risk,
temperature extremes, wind risk, and one practical recommendation.

Example evals/evals.json:

{
  "skill_name": "weather-summary",
  "evals": [
    {
      "id": "storm-risk",
      "name": "storm risk summary",
      "prompt": "Summarize this forecast for an outdoor robotics test: thunderstorms after 2pm, wind gusts to 35 mph, high 91F.",
      "expected_output": "The response should identify thunderstorm timing, wind risk, heat risk, and recommend moving the test earlier or indoors.",
      "assertions": [
        "The output mentions thunderstorm risk after 2pm.",
        "The output mentions wind gusts or wind risk.",
        "The output gives a practical scheduling or safety recommendation."
      ]
    }
  ]
}

The assertions are the contract.

If they are vague, the eval will be vague.


5. Artifact layout

A run creates a workspace similar to:

agent-skills-workspace/
  iteration-1/
    meta.json
    benchmark.json
    eval-basic/
      with_skill/
      without_skill/
    report/
      index.html

Important artifacts:

  • benchmark.json: rolled-up pass/fail results
  • with_skill/: output, timing, grading
  • without_skill/: baseline output, timing, grading
  • report/index.html: static report for review
  • JSON/JSONL events: useful for dashboards or CI history

This artifact-first design is important.

You can diff runs over time.

You can attach reports to pull requests.

You can detect regressions after changing SKILL.md.


6. Judge model grading

The judge sees:

  • eval prompt
  • expected output
  • assertions
  • target model output

Then it grades pass/fail with evidence.

This is useful, but it is not perfect.

LLM judges can:

  • be inconsistent
  • over-reward fluent answers
  • miss subtle tool-call failures
  • leak bias from expected output phrasing
  • pass outputs that satisfy wording but not intent
  • disagree across model versions

Mitigations:

  • keep judge temperature at 0
  • use explicit assertions
  • add deterministic checks where possible
  • review failures and unexpected passes manually
  • pin judge and target model versions when possible
  • keep raw artifacts
  • rerun flaky evals before blocking a release

The judge is a tool, not an authority.


7. Tool-call assertions

Many useful skills are not pure text-generation skills.

OpenClaw-style skills often affect tool behavior:

  • call gh commands
  • invoke gateway RPCs
  • query weather APIs
  • inspect logs
  • read files
  • create issues
  • update docs

For those, text-only grading is weak.

You want deterministic tool-call assertions:

Did the agent call the expected tool?
Did it use the expected method?
Did it pass safe arguments?
Did it avoid a destructive command?
Did it include the required idempotency key?

Use LLM judging for semantic quality.

Use deterministic assertions for protocol behavior.

That split is critical:

semantic correctness:
  judge model

tool contract correctness:
  deterministic assertions

8. Config file for CI

For repeated runs, use a config file:

# agent-skills-eval.yaml
root: ./skills
workspace: ./agent-skills-workspace
baseline: true
target: gpt-4o-mini
judge: gpt-4o-mini
baseUrl: https://api.openai.com/v1
apiKeyEnv: OPENAI_API_KEY
include:
  - "skills/**"
exclude:
  - "**/draft-*"
concurrency: 4
layout: iteration
strict: true
report:
  enabled: true
  title: Agent Skills Report
targetParams:
  temperature: 0
judgeParams:
  temperature: 0

Run:

OPENAI_API_KEY=... npx agent-skills-eval --config agent-skills-eval.yaml

In CI, do not rely only on console output.

Persist:

  • benchmark.json
  • judge grading files
  • generated report
  • JSONL event logs

Those artifacts are the evidence.


9. OpenClaw skill testing workflow

OpenClaw-style systems can use skills for:

  • GitHub issue triage
  • release note generation
  • weather and schedule planning
  • Gateway runbooks
  • node troubleshooting
  • app SDK testing
  • security review
  • GPU performance analysis

A practical workflow:

1. Contributor edits SKILL.md.
2. Contributor adds or updates evals/evals.json.
3. CI runs agent-skills-eval with baseline.
4. Report is uploaded as an artifact.
5. PR review checks pass rate, regressions, and judge evidence.
6. Maintainer decides whether the behavior change is acceptable.

Suggested PR rule:

No skill behavior change without at least one eval proving the intended behavior.
No regression accepted without an explicit note explaining why.

This mirrors how code should be tested.


10. Designing good skill evals

A good eval is narrow.

It tests one behavior.

Bad eval:

Prompt: "Use the GitHub skill to manage issues well."
Assertion: "The answer is good."

Good eval:

Prompt: "Given these three issue titles and labels, identify which one is a bug, which one is a feature request, and which one needs more information."
Assertions:
  - The output classifies all three issues.
  - The output asks for reproduction steps for the ambiguous bug report.
  - The output does not propose closing any issue without evidence.

Good skill evals should cover:

  • happy path
  • ambiguous input
  • missing data
  • adversarial instruction
  • unsafe action request
  • tool-call behavior
  • regression case from a real bug

Do not make the eval suite huge at first.

Start with the three cases most likely to break user trust.


11. Common failure modes

The skill adds no lift

Both with_skill and without_skill pass.

Interpretation:

The model may already know this task,
or the eval is too easy.

Fix:

  • make the eval more specific
  • test domain-specific constraints
  • test tool-call behavior
  • test edge cases

The skill makes output worse

Baseline passes, skill fails.

Interpretation:

The skill is too broad, stale, misleading, or over-prescriptive.

Fix:

  • shorten the skill
  • remove stale rules
  • improve trigger description
  • add counterexamples
  • split into smaller skills

The judge is unreliable

Repeated runs disagree.

Fix:

  • lower temperature
  • sharpen assertions
  • add deterministic checks
  • use a stronger judge
  • manually review artifacts

The eval tests formatting instead of behavior

Fix:

  • assert outcome, not prose style
  • use schema checks for structured outputs
  • separate style tests from correctness tests

12. How this connects to earlier lectures

Lecture 29 introduced skills as workflow discipline.

This lecture adds the missing test loop:

skill design
  -> eval prompt
  -> with/without comparison
  -> judge + deterministic assertions
  -> artifact review
  -> skill revision

Lecture 30 argued that tests are the durable asset in agentic software development.

Skill evals are the tests for agent behavior.

Lecture 35 showed skills for GPU kernel translation.

agent-skills-eval is how you test whether those translation rules actually improve outputs.

Lecture 37 used traces as evidence for performance claims.

Skill evals are evidence for prompt/workflow claims.

Same principle:

No evidence, no claim.

Mini-lab: Add evals to one OpenClaw-style skill

Pick one skill:

  • GitHub issue triage
  • weather planning
  • Gateway troubleshooting
  • node pairing runbook
  • app SDK testing
  • GPU trace analysis

Create:

SKILL.md
evals/evals.json
agent-skills-eval.yaml

Run:

npx agent-skills-eval ./skills \
  --target gpt-4o-mini \
  --judge gpt-4o-mini \
  --baseline \
  --strict

Then write a short report:

Skill:
Eval count:
Pass rate with skill:
Pass rate without skill:
Cases improved:
Cases regressed:
Judge concerns:
Deterministic assertions needed:
Decision:

If the skill does not beat baseline, do not ship it as-is.

Either improve the skill or admit the skill is unnecessary.


Key takeaways

  • Skills need tests because they change agent behavior.
  • with_skill versus without_skill is the core pattern for measuring skill lift.
  • LLM judges are useful when paired with explicit assertions and stored artifacts.
  • Deterministic assertions are required for tool-call and protocol behavior.
  • Artifact output makes skill reviews repeatable and CI-friendly.
  • OpenClaw contributors can use this pattern to validate skill changes before merge.
  • Good evals test concrete behavior, edge cases, safety constraints, and regressions.
  • A skill that does not beat baseline is not automatically worth carrying.

References


Next: Lecture 40 - ZAYA1-8B