Skip to main content
Back to registry

eval-harness

affaan-m/everything-claude-code

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.

Installs976
Install command
npx skills add https://github.com/affaan-m/everything-claude-code --skill eval-harness
Security audits
Gen Agent Trust HubPASS
SocketPASS
SnykPASS
About this skill
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles. Eval-Driven Development treats evals as the "unit tests of AI development": Test if Claude can do something it couldn't before: Ensure changes don't break existing functionality: Deterministic checks using code: Use Claude to evaluate open-ended outputs: Flag for manual review: "At least one success in k attempts" "All k trials succeed" Write code to pass the defined evals. Creates eval definition file at .claude/evals/feature-name.md Runs current evals and reports status Generates full eval report Store evals in project: Use product evals when behavior quality cannot be captured by unit tests alone. Recommended thresholds: - Setting up eval-driven development (EDD) for AI-assisted workflows - Defining pass/fail criteria for Claude Code task completion - Measuring agent reliability with pass@k metrics - Creating regression test suites for prompt or agent changes - Benchmarking agent performance across model versions - Define expected behavior BEFORE implementation - Run evals continuously during development - Track regressions with each change - Use pass@k metrics for reliability measurement - pass@1: First attempt success rate - pass@3: Success within 3 attempts - Typical target: pass@3 > 90% - Higher bar for reliability - pass^3: 3 consecutive successes - Use for...

Source description provided by the upstream skill listing. Community reviews and install context appear in the sections below.

Community Reviews

Latest reviews

Sign in to review

No community reviews yet. Be the first to review.

Browse this skill in context
FAQ
What does eval-harness do?

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.

Is eval-harness good?

eval-harness does not have approved reviews yet, so SkillJury cannot publish a community verdict.

What agent does eval-harness work with?

eval-harness currently lists compatibility with codex, gemini-cli, opencode, cursor, github-copilot, claude-code.

What are alternatives to eval-harness?

Skills in the same category include telegram-bot-builder, flutter-app-size, sharp-edges, iterative-retrieval.

How do I install eval-harness?

npx skills add https://github.com/affaan-m/everything-claude-code --skill eval-harness

Related skills

More from affaan-m/everything-claude-code

Related skills

Alternatives in Software Engineering