Skip to main content
Back to the directory
wshobson/agentsSoftware EngineeringFrontend and Design

llm-evaluation

Systematic evaluation of LLM applications using automated metrics, human feedback, and statistical testing.

SkillJury keeps community verdicts, source metadata, and external repository signals in separate lanes so ranking data never pretends to be a review.

SkillJury verdict
Pending

No approved reviews yet

Would recommend
Pending

Waiting on enough review volume

Install signal
6

Weekly or total install activity from catalog data

Sign in to review
0 review requests
Install command
npx skills add https://github.com/wshobson/agents --skill llm-evaluation
SkillJury does not have enough approved reviews to publish a community verdict yet. Source metadata and repository proof are still available above.
SkillJury Signal Summary

As of Apr 30, 2026, llm-evaluation has 6 weekly installs, 0 community reviews on SkillJury. Community votes currently stand at 0 upvotes and 0 downvotes. Source: wshobson/agents. Canonical URL: https://skills.sh/wshobson/agents/llm-evaluation.

Security audits
Gen Agent Trust HubPASS
SocketPASS
SnykPASS
About this skill
Systematic evaluation of LLM applications using automated metrics, human feedback, and statistical testing. Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing. Fast, repeatable, scalable evaluation using computed scores. Text Generation: Classification: Retrieval (RAG): Manual assessment for quality aspects difficult to automate. Dimensions: Use stronger LLMs to evaluate weaker model outputs. Approaches: - Covers three evaluation approaches: automated metrics (BLEU, ROUGE, BERTScore, accuracy, precision/recall), human evaluation across dimensions like accuracy and coherence, and LLM-as-Judge for pointwise, pairwise, and reference-based scoring - Includes implementations for text generation, classification, and retrieval (RAG) evaluation with ready-to-use metric functions and custom metric support - Provides A/B testing framework with statistical significance testing, effect size calculation, and regression detection to catch performance drops before deployment - Integrates with LangSmith for dataset management and experiment tracking, plus benchmarking utilities for tracking progress over time - Measuring LLM application performance systematically - Comparing different models or prompts - Detecting performance regressions before deployment - Validating improvements from prompt changes - Building...

Source description provided by the upstream listing. Community review signal and install context stay separate from this narrative layer.

Community reviews

Latest reviews

No community reviews yet. Be the first to review.

Browse this skill in context
FAQ
What does llm-evaluation do?

Systematic evaluation of LLM applications using automated metrics, human feedback, and statistical testing.

Is llm-evaluation good?

llm-evaluation does not have approved reviews yet, so SkillJury cannot publish a community verdict.

Which AI agents support llm-evaluation?

llm-evaluation currently lists compatibility with Skills CLI.

Is llm-evaluation safe to install?

llm-evaluation has been scanned by security audit providers tracked on SkillJury. Check the security audits section on this page for detailed results from Socket.dev and Snyk.

What are alternatives to llm-evaluation?

Skills in the same category include grimoire-morpho-blue, conversation-memory, second-brain-ingest, zai-tts.

How do I install llm-evaluation?

Run the following command to install llm-evaluation: npx skills add https://github.com/wshobson/agents --skill llm-evaluation

Related skills

More from wshobson/agents

Related skills

Alternatives in Software Engineering