Skip to main content
Back to registry

llm-evaluation

wshobson/agents

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

Installs3
Install command
npx skills add https://github.com/wshobson/agents --skill llm-evaluation
Security audits
Gen Agent Trust HubPASS
SocketFAIL
SnykPASS
About this skill
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing. Fast, repeatable, scalable evaluation using computed scores. Text Generation: Classification: Retrieval (RAG): Manual assessment for quality aspects difficult to automate. Dimensions: Use stronger LLMs to evaluate weaker model outputs. Approaches: - Measuring LLM application performance systematically - Comparing different models or prompts - Detecting performance regressions before deployment - Validating improvements from prompt changes - Building confidence in production systems - Establishing baselines and tracking progress over time - Debugging unexpected model behavior - BLEU : N-gram overlap (translation) - ROUGE : Recall-oriented (summarization) - METEOR : Semantic similarity - BERTScore : Embedding-based similarity - Perplexity : Language model confidence - Accuracy : Percentage correct - Precision/Recall/F1 : Class-specific performance - Confusion Matrix : Error patterns - AUC-ROC : Ranking quality - MRR : Mean Reciprocal Rank - NDCG : Normalized Discounted Cumulative Gain - Precision@K : Relevant in top K - Recall@K : Coverage in top K - Accuracy : Factual correctness - Coherence : Logical flow - Relevance : Answers the question - Fluency : Natural language quality - Safety : No harmful content - Helpfulness : Useful to the user - Pointwise :...

Source description provided by the upstream skill listing. Community reviews and install context appear in the sections below.

Community Reviews

Latest reviews

Sign in to review

No community reviews yet. Be the first to review.

Browse this skill in context
FAQ
What does llm-evaluation do?

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

Is llm-evaluation good?

llm-evaluation does not have approved reviews yet, so SkillJury cannot publish a community verdict.

What agent does llm-evaluation work with?

llm-evaluation currently lists compatibility with Agent compatibility has not been published yet..

What are alternatives to llm-evaluation?

Skills in the same category include telegram-bot-builder, flutter-app-size, sharp-edges, iterative-retrieval.

How do I install llm-evaluation?

npx skills add https://github.com/wshobson/agents --skill llm-evaluation

Related skills

More from wshobson/agents

Related skills

Alternatives in Software Engineering