DeepEval

Pytest-style framework for evaluating LLM outputs with built-in metrics — hallucination, answer relevancy, faithfulness, G-Eval, and more.

llm-evaluation-frameworksRecently released

71

Hero Score

Popularity

76

Performance

70

Ecosystem

75

Maturity

61

Dev Experience

75

⭐ 16,789 stars⬇ 1.3M downloads/wkFirst release: Aug 2023Last release: Jul 2026

Async Support: YesPlugin Extensions: HighSpeed: MediumDoc Quality: HighLearning Curve: Easy

Pros

• Pytest-style API makes LLM evals feel like familiar unit tests
• Rich built-in metric library (G-Eval, hallucination, faithfulness, relevancy)
• CI-friendly with dataset support and Confident AI dashboard integration

Cons

• Most metrics rely on LLM-as-judge — token cost and latency add up
• Judge-model choice and prompts can shift scores between runs
• Customizing metrics beyond built-ins requires reading the internals

Alternatives in llm-evaluation-frameworks

View documentation →