Ragas
A Python library focused specifically on evaluating RAG (Retrieval-Augmented Generation) pipelines. Measures dimensions that generic eval tools don’t cover well:
- Faithfulness — is the generated answer grounded in the retrieved context, or does it hallucinate?
- Context recall — did the retrieval step find the relevant documents?
- Context precision — are the retrieved documents actually useful, or is there noise?
- Answer relevance — does the final answer address the original question?
Best suited when your primary concern is RAG quality rather than general prompt evaluation.
See also
- RAG — the architecture Ragas evaluates
- Promptfoo — general-purpose eval tool (has some RAG metrics but less depth)
- Prompt Evaluation — the broader eval landscape