Ragas

A Python library focused specifically on evaluating RAG (Retrieval-Augmented Generation) pipelines. Measures dimensions that generic eval tools don’t cover well:

  • Faithfulness — is the generated answer grounded in the retrieved context, or does it hallucinate?
  • Context recall — did the retrieval step find the relevant documents?
  • Context precision — are the retrieved documents actually useful, or is there noise?
  • Answer relevance — does the final answer address the original question?

Best suited when your primary concern is RAG quality rather than general prompt evaluation.

See also

  • RAG — the architecture Ragas evaluates
  • Promptfoo — general-purpose eval tool (has some RAG metrics but less depth)
  • Prompt Evaluation — the broader eval landscape