Evaluating RAG with Ragas

Retrieval-Augmented Generation (RAG) systems combine a retriever and a generator. Measuring quality on the final answer alone is often insufficient: failures may come from irrelevant or incomplete retrieval, from generation that drifts away from sources, or from both.

Ragas (Retrieval-Augmented Generation Assessment) is a Python framework that scores RAG behavior with metrics for faithfulness to context, answer relevance, retrieval quality, and more—depending on which metrics are enabled and which columns are available in the evaluation set.

This page focuses on using the Ragas SDK directly in a notebook or batch job. Integration with other evaluation platforms is optional and not covered here.

What to record for each evaluation example

A typical single-turn row includes:

FieldRole
user_inputUser query (or equivalent).
retrieved_contextsRetrieved passages as a list of strings for that row.
responseModel output to score.
referenceReference answer or key facts; required only for some metrics (for example context recall).

Ragas metrics overview

Ragas exposes a large catalog of metrics. Only a subset is needed for a typical RAG evaluation pass. Each metric expects specific dataset columns (for example question, contexts, answer, ground_truth) and may require an LLM, embeddings, both, or neither. Names and import paths evolve across releases; confirm the installed-version guidance in the Ragas metrics documentation.

The lists below summarize intent and common use; they are not a substitute for upstream API details.

Core RAG metrics

These metrics are most directly aligned with retrieval-augmented generation quality and are usually the first set to track:

Unless otherwise noted, classes in this section are from ragas.metrics.collections.

MetricPython classRequired argumentsEvaluation target
FaithfulnessFaithfulnessuser_input, retrieved_contexts, responseWhether claims in the answer are supported by retrieved contexts (grounding / hallucination control).
Answer relevancy / Response relevancyAnswerRelevancyuser_input, responseWhether the generated answer addresses the user query.
Context precisionContextPrecisionuser_input, retrieved_contexts, referenceWhether retrieved chunks are relevant and useful for answering the query.
Context recallContextRecalluser_input, retrieved_contexts, referenceWhether retrieved contexts cover the information required to answer correctly.
Context entity recallContextEntityRecallreference, retrieved_contextsWhether key entities from the reference are present in retrieved contexts.
Context utilizationContextUtilizationuser_input, response, retrieved_contextsHow much retrieved context is actually used by the answer.

Field and import requirements can vary by Ragas version and metric variant. Confirm against the installed version in the Ragas metrics documentation.

Optional RAG metrics

These metrics are useful in specific evaluation setups, especially when reference answers are available or when robustness checks are needed:

MetricPython classRequired argumentsEvaluation target
Answer correctnessAnswerCorrectnessuser_input, response, referenceAlignment of the generated answer with a reference answer.
Answer similarity / Semantic similaritySemanticSimilarityresponse, referenceSemantic closeness between generated and reference answers (embedding-based).
Factual correctnessFactualCorrectnessresponse, referenceFact-level agreement with a reference or expected facts.
Noise sensitivityNoiseSensitivityuser_input, response, reference, retrieved_contextsStability when distractors or noise are introduced into context or inputs.

For metrics that are weakly related to RAG core evaluation (for example generic text-overlap metrics, rubric-based custom metrics, agent/tool metrics, SQL metrics, or multimodal metrics), refer to the Ragas metrics documentation.

Choosing a minimal RAG set

A practical default for many RAG benchmarks is: faithfulness, answer relevancy, context precision, and context recall (recall and some precision variants need ground_truth or equivalent). Add answer correctness or semantic similarity when a reference answer is available. Match metrics to the columns present in the dataset and to cost constraints (LLM-heavy metrics are slower and more expensive).

Calling the Ragas SDK

For modern Ragas usage, instantiate metrics from ragas.metrics.collections and score each row using ascore() (or score() in synchronous scripts).

  1. Prepare OpenAI-compatible clients (AsyncOpenAI) for LLM and embeddings, then wire llm_factory and OpenAIEmbeddings (see the sample notebook for environment-variable configuration).

  2. Instantiate metrics with explicit dependencies (llm, embeddings where required).

  3. Iterate through rows and call metric.ascore(...) with metric-specific arguments.

    from openai import AsyncOpenAI
    from ragas.embeddings import OpenAIEmbeddings
    from ragas.llms import llm_factory
    from ragas.metrics.collections import AnswerRelevancy, Faithfulness
    
    llm_client = AsyncOpenAI(
        api_key="...",
        base_url="https://your-openai-compatible-endpoint/v1",  # or None for provider default
    )
    embed_client = AsyncOpenAI(
        api_key="...",  # often same key as LLM when using one gateway
        base_url="https://your-embedding-endpoint/v1",  # optional; can match llm_client
    )
    
    llm = llm_factory("your-llm-model", client=llm_client)
    embeddings = OpenAIEmbeddings(model="your-embedding-model", client=embed_client)
    
    faithfulness = Faithfulness(llm=llm)
    answer_relevancy = AnswerRelevancy(llm=llm, embeddings=embeddings)

When selecting metrics, the following differences affect how the scoring call is prepared:

AspectHow it varies
Required columnsEach metric expects specific arguments (for example reference for many reference-based retrieval metrics). Missing fields cause validation errors before scoring starts.
LLM vs embeddings vs neitherLLM-based metrics need a language model; embedding metrics need an embedding model; lexical metrics may need no model. In the modern API, dependencies are passed explicitly when creating metric instances.
Metric variantsDifferent classes implement the same intent with different scorers (for example LLM vs non-LLM context precision). Metric imports and selection change accordingly.
Constructor configurationRubrics, aspect critics, discrete/numeric custom metrics, or specialized faithfulness variants require instantiation arguments or extra setup.
ID-based or multi-turn dataID-based precision/recall expect ID columns in the dataset; multi-turn or agent/tool metrics require different sample layouts. These are outside the single-turn notebook flow.

In practice, this means the main work is to align dataset fields and metric selection, then score rows with the chosen metric instances.

Prerequisites

  • Python 3.10+ recommended.
  • Network access to an LLM API (and to an embeddings API for metrics that need embeddings). The sample notebook assumes an OpenAI-compatible setup and supports configuring credentials and an optional base URL for compatible gateways.
  • Awareness that evaluation issues many model calls; cost and latency scale with rows and metrics.
  • Version pinning: Ragas APIs and metric classes change between releases. For reproducible benchmarks, pin ragas (and related packages) in the environment or notebook; see the commented install line in the sample notebook.

Runnable notebook

Download and open the notebook in JupyterLab or another Jupyter environment:

The notebook opens with a short SDK recap focused on modern metrics (ragas.metrics.collections) and explicit LLM/embedding setup. The canonical explanation is the Calling the Ragas SDK section on this page.

The notebook:

  1. Installs dependencies (with an optional commented version pin for reproducibility).
  2. Creates a small datasets.Dataset with user_input, retrieved_contexts, response, and reference.
  3. Runs baseline evaluation with faithfulness and answer relevancy using modern metric classes.
  4. Adds optional retrieval-focused metrics (context precision and context recall) using modern metric classes.
  5. Shows aggregate and per-row results, followed by a short troubleshooting section.

Troubleshooting

  • Credentials or endpoint configuration: configure LLM API credentials (and an optional base URL for compatible gateways). If embeddings use a separate endpoint, configure embeddings credentials as well, then pass separate AsyncOpenAI clients into llm_factory and OpenAIEmbeddings.
  • Dataset validation errors: verify required arguments for selected metrics and ensure dataset keys align with modern examples (user_input, retrieved_contexts, response, reference).
  • Notebook async execution: the sample notebook uses await metric.ascore(...). For synchronous scripts, use metric.score(...) or wrap async code with asyncio.run(...).
  • Version-related warnings: metric classes and signatures can change across Ragas versions. Pin package versions for reproducible runs and confirm behavior against the installed version documentation.

Interpreting results

  • Compare scores only under the same dataset and evaluation configuration (judge LLM, embeddings, and prompts); otherwise shifts may reflect configuration changes rather than RAG quality.
  • For retrieval-oriented evaluation, use the same embedding model as the production RAG retriever whenever possible to reduce metric drift caused by mismatched embedding spaces.
  • Use aggregate scores for trend tracking or quality gates, and per-row scores for diagnosis (for example missing context, hallucination, or irrelevant retrieval). Treat metric values as directional signals, not absolute truth.

Further reading