Data & AI

Evaluating RAG: Using LLMs to automate benchmarking of retrieval augmented generation systems


Christopher Frenchi

AI Research Engineer

NOTE: This article is Part III in our series on LLM benchmarking. In Part I, we established how to benchmark the accuracy of LLM-based systems to prevent system degradation. In Part II, we took a deeper dive into evaluating truthfulness as an element of LLM accuracy. If you need help setting up a cost-efficient evaluation process for your Generative AI applications, the DART team at TELUS Digital is ready to help. Get started by learning about our Accelerator program.

Introduction

Having established the foundations of large language model (LLM) benchmarking in Parts I and II of this series, this article will explore using an LLM to evaluate the results of retrieval augmented generation (RAG) systems.

With retrieval augmented generation, you essentially hook up a database to a large language model and then bias a chatbot or AI-enabled assistant to retrieve information stored in that database instead of broader external knowledge. We’re then using a separate LLM to evaluate whether or not the RAG is performing as intended. More LLMs testing LLMs!

As we begin testing RAG systems, generating data and interpreting the results, manual comparisons can become challenging, raising questions about bias, truth and utility. Our premise for this experiment is: Can we automate the process to create a self-evaluating generative AI system?

In this article, we walk through how the Data and AI Research Team (DART) at TELUS Digital created an evaluation framework to compare an end-to-end RAG chatbot LLM output against a known base truth. Using this evaluation methodology, we can better understand how to efficiently test a RAG system.

What is retrieval augmented generation (RAG)?

RAG systems utilize embeddings and semantic search to retrieve stored knowledge that augments an LLM’s external knowledge (i.e., world knowledge). We can think of embeddings as a mathematical representation of objects that capture similarities or relationships.

Essentially operating as AI-powered search engines, they retrieve and summarize specific data based on semantic similarities. By storing information that the LLM doesn’t necessarily know in a database, we can use RAG to query organization-specified documents and return the needed information for a chatbot or AI assistant to answer questions. The importance of manual and automated evaluation in these systems lies in verifying the accuracy and effectiveness of their responses.

But how do we evaluate the RAG LLM response? How do we know what we are getting back is correct? Let’s consider how we would manually test this before showing how to automate it.

Manual testing would involve the following steps:

  1. Create a series of questions, expected answers and expected sources — your “gold standard dataset.”
  2. Ask the chatbot a question from your dataset.
  3. Receive the chatbot LLM response.
  4. Compare the LLM response to the expected answer.
  5. Assign a Yes/No determination or score to the LLM response (subject to personal bias).
  6. Compare the expected source to the source returned by the LLM.
  7. Document and share results.

We would then duplicate steps 2 through 7 for the next question for that page source, start over with step 1 for the next source page, rinse and repeat.

In short, manually creating and evaluating hundreds of prompts and responses is downright awful.

Now, let’s explore how to transform these manual processes into an automated series of tests using the evaluation framework.

Building an evaluation framework for RAG

We'll break down the following walkthrough into steps to better illustrate the automated process.

To set the stage, let's recap our goal when evaluating our RAG system:

Test the LLM response to ensure that the information returned from the LLM is an appropriate, correct and accurate response to the user query.

  • This testing often involves reviewing results and determining whether the LLM response accurately passed back an appropriate response to the user. We can use several metrics, but for this experiment, we'll focus on accuracy and sources.

1. Context — Understanding what is stored in the “knowledge” database

In our example, we will grab some data regarding the earliest evidence of photosynthetic organisms, using this information as a stand-in for the kind of information an organization might include in its database.

For this experiment, we'll consider a data chunk below, taken from an article in the scientific journal Plant Physiology and found in a reputable National Library of Medicine’s database that claims photosynthetic organisms may have been present as early as 3.5 billion years ago. While perhaps controversial, we are not debating the scientific accuracy of this data chunk compared to others; we’re explicitly training a RAG system to consider this passage as our source of truth.

An untrained model might find other sources — for instance, this article from Wikipedia — that suggest evidence of photosynthetic organisms appearing only as early as 3.2 billion years ago. We don’t want the LLM drawing from this source but rather from the specified sources.

This is the point of using RAG systems: For many use cases and specific tasks, especially in highly regulated industries like financial services or healthcare, organizations strive to retain total control over the knowledge database an LLM references. Doing so ensures the system responds only with approved information that conforms to regulatory frameworks.

We’re using this paragraph from Plant Physiology as an example of an expected chunk we might find in our knowledge store that we DO want our LLM to use as a source. In other words, we are evaluating to ensure the LLM is NOT referencing broader world knowledge and returning answers based on, in this case, Wikipedia.

2. Gold standard dataset — Questions, answers, sources

We need a gold standard (GS) dataset against which to evaluate RAG. This gold standard will be our ground truth question/answer. There are several ways that we can go about creating this dataset. We can create them directly from our knowledge chunks, mining our chatbot usage logs for frequently asked questions, or we can create them based on internal bug bashes to find edge cases. From this illustration, let’s create a simple example for questions, answers and sources based on the data chunk outlined above. We will create a correct and an incorrect dataset to test against, establishing our source of truth for the evaluation.

As Part II of this series explains, we can create our own question/answer datasets or use an LLM to help us generate these datasets. By passing in knowledge chunks and asking an LLM to develop questions and answers, we can quickly generate large question/answer datasets across multiple pages. While the automatic generation of these datasets makes the process easier, these datasets do need to be reviewed by humans and, therefore, incur some costs.

Note: Consider storing your question/answer datasets in a .csv file or database. If your RAG system returns source information, you might also consider storing source information in your question/answer dataset. We will hard-code some variables to showcase the system and make it easier to understand.

‍3. RAG request/response — Generate an answer from the chatbot to compare with the gold standard dataset

Now that we understand our context and question/answer/sources dataset, we can pass in our generated question(s) to the RAG chatbot to get a response that we can compare to the generated gold standard response. During our request, we don’t need to worry about embeddings during the RAG semantic search because all of that is obfuscated during our evaluation.

Let’s assume your RAG chatbot can be called through an HTTP API. If we iterate through our question dataset, a typical request/response could look like the following:

One thing to note in the response is the callout to cost. It’s important to track and understand the cost and/or “tokens” used during the chatbot API calls to effectively calculate the cost of running the chatbot and evaluating it. Let’s look at what's returned and what we'll store for evaluation.

[‘Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.’, [‘https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949000/’], 0.001662]

4. Evaluation — Does the chatbot response match the expectation?

Ultimately, RAG evaluation compares two pieces of text based on a metric. We will compare the gold standard question/answer/source to the response from our chatbot. There are a few ways we can evaluate responses with known tools, such as Ragas or MLFlow. In this post, we will show a simple behind-the-scenes look at implementing a simple evaluator with an LLM call.

Let’s define a metric we will showcase for evaluation. In this example, we will use accuracy. Depending on the metric in question, this can be specific to the type of response you want to test against. In addition, multiple metrics can be used, but be aware of the costs associated with more extensive or additional LLM calls.

4a. Augmentation Check

The first test we will showcase is the LLM rewrite. We will compare the GS answer against the LLM response from our chatbot. In the following examples, we will showcase both a correct and an incorrect answer during the comparison. Using the function above, we can check out how the evaluator works.

Looking at the gold standard answer, “Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago,” and the language model’s answer, “Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago,” would you state that the language model’s answer is completely accurate?

[“Yes\y The language model’s answer is completely accurate because it matches the gold standard answer exactly.”]

Looking at the gold standard answer, “Photosynthetic organisms emerged between 3.2 and 2.4 billion years ago,” and the language model's answer, “Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago,” would you state that the language model's answer is completely accurate?

[“No\n The language model’s answer is not completely accurate because it states that photosynthetic organisms emerged between 3.2 and 3.5 billion years ago, which is different from the gold standard answer that states they emerged between 3.2 and 2.4 billion years ago.”]

4b. Retrieval Check

The next thing we want to test is the sources returned.

LLM response: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949000/ GS Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949000/ The source matches the source in the prompt.

LLM response: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2949000/ GS Source: https://en.wikipedia.org/wiki/History_of_Earth The source does NOT match the source in the prompt.

5. Review results

5a. Cost

It’s important to understand the cost of using an LLM to evaluate the output of an RAG query. Keeping track of the reason for the evaluation metric increases the completion tokens used, which increases the cost.

RAG Cost: $0.001662

Chatbot Evaluation: Input tokens = 136 Completion tokens = 19 Evaluation Cost: $0.001548

Bad Chatbot Evaluation: Input tokens = 136 Completion tokens = 60 Bad Evaluation Cost: $0.004008

Total Cost: $0.007218

5b. Results

Sharing results is crucial once we evaluate a RAG response. Where do we save our results — a .csv, a db or into our experiment tracker? Ultimately, it is up to the team to determine where these metrics can be reviewed and shared during development.

The breakdown of the different outputs is meant to showcase what we need when reviewing the evaluator responses. When it comes to sharing our results, the findings must be actionable. How is your team saving and sharing these results with developers and stakeholders? What do we do when something fails or behaves incorrectly? If you're using a fine-tuned model, how do you improve it?

Question: When did photosynthetic organisms emerge? Answer: Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago. LLM response: Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago. Accuracy Pass/Fail: Yes Reasoning: The language model's answer is completely accurate because it matches the gold standard answer exactly. Source Pass/Fail: True

Conclusion

Using an evaluation framework to measure the responses from our RAG chatbot provides insights into the quality of the LLM's responses. We are using an evaluation framework because the results from an LLM are not deterministic. We can, however, use this to our advantage and have shown above that an LLM can evaluate two answers and provide a meaningful determination for different metrics. Using accuracy is a simple way of comparing a desired result.

Like writing automated tests, using an evaluation method on your RAG chatbot helps ensure quality. Also, similar to tests, the meaning of results only matters if we’ve constructed good-quality datasets and metrics to use during evaluation.

With the code above as a guideline, a Python file can be created and run to evaluate your RAG system. This file can be used locally or through CI/CD. The next steps are to consider larger datasets and understand how your team will use and share these results with developers and stakeholders.

If you need help setting up a cost-efficient evaluation process for your generative AI and agentic AI applications, the DART team at TELUS Digital is ready to help. Get started by learning about our Agentic AI Accelerator program.

If you haven’t reviewed Part I and Part II of this series on LLM benchmarking, you may find additional answers to your questions in the articles mentioned:

Be the first to know

Get curated content delivered right to your inbox. No more searching. No more scrolling.

Subscribe now