How to evaluate RAG with classification metrics

RAG: A standard industry practice for Question-Answering GenAI Chatbots

Retrieval-Augmented Generation (RAG) is the go-to approach for enabling LLMs to perform Question Answering on proprietary documents, such as business knowledge bases.

Here’s how classic RAG works. 1) The user query is converted into a text embedding. 2) A semantic search retrieves the top-K most relevant chunks from the embedded knowledge base using cosine similarity. 3) These chunks (the 'context') are concatenated with the user query and passed to the LLM as input. 4) The LLM generates an answer using the context-augmented input.

The main struggle: How to evaluate RAG?

While there are different business/research questions behind RAG, one is the most important: "Did we answer correctly to the user?"

The key problem in typical RAG systems is the lack of standardized, robust metrics to evaluate LLM chatbot responses. Some people propose using an LLM-as-a-Judge (e.g., asking a different LLM API to score responses on relevance or groundedness), and while this is promising, it is expensive to scale, especially for high-volume tasks.

Other people, suggest using traditional Natural Language Generation (NLG) metrics like ROUGE or BLEURT, but these are limited too. For instance, let's say you have:

a user query like "How often can I take vacation days?"
an LLM "generated" response such as "You can take vacations as long as you have enough PTO days, but requests must be approved by your department."
and a "gold" response, which might be a bit more lengthy, such as "There is no limit to how often you can take vacation days, as long as you have enough PTO (Paid Time Off) days remaining and your vacation requests are approved."

Even when the semantic meaning is the same between the two responses, traditional metrics like BLEURT here will score poorly due to the different amount of tokens. Such metrics may suffice for simple yes/no queries but fail in evaluating complex question-answering scenarios, where responses often depend on nuanced interactions and longer contexts.

Would it not be convenient to have easy and reliable evaluation metrics to measure the answer we give to the user, while also not costing us a fortune? When evaluating a retriever, you can measure Recall@k easily. But when evaluating an LLM, it seems like you can't do anything other than calling other (expensive) LLMs to evaluate your already LLM-generated answer.. In this post, we show exactly that: how to use classification metrics in RAG, with a simple trick.

Typical RAG example (with arbitrary chunks)

In the standard RAG approach, the raw documents from the Knowledge Base (KB) are segmented into chunks, often splitting content mid-sentence or without meaningful context. Consider the example user query: "How often can I have vacation?"

The retriever retrieves the top 5 chunks:

Chunk 1: "Employees can request time off as long as they have sufficient PTO balance.."
Chunk 2: "..Vacation requests must be approved by the employee's department.."
Chunk 3: "There are different types of leaves, like maternity leave.."
Chunk 4: "..Other types of leaves include the Sick Leave, where, according to the labor policy, you are able to .."
Chunk 5: "Employees must submit vacation requests via the company portal, and HR policies must be followed.."

These chunks are concatenated and passed as the "context" to the LLM, where the LLM is asked to provide an answer based on it, finally responding with:
"You can take vacations as long as you have enough PTO days, but requests must be approved by your department."

When we look at this single user query and the LLM's answer, it seems and it is indeed correct. However, there is no easy way to automate this manual response evaluation in a large scale, other than utilizing the expensive LLM-as-a-Judge approach. In LLM-as-Judge, you typically call a second LLM API and instruct it to asses the quality of the predicted answer, often comparing it with a gold answer provided by a human annotator or by asking the LLM to do multiple reviews of the predicted answer (based on terms like response relevance, etc.). This costs a lot of money and time, especially when you have thousands of user queries.

Also, currently, the answer lacks clear grounding in a single authoritative source. Of course, we could just instruct the LLM to also provide citations. In that case, the LLM would need to cite both Chunk #1 and Chunk #2, which, in practice, come from the same original document, and this could be a bit "confusing" for our end-user, having to look at "multiple" sources.

Our approach: RAG with Citations (RAG+C)

RAG+C builds on standard RAG but with one key difference: the LLM is prompted to select the single most relevant (self-contained) article from the retrieved articles set and ground its response on it, including an inline citation. This RAG+C approach enables us to measure the RAG system performance using standard classification metrics by looking if the citation article is the labeled one or not. If the LLM selected the correct citation article, then it (probably) gave also the correct answer to the user. By looking at the citation article, you can measure metrics like accuracy, without doing expensive methodologies like LLM-as-a-Judge.

Let’s consider an example of a user asking about their vacation days. So, the user query is "How often can I have vacation?" The retriever, then, identifies the top 3 relevant self-contained articles (instead of arbitrary chunks):

ID 17: D. Employee Benefits - 1. Vacation Benefits (PTO)
- (this article includes previous Chunk #1, Chunk #2, and others)
ID 20: D. Employee Benefits - 4. Other Types of Leave (Maternity, etc.)
- (this article includes previous Chunk #3, Chunk #3, and others)
ID 66: H. FAQ - 2. Platform Requests to HR
- (this article includes previous Chunk #5 and others)

Then, the LLM is prompted and selects ID 17 as the most relevant article, generating the response based on this article:
"There is no limit to how often you can take a vacation as long as you still have PTO days left. Vacation days must be approved by the department. [ID: 17]"

By grounding responses in a single cited source, RAG+C enables the easy evaluation of GenAI chatbot outputs, by "converting" this "complex" question-answering task to a multi-classification task, which can be monitored using metrics like accuracy, etc. All we need to do is:

Split the Knowledge Base into self-contained segments to be embedded in the vector database instead of smaller chunks (more on that later)
Instruct the LLM to select only one segment and report it in the generated response

If these two things are done, then we can measure a "correct" answer easily if the LLM returned the [ID: 17] in its response or not, just by using some classic regular expressions. If the LLM returned some other cited article, then we automatically know that the answer is "wrong", since that other article does not contain relevant information to that user query. This is due to our self-contained design in our Knowledge Base. Remember, in RAG+C, articles are self-contained and this is the key thing that enables us to measure accuracy easily in the chatbot's answers. In typical arbitrary chunking, this is not possible, since the answer is split to many chunks and you do not have that "user query to correct article" alignment.

But, wait... what if the LLM indeed selects and cites the correct article, but hallucinates in its answer? Can that happen?

Theoretically, yes. However, when using a leading model (such as OpenAI's GPT-4o), you "ground" the model on a business document inside the instruction prompt (using RAG), and, practically, hallucinations do not happen. To validate this, we even ran some (extra) experiments including also smaller, weaker ("and dumber") LLMs. After looking at the results manually, they verify our intuition: if the LLM selects the correct (self-contained) article in RAG+C, then the LLM also provides a correct answer to the user query.

Design and Preparation of a Knowledge Base (KB) with self-contained articles: The Key to RAG+C Success

As noted previously, for RAG+C to work effectively, the knowledge base must be carefully designed and prepared. This begins with splitting raw company documents into self-contained articles that can be embedded in a vector database. Without this crucial step, the system cannot reliably ground responses or support accurate metrics like accuracy. This is the core element that enables RAG+C to work.

We believe that splitting/chunking complex documents into self-contained articles allows the system to work better, i.e., have better retrieval and response generation. Don't believe us? Check the manual benchmark from JinaAI on "Finding Optimal Breakpoints In Long Documents" at https://jina.ai/news/finding-optimal-breakpoints-in-long-documents-using-small-language-models.

Of course, splitting the documents in a self-contained way takes more time than just doing an arbitrary multi-sentence chunking. To address that, we collaborate closely with our customers and have developed automated, human-in-the-loop tools to streamline the process.

Results in a real-world RAG+C dataset

By utilizing the RAG+C approach, we can now calculate, in scale, at no cost, how good an LLM is at providing answers to the user queries. All we need to do is have a mapping of the user query to the self-contained article that contains this information for our test set. If the LLM picks the correct article, then it knows how to ground its response there and responds correctly. We can then measure the accuracy of the LLM.

In helvia, we have built our own real-world benchmarks in order to assess the performance of LLMs. Here is an example of how different LLMs score in our own HR support chatbot, which operates in our own HR documents. For the test set, we have manually compiled over ~700 labeled user queries and corresponded them with the "correct self-contained citation document" they should be matched with. In this example, in the retrieval, we use a top_k=3 and we use the OpenAI's text-embedding-3-large model with 1024 dimensions.

RAG with Citations (RAG+C) enables us to measure the accuracy of LLMs in our GenAI chatbots, instead of using expensive methodologies like LLM-as-a-Judge. As one would expect, OpenAI's most expensive model engine (GPT-4, GPT-o1) are top in performance, while interestingly, Google's small model, Gemini 1.5 Flash is also really competitive. On the other side, smaller models like Llama 3.1-8b and Mistral-7b fail to identify correctly the citation articles that contain the answers for most user queries.

In a nutshell, why RAG+C?

Approaching RAG with citations is critical for even more and various reasons:

🌐 Streamlining Action Workflows: With citations included in the answers, we can automate various actions through APIs. For instance, if a user receives an answer related to a citation article about "Support Issues", we can automatically send a tailored email to follow up on that.
🤝 User Transparency: If you asked ChatGPT for something crucial, would you trust it if you couldn't verify it? Providing citations allows users to cross-check answers.
✅ No more LLM hallucinations: The LLM is specifically prompted to answer with citation-grounded responses. If it can't find relevant and concrete information inside an article, it will be conservative and it will not respond to, thus, decreasing hallucinations. This is also the case in typical RAG.
📊 ..but, most importantly, easier scientific evaluation: By incorporating citations, we are able to measure an LLM's performance mostly via tackling this problem as a classification task. Instead of using traditional outdated metrics (like BLEURT or ROUGE) or expensive methods (like LLM-as-a-Judge), we can calculate metrics like the easily-explained accuracy, on thousands of user queries, at no cost.

In the next blogpost of this series, we are going to dive deeper into more findings from our own benchmarks, along with some more RAG+C metrics we designed, which allow us to understand the chatbot's answer from different point-of-views.