How to evaluate RAG with GRACE using classification metrics

In this blog post, we introduce GRACE, which stands for Grounded Retrieval-Augmented Citation Evaluation, a technique that helps us evaluate LLMs effectively and affordably using simple classification metrics (like accuracy), just by comparing the LLM-selected citations vs. the annotated citations of an LLM-generated answer. In order to enable GRACE and use citations for evaluation, we need to follow a specific RAG architecture with specific Knowledge Base preprocessing, which we call Grounded RAG with Citations (Grounded RAG+C), explained later. Using GRACE, we can benchmark, evaluate, and identify the best LLM for each use case. GRACE avoids the high cost of methodologies like LLM-as-a-Judge or the limitations of older metrics like ROUGE or BLEURT, which work well only for short and simple chatbot responses.

The main issue: How to evaluate RAG?

Retrieval-Augmented Generation (RAG) is the go-to approach for enabling LLMs to perform Question Answering on proprietary documents, such as business knowledge bases.

While there are different business/research questions behind RAG, one is the most important: Did we provide the correct answer to the user?

The key problem in typical RAG systems is the lack of standardized, robust metrics to evaluate LLM chatbot responses. One approachis to use an LLM-as-a-Judge (i.e., asking a different LLM to score responses on relevance or groundedness), and while this is promising, it is expensive to scale, especially for high-volume tasks.

Another approach is to use traditional Natural Language Generation (NLG) metrics like ROUGE or BLEURT, but these are limited too. For instance, let's say you have a user query like "How often can I take vacation days?" and the following responses:

an LLM generated response, e.g., "You can take vacations as long as you have enough PTO days, but requests must be approved by your department."
and a "gold" response, which might be a bit more lengthy, such as "There is no limit to how often you can take vacation days, as long as you have enough PTO (Paid Time Off) days remaining and your vacation requests are approved."

Even when the semantic meaning of these two responses is the same, traditional metrics like BLEURT here will score poorly due to the different sentence length. Such metrics may suffice for simple yes/no queries but fail in evaluating more complex question-answering scenarios, where responses often depend on nuanced interactions and longer contexts.

Wouldn't it be great to have simple and reliable evaluation metrics to measure the answers generated by an LLM, without incurring significant costs? In this post, we show exactly that, by introducing a simple trick that allows you to leverage classification metrics for evaluating RAG systems effectively, using GRACE.

Typical RAG example

In the standard RAG approach, raw documents from the Knowledge Base (KB) are divided into chunks, often splitting content mid-paragraph or excluding meaningful context.

Here’s how classic RAG works. 1) The user query is converted into a text embedding. 2) A semantic search retrieves the top-K most relevant chunks from the embedded knowledge base using cosine similarity. 3) These chunks (the 'context') are concatenated with the user query and passed to the LLM as input. 4) The LLM generates an answer using the context-augmented input.

Consider the example user query: "How often can I take a vacation?". Following standard RAG the retriever retrieves the top 5 chunks:

Chunk 1: "Employees can request time off as long as they have sufficient PTO balance.."
Chunk 2: "..Vacation requests must be approved by the employee's department.."
Chunk 3: "There are different types of leaves, like maternity leave.."
Chunk 4: "..Other types of leaves include the Sick Leave, where, according to the labor policy, you are able to .."
Chunk 5: "Employees must submit vacation requests via the company portal, and HR policies must be followed.."

These chunks are concatenated and passed as "context" to the LLM along with the user query, where the LLM is asked to provide an answer based on it, finally responding with:
"You can take vacations as long as you have enough PTO days, but requests must be approved by your department."

When we look at this single user query and the LLM's answer, it seems fine and it is indeed correct. However, there is no easy way to automate this manual response evaluation, other than perhaps utilizing the expensive LLM-as-a-Judge approach which comes with its own limitations.

Also, currently, the answer lacks clear grounding in a single authoritative source. Of course, we could just instruct the LLM to also provide citations. In that case, the LLM would need to cite both Chunk #1 and Chunk #2, which, in practice, come from the same original document, and this could be confusing for end-users, having to look at multiple sources.

Our approach: Grounded RAG with Citations (Grounded RAG+C) architecture

In order to evaluate LLMs in RAG easily, we follow a custom architecture which we call Grounded RAG with Citations. Grounded RAG+C builds upon standard RAG by introducing one key improvement: the explicit use of citations, where the LLM is prompted to ground its response on a single retrieved article and include an inline citation.

This Grounded RAG+C approach enables us to evaluate the system performance based on citations (a technique we call GRACE), using classification metrics like accuracy. More specifically, with GRACE, we check whether the cited article matches a predefined correct article for the query. Simply put, if the correct article is cited, the LLM has likely given the correct answer. In pure accuracy terms: we count a hit if the generated answer contains the correct citation article, and we count a miss if it does not. This way GRACE eliminates the need for expensive evaluation methodologies like LLM-as-a-Judge.

Difference between Typical RAG and Grounded RAG+C. In typical RAG, the LLM-generated answer is a complex task since the answer is dependent on many arbitrary chunks. In Grounded RAG+C, the self-contained documents of a Knowledge Base can be cited individually from the LLM. Thus, we can evaluate if the LLM-generated answer is correct if the LLM picked the correct citation. We call this evaluation method GRACE: Grounded Retrieval-Augmented Citation EValuation.

The cornerstone of GRACE is the way the Knowledge Base (KB) is organized in the Grounded RAG+C approach. Instead of arbitrary chunks, we split the KB into self-contained articles. A self-contained article is structured so that it provides complete information on a topic (i.e., the article is "grounded"), allowing the LLM to answer user questions without citing or needing to rely on additional articles.

Let’s consider the previous example of a user asking about their vacation days. The user query is "How often can I take a vacation?" Following Grounded RAG+C the retriever retrieves the top 3 relevant self-contained articles (instead of arbitrary chunks):

ID 17: D. Employee Benefits - 1. Vacation Benefits (PTO)
- (this self-contained article includes previous chunks #1, #2, and others)
ID 20: D. Employee Benefits - 4. Other Types of Leave (Maternity, etc.)
- (this self-contained article includes previous chunks #3, #4, and others)
ID 66: H. FAQ - 2. Platform Requests to HR
- (this self-contained article includes previous chunks #5 and others)

Then, the LLM is prompted and selects ID 17 as the most relevant article, generating the response based on this self-contained article:
"There is no limit to how often you can take a vacation as long as you still have PTO days left. Vacation days must be approved by the department. [ID: 17]"

By grounding responses in a single cited source, Grounded RAG+C enables the straightforward evaluation of GenAI chatbot outputs. Essentially, the design of Grounded RAG+C converts the complex evaluation problem to a multi-class classification task of comparing citations (GRACE), where we can use simple and explainable metrics like accuracy, etc. In the Grounded RAG+C architecture, all we need to do is:

Split the Knowledge Base into self-contained articles to be embedded in the vector database instead of smaller chunks (more on that later)
Instruct the LLM to select exactly one article and report it in the generated response

If these two things are done, then, using classic regular expressions or a structured JSON output, we can extract the citation ID that the LLM selected to base it's response on. Then we can measure a "correct" answer simply by verifying whether the LLM cited the article [ID: 17] in its response or not (GRACE). If the LLM returned some other cited article, then we automatically know that the answer should be classified as "wrong", since that other article does not contain relevant information to that user query. This is due to our self-contained design in the Knowledge Base. In typical chunking strategies, this is not possible, since the answer is split to many chunks and you do not have that "user query to correct article" alignment.

But, wait... what if the LLM indeed selects and cites the correct self-contained article, but hallucinates in its answer? Can that happen?

Theoretically, yes. However, when using a leading model such as OpenAI's GPT-4o, you "ground" the model on a self-contained business document inside the instruction prompt using RAG, and, practically, hallucinations do not happen. To validate this, we even ran additional experiments including some smaller, weaker LLMs. After checking the results manually, they verify our intuition: if the LLM selects the correct self-contained article in Grounded RAG+C, then the LLM also provides a correct answer to the user query.

Design and Preparation of a Knowledge Base (KB) with self-contained articles

As noted previously, for GRACE to work, the knowledge base should be optimized in a certain way. This starts with splitting raw company documents into grounded, self-contained articles that can be embedded in a vector database. Without this crucial step, the system cannot reliably ground responses or support metrics like accuracy. This is the core element that enables our evaluation technique, GRACE, to work.

Here is how our internal HR Knowledge Base looks like after preprocessing it. For all topics like Organization Description or Work Conditions & Hours (left side), there are self-contained articles (right side). These articles are the ones used in the retrieval phase of the Grounded RAG+C "architecture", which helps us evaluate LLMs using citations (GRACE). Since most knowledge bases come in a single .pdf or .doc file, we have built automatic tools to preprocess such documents into a KB with self-contained articles, like the one here.

We believe that splitting/chunking complex documents into self-contained articles allows the system to work better, i.e., have better retrieval and response generation. This is an ongoing research direction currently. For instance, check JinaAI's benchmarks on "Finding Optimal Breakpoints In Long Documents" at https://jina.ai/news/finding-optimal-breakpoints-in-long-documents-using-small-language-models.

Of course, splitting the documents into self-contained articles will take more time than just chunking at an arbitrary multi-sentence boundary. To address that, we collaborate closely with our customers and have developed automated, human-in-the-loop tools to streamline the process.

Using GRACE to get results in a real-world Grounded RAG+C dataset

By utilizing the Grounded RAG+C architecture (i.e., splitting the documents to grounded and self-contained articles about one topic), we can use GRACE to compare predicted vs. annotated citations. Having that done, we can now calculate, in scale, and at very low cost, how good an LLM is at providing answers to the user queries. All we need to do is have a mapping (annotation) of the user query to the self-contained article containing this information for our test set. If the LLM picks the correct article, then it knows how to ground its response there and responds correctly. We can then measure the accuracy of the LLM.

At Helvia, we built our own real-world benchmarks in order to assess the performance of LLMs. Here is an example of how different LLMs score in our interal HR support chatbot, which operates in our own HR documents (as seen in the previous section). For the test set, we have manually compiled over ~700 hard-difficulty real-world user queries and linked them with the "correct self-contained article" they should be matched with. In this example, in the retrieval, we use OpenAI's text-embedding-3-large model with 1024 dimensions and we use the top 3 self-contained articles returned each time as context to our RAG+C system.

Grounded RAG with Citations enables us to measure the accuracy of LLMs in our GenAI chatbots using citations (GRACE), instead of using expensive methodologies like LLM-as-a-Judge. As one would expect, OpenAI's most expensive model engine (GPT-4, GPT-o1) are top in performance, while interestingly, Google's small model, Gemini 1.5 Flash is also really competitive. On the other side, smaller models like Llama 3.1-8b and Mistral-7b fail to identify correctly the citation articles that contain the answers for most user queries. See the complete leaderboard with more experiments at https://helvia.ai/labs/grace-leaderboard/.

Bonus: More reasons to do Grounded RAG+C

Using Grounded RAG with Citations has additional advantages, other than evaluating the citations:

🌐 Streamlining Action Workflows: With citations included in the responses, we can automate various actions through APIs. For instance, if a user receives an answer related to a citation article about "Support Issues", we can automatically send a tailored email to follow up on that.
🤝 User Transparency: If you asked ChatGPT for something crucial, would you trust it if you couldn't verify it? Providing citations allows users to cross-check answers.
✅ No more LLM hallucinations: The LLM is explicitly prompted to answer with citation-grounded responses. If it can't find relevant and concrete information inside an article, it will be conservative and it will not respond to, thus, decreasing hallucinations.

Takeaways

🚀 We created GRACE, which stands for Grounded Retrieval-Augmented Citation Evaluation. GRACE helps us evaluate LLM performance in GenAI RAG-based chatbots by comparing predicted vs. annotated citations in LLM-generated answers.
🔍 In order to evaluate RAG with GRACE, you need to do two certain things: a) optimize your Knowledge Base to have self-contained articles that can be used to answer user questions of one specific topic, and, second, instructing the LLM explicitly to pick the one single best article (and its citation) to answer the user question. This chatbot design builds upon typical RAG and we call it "Grounded RAG with Citations" (Grounded RAG+C), since we "ground" the LLMs to use self-contained documents and their citations.
💡GRACE works like this: If the LLM selects and cites the correct article, then the provided answer is likely (automatically!) correct. GRACE (and the Grounded RAG+C chatbot design) enable low-cost, accurate LLM performance measurement without expensive methods like LLM-as-a-Judge or outdated methods like BLEURT. As a downside, though, it needs annotations beforehand, which could be skipped in LLM-as-a-Judge.
📊 We use this strategy since it allows us to perform easy comparison of different LLM models using straightforward metrics like accuracy. In our real-world HR dataset, we found leading proprietary models like GPT-4 and Google's Gemini Flash to have the best accuracy. However, smaller open-weight models like Mistral-7b and Llama-3.1-8b instruct had the worst performance, failing to answer around 50% of the user questions.
🔬 See more experiments at our GRACE leaderboard: https://helvia.ai/labs/grace-leaderboard/

In the next blogpost of this series, we are going to dive deeper into more findings from our own benchmarks, along with some more GRACE metrics we designed, which allow us to understand the chatbot's answer from different point-of-views.