Helvia.ai Labs

How to evaluate RAG with GRACE using classification metrics

Lefteris Loukas — Thu, 16 Jan 2025 15:19:00 GMT

In this blog post, we introduce GRACE, which stands for Grounded Retrieval-Augmented Citation Evaluation, a technique that helps us evaluate LLMs effectively and affordably using simple classification metrics (like accuracy), just by comparing the LLM-selected citations vs. the annotated citations of an LLM-generated answer. In order to enable GRACE and use citations for evaluation, we need to follow a specific RAG architecture with specific Knowledge Base preprocessing, which we call Grounded RAG with Citations (Grounded RAG+C), explained later. Using GRACE, we can benchmark, evaluate, and identify the best LLM for each use case. GRACE avoids the high cost of methodologies like LLM-as-a-Judge or the limitations of older metrics like ROUGE or BLEURT, which work well only for short and simple chatbot responses.

The main issue: How to evaluate RAG?

Retrieval-Augmented Generation (RAG) is the go-to approach for enabling LLMs to perform Question Answering on proprietary documents, such as business knowledge bases.

While there are different business/research questions behind RAG, one is the most important: Did we provide the correct answer to the user?

The key problem in typical RAG systems is the lack of standardized, robust metrics to evaluate LLM chatbot responses. One approachis to use an LLM-as-a-Judge (i.e., asking a different LLM to score responses on relevance or groundedness), and while this is promising, it is expensive to scale, especially for high-volume tasks.

Another approach is to use traditional Natural Language Generation (NLG) metrics like ROUGE or BLEURT, but these are limited too. For instance, let's say you have a user query like "How often can I take vacation days?" and the following responses:

an LLM generated response, e.g., "You can take vacations as long as you have enough PTO days, but requests must be approved by your department."
and a "gold" response, which might be a bit more lengthy, such as "There is no limit to how often you can take vacation days, as long as you have enough PTO (Paid Time Off) days remaining and your vacation requests are approved."

Even when the semantic meaning of these two responses is the same, traditional metrics like BLEURT here will score poorly due to the different sentence length. Such metrics may suffice for simple yes/no queries but fail in evaluating more complex question-answering scenarios, where responses often depend on nuanced interactions and longer contexts.

Wouldn't it be great to have simple and reliable evaluation metrics to measure the answers generated by an LLM, without incurring significant costs? In this post, we show exactly that, by introducing a simple trick that allows you to leverage classification metrics for evaluating RAG systems effectively, using GRACE.

Typical RAG example

In the standard RAG approach, raw documents from the Knowledge Base (KB) are divided into chunks, often splitting content mid-paragraph or excluding meaningful context.

Here’s how classic RAG works. 1) The user query is converted into a text embedding. 2) A semantic search retrieves the top-K most relevant chunks from the embedded knowledge base using cosine similarity. 3) These chunks (the 'context') are concatenated with the user query and passed to the LLM as input. 4) The LLM generates an answer using the context-augmented input.

Consider the example user query: "How often can I take a vacation?". Following standard RAG the retriever retrieves the top 5 chunks:

Chunk 1: "Employees can request time off as long as they have sufficient PTO balance.."
Chunk 2: "..Vacation requests must be approved by the employee's department.."
Chunk 3: "There are different types of leaves, like maternity leave.."
Chunk 4: "..Other types of leaves include the Sick Leave, where, according to the labor policy, you are able to .."
Chunk 5: "Employees must submit vacation requests via the company portal, and HR policies must be followed.."

These chunks are concatenated and passed as "context" to the LLM along with the user query, where the LLM is asked to provide an answer based on it, finally responding with:
"You can take vacations as long as you have enough PTO days, but requests must be approved by your department."

When we look at this single user query and the LLM's answer, it seems fine and it is indeed correct. However, there is no easy way to automate this manual response evaluation, other than perhaps utilizing the expensive LLM-as-a-Judge approach which comes with its own limitations.

Also, currently, the answer lacks clear grounding in a single authoritative source. Of course, we could just instruct the LLM to also provide citations. In that case, the LLM would need to cite both Chunk #1 and Chunk #2, which, in practice, come from the same original document, and this could be confusing for end-users, having to look at multiple sources.

Our approach: Grounded RAG with Citations (Grounded RAG+C) architecture

In order to evaluate LLMs in RAG easily, we follow a custom architecture which we call Grounded RAG with Citations. Grounded RAG+C builds upon standard RAG by introducing one key improvement: the explicit use of citations, where the LLM is prompted to ground its response on a single retrieved article and include an inline citation.

This Grounded RAG+C approach enables us to evaluate the system performance based on citations (a technique we call GRACE), using classification metrics like accuracy. More specifically, with GRACE, we check whether the cited article matches a predefined correct article for the query. Simply put, if the correct article is cited, the LLM has likely given the correct answer. In pure accuracy terms: we count a hit if the generated answer contains the correct citation article, and we count a miss if it does not. This way GRACE eliminates the need for expensive evaluation methodologies like LLM-as-a-Judge.

Difference between Typical RAG and Grounded RAG+C. In typical RAG, the LLM-generated answer is a complex task since the answer is dependent on many arbitrary chunks. In Grounded RAG+C, the self-contained documents of a Knowledge Base can be cited individually from the LLM. Thus, we can evaluate if the LLM-generated answer is correct if the LLM picked the correct citation. We call this evaluation method GRACE: Grounded Retrieval-Augmented Citation EValuation.

The cornerstone of GRACE is the way the Knowledge Base (KB) is organized in the Grounded RAG+C approach. Instead of arbitrary chunks, we split the KB into self-contained articles. A self-contained article is structured so that it provides complete information on a topic (i.e., the article is "grounded"), allowing the LLM to answer user questions without citing or needing to rely on additional articles.

Let’s consider the previous example of a user asking about their vacation days. The user query is "How often can I take a vacation?" Following Grounded RAG+C the retriever retrieves the top 3 relevant self-contained articles (instead of arbitrary chunks):

ID 17: D. Employee Benefits - 1. Vacation Benefits (PTO)
- (this self-contained article includes previous chunks #1, #2, and others)
ID 20: D. Employee Benefits - 4. Other Types of Leave (Maternity, etc.)
- (this self-contained article includes previous chunks #3, #4, and others)
ID 66: H. FAQ - 2. Platform Requests to HR
- (this self-contained article includes previous chunks #5 and others)

Then, the LLM is prompted and selects ID 17 as the most relevant article, generating the response based on this self-contained article:
"There is no limit to how often you can take a vacation as long as you still have PTO days left. Vacation days must be approved by the department. [ID: 17]"

By grounding responses in a single cited source, Grounded RAG+C enables the straightforward evaluation of GenAI chatbot outputs. Essentially, the design of Grounded RAG+C converts the complex evaluation problem to a multi-class classification task of comparing citations (GRACE), where we can use simple and explainable metrics like accuracy, etc. In the Grounded RAG+C architecture, all we need to do is:

Split the Knowledge Base into self-contained articles to be embedded in the vector database instead of smaller chunks (more on that later)
Instruct the LLM to select exactly one article and report it in the generated response

If these two things are done, then, using classic regular expressions or a structured JSON output, we can extract the citation ID that the LLM selected to base it's response on. Then we can measure a "correct" answer simply by verifying whether the LLM cited the article [ID: 17] in its response or not (GRACE). If the LLM returned some other cited article, then we automatically know that the answer should be classified as "wrong", since that other article does not contain relevant information to that user query. This is due to our self-contained design in the Knowledge Base. In typical chunking strategies, this is not possible, since the answer is split to many chunks and you do not have that "user query to correct article" alignment.

But, wait... what if the LLM indeed selects and cites the correct self-contained article, but hallucinates in its answer? Can that happen?

Theoretically, yes. However, when using a leading model such as OpenAI's GPT-4o, you "ground" the model on a self-contained business document inside the instruction prompt using RAG, and, practically, hallucinations do not happen. To validate this, we even ran additional experiments including some smaller, weaker LLMs. After checking the results manually, they verify our intuition: if the LLM selects the correct self-contained article in Grounded RAG+C, then the LLM also provides a correct answer to the user query.

Design and Preparation of a Knowledge Base (KB) with self-contained articles

As noted previously, for GRACE to work, the knowledge base should be optimized in a certain way. This starts with splitting raw company documents into grounded, self-contained articles that can be embedded in a vector database. Without this crucial step, the system cannot reliably ground responses or support metrics like accuracy. This is the core element that enables our evaluation technique, GRACE, to work.

Here is how our internal HR Knowledge Base looks like after preprocessing it. For all topics like Organization Description or Work Conditions & Hours (left side), there are self-contained articles (right side). These articles are the ones used in the retrieval phase of the Grounded RAG+C "architecture", which helps us evaluate LLMs using citations (GRACE). Since most knowledge bases come in a single .pdf or .doc file, we have built automatic tools to preprocess such documents into a KB with self-contained articles, like the one here.

We believe that splitting/chunking complex documents into self-contained articles allows the system to work better, i.e., have better retrieval and response generation. This is an ongoing research direction currently. For instance, check JinaAI's benchmarks on "Finding Optimal Breakpoints In Long Documents" at https://jina.ai/news/finding-optimal-breakpoints-in-long-documents-using-small-language-models.

Of course, splitting the documents into self-contained articles will take more time than just chunking at an arbitrary multi-sentence boundary. To address that, we collaborate closely with our customers and have developed automated, human-in-the-loop tools to streamline the process.

Using GRACE to get results in a real-world Grounded RAG+C dataset

By utilizing the Grounded RAG+C architecture (i.e., splitting the documents to grounded and self-contained articles about one topic), we can use GRACE to compare predicted vs. annotated citations. Having that done, we can now calculate, in scale, and at very low cost, how good an LLM is at providing answers to the user queries. All we need to do is have a mapping (annotation) of the user query to the self-contained article containing this information for our test set. If the LLM picks the correct article, then it knows how to ground its response there and responds correctly. We can then measure the accuracy of the LLM.

At Helvia, we built our own real-world benchmarks in order to assess the performance of LLMs. Here is an example of how different LLMs score in our interal HR support chatbot, which operates in our own HR documents (as seen in the previous section). For the test set, we have manually compiled over ~700 hard-difficulty real-world user queries and linked them with the "correct self-contained article" they should be matched with. In this example, in the retrieval, we use OpenAI's text-embedding-3-large model with 1024 dimensions and we use the top 3 self-contained articles returned each time as context to our RAG+C system.

Grounded RAG with Citations enables us to measure the accuracy of LLMs in our GenAI chatbots using citations (GRACE), instead of using expensive methodologies like LLM-as-a-Judge. As one would expect, OpenAI's most expensive model engine (GPT-4, GPT-o1) are top in performance, while interestingly, Google's small model, Gemini 1.5 Flash is also really competitive. On the other side, smaller models like Llama 3.1-8b and Mistral-7b fail to identify correctly the citation articles that contain the answers for most user queries. See the complete leaderboard with more experiments at https://helvia.ai/labs/grace-leaderboard/.

Bonus: More reasons to do Grounded RAG+C

Using Grounded RAG with Citations has additional advantages, other than evaluating the citations:

🌐 Streamlining Action Workflows: With citations included in the responses, we can automate various actions through APIs. For instance, if a user receives an answer related to a citation article about "Support Issues", we can automatically send a tailored email to follow up on that.
🤝 User Transparency: If you asked ChatGPT for something crucial, would you trust it if you couldn't verify it? Providing citations allows users to cross-check answers.
✅ No more LLM hallucinations: The LLM is explicitly prompted to answer with citation-grounded responses. If it can't find relevant and concrete information inside an article, it will be conservative and it will not respond to, thus, decreasing hallucinations.

Takeaways

🚀 We created GRACE, which stands for Grounded Retrieval-Augmented Citation Evaluation. GRACE helps us evaluate LLM performance in GenAI RAG-based chatbots by comparing predicted vs. annotated citations in LLM-generated answers.
🔍 In order to evaluate RAG with GRACE, you need to do two certain things: a) optimize your Knowledge Base to have self-contained articles that can be used to answer user questions of one specific topic, and, second, instructing the LLM explicitly to pick the one single best article (and its citation) to answer the user question. This chatbot design builds upon typical RAG and we call it "Grounded RAG with Citations" (Grounded RAG+C), since we "ground" the LLMs to use self-contained documents and their citations.
💡GRACE works like this: If the LLM selects and cites the correct article, then the provided answer is likely (automatically!) correct. GRACE (and the Grounded RAG+C chatbot design) enable low-cost, accurate LLM performance measurement without expensive methods like LLM-as-a-Judge or outdated methods like BLEURT. As a downside, though, it needs annotations beforehand, which could be skipped in LLM-as-a-Judge.
📊 We use this strategy since it allows us to perform easy comparison of different LLM models using straightforward metrics like accuracy. In our real-world HR dataset, we found leading proprietary models like GPT-4 and Google's Gemini Flash to have the best accuracy. However, smaller open-weight models like Mistral-7b and Llama-3.1-8b instruct had the worst performance, failing to answer around 50% of the user questions.
🔬 See more experiments at our GRACE leaderboard: https://helvia.ai/labs/grace-leaderboard/

In the next blogpost of this series, we are going to dive deeper into more findings from our own benchmarks, along with some more GRACE metrics we designed, which allow us to understand the chatbot's answer from different point-of-views.

GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek

Lefteris Loukas — Thu, 16 Jan 2025 12:56:00 GMT

@ COLING 2025

Still all Greeklish to me: Greeklish to Greek Transliteration

Stavros Vassos — Mon, 20 May 2024 15:23:00 GMT

@ LREC-COLING 2024

Cache me if you Can: an Online Cost-aware Teacher-Student Framework to Reduce the Calls to Large Language Models (EMNLP 2023)

Ilias Stogiannidis — Mon, 19 Feb 2024 10:27:00 GMT

Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, Ion Androutsopoulos

Paper Code

Abstract

Prompting Large Language Models (LLMs) performs impressively in zero- and few-shot settings. Hence, small and medium-sized enterprises (SMEs) that cannot afford the cost of creating large task-specific training datasets, but also the cost of pretraining their own LLMs, are increasingly turning to third-party services that allow them to prompt LLMs. However, such services currently require a payment per call, which becomes a significant operating expense (OpEx). Furthermore, customer inputs are often very similar over time, hence SMEs end-up prompting LLMs with very similar instances. We propose a framework that allows reducing the calls to LLMs by caching previous LLM responses and using them to train a local inexpensive model on the SME side. The framework includes criteria for deciding when to trust the local model or call the LLM, and a methodology to tune the criteria and measure the tradeoff between performance and cost. For experimental purposes, we instantiate our framework with two LLMs, GPT-3.5 or GPT-4, and two inexpensive students, a k-NN classifier or a Multi-Layer Perceptron, using two common business tasks, intent recognition and sentiment analysis. Experimental results indicate that significant OpEx savings can be obtained with only slightly lower performance.

Architecture

This is the OCaTS architecture. The incoming customer query is first processed by the student model. If some criteria are met, we provide the student's output to the customer. If not we prompt the teacher to respond. The teacher's output then is cached alongside the query, and they are used to periodically retrain the student.

Framework

We present OCaTS (Online Cost-aware Teacher Student Framework), a framework designed to train a local inexpensive model (student) using the responses of a more expensive model (teacher) in an online setting. Our approach is inspired by the teacher-student schema, but with the additional consideration of the cost associated with utilizing the teacher. This makes OCaTS a suitable solution for small and medium enterprises that want to leverage powerful and easily accessible Large Language Models via API, while minimizing operational expenses (OpEx). OCaTS consists of three main components: a teacher, which is typically a resource-intensive model that produces high-quality results; a student, which is a cost-effective model that is much smaller and simpler than the teacher; and a cache, which is a repository of incoming queries that have already been processed by the teacher.

Cost-awareness

To integrate the cost aspect into the framework, we introduce a novel evaluation measure for such settings, called discounted metric. This metric, denoted as $\hat{\phi}$, aims to capture the trade-off between performance and cost. It is computed using the following equation: $$\hat{\phi} = \phi - \lambda \cdot \rho = \phi - \lambda \cdot \frac{M}{N}.$$
In this equation, $\phi$ represents a conventional evaluation metric such as accuracy. Parameter $\lambda$ is a weighting factor that determines the importance of cost (higher values indicate that cost is considered more significant for the SME). The variables $M$ and $N$ correspond to the total number of calls made to the teacher model and the total number of queries handled, respectively. The discounted metric penalizes the overall performance of the framework based on the rate of calls made to the teacher model and the associated cost for the company. Intuitively, by making this metric the objective of the framework, it learns to maximize performance by allowing the student to respond only when confident enough and prompt the teacher for their response otherwise.

Choosing between Student & Teacher

We focus on applying the framework on a text classification problem. In order to determine whether to rely on the student's response or prompt the teacher to respond, the framework incorporates two criteria inspired by Active Learning. If both criteria meet a certain threshold, the student's response is trusted; otherwise, the query is delegated to the teacher for handling.

The first criterion is to ensure the representativeness of the cached queries considered by the teacher. This is achieved by determining the similarity between the new query and the $k$ most similar cached queries. Let the weighted centroid vector $c$ of the $k$ nearest neighbors be $c = \sum_{i=1}^{k}\hat{w}_i \cdot v_i$ and $\hat{w}_i = w_i/\sum_{j=1}^{k} w_j$, where $w_i$ represents the weight assigned by a distance weighting algorithm to the $i$-th neighbor, and $v_i$ corresponds to the vector representation of the neighbor. The first criterion states that $c$ must be below a threshold $t_c$. Essentially, this condition ensures that the student has previous experience with similar cached queries.

The second criterion is to ensure the confidence of the cached queries considered by the teacher. To establish the second condition let $C$ represent the set of labels (classes) of the text classification problem. The probability $p_c$ for each $c \in C$ is defined as follows: $$p_c = \frac{\exp(W_c)}{\sum_{c' \in C} \exp(W_{c'})},$$ where $W_c$ can be the weight assigned by the $k$-NN algorithm or the logits of an MLP. The entropy $\mathcal{H}$ of the label probabilities $p_c$ is given by: $$\mathcal{H} = -\sum_{c \in C} p_c \log{p_c}.$$ The second criterion states that $\mathcal{H}$ must be below a threshold $t_\mathcal{H}$. Essentially, this condition ensures that the student is confident about the its response.

Results

We evaluate the framework in an intent recognition task for four indicative $\lambda$ values, which determines the importance of cost in the discounted metric $\hat{\phi}$ we introduced. We utilize the Banking77 dataset, a basic k-NN student, and GPT-4 as the teacher. As depicted in the figure below, OCaTS effectively manages the tradeoff between the frequency of contacting the teacher and the level of accuracy. Specifically:

Left part: Calls to the Teacher
- Using OCaTS significantly reduces the calls to the teacher; hence, OpEx.
- As $\lambda$ increases, the number of calls made to the teacher decreases.
Middle part: Trade-off between accuracy & OpEx
- At $\lambda=0.05$, OCaTS achieves accuracy close to GPT-4 teacher (83.05% vs. 82.68%) with only one-third of teacher calls (1050 out of 3080).
- Increasing $\lambda$ leads to a decrease in accuracy but a smaller number of teacher calls.
Right part: Discounted Accuracy ($\hat\phi$) Comparison:
- Right side of the figure compares discounted accuracy ($\hat\phi$) of OCaTS (solid lines) with always contacting GPT-4 teacher (dashed lines).
- OCaTS consistently surpasses GPT-4 teacher's accuracy, emphasizing OpEx efficiency.
Conclusion on OCaTS Superiority:
- OCaTS is superior in terms of OpEx compared to constantly reaching out to the teacher.
- The difference favoring OCaTS becomes more pronounced as $\lambda$ increases, indicating a stronger focus on reducing OpEx.

Number of calls to the teacher (left), accuracy (middle), discounted accuracy (right), using a GPT-4 teacher and a k-NN student, for various λ values, on Banking77 data. The larger the λ the more the SME prefers fewer calls at the expense of increased customer frustration. Dashed lines show the discounted accuracy when calling GPT-4 for all incoming queries. OCaTS has a better discounted accuracy than always calling the GPT-4 teacher.

Takeaways

This study is, to the best of our knowledge, the first study to optimize API requests to commercial LLMs according to a cost-aware metric. Some takeaways:

We introduce a framework for decreasing API requests to commercial LLMs like OpenAI's GPT-4 while maintaining performance standards, by caching responses.
We introduce a discounted metric that measures the trade-off between performance and cost.
We employ a smaller and efficient student model to respond to queries similar to the ones previously handled by the teacher LLM.
In our experiments we match the performance of OpenAI GPT-4, scoring only 0.37 percentage points less than GPT-4, while at the same time effectively cutting down the API costs by calling the LLM teacher for only one-third of the incoming queries (1050 out of 3080).

Bibtex

@inproceedings{stogiannidis-etal-2023-cache,
    title = "Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models",
    author = "Stogiannidis, Ilias  and
      Vassos, Stavros  and
      Malakasiotis, Prodromos  and
      Androutsopoulos, Ion",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.1000",
    pages = "14999--15008"
}

Acknoweledgements

This work was supported by Google’s TPU Research Cloud (TRC) and was carried out in collaboration with AUEB's NLP Group.

Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking (ACM ICAIF 2023)

Lefteris Loukas — Thu, 07 Dec 2023 08:13:08 GMT

Lefteris Loukas, Ilias Stogiannidis, Odysseas Diamantopoulos, Prodromos Malakasiotis, Stavros Vassos

Read the full paper here: https://arxiv.org/abs/2311.06102

Abstract

Standard Full-Data classifiers in NLP demand thousands of labeled examples, which is impractical in data-limited domains. Few-shot methods offer an alternative, utilizing contrastive learning techniques that can be effective with as little as 20 examples per class. Similarly, Large Language Models (LLMs) like GPT-4 can perform effectively with just 1-5 examples per class. However, the performance-cost trade-offs of these methods remain underexplored, a critical concern for budget-limited organizations. Our work addresses this gap by studying the aforementioned approaches over the Banking77 financial intent detection dataset, including the evaluation of cutting-edge LLMs by OpenAI, Cohere, and Anthropic in a comprehensive set of few-shot scenarios. We complete the picture with two additional methods: first, a cost-effective querying method for LLMs based on retrieval-augmented generation (RAG), able to reduce operational costs multiple times compared to classic few-shot approaches, and second, a data augmentation method using GPT-4, able to improve performance in data-limited scenarios. Finally, to inspire future research, we provide a human expert’s curated subset of Banking77, along with extensive error analysis.

Motivation

We study how we can approach text classification effectively in terms of both performance and cost. We use the Banking77 dataset (https://huggingface.co/datasets/PolyAI/banking77), composed of customer support dialogs and their labels. We mainly study Few-shot Settings, where we have limited samples per class (resource-limited scenario), typically 1 to 20 samples per class. For the sake of completeness, we also present some results in the Full-Data Setting, where one can fine-tune models on thousands of samples (which is often impractical).

To the best of our knowledge, this is the first study on the performance-cost investigation of LLMs versus MLMs. Many companies tend to use the most modern models like proprietary LLMs (OpenAI GPT-4), which come at a pretty heavy cost without comparing their performance with cheaper, older models that might perform equally.

After benchmarking the performance-cost tradeoffs of LLMs and MLMs in Banking77 (a real-life conversational dataset of a bank's customer support with 77 labels), we introduce a cost-effective LLM inference method based on active learning, similar to how RAG (Retrieval-Augmented Generation) is performed nowadays in question-answering chatbots. This is able to reduce LLM costs more than 3 times in real-life business settings. Then, we follow with an extra study showing how much synthetic data can one generate in such resource-limited scenarios.

Methodology Outline

We tackle text classification in Few-Shot Settings (where we have limited samples per class) in 2 ways:

Contrastive Learning (SetFit) with Masked Language Models (MLMs)
In-Context Learning (Prompting) with Large Language Models (LLMs)

An overview of Contrastive Learning (SetFit), as used in MLMs. Setfit was first introduced by HuggingFace (Tunstall et al., 2022). It utilizes Sentence Transformers (like MPNet) in a Siamese + Supervised Fine-Tuning Manner by having an objective function to minimize the distance between samples of the same labels. The result is that it produces rich vector representations, even when providing only 10 to 20 samples per class for your text classification problem.

An overview of In-Context Learning, as used in LLMs.We leverage the pre-trained knowledge of LLMs and extend it with our specific task instructions and a few examples per class. This is done for each inference sample. We use a variety of proprietary LLMs, like OpenAI GPT-3.5 and GPT-4, Anthropic Claude 2, and Cohere's Command-Nightly.

Results (#1)

We then employ SetFit methodology on the MPNet-v2 model, a state-of-the-art sentence transformer based on BERT, according to https://sbert.net/.
We also utilize In-Context Learning for multiple proprietary LLMs like OpenAI GPT-3.5, GPT-4, Anthropic Claude 1 & 2, and, Cohere's Command-nightly.

For the MPNet models (and the SetFit technique), we use different settings of 3/5/10/15/20 samples per class, as SetFit typically requires around 10-20 samples to work well.

For the LLMs, we use 1 and 3 samples per class, due to context length limitations (OpenAI had 4K context length limitations at the time of the development). Also, for the LLMs, we use both random samples from the dataset and "representative" samples, as selected by a domain expert. The intuition here is that representative samples will outperform randomly sampled ones and we believe that it is feasible for a company to pick 3 "good" samples for each class in a dataset.

First, let's focus on LLMs. GPT-4 performs the best across 1-shot settings, outperforming competitors like Anthropic Claude and Cohere's Command-nightly. In the 3-shot setting, GPT-4 also works the best. Surprisingly, GPT-3.5's performance in the 3-shot setting drops, compared to the 1-shot setting, probably due to GPT-3.5 getting "Lost In The Middle" (https://arxiv.org/abs/2307.03172) when having a bigger context. As expected, the representative samples work better than random samples in all of our ablation experiments (using OpenAI models). On the other side of MLMs, MPNet might start with a low of 57.4 in the 1-shot setting (vs GPT-4's 80.4) but has a comparable 76.7 in the 3-shot setting (vs GPT-4's 83.1). After providing more samples to the MLM, something which is impossible to the LLMs (due to maximum 4K context capacity), the MPNet models reach a top 91.2 micro-F1 Score, which is 3 points lower than fine-tuning in the typical Full Data Setting with hundreds/thousands of samples per class (94.1)

Cost Analysis

Proprietary LLMs might work well but they cost a lot, due to hefty costs in their API usage per token. Thus, apart from their performance, we analyze their costs. This is the first time that this industrial point-of-view (performance/cost tradeoff) is reported.

In the 1-shot setting, where we show 1 example per class to the LLM, GPT-4 has 80.4 F1 score but costs 620$, while Anthropic Claude 2 costs only 15$ but has 76.8$ micro-f1. It depends on you and your preference on which to choose (money over slight performance increase?). Also, in the 3-shot setting, GPT-4 outperforms GPT-3.5 by nearly 20 points, but costs around 10 times more. We perform 3,080 queries in the test set, one for each inference sample.

RAG (or Dynamic Few-Shot Prompting) for Cost-Effective LLM Inference

Right now, for each class, we feed the LLM N examples per class (classic N-shot settings). For example, in the 3-shot settings and 77 classes, we feed the model 3x77 = 231 samples, hitting the limit of the 4K context window and having a high API cost of the OpenAI and other LLM providers.

Instead of feeding so many samples to the model each time we want to classify a test sample, we found out that during inference, we can retrieve only the top K similar (and their labels), and perform better while reducing the context size (and associated costs).

This is called Dynamic Few-Shot Prompting since we dynamically change the examples we show to the LLM via the prompt or RAG (Retrieval-Augmented Generation), since the LLM generates after the prompt is being augmented after a retrieval step (as seen in the classic question-answering tasks nowadays). We retrieve the most similar examples and their labels by utilizing the cosine similarity in the sentence embeddings (encoded through MPNet).

Results (#2) with RAG/Dynamic Few-Shot Prompting

After performing RAG (or Dynamic Few-Shot Prompting) with K=5/10/20 top similar examples (and their labels) from their training set, one call at a time for each inference/test sample, we also report the results and their (spoiler alert!) heavily reduced associated dollar costs.

Seeing GPT-4 results with RAG (this table) and without RAG (previous table), it seems better (and cheaper) to use LLMs with this dynamic prompting approach (with 5/10/20 samples total), instead of having a classic Few-Shot approach where one shows 3x77 samples per class (231 samples total). Also, Claude 2 on K=20 similar (RAG) yields 85.5% on only 42$, out vs. GPT-4’s original 83.1% on 740$ (previous table) 🤯 We perform 3,080 queries in the test set, one for each inference sample.

Extra: Are LLMs capable for synthetic data generation?

Data augmentation or synthetic data generation is especially important since when one performs Few-Shot approaches, they are doing this mostly because they are data-limited. And as always, the more the data, the better.

So, we tested if we can trust LLMs for synthetic data generation and the answer is: yes, but up to a point.

Previous reports show that data augmentation for tasks with large and overlapping label tasks is difficult (see https://aclanthology.org/2022.nlp4convai-1.5/). For this reason, we did a semantic clustering of the 77 labels and 3 of their examples into N=10 groups. Then, we fed each group to GPT-4 (label + 3 examples), asking it to generate 20 more.

💡 The intuition behind this is that the LLM will understand the overlapping differences of the 77 labels and their examples and will be able to create synthetic data which can be differentiated from one class to another.

After doing that, we put them to the test with the MPNet models (using the SetFit Few-Shot approach).

In this result, we suppose that we have at least N=3 real samples per class, and we want to test and compare how synthetic/augmented data perform vs. actual real ones. Following the black line, we do see an increase when using the augmented 5 and 10 samples in the Few-Shot Scenario. However, the performance drops after 10 samples, which seems to be the sweet spot for this experiment. For reference reasons, we also have the real data plotted in green line, which indicate that real data are better than GPT-4 generated data.

Takeaways

Our work provides a practical rule of thumb for text classification in settings with lots of classes, such as intent detection in chatbot use cases:

If you have more than 5 examples per class, it's better to finetune a pretrained model such as MPNet using a contrastive learning technique such as SetFit.
If you have less than 5 examples per class, it's better to use LLMs.
To reduce the costs of LLMs, one can employ "dynamic" few-shot prompting (employing RAG) that performs better and costs a fraction of the regular few-shot prompting.
Synthetic data can be used to enhance performance, but we found that it hurts the results after incorporating N=7 synthetic examples. As expected though, real data is much better than GPT-4 generated data.

Citation

@inproceedings{10.1145/3604237.3626891,
author = {Loukas, Lefteris and Stogiannidis, Ilias and Diamantopoulos, Odysseas and Malakasiotis, Prodromos and Vassos, Stavros},
title = {Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking},
year = {2023},
isbn = {9798400702402},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3604237.3626891},
doi = {10.1145/3604237.3626891},
pages = {392–400},
numpages = {9},
keywords = {Anthropic, Cohere, OpenAI, LLMs, NLP, Claude, GPT, Few-shot},
location = {Brooklyn, NY, USA},
series = {ICAIF '23}
}

Resources

Paper: https://arxiv.org/abs/2311.06102
Kudos (ACM Showcase) Blogpost: https://www.growkudos.com/publications/10.1145%25252F3604237.3626891/reader
Dataset (Banking77): https://huggingface.co/datasets/PolyAI/banking77
Representative Samples curated by a domain expert (3 samples per class): https://huggingface.co/datasets/helvia/banking77-representative-samples

Acknowledgments

This work has received funding from European Union’s Horizon 2020 research and innovation programme under grant agreement No 101021714 ("LAW GAME"). Also, we would like to sincerely thank the Hellenic Artificial Intelligence Society (EETN) for their sponsorship.

Cache me if you Can: an Online Cost-aware Teacher-Student Framework to Reduce the Calls to Large Language Models

Stavros Vassos — Wed, 06 Dec 2023 15:28:00 GMT

@ EMNLP 2023

Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking

Lefteris Loukas — Sat, 25 Nov 2023 15:27:00 GMT

@ ACM ICAIF 2023

AI-assisted Serious Games: Dialogue Management with Generative AI

Stavros Vassos — Thu, 19 Oct 2023 13:53:00 GMT

@ EDGE 2023

AI models for classifying green plastics patents

Stavros Vassos — Thu, 23 Feb 2023 14:57:00 GMT

The AI4EPO team developed novel AI models for the European Patent Office CodeFest on Green Plastics.

Responding to the European Patent Office (EPO) first ever CodeFest on Green Plastics, Helvia's CEO Dr. Stavros Vassos and ML Engineer Odysseas Diamantopoulos joined forces with Dr. Dimitrios Skraparlis from EPO, NL and Dr. Prodromos Malakasiotis from AUEB, GR, forming the AI4EPO team. The purpose was to apply state-of-the-art AI to develop models for automating the classification of patents as green plastics, tackling one of today's key sustainability challenges.

The challenges and the approach

The team first had to agree on the definition of ‘green plastics’. As there is no standard definition, the team decided to rely on an EPO report and a green plastics cartography, as identified by experts.

The other challenge was that there are no labeled data on patent examples that belong to green plastics. To tackle this, the team generated lists of patents based on the cartography of the report and labeled them with the respective categories.

The third challenge was that the patents are too long, and there is no standard method to extract brief relevant information. The approach for this was to use automated summarization, as well as combinations of full-text title, abstract, description, and claims.

The methodology

The methodology the team followed consists of the below six steps:

1. Define green plastics

2. Label patent examples with respect to green plastics categories

3. Preprocess patents to extract a “patent DNA” per example

4. Train state-of-the-art AI pipelines for text classification

5. Evaluate the results and select the winning approach

6. Refine the winning approach toward a practical MVP

1. Define green plastics

The team relied on the categorization (cartography) laid-out by the experts of the study “Patents for tomorrow’s plastics”.

2. Label patent examples

To label the patent examples, the team curated a list of green plastics examples by executing queries on Google patent advanced search for each 3rd level entry of the cartography. Google patent advanced search was chosen because of its open support for searching the full-text of patents using Boolean syntax, proximity operators, wildcards and classification markings. The created google patent queries combine CPC subclass allocations and keyword constructs carefully selected to correspond to primary search strategies with similar or narrower search scope among the published queries used in the study “Patents for tomorrow’s plastics”.

Complexity constraints of google patents were worked-around through careful query and query-part building.

Following that, they created lists of “near-miss” examples to be used as negative examples that don’t belong to green plastics. The queries used combined CPC subclass allocations with targeted keyword negations.

The resulting data set effectively built upon samples of queries and CPC allocations generated and verified by human experts.

3. Preprocess patent examples - extracting a “patent DNA”

To preprocess patent examples, the team had to follow the below steps:

For each patent ID extract the title, abstract, description, and claims as text along with metadata using the EPO OPS service
Translate* to English for all parts not available in English
Summarize** title & abstract into 75 words, description into 180 words and claims into 150 words

The result was a new balanced dataset for green plastics classification with total examples including 4.3k patents, 2.2k positive and 2.1k negative.

The dataset includes three versions of extracted “patent DNA” for each patent, small (400 words), medium (1000 words), and large (1500 words) using summaries and full-text, with a total size of 597 MB. The three sizes of “patent DNA” enables the application of AI language models of varying complexity.

[* Google Translate was used for automated translation ]

[** OpenAI davinci-003 was used for automated summarization ]

4. Train state-of-the-art AI pipelines

To train the AI pipelines the team harnessed the power of Large Language Models (LLMs) using OpenAI and Cohere managed infrastructure and API:

Zero-shot: no examples, only a definition of the task is given to LLM
Few-shot (in-context) learning: 1 or 2 examples are given per class to LLM, multiple trials are executed, and majority vote is considered
Fine-tuning: dataset is used to finetune LLM
Custom MLP neural network: dataset is used to train a Multi-Layer Perceptron that employs LLM embeddings for its first layer

In addition, there were two pipelines per approach:

Binary: decide whether a patent is green plastics or not (yes/no)
Multi-label: decide which 2nd and 3rd cartography level a patent belongs to (pick a class or NEG otherwise)

5. Evaluate the results and select the winning approach

The winning approach was E2 – MLP with ada-002 embeddings, which was trained on the dataset for multi-label classification.

Note that in the table above for E2 and E3 we report the “aggregate” results, meaning that the AI model was trained to select a category of 2nd or 3rd level, but we only count whether it selected correctly that the patent belongs to green plastics or not with high confidence. In this way we get the binary decision (“Is it green plastics or not?”) along with some hints about why the model thinks it is classified in this way.

The evaluation findings showed that:

Automated summaries are weak: Crucial details for deciding green plastics seem to be missing. As a result, smaller models such as the BERT family that accept up to ~350 words are not expected to work well with automated summaries
Text generation LLMs do not perform well: Perhaps significantly larger datasets are needed; fine-tuning with 4.3k examples it was not possible to get decent performance
Text classification LLM embeddings are powerful: Using a custom Multi-layer Perceptron on top leads to near perfect binary classification, i.e. deciding if patent belongs to green plastics
Multi-label classification performs similar to binary: With multi-label classification we can also get an explanation of the yes response in terms of the 2nd or 3rd level cartography classes

The table below shows some indicative results using E2 that also provides an explanation:

6. Refine the winning approach toward a practical MVP

The use of the “patent DNA” of various sizes enables the exploration of cost-accuracy tradeoffs. The team further investigated further modifications of the winning solution, including the use of medium-sized patent DNA. Using a medium-sized patent DNA, approach E5 was introduced which aims at a lower cost/latency in production due to smaller tokens input. Compared against the winner (E2), the results were the following:

E5 cost savings vs E2: ~30% smaller input token counts
E5 accuracy penalty vs E2: small on binary decision (96.69% vs 98.8%), but significant on 2nd-level decision as the following reports show

Conclusions

The solution contains a comprehensive analysis of traditional and modern, state-of-the-art models and approaches, utilizing all available published expert information (green plastics cartography, queries, CPC subclasses) to create a new dataset. All proposed and tested AI pipelines of AI4EPO are by design, directly transferable to other base models and datasets.

The team proposes E2 that employs state-of-the-art LLM embeddings*, combined with a custom Multi-Layer-Perceptron neural network, producing excellent results on binary yes/no decisions, i.e. detecting whether a patent relates to green plastics or not. In addition, it offers information on why by providing classification to cartography entries of green plastics.

[* OpenAI model text-embedding-ada-002, published on 15/12/2022]

Next steps

The next steps include evaluation of the results on ground truth data using green plastics experts, employing a “Human in the loop” approach for generating a premium dataset and continuous improvement, following the experience of a similar project (A challenge on large-scale biomedical semantic indexing and question answering http://bioasq.org/).

Large dataset generation may be further streamlined and optimized using powerful EPO internal tools. Additionally, further input token size (“patent DNA”) optimization approaches can further optimize performance-accuracy tradeoffs.

Lastly, generating a premium dataset may further facilitate the multi-class approach and reduce the confusion between green plastics categories.

Transformation through Provocation? Designing a ‘Bot of Conviction’ to Challenge Conceptions and Evoke Critical Reflection

Stavros Vassos — Sat, 04 May 2019 11:02:00 GMT

@ CHI 2019