*Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, Ion Androutsopoulos*

## Abstract

Prompting Large Language Models (LLMs) performs impressively in zero- and few-shot settings. Hence, small and medium-sized enterprises (SMEs) that cannot afford the cost of creating large task-specific training datasets, but also the cost of pretraining their own LLMs, are increasingly turning to third-party services that allow them to prompt LLMs. However, such services currently require a payment per call, which becomes a significant operating expense (OpEx). Furthermore, customer inputs are often very similar over time, hence SMEs end-up prompting LLMs with very similar instances. We propose a framework that allows reducing the calls to LLMs by caching previous LLM responses and using them to train a local inexpensive model on the SME side. The framework includes criteria for deciding when to trust the local model or call the LLM, and a methodology to tune the criteria and measure the tradeoff between performance and cost. For experimental purposes, we instantiate our framework with two LLMs, GPT-3.5 or GPT-4, and two inexpensive students, a k-NN classifier or a Multi-Layer Perceptron, using two common business tasks, intent recognition and sentiment analysis. Experimental results indicate that significant OpEx savings can be obtained with only slightly lower performance.

## Architecture

## Framework

We present *OCaTS* (Online Cost-aware Teacher Student Framework), a framework designed to train a local inexpensive model (student) using the responses of a more expensive model (teacher) in an online setting. Our approach is inspired by the teacher-student schema, but with the additional consideration of the cost associated with utilizing the teacher. This makes OCaTS a suitable solution for small and medium enterprises that want to leverage powerful and easily accessible Large Language Models via API, while minimizing operational expenses (OpEx). OCaTS consists of three main components: a *teacher*, which is typically a resource-intensive model that produces high-quality results; a *student*, which is a cost-effective model that is much smaller and simpler than the teacher; and a *cache*, which is a repository of incoming queries that have already been processed by the teacher.

## Cost-awareness

To integrate the cost aspect into the framework, we introduce a novel evaluation measure for such settings, called *discounted metric*. This metric, denoted as $\hat{\phi}$, aims to capture the trade-off between performance and cost. It is computed using the following equation: $$\hat{\phi} = \phi - \lambda \cdot \rho = \phi - \lambda \cdot \frac{M}{N}.$$

In this equation, $\phi$ represents a conventional evaluation metric such as accuracy. Parameter $\lambda$ is a weighting factor that determines the importance of cost (higher values indicate that cost is considered more significant for the SME). The variables $M$ and $N$ correspond to the total number of calls made to the teacher model and the total number of queries handled, respectively. The discounted metric penalizes the overall performance of the framework based on the rate of calls made to the teacher model and the associated cost for the company. Intuitively, by making this metric the objective of the framework, it learns to maximize performance by allowing the student to respond only when confident enough and prompt the teacher for their response otherwise.

## Choosing between Student & Teacher

We focus on applying the framework on a text classification problem. In order to determine whether to rely on the student's response or prompt the teacher to respond, the framework incorporates two criteria inspired by Active Learning. If both criteria meet a certain threshold, the student's response is trusted; otherwise, the query is delegated to the teacher for handling.

The **first criterion** is to ensure the representativeness of the cached queries considered by the teacher. This is achieved by determining the similarity between the new query and the $k$ most similar cached queries. Let the *weighted centroid vector* $c$ of the $k$ nearest neighbors be $c = \sum_{i=1}^{k}\hat{w}_i \cdot v_i$ and $\hat{w}_i = w_i/\sum_{j=1}^{k} w_j$, where $w_i$ represents the weight assigned by a distance weighting algorithm to the $i$-th neighbor, and $v_i$ corresponds to the vector representation of the neighbor. The first criterion states that $c$ must be below a threshold $t_c$. Essentially, this condition ensures that the student has previous experience with similar cached queries.

The **second criterion** is to ensure the confidence of the cached queries considered by the teacher. To establish the second condition let $C$ represent the set of labels (classes) of the text classification problem. The probability $p_c$ for each $c \in C$ is defined as follows: $$p_c = \frac{\exp(W_c)}{\sum_{c' \in C} \exp(W_{c'})},$$ where $W_c$ can be the weight assigned by the $k$-NN algorithm or the logits of an MLP. The *entropy* $\mathcal{H}$ of the label probabilities $p_c$ is given by: $$\mathcal{H} = -\sum_{c \in C} p_c \log{p_c}.$$ The second criterion states that $\mathcal{H}$ must be below a threshold $t_\mathcal{H}$. Essentially, this condition ensures that the student is confident about the its response.

## Results

We evaluate the framework in an intent recognition task for four indicative $\lambda$ values, which determines the importance of cost in the discounted metric $\hat{\phi}$ we introduced. We utilize the Banking77 dataset, a basic k-NN student, and GPT-4 as the teacher. As depicted in the figure below, OCaTS effectively manages the tradeoff between the frequency of contacting the teacher and the level of accuracy. Specifically:

**Left part: Calls to the Teacher**- Using OCaTS significantly reduces the calls to the teacher; hence, OpEx.
- As $\lambda$ increases, the number of calls made to the teacher decreases.

**Middle part: Trade-off between accuracy & OpEx**- At $\lambda=0.05$, OCaTS achieves accuracy close to GPT-4 teacher (83.05% vs. 82.68%) with only one-third of teacher calls (1050 out of 3080).
- Increasing $\lambda$ leads to a decrease in accuracy but a smaller number of teacher calls.

**Right part: Discounted Accuracy ($\hat\phi$) Comparison:**- Right side of the figure compares discounted accuracy ($\hat\phi$) of OCaTS (solid lines) with always contacting GPT-4 teacher (dashed lines).
- OCaTS consistently surpasses GPT-4 teacher's accuracy, emphasizing OpEx efficiency.

**Conclusion on OCaTS Superiority:**- OCaTS is superior in terms of OpEx compared to constantly reaching out to the teacher.
- The difference favoring OCaTS becomes more pronounced as $\lambda$ increases, indicating a stronger focus on reducing OpEx.

## Takeaways

This study is, to the best of our knowledge, the first study to optimize API requests to commercial LLMs according to a cost-aware metric. Some takeaways:

- We introduce a framework for decreasing API requests to commercial LLMs like OpenAI's GPT-4 while maintaining performance standards, by caching responses.
- We introduce a discounted metric that measures the trade-off between performance and cost.
- We employ a smaller and efficient student model to respond to queries similar to the ones previously handled by the teacher LLM.
- In our experiments we match the performance of OpenAI GPT-4, scoring only 0.37 percentage points less than GPT-4, while at the same time effectively cutting down the API costs by calling the LLM teacher for only one-third of the incoming queries (1050 out of 3080).

## Bibtex

```
@inproceedings{stogiannidis-etal-2023-cache,
title = "Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models",
author = "Stogiannidis, Ilias and
Vassos, Stavros and
Malakasiotis, Prodromos and
Androutsopoulos, Ion",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.1000",
pages = "14999--15008"
}
```

## Acknoweledgements

This work was supported by Google’s TPU Research Cloud (TRC) and was carried out in collaboration with AUEB's NLP Group.