NÓBREGA, H. L.; http://lattes.cnpq.br/5444210624277381; NÓBREGA, Henrique Lopes.
Résumé:
Large Language Models (LLMs) such as ChatGPT, Claude and Llama 2 have revolutionized natural language processing, creating many new use cases for applications that use these models in their workflows. However, the high computational costs of these models lead to issues with cost and latency, preventing the scalability of LLM-based features to many services and products especially whenthey depend on models with better reasoning capabilities, such as GPT-4 or Claude 3 Opus. Additionally, many queries to these models are duplicated. Traditional caching is a natural solution to this problem, but its inability to determine if two queries are semantically equivalent leads to low cache hit rates. In this work, we propose exploring the use of semantic caching, which considers the meaning of queries rather than their exact wording, to improve the efficiency of LLM-based applications. We conducted an experiment using a real dataset from Alura, a Brazilian EdTech company, in a scenario where a student answers a question and GPT-4 corrects the answer. The results showed that 45.1% of the requests made to the LLM could have been served from the cache using a similarity threshold of 0.98, with a 4-10x improvement in latency. These results demonstrate the potential of semantic caching to improve the efficiency of LLM-based features, reducing costs and latency while maintaining the benefits of advanced language models like GPT-4. This approach could enable the scalability of LLM-based features to a wider range of applications, advancing the adoption of these powerful models in various domains.