As 2023 comes to a close, many enterprise teams have successfully deployed LLMs in production, while many others have committed to deploying Generative AI products in 2024. However, when we talk to enterprise AI teams, the biggest hurdle to deploying production-ready Generative AI products remains the fear of model hallucinations – a catch-all phrase for when the model generates incorrect or fabricated text. There can be several reasons for this, such as a lack of the model’s capacity to memorize all of the information it was fed, training data errors, and outdated training data.
What makes tackling hallucinations so tricky is the variability and nuance that comes with generative AI. LLMs are not one size fits all when it comes to hallucinations – the dataset, the task type (with or without context), and LLM all play a role in influencing hallucinations. Furthermore, every business is different. Some are optimizing for cost while others are optimizing for sheer performance.
So, with this Hallucination Index benchmark across popular LLMs, we wanted to accomplish three things:
This Index aims to provide a framework to help teams select the right LLM for their project and use case.
Have a look at the Hallucination Index here!
The first question you’re probably asking is “why another benchmark?” After all, there is no shortage of benchmarks such as Stanford CRFM’s Foundation Model Transparency Index and Hugging Face’s Open LLM Leaderboard.
While existing benchmarks do much to advance the adoption of LLMs, they have a few critical blindspots.
We designed the Hallucination Index to address the practical needs of teams building real-world applications. The Hallucination Index was built with four principles in mind.
While LLM-based applications can take many shapes and sizes, in conversations with enterprise teams, we noticed a generative AI applications typically fall into one of three tasks types: Question & Answer without RAG, Question & Answer with RAG, and Long-form Text Generation.
To evaluate each LLM’s performance for each task type, the Index uses seven popular and rigorous benchmarking datasets. The datasets effectively challenge each LLM's capabilities relevant to the task at hand. For instance, for Q&A without RAG, we utilized broad-based knowledge datasets like TruthfulQA and TriviaQA to evaluate how well these models handle general inquiries.
This approach makes it easy for readers to identify the ideal LLM for their specific task type.
While the debate rages on about the best approach to use when building generative AI applications, retrieval augmented generation (RAG) is increasingly becoming a method of choice for many enterprises.
When it comes to RAG, each LLM responds differently when provided with context due to a variety of reasons, such as context limits and variability in the context window. To evaluate the impact of context on an LLM’s performance, the Index measured each LLM’s propensity to hallucinate when provided with context. To do this, we used datasets that come with verified context.
While existing benchmarks stick to traditional statistical metrics, detecting hallucinations reliably has more to do with detecting qualitative nuances in the model’s output that are specific to the task types. While asking GPT-4 to detect hallucinations is a popular (albeit expensive) method to rely on, ChainPoll has emerged as a superior method to detect hallucinations from your model’s outputs, with a high correlation against human benchmarking.
The Hallucination Index is explicitly focused on addressing those qualitative aspects of the desired generated output.
Derived from ChainPoll, this index applies 2 key metrics to reliably compare and rank the world’s most popular models on their propensity to hallucinate on the key task types.
When an LLM hallucinates, it generates responses that seem plausible but are either undesirable, incorrect, or contextually inappropriate. Since reliably detecting these errors is hard and nuanced, this is not something you can just ask another GPT to do for you (...yet!).
Today, the most trusted approach for detecting hallucinations remains human evaluation.
To validate the efficacy of the Correctness and Context Adherence metrics, we invested in thousands of human evaluations across task types across each of the datasets listed above.
The Index ranks LLMs by task type, and the findings were quite intriguing! In the realm of Question & Answer without Retrieval (RAG), OpenAI's GPT-4 outshines its competitors, demonstrating exceptional accuracy. In contrast, for Question & Answer with RAG tasks, OpenAI's GPT-3.5-turbo models present a cost-effective yet high-performing alternative.
The Index also underscores the potential of open-source models. Meta’s Llama-2-70b-chat and Hugging Face's Zephyr-7b emerge as robust alternatives for long-form text generation and Question & Answer with RAG tasks, respectively. These open-source options not only offer competitive performance but also significant cost savings.
With that, be sure to dig into the results and our methodology at www.rungalileo.io/hallucinationindex.
While we don’t expect teams to treat the results of the Hallucination Index as gospel, we do hope the Index serves as an extremely thorough starting point to help teams kick-start their Generative AI efforts.
And, we understand the LLM landscape is constantly evolving. With this in mind, we plan to update the Hallucination Index on a regular cadence. If you’d like to see additional LLMs covered or simply stay in the loop, you can reach out there.
Explore our research-backed evaluation metric for hallucination – read our paper on Chainpoll.