Brought to you by
Many enterprise teams have already successfully deployed LLMs in production, and many others have committed to deploying Generative AI products in 2024. However, for enterprise AI teams, the biggest hurdle to deploying production-ready Generative AI products remains the fear of model hallucinations – a catch-all phrase for when the model generates text that is incorrect or fabricated. There can be several reasons for this, such as a lack of the model’s capacity to memorize all of the information it was fed, training data errors, and outdated training data.
There are a few LLM benchmarks today. While these benchmarks do much to advance the adoption of LLMs, they have a few critical blindspots.
The Hallucination Index offers a structured approach to assess and measure hallucinations as an endeavor to help teams build more trustworthy GenAI applications.
Why
There has yet to be an LLM benchmark report that provides a comprehensive measurement of LLM hallucinations. After all, measuring hallucinations is difficult, as LLM performance varies by task type, dataset, context and more. Further, there isn’t a consistent set of metrics for measuring hallucinations.
What
The Hallucination Index ranks popular LLMs based on their propensity to hallucinate across three common task types - question & answer without RAG, question and answer with RAG, and long-form text generation.
How
The Index ranks 11 leading LLMs performance across three task types. The LLMs were evaluated using seven popular datasets. To measure hallucinations, the Hallucination Index employs two metrics, Correctness and Context Adherence, which are built with the state-of-the-art evaluation method ChainPoll.
20k+
Rows of text
11
Popular LLMs
3
Task Types
To learn more about our Methodology, click here.
Hallucination Index
A model that, when presented with a question, can retrieve relevant information from a given dataset, database, or set of documents to provide an accurate answer. This approach is akin to looking up information in a reference book or searching a database before responding, making it well suited to tasks that require domain-specific information.
🪄 Insights
Metric Used
Context Adherence:
Context Adherence evaluates the degree to which a model's response aligns strictly with the given context, serving as a metric to gauge closed-domain hallucinations, wherein the model generates content that deviates from the provided context.
The higher the Context Adherence score (i.e., it has a value of 1 or close to 1), the more likely the response contains information from the context provided to the model.
The lower the Context Adherence score (ie., it has a value of 0 or close to 0), the more likely the response contains information not included in the context provided to the model.
Datasets Used
Snippets from prompting LLMs
Dataset Used: Hotpot QA
prompt
What album did John Lennon release before the one that contained the song "How?"
Model Response
Model Response
Model Response
prompt
Frank Blake's longtime protégé was also the chairman and CEO of what company in addition to The Home Depot?
Model Response
Model Response
Model Response
prompt
Edward Marszewski is the editor-in-chief of Lumpen, who is the editor-in-chief of Saveur?
Model Response
Model Response
Model Response
Dataset
We use standard datasets and create appropriate prompts for each model.
Generation
Generations are done with similar, model specific prompts, without CoT(Chain of thought), and using the same text generation configurations (i.e. hyper-parameters).
Evaluation
We use an LLM-based evaluation for scalability, both in cost and time.to. Specifically, we use the state of the art ChainPoll metric to evaluate propensity for hallucination.
ChainPoll Efficacy
We leverage extensive human annotation to confirm the reliability of the ChainPoll metric for each task type.
Task score
The final score is calculated as the mean of dataset scores for the task. The dataset score is the mean of ChainPoll score for each sample in the dataset. We emphasize that this score is an LLM based score and not a human evaluation score.
ChainPoll
ChainPoll, developed by Galileo Labs, is an innovative and cost-effective hallucination detection method for large language models (LLMs), and RealHall is a set of challenging, real-world benchmark datasets. Our extensive comparisons show ChainPoll's superior performance in detecting LLM hallucinations, outperforming existing metrics such as with a significant margin in accuracy, transparency, and efficiency, while also introducing new metrics for evaluating LLMs' adherence and correctness in complex reasoning tasks.
Metric | Aggregate AUROC |
---|---|
ChainPoll-GPT-4o | 0.86 |
SelfCheck-Bertscore | 0.74 |
SelfCheck-NGram | 0.70 |
G-Eval | 0.70 |
Max pseudo-entropy | 0.77 |
GPTScore | 0.65 |
Random Guessing | 0.60 |
You're on your way to learning: