METRICPROMPTDATASETMODEL
EXPERIMENTMETHODINSIGHTS
DESIGNRAGLLMEXAMPLEHUMAN
ELAGUENRON&PARKINGFALCON
CATNNRLUCYNOTIONINDEXGPT

LLM Hallucination Index

A Ranking & Evaluation Framework For LLM Hallucinations

Get The Full Report

Brought to you by

đź‘‹ Welcome to the Hallucination Index!

Many enterprise teams have already successfully deployed LLMs in production, and many others have committed to deploying Generative AI products in 2024. However, for enterprise AI teams, the biggest hurdle to deploying production-ready Generative AI products remains the fear of model hallucinations – a catch-all phrase for when the model generates text that is incorrect or fabricated. There can be several reasons for this, such as a lack of the model’s capacity to memorize all of the information it was fed, training data errors, and outdated training data.

Why another benchmark?

There are a few LLM benchmarks today. While these benchmarks do much to advance the adoption of LLMs, they have a few critical blindspots.

  • Not focused on LLM output quality: Existing benchmarks provide a generic evaluation of LLM attributes and performance, and not a focused evaluation of the quality of the LLMs output (hallucination likelihood). As a result, these benchmarks do not leverage metrics that measure the actual quality of LLM outputs – one of the top concerns for enterprise GenAI teams today.
  • Not focused on task type: A practical benchmark useful for Enterprise genAI teams needs to cater to the variability in task types. For instance, a model that works well for chat, might not be great at text summarization.
  • Not focused on the power of context: Retrieval augmented generation (RAG) is a popular technique across teams to provide LLMs with useful context. LLM benchmarks today ignore how they perform with context – granted there is nuance here with regards to the quality of the context, but measuring variability in LLM performance across RAG vs non-RAG tasks is critical.

The Hallucination Index offers a structured approach to assess and measure hallucinations as an endeavor to help teams build more trustworthy GenAI applications.

About the index

Why

There has yet to be an LLM benchmark report that provides a comprehensive measurement of LLM hallucinations. After all, measuring hallucinations is difficult, as LLM performance varies by task type, dataset, context and more. Further, there isn’t a consistent set of metrics for measuring hallucinations.

What

The Hallucination Index ranks popular LLMs based on their propensity to hallucinate across three common task types - question & answer without RAG, question and answer with RAG, and long-form text generation.

How

The Index ranks 11 leading LLMs performance across three task types. The LLMs were evaluated using seven popular datasets. To measure hallucinations, the Hallucination Index employs two metrics, Correctness and Context Adherence, which are built with the state-of-the-art evaluation method ChainPoll.

20k+

Rows of text

11

Popular LLMs

3

Task Types

To learn more about our Methodology, click here.

Hallucination Index

LLM Rankings by Task Type

A model that, when presented with a question, can retrieve relevant information from a given dataset, database, or set of documents to provide an accurate answer. This approach is akin to looking up information in a reference book or searching a database before responding, making it well suited to tasks that require domain-specific information.

Developer
Model
Context Adherence Score
gpt-4-0613
0.76
gpt-3.5-turbo-0613
0.75
gpt-3.5-turbo-1106
0.74
zephyr-7b-beta
0.71
gpt-3.5-turbo-instruct
0.68
llama-2-70b-chat
0.68
llama-2-13b-chat
0.68
mistral-7b-instruct-v0.1
0.67
llama-2-7b-chat
0.65
falcon-40b-instruct
0.60
mpt-7b-instruct
0.58

🪄 Insights

  • Open AI's GPT-4-0613 performed the best and was least likely to hallucinate for Question & Answer with RAG.
  • While GPT-4-0613 performed the best, the faster and more affordable GPT-3.5-turbo-0613/-1106 models performed nearly identically to GPT-4-0613.
  • Huggingface's Zephyr-7b was the best-performing open-source model, outperforming Meta's 10x larger Llama-2-70b, proving larger models are not always better.
  • We found TII UAE's Falcon-40b and Mosaic ML's MPT-7b performed worst for this task type.
  • Recommendation: GPT-3.5-turbo-0613

Metric Used

Context Adherence:

Context Adherence evaluates the degree to which a model's response aligns strictly with the given context, serving as a metric to gauge closed-domain hallucinations, wherein the model generates content that deviates from the provided context.

The higher the Context Adherence score (i.e., it has a value of 1 or close to 1), the more likely the response contains information from the context provided to the model.

The lower the Context Adherence score (ie., it has a value of 0 or close to 0), the more likely the response contains information not included in the context provided to the model.

🤦🏽‍♀️ Hallucination Blooper Reel

Snippets from prompting LLMs

Dataset Used: Hotpot QA

prompt

What album did John Lennon release before the one that contained the song "How?"

Model Response

John Lennon/Plastic Ono Band
GPT-4-0613
Context Adherence score:
0.82

Model Response

John Lennon released "How?" on his album "Imagine" in 1971.
Llama-2-7b-chat
Context Adherence score:
0.55

Model Response

Rock 'n' Roll
GPT-3.5-turbo-0613
Context Adherence score:
0.17

prompt

Frank Blake's longtime protégé was also the chairman and CEO of what company in addition to The Home Depot?

Model Response

The context does not provide information on which company Frank Blake's longtime protégé was also the chairman and CEO of.
GPT-4-0613
Context Adherence score:
0.93

Model Response

Hughes Supply Incorporated
GPT-3.5-turbo-instruct
Context Adherence score:
0.51

Model Response

Frank Blake's longtime protégé was also the chairman and CEO of Coca-Cola.
Llama-2-7b-chat
Context Adherence score:
0.18

prompt

Edward Marszewski is the editor-in-chief of Lumpen, who is the editor-in-chief of Saveur?

Model Response

James Oseland
GPT-4-0613
Context Adherence score:
0.92

Model Response

Edward Marszewski is the editor-in-chief of Lumpen.
Llama-2-7b-chat
Context Adherence score:
0.88

Model Response

Colman Andrew
Llama-2-70b-chat
Context Adherence score:
0.12

Evaluation Methodology


Dataset
We use standard datasets and create appropriate prompts for each model.

Generation
Generations are done with similar, model specific prompts, without CoT(Chain of thought), and using the same text generation configurations (i.e. hyper-parameters).

Evaluation
We use an LLM-based evaluation for scalability, both in cost and time.to. Specifically, we use the state of the art ChainPoll metric to evaluate propensity for hallucination.

ChainPoll Efficacy
We leverage extensive human annotation to confirm the reliability of the ChainPoll metric for each task type.

Task score
The final score is calculated as the mean of dataset scores for the task. The dataset score is the mean of ChainPoll score for each sample in the dataset. We emphasize that this score is an LLM based score and not a human evaluation score.

ChainPoll

ChainPoll, developed by Galileo Labs, is an innovative and cost-effective hallucination detection method for large language models (LLMs), and RealHall is a set of challenging, real-world benchmark datasets. Our extensive comparisons show ChainPoll's superior performance in detecting LLM hallucinations, outperforming existing metrics such as with a significant margin in accuracy, transparency, and efficiency, while also introducing new metrics for evaluating LLMs' adherence and correctness in complex reasoning tasks.

Learn More


MetricAggregate AUROC
ChainPoll-GPT-4o
0.86
SelfCheck-Bertscore
0.74
SelfCheck-NGram
0.70
G-Eval
0.70
Max pseudo-entropy
0.77
GPTScore
0.65
Random Guessing
0.60

đź”® Read the full report

You're on your way to learning:

  • Hallucination rankings by task type
  • Correctness and Context Adherence for each model
  • Evaluation methodology for hallucinations
LLMHALLUCINATIONINDEXLLM
âś•
Hi there! What can I help you with?