Upcoming webinar: Go beyond text with multimodal AI evaluations

13 d 04 h 47 m

Best Practices For Creating Your LLM-as-a-Judge

Pratik Bhavsar
Pratik BhavsarGalileo Labs
Bogdan Gheorghe
Bogdan GheorgheMachine Learning Engineer
Best Practices For Creating Your LLM-as-a-Judge
4 min readOctober 22 2024

This is part two of our blog series on LLM-as-a-Judge!

Part 1: LLM-as-a-Judge vs Human Evaluation

Part 2: Best Practices For Creating Your LLM-as-a-Judge

Part 3: Tricks to Improve LLM-as-a-Judge

We hope you enjoyed reading our last post on LLM vs. Human evaluation. We plan to share more on the topic, and this blog post delves into the intricate process of implementing an LLM-as-a-Judge system. Don't worry; we will also show you how to ensure your AI judge is performing at its best because even an AI Judge needs a performance review!

How to Create LLM-as-a-Judge

First, let's dive into the core elements that'll make your LLM judge do the job. These building blocks will transform your regular LLM into a powerful evaluator.

1. Determine the Evaluation Approach

Crafting an effective LLM-as-a-judge system begins with determining the most appropriate evaluation approach. This initial decision involves choosing between ranking multiple answers or assigning an absolute score. If you opt for an absolute scoring system, consider what supplementary information might aid the LLM in making more informed decisions. This could include extra context, explanations, or relevant metadata to enhance the evaluation process.

2. Establish Evaluation Criteria

Once the approach is determined, the next crucial step is establishing clear evaluation criteria to guide the LLM's assessment process. When comparing outputs, you'll need to consider various factors:

  • Should the focus be on factual accuracy or stylistic quality?
  • How important is the clarity of explanation?
  • Should the answer come only from the context given?
  • Are there specific output format requirements, such as JSON or YAML with specific fields?
  • Should the output be in phrases or complete sentences?
  • Is the response free from restricted keywords?
  • Does the response answer all of the questions asked?

These criteria will form the foundation of your evaluation framework.

3. Define the Response Format

Defining the response format is equally important in creating an effective LLM-as-a-judge system. This involves carefully considering how the judge LLM should rate the LLM output. When choosing an appropriate scale, it's best to prioritize discrete scales with limited values, such as boolean (True/False) or categorical (Disagree/Neutral/Agree) options. These tend to be more reliable than star ratings or 1-10 point scales.

Additionally, specifying a clear output format ensures easy extraction of required values. For instance, you might request a JSON format that includes both an explanation and a boolean True/False value.

With all these elements in place, you're ready to craft the evaluation prompt! Creating the prompt is an iterative process, and refining the prompt usually takes most of the time spent on creating an LLM-as-a-judge.

4. Choosing the Right LLM for Your Judge

Once your prompt is refined, the next critical decision is choosing the appropriate LLM. This choice involves balancing several factors:

  • Performance vs. Cost: Stronger LLMs generally offer superior performance but at a higher cost. For simpler tasks, more modest models may suffice and should be prioritized to optimize resource allocation.
  • Task Specificity: Consider the complexity and nuance of your evaluation task. Some evaluations may benefit from the advanced capabilities of top-tier models, while others may perform equally well with more accessible options.
  • API Availability: Various LLM APIs are available, each with its strengths and limitations. Research and test different options to find the best fit for your specific use case.
  • Fine-Tuning Potential: In some cases, a task-specific fine-tuned LLM may offer superior performance. Evaluate whether the potential benefits of fine-tuning justify the additional time and resources required.
  • Prompt Adaptation: Remember that changing the model often necessitates adjustments to your prompt. Each LLM may have unique quirks or preferences in how it interprets instructions, so be prepared to fine-tune your prompt to align with the chosen model's characteristics.

5. Other considerations

Bias detection: Regularly check for any systematic biases in the validator's judgments across different categories or types of content.

Consistency over time: Ensure the validator maintains consistent performance as it's exposed to new data or as the underlying LLM is updated.

Edge case handling: Test the validator with extreme or unusual cases to ensure it can handle a wide range of scenarios.

Interpretability: Strive for validator outputs that not only provide judgments but also explain the reasoning behind them.

Scalability: Ensure your validation process can handle increasing amounts of data as your needs grow.

Addressing these aspects can help you develop a robust validation process for your LLM-as-a-Judge, ensuring its reliability and effectiveness across various applications.

How to Validate LLM-as-a-Judge

To validate an LLM acting as a judge, we must follow a structured process that ensures the model's reliability across various scenarios. The first step is to select data representative of the domain or task you're concerned with. This data can be either objective (with clear right or wrong answers) or subjective (open to interpretation), depending on your evaluation needs.

Next, generate LLM outputs for this selected data. These outputs will serve as the content to be judged by your validator LLM. It's crucial to ensure these outputs cover a wide range of quality and complexity to truly test the validator's capabilities.

Choosing the right evaluation metric is critical. For objective tasks, you might use straightforward statistical metrics. For subjective tasks, human annotation might be necessary to establish a ground truth. The choice between these depends on the nature of your task and the resources available.

Once you have your data and metrics in place, obtain judgments from your validator LLM. These judgments should be comprehensive and cover all aspects of the evaluation criteria you've established.

To assess the validator's performance, calculate various correlation measures. Each of these serves a different purpose:

  • Precision (P) and Recall (R) are useful for understanding the validator's accuracy in identifying correct and incorrect outputs. Precision tells you how many of the validator's positive judgments are correct, while recall indicates how many of the truly positive instances the validator correctly identified.
  • The Area Under the Receiver Operating Characteristic curve (AUROC) provides a more holistic view of the validator's performance across different threshold settings. It's particularly useful when you need to balance sensitivity and specificity in your evaluations.
  • Cohen's Kappa is excellent for measuring agreement between the validator and human judgments, especially for subjective tasks. It accounts for the possibility of agreement by chance, providing a more robust measure than simple agreement percentages.

Each of these metrics has its pros and cons. Precision and recall are intuitive but can be misleading if used in isolation. AUROC provides a more comprehensive view but can be less intuitive to interpret. Cohen's Kappa is great for subjective tasks but requires careful interpretation in contexts where disagreement might be valid rather than erroneous.

How to Create LLM-as-a-Judge

1import json
2
3import numpy as np
4from sklearn.metrics import precision_score, recall_score, roc_auc_score, cohen_kappa_score
5from typing import List, Tuple
6
7
8class LLMJudge:
9    def __init__(self, model):
10        self.model = model
11
12    def judge_summary(self, original_text: str, summary: str) -> float:
13        prompt = f"""Evaluate the quality of the following summary on a scale of 0 to 1, where 0 is poor and 1 is excellent. Consider accuracy, completeness, and conciseness.
14
15        Original text:
16        {original_text}
17
18        Summary:
19        {summary}
20
21        Quality score:"""
22        
23        response = self.model.generate(prompt)
24        score = float(response.strip())
25        return score
26
27def read_prompts_from_file(filename: str) -> List[dict]:
28    with open(filename, 'r') as file:
29        return json.load(file)
30
31def generate_summaries(data: List[dict], summarizer) -> List[Tuple[str, str, float]]:
32    summaries = []
33    for item in data:
34        original_text = item['text']
35        summary = summarizer.summarize(original_text)
36        human_score = item['human_score']
37        summaries.append((original_text, summary, human_score))
38    return summaries
39
40def validate_llm_judge(judge: LLMJudge, data: List[Tuple[str, str, float]], threshold: float = 0.5):
41    true_scores = []
42    predicted_scores = []
43
44    for original, summary, human_score in data:
45        predicted_score = judge.judge_summary(original, summary)
46        true_scores.append(human_score)
47        predicted_scores.append(predicted_score)
48
49    true_binary = [1 if score >= threshold else 0 for score in true_scores]
50    pred_binary = [1 if score >= threshold else 0 for score in predicted_scores]
51
52    precision = precision_score(true_binary, pred_binary)
53    recall = recall_score(true_binary, pred_binary)
54    auroc = roc_auc_score(true_scores, predicted_scores)
55    kappa = cohen_kappa_score(true_binary, pred_binary)
56
57    return {
58        "precision": precision,
59        "recall": recall,
60        "auroc": auroc,
61        "cohen_kappa": kappa
62    }
63
64class MockLLM:
65    def generate(self, prompt: str) -> str:
66        return str(np.random.random())
67
68class MockSummarizer:
69    def summarize(self, text: str) -> str:
70        return f"Summary of: {text[:50]}..."
71
72# Usage example
73mock_llm = MockLLM()
74judge = LLMJudge(mock_llm)
75summarizer = MockSummarizer()
76
77# Read prompts from file
78prompts = read_prompts_from_file('prompts.json')
79
80# Generate summaries
81summaries = generate_summaries(prompts, summarizer)
82
83# Validate the LLM judge
84results = validate_llm_judge(judge, summaries)
85
86print("Validation Results:")
87for metric, value in results.items():
88    print(f"{metric}: {value:.4f}")
89
90

Conclusion

Whew! We've covered a lot of ground in creating our LLM judge. It's quite the journey from picking the right approach to choosing the best LLM for the job. Your first attempt probably won't be perfect, and that's okay! Keep tweaking those prompts, run those validation metrics, and don't hesitate to switch things up. At Galileo, we have invested countless hours honing our LLM-as-Judge approaches to get high-fidelity metrics. Connect with us to learn more about our state-of-the-art evaluation capabilities.