Getting to know LLM-as-a-Judge

Nov 2 2024

Last updated on Feb 9 2025

What Are Evals

Matt Pocock wrote an article about what evals are. Normal software is deterministic. Let’s say you capitalize a single word in an app menu. You can be fairly confident in the outcome of that change. But capitalizing a single word in a prompt can create massive ripple effects. In AI systems, no change is small.

Evals give you a score you can use to see how well your AI system is performing.

Deterministic evals are traditional pass/fail checks.
Another technique is to pass the results of your prompts into another LLM, and use that LLM as a judge.
Some evals can only be usefully evaluated by humans. These involve long-form text generation and certain types of factuality.

scorers: [
  // Checks if output is long enough
  length,
  // Uses an LLM to check if it's accurate
  factualAccuracy,
  // Uses an LLM to check writing style
  writingStyle,
],

What is LLM-as-a-Judge

LLM-as-a-Judge is a solution that uses LLMs to evaluate LLM responses based on any specific criteria of your choice. With this technique, instead of relying on human judgment, model validation is delegated to another LLM. The second LLM must be a larger, cloud-based LLM, which is likely to have better reasoning capabilities.

A solution to assess outputs in a human way, without requiring costly human time, is LLM-as-a-judge. This method was introduced in Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

In the article “Compare LLM capability with summarization”, it tells how to evaluate the models’ capabilities in summarization.

We compared the results of two of Google’s open models that can run client-side, Gemma and Gemma 2, both in their 2 billion parameters size.
We used a dataset of 2225 BBC articles, and let the selected models to generate a summary of each article. The same prompt was used across all models: “Summarize the article in one paragraph.”
To analyze the summary quality, we used Gemini 1.5 Flash to judge the summaries created by Gemma 2 and Gemma 2 2B.

We broke the evaluation process into two steps. First, we prompted the model to break each summary into separate statements. Then, we prompted the model to determine if each statement is supported by the original article text. The model classified each statement’s validity as:

Yes. The statement is supported by the original text.
No. The statement contradicts the original text.
Idk. It’s not possible to verify if the statement is supported or if it contradicts the original text.

Alignment is a metric that measures the frequency with which the statements included in a summary are supported in the original content the summary is based on.

This process resulted into two metrics that can be used to compare the models:

Alignment: How often did the model produce summaries that contain statements that are supported by the original text. The results show that Gemma 2 2B has a respectable score of 78.64%, indicating a good level of accuracy. Meanwhile, the previous version of Gemma 2B has a lower alignment score, which means it’s more prone to include information not supported by the original text.
Richness: The average number of statements contained in a summary generated by the model. Gemma 2 2B has the highest richness score at 9.1, indicating that its summaries include more details and key points. Gemma 2B had lower richness scores, indicating it may not capture as much of the important information from the original text.

LLM-as-a-judge in a few lines of code

import { generateObject } from 'ai'

// LLM's are bad at numeric scales, so we ask it
// to return a text enum which we then convert to a scale.
const { object } = await generateObject({
  model: google("gemini-2.0-flash..."),
  output: 'enum',
  enum: ['relevant', 'somewhat relevant', 'not relevant'],
  system: `
    You are a helpful assistant that determines whether 
    a search query is relevant to a user's message.
  `,
  prompt: `
    Search Query: ${input.searchQuery}
    User Message: ${input.userMessage}
  `,
})

return {
  'relevant': 1,
  'somewhat relevant': 0.5,
  'not relevant': 0
}[object]

Improve the LLM judge

IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.

Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and answer.

Question: {question}
Answer: {answer}

Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """

Kexi's blog

blog/notes/digital garden