Evaluation of LLM output is a non-trivialModelEvaluation problem. Many people from a traditional deterministic Software Engineering background treat models like deterministic systems that can be “programmed” with a prompt. However, responses are deterministic and subject to significant variation between runs. Therefore evaluation of LLMs at scale is incredibly important to truly understand if the models are solving real problems or not.

For traditional classes of problems like text classification and token classification, we can simply apply Metrics like F1 score. Summarisation is a particularly challenging task to measure since metrics like BLEU and ROUGE measure information content with respect to source material but not qualitative aspects like readability.

LLMs as Judges

More attention is now being given to the use of LLMs as judges as per this tweet from @chipro:

A big issue I see with AI systems is that people aren’t spending enough time evaluating their evaluation pipeline.

  1. Most teams use more than one metrics (3-7 metrics in general) to evaluate their applications, which is a good practice. However, very few are measuring the correlation between these metrics.

If two metrics are perfectly correlated, you probably don’t need both of them. If two metrics strongly disagree with each other, either this reveals something important about your system, or your metrics just aren’t trustworthy.

  1. Many (I estimate 60 - 70%?) use AI to evaluate AI responses, with common criteria being conciseness, relevance, coherence, faithfulness, etc. I find AI-as-a-judge very promising, and expect to see more of this approach in the future.

AI-as-a-judge scores aren’t deterministic the way classification F1 scores or accuracy are. They depend on the judge’s model, the judge’s prompt, and the use case. Many AI judges are good, but many are bad.

Yet, very few are doing experiments to evaluate their AI judges. Are good responses given better scores? How reproducible the scores are — if you ask the judge twice, do you get the same score? Is the judge’s prompt optimal? Some aren’t even aware of the prompts their applications are using, because they use prompts created by eval tools or by other teams.

Also fun fact I learned from a (small) poll yesterday: some teams are spending more money on evaluating responses than on generating responses 🤯

This post enumerates a number of useful tips and techniques for evaluating LLM output for different use cases.

PHUDGE is a semi-automated approach to using LLMs for evaluation which demonstrates reasonable results using small models as judges.