Evaluation of machine learning models output is a non-trivial problem. There are many dimensions along which we can measure and selecting the correct metric for the task at hand is important.

Evaluation of LLMs

Evaluation has become particularly important in the age of LLMs. Many people from a traditional deterministic Software Engineering background treat models like deterministic systems that can be “programmed” with a prompt. However, responses are deterministic and subject to significant variation between runs. Therefore evaluation of LLMs at scale is incredibly important to truly understand if the models are solving real problems or not.

Evaluating Classical Problems

For traditional classes of problems like text classification and token classification, we can simply apply AI Metrics like F1 score. Measuring LLM Summarisation performance is a particularly challenging task to measure since metrics like BLEU and ROUGE measure information content with respect to source material but not qualitative aspects like readability.

Summarisation

See Measuring LLM Summarisation performance

Metrics for Classification

When we measure classification we are interested in how many items the model has correctly placed in their class, but we’re likely also interested in which items were placed in the incorrect class and perhaps in a multi-class problem, why that might be.

Metrics for Ranking

Normalised Discounted Cumulative Gain

One of the best technical explanations for nDCG and how it works is written that I’ve ever come across is this explainer by Dale Lane.

Mean Average Precision

Behavioural Testing

Behavioural testing methods like checklist can provide a more insightful and revealing way to evaluate models by highlighting specific behaviours and shortcomings of trained models.

Metrics for Data Purity

See Perplexity

Evaluating Compound Problems

see LLMs as judges

Evaluationg Computational Performance

Use GuideLLM to load test an llm endpoint.