Evaluation of machine learning models output is a non-trivial problem. There are many dimensions along which we can measure and selecting the correct metric for the task at hand is important.
Evaluation of LLMs
Evaluation has become particularly important in the age of LLMs. Many people from a traditional deterministic Software Engineering background treat models like deterministic systems that can be “programmed” with a prompt. However, responses are deterministic and subject to significant variation between runs. Therefore evaluation of LLMs at scale is incredibly important to truly understand if the models are solving real problems or not.
Evaluating Classical Problems
For traditional classes of problems like text classification and token classification, we can simply apply AI Metrics like F1 score. Measuring LLM Summarisation performance is a particularly challenging task to measure since metrics like BLEU and ROUGE measure information content with respect to source material but not qualitative aspects like readability.
Summarisation
See Measuring LLM Summarisation performance
Metrics for Classification
When we measure classification we are interested in how many items the model has correctly placed in their class, but we’re likely also interested in which items were placed in the incorrect class and perhaps in a multi-class problem, why that might be.
Metrics for Ranking
Normalised Discounted Cumulative Gain
One of the best technical explanations for nDCG and how it works is written that I’ve ever come across is this explainer by Dale Lane.
Mean Average Precision
Behavioural Testing
Behavioural testing methods like checklist can provide a more insightful and revealing way to evaluate models by highlighting specific behaviours and shortcomings of trained models.
Metrics for Data Purity
See Perplexity
Evaluating Compound Problems
see LLMs as judges
Evaluationg Computational Performance
Use GuideLLM to load test an llm endpoint.