Evaluation of LLM output is a non-trivialEvaluation problem. Many people from a traditional deterministic Software Engineering background treat models like deterministic systems that can be “programmed” with a prompt. However, responses are deterministic and subject to significant variation between runs. Therefore evaluation of LLMs at scale is incredibly important to truly understand if the models are solving real problems or not.

For traditional classes of problems like text classification and token classification, we can simply apply Metrics like F1 score. Measuring LLM Summarisation performance is a particularly challenging task to measure since metrics like BLEU and ROUGE measure information content with respect to source material but not qualitative aspects like readability.

See Also