Measuring LLM Summarisation performance

Summarisation is a particularly difficult task to measure since success at this task is largely subjective. Metrics based on similarity “cannot represent informativeness and are unsuitable in such an LLM era, where responses from LLMs are flexible and may contain synonyms and different syntax” fanEVAScoreEvaluationLongform2024.

There are a number of different dimensions that we might want to measure:

Summary Factuality

We want to be sure that the summarisation

Tasks:

AGGREFACT
FActScore
SummaC - task + dataset

Methods:

FENICE
EVA-Score - this is a pre-print method that seems to misunderstand how GPT function calling works.

References

FENICE: Factuality Evaluation of Summarization Based on Natural Language Inference and Claim Extraction

Brainsteam

Explorer

Measuring LLM Summarisation performance

Summary Factuality

References

Graph View

Table of Contents

Backlinks