Summarisation is a particularly difficult task to measure since success at this task is largely subjective. Metrics based on similarity “cannot represent informativeness and are unsuitable in such an LLM era, where responses from LLMs are flexible and may contain synonyms and different syntax” fanEVAScoreEvaluationLongform2024.

There are a number of different dimensions that we might want to measure:

Summary Factuality

We want to be sure that the summarisation

Tasks:

  • AGGREFACT
  • FActScore
  • SummaC - task + dataset

Methods:

  • FENICE
  • EVA-Score - this is a pre-print method that seems to misunderstand how GPT function calling works.

References