Evaluate is a Huggingface software library for evaluating machine learning models. It has integration with Transformers and Huggingface Optimum.
Label Encoding
The evaluate library expects labels to be integers so we have to encode all text labels into ints (see related issue in SetFit library here and proposed solution).
 
labels = ['Positive', 'Negative', 'Neutral']
lbl2idx = {  lbl:i for i,lbl in enumerate(sorted(label_map.keys())) }
idx2label = { v:k for k,v in lbl2idx.items() }
 
 
ds = datasets.Dataset.from_csv("./sentiment_testset.csv.zip")
ds = ds.add_column('label_enc',  [ lbl2idx[x] for x in ds['label'] ])
 
task_evaluator.compute(pipe, data=ds, metric="f1", label_mapping=lbl2idx, label_column='label_enc')Using multi-label F1 with evaluator class
F1 requires us to set an averaging approach (e.g. micro or macro) when we calculate for multi-class.
Using an Evaluator object, we can pass the average in via evaluator.METRIC_KWARGS property:
task_evaluator = evaluate.evaluator("text-classification")
task_evaluator.METRIC_KWARGS['average'] = 'macro'