Argilla is a foss Data Annotation tool.

Argilla in Docker

  • By default see ms to want to use ElasticSearch
  • does not have any credentials for elasticsearch
  • adding env var to disable security on elastic seems to work:
environment:
    - xpack.security.enabled=false
  • The default credentials are argilla and 1234

Argilla User Management

Mainly done via the CLI as documented here.

Log in to Argilla

The user API key is available the user settings page

argilla login --api-url http://localhost:6900/ --api-key argilla.apikey

Create a User

argilla users create \
  --username example \
  --role admin \
  --password <secret> \
  --workspace <workspace>

There is currently no way for users to set their own passwords or API keys so use a secure example or generate one.

Setting --workspace during creation saves running a second command later. This arg can be set multiple times to add them to multiple workspaces.

Changing a user’s role

Can only be done from the server via the server database users command.

$ docker-compose exec argilla
...
 
$ python -m argilla_serverdatabase users update example <role>

Adding Users to a Workspace

Can be done from the client CLI:

$ argilla workspaces --name <workspacename> add-user <username>

Datasets

Create Dataset

Define your dataset in python, run push_to_argilla and use the resulting RemoteDataset to manipulate data entries:

ds = rg.FeedbackDataset.for_text_classification(
    labels=sample['Text'].unique().tolist(),    
    vectors_settings=[rg.VectorSettings(
            name="nomic_v15_vector",
            dimensions=768
        )]
    )
 
# we can also override 'workspace' kwarg here
rds = ds.push_to_argilla(name='example')

Add data to Dataset

We can send embeddings alongside our text which enables argilla to “find similar” examples and so on.

from sentence_transformers import SentenceTransformer
 
embedding_model_id = 'nomic-ai/nomic-embed-text-v1.5'
 
embedding_model = SentenceTransformer(embedding_model_id, trust_remote_code=True)
embeddings = embedding_model.encode(text, show_progress_bar=True)
records=[]
 
for txt, emb in zip (text[:20], embeddings[:20]):
    records.append(
        rg.FeedbackRecord(
            fields={
                "text": txt,
            },
            vectors={
                'nomic_v15_vector': emb.tolist()
            }
        )
    )

Moving Responses Around

If you need to move responses from one copy of a dataset to another you can but it is tricky. The documented dataset.update_records() method does not seem to handle upserts on responses.

I was able to achieve the use case I wanted by taking a backup and then deleting and re-creating the records in the combined dataset.

Let’s assume we want to copy all of test user’s annotations from their personal dataset source_ds into a group dataset called target_ds. This code assumes that both datasets contain a common set of labels where the text and metadata are the same.

user = rg.User.from_name("test")
 
source_ds = rg.FeedbackDataset.from_argilla(name="my_dataset", workspace="test")
 
target_ds = rg.FeedbackDataset.from_argilla(name="group_dataset", workspace="argilla")
 
# we use dict as a hash lookup for matching text content to records
# a better way would be to set a metadata id on all records on first insert and use those instead
text2rec = { record.fields['text']: record for record in source_ds.records}
 
to_update = []
 
to_remove = []
to_recreate = []
 
responsecount = 0
 
for record in target_ds.records:
    for response in record.responses:
        if response.user_id == user.id:
            if record.fields['text'] in text2rec:
                r = text2rec[record.fields['text']]
                r.responses.append(response)
                # argilla doesn't currently appear to remap responses to be 'local' on conversion so we have to do this ourselves
                local_responses = [resp.to_local() for resp in r.responses]
                
                to_remove.append(r)
                local_r = r.to_local()
                local_r.responses = local_responses
 
target_ds.delete_records(to_remove)
target_ds.add_records(to_recreate)

Evaluation

Merging Answers for Evaluation

Related to Moving Answers Around above, we may want to take a set of responses from single datasets in separate workspaces and merge them together in order to make use of Argilla’s built in Krippendorff’s alpha functionality.

users = [u for u in rg.User.list() if u.role == "annotator"]
 
# build a collection of records to be annotated
all_snippets = {}
 
for user in users:
 
    ds = rg.FeedbackDataset.from_argilla(name="my_dataset", workspace=user.username)
    records = sum([1 for _ in ds.records])
    resps = 0
    for record in ds.records:
 
        if record.fields['text'] not in all_snippets:
            all_snippets[record.fields['text']] = record.to_local()
 
        local : rg.FeedbackRecord = all_snippets[record.fields['text']]
        
 
        for resp in record.responses:
            if resp.user_id == user.id:
                local.responses.append(resp)
                resps += 1
 
    print("--------")
    print(user.username)
    print(f"Records: {records} Annotations: {resps}")
    

Then we create a local dataset with the same properties as the remote datasets that we’ve been iterating through

local_set = rg.FeedbackDataset(fields=ds.fields, questions=ds.questions, metadata_properties=ds.metadata_properties)

We add the records and responses that we collected above and compute agreement label for the question we care about

local_set.add_records(all_snippets.values())
local_set.compute_agreement_metrics(metric_names=['alpha'], question_name="label")

Per-Label Krippendorff’s Alpha for Multi-Label Data

Argilla offers the ability to calculate an overall Krippendorff’s Alpha score for the full dataset but does not currently support per-label calculations are useful for if you need to understand which labels annotators find hardest to get their heads around.

When calculating K-A for multi-label data, we treat each label as a binary yes/no problem. We build a matrix m of NUM_USERS x NUM_DATAPOINTS and iterate. We assign position m[user][dpoint] a 1 if the user tagged the given question with the label and 0 if they didn’t.

We should know the difference between the user not labelling the data and the user skipping the question because in the former case Argilla will store a response with an empty array for its value.

Refer to the above code snippets for getting a local dataset with all responses in one place. Then we can use something like the following:

 
k_alpha_metrics = []
 
QUESTION_NAME = 'label'
 
for label in local_set.question_by_name(QUESTION_NAME).labels:
 
    label_responses = np.full(shape=(len(users), len(local_set.records)), fill_value=np.nan)
    
    for i, record in enumerate(local_set.records):
        for response in record.responses:
                
                if 'label' not in response.values:
                    label_responses[user_offset_map[response.user_id], i] = 0
                else:
                    label_responses[user_offset_map[response.user_id], i] = 1 if  label in response.values[QUESTION_NAME].value else 0
 
    k_alpha_metrics.append({"Label": label, "alpha": krippendorff.alpha(label_responses, level_of_measurement='nominal')})
 
 
pd.DataFrame(data=k_alpha_metrics).to_csv("metrics.csv", index=False)
 

Storing Data Locally

The easiest way to export a dataset is to go via datasets and create a pandas dataframe. From there we can format as required:

ds = local_set.format_as('datasets')
ds.to_pandas().to_json(filename, lines=True, orient='records')

Prepare Dataset for Training

Multi-Label Dataset with records that have no label

If you have a multi-label problem where none of the labels apply and you try to use prepare_for_training you will get a KeyError due to the code trying to locate the question’s key in the values property of the row that doesn’t have a corresponding response.

For example:

argds.prepare_for_training('setfit', TrainingTask.for_text_classification(text=argds.field_by_name('text'), label=argds.question_by_name('label')))
│   547 │   │   │   responses = [resp for resp in rec.responses if resp.status == "submitted"]     │
│   548 │   │   │   # get responses with a value that is most frequent                             │
│   549 │   │   │   for resp in responses:                                                         │
│ ❱ 550 │   │   │   │   if isinstance(resp.values[question].value, list):                          │
│   551 │   │   │   │   │   for value in resp.values[question].value:                              │
│   552 │   │   │   │   │   │   counter.update([value])                                            │
│   553 │   │   │   │   else:                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'label'

The workaround I used was implementing a NotRelevant tag for the multi-label question so that all rows always have a response. Then if we want to we can actually remove the not relevant rows in post processing. However, it may be helpful to have a NotRelevant output on a multi-class classifier anyway.

I haven’t yet figured out if there is an easy way to remove records with no label from the dataset before training but this would probably be a good option too.

We can use some code to find records that don’t have a corresponding label with ease:

argds = rds.pull()
for r in argds.records:
    for resp in r.responses:
        try:
            lbl = resp.values['label'].value
        except:
            print("No label")
            print(r)

Multi-label prep where some labels have no records

The prepare_for_training method currently breaks label indexing by silently removing labels that have zero examples from the binarized_label object that is prepared. The issue is documented here.

The current workaround is to manually manipulate the MultiLabelQuestion.labels property and remove unused labels:

import itertools
from collections import Counter
 
 
question = rds.pull().question_by_name('label')
 
ds = rds.format_as('datasets')
ds_df = ds.to_pandas()
label_count = Counter(list(itertools.chain(*ds_df['label'].apply(lambda x: x[0]['value']))))
 
new_q = question.copy()
new_q.labels = sorted(label_count.keys())
 
len(new_q.labels )
# gives 16
 
len(rds.question_by_name('label').labels)
# gives 18
 
setfit_ds = argds.prepare_for_training('setfit', TrainingTask.for_text_classification(text=rds.field_by_name('text'), label=new_q), train_size=0.7, seed=42)

Rules and ElasticSearch

It is possible to define ElasticSearch Queries to automatically apply labels to documents via rules in the platform.

Active Learning

Argilla provides active learning via the Small Text library.

Few-Shot Learning

Assuming you have a small number of annotated examples to start with you can use SetFit and Python to train a model.

Once we have a trained model we can trivially use it to provide a best estimate at a label for a dataset:

 
model = SetFitModel.from_pretrained('path/to/model_dir')
 
# assuming sample is our dataframe that we want to label
sample['suggested_label'] = model.predict(sample['text'].tolist(), show_progress_bar=True)