Fig. 3: Performance comparison in terms of class-average F1, Accuracy (Acc), Area Under the Receiver Operating Characteristic Curve (AUC), Recall, and Precision (Prec).

We compared the performance on the test subset, based on the model with the best F1 score on the validation subset. The values are normalized by the deviation from the best performance for each respective metric.