Fig. 3: Heatmap of cross-dataset performance evaluation (level 2).

The diagonal shows micro-averaged F1scores when trained and tested on the same dataset for level 2 annotations. Other cells show F1 scores when trained on one dataset and tested on another dataset.