Fig. 1: Model comparison for three-level distinctions.

Each black dot represents one result for one of the train-test pairs. The lines connect models based on the same train-test pair. Red points (and the labels) represent the mean accuracy over all train-test pairs for the respective model. Random baseline performance is given in the title.