Table 4 Evaluation results of pre-trained Vision Transformers (ViTs) and their ensembles using weighted and soft voting.

From: A triple pronged approach for ulcerative colitis severity classification using multimodal, meta, and transformer based learning

Models/methods

Accuracy

Precision

Recall

F1

MCC

Inf. time

ViT-base39

0.80

0.81

0.80

0.77

0.46

0.8 ms

ViT-large39

0.87

0.87

0.87

0.87

0.62

15.5 ms

Swin-tiny41

0.85

0.85

0.85

0.84

0.65

4.7 ms

Swin-base41

0.90

0.90

0.90

0.89

0.77

3.1 ms

DeiT-small40

0.83

0.82

0.83

0.82

0.60

10.7 ms

DeiT-base40

0.87

0.87

0.87

0.87

0.64

4.7 ms

Vision transformer ensemble

Weighted voting ensemble

0.91

0.91

0.91

0.90

0.76

31.56 ms

Soft voting ensemble

0.93

0.94

0.93

0.93

0.77

7.0 ms