Table 4 Evaluation results of pre-trained Vision Transformers (ViTs) and their ensembles using weighted and soft voting.
Models/methods | Accuracy | Precision | Recall | F1 | MCC | Inf. time |
---|---|---|---|---|---|---|
ViT-base39 | 0.80 | 0.81 | 0.80 | 0.77 | 0.46 | 0.8 ms |
ViT-large39 | 0.87 | 0.87 | 0.87 | 0.87 | 0.62 | 15.5 ms |
Swin-tiny41 | 0.85 | 0.85 | 0.85 | 0.84 | 0.65 | 4.7 ms |
Swin-base41 | 0.90 | 0.90 | 0.90 | 0.89 | 0.77 | 3.1 ms |
DeiT-small40 | 0.83 | 0.82 | 0.83 | 0.82 | 0.60 | 10.7 ms |
DeiT-base40 | 0.87 | 0.87 | 0.87 | 0.87 | 0.64 | 4.7 ms |
Vision transformer ensemble | ||||||
Weighted voting ensemble | 0.91 | 0.91 | 0.91 | 0.90 | 0.76 | 31.56 ms |
Soft voting ensemble | 0.93 | 0.94 | 0.93 | 0.93 | 0.77 | 7.0 ms |