Table 3 Performance of the employed multimodal approaches: pre-trained multimodal models, multimodal ensembles using soft voting, and multimodal-based feature extraction for ML ensemble classifiers.
Method | Models | Accuracy | F1 | Precision | Recall | Inf. time |
---|---|---|---|---|---|---|
Pre-trained | CLIP B/1618 | 0.65 | 0.66 | 0.67 | 0.65 | 17.8 ms |
Pre-trained | CLIP B/3218 | 0.66 | 0.67 | 0.69 | 0.66 | 21.5 ms |
Pre-trained | CLIP L/1418 | 0.70 | 0.70 | 0.71 | 0.70 | 27.8 ms |
Pre-trained | BLIP22 | 0.65 | 0.66 | 0.66 | 0.65 | 54.3 ms |
Pre-trained | FLAVA28 | 0.65 | 0.67 | 0.72 | 0.65 | 32.7 ms |
Multimodal ensemble | ||||||
Soft voting ensemble | CLIP B/32, BLIP, FLAVA | 0.64 | 0.66 | 0.69 | 0.64 | 27.33 ms |
Soft voting ensemble | CLIP B/32, CLIP L/14, CLIP B/16 | 0.73 | 0.74 | 0.76 | 0.73 | 22.36 ms |
Soft voting ensemble | CLIP B/32, CLIP L/14, CLIP B/16, BLIP, FLAVA | 0.70 | 0.71 | 0.74 | 0.70 | 30.82 ms |
Machine learning classifiers ensemble | ||||||
Soft voting ensemble | encoder: CLIP B/32 | Base classifiers: KNN, SVM, RF | 0.83 | 0.82 | 0.82 | 0.83 | 157.13 ms |
Additional classifiers: LR, GB, GNB | ||||||
Soft voting ensemble | encoder: EVA-CLIP B/16 | Base classifiers: KNN, SVM, RF | 0.82 | 0.80 | 0.82 | 0.82 | 140.35 ms |
Additional Classifiers: LR, GB, GNB | ||||||
Stacking ensemble | encoder: CLIP B/32 | Base classifiers: KNN, SVM, RF | 0.82 | 0.81 | 0.81 | 0.82 | 240.1 ms |
Meta classifier: LR | ||||||
Stacking ensemble | encoder: EVA-CLIP B/16 | Base classifiers: KNN, SVM, RF | 0.82 | 0.79 | 0.80 | 0.81 | 94.37 ms |
Meta Classifier: LR |