Table 3 Performance of the employed multimodal approaches: pre-trained multimodal models, multimodal ensembles using soft voting, and multimodal-based feature extraction for ML ensemble classifiers.

Method	Models	Accuracy	F1	Precision	Recall	Inf. time
Pre-trained	CLIP B/16¹⁸	0.65	0.66	0.67	0.65	17.8 ms
Pre-trained	CLIP B/32¹⁸	0.66	0.67	0.69	0.66	21.5 ms
Pre-trained	CLIP L/14¹⁸	0.70	0.70	0.71	0.70	27.8 ms
Pre-trained	BLIP²²	0.65	0.66	0.66	0.65	54.3 ms
Pre-trained	FLAVA²⁸	0.65	0.67	0.72	0.65	32.7 ms
Multimodal ensemble
Soft voting ensemble	CLIP B/32, BLIP, FLAVA	0.64	0.66	0.69	0.64	27.33 ms
Soft voting ensemble	CLIP B/32, CLIP L/14, CLIP B/16	0.73	0.74	0.76	0.73	22.36 ms
Soft voting ensemble	CLIP B/32, CLIP L/14, CLIP B/16, BLIP, FLAVA	0.70	0.71	0.74	0.70	30.82 ms
Machine learning classifiers ensemble
Soft voting ensemble \| encoder: CLIP B/32	Base classifiers: KNN, SVM, RF	0.83	0.82	0.82	0.83	157.13 ms
	Additional classifiers: LR, GB, GNB
Soft voting ensemble \| encoder: EVA-CLIP B/16	Base classifiers: KNN, SVM, RF	0.82	0.80	0.82	0.82	140.35 ms
	Additional Classifiers: LR, GB, GNB
Stacking ensemble \| encoder: CLIP B/32	Base classifiers: KNN, SVM, RF	0.82	0.81	0.81	0.82	240.1 ms
	Meta classifier: LR
Stacking ensemble \| encoder: EVA-CLIP B/16	Base classifiers: KNN, SVM, RF	0.82	0.79	0.80	0.81	94.37 ms
	Meta Classifier: LR

Quick links

Search