Table 16 Performance comparison with recent SOTA models (NASA + Fire videos datasets).

From: Real time fire and smoke detection using vision transformers and spatiotemporal learning

Model

Accuracy (%)

Precision (%)

Recall (%)

F1-score (%)

AUC-ROC (%)

FPS (GPU)

Proposed hybrid model

98.8

98.6

98.3

98.4

98.9

32

YOLOv1328

97.8

97.5

97.9

97.7

98.2

28

YOLO-NAS40

98.1

98.0

97.8

97.9

98.5

29

MobileViT30

96.5

96.0

96.3

96.1

97.2

35

EfficientViT25

96.8

96.4

96.7

96.5

97.4

33

FireViTNet14

97.2

97.0

97.1

97.0

97.8

27

Smoke detection transformer7

97.5

97.2

97.3

97.2

98.0

26

  1. The Proposed Hybrid Model corresponds to the full configuration integrating Vision Transformers (ViTs), 3D-CNNs, Transformer Attention, and Multi-task Learning. The reported accuracy of 98.8% reflects the average performance on the combined test set (NASA + Fire Videos datasets). Individually, the model achieved 99.2% on the NASA dataset and 98.3% on the Fire Videos dataset, yielding an average accuracy of 98.75%.