Table 8 Comparison with transformer-based detectors.

From: Advanced gesture recognition in Indian sign language using a synergistic combination of YOLOv10 with Swin Transformer model

Model

mAP (%)

F1-score (%)

FPS

Architecture Type

DETR (ResNet-50)

94.8

93.2

7.1

Transformer + CNN

DINO (Swin-L)

96.1

94.8

5.6

Transformer (encoder-decoder)

ViTDet (ViT-L)

96.3

95.1

6.2

Vision Transformer

YOLOv10-ST (Proposed)

97.62

96.58

48.7

Swin