Table 6 Quantitative results of different Backbone combinations without metric learning and using LSTM or ViT Encoder Block as encoder architecture under MSVD benchmark dataset.
Backbones | Score | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
VGG | ResNet152 | SE_ResNet152 | ResNeXt-101 | I3D | B4 | M | R | C | ||||
LSTM | ViT | LSTM | ViT | LSTM | ViT | LSTM | ViT | |||||
\(\checkmark \) | Â | Â | Â | \(\checkmark \) | 52.9 | 53.5 | 35.0 | 35.0 | 71.9 | 70.8 | 92.3 | 93.9 |
\(\checkmark \) | Â | Â | \(\checkmark \) | \(\checkmark \) | 52.6 | 54.0 | 34.3 | 34.9 | 72.6 | 73.6 | 94.8 | 95.0 |
| Â | \(\checkmark \) | Â | Â | \(\checkmark \) | 54.7 | 53.5 | 34.8 | 35.0 | 72.9 | 72.3 | 93.2 | 94.2 |
| Â | \(\checkmark \) | Â | \(\checkmark \) | \(\checkmark \) | 54.0 | 55.1 | 35.1 | 34.2 | 72.0 | 72.0 | 94.5 | 95.9 |
| Â | Â | \(\checkmark \) | Â | \(\checkmark \) | 52.1 | 53.2 | 35.0 | 33.6 | 70.9 | 73.5 | 96.0 | 96.6 |
| Â | Â | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 54.8 | 54.6 | 33.9 | 35.8 | 73.0 | 72.9 | 96.3 | 97.0 |