Table 4 Quantitative results of different backbone combinations without metric learning and using LSTM or ViT Encoder Block as encoder architecture under the MSR-VTT benchmark dataset.
Backbones | Score | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
VGG | ResNet152 | SE_ResNet152 | ResNeXt-101 | I3D | B4 | M | R | C | ||||
LSTM | ViT | LSTM | ViT | LSTM | ViT | LSTM | ViT | |||||
\(\checkmark \) | Â | Â | Â | \(\checkmark \) | 37.50 | 36.92 | 27.90 | 26.92 | 58.61 | 58.30 | 41.12 | 42.28 |
\(\checkmark \) | Â | Â | \(\checkmark \) | \(\checkmark \) | 37.90 | 41.30 | 26.93 | 27.15 | 56.92 | 58.89 | 42.09 | 42.74 |
| Â | \(\checkmark \) | Â | Â | \(\checkmark \) | 38.40 | 34.89 | 27.30 | 27.80 | 58.46 | 59.82 | 44.73 | 50.49 |
| Â | \(\checkmark \) | Â | \(\checkmark \) | \(\checkmark \) | 39.90 | 42.00 | 28.61 | 26.60 | 59.70 | 61.49 | 50.22 | 50.20 |
| Â | Â | \(\checkmark \) | Â | \(\checkmark \) | 39.50 | 38.65 | 27.33 | 27.10 | 61.20 | 61.00 | 50.20 | 51.00 |
| Â | Â | \(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 41.20 | 41.70 | 27.40 | 27.31 | 61.20 | 61.30 | 51.08 | 52.20 |