Table 9 Quantitative results of fixed Backbone network usage metric learning and different encoder under MSR-VTT and MSVD benchmark dataset.

From: Semantic guidance network for video captioning

Backbones

Dataset

Encoder

Score

B4

M

R

C

SE_ResNet152+ResNeXt-101+I3D

MSR-VTT

LSTM

41.50

28.40

61.80

52.80

ViT Encoder Block

42.20

28.90

62.16

54.30

MSVD

LSTM

54.80

35.80

73.30

97.50

ViT Encoder Block

55.30

36.10

74.20

98.40