Table 10 Quantitative results of the metric learning module with fixed Backbone network and encoder architecture under the MSR-VTT and MSVD benchmark dataset.

From: Semantic guidance network for video captioning

Backbones

Encoder

Dataset

Metric learning

Score

B4

M

R

C

SE_ResNet152+ResNeXt101+I3D

ViT

MSR-VTT

NO

41.70

27.31

61.30

52.20

YES

42.20

28.90

62.16

54.30

MSVD

No

54.60

35.80

72.90

97.00

YES

55.30

36.10

74.20

98.40