Table 6 Quantitative results of different Backbone combinations without metric learning and using LSTM or ViT Encoder Block as encoder architecture under MSVD benchmark dataset.

From: Semantic guidance network for video captioning

Backbones

Score

VGG

ResNet152

SE_ResNet152

ResNeXt-101

I3D

B4

M

R

C

LSTM

ViT

LSTM

ViT

LSTM

ViT

LSTM

ViT

\(\checkmark \)

   

\(\checkmark \)

52.9

53.5

35.0

35.0

71.9

70.8

92.3

93.9

\(\checkmark \)

  

\(\checkmark \)

\(\checkmark \)

52.6

54.0

34.3

34.9

72.6

73.6

94.8

95.0

 

\(\checkmark \)

  

\(\checkmark \)

54.7

53.5

34.8

35.0

72.9

72.3

93.2

94.2

 

\(\checkmark \)

 

\(\checkmark \)

\(\checkmark \)

54.0

55.1

35.1

34.2

72.0

72.0

94.5

95.9

  

\(\checkmark \)

 

\(\checkmark \)

52.1

53.2

35.0

33.6

70.9

73.5

96.0

96.6

  

\(\checkmark \)

\(\checkmark \)

\(\checkmark \)

54.8

54.6

33.9

35.8

73.0

72.9

96.3

97.0