Scientific Reports

Table 6 Quantitative results of different Backbone combinations without metric learning and using LSTM or ViT Encoder Block as encoder architecture under MSVD benchmark dataset.

From: Semantic guidance network for video captioning

Backbones					Score
VGG	ResNet152	SE_ResNet152	ResNeXt-101	I3D	B4		M		R		C
VGG	ResNet152	SE_ResNet152	ResNeXt-101	I3D	LSTM	ViT	LSTM	ViT	LSTM	ViT	LSTM	ViT
\(\checkmark \)				\(\checkmark \)	52.9	53.5	35.0	35.0	71.9	70.8	92.3	93.9
\(\checkmark \)			\(\checkmark \)	\(\checkmark \)	52.6	54.0	34.3	34.9	72.6	73.6	94.8	95.0
	\(\checkmark \)			\(\checkmark \)	54.7	53.5	34.8	35.0	72.9	72.3	93.2	94.2
	\(\checkmark \)		\(\checkmark \)	\(\checkmark \)	54.0	55.1	35.1	34.2	72.0	72.0	94.5	95.9
		\(\checkmark \)		\(\checkmark \)	52.1	53.2	35.0	33.6	70.9	73.5	96.0	96.6
		\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	54.8	54.6	33.9	35.8	73.0	72.9	96.3	97.0

Back to article page

Search

Advanced search

Quick links