Table 9 Performance comparison with CIDEr optimization.

From: MSSA: memory-driven and simplified scaled attention for enhanced image captioning

Model

B@1

B@2

B@3

B@4

M

R

C

S

LSTM1

31.9

25.5

54.3

106.3

SCST34

34.2

26.7

55.7

114.0

Up-Down35

79.8

36.3

27.7

56.9

120.1

21.4

AoANet37

80.2

38.9

29.2

58.8

129.8

22.4

SGAE36

80.8

38.4

28.4

58.6

127.8

22.1

X-Transformer18

80.9

39.7

29.5

59.1

132.8

23.4

\(M^2\) Transformer38

80.8

39.1

29.1

58.4

131.2

22.6

DLCT45

81.4

39.8

29.5

59.1

133.8

23.0

TAAIC46

71.0

27.7

23.8

51.1

93.2

18.3

MAN27

80.6

39.3

28.4

58.5

126.5

SCD-Net25

81.3

39.4

29.2

59.1

131.6

23.0

ADF43

81.0

64.3

49.3

37.4

28.3

58.1

123.3

21.6

TCCTN42

81.3

39.4

29.2

58.9

132.8

ICEAP44

81.1

64.5

37.4

28.5

58.2

123.8

21.7

X-LAN18

80.8

65.6

51.4

39.5

29.5

59.2

132.0

23.4

MSSA

81.1

66.1

51.9

40.0

29.5

59.3

131.4

23.2

  1. Significant values are in [bold].