Table 3 Comparisons of the proposed concept-based models (CM-LSTM, CM-META) with respect to the state-of-the-art topic-based approaches for image captioning on the MSCOCO dataset.

From: Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture

Captioning Model

B@1 \(\uparrow \)

B@2 \(\uparrow \)

B@3 \(\uparrow \)

B@4 \(\uparrow \)

METEOR \(\uparrow \)

CIDEr \(\uparrow \)

ROUGE \(\uparrow \)

SPICE \(\uparrow \)

Training Time \(\downarrow \)

NumCap48

66.9

49.4

36.5

27.3

24.1

85.3

50.7

17.0

Topic-based captioning7

67.6

49.4

34.8

24.3

22.7

80.8

49.3

6h (GTX1080)

Topic-sensitive30

72.1

53.4

40.6

24.1

20.1

67.3

Show and tell more22

72.3

54.1

39.2

28.9

23

90.3

52.1

What Topics Do Images Say23

73.3

56.0

41.1

30.1

25.2

98.6

53.4

Topic-oriented (NeuralTalk2-T-oe)28

73.9

57.2

43.2

32.6

26.1

103.8

54.4

Topic-guided Attention (VA)8

75.2

56.16

41.4

30.4

27.0

109.2

58.1

8h (Single NVIDIA TITAN X GPU)

Proposed CM-LSTM

73.7

56.8

41.9

30.5

25.8

99.1

53.9

19.3

4h (GTX1080)

Proposed CM-META

75.8

59.6

45.3

34.1

27.4

110.1

56.0

20.6

5h (GTX1080)