Table 3 Comparisons of the proposed concept-based models (CM-LSTM, CM-META) with respect to the state-of-the-art topic-based approaches for image captioning on the MSCOCO dataset.
From: Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture
Captioning Model | B@1 \(\uparrow \) | B@2 \(\uparrow \) | B@3 \(\uparrow \) | B@4 \(\uparrow \) | METEOR \(\uparrow \) | CIDEr \(\uparrow \) | ROUGE \(\uparrow \) | SPICE \(\uparrow \) | Training Time \(\downarrow \) |
|---|---|---|---|---|---|---|---|---|---|
NumCap48 | 66.9 | 49.4 | 36.5 | 27.3 | 24.1 | 85.3 | 50.7 | 17.0 | – |
Topic-based captioning7 | 67.6 | 49.4 | 34.8 | 24.3 | 22.7 | 80.8 | 49.3 | – | 6h (GTX1080) |
Topic-sensitive30 | 72.1 | 53.4 | 40.6 | 24.1 | 20.1 | 67.3 | – | – | – |
Show and tell more22 | 72.3 | 54.1 | 39.2 | 28.9 | 23 | 90.3 | 52.1 | – | – |
What Topics Do Images Say23 | 73.3 | 56.0 | 41.1 | 30.1 | 25.2 | 98.6 | 53.4 | – | – |
Topic-oriented (NeuralTalk2-T-oe)28 | 73.9 | 57.2 | 43.2 | 32.6 | 26.1 | 103.8 | 54.4 | – | – |
Topic-guided Attention (VA)8 | 75.2 | 56.16 | 41.4 | 30.4 | 27.0 | 109.2 | 58.1 | – | 8h (Single NVIDIA TITAN X GPU) |
Proposed CM-LSTM | 73.7 | 56.8 | 41.9 | 30.5 | 25.8 | 99.1 | 53.9 | 19.3 | 4h (GTX1080) |
Proposed CM-META | 75.8 | 59.6 | 45.3 | 34.1 | 27.4 | 110.1 | 56.0 | 20.6 | 5h (GTX1080) |