Table 11 Human evaluation scores: Relevance and fluency of captions generated by Baseline MLP, W2VV, and W2VV with Attention and Contrastive Loss.
Model | Relevance | Fluency |
|---|---|---|
Baseline MLP | 4.1 | 3.9 |
W2VV | 4.5 | 4.3 |
W2VV + Attention + Contrastive Loss | 4.6 | 4.4 |