Fig. 1
From: Collaborative positional attention for image to English question answering

VQA Examples illustrating the two primary task types: a Multiple-choice, where the model selects from a predefined list, and b Open-ended, where the model generates a natural language answer based on the visual and textual inputs.