Fig. 5: Response accuracy of VLMs on image-based test questions.

Models are clustered and arranged by four scenarios: no image (baseline), VLM caption + question, image + question, and human description + question. Stacked bars show response categories: correct, 2-option selection (2OP), external option selection (EOP), no option selection (NOP), errors, and incorrect responses. Chi-square tests were performed to compare each scenario against the no-image baseline, with p-values reported adjacent to the corresponding bars. For models tested through both API and web interface, the API version is used in this figure due to similar performance between interfaces.