Fig. 5: Response accuracy of VLMs on image-based test questions. | npj Digital Medicine

Fig. 5: Response accuracy of VLMs on image-based test questions.

From: Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning

Fig. 5

Models are clustered and arranged by four scenarios: no image (baseline), VLM caption + question, image + question, and human description + question. Stacked bars show response categories: correct, 2-option selection (2OP), external option selection (EOP), no option selection (NOP), errors, and incorrect responses. Chi-square tests were performed to compare each scenario against the no-image baseline, with p-values reported adjacent to the corresponding bars. For models tested through both API and web interface, the API version is used in this figure due to similar performance between interfaces.

Back to article page