Fig. 2: Response accuracy of LLMs on text-based test questions. | npj Digital Medicine

Fig. 2: Response accuracy of LLMs on text-based test questions.

From: Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning

Fig. 2

Models are grouped by family and arranged chronologically within each cluster from oldest to newest. Stacked bars show response categories: correct, 2-option selection (2OP), external option selection (EOP), no option selection (NOP), errors, and incorrect responses. For models tested through both API and web interface, the API version is used in this figure due to similar performance between interfaces.

Back to article page