Fig. 6: Stratified analysis of top two VLMs. | npj Digital Medicine

Fig. 6: Stratified analysis of top two VLMs.

From: Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning

Fig. 6

Two VLMs, a Llama-3.2-11b and b Claude-3-Sonnet, are arranged by three scenarios. The baseline performance is presented in the left column, while the change in performance after providing the image directly and the human description is displayed in the middle and right columns, respectively. The bars represent percentage of accurate answers with 95% confidence intervals estimated using the bootstrapping method. Performance is reported stratified by question topic, text- or image-based format, question length, patient care phase, laboratory inclusiveness in questions, and difficulty (Q1 represents challenging questions based on the average percentage of humans answering correctly).

Back to article page