Fig. 4: Stratified analysis of top LLM in each model family. | npj Digital Medicine

Fig. 4: Stratified analysis of top LLM in each model family.

From: Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning

Fig. 4

Performance is reported stratified by question topic, text- or image-based format, question length, patient care phase, laboratory inclusiveness in questions, and difficulty (Q1 represents challenging questions based on the average percentage of humans answering correctly, and Q4 represents easy questions). The bars represent percentage of accurate answers with 95% confidence intervals estimated using the bootstrapping method.

Back to article page