Fig. 4: Stratified analysis of top LLM in each model family.

Performance is reported stratified by question topic, text- or image-based format, question length, patient care phase, laboratory inclusiveness in questions, and difficulty (Q1 represents challenging questions based on the average percentage of humans answering correctly, and Q4 represents easy questions). The bars represent percentage of accurate answers with 95% confidence intervals estimated using the bootstrapping method.