Fig. 6: Stratified analysis of top two VLMs.

Two VLMs, a Llama-3.2-11b and b Claude-3-Sonnet, are arranged by three scenarios. The baseline performance is presented in the left column, while the change in performance after providing the image directly and the human description is displayed in the middle and right columns, respectively. The bars represent percentage of accurate answers with 95% confidence intervals estimated using the bootstrapping method. Performance is reported stratified by question topic, text- or image-based format, question length, patient care phase, laboratory inclusiveness in questions, and difficulty (Q1 represents challenging questions based on the average percentage of humans answering correctly).