Fig. 4: Normalized score distribution by question type. | npj Digital Medicine

Fig. 4: Normalized score distribution by question type.

From: Autonomous medical evaluation for guideline adherence of large language models

Fig. 4

This violin plot illustrates the performance distribution of AI models across various clinical tasks in the AMEGA benchmark. The y-axis lists question types, while the x-axis shows normalized scores. Each violin shape represents the score distribution for a specific task. Basic tasks (Primary Working Diagnosis, Extracted Symptoms, Extracted Risk Factors) show consistently high performance with median scores of 1.00 and narrow distributions. Complex tasks like Differential Diagnoses (median 0.40) and Treatment Strategies (median 0.50) display lower scores and wider distributions, indicating greater difficulty and variability in model performance. Intermediate tasks (e.g., Immediate Diagnostic Procedures, Therapeutic strategies) show median scores between 0.73 and 0.86 with moderate distribution widths. Sample sizes vary, with most tasks having 340 evaluations, while Treatment Strategies has only 51, as this type of question appears in only 3 out of the 20 cases.

Back to article page