Fig. 3: Evaluation results on the AMEGA benchmark, score distribution on case level.
From: Autonomous medical evaluation for guideline adherence of large language models

The boxplots visualize the performance distribution of various language models across the AMEGA benchmark cases. Models are arranged from left to right, generally showing improved performance. The y-axis represents scores from 0 to 50, while the x-axis lists different models. Each box shows the interquartile range and median score for a model, with whiskers indicating the full range excluding outliers (shown as points). Models are color-coded by type, including medical and non-medical, open-source and proprietary, and small and large variants. The plot reveals that larger and proprietary models tend to achieve higher and more consistent scores. Notably, medical open-source models (far left) underperform compared to non-medical counterparts. Top-performing models like GPT-4 variants and llama-3-70b show high median scores with tight distributions, indicating consistent performance across diverse medical cases in the benchmark.