Fig. 3: Evaluation results on the AMEGA benchmark, score distribution on case level. | npj Digital Medicine

Fig. 3: Evaluation results on the AMEGA benchmark, score distribution on case level.

From: Autonomous medical evaluation for guideline adherence of large language models

Fig. 3

The boxplots visualize the performance distribution of various language models across the AMEGA benchmark cases. Models are arranged from left to right, generally showing improved performance. The y-axis represents scores from 0 to 50, while the x-axis lists different models. Each box shows the interquartile range and median score for a model, with whiskers indicating the full range excluding outliers (shown as points). Models are color-coded by type, including medical and non-medical, open-source and proprietary, and small and large variants. The plot reveals that larger and proprietary models tend to achieve higher and more consistent scores. Notably, medical open-source models (far left) underperform compared to non-medical counterparts. Top-performing models like GPT-4 variants and llama-3-70b show high median scores with tight distributions, indicating consistent performance across diverse medical cases in the benchmark.

Back to article page