Fig. 2: Evaluation results on the AMEGA benchmark. | npj Digital Medicine

Fig. 2: Evaluation results on the AMEGA benchmark.

From: Autonomous medical evaluation for guideline adherence of large language models

Fig. 2

This figure presents the final scores of various large language models (LLMs) on the AMEGA-LLM Benchmark, averaged across 20 medical cases. The models are categorized into three types: open-sourced non-medical models (blue), open-sourced medical models (orange), and proprietary models (green). The chart shows performance scores out of a possible 50 points. Wizard-LM 2 (8x22B) leads the open-source models with a score of 41.3. Among proprietary models, GPT-4-1106-preview scores highest at 41.9. The largest performance gap is seen between the top-performing models and the specialized medical models like MedLlama-2, Gema, and Meditron, which score significantly lower.

Back to article page