Fig. 5: Scatter plot comparison of the scores between expert evaluation and automated metrics for each sample in the MedThink-Bench dataset.
From: Automating expert-level medical reasoning evaluation of large language models

The plot includes results from GPT-4o, Llama-3.3-70B, and MedGemma-27B; the judge model in the two LLM-based evaluation metrics is GPT-4o-mini. Each point represents an individual sample, with dashed lines indicating equal performance between expert and automated scores. For LLM-w-Rationale, data points tend to align closely with the dashed line, suggesting strong agreement with expert evaluations. In contrast, BLEURT, BERTScore, and LLM-w/o-Rationale exhibit greater divergence from the dashed line, indicating a weaker alignment with expert assessments.