Fig. 4: Correlation analysis between expert and automated evaluation of medical reasoning performance. | npj Digital Medicine

Fig. 4: Correlation analysis between expert and automated evaluation of medical reasoning performance.

From: Automating expert-level medical reasoning evaluation of large language models

Fig. 4

a Pearson correlation analysis of predicted rationales against expert assessments and various automated metrics. These metrics include text-similarity measures (BLEU, ROUGE-L, METEOR, BLEURT, BERTScore), LLM-w/o-Rationale (which does not use ground-truth rationales as a reference), and LLM-w-Rationale (which uses our annotated fine-grained rationales as a reference). We take GPT-4o-mini as the judge model. Warmer colors (red tones) denote stronger positive correlations, while cooler colors (blue tones) indicate weaker or negative correlations. The results indicate a strong correlation between LLM-w-Rationale and expert evaluations, while LLM-w/o-Rationale and text-based metrics show weaker correlations with expert assessments. b Kendall’s tau correlation analysis on the ranking of models based on expert evaluations and the automated metrics. Using GPT-4o-mini as the judge model, LLM-w-Rationale demonstrates a very strong positive correlation (τ = 0.88) with expert rankings, whereas LLM-w/o-Rationale (τ = 0.06) and text-similarity metrics (with τ values ranging from −0.39 to 0, indicating negative or weak correlations) show much weaker associations with expert-derived model performance rankings.

Back to article page