Fig. 3: Medical reasoning performance on the MedThink-Bench dataset.
From: Automating expert-level medical reasoning evaluation of large language models

a Comparison of overall medical reasoning performance, including expert evaluation, five text-similarity metrics, and the proposed LLM-w-Rationale framework under zero-shot prompting. The automated reasoning assessments were obtained by comparing ground-truth reasoning annotations with the predicted annotations. Error bars represent the 95% CI of the mean, calculated via bootstrapping. b Breakdown of medical reasoning performance across the ten medical domains in the MedThink-Bench dataset.