Fig. 1: Overview of the MedThink-Bench dataset. | npj Digital Medicine

Fig. 1: Overview of the MedThink-Bench dataset.

From: Automating expert-level medical reasoning evaluation of large language models

Fig. 1

a Data collection. Medical questions were sourced from ten publicly available datasets, each accompanied by ground-truth answers. b Data preprocessing. Duplicate entries and questions involving medical images were removed. Medical experts then manually curated a subset of complex questions requiring multi-step reasoning. c Expert annotation. A team of ten medical experts annotated the questions into ten distinct medical domains and collaboratively generated fine-grained reasoning trajectories through consensus. d Medical reasoning evaluation. We rigorously evaluated the medical reasoning capabilities of twelve LLMs, comparing them against expert evaluations, text-similarity metrics, LLM-as-a-Judge, and reference-based LLM-as-a-Judge (LLM-w-Rationale). Additionally, we analyzed the correlation between these automated metrics and expert evaluations. Icons adapted from flaticon.com, used under royalty-free license.

Back to article page