Fig. 8: Case study of medical reasoning assessment.
From: Automating expert-level medical reasoning evaluation of large language models

This case demonstrates that while the prediction model Llama-3.3-70B produced an incorrect answer, it followed partially correct medical reasoning trajectories (highlighted in red). This underscores the advantage of the LLM-w-Rationale framework, which, in conjunction with the fine-grained rationale annotations in MedThink-Bench, provides a more nuanced evaluation of the medical reasoning abilities of LLMs compared to merely assessing prediction accuracy.