Fig. 8: Case study of medical reasoning assessment. | npj Digital Medicine

Fig. 8: Case study of medical reasoning assessment.

From: Automating expert-level medical reasoning evaluation of large language models

Fig. 8: Case study of medical reasoning assessment.

This case demonstrates that while the prediction model Llama-3.3-70B produced an incorrect answer, it followed partially correct medical reasoning trajectories (highlighted in red). This underscores the advantage of the LLM-w-Rationale framework, which, in conjunction with the fine-grained rationale annotations in MedThink-Bench, provides a more nuanced evaluation of the medical reasoning abilities of LLMs compared to merely assessing prediction accuracy.

Back to article page