Fig. 7: Performance robustness analysis of LLM-w-Rationale.
From: Automating expert-level medical reasoning evaluation of large language models

a Robustness of LLM-w-Rationale across different judge models. Predicted rationales from Llama-3.3-70B were evaluated using 10 LLMs of varying scales as judge models. b Sensitivity of LLM-w-Rationale to prompt variations. Five semantically similar prompt variations were tested to assess the framework’s robustness. Error bars represent the 95% CI of the mean, calculated via bootstrapping.