Fig. 7: Performance robustness analysis of LLM-w-Rationale. | npj Digital Medicine

Fig. 7: Performance robustness analysis of LLM-w-Rationale.

From: Automating expert-level medical reasoning evaluation of large language models

Fig. 7

a Robustness of LLM-w-Rationale across different judge models. Predicted rationales from Llama-3.3-70B were evaluated using 10 LLMs of varying scales as judge models. b Sensitivity of LLM-w-Rationale to prompt variations. Five semantically similar prompt variations were tested to assess the framework’s robustness. Error bars represent the 95% CI of the mean, calculated via bootstrapping.

Back to article page