Fig. 4: Alignment between human rater and LLM-based judge. | npj Digital Medicine

Fig. 4: Alignment between human rater and LLM-based judge.

From: Benchmarking large language models for personalized, biomarker-based health intervention recommendations

Fig. 4: Alignment between human rater and LLM-based judge.

a Mean balanced accuracies achieved by the models across all validation requirements, as assessed by the human rater and the LLM-based judge. b Overall accuracies per model. Both subplots additionally illustrate Cohen’s kappa scores, which are used as a measure of alignment between the human rater and the LLM-based judge. Error bars indicate variability in alignment.

Back to article page