Extended Data Fig. 5: Sleep and fitness case study human evaluation results by principle.
From: A personal health large language model for sleep and fitness coaching

Mean ratings given by experts for different case study evaluation principles across all sections in the sleep (a) and fitness (b) domains. The principles are ordered according to the rubric presented in Supplementary Table 9. ‘*’ indicates a statistically significant difference (p < 0.05) using the two-sided Wilcoxon rank-sum test and multiple hypothesis testing correction. Error bars represent 95% confidence intervals bootstrapped over 1,000 iterations. Within each bar, n denotes the number of principle ratings per conversation source and circles show the proportion of scores at a given Likert rating.