Extended Data Fig. 8: Effect of fine-tuning data scale on model performance in coaching recommendations.
From: A personal health large language model for sleep and fitness coaching

Mean ratings were generated using our best AutoEval models for the holdout case study subsections in the sleep (a) and fitness (b) domains. ‘PH-LLM’ denotes standard performance while ‘Subsampled 25%’ and ‘Subsampled 50%’ denote responses from models trained on 25% and 50% of the training dataset, respectively. ‘Gemini Ultra’ denotes untuned baseline performance (that is, Gemini Ultra 1.0 trained on 0% of the training dataset). Within each section, a ‘*’ indicates a statistically significant difference (p <0.05) from the top rated response type using the two-sided Wilcoxon rank-sum test and multiple hypothesis testing correction. Error bars represent 95% confidence intervals bootstrapped over 1,000 iterations. Within each bar, n denotes the number of principle ratings per conversation source and circles show the proportion of scores at a given Likert rating.