Extended Data Fig. 8: Effect of fine-tuning data scale on model performance in coaching recommendations. | Nature Medicine

Extended Data Fig. 8: Effect of fine-tuning data scale on model performance in coaching recommendations.

From: A personal health large language model for sleep and fitness coaching

Extended Data Fig. 8

Mean ratings were generated using our best AutoEval models for the holdout case study subsections in the sleep (a) and fitness (b) domains. ‘PH-LLM’ denotes standard performance while ‘Subsampled 25%’ and ‘Subsampled 50%’ denote responses from models trained on 25% and 50% of the training dataset, respectively. ‘Gemini Ultra’ denotes untuned baseline performance (that is, Gemini Ultra 1.0 trained on 0% of the training dataset). Within each section, a ‘*’ indicates a statistically significant difference (p <0.05) from the top rated response type using the two-sided Wilcoxon rank-sum test and multiple hypothesis testing correction. Error bars represent 95% confidence intervals bootstrapped over 1,000 iterations. Within each bar, n denotes the number of principle ratings per conversation source and circles show the proportion of scores at a given Likert rating.

Source data

Back to article page