Extended Data Fig. 1: Case study creation and curation, model training, and response evaluation workflow. | Nature Medicine

Extended Data Fig. 1: Case study creation and curation, model training, and response evaluation workflow.

From: A personal health large language model for sleep and fitness coaching

Extended Data Fig. 1

a, Case studies were selected from anonymized Fitbit production data from individuals who provided consent for research purposes. Two sets of case studies were generated: one set for model training, validation, and testing and a separate holdout set for final evaluation. To facilitate rapid development of high-quality answers, the train/validation/test set of case studies had candidate responses generated by Gemini, which were then edited and rewritten by domain experts. To enable comparison of human and model-derived responses, the holdout set had responses written solely by the domain experts. b, For model training, each case study was split into multiple prompt/answer pairs based on how many sections the case study had: N=3 for sleep with insights, etiology, and recommendations sections, N=5 for fitness with demographics, training load, sleep metrics, health metrics, and assessment sections (Methods). Gemini Ultra 1.0 underwent full fine-tuning using those examples to create PH-LLM. c, Expert evaluation was performed independently on the holdout dataset by the same set of domain experts responsible for generating the expert responses. For each case study in the holdout set, one or more experts who did not write the corresponding expert response graded the candidate responses (expert-written response, Gemini Ultra 1.0 response, and PH-LLM response, with 94 of the 100 case studies having all three candidate responses graded by a single expert.

Back to article page