Fine-tuning LLMs in behavioral psychology for scalable health coaching

Mantena, Sriya; Johnson, Anders; Oppezzo, Marily; Schütz, Narayan; Tolas, Alexander; Doijad, Ritu; Mattson, C. Mikael; Lawrie, Allan; Ramirez-Posada, Mariana; Schmiedmayer, Paul; Linos, Eleni; King, Abby C.; Rodriguez, Fatima; Kim, Daniel Seung; Ashley, Euan A.

doi:10.1038/s44325-025-00083-5

Download PDF

Article
Open access
Published: 22 September 2025

Fine-tuning LLMs in behavioral psychology for scalable health coaching

Sriya Mantena^1,2,
Anders Johnson¹,
Marily Oppezzo^3,4,
Narayan Schütz^1,5,6,
Alexander Tolas¹,
Ritu Doijad¹,
C. Mikael Mattson¹,
Allan Lawrie⁷,
Mariana Ramirez-Posada^8,9,
Paul Schmiedmayer¹,
Eleni Linos^8,9,
Abby C. King^3,4,
Fatima Rodriguez^1,9,10,11,
Daniel Seung Kim ORCID: orcid.org/0000-0003-2971-1909^1,9,10,11,12^na1 &
…
Euan A. Ashley ORCID: orcid.org/0000-0001-9418-9577^{1,9,10,11,13,14}^na1

npj Cardiovascular Health volume 2, Article number: 48 (2025) Cite this article

1153 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Personalized, smartphone-based coaching improves physical activity but relies on static, human-crafted messages. We introduce My Heart Counts (MHC)-Coach, a large language model fine-tuned on the Transtheoretical Model of Change. MHC-Coach generates messages tailored to an individual’s psychology (their “stage of change”), providing personalized support to foster long-term physical activity behavior change. To evaluate MHC-Coach’s efficacy, 632 participants compared human-expert and MHC-Coach interventions encouraging physical activity. Among messages matched to an individual’s stage of change, 68.0% (N = 430) preferred MHC-Coach-generated messages (P < 0.001). Blinded behavioral science experts (N = 2) rated MHC-Coach messages higher than human-expert messages for perceived effectiveness (4.4 vs. 2.8) and Transtheoretical Model alignment (4.1 vs. 3.5) on a 5-point Likert scale. This work demonstrates how language models can operationalize behavioral science frameworks for personalized health coaching, showing the potential for promoting long-term physical activity and reducing cardiovascular disease risk at scale.

Differences in molecular phenotype in mouse and human hypertrophic cardiomyopathy

Article Open access 23 June 2021

Shared genetic pathways contribute to risk of hypertrophic and dilated cardiomyopathies with opposite directions of effect

Article 25 January 2021

Machine learning-based prediction model for myocardial ischemia under high altitude exposure: a cohort study

Article Open access 06 January 2024

Introduction

Physical inactivity is one of the strongest risk factors for nearly all chronic diseases of aging^1,2. Despite the well-documented benefits of regular physical activity, a majority of adults in the United States remain sedentary³. Mean daily steps is a commonly used proxy for physical activity, with even modest, short-term increases (+500 steps) associated with cardiovascular benefits⁴. However, recent large-scale studies show that the average American achieves only 4700 daily steps, well below the recommended threshold for health benefits^4,5. These findings emphasize the need for scalable interventions that increase physical activity and support long-term adherence to ensure sustained health benefits over time.

Personalized coaching has consistently demonstrated efficacy in increasing physical activity by addressing individuals’ unique needs, motivations, and barriers^6,7,8. Yet, sustaining long-term behavior change is a complex and dynamic process shaped by multiple interacting factors. Behavioral science offers structured, evidence-based frameworks for personalization to the individual by addressing the psychological, social, and environmental factors that influence behavior change⁹. One well-established framework is the Transtheoretical Model (TTM)¹⁰, which focuses on the stages of change individuals undergo when adopting new healthy behaviors. The TTM categorizes individuals into five distinct stages of change: pre-contemplation, contemplation, preparation, action, and maintenance¹⁰. The TTM has proven effective in guiding behavior change for chronic disease prevention, with tailored interventions at each stage demonstrating higher efficacy in increasing physical activity¹¹. Because the model’s interventions evolve with users as they progress through the stages of change (from precontemplation to maintenance) it supports both short- and long-term improvements in physical activity¹⁰. Moreover, coaching strategies grounded in the TTM framework have fostered sustained physical activity across diverse populations^12,13,14. Despite its success in fostering behavior change, TTM-based coaching has largely relied on manual creation and delivery, limiting scalability. Digital health tools, such as smartphone apps and wearables, offer a promising avenue for expanding access to TTM-based interventions.

Digital health tools, particularly smartphone applications and wearable devices such as smartwatches, have been employed to encourage physical activity^15,16. Interventions using these digital health tools have demonstrated a positive effect on physical activity, with a recent meta-analysis reporting an average increase of 1,850 steps per day when digital interventions include personalized feedback and notifications¹⁷. While these findings highlight the potential of digital interventions, existing tools are currently limited in that they focus on short-term behavior change, with a lack of attention on sustained engagement and long-term impact. As a result, their effects on chronic disease prevention and treatment, such as cardiovascular disease, remain unknown. Moreover, while several digital trials are being conducted to assess the efficacy of TTM-based personalized coaching, they remain constrained by using preprogrammed, static messages designed by human experts and localized, center-based recruitment, limiting their scalability and adaptability^18,19.

In response to the challenge of balancing scalability and personalization in digital health, recent research has begun leveraging large language models (LLMs) to deliver personalized health coaching at scale. By enabling the automated generation of tailored, high-quality messages, LLMs make it feasible to reach large, diverse populations without requiring manual input for each user. Recent studies have explored the use of various off-the-shelf LLMs, including GPT-4, LamDa, and LLaMA-2-7B-chat to answer health-related questions and facilitate goal-setting conversations through chat interfaces. Messages generated through structured prompting techniques were consistently rated as more behaviorally aligned, empathetic, and actionable than standard outputs. For example, Jörke et al. demonstrated that GPT-4 could simulate motivational interviewing and goal-setting with expert-level fluency using template-based prompts incorporating user data (e.g., activity levels, goals, barriers)²⁰. Hegde et al. used a chat-based format in which LaMDA responded to user-described behavioral challenges framed by capability, opportunity, and motivation, and messages reranked by a BERT-based model were rated as more motivational than unrevised outputs²¹. Ong et al. directly compared LLaMA-2-7B-chat and human-generated messages in the sleep domain, finding that users rated the LLM responses as similarly empathetic and helpful²².

Despite these advances, current LLM-based systems remain reactive, responding through multi-turn conversations only when users initiate interaction, rather than proactively delivering timely and motivationally tailored messages informed by behavioral psychology. Proactive paradigms such as Just-in-Time Adaptive Interventions (JITAIs)²³, context-aware recommender systems²⁴, and ecological momentary interventions²⁵ have shown promise in improving engagement and adherence. However, many of these systems rely on rules-based triggers from physiological or environmental sensors and fail to account for a user’s evolving motivational profile. Hence, there remains a critical need to integrate behavioral science frameworks with LLMs to enable a new class of interventions that draw on the architectural logic of proactive paradigms to deliver adaptive support by grounding messages in users’ evolving motivational profiles.

We pursued this integration by building upon the My Heart Counts (MHC) Cardiovascular Health Study, a large-scale digital health platform designed to promote physical activity. The My Heart Counts Cardiovascular Health Study was launched in 2015 as one of the first smartphone applications using Apple’s ResearchKit. On launch, it became the fastest-recruiting medical study of all time with >40,000 participants in the first two weeks^{26,27,28,29,30,31}. MHC is a smartphone-based observational study designed to gather frequent, remote data on fitness, activity, and sleep³⁰. It has since enabled several digital trials, including a personalized intervention that increased step counts using handcrafted messages tailored to individual activity levels³¹. However, that intervention was limited in both scale and duration, focusing only on short-term behavioral change over a 7-day period.

To address the lack of proactive, scalable, and psychologically grounded personalization to support sustained behavior change in digital health, we developed MHC-Coach, a fine-tuned LLM³² trained on TTM behavioral science principles³³ and domain-specific knowledge in physical activity and cardiovascular health. MHC-Coach autonomously generates personalized, text-based coaching messages to increase physical activity, evolving with users’ changing needs, motivations, and readiness for behavior change. To assess this approach, we conduct three evaluations. First, unlike prior LLM physical activity coaching studies that relied on small evaluator groups (5 to 46 participants)^{20,21,34,35,36,37,38,39,40,41,42,43}, we conducted a large-scale survey (N = 632) with participants of the MHC study, comparing their preference for MHC-Coach messages vs. traditional human expert-designed messages. Second, behavioral science experts systematically rated MHC-Coach and expert messages based on their motivational effectiveness and alignment with behavioral science principles. Finally, we compared the linguistic and stylistic features of messages generated by MHC-Coach, the unmodified LLaMA3-70B model, retrieval-augmented generation (RAG)⁴⁴, few-shot prompting⁴⁵, and human experts to better understand how each method expresses motivation and behavioral specificity. Together, these evaluations assess the potential of integrating behavioral science frameworks with LLM-driven coaching to improve personalization, enhance motivational quality, and enable scalable, long-term behavior change.

Results

Qualitative evaluation of fine-tuned MHC-Coach and Base LLaMA Model

MHC-Coach was developed by fine-tuning LLaMA 3-70B on a dataset of 3268 human expert and peer-generated messages written in response to stage-specific scenarios³³, a domain-specific corpus on health and exercise to provide background information on the importance of physical activity in preventing cardiovascular disease^{1,10,46,47,48,49,50,51,52,53}, and detailed information about the TTM model and characteristics of individuals in each stage of change^10,54 (see Methods and Fig. 1). The combined fine-tuning dataset contained 215,553 tokens.

Fig. 1: Schematic representation of the study design, including the fine-tuning process for LLaMA 3-70B, the generation of stage-specific motivational messages, and the evaluation through participant surveys.

To assess the impact of this fine-tuning approach, we qualitatively compared MHC-Coach outputs to those of the base LLaMA 3-70B model, evaluating their integration of behavioral science principles and ability to generate stage-specific motivational messages (see Fig. 2). For example, when asked to produce a motivation message for a user in the contemplation stage (i.e., those beginning to think about how to change their physical activity behavior), the base LLM generated: “Take the first step towards a healthier you! Schedule exercise into your daily routine and start feeling the benefits in just a few weeks.” While this message encourages users to plan their exercise, it is overly general. It does not specify the type of exercise, the duration, or when to start, which makes it harder for users to visualize or act on the suggestion.

**Fig. 2: Qualitative comparison of Fine-Tuned LLM (MHC-Coach) and Base LLM (LLaMA3-70b) messages across the Contemplation and Action stages.**

In contrast, MHC-Coach generated the following message: “Ready to feel great? Schedule 10-minute daily walks into your calendar & reap energy, mood, & health benefits. Start Monday!” This version demonstrates a significant improvement in actionable specificity by recommending “10-minute daily walks” and incorporating practical planning advice, such as integrating the activity “into your calendar.” The use of immediacy (“Start Monday!”) gives a clear starting point, while the explicit mention of benefits such as “energy, mood, & health benefits” creates a stronger connection between the action and its immediate outcomes. By combining actionable specificity, real-life planning, and outcome clarity, the fine-tuned message effectively integrates TTM behavioral science principles, making it more likely to motivate users to take action. Figure 2 illustrates these improvements with annotated examples from the contemplation and action stages of the TTM.

Comparative evaluation of MHC-Coach and human expert messages

Survey invitations were sent via email to 62,616 users of the MHC smartphone app who had previously consented to re-contact. Surveys were conducted using the Qualtrics platform, and only participants who completed the entire survey were included in the analysis. Among the 1004 initial survey respondents, 853 met the inclusion criteria of being at least 18 years old and English-speaking. Of these 853 eligible participants, 632 provided complete responses and were included in the final analysis (see Fig. 3). The survey respondents included 79.4% males and 19.8% females, with the majority of participants (50.7%) aged 60+ years. The racial/ethnic distribution was predominantly White (80.8%), with smaller proportions identifying as Asian (6.1%), Hispanic/Latino (4.3%), and Black/African American (1.6%). Demographic characteristics of survey participants are summarized in Table 1.

**Fig. 3: The recruitment process for the My Heart Counts study by comparing LLM-generated messages with expert-designed messages.**

Table 1 Participant Demographics: Participant Demographics: Summary of the study population’s demographic characteristics, including age, gender, ethnicity (with the option to select multiple ethnicities), and native language

Full size table

To evaluate the efficacy of MHC-Coach in generating motivational messages, we surveyed participants with a series of forced-choice A/B questions. Figure 4 illustrates the overall survey flow along with example questions. All participants received five forced-choice A/B questions comparing general motivational messages that were not tailored to a specific stage of change, with 85.4% (N = 540) expressing a preference for the general MHC-Coach-generated messages and 14.6% (N = 92) preferring the human expert-designed messages. Across the general motivational messages, there was a significant preference for the MHC-Coach-generated messages (Chi-squared P < 0.001, see Table 2 and Fig. 5).

**Fig. 4: Overview of the survey design for comparing user preferences between LLM-generated and human expert-designed motivational messages.**

**Fig. 5: Comparisons between human expert-generated messages and fine-tuned LLM-generated messages across multiple evaluations.**

Table 2 Preferences over coaching messages generated by LLMs versus human expert-crafted messages, stratified by stages of change

Full size table

Users were then stratified to a specific TTM stage of change and received three forced-choice A/B questions on which message they preferred: an MHC-Coach-generated or a human expert-written coaching message, tailored to their individual stage of change. Across the stage-matched messages, 68.0% (N = 430) of users showed a preference for the MHC-Coach-generated messages, and 32.0% (N = 202) preferred the human expert messages (Chi-squared P < 0.001, see Table 2). This preference was significant and consistent across all stages of change in a sensitivity analysis, with sample sizes ranging from N = 37 for the pre-contemplation stage to N = 318 for the maintenance stage. Detailed breakdowns of preferences across specific stages of change, along with associated statistical results, are presented in Table 2 and Fig. 5.

Notably, female participants and those under 50 showed significantly stronger preferences for LLM-generated messages (P < 0.001 for both), with effect sizes in the small-to-moderate range (Cramér’s V = 0.069 and 0.086, respectively). A post hoc power analysis using the observed preference rate of 68% (vs. 50% under the null) and sample size (N = 632) yielded a power estimate greater than 0.999.

Two behavioral science experts reviewed and rated the MHC-Coach-generated messages and human expert-written messages (see Fig. 5). Across all stages of change, the average perceived effectiveness rating was 2.8 for the human-written messages and 4.4 for the MHC-Coach-generated messages on a 5-point Likert scale. Similarly, the average rating for how well the messages matched the requested TTM stage of change was 3.5 for the human-written messages and 4.1 for the MHC-Coach-generated messages. The most pronounced difference was observed in the pre-contemplation stage, where the human-written messages received an effectiveness rating of 2.2 and a match rating of 2.2, compared to the MHC-Coach-generated messages, which scored 4.3 for effectiveness and 3.1 for the stage alignment. A detailed breakdown of these ratings across all stages of change is presented in Fig. 5.

Linguistic and stylistic feature analysis across generation methods

To further compare the outputs of MHC-Coach with other generation strategies and expert-written messages, we analyzed seven linguistic and stylistic features that are commonly associated with effective behavior change communication: word count, sentiment, action-verb frequency, temporal reference, lexical diversity (type–token ratio), exclamation usage, and readability^55,56,57,58. For each feature, values were averaged across 20 messages generated by each method, four messages for each of the five stages of change. These descriptive comparisons revealed notable differences in stylistic tendencies across generation types (See Table 3). For example, fine-tuned messages exhibited the highest action-verb count per message (mean = 5.2) and the most temporal anchoring (75% of messages), while expert messages had the lowest use of exclamation marks (0.15 per message) and relatively high readability (Flesch score = 68.26). Representative examples from each method are shown in Supplementary Table 1.

Table 3 Summary of averaged linguistic features by generation method (N = 20 messages per method)

Full size table

To assess whether these observed differences were statistically significant, we conducted Kruskal–Wallis tests across generation methods for each feature. Generation method had a significant effect on word count (H(4) = 60.65, p < 0.001), action-verb frequency (H(4) = 36.55, p < 0.001), temporal reference (H(4) = 16.58, p = 0.011), and exclamation usage (H(4) = 54.65, p < 0.001). No significant differences were observed for sentiment (p = 0.069), lexical diversity (p = 0.282), or readability (p = 0.098). Full results are reported in Table 4.

Table 4 Kruskal–Wallis test results comparing linguistic features across generation methods (N = 100 messages, 20 per method)

Full size table

Discussion

Digital health offers the potential of a paradigm shift in scalability and individualization for physical activity coaching⁵⁹. Here, we developed and tested a tool, MHC-Coach, with substantial potential to address the gap in personalization via fine-tuning of a large language model to incorporate an evidence-based and validated psychological framework for behavior change. We show that MHC-Coach can provide autonomous, proactive, and personalized text-based coaching interventions, based on a user’s underlying behavioral psychology. Finally, we showed that MHC-Coach-generated messages were judged preferable by both behavioral science experts and participants in the My Heart Counts Cardiovascular Health study, as compared to human expert-written messages.

A key aspect of this work is demonstrating how fine-tuning LLMs with behavioral science principles enhances their ability to support behavior change. Generic LLMs, despite their fluency and coherence, lack an inherent understanding of psychological motivation, resulting in responses that may be readable but not effective for long-term engagement. By embedding the TTM framework into an LLM’s latent space, the MHC-Coach model delivers structured, stage-specific coaching tailored to users’ behavioral readiness, surpassing both the base LLM and the human expert messages it was trained on. Fine-tuning also appears to more effectively internalize behavior change principles than prompting-based methods like RAG or few-shot learning, yielding messages with more consistent motivational features. Despite being crafted by experienced behavior-change professionals and academics who could draw on diverse motivational strategies, the expert messages were still outperformed by MHC-Coach’s TTM-grounded outputs, highlighting the advantage of structured, theory-driven messaging. Building on this behavioral backbone, future versions of MHC-Coach can layer personalization using real-time activity data, contextual signals, and language preferences^60,61,62. Pairing MHC-Coach with emerging techniques like agentic systems⁶³, memory-augmented architectures⁶⁴, and reinforcement learning from user feedback⁶⁵ could enable deeper integration of these datastreams, enabling ever-more-personalized and proactive health coaching at scale.

We further note that our study is unique amongst studies of LLM and health behavior for its scale and recruitment approach, reaching and engaging 632 MHC users seeking health interventions in the final analysis. This contrasts with the majority of preceding LLM-based physical activity coaching studies, which relied on small, pre-selected groups of fewer than 46 annotators^{20,21,34,35,36,37,38,39,40,41,42,43}. We have collected user preference data and have made it openly available for further development of LLM-based health coaching leveraging the TTM framework (see Methods). Additionally, while there has been growing development of LLM applications in healthcare focused on clinical decision support, this work addresses preventive, behavior-focused coaching that can complement such systems^66,67.

A few limitations of our study should be noted. First, our cohort reflects the user characteristics of the original My Heart Counts Cardiovascular Health Study and may not fully capture the diversity of the broader United States population^{26,27,28,29,30,31}. The survey respondents skewed older than the general MHC cohort²⁹, with half of the participants over the age of 60, which may limit generalizability to younger groups. Moreover, MHC-Coach-generated messages were in English and hence restricted participation to English speakers. Research is ongoing to assess the efficacy of MHC-Coach-generated messages in non-English speakers, of whom there are significant numbers within the United States. The expert evaluation included only two raters, limiting interrater diversity, and was intended as a supplementary analysis to triangulate findings from the larger user preference study. While our findings offer promising signals of motivational alignment, future work is needed to evaluate downstream behavioral and cardiovascular impacts. Finally, while MHC-Coach was fine-tuned on scientific health data to enhance accuracy, we did not conduct a systematic evaluation of hallucinations. This represents an important area for future work, especially as models begin to incorporate patient-specific information or deliver more targeted health advice^68,69.

Concerning implementation, our fine-tuning approach is model-agnostic and compatible with open-source LLMs of varying sizes, enabling flexible, on-device deployment without reliance on external infrastructure. This further allows for stronger data privacy with user data kept local and enabling organizational control over data use⁷⁰. Additionally, unlike prompting-based methods, fine-tuning eliminated the need for runtime context engineering, simplifying integration and lowering latency.

In summary, MHC-Coach demonstrates the promise of integrating behavioral science frameworks with LLMs to deliver scalable, stage-specific health coaching. In contrast to generic LLMs, MHC-Coach goes beyond providing information about behavioral science and exercise; it operationalizes these principles in the language it generates, ensuring that coaching messages align with motivation-sensitive, stage-specific interventions. This study builds a foundation for future efforts, which include integrating MHC-Coach into the My Heart Counts smartphone application, expanding coaching to multilingual and culturally diverse settings, incorporating real-time activity data for automated TTM stage classification, and evaluating long-term adherence and cardiovascular health outcomes through longitudinal studies.

Methods

MHC-Coach message generation

A fine-tuning approach was selected for MHC-Coach as it supports low-latency, single-turn interactions without the runtime overhead of multi-step prompting, making it well-suited for mobile-health deployment.

Data

Our fine-tuning dataset consists of 3,268 motivational messages grounded in the TTM framework, designed to promote exercise behavior change across the five stages of change. These messages were collected through a separate study conducted in 2017³³. In that study, 377 messages were authored by 25 human experts, including fitness coaches, behavioral coaches, and researchers in health psychology who were recruited through direct outreach. The experts were selected based on their applied or academic experience in exercise promotion and behavior change. The remaining messages were written by a larger group of peer contributors recruited via Amazon Mechanical Turk. Although not explicitly trained in the TTM for the study, they were presented with scenarios of individuals in different stages of behavioral change and instructed to create messages that they believed would be most motivating for each. As such, they were free to draw on any behavioral strategy or approach from their professional experience, not limited to the TTM alone⁷¹.

In addition to these human-designed messages, we incorporated content from ten manuscripts that were chosen to capture the seminal Transtheoretical-Model theory, reflect consensus guideline statements on exercise and cardiovascular risk, and provide diverse empirical perspectives, ranging from longitudinal cohort data to global surveillance and equity analyses on physical-activity measurement and outcomes. We limited the set to ten manuscripts^{1,10,46,47,48,49,50,51,53,54} to maintain a focused, coherent corpus, a common practice in domain-specific fine-tuning that helps the model internalize key constructs without diluting them⁷². These manuscripts thus enriched the training corpus with evidence-based insights linking physical activity, behavior change, and cardiovascular outcomes.

The combined fine-tuning dataset, consisting of human-written messages, content extracted from research papers, and detailed information about the TTM model and characteristics of users in each phase^10,54, contained a total of 215,553 tokens. This dataset was used to fine-tune MHC-Coach, enabling it to generate contextually relevant and personalized messages for behavior change in physical activity (see Fig. 1). By integrating expert-designed messages with research-based content, we aimed to ensure accurate health information aligned with clinical and behavioral science guidelines, while grounding the language in evidence-based psychology tailored to each TTM stage.

Language model

LLaMA 3-70B³² was selected as the primary LLM architecture in this study due to its strong instruction-following abilities, open-source weights and model architecture, and a favorable balance between model size and performance for mobile health deployment. It has also demonstrated competitive performance on reasoning and language understanding tasks³², along with recent findings showing its responses to health-related questions matched or exceeded those of human coaches in quality, empathy, and accuracy, further supporting its suitability for health coaching applications²².

LLaMA 3-70B is a decoder-only transformer mode with >70 billion parameters, optimized for diverse tasks such as reasoning, instruction following, and dialog generation. For our experiments, we utilized LLaMA 3-70B with the following hyperparameters: temperature of 0.8 for response diversity, maximum token length of 800, and top-p (nucleus sampling) set to 0.9.

Fine-tuning

We used an instruction-tuning⁷³ variant to fine-tune LLaMA 3-70B on our dataset. Given the unstructured nature of the research papers and the need to align expert messages with the appropriate stage of change, we developed a custom template to guide the fine-tuning process. We structured the input data using a custom template designed to train the model to generate stage-specific motivational messages. Each training example consisted of three key components: (1) a prompt section, describing the user’s stage of change and requesting a tailored motivational message, (2) a stage section label specifying the user’s stage of change, and (3) a completion section, containing the human-designed motivational message as the target response. To ensure clarity and consistency, a designated token was used to separate messages (see Fig. 1). To provide additional context, we appended unstructured text from manuscripts detailing the impact of regular exercise on cardiovascular health^{1,10,46,47,48,49,50,51,53,54}, along with a detailed explanation of the TTM and the characteristics of individuals at each stage of change^10,54.

We used Low-Rank Adaptation⁷⁴ for fine-tuning, a technique that updates pre-trained models. This approach enables efficient fine-tuning by adding trainable low-rank matrices as additional parameters, while keeping the original model weights frozen. We performed a parameter search over learning rates (5e-5, 1e-5), LoRA rank (16, 32) with corresponding LoRA alpha values set to twice the rank (32, 64), and weight decay (0, 0.01), testing each configuration over one epoch. The final configuration consisted of a learning rate of 1e-5, a LoRA rank of 32 with alpha set to 64, LoRA applied across all linear layers without dropout, a cosine learning rate scheduler, a warmup ratio of 0.03, and a batch size of 8 over 5 epochs. Training was performed on 8×A100 80GB GPUs.

Large-scale user study design and participant recruitment

Ethical approval for the study was obtained from Stanford University’s Research Compliance Office (IRB-75836) and was conducted in accordance with the Declaration of Helsinki. Participants provided informed consent to join the My Heart Counts Cardiovascular Health Study, including agreement to be re-contacted for future research. Participation in the follow-up survey was voluntary.

Participant recruitment and eligibility criteria

The My Heart Counts (MHC) smartphone application was first launched in March 2015 via the Apple App Store for United States, United Kingdom, and Hong Kong residents^{26,27,28,29,30,31}. Individuals aged 18 years or older who could read and understand English were eligible to participate in the My Heart Counts Cardiovascular Health Study.

Survey invitations were sent via email to MHC users who consented to being contacted for future research. Survey responses were collected using the Qualtrics survey platform. Respondents who did not complete the full survey were excluded from the analysis, as complete responses were required for group-level analyses. Of the initial 1,004 respondents, 632 met the inclusion criteria, completed the entire survey, and were retained for the final analyses (see Fig. 3).

Survey design

Throughout the survey, participants were shown pairs of messages, one human expert-designed message and an MHC-Coach-generated message, in a blinded fashion and without knowledge that the study involved the TTM. For each message pair, they made a forced A/B choice, selecting the message they felt was more empathetic, realistic, actionable, and encouraging. The human expert-designed messages used for all subsequent comparisons were sourced from the same crowdsourcing dataset used to fine-tune the model, involving 25 human experts in fitness, behavior, and health psychology³³. To match the 20-word length requested in model inputs, expert messages were randomly sampled from the top 50% of messages by word count, improving length comparability while preserving sufficient sample size. All participants were shown the same set of message pairs. This design allowed us to directly evaluate whether the model could reproduce the set of expert-quality messages it was trained on. Examples of both human expert-crafted and MHC-Coach-generated messages that users compared are provided in Supplementary Table 2.

Participants performed two sets of comparisons, one with generally motivating messages and one with messages that were tailored to their stage of change. The first set of questions asked participants to choose their preferred coaching message in an A/B forced-choice set of five questions featuring general motivational messages. These messages were created by a human expert³³ or generated by MHC-Coach with the sole objective of encouraging regular physical activity, independent of the stage of change.

Participants were then stratified into one of five stages of change using an evidence-based TTM stage-identification question, which assessed the duration and consistency of their regular exercise habits⁷⁵. This question, illustrated in Fig. 4, defined regular exercise as planned physical activities (e.g., brisk walking, jogging, swimming) performed 3–5 times per week for 20–60 min to improve fitness. Regular exercise was defined as planned physical activities—such as brisk walking, aerobics, jogging, bicycling, swimming, or rowing—performed 3 to 5 times per week for 20–60 min per session to improve physical fitness⁷⁵. After categorization, participants answered five questions comparing messages tailored to their specific stage of change. In this section, both the MHC-Coach-generated and human expert messages were stage-aligned: expert messages were selected from the dataset based on the stage-specific scenarios used during data collection³³, while MHC-Coach was prompted with the corresponding stage label and descriptive context (see Fig. 1).

Statistical analysis

The Chi-squared test was used to statistically evaluate differences in participants’ preferences for human expert-designed versus MHC-Coach-generated messages, under the null hypothesis that preferences for both types of messages were equal. Analyses were conducted across the five specific stages of change (pre-contemplation, contemplation, preparation, action, and maintenance) and for generic motivational messages. To account for the six comparisons performed (five stage-specific tests and one generic message test), a Bonferroni correction was applied to control the family-wise error rate, adjusting the significance threshold from α = 0.05 to α_adjusted = 0.0083 (0.05/6). Participants were classified as preferring MHC-Coach if they selected a majority of MHC-Coach-generated messages, and as preferring the human expert if they selected a majority of human expert-generated messages in A/B forced-choice questions in analyses of the general motivational and TTM stage-specific question sets. To test the robustness of our findings, we conducted sensitivity analyses stratified by gender (male, female) and age group (above and below 50 years), using chi-squared tests with Cramér’s V⁷⁶ to estimate effect sizes. All P-values were two-sided and Holm-adjusted for multiple comparisons. Finally, a post hoc power analysis was performed using the observed effect size and sample size.

Expert evaluation of coaching messages

We engaged two expert reviewers with active research in applying the TTM to independently evaluate and rate both the fine-tuned MHC-Coach-generated messages and the expert-written messages (selection procedure described in Methods Section 2) in a blinded fashion³³. The evaluators were blinded as to whether a given message was MHC-Coach- or expert-generated. They assessed each message on a 5-point Likert scale across two criteria: (1) the perceived effectiveness of the message in promoting increased physical activity and (2) the degree to which the message aligned with the recipient’s current stage of change.

Linguistic and stylistic features across generation methods

To compare generation strategies suitable for proactive, single-turn coaching, we created a set of twenty messages (four for each of the five TTM stages) using five techniques: base LLaMA 3-70B model, MHC-Coach, few-shot prompting⁷³, retrieval-augmented generation (RAG)⁴⁴, and human experts.

Pretrained baseline

The unmodified LLaMA 3-70B-chat model was prompted with a brief TTM stage description and a request to generate a 20-word motivational message.

Few-shot prompting

We first inserted a brief system instruction: “You are a health coach trained in the Transtheoretical Model of Behavior Change.” This was followed by the stage-specific task prompt used in the baseline condition and then three demonstration messages matched to that stage. These exemplars were drawn from the expert corpus³³ but were withheld from the subset of expert messages used for quantitative comparisons. This single concatenated prompt was provided to the unmodified LLaMA 3-70B model for each stage.

Retrieval-augmented generation

We embedded the full fine-tuning corpus, consisting of 3268 stage-labeled expert messages from the crowdsourcing study³³ and the exercise-science excerpts as outlined earlier in the data portion of the Methods, using the BAAI/bge-small-en-v1.5 encoder and indexed it with Llama-Index (version 0.10.38). At inference time, the index returned the two 1000-token segments with the highest cosine similarity to the stage-specific query. These retrieved segments were prepended to the 20-word prompt used for the baseline condition, and the full context was then passed to the unmodified LLaMA-3-70B model for generation.

Expert messages

Twenty expert-written messages were randomly sampled from the crowdsourced corpus³³ (selection procedure described in Methods Section 2).

Prompts used for each message generation strategy are provided in Supplementary Table 1.

Corpus and preprocessing

We performed a systematic comparison of the linguistic and stylistic properties of the messages produced in each generation technique, as described in the data portion of the Methods. Each of the five generation methods produced 20 messages, yielding a corpus of 100 messages.

All analyses were performed in Python 3.1. Linguistic annotation, including part-of-speech tagging and dependency parsing, was performed with spaCy (version 3.7.2)⁷⁷. Sentiment analysis used VADER (version 0.3.3)⁷⁸, and readability scores were computed using Textstat (version 0.7.3)⁷⁹.

Feature definition and extraction

The selected features reflect complementary dimensions of message quality, including length, lexical richness, emotional tone, behavioral focus, temporal specificity, and readability. Each feature is defined below, along with the method used for its extraction.

Word count

Calculates the length of each message, using spaCy (version 3.7.2, en_core_web_sm model) by counting alphabetic tokens only, excluding numerals and punctuation.

Compound sentiment score

Assesses the overall emotional tone of a message, using the VADER sentiment analysis tool (version 0.3.3). VADER outputs a composite polarity score ranging from –1 (strongly negative) to +1 (strongly positive).

Action-verb frequency

Measures the number of syntactically central actions in a message. Action verbs were dentified using spaCy by selecting tokens with part-of-speech tag VERB, and excluding those with auxiliary or copular dependency roles. This excludes verbs serving grammatical support functions (e.g., is, have), focusing on core predicates expressing meaningful actions.

Temporal reference

Indicates whether the message is temporally anchored. This binary feature was assigned a value of 1 if spaCy recognized a DATE or TIME entity, or if the text included predefined temporal keywords (e.g., today, tomorrow, routine). Messages without such markers were assigned a value of 0.

Type–token ratio

Measures lexical diversity, calculated as the ratio of unique lowercase alphabetic tokens to the total number of such tokens in the message. Tokenization was performed using spaCy.

Exclamation count

Captures emphatic punctuation usage, quantified by the number of exclamation marks in each message.

Flesch Reading-Ease Xscore

Estimates message readability using the Textstat 0.7.3 library, where higher scores correspond to simpler, more accessible text⁸⁰.

For descriptive comparisons across generation methods, each of the seven features was averaged across the 20 messages produced by a given technique.

Statistical comparison across methods

To complement the descriptive comparisons, we conducted a statistical test to assess whether feature distributions differed significantly across generation methods. For each feature, we first tested the normality of per-message values within each method using the Shapiro–Wilk test⁸¹. When distributions met the assumption of normality, we applied one-way ANOVA; otherwise, we used the non-parametric Kruskal–Wallis test. All tests examined the main effect of the generation method on individual feature distributions. All statistical analyses were conducted in Python using the SciPy library (version 1.13)⁸².

Data availability

The fine-tuning corpus used to train MHC-Coach included 3,268 stage-specific motivational messages collected in a separate 2017 study³³. These messages were authored by human experts and crowd contributors and are not publicly available. Researchers interested in accessing this dataset may contact the author of the original study. Additional fine-tuning content included ten research papers^{1,10,46,47,48,49,50,51,53,54} focused on behavior change and exercise. Seven were open-access publications, and the remaining three were accessed through our institution’s academic subscription. The full texts were used solely for academic research under fair use provisions and are not redistributed or publicly released.The health coaching data that was collected and analyzed during the survey study is available on GitHub (https://github.com/SriyaM/MHC-Coach). The dataset has been anonymized to ensure participant privacy, and no personally identifiable information is included. The repository includes code for data processing and customization to enable further analyses. The set of messages and participant responses from the large-scale survey study are available within the GitHub repository in the preference_survey folder. The set of messages generated across all five techniques (MHC-Coach, base LLaMA, few-shot prompting, RAG, and expert) and used for the linguistic and stylistic analysis is available in the linguistic_comparison folder.

Code availability

The code used to generate RAG messages and to conduct the quantitative linguistic analysis comparing all five generation techniques (MHC-Coach, base LLaMA, few-shot prompting, RAG, and human expert) is available on GitHub (https://github.com/SriyaM/MHC-Coach) in the linguistic_comparison folder. In addition, the fine-tuned MHC-Coach model, including weights and an inference example, is openly hosted on the Hugging Face Hub (https://huggingface.co/SriyaM/MHC-Coach). Both the data and code are provided under the MIT License, allowing users to modify, reuse, and redistribute the materials in accordance with the license terms. All software versions, key hyperparameters, and generation settings used in the study are detailed in the Methods.

References

Gibbs, B. Physical activity as a critical component of first-line treatment for elevated blood pressure or cholesterol: who, what, and how?: a scientific statement from the American Heart Association. Hypertension 78, e26–e37 (2021).
Google Scholar
Booth, F. W., Roberts, C. K., Thyfault, J. P., Ruegsegger, G. N. & Toedebusch, R. G. Role of inactivity in chronic diseases: Evolutionary insight and pathophysiological mechanisms. Physiol. Rev. 97, 1351–1402 (2017).
Article PubMed PubMed Central Google Scholar
Nazik, E. A. & Kramarow, C. Physical activity among adults aged 18 and over: United States, 2020. Centers for Disease Control and Prevention (2020).
Hall, K. S. et al. Systematic review of the prospective association of daily step counts with risk of mortality, cardiovascular disease, and dysglycemia. Int. J. Behav. Nutr. Phys. Act. 17, 78 (2020).
Article PubMed PubMed Central Google Scholar
Tudor-Locke, C. et al. How many steps/day are enough? For older adults and special populations. Int. J. Behav. Nutr. Phys. Act. 8, 80 (2011).
Article PubMed PubMed Central Google Scholar
Monteiro-Guerra, F., Rivera-Romero, O., Fernandez-Luque, L. & Caulfield, B. Personalization in real-time physical activity coaching using mobile applications: A scoping review. IEEE J. Biomed. Health Inform. 24, 1738–1751 (2020).
Article PubMed Google Scholar
Bäccman, C., Bergkvist, L. & Wästlund, E. Personalized coaching via texting for behavior change to understand a healthy lifestyle intervention in a naturalistic setting: Mixed methods study. JMIR https://doi.org/10.2196/preprints.47312 (2023).
Article Google Scholar
Brink, S. M., Wortelboer, H. M., Emmelot, C. H., Visscher, T. L. S. & van Wietmarschen, H. A. Developing a personalized integrative obesity-coaching program: A systems health perspective. Int. J. Environ. Res. Public Health 19, 882 (2022).
Article PubMed PubMed Central CAS Google Scholar
Sucala, M., Ezeanochie, N. P., Cole-Lewis, H. & Turgiss, J. An iterative, interdisciplinary, collaborative framework for developing and evaluating digital behavior change interventions. Transl. Behav. Med. 10, 1538–1548 (2020).
PubMed Google Scholar
Prochaska, J. O. & Velicer, W. F. The transtheoretical model of health behavior change. Am. J. Health Promot. 12, 38–48 (1997).
Article PubMed CAS Google Scholar
Hashemzadeh, M., Rahimi, A., Zare-Farashbandi, F., Alavi-Naeini, A. M. & Daei, A. Transtheoretical model of health behavioral change: A systematic review. Iran. J. Nurs. Midwifery Res. 24, 83–90 (2019).
Article PubMed PubMed Central Google Scholar
Liu, K. T., Kueh, Y. C., Arifin, W. N., Kim, Y. & Kuan, G. Application of the transtheoretical model on behavioral changes, and amount of physical activity among university’s students. Front. Psychol. 9, 2402 (2018).
Article PubMed PubMed Central Google Scholar
Pirzadeh, A., Mostafavi, F., Ghofranipour, F. & Feizi, A. Applying transtheoretical model to promote physical activities among women. Iran. J. Psychiatry Behav. Sci. 9, e1580 (2015).
Article PubMed PubMed Central Google Scholar
Jiménez-Zazo, F., Romero-Blanco, C., Castro-Lemus, N., Dorado-Suárez, A. & Aznar, S. Transtheoretical Model for physical activity in older adults: Systematic review. Int. J. Environ. Res. Pub. Health 17, 9262 (2020).
Longhini, J. et al. Wearable devices to improve physical activity and reduce sedentary behaviour: An umbrella review. Sports Med. Open 10, 9 (2024).
Article PubMed PubMed Central Google Scholar
Harries, T. et al. Effectiveness of a smartphone app in increasing physical activity amongst male adults: a randomised controlled trial. BMC Pub. Health 16, 925 (2016).
Article Google Scholar
Laranjo, L. et al. Do smartphone applications and activity trackers increase physical activity in adults? Systematic review, meta-analysis and metaregression. Br. J. Sports Med. 55, 422–432 (2021).
Article PubMed Google Scholar
Mohd Saad, N., Mohamad, M. & Mat Ruzlin, A. N. Web-based intervention to act for weight loss in adults with type 2 diabetes with obesity (Chance2Act): Protocol for a nonrandomized controlled trial. JMIR Res. Protoc. 13, e48313 (2024).
Article PubMed PubMed Central Google Scholar
Takahashi, A. et al. Effectiveness of a lifestyle improvement support app in combination with a wearable device in Japanese people with type 2 diabetes mellitus: STEP-DM study. Diab. Ther. 15, 1187–1199 (2024).
Article CAS Google Scholar
Jörke, M. et al. GPTCoach: Towards LLM-based physical activity coaching. in Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems 1–46 (ACM, New York, NY, USA, 2025). https://doi.org/10.1145/3706598.3713819.
Hegde, N. et al. Infusing behavior science into large language models for activity coaching. PLOS Digit. Health 3, e0000431 (2024).
Article PubMed PubMed Central Google Scholar
Ong, Q. C. et al. Advancing health coaching: A comparative study of large language model and health coaches. Artif. Intell. Med. 157, 103004 (2024).
Article PubMed Google Scholar
Nahum-Shani, I. et al. Just-in-time adaptive interventions (JITAIs) in mobile health: Key components and design principles for ongoing health behavior support. Ann. Behav. Med. 52, 446–462 (2018).
Article PubMed Google Scholar
van Dantzig, S., Bulut, M., Krans, M., van der Lans, A. & de Ruyter, B. Enhancing physical activity through context-aware coaching. in Proceedings of the 12th EAI International Conference on Pervasive Computing Technologies for Healthcare 187–190 (ACM, New York, NY, USA, 2018). https://doi.org/10.1145/3240925.3240928.
Dao, K. P. et al. Smartphone-delivered ecological momentary interventions based on ecological momentary assessments to promote health behaviors: Systematic review and adapted checklist for reporting ecological momentary assessment and intervention studies. JMIR MHealth UHealth 9, e22890 (2021).
Article PubMed PubMed Central Google Scholar
Johnson, A. et al. Mobile health study incorporating novel fitness test. J. Cardiovasc. Transl. Res. 16, 569–580 (2023).
Article PubMed Google Scholar
Shcherbina, A. et al. The effect of digital physical activity interventions on daily step count: a randomised controlled crossover substudy of the MyHeart Counts Cardiovascular Health Study. Lancet Digit. Health 1, e344–e352 (2019).
Article PubMed Google Scholar
Kim, D. S. et al. Unlocking insights: Clinical associations from the largest 6-minute walk test collection via the my Heart Counts Cardiovascular Health Study, a fully digital smartphone platform. Prog. Cardiovasc. Dis. https://doi.org/10.1016/j.pcad.2025.01.010 (2024).
Article Google Scholar
Hershman, S. G. et al. Physical activity, sleep and cardiovascular health data for 50,000 individuals from the MyHeart Counts Study. Sci. Data 6, 24 (2019).
Article PubMed PubMed Central Google Scholar
McConnell, M. V. et al. Feasibility of obtaining measures of lifestyle from a smartphone app: The MyHeart Counts cardiovascular health study. JAMA Cardiol. 2, 67–76 (2017).
Article PubMed Google Scholar
Javed, A. et al. Personalized digital behaviour interventions increase short-term physical activity: a randomized control crossover trial substudy of the MyHeart Counts Cardiovascular Health Study. Eur. Heart J. Digit. Health 4, 411–419 (2023).
Article PubMed PubMed Central Google Scholar
Grattafiori, A. et al. The Llama 3 herd of models. arXiv [cs.AI] (2024).
De Vries, R. A. J., Zaga, C., Bayer, F., Drossaert, C. H. C., Truong, K. P. & Evers, V. Experts get me started, peers keep me going: Comparing crowd- versus expert-designed motivational text messages for exercise behavior change. Proc. 11th EAI Int. Conf. Pervasive Computing Technologies for Healthcare, 155–162. https://doi.org/10.1145/3154862.3154875 (2017).
Meywirth, S., Janson, A. & Söllner, M. Personalized coaching for lifestyle behavior change through large language models: A qualitative study. Proc. Hawaii Int. Conf. System Sciences (HICSS), 3379–3388 (2025).
Lim, S., Schmälzle, R. & Bente, G. Artificial social influence: Rapport-building, LLM-based embodied conversational agents for health coaching. Proc. CONNECT Workshop, 24th ACM Int. Conf. on Intelligent Virtual Agents (IVA 2024).
Huang, Z. et al. Comparing large language model AI and human-generated coaching messages for behavioral weight loss. J. Technol. Behav. Sci. https://doi.org/10.1007/s41347-025-00491-5 (2025).
Article Google Scholar
Løvås, S.E. AI chatbots in health: Implementing an LLM-based solution to promote physical activity. Master’s thesis, UiT The Arctic (University of Norway 2024).
Song, H. et al. Investigating the relationship between physical activity and tailored behavior change messaging: Connecting contextual bandit with large language models. arXiv preprint arXiv:2506.07275 [cs.LG].
Mitchell, E. G. et al. T2 coach: A qualitative study of an automated health coach for diabetes self-management. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems 1–17 (ACM, New York, NY, USA, 2025). https://doi.org/10.1145/3706598.3714404.
Haag, D. et al. The last JITAI? Exploring large language models for issuing just-in-time adaptive interventions: Fostering physical activity in a prospective cardiac rehabilitation setting. in Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems 1–18 (ACM, New York, NY, USA, 2025). https://doi.org/10.1145/3706598.3713307.
Akrimi, S., Schwensfeier, L., Düking, P., Kreutz, T. & Brinkmann, C. ChatGPT-4o-generated exercise plans for patients with type 2 diabetes mellitus-assessment of their safety and other quality criteria by coaching experts. Sports 13, 92 (2025).
Article PubMed PubMed Central Google Scholar
Moore, R., Al-Tamimi, A.-K. & Freeman, E. Investigating the potential of a conversational agent (Phyllis) to support adolescent health and overcome barriers to physical activity: Co-design study. JMIR Form. Res. 8, e51571 (2024).
Article PubMed PubMed Central Google Scholar
Fang, C. M. et al. PhysioLLM: Supporting personalized health insights with wearables and large language models. arXiv [cs.HC] (2024).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv [cs.CL] (2020).
Brown, T. B. et al. Language Models are Few-Shot Learners. arXiv [cs.CL] (2020).
Gonzalez-Jaramillo, N. et al. Systematic review of physical activity trajectories and mortality in patients with coronary artery disease. J. Am. Coll. Cardiol. 79, 1690–1700 (2022).
Article PubMed Google Scholar
Ding, D. et al. Towards better evidence-informed global action: lessons learnt from the Lancet series and recent developments in physical activity and public health. Br. J. Sports Med. 54, 462–468 (2020).
Article PubMed Google Scholar
Kraus, W. E. et al. Daily step counts for measuring physical activity exposure and its relation to health. Med. Sci. Sports Exerc. 51, 1206–1212 (2019).
Article PubMed PubMed Central Google Scholar
Hawes, A. M. et al. Disentangling race, poverty, and place in disparities in physical activity. Int. J. Environ. Res. Public Health 16, 1193 (2019).
Article PubMed PubMed Central Google Scholar
Rippe, J. M. Physical activity and health: Key findings from the physical activity guidelines for Americans 2018. in Lifestyle Medicine, Fourth Edition 200–208 (CRC Press, Boca Raton, 2024). https://doi.org/10.1201/9781003227793-20.
Althoff, T. et al. Large-scale physical activity data reveal worldwide activity inequality. Nature 547, 336–339 (2017).
Article PubMed PubMed Central CAS Google Scholar
Carlson, S. A., Fulton, J. E., Pratt, M., Yang, Z. & Adams, E. K. Inadequate physical activity and health care expenditures in the United States. Prog. Cardiovasc. Dis. 57, 315–323 (2015).
Article PubMed Google Scholar
Myers, J. et al. Physical activity and cardiorespiratory fitness as major markers of cardiovascular risk: their independent and interwoven importance to health status. Prog. Cardiovasc. Dis. 57, 306–314 (2015).
Article PubMed Google Scholar
Velicer, W. F., Prochaska, J. O., Fava, J. L., Norman, G. J. & Redding, C. A. Detailed overview of the transtheoretical model. Homeostasis 38, 216–233 (1998).
Google Scholar
Hawkins, R. P., Kreuter, M., Resnicow, K., Fishbein, M. & Dijkstra, A. Understanding tailoring in communicating about health. Health Educ. Res. 23, 454–466 (2008).
Article PubMed Google Scholar
Ferrer, R. A. & Mendes, W. B. Emotion, health decision making, and health behaviour. Psychol. Health 33, 1–16 (2018).
Article PubMed Google Scholar
Kreuter, M. W. & Farrell, D. W. Tailoring Health Messages. (Routledge, London, England, 2017).
Mohamadlo, A. et al. Analysis and review of the literatures in the field of health literacy. JMIS 6, 58–72 (2020).
Article Google Scholar
Kim, D. S., Eltahir, A. A., Ngo, S. & Rodriguez, F. Bridging the gap: How accounting for social determinants of health can improve digital health equity in cardiovascular medicine. Curr. Atheroscler. Rep. 27, 9 (2024).
Article PubMed Google Scholar
Choi, Y. J., Lee, M. & Lee, S. Toward a multilingual conversational agent: Challenges and expectations of code-mixing multilingual users. in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems vol. 85 1–17 (ACM, New York, NY, USA, 2023).
Chatterjee, A., Pahari, N., Prinz, A. & Riegler, M. AI and semantic ontology for personalized activity eCoaching in healthy lifestyle recommendations: a meta-heuristic approach. BMC Med. Inform. Decis. Mak. 23, 278 (2023).
Article PubMed PubMed Central Google Scholar
Mandyam, A., Jörke, M., Denton, W., Engelhardt, B. E. & Brunskill, E. Adaptive interventions with user-defined goals for health behavior change. arXiv [cs.LG] (2023).
Park, J. S. et al. Generative agents: Interactive simulacra of human behavior. arXiv [cs.HC] (2023).
Wang, W. et al. Augmenting language models with long-term Memory. arXiv [cs.CL] (2023).
Ouyang, L. et al. Training language models to follow instructions with human feedback. arXiv [cs.CL] (2022).
Gaber, F. et al. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 8, 263 (2025).
Article PubMed PubMed Central Google Scholar
Ting, D. S. W. et al. Development and testing of a novel Large Language Model-based Clinical Decision Support Systems for medication safety in 12 clinical specialties. Res. Sq. https://doi.org/10.21203/rs.3.rs-4023142/v1 (2024).
Article PubMed PubMed Central Google Scholar
Huang, L. et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. https://doi.org/10.1145/3703155 (2024).
Article Google Scholar
Ahmad, M., Yaramic, I. & Roy, T.D. Creating trustworthy LLMs: Dealing with hallucinations in healthcare AI. https://doi.org/10.20944/preprints202310.1662.v1 (2023).
Wiest, I. C. et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit. Med. 7, 257 (2024).
Article PubMed PubMed Central Google Scholar
de Vries, R. A. J., Truong, K. P., Kwint, S., Drossaert, C. H. C. & Evers, V. Crowd-designed motivation. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (ACM, New York, NY, USA, 2016). https://doi.org/10.1145/2858036.2858229.
Sun, J. et al. Dial-insight: Fine-tuning large language models with high-quality domain-specific data preventing capability collapse. arXiv [cs.CL] (2024).
Wei, J. et al. Finetuned language models are zero-shot learners. arXiv [cs.CL] (2021).
Hu, E. J. et al. LoRA: Low-Rank Adaptation of large language models. arXiv [cs.CL] (2021).
Marcus, B. H., Selby, V. C., Niaura, R. S. & Rossi, J. S. Self-efficacy and the stages of exercise behavior change. Res. Q. Exerc. Sport 63, 60–66 (1992).
Article PubMed CAS Google Scholar
Cramer, H. Mathematical Methods of Statistics (PMS-9), Volume 9. (Princeton University Press, Princeton, NJ, 1999).
spaCy · Industrial-strength Natural Language Processing in Python. https://spacy.io.
Hutto, C. J. & Gilbert, E. VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Proc. Int. AAAI Conf. Web Soc. Media 8, 216–225 (2014).
Article Google Scholar
Textstat: :memo: Python Package to Calculate Readability Statistics of a Text Object - Paragraphs, Sentences, Articles. (Github).
Flesch, R. A new readability yardstick. J. Appl. Psychol. 32, 221–233 (1948).
Article PubMed CAS Google Scholar
Shapiro, S. S. & Wilk, M. B. An analysis of variance test for normality (complete samples). Biometrika 52, 591–611 (1965).
Article Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

D.S.K. is supported by the Wu-Tsai Human Performance Alliance as a Clinician-Scientist Fellow, the Stanford Center for Digital Health as a Digital Health Scholar, the Pilot Grant from the Stanford Center for Digital Health, the Robert A. Winn Excellence in Clinical Trials Career Development Award, NIH 1L30HL170306, the American Heart Association Career Development Award (25CDA1436622), and the American Diabetes Association (ADA) Pathway to Stop Diabetes Initiator Award (7-25-INI-11). F.R. was funded by grants from the NIH National Heart, Lung, and Blood Institute (R01HL168188; R01HL167974, R01HL169345), the American Heart Association/Harold Amos Medical Faculty Development Program, and the Doris Duke Foundation (Grant #2022051). E.L. is supported by NIH grants K24AR075060, K12TR004930, and R01AR082109. A.C.K. is supported by NIH grants 5R01AG07149002 and 1R44AG071211, and a seed grant from the Stanford Institute for Human-Centered Artificial Intelligence. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

These authors contributed equally: Daniel Seung Kim, Euan A. Ashley.

Authors and Affiliations

Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
Sriya Mantena, Anders Johnson, Narayan Schütz, Alexander Tolas, Ritu Doijad, C. Mikael Mattson, Paul Schmiedmayer, Fatima Rodriguez, Daniel Seung Kim & Euan A. Ashley
Department of Computer Science, Stanford University, Stanford, CA, USA
Sriya Mantena
Department of Epidemiology & Population Health, Stanford University School of Medicine, Stanford, CA, USA
Marily Oppezzo & Abby C. King
Stanford Prevention Research Center, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
Marily Oppezzo & Abby C. King
Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, USA
Narayan Schütz
Clinical Excellence Research Center (CERC), Stanford University, Stanford, CA, USA
Narayan Schütz
National Heart and Lung Institute, Imperial College London, London, UK
Allan Lawrie
Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA
Mariana Ramirez-Posada & Eleni Linos
Center for Digital Health, Stanford University School of Medicine, Stanford, CA, USA
Mariana Ramirez-Posada, Eleni Linos, Fatima Rodriguez, Daniel Seung Kim & Euan A. Ashley
Wu Tsai Human Performance Alliance, Stanford University School of Medicine, Stanford, CA, USA
Fatima Rodriguez, Daniel Seung Kim & Euan A. Ashley
Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
Fatima Rodriguez, Daniel Seung Kim & Euan A. Ashley
Division of Cardiology, Department of Medicine, University of Washington, Seattle, WA, USA
Daniel Seung Kim
Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
Euan A. Ashley
Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
Euan A. Ashley

Authors

Sriya Mantena
View author publications
Search author on:PubMed Google Scholar
Anders Johnson
View author publications
Search author on:PubMed Google Scholar
Marily Oppezzo
View author publications
Search author on:PubMed Google Scholar
Narayan Schütz
View author publications
Search author on:PubMed Google Scholar
Alexander Tolas
View author publications
Search author on:PubMed Google Scholar
Ritu Doijad
View author publications
Search author on:PubMed Google Scholar
C. Mikael Mattson
View author publications
Search author on:PubMed Google Scholar
Allan Lawrie
View author publications
Search author on:PubMed Google Scholar
Mariana Ramirez-Posada
View author publications
Search author on:PubMed Google Scholar
Paul Schmiedmayer
View author publications
Search author on:PubMed Google Scholar
Eleni Linos
View author publications
Search author on:PubMed Google Scholar
Abby C. King
View author publications
Search author on:PubMed Google Scholar
Fatima Rodriguez
View author publications
Search author on:PubMed Google Scholar
Daniel Seung Kim
View author publications
Search author on:PubMed Google Scholar
Euan A. Ashley
View author publications
Search author on:PubMed Google Scholar

Contributions

D.S.K. and E.A.A. led the overall study design. S.M. fine-tuned the model, generated messages using five different techniques, and conducted the quantitative comparisons and data analysis from the large-scale survey study. A.J. managed participant recruitment and developed the survey for the large-scale user study. M.O. contributed to the behavioral science framing, including the design of stage classification questions and survey development. M.R.-P. conducted initial message review. P.S., E.L., A.C.K., F.R., A.T., and R.D. provided ongoing consultation on behavioral science, mobile health app requirements, and inclusive digital health research design, which informed the overall approach and interpretation. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Daniel Seung Kim or Euan A. Ashley.

Ethics declarations

Competing interests

D.S.K. reports grant support from Amgen and the Bristol Myers Squibb Foundation (via the Robert A. Winn Excellence in Clinical Trials Career Development Award). F.R. reports equity from Carta Healthcare and HealthPals, and consulting fees from HealthPals, Novartis, Novo Nordisk, Esperion Therapeutics, Movano Health, Kento Health, Inclusive Health, Edwards, Arrowhead Pharmaceuticals, HeartFlow, and iRhythm outside the submitted work. E.A.A. reports advisory board fees from SequenceBio, Foresite Labs, Pacific Biosciences, and Versant Ventures. E.A.A. has ownership interest in Personalis, Deepcell, Svexa, Candela, Parameter Health, Saturnus Bio, outside the submitted work. E.A.A. is a non-executive director of AstraZeneca and Svexa. E.A.A. receives collaborative research support from Illumina, Pacific Biosciences, Oxford Nanopore, Cache, Cellsonics, outside the submitted work. The remainder of the authors report no potential conflicts of interest. S.M., A.J., M.O., N.S., A.T., R.D., C.M.M., A.L., M.R.-P., P.S., E.L., and A.C.K. declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Supplementary information

Supplementary LLM and Stages of Change - npj Revision

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Mantena, S., Johnson, A., Oppezzo, M. et al. Fine-tuning LLMs in behavioral psychology for scalable health coaching. npj Cardiovasc Health 2, 48 (2025). https://doi.org/10.1038/s44325-025-00083-5

Download citation

Received: 21 February 2025
Accepted: 05 August 2025
Published: 22 September 2025
DOI: https://doi.org/10.1038/s44325-025-00083-5

Subjects

Abstract

Similar content being viewed by others

Differences in molecular phenotype in mouse and human hypertrophic cardiomyopathy

Shared genetic pathways contribute to risk of hypertrophic and dilated cardiomyopathies with opposite directions of effect

Machine learning-based prediction model for myocardial ischemia under high altitude exposure: a cohort study

Introduction

Results

Qualitative evaluation of fine-tuned MHC-Coach and Base LLaMA Model

Comparative evaluation of MHC-Coach and human expert messages

Linguistic and stylistic feature analysis across generation methods

Discussion

Methods

MHC-Coach message generation

Data

Language model

Fine-tuning

Large-scale user study design and participant recruitment

Participant recruitment and eligibility criteria

Survey design

Statistical analysis

Expert evaluation of coaching messages

Linguistic and stylistic features across generation methods

Pretrained baseline

Few-shot prompting

Retrieval-augmented generation

Expert messages

Corpus and preprocessing

Feature definition and extraction

Word count

Compound sentiment score

Action-verb frequency

Temporal reference

Type–token ratio

Exclamation count

Flesch Reading-Ease Xscore

Statistical comparison across methods

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary LLM and Stages of Change - npj Revision

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links