Introduction

Rates of depression increase significantly during adolescence, marking this as a critical developmental period for intervention [1]. Yet, even first-line treatments, such as cognitive behavioral therapy, are not consistently effective for youth [2]. Anhedonia, or loss of interest and pleasure, is a core feature of depression that predicts poorer course of illness and worse treatment outcomes [3]. Treatments that target anhedonia may lead to reductions in other depressive symptoms [4].

Behavioral Activation (BA) therapy aims to reduce anhedonia by targeting patterns of avoidance and withdrawal and increasing engagement with rewarding activities [5]. Given its relative simplicity and clear focus on behavioral reinforcement of rewards, BA may be especially well-suited to efficiently target adolescent anhedonia, and promising evidence exists in youth samples [6, 7]. A core assumption of BA is that increased “activation”(in this article, we use “Behavioral Activation” or “BA” to refer to the manualized treatment approach described by [5] and [6]; by contrast, we use “activation” to refer to the construct of behavioral activation, i.e., the putative mechanism through which patients become activated in this treatment. See [8] for a review of how this construct is often measured) in daily life leads to symptom change [9], but few studies have rigorously measured whether this is the case [8]. To assess activation most of these studies rely on self-report questionnaires, which are limited by recall bias, social desirability, and participant burden [8]. There is a pressing need for objective, low-burden, and ecologically valid methods to track activation in daily life. Tools that assess activation between sessions could help therapists monitor treatment progress and make timely adjustments, consistent with calls for data-informed psychotherapy [10].

Passive sensing through smartphones offers a low-burden, continuous, and objective way to assess real-world behavior [11]. Mobility-related passive sensing features are particularly relevant for assessing behavioral proxies for activation, such as activity levels (accelerometer) or percent of time spent outside the home (GPS). Growing evidence suggests smartphone sensors can be used to predict daily and even hourly fluctuations in depressed mood [12], including in youth [13,14,15]. In a BA study [16], demonstrated congruence between questionnaire-assessed activation and several smartphone metrics, including step count, time spent at home, time in conversation, and screen unlocks, albeit in a very small sample (3 case studies) that requires replication. Despite substantial advantages, passive sensing measures alone lack qualitative context and therefore provide limited insight into the emotional or motivational aspects of behavior.

Unobtrusive language samples offer another key source of information with regard to both qualitative context and quantitative features (e.g., sentiment) that can be used in conjunction with passive sensing to estimate or forecast emotional states and behavior. For example, promising results have been obtained by applying Linguistic Inquiry and Word Count (LIWC) to measure activation in online therapy chat logs [17] obtained promising results by applying Linguistic Inquiry and Word Count (LIWC) to measure activation in online therapy chat logs. However, LIWC required a lexicon of activation-relevant words from their specific dataset, limiting generalizability. In contrast, large language models (LLMs) have been shown to accurately identify psychological constructs in text, while requiring no sample-specific training data [18]. Recent work has demonstrated the potential of LLMs to estimate emotion from daily free response text [19]. Additionally, frequent self-reports of daily activities provide ideal material for LLMs to rate activation, with less risk of typical self-report biases.

This proof-of-concept study aimed to evaluate the validity and utility of two novel, technology-based measures of activation: LLM-derived ratings of daily text entries and smartphone-based passive sensing of mobility patterns. These measures were examined in relation to daily positive and negative emotion, as well as weekly anhedonia, depressive symptoms, and a traditional self-report measure of activation. We hypothesized that the LLM-derived and passive sensing indicators of activation would be positively associated with each other and with self-reported activation, supporting their convergent validity. Furthermore, we hypothesized that these measures would also demonstrate criterion validity, such that higher activation as captured by these methods would be associated with increased daily positive emotion, decreased daily negative emotion, and greater weekly improvement in anhedonia and depression symptoms.

Materials and methods

Participants and procedure

Participants were 38 adolescents recruited from the greater Boston area for a BA treatment trial for adolescent anhedonia from January 2016 to November 2021. See supplement (Table S1) for demographic and clinical characteristics. For more information on study design and inclusion/exclusion criteria, see [7]. Approval was obtained from the Mass General Brigham IRB, and all participants gave informed consent and/or assent.

Participants received 12 weekly, 60-minute, individual therapy sessions, per the [6] BA manual. Before each session participants were asked to rate their anhedonia, using the 14-item Snaith‐Hamilton Pleasure Scale (SHAPS; [20, 21]), depression, using the 20-item Center for Epidemiological Studies Depression scale (CES-D; [22]), and activation, using the 9-item Behavioral Activation for Depression Scale - Short Form (BADS – SF; [23]). (See Supplement Section S1 for more details about the weekly self-report measures.) During the treatment, every other week participants completed a 5-day (2-3 surveys per day) bursts of ecological momentary assessments (EMA) using the MetricWire app. At each prompt, participants rated their positive affect (PA) and negative affect (NA) by responding to 6 items, each anchored by a 5-point Likert scale. Next, they were asked to provide free-text responses about what they were doing right before, with whom they were interacting, and the most enjoyable and stressful events since the previous prompt. (See Supplement Section S1, https://clinicaltrials.gov/study/NCT02498925 and [4] for further details about the EMA items and sampling protocol). Passive sensor data were collected continuously using the Beiwe [24] smartphone platform for a subset of participants (n = 13), as this data collection method was added to the protocol after initial participant enrollment had begun.

Feature derivation

EMA-derived PA was calculated as the mean score of “happy,” “interested,” and “excited” items; NA was the mean of “sad,” “nervous,” and “angry” items. Typed free-text responses provided by participants as part of their EMA surveys (see Section S2 for the specific open questions) were analyzed using OpenAI’s GPT-4o model, accessed via the Python openai package. Using the API ensured application of the same standardized prompt, fixed model parameters (e.g., temperature), and consistent, reproducible outputs across the dataset. The standardized prompt included the clinical definition of behavioral activation (Dimidjian et al., 2011), rating criteria (1 = least active/passive/avoidant to 5 = very high activity with strong reinforcement of pleasure, mastery, or problem-solving), and guidelines for evaluating current activities, social interactions, enjoyable events, and responses to stressful situations (see Supplement Section S2 for the full prompt). The model was instructed to output a single numeric rating [1,2,3,4,5] along with a rationale, based on explicit rating criteria (e.g., passivity vs. activity, social engagement, mastery/enjoyment, and problem-solving vs. avoidance). See Table S8 for examples of participant text with corresponding GPT ratings and explanations. Participant responses were manually deidentified prior to analysis by removing or replacing personal identifiers (e.g., names, places). To evaluate validity, a trained human rater applied the same prompt and criteria to a random 25% subsample (n = 478) of EMA responses. Human and GPT-4o ratings demonstrated substantial agreement (weighted Cohen’s κ = 0.77, 95% CI [0.74, 0.81]), supporting the validity of the automated rating.

Passive sensor features were extracted from participants’ smartphone data to characterize behavioral patterns linked to mood and functioning using the publicly available DPlocate and DPphone software packages [25]. Derived measures included: an hourly activity score (ActScore), derived from aggregated accelerometer data to reflect physical activity intensity throughout the day; daily percent home (percentHome), indicating the proportion of time spent at the inferred home location based on nighttime GPS clustering; daily distance from home (homDist), computed as the mean Euclidean distance between GPS samples and the home location; daily mobility area (radiusMobility), representing the spatial dispersion of movement via radius of gyration; and places visited daily (DayPlaces), estimated using spatiotemporal clustering to identify distinct locations visited each day. These features were selected based on prior evidence linking mobility, location stability, and physical activity to affective states. (See Supplement Section S3 for detailed definitions of each passive sensor variable.)

Analytic strategy

First, we computed daily means for the hourly passive sensing variables, EMA affect variables, and GPT ratings. The daily variables were used to examine associations with NA and PA at the daily level. For week-level analyses, we predicted BADS, SHAPS, and CES-D scores using weekly means of the GPT ratings and passive sensing variables. For all analyses, we used multilevel modeling (MLM) to estimate both within-person (person-mean centered) as well as between-person (participant mean) effects. Models included random intercepts for participants and were conducted using the lme4 package in R. To assess convergent validity, we examined associations between GPT ratings, passive sensing variables, and BADS scores. (We also conducted repeated measures correlation (rmcorr) analyses. Results from rmcorr were nearly identical to those obtained from multilevel models; therefore, we report them in the supplement only (see Supplementary Figure S1 to S4)). To assess criterion validity, we tested whether higher GPT and passive sensor-derived activation were associated with more positive and less negative daily affect, controlling for prior-day affect, and with lower weekly symptoms of anhedonia and depression, controlling for prior-week scores. Models were fit using restricted maximum likelihood (REML), which provides unbiased variance component estimates in multilevel models, especially with small or unbalanced samples. Missing data were handled automatically through REML estimation, which uses all available observations under the missing at random assumption, allowing participants to contribute data even when some time points are missing [26, 27]. Comprehensive evaluations of missing data mechanisms are reported in the Supplement Section S4. No a priori sample size calculation was conducted, as this was a secondary proof-of-concept analysis using data from an existing treatment trial. The analytic sample consisted of all participants who completed daily EMA (n = 38) and a subset who contributed passive smartphone sensing data (n = 13). All analysis code is publicly available at https://osf.io/jhczf/files/osfstorage.

Results

Thirty-eight participants completed 50.24 EMA surveys on average (SD = 27.52) across the entire 12-week treatment period. EMA compliance decreased over the course of treatment, from 62% in in the first two weeks to 52% in the last two weeks (Table S2). For the 13 subjects with passive sensor data, a total of 46,056 hourly observations were collected (n = 1919 days of data total; M = 147.62 days; SD = 99.82 days). Descriptive statistics for daily assessment of smartphone passive sensing features and GPT-derived ratings are presented in the supplement (Table S3).

Convergent validity: associations between activation indicators

Passive sensors and GPT-derived ratings

On days with higher-than-usual smartphone-derived activity scores (Est. = 0.178, p = 0.024) and more places visited daily (Est. = 0.045, p = 0.021), participants tended to have higher GPT-rated activation, while spending more time at home predicted lower GPT-rated activation that day (Est. = −0.007, p = 0.012). No significant between-person associations were found (see Table 1 and Figure S1).

Table 1 Multilevel Model Results of Passive Sensing Variables Predicting Daily GPT-Derived Activation.

Predicting self-report activation

In weeks when participants’ GPT-rated activation was higher, their BADS scores (Est. = 2.05, p = 0.005) were also higher (see Table 2). Figure S2 shows individual trajectories of GPT-rated activation and BADS score over time. For passive sensing, at the within-person level, days with greater number of places visited and less time spent at home were associated with higher BADS scores (all ps < 0.01; see Table 3 and Figure S3). At the between-person level, participants who generally spent more time at home had lower BADS scores. See Tables S5 and S6 for models using the BADS activation and avoidance subscales.

Table 2 Multilevel Model Results of GPT-Derived Activation Ratings Predicting Weekly Self-Reported Activation and Symptom Measures.
Table 3 Multilevel Model Results of Passive Sensing Variables Predicting Weekly Self-Reported Activation.

Criterion validity: associations with daily affect and weekly symptoms

Predicting daily affect

Higher GPT-rated activation was associated with lower same-day negative affect (Est. = −0.09, p < 0.001) and higher positive affect (Est. = 0.16, p < 0.001, see Table 4). Among passive sensing features, only daily mobility radius significantly predicted daily positive affect at the within-person level (Est.= 0.002, p = 0.024; see Table 4 and Figure S4). (Including lagged (t-1) affect scores in the models substantially reduced the number of observations, so we chose to report results from models without lagged predictors. However, the results were consistent when controlling for lagged affect (see Supplementary Table S4)).

Table 4 Multilevel Model Results of Passive Sensing Variables and GPT-Derived Activation Predicting Daily Negative and Positive Affect.

Predicting weekly symptoms

At the weekly level, GPT-derived activation ratings were not significantly associated with changes in SHAPS or CES-D scores (see Table 2). For passive sensing variables, in weeks with greater number of places visited, participants had lower SHAPS (Est. = −0.82, p = 0.022) and CES-D scores (Est. = −2.51, p < 0.001). Conversely, weeks with more time spent at home were related to higher SHAPS (Est. = 0.15, p = 0.001) and CES-D scores (Est. = 4.47, p = <0.001). No between-person effects reached statistical significance (see Table 5 and Figure S5). A visual summary of the passive sensing results is presented in Table S7.

Table 5 Multilevel Model Results of Passive Sensing Variables Predicting Weekly Symptoms.

Discussion

This initial proof-of-concept study demonstrates the potential of scalable, smartphone-derived technologies to track therapeutic processes in adolescents’ daily lives. Specifically, we validated two novel technology-based measures of activation, mobility indicators from passive smartphone sensing and LLM-derived text ratings, against traditional questionnaire-rated activation. To our knowledge, this is the first study to integrate both LLM and passive sensing in adolescents’ daily lives to monitor therapeutic mechanisms.

LLM-derived activation ratings and select passive sensing indicators were positively associated with each other and with an established measure of behavioral activation (BADS), supporting their convergent validity. Notably, only some passive sensing features showed significant associations, suggesting that specific aspects of mobility may be more reliable markers of activation than others. Specifically, when adolescents visited more locations and spent less time at home, they also were found to have greater activation via self-report and LLM-derived ratings of their EMA text entries. In contrast, other mobility indicators, such as distance from home, were not associated with self-reported or LLM-derived activation, suggesting that activation is better reflected by time spent away from home and visiting different locations, rather than the distance traveled. These findings are consistent with prior studies linking fewer unique locations visited and more time spent at home to increased depression risk [15, 28]. However, previous research has typically assumed that mobility patterns are related to depression because they reflect activation, without directly testing whether this is the case. To that end, our results demonstrate a direct association of mobility with a therapeutic target—activation—rather than with depression symptoms alone.

The association between LLM-rated activation and BADS scores contributes to recent evidence suggesting that LLMs, such as GPT, can extract psychologically meaningful information from unstructured text [18]. Relevant to the treatment of adolescent depression, we build on emerging work showing that LLMs can be used to analyze language from psychotherapy sessions to inform clinical decision-making [29]. Importantly, our study shows that LLM-based assessments can also provide clinically relevant insights based on language generated outside the therapy room, offering scalable and unobtrusive ways to monitor therapeutic processes in patients’ daily lives.

These two technology-based approaches demonstrated distinct patterns in their associations with emotional (daily) and clinical (weekly) outcomes. LLM-rated activation was associated with same-day increases in positive affect and decreases in negative affect. In contrast, passive sensing features were more strongly related to weekly changes in symptoms. These distinct timescales suggest that linguistic measures may be better suited to capturing short-term emotional changes, whereas passive sensing may tap into behavioral processes that unfold over longer periods. This finding is clinically significant: it implies that daily text assessments could help clinicians monitor affective responses to activation efforts in real time, while mobility patterns could indicate whether treatment is gradually translating into symptom improvement.

Interestingly, associations between both the LLM-derived and passive sensing indicators of activation with BADS and symptom measures were significant at the within-person level, but not the between-person level. That is, when individuals showed more activation relative to their own mean via passive sensing or text, they also reported higher activation and lower symptoms; however, individuals who, on average, moved more or reported more engagement via EMA text were not necessarily less symptomatic. This pattern is consistent with recent research suggesting that daily mobility features, such as time spent at home or number of locations visited, are more predictive of within-person changes in depression than between-person differences [28]. Clinically, this highlights the need for personalized interventions that focus on deviations from an individual’s own baseline, rather than comparing them to normative averages that may not reflect any one individual. If replicated, our study suggests that digital measures of activation could be used to tailor treatment in real time, identifying when a patient’s activation is dropping and intervening accordingly.

Limitations and future directions

Several limitations should be noted. First, the small sample size, specifically for the passive sensor data collected from a subset of participants, limits generalizability, and replication in larger, more diverse samples is necessary. Second, the LLM-based ratings were generated using a specific prompt and model version. Although recent work has shown that LLM-based ratings using GPT-4o are surprisingly stable and human-like [18], slight variations in prompts or model updates could influence results. Another limitation is that passive sensing features capture only certain dimensions of activation, mainly physical mobility, and may miss other important nuances. Future work should integrate additional passive data streams (e.g., phone use patterns, conversation detection, sleep onset/offset) and physiological markers to capture a broader spectrum of activation-relevant processes. Similarly, while language-based ratings provide valuable context, they depend on participant compliance and openness within a specific EMA sampling scheme. Integrating insights from NLP as applied to continuously collected text data, such as from social media and text messages, would improve the richness and ecological validity of this data stream [30].

In conclusion, this proof-of-concept study provides initial evidence that passive smartphone sensing and LLM-based language analysis can be used to measure activation in adolescents during treatment for depression, offering insights into therapeutic processes as they unfold in daily life. By validating these digital tools against self-report measures and demonstrating their links to emotional and clinical outcomes, we illustrate their potential to enhance treatment monitoring, personalize interventions, and ultimately improve outcomes for depressed youth.