Abstract
Digital phenotyping promises to transform psychiatry by using multimodal, densely sampled data. However, its potential is hindered by the lack of focus on identifying and validating digital biomarkers that accurately reflect mental states before evaluating their impact on outcomes. This longitudinal study used explainable machine learning to analyze multivariate, densely sampled data from 133 bipolar disorder (BD) participants over a median of 251 days, identifying robust digital biomarkers defining depressive episodes. The analysis included features from email-based daily self-reported mood, energy, and anxiety, as well as passively collected activity and sleep data using an Oura ring. The most robust descriptors of depressive episodes were lower daily mood variability, lower daily activity variability, and higher daily sleep onset latency variability. Self-reported daily mood features achieved the highest performance (AU-ROC: 0.82 ± 0.03). Our results establish the value of multimodal data and represent a critical first step toward automated detection and prediction of illness episodes in BD.
Introduction
Digital biomarkers have been introduced into several fields of medicine to enhance diagnosis1,2,3,4,5,6,7,8,9, monitor disease progression2,3,6,7,10,11,12,13,14,15, and personalize treatment plans16,17,18, ultimately improving patient outcomes. In psychiatry, wearables and smart phones are increasingly used to collect data (both passively and actively) outside traditional clinical settings19,20,21,22, a process referred to as “digital phenotyping”23. However, the promise of this approach to transform psychiatry has been impeded by the lack of attention to crucial initial stages required for the identification and validation of digital biomarkers. The successful identification and validation of other types of biomarkers have required a stepwise process. For instance, the selection of reliable pharmacogenetic biomarkers in psychiatry has taken more than two decades, starting with the feasibility of characterizing specific genes and alleles associated with drug metabolism24. The next step was to assess whether these pharmacogenetic biomarkers were associated with clinical outcomes in large datasets25. Then, the potential clinical impact of using these biomarkers was assessed in small prospective studies26. Finally, their clinical efficacy in improving treatment outcomes was established in large randomized clinical trials (RCT) comparing outcomes with and without the use of pharmacogenetic biomarkers27. We propose that a similar approach is needed to establish the usefulness of digital biomarkers. The feasibility of collecting multimodal densely sampled data (“digital phenotyping”) in patients with psychiatric disorders has now been established28,29,30,31,32,33,34,35. Before we can assess whether the potentially unlimited number of measures of objective behaviors and subjective mental states (“digital biomarkers”) can be used to improve care and outcomes, we need to identify a small number of them that are accurate indicators of relevant mental states.
Explainable machine learning (ML) models offer a plausible approach for the identification of digital biomarkers that would accurately indicate the presence of a specific mental state36. For instance, in a 30-day longitudinal study involving 54 participants, with self-reported depression severity ranging from none to severe, explainable ML identified the best descriptors of depressive states as lower distance traveled, lower location entropy, fewer location visits, more total sleep time, and less physical activity37. In a longer 1-year study using continuous wearable data and monthly self-reported self-rating Patient Health Questionnaire (PHQ-9) scales in 10,036 participants38, an explainable ML model identified sleep changes and lower recent step count as the most important digital biomarkers in the model’s classification of depressive episodes.
Despite the effort in developing explainable ML models for mental disorders in general, and for mood disorders in particular, building a single classification model-that-fits-all remains a challenge39. Different mental states that exist in mood disorders add to the complexity of developing such models. For example, a marker that is associated with bipolar disorder when the patient is in a depressive episode may not be associated with the same patient when they are euthymic. This is partly because prediction is the last step of a three-step process: association, description (or detection), and prediction (or forecasting) of episodes of illness40,41. In this regard, association involves determining the relationship between one or more variables and a mental state or a severity indicator of that mental state42,43. Description takes this one step further and identifies digital measures of objective behaviours and subjective mental states (“digital biomarkers”) that can be integrated with sociodemographic data and clinical characteristics to detect a mental state44,45,46. Prediction aims to anticipate the future occurrence of a mental state before its symptoms unfold based on the identification of early warning signals in historical objective and subjective data47,48,49.
In this context, we undertook a longitudinal study to systematically assess which digital biomarkers would be the most accurate descriptors of depressive episodes in BD across different time scales, using longitudinal, densely sampled multivariate objective and subjective data. Unlike many prior studies that focus primarily on mood episode classification with limited methodological innovation, our work presents a robust systematic approach to identify a set of reliable digital descriptors of depressive episodes in BD. By integrating rigorous validation techniques, including bootstrapped feature importance and permutation testing of feature importance, selection frequency, and rank stability across multimodal longitudinal data, we move beyond classification accuracy to unveil reproducible biomarkers with strong potential for clinical actionability. Based on our previous work50, we hypothesised that the day-to-day variability in mood, activity, and sleep would be the most accurate descriptors.
Methods
Participants and data collection
The overall methods of this ongoing longitudinal study have been described previously51. In brief, between December 12, 2020, and January 15, 2024, we recruited and followed 164 adults diagnosed with BD type I or II, in the outpatient mood clinics of two academic psychiatric hospitals in Canada: The Center for Addiction and Mental Health (CAMH), Toronto, Ontario; and the Queen Elizabeth II Health Sciences Center, Halifax, Nova Scotia. This analysis focused on identifying specific descriptors that could be used to detect depressive episodes. Therefore, we excluded 31 participants who entered the study in a manic, hypomanic, or mixed episode.
The study was approved by the Research Ethics Board (REB) at the Centre for Addiction and Mental Health (CAMH), Toronto, Ontario, Canada, and at the QEII Health Care Centre, Halifax, Nova Scotia, Canada, in accordance with the Declaration of Helsinki. REB #: 059-2019. All participants signed an informed consent form approved by the local research ethics board before any research procedure was initiated. Primary diagnoses of BD I or II were established with the Structured Clinical Interview for the Diagnostic and Statistical Manual of Mental Disorders 5 (DSM-552; SCID-553) based on the diagnostic criteria of the DSM-5. Participants completed a baseline assessment, including the gathering of their sociodemographic characteristics, clinical history, cardiovascular health, chronotype, and medication regimen. Polarity (euthymia, depression, or (hypo)mania) at baseline was defined based on the Young Mania Rating Scale (YMRS)54 and the Montgomery-Asberg Depression Rating Scale (MADRS)55: euthymia was defined as both YMRS and MADRS scores < 10; depression as MADRS scores ≥ 10; and (hypo)mania as YMRS scores ≥ 10. All participants were treated by a psychiatrist affiliated with the study or a community-based psychiatrist.
Participants were instructed to wear an Oura ring (Oura Health Oy, Generation 2, Oulu, Finland) continuously throughout the study duration, including during sleep, and to charge the device during periods of inactivity as needed. No fixed charging schedule or specific charging intervals were mandated, allowing participants to manage device battery life flexibly. The ring produces a variety of data related to activity and sleep56 (Fig. 1). As an incentive, participants received $180 upon completing their first year. For the second year, they could choose between an additional $180 payment or ownership of the Oura ring. Additionally, participants received a daily email with a link to an electronic visual analog scale (e-VAS) to self-rate their mood, energy, and anxiety levels over the previous 24 h. Each e-VAS component was scored in one-point increment ranging from 10 to 90, with participants instructed to consider “50” as their baseline (e.g., their “usual” mood, energy, or anxiety): ratings higher than 50 indicating levels “higher than their usual”, and ratings lower than 50, indicating levels “lower than their usual.” Reminders were sent to participants when e-VAS were not completed for three consecutive days. Finally, participants also received a weekly email with a link to an electronic version of the 9-item Patient Health Questionnaire (PHQ-9)57 and the Altman Self-Rating Mania scale (ASRM)58 to rate their depressive or manic symptoms, respectively, during the past week. Daily e-VAS ratings and weekly symptom ratings were stored on a REDCap (Research Electronic Data Capture)59 secure database.
a Clinical course of two exemplary participants: one who experienced the onset of a major depressive episode and one who remained in euthymia. This panel illustrates the longitudinal variability in clinical polarity and highlights the differences between participants with and without the onset of a depressive episode. b Subjective and objective variables used to generate features: A-priori selection of three out of three variables provided by participants, 1 out of 15 variables related to activity by wearable, and 11 out of 23 variables related to sleep by wearable. c Collection and aggregation of data over various time scales to create variables: subjective data provided by participants consist of three self-reported measures; objective data provided by wearables consist of densely sampled data on activity and sleep.
To assess participant adherence to device-wear protocols, we computed non-wear time and missing-data ratios across modalities. Non-wear was directly provided in the Oura ring activity data dictionary, and we computed the missing data ratio (MDR) by modality (i.e., activity, sleep, scales) as the ratio of missing data observations by the total number of expected data observations in a participant’s study course. We performed another sensitivity analysis comparing the extracted digital biomarkers across two subsamples created by dividing the full sample at the median participation length.
Data generation: creation of variables (Fig. 1)
Figure 1 presents an overview of the clinical course and time scales for data generation. Based on the weekly PHQ-9, we used a 2-week rolling window to identify depressive episodes and the time of their onset: the onset of a depressive episode was defined as the start of at least two consecutive weeks with a PHQ-9 score ≥ 10. Previous studies have shown this approach to be both highly sensitive and specific60,61 to define major depressive episodes50 (Fig. 1a).
Among the 39 data generated by the Oura Ring, we selected a smaller set of 12 objective data produced by the ring (1 out of 16 related to activity and 11 out of 23 related to sleep) (see Fig. 1b). For activity, all variables included in the analysis were derived from the 5-minute sampled Metabolic Equivalent of a Task (MET)62 produced by the ring and aggregated over one-hour periods, day times, evening times, night times, one-day periods, one-week periods, and one-month periods. For sleep, all variables included in the analysis were derived from: (i) the 5-minute hypnograms56 produced by the ring and aggregated over one-hour periods; and (ii) a set 10 of daily data (awake duration, sleep-onset latency, restless sleep duration, light sleep duration, deep sleep duration, REM duration, total sleep duration, waking-up count, getting-up count, and sleep efficiency) aggregated weeks and months. Similarly, daily e-VAS data were aggregated over weeks and months (Fig. 1c).
Overall, considering the objective data produced by the ring, the subjective (e-VAS) data provided by the participants, and the various time periods, we created a set of 49 variables (see Supplementary Table 1).
Data processing: imputation of missing data and calculation of features (Fig. 2a, b)
Given our previous work showing that data in this sample were missing not at random (MNAR)63, we generated ten imputed datasets using the K-Nearest Neighbors (KNN) method64. KNN imputation identifies the k-nearest neighbours to each missing data point based on a distance metric and imputes the missing value using the median of these neighbours. This method uses information from similar observations, making it suitable for handling data MNAR. Each of these ten imputed datasets was generated based on different yet plausible imputed values (Fig. 2a). We conducted the subsequent analyses on each of these 10 imputed datasets and mean-aggregated the results to yield parameter estimates and standard errors. The following analyses were performed within each imputed dataset.
a Variable preprocessing, missing data imputation, and feature extraction: This consists of the use of multiple imputations to generate 10 imputed datasets, followed by the extraction of the main statistical characteristics (i.e., features) from each variable, spanning time scales from five minutes to one month. b Variables across various time scales used to generate matrices of features: each variable with a temporal resolution from five minutes to a month is used to generate seven features. All the features of all participants are used to create two matrices, one corresponding to euthymic states and one corresponding to depressive states. These matrices serve as input for the binary classifier. c Classification and generation of performance metrics: Input features and confounding factors used in the ensemble-based binary classifier for distinguishing clinical polarity (euthymia vs. depression). This panel also illustrates the classifier’s performance metrics, including Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves for both the Decision Tree model and the XGBoost model. Performance is quantified by the Area Under the ROC Curve (AU-ROC) and the Area Under the Precision-Recall Curve (AU-PRC). d Feature ranking: SHAPLEY plots are generated to show each feature’s average impact on the model’s predictions and class-specific impacts for the features extracted during euthymia and depression. The plots provide insights into the relative importance of each feature in driving the classifier’s outputs, enhancing model interpretability.
For each of the 133 participants, for each of the 49 variables, in each of ten imputed datasets, we calculated seven features characterizing variability over the entire duration of participation while euthymic (or depressed if applicable): the mean coefficient of variation (CV), kurtosis, skewness, mean absolute differences (MAD), absolute sum of consecutive changes (ASCC), and autocorrelation (AC) (Fig. 2b; Supplementary Table 2). We selected these features to assess the statistical distribution, variability, and self-similarity of the variables. We computed the CV to capture relative variability (i.e., normalising variance by the mean), rather than absolute variability (i.e., changes in the measure). We explicitly distinguished between absolute and relative variability to avoid misinterpretation in cases where mean levels differ substantially across mood states. The calculations were performed across all available time periods (from 5 min to 1 month) to analyze both short- and long-term fluctuations. To assess the impact of various temporal resolutions, we extracted within-day activity features, within-night sleep features, and weekly and monthly features for activity, sleep, and all e-VAS domains using identical preprocessing steps, standardized variability descriptors, and the same evaluation pipeline. For example, for the daily e-VAS ratings of a participant who was euthymic for five months (153 days), became depressed for 2 months (61 days), and became euthymic again for six months (182 days), we would calculate the seven features listed above over the 335 days of euthymia (i.e., concatenating the two periods of euthymia) and the 61 days of depression. For the same participants and their variable corresponding to monthly mood ratings, we would first aggregate daily e-VAS ratings over 13 months of participation and calculate the seven features listed above over the 11 non-overlapping 30-day periods of euthymia (i.e., concatenating the two periods of euthymia) and the two non-overlapping 30-day periods of depression. In total, 343 features were calculated as described below for each individual participant. See Supplementary Table 1 for the full list of data, variables, and features, and Supplementary Table 2 for the mathematical description and interpretation of the features.
Classification model development and evaluation of model performance
The present analyses were designed to evaluate the association between specific features and concurrently assessed clinical polarities: the ML models were used to distinguish between depressive and euthymic data rather than to forecast future episodes. Accordingly, all reported classification performance metrics (e.g., AU-ROC, AU-PRC) reflect the ability of the features to discriminate between clinical polarities in the same data segment, consistent with an associative rather than temporal predictive framework. We created two matrices based on the features of all participants: the first one corresponding to the euthymic states and the second one to the depressed states that were the input for ensemble-based supervised ML classifiers using either Random Forest65, XGBoost66 (Extreme Gradient Boosting), Logistic Regression, and Support Vector Machines (SVM). Input data were split into training and test sets via a random shuffle 80/20 split stratified by binary state (euthymic = 0, depressed = 1). We evaluated model performance using a nested cross-validation (CV) framework to ensure unbiased estimation and robust feature selection. The outer CV consisted of 5 stratified folds, where in each fold, the data was split into training and held-out test sets. Within each outer fold, model optimization and feature selection were performed using an inner CV with 5 folds and randomized hyperparameter search (50 iterations). The models considered included Random Forest, Gradient Boosting, Logistic Regression, and Support Vector Machines, each combined with one of three feature selection methods: SelectKBest, Recursive Feature Elimination (RFE), and SelectFromModel, and feature selection parameters (e.g., number of features to select) were tuned within the inner CV.
Preprocessing included robust scaling and KNN imputation for missing values. We then handled feature outliers by applying interquartile range-based clipping, and the best model and feature subset from the inner CV were retrained on the entire outer training fold and evaluated on the held-out test fold. We assessed model performance by computing the area under the receiver operating characteristic curve (AUROC).
To evaluate the stability and significance of our classification models, we performed label permutation testing. This involved randomly shuffling class labels and recalculating model performance (AU-ROC) across 5 cross-validation folds to generate a null distribution. The true-label model performance was then compared against this null distribution to assess whether observed results exceeded chance levels.
Model explainability: Permutation feature importance and robustness
To interpret the contribution of specific features to the classification model performance, we computed the SHapley Additive exPlanations (SHAP)67 values on the held-out test sets to provide model-agnostic, local explanation of feature contributions. SHAP values were aggregated across all outer folds to derive an overall feature importance ranking. This method provided insights into how each feature influenced the model’s classification performance. It helped identify clinically actionable digital biomarkers and uncover potential underlying mechanisms driving the model’s performance. We determined the significance of individual features by calculating their impact score (IS), derived as the average of the absolute SHAP values; the IS is a positive metric that quantifies the overall predictive power of each feature on the classification model’s output. Ideally, a highly impactful feature would exhibit a homogeneous impact distribution, with lower or higher values tilting towards the positive class (i.e., depressive episodes).
Moreover, to assess feature robustness and importance stability, we performed permutation importance analysis on the held-out test sets of each outer fold. For each fold, the best model was refit on 100 bootstrapped samples of the training data. For each bootstrap, permutation importance was computed on the held-out test set using 100 repeats to ensure stability of importance estimates. The stability metrics we calculated per feature included (i) feature permutation importance (ranked within the fixed feature scope selected by the best model), (ii) feature selection frequency (i.e., the proportion of bootstraps in which a feature ranked within the top k features; 5 for activity data and 10 for sleep and e-VAS data), and (iii) feature rank distribution. We aggregated feature ranks and importance statistics across bootstraps and outer folds to quantify feature robustness. This approach allowed us to identify features consistently important across resamples, enhancing confidence in their relevance.
Statistical analysis
We implemented permutation testing (N = 100) to generate a null distribution of differences between euthymia and depression groups and bootstrapping (N = 1000) to estimate variability in test statistics, addressing the relatively small sample size. This resulted in a total of 100 × 1000 resampled datasets derived from the original imputed dataset. The Mann-Whitney U test was applied to evaluate differences between groups across resampled distributions, providing a robust assessment of statistical significance. To ensure the reliability of our statistical analyses, we required each clinical state to have a minimum of five data points. This threshold was chosen to improve the power for each bootstrap run of the Mann-Whitney U-test; we corrected for multiple testing of all 7 (Table 2) and top 10 (Tables 3 and 4) feature comparisons with a False Discovery Rate (FDR) correction using the Benjamini–Hochberg method68; statistical significance was assumed for FDR-corrected p-value < 0.05. We also computed Cohen’s d effect sizes to quantify differences between the euthymic and depressive states.
Classifier performance metrics were evaluated on the held-out test set, indicating the mean (SD) from 10 imputed datasets (where relevant). We controlled for the effect of sociodemographic (e.g., age, sex) and clinical (e.g., pharmacotherapy) characteristics (see Table 1).
To help understand underlying differences driving the variation of features between euthymia and depressive episodes, we interpreted the ranking of the features by permutation importance, additionally contextualizing these metrics with the Mann-Whitney U-test statistic and the Cohen’s d effect sizes.
We also performed a sensitivity analysis by repeating the full analysis with and without imputation of missing data. We focused this sensitivity analysis on the daily e-VAS features that exhibited the highest classification performance and had the highest missing-data ratio (mean (SD), 16.71% (20.89%)). This was the most stringent test of whether the imputation process impacted our results. Classifier performance, extracted features, and permutation-based importance rankings were compared in both datasets to assess whether imputation meaningfully changed the results.
All analyses were conducted using an within-participant framework: for each participant, features were extracted separately from their depressive and euthymic segments, and statistical comparisons were performed within the same participant to minimize confounding by between-participant heterogeneity, which can be wide in mood disorders69,70. The reported group-level findings represent aggregated within-participant contrasts: each participant contributed paired depressive and euthymic segments (when applicable), and no analyses involved direct comparison of raw variability values across different participants, consistent with our within-participant analytic design.
All statistical analyses were conducted using Python version 3.11.
Results
Table 1 summarizes the sociodemographic and clinical characteristics of the 133 participants included in this analysis. Participants were followed for a mean (SD; range) of 251 (181; 120–706) days; 50 (38.6%) entered the study in a depressive episode and 32 remitted after a mean (SD) of 83 (61) days. Conversely, 83 (62.4%) participants entered the study euthymic, and 16 experienced a depressive episode after a mean (SD) of 94 (89) days. Overall, participants experienced a mean (SD; range) of 1.1 (1.6; 0–15) depressive episodes, lasting a mean (SD) of 3.0 (5.0) weeks.
We used passively collected activity and sleep data, and daily self-reported electronic visual analog scale (e-VAS) data (i.e., mood, energy, and anxiety) collected from 133 participants enrolled in this study (see Fig. 1) to extract a total of 343 features characterizing variability in sleep, activity, and subjective mood during euthymic periods vs. depressive episodes (see Methods and Supplementary Table 1).
Participants demonstrated variable adherence to wearing the ring, with a mean ± SD of 31.5 ± 44.6 non-wear days per participant. MDRs were lowest for activity data (mean ± SD = 10.23 ± 14.48%), higher for self-reported scales (mean ± SD = 16.71 ± 20.89%), and highest for sleep data (mean ± SD = 22.47 ± 23.51%). These compliance metrics indicate generally good data availability, though missingness was non-negligible.
Identification of top descriptors of depressive episodes
We employed a supervised ML approach within a 5-fold nested cross-validation framework to classify the clinical state (euthymia vs. depressive episode) using activity, sleep, and e-VAS features. Model performance was evaluated on held-out test sets, and feature contributions were assessed using SHAP impact scores, alongside a feature robustness analysis (see Fig. 2 and Methods) based on permutation importance (top 5 selection frequency) across 100 bootstraps and 100 repeats.
As illustrated in Fig. 1c and Fig. 2b, features across different time scales were directly compared within the same analytical framework, and the corresponding ROC, PRC, and SHAP analyses for all time scales are reported in the Supplementary results. In the analyses of the descriptors of depressive episodes at different time scales (from 5-minute to monthly), the daily features outperformed features at other time scales. In the main text, we present only results relevant to daily features (see the Supplementary for other results).
Figure 3 presents the absolute impact scores (right panels) and the relative impact scores (left panels) for the top five daily activity features (Fig. 3a), the top ten daily sleep features (Fig. 3b), and the top ten daily e-VAS features (Fig. 3c), in decreasing order of importance. Figure 4 presents the average permutation importance (PI) (left panel), selection frequency in the top k (activity: k = 5; sleep and e-VAS: k = 10) (middle panel), and rank position distribution (right panel) for the top five daily activity features (Fig. 4a–c), the top ten daily sleep features (Fig. 4d–f), and the top ten daily e-VAS features (Fig. 4g–i), ranked in decreasing order of robustness metrics. Tables 2–4 present the statistical comparisons of the same top daily features.
SHAP feature importance plots showing average absolute impact (left panel) and clinical polarity-specific (right panel) impact of the top features extracted from the a activity variables, b daily sleep, and c daily e-VAS as input features to the Random Forest classifier. Higher average absolute impact indicates higher importance of the feature in differentiating clinical states. Features showing symmetrical impact distribution on both sides of the x-axis are equally important in describing clinical states, while asymmetrical distribution shows stronger descriptive potential for one clinical state or the other one.
a, d, g Feature importance plot showing average permutation importance values with error bars representing 95% CI of bootstrapped samples. b, e, h Feature selection frequency bar plot showing the percentage of times each feature was selected among the top predictors across models.c, f, i Feature rank distribution heatmaps showing feature ranking probabilities, with darker colors indicating higher likelihood of being selected in the top rank positions.
The daily activity feature (Fig. 3a, Fig. 4a–c) that best described a depressive episode was daily activity kurtosis (extreme deviations from average activity) (PI = 0.05; p < 0.05; Cohen’s d = 0.24 [0.0, 0.42]), which was significantly and consistently lower during depressive episodes than during euthymia (Table2). The daily sleep features (Fig. 3b, Fig. 4d–f) that best described depressive episodes were sleep onset latency autocorrelation at lag 1 (AC1) (similarity) (PI = 0.03; p < 0.001; Cohen’s d = 0.02 [-0.23, 0.27]) and kurtosis (PI = 0.01; p < 0.001; Cohen’s d = 0.50 [0.29, 0.70]), which were both significantly and consistently lower during depressive episodes, in addition to skewness (distribution symmetry) of deep sleep duration (PI = 0.02; p < 0.05; Cohen’s d = 0.25 [-0.01, 0.52]), also consistently lower during depressive episodes (Table 3). The daily e-VAS features (Fig. 3c, Fig. 4g–i) that accurately and consistently described depressive episodes were the CV (mean-normalized relative variance) of mood scores (PI = 0.11; p < 0.001; Cohen’s d = -0.50 [-0.73, -0.23]), which was higher during depressive episodes, coupled with significantly lower mood scores (PI = 0.06; p < 0.001; Cohen’s d = 0.67 [0.41, 0.94]), jointly indicating significantly lower daily mood variability during depressive episodes. This contradiction arises from a paradox: During depressive episodes, mood ratings are low but fluctuate more in proportion to their low baseline.
Moreover, mean absolute differences between consecutive energy scores consistently described depressive episodes (PI = 0.09; p < 0.01; Cohen’s d = -0.27 [-0.55, -0.01]), coupled with significantly lower energy scores (PI = 0.02; p < 0.001; Cohen’s d = 0.60 [0.33, 0.86]) jointly indicating a steep decrease and stability at low levels in daily energy scores during depressive episodes (Table 4). Detailed robustness metrics for all features and modalities are provided in the Supplementary results.
Classification performance
Features based on daily e-VAS yielded the highest ML model performance on the held-out dataset across folds and permutations (AU-ROC: Mean: 0.82 ± 0.03; range: [0.78, 0.87]), followed by features based on daily activity (AU-ROC: Mean: 0.65 ± 0.07; range: [0.55, 0.72]) and features based on daily sleep (median (IQR) AU-ROC: 0.64 (0.08); AU-PRC: 0.55 (0.15)). The ML model that generated the most accurate depressive episode classification was XGBoost with RFE for feature selection. Upon label permutation, the XGBoost model using daily e-VAS data significantly outperformed the null distribution generated by label permutation (p < 0.01), with a Cohen’s d effect size of 3.78, indicating a large difference between true and random label performance. The model outperformed the random-label model in 10 out of 10 cross-validation folds. The daily activity model also outperformed the null distribution generated by label permutation (p < 0.01), with a Cohen’s d effect size of 1.49.
The area under the receiver-operating curve (AU-ROC) scores of the top-performing ML model on the held-out dataset across outer folds are presented in Supplementary Table 3, and the AU-ROC and area under the precision-recall curve (AU-PRC) scores for the top-performing ML models across time scales are presented in Supplementary Figs. 1–4, and the graphical representation of the label permutation performance for daily e-VAS, activity, and sleep data is presented in Supplementary Fig. 5.
The imputation sensitivity analysis showed no significant differences between the imputed and non-imputed datasets in AU-ROC or AU-PRC performance, the set of selected descriptors, or their permutation-based importance rankings. Classification performances were nearly identical, and the most descriptive features and their relative importance values were identical in both datasets (see Supplementary Figs. 6 to 8). Similarly, the second sensitivity analysis based on participation length yielded consistent classification performance and feature rankings in participants with longer or shorter participation length. This supports that our results are not biased by dropout or performance engagement (see Supplementary Fig. S9).
Discussion
The primary aim of this study was to identify the most accurate digital biomarkers for detecting depressive episodes in BD. Multivariate densely sampled data across different time periods combined continuously-collected objective data from a wearable (i.e., activity and sleep variables), daily self-rated subjective data (i.e., mood, energy, and anxiety variables), and weekly self-reported mood questionnaires (i.e., Patient Health Questionnaire (PHQ-9) and Altman Self-Rating Mania scale (ASRM)). Overall, we systematically analyzed more than 200,000 data points from 133 participants with BD, followed for a median of 251 days.
This analysis supports our previous finding that day-to-day variability in subjective mood, energy, and anxiety (e-VAS) best describes depressive episodes. Continuously collected measures of variability in daily activity and sleep were also robust descriptors of depressive episodes. However, while daily features performed best in our analysis, the generalizability of this finding to other cohorts or other sensing modalities requires further investigation. During depressive episodes, participants consistently self-reported low mood and energy with less extreme day-to-day fluctuations (i.e., lower skewness and kurtosis): participants described being sadder and less energetic than usual, with less extreme day-to-day fluctuations than during euthymic periods. This pattern illustrates a common relative–absolute variability paradox: because mood ratings were lower during depressive episodes, they showed both higher relative variability and lower absolute variability (reflecting changes within a narrow low range (i.e., being “stuck” in a low mood). This highlights the importance of interpreting CV-based metrics in the context of the absolute ratings.
Furthermore, continuous measurement of objective activity and sleep revealed robust descriptors of depressive episodes, consistently characterized by lower variability in daily activity levels (i.e., smaller daily MAD). Our results are consistent with recent studies identifying accelerometry-derived activity features as the most informative markers for distinguishing mood states in BD71,72. During depressive episodes, sleep onset latency was more variable (i.e., lower daily autocorrelation), indicating that some nights it took depressed participants much longer to fall asleep compared to the adjacent nights. Depressive episodes were also characterized by less extreme changes in the duration of deep sleep (i.e., lower kurtosis). These findings are consistent with the results of descriptive phenomenology studies reporting low mood variability73,74, low activity levels75, and sleep disturbances76 during depressive episodes.
Our findings also support that features at the timescale of a day are more accurate to describe and detect depressive episodes than features at a shorter or longer timescale. We believe this is because data for brief time periods (i.e., minutes, hours, mornings, evenings, or nights) are too variable (“noisy”) and data for longer time periods (i.e., weeks or months) are not variable enough (i.e., variations cancel each other, and the pattern is lost).
Overall, our analysis of e-VAS data shows that one can accurately detect and characterize a depressive episode simply by asking a patient daily how they feel. However, while our participants are a self-selected group that was adherent to daily e-VAS, many patients would not tolerate this amount of monitoring. Outside of the confines of a research study, even when physicians ask their patients to contact them if they start experiencing depressive symptoms, many patients do not do it and receive clinical attention only once they become clearly symptomatic and impaired. In this context, our results establish the value of passively (“automatically”) gathered data: they suggest that a small set of features based on daily variation in activity and sleep could be used to detect depressive episodes, which could be reported both to a patient and their clinician, leading to earlier intervention and better outcomes. Beyond detection, this study contributes to the development of methodology for identifying digital biomarkers and advancing our understanding of the pathophysiology of depressive episodes, particularly the clinical significance of decreased variability. The observed reduction in variability may reflect a loss of physiological adaptability and could provide insight into underlying neurobiological mechanisms of depression, such as altered homeostatic control or impaired stress-response systems. To our knowledge, no other longitudinal study has adopted this methodology and showed similar results.
Our study has both strengths and limitations. To our knowledge, our sample is larger, and our duration of follow-up is longer than in most published studies of digital phenotyping in patients with a severe mental illness. Another strength is our use of explainable ML to systematically analyze nearly 400 potential features describing a depressive episode; SHAP values enabled us to select the three most accurate features. The within-participant analysis is a strength of this study that is made possible by the long period of observation and is also relevant for individualized/personalized clinical management. Our focus on within-participant comparisons allowed each participant to serve as their own control, thereby reducing the influence of the substantial between-participant heterogeneity in behavioral, sleep, and symptom trajectories that is expected in patients with mood disorders1,2 and evident in our sample, with a wide overlap of feature values observed in our SHAP analyses. This study also distinguishes itself from previous mood episode classification efforts by emphasizing robust biomarker identification rather than solely highlighting predictive performance. Our systematic approach, incorporating bootstrapped feature stability and permutation analyses, identifies a robust set of digital biomarkers that reliably characterize depressive episodes in BD. Our imputation sensitivity analysis supported that digital biomarkers and their relative importance were robust, and they are not artifacts of our imputation procedures. This focus on reproducibility and interpretability advances the field toward clinically relevant and actionable digital biomarkers rather than incremental improvements in classification.
The additional analysis of all the studied features using traditional statistics identified the features that statistically differentiated depressive episodes from periods of euthymia. However, some features did not contribute as much to the explainable ML overall predictions because they were redundant in their contribution to the models’ outputs (i.e., even after managing multicollinearity, their SHAP importance was distributed among correlated features). A possible limitation is that participants who dropped out of the study early contributed less data to model training than those who did not. To address whether this could have influenced our models (since we have shown that data are missing not at random in our sample)63, we implemented data normalization techniques, performed stratified cross-validation to avoid data length-dependent features. We also performed participant data length-based sensitivity analyses, making our methods more robust and ensuring fair representation of all participants regardless of their enrollment time and illness trajectory. Furthermore, the current study did not include participants with Major Depressive Disorder; therefore, we cannot determine whether identified digital biomarkers are associated specifically with bipolar depression or more generally with depressive states; future work should address this gap. Another limitation is the relatively high homogeneity of our sample: our participants had relatively high levels of education, and few were from underrepresented groups. This limits the generalizability of our findings, which need to be replicated in a more diverse sample. Moreover, participants had access to the Oura Ring app dashboard during the study, which may have introduced reactivity effects, such as increased physical activity due to feedback awareness. Although this was not an intervention study, we acknowledge this as a potential limitation affecting behavior and data interpretation.
In conclusion, a systematic approach and analysis identified three digital biomarkers that are relatively simple to compute and interpret, and that could be integrated into clinical practice to detect depressive episodes and improve clinical outcomes. This work contributes to the field of digital phenotyping in psychiatry, demonstrating the feasibility of the first step in a stepped approach toward actionable clinical predictions. The small set of features we identified to characterize depressive episodes can now be evaluated in future studies addressing the harder problem of detecting, or predicting, the onset of depressive episodes (i.e., the transition from euthymia to a depressive episode)40. Future studies should also extend this framework to true temporal prediction models capable of forecasting clinical state transitions (i.e., relapse, remission), rather than detecting them.
Data Availability
The datasets generated and/or analyzed during the current study are not publicly available due to privacy considerations, and the underlying code for this study is not publicly available but may be made available to qualified researchers on reasonable request from the corresponding author.
Code availability
The underlying code for this study is not publicly available but may be made available to qualified researchers on reasonable request from the corresponding author.
References
Seelye, A. et al. Weekly observations of online survey metadata obtained through home computer use allow for detection of changes in everyday cognition before transition to mild cognitive impairment. Alzheimers Dement 14, 187–194 (2018).
Zwack, C. C. et al. The evolution of digital health technologies in cardiovascular disease research. npj Digit. Med. 6, 1 (2023).
Schmidt, A. et al. Digital biomarkers’ in preclinical heart failure models - a further step towards improved translational research. Heart Fail Rev. 28, 249–260 (2023).
Park, M. J. et al. Performance of ECG-derived digital biomarker for screening coronary occlusion in resuscitated out-of-hospital cardiac arrest patients: a comparative study between artificial intelligence and a group of experts. J. Clin. Med. 13, 1354 (2024).
Wesselius, F. J., van Schie, M. S., De Groot, N. M. S. & Hendriks, R. C. Digital biomarkers and algorithms for detection of atrial fibrillation using surface electrocardiograms: a systematic review. Comput Biol. Med 133, 104404 (2021).
Youn, B.-Y. et al. Digital biomarkers for neuromuscular disorders: a systematic scoping review. Diagnostics 11, 1275 (2021).
Wireless Mobile Communication and Healthcare: 10th EAI International Conference, MobiHealth 2021, Virtual Event, November 13–14, 2021, Proceedings. vol. 440 (Springer International Publishing, 2022).
Bera, K., Schalper, K. A., Rimm, D. L., Velcheti, V. & Madabhushi, A. Artificial intelligence in digital pathology - new tools for diagnosis and precision oncology. Nat. Rev. Clin. Oncol. 16, 703–715 (2019).
Song, Y., Kang, K., Kim, I. & Kim, T.-J. Pathological digital biomarkers: validation and application. Appl. Sci. 12, 9823 (2022).
Kourtis, L. C., Regele, O. B., Wright, J. M. & Jones, G. B. Digital biomarkers for Alzheimer’s disease: the mobile/ wearable devices opportunity. npj Digit. Med. 2, 9 (2019).
Godinho, C. et al. Erratum to: a systematic review of the characteristics and validity of monitoring technologies to assess Parkinson’s disease. J. Neuroeng. Rehabilit. 13, 71 (2016).
Huss, R., Raffler, J. & Märkl, B. Artificial intelligence and digital biomarker in precision pathology guiding immune therapy selection and precision oncology. Cancer Rep. 6, e1796 (2023).
Tandon, A. et al. Wearable biosensors in congenital heart disease. Needs Adv. Field JACC Adv. 2, 100267 (2023).
Cobb, B. et al. Clinical applications of digital biomarkers in multiple sclerosis: a systematic literature review (P5-6.013). Neurology 102, 3675 (2024).
Sahandi Far, M., Stolz, M., Fischer, J. M., Eickhoff, S. B. & Dukart, J. JTrack: a digital biomarker platform for remote monitoring of daily-life behaviour in health and disease. Front Public Health 9, 763621 (2021).
Cay, G. et al. Harnessing physical activity monitoring and digital biomarkers of frailty from pendant based wearables to predict chemotherapy resilience in veterans with cancer. Sci. Rep. 14, 2612 (2024).
Dorsey, E. R., Papapetropoulos, S., Xiong, M. & Kieburtz, K. The first frontier: digital biomarkers for neurodegenerative disorders. Digit Biomark. 1, 6–13 (2017).
Straube, C. et al. A Second course of radiotherapy in patients with recurrent malignant gliomas: clinical data on re-irradiation, prognostic factors, and usefulness of digital biomarkers. Curr. Treat. Options Oncol. 20, 71 (2019).
Ortiz, A. et al. Increased sympathetic tone is associated with illness burden in bipolar disorder. J. Affect Disord. 297, 471–476 (2022).
Ortiz, A. et al. Reduced heart rate variability is associated with higher illness burden in bipolar disorder. J. Psychosom. Res 145, 110478 (2021).
Halabi, R. et al. A novel unsupervised machine learning approach to assess postural dynamics in euthymic bipolar disorder. IEEE J. Biomed. Health Inf. 28, 4903–4911 (2024).
Ortiz, A. et al. Predictors of adherence to electronic self-monitoring in patients with bipolar disorder: a contactless study using growth mixture models. Int J. Bipolar Disord. 11, 18 (2023).
Insel, T. R. Digital phenotyping: a global tool for psychiatry. World Psychiatry 17, 276–277 (2018).
Pinto, N. & Dolan, M. E. Clinically relevant genetic variations in drug metabolizing enzymes. Curr. Drug Metab. 12, 487–497 (2011).
Zubiaur, P. & Abad-Santos, F. Association studies in clinical pharmacogenetics. Pharmaceutics 15, 113 (2022).
Lauschke, V. M., Milani, L. & Ingelman-Sundberg, M. Pharmacogenomic biomarkers for improved drug therapy-recent progress and future developments. AAPS J. 20, 4 (2017).
Oslin, D. W. et al. Effect of pharmacogenomic testing for drug-gene interactions on medication selection and remission of symptoms in major depressive disorder: the PRIME care randomized clinical trial. JAMA 328, 151–161 (2022).
Jacobson, N. C., Weingarden, H. & Wilhelm, S. Digital biomarkers of mood disorders and symptom change. npj Digit. Med. 2, 3 (2019).
Breitinger, S. et al. Digital phenotyping for mood disorders: methodology-oriented pilot feasibility study. J. Med Internet Res 25, e47006 (2023).
Vignapiano, A. et al. A narrative review of digital biomarkers in the management of major depressive disorder and treatment-resistant forms. Front Psychiatry 14, 1321345 (2023).
Orsolini, L., Fiorani, M. & Volpe, U. Digital phenotyping in bipolar disorder: which integration with clinical endophenotypes and biomarkers? Int. J. Mol. Sci. 21, 7684 (2020).
Fraccaro, P. et al. Digital biomarkers from geolocation data in bipolar disorder and schizophrenia: a systematic review. J. Am. Med. Inf. Assoc. 26, 1412–1420 (2019).
Gillett, G. et al. Digital communication biomarkers of mood and diagnosis in borderline personality disorder, bipolar disorder, and healthy control populations. Front. Psychiatry 12, 610457 (2021).
Faurholt-Jepsen, M. et al. Voice analysis as an objective state marker in bipolar disorder. Transl. Psychiatry 6, e856 (2016).
Faurholt-Jepsen, M. et al. Smartphone data as an electronic biomarker of illness activity in bipolar disorder. Bipolar Disord. 17, 715–728 (2015).
Belle, V. & Papantonis, I. Principles and practice of explainable machine learning. Front Big Data 4, 688969 (2021).
Opoku Asare, K. et al. Mood ratings and digital biomarkers from smartphone and wearable data differentiates and predicts depression status: a longitudinal data analysis. Pervasive Mob. Comput. 83, 101621 (2022).
Price, G. D., Heinz, M. V., Song, S. H., Nemesure, M. D. & Jacobson, N. C. Using digital phenotyping to capture depression symptom variability: detecting naturalistic variability in depression symptoms across one year using passively collected wearable movement and sleep data. Transl. Psychiatry 13, 381 (2023).
Winter, N. R. et al. A systematic evaluation of machine learning-based biomarkers for major depressive disorder. JAMA Psychiatry 81, 386–395 (2024).
Ortiz, A. & Mulsant, B. H. Beyond step count: are we ready to use digital phenotyping to make actionable individual predictions in psychiatry? J. Med. Internet Res. 26, e59826 (2024).
Garcia-Ceja, E. et al. Mental health monitoring with multimodal sensing and machine learning: a survey. Pervasive Mob. Comput. 51, 1–26 (2018).
Faurholt-Jepsen, M. et al. Behavioral activities collected through smartphones and the association with illness activity in bipolar disorder. Int J. Methods Psychiatr. Res. 25, 309–323 (2016).
Rohani, D. A., Faurholt-Jepsen, M., Kessing, L. V. & Bardram, J. E. Correlations between objective behavioral features collected from mobile and wearable devices and depressive mood symptoms in patients with affective disorders: systematic review. JMIR Mhealth Uhealth 6, e165 (2018).
Grünerbl, A. et al. Smartphone-based recognition of states and state changes in bipolar disorder patients. IEEE J. Biomed. Health Inf. 19, 140–148 (2015).
Maxhuni, A. et al. Classification of bipolar disorder episodes based on analysis of voice and motor activity of patients. Pervasive Mob. Comput. 31, 50–66 (2016).
Gruenerbl, A. et al. Using smart phone mobility traces for the diagnosis of depressive and manic episodes in bipolar patients. in Proceedings of the 5th Augmented Human International Conference 1–8 (Association for Computing Machinery, New York, NY, USA, 2014). https://doi.org/10.1145/2582051.2582089.
Jakobsen, P. et al. Early warning signals observed in motor activity preceding mood state change in bipolar disorder. Bipolar Disord. 26, 468–478 (2024).
Moore, P. J., Little, M. A., McSharry, P. E., Geddes, J. R. & Goodwin, G. M. Forecasting depression in bipolar disorder. IEEE Trans. Biomed. Eng. 59, 2801–2807 (2012).
Ortiz, A. et al. Day-to-day variability in sleep and activity predict the onset of a hypomanic episode in patients with bipolar disorder. J. Affect Disord. 374, 75–83 (2025).
Ortiz, A. et al. Day-to-day variability in activity levels detects transitions to depressive symptoms in bipolar disorder earlier than changes in sleep and mood. Int J. Bipolar Disord. 13, 13 (2025).
Ortiz, A. et al. Identifying patient-specific behaviors to understand illness trajectories and predict relapses in bipolar disorder using passive sensing and deep anomaly detection: protocol for a contactless cohort study. BMC Psychiatry 22, 288 (2022).
Diagnostic and Statistical Manual of Mental Disorders | Psychiatry Online. DSM Library https://psychiatryonline.org/doi/book/10.1176/appi.books.9780890425596.
First, M. B. Structured Clinical Interview for the DSM (SCID). in The Encyclopedia of Clinical Psychology 1–6 (Wiley, 2015). https://doi.org/10.1002/9781118625392.wbecp351.
Young, R. C., Biggs, J. T., Ziegler, V. E. & Meyer, D. A. A rating scale for mania: reliability, validity and sensitivity. Br. J. Psychiatry 133, 429–435 (1978).
Montgomery, S. A. & Asberg, M. A new depression scale designed to be sensitive to change. Br. J. Psychiatry 134, 382–389 (1979).
de Zambotti, M., Rosas, L., Colrain, I. M. & Baker, F. C. The sleep of the ring: comparison of the ŌURA sleep tracker against polysomnography. Behav. Sleep. Med 17, 124–136 (2019).
Kroenke, K., Spitzer, R. L. & Williams, J. B. The PHQ-9: validity of a brief depression severity measure. J. Gen. Intern Med 16, 606–613 (2001).
Altman, E. G., Hedeker, D., Peterson, J. L. & Davis, J. M. The altman self-rating mania scale. Biol. Psychiatry 42, 948–955 (1997).
Harris, P. A. et al. Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J. Biomed. Inf. 42, 377–381 (2009).
Levis, B., Benedetti, A., Thombs, B. D. & DEPRESsion Screening Data (DEPRESSD) Collaboration Accuracy of Patient Health Questionnaire-9 (PHQ-9) for screening to detect major depression: individual participant data meta-analysis. BMJ 365, l1476 (2019).
Stochl, J. et al. On dimensionality, measurement invariance, and suitability of sum scores for the PHQ-9 and the GAD-7. Assessment 29, 355–366 (2022).
Mendes, M. deA. et al. Metabolic equivalent of task (METs) thresholds as an indicator of physical activity intensity. PLoS One 13, e0200701 (2018).
Halabi, R. et al. Not missing at random: missing data are associated with clinical status and trajectories in an electronic monitoring longitudinal study of bipolar disorder. J. Psychiatr. Res. 174, 326–331 (2024).
Murti, D. M. P., Pujianto, U., Wibawa, A. P. & Akbar, M. I. K-Nearest Neighbor (K-NN) based Missing Data Imputation. in Proc. 5th International Conference on Science in Information Technology (ICSITech) 83–88 (2019). https://doi.org/10.1109/ICSITech46713.2019.8987530.
Breiman, L. Random Forests. Mach. Learn. 45, 5–32 (2001).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, New York, NY, USA, 2016). https://doi.org/10.1145/2939672.2939785.
Lundberg, S. M., & Lee, S. I. A unifi ed approach to interpreting model predictions. Advances in neural information processing systems, 30, (2017).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).
Alda, M. The phenotypic spectra of bipolar disorder. Eur. Neuropsychopharmacol. 14, S94–S99 (2004).
Ostergaard, S. D., Jensen, S. O. W. & Bech, P. The heterogeneity of the depressive syndrome: when numbers get serious. Acta Psychiatr. Scand. 124, 495–496 (2011).
Corponi, F. et al. Automated mood disorder symptoms monitoring from multivariate time-series sensory data: getting the full picture beyond a single number. Transl. Psychiatry 14, 161 (2024).
Anmella, G. et al. Exploring digital biomarkers of illness activity in mood episodes: hypotheses generating and model development study. JMIR mHealth uHealth 11, e45405 (2023).
Judd, L. L. et al. The long-term natural history of the weekly symptomatic status of bipolar I disorder. Arch. Gen. Psychiatry 59, 530–537 (2002).
Judd, L. L. et al. A prospective investigation of the natural history of the long-term weekly symptomatic status of bipolar II disorder. Arch. Gen. Psychiatry 60, 261–269 (2003).
Clinical psychiatry: a text-book for students and physicians / abstracted and adapted from the 7th German edition of Kraepelin’s ‘Lehrbuch der Psychiatrie’ by A. Ross Diefendorf. Wellcome Collection https://wellcomecollection.org/works/md763v8u.
Bauer, M. et al. Temporal relation between sleep and mood in patients with bipolar disorder. Bipolar Disord. 8, 160–167 (2006).
Acknowledgements
This study was funded by the National Institute of Mental Health (NIMH) grant 1R21MH123849-01A1 (AO) and by the Canadian Institutes of Health Research (CIHR) grant 02010PJT-450770-BSB-CEAH-188794 (AO). The work of MA on the project has been supported by the Ministry of Health of the Czech Republic, grant no. NU23-04-00534. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
Author information
Authors and Affiliations
Contributions
R.H. contributed to data management, development and implementation of signal processing and ML algorithms, original draft writing, formal analysis, and data curation. B.M. contributed to manuscript writing, reviewing, editing, refinement of visualizations, methodology validation, and clinical interpretation and validation of results. M.T. contributed to statistical analysis, manuscript reviewing, and editing. D.B., C.G.T., M.I.H., H.K., and C.O. contributed to manuscript reviewing and editing. A.D. contributed to data management. A.H. contributed to project supervision and manuscript reviewing and editing. M.A. contributed to clinical interpretation and validation of results and manuscript reviewing and editing. A.O. contributed to study conceptualization, manuscript writing, reviewing, and editing, methodology and results conceptualization and validation, project supervision and management, resource provision, and funding acquisition.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Halabi, R., Mulsant, B.H., Tolend, M. et al. A systematic exploration of digital biomarkers for the detection of depressive episodes in bipolar disorder. npj Mental Health Res 5, 13 (2026). https://doi.org/10.1038/s44184-026-00195-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s44184-026-00195-5



