Introduction

Forecasts indicate that worldwide individuals 60 years or older are projected to hit 1 billion by 20201, with expectations of this figure increasing twofold to 2.1 billion by 2050.Additionally, it is anticipated that the number of people aged 80 and older will double, hitting 426 million by the year 20502. Despite the global increase in life expectancy, elderly individuals face a heightened risk of mental health issues and various medical conditions. These challenges pose significant barriers to achieving successful aging, especially considering the onset of depression (DS). Survey data indicates that the global prevalence of depressive symptoms has risen to approximately 311 million people, an increase of over 18% from 2005 to 2015,Additionally, depression is the second leading cause of disability worldwide3. This not only negatively impacts the quality of life for elderly patients but also contributes to the development of long-term illnesses, further increasing the economic burden on healthcare systems. Therefore, there is an urgent need to prioritize depression as a public health issue to alleviate disease burden and disability and improve the overall health status of elderly patients.

Relevant studies have shown that DS and subjective cognitive decline (SCD) — characterized by cognitive deterioration that has not yet reached clinically significant levels — are recognized risk factors for dementia4,5. The significant increase in dementia risk is closely associated with persistent SCD and DS. Consequently, individuals with DS in the context of SCD may represent a high-risk group for cognitive decline. However, in-depth research on the likelihood of DS in SCD patients remains scarce. Moreover, DS in SCD patients is often subtle and exhibits considerable individual variability. Accurately identifying high-risk populations remains a significant challenge. Identifying the primary factors contributing to depressive symptoms and effectively managing and mitigating their risks is of critical importance. Therefore, establishing an accurate DS risk assessment model is essential for timely intervention and for reducing the incidence of DS.

Currently, most traditional risk prediction methods rely on regression models. However, these techniques are prone to underestimation when applied to large datasets with complex feature variables, and they face limitations due to restricted feature selection and the complexity of relationship modeling. In recent years, the use of machine learning tools in the fields of medicine and psychology has significantly expanded, enhancing their efficiency. Machine learning techniques can automatically identify hidden trends and relationships within vast historical datasets, helping to uncover the complex risk factors that contribute to the onset of DS6. These techniques are particularly well-suited for analyzing large datasets that include both categorical and continuous variables. Compared to traditional regression methods, machine learning offers substantial advantages in handling complex, multidimensional datasets. This approach enables precise identification of key features and generates more accurate predictions.

In the field of assessing disease risk in the elderly, several studies have employed machine learning techniques, highlighting their exceptional predictive capabilities7,8. Such techniques enhanced the precision of predictions and pinpointed primary risk factors intimately linked to disease initiation.Yet, in spite of ongoing studies on forecasting the likelihood of DS among the aged population9, the application of machine learning for predicting DS risk in elderly individuals with SCD remains a relatively new area of research.

This study presents a machine learning-based predictive model that utilizes data from the China Health and Retirement Longitudinal Study (CHARLS) to predict depression in elderly individuals with SCD. The goal is to identify the key risk factors associated with depression in elderly individuals with SCD and to evaluate prediction accuracy using three different machine learning methods: Random Forest, Gradient Boosting, and XGBoost. The aim of this study is to propose a pioneering, data-driven approach for the timely diagnosis and treatment of depression in elderly individuals with SCD, while providing a scientific foundation for broader research on the mental health of the elderly.

Methods

This study was conducted and reported in accordance with the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement10. The development and validation of the machine learning model followed TRIPOD recommendations, including clear definitions of predictors, outcome variables, and model performance metrics.

Data source

This study utilized data from the China Health and Retirement Longitudinal Study (CHARLS), which was collected by Peking University between 2015 and 201811, and is available on the website [http://charls.pku.edu.cn]. The study was approved by the Biomedical Ethics Committee of Peking University (Beijing, China). After integrating and cleaning the data, excluding respondents with missing data for key variables (more than 1/3 missing), a total of 31,718 participants were included. Following the inclusion and exclusion criteria of the study, individuals with memory-related disorders, including dementia, mild cognitive impairment (MCI), or those diagnosed with memory problems by a physician, were excluded. As a result, 1,921 older adults with SCD were selected to participate in the study. The screening process is illustrated in Supplementary Material 1. A total of 34 different variables were used in this study, with 33 explanatory factors as inputs and one outcome variable for prediction.

Research tools

Self-assessed memory

SCD was evaluated through self-assessment of memory. Participants were asked to rate the following question: “How would you describe your memory currently?” The available responses were: “bad,” “fair,” “good,” “very good,” and “excellent.” Those who selected “fair” or “bad” were classified as having SCD12.

Dependent variables

The 10-item Center for Epidemiological Studies Depression Scale (CES-D) was used in the CHARLS 2015 and 2018 questionnaires to assess the risk of depression in participants. The scale consists of 10 items, each scored from 0 to 3, with a total possible score ranging from 0 to 30. Higher scores indicate a greater risk of depression13. The CES-D has demonstrated good reliability and validity.Based on a review of the literature6,14,15,16and expert consultations, this study extracted sociodemographic data, health status, chronic disease history, lifestyle habits, and physical function from the CHARLS database as independent variables.)Sociodemographic Data: Age, gender, education level, place of residence, marital status, retirement status, pension and medical insurance coverage, number of surviving children, and broadband availability.2) Health Status: Self-rated health (on a 1 to 5 scale, from very good to very poor); life satisfaction (1 to 5 scale, from not satisfied at all to completely satisfied); hearing status(1 to 5 scale, from poor at all to excellent; physical dysfunction (assessed using 9 basic activities, such as running 1 km and walking 100 m, with scores ranging from 0 for no difficulty to 3 for inability to perform, with a total score range of 0 to 27, where higher scores indicate worse physical function)0.3 Chronic Disease History: Includes the presence of chronic conditions such as hypertension, dyslipidemia, diabetes, cancer, lung disease, liver disease, heart disease, stroke, kidney disease, stomach disease, arthritis, and asthma.4)Lifestyle: Includes variables related to nighttime sleep duration, smoking, and alcohol consumption.5 Physical Function: Assessed through Activities of Daily Living (ADL) and Instrumental Activities of Daily Living (IADL), both based on scales developed by foreign scholars in 1969.Every scale is comprised of a variety of six distinct items.Activities encompassed in the ADL range from restroom use, eating, dressing, managing urination and defecation, bed movement, to bathing.The IADL encompasses activities like shopping, phone calling, culinary tasks, housekeeping, medication intake, and maintaining financial health.Activity ratings range from 0 to 3, with higher values signifying a reduced capacity to execute daily activities17

Research design

The data for this study were obtained from the CHARLS database for 2015 and 2018, encompassing data from 3,842 participants. By comparing data across these two years, we gathered repeated observations for 1,921 participants.Data gathering for these participants took place in both the initial 2015 year (Baseline) and the subsequent 2018 year (Follow-up).As a result, the study primarily concentrated on examining the alterations seen in individuals across the two periods, to ascertain variations from the starting point to subsequent observations. To segment the data set, it was divided into a training segment (70%), a validation part (20%), and a testing segment (10%), maintaining a consistent category distribution across both parts with the initial data.

Statistical analysis

To minimize the potential bias in the data imputation process, variables with more than 20% missing data were excluded to ensure reliable performance18. Missing values were imputed using the KNN, median, and mean methods in PyCharm 2.3 For normally distributed continuous variables, the data are presented as mean ± standard deviation, while non-normally distributed continuous variables are expressed as median (P25, P75). Categorical data are presented as frequencies and proportions.

Data preprocessing and feature selection were performed in PyCharm 2.3. All variables were standardized using the StandardScaler. Feature selection was conducted using Lasso regression, with the optimal regularization parameter (alpha) determined through cross-validation. To address data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied.

Finally, the SHAP (Shapley Additive Explanations) method (version 0.46.0) was used to explain the operational mechanisms of the Boosted XGBoost model and to visualize the contributions of each input feature to the prediction of the dependent variable. The SHAP method, based on Shapley values from game theory, quantifies the importance of each feature in different predictions and provides both global and local interpretability of the model output. This approach enabled us to uncover the attribution of independent variables to the dependent variable (DS) and assess their influence on the final prediction results.

Model selection and assessment

We selected three commonly used machine learning models—Random Forest, Gradient Boosting, and Boosted XGBoost—for comparison. Hyperparameter tuning was performed for the Boosted XGBoost model. Model calibration was conducted using CalibratedClassifierCV to adjust predicted probabilities. Cross-validation was then applied to ensure the stability and robustness of the evaluation results. The performance of the models was assessed using the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) as evaluation metrics. Additionally, Decision Curve Analysis (DCA) was employed to assess the clinical applicability of the models. The significance level was set at α = 0.05.

Results

A total of 1,921 elderly individuals with SCD were included in this study, comprising 1,199 males (62.42%) and 722 females (37.58%). The participants’ ages ranged from 62 to 95 years, with a mean age of 70.95 ± 6.31 years. Regarding education level, 843 participants (43.88%) had never attended school, 554 (28.84%) had completed primary school, 337 (17.54%) had completed middle school, and 187 (9.73%) had completed high school or higher. The majority of participants were registered as living in rural areas (1,167, 60.75%) and were married (1,387, 72.20%). Additionally, 1,824 participants (94.95%) had pension insurance. Other details are provided in Supplementary Material 2.

Our study employed LASSO regression models to identify key factors influencing the risk of depression. The LASSO technique simplifies regression coefficients by shrinking those of irrelevant features to zero, effectively performing feature selection. By applying L1 regularization, LASSO imposes a constraint on the data, thereby eliminating variables with weak associations to the increased risk of depression. As shown in Fig. 1, the feature selection effect was optimized when the regularization coefficient alpha was set to 0.0110. The ideal degree of regularization was determined through 10-fold cross-validation, which enabled the identification of key features associated with depression risk based on the regression coefficients.

Table 1 presents the top ten features identified by the LASSO model along with their corresponding regression coefficients. These variables were selected from data collected at different time points (baseline and follow-up periods). The results indicate that factors such as life satisfaction, self-rated health, sleep duration, education level, and retirement status significantly increase the risk of depression. Therefore, our findings suggest that improving life satisfaction, conducting more comprehensive physical and mental health assessments, and enhancing educational levels may collectively help reduce the likelihood of depression.

In contrast, variables with positive regression coefficients—such as physical dysfunction, digestive system diseases, rural residency, arthritis, and reduced ability to perform daily activities—were associated with an increased risk of depression. This implies that individuals experiencing a decline in physical function, digestive disorders, living in rural areas, arthritis, and a reduced ability to carry out daily activities are more likely to face a higher risk of depression.

In Fig. 2, it is depicted that the rate of losses within both training and validation datasets intensifies as the number of steps in the iterative sequence increases.The illustration shows a decreasing pattern in losses associated with training and validation over the period, indicative of the model’s ongoing optimization phase. Figure 3 illustrates the adjustment outcomes for this model on the set for validation.The calibration curves of the Random Forest, Gradient Boosting, and Boosted XGBoost models provide an intuitive demonstration of their effectiveness in accurately reflecting the true probability distribution.

Figure 4 presents the results of the decision curve analysis, which aids in evaluating the overall advantages of different models under various thresholds. The analysis reveals that the Boosted XGBoost model outperforms the other two models within the threshold range of 0 to 0.7, indicating that clinical decisions based on this model are more effective within this range.

Graphical representations in Table 2; Fig. 5A demonstrate the ROC routes and AUC metrics for three machine learning models within the validation dataset (RF 0.859,95% CI:0.827–0.888; GBDT 0.873, 95% CI:0.845-0.900; Boosted XGBoost 0.883,95% CI:0.856–0.910)。In Table 2; Fig. 5B, ROC graphs and AUC metrics for the three examined models are displayed(RF 0.861.

,95%CI:0.819-0.900;GBDT 0.887,95%CI:0.851–0.920;Boosted XGBoost 0.893,95%CI:0.858-.

0.926).Each of the three models in the test set exhibited AUC values exceeding 0.8, signifying their high precision.

The Boosted XGBoost model demonstrated superior performance in the test set, achieving the highest AUC (0.893) and F1-score (0.814), indicating enhanced residual optimization capabilities through iterative refinement. Its high specificity (0.840 in test) further confirmed robust discriminative power for negative-class samples. However, the slightly lower sensitivity (0.797 in testing) compared to RF may reflect Boosted XGBoost’s susceptibility to noise, potentially affecting minority-class detection. The random forest exhibited significantly higher sensitivity, showcasing its inherent advantage in variance reduction through majority-vote aggregation, particularly for identifying positive-class samples. To comprehensively evaluate predictive determinants, we conducted comparative SHAP interpretability analyses for both models (Figs. 6 and 7).

Notably, SHAP interpretability analysis identified five consensus core predictors in both Boosted XGBoost and random forest models. The key predictive factors included education level, digestive health status, arthritis diagnosis, residence location, and sleep duration. These variables emerged as the most critical factors associated with DS in elderly individuals with SCD.

Fig. 1
figure 1

Relationship between regularization coefficient alpha and regression variable coefficients.

Table 1 Regression coefficients of influencing factors after feature selection.
Table 2 Evaluation of the performance of the three algorithms.
Fig. 2
figure 2

Training and validation curve of the model from 2015 to 2018.

Fig. 3
figure 3

Risk calibration curve comparison for different models.

Fig. 4
figure 4

Decision curve analysis. The x-axis indicates the threshold probability of DS. The y-axis indicates the net benefit.

Fig. 5
figure 5

(A/B) ROC Curve Comparison of Different Models.

Fig. 6
figure 6

(A/B) Importance of predictors analysis by SHAP model. SHAP (SHapley Additive exPlanation) values are ranked by value of a feature to the predictions made by the Boosted XGBoost.

Fig. 7
figure 7

(A/B) Importance of predictors analysis by SHAP model. SHAP (SHapley Additive exPlanation) values are ranked by value of a feature to the predictions made by the Random Forest.

Discussion

Using the fourth and fifth wave data from the CHARLS survey, this study analyzed the demographic variables, health status, and chronic medical history of 1,921 middle-aged and elderly individuals. A predictive model for depression risk in the elderly with SCD was constructed. Figure 8 illustrates the overall conceptual framework of this study, summarizing the accuracy of the predictive models, presenting the research findings, and highlighting the potential clinical applications of the models.

Fig. 8
figure 8

Conceptual framework of discussion in this study.

Among the three machine learning models evaluated, both Boosted XGBoost and RF demonstrated distinct yet complementary advantages in predicting depression risk among older adults with SCD. Notably, education level emerged as the top-ranked predictor in SHAP analysis for both models.

These findings suggest that patients’ education level may also play a crucial role in influencing depression rates among individuals with SCD.This finding is similar with results from a study conducted in Japan19.Patients with lower educational levels may lack sufficient cognitive reserve to cope with the challenges of cognitive decline, thus increasing the risk of depression. Moreover, individuals with lower education levels may experience exacerbated depressive symptoms due to limited social networks and a lack of health-related knowledge.

Furthermore, other studies have shown that educational level can significantly predict cognitive ability levels20.Theoretically, enhancing the educational attainment of patients from less educated backgrounds could aid in averting cognitive deterioration. Therefore, interventions that promote education for the elderly could contribute to enhancing cognitive function in these patients. In conclusion, relevant institutions should emphasize late-life education for the elderly population. Providing specialized psychological care and guidance through social channels for individuals with low education and income levels is crucial.Additionally, social support through media campaigns, financial assistance, and other means can help reduce the risk of depression for those in need.

SHAP analysis revealed that arthritis ranked as the second most important predictor in the Boosted XGBoost model and third in the RF model, suggesting that arthritis is also a risk factor for DS in individuals with SCD. Chronic pain may contribute to depression risk through inflammatory mechanisms.

As a chronic condition, arthritis causes persistent pain and activity limitations, which over time restrict daily social interactions and engagement. This, in turn, can lead to negative emotions and feelings of loneliness, eventually manifesting as depressive symptoms. A study by Su et al.which utilized machine learning to predict the risk of depression in elderly patients over a two-year period, found that arthritis, along with other factors, plays a significant role in the onset of depression. Chronic disease patients are more prone to depression21,22, which implies that the presence of arthritis and other such conditions in patients with pre-existing depressive symptoms may exacerbate their mental health issues. Therefore, greater attention should be given to depression screening and treatment among individuals at risk of chronic diseases. Collaboration between government agencies, healthcare organizations, and local communities is crucial for patient rehabilitation, including providing effective disease management strategies, financial support, and psychological assistance. In conclusion, the close association between arthritis, depression, and other chronic diseases necessitates comprehensive disease management strategies. Our goal should be to reduce the depression risk in arthritis patients and improve their overall quality of life through a broad approach that integrates physiological, psychological, and social strategies.

The other three significant predictors were digestive health status, residence location, and sleep duration.There is a significant correlation between gastrointestinal abnormalities and the risk of DS in patients with SCD. One potential explanation is that individuals with digestive issues often experience prolonged discomfort and require long-term medication, leading to a decline in quality of life and negative impacts on mental health. A study conducted in the United States found that individuals reporting gastrointestinal, respiratory, and cardiovascular problems were more likely to exhibit depressive symptoms. This suggests that mental health issues resulting from these three chronic conditions may outweigh those associated with other chronic diseases23. Furthermore, changes in gastrointestinal metabolites under pathological conditions may influence brain activity through the gut-brain axis24. Maintaining a healthy digestive system helps prevent neuroinflammation, thereby protecting against cognitive decline, and may also be a key factor in slowing the onset and progression of depression.

Therefore, to reduce DS in SCD patients, the following strategies are essential: timely treatment and management of gastrointestinal diseases. Additionally, healthcare professionals should provide comprehensive care plans, including psychological support, lifestyle guidance, and other measures to help patients manage the mental stress associated with chronic diseases. Emphasizing the link between gut and brain health is crucial in reducing the likelihood of depression and neuroinflammation in SCD patients.

The results of this study show that the incidence of DS is higher among rural patients, which is consistent with findings from similar studies25. This may be due to the limited healthcare resources, inadequate social support, and lower living standards in rural areas. Furthermore, the social and cultural environment in rural regions can influence patients’ understanding and perceptions of their illness, potentially worsening mental health issues. Therefore, while evaluating individual patient factors, relevant government agencies should focus on strengthening public services and healthcare systems in rural areas to reduce the likelihood of depression in these populations.

Sleep duration plays a critical role in the development of DS in patients with SCD. Studies have shown that both reduced sleep time and prolonged rest periods can trigger the onset of depressive symptoms26,27. A longitudinal study lasting more than two years on elderly individuals in community settings found that sleep disturbances are a significant factor contributing to persistent depression28. This may be due to the close relationship between sleep and mood regulation, where poor sleep quality directly affects brain activity, impairs stress adaptation, and increases the risk of depression. Indeed, related studies suggest that prolonged instability in sleep duration can impair cognitive function in older adults, potentially leading to cognitive decline29.

This highlights the strong association between poor sleep habits and cognitive deficits in the elderly30, where sleep disturbances can unknowingly contribute to cognitive decline, often resulting in DS. Neurological and clinical studies have shown that insufficient sleep disrupts the continuous flow of cerebrospinal fluid and interstitial fluid, which may lead to brain function deterioration and permanent cognitive impairment31,32. Sleep deprivation may also increase tau protein synthesis, reduce brain-derived neurotrophic factors, and stimulate the formation of new neurons and blood vessels33. Chronic sleep deprivation may exacerbate synaptic plasticity in the hippocampus, potentially leading to cognitive decline34. Furthermore, disrupted sleep patterns may be indicative of daytime brain fatigue, adversely affecting multiple cognitive regions. Therefore, sleep problems in SCD patients may exacerbate their cognitive impairment and emotional distress, leading to the onset of depressive symptoms.

There’s been a notable increase in the interest in adopting virtual reality (VR) in managing senior patients’ conditions.people with physical disabilities can discover beneficial physical endeavors, supported by virtual reality (VR).This action enhances life quality and additionally helps in lowering adverse emotional responses.Hence, investigating the effective incorporation of VR technology into patients with physical impairments might be a crucial research path, targeting the alleviation of patient solitude, enhancement of mental health, and the encouragement of both physical and mental recuperation.

In summary, the Random Forest (RF) model may be preferable for clinical screening applications requiring high sensitivity, whereas the Boosted XGBoost model offers superior stability when considering comprehensive performance metrics. Our findings underscore the importance of implementing early screening for high-risk populations in clinical and community health settings, particularly targeting individuals with lower educational attainment, digestive disorders, or arthritis comorbidities, through targeted health education programs to mitigate depression progression.Future research directions should focus on translating our predictive model into practical applications. Developing a web-based risk assessment tool by deploying our optimal model within a theoretical framework could facilitate widespread implementation. Such digital solutions would enable healthcare providers to conduct efficient depression risk evaluations and design personalized intervention strategies based on individualized risk profiles, ultimately advancing precision medicine approaches for depression prevention and management.

Limitations

This study has several limitations. First, the CHARLS data used in this research is specific to the elderly population in China, meaning that the findings may not be directly applicable to older adults in other countries or regions. Additionally, the effectiveness of machine learning models is intrinsically linked to data quality and its associated factors. Although machine learning techniques were employed for feature selection in this study, some potentially important variables may not have been fully explored. These factors could affect the reliability of the model’s predictions. Future research could improve the model’s predictability by incorporating more potential influencing factors, utilizing advanced machine learning techniques, and combining data from different countries.

Conclusion

In summary, this study utilized three machine learning algorithms to construct a predictive model for DS among older adults with SCD, based on CHARLS data. The development of this model highlights several key factors influencing depression risk and holds significant potential for the rapid identification and personalized management of depression. Future work should focus on further optimizing the model, with the aim of providing valuable insights for the management and intervention of mental health in older adults.