Introduction

In 2018, the International Perioperative Cognitive Expert Group formally introduced the term “perioperative neurocognitive disorders (PND)” to replace cognitive change associated with surgery and anesthesia. This inclusive nomenclature encompasses a spectrum of cognitive alterations, including preoperative cognitive decline, early postoperative delirium, delayed neurocognitive recovery, postoperative cognitive dysfunction, and various degrees of cognitive deficits ranging from mild to severe1. Perioperative neurocognitive disorders have lasting effects on patients and families beyond surgery. Many with mild cognitive issues develop Alzheimer’s within three years, raising dementia risk 12 times and increasing postoperative mortality significantly2,3. An editorial revealed that 65% of Australian patients fear lasting cognitive issues after surgery4. This highlights the need for more research into perioperative neurocognitive disorders due to heightened patient concern.

Perioperative neurocognitive disorders vary in prevalence among different surgeries. A major study found a notable 19.6% rate of cognitive impairment after total hip arthroplasty, ranking it highest outside cardiac procedures5. Another study found a significant 54% detection of neurocognitive issues in 99 patients after total joint replacement6. Evidently, cognitive impairment is prevalent throughout the entire joint replacement process.

Existing research have primarily focused on risk factors, with recent findings highlighting hronic illnesses, alcohol intake, auditory deficits, infections, high Apache II scores at admission, sedative and analgesic use, catheter insertion, and sleep issues as independent risks7,8. Furthermore, certain studies have delved into some laboratory values, revealing connections between perioperative cognitive dysfunction and serum lipid, uric acid, perioperative lactate, and inflammation-related factors. These links suggest potential predictive factors9,10,11. Systematic reviews also analyze clinical parameters associated with cognitive impairment after hip arthroplasty. Findings suggest elderliness, lower BMI, smoking, visual impairment, preoperative delirium, dementia, hypertension, and COPD as significant risk factors12. In addition, previous studies mainly focus on predictive factors in the general population, lacking tailored predictive analytics. There’s a crucial need for precise research to deliver personalized predictions for tailored interventions. Anticipating cognitive issues during surgery not only benefits patients but also helps healthcare professionals improve monitoring and care processes13.

Built upon the MIMIC-IV database, this study involves data from over 3000 hip arthroplasty patients, including 30+ clinical variables like demographics, lab tests, vital signs, complications, medications, and treatments. Its main goal is to create a predictive model for neurocognitive issues during hip arthroplasty using machine learning. After thorough validation and comparison, we’ll determine the most effective model for precise predictions. Also, we’ll develop an online tool to identify patients at risk of neurocognitive disorder during hip arthroplasty.

Methods

Data source

This study utilized patient data from the extensive MIMIC-IV database to develop and validate a predictive model for perioperative neurocognitive disorders following hip arthroplasty. Covering clinical data from over 190,000 patients and 450,000 hospital admissions at the Beth Israel Deaconess Medical Center between 2008 and 2019, the MIMIC-IV database is a valuable and publicly accessible clinical resource. Researchers can share data after completing a training program. Importantly, patient identifiers in MIMIC-IV have been meticulously removed to the fullest extent possible, ensuring the utmost protection of patient privacy.

Throughout this research, a team member, duly authorized to access the MIMIC database, employed the MIMIC-IV repository to compile a dataset that comprehensively encompasses inpatient data concerning perioperative neurocognitive disorders during individuals undergoing hip joint surgeries. This dataset incorporates an array of detailed information, encompassing demographics, laboratory findings, perioperative medication history, vital sign recordings, surgical interventions, treatment, disease diagnoses, medication regimens, and subsequent survival outcomes.

Participants and outcomes

We gathered information from all patients admitted for hip joint replacement or revision in the MIMIC-IV database, which covered surgeries conducted through various approaches, including both partial and total hip joint replacement procedures. The identification of these surgeries was based on codes from the 9th edition revision (ICD-9) and the 10th edition revision of the MIMIC-IV database (see Supplementary Appendix S1).

The main outcome of our study was on perioperative cognitive disorders occurring during hospitalization. In accordance with the guidelines outlined in the 5th edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5), we classified delirium, mild cognitive impairment (MCI) and dementia as neurocognitive disorders in our study14. Data related to cognitive disorders were extracted using both ICD-9 and ICD-10 codes (see Supplementary Appendix S2).

Predictive variables

Following a comprehensive review of previously identified risk factors associated with cognitive impairment, we extracted the relevant risk factor variables from the MIMIC database, along with other pertinent variables. The included variables encompassed various domains: (1) Demographic data, including age, gender, admission type, marital status, ICU stay time, and in-hospital mortality rate; (2) Vital signs, including heart rate, systolic blood pressure, diastolic blood pressure, respiratory rate, temperature, and pulse oximetry (SpO2); (3) Laboratory variables, including creatinine, urea nitrogen, hemoglobin, platelet count, international normalized ratio (INR), prothrombin time (PT), white blood cell count, lymphocytes, neutrophils, uric acid, total cholesterol, triglycerides, low-density lipoprotein (LDL), high-density lipoprotein (HDL), sodium, potassium, pH, glucose, and lactate; (4) Identification of complications based on ICD-9 and ICD-10 codes, including hypertension, diabetes, cardiovascular diseases, chronic bronchopulmonary disease, liver disease, kidney disease, craniocerebral disease, and sleep disorders(see Supplementary Appendix S3-S10); (5) Medications: including analgesics and sedatives (morphine, fentanyl, midazolam, propofol, etomidate, dexmedetomidine) and cardiovascular drugs (norepinephrine, epinephrine, nitroglycerin, nifedipine, captopril, hydrocortisone, dobutamine, and dopamine); (6) Scoring systems, including Simplified Acute Physiology Score II (SAPS II), Sequential Organ Failure Assessment (SOFA) score, and Glasgow Coma Scale (GCS); (7) Treatment strategies, including mechanical ventilation, renal replacement therapy, blood transfusion (see Supplementary Appendix S11). For laboratory data, vital signs, and related information, we opted for the earliest available results or the first postoperative results.

Data acquisition and missing management

In this study, we extracted various patient data from the MIMIC-IV database. Demographic information was obtained from the admissions and patients tables, while laboratory results came from the lab events table. We collected complications data from the “diagnoses_icd” table and medication records from the prescriptions table. Scores for the Sequential Organ Failure Assessment (SOFA), Simplified Acute Physiology Score II (SAPS II), and Glasgow Coma Scale (GCS) were retrieved from the corresponding derived tables: sofa, SAPS II, and GCS. To identify patients with perioperative cognitive issues, we queried the “d_icd_diagnoses” and “diagnoses_icd” tables. Each record in the MIMIC-IV database is uniquely identified by the “hadm_id” (hospital admission ID), which we used to link the different data sources for further analysis.

During data mining, we obtained 62 variables from the MIMIC database. Two variables lacked data and were excluded. Variables with over 30% missing data were also omitted. Missing values in continuous variables were filled with median values, and categorical variables were encoded using ordinal techniques. Consequently, 27 variables were removed during model building, including parameters like heart rate, blood pressure, respiratory rate, oxygen saturation, temperature, and various medications and treatments such as captopril, dobutamine, dopamine, dexmedetomidine, epinephrine, etomidate, fentanyl, hydrocortisone, midazolam, morphine, nifedipine, nitroglycerin, propofol, mechanical ventilation, renal replacement therapy, and transfusion.

Sample size

In the MIMIC-IV database, 3,292 patients underwent hip arthroplasty, with 331 developing perioperative neurocognitive disorders. This subset formed our study cohort, which was then randomly split into a development group and a validation group in a 7:3 ratio. To improve the stability and robustness of model performance, we incorporated fivefold cross-validation in the model development and evaluation process, addressing potential variability due to random seed selection. Adhering to the guideline advocating for a minimum of 10 occurrences per predictive variable in machine learning algorithms, the inclusion of 3,292 patients in our study cohort met the recommended sample size criteria for rigorous statistical analysis15.

Statistical analyses

The normality of the data was examined through the Shapiro–Wilk test. Continuous variables are presented as mean ± standard deviation for normally distributed data, or as median [P25, P75] for non-normally distributed data, while categorical variables are expressed as frequencies (percentages). To assess disparities between the PND and non-PND groups within both the development and validation sets, the Pearson chi-square test was employed to analyze distinctions among categorical variable groups. For continuous variables exhibiting non-normal distribution or heterogeneity, the Mann–Whitney U test or Kruskal–Wallis test was utilized accordingly.

Model development and validation

In this study, we developed five models employing various methodologies, encompassing multiple logistic regression (MLR), artificial neural network (ANN), Naive Bayes (NB), support vector machine (SVM), and decision tree (XgBoost) algorithms. For the MLR and SVM models, we applied the Least Absolute Shrinkage and Selection Operator (LASSO) regression for variable selection, with λ set at 1. This procedure led to the reduction of variables from 35 to 22 (see Supplementary Appendix S12). Subsequently, these selected predictive variables were utilized as inputs for the MLR and SVM to facilitate the machine learning process aimed at identifying pivotal features. Conversely, in the case of the remaining models developed using different algorithms, feature selection involved inputting all predictive variables, allowing the algorithms to autonomously discern significant features.

We conducted feature analysis plots and generated Receiver Operating Characteristic (ROC) curves for the five predictive models. The Area Under the Curve (AUC) from the ROC curves was evaluated to compare their performance. Additionally, sensitivity, specificity, and F1 score were assessed for each model on both the development and validation sets to ascertain their classification efficacy. Calibration curves were constructed to examine the alignment between predicted probabilities and actual outcomes in both datasets. Furthermore, the Brier score and the index of prediction accuracy (IPA) were computed to compare the calibration and prediction capabilities of the models. Finally, Decision Curve Analysis (DCA) plots were generated to estimate and compare the potential clinical utility of the models.

Results

Details of included cases and correlation among variables

The MIMIC-IV database encompasses records of 3,272 patients who underwent either hip arthroplasty or revision surgery. Among these, 330 cases experienced perioperative neurocognitive disorders (PND), while 2,942 cases did not. All cases were incorporated into the final analysis, comprising 264 PND cases and 2,353 non-PND cases in the training set, and 66 PND cases and 589 non-PND cases in the validation set. Table 1 provides a comprehensive overview of the characteristics exhibited by PND and non-PND patients across distinct datasets.

Table 1 The characteristics of PND and non-PND patients from train set and test set.

Correlation analysis was undertaken on all variables included in the study. The interrelations among these variables are depicted graphically as a heatmap in Fig. 1. It is evident from the heatmap that there exists a correlation between high density lipoprotein (HDL), low density lipoprotein (LDL), and total cholesterol. Furthermore, a statistically significant correlation is observed between the percentages of lymphocytes and neutrophils.

Fig. 1
figure 1

The relationships among the included variables.

The key features of models established by different algorithms

Figure 2 illustrates the essential features and their weight rankings in models established by each algorithm. Notably, models developed through different algorithms prioritize distinct key features. In the multivariate logistic regression model (Fig. 2a), the top 10 crucial variables include age, craniocerebral disease, serum creatinine, triglycerides, hemoglobin, international normalized ratio (INR), prothrombin time (PT), SOFA score, lymphocytes, and neutrophils. For the ANN predictive model (Fig. 2b), the algorithm assigns the highest weights to the following 10 variables: age, gender, serum creatinine, GCS score, and whether the patient has diabetes, hypertension, chronic disease, kidney disease, craniocerebral disease, or sleep disorder. Support vector machines (SVM), acclaimed for their classification prowess, stand out in delineating data into distinct categories by identifying the optimal hyperplane. In our SVM-based predictive model (Fig. 2c), the algorithm identifies the top 10 most crucial features, such as age, hemoglobin, SOFA score, sodium, potassium, PT, and indicators of craniocerebral disease, hypertension, liver disease, or kidney disease.

Fig. 2
figure 2

The key features of models established by different algorithms.

In the Naive Bayes model, the conditional probabilities of each feature under different categories are derived from the training data. By analyzing these logarithmic probabilities, the algorithm can assess the contribution of each feature to the classification task. Figure 2d illustrates the top 10 most crucial features for the occurrence of PND and non-PND cases in the Naive Bayes model. For PND cases, the most important features include hemoglobin, sodium, craniocerebral disease, age, GCS score, pH, neutrophils, total cholesterol, HDL, and LDL. Conversely, for non-PND cases, the top 10 features comprise potassium, age, sodium, pH, neutrophils, GCS score, total cholesterol, HDL, LDL, and the patient’s admission type. In practice, we obtain logarithmic probabilities for each feature under each category. A higher logarithmic probability for a specific category indicates a higher likelihood of the feature under that given category. In the decision tree (Xgboost) model, each node represents a feature, with branches signifying different values of that feature, and leaf nodes indicating the model’s prediction for input samples (Fig. 2e). Various splitting criteria, such as the Gini coefficient, are employed to determine the optimal feature for reducing data uncertainty through splitting. This process occurs at each node of the tree. Key nodes in the decision tree include age, the presence of craniocerebral disease, neutrophils (%), lactate, GCS score, admission type, the presence of chronic bronchopulmonary disease, hemoglobin, white blood cell count, and serum sodium. These nodes play pivotal roles in guiding the model’s predictions by effectively partitioning the data based on their respective features.

Model performance and comparation

As illustrated in Fig. 3 and summarized in Table 2, the artificial neural network (ANN)-based model exhibits superior performance compared to others on the training dataset, demonstrating a larger area under the Receiver Operating Characteristic (ROC) curve (AUC: 0.954; 95% CI 0.947–0.961). With an accuracy of 93.8%, precision of 75.8%, and an F1-score of 65.7%, the ANN model showcases notable predictive capabilities, surpassing models founded on multiple logistic regression (MLR), decision tree, support vector machine (SVM), and Naive Bayes (NB). However, it is noteworthy that the SVM-based model also presents commendable performance, notably excelling in accuracy and precision metrics (AUC: 0.879; 95% CI: 0.874–0.884; accuracy: 91.6%, precision: 65.0%, F1-score: 44.1%). These findings indicate that both artificial neural network (ANN) and support vector machine (SVM) models exhibit robust potential for predictive accuracy, with the ANN model demonstrating slightly superior overall performance. Moreover, upon assessing model performance on the validation dataset, the SVM-derived model emerges as noteworthy (AUC: 0.865; 95% CI 0.834–0.897), surpassing alternative algorithms with an accuracy of 91.1%, precision of 61.2%, and an F1-score of 41.7%.

Fig. 3
figure 3

ROC curves based on different models in the training and validation sets.

Table 2 Models performance by different algorithms.

In terms of calibration, we generated calibration plots and computed Brier Scores along with the index of prediction accuracy (IPA). Table 2 illustrates that the Brier Scores for all five predictive models consistently remain below 0.1, indicating outstanding accuracy. Notably, on the training dataset, models based on the decision tree, ANN and SVM algorithms show lower Brier Scores and higher IPA values compared to others (decision tree: 0.060; SVM: 0.064; ANN: 0.048; IPA, decision tree: 77.4%; SVM: 75.7%; ANN: 90.8%). Similarly, on the validation dataset, models derived from the decision tree and SVM algorithms demonstrate relatively superior accuracy (decision tree: 0.070; SVM: 0.067; ANN: 0.071; IPA, decision tree: 68.9%; SVM: 73.0%; ANN: 71.4%). These results underscore the remarkable accuracy and calibration performance of the ANN and SVM models across both training and validation datasets (Fig. 4a, c).

Fig. 4
figure 4

Calibration plots and Decision Curve Analysis based on different models in the training and validation sets.

Based on the training dataset, we constructed Decision Curve Analysis (DCA) plots for the five models and performed a comparative assessment. The DCA plots reveal that within the predictive risk threshold range of 0% to 80%, models based on artificial neural networks (ANN) outperform others. This indicates that, in this risk range, patients would benefit more from the use of ANN-based predictive models than from alternative approaches.

The decreasing trend observed in the DCA lines suggests that, as the threshold probability increases beyond a certain point, the net benefit of using the predictive models diminishes. This implies that higher threshold probabilities may lead to more false positive predictions, thereby reducing the model’s clinical utility. The points where the DCA curves cross the x- and y-axes are critical: when a model’s curve crosses the y-axis, it shows the net benefit of treating all patients (i.e., the model has no predictive value). When the curve crosses the x-axis, it suggests the threshold probability beyond which the model’s predictions are no longer useful, as the net benefit becomes negative. Similar findings are observed in the validation dataset, where for predictive risk thresholds between 0 and 80%, ANN-based models again show superior performance compared to other models (Fig. 4b, d).

Constructure online tool

In summary, considering the model’s performance, DCA, and predictive accuracy, the ANN-based predictive model stands out as superior, showing greater potential for clinical use. Using this ANN model, we’ve created an online tool (https://xyyy.pythonanywhere.com/) to identify high-risk patients for PND during hip arthroplasty.

Discussion

In this investigation, we developed a rapid and convenient online tool based on a dataset of 3272 patients undergoing hip arthroplasty. The primary objective of this tool is to effectively identify high-risk patients susceptible to perioperative cognitive disorders (PND), thereby aiding in their perioperative management. Our findings underscore the significance of various clinical factors, including craniocerebral diseases, age, gender, hemoglobin, liver disease, kidney disease, admission type, GCS, sleep disorder and hypertension as pivotal features for predicting the onset of PND in patients undergoing hip joint replacement surgery.

Age and gender are widely acknowledged as a crucial risk factor for the occurrence of PND in patients during the perioperative period16,17,18. Our study’s advanced model confirms that both factors significantly influence PND incidence, with older patients facing a higher risk compared to younger ones, and female patients exhibiting a greater likelihood of developing PND than their male counterparts. Additionally, our analysis identifies craniocerebral diseases as the primary predictors of PND, supporting clinical findings that vascular mechanisms are integral to PND-associated cognitive issues19. The Glasgow Coma Scale (GCS), essential for assessing cognitive functions and mental states, is also highlighted as a pivotal predictor of PND according to our data analysis20. Furthermore, our findings indicate that renal and hepatic diseases, as well as sleep disorders, substantially impact PND risk through mechanisms like disrupted drug metabolism and toxin accumulation. Meta-analyses of existing cohort studies establish a strong correlation between chronic kidney disease and functional brain disorders, with individuals having elevated urine albumin-to-creatinine ratios facing increased risks of Alzheimer’s disease and vascular dementia21. Additionally, previous research underscores a significant rise in sleep disorders among individuals experiencing cognitive decline, reinforcing the profound link between sleep disturbances and cognitive deterioration22.

Observational studies have shown that individuals with hypertension are more susceptible to cognitive dysfunctions23. Extensive research, including retrospective analyses and meta-analyses, has identified hemoglobin levels as a predictive factor for postoperative delirium in orthopedic patients, with lower hemoglobin levels associated with an increased likelihood of this condition24,25,26. Consistent with prior studies, our results affirm that hypertension and low hemoglobin are key predictors of PND.

In the past, several studies have proposed similar models. However, these studies were based on a limited amount of data for model development. These studies had small sample sizes and incorporated numerous variables, which could potentially introduce statistical biases27,28,29. Our study is built upon an extensive dataset comprising 3272 cases, from which we meticulously curated a comprehensive set of variables, totaling over 30. These variables were carefully selected to facilitate both screening and deep learning processes, aiming to develop a robust model targeting perioperative cognitive disorders in patients undergoing hip replacement or revision surgery. Furthermore, we employed a diverse array of five different methods for variable selection, including filter, wrapper, and embedded techniques, in contrast to the prevalent reliance on LASSO regression seen in most studies27,30. Lastly, our study employed a variety of algorithms for model prediction and construction. Through thorough comparisons, we identified and developed the optimal model (ANN-based predictive model). Notably, the resulting model is not only highly effective but also straightforward, convenient, and user-friendly.

Nonetheless, our research does have some limitations. Firstly, being a retrospective study, it is contingent upon the completeness of the collected data. Initially, we selected over 60 variables; however, due to a high rate of missing data for certain variables, deletions were necessary. Furthermore, additional validation using external datasets is warranted for this study.

Conclusion

In summary, we have developed a simple and convenient big data-based visualization online tool for perioperative screening of high-risk patients for postoperative neurocognitive disorders (PND) in hip replacement or revision surgeries. This tool aims to assist healthcare professionals in making decisions regarding perioperative management strategies.