Abstract
Cranioplasty is associated with a substantial burden of postoperative complications. In this multicenter study, we developed a machine learning–based clinical decision-support tool to predict the risk of postoperative complications following cranioplasty. A set of nine features was selected for model development. Among the 15 algorithms evaluated, the random forest model demonstrated the best overall performance and was validated on data from both spatial and temporal external cohorts (AUROC = 0.949, internal cross-validation; 0.930, geographical validation; and 0.932, temporal validation). Subgroup analyses by age and sex demonstrated consistently high discriminative performance (lowest AUROC = 0.927) and good calibration (O/E ratio = 1.16, 95% CI: 0.97–1.40). Analysis of causal effects of modifiable intraoperative variables on postoperative complications, with diverse counterfactual explanations and causal inference methods, including double machine learning and the T-learner framework, revealed a protective effect of subcutaneous negative-pressure drainage (ATE = −0.241) and titanium mesh (ATE = −0.191). Finally, we present the model as an accessible web-based tool for individualized, real-time clinical decision-making (http://www.cranioplastycomplicationprediction.top). These findings provide a practical framework for postoperative risk stratification and support the optimization of intraoperative decision-making in cranioplasty.
Similar content being viewed by others
Introduction
Decompressive craniectomy is a well-established neurosurgical procedure used to alleviate elevated intracranial pressure caused by conditions such as traumatic brain injury, cerebral infarction, and intracranial hemorrhage (ICH)1,2,3. Although decompressive craniectomy is lifesaving, it results in a cranial defect that leaves the brain vulnerable to mechanical injuries and physiological disturbances4,5. To address these risks, cranioplasty is routinely performed to restore cranial integrity. Especially for patients who have undergone hemicraniectomy, cranioplasty is ultimately required, as failure to restore cranial integrity can lead to sinking skin flap syndrome or the syndrome of the trephined. In addition to restoring structural protection, cranioplasty has been shown to improve neurological function and enhance esthetic outcomes, thereby facilitating recovery and promoting psychological well-being6.
Although cranioplasty is considered a technically straightforward procedure, it is associated with a relatively high incidence of postoperative complications7. Complications include infection, ICH, hydrocephalus, seizures, fluid collection, and pneumocephalus. These adverse events not only prolong hospital stays and increase healthcare costs but also impair recovery and quality of life8.
Given these risks, early identification of high-risk patients is crucial for facilitating more efficient use of clinical resources, guiding perioperative management, and ultimately improving outcomes. Currently, there are numerous studies that have investigated the factors associated with postoperative complications following cranioplasty9,10,11. However, these studies have primarily focused on identifying risk factors rather than developing predictive models to forecast complications. As a result, reliable tools for predicting postoperative complications are still lacking, leaving clinicians to rely on empirical judgment and reactive strategies. This limitation substantially reduces the clinical applicability of prior findings. It also underscores the need for robust and interpretable predictive models to support individualized patient care.
In this study, we aimed to develop and validate an explainable machine learning (ML)–based clinical tool for predicting postoperative complications following cranioplasty. By integrating causal and interpretable machine learning into clinical workflows, this work seeks to bridge the gap between predictive modeling and actionable decision support in cranioplasty, ultimately advancing personalized neurosurgical care and improving postoperative outcomes.
Results
Population characteristics
The study design is illustrated in Fig. 1, and baseline characteristics of the derivation cohort (n = 789), geographical external validation cohort (n = 394), and temporal external validation cohort (n = 185) are summarized in Table 1. Postoperative complications occurred in 205 (26.0%) patients in the derivation cohort, 115 (29.2%) in the geographical external validation cohort, and 51 (27.6%) in the temporal external validation cohort. Reoperations due to severe complications were required in 13 patients (1.6%) in the derivation cohort, 8 patients (2.0%) in the geographical external validation cohort, and none in the temporal external validation cohort.
ML machine learning, ROC curve receiver operating characteristic curve, SHAP SHapley Additive exPlanations, PDP partial dependence plot, DiCE diverse counterfactual explanations, PR curve precision-recall curve, DML double machine learning.
Feature selection and model performance comparison
The feature selection process using the Boruta algorithm is illustrated in Fig. 2a. After 500 iterations, 13 of the 26 input features were confirmed as important, with 2 additional features marked as tentative. The selection procedures for LASSO, RF-RFE, and GA are presented in Fig. S6. A total of 9 overlapping predictors identified across all four methods were retained for model construction (Fig. 2b), including surgery time, skull defect area, Glasgow Coma Scale (GCS), preoperative fluid collections, preoperative infection, use of subcutaneous negative-pressure drainage, titanium mesh, preoperative ventriculoperitoneal (V-P) shunt, and the time interval between decompressive craniectomy and cranioplasty. The distributions of the nine selected predictors are shown in Fig. S7.
a Feature importance ridge plot for variable selection based on Boruta. b Variable Venn diagram screened by four methods. (c) Bubble plots for comparing machine learning model performance based on the AB_score. DC-CP interval the time interval (in months) between decompressive craniectomy (DC) and cranioplasty (CP), GOS Glasgow Outcome Scale, GCS Glasgow Coma Scale, BI Barthel Index, HBV hepatitis B virus infection, CHD coronary heart disease, N-P drainage postoperative placement of subcutaneous negative-pressure drainage tubes, Pre-op V-P preoperative ventriculoperitoneal shunt status, Pre-op preoperative, GAM generalized additive model, LR logistic regression, GBDT gradient-boosted decision tree, KNN k-nearest neighbor, LightGBM light gradient boosting machine, RotF rotation forest, XGBoost extreme gradient boosting, NB naive Bayes, AdaBoost adaptive boosting, MLP multilayer perceptron, SVM support vector machine, DT decision tree, ExtraTrees extremely randomized trees, GPC Gaussian process classifier, RF random forest.
Based on the selected features, 15 ML models were developed to predict postoperative complications after cranioplasty. Figure 2c presents a comparison of the \({\rm{A}}{{\rm{B}}}_{{\rm{score}}}\) across different ML models. The Random Forest (RF) model achieved the highest \({\rm{A}}{{\rm{B}}}_{{\rm{score}}}\)(0.929), outperforming all other algorithms and was selected as the final model for further evaluation. The optimal cutoff of the final RF model for predicting postoperative complications was 0.366 (Table S4). At this cutoff, the corresponding sensitivity, specificity, true/false positives (TP/FP), true/false negatives (TN/FN), positive predictive value (PPV), negative predictive value (NPV), and F1 score are reported in Table S5.
Model evaluation and subgroup performance analysis
We comprehensively evaluated the final RF model in terms of both discrimination and calibration. The AUROC was 0.949 (95% CI: 0.949–0.950) in the internal cross-validation, and 0.930 (95% CI: 0.929–0.931) in the graphical external validation cohort (Fig. 3a). Consistent with AUROC, the area under the precision-recall curve (AUPRC) remained high, with values of 0.880 (95% CI: 0.878–0.880) and 0.870 (95% CI: 0.869–0.872), respectively (Fig. 3b).
a Receiver operating characteristic curves. b Precision-recall curves. c Decision curve analysis. d Calibration curves. e Discrimination performance across subgroups in the geographical external validation cohort. The dotted line represents the mean AUROC of the overall cohort. f Calibration performance across subgroups in the geographical external validation cohort. AUROC area under the receiver operating characteristic curve, AUPRC area under the precision–recall curve. O observed. E expected.
Clinical utility was evaluated using DCA (Fig. 3c). Across a wide range of threshold probabilities, the model consistently provided a greater net benefit than both the treat-all and treat-none strategies. Calibration performance is shown in Fig. 3d. The predicted probabilities demonstrated overall good agreement with the observed outcomes. Slight overestimation was noted at higher predicted risk levels in the geographical external validation cohort.
To further understand the predictive performance of the final model in specific patient populations, subgroup analyses were conducted based on age (<40 and ≥40 years) and sex (male, female) in the geographical external validation cohort. Although the AUROC was slightly reduced in ≥40 years patients and male patients compared with the overall cohort, their AUROCs were still higher than 0.92 (Fig. 3e). Calibration in subgroups was evaluated using a forest plot of observed-to-expected (O/E) ratios (Fig. 3f). The overall value was 1.16 (95% CI:0.97–1.40) and there was no significant prediction bias compared with the true outcome in all subgroups.
Complication-specific model development
Given that individual complications may involve distinct mechanisms and require personalized management, we further constructed dedicated ML models for major complications, including infection, pneumocephalus, fluid collection, hydrocephalus, seizures, intracranial hemorrhage, and reoperations. Final model selection was guided by a comprehensive assessment of \(A{B}_{\mathrm{score}}\), calibration, and clinical utility. The optimal models were rotation forest (RotF) for infection and pneumocephalus; RF for fluid collection and reoperations; generalized additive model (GAM) for hydrocephalus; and logistic regression for seizures and intracranial hemorrhage. Performance metrics are summarized in Table S6, with calibration curves, decision curve analyses, and SHapley Additive exPlanations (SHAP) summary plots presented in Figs. S8–S10.
Temporal external validation of the models
We further evaluated the temporal generalizability of both the overall complication model and the complication-specific models using an independent temporal validation cohort. The final overall complication model achieved an AUROC of 0.932 and an overall accuracy of 0.838 (Fig. 4a). Calibration analysis (Fig. 4b) showed good agreement between predicted and observed risks, with the calibration curve closely following the ideal diagonal line. Decision curve analysis (Fig. 4c) demonstrated net clinical benefit across a wide range of threshold probabilities. Subgroup analysis (Fig. 4d) further demonstrated the model’s stable performance across age and sex subgroups, with AUROC values exceeding 0.9 and accuracy exceeding 0.8 in all subgroups.
a Radar plot of five key evaluation metrics. b Calibration curves. c Decision curve analysis. d Bubble chart of subgroup performance. The X-axis indicates the area under the receiver operating characteristic curve (AUROC), while the Y-axis represents different subgroups (age and sex). The size of each bubble reflects the sample size, and the color gradient represents the prediction accuracy, with lighter colors indicating higher accuracy. Numerical labels within the bubbles denote the sample size of each subgroup. AUROC area under the receiver operating characteristic curve, AUPRC,area under the precision–recall curve.
Performance metrics for complication-specific models are provided in Table S7. AUROC values were consistently high across all outcomes (0.851–0.986), indicating strong overall discrimination. However, for intracranial hemorrhage, hydrocephalus, and seizures, the models showed comparatively lower AUPRC values (0.536, 0.574, and 0.600, respectively) and a higher proportion of false-positive predictions.
Model interpretability and feature interaction analysis of the overall complication model
We employed SHAP to assess feature contributions at both the global and individual levels. Global feature importance is shown in Fig. 5a, where contributions were quantified using mean SHAP values and ranked in descending order. Surgery time, skull defect area, and GCS were identified as the top three predictors. Local explanations were further used to illustrate how individual predictions were generated based on patient-specific feature values. Corresponding visualizations, including force plots, decision plots, and waterfall plots, are presented in Figs. S11 and S12.
a SHAP summary plot (bee swarm). This plot evaluates the contribution of each feature to the model using mean SHAP values, displayed in descending order, with color indicating feature value (red = high, blue = low). b SHAP interaction heatmap. This plot quantifies pairwise feature interactions, where darker colors represent stronger interaction effects. c Three-dimensional partial dependence plot (3D PDP). This plot visualizes the joint effect of surgery time, GCS and skull defect area on the predicted probability. Color intensity ranging from dark to light represents the predicted probability of postoperative complications, with lighter colors indicating higher risk. d Radar plot of counterfactual explanations. This plot presents two counterfactual scenarios generated by DiCE. e Bar plot of ATE estimates. The plot shows the estimated ATEs for two modifiable surgical variables: N–P drainage and use of titanium mesh as the cranioplasty material. f Violin plot of CATE estimates. The plot presents subgroup-level CATE estimates for N–P drainage and use of titanium mesh as the cranioplasty material, stratified by age and sex. DC-CP interval the time interval (in months) between decompressive craniectomy (DC) and cranioplasty (CP), GCS Glasgow Coma Scale, BI Barthel Index, N-P drainage postoperative placement of subcutaneous negative-pressure drainage tubes, Pre-op V-P preoperative ventriculoperitoneal shunt status, Pre-op preoperative, DiCE Diverse Counterfactual Explanations, ATE average treatment effect, CATE conditional average treatment effect.
To further explore potential interactions among features, we generated a SHAP interaction heatmap (Fig. 5b). Skull defect area, GCS, and surgery time exhibited high self-interaction SHAP values. Their marginal effects on the predicted risk of postoperative complications were examined using one-dimensional (1D) partial dependence plots (PDPs) (Fig. S13). And their joint effects were visualized using a three-dimensional (3D) PDP (Fig. 5c). The plot showed that low GCS and a larger skull defect area were associated with a higher predicted risk of postoperative complications. This effect was more pronounced in patients with longer surgery time than in those with shorter surgery time.
Counterfactual analysis and causal inference of the modifiable surgical variables
Identifying modifiable surgical variables is of particular importance in neurosurgical practice. In our model, the use of subcutaneous negative-pressure (N-P) drainage and titanium mesh in cranioplasty was associated with a lower predicted risk of postoperative complications (Fig. 5a).
To examine whether changes to these factors could influence model predictions, we performed counterfactual analysis using the Diverse Counterfactual Explanations (DiCE) method. The results showed that modifying either the drainage method or the cranioplasty material alone was sufficient to convert a high-risk prediction into a low-risk outcome in selected patients (Fig. 5d).
To further validate these findings, causal effects of these modifiable surgical factors on postoperative complication risk were estimated using Double Machine Learning (DML). Both N-P drainage and titanium mesh were associated with reduced predicted complication risk, with average treatment effects (ATEs) of −0.241 (95% CI: −0.35 to −0.132) and −0.191 (95% CI: −0.341 to −0.041), respectively (Fig. 5e). Subgroup-specific conditional average treatment effects (CATEs) were subsequently estimated to assess heterogeneity in treatment response (Fig. 5f). The protective effects of titanium mesh and N-P drainage were observed in most age and sex subgroups. However, among males over 40 years, the estimated CATE for N-P drainage exceeded zero (CATE = 0.009), indicating no protective benefit in this subgroup. Detailed CATE results are provided in Table S9.
Finally, sensitivity analysis was conducted to assess the robustness of the estimated ATEs and CATEs. Random noise variables were introduced into the estimation process, and treatment effects were re-estimated. No statistically significant differences were observed between the original and perturbed estimates (all p > 0.05; Table S8, S9).
Accessible web application for clinical utility
The overall and complication-specific models were integrated into a web-based application with eight prediction modules (Fig. S14). Users can input the required feature values under the relevant module, and the application will automatically calculate and display the predicted risk for the selected complication (Fig. S15). The web application can be accessed online at the following link: http://www.cranioplastycomplicationprediction.top/.
In addition, we translated the core methodologies of this study into a generalizable methodological framework platform (Fig. S16; https://surgical-complication-risk-prediction.streamlit.app/). This platform provides a reproducible pipeline for predicting postoperative complications across diverse surgical procedures.
Discussion
ML techniques have been widely used in modern medical research due to their ability to process high-dimensional data and capture complex, nonlinear interactions among variables12. They have demonstrated strong predictive performance in multiple clinically challenging domains, including acute critical illness risk stratification, postoperative functional outcome prediction, and in-hospital mortality prediction13,14,15. However, their application in cranioplasty remains limited. To our knowledge, this is the first study to systematically compare 15 machine learning algorithms for predicting postoperative complications following cranioplasty, based on large-scale multicenter data.
Individualized assessment of postoperative complication risk represents an important component of perioperative care in patients undergoing cranioplasty. By identifying patients at elevated risk, the proposed model may support more targeted postoperative management strategies. For example, patients predicted to be at increased risk of postoperative seizures may benefit from closer neurophysiological monitoring or consideration of prophylactic antiepileptic therapy. Similarly, for patients at increased risk of postoperative fluid collection, drainage strategies such as prolonging drainage duration may be adjusted to reduce the likelihood of this complication.
As highlighted by Thomas H. Shin et al.16, the goal of ML in surgical outcomes research is not simply to improve risk prediction, but to identify underrecognized modifiable risk factors. For neurosurgeons, a critical question is whether optimizing surgical strategies can effectively reduce the risk of complications following cranioplasty. In this context, causal machine learning offers a potential solution.
Unlike traditional ML, which identifies high-risk patients without informing specific actions, Causal ML aims to answer “what if” questions and quantify the effects of potential interventions using data from randomized controlled trials (RCTs) and real-world sources such as clinical registries and EMRs17. By using DML and the T-learner framework, we found that both N-P drainage and the use of titanium mesh exhibited protective effects against postoperative complications following cranioplasty. Notably, N-P drainage is a novel and modifiable surgical factor that has not been previously reported in literature. The continuous evacuation of postoperative blood and exudate may reduce the risk of hematoma or fluid collection, thereby contributing to its protective effect.
Furthermore, the choice of cranioplasty material, particularly the use of titanium mesh, warrants further discussion in the context of existing evidence. Rosenthal et al.18 reported that complication rates with polyetheretherketone (PEEK) implants are comparable to those of other materials. In contrast, Rosinski et al.19 found higher infection rates in patients with PEEK custom implants than in those with titanium meshes. However, the limited sample sizes of these studies weaken their conclusions. Using a large multicenter dataset and causal inference methods, we found that the use of titanium mesh as a cranioplasty material was associated with a lower risk of overall postoperative complications. The lower complication rate associated with titanium mesh may be attributed to its superior biocompatibility and inherent antibacterial properties20, which help reduce the risk of infection and inflammation. In addition, it offers good intraoperative malleability, allowing for manual trimming and contouring to achieve better conformity to the defect site and enhanced implant stability. This adaptability may help reduce dead space and consequently lower the risk of postoperative fluid collection.
In practical terms, this protective association is particularly relevant for patients with impaired neurological status before surgery, as minimizing postoperative complications is essential in preventing further deterioration. In such scenarios, when synthetic materials such as PEEK, titanium mesh, and polymethyl methacrylate–hydroxyapatite composite cement are all viable options, neurosurgeons may consider titanium mesh as a more favorable choice.
Several previous studies have also attempted to develop prediction models in the context of cranioplasty. Kimchi, G et al.21 used survival analysis to predict the probability of post-cranioplasty infection and identified preoperative neurological disability as the strongest predictor. Klieverik, V.M et al.22 developed a Cox-based model to predict cranioplasty implant survival and reported several clinical determinants. Lu, Y et al.23 proposed a logistic-regression model incorporating modified brain-collapse ratio with comorbidity burden to predict postoperative complications. These earlier studies highlight growing interest in predictive modeling for cranioplasty. However, they were constrained by small single-center samples, modest predictive performance, reliance on traditional regression methods, and the absence of external validation, limiting their clinical applicability. In contrast, our study addressed these limitations by leveraging a large multicenter dataset, systematically comparing multiple machine-learning algorithms, and validating the final model across both geographical and temporal cohorts. Moreover, by applying causal ML methods, we were able to identify potentially modifiable surgical variables, thereby providing guidance for surgical decision-making.
Composite outcomes are commonly used in surgical outcomes research to capture the overall clinical impact of heterogeneous postoperative complications. In the context of cranioplasty, overall complication outcomes can inform perioperative management, guide postoperative monitoring strategies, and reflect overall patient outcomes. Accordingly, prior studies have frequently adopted an overall complication endpoint to investigate risk factors associated with postoperative complications9,10. This analytical paradigm is not unique to cranioplasty and has been widely applied across postoperative complication and adverse event research. For instance, Chen et al.24 combined multiple postoperative pulmonary complications into a composite endpoint to develop a risk prediction model. Similarly, Mahajan et al.25 defined major adverse cardiac and cerebrovascular events as a composite outcome comprising postoperative type I or II myocardial infarction, cardiogenic shock or acute heart failure, unstable angina, and stroke, and subsequently constructed predictive models based on this composite endpoint.
To further enhance clinical relevance, we also developed separate models targeting individual complications. In clinical practice, the overall model serves as an initial screening tool to identify patients at higher postoperative risk who may require intensified monitoring, while the specific models further identify the most likely complication types to guide targeted preventive and management strategies. Together, these complementary models support comprehensive postoperative risk assessment and individualized perioperative care.
Despite the promising results of this study, several limitations warrant consideration. First, our models were developed using data from Chinese patients. Future studies are needed to evaluate their generalizability beyond China. Second, although the models effectively predicted the occurrence of postoperative complications following cranioplasty, they could not determine the timing of these events. Future research should focus on integrating temporal analyses to predict the timing of complications, which could further enhance clinical utility. Third, our study only focused on in-hospital complications following cranioplasty and excluded post-discharge events due to inconsistent and incomplete post-discharge data across centers. Fourth, although the models demonstrated robust sensitivity and strong discrimination overall, the complication-specific models for intracranial hemorrhage, seizures, and hydrocephalus showed a higher proportion of false-positive predictions in the temporal external validation cohort. Fifth, our analysis did not include autologous bone as its use was rare across the participating centers. The infrequent use of autologous bone at our participating centers makes it difficult to incorporate this surgical strategy into meaningful statistical analyses. Future studies with larger multicenter datasets may enable more robust evaluation of autologous bone and its comparative impact on postoperative outcomes.
In conclusion, our study successfully developed the first interpretable ML-based clinical tool for predicting postoperative complications after cranioplasty using preoperative and intraoperative data extracted from EMRs. With its high predictive accuracy and practical accessibility, this non-invasive tool has the potential to enhance perioperative risk stratification by shifting complication prediction from experience-based judgment to a data-driven approach, ultimately improving patient outcomes in neurosurgical practice. Further research is warranted to validate the real-world applicability of our clinical tool across diverse healthcare settings.
Methods
Ethics statement
This study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Boards of Qilu Hospital of Shandong University (KYLL-202407-041), Daping Hospital of Army Medical University [(2024) 293], and Tang-Du Hospital (K202411-26). As a retrospective study, the requirement for informed consent was waived. The study is registered with ClinicalTrials.gov (NCT06740773, registered on December 18, 2024) and conducted in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis with Artificial Intelligence (TRIPOD + AI) guidelines26. Patients or members of the public were not involved in the design, conduct, reporting, interpretation, or dissemination of this study.
Study population
This multicenter retrospective cohort study included patients of all ages who underwent cranioplasty in the neurosurgery departments of three independent hospitals. The derivation cohort consisted of patients who underwent cranioplasty at two independent tertiary hospitals (Qilu Hospital of Shandong University and Daping Hospital of Army Medical University) between January 1, 2015, and July 31, 2023. This cohort was used for model training and internal validation through 5-fold cross-validation. The geographical external validation cohort included patients who underwent cranioplasty at another independent tertiary hospital (Tangdu Hospital of Air Force Medical University) during the same period. The temporal external validation cohort included patients who underwent cranioplasty at Qilu Hospital of Shandong University between August 1, 2023, and January 1, 2025. Individuals with a history of prior cranioplasty, severe comorbidities (including significant cardiac, liver, kidney, or immune system dysfunction), congenital cranial defects, or substantial missing data were excluded from the study. The primary endpoint of this study was the occurrence of in-hospital complications following cranioplasty.
Sample size estimation
To determine the minimum sample size required for developing the clinical prediction models, we followed the four-step approach, using pmsampsize package in R and the web tool BeyondEPV (https://mvansmeden.shinyapps.io/BeyondEPV)27. The following parameters were prespecified: number of predictor parameters = 9, shrinkage factor = 0.9, outcome prevalence = 0.2, minimum acceptable C-statistic = 0.8, and mean absolute percentage error (MAPE) = 0.05. Based on these criteria, the minimum required sample size was estimated to be 400.
Data collection and quality control
Both the factors influencing postoperative complications following cranioplasty and the specific outcome variables for complications were identified through a comprehensive literature review, including systematic reviews, meta-analyses, primary studies, and expert clinical opinions28,29,30. Postoperative complications were classified into six major categories: infection, intracranial hemorrhage, fluid collections, hydrocephalus, seizures, and pneumocephalus. Only complications requiring active clinical intervention were counted as clinically significant complications in this study. Minor or self-limiting abnormalities were not considered as complications. In addition, reoperation, defined as a return to the operating room for removal of the implanted material due to severe complications, was also recorded as a separate outcome. Data was extracted from patients’ EMRs. Detailed information about factors and complications is provided in Table S1.
To ensure data consistency and reliability, a standardized multicenter database was established. Data collectors were uniformly trained, and clinical information was recorded using standardized forms. Following data collection, a cross-center quality control process was implemented. A random 30% sample of records from each center was independently reviewed by neurosurgeons from other participating centers. A minimum data accuracy rate of ≥90% was required; centers not meeting this threshold were mandated to re-evaluate and correct their data.
Data preprocessing
To handle missing data, the extent and pattern of missingness across all variables were first assessed (Fig. S1). The missingness mechanism for the skull defect area variable was first assessed by clinical experts and then verified statistically, supporting a Missing Completely at Random assumption (Table S2)31,32. Six commonly used imputation methods were subsequently evaluated using cross-validation, with the normalized root mean square error (NRMSE) as the primary evaluation metric. Among these, missForest33 was ultimately chosen for subsequent analyses (Fig. S2) due to its superior performance in minimizing NRMSE. A sensitivity analysis was conducted to verify that the imputation process preserved data distribution consistency and introduced no bias, as expected under the assumption (Fig. S3). Outliers were identified using a four-layer autoencoder trained with the Adam optimizer and mean squared error loss. Samples with reconstruction errors exceeding the 2σ threshold were considered outliers and excluded from further analysis34 (Fig. S4). The impact of this filtering step on model performance was evaluated through a sensitivity analysis, as detailed in Table S3. Continuous variables were standardized using Z-score normalization, and categorical variables were standardized through one-hot encoding35. To avoid data leakage, all preprocessing steps were independently applied to the derivation and geographical external validation cohorts. The temporal external validation dataset was left unprocessed to reflect real-world deployment conditions.
Feature selection
To address multicollinearity arising from highly correlated features, correlation filtering was applied to remove redundancy, ensuring that all pairwise correlation coefficients fell below 0.6. The variance inflation factor was subsequently calculated to confirm the absence of multicollinearity among the remaining variables (Fig. S5). Feature selection was performed using four distinct algorithms: Boruta36, Lasso37, Random Forest–based Recursive Feature Elimination (RF-RFE)38, and Genetic Algorithm (GA)39,40 (Fig. S6). The final feature set was determined by identifying the intersection of variables selected by all four methods and was visualized using a Venn diagram41. Experienced neurosurgeons from three centers then reviewed the selected features and finalized the predictors, ensuring high face validity and ease of implementation.
Model development, comparisons, and evaluation
Fifteen ML models were developed to predict the risk of complications after cranioplasty: GAM, logistic regression, gradient-boosted decision tree, K-nearest neighbor, light gradient boosting machine, RotF, extreme gradient boosting, naive Bayes, adaptive boosting (AdaBoost), multilayer perceptron, support vector machine, decision tree, extremely randomized trees (ExtraTrees), Gaussian process classifier and random forest (RF). To minimize overfitting and enhance model robustness before external validation, 5-fold cross-validation was performed on the derivation cohort. The optimal cutoff for model was determined by maximizing the Youden index (sensitivity + specificity − 1). The 95% confidence interval was estimated using the bias-corrected and accelerated (BCA) bootstrap method42.
To select the optimal prediction models for each complication, we defined a novel metric, \({\mathrm{AB}}_{\mathrm{score}}\), which combines the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Brier score to assess both the discriminative and calibration abilities of each model. To avoid overly optimistic estimates derived from the training data, performance metrics from the training cohort were not used for model evaluation. Only internal cross-validation and external validation metrics were considered to reliably assess the generalizability of each model.
The mathematical formula for \(A{B}_{{score}}\) is presented in Eq. (1)
where \(\overline{AUROC}\) and \(\overline{Briers\,core}\) represents the arithmetic mean of the AUROC values and Brier score from the internal cross-validation and geographical external validation sets, respectively. \(\alpha\) is a weight factor, and by setting \(\alpha =0.5\), equal emphasis is placed on discrimination and calibration in the \(A{B}_{\mathrm{score}}\). \(1-\overline{Briers\,core}\) is used to align with the positive direction of AUROC.
The discrimination ability of the final model was assessed using the receiver operating characteristic (ROC) curve and Precision–Recall (PR) Curve, while calibration was evaluated with the Brier score and calibration curve43. In addition, decision curve analysis (DCA) was used to evaluate the net benefit of the model at different thresholds44. Furthermore, to evaluate model fairness, we assessed the model’s performance across demographic subgroups, including age (<40 and ≥40 years) and sex (male, female), within the external validation cohort45.
Complication-specific modeling
Separate machine learning models were developed for each major type of postoperative complication. To address severe class imbalance in complication-specific modeling, ADASYN46 was applied to the data to augment minority-class samples and improve model sensitivity for rare outcomes.
Model explanation and causal inference analysis
To address the inherent opacity of machine learning models, SHAP, a game-theoretic approach, was employed to quantify the contribution of individual features to deviations from the mean prediction using SHAP values47. PDPs were additionally applied to visualize interactions among key SHAP-identified features and their combined effects on model predictions48,49.
To investigate the potential causal effects of modifiable surgical factors on postoperative complications, a two-step approach was employed. First, DiCE50 was used to simulate hypothetical adjustments in clinically actionable variables and to explore whether such changes could influence model-predicted complication risks. Subsequently, causal inference methods were applied to quantify the effects of these variables. DML was used to estimate ATEs51, while a T-learner framework was employed to evaluate CATEs across patient subgroups52. Sensitivity analyses were conducted to assess the robustness of the causal estimates.
Statistical analysis
Patient data were categorized into continuous and categorical variables. The normality of continuous variables was assessed using the Kolmogorov–Smirnov test. Variables with a normal distribution were described as mean ± standard deviation, while those with a skewed distribution were presented as median with interquartile range. Categorical variables were reported as counts and percentages. Data analyses were performed using IBM SPSS Statistics for Windows (IBM Corp., released 2019, version 26.0), Python (version 3.8.2), and R (version 4.2.2).
Data availability
The datasets analyzed implemented during this study are available from the corresponding author upon reasonable request. The codes are uploaded on Github. (GitHub - BigEarAsk/A-Causal-and-Interpretable-Machine-Learning-Framework-for-Post-Cranioplasty-Complications: Code for training and validation).
References
Fung, C. et al. Decompressive hemicraniectomy in patients with supratentorial intracerebral hemorrhage. Stroke 43, 3207–3211 (2012).
Hutchinson, P. J. et al. Trial of Decompressive Craniectomy for Traumatic Intracranial Hypertension. N. Engl. J. Med. 375, 1119–1130 (2016).
Vahedi, K. et al. Early decompressive surgery in malignant infarction of the middle cerebral artery: a pooled analysis of three randomised controlled trials. Lancet Neurol. 6, 215–222 (2007).
Honeybul, S. & Ho, K. M. Long-term complications of decompressive craniectomy for head injury. J. Neurotrauma 28, 929–935 (2011).
Kurland, D. B. et al. Complications associated with decompressive craniectomy: a systematic review. Neurocritical Care 23, 292–304 (2015).
Feroze, A. H. et al. Evolution of cranioplasty techniques in neurosurgery: historical review, pediatric considerations, and current trends. J. Neurosurg. 123, 1098–1107 (2015).
Malcolm, J. G. et al. Complications following cranioplasty and relationship to timing: a systematic review and meta-analysis. J. Clin. Neurosci. 33, 39–51 (2016).
Alkhaibary, A. et al. Cranioplasty: a comprehensive review of the history, materials, surgical aspects, and complications. World Neurosurg. 139, 445–452 (2020).
Zanaty, M. et al. Complications following cranioplasty: incidence and predictors in 348 cases. J. Neurosurg. 123, 182–188 (2015).
Chen, R. et al. Optimal timing of cranioplasty and predictors of overall complications after cranioplasty: the impact of brain collapse. Neurosurgery 93, 84–94 (2023).
Abode-Iyamah, K. O. et al. Risk factors for surgical site infections and assessment of vancomycin powder as a preventive measure in patients undergoing first-time cranioplasty. J. Neurosurg. 128, 1241–1249 (2018).
Cho, S. M. et al. Machine learning compared with conventional statistical models for predicting myocardial infarction readmission and mortality: a systematic review. Can. J. Cardiol. 37, 1207–1214 (2021).
Drysch, M. et al. Streamlined machine learning model for early sepsis risk prediction in burn patients. NPJ Digit. Med. 8, 621 (2025).
Yin, P. et al. Prediction of functional outcomes in aneurysmal subarachnoid hemorrhage using pre-/postoperative noncontrast CT within 3 days of admission. NPJ Digit. Med. 8, 542 (2025).
Bai, Z. et al. Machine learning based CAGIB score predicts in-hospital mortality of cirrhotic patients with acute gastrointestinal bleeding. NPJ Digit. Med. 8, 489 (2025).
Shin, T. H., Ashley, S. W. & Tsai, T. C. Defining the role of machine learning in optimizing surgical outcomes. JAMA Surg. 159, 1432 (2024).
Feuerriegel, S. et al. Causal machine learning for predicting treatment outcomes. Nat. Med. 30, 958–968 (2024).
Rosenthal, G. et al. Polyetheretherketone implants for the repair of large cranial defects: a 3-center experience. Neurosurgery 75, 528–529 (2014).
Rosinski, C. L. et al. A retrospective comparative analysis of titanium mesh and custom implants for cranioplasty. Neurosurgery 86, E15–e22 (2020).
Williams, D. F. Titanium for medical applications. in: Titanium in medicine: material science, surface science, engineering, biological responses and medical applications, 13–24 (Springer, 2001).
Kimchi, G. et al. Predicting and reducing cranioplasty infections by clinical, radiographic and operative parameters - A historical cohort study. J. Clin. Neurosci. 34, 182–186 (2016).
Klieverik, V. M., Robe, P. A., Muradin, M. S. M. & Woerdeman, P. A. Development of a prediction model for cranioplasty implant survival following craniectomy. World Neurosurg. 175, e693–e703 (2023).
Lu, Y., Huo, H. & Jiang, J. A clinical prediction model for complications after cranioplasty based on modified-brain collapse ratio and comorbidity burden. World Neurosurg. 201, 124235 (2025).
Chen, S. et al. Development and validation of an explainable machine learning model for predicting postoperative pulmonary complications after lung cancer surgery: a machine learning study. EClinicalMedicine 86, 103386 (2025).
Mahajan, A. et al. Development and validation of a machine learning model to identify patients before surgery at high risk for postoperative adverse events. JAMA Netw. Open 6, e2322285 (2023).
Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024).
Riley, R. D. et al. Calculating the sample size required for developing a clinical prediction model. BMJ 368, m441 (2020).
Zheng, F. et al. Early or late cranioplasty following decompressive craniotomy for traumatic brain injury: A systematic review and meta-analysis. J. Int. Med. Res. 46, 2503–2512 (2018).
Bader, E. R., Kobets, A. J., Ammar, A. & Goodrich, J. T. Factors predicting complications following cranioplasty. J. Cranio Maxillo Fac. Surg. 50, 134–139 (2022).
Shepetovsky, D., Mezzini, G. & Magrassi, L. Complications of cranioplasty in relationship to traumatic brain injury: a systematic review and meta-analysis. Neurosurg. Rev. 44, 3125–3142 (2021).
Carpenter, J. R. & Smuk, M. Missing data: a statistical framework for practice. Biom. J. 63, 915–947 (2021).
Little, R. J. & Rubin, D. B. Statistical analysis with missing data (John Wiley & Sons, 2019).
Stekhoven, D. J. & Bühlmann, P. MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012).
Chaurasia, S., Goyal, S. & Rajput, M. Outlier detection using autoencoder ensembles: a robust unsupervised approach. In Proc. 2020 International Conference on Contemporary Computing and Applications (IC3A) 76–80 (IEEE, 2020).
Sun, Y. et al. Modifying the one-hot encoding technique can enhance the adversarial robustness of the visual model for symbol recognition. Expert Syst. Appl. 250, 123751 (2024).
Kursa, M. B. & Rudnicki, W. R. Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010).
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288 (1996).
Ravishankar, H. et al. Recursive feature elimination for biomarker discovery in resting-state functional connectivity. In Proc. 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 4071–4074 (IEEE, 2016).
Lambora, A., Gupta, K. & Chopra, K. Genetic algorithm-A literature review. In Proc. 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon) 380–384 (IEEE, 2019).
Lei, S. A feature selection method based on information gain and genetic algorithm. In Proc. 2012 International Conference on Computer Science and Electronics Engineering, Vol. 2, 355–358 (IEEE, 2012).
Heberle, H., Meirelles, G. V., da Silva, F. R., Telles, G. P. & Minghim, R. J. B. b. InteractiVenn: a web-based tool for the analysis of sets through Venn diagrams. BMC Bioinform. 16, 169 (2015).
Grün, B. & Miljkovic, T. J. N. A. A. J. The automated bias-corrected and accelerated bootstrap confidence intervals for risk measures. North Am. Actuar. J. 27, 731–750 (2023).
Stevens, R. J. & Poppe, K. K. Validation of clinical prediction models: what does the “calibration slope” really measure? J. Clin. Epidemiol. 118, 93–99 (2020).
Vickers, A. J. & Elkin, E. B. Decision curve analysis: a novel method for evaluating prediction models. Med. Decis. Mak. 26, 565–574 (2006).
Riley, R. D. et al. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ 384, e074820 (2024).
He, H., Bai, Y., Garcia, E. A. & Li, S. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In Proc. 2008 IEEE International Joint Conference on Neural Networks (IEEE world congress on computational intelligence) 1322–1328 (IEEE, 2008).
Lundberg, S. M.et al. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems (NIPS'17), 4768–4777 (Curran Associates Inc., 2017).
Friedman, J. et al. Greedy function approximation: a gradient boosting machine. Ann. Stati. 29, 1189–1232 (2001).
Li, W. et al. Effects of heavy metal exposure on hypertension: a machine learning modeling approach. Chemosphere 337, 139435 (2023).
Mothilal, R. K., Sharma, A. & Tan, C. Explaining machine learning classifiers through diverse counterfactual explanations. In Proc. 2020 Conference on Fairness, Accountability, and Transparency, 607–617 (ACM, 2020).
Chernozhukov, V. et al. Double/debiased machine learning for treatment and structural parameters. Econ. J. 21, C1–C68 (2018).
Künzel, S. R. et al. Metalearners for estimating heterogeneous treatment effects using machine learning. Proc. Natl. Acad. Sci. USA 116, 4156–4165 (2019).
Acknowledgements
We sincerely thank Ms. Wenyun Xia for her contributions to the design of the visualization website. This study was supported by Qilu hygiene and health outstanding youth project, Shandong Provincial Natural Science Foundation (ZR2025MS1184), Shandong University project (6010124061, Study on postoperative complications of cranioplasty using polyetheretherketone material), Shaanxi Province Youth Science and Technology Rising Star (2023KJXX-028), and Tangdu Youth Independent Innovation Science Fund (2023ATDQN004). The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
Author information
Authors and Affiliations
Contributions
N.Y. (Ning Yang) and C.Q. (Chen Qiu) were responsible for project administration, conceptualization, and investigation of the study, and also reviewed and edited the manuscript. W.B.L. (Wenbo Li), B.W. (Bao Wang), and T.L. (Tianzun Li) contributed equally to methodology development, formal analysis, data validation, web-based application development, and drafting of the original manuscript. Y.W.M. (Yiwen Ma) contributed to formal analysis and data validation. H.J. (Haoyong Jin), J.Z. (Jiangli Zhao), Z.W.X. (Zhiwei Xue), N.S. (Nan Su), and Y.H. (Yanya He) were involved in data curation and formal analysis. J.S. (Jiaqi Shi), X.C.L. (Xuchen Liu), X.Y.L. (Xiaoyang Liu), T.W. (Tianzi Wang), J.W. (Jiwei Wang), C.L. (Chao Li), C.Y. (Can Yan), Y.M. (Yang Ma), and Q.Q. (Qichao Qi) contributed to data curation. X.Y.W. (Xinyu Wang), W.G.L. (Weiguo Li), B.H. (Bin Huang), D.W. (Donghai Wang), X.L.W. (Xuelian Wang), Y.Q. (Yan Qu), and X.G.L. (Xingang Li) provided supervision for the study. N.Y. and C.Q. had full access to all data in the study and took responsibility for the integrity of the data and the accuracy of the data analysis. The corresponding authors assumed final responsibility for the decision to submit the manuscript. All authors reviewed and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, W., Wang, B., Li, T. et al. A Causal and interpretable machine learning framework for postcranioplasty risk prediction and surgical decision support. npj Digit. Med. 9, 184 (2026). https://doi.org/10.1038/s41746-026-02370-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-026-02370-6







