A Causal and interpretable machine learning framework for postcranioplasty risk prediction and surgical decision support

Li, Wenbo; Wang, Bao; Li, Tianzun; Ma, Yiwen; Jin, Haoyong; Zhao, Jiangli; Xue, Zhiwei; Su, Nan; He, Yanya; Shi, Jiaqi; Liu, Xuchen; Liu, Xiaoyang; Wang, Tianzi; Wang, Jiwei; Li, Chao; Yan, Can; Ma, Yang; Qi, Qichao; Wang, Xinyu; Li, Weiguo; Huang, Bin; Wang, Donghai; Wang, Xuelian; Qu, Yan; Li, Xingang; Qiu, Chen; Yang, Ning

doi:10.1038/s41746-026-02370-6

Download PDF

Article
Open access
Published: 21 January 2026

A Causal and interpretable machine learning framework for postcranioplasty risk prediction and surgical decision support

Wenbo Li^1,2,3^na1,
Bao Wang^4,5^na1,
Tianzun Li⁶^na1,
Yiwen Ma⁷,
Haoyong Jin^1,2,3,
Jiangli Zhao^1,2,3,
Zhiwei Xue^1,2,3,
Nan Su^1,2,3,
Yanya He^1,2,3,
Jiaqi Shi^1,2,3,
Xuchen Liu^1,2,3,
Xiaoyang Liu^1,2,3,
Tianzi Wang^1,2,3,
Jiwei Wang^1,2,3,
Chao Li^1,2,3,
Can Yan^1,2,3,
Yang Ma⁸,
Qichao Qi^1,2,3,
Xinyu Wang^1,2,3,
Weiguo Li^1,2,3,
Bin Huang^1,2,3,
Donghai Wang^1,2,3,
Xuelian Wang⁴,
Yan Qu⁴,
Xingang Li^1,2,3,
Chen Qiu⁹ &
…
Ning Yang^1,2,3

npj Digital Medicine volume 9, Article number: 184 (2026) Cite this article

6069 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Cranioplasty is associated with a substantial burden of postoperative complications. In this multicenter study, we developed a machine learning–based clinical decision-support tool to predict the risk of postoperative complications following cranioplasty. A set of nine features was selected for model development. Among the 15 algorithms evaluated, the random forest model demonstrated the best overall performance and was validated on data from both spatial and temporal external cohorts (AUROC = 0.949, internal cross-validation; 0.930, geographical validation; and 0.932, temporal validation). Subgroup analyses by age and sex demonstrated consistently high discriminative performance (lowest AUROC = 0.927) and good calibration (O/E ratio = 1.16, 95% CI: 0.97–1.40). Analysis of causal effects of modifiable intraoperative variables on postoperative complications, with diverse counterfactual explanations and causal inference methods, including double machine learning and the T-learner framework, revealed a protective effect of subcutaneous negative-pressure drainage (ATE = −0.241) and titanium mesh (ATE = −0.191). Finally, we present the model as an accessible web-based tool for individualized, real-time clinical decision-making (http://www.cranioplastycomplicationprediction.top). These findings provide a practical framework for postoperative risk stratification and support the optimization of intraoperative decision-making in cranioplasty.

Development and validation of an interpretable prediction model for the risk of unplanned reoperation in patients underwent intracranial tumor surgery

Article Open access 21 March 2026

Explainable machine learning prediction of tracheostomy after craniotomy for supratentorial intracerebral hemorrhage

Article Open access 01 March 2026

Dynamic predictions of postoperative complications from explainable, uncertainty-aware, and multi-task deep neural networks

Article Open access 21 January 2023

Introduction

Decompressive craniectomy is a well-established neurosurgical procedure used to alleviate elevated intracranial pressure caused by conditions such as traumatic brain injury, cerebral infarction, and intracranial hemorrhage (ICH)^1,2,3. Although decompressive craniectomy is lifesaving, it results in a cranial defect that leaves the brain vulnerable to mechanical injuries and physiological disturbances^4,5. To address these risks, cranioplasty is routinely performed to restore cranial integrity. Especially for patients who have undergone hemicraniectomy, cranioplasty is ultimately required, as failure to restore cranial integrity can lead to sinking skin flap syndrome or the syndrome of the trephined. In addition to restoring structural protection, cranioplasty has been shown to improve neurological function and enhance esthetic outcomes, thereby facilitating recovery and promoting psychological well-being⁶.

Although cranioplasty is considered a technically straightforward procedure, it is associated with a relatively high incidence of postoperative complications⁷. Complications include infection, ICH, hydrocephalus, seizures, fluid collection, and pneumocephalus. These adverse events not only prolong hospital stays and increase healthcare costs but also impair recovery and quality of life⁸.

Given these risks, early identification of high-risk patients is crucial for facilitating more efficient use of clinical resources, guiding perioperative management, and ultimately improving outcomes. Currently, there are numerous studies that have investigated the factors associated with postoperative complications following cranioplasty^9,10,11. However, these studies have primarily focused on identifying risk factors rather than developing predictive models to forecast complications. As a result, reliable tools for predicting postoperative complications are still lacking, leaving clinicians to rely on empirical judgment and reactive strategies. This limitation substantially reduces the clinical applicability of prior findings. It also underscores the need for robust and interpretable predictive models to support individualized patient care.

In this study, we aimed to develop and validate an explainable machine learning (ML)–based clinical tool for predicting postoperative complications following cranioplasty. By integrating causal and interpretable machine learning into clinical workflows, this work seeks to bridge the gap between predictive modeling and actionable decision support in cranioplasty, ultimately advancing personalized neurosurgical care and improving postoperative outcomes.

Results

Population characteristics

The study design is illustrated in Fig. 1, and baseline characteristics of the derivation cohort (n = 789), geographical external validation cohort (n = 394), and temporal external validation cohort (n = 185) are summarized in Table 1. Postoperative complications occurred in 205 (26.0%) patients in the derivation cohort, 115 (29.2%) in the geographical external validation cohort, and 51 (27.6%) in the temporal external validation cohort. Reoperations due to severe complications were required in 13 patients (1.6%) in the derivation cohort, 8 patients (2.0%) in the geographical external validation cohort, and none in the temporal external validation cohort.

**Fig. 1: Flow chart of the study design.**

Table 1 Clinical characteristics of patients in the derivation, geographical, and temporal external validation cohorts

Full size table

Feature selection and model performance comparison

The feature selection process using the Boruta algorithm is illustrated in Fig. 2a. After 500 iterations, 13 of the 26 input features were confirmed as important, with 2 additional features marked as tentative. The selection procedures for LASSO, RF-RFE, and GA are presented in Fig. S6. A total of 9 overlapping predictors identified across all four methods were retained for model construction (Fig. 2b), including surgery time, skull defect area, Glasgow Coma Scale (GCS), preoperative fluid collections, preoperative infection, use of subcutaneous negative-pressure drainage, titanium mesh, preoperative ventriculoperitoneal (V-P) shunt, and the time interval between decompressive craniectomy and cranioplasty. The distributions of the nine selected predictors are shown in Fig. S7.

Based on the selected features, 15 ML models were developed to predict postoperative complications after cranioplasty. Figure 2c presents a comparison of the ${\rm{A}}{{\rm{B}}}_{{\rm{score}}}$ across different ML models. The Random Forest (RF) model achieved the highest ${\rm{A}}{{\rm{B}}}_{{\rm{score}}}$(0.929), outperforming all other algorithms and was selected as the final model for further evaluation. The optimal cutoff of the final RF model for predicting postoperative complications was 0.366 (Table S4). At this cutoff, the corresponding sensitivity, specificity, true/false positives (TP/FP), true/false negatives (TN/FN), positive predictive value (PPV), negative predictive value (NPV), and F1 score are reported in Table S5.

Model evaluation and subgroup performance analysis

We comprehensively evaluated the final RF model in terms of both discrimination and calibration. The AUROC was 0.949 (95% CI: 0.949–0.950) in the internal cross-validation, and 0.930 (95% CI: 0.929–0.931) in the graphical external validation cohort (Fig. 3a). Consistent with AUROC, the area under the precision-recall curve (AUPRC) remained high, with values of 0.880 (95% CI: 0.878–0.880) and 0.870 (95% CI: 0.869–0.872), respectively (Fig. 3b).

**Fig. 3: Model performance evaluation in the derivation cohort and geographical external validation cohort.**

Clinical utility was evaluated using DCA (Fig. 3c). Across a wide range of threshold probabilities, the model consistently provided a greater net benefit than both the treat-all and treat-none strategies. Calibration performance is shown in Fig. 3d. The predicted probabilities demonstrated overall good agreement with the observed outcomes. Slight overestimation was noted at higher predicted risk levels in the geographical external validation cohort.

To further understand the predictive performance of the final model in specific patient populations, subgroup analyses were conducted based on age (<40 and ≥40 years) and sex (male, female) in the geographical external validation cohort. Although the AUROC was slightly reduced in ≥40 years patients and male patients compared with the overall cohort, their AUROCs were still higher than 0.92 (Fig. 3e). Calibration in subgroups was evaluated using a forest plot of observed-to-expected (O/E) ratios (Fig. 3f). The overall value was 1.16 (95% CI:0.97–1.40) and there was no significant prediction bias compared with the true outcome in all subgroups.

Complication-specific model development

Given that individual complications may involve distinct mechanisms and require personalized management, we further constructed dedicated ML models for major complications, including infection, pneumocephalus, fluid collection, hydrocephalus, seizures, intracranial hemorrhage, and reoperations. Final model selection was guided by a comprehensive assessment of $A{B}_{\mathrm{score}}$, calibration, and clinical utility. The optimal models were rotation forest (RotF) for infection and pneumocephalus; RF for fluid collection and reoperations; generalized additive model (GAM) for hydrocephalus; and logistic regression for seizures and intracranial hemorrhage. Performance metrics are summarized in Table S6, with calibration curves, decision curve analyses, and SHapley Additive exPlanations (SHAP) summary plots presented in Figs. S8–S10.

Temporal external validation of the models

We further evaluated the temporal generalizability of both the overall complication model and the complication-specific models using an independent temporal validation cohort. The final overall complication model achieved an AUROC of 0.932 and an overall accuracy of 0.838 (Fig. 4a). Calibration analysis (Fig. 4b) showed good agreement between predicted and observed risks, with the calibration curve closely following the ideal diagonal line. Decision curve analysis (Fig. 4c) demonstrated net clinical benefit across a wide range of threshold probabilities. Subgroup analysis (Fig. 4d) further demonstrated the model’s stable performance across age and sex subgroups, with AUROC values exceeding 0.9 and accuracy exceeding 0.8 in all subgroups.

**Fig. 4: Temporal external validation of the final model.**

Performance metrics for complication-specific models are provided in Table S7. AUROC values were consistently high across all outcomes (0.851–0.986), indicating strong overall discrimination. However, for intracranial hemorrhage, hydrocephalus, and seizures, the models showed comparatively lower AUPRC values (0.536, 0.574, and 0.600, respectively) and a higher proportion of false-positive predictions.

Model interpretability and feature interaction analysis of the overall complication model

We employed SHAP to assess feature contributions at both the global and individual levels. Global feature importance is shown in Fig. 5a, where contributions were quantified using mean SHAP values and ranked in descending order. Surgery time, skull defect area, and GCS were identified as the top three predictors. Local explanations were further used to illustrate how individual predictions were generated based on patient-specific feature values. Corresponding visualizations, including force plots, decision plots, and waterfall plots, are presented in Figs. S11 and S12.

**Fig. 5: Model interpretability and causal inference analysis.**

To further explore potential interactions among features, we generated a SHAP interaction heatmap (Fig. 5b). Skull defect area, GCS, and surgery time exhibited high self-interaction SHAP values. Their marginal effects on the predicted risk of postoperative complications were examined using one-dimensional (1D) partial dependence plots (PDPs) (Fig. S13). And their joint effects were visualized using a three-dimensional (3D) PDP (Fig. 5c). The plot showed that low GCS and a larger skull defect area were associated with a higher predicted risk of postoperative complications. This effect was more pronounced in patients with longer surgery time than in those with shorter surgery time.

Counterfactual analysis and causal inference of the modifiable surgical variables

Identifying modifiable surgical variables is of particular importance in neurosurgical practice. In our model, the use of subcutaneous negative-pressure (N-P) drainage and titanium mesh in cranioplasty was associated with a lower predicted risk of postoperative complications (Fig. 5a).

To examine whether changes to these factors could influence model predictions, we performed counterfactual analysis using the Diverse Counterfactual Explanations (DiCE) method. The results showed that modifying either the drainage method or the cranioplasty material alone was sufficient to convert a high-risk prediction into a low-risk outcome in selected patients (Fig. 5d).

To further validate these findings, causal effects of these modifiable surgical factors on postoperative complication risk were estimated using Double Machine Learning (DML). Both N-P drainage and titanium mesh were associated with reduced predicted complication risk, with average treatment effects (ATEs) of −0.241 (95% CI: −0.35 to −0.132) and −0.191 (95% CI: −0.341 to −0.041), respectively (Fig. 5e). Subgroup-specific conditional average treatment effects (CATEs) were subsequently estimated to assess heterogeneity in treatment response (Fig. 5f). The protective effects of titanium mesh and N-P drainage were observed in most age and sex subgroups. However, among males over 40 years, the estimated CATE for N-P drainage exceeded zero (CATE = 0.009), indicating no protective benefit in this subgroup. Detailed CATE results are provided in Table S9.

Finally, sensitivity analysis was conducted to assess the robustness of the estimated ATEs and CATEs. Random noise variables were introduced into the estimation process, and treatment effects were re-estimated. No statistically significant differences were observed between the original and perturbed estimates (all p > 0.05; Table S8, S9).

Accessible web application for clinical utility

The overall and complication-specific models were integrated into a web-based application with eight prediction modules (Fig. S14). Users can input the required feature values under the relevant module, and the application will automatically calculate and display the predicted risk for the selected complication (Fig. S15). The web application can be accessed online at the following link: http://www.cranioplastycomplicationprediction.top/.

In addition, we translated the core methodologies of this study into a generalizable methodological framework platform (Fig. S16; https://surgical-complication-risk-prediction.streamlit.app/). This platform provides a reproducible pipeline for predicting postoperative complications across diverse surgical procedures.

Discussion

ML techniques have been widely used in modern medical research due to their ability to process high-dimensional data and capture complex, nonlinear interactions among variables¹². They have demonstrated strong predictive performance in multiple clinically challenging domains, including acute critical illness risk stratification, postoperative functional outcome prediction, and in-hospital mortality prediction^13,14,15. However, their application in cranioplasty remains limited. To our knowledge, this is the first study to systematically compare 15 machine learning algorithms for predicting postoperative complications following cranioplasty, based on large-scale multicenter data.

Individualized assessment of postoperative complication risk represents an important component of perioperative care in patients undergoing cranioplasty. By identifying patients at elevated risk, the proposed model may support more targeted postoperative management strategies. For example, patients predicted to be at increased risk of postoperative seizures may benefit from closer neurophysiological monitoring or consideration of prophylactic antiepileptic therapy. Similarly, for patients at increased risk of postoperative fluid collection, drainage strategies such as prolonging drainage duration may be adjusted to reduce the likelihood of this complication.

As highlighted by Thomas H. Shin et al.¹⁶, the goal of ML in surgical outcomes research is not simply to improve risk prediction, but to identify underrecognized modifiable risk factors. For neurosurgeons, a critical question is whether optimizing surgical strategies can effectively reduce the risk of complications following cranioplasty. In this context, causal machine learning offers a potential solution.

Unlike traditional ML, which identifies high-risk patients without informing specific actions, Causal ML aims to answer “what if” questions and quantify the effects of potential interventions using data from randomized controlled trials (RCTs) and real-world sources such as clinical registries and EMRs¹⁷. By using DML and the T-learner framework, we found that both N-P drainage and the use of titanium mesh exhibited protective effects against postoperative complications following cranioplasty. Notably, N-P drainage is a novel and modifiable surgical factor that has not been previously reported in literature. The continuous evacuation of postoperative blood and exudate may reduce the risk of hematoma or fluid collection, thereby contributing to its protective effect.

Furthermore, the choice of cranioplasty material, particularly the use of titanium mesh, warrants further discussion in the context of existing evidence. Rosenthal et al.¹⁸ reported that complication rates with polyetheretherketone (PEEK) implants are comparable to those of other materials. In contrast, Rosinski et al.¹⁹ found higher infection rates in patients with PEEK custom implants than in those with titanium meshes. However, the limited sample sizes of these studies weaken their conclusions. Using a large multicenter dataset and causal inference methods, we found that the use of titanium mesh as a cranioplasty material was associated with a lower risk of overall postoperative complications. The lower complication rate associated with titanium mesh may be attributed to its superior biocompatibility and inherent antibacterial properties²⁰, which help reduce the risk of infection and inflammation. In addition, it offers good intraoperative malleability, allowing for manual trimming and contouring to achieve better conformity to the defect site and enhanced implant stability. This adaptability may help reduce dead space and consequently lower the risk of postoperative fluid collection.

In practical terms, this protective association is particularly relevant for patients with impaired neurological status before surgery, as minimizing postoperative complications is essential in preventing further deterioration. In such scenarios, when synthetic materials such as PEEK, titanium mesh, and polymethyl methacrylate–hydroxyapatite composite cement are all viable options, neurosurgeons may consider titanium mesh as a more favorable choice.

Several previous studies have also attempted to develop prediction models in the context of cranioplasty. Kimchi, G et al.²¹ used survival analysis to predict the probability of post-cranioplasty infection and identified preoperative neurological disability as the strongest predictor. Klieverik, V.M et al.²² developed a Cox-based model to predict cranioplasty implant survival and reported several clinical determinants. Lu, Y et al.²³ proposed a logistic-regression model incorporating modified brain-collapse ratio with comorbidity burden to predict postoperative complications. These earlier studies highlight growing interest in predictive modeling for cranioplasty. However, they were constrained by small single-center samples, modest predictive performance, reliance on traditional regression methods, and the absence of external validation, limiting their clinical applicability. In contrast, our study addressed these limitations by leveraging a large multicenter dataset, systematically comparing multiple machine-learning algorithms, and validating the final model across both geographical and temporal cohorts. Moreover, by applying causal ML methods, we were able to identify potentially modifiable surgical variables, thereby providing guidance for surgical decision-making.

Composite outcomes are commonly used in surgical outcomes research to capture the overall clinical impact of heterogeneous postoperative complications. In the context of cranioplasty, overall complication outcomes can inform perioperative management, guide postoperative monitoring strategies, and reflect overall patient outcomes. Accordingly, prior studies have frequently adopted an overall complication endpoint to investigate risk factors associated with postoperative complications^9,10. This analytical paradigm is not unique to cranioplasty and has been widely applied across postoperative complication and adverse event research. For instance, Chen et al.²⁴ combined multiple postoperative pulmonary complications into a composite endpoint to develop a risk prediction model. Similarly, Mahajan et al.²⁵ defined major adverse cardiac and cerebrovascular events as a composite outcome comprising postoperative type I or II myocardial infarction, cardiogenic shock or acute heart failure, unstable angina, and stroke, and subsequently constructed predictive models based on this composite endpoint.

To further enhance clinical relevance, we also developed separate models targeting individual complications. In clinical practice, the overall model serves as an initial screening tool to identify patients at higher postoperative risk who may require intensified monitoring, while the specific models further identify the most likely complication types to guide targeted preventive and management strategies. Together, these complementary models support comprehensive postoperative risk assessment and individualized perioperative care.

Despite the promising results of this study, several limitations warrant consideration. First, our models were developed using data from Chinese patients. Future studies are needed to evaluate their generalizability beyond China. Second, although the models effectively predicted the occurrence of postoperative complications following cranioplasty, they could not determine the timing of these events. Future research should focus on integrating temporal analyses to predict the timing of complications, which could further enhance clinical utility. Third, our study only focused on in-hospital complications following cranioplasty and excluded post-discharge events due to inconsistent and incomplete post-discharge data across centers. Fourth, although the models demonstrated robust sensitivity and strong discrimination overall, the complication-specific models for intracranial hemorrhage, seizures, and hydrocephalus showed a higher proportion of false-positive predictions in the temporal external validation cohort. Fifth, our analysis did not include autologous bone as its use was rare across the participating centers. The infrequent use of autologous bone at our participating centers makes it difficult to incorporate this surgical strategy into meaningful statistical analyses. Future studies with larger multicenter datasets may enable more robust evaluation of autologous bone and its comparative impact on postoperative outcomes.

In conclusion, our study successfully developed the first interpretable ML-based clinical tool for predicting postoperative complications after cranioplasty using preoperative and intraoperative data extracted from EMRs. With its high predictive accuracy and practical accessibility, this non-invasive tool has the potential to enhance perioperative risk stratification by shifting complication prediction from experience-based judgment to a data-driven approach, ultimately improving patient outcomes in neurosurgical practice. Further research is warranted to validate the real-world applicability of our clinical tool across diverse healthcare settings.

Methods

Ethics statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Boards of Qilu Hospital of Shandong University (KYLL-202407-041), Daping Hospital of Army Medical University [(2024) 293], and Tang-Du Hospital (K202411-26). As a retrospective study, the requirement for informed consent was waived. The study is registered with ClinicalTrials.gov (NCT06740773, registered on December 18, 2024) and conducted in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis with Artificial Intelligence (TRIPOD + AI) guidelines²⁶. Patients or members of the public were not involved in the design, conduct, reporting, interpretation, or dissemination of this study.

Study population

This multicenter retrospective cohort study included patients of all ages who underwent cranioplasty in the neurosurgery departments of three independent hospitals. The derivation cohort consisted of patients who underwent cranioplasty at two independent tertiary hospitals (Qilu Hospital of Shandong University and Daping Hospital of Army Medical University) between January 1, 2015, and July 31, 2023. This cohort was used for model training and internal validation through 5-fold cross-validation. The geographical external validation cohort included patients who underwent cranioplasty at another independent tertiary hospital (Tangdu Hospital of Air Force Medical University) during the same period. The temporal external validation cohort included patients who underwent cranioplasty at Qilu Hospital of Shandong University between August 1, 2023, and January 1, 2025. Individuals with a history of prior cranioplasty, severe comorbidities (including significant cardiac, liver, kidney, or immune system dysfunction), congenital cranial defects, or substantial missing data were excluded from the study. The primary endpoint of this study was the occurrence of in-hospital complications following cranioplasty.

Sample size estimation

To determine the minimum sample size required for developing the clinical prediction models, we followed the four-step approach, using pmsampsize package in R and the web tool BeyondEPV (https://mvansmeden.shinyapps.io/BeyondEPV)²⁷. The following parameters were prespecified: number of predictor parameters = 9, shrinkage factor = 0.9, outcome prevalence = 0.2, minimum acceptable C-statistic = 0.8, and mean absolute percentage error (MAPE) = 0.05. Based on these criteria, the minimum required sample size was estimated to be 400.

Data collection and quality control

Both the factors influencing postoperative complications following cranioplasty and the specific outcome variables for complications were identified through a comprehensive literature review, including systematic reviews, meta-analyses, primary studies, and expert clinical opinions^28,29,30. Postoperative complications were classified into six major categories: infection, intracranial hemorrhage, fluid collections, hydrocephalus, seizures, and pneumocephalus. Only complications requiring active clinical intervention were counted as clinically significant complications in this study. Minor or self-limiting abnormalities were not considered as complications. In addition, reoperation, defined as a return to the operating room for removal of the implanted material due to severe complications, was also recorded as a separate outcome. Data was extracted from patients’ EMRs. Detailed information about factors and complications is provided in Table S1.

To ensure data consistency and reliability, a standardized multicenter database was established. Data collectors were uniformly trained, and clinical information was recorded using standardized forms. Following data collection, a cross-center quality control process was implemented. A random 30% sample of records from each center was independently reviewed by neurosurgeons from other participating centers. A minimum data accuracy rate of ≥90% was required; centers not meeting this threshold were mandated to re-evaluate and correct their data.

Data preprocessing

To handle missing data, the extent and pattern of missingness across all variables were first assessed (Fig. S1). The missingness mechanism for the skull defect area variable was first assessed by clinical experts and then verified statistically, supporting a Missing Completely at Random assumption (Table S2)^31,32. Six commonly used imputation methods were subsequently evaluated using cross-validation, with the normalized root mean square error (NRMSE) as the primary evaluation metric. Among these, missForest³³ was ultimately chosen for subsequent analyses (Fig. S2) due to its superior performance in minimizing NRMSE. A sensitivity analysis was conducted to verify that the imputation process preserved data distribution consistency and introduced no bias, as expected under the assumption (Fig. S3). Outliers were identified using a four-layer autoencoder trained with the Adam optimizer and mean squared error loss. Samples with reconstruction errors exceeding the 2σ threshold were considered outliers and excluded from further analysis³⁴ (Fig. S4). The impact of this filtering step on model performance was evaluated through a sensitivity analysis, as detailed in Table S3. Continuous variables were standardized using Z-score normalization, and categorical variables were standardized through one-hot encoding³⁵. To avoid data leakage, all preprocessing steps were independently applied to the derivation and geographical external validation cohorts. The temporal external validation dataset was left unprocessed to reflect real-world deployment conditions.

Feature selection

To address multicollinearity arising from highly correlated features, correlation filtering was applied to remove redundancy, ensuring that all pairwise correlation coefficients fell below 0.6. The variance inflation factor was subsequently calculated to confirm the absence of multicollinearity among the remaining variables (Fig. S5). Feature selection was performed using four distinct algorithms: Boruta³⁶, Lasso³⁷, Random Forest–based Recursive Feature Elimination (RF-RFE)³⁸, and Genetic Algorithm (GA)^39,40 (Fig. S6). The final feature set was determined by identifying the intersection of variables selected by all four methods and was visualized using a Venn diagram⁴¹. Experienced neurosurgeons from three centers then reviewed the selected features and finalized the predictors, ensuring high face validity and ease of implementation.

Model development, comparisons, and evaluation

Fifteen ML models were developed to predict the risk of complications after cranioplasty: GAM, logistic regression, gradient-boosted decision tree, K-nearest neighbor, light gradient boosting machine, RotF, extreme gradient boosting, naive Bayes, adaptive boosting (AdaBoost), multilayer perceptron, support vector machine, decision tree, extremely randomized trees (ExtraTrees), Gaussian process classifier and random forest (RF). To minimize overfitting and enhance model robustness before external validation, 5-fold cross-validation was performed on the derivation cohort. The optimal cutoff for model was determined by maximizing the Youden index (sensitivity + specificity − 1). The 95% confidence interval was estimated using the bias-corrected and accelerated (BCA) bootstrap method⁴².

To select the optimal prediction models for each complication, we defined a novel metric, ${\mathrm{AB}}_{\mathrm{score}}$, which combines the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Brier score to assess both the discriminative and calibration abilities of each model. To avoid overly optimistic estimates derived from the training data, performance metrics from the training cohort were not used for model evaluation. Only internal cross-validation and external validation metrics were considered to reliably assess the generalizability of each model.

The mathematical formula for $A{B}_{{score}}$ is presented in Eq. (1)

$$A{B}_{score}=\alpha * \overline{AUROC}+\alpha * (1-\overline{Briers\,core})$$

(1)

$$\alpha =0.5$$

(2)

where $\overline{AUROC}$ and $\overline{Briers\,core}$ represents the arithmetic mean of the AUROC values and Brier score from the internal cross-validation and geographical external validation sets, respectively. $\alpha$ is a weight factor, and by setting $\alpha =0.5$, equal emphasis is placed on discrimination and calibration in the $A{B}_{\mathrm{score}}$. $1-\overline{Briers\,core}$ is used to align with the positive direction of AUROC.

The discrimination ability of the final model was assessed using the receiver operating characteristic (ROC) curve and Precision–Recall (PR) Curve, while calibration was evaluated with the Brier score and calibration curve⁴³. In addition, decision curve analysis (DCA) was used to evaluate the net benefit of the model at different thresholds⁴⁴. Furthermore, to evaluate model fairness, we assessed the model’s performance across demographic subgroups, including age (<40 and ≥40 years) and sex (male, female), within the external validation cohort⁴⁵.

Complication-specific modeling

Separate machine learning models were developed for each major type of postoperative complication. To address severe class imbalance in complication-specific modeling, ADASYN⁴⁶ was applied to the data to augment minority-class samples and improve model sensitivity for rare outcomes.

Model explanation and causal inference analysis

To address the inherent opacity of machine learning models, SHAP, a game-theoretic approach, was employed to quantify the contribution of individual features to deviations from the mean prediction using SHAP values⁴⁷. PDPs were additionally applied to visualize interactions among key SHAP-identified features and their combined effects on model predictions^48,49.

To investigate the potential causal effects of modifiable surgical factors on postoperative complications, a two-step approach was employed. First, DiCE⁵⁰ was used to simulate hypothetical adjustments in clinically actionable variables and to explore whether such changes could influence model-predicted complication risks. Subsequently, causal inference methods were applied to quantify the effects of these variables. DML was used to estimate ATEs⁵¹, while a T-learner framework was employed to evaluate CATEs across patient subgroups⁵². Sensitivity analyses were conducted to assess the robustness of the causal estimates.

Statistical analysis

Patient data were categorized into continuous and categorical variables. The normality of continuous variables was assessed using the Kolmogorov–Smirnov test. Variables with a normal distribution were described as mean ± standard deviation, while those with a skewed distribution were presented as median with interquartile range. Categorical variables were reported as counts and percentages. Data analyses were performed using IBM SPSS Statistics for Windows (IBM Corp., released 2019, version 26.0), Python (version 3.8.2), and R (version 4.2.2).

Data availability

The datasets analyzed implemented during this study are available from the corresponding author upon reasonable request. The codes are uploaded on Github. (GitHub - BigEarAsk/A-Causal-and-Interpretable-Machine-Learning-Framework-for-Post-Cranioplasty-Complications: Code for training and validation).

References

Fung, C. et al. Decompressive hemicraniectomy in patients with supratentorial intracerebral hemorrhage. Stroke 43, 3207–3211 (2012).
Article PubMed Google Scholar
Hutchinson, P. J. et al. Trial of Decompressive Craniectomy for Traumatic Intracranial Hypertension. N. Engl. J. Med. 375, 1119–1130 (2016).
Article PubMed Google Scholar
Vahedi, K. et al. Early decompressive surgery in malignant infarction of the middle cerebral artery: a pooled analysis of three randomised controlled trials. Lancet Neurol. 6, 215–222 (2007).
Article PubMed Google Scholar
Honeybul, S. & Ho, K. M. Long-term complications of decompressive craniectomy for head injury. J. Neurotrauma 28, 929–935 (2011).
Article PubMed Google Scholar
Kurland, D. B. et al. Complications associated with decompressive craniectomy: a systematic review. Neurocritical Care 23, 292–304 (2015).
Article PubMed PubMed Central Google Scholar
Feroze, A. H. et al. Evolution of cranioplasty techniques in neurosurgery: historical review, pediatric considerations, and current trends. J. Neurosurg. 123, 1098–1107 (2015).
Article CAS PubMed Google Scholar
Malcolm, J. G. et al. Complications following cranioplasty and relationship to timing: a systematic review and meta-analysis. J. Clin. Neurosci. 33, 39–51 (2016).
Article PubMed Google Scholar
Alkhaibary, A. et al. Cranioplasty: a comprehensive review of the history, materials, surgical aspects, and complications. World Neurosurg. 139, 445–452 (2020).
Article PubMed Google Scholar
Zanaty, M. et al. Complications following cranioplasty: incidence and predictors in 348 cases. J. Neurosurg. 123, 182–188 (2015).
Article PubMed Google Scholar
Chen, R. et al. Optimal timing of cranioplasty and predictors of overall complications after cranioplasty: the impact of brain collapse. Neurosurgery 93, 84–94 (2023).
Article PubMed Google Scholar
Abode-Iyamah, K. O. et al. Risk factors for surgical site infections and assessment of vancomycin powder as a preventive measure in patients undergoing first-time cranioplasty. J. Neurosurg. 128, 1241–1249 (2018).
Article PubMed Google Scholar
Cho, S. M. et al. Machine learning compared with conventional statistical models for predicting myocardial infarction readmission and mortality: a systematic review. Can. J. Cardiol. 37, 1207–1214 (2021).
PubMed Google Scholar
Drysch, M. et al. Streamlined machine learning model for early sepsis risk prediction in burn patients. NPJ Digit. Med. 8, 621 (2025).
Article PubMed PubMed Central Google Scholar
Yin, P. et al. Prediction of functional outcomes in aneurysmal subarachnoid hemorrhage using pre-/postoperative noncontrast CT within 3 days of admission. NPJ Digit. Med. 8, 542 (2025).
Article PubMed PubMed Central Google Scholar
Bai, Z. et al. Machine learning based CAGIB score predicts in-hospital mortality of cirrhotic patients with acute gastrointestinal bleeding. NPJ Digit. Med. 8, 489 (2025).
Article PubMed PubMed Central Google Scholar
Shin, T. H., Ashley, S. W. & Tsai, T. C. Defining the role of machine learning in optimizing surgical outcomes. JAMA Surg. 159, 1432 (2024).
Article PubMed Google Scholar
Feuerriegel, S. et al. Causal machine learning for predicting treatment outcomes. Nat. Med. 30, 958–968 (2024).
Article CAS PubMed Google Scholar
Rosenthal, G. et al. Polyetheretherketone implants for the repair of large cranial defects: a 3-center experience. Neurosurgery 75, 528–529 (2014).
Article Google Scholar
Rosinski, C. L. et al. A retrospective comparative analysis of titanium mesh and custom implants for cranioplasty. Neurosurgery 86, E15–e22 (2020).
Article PubMed Google Scholar
Williams, D. F. Titanium for medical applications. in: Titanium in medicine: material science, surface science, engineering, biological responses and medical applications, 13–24 (Springer, 2001).
Kimchi, G. et al. Predicting and reducing cranioplasty infections by clinical, radiographic and operative parameters - A historical cohort study. J. Clin. Neurosci. 34, 182–186 (2016).
Article PubMed Google Scholar
Klieverik, V. M., Robe, P. A., Muradin, M. S. M. & Woerdeman, P. A. Development of a prediction model for cranioplasty implant survival following craniectomy. World Neurosurg. 175, e693–e703 (2023).
Article PubMed Google Scholar
Lu, Y., Huo, H. & Jiang, J. A clinical prediction model for complications after cranioplasty based on modified-brain collapse ratio and comorbidity burden. World Neurosurg. 201, 124235 (2025).
Article PubMed Google Scholar
Chen, S. et al. Development and validation of an explainable machine learning model for predicting postoperative pulmonary complications after lung cancer surgery: a machine learning study. EClinicalMedicine 86, 103386 (2025).
Article PubMed PubMed Central Google Scholar
Mahajan, A. et al. Development and validation of a machine learning model to identify patients before surgery at high risk for postoperative adverse events. JAMA Netw. Open 6, e2322285 (2023).
Article PubMed PubMed Central Google Scholar
Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024).
Riley, R. D. et al. Calculating the sample size required for developing a clinical prediction model. BMJ 368, m441 (2020).
Article PubMed Google Scholar
Zheng, F. et al. Early or late cranioplasty following decompressive craniotomy for traumatic brain injury: A systematic review and meta-analysis. J. Int. Med. Res. 46, 2503–2512 (2018).
Article PubMed PubMed Central Google Scholar
Bader, E. R., Kobets, A. J., Ammar, A. & Goodrich, J. T. Factors predicting complications following cranioplasty. J. Cranio Maxillo Fac. Surg. 50, 134–139 (2022).
Article Google Scholar
Shepetovsky, D., Mezzini, G. & Magrassi, L. Complications of cranioplasty in relationship to traumatic brain injury: a systematic review and meta-analysis. Neurosurg. Rev. 44, 3125–3142 (2021).
Article PubMed PubMed Central Google Scholar
Carpenter, J. R. & Smuk, M. Missing data: a statistical framework for practice. Biom. J. 63, 915–947 (2021).
Article PubMed PubMed Central Google Scholar
Little, R. J. & Rubin, D. B. Statistical analysis with missing data (John Wiley & Sons, 2019).
Stekhoven, D. J. & Bühlmann, P. MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012).
Chaurasia, S., Goyal, S. & Rajput, M. Outlier detection using autoencoder ensembles: a robust unsupervised approach. In Proc. 2020 International Conference on Contemporary Computing and Applications (IC3A) 76–80 (IEEE, 2020).
Sun, Y. et al. Modifying the one-hot encoding technique can enhance the adversarial robustness of the visual model for symbol recognition. Expert Syst. Appl. 250, 123751 (2024).
Article Google Scholar
Kursa, M. B. & Rudnicki, W. R. Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010).
Article Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58, 267–288 (1996).
Article Google Scholar
Ravishankar, H. et al. Recursive feature elimination for biomarker discovery in resting-state functional connectivity. In Proc. 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 4071–4074 (IEEE, 2016).
Lambora, A., Gupta, K. & Chopra, K. Genetic algorithm-A literature review. In Proc. 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon) 380–384 (IEEE, 2019).
Lei, S. A feature selection method based on information gain and genetic algorithm. In Proc. 2012 International Conference on Computer Science and Electronics Engineering, Vol. 2, 355–358 (IEEE, 2012).
Heberle, H., Meirelles, G. V., da Silva, F. R., Telles, G. P. & Minghim, R. J. B. b. InteractiVenn: a web-based tool for the analysis of sets through Venn diagrams. BMC Bioinform. 16, 169 (2015).
Article Google Scholar
Grün, B. & Miljkovic, T. J. N. A. A. J. The automated bias-corrected and accelerated bootstrap confidence intervals for risk measures. North Am. Actuar. J. 27, 731–750 (2023).
Article Google Scholar
Stevens, R. J. & Poppe, K. K. Validation of clinical prediction models: what does the “calibration slope” really measure? J. Clin. Epidemiol. 118, 93–99 (2020).
Article PubMed Google Scholar
Vickers, A. J. & Elkin, E. B. Decision curve analysis: a novel method for evaluating prediction models. Med. Decis. Mak. 26, 565–574 (2006).
Article Google Scholar
Riley, R. D. et al. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ 384, e074820 (2024).
Article PubMed PubMed Central Google Scholar
He, H., Bai, Y., Garcia, E. A. & Li, S. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In Proc. 2008 IEEE International Joint Conference on Neural Networks (IEEE world congress on computational intelligence) 1322–1328 (IEEE, 2008).
Lundberg, S. M.et al. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems (NIPS'17), 4768–4777 (Curran Associates Inc., 2017).
Friedman, J. et al. Greedy function approximation: a gradient boosting machine. Ann. Stati. 29, 1189–1232 (2001).
Li, W. et al. Effects of heavy metal exposure on hypertension: a machine learning modeling approach. Chemosphere 337, 139435 (2023).
Article CAS PubMed Google Scholar
Mothilal, R. K., Sharma, A. & Tan, C. Explaining machine learning classifiers through diverse counterfactual explanations. In Proc. 2020 Conference on Fairness, Accountability, and Transparency, 607–617 (ACM, 2020).
Chernozhukov, V. et al. Double/debiased machine learning for treatment and structural parameters. Econ. J. 21, C1–C68 (2018).
Google Scholar
Künzel, S. R. et al. Metalearners for estimating heterogeneous treatment effects using machine learning. Proc. Natl. Acad. Sci. USA 116, 4156–4165 (2019).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We sincerely thank Ms. Wenyun Xia for her contributions to the design of the visualization website. This study was supported by Qilu hygiene and health outstanding youth project, Shandong Provincial Natural Science Foundation (ZR2025MS1184), Shandong University project (6010124061, Study on postoperative complications of cranioplasty using polyetheretherketone material), Shaanxi Province Youth Science and Technology Rising Star (2023KJXX-028), and Tangdu Youth Independent Innovation Science Fund (2023ATDQN004). The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Author information

These authors contributed equally: Wenbo Li, Bao Wang, Tianzun Li.

Authors and Affiliations

Department of Neurosurgery, Qilu Hospital, Cheeloo College of Medicine and Institute of Brain and Brain-Inspired Science, Shandong University, Jinan, China
Wenbo Li, Haoyong Jin, Jiangli Zhao, Zhiwei Xue, Nan Su, Yanya He, Jiaqi Shi, Xuchen Liu, Xiaoyang Liu, Tianzi Wang, Jiwei Wang, Chao Li, Can Yan, Qichao Qi, Xinyu Wang, Weiguo Li, Bin Huang, Donghai Wang, Xingang Li & Ning Yang
School of Medicine, Cheeloo College of Medicine, Shandong University, Jinan, China
Wenbo Li, Haoyong Jin, Jiangli Zhao, Zhiwei Xue, Nan Su, Yanya He, Jiaqi Shi, Xuchen Liu, Xiaoyang Liu, Tianzi Wang, Jiwei Wang, Chao Li, Can Yan, Qichao Qi, Xinyu Wang, Weiguo Li, Bin Huang, Donghai Wang, Xingang Li & Ning Yang
Shandong Key Laboratory of Brain Health and Function Remodeling, Jinan, China
Wenbo Li, Haoyong Jin, Jiangli Zhao, Zhiwei Xue, Nan Su, Yanya He, Jiaqi Shi, Xuchen Liu, Xiaoyang Liu, Tianzi Wang, Jiwei Wang, Chao Li, Can Yan, Qichao Qi, Xinyu Wang, Weiguo Li, Bin Huang, Donghai Wang, Xingang Li & Ning Yang
Department of Neurosurgery, Tangdu Hospital, Fourth Military Medical University, Xi’an, China
Bao Wang, Xuelian Wang & Yan Qu
Center for Frontier Medicine Innovation, Tangdu Hospital, Fourth Military Medical University, Xi’an, China
Bao Wang
Department of Neurosurgery, Daping Hospital, Army Military Medical University, Chongqing, China
Tianzun Li
School of Computer Science, Hangzhou Dianzi University, Hangzhou, China
Yiwen Ma
Department of Critical Care Medicine, Qilu Hospital of Shandong University, Jinan, China
Yang Ma
Department of Radiation Oncology, Qilu Hospital of Shandong University, Jinan, China
Chen Qiu

Authors

Wenbo Li
View author publications
Search author on:PubMed Google Scholar
Bao Wang
View author publications
Search author on:PubMed Google Scholar
Tianzun Li
View author publications
Search author on:PubMed Google Scholar
Yiwen Ma
View author publications
Search author on:PubMed Google Scholar
Haoyong Jin
View author publications
Search author on:PubMed Google Scholar
Jiangli Zhao
View author publications
Search author on:PubMed Google Scholar
Zhiwei Xue
View author publications
Search author on:PubMed Google Scholar
Nan Su
View author publications
Search author on:PubMed Google Scholar
Yanya He
View author publications
Search author on:PubMed Google Scholar
Jiaqi Shi
View author publications
Search author on:PubMed Google Scholar
Xuchen Liu
View author publications
Search author on:PubMed Google Scholar
Xiaoyang Liu
View author publications
Search author on:PubMed Google Scholar
Tianzi Wang
View author publications
Search author on:PubMed Google Scholar
Jiwei Wang
View author publications
Search author on:PubMed Google Scholar
Chao Li
View author publications
Search author on:PubMed Google Scholar
Can Yan
View author publications
Search author on:PubMed Google Scholar
Yang Ma
View author publications
Search author on:PubMed Google Scholar
Qichao Qi
View author publications
Search author on:PubMed Google Scholar
Xinyu Wang
View author publications
Search author on:PubMed Google Scholar
Weiguo Li
View author publications
Search author on:PubMed Google Scholar
Bin Huang
View author publications
Search author on:PubMed Google Scholar
Donghai Wang
View author publications
Search author on:PubMed Google Scholar
Xuelian Wang
View author publications
Search author on:PubMed Google Scholar
Yan Qu
View author publications
Search author on:PubMed Google Scholar
Xingang Li
View author publications
Search author on:PubMed Google Scholar
Chen Qiu
View author publications
Search author on:PubMed Google Scholar
Ning Yang
View author publications
Search author on:PubMed Google Scholar

Contributions

N.Y. (Ning Yang) and C.Q. (Chen Qiu) were responsible for project administration, conceptualization, and investigation of the study, and also reviewed and edited the manuscript. W.B.L. (Wenbo Li), B.W. (Bao Wang), and T.L. (Tianzun Li) contributed equally to methodology development, formal analysis, data validation, web-based application development, and drafting of the original manuscript. Y.W.M. (Yiwen Ma) contributed to formal analysis and data validation. H.J. (Haoyong Jin), J.Z. (Jiangli Zhao), Z.W.X. (Zhiwei Xue), N.S. (Nan Su), and Y.H. (Yanya He) were involved in data curation and formal analysis. J.S. (Jiaqi Shi), X.C.L. (Xuchen Liu), X.Y.L. (Xiaoyang Liu), T.W. (Tianzi Wang), J.W. (Jiwei Wang), C.L. (Chao Li), C.Y. (Can Yan), Y.M. (Yang Ma), and Q.Q. (Qichao Qi) contributed to data curation. X.Y.W. (Xinyu Wang), W.G.L. (Weiguo Li), B.H. (Bin Huang), D.W. (Donghai Wang), X.L.W. (Xuelian Wang), Y.Q. (Yan Qu), and X.G.L. (Xingang Li) provided supervision for the study. N.Y. and C.Q. had full access to all data in the study and took responsibility for the integrity of the data and the accuracy of the data analysis. The corresponding authors assumed final responsibility for the decision to submit the manuscript. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Chen Qiu or Ning Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, W., Wang, B., Li, T. et al. A Causal and interpretable machine learning framework for postcranioplasty risk prediction and surgical decision support. npj Digit. Med. 9, 184 (2026). https://doi.org/10.1038/s41746-026-02370-6

Download citation

Received: 01 September 2025
Accepted: 13 January 2026
Published: 21 January 2026
Version of record: 20 February 2026
DOI: https://doi.org/10.1038/s41746-026-02370-6