Abstract
Postoperative delirium (POD) is associated with increased morbidity and mortality. This study aims to develop a deep learning-based model (DELPHI-EEG) to predict postoperative delirium using intraoperative electroencephalogram (EEG) waveform. A total of 34,550 surgical cases (267 event cases), with 6-lead intraoperative EEG monitoring between 2022 and 2024, were included for model development. During 5-fold cross-validation, the DELPHI-EEG model showed an area under the receiver operating characteristic (AUROC) curve of 0.870 (95% confidence interval [CI]: 0.789–0.935) and the area under the precision-recall curve (AUPRC) of 0.038 (95% CI: 0.017–0.084), significantly outperforming the logistic regression model using burst suppression ratio with AUROC of 0.729 (95% CI: 0.624–0.825, p = 0.004) and AUPRC of 0.013 (95% CI: 0.007–0.026, p = 0.002). The DELPHI-EEG model might serve as a risk predictor for postoperative delirium, potentially enabling targeted preventive interventions for surgical patients; nonetheless, external validation in diverse clinical settings is required.
Similar content being viewed by others
Introduction
Postoperative delirium (POD) is an acute neuropsychiatric syndrome characterized by an acute onset of impaired awareness, attention, and cognition after surgery1. With an incidence ranging from 10% to 60% depending on the surgical population and an average time to onset of delirium of 2.1 ± 0.9 days, POD can lead to increased 5-year mortality (odds ratio = 7.35; 95% CI = 1.49–36.18), unplanned intensive care unit admission, prolonged hospital stay, discharge to non-home settings, decline in activities of daily living, and higher healthcare costs2,3,4,5. Additionally, sociodemographic disparities—including lower educational attainment and neighborhood socioeconomic disadvantage—are associated with increased POD risk6. Therefore, developing accurate and accessible predictive models is essential for mitigating these disparities by supporting timely interventions in vulnerable groups.
Despite its significant clinical impact, early prediction of POD remains challenging. Current delirium risk prediction tools primarily rely on preoperative factors, such as age, cognitive function, physical health, and medical history2,7,8. Although such tools provide valuable risk stratification before surgery, most of them cannot account for intraoperative physiological changes that may contribute to POD development. Some intraoperative monitoring techniques, such as transcranial Doppler and near-infrared spectroscopy, have been used for POD risk assessment8. However, these methods require additional equipment, increase costs, and only evaluate the cerebral blood flow in limited brain regions. Given these limitations, there is an urgent clinical need for accurate, intraoperatively available tools that can identify patients at high risk for POD early, allowing for timely preventive strategies and targeted postoperative care.
Electroencephalogram (EEG) monitoring is routinely performed during surgery under general anesthesia and provides continuous assessment of brain function. Prior studies have shown that intraoperative EEG parameters, such as burst suppression or changes in delta, theta, or fast frequency activity during procedures, are correlated with POD incidence9,10. In particular, the presence of EEG burst suppression has been reported as a significant predictor of POD, with patients who developed POD showing a mean burst suppression duration increase of 15.86 min compared to that in those without POD11. However, these models exclude other EEG features that showed significant differences between the patients with and without POD, such as alpha, delta, and theta power, highlighting the need for further analysis to explore the potential of these additional EEG features9,12,13.
With the advancement of deep learning techniques, a growing number of studies have applied EEG signals to various clinical tasks, such as diagnosis and prediction, including intraoperative hypotension prediction using biosignal waveforms14,15,16. In a previous study using 10-lead EEG signals, a vision transformer-based model predicted delirium in mechanically ventilated critically ill patients with 97% test accuracy17. Machine learning (ML) models combining handcrafted features of intraoperative frontal EEG and clinical parameters have been developed for POD prediction and have achieved areas under the receiver operating characteristic (AUROC) curve values of 0.887 and 0.77, respectively18,19. However, no study has been conducted on predicting postoperative delirium using deep-learning techniques with multi-channel intraoperative EEG waveforms.
Recently, a deep learning-based spatiotemporal encoding framework has been proposed based on resting state EEG to predict postoperative cognitive function20. This architecture combines a graph convolutional network to capture connectivity patterns between brain regions, a convolutional neural network to extract hierarchical spatial and temporal features, and a Transformer module to model dependencies across time points. Such a framework is particularly promising for predicting POD due to POD’s association with multi-level EEG alterations, including alpha power and frontoparietal alpha coherence, and connectivity between the occipitoparietal and frontal cortex21,22,23. Therefore, we adapted this spatiotemporal encoding framework to utilize intraoperative EEG for POD prediction. EEG features other than burst suppression would be utilizable if the multi-channel EEG signals are input into the model.
The primary objective of our study was to develop a DEep Learning-based model for POD Hazard assessment using Intraoperative EEG (DELPHI-EEG). We compared the predictive performance of DELPHI-EEG with that of a conventional logistic regression model and ML models incorporating the burst suppression ratio (SR), Patient State Index (PSI), age, and sex. We hypothesized that DELPHI-EEG would capture various EEG features beyond burst suppression, thereby demonstrating its potential as an artificial intelligence-assistant tool for POD prediction in surgical patients who underwent general anesthesia. The primary endpoint of our study was the occurrence of POD, defined according to psychiatric consultations and/or administration of delirium-related antipsychotics.
Results
Study population
A total of 35,115 surgical cases were identified, comprising 34,671 negative and 444 positive cases (Fig. 1). The dataset was randomly split by unique patients into a development set (30,802 negative and 241 positive cases) and a test set (3481 negative and 26 positive cases). The development set was further undersampled randomly to 1205 negative and 241 positive cases. Sample exclusion due to missing EEG channels or procedures not involving general anesthesia resulted in 21,290 negative and 3640 positive samples (804 negative and 142 positive cases) in the development set, and 62,009 negative and 422 positive samples (2325 negative and 15 positive cases) in the test set. The test set was evaluated across three sampling strategies: an original distribution set (2325 negative and 15 positive cases), a 1:1 undersampled set (12 negative and 15 positive cases), and a 2:1 undersampled set (34 negative and 15 positive cases). The mean age of the analyzed cases in the development and test set was 58.72 ± 15.44 years (58.66 ± 15.46 years in the development set; 59.22 ± 15.31 years in the test set), and the proportion of males was 43.3% (43.5% in the development set; 43.3% in the test set). Among the POD-positive cases, the median time from surgery to POD onset was 4 days (Q1: 2 days, Q3: 12 days).
EEG electroencephalography, SNUH Seoul National University Hospital, NP neuropsychiatry, ( + ), positive labels for postoperative delirium; (−), negative labels for postoperative delirium.
Comparison of model performance
Figures 2 and 3 show that the AUROC of the DELPHI-EEG was 0.870 (95% confidence interval [CI]: 0.789–0.935) (Fig. 2a), outperforming the baseline logistic regression model based on the duration of time where SR > 1%, area under the SR versus time curve where SR > 1%, duration of time where PSI < 25, age, and sex (AUROC: 0.729 [95% CI: 0.624–0.825]) in the testing set with the original ratio (p = 0.004) (Fig. 3a). The performances of ML models based on the same input features including XGBoost (XGB), LightGBM (LGB), Random Forest (RF), and Gradient Boosting Classifier (GB) were an AUROC of 0.801 (95% CI: 0.673–0.891), 0.798 (95% CI: 0.675–0.890), 0.799 (95% CI: 0.660–0.901), and 0.764 (95% CI: 0.620–0.875), respectively. The other evaluation metrics, including the area under the precision-recall curve (AUPRC), F1 score, accuracy, Brier score, integrated calibration index (ICI), precision, and recall, are presented in Table 1 and Supplementary Table 1. The AUPRC and the recall of DELPHI-EEG were 0.038 (95% CI: 0.013–0.040) and 0.933 (95% CI: 0.778–1.000), respectively.
a 155:1 (original ratio), b 1:1 undersampled, and c 2:1 undersampled test sets.
a The receiver operating characteristic curve, b confusion matrix, and c predicted probability histogram.
Interpretability analysis
SHAP (SHapley Additive exPlanations) analysis for XGB, LGB, RF, and GB models indicated that old age, male sex, long duration of PSI < 25, SR > 1%, and high AUC of SR > 1% correlated with delirium incidence (Supplementary Fig. 1).
Linear regression analysis showed significant associations between DELPHI-EEG predicted probabilities and both the duration of time where SR > 1% and the area under the SR versus time curve where SR > 1%. The regression slopes were 2.112 × 103 (95% CI: 1.890–2.335 × 103) and 3.278 × 102 (95% CI: 2.805–3.750 × 102) respectively for the duration of time where SR > 1% and the area under the SR versus time curve where SR > 1% (all p < 0.001) (Supplementary Fig. 2). There were significant associations between DELPHI-EEG predicted probabilities and relative band power in the alpha, delta, and theta frequency ranges. The regression slopes were 0.343 (95% CI: 0.285–0.400), 0.104 (95% CI: 0.068–0.141), and −0.562 (95% CI: −0.601 to −0.523) for the delta, theta, and alpha band powers, respectively (all p < 0.001) (Fig. 4a). However, there was no significant association between DELPHI-EEG predicted probabilities and the time to POD onset. The regression slopes were −0.012 (95% CI: −0.027–0.004) (p = 0.222) (Supplementary Fig. 3b).
a DELPHI-EEG predicted probability and relative band power of delta (1–4 Hz), theta (4–8 Hz), and alpha (8–12 Hz) frequency bands. b Percent reduction in F1-score for postoperative delirium (POD) prediction following targeted ablation applied to each 2 min sample. β denotes the slope from Spearman’s rank correlation test, and p denotes the corresponding p-value.
Ablation of individual frequency bands resulted in a reduction in DELPHI-EEG performance across all spectral ranges tested (Fig. 4b). The largest decrease in the F1-score was observed following removal of the alpha band (8–12 Hz), with a reduction of 53.04% relative to baseline, followed by reductions of 39.46% for the beta band (12–30 Hz), 34.12% for the slow band (0.5–1 Hz), 33.38% for the delta band (1–4 Hz), and 29.79% for the theta band (4–8 Hz).
Ablation of individual EEG channels resulted in a reduction in DELPHI-EEG performance. The largest decrease in the F1-score was observed following removal of the L2 (left frontoparietal lead), with a reduction of 33.17% relative to baseline, followed by reductions of 31.54% for the R2 (right frontoparietal lead), 22.44% for the R (difference between R1 and R2), 20.65% for the L (difference between L1 and L2), 18.99% for the L1 (left frontal lead), and 12.41% for the R1 (right frontal lead)24 (Supplementary Fig. 4a).
The temporal importance plot showed the model put emphasis more on 40-80 seconds within 2-min sample (Supplementary Fig. 4b).
Subgroup analysis
Subgroup analyses were conducted separately based on the type of anesthesia and the type of surgery. For anesthesia type, the AUROC of the DELPHI-EEG was 0.864 (95% CI: 0.628–0.961) in the total intravenous anesthesia (TIVA) group and 0.872 (95% CI: 0.770–0.943) in the inhalational anesthesia group. For surgical type, the AUROC was 0.899 (95% CI: 0.833–0.981) in the abdominopelvic surgery group, 0.848 (95% CI: 0.689–0.963) in the thoracic surgery group, and 025.867 (95% CI: 0.661–0.960) in the other surgery group (Table 2). The survival analysis showed a significantly different distribution between the predicted positive and negative groups from DELPHI-EEG (p < 0.001) (Supplementary Fig. 3a).
Sensitivity analysis
Of the 34,550 analyzed cases, 4268 had both the original and CAM-ICU labels. The accuracy and F1-score of the original label to predict the CAM-ICU label were 0.895 and 0.242, respectively. In the testing dataset, the AUROC of the DELPHI-EEG model for predicting the composite outcome was maintained at 0.839 (95% CI: 0.795–0.874, P = 0.3707), while the AUPRC increased to 0.123 (95% CI: 0.076–0.198, P = 0.007) compared with the performance using the original labels.
Discussion
This study aimed to develop and validate a deep learning–based intraoperative EEG model for predicting postoperative delirium (POD) and to compare its performance with conventional and machine learning–based approaches. Overall, the DELPHI-EEG showed potentially improved discriminability compared with that of the logistic regression model.
Our model demonstrated a higher AUROC (0.870, 95% CI: 0.789–0.935) than that of a previous ML model using intraoperative frontal EEG signatures and clinical parameters as inputs (AUROC: 0.73–0.80)18. Also, our model showed a higher AUROC compared with that of a transformer-based model using time series of intraoperative features, including EEG as inputs (AUROC: 0.772–0.787)26. In a previous study, Han et al. developed an ML model incorporating intraoperative EEG features and clinical parameters, which achieved an AUROC of 0.88719. However, the model was developed using a manually curated subset of EEG features, which may not be optimal compared to learned features from raw waveform in DELPHI-EEG, and required more than 100 perioperative clinical features as inputs, which may not be feasible in a real-world setting. Moreover, the model by Han et al. was developed specifically for patients undergoing cardiac surgery and included cardiac surgery-specific features, such as intraoperative inotrope use before, during, and after cardiopulmonary bypass (CPB), as well as CPB duration. These features may limit the model’s generalizability to non-cardiac surgical populations. In contrast, DELPHI-EEG was the only model to show a statistically significant improvement in the AUROC relative to the baseline logistic regression model (p = 0.004). Other ML models that incorporated age, sex, and EEG features, akin to the Han et al. model, did not demonstrate significant AUROC improvements (XGB, p = 0.203; LGB, p = 0.181; RF, p = 0.246; GB, p = 0.598). DELPHI-EEG achieved this superior performance by leveraging a deep learning-based framework to extract spatiotemporal features from raw signals, rather than by relying on manually curated features. By directly processing raw EEG waveforms, DELPHI-EEG extracts multi-scale spatiotemporal representations that reflect both transient and sustained electrophysiologic patterns. This framework enables the identification of POD risk in patients who do not exhibit overt burst suppression or abnormal PSI values, thereby extending predictive coverage beyond traditional indices.
The interpretability analysis revealed that a high risk for POD was correlated with a reduced relative alpha power. This observation is consistent with previous studies where patients with POD had a significantly lower alpha power than the control group in global (0.09 ± 0.06 vs. 0.21 ± 0.08, p < 0.0001) and frontal (0.09 ± 0.07 vs. 0.24 ± 0.1, p < 0.0001) EEG27. In addition, POD was also associated with increased delta and theta power, aligning with a previous report (odds ratio = 1.97; 95% CI = 1.30–2.99) where EEG changes included a greater than 50% increase in delta or theta activity9. The generation of intraoperative alpha oscillations has been considered to reflect thalamic hyperpolarization and thalamo-cortical synchronization28. Therefore, attenuated alpha power and relatively increased delta and theta power may signal impaired thalamo-cortical connectivity, potentially reflecting age-related neural vulnerability. This link to aging is supported by findings that alpha band power naturally decreases with age (p < 0.001)29. Given that age was an input variable in the model, attenuated alpha power may represent individual susceptibility to POD30.
Frequency-domain perturbation analysis underscored the relative importance of alpha activity in POD prediction, reinforcing its discriminative value compared to other frequency bands. Notably, a prior study reported a > 2-fold increase in alpha power following anesthesia induction in control patients, while it was maintained in those who developed POD27. Both groups exhibited significant increases in delta absolute power post-induction27. Our ablation analysis showed that any single-frequency band led to significant degradation in model performance, suggesting that while alpha activity holds high predictive value, the DELIPHI-EEG leverages broadband spectral information for robust predictive performance.
Among the analyzed cases, 12.4% had available CAM-ICU data. Although CAM-ICU is widely regarded as the gold standard for delirium assessment in both clinical and research settings, its use is limited to ICU patients. Because our study focused on general surgical patients, many of whom were not admitted to the ICU. Consequently, we established the primary delirium labels based on neuropsychiatric consultations and antipsychotic medication use. However, the sensitivity analysis using the composite outcome revealed that the DELPHI-EEG model’s AUROC was not significantly different from the results with the original labels. The concurrent increase in AUPRC may be due to a higher event rate in the composite standard. This result validates our labeling approach and highlights the model’s generalizability to different annotations.
Our intraoperative EEG model could assist physicians in identifying patients at high risk for POD, allowing for the implementation of targeted prevention strategies. Currently, postoperative dexmedetomidine sedation, multicomponent interventions, and perioperative antipsychotic administration are recognized as effective preventive measures31. Given that POD typically occurs several days after surgery under general anesthesia—as reflected by the median onset of 4 days in our cohort—there is a critical window during which close monitoring and rapid interventions could be helpful. Incorporating our model into perioperative care pathways may enhance risk stratification and improve postoperative outcomes, although further prospective validation is warranted to confirm its clinical utility. From a practical perspective, DELPHI-EEG enables real-time application in the operating room by processing continuous EEG waveforms without additional perioperative inputs. Predictions can be updated within milliseconds, introducing no delay beyond the standard EEG acquisition time. Moreover, DELPHI-EEG can be used in conjunction with conventional indices, such as the burst suppression ratio or PSI, to enhance both predictive robustness and interpretability.
Our study has several limitations that should be acknowledged. First, POD was labelled based on psychiatric consultations and administration of delirium-related antipsychotics. This was because assessment and diagnosis data are often missing in electronic health records. However, these medications are commonly considered to manage psychotic symptoms, which occur in approximately 42.7–44.5% of delirium cases32,33,34. In addition, although haloperidol is primarily considered in the management of delirium, other antipsychotics such as quetiapine can be administered for other indications, such as insomnia35. Secondly, our analysis did not account for patients who may have developed delirium after hospital discharge. Excluding these cases may have led to an underestimation of the true incidence of postoperative delirium. Third, we developed and validated the DELPHI-EEG model using a single-center cohort with a 6-channel EEG monitoring system. External validation of our model is necessary but may prove challenging due to variations in intraoperative EEG practices across centers, where typically only 2 to 4 frontal channels are used36. Moreover, differences in institutional EEG display settings, such as amplitude resolution, can affect signal amplitude and quality, further complicating external validation37. Fourth, the burst SR was mechanically calculated by the intraoperative monitor, similar to a previous model that predicted POD using intraoperative EEG suppression, relying on SR analysis rather than manual raw EEG interpretation38. Accordingly, expert review may be required, as this approach could influence the accuracy of the model’s performance. Fifth, we did not assess the relationship between DELPHI-EEG predictions and patient frailty, as frailty assessments were not available in the electronic health records. Frailty, a known risk factor for POD, is associated with 1.5- to 2-fold lower intraoperative EEG power across alpha, beta, delta, and theta bands compared to robust patients (all p < 0.001)25. Therefore, frailty may act as a confounding variable in the observed correlations between DELPHI-EEG predictions and EEG spectral power. Future studies should adjust for frailty when interpreting these correlations, and further analysis is warranted to incorporate frailty as an additional input to the DELPHI-EEG model to improve its predictive accuracy.
In conclusion, the deep learning-based DELPHI-EEG model successfully predicted postoperative delirium using intraoperative EEG waveforms, outperforming conventional suppression ratio-based logistic regression models, although the confidence intervals overlap in AUROC. By identifying attenuated alpha power as a key predictor, the model aligns with established neurophysiological correlates of delirium, while offering a clinically feasible and real-time risk stratification tool. Although external validation across diverse clinical settings is required, DELPHI-EEG shows promise as a clinical tool for POD risk stratification, potentially enabling targeted preventive interventions before delirium onset.
Methods
Study design
This retrospective study was approved by the institutional review boards (IRBs) of Seoul National University Hospital (IRB No. 2506-023-1646; approval date: June 10, 2025). Given the retrospective nature of the study, the IRBs waived the requirements for patient consent.
Adults ( ≥18 years) who received surgery under general anesthesia with intraoperative EEG monitoring on Root Platform (Masimo, Irvine, CA, USA) at Seoul National University Hospital (SNUH) between January 2022 and July 2024 were included in this retrospective study. This study followed the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis + Artificial Intelligence (TRIPOD + AI) and the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline for observational studies39,40.
Study population
Postoperative delirium was defined as cases with (1) newly administered antipsychotics (haloperidol, quetiapine, olanzapine, or risperidone)41 or (2) a diagnosis of delirium during the neuropsychiatry consultation after surgery during the same hospital admission. In cases with a neuropsychiatry consultation, the occurrence of a delirium assessment was determined by searching for the keyword “delirium” occurring in the text between the “Assessment” and “Plan” sections of the psychiatry consultation notes in the electronic health records. Patients with no demographic information, preoperative administration of antipsychotics, or a delirium diagnosis were excluded.
Among 34,550 analyzed cases, patients from the extracted cases were randomly divided into a development set and a test set at a ratio of 9:1. The development set was further undersampled at a ratio of 5:142,43. Due to the class imbalance, where the number of POD-positive samples is smaller than that of POD-negative samples, the test set was evaluated using three different random sampling strategies based on unique patients: the original test set, a 1:1 undersampled set, and a 2:1 undersampled set.
Sample size calculation
To ensure the adequacy of the analyzed sample size, the calculation from a previous study was applied44. Using the observed overall prevalence, a target AUROC of 0.80, and prespecified confidence interval widths of 0.10 for AUROC, 0.20 for calibration slope, and 0.20 for observed/expected ratio, the calculation demonstrated that the available sample size was larger than the required minimum size of 30,023 to achieve the desired precision and stability of model performance estimates.
Data collection
Intraoperative time series of EEG, SR, and PSI were extracted from VitalDB (Seoul National University College of Medicine, Seoul, South Korea). The age, sex, admission and discharge times, operation start and end times, neuropsychiatric consultation timing and content, the CAM-ICU label, and the timing and type of antipsychotics administration of each participant were retrieved from the electronic health records of Seoul National University Hospital.
EEG preprocessing
For each surgical case, 30 samples of a 2 min EEG waveform were selected by applying a consistent statistical criterion (Supplementary Fig. 5). The mean value was computed from the voltage values at each time point across all 6 EEG channels throughout the whole recording, and a sliding window with a 40 s stride was applied to extract candidate samples. From the candidate samples of each surgical case, the 30 samples with mean values closest to the case-wise mean were retained. This sampling strategy was designed to capture EEG features across various intraoperative phases that are associated with POD, including changes in band power immediately following anesthesia induction, EEG suppression during periods of reduced anesthetic concentration, and alpha spindle activity during emergence from anesthesia38,45,46. We excluded samples with any missing values in the 2 min EEG waveform. Samples without SR or PSI values were also excluded. Consequently, some surgical cases included fewer than 30 samples in the final analysis. Samples from surgeries not under general anesthesia were also excluded. Each sample—originally a time series of length 21,378 (corresponding to 2 min at the original sampling rate, 178.15 Hz)—was resampled to a uniform length of 9600 at 80 Hz. A Butterworth bandpass filter between 0.5 and 40 Hz was then applied to remove low-frequency drift and high-frequency noise.
Deep learning model architecture
An encoding framework comprising a convolutional neural network and a graph convolutional network, followed by a transformer layer, was used in the DELPHI-EEG to predict probabilities for POD20. The model input included a 6-channel, 2 min length waveform, age, and sex. The undersampled development set was randomly split according to unique patients into 5-folds for 5-fold cross-validation. The model was trained for each cross-fold with early stopping. The mean training step size of each cross-fold was used as the step size for the training of a final model using the whole undersampled development set.
Logistic regression and machine learning models
The baseline logistic regression model and ML models were developed using the duration of time where SR > 1%, area under the SR versus time curve where SR > 1%, duration of time where PSI < 25, age, and sex. These variables were selected because they represented the only intraoperative EEG features included in the machine learning model developed by Han et al., thereby allowing for a direct comparison of model performance19. The SR values were automatically calculated by the surgical monitoring equipment and subsequently stored using VitalRecorder (Seoul National University College of Medicine, Seoul, South Korea), a software that captures and archives high-resolution biosignal waveforms and vital signs47. XGB, LGB, RF, and GB were employed as ML models. The undersampled development set was randomly partitioned by unique patients into 5-folds, and each model was hyperparameter-tuned to optimize the mean AUROC curve across the cross-validation sets. To evaluate the relationship between DELPHI-EEG predictions and EEG burst suppression, we assessed the correlation between model-predicted probabilities and both the duration of time with SR > 1% and the area under the SR-versus-time curve where SR > 1%.
Statistical analysis
AUROC, AUPRC, F1 score, and accuracy were used to evaluate the discriminative performance of DELPHI-EEG and the logistic regression model. The survival analysis, where survival was defined as remaining free of POD, between the predicted positive and negative labels from DELPHI-EEG for the original test, was performed using the log-rank test. To evaluate the feasibility of DELPHI-EEG for long-term predictions, we assessed the correlation between model-predicted probabilities and the time to POD onset for POD-positive cases in the test set, using the original ratio. The positive label cutoff for model-predicted probabilities was the threshold maximizing the Youden index in the undersampled development set. The calibration performance was evaluated with the Brier score and the integrated calibration index after using the Spline calibration method48. DeLong’s test was applied to compare the AUROCs of DELPHI-EEG with those of other machine learning models. Correlations between model-predicted probabilities and relative band powers or time to POD onset were assessed using Spearman’s rank correlation coefficient. Statistical significance was defined as a two-sided p-value < 0.05. All model development and statistical analyses were performed using Python 3.9 (Python Software Foundation, Wilmington, DE, USA).
Interpretability analysis
SHAP analysis was performed post hoc for all ML models (XGB, LGB, RF, GB) to interpret feature contributions and assess the relative importance of SR, PSI, age, and sex.
To assess the interpretability of the DELPHI-EEG model, we analyzed the correlation between its prediction outputs and the spectral characteristics of the raw EEG signals. Specifically, we evaluated the relationship between model-predicted probabilities and the relative band power within the delta (1–4 Hz), theta (4–8 Hz), and alpha (8–12 Hz) frequency bands27. For each surgical case, predicted probabilities and corresponding power spectral densities were computed for individual samples with Welch’s method49. The resulting density power spectrum was then integrated over frequency via the trapezoidal rule to obtain band-limited power. For each band, relative band power was computed by integrating the power spectral density over the band, summing across channels, and dividing by the power across the 1–40 Hz range.
To assess the spectral contribution of distinct frequency bands to model performance, we implemented a targeted ablation strategy in the frequency domain50. Perturbations were applied to each 2 min sample in the original ratio test set. For each frequency band, we performed a discrete Fourier transform using a zero-padded Fast Fourier Transform (FFT). Frequency coefficients corresponding to the targeted band were then replaced with zeros. The modified spectrum was inverse transformed into the time domain using an inverse FFT, producing a perturbed version of the test data with the designated frequency band suppressed. These band-ablated time series were passed through DELPHI-EEG, and predictions were recalibrated using a pre-trained spline calibration model. Binary classification outcomes were generated using a previously determined Youden threshold. The model’s F1-score was then evaluated on the perturbed test data and compared to the baseline F1-score on unperturbed inputs. The percent change in the F1-score following each band-specific perturbation was used to quantify the relative importance of that band in contributing to the model’s decision-making process.
Spatial contributions were assessed by per-channel ablation: for each channel, we set the channel to zero across time, recomputed the calibrated predictions, and evaluated the change in F1 score relative to the baseline. Temporal contributions were assessed with gradient-based saliency. We computed the gradient of the model output with respect to the input, averaging the absolute value of gradients across time to obtain temporal importance profiles.
Subgroup analysis
Subgroup analyses of DELPHI-EEG were conducted based on anesthesia type (TIVA and inhalational anesthesia), and surgical category (abdominopelvic surgery—including urology, obstetrics, gynecology, and general surgery—thoracic surgery, and other surgical procedures).
Sensitivity analysis
Among 34,550 analyzed cases, those with both the original label and a CAM-ICU were compared using accuracy and F1-score. Using the testing set, the performance of the DELPHI-EEG for the composite outcome, defined as positive if either the original or the CAM-ICU label was present, was compared to the original label using the paired t-test.
Data availability
The data supporting this study’s findings are available from the corresponding author upon reasonable request.
Code availability
The code for the preprocessing, training, and inference is available at GitHub (https://github.com/oskumd2/EEG_Delirium).
References
Mossie, A. et al. Evidence-based guideline on management of postoperative delirium in older people for low resource setting: systematic review article. Int. J. Gen. Med. 15, 4053–4065 (2022).
Yan, E. et al. Association between postoperative delirium and adverse outcomes in older surgical patients: a systematic review and meta-analysis. J. Clin. Anesth. 90, 111221 (2023).
Robinson, T. N. et al. Postoperative delirium in the elderly: risk factors and outcomes. Ann. Surg. 249, 173–178 (2009).
Moskowitz, E. E. et al. Post-operative delirium is associated with increased 5-year mortality. Am. J. Surg. 214, 1036–1038 (2017).
Shi, Z. et al. Postoperative delirium is associated with long-term decline in activities of daily living. Anesthesiology 131, 492–500 (2019).
Arias, F. et al. Neighborhood-level social disadvantage and risk of delirium following major surgery. J. Am. Geriatr. Soc. 68, 2863–2871 (2020).
Swarbrick, C. J. & Partridge, J. S. L. Evidence-based strategies to reduce the incidence of postoperative delirium: a narrative review. Anaesthesia 77, 92–101 (2022).
Zhang, H., Meng, L. Z., Lyon, R. & Wang, D. X. Monitoring cerebral ischemia during cerebrovascular surgery. J. Biomed. Res. 31, 279–282 (2017).
Al-Qudah, A. M. et al. Role of intraoperative electroencephalography in predicting postoperative delirium in patients undergoing cardiovascular surgeries. Clin. Neurophysiol. 164, 40–46 (2024).
Fritz, B. A. et al. Intraoperative electroencephalogram suppression predicts postoperative delirium. Anesth. Analg. 122, 234–242 (2016).
Likhvantsev, V. V. et al. Intraoperative electroencephalogram patterns as predictors of postoperative delirium in older patients: a systematic review and meta-analysis. Front. Aging Neurosci. 16, 1386669 (2024).
Hight, D. et al. Lower alpha frequency of intraoperative frontal EEG is associated with postoperative delirium: A secondary propensity-matched analysis. J. Clin. Anesth. 93, 111343 (2024).
Hata, M. et al. Predicting postoperative delirium after cardiovascular surgeries from preoperative portable electroencephalography oscillations. Front. Psychiatry 14, 1287607 (2023).
Jo, Y. Y. et al. Predicting intraoperative hypotension using deep learning with waveforms of arterial blood pressure, electroencephalogram, and electrocardiogram: retrospective study. PLoS One 17, e0272055 (2022).
Soria Bretones, C., Roncero Parra, C., Cascon, J., Borja, A. L. & Mateo Sotos, J. Automatic identification of schizophrenia employing EEG records analyzed with deep learning algorithms. Schizophr. Res. 261, 36–46 (2023).
Gramacki, A. & Gramacki, J. A deep learning framework for epileptic seizure detection based on neonatal EEG signals. Sci. Rep. 12, 13010 (2022).
Mulkey, M. A., Huang, H., Albanese, T., Kim, S. & Yang, B. Supervised deep learning with vision transformer predicts delirium using limited lead EEG. Sci. Rep. 13, 7890 (2023).
Rohr, V., Blankertz, B., Radtke, F. M., Spies, C. & Koch, S. Machine-learning model predicting postoperative delirium in older patients using intraoperative frontal electroencephalographic signatures. Front. Aging Neurosci. 14, 911088 (2022).
Han, C. et al. Machine learning with clinical and intraoperative biosignal data for predicting postoperative delirium after cardiac surgery. iScience 27, 109932 (2024).
Sun, J. et al. Adaptive spatiotemporal encoding network for cognitive assessment using resting state EEG. NPJ Digit. Med. 7, 375 (2024).
Tanabe, S. et al. Cohort study into the neural correlates of postoperative delirium: the role of connectivity and slow-wave activity. Br. J. Anaesth. 125, 55–66 (2020).
Guo, Z. et al. Quantitative electroencephalography predicts postoperative delirium in adult cardiac surgical patients from a prospective observational study. Sci. Rep. 14, 31101 (2024).
Reese, M. et al. Associations between anaesthetic dose-adjusted intraoperative EEG alpha power, processing speed, and postoperative delirium: analysis of data from three prospective studies. Br. J. Anaesth. 135, 109-120 (2025).
Mirra, A. et al. Usability of the SedLine(R) electroencephalographic monitor of depth of anaesthesia in pigs: a pilot study. J. Clin. Monit. Comput. 36, 1635–1646 (2022).
Fang, P. P. et al. Intraoperative electroencephalogram features related to frailty in older patients: an exploratory prospective observational study. J. Clin. Monit. Comput. 38, 613–621 (2024).
Giesa, N. et al. Applying a transformer architecture to intraoperative temporal dynamics improves the prediction of postoperative delirium. Commun. Med. 4, 251 (2024).
Gutierrez, R. et al. Intraoperative low alpha power in the electroencephalogram is associated with postoperative subsyndromal delirium. Front. Syst. Neurosci. 13, 56 (2019).
Ching, S., Purdon, P. L., Vijayan, S., Kopell, N. J. & Brown, E. N. A neurophysiological-metabolic model for burst suppression. Proc. Natl. Acad. Sci. USA 109, 3095–3100 (2012).
Kratzer, S. et al. Age-related EEG features of bursting activity during anesthetic-induced burst suppression. Front Syst. Neurosci. 14, 599962 (2020).
Jin, Z., Hu, J. & Ma, D. Postoperative delirium: perioperative assessment, risk reduction, and management. Br. J. Anaesth. 125, 492–504 (2020).
Zhang, H. et al. Strategies for prevention of postoperative delirium: a systematic review and meta-analysis of randomized trials. Crit. Care 17, R47 (2013).
Aldecoa, C. et al. Update of the European Society of Anaesthesiology and Intensive Care Medicine evidence-based and consensus-based guideline on postoperative delirium in adult patients. Eur. J. Anaesthesiol. 41, 81–108 (2024).
Webster, R. & Holroyd, S. Prevalence of psychotic symptoms in delirium. Psychosomatics 41, 519–522 (2000).
Trzepacz, P. T. et al. Delusions and hallucinations are associated with greater severity of delirium. J. Acad. Consult Liaison Psychiatry 64, 236–247 (2023).
Lin, C. Y., Chiang, C. H., Tseng, M. M., Tam, K. W. & Loh, E. W. Effects of quetiapine on sleep: A systematic review and meta-analysis of clinical trials. Eur. Neuropsychopharmacol. 67, 22–36 (2023).
Berger, M. et al. A real-time neurophysiologic stress test for the aging brain: novel perioperative and ICU applications of EEG in older surgical patients. Neurotherapeutics 20, 975–1000 (2023).
von Dincklage, F. et al. Technical considerations when using the EEG export of the SEDLine Root device. J. Clin. Monit. Comput. 35, 1047–1054 (2021).
Fritz, B. A., Maybrier, H. R. & Avidan, M. S. Intraoperative electroencephalogram suppression at lower volatile anaesthetic concentrations predicts postoperative delirium occurring in the intensive care unit. Br. J. Anaesth. 121, 241–248 (2018).
Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024).
von Elm, E. et al. The strengthening the reporting of observational studies in epidemiology (STROBE) Statement: guidelines for reporting observational studies. Int. J. Surg. 12, 1495–1499 (2014).
Grover, S. & Avasthi, A. Clinical practice guidelines for management of delirium in elderly. Indian J. Psychiatry 60, S329–S340 (2018).
Fernández, A. et al. Learning from Imbalanced Data Sets. 1 online resource (XVIII, 377 pages 371 illustrations, 350 illustrations in color (Springer International Publishing, Imprint, Springer, 2018).
Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10, e0118432 (2015).
Whittle, R. et al. Extended sample size calculations for evaluation of prediction models using a threshold for classification. BMC Med. Res. Methodol. 25, 170 (2025).
Koch, S. et al. Perioperative electroencephalogram spectral dynamics related to postoperative delirium in older patients. Anesth. Analg. 133, 1598–1607 (2021).
Tang, X., Zhang, X., Dong, H. & Zhao, G. Electroencephalogram features of perioperative neurocognitive disorders in elderly patients: a narrative review of the clinical literature. Brain Sci. 12, 1073 (2022).
Lee, H. C. & Jung, C. W. Vital Recorder-a free research tool for automatic recording of high-resolution time-synchronised physiological data from multiple anaesthesia devices. Sci. Rep. 8, 1527 (2018).
Lucena, B. Spline-Based Probability Calibration. arXiv (2018).
Welch, P. The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms. IEEE Trans. Audio Electroacoustics 15, 70–73 (1967).
Ellis, C. A., Miller, R. L. & Calhoun, V. D. A systematic approach for explaining time and frequency features extracted by convolutional neural networks from raw electroencephalography data. Front. Neuroinform. 16, 872035 (2022).
Acknowledgements
This work was supported by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: RS-2024-00439677), and the Institute of Information & Communications Technology Planning & Evaluation (IITP)-Global Data-X Leader HRD program grant funded by the Korea government (MSIT) (RS-2024-00441407). This study was also supported by a research fund provided by Seoul National University Hospital (Grant No. 04-2023-0620).
Author information
Authors and Affiliations
Contributions
H.L., J.H.A. contributed equally to this work as co-first authors. H.L., and H.C.L. contributed substantially to the study conception and design, data acquisition, and data analysis. J.H.A. and H.C.L. collected and curated data. J.H.A. conducted data analysis and made tables and figures. H.L. and J.H.A. participated in drafting the article, and H.C.L. revised it critically for important intellectual content. All authors gave final approval of the version to be published.
Corresponding author
Ethics declarations
Competing interests
H.L. is an associate editor of npj Digital Medicine. Other authors declare that they have no known competing interests that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ahn, J.H., Lee, H., Gambus, P. et al. Development of a deep learning-based prediction model for postoperative delirium using intraoperative electroencephalogram in adults. npj Digit. Med. 8, 661 (2025). https://doi.org/10.1038/s41746-025-02033-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-025-02033-y






