Fusion of clinical magnet resonance images and electronic health records promotes multimodal predictions of postoperative delirium

Giesa, Niklas; Dell’Orco, Andrea; Scheel, Michael; Finke, Carsten; Balzer, Felix; Spies, Claudia Doris; Sekutowicz, Maria

doi:10.1038/s41598-025-31693-9

Download PDF

Article
Open access
Published: 26 December 2025

Fusion of clinical magnet resonance images and electronic health records promotes multimodal predictions of postoperative delirium

Niklas Giesa¹,
Andrea Dell’Orco²,
Michael Scheel²,
Carsten Finke³,
Felix Balzer¹,
Claudia Doris Spies⁴ &
…
Maria Sekutowicz^1,4,5

Scientific Reports volume 15, Article number: 44654 (2025) Cite this article

914 Accesses
Metrics details

Subjects

Abstract

Brain morphometry derived from clinical imaging has an underexplored potential for the multimodal prediction of postoperative delirium (POD), an acute encephalopathy that can lead to long-term adverse outcomes or death. This study conducted a comprehensive analysis of patient trajectories, integrating magnetic resonance imaging (MRI) data and electronic health records (EHRs) across two general surgical cohorts. We applied univariate test methods and linear mixed-effects models correcting for confounding. Non-linear multi-layer perceptrons (MLPs), boosted decision trees, and logistic regressions were trained on EHR data, brain morphometry measures, and their multimodal fusion to predict POD. Age-adjusted correlations identified cortical thickness of temporal gyri, as well as thalamic and brainstem volumes to be POD-relevant neuroanatomical features. MLP models demonstrated robust predictive capability, achieving notably high performances up to 86% AUROC (area under the receiver operating characteristic). Multimodal fusion yielded pronounced benefits in less critically ill patients. MLP model weights showed high predictive potential for cerebral atrophy in higher-order cortical regions, including the temporal pole, superior frontal gyrus, and the insula. These findings reveal the previously unrecognized potential of clinically derived brain morphometry in enhancing early multimodal predictions of POD. A better understanding of brain vulnerability in POD may translate into improved clinical decision making based on multimodal health care data.

Applying a transformer architecture to intraoperative temporal dynamics improves the prediction of postoperative delirium

Article Open access 27 November 2024

Neuroimaging signatures predicting motor improvement to focused ultrasound subthalamotomy in Parkinson’s disease

Article Open access 03 June 2022

Deep graph learning of multimodal brain networks defines treatment-predictive signatures in major depression

Article Open access 31 March 2025

Introduction

Delirium is a distressing neuropsychiatric syndrome characterized by acute disturbances in consciousness, cognition, and attention¹. Postoperative delirium (POD), occurring after major surgical procedures, is associated with adverse outcomes, such as prolonged hospitalization or death. Prevalence rates span from 5 to 52%². The etiology of POD is multifactorial, with both predisposing and precipitating factors contributing to its acute onset^2,3. Predisposing factors, such as preexisting cognitive impairment or advanced age, confer baseline vulnerability, while precipitating factors relate to perioperative conditions, including the surgical procedure².

POD manifests through heterogenous levels of vigilance, neuropsychological, and psychotic symptoms which fluctuate in presence and severity demanding close monitoring and early assessment^2,4. Previous studies indicate that structural brain changes may increase vulnerability to POD⁵. Patients with cerebral atrophy are predisposed to suffer from longterm cognitive decline⁶. Vulnerability to delirium may be facilitated by preexisting neuroanatomical changes resulting in neuronal dysfunction and network disintegration⁷. Such pre-morbidity in POD patients has been identified as decreased white matter integrity and increased gray matter atrophy^8,9,10. Previous studies have been restricted to specific patient cohorts at risk, such as elderly patients undergoing major surgical procedures⁸. Thus, the associated structural brain changes may be age-specific or restricted to patients with preexisting cognitive impairment⁹.

While previous studies have developed non-linear machine learning (ML) prediction models^4,11,12 that outperform standard statistical methods, these models rarely translate into clinical practice. Advanced ML approaches that integrate routinely collected clinical data from multiple modalities may overcome this limitation, as Mohsen et al.¹³ illustrate various fusion strategies. Such data fusions are applied either early in the feature space or later when outputting prediction probabilities. To the best of our knowledge, we are the first to utilize neuroanatomical features extracted from preoperative clinical MRIs to use premorbid structural brain changes for POD prediction. We systematically explore the predictive value of combining these MRI features with EHR data in two distinct general surgical cohorts. Since POD may often be undiagnosed¹⁴, we defined an endpoint based on agitation and pharmacological treatment for delirium for intensive care patients. Hereby, we complement standard delirium assessment tools for POD labeling which are routinely used postoperatively. Interpretation of multimodal ML techniques is augmented by linear mixed-effect models (MEM) correcting for covariates allowing further insights into the pathomechanisms of POD.

Methods

Study population, endpoint definitions, and data extractions

We included all patients (aged > = 18) who underwent surgery between 2017 and 2022 if the estimated surgery duration was > = 1 h, initially resulting in EHR-data from 63,222 patients (see Fig. 1a). All data for this single-center study were provided by three different sides at Charité, a large German university hospital. This study is the first in our medical institution to leverage routinely acquired MRIs. As no standard data pipeline was available, we extracted a random sample of preoperative MRIs from the picture archiving and communication system (PACS) without additional capabilities (e.g., no information on POD assessments or type of scans) (see Fig. 1b). This procedure resulted in 3,344 heterogenous de-identified MRI scans.

Preoperative MRI was obtained for broad clinical indications and could be unrelated to subsequent surgical procedures (e.g., screening for intracranial metastases, suspected stroke, oncologic staging, surveillance/follow-up, or headache/seizure workup, as well as performed for neurosurgical planning). Subsequently, MRI headers were filtered for cranial scans, acquisition times, and existences of T1-weighted MPRAGE sequences yielding a cohort of 991 MRI scans (see Fig. 1a).

The clinical information system (CIS) stored pre- and intraoperative EHRs that were de-identified and archived in a Data Warehouse (DWH)¹⁵ allowing data harmonization (see Fig. 1b). We used these EHRs to define predictors, endpoints, and to link clinical predictors to MRIs from the PACS. POD was defined as a binary endpoint variable (1 = delirious, 0 = non-delirious) according to two definitions.

All surgical patients are routinely screened for delirium with the Nursing Delirium Screening Scale (Nu-DESC) preoperatively before the anesthesia and postoperatively during post-anesthesia care inside the recovery room. Patients transferred directly to the ICU are assessed with the Confusion Assessment Method for the Intensive Care Unit (CAM-ICU) at admission and at least three times per day, according to institutional standards. For the score-based POD definition (scoPOD), CAM-ICU and Nu-DESC¹⁶ assessments were used to classify patients as delirious (at least one Nu-DESC > 0 or positive CAM-ICU) or non-delirious (all Nu-DESC = 0 or all CAM-ICU negative). 645 scans for 557 patients are covered by scoPOD.

The second medication-based endpoint definition (medPOD) was based on agitation and pharmacological treatment for delirium. ICU patients are routinely assessed for the level of agitation along the standard Richmond Agitation Sedation Scale (RASS)¹⁷ for critically ill patients. Those who postoperatively scored a RASS of > 1 and subsequently received any of the medications Haloperidol, Clonidine, Dexmedetomidine, Pipamperone, or Risperidone were labeled as delirious. RASS assessment and given medication were required to be temporarily aligned to the same day (within a 24 h interval). Controls maintained a RASS of 0 and did not receive any aforementioned medications until discharge. Any other cases were excluded from the medPOD definition, resulting into 224 scans for 201 patients.

Data preprocessing

As routinely collected EHRs potentially follow skewed distributions¹⁸, we aggregated parameters with robust summary statistics as mean, median, 10th – and 90th percentile separately for the pre- and intraoperative phase. Our initial feature space covered parameters that were available for at least 10% of patients. Laboratory values (lymphocytes, CRP, etc.), medications (propofol, norepinephrine, etc.), vital signs (heart rate, spo2, etc.), and device settings (FiO2, PEEP) resulted in 133 features (see Extended Table A1/A2 in Supplement B). To avoid estimation bias, the feature space excluded any data contributing to our two endpoint definitions medPOD or scoPOD, including medications commonly used to control for delirious symptoms. All data were normalized via z-transformation with statistics from the training set. ICD-based delirium labels would lack temporal information and might reflect documentation gaps rather than negative cases. Consequently, presence of diagnoses in the form of an ICD code¹⁹ were used to characterize our cohorts, but omitted for training ML models or endpoint definitions due to uncertain documentation times²⁰.

Statistical analysis

We analyzed features towards binary endpoints with Mann-Whitney U (MWU) test statistics^21,22. We report the AUC-0.5 with 0 as a chance-level, values near 1 as a strong positive, and values near − 1 as a strong negative effect. The Spearman correlation coefficient (ρ) was used to describe the association of clinical parameters with age, we defined the effect strength similarly with levels of 0, near 1 or −1²³.

To account for age as a confounder in MRI-based brain morphometry, we configured multivariate linear mixed-effect models (MEMs)²⁴. MEMs had varying intercepts for single feature effects as \(\:C\left(POD\right)\:\sim\:age+feature+1|patient\). Here, the POD endpoint functions as the dependent variable, age as fixed effects in conjunction with the feature of interest. To correct for multiple surgeries and multiple MRI scans, patient identifiers were integrated as random effects. We report age-adjusted p-values and a corresponding coefficient β(f) indicating effect directions.

Statistical significance was assessed using false discover rate (FDR) corrections (alpha = 0.05)²⁵. For reporting the effects of binary variables with POD, we used the odds ratio (OR) on a logarithmic scale as ln(OR) with large deviations from 1 indicating strong associations.

MRI analysis

All DICOMs were converted to NIfTIs using dcm2niix and segmented by the FreeSurfer v7.4.1 recon-all pipeline²⁶ Morphometry measures were computed using Desikan-Killiany atlas-based parcellation²⁷. We visually assessed all results to identify missegmentation, administration of contrast agents or anatomical aberrations, such as general atrophy or tissue lesions. In the case of brain abnormality, only healthy contralateral hemispheres were selected for analyses. Otherwise, one hemisphere was randomly selected resulting in 358 right – and 287 left (scoPOD) as well as 91 right – and 133 left (medPOD) hemispheres. To account for the effect of intracranial volume, volume estimates were normalized by division with the estimated total intracranial volume. The final MRI-related feature set was composed of 184 volumes, 70 thickness features, and 72 area features.

Machine learning and fusion strategies

We trained three ML techniques comprising logistic regression (LR), gradient boosted trees (BT), and multi-layer perceptron (MLP) architectures. While LR assumes linear relationship, BT and MLP represent non-linear problems^28,29 with MLPs functioning as universal approximators³⁰ stacking perceptrons (nodes) on interconnected layers.

To handle different data modalities, we deployed two fusion strategies¹³. In “early fusion”, we enhanced our input feature space by ingesting selected measures from both types. For “late fusion”, we trained separate models for each modality (MRI or EHR), combining the prediction outputs (see Extended Figure B1 in Supplement A). For BT and LR, model outputs (probabilities between 0 and 1) were simply mean-averaged. In the late fusion MLP, a linear layer integrates predictions from both models into a single output while learning the weights for both networks via backpropagation, also known as “joint fusion”. We additionally trained completely separate models for MRI and EHR features.

Model configuration, training, and validation

For optimal model configurations (hyperparameters), a 3 × 3 nested cross-validation (CV)³¹ approach was implemented. Different sets of parameters are exhaustively validated via a Grid-Search³² on the inner-nested CV process and then applied to the outer-nested one. 1000x bootstrapping enabled estimations of 95% confidence interval (CI)³³ for validation results.

The area under (AU-) the receiving operating characteristics (-ROC), and the precision recall curve (-PRC) evaluated performances²². As a cost function, we configured a weighted binary cross-entropy (BCE) loss to address class imbalance³⁴. The final parameter space included regularization techniques, like BT pruning or L₁-norm penalty for MLP and LR, in addition to general configurations (see Extended Table B2 in Supplement A). We trained models with subsets of features for different adjusted p-values thresholds (see Extended Table A3 in Supplement B).

Results

Cohort characteristics

Table 1 Descriptive cohort characteristics for two POD endpoints. Descriptive statistics are displayed as mean ± sd for numerical variables. For binary variables, the fraction of positive samples from all (n) are cited followed by the odds as (pos/neg samples). Adjusted p-values are derived from linear mixed-effect models (MEMs) incorporating age and the variable of interest as fixed effects, patient groups as random effects and POD as the independent variable. We highlight significant results with asterisks according to a FDR corrected alpha level. RASS: Richmond agitation sedation Scale, SOFA: sequential organ failure Assessment, SIRS: systemic inflammatory response Syndrome, urgency class N: ranges from N = 1 (immediate surgery required) to N = 5 (elective, planned procedure), anesthesia type of surgery: minor surgical procedures requiring sedation or anesthesia stand by.

Full size table

Cohort characteristics for the score-based endpoint scoPOD and the medication-based endpoint medPOD revealed POD prevalence rates of 18.44% and 21.43%, respectively (Table 1). Patients who met inclusion criteria for scoPOD underwent a mean of 1.23 delirium assessments with Nu-DESC or CAM-ICU scales. P-values refer to POD cases vs. controls per endpoint. POD patients were older (65.01 ± 14.04 years as mean ± sd for scoPOD, 62.08 ± 13.49 years for medPOD, p < 0.05) than controls (58.02 ± 16.43 years for scoPOD, 56.85 ± 16.02 years for medPOD, p < 0.05). Highly significant POD differences were observed in recovery room stay durations for scoPOD (5.18 ± 2.42 h for POD, 2.81 ± 3.28 h for controls, p < 0.001). Patients’ physical status and degree of agitation was significantly reduced for delirious medPOD patients (ASA 2.12 ± 1.21 POD, 1.82 ± 1.69 controls, p < 0.001; RASS 1.08 ± 1.16 POD, −0.81 ± 1.12 controls, p < 0.001).

When comparing cohorts, medPOD patients had 0.31 h longer stays in the recovery room, a decreased physical status (Δ of mean ASA = 0.13), were more prone to sequential organ failure (Δ of mean SOFA = 0.70), and less agitated (Δ mean RASS = 0.24). In both cohorts, the most prominent surgical procedure was neurosurgery, reaching a significant difference for medPOD labels (ln(OR) = 1.71, p < 0.001). Visual MRI screening properties, such as general atrophy, did not show significant differences for POD (see Table 1). Confirming POD labeling, we observed highly significant correlation with ICD encoded delirium (ln(OR) = 2.10, p < 0.001 scoPOD, ln(OR) = 2.92, p < 0.001 medPOD). Additional characteristics are included in Extended Table B3 and Extended Results B1 in Supplement A).

MRI and EHR single feature importance

Table 2 Single feature importance for MRI and EHR features per endpoint. Univariate results from Mann-Whitney U (MWU) test statistic define discriminability of [POD] endpoints with unadjusted p-value and AUC-0.5 as effect size. Spearman ranks provide correlation coefficient (ρ) calculated between feature values and [Age] including p-value under the null-hypothesis of zero-coefficients. Linear mixed-effect models (MEM) were fitted with [POD] as dependent variable, feature values as fixed effects, patient-MRI hierarchy as random effects. MEMs provide age-adjusted p-values and a corresponding coefficient β(f). Data availability shows the fraction of available feature values from all, adjacent to the odds of (available/missing) values. EHRs must be aggregated with either the sum, median (md), mean (me) for preoperative (pre) or intraoperative (intra) time phases. FDR corrected significant results are highlighted in bold and italics.

Full size table

MRI-derived morphometry features, such as the middle temporal- and superior temporal thickness were significantly correlated with scoPOD (MWU p = 7.26E-06, AUC-0.5= −0.142; p = 2.60E-05, AUC-0.5= −0.133) as well as with age (Spearman p = 2.14E-19, ρ= −0.359; p = 5.57E-29, ρ= −0.438) (see Table 2; Fig. 2a). Decreased cortical thickness resulted in increased probabilities of POD and occurred rather in elderly, than in younger patients.

Multivariate MEMs showed that cortical thickness of middle as well as the superior temporal cortex remained significantly associated with scoPOD when adjusting for age (adj. p = 3.25E-05, β(f)= −0.517; adj. p = 7.12E-04, β(f)= −0.332). Decreased cortical thickness in POD was preserved, when dividing patients into equally-sized age groups (negative MEM coefficient β(f), see Fig. 2a). Measures of white matter hypointensities expressed significant univariate effects on scoPOD (MWU p = 8.22E-05, AUC-0.5= −0.125) and age (Spearman p = 5.47E-28, ρ = 0.430).

MWU analysis of EHR features highlighted preoperative measures of anemia (hemoglobin: p-value = 1.19E-05; erythrocytes: p = 1.00E-05) and infection parameters (CRP: p = 3.19E-05). These were significant after age-corrections (MEM erythrocytes: adj. p = 1.21E-06; hemoglobin: adj. p = 7.00E-09, hematocrit adj. p = 3.22E-06; CRP: adjusted p = 1.24E-06).

For medPOD, subcortical MRI features, such as thalamus and brainstem volume, were significantly associated with POD (thalamus volume, p = 2.79E-04, AUC-0.5 = −0.177; MWU; brainstem volume p = 6.80E-04, AUC-0.5 = −0.165; MWU). After age-correction, the thalamus volume remained significantly associated with POD (adj. p = 2.39E-04) (see Table 2; Fig. 2b). Additionally, multivariate MEM analysis indicated significant associations of EHR features and blood parameters like low levels of erythrocytes (adjusted p = 2.49E-09, β(f) = −0.185), hemoglobin (adjusted p = 7.00E-09, β(f) = −0.060), hematocrit (adjusted p = 4.67E-08, β(f) = −2.057), and increased CRP (adjusted p = 1.08E-07, β(f) = 0.003) (see Table 2).

Machine learning results

We evaluated LR, BT, and MLP models with AUROC and AUPRC metrics for both endpoints. Highest performance for the score-based scoPOD cohort was achieved by a late fusion MLP (AUROC 0.735 [0.726, 0.744] as mean, [95% CI]; AUPRC 0.456 [0.411, 0.472]), outperforming LR (AUROC 0.705 [0.695, 0.715]; AUPRC 0.404 [0.389, 0.420]) and BT (AUROC 0.722 [0.712, 0.32]; AUPRC 0.450 [0.436, 0.468]) within the same fusion type (see Fig. 3a). MLPs were superior to LR and BT for all fusion types (see Extended Table B4 in Supplement A). We observed overlapping CIs of AUROC between late- and early fusion of MLPs for scoPOD ([0.726, 0.744] vs. [0.722, 0.740]), but distinct differences to MLPs trained with one modality only (EHR only [0.703, 0.721], MRI only [0.666, 0.685]).

The best model that predicted the medical-based endpoint medPOD used EHR features only (AUROC 0.861 [0.851, 0.871]; AUPRC 0.665 [0.644, 0.687]) (see Fig. 3a). Here, the confidence was decreased due to similar metrics ranges yielded by combined fusion MLPs like early - (AUROC [0.847, 0.860]; AUPRC [0.636, 0.679]), or late fusion (AUROC [0.847, 0.867]; AUPRC [0.619, 0.668]). MLPs showed overall elevated validation metrics in contrast to other ML methods.

In Fig. 3b, corresponding AUROC and AUPRC curves describe model behaviors under varying prediction thresholds. Curves confirm that late fusion was favorable for scoPOD with a sensitivity of 0.81 and a specificity of 0.63 at the threshold where their sums maximize. MLPs trained solely on EHRs exceeded these metrics with 0.81 sensitivity and 0.82 specificity predicting medPOD.

Model interpretation

Model weights (MW) from our best MLPs per endpoint revealed ante-hoc feature importance. The best late fusion scoPOD based MLP focused on intraoperative tidal volume (abs MW = 0.288), preoperative albumin blood levels (abs MW = 0.283), and erythrocytes counts (abs MW = 0.2).

Highest model weights were found in the late fusion MLP using MRI features for scoPOD and were assigned to temporal pole thickness (MW = 0.565), superior frontal gyrus (MW = 0.523), and insula thickness (MW = 0.504). These features also showed univariate MWU feature importance with effect strengths of 0.084 (p = 8.32E-03), −0.111 (p = 4.78E-04), and − 0.096 (p = 2.37E-03) AUC-0.5 (see Extended Table B5 Supplement A, Fig. 3c).

The best MLP to predict medPOD relied on unimodal EHRs such as mean blood pressure (abs MW = 0.186) or administered fluid volume (abs MW = 0.183). The MLP focused on intraoperative norepinephrine infusion (abs MW = 0.172), the tidal volume (abs MW = 0.166), and heart frequency (abs MW = 0.142). Additionally, preoperative CRP levels (abs MW = 0.158), physical status (ASA abs MW = 0.120), hematocrit (abs MW = 0.135), or erythrocyte count (abs MW = 0.133) had predictive values (see Extended Table A4 in Supplement B). Highest MWs regarding MRIs for medPOD, provided by the unimodal MLP, were found for the thalamus volume (abs 0.27) (see Extended Table B5 in Supplement A). Additional analyses of the relationship between key covariates and model raw output probabilities for each surgery did not suggest model biases towards gender or age (see Extended Figure B6, Extended Results B3 in Supplement A).

Discussion

We are the first to demonstrate that neuroanatomical atrophy measures contribute to successful multimodal POD predictions in a general surgical patient population. In contrast to previous studies, our study leveraged data from clinical routinely-collected MRIs which are heterogenous and noisy, but proved to hold potential for clinical decision making^10,35,36. Best unimodal prediction using MRI morphometry measures achieved 72% AUROC expressing the highest predictive value for subcortical volumetric measures such as the thalamus. Through the iterative application of diverse data fusion strategies and multiple ML models, we achieved high performances up to 86% AUROC for multimodal models, where frontal and temporal cortical atrophy were highly predictive.

Our findings provide clinical utility by enabling preoperative risk stratification that leverages routinely acquired MRI alongside EHR data. In less critically ill patients, cortical atrophy (frontal/temporal) flags intrinsic brain vulnerability to delirium, supporting early initiation of clinical delirium-prevention bundles (e.g., reorientation, sleep protection, mobilization). In higher-acuity settings, reflected by medPOD, the multimodal model emphasizes systemic factors (e.g., anemia, hydration, infection proxies), guiding optimization before surgery (e.g., hemoglobin targets, volume status, infection control) and postoperative monitoring. Multimodal approaches, such as future MRI-informed perioperative decision-making tools, have the potential to improve prevention, triage, and targeted management of POD while complementing clinician judgment and existing care pathways.

We assessed the predictive properties of neuroanatomical and clinical markers in two cohorts based on different POD endpoints. In addition to validated clinical POD-assessment methods with Nu-DESC and CAM-ICU scores (scoPOD), we defined POD in a subgroup of ICU patients according to agitation and pharmacotherapy (medPOD). Importantly, both endpoints highly correlated with the documented ICD diagnosis of delirium, but reflected distinct subgroups with key differences in clinical and surgical characteristics. While the score-based cohort scoPOD covered a wider range of surgical interventions, the smaller medPOD cohort had higher degrees of critical illness and systemic inflammation. Findings of scoPOD associated cortical thickness parameters in the temporal and frontal lobe, involved in memory, attention, and higher-order executive functions, align with literature linking temporal cortex atrophy to delirium³⁷. These cortical features outperformed EHR predictors, suggesting that brain-specific vulnerability, reflected by these morphometric measures, is primary driving POD in less critically ill populations. When illness is more critical and patients in intensive care receive antipsychotic pharmacotherapy, MRI measures emphasize the importance of fronto-striato-thalamic circuits in disorders of consciousness and the emergence of psychotic symptoms³⁸. However, such subcortical brain vulnerability measures are outweighed by EHR features when POD is predicted in more complex critical illness. Here, preoperative anemia, hydration and infection proxies, significantly associated with POD in univariate statistics, also showed high predictive value.

To keep the clinical scope wide, the presented models are more generalizable compared to previous prediction studies based on clinical imaging of the hippocampus in cardiovascular surgeries³⁹. While previous work trained ML on EHRs only^40,41, we could show that the combination of data modalities improved prediction performances, especially for a less critically ill population.

For explainable artificial intelligence (XAI) purposes, we preferred directly reading model weights over methods like Shapely or LIME due to their susceptibility to unfavorable effects such as suppressor variables⁴². Since model weights are technical properties, we provide comprehensive univariate analyses to enhance clinical insights. To correct for strong confounders, such as age in neuroanatomical measures, we applied linear mixed-effects models focusing on such covariates⁵. However, due to the highly inter-correlated nature of our data, latent noise may not be fully excluded.

Although there are numerous indications for preoperative cranial MRI-scans, most patients who received cranial MRI comprised neurosurgical cases in our sample. In line with previous findings^43,44, neurosurgical procedures were not predictive for POD in the larger scoPOD sample, nor did we find notably higher POD prevalence rates in our cohorts. However, in the smaller cohort of intensive care patients, the presence of a neurosurgical procedure was significantly associated with POD. While this needs to be replicated, we speculate that through disease severity, patients who undergo major neurosurgical procedures and who require intensive care might be especially vulnerable to POD. Critically, we did not find model biases regarding age, urgency class, sex, or performed neurosurgery suggesting a broad applicability of our results.

Since brain morphometry measures were extracted from heterogenous clinical MRI assessments, we performed quality control to identify anatomical aberrations and segmentation inaccuracies. Automated morphometry tools, such as Freesurfer, are optimized for non-contrast enhanced (CE) images. However, excellent reliability and agreement is reported for T1wCE segmentation²⁶.

The presented work has several limitations. Only healthy hemispheres without visually detectable structural lesions interfering with segmentation accuracy were included in our analyses, potentially excluding valuable data. Future work may incorporate more automated and data-driven segmentation approaches to optimize the utilization of clinical scans with potential lesions enhancing prediction robustness. The sample size, particularly for medPOD, restricts generalizability and findings should be validated in larger external cohorts. Hypoactive delirium which is often undiagnosed was not separately addressed in our prediction targets but we aim to formulate a multi-class prediction problem in the future. As with most real-world clinical data, the EHRs used in this study were not primarily collected for secondary research purposes. Consequently, institution-specific documentation practices and local clinical guidelines may have introduced biases and data quality limitations. We inherently handled class-imbalance by robust MWU test statistics and a weighted BCE loss. Oversampling techniques may have resulted in different findings. Reporting AUPRC metrics, which are sensitive to class-imbalances, enabled a more elaborate assessment of model performance compared to exclusively citing AUROC scores⁴⁵. In contrast to randomized controlled trials, causality cannot be assumed for identified relationships while future work aims to include causal inference methods like propensity scores to increase reliability.

In conclusion, this study highlights the advantages of multimodal fusion models that integrate routine MRI and EHR data, harnessing the potential of modern machine learning for outcome prediction. Additionally, this study demonstrates the added value of MRI data in supporting clinical decision-making and improving the management of postoperative delirium in the future.

Data availability

Python code can be accessed via [https://github.com/ngiesa/fusion_pod](https:/github.com/ngiesa/fusion_pod). We provide comprehensive summary statistics of patient data in the Supplement A. The concrete datasets analyzed during the current study are not publicly available due German data privacy regulations, but are available from the corresponding author on reasonable request. We report results according to the “Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis” (TRIPOD) guidelines (see Extended Table A5 in Supplement B).

Abbreviations

MRI:: Magnet resonance imaging
AUROC:: Are under receiver operating characteristics
AUPRC:: Area under precision recall curve
BCE:: Binary cross-entropy
CIS:: Clinical information system
CAM:: Confusion assessment method
CE:: Contrast-enhanced
CV:: Cross validation
DWH:: Data warehouse
HER:: Electronic health record
XAI:: Explainable artificial intelligence
FDR:: False discovery rate
GBT:: Gradient boosted trees
ICU:: Intensive care unit
ICD:: International classification of diseases
LR:: Logistic regression
ML:: Machine learning
MWU:: Mann Whitney U
MEM:: Mixed-effects model
MLP:: Multi-layer perceptron
OR:: Odds ratio
PACS:: Picture archiving and communication system
POD:: Postoperative delirium
SOFA:: Sequential organ failure

References

European Delirium Association, American Delirium Society. The DSM-5 criteria, level of arousal and delirium diagnosis: inclusiveness is safer. BMC Med. 12 (1), 141 (2014).
Article Google Scholar
Aldecoa, C. et al. European society of anaesthesiology evidence-based and consensus-based guideline on postoperative delirium. Eur. J. Anaesthesiol. 34 (4), 192–214 (2017).
Article PubMed Google Scholar
Iamaroon, A. et al. Incidence of and risk factors for postoperative delirium in older adult patients undergoing noncardiac surgery: a prospective study. BMC Geriatr. 20 (1), 40 (2020).
Article PubMed PubMed Central Google Scholar
Giesa, N. et al. Applying a transformer architecture to intraoperative Temporal dynamics improves the prediction of postoperative delirium. Commun. Med. 4 (1), 251 (2024).
Article PubMed PubMed Central Google Scholar
Gunther, M. L. et al. The association between brain volumes, delirium duration, and cognitive outcomes in intensive care unit survivors: the VISIONS cohort magnetic resonance imaging study*. Crit. Care Med. 40 (7), 2022–2032 (2012).
Goldberg, T. E. et al. Association of delirium with Long-term cognitive decline: A Meta-analysis. JAMA Neurol. 77 (11), 1373 (2020).
Article PubMed Google Scholar
Wilson, J. E. et al. Delirium Nat. Rev. Dis. Primer ;6(1):1–26. (2020).
Article Google Scholar
Shioiri, A. et al. A decrease in the volume of Gray matter as a risk factor for postoperative delirium revealed by an Atlas-based method. Am. J. Geriatr. Psychiatry Off J. Am. Assoc. Geriatr. Psychiatry. 24 (7), 528–536 (2016)
Cavallari, M. et al. Brain atrophy and white-matter hyperintensities are not significantly associated with incidence and severity of postoperative delirium in older persons without dementia. Neurobiol Aging. June 1;36(6):2122–9. (2015).
Kant, I. M. J. et al. Preoperative brain MRI features and occurrence of postoperative delirium. J. Psychosom. Res. 140, 110301 (2021).
Article PubMed Google Scholar
Kyeong, S. et al. Neural predisposing factors of postoperative delirium in elderly patients with femoral neck fracture. Sci. Rep. 8 (1), 7602 (2018).
Article ADS PubMed PubMed Central Google Scholar
Zhao, H., You, J., Peng, Y. & Feng, Y. Machine learning algorithm using electronic Chart-Derived data to predict delirium after elderly hip fracture surgeries: A retrospective Case-Control study. Front. Surg. 8, 634629 (2021).
Article PubMed PubMed Central Google Scholar
Mohsen, F., Ali, H., El Hajj, N. & Shah, Z. Artificial intelligence-based methods for fusion of electronic health records and imaging data. Sci. Rep. 12 (1), 17981 (2022).
Article ADS PubMed PubMed Central Google Scholar
Kirfel, A. et al. Postoperative delirium after cardiac surgery of elderly patients as an independent risk factor for prolonged length of stay in intensive care unit and in hospital. Aging Clin. Exp. Res. 33 (11), 3047–3056 (2021).
Article PubMed PubMed Central Google Scholar
Daniel Boie, S. et al. A scalable approach for critical care data extraction and analysis in an academic medical center. Int. J. Med. Inf. 192, 105611 (2024).
Article Google Scholar
Grover, S. & Kate, N. Assessment scales for delirium: A review. World J. Psychiatry. 2 (4), 58–70 (2012).
Article PubMed PubMed Central Google Scholar
Sessler, C. N. et al. The Richmond Agitation–Sedation scale: validity and reliability in adult intensive care unit patients. Am. J. Respir Crit. Care Med. 166 (10), 1338–1344 (2002).
Article PubMed Google Scholar
Weiskopf, N. G. & Weng, C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inf. Assoc. 20 (1), 144–151 (2013).
Article Google Scholar
Graubner, B. ICD und OPS: historische entwicklung und aktueller stand. Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz. 50(7):932–943 (2007).
Hripcsak, G. & Albers, D. J. Next-generation phenotyping of electronic health records. J. Am. Med. Inf. Assoc. 20 (1), 117–121 (2013).
Article Google Scholar
Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18 (1), 50–60 (1947).
Article MathSciNet Google Scholar
Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 (1), 29–36 (1982).
Article PubMed Google Scholar
Genest, C., Nešlehová, J. G. & Rémillard, B. On the Estimation of spearman’s Rho and related tests of independence for possibly discontinuous multivariate data. J. Multivar. Anal. 117, 214–228 (2013).
Article MathSciNet Google Scholar
McLean, R. A., Sanders, W. L. & Stroup, W. W. A unified approach to mixed linear models. Am. Stat. 45 (1), 54 (1991).
Article Google Scholar
Dunn, O. J. Multiple comparisons among means. J. Am. Stat. Assoc. 56 (293), 52–64 (1961).
Article MathSciNet Google Scholar
Li, X., Morgan, P. S., Ashburner, J., Smith, J. & Rorden, C. The first step for neuroimaging data analysis: DICOM to NIfTI conversion. J. Neurosci. Methods. 264, 47–56 (2016).
Article PubMed Google Scholar
Lie, I. A. et al. The effect of gadolinium-based contrast-agents on automated brain atrophy measurements by freesurfer in patients with multiple sclerosis. Eur. Radiol. 32 (5), 3576–3587 (2022).
Article PubMed PubMed Central Google Scholar
Chen, T., Guestrin, C. & XGBoost: A Scalable Tree Boosting System. 2016 [cited 2023 Oct 16]; Available from: https://arxiv.org/abs/1603.02754
Hastie, T., Tibshirani, R., Friedman, J. & York The Elements of Statistical Learning [Internet]. New NY: Springer New York; [cited 2024 Nov 14]. (Springer Series in Statistics). Available from: http://link.springer.com/ (2009). https://doi.org/10.1007/978-0-387-84858-7
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2 (5), 359–366 (1989).
Article Google Scholar
Varma, S. & Simon, R. Bias in error Estimation when using cross-validation for model selection. BMC Bioinform. 7 (1), 91 (2006).
Article Google Scholar
Pontes, F. J., Amorim, G. F., Balestrassi, P. P., Paiva, A. P. & Ferreira, J. R. Design of experiments and focused grid search for neural network parameter optimization. Neurocomputing 186, 22–34 (2016).
Article Google Scholar
Kuhn, M. & Johnson, K. Applied Predictive Modeling [Internet]. New York, NY: Springer New York; [cited 2024 Nov 14]. Available from: http://link.springer.com/ (2013). https://doi.org/10.1007/978-1-4614-6849-3
Ruby, D. A. U. Binary cross entropy with deep learning technique for image classification. Int. J. Adv. Trends Comput. Sci. Eng. 9 (4), 5393–5397 (2020).
Article Google Scholar
Murrieta-Álvarez, I. et al. Preoperative brain volume loss is associated with postoperative delirium in advanced heart failure patients supported by left ventricular assist device. Sci. Rep. 15 (1), 8884 (2025).
Article ADS PubMed PubMed Central Google Scholar
Omiya, H. et al. Preoperative brain magnetic resonance imaging and postoperative delirium after off-pump coronary artery bypass grafting: a prospective cohort study. Can. J. Anesth. Can. Anesth. 62 (6), 595–602 (2015).
Article Google Scholar
Guenther, U. et al. Predisposing and precipitating factors of delirium after cardiac surgery: A prospective observational cohort study. Ann. Surg. 257 (6), 1160–1167 (2013).
Sabaroedin, K. et al. Frontostriatothalamic effective connectivity and dopaminergic function in the psychosis continuum. Brain 146 (1), 372–386 (2023).
Article PubMed Google Scholar
Xue, X., Chen, W. & Chen, X. A Novel Radiomics-Based Machine Learning Framework for Prediction of Acute Kidney Injury-Related Delirium in Patients Who Underwent Cardiovascular Surgery. Wong K, editor. Comput Math Methods Med. ;2022:1–16. (2022).
Bishara, A. et al. Postoperative delirium prediction using machine learning models and preoperative electronic health record data. BMC Anesthesiol. 22 (1), 8 (2022).
Article PubMed PubMed Central Google Scholar
Giesa, N. et al. (ed ) NYK Predicting postoperative delirium assessed by the nursing screening delirium scale in the recovery room for non-cardiac surgeries without craniotomy: A retrospective study using a machine learning approach. PLOS Digit. Health 3 8 e0000414 (2024).
Article PubMed PubMed Central Google Scholar
Wilming, R., Kieslich, L., Clark, B. & Haufe, S. Theoretical Behavior of XAI Methods in the Presence of Suppressor Variables [Internet]. arXiv; [cited 2024 Nov 14]. (2023). Available from: https://arxiv.org/abs/2306.01464
Budėnas, A. et al. Incidence and clinical significance of postoperative delirium after brain tumor surgery. Acta Neurochir. (Wien). 160 (12), 2327–2337 (2018).
Article PubMed Google Scholar
Kappen, P. R. et al. Delirium in neurosurgery: a systematic review and meta-analysis. Neurosurg. Rev. 45 (1), 329–341 (2022).
Article PubMed Google Scholar
Powers, D. M. W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. 2020 [cited 2024 Nov 14]; Available from: https://arxiv.org/abs/2010.16061

Download references

Acknowledgements

The authors acknowledge the Scientific Computing of the IT Division at the Charité -Universitätsmedizin Berlin and of the Berlin Institute of Health Center of Digital Health for providing computational resources that have contributed to the research results reported in this paper. Dr. Maria Sekutowicz is participant in the BIH Charité Junior Digital Clinician Scientist Program funded by the Charité – Universitätsmedizin Berlin, and the Berlin Institute of Health at Charité (BIH). Niklas Giesa receives funding from the German Academic Scholarship Foundation.

Funding

Open Access funding enabled and organized by Projekt DEAL. This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany
Niklas Giesa, Felix Balzer & Maria Sekutowicz
Institute of Neuroradiology, Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany
Andrea Dell’Orco & Michael Scheel
Experimental Neurology, Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany
Carsten Finke
Department of Anesthesiology and Intensive Care Medicine (CCM, CVK), Charité - Universitätsmedizin Berlin, 13353, Berlin, Germany
Claudia Doris Spies & Maria Sekutowicz
Berlin Institute of Health at Charité – Universitätsmedizin Berlin, BIH Biomedical Innovation Academy, Charitéplatz 1, 10117, Berlin, Germany
Maria Sekutowicz

Authors

Niklas Giesa
View author publications
Search author on:PubMed Google Scholar
Andrea Dell’Orco
View author publications
Search author on:PubMed Google Scholar
Michael Scheel
View author publications
Search author on:PubMed Google Scholar
Carsten Finke
View author publications
Search author on:PubMed Google Scholar
Felix Balzer
View author publications
Search author on:PubMed Google Scholar
Claudia Doris Spies
View author publications
Search author on:PubMed Google Scholar
Maria Sekutowicz
View author publications
Search author on:PubMed Google Scholar

Contributions

NG (first author) and MS (senior author) conceptualized the study design. The senior author manually checked all MRIs, the first author NG extracted and preprocessed EHRs, trained ML models. Both authors also drafted the manuscript. Co-authors AD and MS processed and checked all MRIs. CF and co-author MS consulted in terms of neurological expertise. FB and CS provided clinical input. All authors proof read the manuscript.

Corresponding author

Correspondence to Niklas Giesa.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval

and patient consent.

This study was approved by the Ethics Committee of the Charité Universitätsmedizin – Berlin (EA2/024/18) and followed the Declaration of Helsinki. Patient consent for general research purpose was covered by the patient treatment contract. Specific analysis of patient data approved the IRB EA2/024/18.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Giesa, N., Dell’Orco, A., Scheel, M. et al. Fusion of clinical magnet resonance images and electronic health records promotes multimodal predictions of postoperative delirium. Sci Rep 15, 44654 (2025). https://doi.org/10.1038/s41598-025-31693-9

Download citation

Received: 14 May 2025
Accepted: 04 December 2025
Published: 26 December 2025
Version of record: 29 December 2025
DOI: https://doi.org/10.1038/s41598-025-31693-9