Introduction

Delirium is a distressing neuropsychiatric syndrome characterized by acute disturbances in consciousness, cognition, and attention1. Postoperative delirium (POD), occurring after major surgical procedures, is associated with adverse outcomes, such as prolonged hospitalization or death. Prevalence rates span from 5 to 52%2. The etiology of POD is multifactorial, with both predisposing and precipitating factors contributing to its acute onset2,3. Predisposing factors, such as preexisting cognitive impairment or advanced age, confer baseline vulnerability, while precipitating factors relate to perioperative conditions, including the surgical procedure2.

POD manifests through heterogenous levels of vigilance, neuropsychological, and psychotic symptoms which fluctuate in presence and severity demanding close monitoring and early assessment2,4. Previous studies indicate that structural brain changes may increase vulnerability to POD5. Patients with cerebral atrophy are predisposed to suffer from longterm cognitive decline6. Vulnerability to delirium may be facilitated by preexisting neuroanatomical changes resulting in neuronal dysfunction and network disintegration7. Such pre-morbidity in POD patients has been identified as decreased white matter integrity and increased gray matter atrophy8,9,10. Previous studies have been restricted to specific patient cohorts at risk, such as elderly patients undergoing major surgical procedures8. Thus, the associated structural brain changes may be age-specific or restricted to patients with preexisting cognitive impairment9.

While previous studies have developed non-linear machine learning (ML) prediction models4,11,12 that outperform standard statistical methods, these models rarely translate into clinical practice. Advanced ML approaches that integrate routinely collected clinical data from multiple modalities may overcome this limitation, as Mohsen et al.13 illustrate various fusion strategies. Such data fusions are applied either early in the feature space or later when outputting prediction probabilities. To the best of our knowledge, we are the first to utilize neuroanatomical features extracted from preoperative clinical MRIs to use premorbid structural brain changes for POD prediction. We systematically explore the predictive value of combining these MRI features with EHR data in two distinct general surgical cohorts. Since POD may often be undiagnosed14, we defined an endpoint based on agitation and pharmacological treatment for delirium for intensive care patients. Hereby, we complement standard delirium assessment tools for POD labeling which are routinely used postoperatively. Interpretation of multimodal ML techniques is augmented by linear mixed-effect models (MEM) correcting for covariates allowing further insights into the pathomechanisms of POD.

Methods

Study population, endpoint definitions, and data extractions

We included all patients (aged > = 18) who underwent surgery between 2017 and 2022 if the estimated surgery duration was > = 1 h, initially resulting in EHR-data from 63,222 patients (see Fig. 1a). All data for this single-center study were provided by three different sides at Charité, a large German university hospital. This study is the first in our medical institution to leverage routinely acquired MRIs. As no standard data pipeline was available, we extracted a random sample of preoperative MRIs from the picture archiving and communication system (PACS) without additional capabilities (e.g., no information on POD assessments or type of scans) (see Fig. 1b). This procedure resulted in 3,344 heterogenous de-identified MRI scans.

Preoperative MRI was obtained for broad clinical indications and could be unrelated to subsequent surgical procedures (e.g., screening for intracranial metastases, suspected stroke, oncologic staging, surveillance/follow-up, or headache/seizure workup, as well as performed for neurosurgical planning). Subsequently, MRI headers were filtered for cranial scans, acquisition times, and existences of T1-weighted MPRAGE sequences yielding a cohort of 991 MRI scans (see Fig. 1a).

The clinical information system (CIS) stored pre- and intraoperative EHRs that were de-identified and archived in a Data Warehouse (DWH)15 allowing data harmonization (see Fig. 1b). We used these EHRs to define predictors, endpoints, and to link clinical predictors to MRIs from the PACS. POD was defined as a binary endpoint variable (1 = delirious, 0 = non-delirious) according to two definitions.

All surgical patients are routinely screened for delirium with the Nursing Delirium Screening Scale (Nu-DESC) preoperatively before the anesthesia and postoperatively during post-anesthesia care inside the recovery room. Patients transferred directly to the ICU are assessed with the Confusion Assessment Method for the Intensive Care Unit (CAM-ICU) at admission and at least three times per day, according to institutional standards. For the score-based POD definition (scoPOD), CAM-ICU and Nu-DESC16 assessments were used to classify patients as delirious (at least one Nu-DESC > 0 or positive CAM-ICU) or non-delirious (all Nu-DESC = 0 or all CAM-ICU negative). 645 scans for 557 patients are covered by scoPOD.

The second medication-based endpoint definition (medPOD) was based on agitation and pharmacological treatment for delirium. ICU patients are routinely assessed for the level of agitation along the standard Richmond Agitation Sedation Scale (RASS)17 for critically ill patients. Those who postoperatively scored a RASS of > 1 and subsequently received any of the medications Haloperidol, Clonidine, Dexmedetomidine, Pipamperone, or Risperidone were labeled as delirious. RASS assessment and given medication were required to be temporarily aligned to the same day (within a 24 h interval). Controls maintained a RASS of 0 and did not receive any aforementioned medications until discharge. Any other cases were excluded from the medPOD definition, resulting into 224 scans for 201 patients.

Fig. 1
figure 1

Inclusion criteria for endpoint definitions (a), methodology and data flow (b). We included data of all adult patients between 2017 and 2022 with estimated surgery length of at least one hour. MRI headers from DICOM metadata were filtered for head-scans, MPRAGE sequences, and scanning time. Preoperative MPRAGE sequences that did not pass automated segmentation (e.g. extreme movement) were excluded (shown in a). We divided the hospital stay into pre-, intra-, and postoperative time phases. MRI scans were extracted from the picture archiving and communication system (PACS), EHR data were extracted from the clinical information system (CIS) for integrated data analysis and the application of machine learning (shown in b).

Data preprocessing

As routinely collected EHRs potentially follow skewed distributions18, we aggregated parameters with robust summary statistics as mean, median, 10th – and 90th percentile separately for the pre- and intraoperative phase. Our initial feature space covered parameters that were available for at least 10% of patients. Laboratory values (lymphocytes, CRP, etc.), medications (propofol, norepinephrine, etc.), vital signs (heart rate, spo2, etc.), and device settings (FiO2, PEEP) resulted in 133 features (see Extended Table A1/A2 in Supplement B). To avoid estimation bias, the feature space excluded any data contributing to our two endpoint definitions medPOD or scoPOD, including medications commonly used to control for delirious symptoms. All data were normalized via z-transformation with statistics from the training set. ICD-based delirium labels would lack temporal information and might reflect documentation gaps rather than negative cases. Consequently, presence of diagnoses in the form of an ICD code19 were used to characterize our cohorts, but omitted for training ML models or endpoint definitions due to uncertain documentation times20.

Statistical analysis

We analyzed features towards binary endpoints with Mann-Whitney U (MWU) test statistics21,22. We report the AUC-0.5 with 0 as a chance-level, values near 1 as a strong positive, and values near − 1 as a strong negative effect. The Spearman correlation coefficient (ρ) was used to describe the association of clinical parameters with age, we defined the effect strength similarly with levels of 0, near 1 or −123.

To account for age as a confounder in MRI-based brain morphometry, we configured multivariate linear mixed-effect models (MEMs)24. MEMs had varying intercepts for single feature effects as \(\:C\left(POD\right)\:\sim\:age+feature+1|patient\). Here, the POD endpoint functions as the dependent variable, age as fixed effects in conjunction with the feature of interest. To correct for multiple surgeries and multiple MRI scans, patient identifiers were integrated as random effects. We report age-adjusted p-values and a corresponding coefficient β(f) indicating effect directions.

Statistical significance was assessed using false discover rate (FDR) corrections (alpha = 0.05)25. For reporting the effects of binary variables with POD, we used the odds ratio (OR) on a logarithmic scale as ln(OR) with large deviations from 1 indicating strong associations.

MRI analysis

All DICOMs were converted to NIfTIs using dcm2niix and segmented by the FreeSurfer v7.4.1 recon-all pipeline26 Morphometry measures were computed using Desikan-Killiany atlas-based parcellation27. We visually assessed all results to identify missegmentation, administration of contrast agents or anatomical aberrations, such as general atrophy or tissue lesions. In the case of brain abnormality, only healthy contralateral hemispheres were selected for analyses. Otherwise, one hemisphere was randomly selected resulting in 358 right – and 287 left (scoPOD) as well as 91 right – and 133 left (medPOD) hemispheres. To account for the effect of intracranial volume, volume estimates were normalized by division with the estimated total intracranial volume. The final MRI-related feature set was composed of 184 volumes, 70 thickness features, and 72 area features.

Machine learning and fusion strategies

We trained three ML techniques comprising logistic regression (LR), gradient boosted trees (BT), and multi-layer perceptron (MLP) architectures. While LR assumes linear relationship, BT and MLP represent non-linear problems28,29 with MLPs functioning as universal approximators30 stacking perceptrons (nodes) on interconnected layers.

To handle different data modalities, we deployed two fusion strategies13. In “early fusion”, we enhanced our input feature space by ingesting selected measures from both types. For “late fusion”, we trained separate models for each modality (MRI or EHR), combining the prediction outputs (see Extended Figure B1 in Supplement A). For BT and LR, model outputs (probabilities between 0 and 1) were simply mean-averaged. In the late fusion MLP, a linear layer integrates predictions from both models into a single output while learning the weights for both networks via backpropagation, also known as “joint fusion”. We additionally trained completely separate models for MRI and EHR features.

Model configuration, training, and validation

For optimal model configurations (hyperparameters), a 3 × 3 nested cross-validation (CV)31 approach was implemented. Different sets of parameters are exhaustively validated via a Grid-Search32 on the inner-nested CV process and then applied to the outer-nested one. 1000x bootstrapping enabled estimations of 95% confidence interval (CI)33 for validation results.

The area under (AU-) the receiving operating characteristics (-ROC), and the precision recall curve (-PRC) evaluated performances22. As a cost function, we configured a weighted binary cross-entropy (BCE) loss to address class imbalance34. The final parameter space included regularization techniques, like BT pruning or L1-norm penalty for MLP and LR, in addition to general configurations (see Extended Table B2 in Supplement A). We trained models with subsets of features for different adjusted p-values thresholds (see Extended Table A3 in Supplement B).

Results

Cohort characteristics

Table 1 Descriptive cohort characteristics for two POD endpoints. Descriptive statistics are displayed as mean ± sd for numerical variables. For binary variables, the fraction of positive samples from all (n) are cited followed by the odds as (pos/neg samples). Adjusted p-values are derived from linear mixed-effect models (MEMs) incorporating age and the variable of interest as fixed effects, patient groups as random effects and POD as the independent variable. We highlight significant results with asterisks according to a FDR corrected alpha level. RASS: Richmond agitation sedation Scale, SOFA: sequential organ failure Assessment, SIRS: systemic inflammatory response Syndrome, urgency class N: ranges from N = 1 (immediate surgery required) to N = 5 (elective, planned procedure), anesthesia type of surgery: minor surgical procedures requiring sedation or anesthesia stand by.

Cohort characteristics for the score-based endpoint scoPOD and the medication-based endpoint medPOD revealed POD prevalence rates of 18.44% and 21.43%, respectively (Table 1). Patients who met inclusion criteria for scoPOD underwent a mean of 1.23 delirium assessments with Nu-DESC or CAM-ICU scales. P-values refer to POD cases vs. controls per endpoint. POD patients were older (65.01 ± 14.04 years as mean ± sd for scoPOD, 62.08 ± 13.49 years for medPOD, p < 0.05) than controls (58.02 ± 16.43 years for scoPOD, 56.85 ± 16.02 years for medPOD, p < 0.05). Highly significant POD differences were observed in recovery room stay durations for scoPOD (5.18 ± 2.42 h for POD, 2.81 ± 3.28 h for controls, p < 0.001). Patients’ physical status and degree of agitation was significantly reduced for delirious medPOD patients (ASA 2.12 ± 1.21 POD, 1.82 ± 1.69 controls, p < 0.001; RASS 1.08 ± 1.16 POD, −0.81 ± 1.12 controls, p < 0.001).

When comparing cohorts, medPOD patients had 0.31 h longer stays in the recovery room, a decreased physical status (Δ of mean ASA = 0.13), were more prone to sequential organ failure (Δ of mean SOFA = 0.70), and less agitated (Δ mean RASS = 0.24). In both cohorts, the most prominent surgical procedure was neurosurgery, reaching a significant difference for medPOD labels (ln(OR) = 1.71, p < 0.001). Visual MRI screening properties, such as general atrophy, did not show significant differences for POD (see Table 1). Confirming POD labeling, we observed highly significant correlation with ICD encoded delirium (ln(OR) = 2.10, p < 0.001 scoPOD, ln(OR) = 2.92, p < 0.001 medPOD). Additional characteristics are included in Extended Table B3 and Extended Results B1 in Supplement A).

MRI and EHR single feature importance

Fig. 2
figure 2

Top 4 single MRI feature importance for endpoint scoPOD (a) and medPOD (b). We display most significant results according to an age-adjusted p-value derived from mixed-linear-effects models (MLEMs). For each panel (a) or (b), the first column displays the distribution of feature values for POD cases (label = 1, orange) and controls (label = 0, blue) including AUC-0.5 as an effect size derived from Mann-Whitney U (MWU) statistics. The second column shows the correlation between the feature and age providing the Spearman Rank correlation coefficient ρ. The last column depicts age-group specific differences in distributions of POD cases and controls for each feature citing the corresponding coefficient β(f) retrieved from MEMs. The black boxes inside violin plots show a boxplot from the entire distribution per age group where the white dot indicates the median.

Table 2 Single feature importance for MRI and EHR features per endpoint. Univariate results from Mann-Whitney U (MWU) test statistic define discriminability of [POD] endpoints with unadjusted p-value and AUC-0.5 as effect size. Spearman ranks provide correlation coefficient (ρ) calculated between feature values and [Age] including p-value under the null-hypothesis of zero-coefficients. Linear mixed-effect models (MEM) were fitted with [POD] as dependent variable, feature values as fixed effects, patient-MRI hierarchy as random effects. MEMs provide age-adjusted p-values and a corresponding coefficient β(f). Data availability shows the fraction of available feature values from all, adjacent to the odds of (available/missing) values. EHRs must be aggregated with either the sum, median (md), mean (me) for preoperative (pre) or intraoperative (intra) time phases. FDR corrected significant results are highlighted in bold and italics.

MRI-derived morphometry features, such as the middle temporal- and superior temporal thickness were significantly correlated with scoPOD (MWU p = 7.26E-06, AUC-0.5= −0.142; p = 2.60E-05, AUC-0.5= −0.133) as well as with age (Spearman p = 2.14E-19, ρ= −0.359; p = 5.57E-29, ρ= −0.438) (see Table 2; Fig. 2a). Decreased cortical thickness resulted in increased probabilities of POD and occurred rather in elderly, than in younger patients.

Multivariate MEMs showed that cortical thickness of middle as well as the superior temporal cortex remained significantly associated with scoPOD when adjusting for age (adj. p = 3.25E-05, β(f)= −0.517; adj. p = 7.12E-04, β(f)= −0.332). Decreased cortical thickness in POD was preserved, when dividing patients into equally-sized age groups (negative MEM coefficient β(f), see Fig. 2a). Measures of white matter hypointensities expressed significant univariate effects on scoPOD (MWU p = 8.22E-05, AUC-0.5= −0.125) and age (Spearman p = 5.47E-28, ρ = 0.430).

MWU analysis of EHR features highlighted preoperative measures of anemia (hemoglobin: p-value = 1.19E-05; erythrocytes: p = 1.00E-05) and infection parameters (CRP: p = 3.19E-05). These were significant after age-corrections (MEM erythrocytes: adj. p = 1.21E-06; hemoglobin: adj. p = 7.00E-09, hematocrit adj. p = 3.22E-06; CRP: adjusted p = 1.24E-06).

For medPOD, subcortical MRI features, such as thalamus and brainstem volume, were significantly associated with POD (thalamus volume, p = 2.79E-04, AUC-0.5 = −0.177; MWU; brainstem volume p = 6.80E-04, AUC-0.5 = −0.165; MWU). After age-correction, the thalamus volume remained significantly associated with POD (adj. p = 2.39E-04) (see Table 2; Fig. 2b). Additionally, multivariate MEM analysis indicated significant associations of EHR features and blood parameters like low levels of erythrocytes (adjusted p = 2.49E-09, β(f) = −0.185), hemoglobin (adjusted p = 7.00E-09, β(f) = −0.060), hematocrit (adjusted p = 4.67E-08, β(f) = −2.057), and increased CRP (adjusted p = 1.08E-07, β(f) = 0.003) (see Table 2).

Machine learning results

Fig. 3
figure 3

Performance metrics across machine learning models and fusion types (a) or for best performing model (b) as well as importance of MRI features for scoPOD (c). Area under (AU-) the receiver operating characteristics (-ROC) or the precision recall curve (-PRC) for trained and 3x cross-validated machine learning methods applied to predict endpoint scoPOD (n = 645) or medPOD (n = 224). Gradient boosted trees (BT), logistic regression (LR), and multi-layer perceptrons (MLPs) are included. Panel a shows metrics across models and fusion types either including magnetic resonance imaging features (MRIs), or electronic health records (EHRs) only, or combining these two modalities (early, late). 95% confidence intervals are drawsn as white boxes in (a) on top of bars calculated on 1000x bootstrapped validation folds. In panel b, trained MLPs are selected to draw AU curves showing mean performances across validation sets. In panel c, the absolute value of AUC-0.5 derived from MWU (grey scale) and absolute model weights (MW) from best MLP classifier (blue-green scale) trained with MRI only are shown. Anatomical segmentation across right hemisphere according to DKT-Atlas is displayed from temporal and medial. Overlay colored green-blue indicates model weights. Underlaying absolute AUC metrics for blended areas are: superior frontal thickness 0.111, insula thickness 0.096, middle temporal thickness 0.142, superior temporal thickness 0.133, temporal pole thickness 0.084.

We evaluated LR, BT, and MLP models with AUROC and AUPRC metrics for both endpoints. Highest performance for the score-based scoPOD cohort was achieved by a late fusion MLP (AUROC 0.735 [0.726, 0.744] as mean, [95% CI]; AUPRC 0.456 [0.411, 0.472]), outperforming LR (AUROC 0.705 [0.695, 0.715]; AUPRC 0.404 [0.389, 0.420]) and BT (AUROC 0.722 [0.712, 0.32]; AUPRC 0.450 [0.436, 0.468]) within the same fusion type (see Fig. 3a). MLPs were superior to LR and BT for all fusion types (see Extended Table B4 in Supplement A). We observed overlapping CIs of AUROC between late- and early fusion of MLPs for scoPOD ([0.726, 0.744] vs. [0.722, 0.740]), but distinct differences to MLPs trained with one modality only (EHR only [0.703, 0.721], MRI only [0.666, 0.685]).

The best model that predicted the medical-based endpoint medPOD used EHR features only (AUROC 0.861 [0.851, 0.871]; AUPRC 0.665 [0.644, 0.687]) (see Fig. 3a). Here, the confidence was decreased due to similar metrics ranges yielded by combined fusion MLPs like early - (AUROC [0.847, 0.860]; AUPRC [0.636, 0.679]), or late fusion (AUROC [0.847, 0.867]; AUPRC [0.619, 0.668]). MLPs showed overall elevated validation metrics in contrast to other ML methods.

In Fig. 3b, corresponding AUROC and AUPRC curves describe model behaviors under varying prediction thresholds. Curves confirm that late fusion was favorable for scoPOD with a sensitivity of 0.81 and a specificity of 0.63 at the threshold where their sums maximize. MLPs trained solely on EHRs exceeded these metrics with 0.81 sensitivity and 0.82 specificity predicting medPOD.

Model interpretation

Model weights (MW) from our best MLPs per endpoint revealed ante-hoc feature importance. The best late fusion scoPOD based MLP focused on intraoperative tidal volume (abs MW = 0.288), preoperative albumin blood levels (abs MW = 0.283), and erythrocytes counts (abs MW = 0.2).

Highest model weights were found in the late fusion MLP using MRI features for scoPOD and were assigned to temporal pole thickness (MW = 0.565), superior frontal gyrus (MW = 0.523), and insula thickness (MW = 0.504). These features also showed univariate MWU feature importance with effect strengths of 0.084 (p = 8.32E-03), −0.111 (p = 4.78E-04), and − 0.096 (p = 2.37E-03) AUC-0.5 (see Extended Table B5 Supplement A, Fig. 3c).

The best MLP to predict medPOD relied on unimodal EHRs such as mean blood pressure (abs MW = 0.186) or administered fluid volume (abs MW = 0.183). The MLP focused on intraoperative norepinephrine infusion (abs MW = 0.172), the tidal volume (abs MW = 0.166), and heart frequency (abs MW = 0.142). Additionally, preoperative CRP levels (abs MW = 0.158), physical status (ASA abs MW = 0.120), hematocrit (abs MW = 0.135), or erythrocyte count (abs MW = 0.133) had predictive values (see Extended Table A4 in Supplement B). Highest MWs regarding MRIs for medPOD, provided by the unimodal MLP, were found for the thalamus volume (abs 0.27) (see Extended Table B5 in Supplement A). Additional analyses of the relationship between key covariates and model raw output probabilities for each surgery did not suggest model biases towards gender or age (see Extended Figure B6, Extended Results B3 in Supplement A).

Discussion

We are the first to demonstrate that neuroanatomical atrophy measures contribute to successful multimodal POD predictions in a general surgical patient population. In contrast to previous studies, our study leveraged data from clinical routinely-collected MRIs which are heterogenous and noisy, but proved to hold potential for clinical decision making10,35,36. Best unimodal prediction using MRI morphometry measures achieved 72% AUROC expressing the highest predictive value for subcortical volumetric measures such as the thalamus. Through the iterative application of diverse data fusion strategies and multiple ML models, we achieved high performances up to 86% AUROC for multimodal models, where frontal and temporal cortical atrophy were highly predictive.

Our findings provide clinical utility by enabling preoperative risk stratification that leverages routinely acquired MRI alongside EHR data. In less critically ill patients, cortical atrophy (frontal/temporal) flags intrinsic brain vulnerability to delirium, supporting early initiation of clinical delirium-prevention bundles (e.g., reorientation, sleep protection, mobilization). In higher-acuity settings, reflected by medPOD, the multimodal model emphasizes systemic factors (e.g., anemia, hydration, infection proxies), guiding optimization before surgery (e.g., hemoglobin targets, volume status, infection control) and postoperative monitoring. Multimodal approaches, such as future MRI-informed perioperative decision-making tools, have the potential to improve prevention, triage, and targeted management of POD while complementing clinician judgment and existing care pathways.

We assessed the predictive properties of neuroanatomical and clinical markers in two cohorts based on different POD endpoints. In addition to validated clinical POD-assessment methods with Nu-DESC and CAM-ICU scores (scoPOD), we defined POD in a subgroup of ICU patients according to agitation and pharmacotherapy (medPOD). Importantly, both endpoints highly correlated with the documented ICD diagnosis of delirium, but reflected distinct subgroups with key differences in clinical and surgical characteristics. While the score-based cohort scoPOD covered a wider range of surgical interventions, the smaller medPOD cohort had higher degrees of critical illness and systemic inflammation. Findings of scoPOD associated cortical thickness parameters in the temporal and frontal lobe, involved in memory, attention, and higher-order executive functions, align with literature linking temporal cortex atrophy to delirium37. These cortical features outperformed EHR predictors, suggesting that brain-specific vulnerability, reflected by these morphometric measures, is primary driving POD in less critically ill populations. When illness is more critical and patients in intensive care receive antipsychotic pharmacotherapy, MRI measures emphasize the importance of fronto-striato-thalamic circuits in disorders of consciousness and the emergence of psychotic symptoms38. However, such subcortical brain vulnerability measures are outweighed by EHR features when POD is predicted in more complex critical illness. Here, preoperative anemia, hydration and infection proxies, significantly associated with POD in univariate statistics, also showed high predictive value.

To keep the clinical scope wide, the presented models are more generalizable compared to previous prediction studies based on clinical imaging of the hippocampus in cardiovascular surgeries39. While previous work trained ML on EHRs only40,41, we could show that the combination of data modalities improved prediction performances, especially for a less critically ill population.

For explainable artificial intelligence (XAI) purposes, we preferred directly reading model weights over methods like Shapely or LIME due to their susceptibility to unfavorable effects such as suppressor variables42. Since model weights are technical properties, we provide comprehensive univariate analyses to enhance clinical insights. To correct for strong confounders, such as age in neuroanatomical measures, we applied linear mixed-effects models focusing on such covariates5. However, due to the highly inter-correlated nature of our data, latent noise may not be fully excluded.

Although there are numerous indications for preoperative cranial MRI-scans, most patients who received cranial MRI comprised neurosurgical cases in our sample. In line with previous findings43,44, neurosurgical procedures were not predictive for POD in the larger scoPOD sample, nor did we find notably higher POD prevalence rates in our cohorts. However, in the smaller cohort of intensive care patients, the presence of a neurosurgical procedure was significantly associated with POD. While this needs to be replicated, we speculate that through disease severity, patients who undergo major neurosurgical procedures and who require intensive care might be especially vulnerable to POD. Critically, we did not find model biases regarding age, urgency class, sex, or performed neurosurgery suggesting a broad applicability of our results.

Since brain morphometry measures were extracted from heterogenous clinical MRI assessments, we performed quality control to identify anatomical aberrations and segmentation inaccuracies. Automated morphometry tools, such as Freesurfer, are optimized for non-contrast enhanced (CE) images. However, excellent reliability and agreement is reported for T1wCE segmentation26.

The presented work has several limitations. Only healthy hemispheres without visually detectable structural lesions interfering with segmentation accuracy were included in our analyses, potentially excluding valuable data. Future work may incorporate more automated and data-driven segmentation approaches to optimize the utilization of clinical scans with potential lesions enhancing prediction robustness. The sample size, particularly for medPOD, restricts generalizability and findings should be validated in larger external cohorts. Hypoactive delirium which is often undiagnosed was not separately addressed in our prediction targets but we aim to formulate a multi-class prediction problem in the future. As with most real-world clinical data, the EHRs used in this study were not primarily collected for secondary research purposes. Consequently, institution-specific documentation practices and local clinical guidelines may have introduced biases and data quality limitations. We inherently handled class-imbalance by robust MWU test statistics and a weighted BCE loss. Oversampling techniques may have resulted in different findings. Reporting AUPRC metrics, which are sensitive to class-imbalances, enabled a more elaborate assessment of model performance compared to exclusively citing AUROC scores45. In contrast to randomized controlled trials, causality cannot be assumed for identified relationships while future work aims to include causal inference methods like propensity scores to increase reliability.

In conclusion, this study highlights the advantages of multimodal fusion models that integrate routine MRI and EHR data, harnessing the potential of modern machine learning for outcome prediction. Additionally, this study demonstrates the added value of MRI data in supporting clinical decision-making and improving the management of postoperative delirium in the future.