Introduction

Organ donation plays a critical role in saving lives and improving the quality of life for individuals suffering from end-organ failure. Historically, organs from donation after brain death (DBD) donors have constituted the predominant source of transplantable organs, with donation after circulatory death (DCD) contributing to a comparatively smaller, albeit recently increasing, fraction1. A major reason for this disparity is the lower organ-yield from DCD donors, due to the reduced quality and longevity of allografts2. However, in the last 5-years, technological explosion of normothermic machine perfusion (NMP) and normothermic regional perfusion (NRP) have improved organ quality from DCD donors, highlighting the unrecognized potential of DCD donors in the transplant community3,4. Given these recent advances, there is now a growing consensus that augmenting DCD practice represents the largest and underutilized opportunity for expanding the organ donor pool5.

Although NMP and NRP work to improve the quality of organs procured from a DCD, the critical challenge limiting the volume of DCD practice is the unpredictability regarding whether, or when, a patient after terminal extubation (TE) will progress to meet the Uniform Declaration of Death Act (UDDA)6 criteria for organ donation. This uncertainty limits the ability of Organ Procurement Organizations (OPOs) to evaluate a potential DCD donor for organ donation and thus negatively impacts the organ yield from DCD donors. Indeed, while conventional guidelines stipulate that circulatory death must occur within a narrow time-frame following the cessation of life-sustaining treatment, only 59-72% of potential DCD donors die within the first hour1,7. The goal of this study is to investigate the potential of advanced machine learning models to accurately predict time-to-death (TTD) after extubation.

Recognizing the complexities inherent in DCD, we trained and externally validated an ODE-RNN18 model, which combines recurrent neural network with neural ordinary equations and excels in processing irregularly-sampled time series data. The model is designed to predict the time-to-death (TTD) of a patient following terminal extubation in the intensive care unit (ICU), leveraging the history of clinical observations. Our model shows remarkable accuracy and calibration, underscoring its ability to accurately and reliably identify viable DCD organ donors, and enables a nuanced balance between the risks and benefits of a specific organ donation procedure.

A key challenge of modeling ICU data for TTD prediction is that the data consist of both static variables and a multidimensional time series of longitudinal variables, are measured at irregularly-spaced time points, and contain missing measurements in many variables. Previous efforts in predicting circulatory death within specified time frames include clinical risk scores such as the United Organ Sharing (UNOS) criteria8, and machine learning models such as XGBoost9, RNN10, LSTM11, GRU12, GRU-D13. However, these studies are predominantly based on conventional statistical models or basic machine learning architectures designed for regularly-sampled fixed-dimensional data. As a result, they cannot take full advantage of the data available, leading to insufficient performance and lower clinical reliability7,14,15,16,17. The UNOS criteria and XGBoost, a tree-based machine learning model, only consider static variables and cannot use the rich history of longitudinal variables. Recurrent neural network models and their extensions (RNN, LSTM, GRU and GRU-D) are able to model the time series of longitudinal variables, but they are primarily designed for data regularly-measured in time and perform badly on irregularly-sampled data.

In contrast, we recommend using ODE-RNN, an architecture that builds upon recent advances in longitudinal modeling through the use of neural ordinary differential equations18,19,20, which specifically address the challenges posed by the irregularly-sampled data. The proposed model also allows one to form patient phenoscape visualizations for a better understanding of the cohort’s structure and heterogeneity. By leveraging state-of-the-art deep learning and representation learning methodologies, our approach surpasses the limitations of previous models and sets the stage for a more accurate and clinically relevant prediction of time-to-death following extubation, thereby promising an increase in the DCD donor pool.

Methods

Patient cohort and data preparation

We used two separate cohorts of patients to develop and validate the model. The first cohort contained 3,238 patients at Yale New Haven hospital (YNHH) older than 18 years old, with a recorded TE in the ICU between 2014 and 2023. For unbiased validation, using same inclusion criteria, we formed an external cohort from six different hospitals with 1,908 patients. The hospitals from the external cohort are the Bridgeport hospital, Greenwich hospital, Lawrence and Memorial hospital, Saint Raphael hospital, Westerly hospital and the Yale New Haven Children’s hospital. The median time from extubation to death was 7.15 minutes in the first cohort and 8.28 minutes in the external cohort. The two cohorts are summarized in Table 1.

Table 1 Statistics of the two cohorts. BMI stands for body mass index, and TTD for time-to-death.

For each patient, we extracted 5 static variables and 25 longitudinal variables, and collected historical measurements before extubation. Longitudinal variables were observed at a range of frequencies – from a single observation in more than 24 hours to one/minute. Missing values in the longitudinal records were imputed by performing a combination of forward-fill, backward-fill and mean-fill imputation21. In addition to imputation, presence of missing values was fed to the model by creating binary missingness indicators. TTD was defined as the time from terminal extubation to circulatory death. TTD was converted into an ordinal variable with 4 categories (0: 0\(\sim\)30 minutes, 1: 30\(\sim\)60 minutes, 2: 60\(\sim\)120 minutes, and 3: longer than 120 minutes), that were used as target labels in the machine learning model. We chose these time frames as most transplant centers use 0-60 minutes as “standard criteria” for kidney DCD donation and 60\(\sim\)120 minutes as an “extended spectrum”22; similarly, 0\(\sim\)30 minutes as “standard criteria” for liver DCD donation and 30\(\sim\)60 minutes as an “extended spectrum”23.

List of clinical variables used in the models

The following longitudinal and clinical variables were extracted for each patient.

Longitudinal variables B-type natriuretic peptide (BNP), carboxyhemoglobin, corneal reflex, fraction of inspired oxygen (FiO2), gag reflex, Glasgow Coma Scale (GCS), hemoglobin, lactate, mean arterial blood pressure (MAP), methemoglobin, O2-Hemoglobin, partial pressure of carbon dioxide (pCO2), positive end-expiratory pressure (PEEP), blood potential of hydrogen (pH), partial pressure of oxygen (pO2), pulse, respirations, oxygen saturation (SpO2), Troponin-I, Troponin-T, Dopamine, Epinephrine, Levothyroxine, Lidocaine, Norepinephrine.

Static variables Age, Body Mass Index (BMI), Dialysis, Sex, Weight.

Model description

To capture the impact of longitudinal variables on the target while accommodating for the irregular sampling of clinical trajectories, we used an Ordinary Differential Equation Recurrent Neural Network (ODE-RNN)18. ODE-RNNs combine two powerful architectures, recurrent neural networks (RNN) and neural ordinary differential equations (Neural ODE)27, making them exceptionally apt at processing clinical time series19,24. RNNs are neural networks specialized for processing sequences. At each time step, they maintain a hidden state which represents the whole previous information in the time series. Upon reading a clinical record, the RNN updates its hidden state by combining the previous hidden state with the new observation, using an update unit (here a gated recurrent unit (GRU)). However, RNNs assume regular time intervals between observations, an assumption typically not met in clinical time series. To address this limitation, ODE-RNN uses a Neural ODE, that describes dynamics in continuous time, to model the dynamics of the hidden state between observations. This uniquely allows the model to capture the full span of the clinical records of each patient, correctly accounting for the time interval between observations, and improving upon previous methods such as logistic regression, XGBoost, or the UNOS criteria, among others.

The model accumulates the history before extubation of the patient’s vitals, medications used (e.g. vasopressors), neurological assessments, and lab results, together with their demographic records, and produces a representation of the patient, which we refer to as the patient’s latent phenotype. This latent phenotype can be understood as a learnt compact clinical summary of a particular patient. This representation is then used as an input to a multi-layer perceptron classifier that predicts the probability of each label category (0, 1, 2, or 3, corresponding to the 4 time ranges).

A graphical depiction of the architecture is presented in Fig. 1. Panel A describes the high-level design and panel B illustrates the specific modules involved.

Fig. 1
figure 1

Description of the problem setup and model architecture. (A) Problem setup. Based on the static variables of a specific patient (e.g. age, sex) and the history of clinical follow-up prior to extubation (e.g. SpO2, MAP), our model predicts 4 probabilities: the probability that the time-to-death (TTD) is shorter than 30 minutes, between 30 and 60 minutes, between 60 and 120 minutes, and longer than 120 minutes. The sum of these probabilities equals to 1 by design. BMI stands for body mass index, SpO2 for oxygen saturation, MAP for mean arterial blood pressure, Hgb for hemoglobin, and NE for norepinephrine. Note that we consider 5 static variables and 25 longitudinal variables, and only some are shown for illustration purposes. (B) Architecture of our ODE-RNN. The set of variables fed to the model consists of a concatenation of the longitudinal variables available at that observation time and a mask specifying which longitudinal variables are observed. Each clinical observation is sequentially processed by a gated recurrent unit (GRU) that incorporates the observation into the hidden state representation from the previous samples in the time series. Between observations, an ordinary differential equation (ODE) models the evolution of the patient’s hidden state continuously over time, which enables processing of variable temporal intervals between subsequent observations. The hidden state obtained after the whole time series is then complemented with the static variables to form the latent phenotype, a vector representation that summarizes the whole available information about the patient. The end classification is performed by using a multi-layer perceptron classifier (MLP) that predicts the TTD probabilities from the latent phenotype.

Our model contains the following neural networks:

  • \(f_{\textrm{static}}\): a multi-layer perceptron (MLP) that processes the static variables.

  • \(f_{\textrm{ODE}}\): an MLP that predicts the derivative of the hidden state dynamics between observations.

  • \(f_{\textrm{GRU}}\): a gated recurrent unit (GRU) that updates the hidden states at each observation point.

  • \(f_{\textrm{longitudinal}}\): an MLP that processes the final hidden state of the longitudinal variables.

  • \(f_{\textrm{fusion}}\): an MLP that fuses the latent states derived from static and longitudinal variables, producing a latent variable called the latent phenotype.

  • \(f_{\textrm{classifier}}\): an MLP that performs classification using the latent phenotype.

Algorithm 1 describes how the model predicts outcomes for a single patient. The model takes as input static variables \(s\in \mathbb {R}^{\ell }\), longitudinal variables \(x_1,\dots ,x_n\in \mathbb {R}^{k}\), observation times \(t_1,\dots ,t_n\in \mathbb {R}\), and boolean observation masks \(m_1,\dots ,m_n\in \mathbb {R}^{k}\), where \(m_i=(m_{i1},\dots ,m_{ik})\). The value \(m_{ij}=\textrm{true}\) if \(x_{ij}\), the \(j^{\textrm{th}}\) variable at time i, is observed, and \(m_{ij}=\textrm{false}\) otherwise. Including these observation masks enables the model to utilize “informative missingness,” which is correlated with the patient’s condition. For instance, a patient is unlikely to be suspected of heart failure if B-type Natriuretic Peptide (BNP) is not frequently measured. The model iterates through the observation time points, updating the hidden state h of the longitudinal variables using \(f_{\textrm{ODE}}\) and \(f_{\textrm{GRU}}\). Between observation points, \(f_{\textrm{ODE}}\) is integrated to continuously update h, while at observation times, \(f_{\textrm{GRU}}\) updates h using \(x_i, m_i, t_i\). The final h contains accumulated information from the entire history of the longitudinal variables, which is then processed by \(f_{\textrm{longitudinal}}\) and fused with the processed static variables s (via \(f_{\textrm{static}}\)) using \(f_{\textrm{fusion}}\). This results in a latent variable representing the patient’s condition, referred to as the latent phenotype. \(f_{\textrm{classifier}}\) uses the latent phenotype to make the final classification prediction.

Algorithm 1
figure a

ODE-RNN using GRU cell update (using one patient for illustration).

Evaluation of model performance

We compared the performance of our approach with machine learning methods and clinical scores such as the UNOS criteria8. Machine learning methods directly learn associations from the available data while clinical scores consist of clinical criteria designed by experts. Within the machine learning methods, we distinguish between static methods and longitudinal methods. Unlike longitudinal methods, static methods cannot process a time series of clinical information. They are therefore trained on the last available clinical observation at the time of extubation only.

Static machine learning methods

  1. 1.

    XGBoost XGBoost9 is an effective and widely used tree-based machine learning algorithm for predictive modeling. Utilizing an ensemble of decision trees, XGBoost improves model accuracy by combining weak learners into a strong one. It is designed with sparsity awareness, which renders it helpful in clinical health data. However, being a static method, XGBoost cannot process longitudinal data.

Longitudinal machine learning methods

  1. 1.

    RNN Recurrent Neural Networks (RNNs)10 are a class of neural networks specialized for processing sequences, designed for handling time-series data or sequential information. A vanilla RNN processes sequences by iterating through elements, using its internal state to retain information from previous inputs.

  2. 2.

    LSTM Long Short-Term Memory (LSTM)11, an extension of vanilla RNNs, is designed to overcome the vanishing gradient problem by incorporating memory cells. These cells enable LSTMs to retain information over extended sequences, making them adept at tasks requiring longer-term dependencies.

  3. 3.

    GRU Gated Recurrent Units (GRUs)12 are a streamlined variant of LSTMs. Compared to the LSTM architecture, a GRU replaces the input, forget and output gates with the reset and update gates for higher efficiency.

  4. 4.

    GRU-D GRU-D, an extension of GRUs13, integrates decay mechanisms to handle missing data in time-series. It modifies the GRU architecture to accommodate irregularly-sampled data, enhancing prediction accuracy in such scenarios.

Clinical scores Besides the machine learning models above, we also compared our approach to the most widely used clinical score for DCD candidates identification: the UNOS criteria8.

  1. 1.

    UNOS UNOS criteria consisted of fourteen clinical variables developed by the UNOS DCD consensus committee, based on expert opinion8. Criteria include physiological measurements (e.g. heart rate <30) and respiratory characteristics (e.g. FiO2 >0.5). A final score was computed by adding up the number of UNOS criteria present in the patient at the time of extubation.

For comparison, we trained these various models on the YNHH cohort using a temporal data split. Data from patients before 2021 was used for training the models. Patients after 2021 were used for evaluation only. This ensured that our results were robust to distribution shifts over time25. Standard errors were computed by training five different models with different initializations.

The models were trained to predict TTD as a categorical variable within a given time frame (0\(\sim\)30 min, 30\(\sim\)60 min, 60\(\sim\)120 min, or >120 min). We evaluated the different models according to the overall categorical accuracy as well as pairwise binary classification for different grouped time-frames (e.g., <30 min vs. >30 min). For these binary groupings, we also computed the positive predictive values (PPV), negative predicted values (NPV), area under the receiver operating characteristic curve (ROC-AUC) and the area under the precision-recall curve (PR-AUC). The ROC curve measures the trade-off between sensitivity and specificity, while the PR curve measures the trade-off between precision and recall. To assess the calibration of the models, we computed the expected calibration error (ECE).

Visualizing structures in high-dimensional patient phenoscape with PHATE

To predict TTD, our ODE-RNN model produces a latent phenotype for each patient, which can be intuitively understood as a learnt summary of the clinical history of the patient. We explored the space of latent phenotypes of all patients in the cohort, the patient phenoscape, showing its potential to provide new clinical insights.

The latent phenotype of each patient is high-dimensional, and thus cannot be directly visualized. Therefore, we first produced a two-dimensional representation of the phenotype using PHATE26, a non-linear dimensionality reduction and visualization method that stays faithful to the geometry of the data and retains the inherent similarity between patients. In this visualization, each patient is represented as a point in a two-dimensional phenotypic space. The patient phenoscape is the set of the representations of all patients in the cohort. We colored each point according to their TTD and the value of certain clinical variables, enabling a fine-grained exploration of the impact of different clinical factors on the TTD.

Statements regarding human participants

All methods were carried out in accordance with relevant guidelines and regulations. Identifiable information, including human participants’ names and other HIPAA identifiers, has been removed from all sections of the manuscript, including supplementary materials.

The Yale University Institutional Review Board (IRB) waived the need for approval and informed consent for this study after reviewing the proposal, as this research is not considered human subject research since all the participants were deceased. Please refer to the definition for Human Subjects in the HHS Policy for Protection of Human Research Subjects 45 CFR 46.102: “Human subject means a living individual about whom an investigator (whether professional or student) conducting research obtains”.

Results

Modeling longitudinal clinical variables acquired at irregular time intervals

Since we aim to predict time-to-death using a patient’s history of clinical observations prior to terminal extubation,  the input data contains both static and longitudinal variables. The latter pose challenges for statistical analysis and machine learning methods due to their irregular measurements over time and the presence of missing values. By using an Ordinary Differential Equation Recurrent Neural Network (ODE-RNN)18, we address both issues, as the model integrates a recurrent neural network (RNN)10 component tailored for sequential modeling, with a Neural ODE27 component that interpolates between irregularly-sampled time points.

The model operates by accumulating longitudinal variables with static variables to create a summary of the clinical history of each patient, that we call the latent phenotype28. The latent phenotype is then used by a classifier to predict patient outcomes (i.e. TTD). This procedure makes the ODE-RNN particularly effective at processing EHRs containing both static and longitudinal variables24.

Predictive performance evaluation

We compared our method with the UNOS criteria, the most widely used clinical score for identifying DCD candidates8, and existing machine learning models that have been used for clinical outcome prediction (RNN10, LSTM11, GRU12, GRU-D12, and XGBoost9).

All machine learning models were trained on the Yale New Haven Hospital (YNHH) cohort using a temporal data split. Data from patients before 2021 was used for training the models. Patients after 2021 were used for evaluation only to ensure robustness to distribution shifts over time.

The models were trained to predict time-to-death as a categorical variable, i.e., whether TTD fell within a given time frame (0-30 min, 30-60 min, 60-120 min, or >120 min). We evaluated the different models according to the overall categorical accuracy as well as pairwise binary classification for different grouped time frames (e.g., <30 min vs. >30 min). For these binary groupings, we also computed the area under the positive and negative predicted values (PPV, NPV), the receiver operating characteristic curve (ROC-AUC) and the area under the precision-recall curve (PR-AUC). To assess the calibration of the models, that is, how well the predicted values represent the true likelihood, we computed the expected calibration error (ECE)29.

Table 2 displays the comparative performance of various models on the YNHH patient cohort after 2021 and the external validation cohort. As illustrated in Fig. 2 panel A, the ODE-RNN model consistently outperforms all other approaches in predictive performance for TTD forecasts at 30, 60, and 120 minutes. We note the high performance of the model despite the stringent experimental setup (temporal split and external validation), highlighting the robustness of the method. ODE-RNN also shows the best calibration, suggesting the probability outputs of the model are very reliable.

Table 2 Comparison of the performance results of various machine learning models and statistical models on the Yale New Haven Hospital (YNHH) test cohort and external validation cohort.
Fig. 2
figure 2

Model performances and analyses. (A) Graphical representation of performance assessment for the UNOS score, XGBoost, LSTM, and ODE-RNN across different binary tasks. Left: TTD<30 min vs. TTD>30 min; Middle: TTD<60 min vs. TTD>60 min. Right: TTD<120 min vs TTD>120 min. We report the binary accuracy, the area under the receiver operating characteristic curve (ROC AUC), and the precision recall curve (PR AUC). ODE-RNN outperforms all other models for all tasks and all evaluation metrics. (B) Calibration plot for the binary classification task TTD <30 min vs. TTD >30 min on the external validation cohort, computed with the R package ‘val.prob.ci.2’. The predicted probabilities come from the output of our model and plotted against the fraction of positives observed in the data. The histogram shows the prevalence of patients for different ranges of predicted probabilities. (C) Variable importance for the predictions of the ODE-RNN model in the binary classification (left: <30 vs >30 minutes, middle: <60 vs >60 minutes, right: <120 vs >120 minutes). Variable importance was computed using permutation importance testing. The input variables were split into static variables (that do not change over time) and longitudinal variables. ROC-AUC stands for area under the receiver operating characteristic curves.

The poor performance of XGBoost and UNOS can be explained by (1) the inability of UNOS criteria and XGBoost to capture the temporality of the patient’s data – they only use the last observation at the time of extubation, (2) the limited number of clinical variables used in UNOS (14 variables) compared to our ODE-RNN model (5 static and 25 longitudinal variables).

To accommodate certain DCD protocols with longer time thresholds, we also trained and evaluated the models over extended time frames of 30, 60, 120, 180 and 240 minutes, and our ODE-RNN remains competitive (see Supplementary Table S1). Additionally, in clinical deployment, a dedicated model that predicts TTD several hours before extubation is beneficial, as early prediction facilitates better preparation and planning for DCD. To demonstrate the capability of early prediction, we trained and evaluated the same models using data up to 12 hours before extubation, and ODE-RNN is again the most competitive method in this realistic setting (see Supplementary Table S2).

In Fig. 2 panel B, we report the calibration plot of the ODE-RNN model for the binary prediction (< 30 min vs. > 30 min) on the external validation cohort. Calibration plots for other binary tasks are available in the Supplementary Materials. The model tended to give a reliable but conservative estimate of the probability of death within 30 minutes. Importantly, the model appeared well calibrated for low and high predicted probabilities, highlighting its reliability.

Variable importance assessment

We assessed the importance and impact of the different clinical variables on the prediction of the models with permutation importance testing (Fig. 2 panel C). For longitudinal variables, we found that heart rate was the most important variable in the prediction of the ODE-RNN, followed by respiratory rate, mean arterial blood pressure (MAP), oxygen saturation (SpO2), and the Glasgow Coma Scale (GCS) score. Corneal reflex and gag reflex which are frequently used by transplant surgeons and OPOs, appear to have the least impact on the prediction. Notably, static variables were found significantly less important than longitudinal variables, by an order of magnitude. We also found a strong consistency in the variable importance across different binary tasks.

Patient phenoscape analysis

The patient phenotypes learnt by our model enable accurate TTD predictions because they faithfully represent the patients’ condition and past clinical history. As such, these phenotypes provide richer information about the patients, compared to the single numerical value of TTD prediction. These phenotypes form a continuous landscape of the patient cohort, which we refer to as the phenoscape. Within the phenoscape, we can observe clusters of patients with similar conditions and continuous transitions from one condition to another, thereby uncovering the underlying dynamics of circulatory death. We used PHATE26, a dimensionality reduction method that preserves the underlyding data geometry30,31,32,33, to visualize the phenoscape and provide examples of new insights drawn from such analysis.

Our patient phenoscape visualizations in Fig. 3 revealed that patients reside on a continuous spectrum of phenotype that goes beyond the TTD categorization. Patients were organized along an axis that corresponded with TTD but also with heart rate, SpO2, or GCS, among others. Finer investigation allowed us to identify the dynamical patterns most correlated with TTD, complementing the variable importance analysis above. For instance, in Fig. 3 panel A, patients were colored according to their average heart rate and to their range of heart rate measurements (defined as the difference between highest and lowest values). While the range correlated with TTD, the average value did not, suggesting the variation in the heart rate is more important than the average.

Fig. 3
figure 3

Visualization of the patient phenoscape. The latent phenotype is visualized in two dimensions using PHATE. In this plot each point represents a patient and the coloring is based on the value of different clinical variables. (A) All patients in the Yale cohort are plotted and colored according to TTD label (top-left), TTD (log-transformed, top-right), heart rate (middle), SpO2 (bottom-left), and GCS (bottom-right). Each point represents a patient. These plots uncover the continuous structure of the patients’ latent phenotype and highlight the correlation between different clinical variables and the time-to-death. The longitudinal variables were transformed into scalar variables using different transformations. (range) computes the average of the five highest observations minus the average of the five lowest observations in the clinical history of the patient. (mean) computes the average of the observations in the clinical history of the patient. (min) computes the average of the five lowest observations. Different transformations extract different patterns from the time series, enabling a finer interpretation of the dynamical patterns for a given clinical variable. For instance, we observed that heart rate (range) correlates with the TTD label, unlike heart rate (mean), suggesting the variation in heart rate is more important than the average value. The visualization of the whole cohort suggests two distincts groups of patients, characterized by high or low TTD. (B) Focus visualization of the identified cluster of patients with TTD<120 min. This uncovers a finer grained structure in this specific cohort of patients. We colored the patients by TTD (log-transformed, top-left), range of heart rate (middle-left), minimum SpO2 (middle-right), average GCS (bottom-left), and BMI (bottom-right). (C) We clustered the patients in the zoomed-in group of patients according to the similarity of their latent phenotype. We obtained three clusters: A, B, and C. Guided by the visualization of panel B, we examined the specific phenotype of patients from cluster A, which appear to have higher TTD than the rest of the patients. We show boxplots and corresponding independent t-tests p-values for difference of means between clusters, for various clinical variables. This analysis characterizes cluster A as a subgroup of patients with high TTD, high range of heart rate, low minimum SpO2, high GCS, as well as low BMI.

Our phenoscape visualization also showed a clear distinction between patients with TTD<120 minutes and another group with TTD>120 minutes (categories 0,1,2 vs. 3 in top-left of Fig. 3 panel A). This separation suggests an obvious clinical difference between patients with TTD<120 min and TTD>120 min. We performed a fine-grained analysis of the former cluster, to identify clinical drivers that make the difference between short-range (\(\approx\)30min) and medium-range (\(\approx\)60min) TTD. Panels B and C in Fig. 3 show a visualization of the identified group of patients with TTD<120min. From this zoomed-in analysis, we identified three clusters (A, B, and C). Guided by our visualization, we examined the specific phenotype of patients from cluster A, which appear to have higher TTD than rest of the patients. We found that patients from cluster A were characterized by higher TTD, higher range of heart rate, lower minimum SpO2, higher GCS, higher MAP and lower body mass index (BMI). This reveals that, over the period before TE, range of heart rate, minimum SpO2, average GCS, and the maximum MAP observed are all predictive of TTD.

Discussion

Augmenting the number of donations after circulatory death has been recognized as a crucial factor in mitigating the ongoing organ shortage, with the potential to increase the organ donor pool by as much as 30% in the United States34. However, a major factor hindering the rapid increase of DCD is the unpredictability regarding the time of circulatory death after extubation, leading to an unmanageable risk of prolonged warm ischemic injury. In the United Kingdom, it is estimated that 40% of donation teams mobilized for potential DCD donations are unsuccessful due to unpredictably long ischemic injury35. In the United States, only 59-72% of potential DCD donors die within the first hour after terminal extubation1,7. Similarly, in our cohort, only 73.8% of the patients died within the first hour.

This low success rate worsened by total unpredictability results in the waste of essential and valuable health-care resources and increased distress for families. Therefore, the average cost per DCD organ is estimated to be 63% higher than a DBD organ, mostly attributable to the unpredictability of DCDs from these “dry runs”36,37. The recent introduction of NMP and NRP, which have significantly improved the quality of organs from DCD by resuscitation of DCD organs prior to transplant; but that added expense has also contributed in making “dry runs” even more costly as the resources spent in mobilizing the NMP and NRP teams are still wasted on an unpredictable and thus failed DCD attempt38,39. Unsuccessful DCD donations also result in wasted human effort, and an avoidable environmental cost linked to the inherent logistics (air/ground transport) of a failed DCD attempt; and of an immeasurable psychological burden on grieving families hoping to make sense of their tragedy with the hope of a successful organ donation40.

These considerations highlight the importance and value of an accurate TTD prediction and have motivated the introduction of clinical scores, such as the UNOS criteria8 or the University of Wisconsin Donation after Circulatory Death evaluation tool (UW-DCD)41. However, these statistical methods show poor discrimination performance. For the UNOS criteria, PPV and NPV are reported to be 75.8% and 73%42. The UW-DCD, shows even worse performance (57.6% PPV and 61.8% NPV)42, and requires disconnecting the patient from ventilator for 10 minutes41. In contrast, our model achieved a PPV of 95.8%, a NPV of 93.9% and an accuracy of 95.4% for predicting whether the donor would die within the first hour on the external validation cohort, thereby only misclassifying 4.6% of the patients. Our model also does not require disconnecting the patient from ventilator, as seen in UW-DCD.

Predicting time to circulatory death has been previously attempted in the literature8,14,15,16,17,41. Nevertheless, previous studies predominantly have relied on conventional statistics and machine learning architectures such as logistic regression7 or long short term memory (LSTM)11,14. These models showed promising performance, but their simple architecture failed to fully capture the signal in data. Winter et al.16 proposed a model including only pediatric patients, with an ROC-AUC of 0.85, which is significantly lower than our model (ROC-AUC of \(98.7 \pm 0.3\) for ODE-RNN). It is noteworthy that pediatric patients (<18 years) only conforms to 5% of total deceased organ donations in United States, whereas our model is applicable for the 95% organ donors in U.S.43. Furthermore, the performance evaluation in these studies was often limited and without calibration, preventing a thorough assessment of the maturity of models for a potential clinical use.

Our study aims at addressing these shortcomings by leveraging the most recent advances in machine learning and providing the most robust clinical evaluation possible. The ODE-RNN architecture is specifically designed to handle specific challenges of clinical time series, such as irregular sampling or missing data, which enables capturing all the relevant information in patient’s data. To evaluate the models as closely as possible to a realistic clinical practice scenario, we used temporal splitting, removing bias induced by a change of clinical practices over time. Notably, such an evaluation strategy was absent from previous works on TTD prediction. We also used an external validation cohort to remove the bias linked to the clinical practices at different hospitals.

In clinical setting, it is important that predictive models give a notion of certainty regarding their predictions. Indeed, having access to a probability of death within a timeframe is crucial in balancing the expected benefits and costs of a planned organ donation. Remarkably, we found that our model showed excellent calibration, suggesting that the predicted probabilities could be directly interpreted at face-value. Furthermore, we note that our model jointly predicts probabilities for all four time-frames (<30 min, 30\(\sim\)60 min, 60\(\sim\)120 min, >120 min), which enables a fine-grained evaluation. This facilitates work of OPOs and transplant centers, who plan procurement of particular organs based on their warm ischemia time acceptance criteria for that particular donor as standard or extended criteria.

Our variable importance analysis was generally consistent with the UNOS criteria variables with respiratory rate, heart rate, SpO2, PEEP, and norepinephrine ranking high. However, only dopamine, a variable in the UNOS model, was found to be one of the least important variables in our model. We also found that MAP and GCS, although absent from the UNOS criteria, were very important variables for the predictive accuracy in our model. It is noteworthy that, although UNOS model excludes both MAP and GCS, they have been previously identified as important predictors of death after TE8,16.

Deep learning architecture like, ODE-RNN, bring added value with ability to handle irregular time series of arbitrary length and to provide a hidden state representation of the patient: a latent phenotype. Our experimental results showed, the ability to process the whole available time-series, and handle irregular sampling, resulted in better predictive performance. We further showed that our model enables a fine-grained analysis of the patient cohort by producing a latent phenotype for each patient, put together visualized as a phenoscape. The phenoscape identifies and separates the specific subgroups of patients with higher TTD and could potentially support clinical discovery essential to the prediction and identification for a DCD donor.

While the model developed in this study represents an important proof-of-concept and demonstrates compelling predictive performance, our study still suffers from some limitations. Notably, our patient cohorts were derived from various hospital records rather than directly from OPOs, meaning that some patients in our cohorts might have been ineligible for organ donation. Nevertheless, we are optimistic that the exceptional performance of our model will pave the way for future research involving a dedicated cohort of organ donors, ultimately enabling a definitive demonstration of the utility of machine learning models in improving the success rate of DCD.