Introduction

Clinical uncertainty hinders the practice of evidence-based medicine1. For any condition, transitions from high-evidence regions to low-information, high-uncertainty areas are abrupt2. Clinicians must synthesize diverse sources of information to limit this uncertainty and make effective decisions3. Determining which information to use and how to combine it to estimate probabilities for decisions is complex4 and relies heavily on individual judgment and experience, which runs counter to the principles of evidence-based practice5.

Additional factors complicate this challenge. First, clinical tests and heuristics generally have low positive predictive values and higher negative predictive values, necessitating multiple tests and extended clinical observation to reduce uncertainty6. In acute settings, this time delay is often weighed against the potential harms of empirical treatment and the risks of not providing it7. In more defined settings, such as preoperative risk assessment, the window to gather and synthesize information is limited, establishing a baseline uncertainty that current heuristics and evidence struggle to resolve8,9.

Electronic health records (EHRs) have evolved over the past decade, becoming increasingly detailed10 due to technological advancements and changing regulatory requirements11. However, information contained within EHRs cannot readily be applied to individual clinical decisions12. While information rich, distilling EHR information into useful evidence requires identifying exactly which variables to look at and estimating the conditional relationships between them over evidence that varies across individuals and is subject to measurement and other errors13. Adding to this complexity is the difficulty of choosing the right temporal window in which to observe these variables for making a phenotypic inference14.

Realizing the potential of the EHR to resolve clinical uncertainty requires novel computational approaches15. Graph structures, which consist of nodes connected by edges, are well-suited to representing phenotypic relational structures16. Temporal structures are captured in the form of edges between nodes (representing clinical events), and the flexibility of this structure seamlessly captures individual variation (e.g., a node representing a certain medication may appear only selectively according to who received it)17. The graph structure compactly renders the clinical trajectory into a form suitable for AI applications18. Deep geometric learning is a highly active and emerging branch of AI research which, though nascent, has already produced impactful results in the biomedical domain19,20,21,22,23. Broadly, deep geometric learning extends advancements in neural networks to the case of data that consists in relational structures between entities (geometric data). In our specific application, InfEHR, we use deep geometric learning to render EHR information into a clinical likelihood.

We minimize human involvement throughout the steps involved. This allows InfEHR to learn revealing temporal dynamics without human pre-specification or bias. Our method first automatically extracts temporal graphs from individual EHRs (EHR graphs). Each graph captures the complete trajectory of a patient’s clinical events and the relational structures among them. We summarize the entire clinical history contained in the EHR graph into a compact vector representation using self-supervised learning. Doing so creates a holistic view of the patient record that expresses the semantic similarity between clinical trajectories as a function of spatial distance between vector representations, which can be used to transmit knowledge to unlabeled cases from scant expert-provided labels as presumptive labels. We use these initial labels to refer to the EHR graphs and find individual graph components (e.g., a medication connected to a lab result) that are preferentially associated with an outcome, each providing a weak and uncertain likelihood estimate. We find hundreds of such informative components automatically and aggregate their weak predictions into refined machine-derived likelihoods. These likelihoods provide statistical information that InfEHR combines with deep geometric learning to get a final result. The initial EHR graph is transformed by coalescing and connecting individual clinical events into learned higher order concepts (a learned graph), and successive processing layers render the learned temporal graph into a likelihood according to a specialized training objective (see Fig. 1 for details).

Fig. 1: InfEHR is a framework for resolving clinical uncertainty using probabilistic and deep geometric learning on electronic health records (EHRs).
Fig. 1: InfEHR is a framework for resolving clinical uncertainty using probabilistic and deep geometric learning on electronic health records (EHRs).The alternative text for this image may have been generated using AI.
Full size image

a Schematic of generative processes underlying EHRs. b C-reactive protein level dynamics in neonatal culture-negative sepsis. c Schematic of the InfEHR process showing the progression starting with unprocessed EHRs and ending with computed likelihoods.

Our framework performs according to the criteria important for clinical applications. InfEHR computes likelihoods specific to individual cases for any patient. It does not rely on the presence of specific information such as a particular lab result or vital measurement. It also removes the requirement for a clinician to know when a model can be validly applied, such as when, for example, a model assumes only certain pharmacological exposures and becomes unreliable otherwise, or is only valid in certain clinical settings. Implicit assumptions in the distribution of the training data may also invalidate a model. Even under valid conditions, discriminative models can still report high confidence but unreliable probabilities for uncertain cases. In contrast, the probabilities returned by InfEHR naturally reflect its certainty in the individual case, indicating to the clinician exactly when the EHR does not contain enough information to make an inference. We observe accuracy to scale with confidence for both positive and negative predictions in all experiment settings. However, there are still cases over which InfEHR makes low entropy predictions (relative equiprobability between classes). As discussed later, such predictions may also reveal important phenotypic information concerning the instant case and are therefore still valuable in clinical decision-making.

Results

In the following, we demonstrate how InfEHR, a deep geometric learning framework, can resolve clinical uncertainty in two diagnostically challenging scenarios: diagnosing neonatal culture-negative sepsis (CN-S) and postoperative acute kidney injury (PO-AKI) risk assessment. We first provide a quantitative view of the ambiguity in these conditions following the application of traditional clinical heuristics. We then demonstrate that InfEHR can automatically derive more powerful heuristics by using its graph-based representation of EHRs to create probabilistic labeling heuristics. We show that the InfEHR GNN component, using a generative approach, can subsequently revise initial probabilities to achieve demonstrable clinical value in terms of rule-in and rule-out potentials. Finally, we present external validation, including against established models for irregularly sampled clinical time series, confirming InfEHR’s capabilities in providing substantial and clinically meaningful probability revisions, particularly effective for the challenging task of ruling in low-prevalence conditions.

Uncertainty in diagnosing neonatal culture-negative sepsis

The onset of inflammatory symptoms consistent with sepsis prompts empirical treatment with antibiotics24. Following a positive blood culture result, the criteria for antibiotic cessation are established by individual clinical response and other prior knowledge concerning the infecting pathogen. However, timing antibiotic cessation in cases without positive results and ongoing symptoms poses a challenge. Blood cultures, the most specific tests, involve processing times with incubation periods of up to 72 h, and occasionally repeated samples to obtain reliable results25. Factors including maternal exposure to antibiotics and low sample volumes also contribute to clinical skepticism of blood culture results, adding to a period of diagnostic ambiguity during the wait for confirmatory results. In other cases, such as culture-negative sepsis (CN-S), confirmatory results will never come.

In these situations, a clinician must decide how to interpret the negative result; specifically, whether the patient has CN-S, a condition in which underlying septicemia requires continued antibiotics or the patient experiences inflammatory symptoms without an infectious cause26. Failure to treat sepsis is an unacceptable risk; however, overuse of antibiotics is also associated with significant short- and long-term adverse outcomes27,28. Paradoxically, use of empiric antibiotics is also reported to increase risk of sepsis, necrotizing enterocolitis, or death (OR, 1.24; 95% CI, 1.17-1.31)21. The lack of consensus definition of CN-S or established clinical guidelines for its treatment complicates this situation and leads to highly idiosyncratic treatment patterns that are poorly supported by evidence29,30.

No known single biomarker or group of biomarkers specifically identifies CN-S, nor are there established criteria for pathogen identification based on clinical signs or measurements outside of a blood culture31. In neonates, rapid growth and development of organ systems also add variation to lab and vitals measurements, making it difficult to independently discern a disease process32, and many noninfectious conditions also lead to elevated inflammatory markers in neonates despite a lack of underlying infection33. Serial observations and measurements over time are needed to form more cohesive clinical pictures24. Meanwhile, the harms from inappropriate antibiotic exposure accrue. Exactly what evidence supports a CN-S diagnosis, and therefore continued antibiotics in the setting of a negative blood culture, is subject to ongoing debate26. In our sample of 8015 antibiotic courses (from 6596 individuals), an average of 72 (std. 32) unique labs excluding blood cultures are measured with a mean 225 (std 369) total lab measurements per course (mean course duration 4.35 days); however, no consensus exists on how to use this information to resolve diagnostic ambiguity.

Temporal considerations add to the lack of consensus and diagnostic uncertainty. Although several publications suggest the use of C-reactive protein levels to identify CN-S cases, they recommend measuring it at differing times from the start of antibiotics (from 3 days to 7 days at the earliest)34,35 as well as threshold level. As symptoms persist, diagnostic uncertainty increases as the interpretation of a negative culture result becomes increasingly ambiguous. To model this and show the feasibility of InfEHR to reduce clinical uncertainty in realistic settings, we extract EHR windows including 24 h of information prior to antibiotic administration and up to and excluding the penultimate dose. The resulting windows reflect the patient’s clinical condition prior to the decision point to cease antibiotics (see Fig. 2).

Fig. 2: Clinical uncertainty in the setting of neonatal culture-negative sepsis.
Fig. 2: Clinical uncertainty in the setting of neonatal culture-negative sepsis.The alternative text for this image may have been generated using AI.
Full size image

a Performance of clinical heuristic for identifying CN-S cases provided by a physician expert and corroborated by literature. The heuristic correctly identifies 97% of negative cases but fewer than 5% of positive cases. b Summary of clinical guidelines for using C-reactive protein for the diagnosis of culture-negative sepsis and maintenance of antibiotics. c Distribution of antibiotic course durations pending confirmatory culture results or symptom resolution. d Measurement rates of C-reactive protein in patients with and without neonatal culture-negative sepsis.

We compute the conditional probability of a diagnosis (over all labeled conditions, see Supplementary Materials) given only elapsed time to find that by day 7 the probability of a CN-S diagnosis rises to 0.19 from less than 0.03 at day 3, approaching parity with a Rule-Out Sepsis (ROS) diagnosis (Pr = 0.27) and reflecting high clinical uncertainty in the setting of ongoing symptoms with no positive culture result. The probability of an ROS diagnosis declines rapidly with time, whereas the probability of other diagnoses increases. We identify elapsed day 11 as the maximal window where all labeled conditions, including ROS, have nonzero probabilities. We therefore truncate all EHR windows to a maximum of 11 days to show that InfEHR can compute accurate probabilities when multiple possible sequences are present. The accuracy of InfEHR, as detailed later, is mostly independent of the length of the underlying EHR window (Spearman coefficient = −0.17, R-squared = 0.031, p < 0.05).

Although no significant correlation was found between truncation and accuracy, truncated CN-S sequences were more likely to result in high entropy likelihoods (> 31% of truncated CN-S sequences had predicted probabilities between 0.4 and 0.6 compared to 9% full length), reflecting that InfEHR effectively reduces uncertainty on incomplete sequences, whereas more information from longer observations improves prediction confidence.

An important point must be made about outcome label fidelity. There were 3657 EHRs corresponding to unique antibiotic courses manually labeled by a physician subject-matter expert. Following the lack of consensus definition for CN-S, the SME labeled a case as positive if the treating physician maintained the antibiotic course based on an intent to treat for CN-S. Unfortunately, due to the lack of objective requirements for identifying CN-S, the possibility exists that some patients, despite being treated for CN-S, did not actually have CN-S. Although these cases cannot be identified directly, if they exist, we hypothesize that they may comprise a subset of the cases with high entropy probabilities.

Uncertainty in assessing postoperative acute kidney injury risk preoperatively

Whereas static variables such as age, sex, and surgery type have been independently linked to postoperative acute kidney injury (PO-AKI)36, these associations and their interactions seem to be mediated through individual factors that vary by clinical setting37, which limits the use of generalized scoring indices or models in determining risk38. Here we express the concept of risk as the patient’s likelihood of developing PO-AKI and accordingly compute the prognostic likelihood of developing PO-AKI. We also depart from other methods by considering only time-varying attributes of EHRs39. Our prognostic likelihoods are therefore dynamic in nature and can be modified by evolving EHR information. The likelihood corresponds to a patient at a given time, subject to change with modifications to risk that occur throughout the clinical trajectory, e.g., increasing drug administration (in contrast to static risk indices).

The link between surgery type and uncertainty of PO-AKI transmits along two lines: variation in the clinical setting comes with differences in the detail of monitoring and inherent perceived risks of PO-AKI and the medical background, including age, of a surgical patient36. Cardiac patients are routinely assessed for such risk following a high incidence of PO-AKI for these patients (40% of all cases)40. Age has been identified as an independent risk factor in this setting; however, some cardiac surgeries seemingly carry greater risk than others, and it is not entirely clear how risks stemming from age relate to this41. Nonetheless, numerous models exist for computing risk scores in the cardiac setting (with variable success rates)42. Noncardiac surgery by comparison includes a more medically diverse group of individuals undergoing various types of surgeries with attendant differences in surgery lengths and other surgery-specific risk factors37. The SPARK index, acknowledging this intrinsic uncertainty, attempts to create a generalized clinical risk index for this population39. The index considers factors including pharmacological factors such as RAAS blockade and physiological factors such as diabetes mellitus status and anemia39,43. The performance of SPARK has been shown to degrade in populations whose characteristics differ from those of the training cohort despite good performance in training and validated discovery cohorts38. Although one study documents this performance degradation, other unidentified cases may exist, and it is not obvious to a clinician whether an instant patient is well covered by SPARK without detailed analysis over the alignment of a patient to the training distributions, a clinically infeasible possibility44,45. This underscores a general concern with clinical models, including indices, in that it is not always clear how well they fit an individual patient despite ostensibly satisfactory population-level performance44.

The general attempt to more precisely determine individual risk has been to produce increasingly specific models46. But an examination of the full range of all possible lab measurements for patients within the dataset reveals the limitations of this approach. Some lab measurements cover no positive cases, whereas others are measured in clinical settings where PO-AKI risk is already high (i.e., most measured cases were PO-AKI positive) and so it is unclear what additional information a lab measurement may provide. In settings where the pre-measurement probabilities were less biased, the number of individuals with the measurement can be low47,48, an example of informed missingness leading to bias by indication49. Of the labs without frequency preference for PO-AKI positive or negative patients, no lab has an overall measurement rate above 32%, and some labs, such as prothrombin mutation, with less than 1%. Our dataset also consists of a medically diverse population spanning diverse age ranges (min = 18, max=92) undergoing surgeries over every physiological system (excluding reproductive systems)36. This comprises a group of individuals with collectively high clinical uncertainty that is challenging to resolve using a single model despite established risk assessments for individual subsets44.

In our dataset, 38.7% had presurgical eGFRs below 60; of those, only 37.1% developed PO-AKI, showing that substantial uncertainty remains even after establishing baseline estimations of kidney function. InfEHR can include information even from labs with low representation, along with other factors facing similar measurement-related constraints, such as vital information or medications provided by finding commonalities across settings and holistically integrating individual information to resolve uncertainty. We assigned positive labels to all patients with AKIN scores > 1 at 72 h post-surgery and negative labels otherwise to design a trainable task. However, we report the likelihood of developing PO-AKI using only preoperative data. For a subset of positively labeled patients with low likelihoods returned by InfEHR, it is possible but unknown if the factors involved in developing PO-AKI occurred primarily during surgery or immediately following the operation. In this case, the model output would be correct, where we currently deem these cases as failures in recall. Similarly, where InfEHR was uncertain, it is possible that there was some risk enhancement but unpredictable events in surgery that ultimately led to AKI. Unfortunately, these patients cannot be discerned using preoperative data, a limitation to predicting outcomes preoperatively, where in-surgery events may modify the outcome.

Performance of Clinical Heuristics Under Conditions of Uncertainty

Diagnostic complexity and the constraints of clinical practice limit the range of information a practitioner can use in medical decision-making50. Heuristics, or decisional shortcuts, are an adaptive strategy to efficiently manage uncertainty under time constraints51. Heuristics integrate pattern recognition based on experience with pathophysiological reasoning to identify the most salient aspects of the instant case52. These cognitive operations resemble clinical tests in that they also have implicit pre- and post-test probabilities (typically unknown to the practitioner)4.

Here, we obtain such heuristics from practicing physicians and apply them over the empirical distribution of cases to quantify the implicit uncertainty involved in using them. We corroborated the physician-provided heuristics with literature to ensure that they are representative of general practice and not limited to institutionally specific patterns of care. We apply the heuristics to the EHR windows as described above to compute estimates at clinically relevant points of care and examine their probabilistic outputs over the empirical distributions of cases in each dataset.

To quantify the performance of the heuristic, we compute the rule-in and rule-out potentials53. Rule-in and rule-out potentials are prevalence-independent metrics based on the characteristics of probability distributions produced by a clinical test. They quantify a diagnostic test’s innate capacity to revise disease probability before testing occurs. Unlike likelihood ratios, which characterize performance only after a specific test result is known, these potentials predict how effectively a test or heuristic will revise disease probability for an average subject prior to performing the test. A rule-in potential of 2.0 indicates patients with disease are, on average, twice as likely to be correctly identified after testing, while a rule-out potential of 2.0 means patients without disease are twice as likely to be correctly excluded. Together the rule-in and rule-out potentials express the information gain provided by a test relative to the two poles of clinical decision-making54. Rule-in and rule-out potentials capture a test’s inherent discriminative power rather than its observed performance on a particular dataset alone (including prevalence differences). We additionally report sensitivity and specificity, which are also prevalence independent55 (see Tables 13). Although these metrics do not capture probability revision capacity or information gain, they do measure performance within already-known groups (with disease/without disease), providing an additional view on a test’s innate capacity to confirm disease presence. Although we emphasize these dataset independent measures, we also plot the cumulative probability density functions by disease status (see Figs. 2 and 3) to illustrate the performance distribution across our specific datasets.

Fig. 3: Clinical uncertainty in the setting of predicting postoperative acute kidney injury (PO-AKI) preoperatively.
Fig. 3: Clinical uncertainty in the setting of predicting postoperative acute kidney injury (PO-AKI) preoperatively.The alternative text for this image may have been generated using AI.
Full size image

a Performance of a physician provided a clinical heuristic using serum creatinine which is widely measured (including for all patients in this dataset). The heuristic identifies > 97% of negative cases but <19% of positives. b Distributions of patient age according to surgery type in the Mount Sinai Health System dataset. c Ratio of laboratory measurements according to patient disease status.

Table 1 InfEHR Performance: CN-S (MSHS)
Table 2 InfEHR Performance: PO-AKI (MSHS)
Table 3 PO-AKI Prediction Performance (UCIMC)

CN-S heuristic

Single-point assessments of clinical signs have little value in diagnosing CN-S given that multiple sources of variation in the perinatal environment underlie the observed signs56. No uniform guidelines exist for serial measurements or their interpretation24. Biomarkers, including inflammatory biomarkers, generally have low sensitivity and specificity for CN-S, but some literature supports some positive predictive value for C-reactive protein (CRP)34. Accordingly, we use a clinician-provided heuristic using the minimum, maximum, and average CRP concentration within the prediction window. CRP levels have been shown to vary according to gestational age; however, including this information did not improve heuristic performance35. CRP is not uniformly measured in all CN-S cases (18% of CN-S cases had no CRP measure)57. The rule-in and rule-out potentials indicate that the CRP-based heuristic did not reduce uncertainty (rule-in potential = 1.13, rule-out potential = 1.01). It is possible that specific measurement timings or other parameters of CRP measurement may enhance the clinical value of CRP in diagnosing CN-S; however, there are no consensus guidelines for these parameters. This ambiguity contributes to the low diagnostic potential and an NPV of 0.85. The empirical cumulative distribution functions (ECDFs) indicate some separation between the classes, suggesting that CRP potentially holds more diagnostic information than our heuristic and that which we estimate a clinician, on average, could extract.

PO-AKI Heuristic

Serum creatinine is commonly measured across various preoperative settings (100% measurement rate for patients in the dataset) and is also used in validated predictive models or indices for PO-AKI in cardiac and noncardiac surgery37,42. We are unaware of any clinical signs validated for use in PO-AKI prediction and therefore did not consider any for heuristic use. Of the remaining 100 labs common to all patients in the dataset (less than 21% of all labs measured), none were recognized by a physician as having added prognostic value. We include length of stay measured in fractional days along with serum creatinine, since this enhanced heuristic performance by serving as a nonspecific index of case severity58. Mean serum creatinine, last preoperative serum creatinine, and length of preoperative stay were used to predict PO-AKI (AKIN > 1)36. The returned probability can be considered as a risk index59. This heuristic has a rule-in potential of 1.32 and a rule-out potential of 1.16. The heuristic reduces some uncertainty, especially with some positive cases, but a subset of positive cases remains that the heuristic was unable to identify (NPV = 0.82, and as seen in the ECDFs, Fig. 3).

The heuristic performances illustrate recurrent challenges in identifying positive cases. (PPV = 0.69, FNR = 0.79) Although additional information is apparently required to resolve uncertain likelihoods, exactly which additional information and how it can be integrated is unclear. It is also unclear to a clinician when a heuristic may not apply to a given patient even if overall performance is otherwise known. In our examples, false negatives can be explained not only by a general lack of discriminatory capacity of the heuristic (CN-S) but also by poor fidelity of an otherwise performant heuristic to the individual case (PO-AKI). As explained in more detail next, InfEHR both automatically creates heuristics built on the particularities of individual cases and integrates the array of generated heuristics into a single predicted probability distribution that can later be further resolved to enhance rule-in and rule-out potentials.

Automatic generation of heuristics from EHR graphs

An automatic process was used to transform individual EHRs into temporal graphs where clinical events are identified and represented as nodes and connected to each other by edges according to temporal relationships in the record (detailed below). We use the graph structure to generate automatic heuristics in a two-step process. First, a graph neural network (GNN), trained on an unsupervised objective, encodes the EHR graph into a compact numerical representation. The representation is designed to capture distinguishing features of an EHR graph (such as clinical event composition or the existence of certain temporal dynamics) and assign them to a coordinate in a high-dimensional semantic space (the representation). The resulting coordinate space semantically aligns graphs (generated from patients) according to similarities in distinguishing features. By taking the sample of expert-provided labels (110 total) and fitting a distance-based label propagation algorithm, we can presumptively label cases by a learned spatial distance metric60. This provides an initial set of labels for all the records according to a structural and spatial view of similarity among individuals as expressed through their encoded graph representations. We further evaluate the self-supervised representations quantitatively by analyzing the nearest neighbor of each representation in terms of the ground truth label expressed by mean average precision. The score, ranging between 0 and 1, reflects the percentage of times the nearest neighbor for a given representation shares the same ground truth label (MAP@1 CNS: 0.79, MAP@1 PO-AKI: 0.71).

We use the presumptive labels from the spatial representations to derive new labeling heuristics. We randomly select positive and negative cases from the 110-member sample. Next, we identify highly connected nodes within the EHR graphs corresponding to these selected cases and extract the 1-hop neighborhood of such nodes. Given that EHR graphs consist of observed relationships between clinical entities (expressed as edges between nodes), the resulting extracted subgraph then contains potentially identifying clinical relationships. If the relational structure preferentially distributes in EHR graphs from a given label, we use the relationship to heuristically assign labels. Therefore, elements of the graph structure itself are used as weak labeling heuristics.

To identify these structures, we use the labels obtained by spatial propagation over the high-dimensional representations of the EHR graphs (as described above) to assign labels to individual nodes. We assign the node label according to its class association. We can identify relational structures from the subgraph using these labels according to a simple procedure: when two or more connected nodes share the same label, we treat the substructure as a labeling heuristic that assigns the common node label to any graph where it is present. To improve the accuracy of these weak heuristics, we check if any of the connected nodes are also connected to nodes of opposite label. In this case, we remove any individuals identified by both the same label and opposite label substructures from labeling by the heuristic, limiting the application to cases without known contradictory information.

We generate 2400 negative and 396 positive labeling heuristics for CN-S and 4307 negative and 982 positives for PO-AKI. We observed that many such automatically generated heuristics are corroborated by findings in the literature (see Supplementary Materials for specific examples). Despite filtering for contradictions within heuristics (e.g., if an individual is predicted to be both positive and negative), there remain substantial contradictions between heuristics. However, the heuristics can still be effectively combined following Ratner et al.61,62 to produce performant initial probabilistic estimates. These probabilistic labels represent coalesced information from whole EHR-to-EHR comparisons (i.e., labels derived from the self-supervised embeddings of EHR graphs) and specific conditional relational structures (the weak labeling heuristics). To improve the information in the resulting probability distribution, we obtain a final distribution by randomly subsampling and combining 100 positive and 100 negative labeling heuristics. We repeat this process until there is onlya minimal change in the average entropy of the resulting averaged likelihood distributions.

The resulting distributions are more performant in identifying positive cases than clinician-provided heuristics in both disease settings (PO-AKI: FNR 0.48 vs. 0.79 clinician heuristic, CN-S: FNR 0.67 vs. 0.95 clinician heuristic). The machine-generated heuristics also retain uncertainty in their distributions (ECDF Pr(0.2 − 0.8) CN-S: 0.51, PO-AKI: 0.43). This allows the GNN component of InfEHR to learn uncertainty, reducing temporal dynamics in part by identifying cases with ambiguous label dispositions (high entropy label probabilities) and using features derived from low entropy cases to resolve them. This process would be significantly impaired by overly confident input distributions with low positive case recall (particularly in the setting of rare or infrequent disease). The distribution is used as prior information in the loss function of the GNN, which ultimately reduces individual entropy, as explained below and in Fig. 4.

Fig. 4: Initial uncertainty resolution through automatic probabilistic labeling.
Fig. 4: Initial uncertainty resolution through automatic probabilistic labeling.The alternative text for this image may have been generated using AI.
Full size image

a Schematic of deriving probabilistic labels from electronic health records (EHRs) using InfEHR. b Results of applying the steps outlined in A (in gray). We repeat this process and average the results to obtain initial individual probabilities (in blue and red).

Deep Geometric Learning Resolves Prior Uncertainty

The machine-generated likelihoods derive from clinical relationships encoded into the structure of EHR graphs, expressed here as a connection between two or more nodes each representing a clinical event. We expand the representational capacity of an EHR graph by including (1) semantic encodings for each clinical event as attributes (assigned to each node in the EHR graph) and (2) time stamp encodings which can be combined with the semantic encoding in (1) to create attributes reflecting individual temporal contexts. These attributes build on the existing temporal structure, encoded by the collected nodes and edges of an EHR graph, into a format suitable for learning phenotypic temporal dynamics at scale. The semantic encodings in (1) automatically reproduce clinical knowledge in terms of spatial distance, e.g., the nearest neighbors of an encoding for a certain respiratory rate include representations for pulse oximetry. We also observe less established but nonetheless corroborated clinical relationships such as the appearance of encodings for eosinophil measurements within the near neighborhood of ejection fractions (additional information provided in Methods). The detailed and individualized consideration of the semantic and temporal information added to the EHR graphs through these encodings is executed by training a GNN to compute probabilities from individual EHR graphs using the previous machine-generated estimates as priors to be resolved.

We train the InfEHR GNN (see Fig. 5 for architectural details) under a specialized loss function similar to63, which implicitly models the likelihood

$$P\left({EHR} | {Condition}\right)$$
(1)

following a generative modeling framework. This approach ultimately yields predicted probabilities over conditions given EHR data in the form of a posterior distribution, mirroring in mathematical terms the colloquial use of the phrase “likelihood of disease.” While the InfEHR GNN outputs what is formally a discriminative posterior

$$P\left({Condition} | {EHR}\right)$$
(2)

our training procedure, which minimizes the Kullback-Leibler divergence between the model output and a theoretically motivated generative posterior:

$${P}_{{generative}}\left({Condition} | {EHR}\right)\propto P\left({EHR} | {Condition}\right)\cdot P\left({Condition}\right)$$
(3)
Fig. 5: Resolution of clinical uncertainty with InfEHR.
Fig. 5: Resolution of clinical uncertainty with InfEHR.The alternative text for this image may have been generated using AI.
Full size image

a Schematic showing key architectural details of the InfEHR Graph Neural Network (GNN) and information flow through the architecture. b InfEHR’s performance on false negatives predicted by the clinical heuristic (top) and on training and internal validation in neonatal culture-negative sepsis. c. Density distribution of predicted probabilities from the InfEHR GNN on patients with and without disease.

This infuses our predicted probabilities with the uncertainty-resolving benefits of the generative approach without requiring explicit generative modeling.

During training the GNN simultaneously learns graph features and the likelihood of the graph features under assumption of a given disease or physiological state. Through the learning process, the GNN transforms the naive input EHR graph into a smaller subgraph. The nodes from the original graph structure are reconfigured and coalesced into new nodes through a learned assignment matrix. The result is that nodes in the subgraph represent compound clinical events derived from the original set of individual events. The new nodes are connected by weighted edges that summarize the connectivity of the original naive EHR graph in the learned subgraph. The resulting subgraph undergoes additional processing to determine the likelihood of its graph features under the clinical premise. A key aspect of this process is that the graph distillation and feature assessment are learned simultaneously. This allows the GNN to learn, without human interference or bias, what subsets and subsequences of clinical events are most relevant to the problem setting (the graph distillation) and how relational structures between them (processing the learned subgraph) support or contradict a clinical premise.

Although the parameters required for the graph distillation and likelihood computation are learned batchwise, the learned GNN parameters are applied to individual EHR graphs and use the specifics of individual cases, about which we make no assumptions or requirements over shared clinical trajectories or backgrounds among patients, and compute highly accurate predicted probabilities. The resulting predictions markedly outperform the clinician-provided heuristics with superior rule-in and rule-out potentials in both the PO-AKI and CN-S task settings. The resulting rule-in potentials for CN-S of 16.018 (Train) and 12.49 (Val) compared to 1.013 (clinician-provided) shows the GNN is respectively about 16 and 12 times more likely than the clinician-provided heuristic to identify the average positive case. This finding is also reflected in the PPV and NPV, which are respectively about 2.6 (Train), 2.9 (Val.), and 1.16 (Train), 1.13 (Val.) times greater than the clinical heuristic (see Tables 1 and 2). This partly represents the extent of resolvable uncertainty within the clinical setting and the benefit of additional EHR information, which is otherwise not readily accessible to clinicians. The GNN produces comparable results in the setting of PO-AKI risk assessment, with 2.5 times greater rule-in potentials and 2.13 times greater rule-out potentials. The GNN also reduces the FNR as reflected by 1.12-fold increase in NPV. This does come at a small expense of slightly higher false positivity; the GNN PPV is about 2% lower than the clinical heuristic while making higher overall true positive predictions (identifying 71% of all positive cases compared to only 19% by the clinical heuristic). Figure 5 provides additional information.

Validation of InfEHR on an external dataset

We compare InfEHR’s performance against two established models for irregularly sampled clinical time series: SeFT64 and GRU-D65. SeFT employs differentiable set function learning, treating EHR variables as an unordered collection of entities. This approach parallels InfEHR’s graph structure, where nodes similarly form a set of clinical entities.

GRU-D consists of a gated recurrent network that operates on temporally aligned sequences, computing hidden states sequentially at fixed interval lengths from regularly sampled time points. Whereas SeFT does not directly use sequence information, InfEHR explicitly encodes temporal relationships within both node embeddings and edge structures. However, in contrast to GRU-D’s fixed-interval processing, InfEHR captures temporal dynamics without an explicit and predetermined time discretization.

These models were then evaluated as to their ability to predict PO-AKI using EHRs contained in the Medical Informatics Operating Room Vitals and Events Repository (MOVER, public credentialed access)66. This dataset includes a medically diverse population of patients undergoing surgical procedures at the University of California, Irving Campus Medical Center (UCIMC). We obtained n = 2427 patients for whom we could both assign PO-AKI status (AKIN scoring system) and apply the clinical heuristic; among these patients, n = 261 positive cases were identified (~10% prevalence in UCIMC dataset versus n = 879 positive cases, or ~21%, in the Mount Sinai Health System (MSHS) dataset).

To evaluate the InfEHR framework on this dataset: (1) We constructed EHR graphs as previously described for all UCIMC patients. We transferred knowledge from MSHS by aligning UCIMC EHRs with those from MSHS and extracting clinical events as learned from MSHS data forming the nodes and setting as their attributes the embeddings obtained from the previously learned manifold. (2) The UCIMC dataset did not contain clinical notes. To generate prior probabilities, we processed all MSHS graphs and ablated any nodes originating from note sources and then retrained the GNN and obtained prior probabilities for each EHR graph in the UCIMC dataset (the input distribution). Although GRU-D and SeFT do not extract events from the EHR, we supplied all parent variables (e.g., systolic blood pressure) so that the datasets for SeFT and GRU-D contained the same variables. We used the same input probability distribution for training all models.

Deep neural networks are known to produce idiosyncratic and miscalibrated probability distributions25. Even though these probabilities are internally coherent (e.g., a given model may systematically overestimate probabilities but maintain correct internal ranking of instances resulting in a performant AUC), the raw probabilities are not directly comparable across models. Following recommendations and guidelines in refs. 27 and50 to facilitate model comparison, we applied calibration-in-the-large to map uncalibrated model probabilities to a common semantic standard without changing the model’s inherent discriminative properties prior to computing metrics.

We reported rule-in and rule-out potentials at n = 100 thresholds and reported results: (1) at a common probability threshold of 0.5 to enable fair comparison of the models’ discriminatory capacity, and (2) at model-specific optimal thresholds to demonstrate the maximum achievable performance when tuned for specific diagnostic goals (such as ruling-in (RI) or ruling-out (RO) a condition).

InfEHR consistently outperformed GRU-D and SeFT for rule-in applications, achieving 7.105 (vs. SeFT: 3.415, GRU-D: 4.745) at the 0.5 threshold and an optimal RI potential of 8.322 (vs. SeFT: 7.67, GRU-D: 6.706). For rule-out applications, InfEHR also outperformed both models at the 0.5 threshold (InfEHR: 2.700 vs. SeFT: 1.778, GRU-D: 2.269). However, when optimized for rule-out potential, GRU-D and SeFT achieved marginally higher performance (GRU-D: 3.964, SeFT: 3.935) than InfEHR (3.566). These results support that InfEHR is capable of significant probability revision and particularly for the clinical task of ruling in a condition that is central and challenging under low prevalence (see Fig. 6 for additional evaluation and Table 3).

Fig. 6: InfEHR revises priors obtained from different domains to local distributions without human intervention.
Fig. 6: InfEHR revises priors obtained from different domains to local distributions without human intervention.The alternative text for this image may have been generated using AI.
Full size image

a Peformance of InfEHR in positive case identification relative to the clinical heuristic (in gray), The previously trained GNN is applied to the new and unlabeled cohort to obtain PO-AKI likelihoods (in black). InfEHR GNN further revises these initial probabilities (in red). b Information theoretic measures of predicted probability distributions from the clinical heuristic (gray), InfEHR prior GNN (black), and InfEHR revision GNN (red).

Discussion

Low disease incidence intrinsically limits the diagnostic power of common individual tests67. Similarly, underlying risk accumulation may proceed through highly conditional structures that can make individual risk difficult to assess since individual variation within those structures may lead to correspondingly divergent risks68. These factors, along with limitations in time and existing knowledge, promote uncertainty in clinical decisions2. Discriminative models, even under optimal configurations of model architecture and type, are poorly situated to resolve these uncertainties. Discriminative modeling consists of trying to predict a label given a data instance. Models trained under this framework learn to predict a label or class membership through incurring a penalty for incorrect predictions. As explained below, this approach is conceptually mismatched to the realities of EHR data and clinical medicine, especially in the setting of low-prevalence diseases, which are typically also the most clinically uncertain and where input from models is wanted most.

Discriminative models in medicine can fail when handling rare conditions due to dual challenges: sparsity in both the number of positive cases and their distribution in feature space69. With few positive examples, the model struggles to learn the true variety of ways a condition can manifest. It may see only a narrow subset of possible presentations and optimizes for learning a decision boundary primarily on abundant negative cases. This limited exposure makes it difficult for the model to develop robust feature representations that capture the full spectrum of disease manifestations, and the decision boundary itself becomes problematic because it is primarily shaped by the dense regions of negative cases, with only rare positive cases to counterbalance this influence. As a result, the model may make highly confident but incorrect predictions.

Cases that fall far from the learned feature space of the training data but definitively on one side of a decision boundary represent a fundamental flaw in discriminative models.70. The model will assign high confidence simply based on the case’s position relative to the boundary, even with little evidence to support such certainty in sparse regions of the feature space. (Here, the model has seen few or no other cases with similar learned features.) Sparsity in the feature space may result from limited training examples (common in rare conditions such as CN-S) or in combination with the model’s poor inductive capacity to learn useful features (e.g., highly conditional networks underlying PO-AKI risk). This scenario creates a dangerous situation: a clinician reviewing the model’s output would have no way to know that the high confidence prediction comes from a region where the model has minimal experience.

Post-hoc calibration cannot solve this problem, because it only adjusts probability outputs without addressing the underlying issue: the model’s restricted understanding of how diseases manifest due to the nature of discriminative training71,72. Consider a rare disease that presents in several distinct ways. If the training data captured only one type of presentation, the discriminative model builds its decision boundary around that presentation. When encountering a different but equally valid presentation of the same disease, the model might confidently misidentify it simply because the case falls on the “wrong” side of the boundary based on the learned features. Calibration techniques cannot fix this fundamental gap in the model’s knowledge because they work only with the feature space (the way a case is ultimately represented within the model) that the model has already learned. The calibration process cannot teach the model about alternative disease presentations it has never seen or lead to output probabilities that reflect when a model is operating with epistemic uncertainty, both of which severely limit the applicability of probabilities obtained from discriminative models in clinical settings, even when calibrated.

In contrast, the generative approach explicitly models how different disease processes could generate various presentations in the EHR and simultaneously weigh the consistency of a given case with the presumption of a given disease. Generative models incorporate disease prevalence as a core component of this reasoning—rare conditions require stronger evidence to overcome their low prior probability, just as clinicians maintain a higher threshold for diagnosing uncommon conditions. The likelihoods returned by generative models critically preserve uncertainty in cases where the learned EHR features are less consistent with disease presumptions. Rather than making decisions solely on feature patterns, the model considers both how well the patterns match each disease process and how likely each disease is to occur. Deep geometric learning provides an opportunity to learn highly informative and patient-specific EHR representations leading to well-resolved feature spaces. When combined with a generative approach, the InfEHR framework achieved a more nuanced and clinically valuable form of uncertainty quantification. For example, a case might have features somewhat consistent with a rare disease, but the model appropriately and automatically tempers its confidence based on the condition’s rarity, or the model may be inexperienced with an instant case, leading to similarly moderated probabilities.

InfEHR is a generative modeling framework, which we have shown to dramatically reduce clinical uncertainty of individual cases using EHR data available at the time of clinical decision-making. Our framework automatically learns phenotypic temporal dynamics from EHRs through a graph neural net-based (GNN) approach. We learn high-dimensional representations of such graphs to compute the likelihood of an underlying latent disease given the representation (and the EHR it was derived from). The resulting likelihoods have properties important to any clinical test: the probability clearly identifies when the model is uncertain, confidence scales with accuracy, and the pretest probabilities are substantially revised73,74. We express these characteristics quantitatively through high rule-in and high rule-out potentials benchmarked against real-world clinical heuristics. To our knowledge, we depart from existing deep geometric learning approaches to EHRs75 by considering temporal graphs to (1) automatically derive clinically significant nodes and their embeddings from discrete and continuous variables within raw EHRs, (2) use a graph structure in a rules-generation engine to produce informative priors (enabling the use of minimally labeled examples), (3) learn whole temporal graphs and representations from EHRs using semi-supervised and unsupervised deep geometric learning, and (4) compute likelihoods from these representations for resolving clinical uncertainty in realistic conditions (e.g., enable the consideration of EHRs as they exist at the time of decision and not relying on diagnosis codes or other structures that may be unavailable at decision time).

This framework can also be used to automatically revise prior probabilities generated from previously trained models applied to new data with differing underlying prevalence of disease. This trait sets InfEHR apart from discriminative approaches for several reasons. First, underlying differences in prevalence typically manifest as degraded model performance in discriminative models. This can be mitigated by sufficient training that covers all potential disease manifestations, but low disease frequency intrinsically limits the availability of training examples. In contrast, InfEHR does not require large volumes of training data to prevent performance degradation, since its training method explicitly models prevalence in its likelihoods. Second, fine-tuning previously trained models to better reflect local prevalence conditions requires relabeling an entire dataset, which is often impossible. InfEHR instead revises prior probabilities automatically according to local prevalence conditions without human labeling. And third, InfEHR learns local feature distributions that can be combined with automatic inference of prevalence, which leads to high rule-in and rule-out potentials, which supports the generalizability of the framework to new settings whose clinical practices and patient populations differ.

InfEHR has a wide range of clinical applicability. It computes probabilities in disease settings with low (PO-AKI) and very low (CN-S) incidence of positive cases. Although the true global incidence of PO-AKI is presently unknown, a recent publication estimates that 18.4% of surgical patients will develop it (95% CI 17.7%–19.2%)36; this study observed 21% in the training dataset (n = 4276, PO-AKI = 879) and 10% in the validation dataset (n = 2426, PO-AKI = 261). Estimates of global CN-S cases are more elusive; however, in the records obtained at MSHS, CN-S cases made up less than 4% of total cases (n = 3678 [known label status], CN-S = 137), which is consistent with ratios available in literature24,76.

Although we emphasize performance in positive case identification in low-prevalence settings, InfEHR also effectively finds negative cases, given that it returns probabilities with high rule-in and high rule-out potentials, as explained below77. We show the performance of InfEHR in EHRs obtained from MSHS on CN-S and PO-AKI, and validate the performance in PO-AKI prediction on EHRs from the University of California, Irvine Medical Center (UCIMC). The EHRs in all datasets are from patients with diverse medical and demographic profiles78 (see Supplementary Materials for additional information). As explained in detail below, we also demonstrate the performance of InfEHR under clinically realistic conditions in terms of decision timing and EHR availability. This approach, requiring minimal human intervention, marks a significant advance in leveraging EHR data to support evidence-based medicine and reducing clinical uncertainty20,21,79.

Identifying low-incidence cases is particularly challenging to clinicians and machines alike80. The absence of uniform diagnosis and management, such as in diagnosing neonatal CN-S (where there is no specific confirmatory result or set of results), compounds these challenges. We benchmark InfEHR against a clinical heuristic for CN-S identification that emphasizes CRP, given that it is the only biomarker with specific guidelines for use in diagnosing CN-S34. Absolute CRP levels in neonates are limited in their ability to distinguish physiologic from inflammatory response, as CRP is subject to natural developmental increases outside of inflammatory response35. Further, various noninfectious conditions in the perinatal period can also induce an inflammatory response such as complicated labor and delivery, intraventricular hemorrhage, or tissue injury33. These issues likely reduce the sensitivity of CRP in CN-S detection, since the heuristic, consistent with literature, shows low sensitivity (0.041) and correspondingly neutral rule-in and rule-out potentials (1.03 and 1.097)81. In contrast, InfEHR dramatically improved upon these results with sensitivity of 0.650 and high rule-in and rule-out potentials (Training = 16.018, 2.880, Validation = 12.492, 2.435). We observe recurrent and distinguishing CRP level temporal dynamics in some positive CN-S cases (see Fig. 1). This observation suggests value in considering temporal dynamics for separating physiologic from pathophysiologic responses82. However, identifying such dynamics, which are subject to individual variation within a conserved pattern, at scale exceeds the capacity of the heuristic and is challenging to traditional temporal models83. But InfEHR automatically captures informative temporal dynamics without algorithmic or human pre-specification and can also consider interrelating variations between CRP and other biomarkers such as in CBC results which, by themselves, while reportedly low in sensitivity, may provide more information when integrated with other lab results. These findings, while promising, have not been externally validated owing to insufficient external data, including confirmed CN-S status.

The proliferation of models and clinical indices for the assessment of PO-AKI underscores the highly contingent nature of PO-AKI risk46. Patients with normal serum creatinine do not present with clear postoperative risk and are highly represented in the false negative predictions (FNR > 72%, mean creatinine = 0.79 mg/DL) by the clinical heuristic. As part of the workflow of the InfEHR framework, we automatically generate labeling rules from the EHR graph structures that are aggregated to produce prior probabilities61. The machine-generated prior probabilities outperform the clinical heuristic and perform better than the machine priors generated for CN-S. Given that the rules are generated from nodes that are in turn extracted from EHR sources, we analyze the composition of the rules in terms of node source (e.g., labs, vitals, medications, or clinical notes)84. PO-AKI rules selected for aggregation had higher mean presence of nodes from medications and clinical notes (22%, 39% respectively) compared to aggregate rules for CN-S (8%, 17%), which had more representation from vitals and lab results or from unselected rules for PO-AKI (13%, 31%). Taken together with the decrease in false negatives (47.5% vs. 79.2% clinical heuristic), these results suggest that InfEHR can learn conditional relational structures that better identify risk in patients who otherwise had low or no indicated risk per the creatinine-based clinical heuristic75. The GNN component further reduces this uncertainty to produce better rule-in potentials (3.613 GNN vs. 1.322 heuristic)85. Although the PPV is similar, the higher recall of positive cases at greater NPV (0.924 vs. 0.824) shows that InfEHR can successfully estimate the required contingencies needed to make good PO-AKI risk assessments and outperform the clinical heuristic (see Supplementary Materials).

InfEHR produced high and comparatively better rule-in and rule-out potentials in the UCIMC dataset (rule-in: 7.105, rule-out: 2.700) than in the MSHS dataset. Other metrics that broadly support improvement to the input distribution of probabilistic labels (consistent with the observed better rule-in and rule-out potentials) were also examined. The beta distribution with parameters set to match the prevalence of the empirical sample served as a null model. The clinical heuristic’s close alignment with the beta distribution (0.8129), combined with its low negative log loss (0.0360) but high positive log loss (3.2807), reveals that the heuristic makes predictions in a “calibrated conservative” pattern that readily rules out but requires more evidence to rule in.

In contrast, the revised InfEHR GNN probabilities diverged from the beta distribution (1.5930) while achieving superior performance, suggesting that the model has learned additional signal in the data beyond what clinical heuristics capture, enabling more confident predictions in both positive and negative cases while maintaining accuracy. The increasing perplexity values from heuristic (1.0730) to InfEHR GNN’s revised probabilities (1.5327) indicates that the model learned to express more varied probabilities across cases. When viewed alongside the improved discriminative metrics, this suggests the higher perplexity reflects better-calibrated probability assignments that more accurately capture case-specific uncertainty while maintaining strong overall performance, which broadly indicates that InfEHR can learn effective features from EHRs automatically without requiring expert-provided labels. The positive log loss also showed marked improvement from the clinical heuristic (3.2807 to 1.0828), also showing enhanced accuracy specifically in positive case prediction, though with a measured increase in negative log loss (0.0360 to 0.2839), which is the objective of InfEHR.

InfEHR yields probability distributions with high rule-in and rule-out potentials, exceeding clinical heuristics in all settings (PO-AKI at both UCIMC and MSHS) and CN-S (in Training and Validation splits), because it approaches learning from a generative perspective and supplies a wealth of representational capacity in the patient EHR graph. The InfEHR GNN can effectively capitalize on this as it finds as features temporal relationships between clinical entities. The initial input EHR graph contains all possible temporal relations between the entities (see Supplementary Materials for detail on entity identification), but the GNN learns to make individual entities coalesce into high-level semantic groupings connected by condensed edges through a pooling operation (e.g., the GNN may automatically learn that receiving an ACE inhibitor and a finding of a certain blood pressure should be combined into an abstract singular clinical concept). The training objective enforces the generative paradigm in which the returned probability of an EHR given a disease latent is determined by the certainty that the disease latent is evidenced by the learned features relative to the empirical estimate of the likelihood of the disease itself as learned during the training process. The predicted probabilities therefore reflect the model’s certainty in how well-evidenced a given disease is in the underlying EHR, which can be interpreted similarly to clinical test results. By returning predicted probabilities that express the consistency of a given EHR with the disease itself (as opposed to a distance from a decision boundary) InfEHR computes probabilities with uncertainties that can be used in clinical decision-making.

The predicted probabilities are ultimately derived from quantifying interrelationships between clinical variables over time, which, as we suggest in the case of CN-S, can provide a basis for clinical inference in settings lacking specific individual biomarkers or, in the case of PO-AKI, where dynamically accumulating risks interact with other conditional risks. Although clinical evolution is generally acknowledged to be an important phenotypic and prognostic indicator, the complexities of characterizing it limit its practical use beyond as a general concept. It is not always known what extent of similarity (or over which components) is needed to distinguish common trajectories from unique ones. This phenomenon is known as sequence explosion where individual variation leads to large numbers of detected phenotypes with shallow support, thereby limiting comparisons among them83. InfEHR overcomes sequence explosion and brings temporal dynamics into a practical reality through overlapping mechanisms that cooperate to produce an aggregated view of individual temporal dynamics through which comparisons and inferences can be made.

Minimizing human bias by learning graph structures from naive temporal graphs has its strengths but comes at a computational cost. The edge count in our graphs explodes with the length of EHR modeled, which may impede the training of the pooling and self-attention mechanisms at long durations. Although InfEHR learned on EHR graphs containing up to 11 days of information from hospitalized patients, adapting the framework for longer-term temporal modeling may be needed, especially in high-measurement-frequency environments like the ICU. Future work should be done to decide the optimal method. One possibility is to treat the maximum EHR length as a hyperparameter and then aggregate the representations from InfEHR over such periods. Other possibilities include removing edges outside of a given temporal range to reduce graph complexity.

Future work will include workflow and preprocessing optimizations. Identifying temporal and other semantic information from terms extracted from clinical notes can add important context to these terms. Similarly, aligning such terms to graph-based ontologies such as SNOMED-CT presents opportunities to transfer information from within the ontology but outside the EHR directly to patient graphs. Large Language Models could potentially be used in preprocessing steps for clinical notes or as embedders for entire clinical notes, which could streamline the existing workflow. The integration of multi-modal information from real-time monitoring or genetics can also be facilitated through generalized models with outputs assimilated by InfEHR. Collectively, these improvements will support bringing the framework into production.

Although InfEHR is shown to produce low-entropy probability distributions, additional information could be gained from examining high-entropy results that point to uncertain predictions by InfEHR. These may represent limits to algorithmic performance, but they may also reveal information about the instant case where, for example in PO-AKI, a low-entropy prediction for a known positive case may point to an individual who has experienced specific adverse events during surgery that resulted in AKI or, in the setting of CN-S, an uncertain prediction for a known positive may indicate that the individual never had CN-S despite being treated for it (see earlier note on “label fidelity”). This facet suggests an added research-based use for InfEHR alongside the present suggestions for clinical decision support. InfEHR is an effective and scalable method for extracting insights from EHRs that we make publicly available (see Code Availability).

Methods

The Institutional Review Board of the Icahn School of Medicine at Mount Sinai approved the protocol for retrieving and analyzing all EHRs in this study. Data obtained from the MOVER dataset was approved for use by the Institutional Review Board of the University of California, Medical Center, and the main campus.

Overview of InfEHR

The premise of InfEHR is that more information is available than is typically used in individual clinical decision-making. The complexities of obtaining information from EHRs limit their utility. InfEHR is a geometric deep-learning approach for resolving clinical uncertainty using EHRs with minimal human intervention. The framework is designed to perform in realistic clinical settings where large volumes of labeled training data cannot be obtained and where existing knowledge is limited.

Three sequential modules make up the InfEHR framework. Module 1 intakes raw EHRs and produces EHR graphs through three successive steps: EHRs are first pre-processed to remove invalid data, next clinical events are automatically abstracted from the EHRs and embedded to form a set of nodes, finally individual EHRs are aligned to the abstracted events and represented as graphs where nodes are connected according to the naive temporal ordering in the patient EHR, forming EHR graphs. In Module 2, an attention-based graph neural network (GNN) embeds EHR graphs using self-supervision. These embeddings, representing the complete patient record, are used in an automatic rules-generation engine to obtain initial probabilities for all unlabeled cases. And in Module 3, uncertainty in these probabilities is resolved through semi-supervised training of the GNN using a specialized loss function. Module 3 can be used with any source of prior probability information.

Descriptions of each component module, corresponding detailed equations, and the specifics of the datasets used are provided below. A workflow diagram is provided in Supplementary Fig. 1.

Module 1: Processing Electronic Health Records into Graphs

Training Datasets

We obtained structured and unstructured data from 11 million electronic health records (EHRs) from the Mount Sinai Health System stored in the Mount Sinai Data Warehouse (CN-S) and through the Extrico Health platform (PO-AKI) over time-varying measurements, medications, and clinical progress notes.

For potential neonatal CN-S cases, records were obtained by identifying individuals with at least 48 h of antibiotic exposure administered in the NICU and without categorical missingness (e.g., no vitals information). All antibiotic courses for such individuals meeting this requirement (n = 8067 individuals, 9256 antibiotic courses) were then extracted. A physician subject-matter expert manually confirmed the CN-S status for n = 3653 antibiotic courses. We applied a stratified split to the physician-confirmed dataset by birthweight to obtain a labeled training dataset of n = 2914 cases (80% of the total). Birthweight was chosen because it is an independent risk factor for CN-S32.

For potential PO-AKI cases, records were obtained for individuals undergoing surgery of any kind with presurgical hospitalization 2 days and who had in-hospital serum creatinine measurements taken at 72 h (n = 22,138) postoperatively to compute AKIN scores. For patients with multiple surgeries, only first surgeries with subsequent operations >72 h (n = 8031) were considered. A positive AKI diagnosis was assigned to any patient with AKIN score >1.

Validation Datasets

We used n = 729 from the stratified split (20% of total) cases as a validation dataset for the CN-S task.

For the PO-AKI task we used the EPIC EHR cohort in the MOVER dataset from the University of California at Irvine Medical Center (UCIMC, n = 39,685) and included only patients with serum creatinine measured preoperatively (2 or more measurements) and at 72 h postoperatively (n = 2631). We applied the AKIN definitions to obtain labels as in MSHS.

EHR Preprocessing

We applied the preprocessing steps described below to the CN-S and PO-AKI training datasets (individually) and applied the results where relevant to the validation datasets. The MSHS consists of several individual hospitals with varying database capture and update protocols. As a result, some types of information were systematically unavailable in the data warehouse at the time of retrieval.

EHRs with such categorical missingness (e.g., no vitals) were excluded from model training given that this pattern of missingness likely resulted from database-specific variation (retained cases: n = 5213 CN-S, n = 4276 PO-AKI MSHS, excluded cases: n = 2854 CN-S, n = 3764 PO-AKI MSHS); however, we used all valid records, including incomplete records, for the density estimations that the node discovery process required (see Module 3 below).

Preprocessing Numerical Values

We include vitals measurements and laboratory results measured on at least 100 unique individuals and with representation from both labels (i.e., the measurement does not by itself identify a case). As a result, given only 137 patients with confirmed CN-S, we considered the vital signs of respiratory rate, spO2, temperature, pulse, and systolic/diastolic blood pressures to avoid bias from measurement type. We retained 25 unique vitals in PO-AKI. We considered 387 and 72 unique labs, and 47 and 280 distinct medications, in CN-S and PO-AKI, respectively.

We further processed continuous numerical values by dropping any value greater than three times the maximum or less than three times the minimum clinical reference range (deemed to be likely artifactual). We also apply these preprocessing steps to the UCIMC dataset to variables corresponding with MSHS data and remove any variables without correspondence from the dataset.

Preprocessing Nonnumerical Values

Nonnumerical observations corresponding to lab results were standardized by applying Levenstein distance to coalesce all similar variations to the most frequently observed term. We processed categorical features derived from clinical notes from the MSHS datasets as follows: we applied QuickUMLS, a Universal Medical Language System (UMLS) matcher, to identify terms from clinical notes matching a UMLS term with high confidence (> 0.7). The extracted terms were further refined in the node discovery process. No clinical notes were available in the UCIMC dataset.

Node Discovery and Embedding

We discovered the set of nodes comprising the global pool of clinical events as nodes using density-based selection procedures (node discovery). We applied this process separately to the CN-S and PO-AKI training datasets, then used the learned results from each training dataset to extract clinical events from its respective validation dataset.

We detail the operations involved, as they apply to continuous and discrete variables, in the following sections:

Continuous Variables

We fit kernel density estimations (KDE) to the set of all observations for all EHRs for each measurement type (e.g., heart rate, respiratory rate, white blood cell count, etc.). The resulting KDE curve indicated local densities by the intervals between local peaks that we then used to discretize the continuous measurement. The number and distance between peaks is set by a single bandwidth parameter that we determined empirically to satisfy the following constraints: the discretization must be shared by at least 100 unique individuals (e.g., a local density for blood glucose must contain measurements observed for at least 100 unique individuals) while maximizing the number of identified intervals.

Discrete Variables

InfEHR uses discrete but time-varying information, including medications and clinical terms, in notes. We include medications that were administered to at least 100 unique patients in the prescribable subset of RxNorm.

We extracted UMLS-synonymous terms from clinical notes using QuickUMLS (see preprocessing of nonnumerical data above). We weighted the collection of the extracted terms using term frequency inverse document frequency (TF-IDF), then applied nonnegative matrix factorization (NMF) with automatic determination of latent topic number (minimum components in the NMF H matrix such that the cophenetic correlation coefficient is 0.90). We analyzed the resulting low rank term-weight matrix to identify and retain terms strongly associated with any latent topic ( top 10% of distributed topic weights). This procedure simultaneously selects terms based on frequency and informational content to create a data-driven vocabulary of clinically meaningful terms from notes for use as graph nodes.

The node selection process automatically compresses the range of all clinical events to a subset based on the underlying density distributions of the dataset. This allows the discovery of nodes through a data-driven discovery process without human pre-specification or assumptions. Specifically, the number and content of nodes are not known a priori. Additional semantic information and implicit relational structures between nodes are encoded during node embedding, as described below.

Method of Node Embedding

We derived 64-dimensional numerical representations (embeddings) for the identified nodes described above. To compute the node embeddings, we first constructed a bipartite graph with partitions over individual patients and the identified global set of nodes. Next, we computed the overlap weighted projection of the clinical event nodes over the patients and retained only edges weighted at or above the 25th percentile edge weights. We added nodes representing semantic types (e.g., the name of a lab measurement or vitals sign category) to the projected graph and connected them to relevant nodes with maximum edge weight. The neighborhood of any individual node, therefore, included all nodes of the same semantic type as well as nodes across semantic types with high co-occurrence (indicated by high-weight edges). Nodes were encoded to reflect neighborhood information using the Node2Vec algorithm.

The resulting collection of embeddings (including clinical events and semantic identifiers) forms a manifold that naturally encodes semantic clinical relationships into spatial distances between embeddings. We assigned to each node in an EHR graph the resulting relevant embedding, subject to some added components as described below. Note: learn embeddings from the training datasets individually; we used these embeddings in the validation dataset without retraining. (See Fig. 7.)

Fig. 7: Electronic health records (EHRs) are represented automatically as Electronic Health Record graphs through an unsupervised process.
Fig. 7: Electronic health records (EHRs) are represented automatically as Electronic Health Record graphs through an unsupervised process.The alternative text for this image may have been generated using AI.
Full size image

a Schematic of clinical event identification and graph node determination using density based approaches applied to Electronic Health Records (EHRs). b Two-dimensional visualization of the manifold formed of all extracted nodes and representations (left in gray), detail of nodes within manifold (top right). Distinct nodes from two patient EHR Graphs are highlighted (red, yellow) as well as nodes in common (orange). Details of the semantic neighborhoods within the manifold (right bottom, selected nodes in red). c Angular distances between node embeddings without time embeddings (left), time embeddings change the angular distances between the commonly held base embeddings according reflecting temporal relationships particular to individual cases (left, right).

Tuning the General Node Representation to Individual Temporal Contexts

After extracting the set of clinical events, we extracted the time stamps for their occurrences in individual EHRs. We adjusted all time stamps to reflect elapsed times by subtracting the earliest time stamp corresponding to a clinical event in each EHR. We aggregated all unique time stamps for all EHRs and derived 32-dimensional embeddings for each time stamp using the Time2Vec algorithm in Eq. (1):

$${Time}2{Vec}(t)=\left\{\begin{array}{cc}{w}_{k}\cdot t+{b}_{k} & {ifk}=0\\ \sin \left({w}_{k}\cdot t+{b}_{k}\right) & {if}k > 0\end{array}\right.$$
(4)

where:

\(t \)         time component (like time stamp, hour of day, etc.)

\({w}_{k}\), \({b}_{k}\)     Learnable parameters of the model

\(k\)        Position in the Time2Vec vector.

The general representation of a clinical event is formed by concatenating the event embedding with the semantic type embedding (e.g., the embedding of a certain KDE density for blood glucose is concatenated with the embedding for blood glucose). These generalized embeddings—consistent across patients—are tuned to the individual by adding the embedding of the time stamp for its occurrence. The resulting embeddings render clinical events in a machine-readable format. Although time stamps are not explicitly used as positional markers, the vectorization of time adds temporal information to representations of clinical events. The numerical distance between locally co-occurring but semantically distinct clinical events is reduced by the similarity of their time stamp components compared to events farther apart in time. Individual variation in temporal dynamics therefore shapes the representation of clinical events to the machine, transforming generalized clinical event representations to reflect individual contexts (see Fig. 7).

Module 2: Deep Geometric Learning Approach

Notation

We represent individual EHRs as directed graphs by determining relevant clinical events \(\varepsilon \) (such as a measurement value within a certain range, or the appearance of a term in a clinical note) where \(\varepsilon \) is discovered through an automatic process (node discovery).

We derive embeddings for these clinical events by learning a manifold \({\mathcal{M}}\) comprising all events \(\varepsilon \) and their respective semantic types \(\tau \). We apply an operator (here, concatenation) to obtain the representations of all possible clinical events as shown in Eq. (2):

$${\rm E}=\phi \left(\varepsilon \in \boldsymbol{\mathscr{M}}\Rightarrow \!\!,\tau \in \boldsymbol{\mathscr{M}}\right):{{\mathbb{R}}}^{n}\to {{\mathbb{R}}}^{d}$$
(5)

Graphs representing patients are constructed by identifying the time stamp \(\varepsilon \in {\rm E}\) in the patient record, embedding the time stamp using a network trained on the Date2Vec objective, and concatenating \(\varepsilon \), resulting in initial node embedding

$${h}_{i}^{\left(0\right)}\in {{\mathbb{R}}}^{m}\left(m > d > n\right)$$

The graph is defined as

$$G=\left(\varepsilon,V\left(t\right)\right)$$
(6)

with directed edges

$${e}_{i,j}=\left({v}_{i}\left(t\right) < {v}_{j}\left(t\right)\right)$$
(7)

Problem definition

Given the graph \(G\), we train networks to learn whole graph representations in \({{\mathbb{R}}}^{d}\) for (1) self-supervised representations of EHRs and (2) computing likelihoods over clinical queries. Exact definitions of the loss functions used for training in (1) and (2) appear below.

Construction and Definitions of EHR Graphs

InfEHR computes likelihoods through sequential processing of EHRs. We obtain EHRs and then represent them as temporal graphs Eq. (3), according to these two definitions:

Definition 1: EHR

Given \({{\mathcal{R}}}_{i},\) comprising all medical records for patient\(i\) occurring over the set of times \({T}_{i}\), we extract the electronic health record (EHR) of patient \(i\), denoted as \({\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{i}\):

$${\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{i}=\{(r,t)\in \{{vitals},{labs},{medications},{clinicalnotes}\}{andt}\subseteq {T}_{i}\}$$
(9)

with \(t\) bounded by:

$$\max \left({T}_{0,{defined}},{T}_{0,{patienti}}\right)\le t\le \min \left({T}_{\max,{patienti}},{T}_{\max,{defined}}\right)$$
(10)

where:

\({T}_{{defined}}\)    Timestamp of a clinical event or user-provided temporal duration.

\({T}_{{{patient}}_{i}}\)    An observed timestamp in the records of patient \(i\).

\({\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{{i}_{j}}\)       Unique EHR identified by a specific clinical event in patient \(i\)'s records.

Here, \({T}_{{defined}}\) indicates the time stamp of a clinical event or user-provided temporal duration, and \({T}_{{{patient}}_{i}}\) corresponds to an observed time stamp in the records of patient \(i\). In the case of multiple defined clinical events occurring in the records of patient \(i\), each event results in a unique EHR identified by \({\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{{i}_{j}}\).

Definition 2: EHR Graph

We take EHRs (as defined above) and represent them as temporal graphs. We discover and collate nodes from the collected EHRs using unsupervised methods into a global node pool (Nodes), embed individual time stamps using Time2Vec (Times), and form temporal edges following (11) and the algorithm in Box 1.

$$\forall {\text{node}}_{j}\in G,j < i\wedge {{{time}}}_{j} < {{{time}}}_{i}\!\!:\!\!{{create\; edge}}\left({{{node}}}_{j}\to {{{node}}}_{i}\right).$$
(11)

Training the GNN on Attributed EHR Graphs

We train a GNN to produce whole graph embeddings (dim = 128 self-supervised, dim = 164 semi-supervised) subject to additional processing layers under a self-supervised and semi-supervised objective (details below). We use a consistent architecture adapted to supervised and self-supervised training regimes.

Given an EHR Graph \(G=(V,E)\), the model initially condenses and rewires the graph through a learned pooling operation resulting in:

$${G}^{{\prime} }={ASAPool}(G,\rho )$$
(12)

We derive the global representation \(X\) and logits for \({G}^{{\prime} }\) as shown:

$${H}^{(1)}={ReLU}\left(\;\sum _{j{\mathscr{\in }}{\mathscr{N}}\left(i\right)}{\alpha }_{{ij}}^{\left(1\right)}{W}^{\left(1\right)}{x}_{j}\right)$$
(13)
$$R={W}_{r}{H}^{(1)}$$
(14)
$${H}^{(2)}={ReLU}\left(\;\mathop{\sum }\limits_{j{\mathscr{\in }}{\mathscr{N}} (i)}{\alpha }_{{ij}}^{(2)}{W}^{(2)}{H}_{j}^{(1)}\right)+R$$
(15)
$$X=\frac{1}{\left|{V}^{{\prime} }\right|}\sum _{i\in {V}^{{\prime} }}{H}_{i}^{(2)}$$
(16)
$${Logits}={W}_{f}X+{b}_{f}$$
(17)

This network definition uses equations in sequential order (12, 13, 14, 15, 16, 17).

For all experiments we use input node feature dimensions d = 160 (from d = 32 for time embedding + d = 64 node semantic type embedding + d = 64 node value embedding). We use a node pooling ratio (rho) of 0.8 and attention heads for GAT layers = 2, and use hidden dimensions H(1) = 256, and H(2) = 128. We train on a self-supervised learning (SSL) objective (explained below) to produce a d = 128-dimensional representation (dimensions chosen from experience) and a d = 2-dimensional output for the semi-supervised objective.

Self-Supervised Learning for Whole Graph Representations

We learn whole graph representations by training the InfEHR GNN using a self-supervised loss according to the algorithm in Box 2:

Self-supervised learning

Our SSL training algorithm uses a VICReg framework86, more commonly used for image encoding, enhanced with mutual information (MI) estimation tailored to graph data. The procedure initializes an encoder with ASAPooling and GAT layers, along with two projectors for input and representation transformation. During training, each graph generates two views: raw features projected through p1 and encoded features through p2. The projectors, implemented as multilayer perceptrons with normalization and nonlinearities, serve as learnable transformations that map the input and encoded representations to a shared embedding space while preventing the collapse of information. This architectural choice enforces an information bottleneck that prevents the encoder from learning trivial solutions, while the projectors’ flexibility allows the contrastive learning objective to be optimized without constraining the encoder’s representation capacity. Post-training, the projectors are discarded, preserving the encoder’s learned manifold structure for downstream tasks.

The loss function combines four components: similarity loss (ensuring view alignment), variance loss (preventing dimensional collapse), covariance loss (decorrelating features), and mutual information (MI) loss (maximizing node-to-graph information while minimizing it for corrupted samples). The MI estimation uses a structured corruption scheme and InfoNCE-style loss computation.

The corruption scheme generates negative samples by shuffling node features within windows of the EHR graph (WindowCorrupt). The effect is to introduce random and unrealistic relationships and orderings between clinical events. The MI estimation encourages the model to distinguish valid clinical structures and their representations from these unrealistic examples.

The total loss is weighted as Equation (32), optimized using AdamW. After training through \(T\) epochs (for 1000 epochs), the encoder is preserved for downstream tasks while projectors are discarded.

This SSL loss function encourages EHR representations that capture local temporal patterns within patient records and also global patient states, allowing for encodings that capture inter-patient variation (differing global states) simultaneously with encoding shared local temporal structures. This simultaneity promotes meaningful semantics in several ways: high-density regions are likely to represent patients with common clinical patterns or disease trajectories, whereas sparse regions may indicate rare conditions or unique patient presentations. Spatial distances can be used to infer the disease state as follows.

Deriving instance-level priors automatically

Label propagation using self-supervised embeddings

We train the GNN encoder to produce self-supervised embeddings as above and apply label propagation as described in ref. 60 and implemented in scikit-learn. We hypothesize that training on a self-supervised objective, as described above, results in automatic alignment of phenotypically similar people so as to meet the assumption of label smoothing that semantic similarity is a function of spatial distance.

The label smoothing algorithm iteratively learns a smooth classification function whereby we take the 110 labeled samples and spread label information to spatially proximate samples and label each sample according to the flow of labels it receives during the propagation process. We apply the label spreading algorithm using an RBF kernel (gamma at 70) to determine probabilistic distances between embeddings and set the clamping parameter alpha, controlling the relative importance of the initially labeled examples in deciding the predicted labels for unlabeled examples, to 0.5 based on previous experience. We achieved 0.18 (CN-S) and 0.34 (PO-AKI) recall, with precision of 0.67 and 0.78 respectively (outperforming the clinical heuristic in both cases), suggesting the spatial similarity assumptions appreciably obtained in the self-supervised embeddings.

Integration of spatial information and structural information

We derive weak labeling heuristics from structural features of EHR graphs using label information provided from label propagation over the self-supervised embeddings (spatial information as described). Ordinarily such weak heuristics are generated by a human expert which involves bias and challenges in precisely these clinical settings where existing clinical knowledge is limited. We present a method to automatically generate them at scale in uncertain conditions and follow established guidelines for combining them61,62 to generate initial probabilities. We find that existing literature21,22,86 has corroborated a random sample of automatically generated heuristics. Another potential application of InfEHR could be generating hypotheses from weakly predictive heuristics.

Weakly supervised learning over uncertain priors

The ASAPooling operation87 in the GNN uses an attention-based mechanism to derive cluster medoids and assign cluster memberships to nodes over a fixed receptive field to produce a new, pooled graph. Clusters are scored for inclusion in the pooled graph and reconnected with edge weights that indicate the topology of the original graph. We apply a message-passing algorithm over the pooled graph using graph self-attention to compute new node representations successively. InfEHR derives the node representations by learning an attentional coefficient that weighs the relative importance of a node to its neighbor in the aggregation phase of the message-passing algorithm. To avoid over-smoothing of node representations, we apply a residual between successive message passing steps. Finally, to obtain the whole graph representation we take the mean over node features for all nodes in the pooled graph resulting in a single high dimensional vector. We further process the whole graph representation using linear layers to return likelihoods according to the following loss criterion described below.

Module 3: resolving prior probabilities

GNN Training with Feature-Based Weighting of Kullback-Leibler Loss

We train the GNN (previously described) under our own loss function similar to the RQ loss proposed in ref. 63 and include a small network to learn example specific loss weighting functions. RQ loss consists of a generative formulation allowing the optimization of the log-likelihood of learned graph features relative to an assumed underlying generative process (here a disease latent). The loss function is consistent with the overall data representation strategy in which we capture disease latents at multiple levels (from initial node encodings to the naive temporal EHR graphs). We extend this function further by learning a dynamic weighting mechanism that continuously adapts during training, learning to adjust sample importance based on evolving patterns in the penultimate layer representations. Modulating the RQ loss through learned weights focuses attention on the most informative samples as the feature space becomes progressively more structured throughout training.

Weighted RQ Loss

The weighted RQ loss minimizes the following function, jointly optimizing parameters for the primary model and for the weighting network:

$$\mathop{\min }\limits_{\theta , \phi }\mathop{\sum }\limits_{{iinB}}\frac{\exp ({w}_{i}(\phi ))}{{\sum }_{{jinB}} \exp \left({w}_{j}(\phi )\right)}\cdot {KL}\left[q\left({\mathcal{l}},|,{x}_{i}\right)\theta \left)\right. || {c}_{i} \cdot {p}_{i} ({\mathcal{l}})\frac{q({\mathcal{l}}|{x}_{i};\theta )}{{\sum }_{j}q\left({\mathcal{l}}|{x}_{j};\theta \right)}\right]$$
(18)

Définitions:

B    Virtual batch

fi     Penultimate layer features

ci    Normalization constants ensuring \(\Sigma {r}_{i}({\rm{l}})=1\)

\({p}_{i}({\mathcal{l}})\)     Prior beliefs about latent labels, derived from InfEHR heuristics

wi (ϕ)  Weight computed as:

$$\sigma ({W}_{2}{ReLU}({W}_{1}\,{f}_{i}+{b}_{1})+{b}_{2})$$
(19)

where \(\phi=\{{W}_{1},{W}_{2},{b}_{1},{b}_{2}\}\) are trainable weights and biases of the neural network.

Parameters:

θ    Parameters of the primary model estimating \(q({\mathcal{l}} \, | \, {xi;}\, \theta )\);

ϕ    Parameters of the neural network calculating \({w}_{i}\);

Optimization is by simultaneous updates to \(\theta \) and \(\phi \), aligning the model’s outputs with instance importance in batch \(B\).

In sequential order, this loss definition uses Eqs. (18, 19).

Performing Validation

We construct EHR graphs from the UCIMC data contained in the MOVER dataset by first aligning the records in UCIMC to the same namespace as the MSHS data (e.g., creating a mapping between the same medication or laboratory measurement with varying names) and then using the learned clinical manifold from MSHS data to extract clinical events from the UCIMC records into naive temporal graphs, as described previously. Notably, UCIMC data does not include clinical progress notes, which limits the full translation of UCIMC events to the MSHS manifold. All vitals measurement types in the UCIMC dataset were duplicated in MSHS; however, some laboratory measurements and medications in UCIMC had no correspondence in MSHS. We omit any such record from the temporal graphs.

To apply the GNN for semi-supervision from MSHS data to graphs from UCIMC, we ablated all nodes from clinical text in the MSHS graphs and retrained the GNN on the MSHS graphs. We applied this GNN to the UCIMC graphs to obtain initial probabilistic labels (see Fig. 6, InfEHR priors). We therefore transferred knowledge from previous training in the form of the learned clinical manifold and in the prior probabilities.

InfEHR is designed to learn dynamical temporal features that can be used for clinical uncertainty reduction. To show that InfEHR does this, we trained the InfEHR GNN on UCIMC graphs (n = 2427) constructed using the clinical manifold and embeddings from MSHS. Probabilistic prior labels were obtained from previous training on MSHS graphs and without human-provided labels. This parallels discriminative model training while maintaining consistency with the InfEHR loss function and framework. Using these priors (without human-provided labels) we trained for 20 epochs (by early stopping criterion). We used this trained GNN to compute final likelihoods (see Table 3 and Fig. 6).

We then performed benchmarking experiments with GRU-D and SeFT models implemented and trained as in their reference implementations64,65. Although both of these models ingested tabular nongeometric data, we retained the same variables used to construct the graphs for InfEHR GNN to facilitate comparison.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.