InfEHR: Clinical phenotype resolution through deep geometric learning on electronic health records

Kauffman, Justin; Holmes, Emma; Vaid, Akhil; Charney, Alexander W.; Kovatch, Patricia; Lampert, Joshua; Sakhuja, Ankit; Zitnik, Marinka; Glicksberg, Benjamin S.; Hofer, Ira; Nadkarni, Girish N.

doi:10.1038/s41467-025-63366-6

Download PDF

Article
Open access
Published: 26 September 2025

InfEHR: Clinical phenotype resolution through deep geometric learning on electronic health records

Nature Communications volume 16, Article number: 8475 (2025) Cite this article

8211 Accesses
5 Citations
56 Altmetric
Metrics details

Subjects

Abstract

Electronic health records contain multimodal data that can inform clinical decisions but are often unsuited for advanced machine learning analyses due to lack of labeled data. Here, we present InfEHR, a framework to automatically compute clinical likelihoods from whole electronic health records without requiring large volumes of labeled training data. InfEHR applies deep geometric learning through a procedure that converts whole electronic health records to temporal graphs that naturally capture phenotypic dynamics, leading to unbiased representations. Using only few labeled examples, InfEHR computes and automatically revises probabilities achieving highly performant inferences, especially in low-prevalence diseases. We test InfEHR using electronic health records from Mount Sinai Health System and UC Irvine Medical Center against physician-provided heuristics on neonatal culture-negative sepsis (3% prevalence) and postoperative acute kidney injury (21% prevalence). InfEHR demonstrated superior performance: for culture-negative sepsis (sensitivity: 0.60 vs. 0.04, specificity: 0.98 vs. 0.99) and post-operative acute kidney injury (sensitivity: 0.71 vs. 0.20, specificity: 0.93 vs. 0.98). Our study demonstrates the application of geometric deep learning in electronic health records for probabilistic inference in real-world clinical settings at scale.

Deep learning models for acute kidney injury prediction: multi-center external validation and evaluation under simulated continuous monitoring conditions

Article Open access 08 May 2026

An open-source framework for end-to-end analysis of electronic health record data

Article Open access 12 September 2024

Integrating structured and unstructured data for timely prediction of bloodstream infection among children

Article 19 July 2022

Introduction

Clinical uncertainty hinders the practice of evidence-based medicine¹. For any condition, transitions from high-evidence regions to low-information, high-uncertainty areas are abrupt². Clinicians must synthesize diverse sources of information to limit this uncertainty and make effective decisions³. Determining which information to use and how to combine it to estimate probabilities for decisions is complex⁴ and relies heavily on individual judgment and experience, which runs counter to the principles of evidence-based practice⁵.

Additional factors complicate this challenge. First, clinical tests and heuristics generally have low positive predictive values and higher negative predictive values, necessitating multiple tests and extended clinical observation to reduce uncertainty⁶. In acute settings, this time delay is often weighed against the potential harms of empirical treatment and the risks of not providing it⁷. In more defined settings, such as preoperative risk assessment, the window to gather and synthesize information is limited, establishing a baseline uncertainty that current heuristics and evidence struggle to resolve^8,9.

Electronic health records (EHRs) have evolved over the past decade, becoming increasingly detailed¹⁰ due to technological advancements and changing regulatory requirements¹¹. However, information contained within EHRs cannot readily be applied to individual clinical decisions¹². While information rich, distilling EHR information into useful evidence requires identifying exactly which variables to look at and estimating the conditional relationships between them over evidence that varies across individuals and is subject to measurement and other errors¹³. Adding to this complexity is the difficulty of choosing the right temporal window in which to observe these variables for making a phenotypic inference¹⁴.

Realizing the potential of the EHR to resolve clinical uncertainty requires novel computational approaches¹⁵. Graph structures, which consist of nodes connected by edges, are well-suited to representing phenotypic relational structures¹⁶. Temporal structures are captured in the form of edges between nodes (representing clinical events), and the flexibility of this structure seamlessly captures individual variation (e.g., a node representing a certain medication may appear only selectively according to who received it)¹⁷. The graph structure compactly renders the clinical trajectory into a form suitable for AI applications¹⁸. Deep geometric learning is a highly active and emerging branch of AI research which, though nascent, has already produced impactful results in the biomedical domain^{19,20,21,22,23}. Broadly, deep geometric learning extends advancements in neural networks to the case of data that consists in relational structures between entities (geometric data). In our specific application, InfEHR, we use deep geometric learning to render EHR information into a clinical likelihood.

We minimize human involvement throughout the steps involved. This allows InfEHR to learn revealing temporal dynamics without human pre-specification or bias. Our method first automatically extracts temporal graphs from individual EHRs (EHR graphs). Each graph captures the complete trajectory of a patient’s clinical events and the relational structures among them. We summarize the entire clinical history contained in the EHR graph into a compact vector representation using self-supervised learning. Doing so creates a holistic view of the patient record that expresses the semantic similarity between clinical trajectories as a function of spatial distance between vector representations, which can be used to transmit knowledge to unlabeled cases from scant expert-provided labels as presumptive labels. We use these initial labels to refer to the EHR graphs and find individual graph components (e.g., a medication connected to a lab result) that are preferentially associated with an outcome, each providing a weak and uncertain likelihood estimate. We find hundreds of such informative components automatically and aggregate their weak predictions into refined machine-derived likelihoods. These likelihoods provide statistical information that InfEHR combines with deep geometric learning to get a final result. The initial EHR graph is transformed by coalescing and connecting individual clinical events into learned higher order concepts (a learned graph), and successive processing layers render the learned temporal graph into a likelihood according to a specialized training objective (see Fig. 1 for details).

**Fig. 1: InfEHR is a framework for resolving clinical uncertainty using probabilistic and deep geometric learning on electronic health records (EHRs).**

Our framework performs according to the criteria important for clinical applications. InfEHR computes likelihoods specific to individual cases for any patient. It does not rely on the presence of specific information such as a particular lab result or vital measurement. It also removes the requirement for a clinician to know when a model can be validly applied, such as when, for example, a model assumes only certain pharmacological exposures and becomes unreliable otherwise, or is only valid in certain clinical settings. Implicit assumptions in the distribution of the training data may also invalidate a model. Even under valid conditions, discriminative models can still report high confidence but unreliable probabilities for uncertain cases. In contrast, the probabilities returned by InfEHR naturally reflect its certainty in the individual case, indicating to the clinician exactly when the EHR does not contain enough information to make an inference. We observe accuracy to scale with confidence for both positive and negative predictions in all experiment settings. However, there are still cases over which InfEHR makes low entropy predictions (relative equiprobability between classes). As discussed later, such predictions may also reveal important phenotypic information concerning the instant case and are therefore still valuable in clinical decision-making.

Results

In the following, we demonstrate how InfEHR, a deep geometric learning framework, can resolve clinical uncertainty in two diagnostically challenging scenarios: diagnosing neonatal culture-negative sepsis (CN-S) and postoperative acute kidney injury (PO-AKI) risk assessment. We first provide a quantitative view of the ambiguity in these conditions following the application of traditional clinical heuristics. We then demonstrate that InfEHR can automatically derive more powerful heuristics by using its graph-based representation of EHRs to create probabilistic labeling heuristics. We show that the InfEHR GNN component, using a generative approach, can subsequently revise initial probabilities to achieve demonstrable clinical value in terms of rule-in and rule-out potentials. Finally, we present external validation, including against established models for irregularly sampled clinical time series, confirming InfEHR’s capabilities in providing substantial and clinically meaningful probability revisions, particularly effective for the challenging task of ruling in low-prevalence conditions.

Uncertainty in diagnosing neonatal culture-negative sepsis

The onset of inflammatory symptoms consistent with sepsis prompts empirical treatment with antibiotics²⁴. Following a positive blood culture result, the criteria for antibiotic cessation are established by individual clinical response and other prior knowledge concerning the infecting pathogen. However, timing antibiotic cessation in cases without positive results and ongoing symptoms poses a challenge. Blood cultures, the most specific tests, involve processing times with incubation periods of up to 72 h, and occasionally repeated samples to obtain reliable results²⁵. Factors including maternal exposure to antibiotics and low sample volumes also contribute to clinical skepticism of blood culture results, adding to a period of diagnostic ambiguity during the wait for confirmatory results. In other cases, such as culture-negative sepsis (CN-S), confirmatory results will never come.

In these situations, a clinician must decide how to interpret the negative result; specifically, whether the patient has CN-S, a condition in which underlying septicemia requires continued antibiotics or the patient experiences inflammatory symptoms without an infectious cause²⁶. Failure to treat sepsis is an unacceptable risk; however, overuse of antibiotics is also associated with significant short- and long-term adverse outcomes^27,28. Paradoxically, use of empiric antibiotics is also reported to increase risk of sepsis, necrotizing enterocolitis, or death (OR, 1.24; 95% CI, 1.17-1.31)²¹. The lack of consensus definition of CN-S or established clinical guidelines for its treatment complicates this situation and leads to highly idiosyncratic treatment patterns that are poorly supported by evidence^29,30.

No known single biomarker or group of biomarkers specifically identifies CN-S, nor are there established criteria for pathogen identification based on clinical signs or measurements outside of a blood culture³¹. In neonates, rapid growth and development of organ systems also add variation to lab and vitals measurements, making it difficult to independently discern a disease process³², and many noninfectious conditions also lead to elevated inflammatory markers in neonates despite a lack of underlying infection³³. Serial observations and measurements over time are needed to form more cohesive clinical pictures²⁴. Meanwhile, the harms from inappropriate antibiotic exposure accrue. Exactly what evidence supports a CN-S diagnosis, and therefore continued antibiotics in the setting of a negative blood culture, is subject to ongoing debate²⁶. In our sample of 8015 antibiotic courses (from 6596 individuals), an average of 72 (std. 32) unique labs excluding blood cultures are measured with a mean 225 (std 369) total lab measurements per course (mean course duration 4.35 days); however, no consensus exists on how to use this information to resolve diagnostic ambiguity.

Temporal considerations add to the lack of consensus and diagnostic uncertainty. Although several publications suggest the use of C-reactive protein levels to identify CN-S cases, they recommend measuring it at differing times from the start of antibiotics (from 3 days to 7 days at the earliest)^34,35 as well as threshold level. As symptoms persist, diagnostic uncertainty increases as the interpretation of a negative culture result becomes increasingly ambiguous. To model this and show the feasibility of InfEHR to reduce clinical uncertainty in realistic settings, we extract EHR windows including 24 h of information prior to antibiotic administration and up to and excluding the penultimate dose. The resulting windows reflect the patient’s clinical condition prior to the decision point to cease antibiotics (see Fig. 2).

**Fig. 2: Clinical uncertainty in the setting of neonatal culture-negative sepsis.**

We compute the conditional probability of a diagnosis (over all labeled conditions, see Supplementary Materials) given only elapsed time to find that by day 7 the probability of a CN-S diagnosis rises to 0.19 from less than 0.03 at day 3, approaching parity with a Rule-Out Sepsis (ROS) diagnosis (Pr = 0.27) and reflecting high clinical uncertainty in the setting of ongoing symptoms with no positive culture result. The probability of an ROS diagnosis declines rapidly with time, whereas the probability of other diagnoses increases. We identify elapsed day 11 as the maximal window where all labeled conditions, including ROS, have nonzero probabilities. We therefore truncate all EHR windows to a maximum of 11 days to show that InfEHR can compute accurate probabilities when multiple possible sequences are present. The accuracy of InfEHR, as detailed later, is mostly independent of the length of the underlying EHR window (Spearman coefficient = −0.17, R-squared = 0.031, p < 0.05).

Although no significant correlation was found between truncation and accuracy, truncated CN-S sequences were more likely to result in high entropy likelihoods (> 31% of truncated CN-S sequences had predicted probabilities between 0.4 and 0.6 compared to 9% full length), reflecting that InfEHR effectively reduces uncertainty on incomplete sequences, whereas more information from longer observations improves prediction confidence.

An important point must be made about outcome label fidelity. There were 3657 EHRs corresponding to unique antibiotic courses manually labeled by a physician subject-matter expert. Following the lack of consensus definition for CN-S, the SME labeled a case as positive if the treating physician maintained the antibiotic course based on an intent to treat for CN-S. Unfortunately, due to the lack of objective requirements for identifying CN-S, the possibility exists that some patients, despite being treated for CN-S, did not actually have CN-S. Although these cases cannot be identified directly, if they exist, we hypothesize that they may comprise a subset of the cases with high entropy probabilities.

Uncertainty in assessing postoperative acute kidney injury risk preoperatively

Whereas static variables such as age, sex, and surgery type have been independently linked to postoperative acute kidney injury (PO-AKI)³⁶, these associations and their interactions seem to be mediated through individual factors that vary by clinical setting³⁷, which limits the use of generalized scoring indices or models in determining risk³⁸. Here we express the concept of risk as the patient’s likelihood of developing PO-AKI and accordingly compute the prognostic likelihood of developing PO-AKI. We also depart from other methods by considering only time-varying attributes of EHRs³⁹. Our prognostic likelihoods are therefore dynamic in nature and can be modified by evolving EHR information. The likelihood corresponds to a patient at a given time, subject to change with modifications to risk that occur throughout the clinical trajectory, e.g., increasing drug administration (in contrast to static risk indices).

The link between surgery type and uncertainty of PO-AKI transmits along two lines: variation in the clinical setting comes with differences in the detail of monitoring and inherent perceived risks of PO-AKI and the medical background, including age, of a surgical patient³⁶. Cardiac patients are routinely assessed for such risk following a high incidence of PO-AKI for these patients (40% of all cases)⁴⁰. Age has been identified as an independent risk factor in this setting; however, some cardiac surgeries seemingly carry greater risk than others, and it is not entirely clear how risks stemming from age relate to this⁴¹. Nonetheless, numerous models exist for computing risk scores in the cardiac setting (with variable success rates)⁴². Noncardiac surgery by comparison includes a more medically diverse group of individuals undergoing various types of surgeries with attendant differences in surgery lengths and other surgery-specific risk factors³⁷. The SPARK index, acknowledging this intrinsic uncertainty, attempts to create a generalized clinical risk index for this population³⁹. The index considers factors including pharmacological factors such as RAAS blockade and physiological factors such as diabetes mellitus status and anemia^39,43. The performance of SPARK has been shown to degrade in populations whose characteristics differ from those of the training cohort despite good performance in training and validated discovery cohorts³⁸. Although one study documents this performance degradation, other unidentified cases may exist, and it is not obvious to a clinician whether an instant patient is well covered by SPARK without detailed analysis over the alignment of a patient to the training distributions, a clinically infeasible possibility^44,45. This underscores a general concern with clinical models, including indices, in that it is not always clear how well they fit an individual patient despite ostensibly satisfactory population-level performance⁴⁴.

The general attempt to more precisely determine individual risk has been to produce increasingly specific models⁴⁶. But an examination of the full range of all possible lab measurements for patients within the dataset reveals the limitations of this approach. Some lab measurements cover no positive cases, whereas others are measured in clinical settings where PO-AKI risk is already high (i.e., most measured cases were PO-AKI positive) and so it is unclear what additional information a lab measurement may provide. In settings where the pre-measurement probabilities were less biased, the number of individuals with the measurement can be low^47,48, an example of informed missingness leading to bias by indication⁴⁹. Of the labs without frequency preference for PO-AKI positive or negative patients, no lab has an overall measurement rate above 32%, and some labs, such as prothrombin mutation, with less than 1%. Our dataset also consists of a medically diverse population spanning diverse age ranges (min = 18, max=92) undergoing surgeries over every physiological system (excluding reproductive systems)³⁶. This comprises a group of individuals with collectively high clinical uncertainty that is challenging to resolve using a single model despite established risk assessments for individual subsets⁴⁴.

In our dataset, 38.7% had presurgical eGFRs below 60; of those, only 37.1% developed PO-AKI, showing that substantial uncertainty remains even after establishing baseline estimations of kidney function. InfEHR can include information even from labs with low representation, along with other factors facing similar measurement-related constraints, such as vital information or medications provided by finding commonalities across settings and holistically integrating individual information to resolve uncertainty. We assigned positive labels to all patients with AKIN scores > 1 at 72 h post-surgery and negative labels otherwise to design a trainable task. However, we report the likelihood of developing PO-AKI using only preoperative data. For a subset of positively labeled patients with low likelihoods returned by InfEHR, it is possible but unknown if the factors involved in developing PO-AKI occurred primarily during surgery or immediately following the operation. In this case, the model output would be correct, where we currently deem these cases as failures in recall. Similarly, where InfEHR was uncertain, it is possible that there was some risk enhancement but unpredictable events in surgery that ultimately led to AKI. Unfortunately, these patients cannot be discerned using preoperative data, a limitation to predicting outcomes preoperatively, where in-surgery events may modify the outcome.

Performance of Clinical Heuristics Under Conditions of Uncertainty

Diagnostic complexity and the constraints of clinical practice limit the range of information a practitioner can use in medical decision-making⁵⁰. Heuristics, or decisional shortcuts, are an adaptive strategy to efficiently manage uncertainty under time constraints⁵¹. Heuristics integrate pattern recognition based on experience with pathophysiological reasoning to identify the most salient aspects of the instant case⁵². These cognitive operations resemble clinical tests in that they also have implicit pre- and post-test probabilities (typically unknown to the practitioner)⁴.

Here, we obtain such heuristics from practicing physicians and apply them over the empirical distribution of cases to quantify the implicit uncertainty involved in using them. We corroborated the physician-provided heuristics with literature to ensure that they are representative of general practice and not limited to institutionally specific patterns of care. We apply the heuristics to the EHR windows as described above to compute estimates at clinically relevant points of care and examine their probabilistic outputs over the empirical distributions of cases in each dataset.

To quantify the performance of the heuristic, we compute the rule-in and rule-out potentials⁵³. Rule-in and rule-out potentials are prevalence-independent metrics based on the characteristics of probability distributions produced by a clinical test. They quantify a diagnostic test’s innate capacity to revise disease probability before testing occurs. Unlike likelihood ratios, which characterize performance only after a specific test result is known, these potentials predict how effectively a test or heuristic will revise disease probability for an average subject prior to performing the test. A rule-in potential of 2.0 indicates patients with disease are, on average, twice as likely to be correctly identified after testing, while a rule-out potential of 2.0 means patients without disease are twice as likely to be correctly excluded. Together the rule-in and rule-out potentials express the information gain provided by a test relative to the two poles of clinical decision-making⁵⁴. Rule-in and rule-out potentials capture a test’s inherent discriminative power rather than its observed performance on a particular dataset alone (including prevalence differences). We additionally report sensitivity and specificity, which are also prevalence independent⁵⁵ (see Tables 1–3). Although these metrics do not capture probability revision capacity or information gain, they do measure performance within already-known groups (with disease/without disease), providing an additional view on a test’s innate capacity to confirm disease presence. Although we emphasize these dataset independent measures, we also plot the cumulative probability density functions by disease status (see Figs. 2 and 3) to illustrate the performance distribution across our specific datasets.

**Fig. 3: Clinical uncertainty in the setting of predicting postoperative acute kidney injury (PO-AKI) preoperatively.**

Table 1 InfEHR Performance: CN-S (MSHS)

Full size table

Table 2 InfEHR Performance: PO-AKI (MSHS)

Full size table

Table 3 PO-AKI Prediction Performance (UCIMC)

Full size table

CN-S heuristic

Single-point assessments of clinical signs have little value in diagnosing CN-S given that multiple sources of variation in the perinatal environment underlie the observed signs⁵⁶. No uniform guidelines exist for serial measurements or their interpretation²⁴. Biomarkers, including inflammatory biomarkers, generally have low sensitivity and specificity for CN-S, but some literature supports some positive predictive value for C-reactive protein (CRP)³⁴. Accordingly, we use a clinician-provided heuristic using the minimum, maximum, and average CRP concentration within the prediction window. CRP levels have been shown to vary according to gestational age; however, including this information did not improve heuristic performance³⁵. CRP is not uniformly measured in all CN-S cases (18% of CN-S cases had no CRP measure)⁵⁷. The rule-in and rule-out potentials indicate that the CRP-based heuristic did not reduce uncertainty (rule-in potential = 1.13, rule-out potential = 1.01). It is possible that specific measurement timings or other parameters of CRP measurement may enhance the clinical value of CRP in diagnosing CN-S; however, there are no consensus guidelines for these parameters. This ambiguity contributes to the low diagnostic potential and an NPV of 0.85. The empirical cumulative distribution functions (ECDFs) indicate some separation between the classes, suggesting that CRP potentially holds more diagnostic information than our heuristic and that which we estimate a clinician, on average, could extract.

PO-AKI Heuristic

Serum creatinine is commonly measured across various preoperative settings (100% measurement rate for patients in the dataset) and is also used in validated predictive models or indices for PO-AKI in cardiac and noncardiac surgery^37,42. We are unaware of any clinical signs validated for use in PO-AKI prediction and therefore did not consider any for heuristic use. Of the remaining 100 labs common to all patients in the dataset (less than 21% of all labs measured), none were recognized by a physician as having added prognostic value. We include length of stay measured in fractional days along with serum creatinine, since this enhanced heuristic performance by serving as a nonspecific index of case severity⁵⁸. Mean serum creatinine, last preoperative serum creatinine, and length of preoperative stay were used to predict PO-AKI (AKIN > 1)³⁶. The returned probability can be considered as a risk index⁵⁹. This heuristic has a rule-in potential of 1.32 and a rule-out potential of 1.16. The heuristic reduces some uncertainty, especially with some positive cases, but a subset of positive cases remains that the heuristic was unable to identify (NPV = 0.82, and as seen in the ECDFs, Fig. 3).

The heuristic performances illustrate recurrent challenges in identifying positive cases. (PPV = 0.69, FNR = 0.79) Although additional information is apparently required to resolve uncertain likelihoods, exactly which additional information and how it can be integrated is unclear. It is also unclear to a clinician when a heuristic may not apply to a given patient even if overall performance is otherwise known. In our examples, false negatives can be explained not only by a general lack of discriminatory capacity of the heuristic (CN-S) but also by poor fidelity of an otherwise performant heuristic to the individual case (PO-AKI). As explained in more detail next, InfEHR both automatically creates heuristics built on the particularities of individual cases and integrates the array of generated heuristics into a single predicted probability distribution that can later be further resolved to enhance rule-in and rule-out potentials.

Automatic generation of heuristics from EHR graphs

An automatic process was used to transform individual EHRs into temporal graphs where clinical events are identified and represented as nodes and connected to each other by edges according to temporal relationships in the record (detailed below). We use the graph structure to generate automatic heuristics in a two-step process. First, a graph neural network (GNN), trained on an unsupervised objective, encodes the EHR graph into a compact numerical representation. The representation is designed to capture distinguishing features of an EHR graph (such as clinical event composition or the existence of certain temporal dynamics) and assign them to a coordinate in a high-dimensional semantic space (the representation). The resulting coordinate space semantically aligns graphs (generated from patients) according to similarities in distinguishing features. By taking the sample of expert-provided labels (110 total) and fitting a distance-based label propagation algorithm, we can presumptively label cases by a learned spatial distance metric⁶⁰. This provides an initial set of labels for all the records according to a structural and spatial view of similarity among individuals as expressed through their encoded graph representations. We further evaluate the self-supervised representations quantitatively by analyzing the nearest neighbor of each representation in terms of the ground truth label expressed by mean average precision. The score, ranging between 0 and 1, reflects the percentage of times the nearest neighbor for a given representation shares the same ground truth label (MAP@1 CNS: 0.79, MAP@1 PO-AKI: 0.71).

We use the presumptive labels from the spatial representations to derive new labeling heuristics. We randomly select positive and negative cases from the 110-member sample. Next, we identify highly connected nodes within the EHR graphs corresponding to these selected cases and extract the 1-hop neighborhood of such nodes. Given that EHR graphs consist of observed relationships between clinical entities (expressed as edges between nodes), the resulting extracted subgraph then contains potentially identifying clinical relationships. If the relational structure preferentially distributes in EHR graphs from a given label, we use the relationship to heuristically assign labels. Therefore, elements of the graph structure itself are used as weak labeling heuristics.

To identify these structures, we use the labels obtained by spatial propagation over the high-dimensional representations of the EHR graphs (as described above) to assign labels to individual nodes. We assign the node label according to its class association. We can identify relational structures from the subgraph using these labels according to a simple procedure: when two or more connected nodes share the same label, we treat the substructure as a labeling heuristic that assigns the common node label to any graph where it is present. To improve the accuracy of these weak heuristics, we check if any of the connected nodes are also connected to nodes of opposite label. In this case, we remove any individuals identified by both the same label and opposite label substructures from labeling by the heuristic, limiting the application to cases without known contradictory information.

We generate 2400 negative and 396 positive labeling heuristics for CN-S and 4307 negative and 982 positives for PO-AKI. We observed that many such automatically generated heuristics are corroborated by findings in the literature (see Supplementary Materials for specific examples). Despite filtering for contradictions within heuristics (e.g., if an individual is predicted to be both positive and negative), there remain substantial contradictions between heuristics. However, the heuristics can still be effectively combined following Ratner et al.^61,62 to produce performant initial probabilistic estimates. These probabilistic labels represent coalesced information from whole EHR-to-EHR comparisons (i.e., labels derived from the self-supervised embeddings of EHR graphs) and specific conditional relational structures (the weak labeling heuristics). To improve the information in the resulting probability distribution, we obtain a final distribution by randomly subsampling and combining 100 positive and 100 negative labeling heuristics. We repeat this process until there is onlya minimal change in the average entropy of the resulting averaged likelihood distributions.

The resulting distributions are more performant in identifying positive cases than clinician-provided heuristics in both disease settings (PO-AKI: FNR 0.48 vs. 0.79 clinician heuristic, CN-S: FNR 0.67 vs. 0.95 clinician heuristic). The machine-generated heuristics also retain uncertainty in their distributions (ECDF Pr(0.2 − 0.8) CN-S: 0.51, PO-AKI: 0.43). This allows the GNN component of InfEHR to learn uncertainty, reducing temporal dynamics in part by identifying cases with ambiguous label dispositions (high entropy label probabilities) and using features derived from low entropy cases to resolve them. This process would be significantly impaired by overly confident input distributions with low positive case recall (particularly in the setting of rare or infrequent disease). The distribution is used as prior information in the loss function of the GNN, which ultimately reduces individual entropy, as explained below and in Fig. 4.

**Fig. 4: Initial uncertainty resolution through automatic probabilistic labeling.**

Deep Geometric Learning Resolves Prior Uncertainty

The machine-generated likelihoods derive from clinical relationships encoded into the structure of EHR graphs, expressed here as a connection between two or more nodes each representing a clinical event. We expand the representational capacity of an EHR graph by including (1) semantic encodings for each clinical event as attributes (assigned to each node in the EHR graph) and (2) time stamp encodings which can be combined with the semantic encoding in (1) to create attributes reflecting individual temporal contexts. These attributes build on the existing temporal structure, encoded by the collected nodes and edges of an EHR graph, into a format suitable for learning phenotypic temporal dynamics at scale. The semantic encodings in (1) automatically reproduce clinical knowledge in terms of spatial distance, e.g., the nearest neighbors of an encoding for a certain respiratory rate include representations for pulse oximetry. We also observe less established but nonetheless corroborated clinical relationships such as the appearance of encodings for eosinophil measurements within the near neighborhood of ejection fractions (additional information provided in Methods). The detailed and individualized consideration of the semantic and temporal information added to the EHR graphs through these encodings is executed by training a GNN to compute probabilities from individual EHR graphs using the previous machine-generated estimates as priors to be resolved.

We train the InfEHR GNN (see Fig. 5 for architectural details) under a specialized loss function similar to⁶³, which implicitly models the likelihood

$$P\left({EHR} | {Condition}\right)$$

(1)

following a generative modeling framework. This approach ultimately yields predicted probabilities over conditions given EHR data in the form of a posterior distribution, mirroring in mathematical terms the colloquial use of the phrase “likelihood of disease.” While the InfEHR GNN outputs what is formally a discriminative posterior

$$P\left({Condition} | {EHR}\right)$$

(2)

our training procedure, which minimizes the Kullback-Leibler divergence between the model output and a theoretically motivated generative posterior:

$${P}_{{generative}}\left({Condition} | {EHR}\right)\propto P\left({EHR} | {Condition}\right)\cdot P\left({Condition}\right)$$

(3)

**Fig. 5: Resolution of clinical uncertainty with InfEHR.**

This infuses our predicted probabilities with the uncertainty-resolving benefits of the generative approach without requiring explicit generative modeling.

During training the GNN simultaneously learns graph features and the likelihood of the graph features under assumption of a given disease or physiological state. Through the learning process, the GNN transforms the naive input EHR graph into a smaller subgraph. The nodes from the original graph structure are reconfigured and coalesced into new nodes through a learned assignment matrix. The result is that nodes in the subgraph represent compound clinical events derived from the original set of individual events. The new nodes are connected by weighted edges that summarize the connectivity of the original naive EHR graph in the learned subgraph. The resulting subgraph undergoes additional processing to determine the likelihood of its graph features under the clinical premise. A key aspect of this process is that the graph distillation and feature assessment are learned simultaneously. This allows the GNN to learn, without human interference or bias, what subsets and subsequences of clinical events are most relevant to the problem setting (the graph distillation) and how relational structures between them (processing the learned subgraph) support or contradict a clinical premise.

Although the parameters required for the graph distillation and likelihood computation are learned batchwise, the learned GNN parameters are applied to individual EHR graphs and use the specifics of individual cases, about which we make no assumptions or requirements over shared clinical trajectories or backgrounds among patients, and compute highly accurate predicted probabilities. The resulting predictions markedly outperform the clinician-provided heuristics with superior rule-in and rule-out potentials in both the PO-AKI and CN-S task settings. The resulting rule-in potentials for CN-S of 16.018 (Train) and 12.49 (Val) compared to 1.013 (clinician-provided) shows the GNN is respectively about 16 and 12 times more likely than the clinician-provided heuristic to identify the average positive case. This finding is also reflected in the PPV and NPV, which are respectively about 2.6 (Train), 2.9 (Val.), and 1.16 (Train), 1.13 (Val.) times greater than the clinical heuristic (see Tables 1 and 2). This partly represents the extent of resolvable uncertainty within the clinical setting and the benefit of additional EHR information, which is otherwise not readily accessible to clinicians. The GNN produces comparable results in the setting of PO-AKI risk assessment, with 2.5 times greater rule-in potentials and 2.13 times greater rule-out potentials. The GNN also reduces the FNR as reflected by 1.12-fold increase in NPV. This does come at a small expense of slightly higher false positivity; the GNN PPV is about 2% lower than the clinical heuristic while making higher overall true positive predictions (identifying 71% of all positive cases compared to only 19% by the clinical heuristic). Figure 5 provides additional information.

Validation of InfEHR on an external dataset

We compare InfEHR’s performance against two established models for irregularly sampled clinical time series: SeFT⁶⁴ and GRU-D⁶⁵. SeFT employs differentiable set function learning, treating EHR variables as an unordered collection of entities. This approach parallels InfEHR’s graph structure, where nodes similarly form a set of clinical entities.

GRU-D consists of a gated recurrent network that operates on temporally aligned sequences, computing hidden states sequentially at fixed interval lengths from regularly sampled time points. Whereas SeFT does not directly use sequence information, InfEHR explicitly encodes temporal relationships within both node embeddings and edge structures. However, in contrast to GRU-D’s fixed-interval processing, InfEHR captures temporal dynamics without an explicit and predetermined time discretization.

These models were then evaluated as to their ability to predict PO-AKI using EHRs contained in the Medical Informatics Operating Room Vitals and Events Repository (MOVER, public credentialed access)⁶⁶. This dataset includes a medically diverse population of patients undergoing surgical procedures at the University of California, Irving Campus Medical Center (UCIMC). We obtained n = 2427 patients for whom we could both assign PO-AKI status (AKIN scoring system) and apply the clinical heuristic; among these patients, n = 261 positive cases were identified (~10% prevalence in UCIMC dataset versus n = 879 positive cases, or ~21%, in the Mount Sinai Health System (MSHS) dataset).

To evaluate the InfEHR framework on this dataset: (1) We constructed EHR graphs as previously described for all UCIMC patients. We transferred knowledge from MSHS by aligning UCIMC EHRs with those from MSHS and extracting clinical events as learned from MSHS data forming the nodes and setting as their attributes the embeddings obtained from the previously learned manifold. (2) The UCIMC dataset did not contain clinical notes. To generate prior probabilities, we processed all MSHS graphs and ablated any nodes originating from note sources and then retrained the GNN and obtained prior probabilities for each EHR graph in the UCIMC dataset (the input distribution). Although GRU-D and SeFT do not extract events from the EHR, we supplied all parent variables (e.g., systolic blood pressure) so that the datasets for SeFT and GRU-D contained the same variables. We used the same input probability distribution for training all models.

Deep neural networks are known to produce idiosyncratic and miscalibrated probability distributions²⁵. Even though these probabilities are internally coherent (e.g., a given model may systematically overestimate probabilities but maintain correct internal ranking of instances resulting in a performant AUC), the raw probabilities are not directly comparable across models. Following recommendations and guidelines in refs. ²⁷ and⁵⁰ to facilitate model comparison, we applied calibration-in-the-large to map uncalibrated model probabilities to a common semantic standard without changing the model’s inherent discriminative properties prior to computing metrics.

We reported rule-in and rule-out potentials at n = 100 thresholds and reported results: (1) at a common probability threshold of 0.5 to enable fair comparison of the models’ discriminatory capacity, and (2) at model-specific optimal thresholds to demonstrate the maximum achievable performance when tuned for specific diagnostic goals (such as ruling-in (RI) or ruling-out (RO) a condition).

InfEHR consistently outperformed GRU-D and SeFT for rule-in applications, achieving 7.105 (vs. SeFT: 3.415, GRU-D: 4.745) at the 0.5 threshold and an optimal RI potential of 8.322 (vs. SeFT: 7.67, GRU-D: 6.706). For rule-out applications, InfEHR also outperformed both models at the 0.5 threshold (InfEHR: 2.700 vs. SeFT: 1.778, GRU-D: 2.269). However, when optimized for rule-out potential, GRU-D and SeFT achieved marginally higher performance (GRU-D: 3.964, SeFT: 3.935) than InfEHR (3.566). These results support that InfEHR is capable of significant probability revision and particularly for the clinical task of ruling in a condition that is central and challenging under low prevalence (see Fig. 6 for additional evaluation and Table 3).

**Fig. 6: InfEHR revises priors obtained from different domains to local distributions without human intervention.**

Discussion

Low disease incidence intrinsically limits the diagnostic power of common individual tests⁶⁷. Similarly, underlying risk accumulation may proceed through highly conditional structures that can make individual risk difficult to assess since individual variation within those structures may lead to correspondingly divergent risks⁶⁸. These factors, along with limitations in time and existing knowledge, promote uncertainty in clinical decisions². Discriminative models, even under optimal configurations of model architecture and type, are poorly situated to resolve these uncertainties. Discriminative modeling consists of trying to predict a label given a data instance. Models trained under this framework learn to predict a label or class membership through incurring a penalty for incorrect predictions. As explained below, this approach is conceptually mismatched to the realities of EHR data and clinical medicine, especially in the setting of low-prevalence diseases, which are typically also the most clinically uncertain and where input from models is wanted most.

Discriminative models in medicine can fail when handling rare conditions due to dual challenges: sparsity in both the number of positive cases and their distribution in feature space⁶⁹. With few positive examples, the model struggles to learn the true variety of ways a condition can manifest. It may see only a narrow subset of possible presentations and optimizes for learning a decision boundary primarily on abundant negative cases. This limited exposure makes it difficult for the model to develop robust feature representations that capture the full spectrum of disease manifestations, and the decision boundary itself becomes problematic because it is primarily shaped by the dense regions of negative cases, with only rare positive cases to counterbalance this influence. As a result, the model may make highly confident but incorrect predictions.

Cases that fall far from the learned feature space of the training data but definitively on one side of a decision boundary represent a fundamental flaw in discriminative models.⁷⁰. The model will assign high confidence simply based on the case’s position relative to the boundary, even with little evidence to support such certainty in sparse regions of the feature space. (Here, the model has seen few or no other cases with similar learned features.) Sparsity in the feature space may result from limited training examples (common in rare conditions such as CN-S) or in combination with the model’s poor inductive capacity to learn useful features (e.g., highly conditional networks underlying PO-AKI risk). This scenario creates a dangerous situation: a clinician reviewing the model’s output would have no way to know that the high confidence prediction comes from a region where the model has minimal experience.

Post-hoc calibration cannot solve this problem, because it only adjusts probability outputs without addressing the underlying issue: the model’s restricted understanding of how diseases manifest due to the nature of discriminative training^71,72. Consider a rare disease that presents in several distinct ways. If the training data captured only one type of presentation, the discriminative model builds its decision boundary around that presentation. When encountering a different but equally valid presentation of the same disease, the model might confidently misidentify it simply because the case falls on the “wrong” side of the boundary based on the learned features. Calibration techniques cannot fix this fundamental gap in the model’s knowledge because they work only with the feature space (the way a case is ultimately represented within the model) that the model has already learned. The calibration process cannot teach the model about alternative disease presentations it has never seen or lead to output probabilities that reflect when a model is operating with epistemic uncertainty, both of which severely limit the applicability of probabilities obtained from discriminative models in clinical settings, even when calibrated.

In contrast, the generative approach explicitly models how different disease processes could generate various presentations in the EHR and simultaneously weigh the consistency of a given case with the presumption of a given disease. Generative models incorporate disease prevalence as a core component of this reasoning—rare conditions require stronger evidence to overcome their low prior probability, just as clinicians maintain a higher threshold for diagnosing uncommon conditions. The likelihoods returned by generative models critically preserve uncertainty in cases where the learned EHR features are less consistent with disease presumptions. Rather than making decisions solely on feature patterns, the model considers both how well the patterns match each disease process and how likely each disease is to occur. Deep geometric learning provides an opportunity to learn highly informative and patient-specific EHR representations leading to well-resolved feature spaces. When combined with a generative approach, the InfEHR framework achieved a more nuanced and clinically valuable form of uncertainty quantification. For example, a case might have features somewhat consistent with a rare disease, but the model appropriately and automatically tempers its confidence based on the condition’s rarity, or the model may be inexperienced with an instant case, leading to similarly moderated probabilities.

InfEHR is a generative modeling framework, which we have shown to dramatically reduce clinical uncertainty of individual cases using EHR data available at the time of clinical decision-making. Our framework automatically learns phenotypic temporal dynamics from EHRs through a graph neural net-based (GNN) approach. We learn high-dimensional representations of such graphs to compute the likelihood of an underlying latent disease given the representation (and the EHR it was derived from). The resulting likelihoods have properties important to any clinical test: the probability clearly identifies when the model is uncertain, confidence scales with accuracy, and the pretest probabilities are substantially revised^73,74. We express these characteristics quantitatively through high rule-in and high rule-out potentials benchmarked against real-world clinical heuristics. To our knowledge, we depart from existing deep geometric learning approaches to EHRs⁷⁵ by considering temporal graphs to (1) automatically derive clinically significant nodes and their embeddings from discrete and continuous variables within raw EHRs, (2) use a graph structure in a rules-generation engine to produce informative priors (enabling the use of minimally labeled examples), (3) learn whole temporal graphs and representations from EHRs using semi-supervised and unsupervised deep geometric learning, and (4) compute likelihoods from these representations for resolving clinical uncertainty in realistic conditions (e.g., enable the consideration of EHRs as they exist at the time of decision and not relying on diagnosis codes or other structures that may be unavailable at decision time).

This framework can also be used to automatically revise prior probabilities generated from previously trained models applied to new data with differing underlying prevalence of disease. This trait sets InfEHR apart from discriminative approaches for several reasons. First, underlying differences in prevalence typically manifest as degraded model performance in discriminative models. This can be mitigated by sufficient training that covers all potential disease manifestations, but low disease frequency intrinsically limits the availability of training examples. In contrast, InfEHR does not require large volumes of training data to prevent performance degradation, since its training method explicitly models prevalence in its likelihoods. Second, fine-tuning previously trained models to better reflect local prevalence conditions requires relabeling an entire dataset, which is often impossible. InfEHR instead revises prior probabilities automatically according to local prevalence conditions without human labeling. And third, InfEHR learns local feature distributions that can be combined with automatic inference of prevalence, which leads to high rule-in and rule-out potentials, which supports the generalizability of the framework to new settings whose clinical practices and patient populations differ.

InfEHR has a wide range of clinical applicability. It computes probabilities in disease settings with low (PO-AKI) and very low (CN-S) incidence of positive cases. Although the true global incidence of PO-AKI is presently unknown, a recent publication estimates that 18.4% of surgical patients will develop it (95% CI 17.7%–19.2%)³⁶; this study observed 21% in the training dataset (n = 4276, PO-AKI = 879) and 10% in the validation dataset (n = 2426, PO-AKI = 261). Estimates of global CN-S cases are more elusive; however, in the records obtained at MSHS, CN-S cases made up less than 4% of total cases (n = 3678 [known label status], CN-S = 137), which is consistent with ratios available in literature^24,76.

Although we emphasize performance in positive case identification in low-prevalence settings, InfEHR also effectively finds negative cases, given that it returns probabilities with high rule-in and high rule-out potentials, as explained below⁷⁷. We show the performance of InfEHR in EHRs obtained from MSHS on CN-S and PO-AKI, and validate the performance in PO-AKI prediction on EHRs from the University of California, Irvine Medical Center (UCIMC). The EHRs in all datasets are from patients with diverse medical and demographic profiles⁷⁸ (see Supplementary Materials for additional information). As explained in detail below, we also demonstrate the performance of InfEHR under clinically realistic conditions in terms of decision timing and EHR availability. This approach, requiring minimal human intervention, marks a significant advance in leveraging EHR data to support evidence-based medicine and reducing clinical uncertainty^20,21,79.

Identifying low-incidence cases is particularly challenging to clinicians and machines alike⁸⁰. The absence of uniform diagnosis and management, such as in diagnosing neonatal CN-S (where there is no specific confirmatory result or set of results), compounds these challenges. We benchmark InfEHR against a clinical heuristic for CN-S identification that emphasizes CRP, given that it is the only biomarker with specific guidelines for use in diagnosing CN-S³⁴. Absolute CRP levels in neonates are limited in their ability to distinguish physiologic from inflammatory response, as CRP is subject to natural developmental increases outside of inflammatory response³⁵. Further, various noninfectious conditions in the perinatal period can also induce an inflammatory response such as complicated labor and delivery, intraventricular hemorrhage, or tissue injury³³. These issues likely reduce the sensitivity of CRP in CN-S detection, since the heuristic, consistent with literature, shows low sensitivity (0.041) and correspondingly neutral rule-in and rule-out potentials (1.03 and 1.097)⁸¹. In contrast, InfEHR dramatically improved upon these results with sensitivity of 0.650 and high rule-in and rule-out potentials (Training = 16.018, 2.880, Validation = 12.492, 2.435). We observe recurrent and distinguishing CRP level temporal dynamics in some positive CN-S cases (see Fig. 1). This observation suggests value in considering temporal dynamics for separating physiologic from pathophysiologic responses⁸². However, identifying such dynamics, which are subject to individual variation within a conserved pattern, at scale exceeds the capacity of the heuristic and is challenging to traditional temporal models⁸³. But InfEHR automatically captures informative temporal dynamics without algorithmic or human pre-specification and can also consider interrelating variations between CRP and other biomarkers such as in CBC results which, by themselves, while reportedly low in sensitivity, may provide more information when integrated with other lab results. These findings, while promising, have not been externally validated owing to insufficient external data, including confirmed CN-S status.

The proliferation of models and clinical indices for the assessment of PO-AKI underscores the highly contingent nature of PO-AKI risk⁴⁶. Patients with normal serum creatinine do not present with clear postoperative risk and are highly represented in the false negative predictions (FNR > 72%, mean creatinine = 0.79 mg/DL) by the clinical heuristic. As part of the workflow of the InfEHR framework, we automatically generate labeling rules from the EHR graph structures that are aggregated to produce prior probabilities⁶¹. The machine-generated prior probabilities outperform the clinical heuristic and perform better than the machine priors generated for CN-S. Given that the rules are generated from nodes that are in turn extracted from EHR sources, we analyze the composition of the rules in terms of node source (e.g., labs, vitals, medications, or clinical notes)⁸⁴. PO-AKI rules selected for aggregation had higher mean presence of nodes from medications and clinical notes (22%, 39% respectively) compared to aggregate rules for CN-S (8%, 17%), which had more representation from vitals and lab results or from unselected rules for PO-AKI (13%, 31%). Taken together with the decrease in false negatives (47.5% vs. 79.2% clinical heuristic), these results suggest that InfEHR can learn conditional relational structures that better identify risk in patients who otherwise had low or no indicated risk per the creatinine-based clinical heuristic⁷⁵. The GNN component further reduces this uncertainty to produce better rule-in potentials (3.613 GNN vs. 1.322 heuristic)⁸⁵. Although the PPV is similar, the higher recall of positive cases at greater NPV (0.924 vs. 0.824) shows that InfEHR can successfully estimate the required contingencies needed to make good PO-AKI risk assessments and outperform the clinical heuristic (see Supplementary Materials).

InfEHR produced high and comparatively better rule-in and rule-out potentials in the UCIMC dataset (rule-in: 7.105, rule-out: 2.700) than in the MSHS dataset. Other metrics that broadly support improvement to the input distribution of probabilistic labels (consistent with the observed better rule-in and rule-out potentials) were also examined. The beta distribution with parameters set to match the prevalence of the empirical sample served as a null model. The clinical heuristic’s close alignment with the beta distribution (0.8129), combined with its low negative log loss (0.0360) but high positive log loss (3.2807), reveals that the heuristic makes predictions in a “calibrated conservative” pattern that readily rules out but requires more evidence to rule in.

In contrast, the revised InfEHR GNN probabilities diverged from the beta distribution (1.5930) while achieving superior performance, suggesting that the model has learned additional signal in the data beyond what clinical heuristics capture, enabling more confident predictions in both positive and negative cases while maintaining accuracy. The increasing perplexity values from heuristic (1.0730) to InfEHR GNN’s revised probabilities (1.5327) indicates that the model learned to express more varied probabilities across cases. When viewed alongside the improved discriminative metrics, this suggests the higher perplexity reflects better-calibrated probability assignments that more accurately capture case-specific uncertainty while maintaining strong overall performance, which broadly indicates that InfEHR can learn effective features from EHRs automatically without requiring expert-provided labels. The positive log loss also showed marked improvement from the clinical heuristic (3.2807 to 1.0828), also showing enhanced accuracy specifically in positive case prediction, though with a measured increase in negative log loss (0.0360 to 0.2839), which is the objective of InfEHR.

InfEHR yields probability distributions with high rule-in and rule-out potentials, exceeding clinical heuristics in all settings (PO-AKI at both UCIMC and MSHS) and CN-S (in Training and Validation splits), because it approaches learning from a generative perspective and supplies a wealth of representational capacity in the patient EHR graph. The InfEHR GNN can effectively capitalize on this as it finds as features temporal relationships between clinical entities. The initial input EHR graph contains all possible temporal relations between the entities (see Supplementary Materials for detail on entity identification), but the GNN learns to make individual entities coalesce into high-level semantic groupings connected by condensed edges through a pooling operation (e.g., the GNN may automatically learn that receiving an ACE inhibitor and a finding of a certain blood pressure should be combined into an abstract singular clinical concept). The training objective enforces the generative paradigm in which the returned probability of an EHR given a disease latent is determined by the certainty that the disease latent is evidenced by the learned features relative to the empirical estimate of the likelihood of the disease itself as learned during the training process. The predicted probabilities therefore reflect the model’s certainty in how well-evidenced a given disease is in the underlying EHR, which can be interpreted similarly to clinical test results. By returning predicted probabilities that express the consistency of a given EHR with the disease itself (as opposed to a distance from a decision boundary) InfEHR computes probabilities with uncertainties that can be used in clinical decision-making.

The predicted probabilities are ultimately derived from quantifying interrelationships between clinical variables over time, which, as we suggest in the case of CN-S, can provide a basis for clinical inference in settings lacking specific individual biomarkers or, in the case of PO-AKI, where dynamically accumulating risks interact with other conditional risks. Although clinical evolution is generally acknowledged to be an important phenotypic and prognostic indicator, the complexities of characterizing it limit its practical use beyond as a general concept. It is not always known what extent of similarity (or over which components) is needed to distinguish common trajectories from unique ones. This phenomenon is known as sequence explosion where individual variation leads to large numbers of detected phenotypes with shallow support, thereby limiting comparisons among them⁸³. InfEHR overcomes sequence explosion and brings temporal dynamics into a practical reality through overlapping mechanisms that cooperate to produce an aggregated view of individual temporal dynamics through which comparisons and inferences can be made.

Minimizing human bias by learning graph structures from naive temporal graphs has its strengths but comes at a computational cost. The edge count in our graphs explodes with the length of EHR modeled, which may impede the training of the pooling and self-attention mechanisms at long durations. Although InfEHR learned on EHR graphs containing up to 11 days of information from hospitalized patients, adapting the framework for longer-term temporal modeling may be needed, especially in high-measurement-frequency environments like the ICU. Future work should be done to decide the optimal method. One possibility is to treat the maximum EHR length as a hyperparameter and then aggregate the representations from InfEHR over such periods. Other possibilities include removing edges outside of a given temporal range to reduce graph complexity.

Future work will include workflow and preprocessing optimizations. Identifying temporal and other semantic information from terms extracted from clinical notes can add important context to these terms. Similarly, aligning such terms to graph-based ontologies such as SNOMED-CT presents opportunities to transfer information from within the ontology but outside the EHR directly to patient graphs. Large Language Models could potentially be used in preprocessing steps for clinical notes or as embedders for entire clinical notes, which could streamline the existing workflow. The integration of multi-modal information from real-time monitoring or genetics can also be facilitated through generalized models with outputs assimilated by InfEHR. Collectively, these improvements will support bringing the framework into production.

Although InfEHR is shown to produce low-entropy probability distributions, additional information could be gained from examining high-entropy results that point to uncertain predictions by InfEHR. These may represent limits to algorithmic performance, but they may also reveal information about the instant case where, for example in PO-AKI, a low-entropy prediction for a known positive case may point to an individual who has experienced specific adverse events during surgery that resulted in AKI or, in the setting of CN-S, an uncertain prediction for a known positive may indicate that the individual never had CN-S despite being treated for it (see earlier note on “label fidelity”). This facet suggests an added research-based use for InfEHR alongside the present suggestions for clinical decision support. InfEHR is an effective and scalable method for extracting insights from EHRs that we make publicly available (see Code Availability).

Methods

The Institutional Review Board of the Icahn School of Medicine at Mount Sinai approved the protocol for retrieving and analyzing all EHRs in this study. Data obtained from the MOVER dataset was approved for use by the Institutional Review Board of the University of California, Medical Center, and the main campus.

Overview of InfEHR

The premise of InfEHR is that more information is available than is typically used in individual clinical decision-making. The complexities of obtaining information from EHRs limit their utility. InfEHR is a geometric deep-learning approach for resolving clinical uncertainty using EHRs with minimal human intervention. The framework is designed to perform in realistic clinical settings where large volumes of labeled training data cannot be obtained and where existing knowledge is limited.

Three sequential modules make up the InfEHR framework. Module 1 intakes raw EHRs and produces EHR graphs through three successive steps: EHRs are first pre-processed to remove invalid data, next clinical events are automatically abstracted from the EHRs and embedded to form a set of nodes, finally individual EHRs are aligned to the abstracted events and represented as graphs where nodes are connected according to the naive temporal ordering in the patient EHR, forming EHR graphs. In Module 2, an attention-based graph neural network (GNN) embeds EHR graphs using self-supervision. These embeddings, representing the complete patient record, are used in an automatic rules-generation engine to obtain initial probabilities for all unlabeled cases. And in Module 3, uncertainty in these probabilities is resolved through semi-supervised training of the GNN using a specialized loss function. Module 3 can be used with any source of prior probability information.

Descriptions of each component module, corresponding detailed equations, and the specifics of the datasets used are provided below. A workflow diagram is provided in Supplementary Fig. 1.

Module 1: Processing Electronic Health Records into Graphs

Training Datasets

We obtained structured and unstructured data from 11 million electronic health records (EHRs) from the Mount Sinai Health System stored in the Mount Sinai Data Warehouse (CN-S) and through the Extrico Health platform (PO-AKI) over time-varying measurements, medications, and clinical progress notes.

For potential neonatal CN-S cases, records were obtained by identifying individuals with at least 48 h of antibiotic exposure administered in the NICU and without categorical missingness (e.g., no vitals information). All antibiotic courses for such individuals meeting this requirement (n = 8067 individuals, 9256 antibiotic courses) were then extracted. A physician subject-matter expert manually confirmed the CN-S status for n = 3653 antibiotic courses. We applied a stratified split to the physician-confirmed dataset by birthweight to obtain a labeled training dataset of n = 2914 cases (80% of the total). Birthweight was chosen because it is an independent risk factor for CN-S³².

For potential PO-AKI cases, records were obtained for individuals undergoing surgery of any kind with presurgical hospitalization ≧ 2 days and who had in-hospital serum creatinine measurements taken at 72 h (n = 22,138) postoperatively to compute AKIN scores. For patients with multiple surgeries, only first surgeries with subsequent operations >72 h (n = 8031) were considered. A positive AKI diagnosis was assigned to any patient with AKIN score >1.

Validation Datasets

We used n = 729 from the stratified split (20% of total) cases as a validation dataset for the CN-S task.

For the PO-AKI task we used the EPIC EHR cohort in the MOVER dataset from the University of California at Irvine Medical Center (UCIMC, n = 39,685) and included only patients with serum creatinine measured preoperatively (2 or more measurements) and at 72 h postoperatively (n = 2631). We applied the AKIN definitions to obtain labels as in MSHS.

EHR Preprocessing

We applied the preprocessing steps described below to the CN-S and PO-AKI training datasets (individually) and applied the results where relevant to the validation datasets. The MSHS consists of several individual hospitals with varying database capture and update protocols. As a result, some types of information were systematically unavailable in the data warehouse at the time of retrieval.

EHRs with such categorical missingness (e.g., no vitals) were excluded from model training given that this pattern of missingness likely resulted from database-specific variation (retained cases: n = 5213 CN-S, n = 4276 PO-AKI MSHS, excluded cases: n = 2854 CN-S, n = 3764 PO-AKI MSHS); however, we used all valid records, including incomplete records, for the density estimations that the node discovery process required (see Module 3 below).

Preprocessing Numerical Values

We include vitals measurements and laboratory results measured on at least 100 unique individuals and with representation from both labels (i.e., the measurement does not by itself identify a case). As a result, given only 137 patients with confirmed CN-S, we considered the vital signs of respiratory rate, spO2, temperature, pulse, and systolic/diastolic blood pressures to avoid bias from measurement type. We retained 25 unique vitals in PO-AKI. We considered 387 and 72 unique labs, and 47 and 280 distinct medications, in CN-S and PO-AKI, respectively.

We further processed continuous numerical values by dropping any value greater than three times the maximum or less than three times the minimum clinical reference range (deemed to be likely artifactual). We also apply these preprocessing steps to the UCIMC dataset to variables corresponding with MSHS data and remove any variables without correspondence from the dataset.

Preprocessing Nonnumerical Values

Nonnumerical observations corresponding to lab results were standardized by applying Levenstein distance to coalesce all similar variations to the most frequently observed term. We processed categorical features derived from clinical notes from the MSHS datasets as follows: we applied QuickUMLS, a Universal Medical Language System (UMLS) matcher, to identify terms from clinical notes matching a UMLS term with high confidence (> 0.7). The extracted terms were further refined in the node discovery process. No clinical notes were available in the UCIMC dataset.

Node Discovery and Embedding

We discovered the set of nodes comprising the global pool of clinical events as nodes using density-based selection procedures (node discovery). We applied this process separately to the CN-S and PO-AKI training datasets, then used the learned results from each training dataset to extract clinical events from its respective validation dataset.

We detail the operations involved, as they apply to continuous and discrete variables, in the following sections:

Continuous Variables

We fit kernel density estimations (KDE) to the set of all observations for all EHRs for each measurement type (e.g., heart rate, respiratory rate, white blood cell count, etc.). The resulting KDE curve indicated local densities by the intervals between local peaks that we then used to discretize the continuous measurement. The number and distance between peaks is set by a single bandwidth parameter that we determined empirically to satisfy the following constraints: the discretization must be shared by at least 100 unique individuals (e.g., a local density for blood glucose must contain measurements observed for at least 100 unique individuals) while maximizing the number of identified intervals.

Discrete Variables

InfEHR uses discrete but time-varying information, including medications and clinical terms, in notes. We include medications that were administered to at least 100 unique patients in the prescribable subset of RxNorm.

We extracted UMLS-synonymous terms from clinical notes using QuickUMLS (see preprocessing of nonnumerical data above). We weighted the collection of the extracted terms using term frequency inverse document frequency (TF-IDF), then applied nonnegative matrix factorization (NMF) with automatic determination of latent topic number (minimum components in the NMF H matrix such that the cophenetic correlation coefficient ≧ is 0.90). We analyzed the resulting low rank term-weight matrix to identify and retain terms strongly associated with any latent topic (≧ top 10% of distributed topic weights). This procedure simultaneously selects terms based on frequency and informational content to create a data-driven vocabulary of clinically meaningful terms from notes for use as graph nodes.

The node selection process automatically compresses the range of all clinical events to a subset based on the underlying density distributions of the dataset. This allows the discovery of nodes through a data-driven discovery process without human pre-specification or assumptions. Specifically, the number and content of nodes are not known a priori. Additional semantic information and implicit relational structures between nodes are encoded during node embedding, as described below.

Method of Node Embedding

We derived 64-dimensional numerical representations (embeddings) for the identified nodes described above. To compute the node embeddings, we first constructed a bipartite graph with partitions over individual patients and the identified global set of nodes. Next, we computed the overlap weighted projection of the clinical event nodes over the patients and retained only edges weighted at or above the 25th percentile edge weights. We added nodes representing semantic types (e.g., the name of a lab measurement or vitals sign category) to the projected graph and connected them to relevant nodes with maximum edge weight. The neighborhood of any individual node, therefore, included all nodes of the same semantic type as well as nodes across semantic types with high co-occurrence (indicated by high-weight edges). Nodes were encoded to reflect neighborhood information using the Node2Vec algorithm.

The resulting collection of embeddings (including clinical events and semantic identifiers) forms a manifold that naturally encodes semantic clinical relationships into spatial distances between embeddings. We assigned to each node in an EHR graph the resulting relevant embedding, subject to some added components as described below. Note: learn embeddings from the training datasets individually; we used these embeddings in the validation dataset without retraining. (See Fig. 7.)

**Fig. 7: Electronic health records (EHRs) are represented automatically as Electronic Health Record graphs through an unsupervised process.**

Tuning the General Node Representation to Individual Temporal Contexts

After extracting the set of clinical events, we extracted the time stamps for their occurrences in individual EHRs. We adjusted all time stamps to reflect elapsed times by subtracting the earliest time stamp corresponding to a clinical event in each EHR. We aggregated all unique time stamps for all EHRs and derived 32-dimensional embeddings for each time stamp using the Time2Vec algorithm in Eq. (1):

$${Time}2{Vec}(t)=\left\{\begin{array}{cc}{w}_{k}\cdot t+{b}_{k} & {ifk}=0\\ \sin \left({w}_{k}\cdot t+{b}_{k}\right) & {if}k > 0\end{array}\right.$$

(4)

where:

$t $ time component (like time stamp, hour of day, etc.)

${w}_{k}$, ${b}_{k}$ Learnable parameters of the model

$k$ Position in the Time2Vec vector.

The general representation of a clinical event is formed by concatenating the event embedding with the semantic type embedding (e.g., the embedding of a certain KDE density for blood glucose is concatenated with the embedding for blood glucose). These generalized embeddings—consistent across patients—are tuned to the individual by adding the embedding of the time stamp for its occurrence. The resulting embeddings render clinical events in a machine-readable format. Although time stamps are not explicitly used as positional markers, the vectorization of time adds temporal information to representations of clinical events. The numerical distance between locally co-occurring but semantically distinct clinical events is reduced by the similarity of their time stamp components compared to events farther apart in time. Individual variation in temporal dynamics therefore shapes the representation of clinical events to the machine, transforming generalized clinical event representations to reflect individual contexts (see Fig. 7).

Module 2: Deep Geometric Learning Approach

Notation

We represent individual EHRs as directed graphs by determining relevant clinical events $\varepsilon $ (such as a measurement value within a certain range, or the appearance of a term in a clinical note) where $\varepsilon $ is discovered through an automatic process (node discovery).

We derive embeddings for these clinical events by learning a manifold ${\mathcal{M}}$ comprising all events $\varepsilon $ and their respective semantic types $\tau $. We apply an operator (here, concatenation) to obtain the representations of all possible clinical events as shown in Eq. (2):

$${\rm E}=\phi \left(\varepsilon \in \boldsymbol{\mathscr{M}}\Rightarrow \!\!,\tau \in \boldsymbol{\mathscr{M}}\right):{{\mathbb{R}}}^{n}\to {{\mathbb{R}}}^{d}$$

(5)

Graphs representing patients are constructed by identifying the time stamp $\varepsilon \in {\rm E}$ in the patient record, embedding the time stamp using a network trained on the Date2Vec objective, and concatenating $\varepsilon $, resulting in initial node embedding

$${h}_{i}^{\left(0\right)}\in {{\mathbb{R}}}^{m}\left(m > d > n\right)$$

The graph is defined as

$$G=\left(\varepsilon,V\left(t\right)\right)$$

(6)

with directed edges

$${e}_{i,j}=\left({v}_{i}\left(t\right) < {v}_{j}\left(t\right)\right)$$

(7)

Problem definition

Given the graph $G$, we train networks to learn whole graph representations in ${{\mathbb{R}}}^{d}$ for (1) self-supervised representations of EHRs and (2) computing likelihoods over clinical queries. Exact definitions of the loss functions used for training in (1) and (2) appear below.

Construction and Definitions of EHR Graphs

InfEHR computes likelihoods through sequential processing of EHRs. We obtain EHRs and then represent them as temporal graphs Eq. (3), according to these two definitions:

Definition 1: EHR

Given ${{\mathcal{R}}}_{i},$ comprising all medical records for patient$i$ occurring over the set of times ${T}_{i}$, we extract the electronic health record (EHR) of patient $i$, denoted as ${\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{i}$:

$${\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{i}=\{(r,t)\in \{{vitals},{labs},{medications},{clinicalnotes}\}{andt}\subseteq {T}_{i}\}$$

(9)

with $t$ bounded by:

$$\max \left({T}_{0,{defined}},{T}_{0,{patienti}}\right)\le t\le \min \left({T}_{\max,{patienti}},{T}_{\max,{defined}}\right)$$

(10)

where:

${T}_{{defined}}$ Timestamp of a clinical event or user-provided temporal duration.

${T}_{{{patient}}_{i}}$ An observed timestamp in the records of patient $i$.

${\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{{i}_{j}}$ Unique EHR identified by a specific clinical event in patient $i$'s records.

Here, ${T}_{{defined}}$ indicates the time stamp of a clinical event or user-provided temporal duration, and ${T}_{{{patient}}_{i}}$ corresponds to an observed time stamp in the records of patient $i$. In the case of multiple defined clinical events occurring in the records of patient $i$, each event results in a unique EHR identified by ${\mathcal{E}}{\mathcal{H}}{{\mathcal{R}}}_{{i}_{j}}$.

Definition 2: EHR Graph

We take EHRs (as defined above) and represent them as temporal graphs. We discover and collate nodes from the collected EHRs using unsupervised methods into a global node pool (Nodes), embed individual time stamps using Time2Vec (Times), and form temporal edges following (11) and the algorithm in Box 1.

$$\forall {\text{node}}_{j}\in G,j < i\wedge {{{time}}}_{j} < {{{time}}}_{i}\!\!:\!\!{{create\; edge}}\left({{{node}}}_{j}\to {{{node}}}_{i}\right).$$

(11)

Training the GNN on Attributed EHR Graphs

We train a GNN to produce whole graph embeddings (dim = 128 self-supervised, dim = 164 semi-supervised) subject to additional processing layers under a self-supervised and semi-supervised objective (details below). We use a consistent architecture adapted to supervised and self-supervised training regimes.

Given an EHR Graph $G=(V,E)$, the model initially condenses and rewires the graph through a learned pooling operation resulting in:

$${G}^{{\prime} }={ASAPool}(G,\rho )$$

(12)

We derive the global representation $X$ and logits for ${G}^{{\prime} }$ as shown:

$${H}^{(1)}={ReLU}\left(\;\sum _{j{\mathscr{\in }}{\mathscr{N}}\left(i\right)}{\alpha }_{{ij}}^{\left(1\right)}{W}^{\left(1\right)}{x}_{j}\right)$$

(13)

$$R={W}_{r}{H}^{(1)}$$

(14)

$${H}^{(2)}={ReLU}\left(\;\mathop{\sum }\limits_{j{\mathscr{\in }}{\mathscr{N}} (i)}{\alpha }_{{ij}}^{(2)}{W}^{(2)}{H}_{j}^{(1)}\right)+R$$

(15)

$$X=\frac{1}{\left|{V}^{{\prime} }\right|}\sum _{i\in {V}^{{\prime} }}{H}_{i}^{(2)}$$

(16)

$${Logits}={W}_{f}X+{b}_{f}$$

(17)

This network definition uses equations in sequential order (12, 13, 14, 15, 16, 17).

For all experiments we use input node feature dimensions d = 160 (from d = 32 for time embedding + d = 64 node semantic type embedding + d = 64 node value embedding). We use a node pooling ratio (rho) of 0.8 and attention heads for GAT layers = 2, and use hidden dimensions H(1) = 256, and H(2) = 128. We train on a self-supervised learning (SSL) objective (explained below) to produce a d = 128-dimensional representation (dimensions chosen from experience) and a d = 2-dimensional output for the semi-supervised objective.

Self-Supervised Learning for Whole Graph Representations

We learn whole graph representations by training the InfEHR GNN using a self-supervised loss according to the algorithm in Box 2:

Box 1 Algorithm to construct EHR Graph

1. Input: $EHRs=\{{\rm{E}}{\rm{H}}{{\rm{R}}}_{1}\ldots,{\rm{E}}{\rm{H}}{{\rm{R}}}_{n}\}$ for a dataset of length $n$

2. Nodes $\leftarrow $ Embed (NodeDiscovery(EHRs))

3. Times Embed ({time(Node) | Node in Nodes}) ←

4. for each $({r}_{i},{t}_{i})$ in ${\mathscr{E}}{\mathscr{H}}{R}_{i}$ do

5. if ${r}_{i}$ has a representation in Nodes then

6. ${{node}}_{i}\leftarrow {Nodes}[{r}_{i}]$

7. ${tim}{e}_{i}\leftarrow {Times}[{t}_{i}]$

8. Concatenate $({nod}{e}_{i},{tim}{e}_{i})$ and insert into graph ${G}_{i}$

9. end if

10. for each ${nod}{e}_{j}$ in ${G}_{i}$ where $j < i$ and and ${tim}{e}_{j} < {tim}{e}_{i}$ do

11. Create edge in $({nod}{e}_{j}\to {nod}{e}_{i})$ in ${G}_{i}$

12. end for

13. end for

14. EHR Graphs ←$\{{G}_{1},\ldots,{G}_{n}\}$

Box 2 Self-supervised training algorithm: Detailed Graph Self-Supervised Learning with VICReg and MI

1. Input: Graph dataset ${\rm{G}}={{\{{G}_{i}=({V}_{i},{E}_{i},{X}_{i})\}}^{N}}_{i=1}$

2. where: ${V}_{i}$ is the set of nodes;

3. ${E}_{i}\subseteq {V}_{i}\times {V}_{i}$ is the set of edges;

4. ${X}_{i}\in {{\mathbb{R}}}^{\left|{V}_{i}\right|\times {d}_{{in}}}$ are node features;

5. $T$ Number of epochs;

6. Feature dimensions: ${d}_{{in}}$: (input), ${d}_{h}$: (hidden), ${d}_{e}$: (embedding).

7. Input: Loss weights: ${\lambda }_{{sim}}$, ${\lambda }_{\mathrm{var}}$, ${\lambda }_{\mathrm{cov}}$, ${\lambda }_{{mi}}$

8. ASAPooling ratio $r$, number of attention heads $K.$

9. Output:

10. Trained encoder ${f}_{\theta }$ for downstream tasks

11. // Initialization Phase

12. Initialize encoder ${f}_{\theta }$ with components:

13. ASAPooling layer: ASA: ${{\mathbb{R}}}^{{d}_{{in}}}\to {{\mathbb{R}}}^{r\cdot {|V|}}$

14. GAT layers: ${{GAT}}_{k}:{{\mathbb{R}}}^{{d}_{k}}\to {{\mathbb{R}}}^{{d}_{k}}+1$ with heads; $K$

15. Residual connections and dimension reduction;

16. Initialize projectors:

17. ${p}_{1}:{{\mathbb{R}}}^{{d}_{{in}}}\to {{\mathbb{R}}}^{{d}_{e}}$(input projector);

18. ${p}_{2}:{{\mathbb{R}}}^{{d}_{e}}\to {{\mathbb{R}}}^{{d}_{e}}$ (representation projector);

19. Both with architecture: Linear→ LayerNorm→ ReLU × 2 Linear;

20. Initialize MI scoring matrix; ${W}_{{mi}}\in {{\mathbb{R}}}^{{d}_{e}\times {d}_{e}}$

21. // Training Loop

22. for $t=1$ to Tdo

23. for each graph $G=(V,E,X)$ in ${\mathscr{G}}$ do

24. // Generate representations

25. $h\leftarrow {f}_{\theta }(X,E);{v}_{1}\leftarrow {p}_{1}(X);{v}_{2}\leftarrow {p}_{2}(h);$

26. // Compute VICReg losses

27. ${{\mathcal{l}}}_{{sm}}\leftarrow \frac{1}{{|V|}}\mathop{\sum }\limits_{i=1}^{{|V|}}{\left|{v}_{1}^{i}-{v}_{2}^{i}\right|}_{2}^{2};{S}_{j}(v)\leftarrow \sqrt{{Var}\left({v}^{j}\right)+\epsilon }$

28. for $j\in [{d}_{e}];$

29. ${{\mathscr{L}}}_{{vr}}\leftarrow \frac{1}{{d}_{e}}\mathop{\sum }\limits_{j=1}^{{d}_{e}}(\max (0,\gamma -{S}_{j}({v}_{1})))$

30. $C\left(v\right)\leftarrow \frac{1}{\left|V\right|-1}\left({v}^{T}v-{diag}\left({v}^{T}v\right)\right); {{\mathscr{L}}}_{{cv}} \leftarrow \frac{1}{{d}_{e}} \sum _{i\ne j}\left(\left[C{\left({v}_{1}\right)}_{{ij}}^{2}+{\left[C({v}_{2})\right]}_{{ij}}^{2}\right]\right);$

31. // Mutual Information Estimation

32. ${poo}{l}_{1}{AvgPoolNeighbor}\left({v}_{1},E\right){;glo}{b}_{1}\leftarrow \frac{1}{{|V|}}\mathop{\sum }\limits_{i=1}^{{|V|}}{poo}{l}_{1}^{i};$

33. ${cor}{r}_{1}$ ←WindowCorrupt $(v1,w);$ ${poo}{l}_{c1}$

34. AvgPoolNeighbor$({cor}{r}_{1},E);$

35. ${glo}{b}_{c1}\leftarrow \frac{1}{{|V|}}\mathop{\sum }\limits_{i=1}^{{|V|}}{poo}{l}_{c1}^{i};{s}_{{pos}}{softmax}({poo}{l}_{1}{W}_{{mi}}{glo}{b}_{1}^{T})$

36. ${s}_{{neg}}\leftarrow {softmax}\left({poo}{l}_{1}{W}_{{mi}}{glo}{b}_{c1}^{T}\right){\mathscr{L}}m\leftarrow -\log \frac{\exp ({s}_{{pos}})}{\exp ({s}_{{pos}})+\exp \left({s}_{{neg}}\right)};$

37. // Update Model

38. ${\mathscr{L}}{\mathscr{\leftarrow }}{\lambda }_{{sim}}{{\mathscr{L}}}_{{sim}}+{\lambda }_{\mathrm{var}}{{\mathscr{L}}}_{\mathrm{var}}+{\lambda }_{\mathrm{cov}}{{\mathscr{L}}}_{\mathrm{cov}}+{\lambda }_{{mi}}{{\mathscr{L}}}_{{mi}}$

39. Update parameters using AdamW optimizer;

40. end

41. end

42. return trained encoder ${f}_{\theta }$, discard projectors ${p}_{1}$ and ${p}_{2}$

Self-supervised learning

Our SSL training algorithm uses a VICReg framework⁸⁶, more commonly used for image encoding, enhanced with mutual information (MI) estimation tailored to graph data. The procedure initializes an encoder with ASAPooling and GAT layers, along with two projectors for input and representation transformation. During training, each graph generates two views: raw features projected through p₁ and encoded features through p₂. The projectors, implemented as multilayer perceptrons with normalization and nonlinearities, serve as learnable transformations that map the input and encoded representations to a shared embedding space while preventing the collapse of information. This architectural choice enforces an information bottleneck that prevents the encoder from learning trivial solutions, while the projectors’ flexibility allows the contrastive learning objective to be optimized without constraining the encoder’s representation capacity. Post-training, the projectors are discarded, preserving the encoder’s learned manifold structure for downstream tasks.

The loss function combines four components: similarity loss (ensuring view alignment), variance loss (preventing dimensional collapse), covariance loss (decorrelating features), and mutual information (MI) loss (maximizing node-to-graph information while minimizing it for corrupted samples). The MI estimation uses a structured corruption scheme and InfoNCE-style loss computation.

The corruption scheme generates negative samples by shuffling node features within windows of the EHR graph (WindowCorrupt). The effect is to introduce random and unrealistic relationships and orderings between clinical events. The MI estimation encourages the model to distinguish valid clinical structures and their representations from these unrealistic examples.

The total loss is weighted as Equation (32), optimized using AdamW. After training through $T$ epochs (for 1000 epochs), the encoder is preserved for downstream tasks while projectors are discarded.

This SSL loss function encourages EHR representations that capture local temporal patterns within patient records and also global patient states, allowing for encodings that capture inter-patient variation (differing global states) simultaneously with encoding shared local temporal structures. This simultaneity promotes meaningful semantics in several ways: high-density regions are likely to represent patients with common clinical patterns or disease trajectories, whereas sparse regions may indicate rare conditions or unique patient presentations. Spatial distances can be used to infer the disease state as follows.

Deriving instance-level priors automatically

Label propagation using self-supervised embeddings

We train the GNN encoder to produce self-supervised embeddings as above and apply label propagation as described in ref. ⁶⁰ and implemented in scikit-learn. We hypothesize that training on a self-supervised objective, as described above, results in automatic alignment of phenotypically similar people so as to meet the assumption of label smoothing that semantic similarity is a function of spatial distance.

The label smoothing algorithm iteratively learns a smooth classification function whereby we take the 110 labeled samples and spread label information to spatially proximate samples and label each sample according to the flow of labels it receives during the propagation process. We apply the label spreading algorithm using an RBF kernel (gamma at 70) to determine probabilistic distances between embeddings and set the clamping parameter alpha, controlling the relative importance of the initially labeled examples in deciding the predicted labels for unlabeled examples, to 0.5 based on previous experience. We achieved 0.18 (CN-S) and 0.34 (PO-AKI) recall, with precision of 0.67 and 0.78 respectively (outperforming the clinical heuristic in both cases), suggesting the spatial similarity assumptions appreciably obtained in the self-supervised embeddings.

Integration of spatial information and structural information

We derive weak labeling heuristics from structural features of EHR graphs using label information provided from label propagation over the self-supervised embeddings (spatial information as described). Ordinarily such weak heuristics are generated by a human expert which involves bias and challenges in precisely these clinical settings where existing clinical knowledge is limited. We present a method to automatically generate them at scale in uncertain conditions and follow established guidelines for combining them^61,62 to generate initial probabilities. We find that existing literature^21,22,86 has corroborated a random sample of automatically generated heuristics. Another potential application of InfEHR could be generating hypotheses from weakly predictive heuristics.

Weakly supervised learning over uncertain priors

The ASAPooling operation⁸⁷ in the GNN uses an attention-based mechanism to derive cluster medoids and assign cluster memberships to nodes over a fixed receptive field to produce a new, pooled graph. Clusters are scored for inclusion in the pooled graph and reconnected with edge weights that indicate the topology of the original graph. We apply a message-passing algorithm over the pooled graph using graph self-attention to compute new node representations successively. InfEHR derives the node representations by learning an attentional coefficient that weighs the relative importance of a node to its neighbor in the aggregation phase of the message-passing algorithm. To avoid over-smoothing of node representations, we apply a residual between successive message passing steps. Finally, to obtain the whole graph representation we take the mean over node features for all nodes in the pooled graph resulting in a single high dimensional vector. We further process the whole graph representation using linear layers to return likelihoods according to the following loss criterion described below.

Module 3: resolving prior probabilities

GNN Training with Feature-Based Weighting of Kullback-Leibler Loss

We train the GNN (previously described) under our own loss function similar to the RQ loss proposed in ref. ⁶³ and include a small network to learn example specific loss weighting functions. RQ loss consists of a generative formulation allowing the optimization of the log-likelihood of learned graph features relative to an assumed underlying generative process (here a disease latent). The loss function is consistent with the overall data representation strategy in which we capture disease latents at multiple levels (from initial node encodings to the naive temporal EHR graphs). We extend this function further by learning a dynamic weighting mechanism that continuously adapts during training, learning to adjust sample importance based on evolving patterns in the penultimate layer representations. Modulating the RQ loss through learned weights focuses attention on the most informative samples as the feature space becomes progressively more structured throughout training.

Weighted RQ Loss

The weighted RQ loss minimizes the following function, jointly optimizing parameters for the primary model and for the weighting network:

$$\mathop{\min }\limits_{\theta , \phi }\mathop{\sum }\limits_{{iinB}}\frac{\exp ({w}_{i}(\phi ))}{{\sum }_{{jinB}} \exp \left({w}_{j}(\phi )\right)}\cdot {KL}\left[q\left({\mathcal{l}},|,{x}_{i}\right)\theta \left)\right. || {c}_{i} \cdot {p}_{i} ({\mathcal{l}})\frac{q({\mathcal{l}}|{x}_{i};\theta )}{{\sum }_{j}q\left({\mathcal{l}}|{x}_{j};\theta \right)}\right]$$

(18)

Définitions:

B Virtual batch

f_i Penultimate layer features

c_i Normalization constants ensuring $\Sigma {r}_{i}({\rm{l}})=1$

${p}_{i}({\mathcal{l}})$ Prior beliefs about latent labels, derived from InfEHR heuristics

w_i (ϕ) Weight computed as:

$$\sigma ({W}_{2}{ReLU}({W}_{1}\,{f}_{i}+{b}_{1})+{b}_{2})$$

(19)

where $\phi=\{{W}_{1},{W}_{2},{b}_{1},{b}_{2}\}$ are trainable weights and biases of the neural network.

Parameters:

θ Parameters of the primary model estimating $q({\mathcal{l}} \, | \, {xi;}\, \theta )$;

ϕ Parameters of the neural network calculating ${w}_{i}$;

Optimization is by simultaneous updates to $\theta $ and $\phi $, aligning the model’s outputs with instance importance in batch $B$.

In sequential order, this loss definition uses Eqs. (18, 19).

Performing Validation

We construct EHR graphs from the UCIMC data contained in the MOVER dataset by first aligning the records in UCIMC to the same namespace as the MSHS data (e.g., creating a mapping between the same medication or laboratory measurement with varying names) and then using the learned clinical manifold from MSHS data to extract clinical events from the UCIMC records into naive temporal graphs, as described previously. Notably, UCIMC data does not include clinical progress notes, which limits the full translation of UCIMC events to the MSHS manifold. All vitals measurement types in the UCIMC dataset were duplicated in MSHS; however, some laboratory measurements and medications in UCIMC had no correspondence in MSHS. We omit any such record from the temporal graphs.

To apply the GNN for semi-supervision from MSHS data to graphs from UCIMC, we ablated all nodes from clinical text in the MSHS graphs and retrained the GNN on the MSHS graphs. We applied this GNN to the UCIMC graphs to obtain initial probabilistic labels (see Fig. 6, InfEHR priors). We therefore transferred knowledge from previous training in the form of the learned clinical manifold and in the prior probabilities.

InfEHR is designed to learn dynamical temporal features that can be used for clinical uncertainty reduction. To show that InfEHR does this, we trained the InfEHR GNN on UCIMC graphs (n = 2427) constructed using the clinical manifold and embeddings from MSHS. Probabilistic prior labels were obtained from previous training on MSHS graphs and without human-provided labels. This parallels discriminative model training while maintaining consistency with the InfEHR loss function and framework. Using these priors (without human-provided labels) we trained for 20 epochs (by early stopping criterion). We used this trained GNN to compute final likelihoods (see Table 3 and Fig. 6).

We then performed benchmarking experiments with GRU-D and SeFT models implemented and trained as in their reference implementations^64,65. Although both of these models ingested tabular nongeometric data, we retained the same variables used to construct the graphs for InfEHR GNN to facilitate comparison.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Mount Sinai Health System patient data used retrospectively for this project, and with institutional permission granted by the Institutional Review Board of the Icahn School of Medicine at Mount Sinai, comprise confidential patient records and are not publicly available due to privacy restrictions. Requests from institution-affiliated researchers for access to processed data will be considered by the Charles Bronfman. Institute for Personalized Medicine steering committee and should be directed to G.N.N. (girish.nadkarni@mountsinai.org), and will be handled within 1 month. Patient data from University of California, Irvine Medical Center are available through credentialed access as part of the MOVER dataset (https://mover.ics.uci.edu/index.html) with Institutional Review Board approval from University of California, Irvine School of Medicine and from University of California, Irvine main campus. Processed data including model weights and embeddings are publicly available at https://github.com/Nadkarni-Lab/InfEHR.

Code availability

PyTorch implementation and documentation are available at https://github.com/Nadkarni-Lab/InfEHR.

References

Sackett, D. L., Rosenberg, W. M. C., Gray, J. A. M., Haynes, R. B. & Richardson, W. S. Evidence based medicine: What it is and what it isn’t. BMJ 312, 71–72 (1996).
Article CAS PubMed PubMed Central Google Scholar
Djulbegovic, B. & Guyatt, G. H. Progress in evidence-based medicine: A quarter century on. Lancet 390, 415–423 (2017).
Article PubMed Google Scholar
Bate, L., Hutchinson, A., Underhill, J. & Maskrey, N. How clinical decisions are made. Br. J. Clin. Pharm. 74, 614–620 (2012).
Article Google Scholar
Hunink, M. G. M., et al. Decision Making in Health and Medicine: Integrating Evidence and Values (Cambridge University Press, 2014).
Timmermans, S. & Angell, A. Evidence-based medicine, clinical uncertainty, and learning to doctor. J. Health Soc. Behav. 42, 342–359 (2001).
Article CAS PubMed Google Scholar
Maxim, L. D., Niebo, R. & Utell, M. J. Screening tests: A review with examples. Inhal. Toxicol. 26, 811–828 (2014).
Article CAS PubMed PubMed Central Google Scholar
Kumar, A. et al. Duration of hypotension before initiation of effective antimicrobial therapy is the critical determinant of survival in human septic shock. Crit. Care Med. 34, 1589–1596 (2006).
Article PubMed Google Scholar
Fleisher, L. A. et al. 2014 ACC/AHA guideline on perioperative cardiovascular evaluation and management of patients undergoing noncardiac surgery: A report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation 130, 2215–2245 (2014).
Article PubMed Google Scholar
Bose, S. & Talmor, D. Who is a high-risk surgical patient?. Curr. Opin. Crit. Care 24, 547–553 (2018).
PubMed Google Scholar
Adler-Milstein, J. & Jha, A. K. HITECH Act drove large gains in hospital electronic health record adoption. Health Aff. (Millwood) 36, 1416–1422 (2017).
Article PubMed Google Scholar
Blumenthal, D. & Tavenner, M. The “meaningful use” regulation for electronic health records. N. Engl. J. Med 363, 501–504 (2010).
Article CAS PubMed Google Scholar
Longhurst, C. A., Harrington, R. A. & Shah, N. H. A “green button” for using aggregate patient data at the point of care. Health Aff. (Millwood) 33, 1229–1235 (2014).
Article PubMed Google Scholar
Jensen, P. B., Jensen, L. J. & Brunak, S. Mining electronic health records: Towards better research applications and clinical care. Nat. Rev. Genet 13, 395–405 (2012).
Article CAS PubMed Google Scholar
Hripcsak, G. & Albers, D. J. Next-generation phenotyping of electronic health records. J. Am. Med Inf. Assoc. 20, 117–121 (2013).
Article Google Scholar
Miotto, R., Li, L., Kidd, B. A. & Dudley, J. T. Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6, 1–10 (2016).
Article Google Scholar
Rotmensch, M., Halpern, Y., Tlimat, A., Horng, S. & Sontag, D. Learning a health knowledge graph from electronic medical records. Sci. Rep. 7, 1–11 (2017).
Article CAS Google Scholar
Zhao, J., Papapetrou, P., Asker, L. & Boström, H. Learning from heterogeneous temporal data in electronic health records. J. Biomed. Inf. 65, 105–119 (2017).
Article Google Scholar
Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F., & Sun, J. Doctor AI: Predicting clinical events via recurrent neural networks. JMLR Workshop Conf Proc, 301–318 (2016).
Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A. & Vandergheynst, P. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Process Mag. 34, 18–42 (2017).
Article ADS Google Scholar
Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17, 184–192 (2020).
Article CAS PubMed Google Scholar
Nelson, C. A., Butte, A. J. & Baranzini, S. E. Integrating biomedical research and electronic health records to create knowledge-based biologically meaningful machine-readable embeddings. Nat. Commun. 10, 3045 (2019).
Article ADS PubMed PubMed Central Google Scholar
Wanyan, T., Honarvar, H., Azad, A., Ding, Y. & Glicksberg, B. S. Deep learning with heterogeneous graph embeddings for mortality prediction from electronic health records. Data Intell. 3, 329–339 (2021).
Article Google Scholar
Huang, K. et al. A foundation model for clinician-centered drug repurposing. Nat. Med 30, 3601–3613 (2024).
Article CAS PubMed PubMed Central Google Scholar
Shane, A. L., Sánchez, P. J. & Stoll, B. J. Neonatal sepsis. Lancet 390, 1770–1780 (2017).
Article PubMed Google Scholar
Benitz, W. E. Adjunct laboratory tests in the diagnosis of early-onset neonatal sepsis. Clin. Perinatol. 37, 421–438 (2010).
Article PubMed Google Scholar
Zea-Vera, A. & Ochoa, T. J. Challenges in the diagnosis and management of neonatal sepsis. J. Trop. Pediatr. 61, 1–13 (2015).
Article PubMed PubMed Central Google Scholar
Kuppala, V. S., Meinzen-Derr, J., Morrow, A. L. & Schibler, K. R. Prolonged initial empirical antibiotic treatment is associated with adverse outcomes in premature infants. J. Pediatr. 159, 720–725 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ting, J. Y. et al. Association between antibiotic use and neonatal mortality and morbidities in very low-birth-weight infants without culture-proven sepsis or necrotizing enterocolitis. JAMA Pediatr. 170, 1181–1187 (2016).
Article PubMed Google Scholar
Cantey, J. B. & Patel, S. J. Antimicrobial stewardship in the NICU. Infect. Dis. Clin. North Am. 28, 247–261 (2014).
Article PubMed Google Scholar
Kuzniewicz, M. W. et al. A quantitative, risk-based approach to the management of neonatal early-onset sepsis. JAMA Pediatr. 171, 365–371 (2017).
Article PubMed Google Scholar
Gilfillan, M. & Bhandari, V. Biomarkers for the diagnosis of neonatal sepsis and necrotizing enterocolitis: Clinical practice guidelines. Early Hum. Dev. 105, 25–33 (2017).
Article CAS PubMed Google Scholar
Wynn, J. L. & Polin, R. A. Progress in the management of neonatal sepsis: The importance of a consensus definition. Pediatr. Res 83, 13–15 (2018).
Article PubMed Google Scholar
Sharma, D., Farahbakhsh, N., Shastri, S. & Sharma, P. Biomarkers for diagnosis of neonatal sepsis: A literature review. J. Matern Fetal Neonatal Med 31, 1646–1659 (2018).
Article PubMed Google Scholar
Hofer, N., Zacharias, E., Müller, W. & Resch, B. An update on the use of C-reactive protein in early-onset neonatal sepsis: Current insights and new tasks. Neonatology 102, 25–36 (2012).
Article CAS PubMed Google Scholar
Chiesa, C. et al. C reactive protein and procalcitonin: Reference intervals for preterm and term newborns during the early neonatal period. Clin. Chim. Acta 412, 1053–1059 (2011).
Article CAS PubMed Google Scholar
Grams, M. E. et al. Acute kidney injury after major surgery: A retrospective analysis of Veterans Health Administration data. Am. J. Kidney Dis. 67, 872–880 (2016).
Article PubMed Google Scholar
Kheterpal, S. et al. Development and validation of an acute kidney injury risk index for patients undergoing general surgery: Results from a national data set. Anesthesiology 110, 505–515 (2009).
Article PubMed Google Scholar
Hodgson, L. E., Dimitrov, B. D., Roderick, P. J., Venn, R. & Forni, L. G. Predicting AKI in emergency admissions: An external validation study of the acute kidney injury prediction score (APS). BMJ Open 7, e013511 (2017).
Article CAS PubMed PubMed Central Google Scholar
Park, S. et al. Simple postoperative AKI Risk (SPARK) classification before noncardiac surgery: A prediction index development study with external validation. J. Am. Soc. Nephrol. 30, 170–181 (2019).
Article PubMed Google Scholar
Thiele, R. H. et al. Standardization of care: Impact of an enhanced recovery protocol on length of stay, complications, and direct costs after colorectal surgery. J. Am. Coll. Surg. 220, 430–443 (2015).
Article PubMed Google Scholar
O’Neal, J. B., Shaw, A. D. & Billings, F. T. 4th. Acute kidney injury following cardiac surgery: Current understanding and future directions. Crit. Care 20, 187 (2016).
Article PubMed PubMed Central Google Scholar
Thakar, C. V., Arrigain, S., Worley, S., Yared, J.-P. & Paganini, E. P. A clinical score to predict acute renal failure after cardiac surgery. J. Am. Soc. Nephrol. 16, 162–168 (2005).
Article PubMed Google Scholar
Park, S. et al. Impact of electronic acute kidney injury (AKI) alerts with automated nephrologist consultation on detection and severity of AKI: A quality improvement study. Am. J. Kidney Dis. 71, 9–19 (2018).
Article PubMed Google Scholar
Matheny, M. E. et al. Development of inpatient risk stratification models of acute kidney injury for use in electronic health records. Med Decis. Mak. 30, 639–650 (2010).
Article Google Scholar
Nishimoto, M. et al. External validation of a prediction model for acute kidney injury following noncardiac surgery. JAMA Netw. Open 4, e2127362 (2021).
Article PubMed PubMed Central Google Scholar
Hodgson, L. E. et al. Systematic review of prognostic prediction models for acute kidney injury (AKI) in general hospital populations. BMJ Open 7, e016591 (2017).
Article PubMed PubMed Central Google Scholar
James, M. T. et al. Derivation and external validation of prediction models for advanced chronic kidney disease following acute kidney injury. JAMA 318, 1787–1797 (2017).
Article ADS PubMed PubMed Central Google Scholar
Moledina, D. G. & Parikh, C. R. Phenotyping of acute kidney injury: Beyond serum creatinine. Semin Nephrol. 38, 3–11 (2018).
Article CAS PubMed PubMed Central Google Scholar
Salas, M., Hotman, A. & Stricker, B. H. Confounding by indication: An example of variation in the use of epidemiologic terminology. Am. J. Epidemiol. 149, 981–983 (1999).
Article CAS PubMed Google Scholar
Bornstein, B. H. & Emler, A. C. Rationality in medical decision making: A review of the literature on doctors’ decision-making biases. J. Eval. Clin. Pr. 7, 97–107 (2001).
Article CAS Google Scholar
Marewski, J. N. & Gigerenzer, G. Heuristic decision making in medicine. Dialogues Clin. Neurosci. 14, 77–89 (2012).
Article PubMed PubMed Central Google Scholar
Norman, G. R. & Eva, K. W. Diagnostic error and clinical reasoning. Med Educ. 44, 94–100 (2010).
Article PubMed Google Scholar
Lee, W. C. Selecting diagnostic tests for ruling out or ruling in disease: The use of the Kullback-Leibler distance. Int J. Epidemiol. 28, 521–525 (1999).
Article CAS PubMed Google Scholar
Grimes, D. A. & Schulz, K. F. Refining clinical diagnosis with likelihood ratios. Lancet 365, 1500–1505 (2005).
Article PubMed Google Scholar
Pepe, M. S. The Statistical Evaluation of Medical Tests for Classification and Prediction (Oxford University Press, 2003).
Klingenberg, C., Kornelisse, R. F., Buonocore, G., Maier, R. F. & Stocker, M. Culture-negative early-onset neonatal sepsis—at the crossroad between efficient sepsis care and antimicrobial stewardship. Front Pediatr. 6, 285 (2018).
Article PubMed PubMed Central Google Scholar
Cantey, J. B. & Baird, S. D. Ending the culture of culture-negative sepsis in the neonatal ICU. Pediatrics 140, e20170044 (2017).
Article PubMed Google Scholar
Bihorac, A. et al. National surgical quality improvement program underestimates the risk associated with mild and moderate postoperative acute kidney injury. Crit. Care Med 41, 2570–2583 (2013).
Article PubMed Google Scholar
Pencina, M. J., D’Agostino, R. B. Sr., D’Agostino, R. B. Jr. & Vasan, R. S. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat. Med 27, 157–172 (2008).
Article MathSciNet PubMed Google Scholar
Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Schölkopf, B. Learning with local and global consistency. In Advances in Neural Information Processing Systems (eds. S. Thrun, L. Saul, & B. Schölkopf), 321–328 (MIT Press, 2008).
Ratner, A. et al. Snorkel: Rapid training data creation with weak supervision. VLDB J. 29, 709–730 (2020).
Article PubMed Google Scholar
Ratner, A., De Sa, C., Wu, S., Selsam, D., & Ré, C. Data programming: Creating large training sets, quickly. Adv Neural Inf Process Syst. 29, 3567–3575 (
Rolf, E. et al. Resolving label uncertainty with implicit posterior models. In: Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence. PMLR 180, 1707–1717 (2022).
Google Scholar
Horn, M., Moor, M., Bock, C., Rieck, B. & Borgwardt, K. Set functions for time series. Proceedings of the 37th International Conference on Machine Learning. PMLR 119, 4353–4363 (2020).
Google Scholar
Che, Z., Purushotham, S., Cho, K., Sontag, D. & Liu, Y. Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 8, 6085 (2018).
Article ADS PubMed PubMed Central Google Scholar
Samad, M. et al. Medical Informatics Operating Room Vitals and Events Repository (MOVER): A public-access operating room database. JAMIA Open 6, ooad084 (2023).
Article PubMed PubMed Central Google Scholar
Ioannidis, J. P. A. & Bossuyt, P. M. M. Waste, leaks, and failures in the biomarker pipeline. Clin. Chem. 63, 963–972 (2017).
Article CAS PubMed Google Scholar
Hippisley-Cox, J. & Coupland, C. Predicting risk of emergency admission to hospital using primary care data: Derivation and validation of QAdmissions score. BMJ Open 3, e003482 (2013).
Article PubMed PubMed Central Google Scholar
Johnson, J. M. & Khoshgoftaar, T. M. Survey on deep learning with class imbalance. J. Big Data 6, 27 (2019).
Article Google Scholar
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. Proc. 34th Int. Conf. Mach. Learn. PMLR 70, 1321–1330 (2017).
Google Scholar
Ovadia, Y., et al. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Article 1254, 14003–14014 (2019).
Hüllermeier, E. & Waegeman, W. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Mach. Learn. 110, 457–506 (2021).
Article MathSciNet Google Scholar
Druzdzel, M. J. & Díez, F. J. Combining knowledge from different sources in causal probabilistic models. J. Mach. Learn Res 4, 295–316 (2003).
MathSciNet Google Scholar
Bossuyt, P. M. et al. STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. BMJ 351, h5527 (2015).
Article PubMed PubMed Central Google Scholar
Choi, E. et al. Graph convolutional transformer: Learning the graphical structure of electronic health records. Proc. AAAI Conf. Artif. Intell. 34, 606–613 (2020).
Google Scholar
Wynn, J. L. et al. Time for a neonatal-specific consensus definition for sepsis. Pediatr. Crit. Care Med 15, 523–528 (2014).
Article PubMed PubMed Central Google Scholar
Akobeng, A. K. Understanding diagnostic tests 2: Likelihood ratios, pre- and post-test probabilities and their use in clinical practice. Acta Paediatr. 96, 487–491 (2007).
Article PubMed Google Scholar
Tomašev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572, 116–119 (2019).
Article ADS PubMed PubMed Central Google Scholar
Beam, A. L. & Kohane, I. S. Big data and machine learning in health care. JAMA 319, 1317–1318 (2018).
Article PubMed Google Scholar
Japkowicz, N. & Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 6, 429–449 (2002).
Article Google Scholar
Biggerstaff, B. J. Comparing diagnostic tests: A simple graphic using likelihood ratios. Stat. Med 19, 649–663 (2000).
Article CAS PubMed Google Scholar
Jensen, A. B. et al. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat. Commun. 5, 4022 (2014).
Article ADS CAS PubMed Google Scholar
Liu, C., Zhang, K., Xiong, H., Jiang, G., & Yang, Q. Temporal skeletonization on sequential data: Patterns, categorization, and visualization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1336–1345 (2014).
Goldstein, B. A., Navar, A. M., Pencina, M. J. & Ioannidis, J. P. A. Opportunities and challenges in developing risk prediction models with electronic health records data: A systematic review. J. Am. Med Inf. Assoc. 24, 198–208 (2017).
Article Google Scholar
Kipf, T. N., & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907v4 [cs.LG] (2017).
Bardes, A., Ponce, J., & LeCun, Y. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. International Conference on Learning Representations (2022).
Ranjan, E., Sanyal, S. & Talukdar, P. ASAP: Adaptive Structure Aware Pooling for learning hierarchical graph representations. Proc. AAAI Conf. Artif. Intell. 34, 5470–5477 (2020).
Google Scholar

Download references

Acknowledgements

GN gratefully acknowledges the support of NIH UL1TR004419. This work was also supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. JK gratefully acknowledges Samuel Jackson M.D., who provided valuable insights into diagnostic challenges in clinical medicine.

Author information

Authors and Affiliations

The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
Justin Kauffman, Akhil Vaid, Patricia Kovatch, Ankit Sakhuja, Benjamin S. Glicksberg, Ira Hofer & Girish N. Nadkarni
The Hasso Plattner Institute of Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
Justin Kauffman, Akhil Vaid, Patricia Kovatch, Ankit Sakhuja, Benjamin S. Glicksberg, Ira Hofer & Girish N. Nadkarni
The Charles Bronfman Institute for Personalized Medicine at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
Justin Kauffman, Emma Holmes, Akhil Vaid, Alexander W. Charney, Patricia Kovatch, Joshua Lampert, Ankit Sakhuja, Benjamin S. Glicksberg, Ira Hofer & Girish N. Nadkarni
Department of Data-Driven and Digital Medicine, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
Justin Kauffman, Emma Holmes, Akhil Vaid, Patricia Kovatch, Joshua Lampert, Ankit Sakhuja, Benjamin S. Glicksberg, Ira Hofer & Girish N. Nadkarni
Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
Emma Holmes & Ankit Sakhuja
Division of Newborn Medicine, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
Emma Holmes
Department of Medicine, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
Joshua Lampert & Girish N. Nadkarni
Valentin Fuster Heart Hospital, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
Joshua Lampert
Broad Institute of MIT and Harvard, Cambridge, MA, USA
Marinka Zitnik
Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA
Marinka Zitnik
Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA, USA
Marinka Zitnik
Department of Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
Ira Hofer

Authors

Justin Kauffman
View author publications
Search author on:PubMed Google Scholar
Emma Holmes
View author publications
Search author on:PubMed Google Scholar
Akhil Vaid
View author publications
Search author on:PubMed Google Scholar
Alexander W. Charney
View author publications
Search author on:PubMed Google Scholar
Patricia Kovatch
View author publications
Search author on:PubMed Google Scholar
Joshua Lampert
View author publications
Search author on:PubMed Google Scholar
Ankit Sakhuja
View author publications
Search author on:PubMed Google Scholar
Marinka Zitnik
View author publications
Search author on:PubMed Google Scholar
Benjamin S. Glicksberg
View author publications
Search author on:PubMed Google Scholar
Ira Hofer
View author publications
Search author on:PubMed Google Scholar
Girish N. Nadkarni
View author publications
Search author on:PubMed Google Scholar

Contributions

J.K. developed and implemented the deep geometric learning methods and probabilistic approach, benchmarked models, analyzed model behavior, retrieved and processed EHR data, developed the temporal graph conversion methodology, and contributed to the study design. E.H. provided clinical labels for culture-negative sepsis and guidance on clinical decision making. A.V., J.L., A.S., and A.W.C. provided guidance on clinical decision making and reviewed the manuscript. P.K. provided technical guidance and support. M.Z. provided technical advice and guidance throughout the study. I.H. provided technical advice, guidance on clinical decision making, and contributed to study design. B.S.G. supplied technical advice, contributed to study design, and assisted with manuscript editing and revision. G.N.N. designed the study, provided technical and clinical guidance, and served as the corresponding author. All authors contributed to writing and editing the manuscript and discussed the results.

Corresponding author

Correspondence to Girish N. Nadkarni.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Ole Winther and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Reporting Summary (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kauffman, J., Holmes, E., Vaid, A. et al. InfEHR: Clinical phenotype resolution through deep geometric learning on electronic health records. Nat Commun 16, 8475 (2025). https://doi.org/10.1038/s41467-025-63366-6

Download citation

Received: 06 February 2025
Accepted: 15 August 2025
Published: 26 September 2025
Version of record: 26 September 2025
DOI: https://doi.org/10.1038/s41467-025-63366-6

Subjects

Abstract

Similar content being viewed by others

Deep learning models for acute kidney injury prediction: multi-center external validation and evaluation under simulated continuous monitoring conditions

An open-source framework for end-to-end analysis of electronic health record data

Integrating structured and unstructured data for timely prediction of bloodstream infection among children

Introduction

Results

Uncertainty in diagnosing neonatal culture-negative sepsis

Uncertainty in assessing postoperative acute kidney injury risk preoperatively

Performance of Clinical Heuristics Under Conditions of Uncertainty

CN-S heuristic

PO-AKI Heuristic

Automatic generation of heuristics from EHR graphs

Deep Geometric Learning Resolves Prior Uncertainty

Validation of InfEHR on an external dataset

Discussion

Methods

Overview of InfEHR

Module 1: Processing Electronic Health Records into Graphs

Training Datasets

Validation Datasets

EHR Preprocessing

Preprocessing Numerical Values

Preprocessing Nonnumerical Values

Node Discovery and Embedding

Continuous Variables

Discrete Variables

Method of Node Embedding

Tuning the General Node Representation to Individual Temporal Contexts

Module 2: Deep Geometric Learning Approach

Notation

Problem definition

Construction and Definitions of EHR Graphs

Definition 1: EHR

Definition 2: EHR Graph

Training the GNN on Attributed EHR Graphs

Self-Supervised Learning for Whole Graph Representations

Self-supervised learning

Deriving instance-level priors automatically

Label propagation using self-supervised embeddings

Integration of spatial information and structural information

Weakly supervised learning over uncertain priors

Module 3: resolving prior probabilities

GNN Training with Feature-Based Weighting of Kullback-Leibler Loss

Weighted RQ Loss

Performing Validation

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information (download PDF )

Reporting Summary (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links