Deep representation learning for clustering longitudinal survival data from electronic health records

Qiu, Jiajun; Hu, Yao; Li, Li; Erzurumluoglu, Abdullah Mesut; Braenne, Ingrid; Whitehurst, Charles; Schmitz, Jochen; Arora, Jatin; Bartholdy, Boris Alexander; Gandhi, Shrey; Khoueiry, Pierre; Mueller, Stefanie; Noyvert, Boris; Ding, Zhihao; Jensen, Jan Nygaard; de Jong, Johann

doi:10.1038/s41467-025-56625-z

Download PDF

Article
Open access
Published: 14 March 2025

Deep representation learning for clustering longitudinal survival data from electronic health records

Jiajun Qiu ORCID: orcid.org/0000-0003-1790-8536¹,
Yao Hu¹,
Li Li¹,
Abdullah Mesut Erzurumluoglu¹,
Ingrid Braenne¹,
Charles Whitehurst ORCID: orcid.org/0000-0003-3112-5458²,
Jochen Schmitz²,
Jatin Arora¹,
Boris Alexander Bartholdy¹,
Shrey Gandhi ORCID: orcid.org/0000-0002-1635-1062¹,
Pierre Khoueiry¹,
Stefanie Mueller¹,
Boris Noyvert¹,
Zhihao Ding¹,
Jan Nygaard Jensen¹ &
…
Johann de Jong ORCID: orcid.org/0000-0002-2097-0915¹

Nature Communications volume 16, Article number: 2534 (2025) Cite this article

10k Accesses
14 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Precision medicine requires accurate identification of clinically relevant patient subgroups. Electronic health records provide major opportunities for leveraging machine learning approaches to uncover novel patient subgroups. However, many existing approaches fail to adequately capture complex interactions between diagnosis trajectories and disease-relevant risk events, leading to subgroups that can still display great heterogeneity in event risk and underlying molecular mechanisms. To address this challenge, we implemented VaDeSC-EHR, a transformer-based variational autoencoder for clustering longitudinal survival data as extracted from electronic health records. We show that VaDeSC-EHR outperforms baseline methods on both synthetic and real-world benchmark datasets with known ground-truth cluster labels. In an application to Crohn’s disease, VaDeSC-EHR successfully identifies four distinct subgroups with divergent diagnosis trajectories and risk profiles, revealing clinically and genetically relevant factors in Crohn’s disease. Our results show that VaDeSC-EHR can be a powerful tool for discovering novel patient subgroups in the development of precision medicine approaches.

Novel architecture for gated recurrent unit autoencoder trained on time series from electronic health records enables detection of ICU patient subgroups

Article Open access 11 March 2023

Analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases

Article Open access 01 March 2024

Data-driven identification of heart failure disease states and progression pathways using electronic health records

Article Open access 25 October 2022

Introduction

There has been a notable shift in the healthcare sector toward digitizing patient information, with electronic health records (EHRs) emerging as the new norm. As of 2018, the adoption rate of EHR systems has surpassed 84% and 94% in the US and UK, respectively^1,2. EHR systems offer a comprehensive and easily accessible source of patient data, typically gathering data from millions of individuals over many years, encompassing various sources (such as primary and secondary care) and modalities (such as diagnoses, medications, and lab tests). The extensive nature of EHRs makes them a valuable resource for healthcare research, enabling more accurate modeling of patients and their disease risk, onset, and progression. However, the sheer size and complexity of EHR data inevitably pose challenges to modeling efforts, necessitating the development of sophisticated algorithms and data processing methods. Rapid advancements in the field of deep learning (DL) have had a profound impact on a wide range of industries and provide many opportunities for improving healthcare as well^3,4. Specifically, DL has shown great promise in learning meaningful patient representations from EHRs, by virtue of its ability to uncover hidden patterns and trends in such complex datasets. These representations can then be used for a wide variety of downstream tasks such as patient stratification, disease risk prediction, disease progression modeling, etc. For example, CLOUT used long short-term memory (LSTM) networks for learning patient representations from EHRs, which were then used for downstream tasks such as mortality prediction⁵. Leveraging recent advances in large language modeling technology^6,7, Med-BERT and BEHRT were two early examples of using transformer neural networks for learning representations from EHR sequences^3,4, with applications including heart failure prediction and pancreatic cancer prediction. As a more recent example, Placido et al. presented a transformer model for detecting patients with a high risk of pancreatic cancer from EHR data⁸. As a final example, Chung et al. applied the large language model GPT-4 Turbo⁹ on procedure descriptions and clinical notes retrieved from EHRs for tasks including the prediction of hospital mortality, hospitalization duration, and ICU duration¹⁰.

Here, we explore the problem of learning clustered patient representations from longitudinal EHR data, in the context of disease-associated risk events. Patient clustering is an important concept in the field of precision medicine. Precision medicine aims to provide the right treatment, to the right patient at the right time, by utilizing individual patient characteristics to guide clinical decision-making, instead of population-wide averages of patient characteristics¹¹. Patient clustering supports the development of precision medicine approaches by detecting patterns and trends within a certain patient population of interest, which can serve as the basis for the identification of novel disease subtypes. By studying the causal molecular mechanisms of these disease subtypes, more targeted and personalized therapeutic approaches can be developed. Disease subtyping is often relevant in the context of certain risk events¹². For example, why do some patients with Crohn’s disease, a subtype of inflammatory bowel disease (IBD), progress to intestinal stricture (a narrowing of the intestines due to the formation of scar tissue and muscular hypertrophy), and others do not? As Crohn’s disease is a multifactorial disease, it is likely that there are multiple mechanisms associated with progression toward intestinal obstruction¹³. Typically, one of the following two approaches is applied for elucidating how patient subgroups correlate with event risk: (1) Start with identifying patient clusters, and then analyze the risk event within each of the clusters¹⁴. This approach has inherent limitations because resulting clusters are not guaranteed to correlate to the event risk. (2) Start with stratifying patients by the risk event, and then identify subgroups within the risk strata¹⁵. The main limitation here is that it can be difficult to identify patient subgroups with differentiating generative mechanisms. In other words, a high-risk patient subgroup could exhibit significant heterogeneity in causal molecular mechanisms, and patient subgroups with comparable survival outcomes could have varying responses to identical treatments¹⁶. Hence, to more accurately identify novel patient subgroups characterized by both divergent diagnosis trajectories and time-to-event profiles, patient clustering should be integrated with risk modeling (time-to-event analysis) to enable studying their interactions.

While several studies have previously explored patient clustering and risk modeling^17,18,19,20, none of these approaches directly integrated risk modeling with clustering. In all cases, any resulting clusters were purely driven by the risk event and would suffer from the limitations we outlined above. These limitations were recently addressed by the introduction of VaDeSC (Variational deep survival clustering)²¹, which integrates risk modeling with clustering using a variational autoencoder (VAE) framework. The VAE is a probabilistic generative model where observations are assumed to originate from latent representations sampled from a prior distribution of choice. This prior distribution can function as a regularizer on the learned representations by enforcing a certain latent structure²². In the original formulation of the VAE, the researchers explored multivariate Gaussian and multivariate Bernoulli priors²². More recently, with models such as Variational Deep Embedding (VaDE)²³, Gaussian mixture priors have been studied for the purpose of enforcing a latent structure that promotes clustering. VaDeSC combines a Gaussian mixture prior as used in VaDE with a Weibull mixture distribution for modeling cluster-specific survival, to learn cluster-specific associations between covariates and survival times.

Given the importance of risk modeling and clustering in healthcare research, methods such as VaDeSC provide many opportunities for applications within the healthcare domain. Specifically, insights into patients’ disease history and progression are of great importance for developing a more comprehensive disease understanding and eventually developing more effective and personalized therapies^24,25. Hence, in this work, we developed a novel application of VaDeSC by building upon recent advances in patient representation learning from EHR sequences⁴. We integrated VaDeSC with a custom-designed autoencoding transformer-based architecture and implemented VaDeSC-EHR, a first attempt at disentangling the complex relationships between cluster-specific longitudinal disease histories and event risk profiles as retrieved from EHRs. In this study, we demonstrate VaDeSC-EHR’s ability to capture statistical interactions between cluster-specific diagnosis trajectories and survival times, enabling it to discover novel and clinically relevant patient subgroups.

Results

VaDeSC-EHR architecture

VaDeSC-EHR takes a patient’s disease history (diagnosis sequence) as an input and maps it into a latent representation $z$ using a transformer-based VAE with a Gaussian mixture prior (Fig. 1). Adopting the VaDeSC approach²¹, the survival outcome is modeled using a mixture of Weibull distributions with cluster-specific parameters $\beta .$ The parameters of the Gaussian mixture and Weibull distributions are jointly optimized using the diagnosis sequences and survival outcomes. Note that in this work, we use the terms survival modeling, risk modeling, and time-to-event modeling interchangeably, as we do with patient cluster, patient subgroup, and patient subpopulation. More details about the model architecture and loss function can be found in the “Methods” section.

UK Biobank datasets

In this study, we applied VaDeSC-EHR to EHR data from UK Biobank (Table 1, Supplementary Table 1, and Supplementary Fig. 1). We mapped the Read v2/3 diagnosis codes and ICD-9 codes (International Classification of Diseases, 9^th revision) provided by UK Biobank to ICD-10 codes (International Classification of Diseases, 10^th revision; details in the “Methods” section) and designed a multi-level ICD-10 diagnosis embedding layer to capture the hierarchical nature of the ICD-10 ontology. In this embedding layer, each diagnosis is represented by a combination of six distinct embeddings (Fig. 2): three for the ICD-10 code (subcategory, category, and block), one for age, one for type, and one for position. The age embedding represents the patient’s age at the time of diagnosis and can additionally assist the model in understanding the temporal gaps between diagnoses. The type embedding differentiates between diagnoses derived from primary care data and those from hospital data. The position embedding, representing visits, establishes the relative placement of diagnoses within the diagnosis sequence, allowing the network to recognize positional relationships between diagnoses. Diagnoses originating from the same visit will have identical position embeddings.

Table 1 Summary of the datasets used in this study

Full size table

VaDeSC-EHR outperforms baseline methods on synthetic data

In order to establish confidence in the methodology, we first technically validated VaDeSC-EHR on a synthetic benchmark dataset generated using a VaDeSC-EHR decoder (see “Methods” section) and compared its generalization performance to a range of baseline methods (Supplementary Figs. 2 and 3). The baseline methods were: variational deep survival clustering with a multilayer perceptron (VaDeSC-MLP), semi-supervised clustering (SSC)¹⁷, survival cluster analysis (SCA)¹⁸, deep survival machines (DSM)¹⁹, and recurrent neural network-based DSM (RDSM)²⁰, as well as k-means and regularized Cox PH as naïve baselines.

Using synthetic data provides optimal control over the data-generating process and allows the generation of ground-truth cluster labels, which are not used for training the model, but can be used post hoc to unambiguously assess generalization performance. It is important to note that no baseline methods explicitly designed for clustering longitudinal survival data are currently available from the literature, and certain adaptations in applying these baseline methods need to be made (see “Methods” section). In addition to evaluating the cluster predictions using the balanced accuracy (ACC), normalized mutual information (NMI) and adjusted Rand index (ARI), we evaluated the time-to-event predictions using the concordance index (CI). Note that due to the noise in the data-generating process, achieving perfect performance was not possible.

As can be seen in Table 2, VaDeSC-EHR significantly outperformed all other methods in retrieving the ground-truth clusters from the benchmark data. Even when replacing the transformer encoder/decoder architecture with a basic multilayer perceptron (MLP) encoder/decoder with simple Term Frequency-Inverse Document Frequency (TF-IDF) features (VaDeSC-MLP), performance is still significantly better, highlighting the expressiveness provided by jointly modeling clustering and event risk using the VaDeSC approach. As expected, the transformer architecture helped VaDeSC-EHR to achieve much better performance on the longitudinal clustering task than VaDeSC-MLP (VaDeSC-EHR: ACC = 0.64 ± 0.04, NMI = 0.72 ± 0.04 and ARI = 0.66 ± 0.07; VaDeSC: ACC = 0.57 ± 0.05, NMI = 0.37 ± 0.04, and ARI = 0.44 ± 0.03, Table 1). Importantly, VaDeSC-EHR achieved its superior clustering performance while hardly sacrificing performance on the risk prediction task. VaDeSC-MLP only showed marginally better performance on survival prediction than VaDeSC-EHR (CI = 0.79 ± 0.02 for VaDeSC-MLP vs. CI = 0.77 ± 0.01 for VaDeSC-EHR, p-value = 0.007). Given the known regularizing effects of variational loss terms²², it is likely that VaDeSC-EHR’s risk prediction performance could be improved (possibly at the expense of some clustering performance) by weighting the different loss terms (e.g. beta-VAE²⁶). We leave this to future study.

Table 2 Performance on synthetic benchmark data

Full size table

The remaining models (k-means+Cox PH, SSC, SCA, DSM, and RDSM) performed no better than random at retrieving the ground-truth clusters. Whereas RDSM clearly stood out as the third-best performing on the risk prediction task, it did perform worse than VaDeSC-MLP. This is interesting, because RDSM explicitly models EHR data as sequences, whereas VaDeSC-MLP does not. However, as explained in the introduction, the RDSM clusters are solely driven by survival times. This can explain the observed difference between VaDeSC-MLP and RDSM on the risk prediction task, and again highlights the importance of modeling the interactions between the diagnosis sequences and the survival times.

VaDeSC-EHR outperforms baseline methods on a T1D/T2D benchmark

While VaDeSC-EHR outperformed all baseline methods on synthetic data, we still wondered how VaDeSC-EHR would compare to the baseline methods on real-world data instead of data generated through simulation. For this reason, we designed a second experiment for technically validating VaDeSC-EHR, using real-world data from UK Biobank instead of synthetic data: distinguishing 494 type 1 diabetes (T1D) patients from 1830 type 2 diabetes (T2D) patients in their progression toward retinal disorders. We chose diabetes for three main reasons. First, like for the synthetic data, it allowed us to unambiguously define ground-truth cluster labels (T1D and T2D), which would not be used for training the model but could be used post hoc to assess generalization performance. Second, with a censoring rate of 42% (Table 1), the dataset is relatively balanced allowing for a sufficiently powered comparison between the methods. Third, besides some similarities^27,28, there are clear and known differences in clinical course and pathophysiological mechanisms between T1D and T2D²⁹, which can be exploited by machine learning algorithms in their attempts to recover the ground-truth cluster labels.

We compared VaDeSC-EHR’s generalization performance to a range of baseline methods using nested cross-validation (NCV)³⁰, by performing 10 runs using the same hyperparameters but with differently randomly initialized embeddings and neural network parameters, confirming the stability of the results (Supplementary Fig. 4). Note that the T1D and T2D labels were not used for clustering, and thus also the age at first diabetes diagnosis was not available to the model. To ensure that no information leakage occurred due to the inclusion of age-related complications, we additionally performed an analysis taking age information relative to the age at the start of the diagnosis sequence (VaDeSC-EHR_relage), by subtracting the age at first diagnosis from all elements in the age sequence. The performance of VaDeSC-EHR_relage was almost identical to that of VaDeSC-EHR, indicating that VaDeSC-EHR does indeed primarily utilize the time intervals between diagnoses, rather than the age itself.

The main observations from our method comparison strongly mirror those made from the synthetic benchmark. Again, VaDeSC-EHR outperformed the other methods at retrieving the ground-truth clustering (Fig. 3 and Table 3; AUC: 0.81 ± 0.01, ACC = 0.81 ± 0.02), while not giving up any risk prediction performance. A UMAP projection of the latent representations of the diagnosis sequences indeed showed that patients with the same ground-truth disease label (either T1D or T2D) tended to be close in the latent space (Supplementary Fig. 5). VaDeSC-MLP (ACC: 0.71 ± 0.09) was the next best-performing method, performing better than VaDeSC-EHR_nosurv (ACC: 0.64 ± 0.06), in which the survival loss was turned off. VaDeSC-EHR’s performance was degraded when not considering the survival times (VaDeSC-EHR_nosurv), illustrating that in achieving its superior performance, the full VaDeSC-EHR model did indeed exploit the interactions between EHR trajectories and survival times. As opposed to the synthetic benchmark, the performance of VaDeSC-EHR and VaDeSC-MLP on the risk prediction task was statistically indistinguishable (CI: 0.72 ± 0.07, p-value: 0.69). Mirroring the results on the synthetic benchmark, all other models performed poorly at retrieving the ground-truth clusters. Finally, RDSM again clearly stood out as the third-best method on the risk prediction task (CI = 0.63 ± 0.13), highlighting the importance of accounting for interactions between diagnosis sequences and survival times.

**Fig. 3: Generalization performance on synthetic benchmark data.**

Table 3 Generalization performance on T1D/T2D benchmark data

Full size table

Concluding, VaDeSC-EHR outperformed the baseline methods not only on synthetic data but also on an experiment designed from real-world data. This reinforced our confidence in the approach and again highlighted the potential of integrating VaDeSC into a transformer-based patient representation learning approach.

Application: VaDeSC-EHR identifies clinically and genetically relevant Crohn’s disease patient subgroups

Having gained confidence in the method through two technical validation experiments with ground-truth cluster labels, we applied VaDeSC-EHR to 1908 Crohn’s disease (CD) patients from UK Biobank. Here, the purpose was to identify potentially novel patient subgroups related to progression toward intestinal obstruction, an important CD complication³¹ that is enriched in CD patients relative to the general population (Supplementary Fig. 6). Here, VaDeSC-EHR identified four patient subgroups (CI: 0.91 ± 0.02) that demonstrated both divergent longitudinal disease histories and divergent risk profiles (Fig. 4). The results were stable across 10 randomly initialized runs (Supplementary Fig. 7). As expected, we observed that patients clustered together tended to be close in the latent space, i.e. have diagnosis trajectories that are similar (Fig. 4a). Additionally, the four patient subgroups each demonstrated distinct time-to-event profiles (Fig. 4b), with the fastest progressing cluster 2 being most strongly enriched for intestinal obstruction (Supplementary Fig. 8a). We found the four clusters to be significantly associated with age of onset (p-value: 6.82 × 10⁻⁰⁵), sex (p-value: 2.35 × 10⁻¹⁰), genetic principal component number 1 (PC1) (p-value: 0.01), but not with UKB recruitment location or overall diagnosis sequence length (total number of diagnosis codes) (Supplementary Figs. 8–10).

**Fig. 4: Clustering CD patients from UK Biobank in their progression toward intestinal obstruction.**

CD patient subgroups demonstrate distinct disease histories

To gain insight into the generative mechanisms giving rise to the observed differences in intestinal obstruction risk, we first analyzed differential enrichment of individual diagnoses across the clusters, while adjusting for potential confounding as outlined in the “Methods” section. For example, corroborating previous studies^32,33,34, we found that many important CD-related ICD-10 codes were reported significantly less frequently in the most slowly progressing cluster 1, such as R10 (abdominal and pelvic pain, adjusted p-value: 0.01), R11 (nausea and vomiting, adjusted p-value: 0.0007), and D64 (anemia, adjusted p-value: 0.0001) (Supplementary Fig. 11). To take full advantage of the longitudinal nature of VaDeSC-EHR we then wanted to analyze how diagnosis trajectories longitudinally differed among patient clusters. Although there was no significant difference in overall diagnosis sequence length between the clusters (Supplementary Fig. 8e), the diagnosis history leading up to the first CD diagnosis was twice as long for the slowly progressing clusters (clusters 1 and 4) as it was for fast progressing clusters (clusters 2 and 3) (Supplementary Fig. 8f), with many diagnosis subsequences significantly enriched in the slow progressors before their first CD diagnosis relative to the fast progressors (Supplementary Fig. 12a). These enriched diagnosis subsequences contained many known comorbidities of CD such as anemia³⁴, hypertension³⁵, and hernia³². While clusters 1 and 4 were strongly enriched for comorbidities before the first CD diagnosis, there was no significant enrichment of comorbidities after the first CD diagnosis, except for hypertension. On the other hand, clusters 2 and 3 were hardly significantly enriched for comorbidities leading up to their first CD diagnosis (Supplementary Fig. 12b), but these patients did appear to progress more rapidly in their disease, as evident from comorbidities developing after CD onset, including abdominal pain, nausea, vomiting and iron deficiency anemia^32,36,37 (Supplementary Fig. 12b). Interestingly, although patients in clusters 1 and 4 progressed more slowly, their diagnosis trajectories were nonetheless quite distinct (Supplementary Fig. 13). For example, cluster 4 was more enriched for hypertension before first the CD diagnosis, whereas cluster 1 was more enriched for pain^38,39 (back pain, abdominal pain and pain in joint), and respiratory abnormalities^40,41 (cough and asthma) (Supplementary Figs. 13 and 14).

Does smoking protect against progression toward intestinal obstruction in some CD patients?

Because of the previously reported interesting and poorly understood differential effects of smoking behavior between the two main types of inflammatory bowel disease (protective in ulcerative colitis and harmful in Crohn’s disease)^42,43, we were then interested to see whether smoking was uniformly harmful across our identified CD patient clusters. Corroborating previous work³⁵, we observed that smoking behavior was associated with faster progression toward intestinal obstruction in the overall CD patient population (ever smoked, log hazard ratio: 0.08, Fig. 5a; ICD-10 F17.2: nicotine dependence, log hazard ratio: 0.20, Fig. 5b). Interestingly however, we observed that the slowest progressing cluster 1 was enriched for smoking behavior (ever smoked, p-value: 1.86 × 10⁻⁰⁵; ICD-10 F17.2: nicotine dependence, p-value: 0.001) (Fig. 5c, d). Even more surprisingly, within cluster 1, smoking was in fact significantly associated with slower progression toward obstruction, i.e. CD patients who had ever smoked generally progressed more slowly toward intestinal obstruction (ever smoked, log hazard ratio: −0.76, p-value: 1.06 × 10⁻⁰⁷, Fig. 5a; ICD-10 F17.2: nicotine dependence, log hazard ratio: −0.96, p-value: 0.001, Fig. 5b). Similarly, “Pack years of smoking” and “Pack years adult smoking as a proportion of life span exposed to smoking” were associated with slower progression toward intestinal obstruction within cluster 1, as were previous and current smoker status (Supplementary Fig. 15). These results could suggest that in CD, as in ulcerative colitis^42,43, patient subgroups exist for which smoking protects against progression toward intestinal obstruction.

**Fig. 5: Association of CD clusters with smoking behavior.**

VaDeSC-EHR can identify subgroup-specific molecular mechanisms relevant to drug discovery and precision medicine

In addition to the EHR data that was used for identifying the four CD patient subgroups presented in this section, UK Biobank provides a wealth of other types of data on the same CD patients that we did not use for clustering. Specifically, the genetics data available for approximately 500,000 patients in the UK Biobank provides interesting opportunities for further characterizing our patient clusters and gaining insight into the generative mechanisms underlying the observed differences between the clusters.

Using the available genetics data, we first computed pathway PRSs (pathway-based polygenic risk scores) to assess differences in the underlying genetic background between fast and slow progression clusters (Supplementary Data 1). Among others, we found that the fast progressors (clusters 2 and 3) displayed a higher genetic burden in a pathway related to the adaptive immune response (Fig. 6a), relative to the slow progressors (clusters 1 and 4) (Fig. 6a). Specifically, we found that the patients in the fastest progressing cluster 3 displayed a higher genetic burden in this pathway than the patients in all other, more slowly progressing, clusters. The involvement of innate immunity versus adaptive immunity in the pathogenesis of CD is an important topic of research⁴⁴. Recently, several studies demonstrated the role of an abnormal adaptive immune response in the pathogenesis of CD^44,45,46, and the over-reactive adaptive immune response is in fact the target of current CD treatments⁴⁶. Our results confirm the important role that the adaptive immune response may play in the pathogenesis of CD in at least a subpopulation of CD patients.

**Fig. 6: Association of CD clusters with genetics.**

We then wanted to see if we could identify any specific genetic variants in the adaptive immune response pathway that could potentially explain the significant association of its pathway PRS with our patient subgroups. Among the 150 SNPs, we found SNP rs2523608 to be significantly associated with fast progression toward intestinal obstruction (clusters 2 and especially 3) (Fig. 6b, c). SNP rs2523608 has previously been associated with gastrointestinal disorders, such as celiac disease⁴⁷ (known to be bidirectionally causally related to CD⁴⁸), intestinal malabsorption⁴⁹ (a common complication of CD⁵⁰), and CD itself⁵¹.

Interestingly, rs2523608 has been reported to be negatively associated with the binding antibody response to interferon beta-1a (IFN beta-1a) therapy in multiple sclerosis (MS), which shares common pathogenic processes with CD⁵². IFN beta-1a is an approved treatment for MS⁵³ and has been investigated as a potential treatment for CD as well, due to its ability to down-regulate the expression of interleukin-12, a cytokine that is thought to be involved in mucosal degeneration in CD⁵⁴. Unfortunately, there was no significant difference in efficacy between patients receiving IFN beta-1a and those receiving placebo⁵⁴. However, the results of the MS study suggest that one reason for this observed lack of efficacy could lie in the genetic variability across patients. More specifically, depending on the presence of the rs2523608 SNP (risk allele frequency in UKB: 0.42 in CD patients and 0.40 in the general population) in a CD patient, response to IFN beta-1a treatment could be impaired. Using VaDeSC-EHR, we identified two, relatively fast progressing, CD subgroups enriched for this SNP. This demonstrates that VaDeSC-EHR is able to identify subgroup-specific molecular mechanisms relevant to drug discovery and precision medicine, using only longitudinal diagnosis data as input.

Discussion

Precision medicine aims to develop therapies that are targeted to specific patient subgroups based on their predicted disease risk or progression, or treatment response. Due to their scale, comprehensive nature, and multimodality, EHRs have emerged as a valuable source of data for healthcare research in general and the development of precision medicine approaches specifically⁵⁵. An important concept in precision medicine is the identification of novel patient subgroups, i.e. patient clustering. Retrospective analysis and clustering of EHR trajectories can provide an important first step in the development of precision medicine approaches by supporting the identification of novel subgroup-specific and targetable disease mechanisms. Following the identification of a potentially targetable patient subgroup, predictive modeling could be used for prognostic enrichment of clinical trials⁵⁶, which can eventually influence treatment decisions in clinical practice, through a drug label reflecting the enrichment strategies used to select patients in the clinical trials⁵⁷.

In this paper, we implemented a novel application of VaDeSC²¹ leveraging recent advances in the domain of EHR modeling. Specifically, we integrated VaDeSC into a custom-designed transformer-based autoencoding architecture for patient representation learning and developed VaDeSC-EHR, for clustering longitudinal time-to-event data extracted from EHRs. VaDeSC-EHR can exploit statistical interactions between a patient’s longitudinal disease history and survival time, enabling it to identify non-trivial patient subgroups characterized by both divergent disease histories and survival times. We validated VaDeSC-EHR on two benchmark experiments with known ground-truth cluster labels, showing that VaDeSC-EHR outperformed all baseline methods on simultaneously the risk prediction task and the task of retrieving the ground-truth clustering. The baseline methods failed to identify clinically relevant subgroups due to either their focus on purely risk-driven clustering or their inability to model EHRs as sequences. As such, our validation highlights the need for novel methods such as VaDeSC-EHR that can more accurately identify novel patient subgroups by modeling the interactions between longitudinal diagnosis trajectories and event risk. After the technical validation, we applied VaDeSC-EHR to the problem of clustering Crohn’s disease patients in their progression toward intestinal obstruction. We demonstrated that the resulting subgroups were both clinically and biologically relevant. Specifically, using only diagnosis trajectories as input, VaDeSC-EHR was able to identify molecular mechanisms relevant to only specific subgroups of CD patients. Knowledge of such mechanisms is critical for drug discovery and the development of precision medicine approaches.

VaDeSC-EHR is a first attempt at integrating recent advances in EHR trajectory clustering and risk modeling and we see many opportunities for future research, a few of which we list here. First, the attention mechanism in the transformer architecture could be used to improve the interpretability of resulting clusters, e.g. by integrating an attention-based feature importance score⁵⁸. Second, in our implementation of VaDeSC-EHR, we took some limited steps in representing relations between medical concepts by embedding multiple layers of the ICD-10 ontology, using the alphanumeric ICD-10 codes. However, more sophisticated, and potentially more powerful, approaches have been published that could be integrated with VaDeSC-EHR. Examples include modeling the entire ICD-10 ontology by representing ICD-10s as combinations of their ancestors via an attention mechanism⁵⁹ as well as generating embeddings based on ICD-10 text descriptions instead of the ICD-10 codes⁶⁰. Third, depending on the data source used, additional data modalities could be included to provide a more comprehensive view of a patient’s disease presentation, as well as to investigate additional factors potentially confounding the interpretation of the clustering, such as additional longitudinal EHR data modalities (e.g. medications, lab tests, and surgical procedures) and demographics (e.g. age, sex, education) or molecular data modalities commonly available in biobanks (e.g. genomics, proteomics)⁵.

In our study, we used data from the UK Biobank. The UK Biobank study provides extensive molecular and health data from approximately 500,000 volunteers from the UK. These data are moreover linked to EHR data collected from primary and hospital inpatient care interactions. Like the work that we built upon^3,4, we focused on diagnosis trajectories. Our reason for this is that typical EHR data modalities such as prescribed medications and lab tests are not available from UK Biobank. Additionally, while longitudinal data on procedures are available, these data are only available in the highly UK-specific OPCS3/4 coding systems that are not easily mappable to internationally used systems and their inclusion would hence lead to highly UK-specific models^61,62. In addition to the limited availability of EHR data modalities, another drawback of using UK Biobank data for EHR modeling is that the number of individuals covered by UK Biobank is small compared to many EHR databases such as IBM MarketScan⁶³. However, the rich multimodality of biobanks such as the UK Biobank, with data ranging from lifestyle, medical history, and biometric data to molecular data such as whole genome sequencing data, provides wide-ranging opportunities for the downstream interpretation of analyses such as presented in our study.

In conclusion, we demonstrated how VaDeSC-EHR, integrating recent advances in risk modeling and EHR modeling, is highly effective at disentangling complex relationships between cluster-specific diagnosis trajectories and survival times. Hence, VaDeSC-EHR can be a powerful tool for supporting the development of precision medicine approaches through its ability to discover novel risk-associated patient subgroups.

Methods

EHR dataset from UK Biobank

This study was conducted using the UK Biobank resource, which has ethical approval and its own ethics committee (https://www.ukbiobank.ac.uk/ethics/). This research has been conducted using UK Biobank resources under Application Number 57952. For our analyses, we used both the primary and hospital inpatient care diagnosis records made available via the UK Biobank study⁶⁴. We started from 451,265 patients with available hospital inpatient care data. For each patient, a diagnosis sequence was constructed by interleaving the hospital inpatient care data with any available primary care data based on their timestamps. We then mapped all resulting diagnosis codes from Read v2/3 and ICD-9 to ICD-10⁶⁴. For those codes mapping to multiple ICD-10 codes, we included all possible mappings. Keeping only patients with at least five diagnoses, at most 200 diagnoses, and at least one month of diagnosis history, the resulting dataset consisted of 352,891 patients.

The data processing pipeline is visualized in Supplementary Fig. 16, and the resulting dataset is summarized in Table 1, Supplementary Table 1, and Supplementary Fig. 1.

VaDeSC-EHR architecture

For implementing VaDeSC-EHR, we integrated VaDeSC into a transformer-based encoder/decoder architecture for learning patient representations (Fig. 1). Here, we describe the architecture and generative process in more detail.

The generative process

Adopting the approach of VaDeSC²¹, first, a cluster assignment $c\in \left\{\left.2,\ldots,K\right\}\right.$ is sampled from a categorical distribution: $p\left(c\right)={Cat}\left(\pi \right)$. Then a latent embedding $z$ is sampled from a Gaussian distribution: $p\left(z|c\right){{=}}{{{\mathcal{N}}}}\left({\mu }_{c},{\sigma }_{c}^{2}\right)$. The diagnosis sequence $x$ is generated from $p\left(x|z\right)$, which for VaDeSC-EHR is modeled by a transformer-based decoder as described below. Finally, the survival time $t$ is generated by $p\left(t|z,c\right)$.

Transformer-based encoder and decoder

Embedding

First, for each ICD-10 code, its subcategory (e.g. K50.1), category (e.g. K50), and block (e.g. K50-K52) were extracted using the algorithm shown in Supplementary Note 1.

Then, the ICD-10 code (subcategory, category, and block), age, and type (primary or hospital) were individually embedded, while adding a sinusoidal position embedding to each of the individual embeddings. The position embedding was based on the visit number of the diagnosis within the diagnosis sequence⁶. Hence, diagnoses originating from the same doctor’s visit received the same position embedding. Finally, the individual embeddings were summed to arrive at the final embedding for each diagnosis (Fig. 2):

$${{E}_{{ICD}}=E}_{{ICD}{\_}{block}}+{E}_{{ICD}{\_}{category}}+{E}_{{ICD}{\_}{subcategory}}$$

(1)

$${{E}_{{encoder}}=E}_{{ICD}}+{E}_{{age}}+{E}_{{type}}+{E}_{{position}}$$

(2)

Encoder

The embeddings were fed into a classical transformer encoder^6,7 augmented with a SeqPool layer⁶⁵ to consolidate the entire sequence of a patient into a single comprehensive representation for that individual. More specifically, given:

$${X}_{L}=f({X}_{0})\in {{\mathbb{R}}}^{b\times n\times d}$$

(3)

where ${X}_{L}$ is the output of an $L$ layer transformer encoder $f$, and $b$ is the batch size, $n$ is the sequence length, $d$ is the total embedding dimension. ${X}_{L}$ was fed into a linear layer $g({X}_{L})\in {{\mathbb{R}}}^{d\times 1}$, and a softmax activation was applied to the output:

$${X}_{L}^{{\prime} }={softmax}({g\left({X}_{L}\right)}^{T})\in {{\mathbb{R}}}^{b\times 1\times n}$$

(4)

This generated an importance weighting for each input token, which was used as follows⁶⁵:

$$z={X}_{L}^{{\prime} }{X}_{L}={softmax}({g\left({X}_{L}\right)}^{T})\times {X}_{L}\in {{\mathbb{R}}}^{b\times 1\times d}$$

(5)

By flattening, the output $z\in {{\mathbb{R}}}^{b\times d}$ was produced, a summarized embedding for the full patient sequence.

Decoder

First, the latent representation was transformed using fully connected layers:

$$X=f\left({W}^{\left(i\right)}\ldots f\left({W}^{\left(1\right)}{Z}^{\left(1\right)}\right)\right),Z\in {{\mathbb{R}}}^{b\times J},X\in {{\mathbb{R}}}^{b\times {LH}}$$

(6)

Here, $i$ is the number of layers, $b$ the batch size, $L$ the sequence length, $J$ the number of latent variables, $H$ the dimensionality of embedding, and $Z$ the latent representation (Fig. 1).

$X$ was then reshaped to ${X}_{{re}}$ (${{\mathbb{R}}}^{b\times {LH}}\to {{\mathbb{R}}}^{b\times L\times H}$) and fed into the transformer decoder^6,7 to generate the reconstructed ICD embedding ${\hat{E}}_{{ICD}}={{{\rm{Decoder}}}}({X}_{{re}})$. The reconstructed input embedding for each diagnosis for a given patient was then computed as:

$${E}_{{decoder}}={\hat{E}}_{{ICD}}+{E}_{{age}}+{E}_{{type}}+{E}_{{position}}$$

(7)

Here, ${E}_{{age}}$, ${E}_{{type}}$ and ${E}_{{position}}$ were copied over from the input to the transformer encoder.

Finally, the original EHR input sequence (up to the level of the ICD-10 subcategory) was reconstructed from ${E}_{{decoder}}$ using a softmax function.

Evidence lower bound (ELBO)

The loss on the architecture as described above was computed using the ELBO as previously defined for VaDeSC by Manduchi et al.²¹. We briefly present the formula and outline its interpretation, but for more details, we refer the reader to Manduchi et al.²¹.

The ELBO of the classic VAE²² looks as follows:

$$L\left(x\right)={{\mathbb{E}}}_{q(z|x)}\log p(x|z)-{D}_{{KL}}\left(q\left(z|x\right){{||}}p(z)\right)$$

(8)

Here, $x$ is the input patient diagnosis sequence, $z$ is the latent space for the patient. The first term $p\left(x|z\right)$ can be interpreted as the reconstruction loss of the autoencoder. In the second term, $q\left(z|x\right)$ is the variational approximation to the intractable posterior $p\left(z|x\right)$ and can be seen as regularizing z to lie on a multivariate Gaussian manifold²².

Adding a cluster indicator as in VaDE²³, the ELBO looks as follows:

$$L\left(x\right)={{\mathbb{E}}}_{q(z|x)}\log p(x|z)-{D}_{{KL}}\left(q\left(z,c|x\right){{||}}p(z,c)\right)$$

(9)

The first term is the same as above. Analogous to the above, in the second term, $q\left(z,c|x\right)$ is the variational approximation to the intractable posterior $p\left(z,c|x\right)$ and can be seen as regularizing $z$ to lie on a multivariate Gaussian mixture manifold.

Finally, additionally, including a survival time variable t in the model as in VaDeSC²¹, the ELBO looks as follows:

$$L\left(x,t\right)= {{\mathbb{E}}}_{q(z|x)p(c|z,t)}\log p(x|z)+{{\mathbb{E}}}_{q\left(z|x\right)p\left(c|z,t\right)}\log p\left(t|z,c\right)\\ -{D}_{{KL}}\left(q\left(z,c|x,t\right){{||}}p(z,c)\right)$$

(10)

The reconstruction loss $p\left(x|z\right)$ is calculated as sequence length*mean cross-entropy.

The survival time $p\left(t|z,c\right)$ is modeled by a Weibull distribution and adjusts for right-censoring:

$$p\left(t|z,c\right) ={f(t)}^{\delta }{S(t|z,c)}^{1-\delta }\\ ={\left[\frac{k}{{\lambda }_{c}^{z}}{\left(\frac{t}{{\lambda }_{c}^{z}}\right)}^{k-1}\exp \left(-{\left(\frac{t}{{\lambda }_{c}^{z}}\right)}^{k}\right)\right]}^{\delta }{\left[\exp \left(-{\left(\frac{t}{{\lambda }_{c}^{z}}\right)}^{k}\right)\right]}^{1-\delta }$$

(11)

Here, the variable $\delta$ represents the censoring indicator, which is assigned 0 when the survival time of the patient is censored, and 1 in all other cases. For each patient, given by the latent space $z$ and cluster assignment $c$, the uncensored survival time is assumed to follow a Weibull distribution: $f\left(t\right)={Weilbull}\left({\lambda }_{c}^{z},k\right)=\frac{k}{{\lambda }_{c}^{z}}{(\frac{t}{{\lambda }_{c}^{z}})}^{k-1}\exp (-{(\frac{t}{{\lambda }_{c}^{z}})}^{k})$, where ${\lambda }_{c}^{z}={softplus}({z}^{T}{\beta }_{c})$, ${\beta }_{c}\in \left\{\left.{\beta }_{1},{\beta }_{2},\ldots,{\beta }_{K}\right\}\right.$. The censored survival time is then described by the survival function $S\left(t|z,c\right)=\exp (-{(\frac{t}{{\lambda }_{c}^{z}})}^{k})$.

For more details and the complete derivation of the VaDeSC ELBO, we refer the reader to Manduchi et al.²¹.

Pre-training and fine-tuning strategy

For the real-world data applications (diabetes and CD), we first pre-trained the transformer encoder on the entire UK Biobank EHR dataset (Table 1, column 3). Following the original BERT study, we used a masked diagnosis learning strategy for pre-training. Specifically, for each patient’s diagnosis sequence, we set an 80% probability of replacing a code by [MASK], a 10% probability of replacing a code by a random other code, and the remaining 10% probability of keeping the code unchanged. The ICD-10 embeddings were randomly initialized, and the encoder was trained using an Adam optimizer with default beta1 and beta2.

For selecting our final transformer architecture, we followed the Bayesian hyperparameter optimization strategy as described in the BEHRT study⁴. The best-performing architecture consisted of 6 layers, 16 attention heads, a 768-dimensional latent space, and 1280-dimensional intermediate layers (more details in Supplementary Table 2).

After transformer encoder pre-training, we fine-tuned VaDeSC-EHR end-to-end on the T1D/T2D dataset (Table 1, column 4) and the CD dataset (Table 1, column 5). Details around fine-tuning are described in the following section.

VaDeSC-EHR in various applications

Synthetic benchmark

Data generation

We used a transformer decoder with random weights to simulate diagnosis sequences (Supplementary Fig. 2). More specifically, let $K$ be the number of clusters, $N$ the number of data points, $L$ the capped sequence length, $H$ the dimensionality of embedding, $D$ the size of vocabulary, $J$ the number of latent variables, $k$ the shape parameter of the Weibull distribution and ${p}_{{cens}}$ the probability of censoring. Then, the data-generating process can be summarized as follows:

1.
Let ${\pi }_{c}=\frac{1}{K}$, for $1\le c\le K$
2.
Sample ${c}_{i} \sim {Cat}\left(\pi \right),$ for $1\le i\le N$
3.
Sample ${\mu }_{c,j} \sim {unif}\left(-{10,10}\right),$ for $1\le c\le K$ and $1\le j\le J$
4.
Sample ${z}_{i}{\sim}{{{\mathcal{N}}}}({\mu }_{{c}_{i}},{\varSigma }_{{c}_{i}})$, for $1\le i\le N$
5.
Sample ${{seq}}_{i} \sim {unif}\left(0,L\right),$ for $1\le i\le N$
6.
Let ${g}_{{res}}\left(z\right)={reshape}\left({ReLU}\left({wz}+b\right),L\times H\right)$, where $w\in {{\mathbb{R}}}^{{LH}\times J}$ and $b\in {{\mathbb{R}}}^{{LH}}$ random matrices and vectors.
7.
Let ${x}_{i}={g}_{{res}}({z}_{i}),$ for $1\le i\le N$
8.
Let$\,{g}_{{att}}\left(x\right)={softmax} (\frac{\left({w}_{Q}x+{b}_{Q}\right){\left({w}_{K}x+{b}_{K}\right)}^{T}}{\sqrt{H}}+{mask} )\left({w}_{V}x+{b}_{V}\right),$ where ${w}_{Q}$, ${w}_{K},{w}_{V}$ and ${b}_{Q},{b}_{K},{b}_{V}$ are random matrices and vectors. Mask is based on ${{seq}}_{i}$
9.
Let ${x}_{i}={g}_{{att}}({g}_{{att}}({g}_{{att}}({x}_{i}))),$ for $1\le i\le N$
10.
Let${g}_{{dec}}(x)={softmax}({ReLU}({wx}+{b})),$ where $w\in {{\mathbb{R}}}^{D\times H}$ and $b\in {{\mathbb{R}}}^{D}$ random matrices and vectors.
11.
Let ${x}_{i}={argmax}({g}_{{dec}}({x}_{i}))[1:{{seq}}_{i}],$ for $1\le i\le N$
12.
Sample ${\beta }_{c,j} \sim {unif}\left(-{{\mathrm{2.5,2.5}}}\right),$ for $1\le c\le K$ and $1\le j\le J$
13.
Sample ${u}_{i} \sim {Weibull}({softplus}({z}_{i}^{T}{\beta }_{{c}_{i}}),k)$, for $1\le i\le N$
14.
Sample ${\delta }_{i} \sim {Bernoulli}(1-{p}_{{cens}}),$ for $1\le i\le N$
15.
Let ${t}_{i}={u}_{i},$ if ${\delta }_{i}=1,$ and sample ${t}_{i} \sim {unif}\left(0,{u}_{i}\right)$ otherwise, for $1\le i\le N$

In our experiments, we fixed $K=3,N=30000,J=5,D=1998$ (ICD-10 category-level vocabulary)$,k=1,{p}_{{cens}}=0.3,$ $L=100.$ For the attention operation, we used 3 attention layers with 10 heads in each layer.

Model training and hyperparameter optimization

Hyperparameter optimization was done by a grid search on learning rate ({0.1, 0.05, 0.01, 0.005, 0.001}) and weight decay ({0.1, 0.05, 0.01, 0.005, 0.001}), using an Adam optimizer with default beta1 and beta2. For performance estimation, we split the generated data into three parts, one-third for training, one-third for validating, and one-third for testing the model. We repeated the above 5 times (i.e. for 5 randomly generated datasets) to arrive at a robust average performance estimate.

Real-world diabetes benchmark

Data extraction

To extract the data, we selected patients by the occurrence of ICD-10 codes E10 (T1D) or E11.3 (T2D) in their diagnosis trajectories and labeled patients by the presence of H36 (“Retinal disorders in diseases classified elsewhere”). Because of sample size limitations, no requirement was placed on the minimum number of occurrences of each of E10 and E11.3. In order to avoid ambiguity in the performance estimation, patients with both E10 and E11 in their disease history were excluded, which resulted in the dataset as summarized in Table 4, column 2. In the training data, all occurrences of E10, E11, and H36 (and their children) were deleted to avoid data leakage. The selected patients substantially fitted the validated phenotype definition⁶⁶.

Model training and hyperparameter optimization

We fine-tuned VaDeSC-EHR end-to-end on the dataset described above, taking our pre-trained encoder as a starting point. Note that in the fine-tuning stage, the decoder could benefit from the pre-trained encoder, because weights were shared between the two. We used nested cross-validation (NCV)³⁰ with a 4-fold inner loop for hyperparameter optimization and a 5-fold outer loop for performance estimation. Hyperparameters were optimized using Bayesian optimization on the following hyperparameter grid:

learning rate of the Adam optimizer: {1e-5, 5e-5, 1e-4, 5e-4,1e-3},
weight decay parameters of the Adam optimizer: {1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.1},
dimension of latent variables: {5,10,15,20},
shape parameter of the Weibull distribution: {1, 2, 3, 4, 5},
dropout rate: {0.1, 0.2, 0.3, 0.4, 0.5},
number of reshape layers: {1, 2, 3, 4, 5}.

Application: progression of Crohn’s disease toward intestinal obstruction

Data extraction

To extract the Crohn’s disease (CD) patient population, we selected patients by requiring at least one occurrence of the ICD-10 code K50 (Crohn’s disease) while excluding patients with a K51 diagnosis (ulcerative colitis) and then labeled patients according to the presence of K56 (“Paralytic ileus and intestinal obstruction without hernia”). This resulted in the dataset summarized in Table 4, column 5. In the training data, all occurrences of K50 and K56 (and their children) were deleted to avoid data leakage. The selected patients substantially fitted the validated phenotype definition⁶⁶.

Model training and hyperparameter optimization

We fine-tuned VaDeSC-EHR end-to-end on the dataset described above, taking our pre-trained encoder as a starting point. Note that in the fine-tuning stage, the decoder could benefit from the pre-trained encoder, because weights were shared between the two. We used the pre-trained encoder as the initial weight. And applied 5-fold cross-validation for hyperparameter optimization. Hyperparameters were optimized using Bayesian optimization, with a hyperparameter search space defined as for the diabetes model, except that we now also needed to optimize the number of clusters $\left\{\left.2,\ldots,K\right\}\right.$ jointly with the other hyperparameters, because a ground-truth clustering was not available ($K=4$, in this use case). The best combination of hyperparameters (including the number of clusters) was determined by encouraging a low Bayesian information criterion (BIC) and a high concordance index (CI) through maximizing: $\sqrt{{{CI}}^{2}+{\left(1-{{BIC}}_{{norm}}\right)}^{2}}$, where the BIC was normalized to the interval [0, 1] (Supplementary Fig. 17, Supplementary Table 3).

Methods comparison and metrics

We compared VaDeSC-EHR to a range of baseline methods: variational deep survival clustering with a multilayer perceptron (VaDeSC-MLP), semi-supervised clustering (SSC)¹⁷, survival cluster analysis (SCA)¹⁸, deep survival machines (DSM)¹⁹, and recurrent neural network-based DSM (RDSM)²⁰, as well as k-means and regularized Cox PH as naïve baselines. In addition, to assess the influence of the survival loss on the eventual clustering, we included VaDeSC-EHR_nosurv, in which the survival loss of VaDeSC-EHR was turned off. Finally, to assess the influence of absolute age on distinguishing between ground-truth clusters, we included VaDeSC-EHR_relage (with age at first diagnosis subtracted from all elements in the age sequence). We used ICD-10-based TF-IDF features as the input for all methods but RDSM and VaDeSC-EHR, which allow for directly modeling sequences of events.

We evaluated the clustering performance of models, when possible, in terms of balanced accuracy (ACC), normalized mutual information (NMI), adjusted Rand index (ARI), and area under the receiver-operating characteristic (AUC). Clustering accuracy was computed by using the Hungarian algorithm for mapping between cluster predictions and ground-truth labels⁶⁷. The statistical significance of performance difference was determined using the Mann–Whitney U test.

For the time-to-event predictions, we used the concordance index (CI) to evaluate the ability of the methods to rank patients by their event risk. Given observed survival times ${t}_{i}$, predicted risk scores ${\delta }_{i}$, and censoring indicators ${\delta }_{i}$, the concordance index was defined as

$${CI}=\frac{{\sum }_{i=1}^{N}{\sum }_{j=1}^{N}{{{{\bf{1}}}}}_{{t}_{j}{ < t}_{i}}{{{{\bf{1}}}}}_{{\eta }_{j}{ > \eta }_{i}}{\delta }_{j}}{{\sum }_{i=1}^{N}{\sum }_{j=1}^{N}{{{{\bf{1}}}}}_{{t}_{j}{ < t}_{i}}{\delta }_{j}}$$

(12)

Visualization and enrichment analysis of clusters

The clusters were visualized using UMAP (uniform manifold approximation and projection)⁶⁸ with the Jensen–Shannon divergence⁶⁹ as a distance measure: ${JSD}(P(c|{x}_{i}),P(c|{x}_{j}))$, where $P\left(c|x\right)$ is the distribution across clusters c given a patient x. Cluster cohesion was measured using the silhouette coefficient⁷⁰.

We assessed the association of the clustering with sex, age, education level, location of UKB recruitment, 4 genetic principal components, fraction of hospital care data, and overall diagnosis sequence length using a Chi-squared test or a one-way ANOVA. We calculated the differential enrichment of diagnoses between clusters in two ways: (1) for individual diagnoses, and (2) for sequences of diagnoses. For the individual diagnoses, we used the ICD-10 codes as provided by the UK Biobank. For the diagnosis sequences, we first mapped the ICD-10 codes to Phecode⁷¹ and CALIBER codes⁷², which provided a higher level of abstraction in defining diseases. For each patient, we then identified all, potentially gapped, subsequences of three diagnoses from the EHR data, with the following constraints: (1) for duplicate diagnoses in the diagnosis sequence, we only considered the first one, (2) the subsequence contained a diagnosis of the disease under study (CD) but did not contain the selected risk event (intestinal obstruction).

We assessed the statistical significance of the differential enrichment using logistic regression models predicting patient clusters from subsequence occurrence while adjusting for the effects of age, sex, PC1, recruitment location, and fraction of hospital care data by including these variables as covariates into the model. We corrected the resulting p-values for multiple testing using the Benjamini–Hochberg procedure and set a threshold at 0.05 for statistical significance.

Analysis of smoking behavior

We analyzed the association of smoking behavior with progression toward intestinal obstruction using multivariate Cox regression models, individually testing the hazard ratio of several variables related to smoking behavior:

1.
data field 20160 (“Ever smoked”),
2.
diagnosis ICD-10 code F17.2 (“Mental and behavioral disorders due to use of tobacco dependence syndrome”), which is commonly interpreted as nicotine dependence⁷³,
3.
data field 20161 (“Pack years of smoking”),
4.
data field 20162 (“Pack years adult smoking as a proportion of life span exposed to smoking”),
5.
Current smoking status, which we defined as the union of patients identified as current smokers from data fields 1239 (“Current tobacco smoking”) and 20116 (“Smoking status”)
6.
Previous smoking status, which we defined as patients who had ever smoked but are not currently smoking. Additionally, to make sure the patients were not smoking at the time of their first CD diagnosis, we excluded patients whose assessment date was after the date of their first CD diagnosis.

The UK Biobank data fields we used contained data collected between 2006 and 2010.

We adjusted the above models for the effects of age, sex, PC1, recruitment location, fraction of hospital care data, and time difference between the CD onset and the nearest smoking diagnosis (F17.2) or assessment date, by including these variables as covariates in the model.

Genetic analysis of patient clusters

We computed pathway-based polygenic risk scores (‘pathway PRSs’ henceforth) using PRSet⁷⁴ to assess genetic differences between the patient clusters, restricting ourselves to UK Biobank participants of European ancestry⁷⁴. Quality control steps were performed before calculating pathway PRSs, including filtering of SNPs with genotype missingness > 0.05, minor allele frequency (MAF) < 0.01, and with Hardy–Weinberg Equilibrium (HWE) p-value < 5 × 10⁻⁸. We focused on 164 biological pathways related to Crohn’s disease as retrieved from the Gene Ontology – Biological Process (GO-BP) database, selected based on a literature and keyword search (“IMMUNE”) (Table S1)^75,76. We calculated pathway PRSs for each (patient, pathway) pair using variants located in exon regions. For each pathway PRS, we then fitted a logistic regression model predicting cluster from pathway PRS, while adjusting for age, sex, PC1, recruitment location, and the fraction of hospital care data. The p-value of the resulting coefficient was calculated using a log-likelihood ratio test⁷⁷ and corrected for multiple testing using the Benjamini–Hochberg procedure. For determining statistical significance, we used a threshold of 0.05.

After the pathway-level analysis, we extracted the individual genetic variants contributing to the pathway PRS. We applied Chi-squared tests to identify the significant SNPs⁷⁸ and corrected the resulting p-values for multiple testing using the Benjamini–Hochberg procedure. Finally, we fitted logistic regression models predicting cluster from mutation status, while adjusting for confounding by including age, sex, PC1, recruitment location, and the fraction of hospital care data as covariates in the models. We defined mutation status by dominant coding, thereby comparing no copy of the risk allele to at least one copy.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The data that support the findings of this study are available from UK Biobank (www.ukbiobank.ac.uk). Researchers can apply to use the UK Biobank resource for health-related research that is in the public interest (https://www.ukbiobank.ac.uk/register-apply/). Source data are provided with this paper.

Code availability

The code that supports the findings of this study is available from GitHub https://github.com/JiajunQiu/VaDeSC-EHR (https://doi.org/10.5281/zenodo.14299831).

References

Electronic Public Health Reporting. ONC Annu. Meet. https://www.healthit.gov/sites/default/files/2018-12/ElectronicPublicHealthReporting.pdf (2018).
Parasrampuria, S. & Henry, J. Hospitals use of electronic health records data, 2015–2017. In ASTP Health IT Data Brief [Internet] 46 (Office of the Assistant Secretary for Technology Policy, Washington, DC, 2019).
Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit. Med. 4, 86 (2021).
Article PubMed PubMed Central Google Scholar
Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep. 10, 7155 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Rongali, S. et al. Learning latent space representations to predict patient outcomes: model development and validation. J. Med. Internet Res. 22, e16374 (2020).
Article PubMed PubMed Central Google Scholar
Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17) 6000–6010 (Curran Associates Inc., Red Hook, NY, 2017).
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, 4171–4186 (Association for Computational Linguistics, 2019).
Placido, D. et al. A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories. Nat. Med 29, 1113–1122 (2023).
Article CAS PubMed PubMed Central Google Scholar
Achiam, J. et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Chung, P. et al. Large language model capabilities in perioperative risk prediction and prognostication. JAMA Surg. 159, 928–937 (2024).
de Jong, J. et al. Towards realizing the vision of precision medicine: AI based prediction of clinical drug response. Brain 144, 1738–1750 (2021).
Article PubMed PubMed Central Google Scholar
You, N., He, S., Wang, X., Zhu, J. & Zhang, H. Subtype classification and heterogeneous prognosis model construction in precision medicine. Biometrics 74, 814–822 (2018).
Article MathSciNet PubMed Google Scholar
Kumar, M., Garand, M. & Al Khodor, S. Integrating omics for a better understanding of inflammatory bowel disease: a step towards personalized medicine. J. Transl. Med. 17, 419 (2019).
Article PubMed PubMed Central Google Scholar
Castela Forte, J. et al. Identifying and characterizing high-risk clusters in a heterogeneous ICU population with deep embedded clustering. Sci. Rep. 11, 12109 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Bretos-Azcona, P. E., Sanchez-Iriso, E. & Cabases Hita, J. M. Tailoring integrated care services for high-risk patients with multiple chronic conditions: a risk stratification approach using cluster analysis. BMC Health Serv. Res. 20, 806 (2020).
Article PubMed PubMed Central Google Scholar
Tanniou, J., van der Tweel, I., Teerenstra, S. & Roes, K. C. Subgroup analyses in confirmatory clinical trials: time to be specific about their purposes. BMC Med. Res. Methodol. 16, 20 (2016).
Article PubMed PubMed Central Google Scholar
Bair, E. & Tibshirani, R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2, E108 (2004).
Article PubMed PubMed Central Google Scholar
Chapfuwa, P., Li, C., Mehta, N., Carin, L. & Henao, R. in Proc. ACM Conference on Health, Inference, and Learning 60–68 (Association for Computing Machinery, Toronto, ON, Canada, 2020).
Nagpal, C., Li, X. & Dubrawski, A. Deep survival machines: fully parametric survival regression and representation learning for censored data with competing risks. IEEE J. Biomed. Health Inf. 25, 3163–3175 (2021).
Article Google Scholar
Nagpal, C., Jeanselme, V. & Dubrawski, A. W. in SPACA.
Manduchi, L. et al. A deep variational approach to clustering survival data https://doi.org/10.3929/ethz-b-000536597 (2022).
Kingma, D. P. & Welling, M. Auto-encoding variational bayes. Conference proceedings. International Conference on Learning Representations (ICLR) (2014).
Jiang, Z., Zheng, Y., Tan, H., Tang, B. & Zhou, H. in Proc. 26th International Joint Conference on Artificial Intelligence 1965–1972 (AAAI Press, Melbourne, Australia, 2017).
Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit. Med. 3, 96 (2020).
Article PubMed PubMed Central Google Scholar
de Jong, J. et al. Deep learning for clustering of multivariate clinical patient trajectories with missing values. Gigascience 8, giz134 (2019).
Higgins, I. et al. β-VAE: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR) (2017).
Inshaw, J. R. J. et al. Analysis of overlapping genetic association in type 1 and type 2 diabetes. Diabetologia 64, 1342–1347 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nyaga, D. M., Vickers, M. H., Jefferies, C., Fadason, T. & O’Sullivan, J. M. Untangling the genetic link between type 1 and type 2 diabetes using functional genomics. Sci. Rep. 11, 13871 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Zaccardi, F., Webb, D. R., Yates, T. & Davies, M. J. Pathophysiology of type 1 and type 2 diabetes mellitus: a 90-year perspective. Postgrad. Med. J. 92, 63–69 (2016).
Article CAS PubMed Google Scholar
Stone, M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B Methodol. 36, 111–133 (1974).
Article MathSciNet Google Scholar
Lin, X. et al. Intestinal strictures in Crohn’s disease: a 2021 update. Ther. Adv. Gastroenterol. 15, 17562848221104951 (2022).
Article Google Scholar
Sinopoulou, V. et al. Interventions for the management of abdominal pain in Crohn’s disease and inflammatory bowel disease. Cochrane Database Syst. Rev. 11, CD013531 (2021).
PubMed Google Scholar
Schwartzberg, D. M., Brandstetter, S. & Grucela, A. L. Crohn’s disease of the esophagus, duodenum, and stomach. Clin. Colon Rectal Surg. 32, 231–242 (2019).
Article PubMed PubMed Central Google Scholar
Yueying, C., Yu Fan, W. & Jun, S. Anemia and iron deficiency in Crohn’s disease. Expert Rev. Gastroenterol. Hepatol. 14, 155–162 (2020).
Article PubMed Google Scholar
Andersen, N. N. & Jess, T. Risk of cardiovascular disease in inflammatory bowel disease. World J. Gastrointest. Pathophysiol. 5, 359–365 (2014).
Article PubMed PubMed Central Google Scholar
Abomhya, A. et al. Iron deficiency anemia: an overlooked complication of Crohn’s disease. J. Hematol. 11, 55–61 (2022).
Article CAS PubMed PubMed Central Google Scholar
Cai, W., Cagan, A., He, Z. & Ananthakrishnan, A. N. A phenome-wide analysis of healthcare costs associated with inflammatory bowel diseases. Dig. Dis. Sci. 66, 760–767 (2021).
Article PubMed Google Scholar
Wils, P., Caron, B., D’Amico, F., Danese, S. & Peyrin-Biroulet, L. Abdominal pain in inflammatory bowel diseases: a clinical challenge. J. Clin. Med. 11, 4269 (2022).
van Erp, S. J. et al. Classifying back pain and peripheral joint complaints in inflammatory bowel disease patients: a prospective longitudinal follow-up study. J. Crohns Colitis 10, 166–175 (2016).
Article PubMed Google Scholar
Georgakopoulou, V. E. et al. Role of pulmonary function testing in inflammatory bowel diseases (review). Med. Int. 2, 25 (2022).
Article Google Scholar
Lu, D. G., Ji, X. Q., Liu, X., Li, H. J. & Zhang, C. Q. Pulmonary manifestations of Crohn’s disease. World J. Gastroenterol. 20, 133–141 (2014).
Article CAS PubMed PubMed Central Google Scholar
Lakatos, P. L., Szamosi, T. & Lakatos, L. Smoking in inflammatory bowel diseases: good, bad or ugly? World J. Gastroenterol. 13, 6134–6139 (2007).
Article PubMed PubMed Central Google Scholar
Piovani, D. et al. Ethnic differences in the smoking-related risk of inflammatory bowel disease: a systematic review and meta-analysis. J. Crohns Colitis 15, 1658–1678 (2021).
Article PubMed Google Scholar
Dai, C., Jiang, M. & Sun, M. J. Innate immunity and adaptive immunity in Crohn’s disease. Ann. Transl. Med. 3, 34 (2015).
PubMed PubMed Central Google Scholar
Geremia, A., Biancheri, P., Allan, P., Corazza, G. R. & Di Sabatino, A. Innate and adaptive immunity in inflammatory bowel disease. Autoimmun. Rev. 13, 3–10 (2014).
Article CAS PubMed Google Scholar
Sutcliffe, S. et al. Novel microbial-based immunotherapy approach for Crohn’s disease. Front. Med. (Lausanne) 6, 170 (2019).
Article PubMed Google Scholar
Trynka, G. et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nat. Genet. 43, 1193–1201 (2011).
Article CAS PubMed PubMed Central Google Scholar
Shi, Y. et al. Inflammatory bowel disease and celiac disease: a bidirectional Mendelian randomization study. Front. Genet. 13, 928944 (2022).
Article PubMed PubMed Central Google Scholar
Ochoa, D. et al. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Res. 51, D1353–D1359 (2023).
Article PubMed Google Scholar
Gold, S. L. et al. Micronutrients and their role in inflammatory bowel disease: function, assessment, supplementation, and impact on clinical outcomes including muscle health. Inflamm. Bowel Dis. 29, 487–501 (2023).
Article PubMed Google Scholar
de Lange, K. M. et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat. Genet. 49, 256–261 (2017).
Article PubMed PubMed Central Google Scholar
Andlauer, T. F. M. et al. Treatment- and population-specific genetic risk factors for anti-drug antibodies against interferon-beta: a GWAS. BMC Med. 18, 298 (2020).
Article CAS PubMed PubMed Central Google Scholar
Murdoch, D. & Lyseng-Williamson, K. A. Spotlight on subcutaneous recombinant interferon-beta-1a (Rebif) in relapsing-remitting multiple sclerosis. BioDrugs 19, 323–325 (2005).
Article PubMed Google Scholar
Pena Rossi, C. et al. Interferon beta-1a for the maintenance of remission in patients with Crohn’s disease: results of a phase II dose-finding study. BMC Gastroenterol. 9, 22 (2009).
Article ADS PubMed PubMed Central Google Scholar
Abul-Husn, N. S. & Kenny, E. E. Personalized medicine and the power of electronic health records. Cell 177, 58–69 (2019).
Article CAS PubMed PubMed Central Google Scholar
Birkenbihl, C., de Jong, J., Yalchyk, I. & Fröhlich, H. Deep learning-based patient stratification for prognostic enrichment of clinical dementia trials. Brain Commun. 6, fcae445 (2024).
U. S. Food and Drug Administration. Enrichment strategies for clinical trials to support approval of human drugs and biological products - guidance for industry https://doi.org/https://www.fda.gov/regulatory-information/search-fda-guidance-documents/enrichment-strategies-clinical-trials-support-approval-human-drugs-and-biological-products (2019).
Gui, N., Ge, D. & Hu, Z. in Proc. Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence Article 455 (AAAI Press, Honolulu, Hawaii, USA, 2019).
Choi, E., Bahadori, M. T., Song, L., Stewart, W. F. & Sun, J. GRAM: graph-based attention model for healthcare representation learning. KDD 2017, 787–795 (2017).
PubMed PubMed Central Google Scholar
Munoz-Farre, A., Rose, H. & Cakiroglu, S. A. sEHR-CE: language modelling of structured EHR data for efficient and generalizable patient cohort expansion. In NeurIPS 2022 Workshop on Learning from Time Series for Health (2022).
Stroganov, O. et al. Mapping of UK Biobank clinical codes: challenges and possible solutions. PLoS ONE 17, e0275816 (2022).
Article CAS PubMed PubMed Central Google Scholar
Papez, V. et al. Transforming and evaluating the UK Biobank to the OMOP common data model for COVID-19 research and beyond. J. Am. Med Inf. Assoc. 30, 103–111 (2022).
Article Google Scholar
Butler, A. M., Nickel, K. B., Overman, R. A. & Brookhart, M. A. in Databases for Pharmacoepidemiological Research (eds Sturkenboom, M. & Schink, T.) 243–251 (Springer International Publishing, 2021).
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J. & Shi, H. Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704) (2021).
Kirby, J. C. et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inf. Assoc. 23, 1046–1052 (2016).
Article Google Scholar
Kuhn, H. W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2, 83–97 (1955).
Article MathSciNet Google Scholar
McInnes, L. & Healy, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Nielsen, F. On the Jensen-Shannon symmetrization of distances relying on abstract means. Entropy 21, 485 (2019).
Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Article Google Scholar
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kuan, V. et al. A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service. Lancet Digit. Health 1, e63–e77 (2019).
Article PubMed PubMed Central Google Scholar
Rajagopal, V. M. et al. Rare coding variants in CHRNB2 reduce the likelihood of smoking. Nat. Genet. 55, 1138–1148 (2023).
Article CAS PubMed PubMed Central Google Scholar
Choi, S. W. et al. PRSet: pathway-based polygenic risk score analyses and software. PLoS Genet. 19, e1010624 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ntunzwenimana, J. C. et al. Functional screen of inflammatory bowel disease genes reveals key epithelial functions. Genome Med. 13, 181 (2021).
Article CAS PubMed PubMed Central Google Scholar
Michail, S., Bultron, G. & Depaolo, R. W. Genetic variants associated with Crohn’s disease. Appl. Clin. Genet. 6, 25–32 (2013).
Article CAS PubMed PubMed Central Google Scholar
Buse, A. The likelihood ratio, wald, and Lagrange multiplier tests: an expository note. Am. Stat. 36, 153–157 (1982).
Google Scholar
Sei, Y. & Ohsuga, A. Privacy-preserving chi-squared test of independence for small samples. BioData Min. 14, 6 (2021).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This research has been conducted using UK Biobank, a major biomedical database (www.ukbiobank.ac.uk). This research has been conducted using UK Biobank resources under Application Number 57952. We thank all the employees from Boehringer Ingelheim – Global Computational Biology and Digital Sciences for their support during the project.

Author information

Authors and Affiliations

Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riβ, Germany
Jiajun Qiu, Yao Hu, Li Li, Abdullah Mesut Erzurumluoglu, Ingrid Braenne, Jatin Arora, Boris Alexander Bartholdy, Shrey Gandhi, Pierre Khoueiry, Stefanie Mueller, Boris Noyvert, Zhihao Ding, Jan Nygaard Jensen & Johann de Jong
Immunology & Respiratory Diseases, Boehringer-Ingelheim, Ridgefield, CT, USA
Charles Whitehurst & Jochen Schmitz

Authors

Jiajun Qiu
View author publications
Search author on:PubMed Google Scholar
Yao Hu
View author publications
Search author on:PubMed Google Scholar
Li Li
View author publications
Search author on:PubMed Google Scholar
Abdullah Mesut Erzurumluoglu
View author publications
Search author on:PubMed Google Scholar
Ingrid Braenne
View author publications
Search author on:PubMed Google Scholar
Charles Whitehurst
View author publications
Search author on:PubMed Google Scholar
Jochen Schmitz
View author publications
Search author on:PubMed Google Scholar
Jatin Arora
View author publications
Search author on:PubMed Google Scholar
Boris Alexander Bartholdy
View author publications
Search author on:PubMed Google Scholar
Shrey Gandhi
View author publications
Search author on:PubMed Google Scholar
Pierre Khoueiry
View author publications
Search author on:PubMed Google Scholar
Stefanie Mueller
View author publications
Search author on:PubMed Google Scholar
Boris Noyvert
View author publications
Search author on:PubMed Google Scholar
Zhihao Ding
View author publications
Search author on:PubMed Google Scholar
Jan Nygaard Jensen
View author publications
Search author on:PubMed Google Scholar
Johann de Jong
View author publications
Search author on:PubMed Google Scholar

Contributions

Model design: J.Q., J.d.J.; method development: J.Q.; data analysis: J.Q., Y.H.; definition of CD application: J.Q., L.L., C.W., J.S., J.d.J.; writing the manuscript: J.Q., J.d.J., A.M.E., I.B.; definition and supervision of research project: J.d.J.; providing assistance to the project: J.A., B.A.B., S.G., P.K., S.M., B.N.; funding the project: Z.D., J.N.J.; All authors read, edited, and approved the article.

Corresponding author

Correspondence to Johann de Jong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Peer review file

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Qiu, J., Hu, Y., Li, L. et al. Deep representation learning for clustering longitudinal survival data from electronic health records. Nat Commun 16, 2534 (2025). https://doi.org/10.1038/s41467-025-56625-z

Download citation

Received: 17 April 2024
Accepted: 21 January 2025
Published: 14 March 2025
Version of record: 14 March 2025
DOI: https://doi.org/10.1038/s41467-025-56625-z

This article is cited by

Joint models in big data: simulation-based guidelines for required data quality in longitudinal electronic health records
- Berit Hunsdieck
- Christian Bender
- Johanna Mielke
BioData Mining (2025)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

VaDeSC-EHR architecture

UK Biobank datasets

VaDeSC-EHR outperforms baseline methods on synthetic data

VaDeSC-EHR outperforms baseline methods on a T1D/T2D benchmark

Application: VaDeSC-EHR identifies clinically and genetically relevant Crohn’s disease patient subgroups

CD patient subgroups demonstrate distinct disease histories

Does smoking protect against progression toward intestinal obstruction in some CD patients?

VaDeSC-EHR can identify subgroup-specific molecular mechanisms relevant to drug discovery and precision medicine

Discussion

Methods

EHR dataset from UK Biobank

VaDeSC-EHR architecture

The generative process

Transformer-based encoder and decoder

Embedding

Encoder

Decoder

Evidence lower bound (ELBO)

Pre-training and fine-tuning strategy

VaDeSC-EHR in various applications

Synthetic benchmark

Data generation

Model training and hyperparameter optimization

Real-world diabetes benchmark

Data extraction

Model training and hyperparameter optimization

Application: progression of Crohn’s disease toward intestinal obstruction

Data extraction

Model training and hyperparameter optimization

Methods comparison and metrics

Visualization and enrichment analysis of clusters

Analysis of smoking behavior

Genetic analysis of patient clusters

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links