A scoping review of self-supervised representation learning for clinical decision making using EHR categorical data

Zheng, Yuanyuan; Bensahla, Adel; Bjelogrlic, Mina; Zaghir, Jamil; Turbe, Hugues; Bednarczyk, Lydie; Gaudet-Blavignac, Christophe; Ehrsam, Julien; Marchand-Maillet, Stéphane; Lovis, Christian

doi:10.1038/s41746-025-01692-1

Download PDF

Article
Open access
Published: 14 June 2025

A scoping review of self-supervised representation learning for clinical decision making using EHR categorical data

Yuanyuan Zheng^1,2^na1,
Adel Bensahla^1,2^na1,
Mina Bjelogrlic^1,2,
Jamil Zaghir^1,2,
Hugues Turbe^1,2,
Lydie Bednarczyk^1,2,
Christophe Gaudet-Blavignac^1,2,
Julien Ehrsam^1,2,
Stéphane Marchand-Maillet³ &
…
Christian Lovis^1,2

npj Digital Medicine volume 8, Article number: 362 (2025) Cite this article

4097 Accesses
1 Citations
2 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 18 August 2025

This article has been updated

Abstract

The widespread adoption of Electronic Health Records (EHRs) and deep learning, particularly through Self-Supervised Representation Learning (SSRL) for categorical data, has transformed clinical decision-making. This scoping review, following PRISMA-ScR guidelines, examines 46 studies published from January 2019 to April 2024, sourced from PubMed, MEDLINE, Embase, ACM, and Web of Science, focusing on SSRL for unlabeled categorical EHR data. The review systematically assesses research trends in building computationally and data-efficient representations for medical tasks, identifying major trends in model families: Transformer-based (43%), Autoencoder-based (28%), and Graph Neural Network-based (17%) models. The analysis highlights scenarios where healthcare institutions can leverage or develop SSRL technologies. It also addresses current limitations in assessing the impact of these technologies and identifies research opportunities to enhance their influence on clinical practice.

Transforming label-efficient decoding of healthcare wearables with self-supervised learning and “embedded” medical domain expertise

Article Open access 26 July 2025

Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records

Article 05 May 2021

Clinical decision support using pseudo-notes from multiple streams of EHR data

Article Open access 02 July 2025

Introduction

The advent of EHRs has revolutionized the healthcare industry by providing comprehensive, digitized patient information¹. This shift has enabled healthcare providers to maintain accurate and accessible records, facilitating better patient care². The widespread adoption of EHRs has fueled the development of deep learning models for various automated clinical decision-making, offering sophisticated tools for predicting patient trajectories, identifying disease patterns, and personalizing treatments^3,4.

Recently, an increasing number of deep neural networks (DNNs) based on SSRL have been deployed in real-world applications⁵. Examples include DINOv2⁶, OpenCLIP⁷ for vision, and GPT-4⁸ for free text. In the medical field, similar models like MedCLIP⁹ and MedSAM¹⁰ have also been developed, trained specifically with medical imaging and textual data. These models are trained on extensive datasets and are open-source, making them easily deployable. The representations they learn from the unlabeled data are designed for versatile use, enabling application across various downstream tasks, often referred to as foundation models^11,12. By providing efficient learned representations, these models offer new opportunities to enhance the performance of existing models and reduce the need for large, manually annotated datasets.

Analyzing EHRs data poses several challenges, including its sparsity, high-dimensionality, and complex interrelationships¹³. EHRs consist of irregularly spaced visits over time, with each visit containing a subset of thousands of possible medical codes, along with laboratory test results, unstructured text, and images¹⁴. In this review, we focus specifically on EHR categorical data, also referred as structured data. EHR categorical data includes medical codes such as diagnoses, procedures, medications, and laboratory test codes. Categorical data is easier to de-identify following HIPAA guidelines¹⁵, enabling faster construction of large datasets, as it is generally considered safer in terms of patient privacy compared to clinical free text¹⁶.

SSRL in DNNs automatically discovers and extracts features from unlabeled data¹¹. Unlike supervised learning, which relies on labeled datasets, SSRL algorithms are trained to predict part of the data from other parts, which could be incomplete, transformed, distorted, or corrupted. Essentially, the model learns to ’recover’ whole, parts of, or merely some features of its original input¹⁷. This enables SSRL to identify patterns and structures within unlabeled data, producing efficient representation vectors. These vectors, along with trained SSRL models, can be used for clustering similar data points, enhancing data visualization, or serving as inputs for subsequent predictive models. Figure 1 illustrates the application of SSRL in clinical settings. This framework offers several advantages: it reduces the need for extensive manual labeling, can be generalized across different tasks without requiring full model retraining, and often outperforms the models trained on a similar number of labeled datasets. As a result, using these representations and models optimizes manpower, computational resources, and model performance^11,17.

**Fig. 1: Description of medical application of SSRL in the clinical settings and the flow of data.**

Despite the progress in large models for images and text, there is still a notable absence of large models based on EHRs in real-world applications. Previous reviews, including those by Si et al.¹⁸, Amirahmadi et al.¹⁹, Oss Boll et al.²⁰, and Hama et al.²¹, have covered both supervised and unsupervised methods across various data types. However, none have systematically analyzed representation learning using unlabeled EHR categorical data, covering both clustering and prediction tasks. As a result, the reader is left without a clear understanding of the current State-of-the-Art (SOTA) trends, limitations, and opportunities in this area. This review, covering studies from 2019 to 2024, addresses this critical gap by offering detailed insights into the latest SSRL methodologies for unlabeled EHR categorical data. We assess their potential applications, identify appropriate scenarios for their deployment, and evaluate the feasibility of implementation in current clinical settings. This review offers valuable guidance for future research, practical healthcare data analysis, and implementation in hospital settings. Our scoping review answers three main research questions: i) What techniques and models are used for analyzing categorical data? (see sections “Type of data”, “Data preprocessing” and “Self-supervised learning models” ii) How can SSRL models enhance clinical decision-making? (see sections “Fields of application” and “Evaluation tasks”) and iii) What are the current trends in the research field, and how do they impact medical settings? (see section “Discussion”). The detailed research questions and the full methodology of the scoping review are described in Methods section.

We differentiate our work from the narrative review by Wornow et al.⁵ by specifically addressing SSRL using unlabeled categorical data from EHRs, regardless of the changing definitions and usages of the term “foundation models” to provide the reader a systematic analysis and comprehensive view of current SOTA in the field. Notably, while we overlap with only 12 of the same papers over a similar time period, we include an additional 34 studies, underscoring the broader scope of our review. We highlight several agreements with their claims and systematically clarify the areas of similarity.

This scoping review is intended for an audience comprising medical professionals, data scientists, and healthcare stakeholders such as decision-makers and hospital IT teams. By synthesizing studies from databases across these fields, we aim to bridge the gap between clinical expertise and advanced data science techniques. Considering the societal and economic impact of leveraging recent research advances in SSRL, our goal is to provide valuable insights that enhance clinical decision-making processes, encourage interdisciplinary collaboration in healthcare informatics, and assist decision-makers in effectively adapting their IT infrastructure and data management strategies.

Results

This section provides a comprehensive overview of the findings from our scoping review, organized around subsections that emerged during our analysis. We begin by outlining the characteristics of the included studies and the types of data utilized. Next, we examine the studies from the technical aspects including the data preprocessing techniques, SSRL model types, SSRL model comparison, models for downstream tasks, the evaluation metrics used, and the interpretability techniques. Finally, we analyze the studies from a clinical perspective focusing on the fields of clinical application, clinical downstream tasks, and the involvement of medical experts. Error! Reference source not found. summarizes the key features of the technical aspect, and Table 2 provides essential information on the studies from the medical perspective.

Studies characteristics

As illustrated in Fig. 2, most of the research (n = 33, 72%) was conducted by interdisciplinary teams of medical experts and data scientists. The United States led in the number of published studies (n = 21, 46%), followed by China (n = 9, 20%) and the United Kingdom (n = 4, 9%). Despite this geographic diversity, only a few studies (n = 11, 24%) involved international collaborations. For details on the authors and research teams, refer to Supplementary Data 2

**Fig. 2: Meta-data from reviewed studies.**

Type of model and trend

Five main model types have been identified for representing EHR categorical data: Transformer-based models (n = 20, 43%), Autoencoder (AE) based models (n = 13, 28%), Graph Neural Network (GNN) based models (n = 8, 17%), Word-embedding models (n = 3, 7%), and Recurrent Neural Network (RNN) based models (n = 3, 7%). Studies that combine two or more model types are counted once for each corresponding model type. To assess their impact on research, we analyzed the number of citations for each model type.

Figure 3 shows the papers published from January 2019 to December 2023, their citation counts by July 2024, and their corresponding model types. Based on the number of citations, Transformers, RNN, and GNN models are the most impactful, with Transformer models showing particularly high citation counts for papers published from 2020 to 2023.

**Fig. 3: Number of citations for each study published from 2019 to 2023.**

Type of data

Studies utilize various data types to represent patients and medical knowledge. Typically, patient representation is derived from EHRs, incorporating both categorical and non-categorical data. Additionally, external medical knowledge can be integrated into models through data collected beyond EHRs. For detailed information on the modalities used across studies, see Supplementary Data 3.

Among the categorical data types in EHRs, diagnosis codes are the most frequently used (n = 45, 98%), including ICD-9, ICD-10-CM, and SNOMED-CT. Medication codes (n = 32, 70%), such as ATC and SNOMED-CT, along with procedure codes (n = 20, 43%) like CPT and ICD-10-PCS. To enhance patient representation, non-categorical data may also be included. The most common non-categorical data types are patient age (n = 19, 41%), clinical measurement values (n = 15, 33%) such as BMI, heart rate, and systolic blood pressure, and clinical narratives from physicians and practitioners (n = 7, 15%).

The integration of external data sources can further enrich patient profiles. Medical knowledge graphs and ontologies provide rich hierarchical information, while medical text corpora contain expert medical knowledge. These external sources offer a comprehensive understanding of clinical concept interactions. Among external data sources, ontologies are the most used (n = 7, 15%), they are employed to obtain the medical concept embeddings^{22,23,24,25,26,27,28} and for SSRL training task²³. Other significant external data sources include medical knowledge graph^25,29 and medical text corpora³⁰.

Data preprocessing

Most models treat each data element as a distinct unit or token (n = 44, 95%). The identified data preprocessing techniques address various aspects such as numerical data, categorical data, data cleaning, and data shuffling. Some studies (n = 7, 15%) performed categorization by converting exact ages into intervals and clinical measurements into categories like high, normal, and low, based on clinical evaluation standards^{31,32,33,34,35,36,37}. When maintaining the numerical nature of data, missing value imputation^30,38,39 and value normalization^31,39,40,41 have also been employed.

Some studies standardize data elements by mapping them to known ontologies^23,36,42,43. A common approach to reduce dimensionality and data sparsity is using only the first digits of codes, effectively replacing them with parent nodes in the hierarchical ontology (n = 15, 33%).

In terms of data cleaning, typical practices include the removal of rare medical terms^{14,32,37,42,44,45} and the elimination of duplicated terms within a specific time range^22,42,46,47. Additionally, shuffling the order of medical concepts within a time window^33,47 was shown to help the model to generalize better, by mitigating the impact of arbitrary sequencing and emphasizing the importance of co-occurrence over specific order. This method can also be considered as a form of data augmentation. Detailed information on data preprocessing across studies can be found in Supplementary Data 4.

Self-supervised learning models

There are two primary self-supervised learning training strategies: generative and contrastive. Generative tasks involve models predicting parts of the data from other parts, which may be incomplete, transformed, masked, or corrupted. These tasks, such as autoregressive prediction and masked modeling, help the model learn to recover whole or partial features of its original input^17,48. Contrastive tasks, on the other hand, focus on distinguishing between similar and dissimilar data points, helping the model capture discriminative features that are essential for understanding different types of data⁴⁸. Both task types are crucial for training models to generate rich, generalized representations from unlabeled data^48,49, and they are applied across various model architectures. The objective of these models is to capture essential patterns and features in the data and output the learned representation which is typically a fixed-length, high-dimensional vector that condenses large amounts of information. Five major architecture types have been identified in the studies, each trained with unlabeled data with different training tasks. Details of the SSRL models used and the temporality monitored in each study are provided in Supplementary Data 5.

Transformer-based models are among the most impactful model types in the studies. In the medical domain, most transformer-based models treat patients as documents, visits as sentences, and medical concepts as tokens, capturing detailed patient histories. BERT⁵⁰ is a transformer encoder-only model that effectively learns data representations by processing and contextualizing complex sequences of information. BERT models can be trained using various techniques, such as training with only Masked Language Model (MLM) by predicting randomly masked medical concepts in each EHR sequence^{34,43,44,51,52,53}, enhancing its contextual understanding. Training both with MLM and auxiliary tasks^{13,22,39,54,55,56}, further refine the model’s representations by guiding it with specific medical insights. Additionally, self-contrastive learning techniques help improve BERT’s robustness and accuracy in capturing meaningful patterns in medical data^30,35. Other transformer-based training tasks include next visit code prediction^23,36,45,57, medical code category prediction²³, medication-diagnosis cross prediction²⁶, and token replacement detection ELECTRA⁵⁸.

AE-based models are encoder-decoder models that aim to reconstruct the input, enabling the learning of data representations in a compressed, lower-dimensional space. AEs are designed to learn the most salient features of the data, which can be particularly useful for capturing the underlying structure of categorical EHR data. Various deviations of AE were applied in the studies: Stacked Autoencoder^32,59, Denoising Autoencoder⁶⁰, Autoencoder with RNN units, such as GRU³¹ and LSTM^{38,41,61,62,63}. Additionally, AE can be combined with other models such as collective matrix factorization²⁹, CNN⁴², and clustering algorithms^27,64.

GNN-based models use graph learning to represent medical ontologies, hospital visits, and disease co-occurrence. Nodes represent the medical concepts and personal entities, linked by edges indicating their relationships. Graph attention models were used to learn the medical concept embeddings within medical ontologies^22,26, with these embeddings frequently serving as initializations for further model training. Random walk technique is used to embed doctors according to their specialty⁶⁵. Graph contrastive learning^25,28 generates multiple views of augmented hospital visit graphs by modifying the original graph with node or edge perturbations, allowing the model to learn robust representations by contrasting positive pairs against negative pairs. These approaches ensure that the learned embeddings accurately reflect the complex relationships inherent in medical data⁴⁹.

Word-embedding-based models convert words into numerical vectors, allowing computers to understand their meanings and relationships from their context in a sequence of words. The model learns to map each word or concept to a dense vector representation, capturing semantic similarities based on co-occurrence patterns. Patient EHR data, composed of a sequence of medical concepts ordered by time, are used to train the representation model to predict medical concepts based on their surrounding context, helping the model to understand relationships between concepts. Various algorithms were identified, such as Glove⁴⁶, Word2vec^33,46,47 and FastText⁴⁶.

RNN-based models are designed to capture temporal dependencies in sequential data, making them well-suited for tasks involving time-series EHR data. These models are trained with the objective of predicting future medical events based on a patient’s historical data. Studies^14,36,37 use a specific type of RNN, GRU. The models were trained to predict the set of medical code of day t based on the medical codes of previous days. To enhance the temporality, these studies have also included the time gap information in the input.

SSRL models comparison

Different self-supervised representation learning models offer unique advantages and face specific limitations. The choice of models depends on several factors, including the size of the available dataset, the importance of temporal modeling for the downstream tasks, and the computational resources available at the institution.

AEs excel at dimensionality reduction⁶⁶ and are well-suited to relatively moderate datasets (average size: 166k in the included studies). However, they struggle with high-sparsity data⁶⁷ and cannot inherently model temporal dependencies without incorporating sequential components, such as RNNs, CNNs, and Transformers.

Word embedding models are designed to map medical concepts or tokens into dense vector spaces that capture contextual information and syntactic relationships in the data⁶⁸. They perform well with a moderate dataset (average size: 139k in the included studies). However, traditional word embeddings are static and fail to account for the temporality or the sequential order of the input data, necessitating their integration with sequential components.

GNNs perform well with small to moderate datasets (average size: 55k in the included studies) and are particularly effective at representing relational data, such as knowledge graphs, patient networks, and ontologies⁶⁹. They offer strong interpretability by visualizing relational data, aligning with clinical knowledge. However, GNNs alone cannot fully address temporal dependencies, necessitating their integration with sequential components.

RNNs⁷⁰ are well-suited for larger datasets (average size: 1.8 M in the included studies) and excel at capturing temporal patterns in sequential data. However, their training process is not parallelizable, leading to time inefficiencies⁷¹.

Transformers dominate SSRL research due to their ability to simultaneously capture long-range dependencies and temporal patterns⁷², offering scalability for large datasets and robust performance across diverse tasks. However, training these models from scratch necessitates substantial amounts of data (average size: 3 M in the included studies), and their high computational cost and complexity can pose significant challenges for deployment in resource-limited settings⁷³.

Downstream task models

Predictive models for classification are used with the trained SSRL model as their backbone, to which a specific classification head is added. These predictive models require labeled data for training on specific tasks. Among the articles that have mentioned the predictive models used for classification tasks, different model types have been identified. These models are predominantly characterized by simple architectures which are easy to train. Some studies employ shallow neural networks such as linear layer^23,39,44,57, logistic regression (LR) (n = 8, 17%), and support vector machines (SVM)^31,74. Models that can capture more complex data patterns such as feedforward neural networks (n = 12, 26%) and RNN^{13,40,54,55,62,65} (n = 6, 13%), are also applied.

Clustering and visualization models are used with the data representation vector as input. We identified several techniques employed across the literature. T-distributed Stochastic Neighbor Embedding (t-SNE) emerged as the most frequently used model for data representation visualization and cluster interpretation (n = 12, 26%). In terms of clustering techniques, K-means^33,38,47,62 was found to be the most common method. These clustering models take the embedding vectors generated by trained representation learning models as input.

Evaluation metrics

The evaluation of these tasks is primarily categorized into classification and clustering assessments, each employing different metrics to measure performance.

For classification tasks, the majority were binary, the most frequently used classification metric was AUROC (n = 21, 46%), followed by AUPRC (n = 14, 30%), accuracy (n = 10, 22%), and F1 (n = 9, 20%), while other metrics were also used but less frequently, such as precision (n = 6, 13%) and sensitivity (n = 5, 11%). A few studies have evaluated multi-class classification tasks. Metrics such as average precision^51,74, precision at k^44,45, macro-F1^29,65 and weighted F1^24,29 were each reported in the studies^10,24,28,74.

For clustering tasks, despite the prevalence of clustering studies, only a few employed specific clustering analysis metrics. Silhouette analysis (n = 4, 9%) was the most frequently used metric, followed by Davies-Bouldin index^33,41 (n = 2, 4%) and purity score^42,64 (n = 2, 4%)

Interpretability

Interpretability in machine learning is defined as the extraction of relevant knowledge from a machine-learning model concerning relationships either contained in data or learned by the model⁷⁵. Attention weight analysis was used in several studies (n = 6, 13%). Statistical analysis of the clusters was employed in some papers (n = 3, 6%). For post-hoc interpretability, methods such as Integrated gradient¹³ and Gradient-based saliency⁴⁵ were utilized. Most of the papers interpreted their results using visualization computed by t-SNE (n = 12, 26%) and Uniform Manifold Approximation and Projection for Dimension Reduction (n = 3, 6%). Ten papers involved medical expert interpretation. Overall, only two papers attempted post-hoc interpretability methods on trained models. Refer to Supplementary Data 7 for detailed information on the interpretability methods used in the studies.

Fields of application

Our scoping review identified various tasks across the articles. These tasks were distributed across various clinical domains, with Cardiology^{24,31,32,34,35,40,41,43,53,54,55,56,60,61,74} (n = 15, 33%), both General & multiple diseases (n = 11, 24%), Neurology & Psychiatry and Primary Care (n = 9, 20%) being the most frequently studied areas. Oncology (n = 6, 13%), followed, while Infectious Diseases^39,40,47,61, Endocrinology^35,38,42,60 and Respiratory^13,23,32,64 each had 4 downstream tasks (n = 4, 9%). Gastroenterology^40,42 and Nephrology^27,35 had the lowest number of downstream tasks (n = 2, 4%). A detailed overview of the clinical events and their corresponding clinical domain mapping can be found in Supplementary Data 1.

Evaluation tasks

Upon training, deep learning models have developed an intrinsic representation of the data, which can be general, supporting multiple tasks, or task-specific, focusing on a single or a few similar tasks. Representation quality is evaluated in various clinical tasks, including predictive tasks, or patient phenotyping. For detailed information on the evaluation tasks in the studies, see Supplementary Data 7.

Among the 73 predictive tasks, the primary focus was on disease prediction (n = 27, 59%), followed by mortality prediction (n = 11, 24%), readmission prediction^{14,26,28,32,36,53,55,65,76} (n = 9, 20%), hospitalization (n = 5, 11%), and length of stay prediction (n = 4, 9%). In addition to these, other tasks included medication recommendations^22,26,40 (n = 3, 7%), ICD coding⁵⁶, doctor recommendations⁶⁵, ICU transfers¹⁴, emergency department visits⁶³, and high medical resource utilization⁶³.

Beyond predictive modeling, patient phenotyping plays a crucial role in understanding patient populations. Of the 33 patient phenotyping tasks, clustering was primarily used for visualization (n = 15, 33%), patient similarity assessment (n = 8, 24%), characterization of clusters (n = 3, 9%), patient subtyping (n = 2, 6%), and patient stratification (n = 1, 3%).

Medical expert involvement

Medical experts were involved across different stages of the studies, with varying degrees of participation. Among the reviewed publications, expert participation was the most prominent were study design (n = 14, 30%) and result interpretation (n = 14, 30%). Feature selection also saw substantial expert input (n = 10, 22%), while dataset extraction had more limited expert participation (n = 4, 9%).

Discussion

The most employed SSRL model types include Transformer-based, AE-based, and GNN-based architectures. These models, often referred to as foundation models, are trained to reconstruct or predict corrupted portions of input data¹¹. The core strength of SSRL is the ability to construct a vectorized database, where clinical data, such as patient or encounter information, is embedded directly into low-dimensional representation embeddings. These embeddings can be easily retrieved and used for various medical ML research and applications, such as predictive modeling, personalized medicine, and disease prognosis, as shown in Table 2. To train such SSRL, it is advised to use a broader patient cohort, then transfer learned information from the entire patient population to specific models relevant to a subset of the population^13,14. The average unlabeled dataset size used for training SSRL models is 1.3 million data elements, compared to 96k data elements for labeled datasets used in downstream tasks, see Table 1 and Supplementary Data 6 for detailed information on SSRL training cohort selection, types of cohorts, and cohort size. This comprehensive data exposure enhances the models’ ability to learn underlying medical knowledge, thus improves predictive performance with specific patient subsets and even generalizes to other external datasets²³.

Table 1 Technical overview of studies

Full size table

Labeling EHR data is manpower-intensive and time-consuming. SSRL models, especially those designed as general-purpose foundation models, streamline the development process by eliminating the need for labeled data or task-specific training⁵. As shown in Table 2, over half of the studies (n = 24, 52%) focused on general-purpose SSRL models, which, once trained, can be reused across various end tasks, in contrast to the task-specific nature of supervised learning models. For clinical downstream tasks, the integration of SSRL models improves the model’s predictive performance. Studies have shown that SSRL models, when trained on large unlabeled datasets, can improve predictive performance in fine-tuned settings, often requiring less labeled data compared to traditional supervised models^35,55.

Table 2 Clinical overview of studies

Full size table

Despite these advantages, several challenges persist in the current research landscape, compassing data, modeling, and real-world application, as illustrated in Fig. 4. One of the primary concerns is data. Due to different clinical practices and economic reasons, datasets collected from different regions may differ a lot, which is called data shift. Most studies (n = 26, 57%) rely solely on private data collected from their medical sites. Only a few studies demonstrate model generalizability and transparency using public datasets (n = 11, 24%) or a combination of public and private datasets (n = 9, 20%). Additionally, there is a notable lack of public EHR dataset resources. The most frequently used datasets are MIMIC-III, MIMIC-IV, and eICU, which focus on intensive care data, whereas public datasets for general wards are lacking. See Table 1 and Supplementary Data 2 for the datasets used and their availability information.

**Fig. 4: Summarization of limitations in current research.**

Cohort selection introduces further challenges, often leading to selection bias. Rigorous cohort selection criteria, while ensuring data relevance and quality, can result in unrepresentative patient samples, thus affecting the generalizability of the findings¹⁴. Studies frequently exclude patients based on the number of visits, medical codes, age range, and specific medical conditions, leading to cohorts that may not reflect the realistic patient population and often include more severe cases^22,23.

Furthermore, expert knowledge is rarely integrated into dataset construction. Only 9% of studies reported using domain experts in defining patient cohorts. This lack of clinical input raises concerns about whether datasets reflect real-world patient diversity and clinical complexity.

Data oversimplification is a common practice, where numerical data is categorized, and medical codes are truncated, which, while reducing input data dimensionality, introduces significant information loss and potential biases⁷⁷. For example, reducing ICD-9 codes to their first three digits decreases the number of concepts from 9285 to 1131⁵², resulting in a loss of granularity and potentially important clinical details, Detailed information on the impact of preprocessing on the number of features across studies can be found in Supplementary Data 4.

Finally, the choice of coding systems, particularly the use of ICD for EHR analysis, raises concerns. ICD coding is often influenced by billing requirements rather than clinical accuracy, leading to potential biases. Additionally, since there is no unique mapping of a physician’s diagnosis to a coding scheme such as ICD, there is a tendency to select the code that delivers the greatest economic benefit from among several possible codes¹³. ICD-9, despite being the most frequently used ontology in these studies, has limited clinical relevance, as it does not cover all health conditions⁶⁴. Moreover, variations in ICD coding across countries complicate transfer learning and hinder the development of universally applicable models⁶¹.

From a modeling perspective, most studies were evaluated using predictive tasks, typically comparing their model performance to classic end-to-end machine learning algorithms such as RNN, LR, SVM, and MLP, as shown in Fig. 5. On one hand, this demonstrates the superiority of SSRL frameworks over classic supervised learning ML baseline models. However, it also reveals a lack of direct comparison between different SSRL models. Additionally, the variety of clinical predictive tasks and datasets used makes it challenging to determine which model is optimal for a given task. Nonetheless, we observed that recent studies increasingly benchmark their models against other SSRL frameworks.

**Fig. 5: Comparison of performance between different models.**

Another limitation of the modeling is the lack of interpretability. Deep learning models are often considered black boxes and can suffer from hallucinations. In real-world medical applications, clinical reasoning and model interpretation are crucial for providing justifiable guidance in decision-making¹¹. While most studies attempt to interpret the model outcomes based on attention weights, visual evaluation of clusters with t-SNE, and manual inspection, these interpretation methods can be subjective^78,79. Only a few articles perform formal post-hoc interpretation. The lack of unbiased interpretation may reduce the credibility of the findings.

Beyond data and modeling concerns, the real-world adoption of SSRL models faces significant hurdles. Several key limitations hinder the deployment of SSRL models in practice. First, the evaluation metrics commonly used in research are often technically focused and may not align with specific clinical needs, as noted by Wornow et al.⁵. Medical datasets frequently exhibit a marked class imbalance, with a much higher prevalence of healthy cases compared to disease cases. In such scenarios, achieving high sensitivity is often more critical than high specificity, as missing a true positive case can lead to severe consequences. The reliance on conventional data science metrics could result in unforeseen outcomes when applied to clinical practice. Furthermore, despite the potential advantages of transfer learning, SSRL models are typically data-intensive, making it challenging to train such frameworks in environments with limited data, such as small hospitals. To benefit from state-of-the-art models, these institutions would need access to pre-trained models built on large external datasets. However, shareable pre-trained models are often unavailable due to data security concerns. Even when such models are available, there is evidence in only a few studies of the generalizability across diverse populations and clinical environments, with further research needed to establish broader applicability. As shown in Table 2, only nine studies have demonstrated the effectiveness of pre-trained models on external clinical datasets, highlighting a significant issue: the lack of proven generalizability of transfer learning across diverse populations and clinical environments. Furthermore, they may not be easily usable due to incompatibilities with different EHR coding systems, leading to interoperability challenges^23,43,80. Thieme et al.⁸¹ explored challenges and provided recommendations based on real-world implementation experiences. However, their work does not specifically address SSRL deployment. To date, no comprehensive study has examined these factors holistically, and none of the included papers in our review explicitly mentions actionable recommendations for SSRL implementation.

The adoption of SSRL models in clinical settings depends on the specific characteristics of the dataset, the availability of annotated data, the nature of the downstream tasks, and the computational resources. In clinical settings with low cardinality datasets and abundant annotated data for specific tasks, traditional supervised learning approaches may remain more effective than SSRL. However, in resource-limited clinical settings, where data and computational resources are scarce, SSRL models can present an efficient alternative, depending on specific task requirements and data characteristics. Autoencoders, GNNs, and word-embedding models efficiently learn compact representations when temporality is not a concern. In cases where the sequential order of medical history is critical and no pretrained model is available, RNNs are a viable option. If a pretrained transformer model is accessible, it is generally the preferred choice due to its ability to leverage rich, contextualized representations from pretraining. Techniques such as Inference using a frozen architecture^31,32,37,82, fine-tuning⁸³, domain adaptation^43,80, prompt engineering⁸⁴ and continual pretraining⁸⁰ can significantly reduce the need for extensive computational resources and annotated data, even when the data distribution differs from the pretraining dataset.

As SSRL models continue to evolve, researchers are exploring ways to improve their generalizability. One emerging trend is the development of publicly shared Foundational Models (FMs), inspired by advancements in Natural Language Processing (NLP). These models, trained on diverse datasets, have the potential to improve knowledge transfer across different clinical settings^23,85. These FMs could be particularly beneficial for smaller healthcare institutions with limited private data and computational resources³⁹. For example, collaborative studies between the United States and Austrian hospitals¹³, the application of models trained on adult data to pediatric cases³⁷, and the transfer learning of models from EHR to insurance data⁴⁴ demonstrate the versatility of FMs. A recent multi-center study highlighted the adaptability of a shared foundation model⁸⁰, showing that continual training on local data required fewer than 1% of training examples to match the performance of fully trained gradient boosting machines (GBMs). This approach was 60% to 90% more sample-efficient compared to training a local FM from scratch, underscoring its feasibility for resource-limited settings. These advancements highlight the potential of FMs to bridge the gap between research and real-world clinical applications, enabling resource-limited institutions to leverage state-of-the-art models without the need for extensive local data or computational infrastructure.

For institutions with access to extensive data and computational resources, pretraining a model from scratch offers significant advantages. For example, the Med-BERT⁵⁴ model, pretrained on data from 28 million patients, exemplifies the potential of SSRL in such scenarios. A pretraining phase lasting approximately one week on a high-performance GPU, costing approximately 11,000$ (see Supplementary Data 8), can produce a robust representation model, capable of understanding and predicting complex health outcomes. This approach has demonstrated improved performance on predictive tasks and transferability across various clinical datasets²⁴. Pretraining from scratch is particularly beneficial when existing pretrained models do not align well with the institution’s data or task requirements. By leveraging their vast datasets, institutions can create highly customized models that outperform generic, off-the-shelf solutions. This makes in-house development a cost-effective and scalable strategy for large healthcare organizations aiming to harness the full potential of their data.

Given these challenges, future research should focus on three key areas: (1) improving data availability and standardization, (2) developing better benchmarking practices, and (3) fostering multi-institutional collaboration to enhance model generalizability.

Expanding and sharing public datasets is essential to improve the generalizability and robustness of SSRL models. Increasing the availability of public EHR datasets that cover a broader spectrum of medical care beyond intensive care units is crucial. Collaborative efforts among medical institutions, government agencies, and research organizations can facilitate this expansion. Additionally, establishing data-sharing agreements and frameworks that address privacy and security concerns will enhance model generalizability and transparency. Incorporating data from diverse populations and clinical settings will make models robust to data shifts^43,56,86, enabling their application in small hospitals with domain adaptations. Medical data standardization is another key factor, as it improves interoperability across institutions. Recently, Guo et al.⁸⁰ proposed to use the widely adopted Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) to standardize the data integration, Ruth et al.⁸⁷ demonstrated that unifying different medical vocabularies into a cohesive knowledge graph significantly enhances the integration and generalizability of clinical AI models. Revisiting coding systems to ensure clinical relevance and consistency across regions, such as adopting or developing comprehensive ontologies or knowledge graphs like SNOMED-CT⁸⁸ or PheKG⁸⁷, is also recommended. Researchers should also focus on avoiding excessive data simplification and categorization by exploring advanced techniques for handling high-dimensional data without significant information loss. This will ensure that critical clinical details are preserved, improving the accuracy and applicability of SSRL models in real-world settings.

Benchmarking different SSRL models using standardized clinical predictive tasks and datasets is another critical area for future research⁸⁵. Enhancing interpretability is also critical. Developing transparent models or robust post-hoc interpretation methods, such as model-agnostic interpretability techniques and explainable AI (XAI) frameworks, will make models more clinically useful⁸⁹. Recently Self-Explainable Models (SEMs) have shown great explainability by proposing meaningful concepts to the user in medical applications while keeping strong performance⁹⁰. Collaborating with clinicians to interpret outcomes and validate findings will enhance credibility and relevance. Furthermore, demonstrating the possibility of models with minimal effort using advanced techniques like Low-Rank Adaptation (LoRA)⁹¹ and Retrieval-Augmented Generation (RAG)⁹² will make these models more adaptable and practical for real-world applications¹³.

Extensive validation across diverse populations and clinical settings is necessary to ensure the real-world applicability of SSRL models. Models must be tested on external datasets to prove their generalizability⁸⁰. Additionally, adopting evaluation metrics that reflect clinical outcomes and practical utility, rather than relying solely on technical performance metrics, is also essential. Engaging clinicians in the evaluation process will ensure the metrics used are aligned with clinical needs and that the models provide actionable insights. Finally, evaluating the information loss during preprocessing should be a priority, with preprocessing methods adapted to the precise clinical use to preserve critical information.

Multicenter collaboration is another priority for advancing SSRL models in healthcare. Most studies in the EHR domain rely predominantly on private datasets, which, despite their data size, represent only a small fraction of global patient data⁸⁵. This lack of diversity limits the generalizability of trained models. To address this limitation, multicenter collaboration should be encouraged. Such collaborations are particularly beneficial for tasks that are less dependent on the local institutes’ clinical practices⁹³, such as disease patterns relationship discovery⁸⁷, genetic factors identification⁹⁴ and rare disease analysis⁸⁶.

Collaboration can take various forms, including data-sharing initiatives⁹⁵ and federated learning (FL). Data-sharing initiatives, which leverage extensive and diverse datasets across institutions, can improve model robustness and applicability. However, achieving data-sharing requires addressing data privacy concerns and building trust in AI systems through transparent and ethical guidelines. In contrast, FL allows institutions to train models without directly exchanging data, ensuring privacy and security^86,96. This approach is particularly valuable in healthcare, where data sensitivity is a major concern.

Despite these advantages, multicenter collaboration faces several challenges. First, multicenter data has high heterogeneity, which requires harmonizing medical vocabularies and standardizing data model^87,97 to ensure interoperability across institutions. Second, the data heterogeneity may result in a global optimal solution that may not be optimal for an individual local participant. To address this issue, some authors propose that an agreed definition of model training optimality should be established among all participants before the collaboration⁸⁶.However, this potential limitation was not addressed in the analyzed studies, and further evidence is needed to assess the use cases for which multicentric federated models provide a clear advantage.

Methods

Study design and search strategy

We conducted a scoping review following the PRISMA extension for Scoping Reviews (PRISMA-ScR) guidelines⁹⁸. To encompass both healthcare and engineering perspectives, we systematically searched five electronic databases: PubMed, MEDLINE, Embase, ACM, and Web of Science. The search was limited to papers published between January 2019 and April 2024.

Our search strategy was designed to identify studies meeting three criteria: (1) utilization of deep learning or neural networks, (2) application of un/self-supervised deep representation learning, and (3) use of electronic health records (EHRs) categorical data as the primary data source for SSRL model training. The search query combined the following keywords: (“deep learning” OR “neural network” OR “machine learning”) AND (“unsupervised” OR “self-supervised” OR “pretrain*” OR “pre-train*” OR “BERT”) AND (“electronic health record?” OR “ehr” OR “electronic medical record?” OR “emr” OR “Electronic Health Records” OR “health care data” OR “patient longitudinal” OR “patient trajectory”).

Study selection

The screening process was conducted in multiple stages, see Fig. 6. First, a pilot screening of 100 papers was performed to refine the inclusion and exclusion criteria. Once consensus was reached, two independent reviewers screened all papers by title and abstract. The inter-rater reliability for the title and abstract screening process was 87%. Disagreements were resolved through discussion to achieve consensus. This was followed by a full-text review. We excluded studies that had duplicate titles, were review articles, did not use unsupervised deep learning on EHR categorical data for patient or encounter representation learning, or had outcomes not directly related to clinical decision-making. Studies focusing solely on physiological signals, clinical free texts, medical images, or clustering were also excluded. Three additional papers were identified through reference screening of included studies, resulting in a final sample of 46 papers for analysis.

**Fig. 6: Overview of our PRISMA process and research questions of the review.**

Data extraction and analysis

We extracted data on article information, authorship details, clinical data characteristics, unsupervised components of deep learning models, evaluation metrics, end tasks, as well as interpretability and transferability properties. A detailed description of these data items can be found in Supplementary Table 1. This information was compiled into a standardized spreadsheet and available in Supplementary Data 1–8, which was pre-tested by the team to ensure consistency. Two reviewers independently extracted the data, and discrepancies were resolved through discussion. Data analysis was performed using Python, primarily employing the pandas library for descriptive statistical techniques.

Data availability

All data generated or analyzed during this study are provided in the main article and Supplementary Data.

Code availability

This study did not involve the utilization of any custom code or mathematical algorithm.

Change history

18 August 2025
A Correction to this paper has been published: https://doi.org/10.1038/s41746-025-01919-1

Abbreviations

EHR:: Electronic Health Record
SSRL:: Self-Supervised Representation Learning
DNNs:: Deep Neural Networks
SOTA:: State-of-the-Art
RNN:: Recurrent Neural Network
LSTM:: Long Short Term Memory
GRU:: Gated Recurrent Unit
BERT:: Bidirectional Encoder Representations from Transformer
GNNs:: Graph Neural Networks
AE:: Autoencoder
MLM:: Masked Language Model
CNN:: Convolutional Neural Network
t-SNE:: T-distributed Stochastic Neighbor Embedding
AUROC:: the Area Under the Receiver Operating Characteristic Curve
AUPRC:: Area Under the Precision-Recall Curve
NLP:: Natural Language Processing
FL:: Federated Learning. XAI Explainable AI
SEMs:: Self-Explainable Models
LoRA:: Low-Rank Adaptation
RAG:: Retrieval-Augmented Generation
OMOP CDM:: Observational Medical Outcomes Partnership Common Data Model
LR:: Logistic Regression
SVM:: support vector machine
GBMs:: gradient boosting machines

References

Gunter, T. D. & Terry, N. P. The emergence of national electronic health record architectures in the United States and Australia: models, costs, and questions. J. Med. Internet Res. 7, e3 (2005).
Article PubMed PubMed Central Google Scholar
Tsai, C. H. et al. Effects of electronic health record implementation and barriers to adoption and use: a scoping review and qualitative analysis of the content. Life 10, 327 (2020).
Article PubMed PubMed Central Google Scholar
Deep learning techniques for electronic health record (EHR) analysis. IEEE J. Biomed. Health Inf. 22, 1589–1604 (2017).
FDA. Artificial intelligence and machine learning (AI/ML)-enabled medical devices (FDA, 2024).
Wornow, M. et al. The shaky foundations of large language models and foundation models for electronic health records. npj Digit. Med. 6, 1–10 (2023).
Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research (2024).
Cherti, M. et al. Reproducible scaling laws for contrastive language-image learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2818–2829. https://doi.org/10.1109/CVPR52729.2023.00276. (2023)
OpenAI et al. GPT-4 Technical Report. Preprint at http://arxiv.org/abs/2303.08774 (2024).
Wang, Z., Wu, Z., Agarwal, D. & Sun, J. MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (eds. Goldberg, Y., Kozareva, Z. & Zhang, Y.) 3876–3887 https://doi.org/10.18653/v1/2022.emnlp-main.256 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024).
Article CAS PubMed PubMed Central Google Scholar
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2022).
He, Y. et al. Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions. IEEE Rev. Biomed. Eng. 18, 172–191 (2025).
Lentzen, M. et al. A Transformer-Based Model Trained on Large Scale Claims Data for Prediction of Severe COVID-19 Disease Progression. IEEE J. Biomed. Health Inform. 27, 4548–4558 (2023).
Article PubMed Google Scholar
Steinberg, E. et al. Language models are an effective representation learning technique for electronic health record data. J. Biomed. Inform 113, 103637 (2021).
The HIPAA privacy rule. https://www.hhs.gov/hipaa/for-professionals/privacy/index.html (2008).
Ford, E. et al. Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK. J. Med. Ethics 46, 367–377 (2020).
Article PubMed Google Scholar
Liu, X. et al. Self-supervised learning: generative or contrastive. IEEE Trans. Knowl. Data Eng. 35, 857–876 (2023).
Google Scholar
Si, Y. et al. Deep representation learning of patient data from Electronic Health Records (EHR): a systematic review. J. Biomed. Inform. 115, 103671 (2021).
Article PubMed Google Scholar
Amirahmadi, A., Ohlsson, M. & Etminani, K. Deep learning prediction models based on EHR trajectories: a systematic review. J. Biomed. Inform. 144, 104430 (2023).
Article PubMed Google Scholar
Oss Boll, H. et al. Graph neural networks for clinical risk prediction based on electronic health records: A survey. J. Biomed. Inform. 151, 104616 (2024).
Article PubMed Google Scholar
Hama, T. et al. Enhancing Patient Outcome Prediction Through Deep Learning With Sequential Diagnosis Codes From Structured Electronic Health Record Data: Systematic Review. J. Med. Internet Res 27, e57358 (2025).
Shang, J., Ma, T., Xiao, C. & Sun, J. Pre-training of Graph Augmented Transformers for Medication Recommendation. 5953–5959 (2019).
Zeng, X., Linwood, S. L. & Liu, C. Pretrained transformer framework on pediatric claims data for population specific tasks. Sci. Rep. 12, 3651 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lu, C., Reddy, C. K. & Ning, Y. Self-supervised graph learning with hyperbolic embedding for temporal health event prediction. IEEE Trans. Cybern. 53, 2124–2136 (2023).
Article PubMed Google Scholar
Xu, Y. et al. SeqCare: sequential training with external medical knowledge graph for diagnosis prediction in healthcare data. In: Proceedings of the ACM Web Conference 2023 2819–2830 (ACM, 2023). https://doi.org/10.1145/3543507.3583543.
Liu, S. et al. Multimodal data matters: language model pre-training over structured and unstructured electronic health records. IEEE J. Biomed. Health Inform. 27, 504–514 (2023).
Article PubMed Google Scholar
Liu, Z. et al. Patient clustering for vital organ failure using ICD code with graph attention. IEEE Trans. Biomed. Eng. 70, 2329–2337 (2023).
Article PubMed Google Scholar
Cao, Y., Wang, Q., Wang, X., Peng, D. & Li, P. Multi-gate mixture of multi-view graph contrastive learning on electronic health record. IEEE J. Biomed. Health Inf. 1–13, https://doi.org/10.1109/JBHI.2023.3325221 (2023)
Kumar, S., Nanelia, A., Mariappan, R., Rajagopal, A. & Rajan, V. Patient representation learning from heterogeneous data sources and knowledge graphs using deep collective matrix factorization: evaluation study. JMIR Med. Inf. 10, e28842 (2022).
Article Google Scholar
Chen, Y.-P., Lo, Y.-H., Lai, F. & Huang, C.-H. Disease concept-embedding based on the self-supervised method for medical information extraction from electronic health records and disease retrieval: algorithm development and validation study. J. Med. Internet Res. 23, e25113 (2021).
Article PubMed PubMed Central Google Scholar
Ruan, T. et al. Representation learning for clinical time series prediction tasks in electronic health records. BMC Med. Inf. Decis. Mak. 19, 259 (2019).
Article Google Scholar
Wang, L., Tong, L., Davis, D., Arnold, T. & Esposito, T. The application of unsupervised deep learning in predictive models using electronic health records. BMC Med. Res. Methodol. 20, 37 (2020).
Article PubMed PubMed Central Google Scholar
Huang, Y. et al. Patient representation from structured electronic medical records based on embedding technique: development and validation study. JMIR Med. Inf. 9, e19905 (2021).
Article Google Scholar
Poulain, R., Gupta, M., Foraker, R. & Beheshti, R. Transformer-based multi-target regression on electronic health records for primordial prevention of cardiovascular disease. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 726–731 (IEEE, 2021). https://doi.org/10.1109/BIBM52615.2021.9669441.
Li, Y. et al. Hi-BEHRT: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J. Biomed. Health Inform. 27, 1106–1117 (2023).
Article PubMed PubMed Central Google Scholar
Guo, L. L. et al. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci. Rep. 13, 3767 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lemmon, J. et al. Self-supervised machine learning using adult inpatient data produces effective models for pediatric clinical prediction tasks. J. Am. Med. Inform. Assoc. 30, 2004–2011 (2023).
Article PubMed PubMed Central Google Scholar
Manzini, E. et al. Longitudinal deep learning clustering of Type 2 Diabetes Mellitus trajectories using routinely collected health records. J. Biomed. Inform. 135, 104218 (2022).
Article PubMed Google Scholar
Pellegrini, C., Navab, N. & Kazi, A. Unsupervised pre-training of graph transformers on patient population graphs. Med. Image Anal. 89, 102895 (2023).
Article PubMed Google Scholar
Song, J. et al. Local–global memory neural network for medication prediction. IEEE Trans. Neural Netw. Learn. Syst. 32, 1723–1736 (2021).
Article PubMed Google Scholar
Huang, Y. et al. Deep significance clustering: a novel approach for identifying risk-stratified and predictive patient subgroups. J. Am. Med. Inform. Assoc. 28, 2641–2653 (2021).
Article PubMed PubMed Central Google Scholar
Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. npj Digit. Med. 3, 1–11 (2020).
Article Google Scholar
Zhang, T., Chen, M. & Bui, A. A. T. AdaDiag: adversarial domain adaptation of diagnostic prediction with clinical event sequences. J. Biomed. Inform. 134, 104168 (2022).
Article PubMed PubMed Central Google Scholar
Blinov, P. & Kokh, V. Medical profile model: scientific and practical applications in healthcare. IEEE J. Biomed. Health Inform. 28, 450–458 (2024).
Article Google Scholar
Kraljevic, Z. et al. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. Lancet Digital Health 6, e281–e290 (2024).
Article CAS PubMed Google Scholar
De Freitas, J. K. et al. Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records. Patterns 2, 100337 (2021).
Article PubMed PubMed Central Google Scholar
Ta, C. N. et al. Clinical and temporal characterization of COVID-19 subgroups using patient vector embeddings of electronic health records. J. Am. Med. Inform. Assoc. 30, 256–272 (2023).
Article PubMed Google Scholar
Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).
Article CAS PubMed Google Scholar
Albelwi, S. Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging. Entropy 24, 551 (2022).
Article PubMed PubMed Central Google Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv, http://arxiv.org/abs/1810.04805 (2018).
Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep. 10, 7155 (2020).
Article CAS PubMed PubMed Central Google Scholar
Meng, Y., Speier, W., Ong, M. K. & Arnold, C. W. Bidirectional representation learning from transformers using multimodal electronic health record data to predict depression. IEEE J. Biomed. Health Inform. 25, 3121–3129 (2021).
Article PubMed PubMed Central Google Scholar
Ru, B. et al. Comparison of machine learning algorithms for predicting hospital readmissions and worsening heart failure events in patients with heart failure with reduced ejection fraction: modeling study. JMIR Form. Res. 7, e41775 (2023).
Article PubMed PubMed Central Google Scholar
Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit. Med. 4, 1–13 (2021).
Article Google Scholar
Pang, C. et al. CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks. in Proceedings of Machine Learning for Health 239–260 (PMLR, 2021).
Poulain, R., Gupta, M. & Beheshti, R. Few-shot learning with semi-supervised transformers for electronic health records. Proc. Mach. Learn Res. 182, 853–873 (2022).
PubMed PubMed Central Google Scholar
Yang, Z., Mitra, A., Liu, W., Berlowitz, D. & Yu, H. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat. Commun. 14, 7857 (2023).
Article CAS PubMed PubMed Central Google Scholar
Dong, B. et al. Toward a stable and low-resource PLM-based medical diagnostic system via prompt tuning and MoE structure. Sci. Rep. 13, 12595 (2023).
Article CAS PubMed PubMed Central Google Scholar
de Lusignan, S. et al. Analysis of primary care computerised medical records with deep learning. Stud. Health Technol. Inf. 258, 249–250 (2019).
Google Scholar
Chushig-Muzo, D., Soguero-Ruiz, C., de Miguel-Bohoyo, P. & Mora-Jiménez, I. Interpreting clinical latent representations using autoencoders and probabilistic models. Artif. Intell. Med. 122, 102211 (2021).
Article PubMed Google Scholar
Navaz, A. N., El-Kassabi H, T., Serhani, M. A., Oulhaj, A. & Khalil, K. A novel patient similarity network (PSN) framework based on multi-model deep learning for precision medicine. JPM 12, 768 (2022).
Article PubMed PubMed Central Google Scholar
Herp, J. et al. Modeling of electronic health records for time-variant event learning beyond bio-markers—a case study in prostate cancer. IEEE Access 11, 50295–50309 (2023).
Article Google Scholar
Jones, B. W., Taylor, W. D. & Walsh, C. G. Sequential autoencoders for feature engineering and pretraining in major depressive disorder risk prediction. JAMIA Open 6, ooad086 (2023).
Article PubMed PubMed Central Google Scholar
Shao, W. et al. Application of unsupervised deep learning algorithms for identification of specific clusters of chronic cough patients from EMR data. BMC Bioinforma. 23, 140 (2022).
Article Google Scholar
Wu, T., Wang, Y., Wang, Y., Zhao, E. & Yuan, Y. Leveraging graph-based hierarchical medical entity embedding for healthcare applications. Sci. Rep. 11, 5858 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nurmaini, S. et al. Deep Learning-Based Stacked Denoising and Autoencoder for ECG Heartbeat Classification. Electronics 9, 135 (2020).
Steck, H. Embarrassingly shallow autoencoders for sparse data. In: The World Wide Web Conference 3251–3257, https://doi.org/10.1145/3308558.3313710 (2019).
Almeida, F. & Xexéo, G. Word embeddings: a survey. Preprint at https://doi.org/10.48550/arXiv.1901.09069 (2023).
Johnson, R., Li, M. M., Noori, A., Queen, O. & Zitnik, M. Graph artificial intelligence in medicine. Annu. Rev. Biomed. Data Sci. 7, 345–368 (2024).
Article PubMed PubMed Central Google Scholar
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 404, 132306 (2020).
Article Google Scholar
RWKV-TS: beyond traditional recurrent neural network for time series tasks. Preprint at https://arxiv.org/html/2401.09093v1 (2024).
Vaswani, A. et al. Attention is all you need. arXiv, http://arxiv.org/abs/1706.03762 (2017).
Izsak, P., Berchansky, M. & Levy, O. How to train BERT with an academic budget. In: Proceedings of the 2021 conference on empirical methods in natural language processing (eds. Moens, M.-F., Huang, X., Specia, L. & Yih, S. W.) 10644–10652 (Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021). https://doi.org/10.18653/v1/2021.emnlp-main.831.
Liang, Z. et al. Deep generative learning for automated EHR diagnosis of traditional Chinese medicine. Comput Methods Prog. Biomed. 174, 17–23 (2019).
Article Google Scholar
Definitions, methods, and applications in interpretable machine learning. https://www.pnas.org/doi/epdf/10.1073/pnas.1900654116
Seki, T., Kawazoe, Y. & Ohe, K. Graph representation learning-based fixed-length clinical feature vector generation from heterogeneous medical records. In: Studies in health technology and informatics (eds. Bichel-Findlay, J., Otero, P., Scott, P. & Huesing, E.) (IOS Press, 2024). https://doi.org/10.3233/SHTI231058.
Understanding the ICD-10 code structure. https://www.healthnetworksolutions.net/index.php/understanding-the-icd-10-code-structure.
Wattenberg, M., Viégas, F. & Johnson, I. How to use t-SNE effectively. Distill 1, e2 (2016).
Article Google Scholar
Serrano, S. & Smith, N. A. Is attention interpretable? In: Proceedings of the 57th annual meeting of the association for computational linguistics (eds. Korhonen, A., Traum, D. & Màrquez, L.) 2931–2951 (Association for Computational Linguistics, Florence, 2019). https://doi.org/10.18653/v1/P19-1282.
Guo, L. L. et al. A multi-center study on the adaptability of a shared foundation model for electronic health records. npj Digit. Med. 7, 1–9 (2024).
Article Google Scholar
Thieme, A. et al. Challenges for Responsible AI Design and Workflow Integration in Healthcare: A Case Study of Automatic Feeding Tube Qualification in Radiology. ACM Trans. Comput.-Hum. Interact. https://doi.org/10.1145/3716500 (2025).
Zhang, Z., Yan, C., Zhang, X., Nyemba, S. L. & Malin, B. A. Forecasting the future clinical events of a patient through contrastive learning. J. Am. Med. Inform. Assoc. 29, 1584–1592 (2022).
Article PubMed PubMed Central Google Scholar
Raj, J. A., Qian, L. & Ibrahim, Z. Fine-tuning–a Transfer Learning approach. Preprint at https://doi.org/10.48550/arXiv.2411.03941 (2024).
Zaghir, J. et al. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J. Med. Internet Res 26, e60501 (2024).
Wornow, M., Thapa, R., Steinberg, E., Fries, J. & Shah, N. EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models. Advances in Neural Information Processing Systems 36, 67125–67137 (2023).
Rieke, N. et al. The future of digital health with federated learning. npj Digit. Med. 3, 1–7 (2020).
Article Google Scholar
Johnson, R. et al. Unified clinical vocabulary embeddings for advancing precision. Preprint at https://doi.org/10.1101/2024.12.03.24318322 (2024).
SNOMED International. Systematized nomenclature of medicine - clinical terms (SNOMED-CT). https://www.snomed.org/ (1999).
Speith, T., Crook, B., Mann, S., Schomäcker, A. & Langer, M. Conceptualizing understanding in explainable artificial intelligence (XAI): an abilities-based approach. Ethics Inf. Technol. 26, 40 (2024).
Article Google Scholar
Nauta, M. et al. Interpreting and Correcting Medical Image Classification with PIP-Net. in Artificial Intelligence. ECAI 2023 International Workshops (eds. Nowaczyk, S. et al.) 198–215 https://doi.org/10.1007/978-3-031-50396-2_11 (Springer Nature Switzerland, Cham, 2024).
Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. in (2021).
Gao, Y. et al. Retrieval-Augmented Generation for Large Language Models: A Survey. CoRR (2023).
Placido, D. et al. A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories. Nat. Med. 1–10, https://doi.org/10.1038/s41591-023-02332-5 (2023).
Manolio, T. A. et al. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat. Genet 39, 1045–1051 (2007).
Article CAS PubMed Google Scholar
Quinn, M. et al. Electronic health records, communication, and data sharing: challenges and opportunities for improving the diagnostic process. Diagnosis 6, 241–248 (2019).
Article PubMed Google Scholar
Kim, J., Kim, J., Hur, K. & Choi, E. EHRFL: federated learning framework for heterogeneous EHRs and precision-guided selection of participating clients. Preprint at https://doi.org/10.48550/arXiv.2404.13318 (2024).
Benson, T. Principles of Health Interoperability HL7 and SNOMED (Springer Science & Business Media, 2012).
Tricco, A. C. et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 169, 467–473 (2018).
Article PubMed Google Scholar

Download references

Acknowledgements

This study is part of the HERO project (CCER 2023-01571) which is financed by the philanthropic donation of Mr. Nicolas Pictet.

Author information

These authors contributed equally: Yuanyuan Zheng, Adel Bensahla.

Authors and Affiliations

Division of Medical Information Sciences, Geneva University Hospitals, Geneva, Switzerland
Yuanyuan Zheng, Adel Bensahla, Mina Bjelogrlic, Jamil Zaghir, Hugues Turbe, Lydie Bednarczyk, Christophe Gaudet-Blavignac, Julien Ehrsam & Christian Lovis
Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
Yuanyuan Zheng, Adel Bensahla, Mina Bjelogrlic, Jamil Zaghir, Hugues Turbe, Lydie Bednarczyk, Christophe Gaudet-Blavignac, Julien Ehrsam & Christian Lovis
Department of Computer Science, University of Geneva, Geneva, Switzerland
Stéphane Marchand-Maillet

Authors

Yuanyuan Zheng
View author publications
Search author on:PubMed Google Scholar
Adel Bensahla
View author publications
Search author on:PubMed Google Scholar
Mina Bjelogrlic
View author publications
Search author on:PubMed Google Scholar
Jamil Zaghir
View author publications
Search author on:PubMed Google Scholar
Hugues Turbe
View author publications
Search author on:PubMed Google Scholar
Lydie Bednarczyk
View author publications
Search author on:PubMed Google Scholar
Christophe Gaudet-Blavignac
View author publications
Search author on:PubMed Google Scholar
Julien Ehrsam
View author publications
Search author on:PubMed Google Scholar
Stéphane Marchand-Maillet
View author publications
Search author on:PubMed Google Scholar
Christian Lovis
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.Z. and A.B. led the review, performed data extraction, study analysis, the writing, and generated the figures and tables. M.B. supervised the review process, provided necessary feedback, generated figures, writing, and content curation. J.Z. provided NLP specialist insight, H.T. provided SSRL specialist insight, L.B. and J.E. provided medical specialist insight, C.G. provided medical semantic specialist insight. C.L. and S.M. oversaw the review process and provided necessary feedback. Y.Z. and A.B. contributed equally to the study. All authors have approved the manuscript and agree with its submission to this journal.

Corresponding author

Correspondence to Mina Bjelogrlic.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Supplementary Data 5

Supplementary Data 6

Supplementary Data 7

Supplementary Data 8

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zheng, Y., Bensahla, A., Bjelogrlic, M. et al. A scoping review of self-supervised representation learning for clinical decision making using EHR categorical data. npj Digit. Med. 8, 362 (2025). https://doi.org/10.1038/s41746-025-01692-1

Download citation

Received: 09 September 2024
Accepted: 30 April 2025
Published: 14 June 2025
DOI: https://doi.org/10.1038/s41746-025-01692-1