Abstract
The widespread adoption of Electronic Health Records (EHRs) and deep learning, particularly through Self-Supervised Representation Learning (SSRL) for categorical data, has transformed clinical decision-making. This scoping review, following PRISMA-ScR guidelines, examines 46 studies published from January 2019 to April 2024, sourced from PubMed, MEDLINE, Embase, ACM, and Web of Science, focusing on SSRL for unlabeled categorical EHR data. The review systematically assesses research trends in building computationally and data-efficient representations for medical tasks, identifying major trends in model families: Transformer-based (43%), Autoencoder-based (28%), and Graph Neural Network-based (17%) models. The analysis highlights scenarios where healthcare institutions can leverage or develop SSRL technologies. It also addresses current limitations in assessing the impact of these technologies and identifies research opportunities to enhance their influence on clinical practice.
Similar content being viewed by others
Introduction
The advent of EHRs has revolutionized the healthcare industry by providing comprehensive, digitized patient information1. This shift has enabled healthcare providers to maintain accurate and accessible records, facilitating better patient care2. The widespread adoption of EHRs has fueled the development of deep learning models for various automated clinical decision-making, offering sophisticated tools for predicting patient trajectories, identifying disease patterns, and personalizing treatments3,4.
Recently, an increasing number of deep neural networks (DNNs) based on SSRL have been deployed in real-world applications5. Examples include DINOv26, OpenCLIP7 for vision, and GPT-48 for free text. In the medical field, similar models like MedCLIP9 and MedSAM10 have also been developed, trained specifically with medical imaging and textual data. These models are trained on extensive datasets and are open-source, making them easily deployable. The representations they learn from the unlabeled data are designed for versatile use, enabling application across various downstream tasks, often referred to as foundation models11,12. By providing efficient learned representations, these models offer new opportunities to enhance the performance of existing models and reduce the need for large, manually annotated datasets.
Analyzing EHRs data poses several challenges, including its sparsity, high-dimensionality, and complex interrelationships13. EHRs consist of irregularly spaced visits over time, with each visit containing a subset of thousands of possible medical codes, along with laboratory test results, unstructured text, and images14. In this review, we focus specifically on EHR categorical data, also referred as structured data. EHR categorical data includes medical codes such as diagnoses, procedures, medications, and laboratory test codes. Categorical data is easier to de-identify following HIPAA guidelines15, enabling faster construction of large datasets, as it is generally considered safer in terms of patient privacy compared to clinical free text16.
SSRL in DNNs automatically discovers and extracts features from unlabeled data11. Unlike supervised learning, which relies on labeled datasets, SSRL algorithms are trained to predict part of the data from other parts, which could be incomplete, transformed, distorted, or corrupted. Essentially, the model learns to ’recover’ whole, parts of, or merely some features of its original input17. This enables SSRL to identify patterns and structures within unlabeled data, producing efficient representation vectors. These vectors, along with trained SSRL models, can be used for clustering similar data points, enhancing data visualization, or serving as inputs for subsequent predictive models. Figure 1 illustrates the application of SSRL in clinical settings. This framework offers several advantages: it reduces the need for extensive manual labeling, can be generalized across different tasks without requiring full model retraining, and often outperforms the models trained on a similar number of labeled datasets. As a result, using these representations and models optimizes manpower, computational resources, and model performance11,17.
EHR data follow a cyclical process, beginning at health centers, where they are either used to train internal SSRL models (blue box) or directly supplied to external SSRL models (purple box). These models convert the data into efficient representations, which are then adapted based on the specific downstream tasks. The results from these downstream tasks are sent back to the health centers, facilitating the delivery of effective medicine and medical knowledge discovery. Blue and orange arrows represent unsupervised and supervised learning tasks, respectively. For efficient representations, the snowflake and cluster icons stand for obtaining efficient representations with respectively frozen (only inference) or trainable (with high computational resources such as high-performance computing) SSRL models. The gear icon signifies the training of downstream models using moderate resources, such as multiple GPUs. The potential use of externally developed SSRL models is highlighted in purple.
Despite the progress in large models for images and text, there is still a notable absence of large models based on EHRs in real-world applications. Previous reviews, including those by Si et al.18, Amirahmadi et al.19, Oss Boll et al.20, and Hama et al.21, have covered both supervised and unsupervised methods across various data types. However, none have systematically analyzed representation learning using unlabeled EHR categorical data, covering both clustering and prediction tasks. As a result, the reader is left without a clear understanding of the current State-of-the-Art (SOTA) trends, limitations, and opportunities in this area. This review, covering studies from 2019 to 2024, addresses this critical gap by offering detailed insights into the latest SSRL methodologies for unlabeled EHR categorical data. We assess their potential applications, identify appropriate scenarios for their deployment, and evaluate the feasibility of implementation in current clinical settings. This review offers valuable guidance for future research, practical healthcare data analysis, and implementation in hospital settings. Our scoping review answers three main research questions: i) What techniques and models are used for analyzing categorical data? (see sections “Type of data”, “Data preprocessing” and “Self-supervised learning models” ii) How can SSRL models enhance clinical decision-making? (see sections “Fields of application” and “Evaluation tasks”) and iii) What are the current trends in the research field, and how do they impact medical settings? (see section “Discussion”). The detailed research questions and the full methodology of the scoping review are described in Methods section.
We differentiate our work from the narrative review by Wornow et al.5 by specifically addressing SSRL using unlabeled categorical data from EHRs, regardless of the changing definitions and usages of the term “foundation models” to provide the reader a systematic analysis and comprehensive view of current SOTA in the field. Notably, while we overlap with only 12 of the same papers over a similar time period, we include an additional 34 studies, underscoring the broader scope of our review. We highlight several agreements with their claims and systematically clarify the areas of similarity.
This scoping review is intended for an audience comprising medical professionals, data scientists, and healthcare stakeholders such as decision-makers and hospital IT teams. By synthesizing studies from databases across these fields, we aim to bridge the gap between clinical expertise and advanced data science techniques. Considering the societal and economic impact of leveraging recent research advances in SSRL, our goal is to provide valuable insights that enhance clinical decision-making processes, encourage interdisciplinary collaboration in healthcare informatics, and assist decision-makers in effectively adapting their IT infrastructure and data management strategies.
Results
This section provides a comprehensive overview of the findings from our scoping review, organized around subsections that emerged during our analysis. We begin by outlining the characteristics of the included studies and the types of data utilized. Next, we examine the studies from the technical aspects including the data preprocessing techniques, SSRL model types, SSRL model comparison, models for downstream tasks, the evaluation metrics used, and the interpretability techniques. Finally, we analyze the studies from a clinical perspective focusing on the fields of clinical application, clinical downstream tasks, and the involvement of medical experts. Error! Reference source not found. summarizes the key features of the technical aspect, and Table 2 provides essential information on the studies from the medical perspective.
Studies characteristics
As illustrated in Fig. 2, most of the research (n = 33, 72%) was conducted by interdisciplinary teams of medical experts and data scientists. The United States led in the number of published studies (n = 21, 46%), followed by China (n = 9, 20%) and the United Kingdom (n = 4, 9%). Despite this geographic diversity, only a few studies (n = 11, 24%) involved international collaborations. For details on the authors and research teams, refer to Supplementary Data 2
Type of model and trend
Five main model types have been identified for representing EHR categorical data: Transformer-based models (n = 20, 43%), Autoencoder (AE) based models (n = 13, 28%), Graph Neural Network (GNN) based models (n = 8, 17%), Word-embedding models (n = 3, 7%), and Recurrent Neural Network (RNN) based models (n = 3, 7%). Studies that combine two or more model types are counted once for each corresponding model type. To assess their impact on research, we analyzed the number of citations for each model type.
Figure 3 shows the papers published from January 2019 to December 2023, their citation counts by July 2024, and their corresponding model types. Based on the number of citations, Transformers, RNN, and GNN models are the most impactful, with Transformer models showing particularly high citation counts for papers published from 2020 to 2023.
Type of data
Studies utilize various data types to represent patients and medical knowledge. Typically, patient representation is derived from EHRs, incorporating both categorical and non-categorical data. Additionally, external medical knowledge can be integrated into models through data collected beyond EHRs. For detailed information on the modalities used across studies, see Supplementary Data 3.
Among the categorical data types in EHRs, diagnosis codes are the most frequently used (n = 45, 98%), including ICD-9, ICD-10-CM, and SNOMED-CT. Medication codes (n = 32, 70%), such as ATC and SNOMED-CT, along with procedure codes (n = 20, 43%) like CPT and ICD-10-PCS. To enhance patient representation, non-categorical data may also be included. The most common non-categorical data types are patient age (n = 19, 41%), clinical measurement values (n = 15, 33%) such as BMI, heart rate, and systolic blood pressure, and clinical narratives from physicians and practitioners (n = 7, 15%).
The integration of external data sources can further enrich patient profiles. Medical knowledge graphs and ontologies provide rich hierarchical information, while medical text corpora contain expert medical knowledge. These external sources offer a comprehensive understanding of clinical concept interactions. Among external data sources, ontologies are the most used (n = 7, 15%), they are employed to obtain the medical concept embeddings22,23,24,25,26,27,28 and for SSRL training task23. Other significant external data sources include medical knowledge graph25,29 and medical text corpora30.
Data preprocessing
Most models treat each data element as a distinct unit or token (n = 44, 95%). The identified data preprocessing techniques address various aspects such as numerical data, categorical data, data cleaning, and data shuffling. Some studies (n = 7, 15%) performed categorization by converting exact ages into intervals and clinical measurements into categories like high, normal, and low, based on clinical evaluation standards31,32,33,34,35,36,37. When maintaining the numerical nature of data, missing value imputation30,38,39 and value normalization31,39,40,41 have also been employed.
Some studies standardize data elements by mapping them to known ontologies23,36,42,43. A common approach to reduce dimensionality and data sparsity is using only the first digits of codes, effectively replacing them with parent nodes in the hierarchical ontology (n = 15, 33%).
In terms of data cleaning, typical practices include the removal of rare medical terms14,32,37,42,44,45 and the elimination of duplicated terms within a specific time range22,42,46,47. Additionally, shuffling the order of medical concepts within a time window33,47 was shown to help the model to generalize better, by mitigating the impact of arbitrary sequencing and emphasizing the importance of co-occurrence over specific order. This method can also be considered as a form of data augmentation. Detailed information on data preprocessing across studies can be found in Supplementary Data 4.
Self-supervised learning models
There are two primary self-supervised learning training strategies: generative and contrastive. Generative tasks involve models predicting parts of the data from other parts, which may be incomplete, transformed, masked, or corrupted. These tasks, such as autoregressive prediction and masked modeling, help the model learn to recover whole or partial features of its original input17,48. Contrastive tasks, on the other hand, focus on distinguishing between similar and dissimilar data points, helping the model capture discriminative features that are essential for understanding different types of data48. Both task types are crucial for training models to generate rich, generalized representations from unlabeled data48,49, and they are applied across various model architectures. The objective of these models is to capture essential patterns and features in the data and output the learned representation which is typically a fixed-length, high-dimensional vector that condenses large amounts of information. Five major architecture types have been identified in the studies, each trained with unlabeled data with different training tasks. Details of the SSRL models used and the temporality monitored in each study are provided in Supplementary Data 5.
Transformer-based models are among the most impactful model types in the studies. In the medical domain, most transformer-based models treat patients as documents, visits as sentences, and medical concepts as tokens, capturing detailed patient histories. BERT50 is a transformer encoder-only model that effectively learns data representations by processing and contextualizing complex sequences of information. BERT models can be trained using various techniques, such as training with only Masked Language Model (MLM) by predicting randomly masked medical concepts in each EHR sequence34,43,44,51,52,53, enhancing its contextual understanding. Training both with MLM and auxiliary tasks13,22,39,54,55,56, further refine the model’s representations by guiding it with specific medical insights. Additionally, self-contrastive learning techniques help improve BERT’s robustness and accuracy in capturing meaningful patterns in medical data30,35. Other transformer-based training tasks include next visit code prediction23,36,45,57, medical code category prediction23, medication-diagnosis cross prediction26, and token replacement detection ELECTRA58.
AE-based models are encoder-decoder models that aim to reconstruct the input, enabling the learning of data representations in a compressed, lower-dimensional space. AEs are designed to learn the most salient features of the data, which can be particularly useful for capturing the underlying structure of categorical EHR data. Various deviations of AE were applied in the studies: Stacked Autoencoder32,59, Denoising Autoencoder60, Autoencoder with RNN units, such as GRU31 and LSTM38,41,61,62,63. Additionally, AE can be combined with other models such as collective matrix factorization29, CNN42, and clustering algorithms27,64.
GNN-based models use graph learning to represent medical ontologies, hospital visits, and disease co-occurrence. Nodes represent the medical concepts and personal entities, linked by edges indicating their relationships. Graph attention models were used to learn the medical concept embeddings within medical ontologies22,26, with these embeddings frequently serving as initializations for further model training. Random walk technique is used to embed doctors according to their specialty65. Graph contrastive learning25,28 generates multiple views of augmented hospital visit graphs by modifying the original graph with node or edge perturbations, allowing the model to learn robust representations by contrasting positive pairs against negative pairs. These approaches ensure that the learned embeddings accurately reflect the complex relationships inherent in medical data49.
Word-embedding-based models convert words into numerical vectors, allowing computers to understand their meanings and relationships from their context in a sequence of words. The model learns to map each word or concept to a dense vector representation, capturing semantic similarities based on co-occurrence patterns. Patient EHR data, composed of a sequence of medical concepts ordered by time, are used to train the representation model to predict medical concepts based on their surrounding context, helping the model to understand relationships between concepts. Various algorithms were identified, such as Glove46, Word2vec33,46,47 and FastText46.
RNN-based models are designed to capture temporal dependencies in sequential data, making them well-suited for tasks involving time-series EHR data. These models are trained with the objective of predicting future medical events based on a patient’s historical data. Studies14,36,37 use a specific type of RNN, GRU. The models were trained to predict the set of medical code of day t based on the medical codes of previous days. To enhance the temporality, these studies have also included the time gap information in the input.
SSRL models comparison
Different self-supervised representation learning models offer unique advantages and face specific limitations. The choice of models depends on several factors, including the size of the available dataset, the importance of temporal modeling for the downstream tasks, and the computational resources available at the institution.
AEs excel at dimensionality reduction66 and are well-suited to relatively moderate datasets (average size: 166k in the included studies). However, they struggle with high-sparsity data67 and cannot inherently model temporal dependencies without incorporating sequential components, such as RNNs, CNNs, and Transformers.
Word embedding models are designed to map medical concepts or tokens into dense vector spaces that capture contextual information and syntactic relationships in the data68. They perform well with a moderate dataset (average size: 139k in the included studies). However, traditional word embeddings are static and fail to account for the temporality or the sequential order of the input data, necessitating their integration with sequential components.
GNNs perform well with small to moderate datasets (average size: 55k in the included studies) and are particularly effective at representing relational data, such as knowledge graphs, patient networks, and ontologies69. They offer strong interpretability by visualizing relational data, aligning with clinical knowledge. However, GNNs alone cannot fully address temporal dependencies, necessitating their integration with sequential components.
RNNs70 are well-suited for larger datasets (average size: 1.8 M in the included studies) and excel at capturing temporal patterns in sequential data. However, their training process is not parallelizable, leading to time inefficiencies71.
Transformers dominate SSRL research due to their ability to simultaneously capture long-range dependencies and temporal patterns72, offering scalability for large datasets and robust performance across diverse tasks. However, training these models from scratch necessitates substantial amounts of data (average size: 3 M in the included studies), and their high computational cost and complexity can pose significant challenges for deployment in resource-limited settings73.
Downstream task models
Predictive models for classification are used with the trained SSRL model as their backbone, to which a specific classification head is added. These predictive models require labeled data for training on specific tasks. Among the articles that have mentioned the predictive models used for classification tasks, different model types have been identified. These models are predominantly characterized by simple architectures which are easy to train. Some studies employ shallow neural networks such as linear layer23,39,44,57, logistic regression (LR) (n = 8, 17%), and support vector machines (SVM)31,74. Models that can capture more complex data patterns such as feedforward neural networks (n = 12, 26%) and RNN13,40,54,55,62,65 (n = 6, 13%), are also applied.
Clustering and visualization models are used with the data representation vector as input. We identified several techniques employed across the literature. T-distributed Stochastic Neighbor Embedding (t-SNE) emerged as the most frequently used model for data representation visualization and cluster interpretation (n = 12, 26%). In terms of clustering techniques, K-means33,38,47,62 was found to be the most common method. These clustering models take the embedding vectors generated by trained representation learning models as input.
Evaluation metrics
The evaluation of these tasks is primarily categorized into classification and clustering assessments, each employing different metrics to measure performance.
For classification tasks, the majority were binary, the most frequently used classification metric was AUROC (n = 21, 46%), followed by AUPRC (n = 14, 30%), accuracy (n = 10, 22%), and F1 (n = 9, 20%), while other metrics were also used but less frequently, such as precision (n = 6, 13%) and sensitivity (n = 5, 11%). A few studies have evaluated multi-class classification tasks. Metrics such as average precision51,74, precision at k44,45, macro-F129,65 and weighted F124,29 were each reported in the studies10,24,28,74.
For clustering tasks, despite the prevalence of clustering studies, only a few employed specific clustering analysis metrics. Silhouette analysis (n = 4, 9%) was the most frequently used metric, followed by Davies-Bouldin index33,41 (n = 2, 4%) and purity score42,64 (n = 2, 4%)
Interpretability
Interpretability in machine learning is defined as the extraction of relevant knowledge from a machine-learning model concerning relationships either contained in data or learned by the model75. Attention weight analysis was used in several studies (n = 6, 13%). Statistical analysis of the clusters was employed in some papers (n = 3, 6%). For post-hoc interpretability, methods such as Integrated gradient13 and Gradient-based saliency45 were utilized. Most of the papers interpreted their results using visualization computed by t-SNE (n = 12, 26%) and Uniform Manifold Approximation and Projection for Dimension Reduction (n = 3, 6%). Ten papers involved medical expert interpretation. Overall, only two papers attempted post-hoc interpretability methods on trained models. Refer to Supplementary Data 7 for detailed information on the interpretability methods used in the studies.
Fields of application
Our scoping review identified various tasks across the articles. These tasks were distributed across various clinical domains, with Cardiology24,31,32,34,35,40,41,43,53,54,55,56,60,61,74 (n = 15, 33%), both General & multiple diseases (n = 11, 24%), Neurology & Psychiatry and Primary Care (n = 9, 20%) being the most frequently studied areas. Oncology (n = 6, 13%), followed, while Infectious Diseases39,40,47,61, Endocrinology35,38,42,60 and Respiratory13,23,32,64 each had 4 downstream tasks (n = 4, 9%). Gastroenterology40,42 and Nephrology27,35 had the lowest number of downstream tasks (n = 2, 4%). A detailed overview of the clinical events and their corresponding clinical domain mapping can be found in Supplementary Data 1.
Evaluation tasks
Upon training, deep learning models have developed an intrinsic representation of the data, which can be general, supporting multiple tasks, or task-specific, focusing on a single or a few similar tasks. Representation quality is evaluated in various clinical tasks, including predictive tasks, or patient phenotyping. For detailed information on the evaluation tasks in the studies, see Supplementary Data 7.
Among the 73 predictive tasks, the primary focus was on disease prediction (n = 27, 59%), followed by mortality prediction (n = 11, 24%), readmission prediction14,26,28,32,36,53,55,65,76 (n = 9, 20%), hospitalization (n = 5, 11%), and length of stay prediction (n = 4, 9%). In addition to these, other tasks included medication recommendations22,26,40 (n = 3, 7%), ICD coding56, doctor recommendations65, ICU transfers14, emergency department visits63, and high medical resource utilization63.
Beyond predictive modeling, patient phenotyping plays a crucial role in understanding patient populations. Of the 33 patient phenotyping tasks, clustering was primarily used for visualization (n = 15, 33%), patient similarity assessment (n = 8, 24%), characterization of clusters (n = 3, 9%), patient subtyping (n = 2, 6%), and patient stratification (n = 1, 3%).
Medical expert involvement
Medical experts were involved across different stages of the studies, with varying degrees of participation. Among the reviewed publications, expert participation was the most prominent were study design (n = 14, 30%) and result interpretation (n = 14, 30%). Feature selection also saw substantial expert input (n = 10, 22%), while dataset extraction had more limited expert participation (n = 4, 9%).
Discussion
The most employed SSRL model types include Transformer-based, AE-based, and GNN-based architectures. These models, often referred to as foundation models, are trained to reconstruct or predict corrupted portions of input data11. The core strength of SSRL is the ability to construct a vectorized database, where clinical data, such as patient or encounter information, is embedded directly into low-dimensional representation embeddings. These embeddings can be easily retrieved and used for various medical ML research and applications, such as predictive modeling, personalized medicine, and disease prognosis, as shown in Table 2. To train such SSRL, it is advised to use a broader patient cohort, then transfer learned information from the entire patient population to specific models relevant to a subset of the population13,14. The average unlabeled dataset size used for training SSRL models is 1.3 million data elements, compared to 96k data elements for labeled datasets used in downstream tasks, see Table 1 and Supplementary Data 6 for detailed information on SSRL training cohort selection, types of cohorts, and cohort size. This comprehensive data exposure enhances the models’ ability to learn underlying medical knowledge, thus improves predictive performance with specific patient subsets and even generalizes to other external datasets23.
Labeling EHR data is manpower-intensive and time-consuming. SSRL models, especially those designed as general-purpose foundation models, streamline the development process by eliminating the need for labeled data or task-specific training5. As shown in Table 2, over half of the studies (n = 24, 52%) focused on general-purpose SSRL models, which, once trained, can be reused across various end tasks, in contrast to the task-specific nature of supervised learning models. For clinical downstream tasks, the integration of SSRL models improves the model’s predictive performance. Studies have shown that SSRL models, when trained on large unlabeled datasets, can improve predictive performance in fine-tuned settings, often requiring less labeled data compared to traditional supervised models35,55.
Despite these advantages, several challenges persist in the current research landscape, compassing data, modeling, and real-world application, as illustrated in Fig. 4. One of the primary concerns is data. Due to different clinical practices and economic reasons, datasets collected from different regions may differ a lot, which is called data shift. Most studies (n = 26, 57%) rely solely on private data collected from their medical sites. Only a few studies demonstrate model generalizability and transparency using public datasets (n = 11, 24%) or a combination of public and private datasets (n = 9, 20%). Additionally, there is a notable lack of public EHR dataset resources. The most frequently used datasets are MIMIC-III, MIMIC-IV, and eICU, which focus on intensive care data, whereas public datasets for general wards are lacking. See Table 1 and Supplementary Data 2 for the datasets used and their availability information.
Key challenges across three stages: Data, Modeling, and Application. Data-related issues include transparency, strict cohort selection, preprocessing-related information loss, and lack of interoperability. In modeling, challenges stem from uncertainty in model superiority and poor interpretability. Application limitations include costly deployment, inadequate evaluation metrics, lack of generalizability, and interoperability barriers.
Cohort selection introduces further challenges, often leading to selection bias. Rigorous cohort selection criteria, while ensuring data relevance and quality, can result in unrepresentative patient samples, thus affecting the generalizability of the findings14. Studies frequently exclude patients based on the number of visits, medical codes, age range, and specific medical conditions, leading to cohorts that may not reflect the realistic patient population and often include more severe cases22,23.
Furthermore, expert knowledge is rarely integrated into dataset construction. Only 9% of studies reported using domain experts in defining patient cohorts. This lack of clinical input raises concerns about whether datasets reflect real-world patient diversity and clinical complexity.
Data oversimplification is a common practice, where numerical data is categorized, and medical codes are truncated, which, while reducing input data dimensionality, introduces significant information loss and potential biases77. For example, reducing ICD-9 codes to their first three digits decreases the number of concepts from 9285 to 113152, resulting in a loss of granularity and potentially important clinical details, Detailed information on the impact of preprocessing on the number of features across studies can be found in Supplementary Data 4.
Finally, the choice of coding systems, particularly the use of ICD for EHR analysis, raises concerns. ICD coding is often influenced by billing requirements rather than clinical accuracy, leading to potential biases. Additionally, since there is no unique mapping of a physician’s diagnosis to a coding scheme such as ICD, there is a tendency to select the code that delivers the greatest economic benefit from among several possible codes13. ICD-9, despite being the most frequently used ontology in these studies, has limited clinical relevance, as it does not cover all health conditions64. Moreover, variations in ICD coding across countries complicate transfer learning and hinder the development of universally applicable models61.
From a modeling perspective, most studies were evaluated using predictive tasks, typically comparing their model performance to classic end-to-end machine learning algorithms such as RNN, LR, SVM, and MLP, as shown in Fig. 5. On one hand, this demonstrates the superiority of SSRL frameworks over classic supervised learning ML baseline models. However, it also reveals a lack of direct comparison between different SSRL models. Additionally, the variety of clinical predictive tasks and datasets used makes it challenging to determine which model is optimal for a given task. Nonetheless, we observed that recent studies increasingly benchmark their models against other SSRL frameworks.
Another limitation of the modeling is the lack of interpretability. Deep learning models are often considered black boxes and can suffer from hallucinations. In real-world medical applications, clinical reasoning and model interpretation are crucial for providing justifiable guidance in decision-making11. While most studies attempt to interpret the model outcomes based on attention weights, visual evaluation of clusters with t-SNE, and manual inspection, these interpretation methods can be subjective78,79. Only a few articles perform formal post-hoc interpretation. The lack of unbiased interpretation may reduce the credibility of the findings.
Beyond data and modeling concerns, the real-world adoption of SSRL models faces significant hurdles. Several key limitations hinder the deployment of SSRL models in practice. First, the evaluation metrics commonly used in research are often technically focused and may not align with specific clinical needs, as noted by Wornow et al.5. Medical datasets frequently exhibit a marked class imbalance, with a much higher prevalence of healthy cases compared to disease cases. In such scenarios, achieving high sensitivity is often more critical than high specificity, as missing a true positive case can lead to severe consequences. The reliance on conventional data science metrics could result in unforeseen outcomes when applied to clinical practice. Furthermore, despite the potential advantages of transfer learning, SSRL models are typically data-intensive, making it challenging to train such frameworks in environments with limited data, such as small hospitals. To benefit from state-of-the-art models, these institutions would need access to pre-trained models built on large external datasets. However, shareable pre-trained models are often unavailable due to data security concerns. Even when such models are available, there is evidence in only a few studies of the generalizability across diverse populations and clinical environments, with further research needed to establish broader applicability. As shown in Table 2, only nine studies have demonstrated the effectiveness of pre-trained models on external clinical datasets, highlighting a significant issue: the lack of proven generalizability of transfer learning across diverse populations and clinical environments. Furthermore, they may not be easily usable due to incompatibilities with different EHR coding systems, leading to interoperability challenges23,43,80. Thieme et al.81 explored challenges and provided recommendations based on real-world implementation experiences. However, their work does not specifically address SSRL deployment. To date, no comprehensive study has examined these factors holistically, and none of the included papers in our review explicitly mentions actionable recommendations for SSRL implementation.
The adoption of SSRL models in clinical settings depends on the specific characteristics of the dataset, the availability of annotated data, the nature of the downstream tasks, and the computational resources. In clinical settings with low cardinality datasets and abundant annotated data for specific tasks, traditional supervised learning approaches may remain more effective than SSRL. However, in resource-limited clinical settings, where data and computational resources are scarce, SSRL models can present an efficient alternative, depending on specific task requirements and data characteristics. Autoencoders, GNNs, and word-embedding models efficiently learn compact representations when temporality is not a concern. In cases where the sequential order of medical history is critical and no pretrained model is available, RNNs are a viable option. If a pretrained transformer model is accessible, it is generally the preferred choice due to its ability to leverage rich, contextualized representations from pretraining. Techniques such as Inference using a frozen architecture31,32,37,82, fine-tuning83, domain adaptation43,80, prompt engineering84 and continual pretraining80 can significantly reduce the need for extensive computational resources and annotated data, even when the data distribution differs from the pretraining dataset.
As SSRL models continue to evolve, researchers are exploring ways to improve their generalizability. One emerging trend is the development of publicly shared Foundational Models (FMs), inspired by advancements in Natural Language Processing (NLP). These models, trained on diverse datasets, have the potential to improve knowledge transfer across different clinical settings23,85. These FMs could be particularly beneficial for smaller healthcare institutions with limited private data and computational resources39. For example, collaborative studies between the United States and Austrian hospitals13, the application of models trained on adult data to pediatric cases37, and the transfer learning of models from EHR to insurance data44 demonstrate the versatility of FMs. A recent multi-center study highlighted the adaptability of a shared foundation model80, showing that continual training on local data required fewer than 1% of training examples to match the performance of fully trained gradient boosting machines (GBMs). This approach was 60% to 90% more sample-efficient compared to training a local FM from scratch, underscoring its feasibility for resource-limited settings. These advancements highlight the potential of FMs to bridge the gap between research and real-world clinical applications, enabling resource-limited institutions to leverage state-of-the-art models without the need for extensive local data or computational infrastructure.
For institutions with access to extensive data and computational resources, pretraining a model from scratch offers significant advantages. For example, the Med-BERT54 model, pretrained on data from 28 million patients, exemplifies the potential of SSRL in such scenarios. A pretraining phase lasting approximately one week on a high-performance GPU, costing approximately 11,000$ (see Supplementary Data 8), can produce a robust representation model, capable of understanding and predicting complex health outcomes. This approach has demonstrated improved performance on predictive tasks and transferability across various clinical datasets24. Pretraining from scratch is particularly beneficial when existing pretrained models do not align well with the institution’s data or task requirements. By leveraging their vast datasets, institutions can create highly customized models that outperform generic, off-the-shelf solutions. This makes in-house development a cost-effective and scalable strategy for large healthcare organizations aiming to harness the full potential of their data.
Given these challenges, future research should focus on three key areas: (1) improving data availability and standardization, (2) developing better benchmarking practices, and (3) fostering multi-institutional collaboration to enhance model generalizability.
Expanding and sharing public datasets is essential to improve the generalizability and robustness of SSRL models. Increasing the availability of public EHR datasets that cover a broader spectrum of medical care beyond intensive care units is crucial. Collaborative efforts among medical institutions, government agencies, and research organizations can facilitate this expansion. Additionally, establishing data-sharing agreements and frameworks that address privacy and security concerns will enhance model generalizability and transparency. Incorporating data from diverse populations and clinical settings will make models robust to data shifts43,56,86, enabling their application in small hospitals with domain adaptations. Medical data standardization is another key factor, as it improves interoperability across institutions. Recently, Guo et al.80 proposed to use the widely adopted Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) to standardize the data integration, Ruth et al.87 demonstrated that unifying different medical vocabularies into a cohesive knowledge graph significantly enhances the integration and generalizability of clinical AI models. Revisiting coding systems to ensure clinical relevance and consistency across regions, such as adopting or developing comprehensive ontologies or knowledge graphs like SNOMED-CT88 or PheKG87, is also recommended. Researchers should also focus on avoiding excessive data simplification and categorization by exploring advanced techniques for handling high-dimensional data without significant information loss. This will ensure that critical clinical details are preserved, improving the accuracy and applicability of SSRL models in real-world settings.
Benchmarking different SSRL models using standardized clinical predictive tasks and datasets is another critical area for future research85. Enhancing interpretability is also critical. Developing transparent models or robust post-hoc interpretation methods, such as model-agnostic interpretability techniques and explainable AI (XAI) frameworks, will make models more clinically useful89. Recently Self-Explainable Models (SEMs) have shown great explainability by proposing meaningful concepts to the user in medical applications while keeping strong performance90. Collaborating with clinicians to interpret outcomes and validate findings will enhance credibility and relevance. Furthermore, demonstrating the possibility of models with minimal effort using advanced techniques like Low-Rank Adaptation (LoRA)91 and Retrieval-Augmented Generation (RAG)92 will make these models more adaptable and practical for real-world applications13.
Extensive validation across diverse populations and clinical settings is necessary to ensure the real-world applicability of SSRL models. Models must be tested on external datasets to prove their generalizability80. Additionally, adopting evaluation metrics that reflect clinical outcomes and practical utility, rather than relying solely on technical performance metrics, is also essential. Engaging clinicians in the evaluation process will ensure the metrics used are aligned with clinical needs and that the models provide actionable insights. Finally, evaluating the information loss during preprocessing should be a priority, with preprocessing methods adapted to the precise clinical use to preserve critical information.
Multicenter collaboration is another priority for advancing SSRL models in healthcare. Most studies in the EHR domain rely predominantly on private datasets, which, despite their data size, represent only a small fraction of global patient data85. This lack of diversity limits the generalizability of trained models. To address this limitation, multicenter collaboration should be encouraged. Such collaborations are particularly beneficial for tasks that are less dependent on the local institutes’ clinical practices93, such as disease patterns relationship discovery87, genetic factors identification94 and rare disease analysis86.
Collaboration can take various forms, including data-sharing initiatives95 and federated learning (FL). Data-sharing initiatives, which leverage extensive and diverse datasets across institutions, can improve model robustness and applicability. However, achieving data-sharing requires addressing data privacy concerns and building trust in AI systems through transparent and ethical guidelines. In contrast, FL allows institutions to train models without directly exchanging data, ensuring privacy and security86,96. This approach is particularly valuable in healthcare, where data sensitivity is a major concern.
Despite these advantages, multicenter collaboration faces several challenges. First, multicenter data has high heterogeneity, which requires harmonizing medical vocabularies and standardizing data model87,97 to ensure interoperability across institutions. Second, the data heterogeneity may result in a global optimal solution that may not be optimal for an individual local participant. To address this issue, some authors propose that an agreed definition of model training optimality should be established among all participants before the collaboration86.However, this potential limitation was not addressed in the analyzed studies, and further evidence is needed to assess the use cases for which multicentric federated models provide a clear advantage.
Methods
Study design and search strategy
We conducted a scoping review following the PRISMA extension for Scoping Reviews (PRISMA-ScR) guidelines98. To encompass both healthcare and engineering perspectives, we systematically searched five electronic databases: PubMed, MEDLINE, Embase, ACM, and Web of Science. The search was limited to papers published between January 2019 and April 2024.
Our search strategy was designed to identify studies meeting three criteria: (1) utilization of deep learning or neural networks, (2) application of un/self-supervised deep representation learning, and (3) use of electronic health records (EHRs) categorical data as the primary data source for SSRL model training. The search query combined the following keywords: (“deep learning” OR “neural network” OR “machine learning”) AND (“unsupervised” OR “self-supervised” OR “pretrain*” OR “pre-train*” OR “BERT”) AND (“electronic health record?” OR “ehr” OR “electronic medical record?” OR “emr” OR “Electronic Health Records” OR “health care data” OR “patient longitudinal” OR “patient trajectory”).
Study selection
The screening process was conducted in multiple stages, see Fig. 6. First, a pilot screening of 100 papers was performed to refine the inclusion and exclusion criteria. Once consensus was reached, two independent reviewers screened all papers by title and abstract. The inter-rater reliability for the title and abstract screening process was 87%. Disagreements were resolved through discussion to achieve consensus. This was followed by a full-text review. We excluded studies that had duplicate titles, were review articles, did not use unsupervised deep learning on EHR categorical data for patient or encounter representation learning, or had outcomes not directly related to clinical decision-making. Studies focusing solely on physiological signals, clinical free texts, medical images, or clustering were also excluded. Three additional papers were identified through reference screening of included studies, resulting in a final sample of 46 papers for analysis.
Data extraction and analysis
We extracted data on article information, authorship details, clinical data characteristics, unsupervised components of deep learning models, evaluation metrics, end tasks, as well as interpretability and transferability properties. A detailed description of these data items can be found in Supplementary Table 1. This information was compiled into a standardized spreadsheet and available in Supplementary Data 1–8, which was pre-tested by the team to ensure consistency. Two reviewers independently extracted the data, and discrepancies were resolved through discussion. Data analysis was performed using Python, primarily employing the pandas library for descriptive statistical techniques.
Data availability
All data generated or analyzed during this study are provided in the main article and Supplementary Data.
Code availability
This study did not involve the utilization of any custom code or mathematical algorithm.
Change history
18 August 2025
A Correction to this paper has been published: https://doi.org/10.1038/s41746-025-01919-1
Abbreviations
- EHR:
-
Electronic Health Record
- SSRL:
-
Self-Supervised Representation Learning
- DNNs:
-
Deep Neural Networks
- SOTA:
-
State-of-the-Art
- RNN:
-
Recurrent Neural Network
- LSTM:
-
Long Short Term Memory
- GRU:
-
Gated Recurrent Unit
- BERT:
-
Bidirectional Encoder Representations from Transformer
- GNNs:
-
Graph Neural Networks
- AE:
-
Autoencoder
- MLM:
-
Masked Language Model
- CNN:
-
Convolutional Neural Network
- t-SNE:
-
T-distributed Stochastic Neighbor Embedding
- AUROC:
-
the Area Under the Receiver Operating Characteristic Curve
- AUPRC:
-
Area Under the Precision-Recall Curve
- NLP:
-
Natural Language Processing
- FL:
-
Federated Learning. XAI Explainable AI
- SEMs:
-
Self-Explainable Models
- LoRA:
-
Low-Rank Adaptation
- RAG:
-
Retrieval-Augmented Generation
- OMOP CDM:
-
Observational Medical Outcomes Partnership Common Data Model
- LR:
-
Logistic Regression
- SVM:
-
support vector machine
- GBMs:
-
gradient boosting machines
References
Gunter, T. D. & Terry, N. P. The emergence of national electronic health record architectures in the United States and Australia: models, costs, and questions. J. Med. Internet Res. 7, e3 (2005).
Tsai, C. H. et al. Effects of electronic health record implementation and barriers to adoption and use: a scoping review and qualitative analysis of the content. Life 10, 327 (2020).
Deep learning techniques for electronic health record (EHR) analysis. IEEE J. Biomed. Health Inf. 22, 1589–1604 (2017).
FDA. Artificial intelligence and machine learning (AI/ML)-enabled medical devices (FDA, 2024).
Wornow, M. et al. The shaky foundations of large language models and foundation models for electronic health records. npj Digit. Med. 6, 1–10 (2023).
Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research (2024).
Cherti, M. et al. Reproducible scaling laws for contrastive language-image learning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2818–2829. https://doi.org/10.1109/CVPR52729.2023.00276. (2023)
OpenAI et al. GPT-4 Technical Report. Preprint at http://arxiv.org/abs/2303.08774 (2024).
Wang, Z., Wu, Z., Agarwal, D. & Sun, J. MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (eds. Goldberg, Y., Kozareva, Z. & Zhang, Y.) 3876–3887 https://doi.org/10.18653/v1/2022.emnlp-main.256 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022).
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2022).
He, Y. et al. Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions. IEEE Rev. Biomed. Eng. 18, 172–191 (2025).
Lentzen, M. et al. A Transformer-Based Model Trained on Large Scale Claims Data for Prediction of Severe COVID-19 Disease Progression. IEEE J. Biomed. Health Inform. 27, 4548–4558 (2023).
Steinberg, E. et al. Language models are an effective representation learning technique for electronic health record data. J. Biomed. Inform 113, 103637 (2021).
The HIPAA privacy rule. https://www.hhs.gov/hipaa/for-professionals/privacy/index.html (2008).
Ford, E. et al. Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK. J. Med. Ethics 46, 367–377 (2020).
Liu, X. et al. Self-supervised learning: generative or contrastive. IEEE Trans. Knowl. Data Eng. 35, 857–876 (2023).
Si, Y. et al. Deep representation learning of patient data from Electronic Health Records (EHR): a systematic review. J. Biomed. Inform. 115, 103671 (2021).
Amirahmadi, A., Ohlsson, M. & Etminani, K. Deep learning prediction models based on EHR trajectories: a systematic review. J. Biomed. Inform. 144, 104430 (2023).
Oss Boll, H. et al. Graph neural networks for clinical risk prediction based on electronic health records: A survey. J. Biomed. Inform. 151, 104616 (2024).
Hama, T. et al. Enhancing Patient Outcome Prediction Through Deep Learning With Sequential Diagnosis Codes From Structured Electronic Health Record Data: Systematic Review. J. Med. Internet Res 27, e57358 (2025).
Shang, J., Ma, T., Xiao, C. & Sun, J. Pre-training of Graph Augmented Transformers for Medication Recommendation. 5953–5959 (2019).
Zeng, X., Linwood, S. L. & Liu, C. Pretrained transformer framework on pediatric claims data for population specific tasks. Sci. Rep. 12, 3651 (2022).
Lu, C., Reddy, C. K. & Ning, Y. Self-supervised graph learning with hyperbolic embedding for temporal health event prediction. IEEE Trans. Cybern. 53, 2124–2136 (2023).
Xu, Y. et al. SeqCare: sequential training with external medical knowledge graph for diagnosis prediction in healthcare data. In: Proceedings of the ACM Web Conference 2023 2819–2830 (ACM, 2023). https://doi.org/10.1145/3543507.3583543.
Liu, S. et al. Multimodal data matters: language model pre-training over structured and unstructured electronic health records. IEEE J. Biomed. Health Inform. 27, 504–514 (2023).
Liu, Z. et al. Patient clustering for vital organ failure using ICD code with graph attention. IEEE Trans. Biomed. Eng. 70, 2329–2337 (2023).
Cao, Y., Wang, Q., Wang, X., Peng, D. & Li, P. Multi-gate mixture of multi-view graph contrastive learning on electronic health record. IEEE J. Biomed. Health Inf. 1–13, https://doi.org/10.1109/JBHI.2023.3325221 (2023)
Kumar, S., Nanelia, A., Mariappan, R., Rajagopal, A. & Rajan, V. Patient representation learning from heterogeneous data sources and knowledge graphs using deep collective matrix factorization: evaluation study. JMIR Med. Inf. 10, e28842 (2022).
Chen, Y.-P., Lo, Y.-H., Lai, F. & Huang, C.-H. Disease concept-embedding based on the self-supervised method for medical information extraction from electronic health records and disease retrieval: algorithm development and validation study. J. Med. Internet Res. 23, e25113 (2021).
Ruan, T. et al. Representation learning for clinical time series prediction tasks in electronic health records. BMC Med. Inf. Decis. Mak. 19, 259 (2019).
Wang, L., Tong, L., Davis, D., Arnold, T. & Esposito, T. The application of unsupervised deep learning in predictive models using electronic health records. BMC Med. Res. Methodol. 20, 37 (2020).
Huang, Y. et al. Patient representation from structured electronic medical records based on embedding technique: development and validation study. JMIR Med. Inf. 9, e19905 (2021).
Poulain, R., Gupta, M., Foraker, R. & Beheshti, R. Transformer-based multi-target regression on electronic health records for primordial prevention of cardiovascular disease. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 726–731 (IEEE, 2021). https://doi.org/10.1109/BIBM52615.2021.9669441.
Li, Y. et al. Hi-BEHRT: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J. Biomed. Health Inform. 27, 1106–1117 (2023).
Guo, L. L. et al. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci. Rep. 13, 3767 (2023).
Lemmon, J. et al. Self-supervised machine learning using adult inpatient data produces effective models for pediatric clinical prediction tasks. J. Am. Med. Inform. Assoc. 30, 2004–2011 (2023).
Manzini, E. et al. Longitudinal deep learning clustering of Type 2 Diabetes Mellitus trajectories using routinely collected health records. J. Biomed. Inform. 135, 104218 (2022).
Pellegrini, C., Navab, N. & Kazi, A. Unsupervised pre-training of graph transformers on patient population graphs. Med. Image Anal. 89, 102895 (2023).
Song, J. et al. Local–global memory neural network for medication prediction. IEEE Trans. Neural Netw. Learn. Syst. 32, 1723–1736 (2021).
Huang, Y. et al. Deep significance clustering: a novel approach for identifying risk-stratified and predictive patient subgroups. J. Am. Med. Inform. Assoc. 28, 2641–2653 (2021).
Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. npj Digit. Med. 3, 1–11 (2020).
Zhang, T., Chen, M. & Bui, A. A. T. AdaDiag: adversarial domain adaptation of diagnostic prediction with clinical event sequences. J. Biomed. Inform. 134, 104168 (2022).
Blinov, P. & Kokh, V. Medical profile model: scientific and practical applications in healthcare. IEEE J. Biomed. Health Inform. 28, 450–458 (2024).
Kraljevic, Z. et al. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. Lancet Digital Health 6, e281–e290 (2024).
De Freitas, J. K. et al. Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records. Patterns 2, 100337 (2021).
Ta, C. N. et al. Clinical and temporal characterization of COVID-19 subgroups using patient vector embeddings of electronic health records. J. Am. Med. Inform. Assoc. 30, 256–272 (2023).
Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).
Albelwi, S. Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging. Entropy 24, 551 (2022).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv, http://arxiv.org/abs/1810.04805 (2018).
Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep. 10, 7155 (2020).
Meng, Y., Speier, W., Ong, M. K. & Arnold, C. W. Bidirectional representation learning from transformers using multimodal electronic health record data to predict depression. IEEE J. Biomed. Health Inform. 25, 3121–3129 (2021).
Ru, B. et al. Comparison of machine learning algorithms for predicting hospital readmissions and worsening heart failure events in patients with heart failure with reduced ejection fraction: modeling study. JMIR Form. Res. 7, e41775 (2023).
Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit. Med. 4, 1–13 (2021).
Pang, C. et al. CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks. in Proceedings of Machine Learning for Health 239–260 (PMLR, 2021).
Poulain, R., Gupta, M. & Beheshti, R. Few-shot learning with semi-supervised transformers for electronic health records. Proc. Mach. Learn Res. 182, 853–873 (2022).
Yang, Z., Mitra, A., Liu, W., Berlowitz, D. & Yu, H. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat. Commun. 14, 7857 (2023).
Dong, B. et al. Toward a stable and low-resource PLM-based medical diagnostic system via prompt tuning and MoE structure. Sci. Rep. 13, 12595 (2023).
de Lusignan, S. et al. Analysis of primary care computerised medical records with deep learning. Stud. Health Technol. Inf. 258, 249–250 (2019).
Chushig-Muzo, D., Soguero-Ruiz, C., de Miguel-Bohoyo, P. & Mora-Jiménez, I. Interpreting clinical latent representations using autoencoders and probabilistic models. Artif. Intell. Med. 122, 102211 (2021).
Navaz, A. N., El-Kassabi H, T., Serhani, M. A., Oulhaj, A. & Khalil, K. A novel patient similarity network (PSN) framework based on multi-model deep learning for precision medicine. JPM 12, 768 (2022).
Herp, J. et al. Modeling of electronic health records for time-variant event learning beyond bio-markers—a case study in prostate cancer. IEEE Access 11, 50295–50309 (2023).
Jones, B. W., Taylor, W. D. & Walsh, C. G. Sequential autoencoders for feature engineering and pretraining in major depressive disorder risk prediction. JAMIA Open 6, ooad086 (2023).
Shao, W. et al. Application of unsupervised deep learning algorithms for identification of specific clusters of chronic cough patients from EMR data. BMC Bioinforma. 23, 140 (2022).
Wu, T., Wang, Y., Wang, Y., Zhao, E. & Yuan, Y. Leveraging graph-based hierarchical medical entity embedding for healthcare applications. Sci. Rep. 11, 5858 (2021).
Nurmaini, S. et al. Deep Learning-Based Stacked Denoising and Autoencoder for ECG Heartbeat Classification. Electronics 9, 135 (2020).
Steck, H. Embarrassingly shallow autoencoders for sparse data. In: The World Wide Web Conference 3251–3257, https://doi.org/10.1145/3308558.3313710 (2019).
Almeida, F. & Xexéo, G. Word embeddings: a survey. Preprint at https://doi.org/10.48550/arXiv.1901.09069 (2023).
Johnson, R., Li, M. M., Noori, A., Queen, O. & Zitnik, M. Graph artificial intelligence in medicine. Annu. Rev. Biomed. Data Sci. 7, 345–368 (2024).
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 404, 132306 (2020).
RWKV-TS: beyond traditional recurrent neural network for time series tasks. Preprint at https://arxiv.org/html/2401.09093v1 (2024).
Vaswani, A. et al. Attention is all you need. arXiv, http://arxiv.org/abs/1706.03762 (2017).
Izsak, P., Berchansky, M. & Levy, O. How to train BERT with an academic budget. In: Proceedings of the 2021 conference on empirical methods in natural language processing (eds. Moens, M.-F., Huang, X., Specia, L. & Yih, S. W.) 10644–10652 (Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021). https://doi.org/10.18653/v1/2021.emnlp-main.831.
Liang, Z. et al. Deep generative learning for automated EHR diagnosis of traditional Chinese medicine. Comput Methods Prog. Biomed. 174, 17–23 (2019).
Definitions, methods, and applications in interpretable machine learning. https://www.pnas.org/doi/epdf/10.1073/pnas.1900654116
Seki, T., Kawazoe, Y. & Ohe, K. Graph representation learning-based fixed-length clinical feature vector generation from heterogeneous medical records. In: Studies in health technology and informatics (eds. Bichel-Findlay, J., Otero, P., Scott, P. & Huesing, E.) (IOS Press, 2024). https://doi.org/10.3233/SHTI231058.
Understanding the ICD-10 code structure. https://www.healthnetworksolutions.net/index.php/understanding-the-icd-10-code-structure.
Wattenberg, M., Viégas, F. & Johnson, I. How to use t-SNE effectively. Distill 1, e2 (2016).
Serrano, S. & Smith, N. A. Is attention interpretable? In: Proceedings of the 57th annual meeting of the association for computational linguistics (eds. Korhonen, A., Traum, D. & Màrquez, L.) 2931–2951 (Association for Computational Linguistics, Florence, 2019). https://doi.org/10.18653/v1/P19-1282.
Guo, L. L. et al. A multi-center study on the adaptability of a shared foundation model for electronic health records. npj Digit. Med. 7, 1–9 (2024).
Thieme, A. et al. Challenges for Responsible AI Design and Workflow Integration in Healthcare: A Case Study of Automatic Feeding Tube Qualification in Radiology. ACM Trans. Comput.-Hum. Interact. https://doi.org/10.1145/3716500 (2025).
Zhang, Z., Yan, C., Zhang, X., Nyemba, S. L. & Malin, B. A. Forecasting the future clinical events of a patient through contrastive learning. J. Am. Med. Inform. Assoc. 29, 1584–1592 (2022).
Raj, J. A., Qian, L. & Ibrahim, Z. Fine-tuning–a Transfer Learning approach. Preprint at https://doi.org/10.48550/arXiv.2411.03941 (2024).
Zaghir, J. et al. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J. Med. Internet Res 26, e60501 (2024).
Wornow, M., Thapa, R., Steinberg, E., Fries, J. & Shah, N. EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models. Advances in Neural Information Processing Systems 36, 67125–67137 (2023).
Rieke, N. et al. The future of digital health with federated learning. npj Digit. Med. 3, 1–7 (2020).
Johnson, R. et al. Unified clinical vocabulary embeddings for advancing precision. Preprint at https://doi.org/10.1101/2024.12.03.24318322 (2024).
SNOMED International. Systematized nomenclature of medicine - clinical terms (SNOMED-CT). https://www.snomed.org/ (1999).
Speith, T., Crook, B., Mann, S., Schomäcker, A. & Langer, M. Conceptualizing understanding in explainable artificial intelligence (XAI): an abilities-based approach. Ethics Inf. Technol. 26, 40 (2024).
Nauta, M. et al. Interpreting and Correcting Medical Image Classification with PIP-Net. in Artificial Intelligence. ECAI 2023 International Workshops (eds. Nowaczyk, S. et al.) 198–215 https://doi.org/10.1007/978-3-031-50396-2_11 (Springer Nature Switzerland, Cham, 2024).
Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. in (2021).
Gao, Y. et al. Retrieval-Augmented Generation for Large Language Models: A Survey. CoRR (2023).
Placido, D. et al. A deep learning algorithm to predict risk of pancreatic cancer from disease trajectories. Nat. Med. 1–10, https://doi.org/10.1038/s41591-023-02332-5 (2023).
Manolio, T. A. et al. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat. Genet 39, 1045–1051 (2007).
Quinn, M. et al. Electronic health records, communication, and data sharing: challenges and opportunities for improving the diagnostic process. Diagnosis 6, 241–248 (2019).
Kim, J., Kim, J., Hur, K. & Choi, E. EHRFL: federated learning framework for heterogeneous EHRs and precision-guided selection of participating clients. Preprint at https://doi.org/10.48550/arXiv.2404.13318 (2024).
Benson, T. Principles of Health Interoperability HL7 and SNOMED (Springer Science & Business Media, 2012).
Tricco, A. C. et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 169, 467–473 (2018).
Acknowledgements
This study is part of the HERO project (CCER 2023-01571) which is financed by the philanthropic donation of Mr. Nicolas Pictet.
Author information
Authors and Affiliations
Contributions
Y.Z. and A.B. led the review, performed data extraction, study analysis, the writing, and generated the figures and tables. M.B. supervised the review process, provided necessary feedback, generated figures, writing, and content curation. J.Z. provided NLP specialist insight, H.T. provided SSRL specialist insight, L.B. and J.E. provided medical specialist insight, C.G. provided medical semantic specialist insight. C.L. and S.M. oversaw the review process and provided necessary feedback. Y.Z. and A.B. contributed equally to the study. All authors have approved the manuscript and agree with its submission to this journal.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zheng, Y., Bensahla, A., Bjelogrlic, M. et al. A scoping review of self-supervised representation learning for clinical decision making using EHR categorical data. npj Digit. Med. 8, 362 (2025). https://doi.org/10.1038/s41746-025-01692-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-025-01692-1