Introduction

Application of digital technologies, e.g., artificial intelligence, to improve medical management, patient outcomes, and healthcare delivery, is known as digital medicine1,2. This context has significantly evolved towards intelligent decision-making after the development of deep learning (DL) models. A common feature of most DL models is the need for a large dataset for training and validation of the models. Preparing a large dataset that incorporates sufficient samples of various classes for training a DL model is sometimes problematic. This is particularly evident when it comes to medical data in which the privacy of patient data demands serious attention.

Social and medical communities rigorously assign escalating regulations at the different societal levels to avoid misconduct of patient data. In this light, the European Union has recently released the first regulation3 on artificial intelligence towards lowering the risk of data abuse and further governance on the developers. Such pertinent restrictions as well as difficulties in collecting medical data, act as the impeding factors in preparing a sufficiently large and fair dataset for training and validation of the DL models. Medical data is nowadays stored in a digital fashion, named Electronic Health Record (EHR), which is composed of a set of the health-related data collected from the individuals of a population during their visits to any care unit defined by the healthcare system of the population4. An EHR typically contains demographic data, clinical findings, lab values, procedures, medications, symptoms, diagnoses, medical images, physiological signals, and descriptive texts obtained from the individuals during the visits. During a visit to a care unit, and depending on the visit, an individual may undergo a sequence of investigations and/or examinations, called events, which are stored in the EHR of the individual, using the globally known codes, e.g., international classification code (ICD)5 for the diagnosis. A sequence of the tabular health records of an individual, resulting from several visits, constitutes longitudinal health data6,7.

Accessibility of the EHR can be of vital importance when it comes to the patient management, and hence, highest level of security considerations are administered to preserve privacy of EHRs. Healthcare systems set intensive restrictions and limitations for accessing EHR contents. As a result, preparing a rich clinical dataset is considered as an important research question for any study in the digital medicine context. A practical solution to this research question is to create a synthetic copy of the clinical dataset from the real one, and use it for research purposes. The created dataset, which is called synthetic health record (SHR), can be utilized and shared with the researchers, instead of the real ones, for the training and validation of DL models. A SHR is created to resemble the general characteristics of EHR data while ensuring the subjects of the EHR remain unidentifiable8,9,10. Such datasets can be employed by any machine learning method for training and optimization purposes.

Health records can incorporate tabular data of the medical records, longitudinal tables of different visits, time series of physiological signals, textual clinical data, and medical images. Recent progress in developing DL models for generating various modalities of synthetic data enabled researchers to address the diverse research gaps that existed in the preparation of high-quality health data. Topics such as privacy leakage, limited data availability, uneven class distribution, and scattered data, are regarded as some of the research topics of this domain.

Despite substantial progress in developing generative models for synthetic medical data, there is still a big room for studying the reliability of the generated data. As we will see in the discussion section, the reliability of the models is explored from two different perspectives: the quality of generated data in terms of conformity with real data, and the capability of learning models to preserve the privacy of the real data in terms of re-identification. Discrepancy in the study objectives for various data modalities, and inconsistency in the evaluation metrics, make it a complicated task to select an appropriate model for creating SHR along with a realistic evaluation of the model.

This paper presents the results of a scoping review study on the existing DL models for generating SHRs. The review examines the DL models deployed, the modalities utilized, the datasets invoked, and the metrics employed by the researchers to explore the scope and potential of using SHR for different medical objectives. The study taxonomy is performed in both technical and applicative manners to represent the relevance of the models in conjunction with their capabilities in producing time series, text, and longitudinal SHR. The main study objectives are the identification of the practical capabilities and the knowledge gaps of generative models for creating SHR, which are detailed as follows:

  • Finding state-of-the-art of the generative models for creating synthetic medical texts, time series, and longitudinal data, along with the methodological limitations.

  • Summarizing the existing performance measures in conjunction with the related metrics for evaluating the quality of SHR.

  • Listing the most used datasets employed by the researchers for generating SHR.

  • Finding the key research gaps of the field.

The unique features of this study are mainly the comprehensiveness of the review and the novel research taxonomy. As we will see in the discussion section, this paper introduces innovative features compared to the other review papers. These features initiate the following contributions to the field:

  • Introducing taxonomic novelty by defining various data modalities and applications. There are review papers on SHR for tabular and image data, however, less attention was paid to important topics such as medical texts and longitudinal data.

  • Evaluation of different machine learning (ML) methods for generating various forms of SHR.

  • Representing the performance measures and the evaluation metrics used for validating the quality of SHR in association with the related methodological capabilities for evaluating different data modalities.

  • Introducing available datasets for generating SHR in conjunction with the related applicability per the study taxonomy.

It is worth noting that the number of publications in the SHR domain has drastically increased in the last two years. The novel aspects of this study project the scope and the possibilities of state-of-the-art methods in this new domain of digital medicine.

Results

Overview of the findings

Figure 1a illustrates an overview of the identified publications. In total, 3740 citations resulted from the bibliographic search in PubMed (n = 352), Scopus (n = 2798), and Web of Science (n = 590) in the Identification phase, from which 935 of the citations were excluded due to duplication. In the screening phase, after carefully reading Tile and the Abstract of the publications, 2692 publications were filtered out because of the topical irrelevance. In the Eligibility phase, 52 publications fulfilled the inclusion and exclusion criteria and ultimately participated in the study (PubMed = 27, Scopus = 19, and Web of Science = 6). Note that half of the eligible publications (n = 26) were interestingly published after 2022. Additionally, we included the related review or survey papers, published between 2022 and 2023 (Table 4). All of the included review papers were articles in peer-reviewed journals.

Fig. 1: Overview of the study selection process and research queries.
figure 1

a PRISMA-ScR representation of the research methodology. b Research queries used in the study.

Figure 2 represents an overview of the findings. It is observed that methods based on generative adversarial networks (GANs) were dominantly employed for generating medical time series, and much less, for generating the longitudinal and the text modalities, respectively. The diffusion model was merely used for this data modality. Although large language models (LLMs) have been well-received mainly for generating synthetic texts, their application in generating longitudinal data showed promising results. The variational auto-encoder (VAE) method was used in a minority of the studies, equally for generating longitudinal and text data, but not so for the time series. The probabilistic models, e.g., Bayesian network, were mostly used for the longitudinal data, and trivially for the time series.

Fig. 2: The mutual link between the data modalities, generative models, and research objectives found by the research objectives.
figure 2

This Figure shows an overview of the findings in the paper. In summary, generative adversarial networks (GANs) were dominantly employed for generating medical time series. Large language models (LLMs) have been widely used for generating synthetic texts. The variational auto-encoder (VAE) method was employed in a minority of the studies, equally for generating longitudinal and text data, but it was not used for time series. Probabilistic models were dominantly utilized for longitudinal data.

Medical time series data

Generation of synthetic physiological time series was observed in 22 reports (42%) of the published peer-reviewed papers, from which synthetic Electrocardiogram is the most common case study (10 studies). Electroencephalogram was the main topic of the second most common objective where the diffusion model resulted in the optimal utility11. Data scarcity and privacy were found to be the main objectives of 12 and 6 studies, respectively. Class imbalance and imputation constitute the objective of the rest of the 4 studies. A great majority of the studies relied on the various fashions of the GAN-based methods where statistical models, such as the diffusion model and hidden Markov model were seen in a minority of 10% of the studies. Table 1 lists the findings of the survey on the generative models for synthetic time series.

Table 1 The reviewed publications on generating synthetic medical records of time series

Medical longitudinal data

Table 2 lists the survey’s findings on the generative models for synthetic longitudinal data. As can be seen, privacy is the main objective of the 16 out of the 17 studies with various case studies comprising kidney diseases, patients with hearing loss, Parkinson’s and Alzheimer’s diseases, chronic heart failure disease, diabetes, hypertension, and hospital admissions. The GAN-based methods were observed in the great majority of the studies as the optimal model outperforming the baseline studies.

Table 2 The reviewed publications on generating synthetic medical records of longitudinal data

Medical text data

Medical texts are an important part of an EHR, reflecting the medical assessments and decisions. Reading these texts can put the underlying EHR at risk of identification, and therefore, privacy is regarded as an important objective when it comes to the SHR. This was confirmed by our study as privacy was the objective of the 9 studies out of the 12 studies that participated in this survey with various case studies. Table 3 lists the findings of the survey on the generative models for synthetic text data.

Table 3 The reviewed publications on generating synthetic medical records of text data

In contrast with the synthetic health time series generation where the proposed models were dominantly based on the GAN, GPT-style models are employed in approximately 40% of studies, surpassing individual usage rates of other generative methods. This is justifiable considering the versatility of GPT models. Such elaborative capabilities of GPT provided the ability to generate synthetic medical texts in different languages, spanning from far eastern countries, e.g., Chinese, to European countries, e.g., Dutch, and English (see the case studies in Table 3). Nevertheless, the GAN-based models were seen in some studies.

Discussion

The need for a rich dataset for training ML methods on the one hand, and the difficulties in collecting patient data, e.g., privacy issues, on the other hand, make the generation of SHR a practical strategy. This is subjected to a high level of security in terms of re-identification along with acceptable fidelity. Countries adopt different regulations that intensively restrict the sharing of patient data, which is sometimes administered in a federated way. The use of SHR allows sharing of data, which can be employed by researchers to develop advanced ML methods for different medical applications, where access to the real data is problematic. Another application of SHR is the cases in which a heavy class imbalance negatively affects the learning process. In such cases, generative models are employed to create synthetic medical data for the minority classes. This is different from data augmentation where the minority data is reproduced, as the statistical distribution of the data is taken into account for generating SHRs.

Several surveys and review studies have been previously conducted on different models for generating SHRs. However, a comprehensive study on this topic with sufficient pervasiveness to explore different aspects of the studies, cannot be found in the existing literature, from a practical perspective. In addition, unlike other review papers (Table 4), ours covers more data modalities and DL models, thereby providing readers with novel perspectives on the topic. The outcomes of such a pervasive study can unveil practical limitations and bottlenecks of the existing methodologies to choose an appropriate model for such a demanding application.

Table 4 The reviewed publications on generating synthetic health records

This study provided a scoping review of the most popular generative models for producing SHRs. Our analysis showed that the researchers employed GAN-based models most for generating synthetic time series compared to the other alternatives (See Supplementary Note 1). In addition to the GANs, the probabilistic models were widely used for generating synthetic longitudinal data. However, several studies reported that GAN-based models generally suffered from (i) the mode collapse issue12,13,14,15,16, (ii) requiring preliminary experiments to identify optimal hyperparameters, and (iii) having biases towards high-density classes11. Diffusion models demonstrated promising results in synthesizing time series compared to GANs. Nevertheless, resolving expensive computational costs and interoperability difficulties of diffusion models, are considered ongoing research endeavors. Finally, current works could gain significant advantages by integrating domain-specific expertise from physicians into the learning process17,18.

Generating synthetic clinical notes is a less explored area in the literature. Recent advancements of LLMs19,20,21 have demonstrated significant improvements in generating synthetic clinical notes. Regardless, LLMs suffer from requiring massive processing power (21 leveraged 560 A100 GPUs for 20 days to train the LLM). Furthermore22,23, addressed that LLMs struggle with complex reasoning problems. Non-reasoned outputs for generating synthetic clinical notes lacked coherence, consistency, and certainty22. Chain-of-Thought prompting24 standed out as a leading method aimed to improve complex reasoning capabilities through intermediate suggestions. While Chain-of-Thought has shown promising results, its effectiveness in tackling the reasonability of LLMs for complex multi-modal input and tasks necessitating compositional generalization remains an unresolved problem22. Despite the success of generative models, tweaking effort was sometimes required to achieve optimal performance25. It is worth mentioning that despite the success of studied papers in generating SHRs, the reproducibility of results of several studies is under question as (i) the implementation code is not available, and (ii) details of training hyper-parameters have not been reported.

Generating SHR necessitates real datasets to train the generative model, and the quality of the training data accessible defines the caliber of synthetic data4,26,27,28,29,30. The EHRs collected at healthcare sites are usually multi-dimensional and longitudinal datasets, recording patient history over multiple visits. However, secondary use of this data is restricted by privacy laws31,32,33; nevertheless, there are several de-identified datasets available for generating synthetic data. The popular databases for generating SHRs used by eligible publications of this review study can be found in Supplementary Table 1. Our findings show that despite the existence of public datasets available for generating SHRs, the majority of public longitudinal medical datasets primarily focus on ICU records, prioritizing acute patient cases and overlooking non-acute medical conditions. Furthermore, these public datasets often lack a comprehensive representation of all demographic groups and geographic regions, which limits their relevance and generalizability to broader populations. Finally, it is worth noting that most longitudinal records are reported in English, underscoring the current shortage of public resources across diverse languages.

One of the objectives of this survey was to help researchers find an optimal generative model for SHR among a great variety of existing ones, which in turn demanded a set of objective performance measures for comparison. Various statistical and intuitive metrics have been employed for comparing the performance of generative models which made choosing an optimal model for a case study complicated. In addition, some of the metrics are based on the discrimination power of the physicians11,17,34,35, whereas some others rely on the performance of a benchmark binary classifier to distinguish between SHR and EHR8,36. Supplementary Note 2 elaborates further on the techniques used for evaluating SHRs, while Supplementary Table 2 compiles the metrics that the included papers employed to assess the generated synthetic data. An important challenge, observed in this study is the lack of generic methods and metrics for comparing the performance of different generative models. Figure 3 represents the distribution of the three performance measurement objectives: fidelity, utility, and re-identification, over the data modalities. As can be seen, generating high-fidelity SHRs appears to be the main objective for all data modalities. It is implied from Fig. 3 that introducing appropriate performance measures to address the privacy of SHR is a contemporary research gap due to the shortage of pertinent studies. This is the case for the utility when it comes to the longitudinal SHR.

Fig. 3: Normalized distribution of performance measurement objectives over the data modalities. Larger circles display more publications in each category.
figure 3

The Figure indicates an inadequate evaluation of re-identification of SHRs in studied papers. In addition, evaluating the utility of longitudinal data has been less researched compared to medical time series and text data.

Statistical diversity of the data has been recently introduced as a measure for comparing the utility and fidelity of SHRs28,37. Furthermore, fairness of SHR was also defined as another comparative objective for the performance measures38. These two measures were not widely accepted by the researchers according to the citation records. All in all, there are no best-established systematic criteria or practices on how to evaluate SHRs.

Generative models for time series prediction showed promising results in various medical applications for classification and identification problems39,40. This topic was indeed an initiation for creating SHR, as was previously reported by reviewed studies. Recent scientific endeavors revealed that the application of SHR is not limited to the privacy of patient data, but can be extended towards statistical planning for clinical trials41, and evermore, towards addressing ethical issues42 by solving bias (e.g., black patients were less likely to be admitted to cardiology for heart failure care43) in the original dataset. In terms of methodology, the recent methodological trend of the practical models for the creation of SHR shows a shift from the GAN-based methods to the statistical models such as graph neural networks and diffusion models9,11, and data fairness was addressed by one of the recent studies38.

For further reading, we recommend the following key papers that complement our work and provide a deeper understanding of the subject of generating SHR.27 demonstrated that the evaluation metrics currently available for generic LLMs lack an understanding of medical and health-related concepts, which aligns closely with the findings of our study. In addition, the authors introduced a comprehensive collection of LLM-based metrics tailored for the evaluation of healthcare chat-bots from an end-user perspective. Social determinants of health (SDoH) encompass the conditions of individuals’ lives, influenced by the distribution of resources and power at various levels, and are estimated to contribute to 80–90% of modifiable factors affecting health outcomes44. However, documentation of SDoH is often incomplete in the structured data of EHRs. In44, the authors extracted six categories of SDoH from the EHR using LLMs to support research and clinical care. To address class imbalance and fine-tune the extraction model, the authors leveraged an LLM to generate synthetic SDoH data.

Methods

Definitions

The methodological contents of the reviewed papers addressed different study objectives, identified by their applicative terms. The main objectives of the introduced methods are summarised as follows:

  • Privacy: Reliable SHRs can be generated based on patient data to be utilized for training and validation of ML methods.

  • Class Imbalance: In many applications of health studies, access to different classes of data is not feasible in a consistent form, and a single class is dominantly seen in the study population. This can introduce bias to ML methods towards better learning of the dominant class. A reliable SHR can be invoked to generate synthetic data for minority classes.

  • Data Scarcity: Access to data of a specific class can sometimes be problematic. In this case, the scarce class is identified and modeled using the absolute minority samples along with meta-learning methods. The SHR is generated based on the identified model to explore the characteristics of the scarce class.

  • Data Imputation: EHR is heavily sparse with missing values, which are not uniformly obtained over the visits. Data imputation implies the methods to estimate the missing values that happened systematically or randomly in data collection.

Generative models

The existing methods for generating synthetic data typically fall into two categories: probability distribution techniques and neural network-based methods4. Probability distribution techniques involve estimating a probability distribution of the real data and then drawing random samples that fit such distribution as synthetic data. Generative Markov-Bayesian probabilistic modeling is a technique used for synthesizing longitudinal EHRs35. On the other hand, Recent developments in synthetic data generation are adopting advanced neural networks. Below are the most commonly used neural network-driven generative models:

  • Generative adversarial networks (GANs)4, comprising a generator and a discriminator, produce synthetic data resembling real samples drawn from a specific distribution. The discriminator distinguishes between real and synthetic samples, refining the generator’s ability to create realistic data through adversarial training, enabling accurate approximation of the data distribution and generation of high-fidelity novel samples. GANs can generate sequences of data points that closely resemble the patterns observed in the original image or time series data.

  • Diffusion models18 gradually introduce noise into original data until it matches a predefined distribution. The core idea behind diffusion models is to learn the process of reversing this diffusion process, allowing for the generation of synthetic samples that closely resemble the original data while capturing its essential characteristics and variability.

  • Variational auto-encoders (VAEs)4 are a category of generative models that learn to encode and decode data points while approximating a probability distribution, typically Gaussian, in the latent space. VAEs are trained by optimizing a variational lower bound on the log-likelihood of the data, enabling them to learn meaningful representations and create new data samples.

  • Large language models (LLMs)21 can predict the probability of the next word in a text sequence based on preceding words, typically leveraging transformer architectures adept at capturing long-range dependencies. LLMs are effective for generating contextually appropriate text by learning the probability distribution of natural language data on vast corpora of text.

Performance Measurement

Evaluating the strengths and weaknesses of generative models has become increasingly critical as these models continue to advance in complexity and capacity. The evaluation of generative models can be seen from different perspectives. In this study, we categorized evaluation metrics based on three different objectives: (i) Fidelity: degree of faithfulness in which the synthetic data preserves the essential characteristics, structures, and statistical properties of the actual data. Fidelity can be seen in either population-based (e.g., examining marginal and joint feature distributions) or individual-based (e.g., synthetic data must adhere to specific criteria, such as not including prostate cancer in a female patient) levels, (ii) Re-identification: concerning protection of sensitive information and confidentiality of individuals’ identity, and (iii) Utility: using synthetic data as a substitute for actual data for training/testing any medical devices and algorithms.

Study Taxonomy

The study on SHRs covers a wide scope spanning from tabular medical data of a single record to longitudinal data from several visits and events. In contrast with the varieties in the adoption of different learning methods to create synthetic health data, the methodological suitability of the proposed methods depends merely on data modality. This review study is hence, performed according to the following taxonomy: (i) longitudinal medical/health data, (ii) medical/health time series, and (iii) medical/health texts.

Research method

To perform this scoping review, we followed the recommendations outlined in the PRISMA-ScR guidelines45. Supplementary Note 3 provides the PRISMA-ScR checklist. The research method was comprised of 5 steps, described as follows:

  1. 1.

    Search: A systematic search is performed on the three widely accepted platforms of scientific publications in this domain: PubMed, Web of Science, and Scopus using combinations of {Synthetic}, {Time Series, Text, Longitudinal}, and {Medical, Medicine, Health} keywords in the title and/or abstract of the publications. Our search queries are shown in Fig. 1b. We adapted the string for each database, using various forms of the terms.

  2. 2.

    Identification: The outcomes of the search were explored in terms of duplication and repeated publications were excluded from the study.

  3. 3.

    Screening: In this phase Title and Abstract of the identified papers were studied and the topical relevance of the publications was investigated. Some of the publications from a different scientific topic were identified to participate in the study because of similarities in keywords. These publications were detected and excluded from the study.

  4. 4.

    Eligibility (inclusion criteria): After the search phase, only those publications fulfilling all the below criteria were allowed to participate in the study: (i) published within 2018–2023, (ii) the full paper is available, and (iii) addressing an ML topic for electronic health record (EHR) generation. Papers with only the Abstract available, cannot be analyzed and hence, excluded from the study, in addition to those addressing synthetic organs, without addressing the ML objectives.

  5. 5.

    Included Studies: This study focuses on reproducible ML methods for generating synthetic, time series, longitudinal, or text contents of medical record. Therefore, the validity of the proposed methods in terms of implementation feasibility is an important criterion for consideration. We consolidated the scientific quality of the study by using the following Exclusion Criteria: (i) lack of the peer-reviewed process for the publication, (ii) EHR generation is not the major objective of the publication, and (iii) limited to tabular and image data only.