Table 2 The reviewed publications on generating synthetic medical records of longitudinal data
From: A review on generative AI models for synthetic medical text, time series, and longitudinal data
Paper & Year | Study Objective | Case Study | Method | Key Takeaways |
|---|---|---|---|---|
60, 2023 | Privacy, data scarcity | Hospital visits from MIMIC III database | Hierarchical auto-regressive language model | (+) Fidelity of SHR is improved by utilizing a probabilistic and an autoregressive model for estimating longitudinal data at the visit and code level |
61, 2023 | Data scarcity, privacy | multi-dimensional cancer and type-2 diabetes data | GAN-boosted semi-supervised learning | (+) Utilizes the underlying graphical structure of EHRs |
6, 2023 | Privacy, data scarcity | EHR time series for ICU patients | Mixed-type longitudinal GAN | (+) Generating mixed-type time series by effectively capturing the temporal characteristics of the original data |
9, 2023 | Privacy | Critical care patients data admitted to ICU (e.g., #visits, diagnosis) from MIMIC IV dataset | Variational graph auto-encoder | (+) Generating synthetic patient trajectories from EHRs with graph learning |
7, 2023 | Privacy | Longitudinal health records (e.g., age, vital statistics) | RNN | (-) Generating lengthy sequences has limitations |
35, 2022 | Data scarcity | Type-2 diabetes data | Generative Markov-Bayesian-based model | (-) limited to a single chronic disease and using only ICD-10 data code |
62, 2022 | Privacy | Health records of patients with hypertension | GAN | (-) The criteria for data inclusion and exclusion could potentially result in selection bias |
63, 2022 | Privacy, data imputation | Parkinson’s disease and Alzheimer’s disease | Multi-modal Neural Ordinary Differential Equations | (+) Handling multi-modal data along with learning continuous-time real data trajectories (-) Limited to the static categorical variables |
64, 2022 | Privacy | Hospital visits from MIMIC III database | GPT-2 | (+) Formulating the generation of the heterogeneous EHRs as a text-to-text translation task using LLMs |
65, 2022 | Privacy, data imputation | Hospital visits from MIMIC III database | DataSifter-II (ruled-based method) | (+) Improved privacy of the time-varying correlated data by using a generalized linear mixed model and random effects-expectation maximization tree |
8, 2021 | Privacy | Hospital visits from MIMIC III database | Bayesian network | (-) Struggling to preserve multivariate relationships in the datasets |
66, 2021 | Privacy | Acute kidney injury | GAN | (-) Insufficient evaluation of the fidelity and the utility |
67, 2021 | Privacy | The EHR from type-2 diabetes, heart failure, and hypertension | GAN | (+) Mitigation of the GAN issues by using a two-step learning method: dependency learning and conditional simulation |
36, 2020 | Privacy | Hospital visits from MIMIC III database | Adversarial auto-encoder | (+) Adversarially learning both the continuous latent distribution and the discrete data distribution |
68, 2020 | Privacy | Chronic heart failure, organ transplantation | cGAN | (+) Improved privacy; the identifiability of the SHR is quantified and employed for the optimization of a cGAN |
69, 2019 | Privacy | Hearing loss patients | Bayesian network | (-) Insufficient evaluation of the fidelity and the utility |
70, 2019 | Privacy | Hospital visits from MIMIC III database | GAN | (-) Limited to generating discrete synthetic EHRs |