Table 4 Key takeaways from various challenges in synthetic data generation
From: A scoping review of privacy and utility metrics in medical synthetic data
Challenges | Key takeaways |
|---|---|
Lack of consensus | Evaluations of synthetic data generators and synthetic data releases should cover different dimensions: broad utility/statistical fidelity, narrow utility (if synthetic data is released for a specific task), fairness, and privacy. |
Conflicting metrics | Privacy and utility metrics that rely on similarity of synthetic records to real data, such as distance to closest ratio, should be used cautiously. As different studies use equivalent similarity-based metrics for measuring both utility and privacy, the usage of such metrics complicates the interpretation of the privacy-utility trade-off. |
Principled privacy evaluation | If the purpose of synthetic data is to preserve privacy of the original data, practitioners and researchers should rigorously evaluate the associated privacy risks using modern techniques14,15, avoiding similarity-based metrics such as distance to the closest record or nearest neighbor distance ratio. |
Ensuring provable privacy guarantees | Differential privacy (DP) is a well-established formal theory for provably guaranteeing a given level of data privacy, including for synthetic data. Although DP oftentimes can hurt utility and fairness, recent methods such as those based on k-way marginals36 have significantly improved on its privacy-utility trade-off, making private methods a compelling candidate for synthetic data generation, especially when releasing synthetic data publicly. Even if DP is used, however, DP implementations should be audited, as they can often contain mistakes. |