Table 4 Key takeaways from various challenges in synthetic data generation

From: A scoping review of privacy and utility metrics in medical synthetic data

Challenges

Key takeaways

Lack of consensus

Evaluations of synthetic data generators and synthetic data releases should cover different dimensions: broad utility/statistical fidelity, narrow utility (if synthetic data is released for a specific task), fairness, and privacy.

Conflicting metrics

Privacy and utility metrics that rely on similarity of synthetic records to real data, such as distance to closest ratio, should be used cautiously. As different studies use equivalent similarity-based metrics for measuring both utility and privacy, the usage of such metrics complicates the interpretation of the privacy-utility trade-off.

Principled privacy evaluation

If the purpose of synthetic data is to preserve privacy of the original data, practitioners and researchers should rigorously evaluate the associated privacy risks using modern techniques14,15, avoiding similarity-based metrics such as distance to the closest record or nearest neighbor distance ratio.

Ensuring provable privacy guarantees

Differential privacy (DP) is a well-established formal theory for provably guaranteeing a given level of data privacy, including for synthetic data. Although DP oftentimes can hurt utility and fairness, recent methods such as those based on k-way marginals36 have significantly improved on its privacy-utility trade-off, making private methods a compelling candidate for synthetic data generation, especially when releasing synthetic data publicly. Even if DP is used, however, DP implementations should be audited, as they can often contain mistakes.