Table 3 Hamming distances for privacy conservation

From: Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence

 

CTAB-GAN+

NFlow

Original cohort

Absolute Hamming distances

 Average min. distance train

8.7034

9.3474

8.2524

 Average min. distance test

8.8587

9.4117

8.2224

 Median distance train

9

9

8

 Median distance test

9

9

8

Relative Hamming distances

 privacy leakage coefficient

0.0178

0.0069

 
  1. Hamming distances were used to measure the distance between two points within and between equally sized subsets of training (four sets of 20%) and test data (20%). The median distance represents the number of variables that have to be altered (and matched exactly) to fit a real patient. A threshold for the privacy leakage coefficient of 0.05 for relative distances was set where values above 0.05 signal potential privacy breaches. Both synthetic data sets fell well below the 0.05 threshold signaling larger distances between synthetic and training data, which make a re-identification of training set patients unlikely.