Table 1 A summary of the metrics in the framework

From: A Multifaceted benchmarking of synthetic electronic health record generation models

 

Metric

Summary

Direction

Utility

Dimension-wise distribution

Goal: The ability to capture marginal feature distributions in real data. Measurement: Average of the absolute prevalence differences (APD) for binary features and the average of the Wasserstein distances (AWD) for continuous features between real and synthetic datasets25.

↓

Column-wise correlation

Goal: The ability to capture the relationship between two features in real data. Measurement: Average of the cell-wise absolute differences of the Pearson correlation coefficient matrices derived from real and synthetic datasets26.

↓

Latent cluster analysis

Goal: The ability to capture the joint distribution of all features in real data. Measurement: Deviation of a synthetic dataset in the underlying latent space from the corresponding real dataset in terms of unsupervised clustering27.

↓

Clinical knowledge violation

Goal: The ability to learn the clinical knowledge at the patient level. Measurement: Proportion of generated records that violate clinical knowledge derived from the real dataset (e.g., the synthetic records for male patients are frequently associated with pregnancy diagnosis codes).

↓

Medical concept abundance

Goal: The ability to retain record-level information from the real data. Measurement: Normalized Manhattan distance between the distributions of the number of assigned distinct medical concepts for real and synthetic records.

↓

TSTR Model performance

Goal: The ability to approximate the performance of the downstream task of machine learning model development. Measurement: Given an outcome prediction task, this is calculated as the model performance, typically the area under the receiver operating characteristics curve (AUROC), in the scenario of training on synthetic dataset and testing on real dataset (TSTR)28.

↑

TRTS Model performance

Goal: The ability to generate convincing and realistic data records for different labels. Measurement: Given an outcome prediction task, this is calculated as the model performance, typically the AUROC, in the scenario of training on real dataset and testing on synthetic dataset (TRTS)28.

↑

Feature selection

Goal: The ability to support model interpretability in downstream tasks. Measurement: The proportion of shared important features for models trained on a synthetic dataset and the corresponding real dataset.

↑

Privacy

Attribute inference risk

Goal: The adversary’s ability to infer sensitive attributes of a targeted record. Adversarial knowledge: Demographics and some sensitive attributes of a targeted record. Measurement: The weighted sum of F1 scores of the inferences of other sensitive attributes20,25.

↓

Membership inference risk

Goal: The adversary’s ability to infer the membership of a targeted record. Adversarial knowledge: A set of attributes of a targeted record. Measurement: The F1 score of the inference based on Euclidean distances between the targeted record and all synthetic records20,25.

↓

Meaningful identity disclosure risk

Goal: The adversary’s ability to identify synthetic records with meaningful attributes. Adversarial knowledge: A population dataset with identities. Measurement: The adjusted re-identification risk considering the linkage between the synthetic dataset and the real dataset, the linkage between the synthetic dataset and the population dataset, and the rareness of each sensitive attribute in the real dataset29.

↓

Nearest neighbor adversarial accuracy risk

Goal: The extent to which a generative model overfits the real training dataset. Measurement: The difference between 1) the aggregated distance between records in the synthetic datasets and records in the evaluation dataset and 2) the aggregated distance between records in the synthetic datasets and records in the real dataset30.

↓

  1. The direction of the values indicates if a higher (up arrow) or lower (down arrow) value is better.