Table 3 The characteristics of the benchmarking datasets

From: A Multifaceted benchmarking of synthetic electronic health record generation models

 

UW Dataset

VUMC Dataset

Age

–

26.0, 40.3, 55.8

41.0 ± 18.7

Race

White

69.9%

131,830

65.2%

13,366

Black

7.9%

14,956

8.8%

1794

Asian

9.4%

17,646

1.9%

384

American Indian or Alaska Native

1.5%

2836

0.0%

42

Pacific Islander

0.8%

1563

0.0%

0

Unknown

10.5%

19,912

24.0%

4913

Gender

Male

45.3%

85,490

43.9%

8990

Female

54.7%

103,253

56.1%

11,509

Medical features for generation

Binary features

 # of unique codes

2662

2581

 Diagnosis (Phecode)

1736

1269

 Procedure (Category)

66

67

 Medication (RxNorm Ingredient)

860

1245

 # of unique codes per patient

13.0, 30.0, 51.0

36.8 ± 31.3

6.0, 21.0, 59.0

45.3 ± 63.6

Continuous features

 Diastolic pressure

–

68.0, 75.0, 82.0

75.0 ± 10.7

 Systolic pressure

–

114.0, 124.0, 136.0

125.3 ± 15.9

 Pulse

–

77.3, 90.0, 104.3

91.4 ± 18.6

 Temperature

–

36.8, 37.1, 37.7

37.3 ± 0.6

 Pulse Oximetry

–

95.1, 97.1, 99.0

97.1 ± 2.1

 Respirations

–

16.0, 18.0, 23.9

19.6 ± 4.4

 Body Mass Index

–

24.4, 30.3, 38.1

31.3 ± 8.7

Data split for prediction

Training data

 Positive label

3.8%

4966

3.8%

541

 Negative label

96.2%

127,158

96.2%

13,808

Evaluation data

 Positive label

3.8%

2129

4.2%

260

 Negative label

96.2%

54,490

95.8%

5609

  1. x,y,z represents the first quartile, median, and third quartile. x ± y represents the mean and one standard deviation. x%y indicates that the percentage of y patients is x% among all patients.