Table 2 Results of the best-performing models on the out-of-domain test datasets.

From: Large language models to identify social determinants of health in electronic health records

Any social determinant of health (SDoH)

Dataset

Macro-F1

No SDoH (F1)

Employment (F1)

Housing (F1)

Parent (F1)

Relationship (F1)

Social support (F1)

Transportation (F1)

 

Mean (95% CI)

Delta F1

P value

Immunotherapy

FlanXXL: Gold data only

0.70 (0.63–0.76)

+0.01

<0.01

0.99

0.83

0.55

0.69

0.93

0.46

0.46

FlanXXL: Gold + synthetic data

0.71 (0.64–0.76)

0.99

0.79

0.55

0.68

0.91

0.63

0.40

MIMIC-III

 

 FlanXXL: Gold data only

0.57 (0.49–0.63)

−0.02

<0.01

0.98

0.65

0.00

0.63

0.91

0.32

0.50

 FlanXXL: Gold + synthetic data

0.55 (0.49–0.61)

0.98

0.69

0.24

0.44

0.91

0.33

0.24

Adverse social determinants of health (SDoH)

Dataset

Macro-F1

No SDoH (F1)

Employment (F1)

Housing (F1)

Parent (F1)

Relationship (F1)

Social support (F1)

Transportation (F1)

 

Mean (95% CI)a

Delta F1b

P value

Immunotherapy

FlanXL: Gold data only

0.63 (0.54–0.72)

+0.03

<0.01

1.00

0.56

0.46

0.68

0.81

0.50

0.46

FlanXL: Gold + synthetic data

0.66 (0.58–0.72)

1.00

0.60

0.63

0.60

0.81

0.59

0.40

MIMIC-III

 

 FLANXL: Gold data only

0.53 (0.47–0.60)

−0.02

<0.01

0.99

0.51

0.50

0.53

0.65

0.22

0.20

 FLANXL: Gold + synthetic data

0.51 (0.43–0.59)

0.99

0.55

0.35

0.54

0.68

0.43

0.20

  1. The 95% CI for Macro-F1 is calculated by bootstrapping 3400 times (to achieve bootstrap SE < 0.01) with replacement. The SE of the 95% confidence interval limits is 0.0074, ascertained by performing bootstrapping 3400 times on three distinct samples. Delta F1 score is the change in Macro-F1 when synthetic data are added to the fine-tuning data. Bolded text indicates the best performance with and without synthetic data augmentation. p values are computed with Mann–Whitney U test. CI confidence interval, SE standard error.