Table 2 Results of the best-performing models on the out-of-domain test datasets.

Any social determinant of health (SDoH)
Dataset	Macro-F1			No SDoH (F1)	Employment (F1)	Housing (F1)	Parent (F1)	Relationship (F1)	Social support (F1)	Transportation (F1)
	Mean (95% CI)	Delta F1	P value	No SDoH (F1)	Employment (F1)	Housing (F1)	Parent (F1)	Relationship (F1)	Social support (F1)	Transportation (F1)
Immunotherapy
FlanXXL: Gold data only	0.70 (0.63–0.76)	+0.01	<0.01	0.99	0.83	0.55	0.69	0.93	0.46	0.46
FlanXXL: Gold + synthetic data	0.71 (0.64–0.76)	+0.01	<0.01	0.99	0.79	0.55	0.68	0.91	0.63	0.40
MIMIC-III
FlanXXL: Gold data only	0.57 (0.49–0.63)	−0.02	<0.01	0.98	0.65	0.00	0.63	0.91	0.32	0.50
FlanXXL: Gold + synthetic data	0.55 (0.49–0.61)	−0.02	<0.01	0.98	0.69	0.24	0.44	0.91	0.33	0.24

Adverse social determinants of health (SDoH)
Dataset	Macro-F1			No SDoH (F1)	Employment (F1)	Housing (F1)	Parent (F1)	Relationship (F1)	Social support (F1)	Transportation (F1)
	Mean (95% CI)^a	Delta F1^b	P value	No SDoH (F1)	Employment (F1)	Housing (F1)	Parent (F1)	Relationship (F1)	Social support (F1)	Transportation (F1)
Immunotherapy
FlanXL: Gold data only	0.63 (0.54–0.72)	+0.03	<0.01	1.00	0.56	0.46	0.68	0.81	0.50	0.46
FlanXL: Gold + synthetic data	0.66 (0.58–0.72)	+0.03	<0.01	1.00	0.60	0.63	0.60	0.81	0.59	0.40
MIMIC-III
FLANXL: Gold data only	0.53 (0.47–0.60)	−0.02	<0.01	0.99	0.51	0.50	0.53	0.65	0.22	0.20
FLANXL: Gold + synthetic data	0.51 (0.43–0.59)	−0.02	<0.01	0.99	0.55	0.35	0.54	0.68	0.43	0.20

The 95% CI for Macro-F1 is calculated by bootstrapping 3400 times (to achieve bootstrap SE < 0.01) with replacement. The SE of the 95% confidence interval limits is 0.0074, ascertained by performing bootstrapping 3400 times on three distinct samples. Delta F1 score is the change in Macro-F1 when synthetic data are added to the fine-tuning data. Bolded text indicates the best performance with and without synthetic data augmentation. p values are computed with Mann–Whitney U test. CI confidence interval, SE standard error.

Quick links

Search