Table 2 Statistical Analysis of Model Performance Improvements with Bootstrap Paired Testing

From: Enhancing privacy-preserving deployable large language models for perioperative complication detection: a targeted strategy with LoRA fine-tuning

Center	Model	Comparison	Metric	Mean Diff	95% CI	p-value	p-corrected
Center1 (n = 46)	4B	Comprehensive vs Targeted	F1	0.273	[0.151, 0.385]	<0.001	0.001
			Precision	0.229	[0.119, 0.330]	<0.001	0.001
			Recall	0.321	[0.147, 0.480]	<0.001	0.001
		Targeted vs Targeted + SFT	F1	0.116	[−0.022, 0.255]	0.0535	0.107
			Precision	0.277	[0.137, 0.414]	<0.001	0.001
			Recall	−0.104	[−0.263, 0.060]	0.9055	1.000
	8B	Comprehensive vs Targeted	F1	0.229	[0.072, 0.420]	0.0015	0.003
			Precision	0.194	[0.040, 0.395]	0.002	0.004
			Recall	0.283	[0.103, 0.468]	0.0015	0.003
		Targeted vs Targeted + SFT	F1	0.117	[−0.066, 0.278]	0.110	0.220
			Precision	0.221	[0.022, 0.398]	0.0155	0.031
			Recall	−0.017	[−0.186, 0.150]	0.602	1.000
	14B	Comprehensive vs Targeted	F1	0.188	[0.054, 0.321]	0.0025	0.005
			Precision	0.150	[0.006, 0.294]	0.0215	0.043
			Recall	0.231	[0.077, 0.386]	0.0015	0.003
		Targeted vs Targeted + SFT	F1	0.066	[−0.073, 0.197]	0.1685	0.337
			Precision	0.145	[0.007, 0.276]	0.0225	0.045
			Recall	−0.017	[−0.174, 0.139]	0.618	1.000
	32B	Comprehensive vs Targeted	F1	0.153	[0.017, 0.293]	0.0165	0.033
			Precision	0.110	[−0.011, 0.235]	0.043	0.086
			Recall	0.228	[0.027, 0.410]	0.0185	0.037
		Targeted vs Targeted + SFT	F1	0.149	[0.021, 0.286]	0.0105	0.021
			Precision	0.280	[0.156, 0.410]	<0.001	0.001
			Recall	−0.027	[−0.179, 0.153]	0.6765	1.000
Center 2 (n = 102)	4B	Comprehensive vs Targeted	F1	0.256	[0.181, 0.336]	<0.001	0.001
			Precision	0.216	[0.142, 0.290]	<0.001	0.001
			Recall	0.311	[0.201, 0.422]	<0.001	0.001
		Targeted vs Targeted + SFT	F1	0.103	[0.023, 0.186]	0.0055	0.011
			Precision	0.224	[0.133, 0.313]	<0.001	0.001
			Recall	−0.046	[−0.142, 0.054]	0.8375	1.000
	8B	Comprehensive vs Targeted	F1	0.192	[0.094, 0.296]	<0.001	0.001
			Precision	0.173	[0.053, 0.292]	0.001	0.002
			Recall	0.212	[0.114, 0.315]	<0.001	0.001
		Targeted vs Targeted + SFT	F1	0.073	[0.002, 0.144]	0.023	0.046
			Precision	0.136	[0.061, 0.210]	<0.001	0.001
			Recall	-0.008	[-−0.097, 0.085]	0.605	1.000
	14B	Comprehensive vs Targeted	F1	0.267	[0.163, 0.366]	<0.001	0.001
			Precision	0.208	[0.088, 0.317]	<0.001	0.001
			Recall	0.338	[0.229, 0.451]	<0.001	0.001
		Targeted vs Targeted + SFT	F1	0.001	[−0.071, 0.086]	0.5155	1.000
			Precision	0.067	[−0.008, 0.145]	0.040	0.080
			Recall	-0.078	[−0.169, 0.033]	0.920	1.000
	32B	Comprehensive vs Targeted	F1	0.207	[0.107, 0.313]	<0.001	0.001
			Precision	0.137	[0.032, 0.242]	0.002	0.004
			Recall	0.316	[0.195, 0.439]	<0.001	0.001
		Targeted vs Targeted + SFT	F1	0.079	[−0.007, 0.172]	0.042	0.084
			Precision	0.188	[0.104, 0.277]	<0.001	0.001
			Recall	-0.076	[−0.176, 0.037]	0.914	1.000

Statistical Methods: Patient-level bootstrap paired testing with 2000 iterations. All tests are single-sided (greater than) with Bonferroni correction for multiple comparisons.
Note: Positive values indicate improvement in the second strategy compared to the first.
SFT Supervised Fine-Tuning (LoRA), CI Confidence Interval, Mean Diff Mean Difference.
Bold values indicate statistical significance at p<0.05 after Bonferroni correction.

Back to article page

Table 2 Statistical Analysis of Model Performance Improvements with Bootstrap Paired Testing

Search

Quick links