Table 2 Statistical Analysis of Model Performance Improvements with Bootstrap Paired Testing

From: Enhancing privacy-preserving deployable large language models for perioperative complication detection: a targeted strategy with LoRA fine-tuning

Center

Model

Comparison

Metric

Mean Diff

95% CI

p-value

p-corrected

Center1 (n = 46)

4B

Comprehensive vs Targeted

F1

0.273

[0.151, 0.385]

<0.001

0.001

Precision

0.229

[0.119, 0.330]

<0.001

0.001

Recall

0.321

[0.147, 0.480]

<0.001

0.001

Targeted vs Targeted + SFT

F1

0.116

[−0.022, 0.255]

0.0535

0.107

Precision

0.277

[0.137, 0.414]

<0.001

0.001

Recall

−0.104

[−0.263, 0.060]

0.9055

1.000

8B

Comprehensive vs Targeted

F1

0.229

[0.072, 0.420]

0.0015

0.003

Precision

0.194

[0.040, 0.395]

0.002

0.004

Recall

0.283

[0.103, 0.468]

0.0015

0.003

Targeted vs Targeted + SFT

F1

0.117

[−0.066, 0.278]

0.110

0.220

Precision

0.221

[0.022, 0.398]

0.0155

0.031

Recall

−0.017

[−0.186, 0.150]

0.602

1.000

14B

Comprehensive vs Targeted

F1

0.188

[0.054, 0.321]

0.0025

0.005

Precision

0.150

[0.006, 0.294]

0.0215

0.043

Recall

0.231

[0.077, 0.386]

0.0015

0.003

Targeted vs Targeted + SFT

F1

0.066

[−0.073, 0.197]

0.1685

0.337

Precision

0.145

[0.007, 0.276]

0.0225

0.045

Recall

−0.017

[−0.174, 0.139]

0.618

1.000

32B

Comprehensive vs Targeted

F1

0.153

[0.017, 0.293]

0.0165

0.033

Precision

0.110

[−0.011, 0.235]

0.043

0.086

Recall

0.228

[0.027, 0.410]

0.0185

0.037

Targeted vs Targeted + SFT

F1

0.149

[0.021, 0.286]

0.0105

0.021

Precision

0.280

[0.156, 0.410]

<0.001

0.001

Recall

−0.027

[−0.179, 0.153]

0.6765

1.000

Center 2 (n = 102)

4B

Comprehensive vs Targeted

F1

0.256

[0.181, 0.336]

<0.001

0.001

Precision

0.216

[0.142, 0.290]

<0.001

0.001

Recall

0.311

[0.201, 0.422]

<0.001

0.001

Targeted vs Targeted + SFT

F1

0.103

[0.023, 0.186]

0.0055

0.011

Precision

0.224

[0.133, 0.313]

<0.001

0.001

Recall

−0.046

[−0.142, 0.054]

0.8375

1.000

8B

Comprehensive vs Targeted

F1

0.192

[0.094, 0.296]

<0.001

0.001

Precision

0.173

[0.053, 0.292]

0.001

0.002

Recall

0.212

[0.114, 0.315]

<0.001

0.001

Targeted vs Targeted + SFT

F1

0.073

[0.002, 0.144]

0.023

0.046

Precision

0.136

[0.061, 0.210]

<0.001

0.001

Recall

-0.008

[-−0.097, 0.085]

0.605

1.000

14B

Comprehensive vs Targeted

F1

0.267

[0.163, 0.366]

<0.001

0.001

Precision

0.208

[0.088, 0.317]

<0.001

0.001

Recall

0.338

[0.229, 0.451]

<0.001

0.001

Targeted vs Targeted + SFT

F1

0.001

[−0.071, 0.086]

0.5155

1.000

Precision

0.067

[−0.008, 0.145]

0.040

0.080

Recall

-0.078

[−0.169, 0.033]

0.920

1.000

32B

Comprehensive vs Targeted

F1

0.207

[0.107, 0.313]

<0.001

0.001

Precision

0.137

[0.032, 0.242]

0.002

0.004

Recall

0.316

[0.195, 0.439]

<0.001

0.001

Targeted vs Targeted + SFT

F1

0.079

[−0.007, 0.172]

0.042

0.084

Precision

0.188

[0.104, 0.277]

<0.001

0.001

Recall

-0.076

[−0.176, 0.037]

0.914

1.000

  1. Statistical Methods: Patient-level bootstrap paired testing with 2000 iterations. All tests are single-sided (greater than) with Bonferroni correction for multiple comparisons.
  2. Note: Positive values indicate improvement in the second strategy compared to the first.
  3. SFT Supervised Fine-Tuning (LoRA), CI Confidence Interval, Mean Diff Mean Difference.
  4. Bold values indicate statistical significance at p<0.05 after Bonferroni correction.