Table 2 Statistical Analysis of Model Performance Improvements with Bootstrap Paired Testing
Center | Model | Comparison | Metric | Mean Diff | 95% CI | p-value | p-corrected |
|---|---|---|---|---|---|---|---|
Center1 (n = 46) | 4B | Comprehensive vs Targeted | F1 | 0.273 | [0.151, 0.385] | <0.001 | 0.001 |
Precision | 0.229 | [0.119, 0.330] | <0.001 | 0.001 | |||
Recall | 0.321 | [0.147, 0.480] | <0.001 | 0.001 | |||
Targeted vs Targeted + SFT | F1 | 0.116 | [−0.022, 0.255] | 0.0535 | 0.107 | ||
Precision | 0.277 | [0.137, 0.414] | <0.001 | 0.001 | |||
Recall | −0.104 | [−0.263, 0.060] | 0.9055 | 1.000 | |||
8B | Comprehensive vs Targeted | F1 | 0.229 | [0.072, 0.420] | 0.0015 | 0.003 | |
Precision | 0.194 | [0.040, 0.395] | 0.002 | 0.004 | |||
Recall | 0.283 | [0.103, 0.468] | 0.0015 | 0.003 | |||
Targeted vs Targeted + SFT | F1 | 0.117 | [−0.066, 0.278] | 0.110 | 0.220 | ||
Precision | 0.221 | [0.022, 0.398] | 0.0155 | 0.031 | |||
Recall | −0.017 | [−0.186, 0.150] | 0.602 | 1.000 | |||
14B | Comprehensive vs Targeted | F1 | 0.188 | [0.054, 0.321] | 0.0025 | 0.005 | |
Precision | 0.150 | [0.006, 0.294] | 0.0215 | 0.043 | |||
Recall | 0.231 | [0.077, 0.386] | 0.0015 | 0.003 | |||
Targeted vs Targeted + SFT | F1 | 0.066 | [−0.073, 0.197] | 0.1685 | 0.337 | ||
Precision | 0.145 | [0.007, 0.276] | 0.0225 | 0.045 | |||
Recall | −0.017 | [−0.174, 0.139] | 0.618 | 1.000 | |||
32B | Comprehensive vs Targeted | F1 | 0.153 | [0.017, 0.293] | 0.0165 | 0.033 | |
Precision | 0.110 | [−0.011, 0.235] | 0.043 | 0.086 | |||
Recall | 0.228 | [0.027, 0.410] | 0.0185 | 0.037 | |||
Targeted vs Targeted + SFT | F1 | 0.149 | [0.021, 0.286] | 0.0105 | 0.021 | ||
Precision | 0.280 | [0.156, 0.410] | <0.001 | 0.001 | |||
Recall | −0.027 | [−0.179, 0.153] | 0.6765 | 1.000 | |||
Center 2 (n = 102) | 4B | Comprehensive vs Targeted | F1 | 0.256 | [0.181, 0.336] | <0.001 | 0.001 |
Precision | 0.216 | [0.142, 0.290] | <0.001 | 0.001 | |||
Recall | 0.311 | [0.201, 0.422] | <0.001 | 0.001 | |||
Targeted vs Targeted + SFT | F1 | 0.103 | [0.023, 0.186] | 0.0055 | 0.011 | ||
Precision | 0.224 | [0.133, 0.313] | <0.001 | 0.001 | |||
Recall | −0.046 | [−0.142, 0.054] | 0.8375 | 1.000 | |||
8B | Comprehensive vs Targeted | F1 | 0.192 | [0.094, 0.296] | <0.001 | 0.001 | |
Precision | 0.173 | [0.053, 0.292] | 0.001 | 0.002 | |||
Recall | 0.212 | [0.114, 0.315] | <0.001 | 0.001 | |||
Targeted vs Targeted + SFT | F1 | 0.073 | [0.002, 0.144] | 0.023 | 0.046 | ||
Precision | 0.136 | [0.061, 0.210] | <0.001 | 0.001 | |||
Recall | -0.008 | [-−0.097, 0.085] | 0.605 | 1.000 | |||
14B | Comprehensive vs Targeted | F1 | 0.267 | [0.163, 0.366] | <0.001 | 0.001 | |
Precision | 0.208 | [0.088, 0.317] | <0.001 | 0.001 | |||
Recall | 0.338 | [0.229, 0.451] | <0.001 | 0.001 | |||
Targeted vs Targeted + SFT | F1 | 0.001 | [−0.071, 0.086] | 0.5155 | 1.000 | ||
Precision | 0.067 | [−0.008, 0.145] | 0.040 | 0.080 | |||
Recall | -0.078 | [−0.169, 0.033] | 0.920 | 1.000 | |||
32B | Comprehensive vs Targeted | F1 | 0.207 | [0.107, 0.313] | <0.001 | 0.001 | |
Precision | 0.137 | [0.032, 0.242] | 0.002 | 0.004 | |||
Recall | 0.316 | [0.195, 0.439] | <0.001 | 0.001 | |||
Targeted vs Targeted + SFT | F1 | 0.079 | [−0.007, 0.172] | 0.042 | 0.084 | ||
Precision | 0.188 | [0.104, 0.277] | <0.001 | 0.001 | |||
Recall | -0.076 | [−0.176, 0.037] | 0.914 | 1.000 |