Table 19 Performance analysis under different class distribution strategies.
Training strategy | Rare pathology sensitivity (95% CI) | Common pathology sensitivity (95% CI) | Overall accuracy (95% CI) | Precision (95% CI) | Recall (95% CI) | F1-score (95% CI) | AUC (95% CI) | Clinical utility index (mean ± SD) | Cohen’s κ | p-value |
|---|---|---|---|---|---|---|---|---|---|---|
Artificially balanced (original) | 94.3% (92.8–95.8) | 97.8% (97.2–98.4) | 97.3% (96.8–97.8) | 97.1% (96.5–97.7) | 96.5% (95.9–97.1) | 0.968 (0.963–0.973) | 0.993 (0.989–0.997) | 9.1 ± 0.3/10 | 0.946 | - |
Natural prevalence weighted | 79.2% (76.8–81.6) | 98.7% (98.3–99.1) | 91.4% (90.7–92.1) | 95.8% (95.2–96.4) | 91.4% (90.7–92.1) | 0.935 (0.928–0.942) | 0.971 (0.966–0.976) | 8.3 ± 0.4/10 | 0.828 | < 0.001 |
Hybrid balanced-weighted | 86.7% (84.9–88.5) | 98.3% (97.8–98.8) | 94.9% (94.3–95.5) | 96.4% (95.9–96.9) | 94.9% (94.3–95.5) | 0.956 (0.951–0.961) | 0.984 (0.980–0.988) | 8.8 ± 0.2/10 | 0.898 | < 0.001 |
Cost-sensitive learning | 83.4% (81.4–85.4) | 98.1% (97.6–98.6) | 93.2% (92.5–93.9) | 95.9% (95.3–96.5) | 93.2% (92.5–93.9) | 0.945 (0.939–0.951) | 0.978 (0.973–0.983) | 8.6 ± 0.3/10 | 0.864 | < 0.001 |
Focal loss optimization | 85.9% (84.0-87.8) | 97.9% (97.4–98.4) | 94.1% (93.4–94.8) | 96.2% (95.6–96.8) | 94.1% (93.4–94.8) | 0.951 (0.945–0.957) | 0.981 (0.976–0.986) | 8.7 ± 0.3/10 | 0.882 | < 0.001 |