Table 1 Phenotype classification performance for pulmonary hypertension

Metric	Count	KOMAP	XGBoost	Transformer (silver = gold)	Transformer (gold only)	WEST (w/o neg)	WEST (w/ neg)
AUC	0.85 (0.77–0.91)	0.86 (0.79–0.92)	0.82 (0.72–0.91)	0.84 (0.71–0.89)	0.88 (0.79–0.94)	0.91 (0.85–0.95)	0.93 (0.87–0.97)
F1 Score	0.79 (0.72–0.85)	0.84 (0.77–0.90)	0.85 (0.79–0.91)	0.82 (0.72–0.87)	0.85 (0.79–0.92)	0.86 (0.77–0.90)	0.88 (0.80–0.92)
PPV	0.65 (0.57–0.74)	0.88 (0.79–0.95)	0.87 (0.78–0.94)	0.84 (0.71–0.89)	0.89 (0.81–0.96)	0.91 (0.75–0.93)	0.95 (0.83–0.97)
Specificity	0 (0–0)	0.78 (0.63–0.90)	0.76 (0.59–0.89)	0.70 (0.49–0.79)	0.81 (0.68–0.93)	0.84 (0.58–0.86)	0.92 (0.71–0.95)

WEST trained with both positive and negative gold-standard labels, denoted WEST (w/ neg), achieved the highest AUC, F1 score, PPV, and specificity across all methods. The Transformer (silver = gold) baseline was trained by treating all silver-standard labels as gold-standard (i.e., no iterative updates or augmentation), while Transformer (gold only) used only expert-validated labels. Transformer metrics were averaged across two cross-validation folds, and all metrics are reported with 95% confidence intervals estimated by bootstrapping on patient-level predictions. Bold values denote the best performance per metric.

Quick links

Search