Fig. 1: Effect of gold-standard label count on model performance.

Curves show a AUC and b F1 score with 95% confidence intervals for PH as the number of gold-standard training labels increases. Metrics are averaged across two cross-validation folds. The horizontal black dashed line indicates the best-performing baseline model, Transformer (gold only).