Fig. 1: Performance evaluation on validation set.

The validation set is sampled from the original PHI-containing DFCI dataset and is used during the training process to fine-tune hyperparameters and monitor model performance. The figure includes six panels representing experimental setups: a Teacher with temporal context, b Teacher without temporal context, c Teacher-Public Student (MIMIC-IV), d Teacher-Public Student (Wiki-text), e Teacher-GPT-4 Student, and f GPT-4 Student only. Metrics are shown with purple for AUROC, teal for AUPRC, and yellow for Best F1. Red dashed lines indicate 36% overall response prevalence, and blue dashed lines indicate 21% progressive disease prevalence. AUROC Area Under the Receiver Operating Characteristic Curve, AUPRC Area Under the Precision-Recall Curve, MIMIC-IV Medical Information Mart for Intensive Care IV, GPT-4 Generative Pre-trained Transformer 4.