Fig. 1: Study design. | npj Digital Medicine

Fig. 1: Study design.

From: Fair positive unlabeled learning for predicting undiagnosed Alzheimer’s disease in diverse electronic health records

Fig. 1

Phase 1) Patient- and record-level data preprocessing. Phase 2) Train and evaluate. Post-preprocessing, data for non-ATLAS patients was randomly split (labeled positive-stratified). The training set was used for training SSPUL framework, while the validation set was used for determining the optimal GBE cutoff for each race and ethnicity. The trained model was then applied to the test set for race-stratified performance and fairness evaluation using proxy ICDs and medications. Process was repeated 1000 times. Performance and fairness metrics were averaged over the 1000 splits and the trained model from each split was saved for predicting unlabeled ATLAS patients for validation. Phase 3) Validate. Polygenic risk scores were obtained for ATLAS patients using LDPred2. Applying each trained model from phase 2, the mean PRS and ε4 allele count for each final classification of unlabeled patients [i.e., predicted positive (PP) or predicted negative (PN)] was obtained (1000 mean PRS for predicted positives and predicted negatives total). The mean of PRS means and mean of ε4 allele count means were then obtained by aggregating the PRS means and ε4 allele count means, respectively, followed by race-stratified validation. ATLAS = UCLA ATLAS Community Health Initiative, DDR = Data Discovery Repository, EA = East Asian, HL = Hispanic Latino, ICD = International Classification of Diseases, NH-AfAm = Non-Hispanic African American, NH-white = non-Hispanic white, PN=predicted negatives, PP = predicted positives, PRS = polygenic risk scores, SSPUL = semi-supervised positive unlabeled learning, VAL = validation set.

Back to article page