Fig. 2: 4-step SSPUL framework. | npj Digital Medicine

Fig. 2: 4-step SSPUL framework.

From: Fair positive unlabeled learning for predicting undiagnosed Alzheimer’s disease in diverse electronic health records

Fig. 2

a SSPUL framework overview. Step 1) Identify reliable negatives: Following feature selection, a Generalized Linear Model (GLM) was trained on labeled positive (LP) and unlabeled data. Reliable negatives were obtained based on having a probabilistic gap that is smaller than the smallest observed probabilistic gap of LPs. Step 2) Pre-processing racial bias mitigation: additional positive (AP) and negative (AN) labels were assigned using race-specific probabilistic gaps. Step 3) Train final classifier: XGBoost classifier was trained on all labeled and pseudo-labeled patients. Step 4) Post-processing bias mitigation: classification cutoffs were determined by optimizing the group benefit equality (GBE) for each race and ethnicity. b Pre- and post-processing bias mitigation details. Pre-processing bias mitigation: After training a distributed random forest classifier using LPs and RNs, APs and ANs were assigned for a subset of the remaining unlabeled data such that the following race and ethnic-specific probabilistic criteria are met: 1) APs and ANs have race-specific probabilistic gaps that are greater and smaller than the smallest observed probabilistic gap of LPs and largest observed probabilistic gap of RNs, respectively, for each race and ethnicity; 2) the prevalence of positive labels for each race and ethnicity closely matches the corresponding population AD prevalence. Post-processing bias mitigation: Predicted probabilities for unlabeled patients in the validation set were obtained from the trained final XGBoost classifier. The classification cutoff for each race and ethnicity was determined by optimizing the GBE for each race and ethnicity to ensure that the prevalence of LPs and predicted positives matched that of labeled and proxy-validated positives. The cutoffs were then applied to the test set for classification. AN = additional negative, AP = additional positive, DRF = distributed random forest, g = race or ethnicity variable, GBE = group benefit equality, LP = labeled positive, n = number of patients, RN = reliable negative, U = unlabeled, ΔPLP = observed probabilistic gap of LPs, ΔPRN = observed probabilistic gap of RNs, ΔPU=observed probabilistic gap of unlabeled patients, πg = population prevalence for race or ethnicity g.

Back to article page