Fig. 9: Overview of dataset characteristics and preprocessing pipeline. | npj Digital Medicine

Fig. 9: Overview of dataset characteristics and preprocessing pipeline.

From: Clinically informed semi-supervised learning improves disease annotation and equity from electronic health records: a glaucoma case study

Fig. 9: Overview of dataset characteristics and preprocessing pipeline.The alternative text for this image may have been generated using AI.

Patient demographics, note types, and clinical note distributions are shown on the left. Processed notes are then de-identified and augmented (token shuffling, synonym substitution, abbreviation expansion, and full spelling), as illustrated on the right. Data augmentation was applied with stratified rates adjusted by race, gender, and age groups to ensure balanced representation across all demographic subgroups in the final training data.

Back to article page