Fig. 4: Overview of most important biomarker features learnt by MILTON per ICD10 code for time-agnostic models.

a, Number of top seven biomarkers shared between each pair of ancestries for all 149 ICD10 codes with AUC > 0.6. MWU, two-sided P values are shown. No multiple testing correction was performed. Box plot shows median as center line, 25th percentile as lower box limit and 75th percentile as upper box limit; whiskers extend to 25th percentile − 1.5× interquartile range at the bottom and 75th percentile + 1.5 × interquartile range at the top; points denote outliers. b, Features with the highest FISs for E10 (type 1 diabetes mellitus), N18 (chronic renal failure) and I50.0 (congestive heart failure) for each ancestry. §Biomarkers that were also listed by an expert for given disease area22. LDL, low-density lipoprotein; FEV1, forced expiratory volume in 1 s. c, Top predictive features for C61 and G12 when using UKB proteomics data to train MILTON (time-agnostic model). Dashed, orange bar plots indicate average FIS of corresponding feature across all ICD10 codes for time-agnostic model. Bar plots comparing AUC between models trained on proteomics data along with 67 traits versus 67 traits only are shown on the right. d, Number of ICD10 codes that do not share the top N features as a function of N, indicating a quasi-unique biomarker signature per disease, comprising N ≥ 7 features when models are trained on 67 features only and N ≥ 5 features when models are trained on proteomics data only. e, The t-distributed stochastic neighbor embedding (t-SNE) projection of diseases across the phenome based on their MILTON-derived FISs. Each point corresponds to an ICD10 code, colored by Louvain clustering.