Fig. 2: Deep learning reveals non-linear genetic and covariate effects. | Nature Communications

Fig. 2: Deep learning reveals non-linear genetic and covariate effects.

From: Complex genetic effects linked to plasma protein abundance in the UK Biobank

Fig. 2: Deep learning reveals non-linear genetic and covariate effects.

a DL (EIR) and linear (bigstatsr) mean bootstrapped (n = 1000) model performance (R2) for all 2922 proteins (Supplementary Data 2). The error bars indicate the 95% confidence intervals (CI) from 1000 bootstraps, and proteins with non-overlapping CI between DL and linear models are called significant and labeled in red. b SNP-based heritability of 2414 plasma proteins from Sun et al. correlated with DL performance from our study (two-sided Pearson correlation test, R = 0.85, P < 2.2e-16). c Performance gap (R2R2) for the 171 significant proteins between DL and linear models (DL-linear) on NPX or INT normalized protein values. Proteins labeled in red indicate that no significant performance gap (overlapping CIs) was found when modeling on INT normalized protein values. d DL and linear model performance for the top 20 significant proteins with the largest absolute performance gap in R2 between the DL and linear models are shown. Additionally, performance of linear and non-linear (XGBoost) models trained only on covariates is shown (Supplementary Data 3). The covariates include demographic information (age and sex), the genetic array, genetic principal components (GPC1-GPC20), whether individuals were consortium selected, and the research center location for participant measurement. On the right, the fraction of the performance gap that remained when modeling on INT values instead of NPX protein levels is shown. The error bars indicate the 95% CI from 1000 bootstraps. e Aggregated DL attribution of 487 SNVs across the genome was used as input to model PAEP protein levels. Variants located within the PAEP gene are labeled in red. f Performance gap (R2R2) between DL and linear models on genotype and covariates against the performance gap between non-linear (XGBoost) and linear models on covariates only. Orange and green areas indicate if protein levels underlie non-linear covariate effects or other non-linear effects in the input data. g Linear model performance improvement (ΔR2) when iteratively adding more complex terms as input features to the linear model. h Performance decomposition for 11 proteins showing baseline model performance (linear SNP + non-linear covariate effects) and gains from progressively adding GxG and GxE interactions.

Back to article page