Fig. 1: Overview of the study and protein pre-GWAS results. | Nature Communications

Fig. 1: Overview of the study and protein pre-GWAS results.

From: Complex genetic effects linked to plasma protein abundance in the UK Biobank

Fig. 1: Overview of the study and protein pre-GWAS results.

a Overview of study design and workflow. UKB genotypes underwent quality control (QC), resulting in 424,097 QC-passed SNVs. The data were split into training and validation sets of self-reported UK-white ethnicity, OLINK batches 0–6 (n = 34,947 for training; n = 2000 for validation), and test sets stratified by ethnicity and batch: UK white self-reported ethnicity (n = 1771) and mixed ethnicities (n = 9876), all from OLINK batches 0–6 and 0–7, respectively. The training and validation data were used to develop DL and linear models, with a per-target GWAS on the training set used to pre-filter input variants for training the DL model. Finally, predictions and analyses were performed on the test data, and proteins that had discordant performance between the DL and linear models were investigated for non-linear covariate, non-additive (e.g., dominance), and interaction (e.g., epistasis) effects. b Correlation of GWAS P values between the current study and Sun et al.6. Variants with p values equal to 0, likely due to being below the numerical precision threshold (underflow), were omitted from the plot. The scatter plot represents the −log10(p values) correlation of 1780 overlapping genetic variants with significant associations (p < 1.7e-11) between our analysis and Sun et al.6. The strong correlation (two-sided Pearson correlation test, R = 0.96, P < 2.2e-16) between p values demonstrates consistency in identifying significant associations. c Histogram of the number of input SNVs used for DL model training following per target GWAS pre-filtering, where only SNVs with p values < 0.001 (computed on the training set) were considered. For the majority of proteins, fewer than 1000 SNVs passed the threshold.

Back to article page