Fig. 1: The concept of genetic architecture and predictive models for polygenic diseases. | Translational Psychiatry

Fig. 1: The concept of genetic architecture and predictive models for polygenic diseases.

From: Machine learning for effectively avoiding overfitting is a crucial strategy for the genetic prediction of polygenic psychiatric phenotypes

Fig. 1

a The distribution of P values in GWAS for polygenic disease models in training and test datasets. To depict the concept of genetic architecture and predictive models for polygenic disease, the simulated distribution of variants analyzed in GWAS for a certain target phenotype is shown in the figures. The Y axis indicates the negative logarithm (−log) of P values, and the X axis indicates the logarithm (log) of the number of variants. While the P values of variants with true susceptibility to the disease of interest (depicted in orange and yellow) tend to be small, some of them can be large due to insufficient power. Likewise, while the majority of the P values of null variants (variants with no effect on the susceptibility to the disease, depicted in blue) tend to be large, some of them can be small by random chance due to a large number of statistical tests. The variants with true susceptibility to the disease can be divided into a set of variants that are independent of each other (depicted in orange) and a set of remaining variants that are dependent on the former variants due to the linkage disequilibrium (depicted in yellow). While true susceptibility variants increase prediction accuracy, null variants decrease prediction accuracy when the variants are included in the prediction model because associations between the null variants and the target phenotype are not replicated in the validation cohort, which is referred to as overfitting. Distinguishing true susceptibility variants and null variants in single GWAS is difficult with currently available sample-size data. b Concepts of PRS. PRS intends to select variants with true susceptibility and avoid influence from null variants by setting a cutoff of P values in GWAS; however, the model decreases prediction accuracy because the model (i) still includes and overestimates a large number of the null variants, and (ii) incorporates clumping and excludes correlated true susceptibility variants, which can contribute to prediction accuracy. c Concepts of GBLUP. GBLUP utilizes true susceptibility variants correlated with each other for better prediction accuracy; however, the model includes a large number of null variants and results in decreasing prediction accuracy due to overfitting. d Concepts of STMGP. STMGP decreases overfitting by weighting selected variants to decrease overestimation of null variants, utilizes correlated true susceptibility variants effectively by building generalized ridge regression, and sets an optimal cutoff for the P value with low computer costs by avoiding CV. GWAS genome-wide association study, PRS polygenic risk score, CV cross-validation, GBLUP genomic best linear-unbiased prediction, STMGP Smooth-Threshold Multivariate Genetic Prediction.

Back to article page