Fig. 1: Flowchart of genetic-centric analysis. A Data partitioning. | Nature Communications

Fig. 1: Flowchart of genetic-centric analysis. A Data partitioning.

From: AI-enhanced integration of genetic and medical imaging data for risk assessment of Type 2 diabetes

Fig. 1

The dataset containing information from 60,747 individuals after data quality control (QC) was divided into several subsets: (i) The genome-wide association study (GWAS) samples (Dataset 1, N = 35,688), training samples (Dataset 2, N = 12,236; Dataset 4, N = 40,787), and validation samples (Dataset 3, N = 3060; Dataset 5, N = 10,197). For classification analysis, testing samples comprised Dataset 6 (N = 8827) and Dataset 7 (N = 936), while for prediction analysis, they were represented as Datasets 6’ (N = 8827) and Dataset 7’ (N = 936); B Sample size. Total sample size, along with the number of cases and the number of controls, are shown for each of the four phenotype definitions in Datasets 1 – 7; C Phenotype definition criteria. The definition and sample size for the four Type 2 Diabetes (T2D) phenotype definitions is shown. D Analysis flowchart. The analysis flow comprises three steps, starting with selecting T2D-associated single nucleotide polymorphisms (SNPs) and polygenic risk score (PRS), then selecting demographic and environmental covariates, and the best XGBoost model was established using the selected features. As to the first step, SNPs can be chosen from A our own GWAS with an adjustment for age, sex, and top ten principal components (PCs), B published studies based on single ethnic populations, and C published studies based on multiple ethnic populations. Source data are provided as a Source Data file.

Back to article page