Using genomic context informed genotype data and within-model ancestry adjustment to classify type 2 diabetes

Barnett, Eric J.; Zhang-James, Yanli; Hess, Jonathan L.; Glatt, Stephen J.; Faraone, Stephen V.

doi:10.1038/s43856-025-01189-8

Download PDF

Article
Open access
Published: 18 November 2025

Using genomic context informed genotype data and within-model ancestry adjustment to classify type 2 diabetes

Communications Medicine volume 5, Article number: 479 (2025) Cite this article

1043 Accesses
Metrics details

Subjects

Abstract

Background:

Despite high heritability estimates, complex genetic disorders have proven difficult to predict with genetic data. Genomic research has documented polygenic inheritance, cross-disorder genetic correlations, and enrichment of risk by functional genomic annotation, but the vast potential of that combined knowledge has not yet been leveraged to build optimal risk models. Additional methods are likely required to progress genetic risk models of complex genetic disorders towards clinical utility.

Methods:

We developed a modeling framework that uses annotations providing genomic context alongside genotype data as input to convolutional neural networks. To test this framework, we used a matched-pairs dataset of individuals with and without type 2 diabetes. We compared the performance of models trained on genotype data alone to those trained on context-informed genotype data. We also introduced adversarial tasks to remove the ability of models to predict genetic ancestry while assessing whether risk prediction performance was preserved.

Results:

Here, we show that a neural network using genotype data (AUC: 0.66) and a convolutional neural network using context-informed genotype data (AUC: 0.65) both significantly outperform polygenic risk score approaches in classifying type-2 diabetes. Adversarial ancestry tasks eliminate the predictability of ancestry without changing model performance.

Conclusions:

We present neural network approaches that improve classification performance by integrating genotype data with genomic context while accounting for ancestry. Although current performance is not yet sufficient for clinical use in type 2 diabetes, incremental advances such as these may help move genetic risk prediction toward future clinical utility.

Plain Language Summary

Type 2 diabetes is a common disease that affects millions of people worldwide. It occurs when the body cannot properly use insulin, causing glucose to stay in the blood instead of being used by the body for energy. Our genes, which are generally inherited, play a role in the risk of developing diabetes, but it has been difficult to use genetic data to accurately predict who will get the disease. In this study, we use information other researchers have learned about the genome, such as the function of genes, to provide context to disease risk models that normally only use information about each person’s genetic changes. These approaches perform better than traditional risk modeling approaches. Although not yet accurate enough for medical use, this work points to a promising direction for improving genetic detection of disease and improving our understanding of type 2 diabetes.

Polygenic risk score for type 2 diabetes shows context-dependent effects across populations

Article Open access 01 October 2025

The necessity of incorporating non-genetic risk factors into polygenic risk score models

Article Open access 20 February 2023

The effect of type 2 diabetes genetic predisposition on non-cardiovascular comorbidities

Article Open access 10 October 2025

Introduction

Over the past decade, many studies have developed regression-based and machine learning models for predicting the risk of complex genetic disorders¹. Despite progress, their performance still falls short of the estimated limits based on heritability². Meanwhile, efforts to identify functional genome components and environmental factors influencing heritable disorders have advanced. In polygenic disorders, where heritability is spread across many genes, leveraging current genomic knowledge could improve risk estimation. One study showed that using trait-specific functional priors increased polygenic prediction accuracy by 5%³. However, machine learning models have yet to fully utilize genomic knowledge, requiring new methods to handle the changes in data dimensionality and characteristics.

As an example, we focused on type-2 diabetes mellitus (T2D), which afflicts 6.3% of the world’s population^4,5. Genome-wide association studies (GWASs) of T2D have been successful in identifying hundreds of genetic variants^6,7. Studies using machine learning models to classify people with and without T2D have been less successful; model performances are lower than expected relative to the heritability of T2D and the classification performance of diseases with similar heritability^8,9,10. This poor performance may stem from several factors. One possibility is that the genetic risk for T2D is more complex, involving numerous small-effect loci across the genome, unlike type-1 diabetes, where risk is concentrated in a single region¹¹. If so, more sophisticated models may be needed to capture this complexity.

The complexity of machine learning models has its drawbacks¹². A complex model can capture patterns specific to the training data. If these patterns are more effective at classification than the real effects, the model may prioritize them, leading to “overfitting.” This means the model performs well on training data but struggles with unseen data that lack the same patterns. One example of a model learning patterns specific to training data and failing to generalize is when it is trained with population stratification¹³. In population stratification, cases and controls in the training data have systematic ancestry differences. A model may use these variant frequencies between ancestry groups to classify cases and controls, but it ends up classifying ancestry rather than disease. When tested on data without the same ancestry imbalance, the model’s disease classification performance will drop.

Although ancestry can confound analyses, modeling ancestry alongside genotype data can improve prediction^14,15 due to differing sampled case/control ratios within different ancestries or the interaction between genetic risk and a factor in which ancestry acts as a proxy. While these factors are undoubtedly important for clinical implementations of genetic risk models, the goal of many studies, including ours, is to better understand the genetic basis of the disorder, which motivates efforts to minimize the confounding effects of ancestry on genetic risk modeling.

In GWAS and polygenic risk score models, population stratification is typically controlled by including principal components (PCs) as covariates to assess associations and derive risk scores¹⁶. Most ancestry information is thought to be captured in the top PCs, so removing their effects allows the model to detect true disease risk. However, linear adjustments may not fully address ancestry-specific differences in allele frequencies. Since many machine learning models can learn non-linear relationships, it’s crucial to account for potential non-linear ancestry effects. In this work, we tackle this by using a within-model adjustment that reverses the gradient during gradient descent¹⁷. This “adversarial task” helps the model unlearn ancestry, reducing its reliance on ancestry confounds.

One cause of overfitting complex models is the high number of parameters in a model¹⁸. Many machine learning models can achieve near-perfect accuracy on randomly labeled data if the number of parameters far exceeds the number of samples¹⁹. This is particularly relevant in GWAS studies, where genetic variants can number in the millions, while the number of study participants is far fewer. A common way to reduce overfitting is to limit the number of variants used¹⁸, but this may hinder model performance if true risk variants are excluded.

In this study, we develop a classification model using convolutional neural networks (CNNs) to limit the number of model parameters without removing genetic variants. CNNs are widely used in image recognition due to their ability to learn local patterns and efficiently reduce high-dimensional features. Several studies have shown the effectiveness of CNNs on genetic data, focusing on disease classification or annotation prediction based on genotype values^20,21. Here, we create a CNN guided by genetic annotations, aiming to produce a low-dimensional representation of disease-relevant genetic variants. We hypothesize that training a CNN informed by genomic context will improve classification performance relative to current approaches. We also test whether using gradient reversal layers to create adversarial multi-task networks could control for ancestry confounding while preserving disorder classification accuracy. We find that CNN models using genomic context and neural network models constrained to context-enriched genetic variants significantly outperform polygenic risk score methods in classifying T2D. We show that adversarial ancestry adjustment effectively removes ancestry information from these models without reducing classification performance. Finally, we find that context-informed models capture unique and generalizable risk patterns, indicating that combining genomic context with within-model ancestry adjustment is a promising direction for advancing genetic risk prediction.

Methods

Data acquisition and preprocessing

We applied for and obtained genotype and phenotype data from the UK Biobank²², a prospective cohort study of over 500,000 individuals in the United Kingdom under application number 60050. Access to UK Biobank data is available to bona fide researchers through application to the UK Biobank (www.ukbiobank.ac.uk/register-apply). All UK Biobank participants provided informed consent at recruitment and can withdraw consent at any time. The UK Biobank received ethical approval from the Northwest Multi-Center Research Ethics Committee (reference no. 16/NW/0274). We applied for and accessed genotype and phenotype data from the Wellcome Trust Case-Control Consortium (WTCCC) under the application titled “Using Genomic Context Informed Genotype Data and Within-model Ancestry Adjustment to Classify Complex Genetic Diseases”. Access to WTCCC data is available to bona fide researchers through application to the Sanger Data Access Committee (https://edam.sanger.ac.uk). The WTCCC project received ethical approval from the South East Multicentre Research Ethics Committee (ref no. 05/Q0106/74). All WTCCC participants provided written informed consent. The UK Biobank genotype data were generated by a combination of the UK Biobank Axiom array and UK BiLEVE Axiom array. 1037 sample outliers, multi-allelic single nucleotide polymorphisms (SNPs), and SNPs with a minor allele frequency (MAF) < 1% were removed, resulting in 641,018 SNPs²³. These were used to impute ungenotyped SNPs, leading to a dataset of 73,355,667 variants. For our analyses, based on our training subset, we removed all SNPs with MAF < 1%, all SNPs with > 5% missingness, all individuals with > 5% missingness, and one member of any estimated kinship equal to or closer than second-degree relatives using Plink. We randomly split the data into training (70%), validation (15%), and testing subsets (15%). In each subset, we performed 1:1 case-control matched pairing based on age and sex using MatchIt^24,25. The training, validation and test subsets had 55,168, 11,712, and 11,648 records, respectively. The SUNY Upstate IRB determined that this project did not meet the definition of human subject research (Project ID: 2241318-1).

Context informed data matrix (CID)

Model overview and input

Simplified diagrams of model architectures, inputs, and outputs are shown in Fig. 1. An overview of the models used in our analysis is shown in Table 1. The genotype input used in our NN models is the number of minor alleles at each SNP for each person, ranging from 0, when neither chromosome contained the alternate allele, to 2, when both chromosomes contained the alternate allele. The additional input used in our CNN models is a CID, which contains each SNP along with information about that SNP. The CID’s columns represent each SNP; rows contain the genomic annotations and disorder risk values for each SNP. Figure 2 shows a simplified CID.

**Fig. 1: Simplified model architecture diagrams.**

**Fig. 2: An illustrative example of a portion of a context-informed data matrix (CID).**

Table 1 Overview of Machine Learning Models

Full size table

For each individual, the CID, which largely contains information about each SNP that does not change between individuals, is multiplied by the genotype values of the individual prior to model training. For all models, we used a binary variable representing the presence of the International Statistics Classification of Disease 10th revision (ICD-10) code for T2D, E11, for each individual as the label to be predicted. Our models output values from 0 to 1 with the goal of minimizing the error in predicting the label for each individual.

Genomic annotations

The genomic annotations provided information on whether the SNP occurs at a location known to be a miRNA binding site, DNase hypersensitivity site, CPG island, gene, intron, 5’ untranslated region (UTR), 3’ UTR, splice site, promoter, or transcription factor binding site using the AnnotationHub²⁶ and VariantAnnotation²⁷ packages in R. We did this by loading the SNP locations and annotation ranges and then finding the overlaps between the two. Any SNP that had an overlapping location with an annotation was coded as a 1 for that annotation or was otherwise coded as a zero. We included whether the SNP was a coding variant using the same overlap method.

We also added annotations indicating whether the SNP was within the range of each of the 20 gene sets most associated with T2D among gene sets in the lowest 10% standard error in MAGMA gene set analysis²⁸. We used the lowest 10% standard error threshold to ensure that the resulting gene set annotations were dense due to using larger gene sets and gene set association was consistent in the training subset as determined by MAGMA gene set analysis. The results of MAGMA gene set analysis on the training subset can be found in Supplementary Data.

Risk values

To give the model information about documented disease risk associations, we added the log odds ratios for each SNP from an external GWAS on T2D⁶ to the CID. Only SNPs present in both the UK Biobank data and the external GWAS were included in the model. To reduce the computational cost of this analysis, we further reduced the number of SNPs for ML models by only including SNPs that were associated with T2D in the external GWAS at p ≤ 0.01. This reduction left us with 11,730 SNPs for machine learning models. We also included odds ratios from correlated disorders based on genetic correlations calculated by GWAS Atlas²⁹. From the list of the 100 most correlated traits with T2D, we selected traits that had ICD-10 codes available in the UK Biobank data. The included traits were overweight and obesity, disorders of lipoprotein metabolism and other lipidemias, essential hypertension, chronic ischemic heart disease, cholelithiasis, angina pectoris, other disorders of the urinary system, and pain in the throat and chest. We performed GWASs on these traits using genotyped UK Biobank individuals that were not included in our machine learning models. The top ten PCs were used as covariates in the model to adjust for population stratification. We included the log odds ratios calculated from these GWASs in the CID.

Correlation to Annotation

It is likely that, for most SNPs, the risk associated with the SNP is due to another genetic change that is associated with the SNP through linkage disequilibrium³⁰. To account for this, we added an annotation for each previously described binary annotation that indicates the maximum squared correlation each SNP has to a SNP with the binary annotation. Correlations were calculated using Plink³¹. In cases where the SNP has the annotation, the correlation is 1, and the resulting annotation value is identical to the original annotation value.

Model hyperparameter optimization

Within TensorFlow³², we used the KerasTuner³³ framework to optimize the hyperparameters of our models. Using the Hyperband search algorithm, we searched for the optimal number of genomic convolutional blocks, number of filters within each block, filter and pool width within each block, number of dense layers, number of nodes and dropout rate within each layer, gradient reversal weights, L2 regularization presence and factor, number of epochs, and learning rates. The objective of the search algorithm was to maximize the area under the receiver operating characteristic curve (AUC) in the validation subset for classification tasks and maximize the validation subset R² for regression tasks.

Detecting and adjusting ancestry confounding with PCs and adversarial learning

ML models use all information available to them to produce the best performance possible. This can result in models that perform well mainly due to a confounding variable. To address confounding by ancestry, we ran a PCs analysis using the SNPs that were not included in our ML models nor correlated with the SNPs (r² < 0.2) that were used in the ML models. We extracted the top ten PCs, as is frequently used for ancestry inference¹⁶ and used them as labels in several models.

To establish whether population stratification was an issue, we built the “PC NN” model (Fig. 1a) using the ancestral PCs to predict T2D status. To determine if the subset of SNPs used as predictors in our model could recreate the PCs within the model, we used the genotype data as input in a neural network model that predicted the PCs (“Geno-PC” model, Fig. 1b). We determined the effectiveness of all model architectures in estimating the PCs with mean squared error (MSE) and R². In models that had a positive R² value, we also tracked T2D classification AUC from the PC estimates.

To test if a neural network could classify T2D and estimate PCs using the same layers, we designed a multi-task model with both tasks sharing all layers except output layers (“Geno-PC T2D” model, Fig. 1c). An additional task classifying T2D from the PC estimates was used to compare classification performance between the PC estimates and the true PCs.

To test whether, in classifying T2D from genotype data, our model inadvertently uses PC-like ancestral information, we built a multi-task model where one task classified T2D from the genotype input and another task estimated PCs (“Geno-T2D Track-PC” model, Fig. 1d) from the output of the dense layers. In this model, we used a stop gradient layer between the PC estimation task and the dense layers, which stops backpropagation. By using this layer, we stopped training such that the task estimating PCs did not influence the dense layers upstream from the layer unique to their tasks. This strategy allows us to track the performance of a task, in this case, PC estimation, without interfering with shared layer training. If the PC estimation task was predictive, this would suggest the model classifying T2D from genotype data used ancestral information that could be directly transformed into PC estimations.

To test whether we could teach the network not to use ancestry information when predicting T2D, we built a multi-task model (“Geno-T2D Adv-PC” model, Fig. 1e) with the same two tasks as the previous model but replaced the stop gradient layer with a gradient reversal layer, a technique initially developed in the domain adaptation field of machine learning¹⁷. This layer reverses the direction of gradients in gradient descent, thereby directing the weights in the layers upstream of the gradient reversal layer to adjust in a way that maximizes the error in the task, instead of the typical minimization. Unlike the fast gradient sign method, which creates “adversarial examples” by using gradients to create a new example that maximizes loss, the gradient reversal layer creates a continuous “adversarial task” by reversing the gradient from back propagation from the task-specific layers of the selected task throughout model training. The “adversarial task” directs the model to find ancestry-invariant patterns by penalizing classification strategies that lead towards more accurate estimation of ancestral PCs. Layers that are downstream of the gradient reversal layer still adjust weights in a way that minimizes error in the task. This means the adversarial task still attempts to minimize error and accurately estimate PCs in the layers that are unique to the task, thereby leaving adjustment of shared layers as the only option in maximizing error. We used this architecture to remove any PC-like ancestry information from the shared layers. If the model is unable to estimate PCs, it suggests the ancestry information present within the PCs is not being used in the shared layers.

CNN model architecture for genomic data

In all our CNN models, we used the matrices as input into genomic convolutional blocks, which are repeating units within the model that contain a combination of 1D convolutional layers with ReLU activation functions, pooling layers, and batch normalization layers. Within convolutional layers, the convolutional filters require spatial invariance, meaning that a signal at one part of the two-dimensional data structure has the same meaning as the same signal anywhere else in the data structure. This is true for images, for which convolutional layers were originally developed. In contrast, the height dimension of the context-informed data matrix used as input in our model, which represents the annotation information at each SNP location, lacks spatial meaning. Consequently, filters at different heights could convey vastly different meanings, thereby violating spatial invariance. To address this issue, we set the height of each convolutional filter equal to the total number of rows present in the matrix to assure that output signals from one part of the matrix were equivalent to those from another part. Given that spatial invariance was violated along the genome, with the columns representing SNPs in the CID, we set the hyperparameter selection range of convolutional width to 1–20. This range was used to allow the model to set the convolutional width to 1 if having convolutional filters containing more than a single SNP was disruptive to model performance due to the violation of spatial invariance, while also allowing for wider convolutional filters if that was beneficial to model performance. After each 1D convolutional layer, the output was fed into a batch normalization layer. After the genomic convolutional blocks, the output was used as input into one or more dense layers with ReLU activation functions, depending on hyperparameter optimization. Finally, we used a dense layer with a sigmoid activation function to produce an output prediction of the T2D status for each person. We used the Adam optimizer and binary cross-entropy loss function to train our models.

To test whether genotype-only models and context-informed genotype models can share the same features, we trained a CNN model that had two input sources that shared intermediate layers but have separate input-specific T2D classification tasks (“CID-T2D Track-PC” model, Fig. 1f). This means that the two input sources undergo the same transformations in the shared layers, but the output of those transformations is different because the input sources are different. The inputs into this model differ in that the CID input contains annotation information for each SNP, whereas the genotype input has copies of the genotype data, such that it is the same shape as the CID input to allow layer sharing. The input specific output layers then optimize T2D classification based on the result of the transformations in each input source. One input was the CID as previously described. The second input was the genotype data portion of the CID replicated such that it had the same dimensions as the CID. Both inputs shared all layers except the output layer specific to each input/task. If both tasks can classify T2D, it would suggest that there are similarities in the patterns used in the two input types. A PC estimation task with a stop gradient layer was used to track whether ancestry information was used within the model.

To test whether overlaps in the patterns found and used in both tasks are present when only training the CNN with CID input, we trained another CNN model (“CID-T2D Track-PC/Geno” model, Fig. 1g). In this model, we used the same base structure as the “CID-T2D Track-PC” model but used a stop gradient layer prior to the output layer of the genotype-only model, thereby preventing the task from training layers besides the output layer unique to the task. If the task is still able to classify T2D diagnosis with these constraints, it suggests that at least some of the similarities in the patterns used for both tasks are used even when not directed to find shared patterns.

To test whether the CNN with CID input can find patterns that the genotype-only model cannot use to classify T2D, we built a third CNN model (“CID Specific” model, Fig. 1h). In this model, the stop gradient layers in the “CID-T2D Track-PC/Geno” model were replaced with gradient reversal layers. This effectively forces the shared layers away from any patterns that could be used by the genotype-only input to classify T2D. Likewise, the shared layers are forced away from using ancestry information that could be used to estimate PCs. If the CID input can classify T2D using the same layers as the genotype-only input, it would suggest that the CID input can model patterns that the genotype-only input is not able to model. To test whether the CNN with genotype input can find patterns the model with CID input cannot, we built a similar model (“Geno Specific” model, Fig. 1i) that switched the main and adversarial tasks of the “CID Specific” model.

To test whether the “Specific” models contained non-overlapping T2D risk information, we built a model that used a combination of the risk features generated by the “Specific” models to predict T2D. To do this, we concatenated the output from the final intermediate layer in both models. This combined output was used as input into a neural network model (“Combined Specific” model) with the task of classifying T2D diagnosis. We compared the performance of this combined model to the performance of the “Specific” models individually.

Logistic regression models and model comparison

We fit logistic regressions using multiple PRS methods on all available SNPs comparisons. For standard PRS, we used Plink to prune correlated SNPs and calculate polygenic risk scores (PRSs)³¹. We also calculated PRS scores using LDpred2³⁴ and PRS-CS³⁵. We adjusted all PRSs for ancestry by regressing the PCs on the PRSs and keeping the residual. We used this residual in the generalized linear model (glm) function within the base stats package in R to fit a logistic regression using only the training subset. We predicted the test subset T2D status and the performance within the test subset to compare with other models. We used AUC to measure performance in the test subset. AUC confidence intervals were computed with 2000 stratified bootstrap replicates using the pROC R package³⁶. We also used the pROC package to compare the performance between models with DeLong’s test for two correlated ROC curves.

Feature importance and generalization analyses

We used integrated gradients³⁷ on our trained NN and CNN models to identify the features that were most influential in separating cases and controls. The output from Integrated Gradients is a value for each input feature for each person representing how much that feature contributed to the prediction of that person’s T2D status. To estimate overall feature importance, we compared the mean attributions in cases and controls for each feature using t-tests. We adjusted for multiple comparisons in our feature importance analyses using Bonferroni correction. There is significant and reasonable concern that the features used in ML algorithms are somewhat specific to the data used to train the model¹⁸. To test generalization, we used data from the WTCCC, which included 2922 T2D cases and 2922 controls after applying the same QC steps we used for the UK Biobank data³⁸. Genotypes for these data were imputed using the 1000 Genomes Project Phase 3 reference panel. Since the SNPs used in the original model were not all present in the WTCCC data, we re-ran the analysis for the best-performing models using only SNPs present in both WTCCC and UK Biobank data. We performed all other steps as described in previous sections, with the WTCCC data as an additional withheld subset. We assessed generalization by comparing AUC across models and data sets. We also looked for overlap in significant feature importance results across data sets.

Statistics and reproducibility

Statistical analyses were conducted in R (version 4.3.1). Machine learning models were implemented in Python (version 3.8.5) using TensorFlow (version 2.9.1) and associated libraries. Model performance was evaluated using area under the ROC curve (AUC) with 95% confidence intervals estimated from 2000 stratified bootstrap replicates. Differences between models were tested with DeLong’s test for two correlated ROC curves. Feature importance values were estimated using Integrated Gradients and compared between cases and controls using two-sample t-tests with Bonferroni correction for multiple comparisons. All statistical tests were two-sided. Principal component estimation accuracy was evaluated using MSE and R². Models were trained using a training subset (n = 55,168), hyperparameters were optimized using a validation subset (n = 11,712), and performance was tested using a test subset (n = 11,648), all from the UK Biobank dataset. We tested model generalizability using external WTCCC data (n = 5844).

Results

In the following, “genotype data” refers to the set of genotypes used as input to the models, “GWAS PCs” refers to ancestrally informative PCs estimated from the full set of GWAS SNPs and “ML PC estimates” refers to PCs estimated in machine learning models using genotype data as input. Table 2 shows the performance of all models on the test subset. Descriptions of each model’s results can be found in the supplementary results. The hyperparameter optimization ranges and values are in Table 3.

Table 2 Machine Learning Model Results

Full size table

Table 3 Hyperparameter Optimization Ranges

Full size table

Adversarial ancestry adjustment

Two of our models (shown in Fig. 1a, b) showed that both GWAS PCs and ML PC estimates were significantly predictive of T2D. The prediction of T2D using GWAS PCs and the prediction of T2D using ML PCs estimates were not significantly different (p = 0.65, Delong’s test for two correlated ROC curves). This illustrates the issue that within model correction may be necessary to avoid using potentially confounding ancestry information in classification modeling. Models that used adversarial tasks for ancestry (shown in Fig. 1e, h, i) all increased the ML PC estimates MSE as intended, ensuring that the model could not accurately recreate GWAS PCs and therefore could not use the ancestry information within those PCs to improve the classification of T2D. All models with an adversarial ancestry task had an R² < 0 for the PC estimation task, indicating that the model’s predictions are worse predictions than using the mean.

Classification model performance and comparison

The best performing NN model was the “Geno-T2D Adv-PC” model (Fig. 1e), with an AUC of 0.66 (95% CI: 0.65–0.67). The best performing CNN model was the “CID-T2D Track-PC/Geno” model (Fig. 1g) with an AUC of 0.65 (95% CI: 0.64–0.66). In comparison, the best performing PRS method used LDpred2 to predict T2D with an AUC of 0.63 (95% CI: 0.62–0.64). The best performing CNN and NN models were both significantly more predictive than the best PRS method (p = 0.009 and p = 0.001, Delong’s test for two correlated ROC curves) despite using only 11,730 SNPs compared to 979,891 SNPs used in the PRS methods.

Feature importance and generalization analyses

SNPs with significantly different mean attributions between cases and controls in the test subset for the “Geno-T2D Adv-PC”, “CID-T2D Track-PC/Geno”, and “CID Specific” models can be found in Supplementary Data. In all three models, the significant SNPs based on feature importance did not have significantly different GWAS association p-values compared to the SNPs that were not significant based on feature importance. In a generalization analysis, which was restricted to overlapping SNPs in the WTCCC and UK Biobank data sets, model performance decreased but all models were still significantly predictive of T2D. In the LDPred2 PRS model, which used 1,088,838 SNPs, the AUC was 0.56 (95% CI: 0.55–0.58) in the test subset and 0.54 (95% CI: 0.53–0.56) in the WTCCC data. In the “Geno T2D Adv PC” model, which used 9626 SNPs, the AUC was 0.62 (95% CI: 0.61–0.64) in the test subset and 0.62 (95% CI: 0.60–0.64) in the WTCCC data. In the “CID-T2D Track-PC/Geno” model, which used the same 9626 SNPs, the AUC was 0.58 (95% CI: 0.56–0.60) in the test subset and 0.60 (95% CI: 0.58–0.62) in the WTCCC data. In feature importance analysis, comparing the results in the test subset and WTCCC data, there was one overlapping significant SNP in the “Geno T2D Adv PC” model, rs12244851, and two overlapping significant SNPs in the “CID-T2D Track-PC/Geno” model, rs11996576 and rs3818717.

Discussion

Our findings show that incorporating genetic annotations for common variations provides additional insights relative to polygenic risk score approaches. To our knowledge, we are the first to use convolutional layers with genomic context, and our results suggest these methods can find unique, generalizable risk patterns. As far as we are aware, this study is also the first to apply gradient reversal layers in genomic machine learning, proving useful for adjusting ancestry and testing hypotheses. We found that relying solely on the top ten PCs is insufficient for removing ancestry-related confounding. Therefore, within-model control methods are essential to prevent ancestry confounding in predictive models.

In a large dataset of over 70,000 subjects, the “PC NN” model (Fig. 1a) significantly predicted T2D using the top 10 PCs from a PCA based on SNPs not included or in linkage disequilibrium with SNPs used to estimate our models. The “Geno-PC” model (Fig. 1b) showed that the genotype information we use as predictors can accurately estimate PCs. Taken together, these two models show that machine learning models can recreate ancestry information and use that information to classify T2D diagnosis and, likely, other disorders. The “Geno-PC-T2D” model (Fig. 1c) tested this ancestry inference to disorder diagnosis pathway directly and resulted in T2D classification performance not significantly different from classifying directly from GWAS PCs. The “Geno-T2D Adv-PC” (Fig. 1e) did not perform any worse than the “Geno-PC-T2D” model (Fig. 1c), despite eliminating the latter model’s ability to estimate PCs. Since the layers within the Geno-T2D Adv-PC model were forced away from using ancestry information, it is likely that nodes of the model that had been occupied with ancestry inference were used to represent additional real risk features of T2D.

In traditional machine learning approaches, it’s difficult to know if ancestry influences a model. Our solution uses a subtask that estimates ancestry-adjusted PCs from the output of layers used in the main classification task. This subtask cannot affect the shared layers’ weights due to a stop gradient layer. Without it, backpropagation would alter the main classification layers to improve PC estimation, encouraging the use of ancestry information. Instead, the stop gradient layer monitors whether ancestry data is used without altering the main network. If significant PC estimation occurs with the stop gradient, it indicates ancestry information is present in the shared layers. In this case, the stop gradient can be replaced by a gradient reversal layer, which flips the direction of weight changes from gradient descent in the upstream layers. This causes the model to maximize errors in PC estimation, avoiding ancestry use. This effect is shown by comparing the MSE in Fig. 1d, e, where replacing the stop gradient with a gradient reversal layer increases the PC estimation MSE from 1.70 to 2.32. The strength of this adversarial task can be adjusted to eliminate ancestry information. If more detailed ancestry data is available, it can also be used similarly. Studies have shown that genetic risk models, primarily developed on people of European ancestry, do not generalize well to other ancestral groups^39,40. Adversarial ancestry tasks would reduce this discrepancy by finding ancestry-invariant patterns.

In the “CID-T2D Track-PC” model (Fig. 1f), we trained a model to predict T2D with two types of inputs. The inputs were a “CID Input” that uses genomic context information and a “genotype input” that did not use genomic context information. The input types trained and used the same shared layers, apart from output layers that were not shared. The classification performance resulting from the two input types were similar. This suggests that the input types can find and use the same patterns within the data to classify T2D. This result was expected since the two input types share the same genotype information.

However, the CID-T2D Track-PC/Geno model (Fig. 1g), which only allowed the CID input to train the shared layers, found that the CID input was significantly better at classifying T2D compared to genotype input that did not train the shared layers (p = 2e-16, Delong’s test for two correlated ROC curves). This shows that while the two input types share similarities in their ability to represent the risk of T2D, CID input can create unique risk representations that improve prediction beyond what can be represented by genotype data alone.

Our use of gradient reversal layers to create adversarial tasks in the “CID Specific” model and “Geno Specific” model further separated out input-dependent features. While both models’ performance declined relative to other CNN models, the inputs for the non-adversarial tasks still had AUCs that were statistically significant. In comparison, the adversarial task input (genotype input for the “CID Specific” model and CID input for the “Geno Specific” model) was unable to significantly classify T2D using the same shared layers used by the main input. This suggests that the risk representations within the models leading to T2D classification were specific to the non-adversarial input. Our feature importance analysis also suggests that the CNN models are finding different risk patterns compared to our NN models. There was only a 2% overlap in significantly predictive SNPs between the “Geno-T2D Adv-PC” and the “CID Specific” models.

One explanation of the unique risk representations is that they represent the same underlying risk features but are constructed in a way that can only be used with one input type. To test this theory, we trained the “Combined Specific” model, which combines the features of the final intermediate layers of the “CID Specific” and “Geno Specific” models. If the features from the “Specific” models were overlapping, one would expect the model using the features from both models to have the same classification performance. We found that the “Combined Specific” model was significantly better at classifying T2D compared to both individual models. This suggests that the risk representations found in the “Specific” models are at least in part specific to those input types. This suggests that some of the genetic risk for T2D can be transformed into higher-order risk features using genomic context (features from the “CID Specific” model) while other risk cannot be transformed with the set of genomic context information used in our analyses. Further investigation of the genetic variants and higher-order risk features used and developed in these models could reveal insights for future genetic risk modeling efforts.

The performances of our NN and CNN models improves on prior results. One study reported an AUC of 0.60 (95% CI: 0.59–0.61) when using a gradient boosted and LD-adjusted heuristic polygenic score and an AUC of 0.61 when using LDpred polygenic scoring^41,42. The same study reported an AUC of 0.58 in an LD-unadjusted polygenic risk score and an AUC of 0.58 when using traditional pruning and thresholding polygenic scoring methods, which is like our polygenic risk score methods and results. Another study that investigated many different models and feature encodings reported a maximum AUC in T2D across all models/encodings of 0.59⁴³. We used the LDpred2³⁴ polygenic scoring method and saw an improvement compared to these prior results, with an AUC of 0.63 (95% CI: 0.62–0.64). Our best genotype (“Geno-T2D Adv-PC” model, AUC: 0.66, 95% CI: 0.65–0.67) and CID (“CID-T2D Track-PC/Geno” model, AUC: 0.65, 95% CI: 0.64–0.66) models are both improvements over these prior studies and our own polygenic scoring results. While these are significant improvements, they are small; further advances will be necessary to reach the goal of clinical utility.

Classification models may be useful for identifying genetic variants that predict T2D. Our feature importance analyses of the “Geno-T2D Adv-PC”, “CID-T2D Track-PC/Geno”, and “CID Specific” models had 352, 57, and 62 SNPs with statistically significant differences in mean attributions between cases and controls in the test subset (see Supplementary Data). These SNPs did not have significant differences in association with T2D compared to the other SNPs used in the models. This suggests that information beyond the differing allele rates between cases and controls are driving the importance of the significant SNPs, which may be worthy of further investigation. They may also be specific to the models and data used in our study due to overfitting. We addressed generalizability of our models’ prediction in this study by presenting classification model performance only in a withheld testing subset of data that were not used to train or optimize the model and by testing performance in an external WTCCC dataset.

Several limitations may have reduced our models’ ability to classify T2D. Due to computational issues, we only used a small subset of the genetic variants. Using more variants might increase the CNNs’ ability to detect local patterns since the genetic variants would be spaced closer together and the distance between variants would be more uniform. Insufficient optimization of hyperparameters may have also limited the performance of our models. In more complex models, the hyperparameter space becomes too large to efficiently explore all possible solutions. Therefore, it is likely that our model architectures are not the optimal solution. Further exploration of the hyperparameter space could improve results. The annotations used to create the CID may not be the best combination of information. Adjustments to the genomic context information might improve our results. There has been some evidence that population structure may not be fully captured by just the top PCs, so residual ancestry information could still influence the model⁴⁴. Our model’s ability to account for ancestry depends on the PCs used in the adversarial task. It is possible that other ancestry data are still included in the model and used in classification. Including more PCs as labels in adversarial tasks may further reduce the models’ reliance on ancestry information, minimizing the potential confounding effect of population stratification. On the other hand, our focus on minimizing ancestry information in the models may come at the cost of some predictive power. It is possible that some of the genetic risk features learned by the models in our study may be strengthened or weakened by the presence of gene-gene or gene-environment interactions present specifically in certain ancestry groups. Further study of ancestry-variant genetic risk will further our understanding and modeling of disorders like T2D. In addition, while the comparable classification performance between withheld UK Biobank and WTCCC data sets suggests model generalizability, the limited overlap between significant feature importance results suggest that different features led to the similar performance and since most of the individuals in these data sets are of European ancestry, generalization to other ancestries is unknown. The most generalizable modeling strategy would likely be to train a single model using multiple data sets. This could lead to more generalizable classification models and feature importance results.

In summary, we have described CNN/NN architectures that combine genomic context-informed genotype data, and within model ancestry detection/adjustment. Our results indicate that this may be a useful direction for improving our ability to classify complex genetic disorders and detect SNPs that are significantly predictive of disease. While classification performance remains too low for clinical utility and earlier detection of T2D risk, incremental improvements such as those reported here may get to the point of clinical utility in the future.

Data availability

This research was conducted using genotype and phenotype data from the UK Biobank resource under application number 60050. This study makes use of data generated by the Wellcome Trust Case-Control Consortium (WTCCC data sets EGAD00010000232, EGAD00010000248, and EGAD00010000250). A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113, 085475 and 090355. The genotype and phenotype data are available to researchers upon application to UK Biobank and Wellcome Trust Case-Control Consortium, subject to approval. All other data are available from the corresponding author upon request. Model performance results for individual models are available in the Supplementary Results. Results from feature importance analyses and MAGMA are available in the Supplementary Data.

Code availability

All code used in this project are available in GitHub: https://github.com/barnettej/Genomic-CNN⁴⁵.

References

Barnett, E. J., Onete, D. G., Salekin, A. & Faraone, S. V. Genomic machine learning meta-regression: insights on associations of study features with reported model performance. IEEE/ACM Trans. Comput. Biol. Bioinform 21, 169–177 (2024).
Article PubMed Google Scholar
Wray, N. R., Yang, J., Goddard, M. E. & Visscher, P. M. The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet. 6, e1000864 (2010).
Article PubMed PubMed Central Google Scholar
Márquez-Luna, C. et al. Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat. Commun. 12, 6052 (2021).
Article PubMed PubMed Central Google Scholar
CDC Division of Diabetes Translation. Long-term Trends in Diabetes. Diabetes Surveillance System (2017).
Centers for Disease Control and Prevention. Prevalence of both diagnosed and undiagnosed diabetes. National Diabetes Statistics Report (2024).
Mahajan, A. et al. Multi-ancestry genetic study of type 2 diabetes highlights the power of diverse populations for discovery and translation. Nat. Genet. 54, 560–572 (2022).
Article PubMed PubMed Central CAS Google Scholar
Xue, A. et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat. Commun. 9, 2941 (2018).
Article PubMed PubMed Central Google Scholar
Mittag, F. et al. Use of support vector machines for disease risk prediction in genome-wide association studies: concerns and opportunities. Hum. Mutat. 33, 1708–1718 (2012).
Article PubMed CAS Google Scholar
Abraham, G., Kowalczyk, A., Zobel, J. & Inouye, M. Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease. Genet. Epidemiol. 37, 184–195 (2013).
Article PubMed Google Scholar
Evans, D. M., Visscher, P. M. & Wray, N. R. Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. Hum. Mol. Genet. 18, 3525–3531 (2009).
Article PubMed CAS Google Scholar
Noble, J. A. & Valdes, A. M. Genetics of the HLA region in the prediction of type 1 diabetes. Curr. Diab Rep. 11, 533–542 (2011).
Article PubMed PubMed Central CAS Google Scholar
Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 23, 169–181 (2022).
Article PubMed CAS Google Scholar
Freedman, M. L. et al. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36, 388–393 (2004).
Article PubMed CAS Google Scholar
Chen, C. Y., Han, J., Hunter, D. J., Kraft, P. & Price, A. L. Explicit modeling of ancestry improves polygenic risk scores and BLUP prediction. Genet. Epidemiol. 39, 427–438 (2015).
Article PubMed PubMed Central Google Scholar
Naret, O. et al. Improving polygenic prediction with genetically inferred ancestry. HGG Adv. 3, 100109 (2022).
PubMed PubMed Central CAS Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet 38, 904–909 (2006).
Article PubMed CAS Google Scholar
Ganin, Y. et al. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 2096–2030 (2016).
Google Scholar
Ying, X. An overview of overfitting and its solutions. J. Phys.: Conf. Ser. 1168, 022022 (IOP Publishing, 2019).
Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization (2016). arXiv preprint arXiv:1611.03530 (2017).
Waldmann, P., Pfeiffer, C. & Mészáros, G. Sparse convolutional neural networks for genome-wide prediction. Front. Genet. 11, 25 (2020).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Article PubMed PubMed Central CAS Google Scholar
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Marchini, J. UK Biobank Phasing and Imputation Documentation. Department of Statistics, University of Oxford, On behalf of UK Biobank Version 1.2 (2015).
R. Core Team. R: A language and environment for statistical computing. (R Foundation for Statistical Computing, 2014).
Ho, D., Imai, K., King, G. & Stuart, E. A. MatchIt: nonparametric preprocessing for parametric causal inference. J. Stat. Softw. 42, 1–28 (2011).
Article Google Scholar
Morgan, C. J. & Cauce, A. M. Predicting DSM-III-R disorders from the youth self-report: analysis of data from a field study. J. Am. Acad. Child Adolesc. Psychiatry 38, 1237–1245 (1999).
Article PubMed CAS Google Scholar
Obenchain, V. et al. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics 30, 2076–2078 (2014).
Article PubMed PubMed Central CAS Google Scholar
de Leeuw, C. A., Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput. Biol. 11, e1004219 (2015).
Article PubMed PubMed Central Google Scholar
Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 51, 1339–1348 (2019).
Article PubMed CAS Google Scholar
Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 19, 491–504 (2018).
Article PubMed PubMed Central CAS Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article PubMed PubMed Central CAS Google Scholar
Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
O’Malley, T. et al. KerasTuner https://github.com/keras-team/keras-tuner (2019).
Privé, F., Arbel, J. & Vilhjálmsson, B. J. LDpred2: better, faster, stronger. Bioinformatics 36, 5424–5431 (2021).
Article PubMed Google Scholar
Ge, T., Chen, C. Y., Ni, Y., Feng, Y. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 10, 1776 (2019).
Article PubMed PubMed Central Google Scholar
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinforma. 12, 77 (2011).
Article Google Scholar
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In International Conference on Machine Learning 3319–3328 (PMLR, 2017).
Wellcome Trust Case Control, C. Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature 447, 661–678 (2007).
Article Google Scholar
Duncan, L. et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun. 10, 3328 (2019).
Article PubMed PubMed Central CAS Google Scholar
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Article PubMed PubMed Central CAS Google Scholar
Paré, G., Mao, S. & Deng, W. Q. A machine-learning heuristic to improve gene score prediction of polygenic traits. Sci. Rep. 7, 12665 (2017).
Article PubMed PubMed Central Google Scholar
Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015).
Article PubMed PubMed Central Google Scholar
Mittag, F., Römer, M. & Zell, A. Influence of feature encoding and choice of classifier on disease risk prediction in genome-wide association studies. PLoS One 10, e0135832 (2015).
Article PubMed PubMed Central Google Scholar
Privé, F., Luu, K., Blum, M. G. B., McGrath, J. J. & Vilhjálmsson, B. J. Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics 36, 4449–4457 (2020).
Article PubMed PubMed Central Google Scholar
Barnett, E. J. Genomic Context-Informed CNN/NN Models for Type 2 Diabetes Classification. https://github.com/barnettej/Genomic-CNN. (ed. Zendo) (Github, 2025).

Download references

Acknowledgements

This research was conducted using genotype and phenotype data from the UK Biobank resource under application number 60050. This study makes use of data generated by the Wellcome Trust Case-Control Consortium (WTCCC data sets EGAD00010000232, EGAD00010000248, and EGAD00010000250). A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Yanli Zhang-James is supported by the European Union’s Seventh Framework Program for research, technological development and demonstration under grant agreement no 602805 and the European Union’s Horizon 2020 research and innovation program under grant agreements No 667302. Jonathan Hess is supported by the U.S. National Institutes of Health (R21MH126494, R01NS128535) and NARSAD: The Brain & Behavior Research Foundation. Stephen J. Glatt is supported by grants from the U.S. National Institutes of Health (R01MH101519, R01AG054002, R01AG064955), the Sidney R. Baer, Jr. Foundation, and NARSAD: The Brain & Behavior Research Foundation. Dr. Faraone’s research is supported by The European Union’s Horizon 2020 research and innovation program under grant agreement 965381; NIH/NIMH grants U01AR076092, R01MH116037, 1R01NS128535, R01MH131685, 1R01MH130899, U01MH135970, and Supernus Pharmaceuticals. His continuing medical education programs are supported by Corium Pharmaceuticals, Tris Pharmaceuticals and Supernus Pharmaceuticals. This study did not receive any direct funding.

Author information

Authors and Affiliations

Department of Psychiatry and Behavioral Sciences and Department of Neuroscience and Physiology, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, USA
Eric J. Barnett, Stephen J. Glatt & Stephen V. Faraone
Department of Psychiatry and Behavioral Sciences, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, USA
Yanli Zhang-James & Jonathan L. Hess
Department of Public Health and Preventive Medicine, Norton College of Medicine at SUNY Upstate Medical University, Syracuse, NY, USA
Stephen J. Glatt

Authors

Eric J. Barnett
View author publications
Search author on:PubMed Google Scholar
Yanli Zhang-James
View author publications
Search author on:PubMed Google Scholar
Jonathan L. Hess
View author publications
Search author on:PubMed Google Scholar
Stephen J. Glatt
View author publications
Search author on:PubMed Google Scholar
Stephen V. Faraone
View author publications
Search author on:PubMed Google Scholar

Contributions

E.J.B. and S.V.F. conceived and designed the models and analyses. E.J.B. implemented the models and performed the analyses with contributions from S.V.F., Y.Z.J., J.H. and S.J.G. E.J.B. wrote the paper. All authors discussed model design, results, and implications. All authors commented and edited the manuscript.

Corresponding author

Correspondence to Stephen V. Faraone.

Ethics declarations

Competing interests

Stephen V. Faraone received income, potential income, travel from Aardvark, Aardwolf, AIMH, Akili, Atentiv, Axsome, Genomind, Ironshore, Johnson & Johnson/Kenvue, Kanjo, KemPharm/Corium, Noven, Otsuka, Sky Therapeutics, Sandoz, Supernus, Tris, and Vallon. The remaining authors declare no competing interests.

Peer review

Peer review information

Communications Medicine thanks Huixin Zhan and the other, anonymous, reviewer (s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file

Supplementary Material

Description of Additional Supplementary Files

Supplementary Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Barnett, E.J., Zhang-James, Y., Hess, J.L. et al. Using genomic context informed genotype data and within-model ancestry adjustment to classify type 2 diabetes. Commun Med 5, 479 (2025). https://doi.org/10.1038/s43856-025-01189-8

Download citation

Received: 16 January 2024
Accepted: 03 October 2025
Published: 18 November 2025
Version of record: 18 November 2025
DOI: https://doi.org/10.1038/s43856-025-01189-8

Subjects

Abstract

Background:

Methods:

Results:

Conclusions:

Plain Language Summary

Similar content being viewed by others

Polygenic risk score for type 2 diabetes shows context-dependent effects across populations

The necessity of incorporating non-genetic risk factors into polygenic risk score models

The effect of type 2 diabetes genetic predisposition on non-cardiovascular comorbidities

Introduction

Methods

Data acquisition and preprocessing

Context informed data matrix (CID)

Model overview and input

Genomic annotations

Risk values

Correlation to Annotation

Model hyperparameter optimization

Detecting and adjusting ancestry confounding with PCs and adversarial learning

CNN model architecture for genomic data

Logistic regression models and model comparison

Feature importance and generalization analyses

Statistics and reproducibility

Results

Adversarial ancestry adjustment

Classification model performance and comparison

Feature importance and generalization analyses

Discussion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Transparent Peer Review file

Supplementary Material

Description of Additional Supplementary Files

Supplementary Data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links