Introduction

Genome-wide association studies (GWAS) have enabled huge progress in identifying variants associated with the risk of developing Alzheimer’s disease (AD)1. Polygenic risk scores (PRS) based on these variants have greatly improved prediction of disease status2. However, inherent to GWAS and PRS are the assumptions that variants are independent predictors, linearly associated with the outcome, and therefore combine additively within and between loci3, with no interactions occurring between variants, or between genes and other risk factors. While such simplifying genetic assumptions have proved fruitful across a range of diseases and disorders4,5, they are at odds with biological evidence in AD that disease heterogeneity and responses from cells such as microglia are dependent on APOE status6,7,8,9. Further, there is genetic evidence suggesting that different variants are associated with the disease depending on APOE status10,11,12,13 and age at diagnosis or assessment14,15,16. As GWAS sample size increases and PRS approach limits on predictive performance, alternative modelling approaches are essential to maximise discoveries from existing data and enable a deeper understanding of AD genetics.

The confluence of increasingly large genetic data17, readily available computational resources, and mature methodologies presents a key opportunity for addressing this at scale by applying flexible data-driven machine learning (ML) models. Several studies have applied ML to the genetics of brain disorders and have been recently summarised18,19. Previous ML attempts have been impacted by high risk of bias20 and population stratification21, while AD studies in particular have been hampered by low sample size18, leaving a gap for comprehensive, large-scale studies that rigorously apply ML to genome-wide data. We conducted the largest genome-wide ML study in AD to date, marking a pivotal moment in the field. Our study presents a reproducible, bias-aware approach for ML model development, validation, and confounder adjustment. We trained three of the most prominent approaches in the field to compare predictive accuracy and uncover novel AD-associated risk loci which were not identified by traditional GWAS.

Results

Prediction

Gradient boosting, neural networks, MB-MDRC 1 d, and PRS models were compared for discovery and prediction (Fig. 1). Prediction of AD status between models was highly correlated in the test set, having pairwise correlations between r = 0.80 and r = 0.87 (Fig. 2). The highest correlations were observed for GBM-NN (r = 0.87) and GBM-MB-MDRC 1 d (r = 0.86). The weakest correlations were between NNs and PRS (r = 0.80). PRS was most strongly correlated with GBMs (r = 0.84). For discrimination between cases and controls, the highest AUC of 0.692 (95% CI: 0.683-0.701) was obtained with gradient boosting, and was not significantly different to an AUC of 0.689 (0.679-0.698) for PRS (Fig. 2, Supplementary Data 4). The AUCs remained within the 95% CI when we excluded any imputed variants from the data, indicating no risk of bias from their inclusion (GBMs: 0.683, NN: 0.674, and MBMDRC-1d: 0.667 without imputed variants). Predictions remained stable across repeats with different random train-test splits (Supplementary Data 4) and across the cohorts in the data (Supplementary Fig. 6). All models have a greater proportion of females in those predicted to be a case, reflecting the underling sex differences in the data (59% female, 62% female in cases, 57% in controls), except for GBMs which have a similar proportion in both cases and controls (Fig. 2i).

Fig. 1: Methods overview.
Fig. 1: Methods overview.
Full size image

Data was separated into an initial balanced random split before model selection (cross-validation and hyperparameter tuning) in the training split (a). All models were subsequently evaluated for association (annotation, enrichment analysis, interaction testing and replication; b) and prediction (AUC and correlations; c). Interaction tests report p-values from the Wald test in logistic regression (two-sided) as standard, after correction for multiple testing The full pipeline was run four times per model to assess robustness. For prediction, AUC values, statistical tests, and correlation analyses are based on the initial train-test split. For association, variants were prioritised if they appeared in the top SNP selection in at least two repeats.

Fig. 2: Prediction from ML models in the test split for the most predictive models trained with the APOE region included.
Fig. 2: Prediction from ML models in the test split for the most predictive models trained with the APOE region included.
Full size image

The top two most predictive approaches (GBM, PRS) were not significantly different by AUC, as measured by DeLong’s test, though prediction from MB-MDRC 1 d was significantly below other methods (a). Bars show the AUC from a single test split for each model, where whiskers are 95% CIs from the pROC package. Unadjusted p-values from tests (ns: not significant, ****: p < 0.0005) are annotated on panel a for DeLong’s two-sided test for correlated ROC curve; see supplementary Data 3 for exact values. Model predictions showed strong correlation, though correlation of the ranks is lower (b). Distributions of covariate-adjusted predictions for the most predictive approaches are similar but show a more distinct multimodal distribution for GBMs (c), NNs (d) and MB-MDRC 1d (f) compared to PRS (e), illustrating stronger influence of APOE risk alleles on predictions. Panel (g) shows the consistency across methods for individuals’ prediction scores, where the participants in the 5% extreme tails of GBM predictions are followed across predictions from NN, PRS and MB-MDRC 1 d. Classifications metrics are given in (h). Model predictions broken down by covariates are shown in (i) and (j), where dark blue indicates predicted cases, and light blue predicted controls. Box plots in (j) show the median (center line), the 25th and 75th percentiles (box limits), and the whiskers which extend to 1.5 times the interquartile range.

Identification of AD-associated loci

SNPs prioritized by ML approaches were required to appear in at least two random train-test splits, ensuring more robust associations; they include both known (Table 1) and novel variants (Table 2). Gradient boosting machines correctly distinguished the APOE haplotypes (Fig. 3a, showing distinct clusters). Figure 3b further highlights modelling of the APOE region across methods, wherein GBMs and NNs correctly identify causal SNPs. Known loci including CR1, BIN1, IDUA, OTULIN, RASA1, RASGEF1C, CLU, ABCA1, MS4A*, PICALM, ABCA7, APOE, and CASS4 (Table 1) are highlighted by ML models (Fig. 3c). In addition, several novel loci were identified with putative biological evidence for association with AD (ARHGAP25, COG7, LINC00924/LOC105369212, LY6H, SOD1 and ZNF597) which were replicated in Jansen et al.22 (Table 2). Association of an exonic missense variant was also highlighted in AP4E1, within 500 kb of the known SPPL2A locus (Table 2). Neural networks (NN) detected known loci in APOE, BIN1, CYP27C1, ABCA1 and ABCA7 (Fig. 3c) and an additional novel locus (SOD1), which also replicated in Jansen et al.22 (Table 2). MB-MDR 1 d identified SNPs in 24 genes, the majority of which map to the APOE region, for at least two train-test splits. Of these, 20 were identified by every possible split, indicating highly stable results. MB-MDR 1 d identified SNP-SNP pairs in the APOE region gene consistently through every train-test split. Single train-test splits also find genes outside of Chromosome 19 (Supplementary Fig. 7). The majority of candidate novel loci were identified by GBMs, which find multiple loci with evidence for association with AD-related traits such as cognition, pTau, AD age-at-onset and neurofibrillary tangles from previous GWAS (see Table 2 and Supplementary Data 5).

Fig. 3: Association in ML models.
Fig. 3: Association in ML models.
Full size image

Uniform Manifold Approximation and Projection (UMAP) of raw (unscaled) SHAP values for GBM hits highlights that APOE alleles are identified and drive prediction (a). Neural networks and gradient boosting both rank the SNPs required to derive the e2 and e4 allele status for APOE as highest, unlike traditional GWAS (b). Values for neural networks and GBMs are not based on p-values, as described below, while p-values in (b,c) (GWAS) are from a logistic regression in the training split, using a logistic regression and p-values from a two-sided Wald test as standard. Manhattan plots are given for top hits only from gradient boosting (mean absolute SHAP values), neural networks (normalised network layer weights) and MB-MDRC 1 d (−log10 p-values), where hits from different random splits of the models are shown in different colours, and all variants from a single GWAS on the train split are shown in greyscale for comparison (right hand y-axis) (c). p-values for MB-MDRC 1 d in (b, c) are derived from a two-sided permutation-based test as implemented in MDMDR61,62. Hits from machine learning models (see Table 2) are enrichment for known Alzheimer’s disease processes (d).

Table 1 18 known loci prioritised by machine learning models
Table 2 34 putative novel loci prioritised from machine learning models

Since ML may identify non-linear SNP-SNP interactions, pairwise interaction tests were performed for all top SNPs identified using machine learning models (Tables 1 and 2). Of 17,205 SNP-SNP pairwise interactions, 13 pairs were significant under a standard regression framework (encoded as a multiplicative interaction) after accounting for multiple testing and excluding pairs where both are within the APOE region. The two SNP-SNP pairs with the strongest evidence for association were between SNPs rs405509 and rs600550 (beta = 0.058, pFDR = 6.8 × 10−5), and SNPs rs405509 and rs12421663 (beta = −0.056, pFDR = 1.8 × 10−4), both of which involve SNPs in the APOE and MS4A* regions (Supplementary Data 6, Supplementary section 2.5).

Overlap of genes associated with disease risk across methodologies

Known lead variants in APOE and BIN1 were the most important predictors in GBMs and NNs, with the SNPs used to derive APOE status (rs7412 and rs429358) identified and ranked as the most important SNPs by both GBMs and NNs but not by a GWAS in the training set (Fig. 3b). To compare the ML findings with GWAS results, we included SNPs not only identified in the training set in our study, but also all SNPs reported as genome-wide significant by larger meta-analyses (Supplementary Data 1) and applied the same gene annotation strategy as for the top variants from ML models (see “functional annotation and enrichment analysis”). In total, 130 genes were annotated, where more than one gene may be annotated to the same locus, and the same gene may be annotated to several independent loci. These genes correspond to 86 distinct loci reported in previous publications23,24,25. Of these, 19 loci were implicated by at least one ML methodology (Supplementary Fig. 8). APOE was found by all methods, seven loci (PICALM, IDUA, RASGEF1C, CLU, CR1, RTF2, RASA1) were found by GBMs and MB-MDRC 1 d, and two (ABCA7, BIN1) by MB-MDRC 1 d, GBMs and NNs. ABCA1 was only detected by NNs, and two (SPPL2A, ADAMTS1) only by GBMs (Table 1).

Motivated by published evidence suggesting that different variants are associated with the disease depending on APOE status, ML models were compared when trained with and without the APOE region in the same train-test split. Remarkably, GBMs without the APOE region found more known AD risk genes than when trained with APOE (see Supplementary section, 2.6), but with lower AUCs (Supplementary Fig. 9).

When the list of SNPs is limited to those which could be identified in the current data (using a GWAS in the corresponding training splits), all GWAS-significant SNPs were prioritised by at least one ML approach (Fig. 4). Furthermore, ~84% at the suggestive significance level (p ≤ 10−5) were retrieved by at least one ML approach.

Fig. 4: UpSet plot showing the overlap between ML and GWAS significant findings from the train part of the train-test split.
Fig. 4: UpSet plot showing the overlap between ML and GWAS significant findings from the train part of the train-test split.
Full size image

a Genes mapped by the SNPs highlighted by each ML approach separately. Both ML approaches and GWAS significant SNPs were identified in the same training split. b Genes that are shared among at least two train-test splits. All GWAS genome-wide significant (p ≤ 5×10−8) SNPs in the train split were also identified by ML approaches. For simplicity, genes within 500 kb of a known locus or with at least one overlapping gene with the region were annotated only by the locus, including MS4A6A, CSTF1, EPHX2, CNN2, TOMM40/NECTIN2/CLPTM1/BCL3/BCAM/APOC1/APOC2/APOC4 and DGKQ/FAM53A, which were mapped to MS4A*, CASS4, CLU, ABCA7, APOE and IDUA regions, respectively. Subplots were created using ComplexUpset version 1.3.3.

Enrichment of ML findings in biologically relevant gene sets

The SNPs prioritised by ML (Tables 1 and 2) showed significant enrichment in microglial (p = 0.0024) and astrocytic regions (p = 0.0083), but not synaptic regions (p = 0.117). All 68 genes from 52 loci reported in Tables 1 and 2 were further analysed for protein-protein interaction scores using the online tool STRING (Supplementary Fig. 10). Pathway analyses demonstrated enrichment in various gene-ontology (GO) pathways (Fig. 3d, Supplementary Data 7). As expected, GO pathways yielded a number of biological processes of interest to Alzheimer’s disease, including regulation of amyloid beta formation (pFDR = 0.0014) and regulation of amyloid precursor protein catabolic process (pFDR = 0.00023, Fig. 3d).

Discussion

In leveraging the largest genotyped case-control AD dataset, this study demonstrates that the current scale of data has reached a threshold where ML can achieve similar predictive accuracy to classical methods and uncover novel genetic insights into AD. Gradient boosting, neural networks, and multifactor dimensionality reduction were selected due to their prominence in the field, demonstrated performance, and complementary strengths. These methods offer distinct advantages: highly performant tree ensembles (GBMs), flexible networks incorporating prior knowledge (NNs), and detection of SNP-SNP interactions (MB-MDR)26. Here they were applied to identify novel genes not detectable via GWAS, and to compare ML prediction accuracies with PRS.

ML methods correctly identified the lead SNPs used to calculate the APOE haplotype, rs7412 and rs429358, as having the greatest impact on the model, in contrast to classical univariable linear models (traditional GWAS) which do not distinguish between top variants by p-value, and rank rs7412 lower (Fig. 3c). PRS for AD risk prediction perform best when modelled with two predictors: PRS calculated without the APOE region and a separately coded weighting of the APOE e2 and e4 alleles27. Here, we demonstrate that ML approaches can accurately identify APOE clusters and achieve comparable prediction accuracy without including the derived APOE variable as a second predictor. ML approaches further detected the lead SNPs identified in large meta-analyses GWAS for several key genes, including BIN1 (rs6733839), PICALM (rs3851179), ABCA7 (rs3752246, a coding variant), CASS4 (rs6069736) and CR1 (rs4844610). More broadly, ML highlighted SNPs in around 22% of genes identified by larger meta-analysis GWAS23,24,25, while comparison to a GWAS undertaken on the same (EADB-core) split that was used to train machine learning models, showed that 100% of genome-wide significant SNPs (p ≤ 5 × 10−8) were retrieved by ML approaches. This demonstrates that the majority of findings expected under a linear additive GWAS paradigm can be prioritised using flexible machine learning models. Though strides have been made in ML-based prediction of AD from genetic data, for example28, to our knowledge we present the first well-powered, genome-wide ML-based gene discovery study detecting nearly a quarter of known genes found in larger GWAS meta-analyses from the literature which contain around twenty times the sample size, while also identifying multiple putative regions with credible associations to AD biology, marking an important benchmark for these methods.

Putative novel genes which replicate in an independent GWAS consist of ARHGAP25, COG7, LINC00924/LOC105369212, LY6H, SOD1 and ZNF597, which have potential relevance to AD. ARHGAP25 encodes the Arhgap25 protein which is expressed in macrophages where it affects phagocytosis29 through modulation of the actin cytoskeleton30. Wu et al.31 demonstrated that Ly6h is among the proteins competing to bind to α7 subunits of nicotinic acetylcholine receptors (nAChRs), which are expressed throughout the brain and enable fast cholinergic transmission at synapses. These proteins function collectively to maintain the optimal α7 assembly required for neuronal function and viability. This delicate balance is disrupted during Alzheimer’s disease due to Aβ-driven reduction in Ly6h. Notably, an increase in Ly6h in human cerebrospinal fluid correlates with elevated Alzheimer’s disease severity31.

COG7 (Component of Oligomeric Golgi Complex 7) encodes a protein integral to Golgi apparatus function, which is essential for protein glycosylation and trafficking. Disruptions in Golgi function have been implicated in various neurodegenerative disorders, including Alzheimer’s disease. Mutations in COG7 are associated with Congenital Disorders of Glycosylation (CDG)32, which frequently manifest with neurological impairments. Abnormal glycosylation processes have been implicated in Alzheimer’s disease pathology, particularly affecting tau protein processing and amyloid precursor protein (APP) metabolism33.

ML hits which replicate also include the missense variant rs2306331 (AP4E1). This is within the known locus of SPPL2A (maximum r2 0.32, minimum distance 177 kb with rs2306331 and GWAS catalogue hits for SPPL2A), but may be an independent signal with the region. The AP4E1 gene encodes the ε subunit of adaptor protein complex-4 (AP-4), which facilitates the transport of amyloid precursor protein (APP) from the trans-Golgi network to endosomes. Disruption of the APP-AP-4 interaction enhances γ-secretase-catalysed cleavage of APP to amyloid-β peptide, suggesting that AP-4 deficiency may constitute a potential risk factor for Alzheimer’s disease34.

We also note that ML-derived hits in the know loci IDUA and DGKQ (rs4690324, linked via sQTLs in GTEX to splicing of IDUA in the brain35) are also related to heparan sulfate metabolism36, in addition to multiple brain diseases37, traits38, and lipid metabolism39. Neural networks also highlight a new potential AD-relevant locus. All GO-derived pathways, independent of association to AD, were used in the neural network analysis as hidden layers. This approach highlighted SNPs mapping to SOD1 (cytogenic band 21q22.11), found in the region of APP (21q21.3), with lead SNPs from NNs around 5 Mb away from the closest NN-based hits in APP. SOD1 has been widely investigated for its role in antioxidant defense system, showing impaired expression in AD patients40,41, but is currently primarily implicated in ALS.

In our study, we adopted an MB-MDR approach to detect both SNP main effects and pair-wise SNP-SNP interactions. The method is an improvement over classical pair-wise statistical interaction approaches as it can detect more complex interactions. However, it implements a somewhat conservative strategy for variant discovery, as interaction studies are hampered by several factors that can increase the number of false positives including, but not limited to, LD and interference of major loci, which may contribute to phantom epistasis. To address this limitation, we opted for a model-based (MB) form of MDR, though the results did not highlight novel loci.

Comparing individual scores, ML predictions achieved correlations of 0.8-0.84 with PRS, indicating ML models give predictions which are broadly consistent with well-established approaches. It also shows that identification of novel signals through flexible modelling of complex effects will introduce deviations from the predictions of a simple linear additive model. Disease risk prediction from ML, as assessed by AUC, was higher (but not significantly) than PRS. Similar results have been reported in psychiatry21 and coronary artery disease42. This is likely to be due to several reasons. First, SNPs in general are (at best) only correlated with the causal variants, making it particularly difficult to detect nonlinear effects and interactions, which are the main potential advantages of ML over PRS. Second, genetic predictors are weak as compared to some other predictors (e.g. biomarkers43), and the upper bound for AUC in complex trait genetics in practice falls substantially below 144. Weak predictor-response relationships are an inherent challenge to finding patterns with flexible models, and complex models may at-present still be under-powered to achieve a clear improvement in AUC. Third, large GWAS discover and replicate SNPs, which in the resulting summary statistics show small association effect sizes. The effect sizes may however be higher in more homogenous samples. For example, the OR for APOE is around 3.4 in samples of mean age 72−73 years45 but is reduced in samples over 90 years46. In pathologically-confirmed samples, which are also generally older than clinical samples, some of the GWAS-derived SNP effect sizes are higher than reported in clinically-assessed AD GWAS47. In more homogenous samples in terms of age, population, and cognitive scores (such as The Alzheimer’s Disease Neuroimaging Initiative (ADNI)48), the AD PRS AUC are higher than that in clinical samples49,50. Thus, summary statistics from large GWAS meta-analyses enable PRS which predict with moderate AUC in any data set, but which do not achieve high accuracy as effect sizes are averaged across many studies with slightly different features such as recruitment criteria and outcome definitions, and therefore genetic architectures, rather than being specific to a particular one.

A similar situation affects variant discovery from flexible models which can detect interactions: nuanced relationships between predictors are notoriously difficult to replicate51, a situation further complicated by varying LD between tagged and causal variants. Such SNPs may nonetheless represent important loci which impact disease risk in specific contexts without being consistently associated across enough studies and circumstances to reach genome-wide significance under linear models in meta-analyses. While machine learning models here do not show signs of overfitting, and top SNP rankings by SHAP values are consistent in both the train and test splits, many SNPs identified do not replicate in an external dataset, which is also the case for standard GWAS. In particular, in PRS approaches using different priors (for Bayesian models) or LD pruning parameters (in clumping and thresholding approaches), the resulting set of SNPs and their estimated effects will differ27,52. Similarly, when effects of SNPs are jointly estimated in sparse, high-dimensional ML models, either simultaneously or in a stage-wise manner (as in gradient boosting), different top associations and predictions are expected. Despite these caveats, we show that the novel findings suggest SNPs which are biologically relevant to AD.

This study has a number of limitations. First, our study attempted to run and compare reasonably diverse methods for predicting AD risk, with the advantage of implementing them in a unified dataset. As a result, we found that the PRS and ML prediction showed a similar prediction performance based upon ranking individuals according by their prediction scores. The predictions from ML and PRS, however, were highly correlated, explaining the similarity of the prediction accuracies and reducing the likelihood that an ensemble of the different methods would improve prediction. Second, the methods used for selection of top SNPs in interpreting model results reflect inherent differences in how the ML approaches work, and no standard thresholds exist at the moment for identifying important features across these distinct frameworks. This methodological variability, which is a broader challenge in the field, likely contributes to the incomplete overlap between the top SNPs identified by different ML approaches, making replication in external summary statistics a critical step to ensure robust findings. Third, the influence of APOE status on model outcomes is a key component of all models in the study. While APOE SNPs were included as predictors, we did not conduct analyses stratified by APOE carrier status, especially ε3/ε3, ε3/ε4, or ε4/ε4 carriers. Future work should explore whether predictive accuracy and associations differ meaningfully across these strata. Finally, although our post-hoc analyses indicate that cohort or genotyping centre effects do not drive associations or predictions, alternative machine learning approaches such as federated learning (FL) may improve modelling where underlying predictor distributions diverge53, while also allowing for models which include data from more cohorts without diminishing privacy. Indeed, within a central learning paradigm, current best practices for data harmonisation involve merging datasets based on shared high-quality SNPs that have undergone rigorous quality control, ensuring high imputation accuracy, reasonably large minor allele frequencies (MAF), and other metrics in each separate study. However, these QC steps alone do not guarantee that differences arising from ancestry, clinical assessment criteria, and genotyping chips or batches used to generate the data are fully accounted for. Applying flexible ML within an FL framework leverages the advantages of client-specific data, which is often more homogeneous as it is typically generated locally, within the same population, using similar clinical assessments and the same genotyping chip. Results from each “client” can then inform analyses for other clients by tuning parameters (e.g., increasing or decreasing neural network weights for variants in specific genomic regions), creating a more flexible and adaptive analysis framework.

In conclusion, this study demonstrates that machine learning can uncover both known and novel genetic loci for AD, providing a powerful complement to traditional GWAS while still achieving competitive predictive performance. Though replication challenges remain, ML successfully prioritised established risk loci and identified biologically relevant novel associations, including variants in ARHGAP25, COG7, LY6H, and SOD1. These findings highlight ML’s potential to refine our understanding of AD beyond additive genetic effects and expand the toolkit available for maximising discovery from available data. With expanding datasets and computational advances, ML could further enhance risk prediction and gene discovery, particularly through federated learning and multi-omic integration. This study marks a key step toward leveraging ML for deeper insights into AD genetics.

Methods

Data

Data were obtained from the European Alzheimer & Dementia Biobank (EADB) consortium which combines genetic and clinical data from 16 countries and has been described previously23. All study protocols were reviewed and approved by the respective institutional review boards overseeing the cohorts (see supplementary information for details). Individuals were genotyped at three centres. Data were accessed after quality control procedures and data harmonisation were applied to give the EADB-core sample. All participants are unrelated individuals of European ancestry, encompassing 20,013 clinically defined AD cases and 21,673 control, after excluding participants present in Kunkle et al.45. 59% of the sample is female, split as 62% in cases and 57% in controls, with a median age at baseline of 73 (IQR = 14). Data and splitting procedures are described further in Supplementary section 1.1. Informed consent was obtained in writing from all study participants. For individuals with significant cognitive impairment, consent was secured from a caregiver, legal guardian, or other authorized proxy.

Quality control (QC)

Analyses used directly genotyped (non-imputed) variants to ensure high data quality. To avoid excluding key known AD loci not covered in the genotyped data, we also included 67 imputed variants in previously reported genome-wide significant loci23,24,25 (Supplementary Fig. 1) which were not already present in the genotyped data, converting the imputed dosages to the most probable (probability ≥ 0.9) genotypes, and applying further quality control as described previously23. Data were further filtered for a minor allele frequency of 5%, to ensure variants were common enough to reliably observe interactions. SNPs were clumped (R2 = 0.75 following Joiret et al.54, window = 1 Mb) using stage 1 summary statistics from Kunkle et al.45, after removing individuals common between the summary statistics and the genotyped data. Out of 81 SNPs previously reported as genome-wide significant, 52 survived minor allele frequency (MAF) and clumping procedures (Supplementary Fig. 1; Supplementary Data 1), giving a combined 215,193 SNPs (Supplementary section 1.1). Analyses were also run without inclusion of the previously reported 67 imputed variants to confirm that their inclusion did not artificially inflate performance estimates.

Statistics and reproducibility

No statistical method was used to predetermine sample size, as there are no standardised calculations for machine learning. For consistency, all implemented approaches were evaluated using the same QC and random train-test splits of the data (Fig. 1) which were well balanced for case-control status, age-at-baseline or assessment, sex, genotyping centre, and the distribution of all principal components (Supplementary Fig. 2). Participants were randomly separated into 70–30% train-test splits, with the same split applied to all algorithms, resulting in 29,180 individuals in the training set (14,006 cases; 15,174 controls) and 12,506 for testing (6007 cases; 6499 controls), each with 215,193 predictors after quality control procedures. ML models were built with and without SNPs from the APOE region (Chr19:44.4–46.5 Mb). In training ML and PRS models, analyses were adjusted for covariates comprising genetic sex, 20 principal component (PCs), and genotyping centre. The adjustment method was altered to be appropriate for each modelling approach: covariates were included in the final layer for NNs, with predictions and importance scores taken from non-covariate nodes (see Supplementary Fig. 3); for GBMs and MB-MDR, covariates were z-transformed and then regressed-off from the data before modelling. All reported area under the receiver operator characteristic curve (AUC) values are calculated on the predicted probabilities from models (without thresholding) and adjusted again for confounders in the test split. We utilise penalisation and cross-validated random (GBMs) or grid-based (NNs and MB-MDR) hyperparameter search to reduce the likelihood of overfitting. Details for training and covariate adjustment are given in Supplementary sections 1.2-1.4. To ensure robust results, stability was assessed by re-running all models on three additional random 70–30% train-test splits of the data.

Gradient boosting machines (GBMs)

GBMs were trained using version 1.7.6 of the XGBoost package55, an efficient implementation of regularised gradient boosting, and the dask package56, version 2023.1.1, which allows for distributed training of ML models across multiple nodes in a high-performance computing (HPC) cluster. Hyperparameters for learning rate, tree depth, and column sampling fraction were tuned on a random subsample of the training set using random search (Supplementary section 1.2). Importance of all predictors in GBMs was assessed using SHapley Additive exPlanatory (SHAP) values57,58.

Neural networks (NNs)

NN models were built with GenNet59, which uses a biologically-driven configuration in which the connections between the input layer, representing SNPs, and the first hidden layer, representing genes, are defined using knowledge available in annotation databases. NN architectures were built with nine hidden layers, including an initial SNP-to-gene layer annotating all SNPs to the nearest gene using annotations from ANNOVAR60, followed by layers defined from the hierarchy of terms from the Gene Ontology consortium (GO terms), where connections between layers progress from local pathways to more general ones as they move deeper through the network, using all available pathway annotations (Supplementary Fig. 3, Supplementary Data 2). Model hyperparameters for batch size, learning rate (LR) and L1 (default) penalization were tuned during training (Supplementary section 1.3).

Model based multifactor dimensionality reduction (MB-MDR)

A multi-dimensional reduction strategy was implemented using the MB-MDR methodology61,62 with MBMDR 4.4.1 software. An approximation routine was used to accelerate permutation-based significance assessment and multiple testing63, when searching for disease-susceptibility multi-locus genotypes, adjusted for confounders. Single and interacting variants under various pair-wise and higher order epistasis models were combined to create multi-locus risk scores (MB-MDRC) and estimate an individual’s susceptibility to a trait with the MBMDRClassifieR package available in R64. SNPs and SNP-SNP interactions were included in the MB-MDR risk score if they passed the permutation-based multiple testing corrected threshold that had the best performance in the train split (Supplementary section 1.4).

Polygenic risk scores (PRS)

PRS were calculated with LDAK-Bolt-Predict from the LDAK package, the most predictive heritability and linkage disequilibrium-informed PRS when individual genotypes are available65. LDAK-Bolt-Predict reweights variants using Gaussian priors informed by heritability models, shrinking effect sizes without the need for p-value thresholding65. To give the same information to both PRS and all machine learning models, PRS were derived using summary statistics which were generated in the training set by running a genome-wide association study (GWAS), adjusting for the same confounders in PLINK v2.00a3.3LM66 (Supplementary section 1.5).

Selection of top predictors

The top SNPs identified by machine learning were selected by the permutation-based p-value threshold defined above for MB-MDR (padj < 1), empirically by taking the extreme tail of the distribution for SHAP values (μ | SHAP | > 0.0005) in gradient boosting (Supplementary Fig. 4), summed weights (padj < 0.05) for neural networks (Supplementary Fig. 5), and by applying the Boruta algorithm67 (gradient boosting models only). Only SNPs which were present in top SNPs for at least two train-test random data splits of a given method were prioritised. Details on predictor selection are given in Supplementary sections 1.6 and 2.2-2.4.

Replication of top predictors

The prioritised loci were reported as novel if they were identified with more than one split of a machine learning model, replicated at p ≤ 0.05 significance level in an external independent dataset (Jansen et al.22) and had no genome-wide hits from AD summary statistics in the GWAS catalogue. The direction of effect for ML approaches is not available as different subgroups may have varying associations with the outcome for a given SNP, and consequently it is not directly reportable as in standard GWAS.

Functional annotation and enrichment analysis

Publicly available tools were used for further annotation: SNPs were annotated with dbSNP build 15668. SNPs were initially positionally mapped to genes using ANNOVAR60 version 2020-06-07 and Gencode v40, and then reannotated using functional evidence from the Open Targets Genetics portal where available69. Genome coordinates use build GRCh38.p14. Pathway analysis and protein-protein interaction from the consensus annotated genes were determined using STRING v12.070 (Supplementary section 1.7).

The list of top SNPs within each ML analysis was tested for enrichment in the list of genes expressed in microglia (n = 761)71, astrocytes (n = 757)72, and synapses (n = 1535)73, with a window size of 35 kilobase (kb) upstream and 10 kb downstream of regions to include regulatory elements. To account for non-independent SNPs in the genomic regions, enrichment p-values were derived using a bootstrap approach (Supplementary section 1.8) and presented without correction for multiple testing for the number of cell types tested.

Statistical interactions

The ML methods used may include interactions but not explicitly test for them. The top SNPs from all ML approaches (Tables 1 and 2) were therefore formally tested for pair-wise interactions in the whole dataset under a regression framework, assuming a multiplicative relationship between SNPs, i.e. logit(y) = β0 + β1SNP1 + β2SNP2 + β3SNP1*SNP2, where SNPs are coded additively (0, 1, 2) with zero-count compensation, and covariates are included in the model. Pairs were assessed by the p-value for the interaction term after adjusting for multiple comparisons using a false discovery rate (FDR) Benjamini-Hochberg (p = 0.05) threshold for significance (Supplementary section 1.9). Putative SNP interactions were visualized in python using shap 0.4158, statsmodels 0.13.574, matplotlib 3.7.175, and seaborn 0.12.276. The overall workflow is presented in Fig. 1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.