Cost-effective non-additive GWAS across 2329 diseases in 500,349 individuals

Molotkov, Ivan; Kurki, Mitja; Palotie, Aarno; Daly, Mark J.; Artomov, Mykyta

doi:10.1038/s41467-025-67277-4

Download PDF

Article
Open access
Published: 13 December 2025

Cost-effective non-additive GWAS across 2329 diseases in 500,349 individuals

Nature Communications volume 17, Article number: 580 (2026) Cite this article

3827 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

Drug candidates supported by genetic evidence are more likely to succeed in clinical trials, with genome-wide association studies (GWAS) providing a key source of such evidence. Standard GWAS approaches assume additive effects of alleles on the phenotype, but non-additive models have also successfully identified novel associations across various traits. Despite their potential, the large-scale application of non-additive GWAS across thousands of phenotypes in biobanks has been limited by high computational costs. To address this challenge, we present a method that leverages the correlation between additive and non-additive p-values to prioritize variants likely to reach genome-wide significance in non-additive analyses. Applied to the FinnGen dataset comprising 500,349 individuals and 2329 phenotypes, this method reduces computational costs by three orders of magnitude while retaining nearly all true non-additive associations, identifying 781 novel loci missed by additive GWAS. We report fine-mapping and colocalization with 571 datasets for novel loci, uncovering likely causal variants and potential insights into biological mechanisms.

An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci

Article 28 October 2021

Genome-wide association studies

Article 26 August 2021

Genetic associations of protein-coding variants in human disease

Article Open access 23 February 2022

Introduction

Translating basic scientific research into clinically tested drugs is challenging due to the high clinical trial failure rate¹. Genetic evidence significantly improves the likelihood of success in clinical trials, with candidate drugs supported by such evidence being 2.6 times more likely to gain approval². This evidence is often derived from genome-wide association studies (GWAS), which identify associations between genetic and phenotypic variation³.

The most common statistical model used in GWAS assumes that the impact of biallelic genetic variants is additive, meaning that the effect on a phenotype is proportional to the number of effect alleles³. Beyond additive, genetic variants can also exhibit dominant and recessive effects on traits. A recessive effect requires two copies of the effect allele to influence the trait, meaning heterozygotes typically do not show the effect. In contrast, in a dominant effect, just one copy of the risk allele is sufficient to produce the full effect, with no substantial additional impact from having two copies. Accordingly, recessive and dominant statistical models are used in GWAS to detect recessive and dominant effects⁴.

GWAS based on non-additive models can identify genetic associations that might be missed by the additive model alone, leading to a more complete understanding of genetic variation influencing the risk of complex traits and diseases⁵. For example, non-additive models have successfully identified novel associations in type 2 diabetes⁶, aging-related diseases⁵, as well as other traits^7,8,9. Thus, a potential strategy for studying the genetic architecture of a disease could be to first conduct an additive GWAS, followed by a complementary non-additive analysis. However, many genome-wide significant associations detected through non-additive models can also be captured by additive models, although with decreased statistical power¹⁰. This overlap means that while both analyses would incur similar computational costs, a substantial portion of the resources allocated to non-additive analysis would be spent rediscovering associations already detected in prior additive GWAS. This computational inefficiency limited the widespread use of non-additive analyses in large-scale biobanks such as FinnGen¹¹ or UK Biobank¹², where their application has been largely restricted to rare and coding variants^13,14. Consequently, many non-additive associations have remained undetected⁵.

In this study, we present an approach to leverage the overlap in statistical power between additive and non-additive models to make non-additive GWAS substantially more computationally efficient at identifying new associations. Specifically, we utilize the fact that variants lacking any signal in additive GWAS are highly unlikely to reach genome-wide significance in non-additive analysis. Therefore, such variants can be safely excluded from non-additive GWAS based on the readily available additive summary statistic. We derived precise filtering criteria that preserve nearly all novel non-additive associations while reducing the number of analyzed variants by several orders of magnitude. Using this approach, we conducted a non-additive analysis of 2329 phenotypes in 500,349 individuals from FinnGen using the dominant model. The computational cost of the analysis was less than $40, compared to an estimated $27,000 without pre-filtering. We identified 781 new associated loci missed by the additive model. Potential mechanisms underlying these effects were further investigated via fine-mapping and colocalization across 571 resources. We make all this data publicly available and demonstrate that our approach is straightforward to implement at a biobank scale, offering detailed instructions for its application.

Results

Overview of the cost-efficient approach to non-additive GWAS

Additive and non-additive GWAS models assume different patterns for how genotypes affect a trait. However, the p-values they generate for genotype-phenotype associations are not independent. If a variant is genome-wide significant in a non-additive model, it typically also shows a low p-value in the additive model (Fig. 1A, B). Conversely, variants with high additive p-values (close to 1) are rarely genome-wide significant in non-additive analyses (Fig. 1C, D). Because of this relationship, we hypothesized that additive p-values can be used to filter out variants that are unlikely to be significant in non-additive GWAS. Specifically, we propose setting an allele-frequency-dependent threshold and then eliminating all variants with additive p-values above this threshold prior to conducting subsequent non-additive analysis (Fig. 1C, D).

**Fig. 1: Using correlation between additive and non-additive p-values for efficient non-additive GWAS.**

The process of selecting an additive p-value threshold aims to achieve two main goals: reducing the number of variants to lower computational costs and retaining as many novel non-additive associations as possible. However, limited empirical data from large-scale non-additive analyses makes it difficult to estimate the fraction of novel associations retained for a given threshold. To address this, we simulated variants with true non-additive effects to study the relationship between their additive and non-additive p-values (Fig. 1C, D; Methods, Statistical framework for simulating additive and non-additive p-values). We then focused on simulations representing novel non-additive associations – those that are genome-wide significant in non-additive models but not in additive models – and analyzed the distribution of their additive p-values.

From this analysis, we defined the “discovery rate” as the fraction of novel non-additive associations with additive p-values below a given threshold (i.e., those that would not be filtered out). To achieve a target discovery rate of P%, we set the additive threshold for each allele frequency at the P-th percentile of the additive p-value distribution among these novel associations. This ensures that exactly P% of the novel associations are retained (Fig. 1C, D; Methods: Derivation of additive p-value thresholds for variant pre-filtering).

Each simulation for a variant with a particular allele frequency requires specifying parameters such as the GWAS sample size, disease prevalence, and the true effect size distribution. Because some of these parameters (like the effect size distribution) may not be known in advance, we generated thresholds under multiple assumptions and then selected the most permissive threshold that still maintained the desired discovery rate across all scenarios (Methods: Derivation of additive p-value thresholds for variant pre-filtering).

Simulations can also be easily extended to continuous traits (Supplementary Materials: Simulating GWAS p-values for continuous traits). The resulting additive p-value thresholds are almost identical for binary and continuous traits: the median absolute difference on the log10 scale for recessive and dominant analyses is 0.10 and 0.22, respectively (Supplementary Fig. 1). Similarly, linear and logistic regression produce almost identical p-values for additive GWAS of binary traits¹⁰, so the regression model chosen in the additive step does not affect the thresholds used in subsequent non‑additive analyses. Finally, we showed that, while biobank size is generally known, it does not affect the determination of the additive p-value threshold for a given discovery rate (Supplementary Fig. 2).

Evaluation of dominant and recessive models with variant pre-filtering in application to FinnGen

We derived allele-frequency-dependent additive p-value thresholds from simulations for both dominant and recessive models, covering discovery rates from 1% to 99% (Supplementary Data 1, Methods. Derivation of additive p-value thresholds for variant pre-filtering). A threshold corresponding to a P% discovery rate was expected to retain at least P% of novel non-additive associations, regardless of trait prevalence in the population.

We conducted dominant, recessive, and additive GWAS on 19 common traits in FinnGen, most of which were part of a previous non-additive study on age-related phenotypes⁵. The results of these GWAS served as the ground truth for evaluating the performance of our pre-filtering strategy (Supplementary Materials. Analysis of full summary statistics for 19 phenotypes). For both models, we focused on common variants (MAF > 0.05), but for different reasons. In dominant GWAS, rare variants were excluded to optimize computational efficiency, as their additive and dominant p-values are nearly identical (Supplementary Fig. 3A, B). In recessive GWAS, rare variants were removed because the low correlation between their additive and recessive p-values limits the effectiveness of the filtering approach (Supplementary Fig. 3C, D). Then, we evaluated the performance of allele-frequency-dependent additive p-value thresholds on these data (Fig. 2A, B).

**Fig. 2: Validation of thresholds derived through simulations on real GWAS data for 19 phenotypes.**

First, we estimated how many genome-wide significant associations unique to non-additive models might be missed due to variant pre-filtering. By comparing the theoretical discovery rates from simulations with the empirical discovery rates – the proportion of novel non-additive associations retained after filtering in 19 GWAS – we found that the simulation estimates were conservative. In practice, the empirical discovery rates were substantially higher than expected from simulations (Fig. 2C, D), indicating that our filtering procedure retains more associations than predicted.

Next, we assessed how pre-filtering reduces the number of variants tested. For the recessive model, permissive thresholds (theoretical discovery rates above 80%) reduced the number of tests by less than 20-fold, while stricter thresholds (theoretical discovery rates below 20%) led to more than a 200-fold reduction (Fig. 2E). This difference in cost reduction suggests that choosing a strict threshold with a discovery rate below 20% can be the most resource-efficient strategy to conduct recessive GWAS, even at the cost of missing some associations (Fig. 2G). In contrast, in dominant GWAS, even a highly permissive threshold (99% discovery rate) reduced the number of variants by 799-fold, demonstrating that dominant GWAS can dramatically cut computational costs while retaining nearly all novel associations (Fig. 2F, H).

Aside from being more computationally efficient, the dominant model also identified a higher number of new associations than the recessive one in full non-additive GWAS of 19 traits. Dominant GWAS identified 173 new associations (Fig. 2B), while the recessive GWAS uncovered only 83 new associations, 12 of which had allele frequency less than 0.05 (Fig. 2A). One explanation for this difference is that the dominant model has greater power to detect common variants with moderate effect sizes than the recessive model. By using power calculations and population genetics, we demonstrated that variants with moderate effect sizes (${sd}\left[\beta \right]\le 0.04$) under no selection ($4{N}_{e}s=0$) or moderate negative selection ($4{N}_{e}s=-1$) are more likely to be detected by the dominant model. In contrast, the recessive model performs better for variants under strong negative selection ($4{N}_{e}s=-10$) with large effect sizes (${sd}\left[\beta \right]\ge 0.04$), which are typically rare (Fig. 3, Methods. Theoretical characterization of genome-wide significant associations detected by recessive and dominant models). Since GWAS typically focus on common variants, the dominant model has a clear advantage in identifying novel associations.

Fig. 3: Modeling the distribution of allele frequencies (AF) and true effect sizes (
$$\beta$$

β

) of novel non-additive associations using power calculations and population genetics. — **Fig. 3: Modeling the distribution of allele frequencies (AF) and true effect sizes ($\beta$) of novel non-additive associations using power calculations and population genetics.**

The proposed procedure for deriving thresholds can potentially account for trait prevalence, variant allele frequency, and distribution of effect sizes when known. However, from a practical standpoint, developing simplified filtering rules could streamline integration into non-additive analyses of large biobanks. For the dominant model, a fixed threshold of $1\times {10}^{-3}$ would achieve 99%+ discovery rate across the entire allele frequency spectrum. In practice, using fixed additive p-value cutoffs of $1\times {10}^{-3}$, $1\times {10}^{-4}$, or $1\times {10}^{-5}$ for variant pre-filtering maintained empirical discovery rates of 99.4%, 98.8%, and 96.0%, respectively, while drastically reducing the number of tests (Supplementary Fig. 4, Supplementary Materials. Fixed additive p-value thresholds for variant pre-filtering in dominant GWAS).

Overall, these findings indicate that the dominant model is expected to identify more novel associations than the recessive model while requiring lower computational costs. Furthermore, the use of simple, fixed additive p-value thresholds makes variant pre-filtering easier to integrate into GWAS analysis pipelines.

Dominant GWAS of 2329 phenotypes in FinnGen

We applied the dominant GWAS model to FinnGen due to its potential to uncover a greater number of novel associations, better resource efficiency, and lack of prior application to this dataset compared to the recessive model (Methods. Theoretical characterization of genome-wide significant associations detected by recessive and dominant models, Fig. 3). Mechanistically, dominant effects can arise from (i) haplo-insufficiency, meaning that one wild type copy of a protein is not enough to perform its intended physiological function¹⁵, (ii) negative dominant variants that result in a mutated protein copy that interferes with the function of the wild type one, thus making both copies relatively dysfunctional¹⁶, or (iii) gain-of-function variants that can lead to one copy of a protein acquiring toxic properties and causing a disease¹⁷.

Analysis setup and variant filtration

We analyzed 2329 phenotypes (Supplementary Data 2), pre-filtering variants by allele frequency (MAF > 0.05) and additive p-value (${p}_{{add}} < 1\times {10}^{-5}$) (Fig. 4A, Methods. Dominant GWAS of 2329 phenotypes). This filtering reduced the number of variants per phenotype from 21,329,905 to roughly 1070, leading to a total of 2,493,251 association tests across all phenotypes. Despite this dramatic reduction, both simulations and empirical data indicate that over 90% of novel associations were retained (Supplementary Fig. 4; Supplementary Materials. Fixed Additive p-value Thresholds for Variant Pre-Filtering in Dominant GWAS).

**Fig. 4: Non-additive analysis of 2329 phenotypes within FinnGen.**

Computational efficiency was further improved by reusing precomputed null models from a prior additive REGENIE¹⁸ analysis of all FinnGen phenotypes (Methods. Dominant GWAS of 2329 phenotypes). These measures reduced the processing time per phenotype from 710 hours to under 1 hour on a cloud virtual machine (1 CPU, 4 GB RAM). In terms of monetary costs, analyzing all phenotypes without filtration would have cost approximately $26,916, whereas the filtration approach brought the total expense down to less than $40 (Methods. Computational Cost Estimation).

Identification of novel dominant associations

By performing clumping, we identified 5632 independent (r² < 0.2) dominant genome-wide significant associations (${p}_{{dom}} < 5\times {10}^{-8}$) that did not reach genome-wide significance in the additive model. We estimated that fewer than 1% of these are false positives, assuming 2329 analyses across 1 million independent variants (Methods. Calculating false discovery rate).

However, to avoid misclassifying loci as “missed” by the additive model when another variant within the same locus is significant, we further restricted our analysis. Specifically, we focused only on dominant lead variants that had no significant additive associations within their linkage disequilibrium (LD) region (r² < 0.1) (Methods. Dominant GWAS of 2329 phenotypes). This refinement resulted in 781 novel dominant associations without any genome-wide significant additive signals in their respective LD loci (Supplementary Data 3).

Evaluating dominance in novel associations

Although these 781 variants show a better fit to the dominant model, this observation alone does not prove true dominance. Sampling variability can sometimes make additive effects appear dominant (Supplementary Materials. The likelihood of a better dominant model fit for true additive effects). To address this, we simulated the likelihood of the observed differences in log p-values between the dominant and additive models under both true dominant and true additive effects. Our analysis revealed that for 743 of the 781 variants (95.1%), the observed p-value differences were at least 10 times more likely under a true dominant effect. However, for only 406 variants (52.0%), this likelihood difference exceeded 100-fold (Supplementary Fig. 5).

Thus, while the low false discovery rate confirms that nearly all new dominant associations are not driven by null variants, it remains challenging to definitively determine whether their effects are truly non-additive.

Interpretation of novel findings through fine-mapping and colocalization

The dominant non-additive analysis of 2329 phenotypes identified 781 associations missed by the additive model across 359 diseases. We grouped these diseases into families using expert-assigned FinnGen tags (Supplementary Materials: Clustering traits into families using FinnGen tags). The largest numbers of associations were observed for musculoskeletal/connective tissue disorders and circulatory system diseases (Fig. 4B). Notably, several new associations came from phenotypes that previously had no genome-wide significant additive associations in FinnGen, such as viral meningitis, disorders of refraction and accommodation, discitis, thrombotic microangiopathy, recurrent dislocation of the patella, and ovarian hyperstimulation (Supplementary Data 3).

Fine-mapping and colocalization analysis

For 781 novel associations, we performed fine-mapping using SuSiE¹⁹ (Supplementary Data 4; Methods. Dominant GWAS of 2329 phenotypes). We then conducted a colocalization analysis across 571 datasets, including FinnGen and UK Biobank GWAS, as well as multi-omics studies from GTEx²⁰, INTERVAL²¹, and UK Biobank (Supplementary Data 5; Methods. Dominant GWAS of 2329 phenotypes). This analysis identified 3,493 colocalizations (PP.H4.abf > 0.8) with 231 of our novel associations (Fig. 4C), supporting the robustness of our signals and suggesting potential biological mechanisms by linking variant effect to changes in gene expression or protein levels in the blood.

Replication study

To further validate our findings, we replicated the analysis using summary statistics from the VA’s Million Veteran Program (MVP)²². MVP did not have dominant GWAS results, so the additive summary statistics were used since the additive model can partially capture dominant effects. With 78 traits mapped between FinnGen and MVP, 171 variants were included in the replication. Despite the lower power of additive GWAS to replicate non-additive associations, 35% of the variants replicated when controlling for the false discovery rate (FDR < 0.05) while 18% replicated after Bonferroni correction (P < 0.05/171) (Supplementary Data 6; Methods. Dominant GWAS of 2329 phenotypes).

Showcasing some of the biologically relevant associations

To further substantiate the biological relevance of some of the novel associations, we integrated our signals with evidence from other genetic studies, including previous GWAS^23,24, rare variant association studies (RVAS)^25,26, and transcriptome-wide association studies (TWAS)^27,28.

rs1788100

This likely causal variant (Posterior Inclusion Probability (PIP) = 1.0) was associated with systemic connective tissue disorders (mainly autoimmune, ICD-codes M30-M35, Supplementary Data 2). Prior FinnGen additive GWAS linked it to autoimmune hypothyroidism and Crohn’s disease, and other genome-wide association studies have associated it with blood cell counts and immune conditions (e.g., type 1 diabetes, hypothyroidism²³). Its dominant signal colocalized with three autoimmunity-related phenotypes in FinnGen and with CD226 gene expression in T-cells and monocytes. Additionally, in previous studies, loss-of-function variants in CD226 were linked to platelet volume²⁵, and rare variants in CD226 showed the most significant association with gout²⁶. Moreover, TWAS showed the highest association of genetically predicted expression levels of CD226 with hypothyroidism, lupus, and platelet/eosinophil counts²⁷. Recently, CD226 has been gaining recognition as an important part of viral and immune-related diseases²⁹, and our analysis of different types of genetic evidence corroborates this hypothesis.

rs4912715

Another variant with PIP = 1.0 was associated with neurodegenerative disorders, a combined endpoint mainly comprising cases with Alzheimer’s and dementia. In prior additive GWAS within FinnGen, this variant was associated with disorders of psychological development (Supplementary Data 2). Its dominant signal colocalized with WDR55 expression in iPSC. TWAS linked WDR55 to intelligence, schizophrenia, and depression²⁷, and rare variant analysis showed a strong association between WDR55 and bipolar disorder²⁶. WDR55 is part of the WD-repeat domain family, genes from which are explored as a novel drug target class³⁰. Another gene from this family, WDR45, is also known to play a role in neurodegenerative disorders³¹.

rs3034037

This new causal variant (PIP = 1.0) was associated with obesity. In prior additive GWAS within FinnGen, this variant was associated with weight. The dominant signal in our study colocalized with CERS2 expression across 19 eQTL datasets. TWAS findings associated CERS2 with body composition traits such as water mass, lean mass, and waist circumference²⁸. RVAS results also support an association between CERS2 and fat-free mass-adjusted BMI²⁵. As CERS2 encodes a ceramide synthase, and related enzymes have been linked to diabetes and obesity in non-genetic studies³², this provides further evidence of its potential role in obesity.

Figure 4D presents short interpretations of select causal variants (PIP > 0.99) that colocalized with genes relevant to disease based on a literature review^{20,25,26,27,29,30,31,32,33,34,35,36,37,38,39,40}. For additional details and references, see the Supplementary Materials. Interpretation of selected new associations.

Discussion

Previous studies indicate that the contribution of non-additivity in common variants to disease heritability is relatively small^10,41. On the other hand, non-additive effects have been shown to explain much more variability in other contexts. For example, recent studies⁴² estimate non-additivity to explain more than a third of variance in gene expression, and up to 25% of phenotypic variance in model organisms such as mice, rats, and pigs⁴³. Moreover, despite the majority of signals in GWAS of human disease being captured by the additive models, recessive and dominant models can uncover new genetic signals from existing data that have been missed in additive analyses^5,13,14. In the past, this potential was not sufficient to justify the high computational costs required to perform non-additive GWAS on thousands of phenotypes in large biobanks.

In this study, we developed an approach that significantly reduces the computational burden of conducting non-additive GWAS. By applying this method to FinnGen data comprising 2329 phenotypes in 500,349 individuals, we discovered 781 new loci that were not detected by the additive model. Without the proposed filtering procedure, the estimated computational cost for these analyses would have been approximately $27,000 (roughly $35 per new association). In contrast, the optimized workflow reduced the total cost by around 700-fold to less than $40 (approximately $0.051 per new association), demonstrating the feasibility of large-scale non-additive GWAS and highlighting the importance of examining non-additivity in genetic studies.

Beyond methodological efficiency, our findings hold potential clinical value. Genetic evidence is an important predictor of a candidate drug’s success in clinical trials². This can be explained by the absence of certain biases present in other approaches. Unlike gene expression and other omics studies, genetic associations cannot result from reverse causation⁴⁴, as disease generally cannot alter the germline DNA sequence. Additionally, when population structure is appropriately accounted for, bias due to confounding can be minimized⁴⁴. Furthermore, genetic associations have immediate relevance to human disease, unlike cellular and animal models⁴⁵.

A persistent challenge in GWAS-based target discovery, however, is the difficulty of confidently linking associated variants to causal genes. To address this challenge, we conducted colocalization analysis to link the effects of the newly identified variants to gene expression and protein levels, revealing potential biological mechanisms underlying these associations. By further reviewing the literature, we highlighted potential causal links between genes and diseases, such as CD226, SYNGR1, MUC5B, FADS1/FADS2, TNFRSF14, IL1R1, and immune-related disorders; ANO5, MAP3K3, PLCE1, HSD17B13, SENP2, and circulatory system diseases; LPL, CERS2, GIPR and metabolic diseases; WDR55, DEF8, SPIRE2, and neurodegenerative disease. However, a thorough manual investigation of 781 associations and their thousands of corresponding colocalizations is not feasible. Therefore, we have made all our data on GWAS summary statistics, fine-mapping, and colocalization results publicly available, allowing researchers interested in specific diseases to easily access the results.

While our study predominantly focused on the dominant model due to its superior efficiency in identifying novel variants, the presented approach is also applicable to the recessive model. However, in real data, the reduction in computational burden for recessive analyses is typically one to two orders of magnitude, less dramatic than the savings observed for the dominant model. Moreover, our filtering strategy is not well-suited for rare recessive variants, as these variants can exhibit very large recessive effects while showing no signal in the additive model, limiting our ability to exclude a large portion of variants based on additive p-values. Since the proposed approach requires filtering out a large portion of variants, some of the analyses, such as heritability estimation, cannot be conducted. Additionally, although the simulation-based framework has proven effective for conducting non-additive GWAS in a single biobank, its application to meta-GWAS would require further adjustment to derive rules for filtering out variants that are unlikely to be genome-wide significant in meta-analyses. We also show that some of the dominant associations could arise from true additive effects by chance and thus might not reflect true underlying non-additivity. Finally, the primary aim of this study was not to dissect non-additivity as a phenomenon but rather to leverage non-additive models to highlight potential disease-associated variants and genes.

Importantly, our approach to non-additive analysis can be extended to other biobanks. It also works particularly well with widely used two-step mixed linear models, such as REGENIE¹⁸ or BOLT-LMM⁴⁶, because the results from the first step of the additive analysis can also be repurposed for the non-additive GWAS, further reducing computational costs. We argue that such minimization of computational costs enables non-additive analysis to be added to virtually all existing additive GWAS.

Methods

Statistical framework for simulating additive and non-additive p-values

For a statistical test, we need to specify disease prevalence, relative risks for the Aa and aa genotypes, allele frequency, number of cases, and controls. Additionally, we assume that the Hardy-Weinberg Equilibrium (HWE) holds.

Then, we calculate probability of having a specific genotype G (AA/Aa/aa) given the individual’s disease status Y (Y = 1 for cases, Y = 0 for controls):

$$P({G|Y})=\frac{P({Y|G})P(G)}{P(Y)}=\frac{P({Y|G})\cdot P(G)}{\mathop{\sum }_{G}P({Y|G})P(G)}$$

(1)

Here, $P(Y|G)$ is calculated from the given relative risk and disease prevalence, $P(G)$ – from allele frequency under the assumption of Hardy-Weinberg Equilibrium (HWE).

Next, we simulate genotypes for cases and controls. The probability of having a genotype g for a case or a control is equal to $P(G=g|Y=1)$ or $P(G=g|Y=0)$, respectively. As such, genotypes for individuals are simulated from two multinomial distributions defined by $P(G|Y=1)$ and $P(G|Y=0)$.

Having the simulated genotypes with the corresponding disease status, we conduct an association test using the test statistic defined as:

$$Z=\frac{{\hat{p}}_{1}-{\hat{p}}_{2}}{\sqrt{\left(\frac{1}{{n}_{1}}+\frac{1}{{n}_{2}}\right)\cdot \frac{{n}_{1}\cdot {\hat{p}}_{1}+{n}_{2}\cdot {\hat{p}}_{2}}{{n}_{1}+{n}_{2}}\cdot \left(1-\frac{{n}_{1}\cdot {\hat{p}}_{1}+{n}_{2}\cdot {\hat{p}}_{2}}{{n}_{1}+{n}_{2}}\right)}}$$

(2)

Under the null hypothesis of no association, Z asymptotically follows a standard normal distribution⁴⁷. Thus, this Z-score is then converted into a two-tailed p-value.

For a bi-allelic variant with three possible genotypes (AA, Aa, aa), with a being the effective allele, the frequencies ${\hat{p}}_{1}$ and ${\hat{p}}_{2}$ differ depending on the genetic model. In the additive model, ${\hat{p}}_{1}$ and ${\hat{p}}_{2}$ are calculated as freq(Aa)/2 + freq(AA), representing the average contribution of effect alleles in cases and controls, respectively. This test, also known as an allelic test, is equivalent to the Cochran–Armitage trend test for binary phenotypes under HWE⁴⁸. For the dominant model, ${\hat{p}}_{1}$ and ${\hat{p}}_{2}$ are calculated as freq(Aa) + freq(aa), combining heterozygotes and homozygotes for the effect allele in cases and controls. In the recessive model, ${\hat{p}}_{1}$ and ${\hat{p}}_{2}$ represent freq(aa), considering only homozygotes for the effect allele in cases and controls. For the dominant and recessive models, ${n}_{1}$ and ${n}_{2}$ correspond to the number of cases and controls, respectively, while for the additive model, ${n}_{1}$ and ${n}_{2}$ are effectively doubled to account for the total number of alleles in cases and controls.

We can also obtain the p-value distributions for variants at a specific allele frequency with a causal non-additive effect on a disease of specified prevalence. First, we specify the distribution of effect sizes – in our case, normal distribution with mean 0. Then, we sample effect sizes from this distribution and simulate dominant, recessive, and additive p-values for each effect size.

Derivation of additive p-value thresholds for variant pre-filtering

We describe the steps to derive a filtering rule for selecting variants for further non-additive analysis based on information available in additive summary statistics - specifically, the variant’s allele frequency and additive p-value. We calculate an allele-frequency-dependent additive p-value threshold, as explained below:

1)
Parameters for simulations. Since a threshold is calculated for a specific allele frequency, AF is assumed to be known. Prevalence ranges from 0.001 to 0.5 and is discretized, taking values 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5. Effect sizes of causal variants are assumed to follow a normal distribution with standard deviation $\sigma$, ranging from 0.01 to 0.13 (with step 0.01) in our simulations. The sample size of the biobank is taken as 500,000. The number of cases is assumed to be the product of disease prevalence and sample size (cohort study design).
2)
Simulation of additive and non-additive p-values. For each combination of assumptions about $\sigma$ and disease prevalence, we simulate 1000 effect sizes from the corresponding normal distribution. For each effect size, we simulate 1000 additive and non-additive p-values (Methods: Statistical framework for simulating additive and non-additive p-values). In total, we have 1,000,000 simulated causal non-additive variants with corresponding additive and non-additive p-values.
3)
Deriving additive p-value threshold with P% discovery rate. We only consider simulated variants with ${p}_{{nonadd}} < 5\times {10}^{-8} < {p}_{{add}}$ which represent new associations identified by the non-additive model and missed by the additive model. Next, we set the threshold to be equal to the P’th percentile of the additive p-values of these variants. By definition, P% of simulated new associations would be below this threshold, corresponding to new non-additive associations that are not filtered out.
4)
Ensuring the threshold works for any set of assumptions. All simulations prior to this step were performed under specific assumptions about disease prevalence, allele frequency, and effect size distribution. To derive a threshold that would consistently have a discovery rate of P% or higher across all sets of assumptions, we repeated the simulations for all possible combinations of parameters and then picked the most permissive threshold. It is important to note that some settings result in almost no new associations at all – to discard those, we filter out any set of parameters for which less than 1% simulations produce a significant association.

Theoretical characterization of genome-wide significant associations detected by recessive and dominant models

Power calculations for dominant and recessive models were performed under the assumption of a biobank of size 500,000 for a disease with a prevalence 0.1. We simulated additive, dominant, and recessive p-values for variants with allele frequencies (AF) ranging from 0.01 to 0.5, and effect sizes $\beta$ ranging from −0.4 to 0.4 (Methods. Statistical framework for simulating additive and non-additive p-values). For each combination of effect size and allele frequency, we estimated the proportion of the total $n$ ($n={\mathrm{500,000}}$ in simulations) simulations that had a non-additive p-value (${p}_{{nonadd}}$) below the genome-wide significance threshold of $5\times {10}^{-8}$, and an additive p-value (${p}_{{add}}$) not reaching this significance:

$$\hat{{{\rm{P}}}}({{\rm{nonadditive\; hit}}}{|AF},\beta )= \hat{{{\rm{P}}}}({p}_{{nonadd}} < 5\times {10}^{-8} < {p}_{{add}}{|AF},\beta ) \\= \frac{{\sum }_{i}^{n}{I}_{\left[{{p}_{{nonadd}}}^{(i)} < 5\times {10}^{-8} < {{p}_{{add}}}^{(i)}\right]}}{n}$$

(3)

Power calculations can show subsets of effect sizes and allele frequencies that can be effectively detected by a model. For example, the recessive model excels at discovering rare variants with large effect sizes, while the dominant model can mainly identify common variants with moderate effect sizes (Fig. 3A, B). However, to compare different models in terms of their ability to identify large numbers of new associations, it is important to consider statistical power in the context of the distribution of effect sizes and allele frequencies of variants in the population. Generally, a model that is well-powered to detect variants that are also prevalent in the population would be preferable to one that has the power to detect variants with $\beta$ and AF combinations that do not exist in the real world. If we assume that a variant has some dominance $h$ ($h=0$ for recessive variant, $h=1$ for dominant variant), and some selection coefficient $s$, we can calculate the probability of detecting a causal variant as

$${{\rm{P}}}\left({{\rm{nonadditive}} \; {\rm{hit}}}\,{|\,s},h\right)= \int {{\rm{P}}}\left({{\rm{nonadditive}} \; {\rm{hit}}},{AF},\beta {|s},h\right){dAFd}\beta \\= \int {{\rm{P}}}\left({{\rm{nonadditive}} \; {\rm{hit}}}{|AF},\beta,h\right) \cdot {{\rm{P}}}\left({AF},\beta {|s},h\right)\,{dAF\,d}\beta$$

(4)

Here, $P\left({{\rm{nonadditive}} \; {\rm{hit}}}|{AF},\beta,h\right)$ corresponds to the simulation-based power calculation, so only $P\left({AF},\beta |s,h\right)$ is required to estimate the desired probability. We can reasonably assume that ${AF}$ and $\beta$ are dependent only through the selection coefficient, meaning that they are conditionally independent given $s$:

$${{\rm{P}}}\left({AF},\beta {|s},h\right)={{\rm{P}}}\left({AF|s},h\right)\cdot {{\rm{P}}}\left(\beta {|s},h\right)$$

(5)

The first term can be calculated analytically. For example, Walsh & Lynch, 2018, pp. 192 provide a solution for the recessive model⁴⁹:

$${{\rm{P}}}\left({AF|s},h=0\right)={e}^{2{N}_{e}s\cdot {{AF}}^{2}}\cdot \frac{1}{{AF}\cdot (1-{AF})}$$

(6)

This is a good approximation when the contribution of mutation rate $\mu$ is low ($4{N}_{e}\mu \ll 1$)⁴⁹, where ${N}_{e}$ is the effective population size taken as 100,000 for the Finnish population⁵⁰.

Modeling $P\left(\beta |s,h\right)$ is less straightforward, and there are several approaches to do so^51,52,53,54. We will only assume that the effect sizes are normally distributed with mean 0 and variance ${\sigma }^{2}$. Then, the magnitude of effect, captured by ${\sigma }^{2}$, could depend on $s$. In our case, we will not assume an explicit relationship between ${\sigma }^{2}$ and $s$, and will instead try different combinations of them to see how they affect the results.

Combining this, we can estimate ${{\rm{P}}}\left({{\rm{nonadditive}} \; {\rm{hit}}},{AF},\beta |s,h\right)$, which corresponds to the distribution of allele frequencies and effect sizes for variants with selection $s$ and dominance $h$ that are genome-wide significant only in the non-additive model (Fig. 3C, D). Notably, new recessive associations with moderate effect sizes are very unlikely to have relatively low allele frequencies (${AF}\in [{\mathrm{0.05,0.2}}]$) (Fig. 3D). This theoretical finding can be observed in real data, taken from Heyne et al, 2023, Nature¹³, and from our recessive analysis of 19 phenotypes, that shows that the distribution of allele frequencies is bimodal: the recessive model either detects very rare (${AF} < 0.05$) variants with large effect sizes, or very common variants (AF > 0.3) with moderate effect sizes (Fig. 3E). In contrast, the dominant model is expected to identify new variants from the whole spectrum of allele frequencies (Fig. 3C). Overall, the expected distribution of allele frequencies and effect sizes for new dominant hits matches the observed one from our dominant GWAS analysis on FinnGen well (Fig. 3F).

Finally, we estimate ${{\rm{P}}}\left({{\rm{nonadditive}} \; {\rm{hit}}}|s,h\right)$ as $\int {{\rm{P}}}\left({{\rm{nonadditive}} \; {\rm{hit}}},{AF},\beta |s,h\right){dAFd}\beta$ to compare the dominant and the recessive model in their effectiveness at identifying the highest number of new associations. For a bi-allelic variant, recessive and dominant effects are equivalent up to the choice of the effective allele. If a recessive allele has an allele frequency (AF) greater than 0.5, its corresponding minor allele will exhibit a dominant effect, which can be detected in a standard dominant GWAS when the minor allele is considered the effect allele. Using this idea, we can directly compare the ability of dominant and recessive models to detect true causal associations without knowing the prevalence of recessive and dominant effects in the population. Specifically, we assess the probability that a variant with a recessive effect reaches genome-wide significance and falls into one of two categories: AF < 0.5, representing recessive GWAS hits, or AF > 0.5, representing dominant GWAS associations:

$$\begin{array}{c}\frac{{{\rm{Expected}}}\; {{\rm{portion}}}\; {{\rm{of}}}\; {{\rm{variants}}}\; {{\rm{that}}}\; {{\rm{are}}}\; {{\rm{dominant}}}\; {{\rm{and}}}\; {{\rm{detected}}}}{{{\rm{Expected}}}\; {{\rm{portion}}}\; {{\rm{of}}}\; {{\rm{variants}}}\; {{\rm{that}}}\; {{\rm{are}}}\; {{\rm{recessive}}}\; {{\rm{and}}}\; {{\rm{detected}}}}=\\=\frac{{{\rm{P}}}\left({p}_{{dom}} < 5 \times {10}^{-8} < {p}_{{add}}|{h}=1,s\right)\cdot {{\rm{P}}}\left(h=1|s\right)}{{{\rm{P}}}\left({p}_{{rec}} < 5\times {10}^{-8} < {p}_{{add}}|{h}=0,s\right)\cdot {{\rm{P}}}\left(h=0|s\right)}=\\ \begin{array}{c}=\frac{{{\rm{P}}}\left({p}_{{rec}} < 5\times {10}^{-8} < {p}_{{add}},{AF} > 0.5|{h}=0,s\right)\cdot {{\rm{P}}}\left(h=0|s\right)}{{{\rm{P}}}\left({p}_{{rec}} < 5\times {10}^{-8} < {p}_{{add}},{AF} < 0.5{|h}=0,s\right)\cdot {{\rm{P}}}\left(h=0|s\right)}=\\=\frac{\int {I}_{[{AF} > 0.5]}{{\rm{P}}}\left({p}_{{rec}} < 5\times {10}^{-8} < {p}_{{add}},{AF},\beta |{s},h=0\right){dAFd}\beta }{\int {I}_{[{AF} < 0.5]}{{\rm{P}}}\left({p}_{{rec}} < 5\times {10}^{-8} < {p}_{{add}},{AF},\beta |{s},h=0\right){dAFd}\beta }\end{array}\end{array}$$

(7)

The dominant model is expected to be better powered to detect variants with moderate effect sizes that are evolutionary neutral or under moderate negative selection (Fig. 3G). On the other hand, the recessive model is more suited to detect variants under strong selection, with allele frequency skewed towards 0, that have larger effect sizes (Fig. 3G).

Dominant GWAS of 2329 phenotypes

1)
Variant pre-filtering. For the dominant GWAS, variants were included in the analysis if they had allele frequency above 0.05 and additive p-value below $1\times {10}^{-5}$. To do that, we obtained additive summary statistics files available on the FinnGen sandbox (also publicly available at https://www.finngen.fi/en/access_results), and then used awk to get the variants that pass the pre-filtering criteria.
2)
Subsetting genotype files. To avoid downloading full.bgen files to virtual machines (VMs), which would be by far the most time-consuming step, we subset the initial full.bgen files using bgenix to get new.bgen files that have information only for the variants of interest. To reuse one file for several GWAS, we grouped phenotypes into batches and created.bgen files for all variants that would be analyzed for phenotypes in a batch. Specifically, we divided all phenotypes into 20 batches, such that the number of variants that were analyzed per batch was roughly similar (in R, we used split() function for it). Then, these smaller.bgen files were passed to VMs to conduct GWAS of phenotypes corresponding to the batch.
3)
Dominant GWAS. We used REGENIE v2.2.4 software with the “--test dominant” option to conduct dominant GWAS. Since the results of the first step were already computed during the additive analysis, we simply referenced the null model files using the --pred and --use-null-firth options. For the analysis, sex, end of follow-up age, ten principal components, and genotyping batches were used as covariates.
4)
Clumping. We used PLINK 1.9⁵⁵ to perform clumping, setting the p-value threshold as $5\times {10}^{-8}$, the r² threshold as 0.2, and the window size as 250 kb.
5)
Identification of dominant associations missed by the additive model. For each clumped association, we calculated the r² LD measure with each of the genome-wide significant additive associations. If a dominant association was not in LD (r² < 0.1) with any of the additive hits, it was considered to be a new association.
6)
Fine-mapping. Each of the new associations was then fine-mapped. Successful fine-mapping requires detailed summary statistics and the inclusion of as many variants within a locus as possible. To achieve this, we re-ran the GWAS (steps 1 to 3), this time analyzing all variants within a 300 kb region surrounding each novel association. Then, we used the standard FinnGen fine-mapping pipeline that utilizes SuSiE. The pipeline’s description can be found here: https://finngen.gitbook.io/finngen-handbook/working-in-the-sandbox/which-tools-are-available/untitled/finemapping-of-custom-gwas-analyses.
7)
Colocalization. The results of the fine-mapping pipeline were then used for colocalization analysis. Again, we used the standard FinnGen colocalization pipeline that utilizes SuSiE. The description of the pipeline can be found here: https://finngen.gitbook.io/finngen-handbook/working-in-the-sandbox/running-analyses-in-sandbox/how-to-run-colocalization-pipeline.
8)
Replication. The FinnGen team has already run a comprehensive meta-analyses, so the summary statistics from FinnGen and MVP datasets for matched endpoints were already available within the FinnGen cloud environment (https://docs.finngen.fi/finngen-data-specifics/green-library-data-aggregate-data/other-analyses-available/meta-analyses#finngen-mvp-ukbb-meta-analysis). For all new associations with matched endpoints, we looked at the additive p-value in MVP GWAS. Overall, we had 171 variants with available p-values from the MVP dataset. Then, we replicated them either using Bonferroni correction or controlling FDR < 0.05 using the Benjamini-Hochberg procedure.

Computational cost estimation

The analysis of large datasets is typically divided into hundreds of small tasks that each run in parallel on relatively small virtual machines. To define the workload of an analysis, we will use machine hours, defined as the number of hours it takes one Google Cloud virtual machine with 1 CPU and 4 GB RAM it takes to compute one task. The standard GWAS using REGENIE has two steps: calculate null models, and then run GWAS in parallel for chunks of variants (660 chunks in our case). The first step takes around 50 machine hours, while processing each of the 660 chunks takes one machine hour. Overall, it takes approximately 710 machine hours to compute one standard GWAS in FinnGen. On the other hand, after the filtration procedure, it typically takes less than one machine hour to perform one dominant GWAS, and most of this time is spent localizing genetic files to VM. In this analysis, we will use an upper bound and assume that it takes exactly one machine hour. As such, the workload per analysis decreases by more than 710 times after applying additive p-value filtering.

We can also calculate the difference in US dollars using Google Cloud Calculator (https://cloud.google.com/products/calculator). For calculations, we consider preemptible virtual machines with 1 CPU, 4 GB RAM, and 10 GB disk size, while leaving the rest of the options as default. Machine hours can be transferred in a straightforward way to costs since there is no substantial difference between how the workload is distributed: for example, using 7100 machines to run 10 GWAS simultaneously for 233 hours costs the same as using 710 machines to run 1 GWAS at a time for 2330 hours. The estimated cost of analyzing 2329 phenotypes, which is equivalent to running 710 VMs for 2329 hours, is equal to $26,916. On the other hand, our analysis that used the proposed filtration procedures and thus could be conducted by running 2329 VMs for 1 hour, cost only 39$.

Calculating false discovery rate

We can simulate dominant and additive p-values (Methods: Statistical framework for simulating additive and non-additive p-values) under the assumption of no effect, and then calculate the probability that a variant is taken into analysis and is significant only in the dominant model. This would correspond to the false discovery. In our case, this resulted in roughly $2\times {10}^{-8}$ probability of a null variant having a dominant p-value below $5\times {10}^{-8}$ and an additive p-value between $5\times {10}^{-8}$ and $1\times {10}^{-5}$, regardless of the variant’s allele frequency.

Assuming there are 1,000,000 independent common variants, tested for the associations with 2329 phenotypes. Knowing the probability of false positive $2\times {10}^{-8}$, we can calculate the expected number of false positive associations across all GWAS to be equal to 47. On the other hand, we had 5632 independent non-additive genome-wide significant associations in total. As such, we can estimate the false discovery rate as $\frac{47}{5632}=0.83\% < 1\%$.

Ethics statement and materials & methods

Study subjects in FinnGen provided informed consent for biobank research, based on the Finnish Biobank Act. Alternatively, separate research cohorts, collected prior to the Finnish Biobank Act coming into effect (in September 2013) and the start of FinnGen (August 2017), were collected based on study-specific consents and later transferred to the Finnish biobanks after approval by Fimea (Finnish Medicines Agency), the National Supervisory Authority for Welfare and Health. Recruitment protocols followed the biobank protocols approved by Fimea. The Coordinating Ethics Committee of the Hospital District of Helsinki and Uusimaa (HUS) statement number for the FinnGen study is Nr HUS/990/2017.

The FinnGen study is approved by Finnish Institute for Health and Welfare (permit numbers: THL/2031/6.02.00/2017, THL/1101/5.05.00/2017, THL/341/6.02.00/2018, THL/2222/6.02.00/2018, THL/283/6.02.00/2019, THL/1721/5.05.00/2019 and THL/1524/5.05.00/2020), Digital and population data service agency (permit numbers: VRK43431/2017-3, VRK/6909/2018-3, VRK/4415/2019-3), the Social Insurance Institution (permit numbers: KELA 58/522/2017, KELA 131/522/2018, KELA 70/522/2019, KELA 98/522/2019, KELA 134/522/2019, KELA 138/522/2019, KELA 2/522/2020, KELA 16/522/2020), Findata permit numbers THL/2364/14.02/2020, THL/4055/14.06.00/2020, THL/3433/14.06.00/2020, THL/4432/14.06/2020, THL/5189/14.06/2020, THL/5894/14.06.00/2020, THL/6619/14.06.00/2020, THL/209/14.06.00/2021, THL/688/14.06.00/2021, THL/1284/14.06.00/2021, THL/1965/14.06.00/2021, THL/5546/14.02.00/2020, THL/2658/14.06.00/2021, THL/4235/14.06.00/2021, Statistics Finland (permit numbers: TK-53−1041-17 and TK/143/07.03.00/2020 (earlier TK-53-90-20) TK/1735/07.03.00/2021, TK/3112/07.03.00/2021) and Finnish Registry for Kidney Diseases permission/extract from the meeting minutes on 4th July 2019.

The Biobank Access Decisions for FinnGen samples and data utilized in FinnGen Data Freeze 11 include: THL Biobank BB2017_55, BB2017_111, BB2018_19, BB_2018_34, BB_2018_67, BB2018_71, BB2019_7, BB2019_8, BB2019_26, BB2020_1, BB2021_65, Finnish Red Cross Blood Service Biobank 7.12.2017, Helsinki Biobank HUS/359/2017, HUS/248/2020, HUS/430/2021 §28, §29, HUS/150/2022 §12, §13, §14, §15, §16, §17, §18, §23, §58 and §59, Auria Biobank AB17-5154 and amendment #1 (August 17 2020) and amendments BB_2021-0140, BB_2021-0156 (August 26 2021, Feb 2 2022), BB_2021-0169, BB_2021-0179, BB_2021-0161, AB20-5926 and amendment #1 (April 23 2020) and it´s modification (Sep 22 2021), BB_2022-0262, BB_2022-0256, Biobank Borealis of Northern Finland_2017_1013, 2021_5010, 2021_5018, 2021_5015, 2021_5015 Amendment, 2021_5023, 2021_5023 Amendment, 2021_5017, 2022_6001, 2022_6006 Amendment, BB22-0067, 2022_0262, Biobank of Eastern Finland 1186/2018 and amendment 22§/2020, 53§/2021, 13§/2022, 14§/2022, 15§/2022, 27§/2022, 28§/2022, 29§/2022, 33§/2022, 35§/2022, 36§/2022, 37§/2022, 39§/2022, 7§/2023, Finnish Clinical Biobank Tampere MH0004 and amendments (21.02.2020 & 06.10.2020), 8§/2021, 9§/2021, §9/2022, §10/2022, §12/2022, 13§/2022, §20/2022, §21/2022, §22/2022, §23/2022, 28§/2022, 29§/2022, 30§/2022, 31§/2022, 32§/2022, 38§/2022, 40§/2022, 42§/2022, 1§/2023, Central Finland Biobank 1-2017, BB_2021-0161, BB_2021-0169, BB_2021-0179, BB_2021-0170, BB_2022-0256, and Terveystalo Biobank STB 2018001 and amendment 25^th Aug 2020, Finnish Hematological Registry and Clinical Biobank decision 18^th June 2021, Arctic biobank P0844: ARC_2021_1001.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Additive p-value thresholds for dominant and recessive analyses of binary and continuous traits generated by this study are available in Supplementary Data 1. Descriptions of traits are available in Supplementary Data 2. Summary statistics for 781 new associations generated by this study are available in Supplementary Data 3. The results of fine-mapping, colocalization, and replication generated by this study are available in Supplementary Data 4, 5, and 6, respectively. Full summary statistics for additive, dominant, and recessive GWAS of 19 traits; all summary statistics generated by the dominant GWAS of 2329 traits; and raw colocalization and fine-mapping outputs generated by this study have been deposited on Zenodo (zenodo.org/records/17449480) Under national and European regulations (GDPR), access to individual-level sensitive health data requires approval by national authorities for specific projects and named researchers. The health data used here were generated by the National Health Register Authorities (Finnish Institute for Health and Welfare, Statistics Finland, KELA, and the Digital and Population Data Services Agency) and authorized by the respective authorities or by the Finnish Health and Social Data Permit Authority (Findata) – for use within FinnGen. Accordingly, we are not able to grant access to individual-level data. Details on the data access restrictions and policies are listed within the FinnGen flagship publication – Kurki et al, Nature, 2023. The additive GWAS summary statistics for 2329 traits in FinnGen are available at https://r12.finngen.fi/. The GWAS summary statistics from the European ancestry subset of the MVP dataset, which were used for replication, are available at https://mvp-ukbb.finngen.fi/.

Code availability

The code used for running the simulations and creating all figures and tables has been deposited on GitHub⁵⁶ (github.com/ArtomovLab/Cost-efficient-non-additive-GWAS).

References

Dowden, H. & Munro, J. Trends in clinical success rates and therapeutic focus. Nat. Rev. Drug Discov. 18, 495–496 (2019).
Article CAS PubMed Google Scholar
Minikel, E. V., Painter, J. L., Dong, C. C. & Nelson, M. R. Refining the impact of genetic evidence on clinical success. Nature 629, 624–629 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Prim. 1, 1–21 (2021).
Google Scholar
Liu, H.-M. et al. Recessive/dominant model: Alternative choice in case-control-based genome-wide association studies. PLOS ONE 16, e0254947 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Guindo-Martínez, M. et al. The impact of non-additive genetic associations on age-related complex diseases. Nat. Commun. 12, 2436 (2021).
Article ADS PubMed PubMed Central Google Scholar
O’Connor, M. J. et al. Recessive genome-wide meta-analysis illuminates genetic architecture of Type 2 diabetes. Diabetes 71, 554–565 (2021).
Article PubMed Central Google Scholar
Tanikawa, C. et al. A genome-wide association study identifies two susceptibility loci for duodenal ulcer in the Japanese population. Nat. Genet. 44, 430–434 (2012).
Article CAS PubMed Google Scholar
Pirastu, N. et al. Non-additive genome-wide association scan reveals a new gene associated with habitual coffee consumption. Sci. Rep. 6, 31590 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Warner, S. C. et al. Genome-wide association scan of neuropathic pain symptoms post total joint replacement highlights a variant in the protein-kinase C gene. Eur. J. Hum. Genet. 25, 446–451 (2017).
Article CAS PubMed PubMed Central Google Scholar
Palmer, D. S. et al. Analysis of genetic dominance in the UK Biobank. Science 379, 1341–1348 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Kurki, M. I. et al. FinnGen provides genetic insights from a well-phenotyped isolated population. Nature 613, 508–518 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Heyne, H. O. et al. Mono- and biallelic variant effects on disease at the biobank scale. Nature 613, 519–525 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, Q. et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597, 527–532 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Collins, R. L. et al. A cross-disorder dosage sensitivity map of the human genome. Cell 185, 3041–3055.e25 (2022).
Article CAS PubMed PubMed Central Google Scholar
Merkle, F. T. et al. Human pluripotent stem cells recurrently acquire and expand dominant negative P53 mutations. Nature 545, 229–233 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Grice, S. J. et al. Dominant, toxic gain-of-function mutations in gars lead to non-cell autonomous neuropathology. Hum. Mol. Genet. 24, 4397–4406 (2015).
Article CAS PubMed PubMed Central Google Scholar
Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).
Article CAS PubMed Google Scholar
Zou, Y., Carbonetto, P., Wang, G. & Stephens, M. Fine-mapping from summary data with the “Sum of Single Effects” model. PLOS Genet. 18, e1010299 (2022).
Article CAS PubMed PubMed Central Google Scholar
GTEx Consortium, T. he The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Article Google Scholar
Di Angelantonio, E. et al. Efficiency and safety of varying the frequency of whole blood donation (INTERVAL): a randomised trial of 45,000 donors. Lancet Lond. Engl. 390, 2360–2371 (2017).
Article Google Scholar
Gaziano, J. M. et al. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
Article PubMed Google Scholar
Koscielny, G. et al. Open Targets: a platform for therapeutic target identification and validation. Nucleic Acids Res. 45, D985–D994 (2017).
Article CAS PubMed Google Scholar
MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).
Article CAS PubMed Google Scholar
Karczewski, K. J. et al. Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. Cell Genomics 2, (2022).
Jurgens, S. J. et al. Rare coding variant analysis for human diseases across biobanks and ancestries. Nat. Genet. 56, 1811–1820 (2024).
Article CAS PubMed PubMed Central Google Scholar
Mancuso, N. et al. Integrating Gene Expression with Summary Association Statistics to Identify Genes Associated with 30 Complex Traits. Am. J. Hum. Genet. 100, 473–487 (2017).
Article CAS PubMed PubMed Central Google Scholar
Lu, M. et al. TWAS Atlas: a curated knowledgebase of transcriptome-wide association studies. Nucleic Acids Res. 51, D1179–D1187 (2023).
Article CAS PubMed Google Scholar
Huang, Z., Qi, G., Miller, J. S. & Zheng, S. G. CD226: An emerging role in immunologic diseases. Front. Cell Dev. Biol. 8, (2020).
Schapira, M., Tyers, M., Torrent, M. & Arrowsmith, C. H. WD40 repeat domain proteins: a novel target class? Nat. Rev. Drug Discov. 16, 773–786 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cong, Y. et al. WDR45, one gene associated with multiple neurodevelopmental disorders. Autophagy 17, 3908–3923 (2021).
Article CAS PubMed PubMed Central Google Scholar
Raichur, S. et al. The role of C16:0 ceramide in the development of obesity and type 2 diabetes: CerS6 inhibition as a novel therapeutic approach. Mol. Metab. 21, 36–50 (2019).
Article CAS PubMed PubMed Central Google Scholar
Chen, Y. et al. Epigenetically upregulated oncoprotein PLCE1 drives esophageal carcinoma angiogenesis and proliferation via activating the PI-PLCε-NF-κB signaling pathway and VEGF-C/ Bcl-2 expression. Mol. Cancer 18, 1 (2019).
Article PubMed PubMed Central Google Scholar
Christiansen, J. et al. ANO5-related muscle diseases: From clinics and genetics to pathology and research strategies. Genes Dis. 9, 1506–1520 (2022).
Article CAS PubMed PubMed Central Google Scholar
Yan, H., He, L., Lv, D., Yang, J. & Yuan, Z. The role of the dysregulated JNK signaling pathway in the pathogenesis of human diseases and its potential therapeutic strategies: a comprehensive review. Biomolecules 14, 243 (2024).
Article CAS PubMed PubMed Central Google Scholar
Astore, C., Nagpal, S. & Gibson, G. Mendelian randomization indicates a causal role for Omega−3 fatty acids in inflammatory Bowel disease. Int. J. Mol. Sci. 23, 14380 (2022).
Article CAS PubMed PubMed Central Google Scholar
Liegeois, M. A. & Fahy, J. V. The Mucin Gene MUC5B is required for normal lung function. Am. J. Respir. Crit. Care Med. 205, 737–739.
Leyton, E. et al. DEF8 and autophagy-associated genes are altered in mild cognitive impairment, probable Alzheimer’s disease patients, and a transgenic model of the disease. J. Alzheimers Dis. JAD. 82, S163–S178 (2021).
Article CAS PubMed Google Scholar
Hao, L. et al. Decreased Spire2 expression is involved in epilepsy. Neuroscience 504, 1–9 (2022).
Article CAS PubMed Google Scholar
Plotnikov, D. et al. A commonly occurring genetic variant within the NPLOC4-TSPAN10-PDE6G gene cluster is associated with the risk of strabismus. Hum. Genet. 138, 723–737 (2019).
Article CAS PubMed PubMed Central Google Scholar
Pazokitoroudi, A., Chiu, A. M., Burch, K. S., Pasaniuc, B. & Sankararaman, S. Quantifying the contribution of dominance deviation effects to complex trait variation in biobank-scale data. Am. J. Hum. Genet. 108, 799–808 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tsouris, A., Brach, G., Schacherer, J. & Hou, J. Non-additive genetic components contribute significantly to population-wide gene expression variation. Cell Genomics 4, 100459 (2024).
Article CAS PubMed Google Scholar
Cui, L. et al. Dominance is common in mammals and is associated with trans-acting gene expression and alternative splicing. Genome Biol. 24, 215 (2023).
Article CAS PubMed PubMed Central Google Scholar
Haycock, P. C. et al. Best (but oft-forgotten) practices: the design, analysis, and interpretation of Mendelian randomization studies1. Am. J. Clin. Nutr. 103, 965–978 (2016).
Article CAS PubMed PubMed Central Google Scholar
Plenge, R. M., Scolnick, E. M. & Altshuler, D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 12, 581–594 (2013).
Article CAS PubMed Google Scholar
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Article CAS PubMed PubMed Central Google Scholar
Sham, P. C. & Purcell, S. M. Statistical power and significance testing in large-scale genetic studies. Nat. Rev. Genet. 15, 335–346 (2014).
Article CAS PubMed Google Scholar
Guedj, M., Nuel, G. & Prum, B. A note on allelic tests in case-control association studies. Ann. Hum. Genet. 72, 407–409 (2008).
Article CAS PubMed Google Scholar
Walsh, B. & Lynch, M. Evolution and Selection of Quantitative Traits. (Oxford University Press, 2018).
Browning, S. R. & Browning, B. L. Accurate non-parametric estimation of recent effective population size from segments of identity by descent. Am. J. Hum. Genet. 97, 404–418 (2015).
Article CAS PubMed PubMed Central Google Scholar
Schoech, A. P. et al. Quantification of frequency-dependent genetic architectures in 25 UK Biobank traits reveals action of negative selection. Nat. Commun. 10, 790 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Eyre-Walker, A. Genetic architecture of a complex trait and its implications for fitness and genome-wide association studies. Proc. Natl. Acad. Sci. 107, 1752–1756 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Simons, Y. B., Bullaughey, K., Hudson, R. R. & Sella, G. A population genetic interpretation of GWAS findings for human quantitative traits. PLOS Biol. 16, e2002985 (2018).
Article PubMed PubMed Central Google Scholar
Caballero, A., Tenesa, A. & Keightley, P. D. The nature of genetic variation for complex traits revealed by GWAS and regional heritability mapping analyses. Genetics 201, 1601–1613 (2015).
Article PubMed PubMed Central Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
Molotkov, I. et al. ArtomovLab/Cost-efficient-non-additive-GWAS: Zenodo https://doi.org/10.5281/zenodo.17652059 (2025).

Download references

Acknowledgements

The study was supported by the Aging Biology Foundation to the Artomov Lab. We want to acknowledge the participants and investigators of the FinnGen study. A full list of FinnGen Consortium members is shown in the Supplementary Data 7. The FinnGen project is funded by two grants from Business Finland (HUS 4685/31/2016 and UH 4386/31/2016) and the following industry partners: AbbVie Inc., AstraZeneca UK Ltd, Biogen MA Inc., Bristol Myers Squibb (and Celgene Corporation & Celgene International II Sàrl), Genentech Inc., Merck Sharp & Dohme LCC, Pfizer Inc., GlaxoSmithKline Intellectual Property Development Ltd., Sanofi US Services Inc., Maze Therapeutics Inc., Janssen Biotech Inc, Novartis AG, and Boehringer Ingelheim International GmbH. Following biobanks are acknowledged for delivering biobank samples to FinnGen: Auria Biobank (www.auria.fi/biopankki), THL Biobank (www.thl.fi/biobank), Helsinki Biobank (www.helsinginbiopankki.fi), Biobank Borealis of Northern Finland (https://www.ppshp.fi/Tutkimus-ja-opetus/Biopankki/Pages/Biobank-Borealis-briefly-in-English.aspx), Finnish Clinical Biobank Tampere (www.tays.fi/en-US/Research_and_development/Finnish_Clinical_Biobank_Tampere), Biobank of Eastern Finland (www.ita-suomenbiopankki.fi/en), Central Finland Biobank (www.ksshp.fi/fi-FI/Potilaalle/Biopankki), Finnish Red Cross Blood Service Biobank (www.veripalvelu.fi/verenluovutus/biopankkitoiminta), Terveystalo Biobank (www.terveystalo.com/fi/Yritystietoa/Terveystalo-Biopankki/Biopankki/) and Arctic Biobank (https://www.oulu.fi/en/university/faculties-and-units/faculty-medicine/northern-finland-birth-cohorts-and-arctic-biobank). All Finnish Biobanks are members of BBMRI.fi infrastructure (https://www.bbmri-eric.eu/national-nodes/finland/). Finnish Biobank Cooperative -FINBB (https://finbb.fi/) is the coordinator of BBMRI-ERIC operations in Finland. The Finnish biobank data can be accessed through the Fingenious® services (https://site.fingenious.fi/en/) managed by FINBB, and specific requests and policies regulating the access of this data should be addressed to FINBB.

Author information

A full list of members and their affiliations appears in the Supplementary Information.

Authors and Affiliations

The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, USA
Ivan Molotkov & Mykyta Artomov
Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
Ivan Molotkov & Mykyta Artomov
Institute for Molecular Medicine Finland, Helsinki, Finland
Ivan Molotkov, Mitja Kurki, Aarno Palotie, Mark J. Daly & Mykyta Artomov
Broad Institute, Cambridge, MA, USA
Mitja Kurki, Aarno Palotie & Mark J. Daly
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
Mark J. Daly

Authors

Ivan Molotkov
View author publications
Search author on:PubMed Google Scholar
Mitja Kurki
View author publications
Search author on:PubMed Google Scholar
Aarno Palotie
View author publications
Search author on:PubMed Google Scholar
Mark J. Daly
View author publications
Search author on:PubMed Google Scholar
Mykyta Artomov
View author publications
Search author on:PubMed Google Scholar

Consortia

FinnGen

Ivan Molotkov
, Mitja Kurki
, Aarno Palotie
, Mark J. Daly
& Mykyta Artomov

Contributions

I.M., M.A. designed and conceived the study. M.K. and FinnGen contributed data and analysis pipelines. I.M. conducted the analysis and generated figures. I.M. and M.A. wrote the manuscript. M.A., M.J.D., and A.P. acquired funding. All authors reviewed and edited the manuscript.

Corresponding author

Correspondence to Mykyta Artomov.

Ethics declarations

Competing interests

A.P. is a member of the Pfizer Genetics Scientific Advisory Panel. M.J.D. is a founder of Maze Therapeutics. Other authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Ghislain Rocheleau and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Description of Additional Supplementary Files (download DOCX )

Supplementary Information (download PDF )

Supplementary Data 1-7 (download XLSX )

Reporting Summary (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Molotkov, I., Kurki, M., FinnGen. et al. Cost-effective non-additive GWAS across 2329 diseases in 500,349 individuals. Nat Commun 17, 580 (2026). https://doi.org/10.1038/s41467-025-67277-4

Download citation

Received: 04 April 2025
Accepted: 26 November 2025
Published: 13 December 2025
Version of record: 15 January 2026
DOI: https://doi.org/10.1038/s41467-025-67277-4