Main

ALS is a fatal neurodegenerative disease with a substantial genetic component1,2,3. Despite the discovery of several ALS genes, the genetic etiology remains elusive for most patients, with previous work indicating that a substantial portion of genetic risk for ALS is mediated by rare variants4. Identifying new ALS genes is key to furthering our understanding of the disease and may provide direct potential therapeutic targets, as evidenced by the recent approval of Tofersen for SOD1 mutation-positive ALS. The advent of next generation sequencing has opened the way for population-scale, genome-wide studies of both familial and sporadic ALS cases, which has already led to the identification of several ALS genes, including, among others, TBK1, NEK1 and KIF5A5,6,7,8,9. In continuing the search for ALS genes, sequencing data from tens of thousands of people are required to enable detection of ultrarare and low-to-modest impact variants. To this end, we assembled and harmonized data from 22 cohorts to generate the largest ALS exome sequencing dataset to date. This provided a discovery cohort encompassing 13,138 cases and 69,775 controls and an independent replication set of 4,781 cases and 130,928 controls. We processed and harmonized all samples uniformly, including realignment to the GRCh38 reference genome and joint variant calling, which we show to be crucial to eliminate structural biases. Through comprehensive single-variant and ultrarare variant (URV) burden analyses, we identified several new candidate variants, genes and genesets, thereby substantially expanding our understanding of the contribution of rare variants to ALS.

Results

Building a harmonized ALS exome dataset for rare variant analysis

To identify rare coding variants involved in ALS, we harmonized 18 whole-exome (WXS) and whole-genome (WGS) sequencing datasets into a discovery cohort totaling 94,545 people. All data were realigned uniformly to GRCh38 and called jointly using the functional equivalence pipeline10,11, substantially reducing technical variation (Supplementary Fig. 1a). Moreover, the distributions of exome-wide URV counts were aligned between ancestry-matched WGS (Project MinE) and WXS (UK Biobank) samples, indicating that sequencing technologies were comparable after joint processing and quality control (Supplementary Fig. 1b). Following strict filtering, the final dataset comprised 13,138 unrelated cases and 69,775 controls of predominantly European ancestry, with 5,207,138 variants (2,367,861 predicted moderate or high impact; Supplementary Figs. 24).

Rare single-variant analyses identify five new risk variants and largely recapitulate known rare variant architecture of ALS

We conducted single-variant analyses of 272,925 rare variants that fell within our testable minor allele frequency (MAF) range (5 × 10−5 < MAF < 0.05) while also satisfying variant effect prediction criteria of either moderate- (missense mutations, in-frame deletions and untranslated region (UTR) truncations) or high-impact (nonsense, splice acceptor/donor and frameshift mutations) annotations. For each variant, we used Firth’s logistic regression to test for an association between ALS status and minor allele count (MAC), adjusting for sex, ten principal components (PCs) and the total number of rare synonymous variants in each person12,13. The resulting test statistics showed no systematic inflation, indicating no residual confounding (λ1,000 = 1.01), and significant variants passed subsequent validation and sensitivity analyses (Extended Data Fig. 1a,b).

We identified 15 exome-wide significant variants across 11 distinct genes (P < 1.83 × 10−7; Fig. 1a, Table 1, Extended Data Table 1, Extended Data Figs. 1 and 2 and Supplementary Data 1), for all of which the minor allele was associated with increased ALS risk (Fig. 1c). Among the 15 associated variants, 10 were located in genes previously shown to be related to ALS: SOD1, CFAP410, NEK1, KIF5A, FUS and TBK1 (Fig. 1a and Extended Data Table 1). The remaining five have not been reported previously in ALS (Table 1). These encompass intermediate frequency variants with modest effect size, including HTR3C p.T186A (odds ratio (OR) = 3.41, P = 1.87 × 10−8) and YKT6 p.Y64C (OR = 2.84, P = 9.08 × 10−8) as well as rare variants with high effect size, including GBGT1 p.R152L (OR = 26.9, P = 1.68 × 10−10), CAPN2 p.I530V (OR = 25.3, P = 3.66 × 10−9), and KNTC1 p.W287R (OR = 27.7, P = 1.07 × 10−7).

Fig. 1: Rare single-variant analyses.
Fig. 1: Rare single-variant analyses.
Full size image

a, y axis: exome-wide single variant associations estimated using Firth’s logistic regression with profile penalized likelihood CIs (−log10(P)); x axis: genomic coordinates (GRCh38). Dashed line: exome-wide significance threshold (P < 1.83 × 10−7). New variants are highlighted in orange. b, Rare single-variant analyses among ALS-linked genes curated by the ALS GCEP. y axis: single-variant associations estimated using Firth’s logistic regression with profile penalized likelihood CIs (−log10(P)); x axis: genomic coordinates (GRCh38). Variants are colored by the clinical validity classification as curated by the ALS GCEP. Lower dashed line: significance threshold across variants in ALS-linked genes (P < 3.20 × 10−5); upper dashed line: exome-wide significance threshold as presented in a. c, ORs (y axis) and 95% CIs (gray shaded area) plotted against the risk allele frequency in controls (x axis) for significant variants identified in either the exome-wide or GCEP analysis. For variants where the control risk allele frequency was 0, it was set to half the lowest nonzero risk allele frequency observed in the control group. P values are two-tailed and are presented uncorrected for multiple testing.

Table 1 New rare single variants achieving significance

We also performed a targeted analysis of variants within 51 ALS-linked genes curated by the ALS Gene Curation Expert Panel (GCEP)14. To ensure the inclusion of the full set of GCEP-curated genes, we did not apply the per-supercohort call-rate filter for this analysis, allowing for the assessment of genes exhibiting subpar call-rates in certain subcohorts. This identified eight additional variants across six genes (P < 3.20 × 10−5; Fig. 1b,c, Extended Data Table 1 and Supplementary Data 2), including variants in genes that were not detected in the exome-wide analysis (ARPP21, ANXA11, UBQLN2 and TARDBP). For all identified variants, the minor allele was associated with increased ALS risk (Fig. 1c and Extended Data Table 1). We provide independent evidence for two rare variants in ARPP21 (p.P563L and p.P747L)—a gene that is currently considered as having limited evidence according to GCEP (p.P563L: OR = 44.8, P = 2.55 × 10−10; p.P747L: OR = 75.8, P = 1.45 × 10−6) (Fig. 1b,c). Of note, the ARPP21 p.P563L variant had subpar call-rates in some exome cohorts. However, even when restricting the analysis to cohorts meeting stringent call-rate thresholds, the association remained exome-wide significant with a similar odds ratio (P = 1.09 × 10−8, OR = 38.1; Extended Data Fig. 1c,d).

Principal component analysis (PCA) suggested a mixed pattern of geographical distribution for carriers of the identified variants (Supplementary Fig. 5). For some variants, we observed that carriers exhibited relatively tight clustering in PCA space. This was observed for both well-established ALS variants such as UBQLN2 p.P509S (Sweden) and SOD1 p.A5V (USA), as well for the new CAPN2 p.I540V variant (the Netherlands). Conversely, other variants were distributed more uniformly across patient populations (for example, YKT6 p.Y64C and ARPP21 p.P563L). In silico pathogenicity prediction tools also yielded varying annotations for both previously established and new ALS-associated variants (Supplementary Table 1). Nonetheless, we observed that YKT6 p.Y64C was consistently predicted as damaging by all predictors, and KNTC1 p.W287R was predicted as damaging by all but SIFT.

Ultrarare burden analyses identify new ALS-associated genes

To detect associations among URVs (five or fewer carriers), we performed burden tests using Firth’s logistic regression to evaluate their cumulative effects. URVs were aggregated across several functional units, including genes and protein domains. To enrich for potentially pathogenic variants, we used four filtering strategies based on two criteria: (1) variant frequency—either all URVs or singleton-only variants; (2) variant impact—either only high-impact variants or both high- and moderate-impact variants. Tests across these filtering strategies were combined using the ACAT omnibus test15. We observed no evidence of genomic inflation in any of the analyses performed (gene λ1,000 = 1.011, domain λ1,000 = 1.006; Extended Data Fig. 3a), and all presented genes passed subsequent sensitivity analyses (Extended Data Fig. 3e–g).

URV gene burden analyses across 17,324 protein-coding genes identified eight genes that reached exome-wide significance (P < 2.89 × 10−6) (Fig. 2a, Table 2, Extended Data Table 2, Extended Data Figs. 2 and 3, Supplementary Data 3 and 4 and Supplementary Fig. 6). Among these were four established ALS genes: SOD1 (P < 1 × 10−16), TBK1 (P < 1 × 10−16), NEK1 (P = 6.49 × 10−13) and TARDBP (P = 5.02 × 10−8) (Extended Data Table 2). Furthermore, we identified DNAJC7 (P = 8.77 × 10−8), which is currently classified as having limited evidence (ClinGen gene curation panel14), and here reaches exome-wide significance for the first time in an exome-wide discovery analysis. New candidate genes included TTC3 (P = 4.16 × 10−7), UNC13C (P = 2.80 × 10−7) and KIF4A (P = 1.62 × 10−6), in all of which higher URV burden increased risk of ALS (Table 2 and Extended Data Fig. 3b). A targeted analysis among the 51 ALS-linked genes classified by GCEP also revealed a significant association for OPTN (P = 1.56 × 10−5), which is classified by GCEP as a definitive ALS gene (Fig. 2b and Extended Data Table 2).

Fig. 2: URV burden analyses.
Fig. 2: URV burden analyses.
Full size image

a, y axis: exome-wide gene-based URV associations (−log10(P)); x axis: genomic coordinates (GRCh38). Dashed line: exome-wide significance threshold (P < 2.9 × 10−6). New risk genes are highlighted in orange. b, URV burden analyses among ALS-linked genes curated by the ALS GCEP. y axis: gene-based URV associations (−log10(P)); x axis: genomic coordinates (GRCh38). Lower dashed line: significance threshold across ALS-linked genes (P < 1 × 10−3); upper dashed line: exome-wide significance threshold as presented in a. c, Domain-based URV analyses. y axis: domain associations (−log10(P)); x axis: genomic coordinates (GRCh38). Dashed line: exome-wide significance threshold (P < 7.68 × 10−7). d, Association P values for URV geneset burden analyses excluding exome-wide significant genes (y axis) versus including exome-wide significant genes (x axis). The dashed lines indicate the multiple testing threshold (P < 4.25 × 106). Association statistics were estimated using Firth’s logistic regression with profile penalized likelihood CIs. P values are from the ACAT omnibus test combining the four variant filtering strategies (Methods) and are two-tailed and uncorrected for multiple testing.

Table 2 New genes achieving exome-wide significance in URV burden analyses

The URV domain analyses across 65,071 domains identified three partially overlapping domains in TBK1 (protein kinase, kinase-like and CCD1 domains), one domain in SOD1 (SOD_Cu/Zn_BS domain) and one domain in VCP (CDC48 domain 2-like domain) at exome-wide significance (P < 7.68 × 10−7; Fig. 2c, Extended Data Fig. 3c and Supplementary Data 5 and 6). Unlike SOD1 and TBK1, VCP did not reach significance in the whole-gene analysis (Pgene = 8.09 × 10−3), suggesting that the CDC48 domain 2-like region harbors the primary association signal with a markedly stronger effect (Pdomain = 2.16 × 10−7). This domain constitutes the second subdomain of the N-terminal domain, in which most known pathogenic mutations are concentrated16.

Across burden analyses, ORs were generally similar when including all URVs compared to including singletons only, with the notable exceptions of NEK1 and KIF4A, which showed markedly higher ORs in the singleton-only analyses (Extended Data Fig. 3b–d). The observed associations were driven primarily by moderate-impact variants: NEK1 and TBK1 were the only genes showing a significant signal when analyses were restricted to high-impact variants (Extended Data Fig. 3k–m), although some signal among high-impact variants was observed for DNAJC7 and OPTN. Single nucleotide variants (SNVs) were the primary drivers of the associations, with insertions/deletions (INDELs) contributing substantially to the association P values only for NEK1 and DNAJC7 (Extended Data Fig. 3h–j). For UNC13C, TTC3 and OPTN, we identified a small subset of people carrying two URVs, whereas for TBK1 and NEK1, we found people with both a URV and a more common (0.01 < MAF < 0.05) risk variant (p.V464A and p.R261H, respectively). No increased risk was observed in these cases, although this may be due to the low number of co-occurrences (Extended Data Fig. 4a).

Assessing geneset burden and variant co-occurrence

We performed URV geneset burden analyses across 11,777 Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Reactome genesets from the Molecular Signatures Database (MSigDB v.7.5)17, using the same procedure as the single gene analyses (λ1,000 = 1.006, Supplementary Data 7 and 8). After excluding genesets driven solely by one highly significant gene, two genesets remained significant: ‘GOBP: regulation of mRNA splicing via spliceosome’ (GO:0048024, 96 genes, P = 2.97 × 10−7) and its parent term ‘GOBP: regulation of RNA splicing’ (GO:0043484, 142 genes, P = 3.50 × 10−6) (Fig. 2d and Extended Data Fig. 3d). As ‘regulation of mRNA splicing via spliceosome’ is a subset of ‘regulation of RNA splicing,’ we performed a conditional analysis to assess its independent contribution. This revealed that residual signal remains in ‘regulation of mRNA splicing via spliceosome’ (P = 0.0084), suggesting it captures a more specific association within this pathway. Among the 153 unique genes across these two genesets, 30 reached nominal significance (P < 0.05), with top genes including HSPA8, HABP4, NOVA2, HNRNPL and SNW1 (Supplementary Fig. 7a). We also performed a geneset analysis among the 51 ALS-linked genes curated by GCEP14. As expected, this showed that the ‘Definitive’ category was highly significant (P < 1 × 10−16) across allele frequency thresholds, whereas the ‘Limited’ category showed only modest enrichment (P = 0.0015), and no enrichment was seen among the other categories (Supplementary Fig. 7b,c).

We next examined whether carrying several variants among ‘Definitive’ ALS genes as classified by GCEP confers cumulative risk. We observed a clear dose–response relationship across low-frequency variants (MAF < 0.05): the OR increased progressively as people carried one (OR = 1.19, P = 2.11 × 10−15), two (OR = 1.35, P = 8.43 × 10−13), three (OR = 1.84, P = 2.78 × 10−8) or four (OR = 4.26, P = 5.35 × 10−5) qualifying variants (Extended Data Fig. 4c and Supplementary Data 9). This relationship persisted when burden was assessed at the gene level, where several variants within the same gene were counted as a single event (Extended Data Fig. 4d and Supplementary Data 9). Analyses restricted to rarer variants were underpowered due to the low number of people carrying several variants (Supplementary Data 9). We did not observe a similar dose–response relationship when we tested for an association with age at onset and survival (Supplementary Fig. 8).

We next focused on co-occurrence among the specific risk variants identified in this study. Focusing on single variants in ‘Definitive’ GCEP genes, we found that 11.1% of cases carried one variant and 0.54% carried two, whereas the co-occurrence of three or more variants was not observed (Extended Data Table 3). When including variants in genes with ‘Limited’ evidence and new single variants identified in this study, the proportions increased to 14.5% for one, 1.1% for two and 0.0076% for three variants. The proportions increased further to 18.2%, 1.7% and 0.099%, respectively, when also including qualifying variants from the URV burden analyses. Finally, when C9orf72 repeat expansion status was also considered (available for 66% of cases), these totals rose to 23.5%, 3.12% and 0.22%, respectively, totaling 26.9% of cases. The observed co-occurrence rates did not deviate from those expected under an additive model using permutation analyses (P = 0.39). When examining specific variant pairs, we observed numerous instances of cases carrying several variants (Extended Data Fig. 5 and Supplementary Fig. 9). For example, 20% of C9orf72 repeat expansion carriers harbored additional risk variants. Furthermore, some pairs, including CFAP410 p.V58L × NEK1 p.R261H, showed trends suggestive of a synergistic effect (Extended Data Fig. 4e). To formally test whether any of these pairs showed nonadditive effects, we performed pairwise co-occurrence and interaction analyses. No pairs reached significance after correction for multiple testing (Extended Data Fig. 5 and Supplementary Fig. 9). This was consistent with our power calculations (Supplementary Fig. 10), which showed that the study was underpowered to detect all but the largest deviations from additivity for specific variant pairs, and then only for pairs including at least one low-frequency variant (0.01 < MAF < 0.05).

ARPP21 p.P563L is associated with earlier disease onset and shorter disease duration

To assess the impact of genetic variants on disease progression, we analyzed survival and age at onset across candidate genes and variants (Extended Data Fig. 6 and Supplementary Data 10). Consistent with previous reports, SOD1 p.A5V and p.D91A were associated significantly with a lower age at onset (p.A5V: B = −9.44, P = 5.61 × 10−4; p.D91A: B = −4.82, P = 1.11 × 10−5), with p.A5V linked to shorter survival and p.D91A to longer survival (p.A5V: hazard ratio (HR) = 13.0, P = 1.19 × 10−8; p.D91A: HR = 0.453, P = 1.48 × 10−7). Similarly, FUS p.R521C and p.P525L were associated with earlier onset (p.R521C: B = −16.2, P = 1.90 × 10−4; p.525L: B = −39.1, P = 1.53 × 10−10), with p.P525L specifically associated with shorter survival (HR = 41.75, P = 1.41 × 10−10). Notably, ARPP21 p.P563L was associated with a significantly lower age at onset (B = −12.7, P = 5.44 × 10−4) and shorter survival (HR = 5.96, ∆survival time = −19.5 months, P = 2.54 × 10−6), showing effect sizes comparable to SOD1 p.A5V (Extended Data Fig. 6a). Among URVs, SOD1 was associated with longer survival (HR = 0.45, P = 0.0022), whereas no significant associations were observed for other genes (Extended Data Fig. 6b).

Replication confirms YKT6 and supports HTR3C, GBGT1 and KNTC1 as ALS risk genes

For replication, we generated a cohort comprising 4,781 individuals with ALS and 130,928 controls after applying stringent quality control criteria identical to those used in the discovery set (Supplementary Figs. 11 and 12). Power analyses based on the (winner’s curse adjusted) effect sizes observed in the discovery dataset indicated that this provides between 32% and 91% statistical power for replication across candidate variants and genes (Supplementary Fig. 12).

Of the five new single variants identified in the discovery phase, all showed a consistent direction of effect in the replication cohort (Table 1; λ1,000 = 0.965). Moreover, all five reached exome-wide significance in a meta-analysis of the combined discovery and replication data, with all but CAPN2 p.I530V showing greater significance compared to the discovery phase alone (Table 1). Furthermore, YKT6 p.Y64C achieved replication-wide significance (P < 0.0063), correcting for the eight new associations from the discovery phase (five single variants and three URV genes). Among the three candidate URV genes, a consistent direction of effect was seen only for KIF4A (OR = 2.46, P = 0.26), and none reached replication-wide significance (Table 2; λ1,000 = 1.046).

Establishing independent evidence for ARPP21, DNAJC7 and CFAP410

Next, for the genes that were significant in our discovery analysis that are currently classified by GCEP with ‘Limited’ evidence (ARPP21, CFAP410 and DNAJC7), we aimed to confirm the independence of our findings.

For ARPP21, we identified two rare variants: p.P747L, which has not previously been reported in the scientific literature, and p.P563L, previously reported in UK and Spanish families as candidate variants18,19. To confirm independence for p.P563L, we excluded four potentially overlapping UK carriers (no Spanish carriers were identified). The association remained (OR = 28.3, P = 3.47 × 10−7; Extended Data Fig. 7a) and was further supported by our replication dataset, which had no potential overlap with previous studies (OR = 16.5, P = 3.29 × 10−3). A meta-analysis of these two independent datasets yielded a highly significant association (P = 4.31 × 10−9), confirming a strong, independent signal. We also validated the reported effects of age of onset and progression19 in our nonoverlapping cohort (Extended Data Fig. 6). Finally, ARPP21 carriers were observed across several cohorts beyond those from the UK and Spain, significantly expanding its known population distribution (Supplementary Fig. 5).

CFAP410 p.V58L was previously identified in two common variant genome-wide association studies (GWAS) (MAF = 0.013)4,20. To confirm independence, we excluded 8,372 cases and 4,159 controls that were duplicated or had second-degree or closer genetic relatedness to the original GWAS cohorts. The association remained highly significant after this exclusion (Pmeta = 1.34 × 10−14), with consistent ORs in both discovery (OR = 1.81, P = 1.32 × 10−10) and replication (OR = 1.61, P = 1.09 × 10−5) cohorts (Extended Data Fig. 7b).

For DNAJC7, implicated previously in a case–control study of ALS21, we re-evaluated the association after excluding overlapping cohorts (excluding 5,722 cases and 9,849 controls). In this reduced discovery dataset, there remained a robust association with a consistent odds ratio (ncases = 7,606, ncontrols = 59,926; OR = 2.56, P = 1.36 × 10−4; Extended Data Fig. 7c). This was further supported by our replication cohort, which had minimal overlap (190 cases) with the previous study (OR = 2.41, P = 2.82 × 10−3; Extended Data Fig. 7c). Meta-analysis across these two datasets confirmed a strong, independent signal (P = 2.96 × 10−6).

Discussion

This study represents one of the largest rare variant analysis of ALS until now, capturing nearly all of its known rare variant architecture. We expand upon known genetic factors contributing to ALS, demonstrating a substantially higher yield than common variant GWAS of comparable size4,20.

Our findings also provide a broader view of the genetic architecture of ALS. The variants we identified were mostly missense and spanned a spectrum of effect sizes, ranging from low-frequency variants with moderate effect sizes and URVs conferring large effects. At the far end of this spectrum, the latter category includes variants such as SOD1 p.A5V, which was absent from ~200,000 controls and conferred large effects (OR > 200), alongside other high impact variants such as those in ARPP21 (ORs > 40). Moreover, our data support an additive, oligogenic model22,23 wherein several rare variants cumulatively increase ALS risk without strong evidence of interaction. Although statistical power was limited, pairwise analyses of the best-powered variant combinations revealed no significant interactions, consistent with an additive model. Fully characterizing this oligogenic architecture will require studies extending beyond established ALS genes and, ultimately, beyond the exome.

Among the identified rare variants, YKT6 p.Y64C stood out due to its highly significant and consistent associations in both the discovery and replication cohorts. It was associated with a moderate increase in risk, with an OR comparable to that of established ALS variants SOD1 p.D91A and NEK1 p.R261H (Fig. 1c). YKT6 encodes a highly conserved SNARE protein that plays a key role in vesicular transport pathways, also implicated in GWAS20, and is critical for autophagosome-lysosome fusion24,25. The p.Y64C variant has been linked to a neurodevelopmental disorder in the homozygous state26, whereas we found it to be associated with ALS in the heterozygous state. The variant was shown to cause partial loss-of-function and impaired autophagy in Drosophila26, in line with it being consistently predicted as damaging by all in silico predictors we tested. YKT6 has also been implicated in the secretion of the MSP domain of the ALS-linked protein VAPB, suggesting that it may also play a role in extracellular signaling27,28. The identification of YKT6 thus highlights the central role of disrupted vesicle fusion and trafficking in ALS and specifically implicates downstream consequences including impaired autophagy and exocytosis. Beyond YKT6, three high-effect (OR > 20) missense variants in KNTC1, HTR3C and GBGT1 also represent strong candidates displaying consistent directions of effect across discovery and replication analyses, with meta-analysis achieving greater statistical significance than the discovery analysis alone. These genes implicate GTPase signaling, serotonergic function and glycosphingolipid metabolism respectively, all processes previous linked to ALS pathology29,30,31,32.

In addition to these new findings, a key contribution of our study is providing robust, independent evidence for several genes with limited previous evidence. In ARPP21, we identified two high-effect variants (p.P563L, p.P747L), with ORs comparable to those of highly penetrant variants such as FUS p.R521C and TARDBP p.N352S (Fig. 1c). Of these, p.P563L has been reported previously in two family studies18,19. Our study now firmly establishes not only the association of this variant with ALS, but also its previously reported effects on age of onset and survival. Moreover, this variant had been reported only in UK and Spanish families, whereas our study establishes its relevance in a broader population, identifying carriers across Dutch, US, Italian and Israeli cohorts. The second variant, p.P747L, has not previously been reported in the scientific literature. ARPP21, like TDP-43 and FUS, is an RNA-binding protein that localizes to stress granules under stress33. Our observed enrichment of URVs in splicing-related genes adds support to the relevance of this commonality, and suggests further insights remain to be discovered concerning the full depth of RNA processing dysfunction in ALS pathogenesis. For DNAJC7, which encodes a heat-shock protein implicated previously in an exome-wide burden analysis21, our study offers independent validation. Although the previous signal was driven by protein-truncating variants, our URV analysis identified a robust association that also included INDELs and missense variants. Crucially, this association was replicated across cohorts and remained after removal of cohorts overlapping with the previous study. Finally, in CFAP410 (also known as C21orf2), the low-frequency (MAF = 0.013) missense variant p.V58L was identified previously in two common variant GWASs4,20 and has been linked to primary cilia dysfunction34. We show that this is a robust and independent finding, as the association remained highly significant even after we excluded all participants who were duplicated or genetically related to the original GWAS cohorts.

Finally, despite inconclusive evidence from the replication analysis, UNC13C, KIF4A and CAPN2 remain candidates of interest for further study. UNC13C and KIF4A are paralogs of ALS genes UNC13A and KIF5A, respectively35,36,37,38, and are similarly involved in synaptic vesicle release and axonal transport (Supplementary Table 2 and Supplementary Data 11). CAPN2 is of interest due to previous literature supporting its role in ALS pathology and is being evaluated as a therapeutic target for antisense oligonucleotide (ASO) therapies39,40,41. Further investigation of these genes in independent datasets is warranted to fully elucidate their potential role in ALS.

Our findings have clear translational potential. ASO-based therapies such as Tofersen (SOD1) and Jacifusen (FUS)42 demonstrate the feasibility of gene-targeted treatment, and ongoing individualized approaches (for example, Silence ALS, n-Lorem) extend this to URVs. Our study increases the proportion of cases with an identifiable genetic risk factor from 11.6% (‘Definitive’ genes) to 15.6% with our validated and new single-variants, a figure that rises to 22.9% when C9orf72 repeat expansions are included. Although not all identified genes will be viable ASO targets, as evidenced by setbacks in trials targeting ATXN2 and C9orf72 (ref. 43), genes harboring variants with high ORs such as ARPP21 represent prime candidates to prioritize for future ASO-based studies.

Our study has limitations. First, we did not perform functional validation for the identified variants; therefore, the precise molecular mechanisms (for example, loss-of-function, gain-of-function or a combination thereof) remain to be determined. Second, by design, the exome analyses in this study precluded the investigation of noncoding variation. However, it is important to note that about three-quarters of the cases included in the discovery cohort and all replication cohorts underwent whole-genome sequencing from which exomes were derived in silico. In addition, the increasing availability of large-scale WGS data, particularly in the UK Biobank and All of Us, provides ample controls, meaning that much of what is needed to build large-scale WGS datasets for future analyses is already available44,45. Third, our analysis is restricted to germline variants and therefore does not address the potential role of somatic mutations, which accumulate in the central nervous system with aging and could therefore explain the late-onset of the disease46. Fourth, we acknowledge the importance of expanding beyond the predominantly European ancestry of participants included in this study. Finally, although our study captures most of the known ALS genetic architecture, a few genes remain undetected. These are either genes associated primarily with repeat expansions (ATXN2 and C9orf72) or genes harboring exceedingly rare variants (VAPB and PFN1).

To conclude, the assembly of the largest exome sequencing dataset for ALS to date, coupled with robust harmonization and replication, enabled the discovery of rare variant contributions to ALS. We show that rare variant analyses yield particularly high returns in ALS compared to common variant GWAS. The identification of several new genes, alongside the confirmation of genes with previous limited evidence, collectively provides a compelling set of potential new targets for translational ALS research.

Methods

Cohorts

This study was approved by the institutional review boards of all participating centers, written informed consent for research was obtained from each participant and the study was approved by the Medical Ethical Testing Committee NedMec and the Biobanks Testing Committee of UMC Utrecht. Cases were included in this study irrespective of their carrier status for variants in known ALS genes.

Discovery cohort

The discovery cohort included 15,862 participants with ALS and 78,683 controls, totaling 94,545 individuals, of which 21,102 were subjected to WGS and 73,443 to WXS. Case cohorts included the Project MinE ALS sequencing consortium (7,614 cases; 2,605 controls)47, the NYGC ALS Consortium (2,650 cases; 342 controls), the ALS Sequencing Consortium (2,851 cases)5, two cohorts from the FALS consortium (1,277 cases; phs001585), the National Institutes of Health (NIH) Exome Sequencing of FALS Project (194 cases; phs000101), two Australian cohorts described in ref. 48 (125 cases, 18 controls) and ref. 49 (568 cases), and a Chinese motor neurone disease (MND) cohort50 (583 cases, 182 controls). All cases were diagnosed with definite, probable or probable laboratory-supported ALS according to the revised El Escorial Criteria51. Control cohorts included 7,323 samples from the National Heart, Lung, and Blood Institute (NHLBI) TOPMed research program52, 49,981 samples from the UK Biobank44 and 18,232 samples across seven cohorts from dbGAP53.

Replication cohort

The replication cohort included 5,404 people with ALS and 133,823 controls, totaling 139,227 participants, all of whom were subjected to WGS. Cohorts include the Project MinE ALS sequencing consortium (1,510 cases; 169 controls), the NYGC ALS consortium (1,257 cases; 69 controls), ALS compute (1,870 cases; 1,820 controls; phs003184) and the UK Biobank (767 cases; 131,765 controls). During sample quality control, people who were duplicates or related up to the second degree to any participant in the discovery cohort were excluded.

Processing of sequencing data

Processing and annotation of sequencing data was performed as described previously54. All raw sequencing data were aligned to the GRCh38 reference genome using BWA-mem55 according to the functional equivalence pipeline described by Regier et al.11 (implementation can be found at https://github.com/maarten-k/realignment). Joint genotyping was performed using a uniform pipeline according to the GATK best practices (v.4.2.6.1)10. Genotype calls with a quality score < 20 were set to missing, variant calls supported by uninformative reads were excluded and multiallelic variants were split into biallelic variants. Male genotypes in nonpseudoautosomal regions on chromosome X were coded as 0 or 1 (according to 0 or 1 allele copies).

Variant annotation

Variants were annotated using snpEff56, dbscSNV57 and Ensembl Release v.105 gene models58. Variants were classified as high-impact when predicted by snpEff to have a high impact (including nonsense mutations, splice acceptor/donors and frameshift mutations) or predicted as potentially splice-altering by dbscSNV (‘ada’ or ‘rf’ score > 0.7). Variants were classified as having moderate impact when predicted as such by snpEff (including missense mutations, in-frame deletions and UTR truncations). For each gene, the impact of a variant was determined by its most severe consequence across protein-coding transcripts.

Sample quality control

Ancestry was estimated by projecting all samples on a reference ancestry space comprising samples from the 1000 Genomes project using the LASER software (v.2.04)59. We retained participants of predominantly European ancestry. We then excluded samples with low genotype call-rate (<0.9), discordant sex or deviating heterozygosity (inbreeding F < -0.1 or F > 0.1). These metrics were calculated in a set of autosomal variants meeting the following criteria: call-rate > 0.9 in each supercohort (discovery: WGS, WXSUKB, WXSother; replication: Project MinE, ALS compute, NYGC, UK Biobank), MAF > 0.01 and, for sex inference, heterozygosity, relatedness and PCA variants were also filtered based on Hardy–Weinberg equilibrium (HWE) (P < 0.0001; for nonpseudoautosomal regions on chromosome X, these were calculated among female participants only) and pruned if in linkage disequilibrium (LD) (r2 < 0.5, window size = 50, step = 5; furthermore, high LD regions were excluded before PCA60). We then excluded samples based on a high exome-wide number of SNVs, INDELs, singletons, high INDEL/SNV ratio or deviating Ti/Tv ratio (thresholds listed in Supplementary Fig. 2). Sample duplicates and relatives up to and including the second degree were identified using KING software61. An unrelated sample set was generated by first excluding samples with five or more relations, followed by iteratively excluding participants with the highest number of relations, resolving ties by prioritizing (in order) ALS over controls and WGS over WXS samples. Furthermore, in the replication cohort, samples that were duplicated or related up to the second degree to any sample in the discovery cohort were excluded. PCA was performed on the unrelated sample set using fastPCA as implemented in plink2 (ref. 62). In the discovery cohort, a distinct cluster was identified on the fourth and fifth PC consisting of an Amish population, which was excluded as the cluster contained only controls (Supplementary Fig. 2f).

Variant quality control

First, GATK variant quality score recalibration was applied to all variants using the training data and annotations as recommended by the GATK best practices10. Variants were excluded if they did not pass variant quality score recalibration, their genotyping rate was <0.9 in any of the supercohorts (discovery: WGS, WXSUKB, WXSother; replication: Project MinE, ALS compute, NYGC, UK Biobank) or if they did not pass the HWE test in controls (P < 0.0001). We then also excluded variants with subpar quality scores and variants located in regions showing signs of batch effects. Potential batch effects were identified by testing whether variant allele counts were associated with cohort membership within control subjects. Firth’s logistic regression with profile penalized likelihood CIs was used to perform these control–control analyses, adjusting for sex and four PCs12. This procedure was repeated for each cohort (that is, 1 = subject in respective cohort, 0 otherwise). In total, 16 cohorts were tested (including all WGS controls versus all WXS controls; cohorts with <100 controls were merged into one cohort) in the discovery cohort and four cohorts were tested in the replication cohort. The minimum P value across these analyses was used as a metric to identify variants associating with probable batch effects. The stringency of various standard variant quality control filters was then increased to eliminate variants exhibiting batch associated calling bias while maintaining maximal sensitivity for unbiased variant calls (Supplementary Fig. 4). Identical thresholds were used for SNVs and INDELs and we also excluded long insertions and deletions (>50 base pairs) and variants coding the reference allele in spanning deletions.

Single-variant analyses

Single-variant analyses were performed for all high and moderate impact variants with MAF < 0.05 and at least MAF > 5 × 10−5 (272,925 variants). For each variant, we tested for an association between ALS status and MAC using Firth’s logistic regression with profile penalized likelihood confidence intervals, which properly controls for type I error when testing rare variants in an unbalanced case–control setting12,63,64,65. We adjusted for sex, ten PCs and total number of rare synonymous variants in each participant. All tests were two-sided, and the Bonferroni correction was used to correct for multiple testing.

Candidate single-variant associations were screened for additional technical biases and excluded if (1) variant concordance <0.9 among 678 between-cohort duplicates included in the unfiltered dataset; (2) showed batch effects among case cohorts (Pcase–case < Pcase–control) based on the same procedure as used in the control–control analyses, where we tested for an association between cohort membership and MAC of the respective variant (that is, 1 = subject in respective cohort, 0 otherwise); (3) the minor allele was supported by one read in >25% of carriers; (4) significant heterogeneity (Phet < 0.001) between this study and a recent ALS common variant GWAS by van Rheenen et al.20, for variants that overlap between both studies.

A targeted analysis was conducted on variants within 51 ALS-linked genes curated by the ALS GCEP (accessed December 2024)14. Certain exons of known ALS genes had lower call rate within subcohorts of the dataset; to provide a more complete investigation of known ALS genes, we therefore did not apply the per-supercohort call-rate filter for this analysis.

URV burden analyses

URV burden analyses were performed using four filtering strategies based on two criteria: (1) variant frequency—either all URVs or singleton-only variants; (2) variant impact—either only high-impact variants (nonsense, splice acceptor/donor and frameshift mutations) or both high- and moderate-impact variants (missense mutations, in-frame deletions and UTR truncations). Burden analyses were performed by testing for an association between ALS status and the aggregate effect of minor alleles observed per sample per functional unit using Firth’s logistic regression with profile penalized likelihood confidence intervals12. Sex, ten PCs and the total number of qualifying synonymous variants in each participant were included as covariates. Tests were retained if there were at least ten carriers across the functional unit tested. Test-statistics across the four filtering strategies were combined using the Cauchy method (ACAT), which is designed to combine results from several statistical tests15. Candidate associations were screened for potential technical biases by assessing biases among case cohorts using the same procedure as used in the control–control analyses. Genes where Pcase–case < Pcase–control were flagged as potentially driven by technical variation.

Genes

Genes were defined using Ensembl gene models (release v.105), including only protein-coding genes without annotation errors.

Domains

Protein coordinates for Interpro domains, coiled coils, transmembrane helices, low complexity regions and cleavage sites were retrieved from Ensembl v.105 (http://dec2021.archive.ensembl.org/biomart/martview/)58. For each transcript, variants were annotated to domains by remapping both the domain coordinates and variant positions to coding sequence (CDS) relative coordinates using the mapToCDS method in RVAT13. Variants up to 12 base pairs from the CDS border (introns and UTRs) were mapped to the respective border. Domains that spanned more than 90% of the width of the transcript were excluded.

Genesets

To identify genesets or pathways associated with ALS, we performed geneset burden analysis on 13,347 GO, KEGG and Reactome genesets from the Molecular Signatures Database (MSigDB v.7.5)17. Genesets including fewer than 5 or more than 1,000 genes were excluded, resulting in a total of 11,777 tested genesets.

Variant co-occurrence analyses

We tested for a cumulative effect of carrying several risk variants among moderate- and high-impact variants in genes classified as ‘Definitive’ according to GCEP14. Participants were grouped into categories based on the number of variants carried: 0 (reference group), 1, 2‥. n variants. Both heterozygous and homozygous variants were treated as single events. We assessed the association between each variant count category and ALS status using Firth’s logistic regression with profile penalized likelihood CIs, with the 0-variant group as the reference category and adjusting for the same covariates used in the geneset burden analyses.

To identify nonrandom co-occurrence of variant pairs, we performed a permutation-based test within the case cohort. The set of variants tested included those defined above as well as C9orf72 (C9) repeat expansion status (available for n = 8,610 cases; 66%). For each pair, we generated an empirical null distribution by performing 100,000 permutations, shuffling the carrier status of one variant relative to the other. An empirical P value was then calculated by comparing the observed co-occurrence count to this null distribution. For each variant pair, only participants with nonmissing genotypes for both variants were included.

Power analyses for the co-occurrence analyses were performed through 10,000 simulations. For each variant pair, joint genotype counts were drawn from a multinomial distribution, with probabilities based on their allele frequencies and a given co-occurrence OR. We then tested then for a depletion or excess of co-occurrence using Fisher’s exact test comparing observed to expected counts under independence.

To test for statistical interactions among variant pairs, we used Firth’s logistic regression in the full case–control cohort. For each pair of variants, we fitted a model including their main effects and their interaction term, adjusting for the same covariates as used in the single-variant analyses.

$$\begin{array}{l}{\rm{MND}}\sim {\beta }_{0}+{\beta }_{1}\times {\rm{var}}1+{\beta }_{2}\times {\rm{var}}2+{\beta }_{3}\times ({\rm{var}}1\times {\rm{var}}2)+{\beta }_{4}\times {\rm{sex}}\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+{\beta }_{5}\times {\rm{total}}\,{\rm{synonymous}}\,{\rm{count}}+{\beta }_{6}{{\rm{PC}}}_{1}+\cdots +{\beta }_{15}\times {{\rm{PC}}}_{10}\end{array}$$

Because C9 status was available for only a few controls, it was not included in these case–control interaction models.

Survival and age of onset analyses

Age at onset analyses (n = 10,557) were performed using linear regression, testing for an association between age at onset and either MAC (single-variant analyses) or the aggregate effect of minor alleles observed per sample per gene (URV burden analyses). Survival analyses (n = 7,194) were performed using a Cox proportional hazards model, testing for an association between right-censored survival time and either MAC (single-variant analyses) or the aggregate effect of minor alleles observed per sample per gene (URV burden analyses). Both age at onset and survival analyses were adjusted for sex, cohort, ten PCs and the total number of rare synonymous variants in each participant.

Replication analyses

Power analyses were performed through 10,000 simulations in which alleles were drawn from the binomial distribution with the probability set to the MAF of the respective variant. Simulated genotypes were tested for an association with the binary phenotype status using Firth’s logistic regression12. Power was calculated as the fraction of simulations with P values below the specified significance level. Effect sizes estimated in the discovery phase were corrected for winner’s curse bias using the parametric bootstrap approach implemented in the winnerscurse R package66. Power analyses for URV burden tests were performed in a similar manner, substituting MAF for the frequency of carrying at least one minor allele across the gene. These power estimates reflect an idealized scenario without covariate adjustment; in practice, necessary covariate inclusion may reduce power.

The processing of sequencing data and sample quality control were performed identically to the discovery stage; participants who were duplicates or related up to the second degree to any participant in the discovery cohort were excluded. Single-variant analyses were performed identically to those in the discovery analyses and included variants that were significant in the discovery analysis (all variants achieved per-supercohort call-rate >0.9, HWE P value > 0.0001). URV burden analyses were performed identically to those in the discovery analyses and included variants that passed strict quality control filters as applied in the discovery analysis. Meta-analyses were performed using Stouffer’s Z score method weighted by effective sample size, as implemented in METAL software67.

Gene annotation

Candidate genes were annotated with GO terms. The resulting list was summarized using the rrvgo R package68, where a matrix of pairwise semantic similarity scores was first calculated. The terms were subsequently clustered using the default similarity threshold of 0.7 and a representative term for each cluster was selected based on its uniqueness score.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.