Hotspots of human mutation point to clonal expansions in spermatogonia

Seplyarskiy, Vladimir; Moldovan, Mikhail A.; Koch, Evan; Kar, Prathitha; Neville, Matthew D. C.; Rahbari, Raheleh; Sunyaev, Shamil

doi:10.1038/s41586-025-09579-7

Download PDF

Article
Open access
Published: 08 October 2025

Hotspots of human mutation point to clonal expansions in spermatogonia

Nature volume 647, pages 429–435 (2025) Cite this article

12k Accesses
10 Citations
237 Altmetric
Metrics details

Subjects

This article has been updated

Abstract

In renewing tissues, mutations conferring selective advantage may result in clonal expansions^1,2,3,4. In contrast to somatic tissues, mutations driving clonal expansions in spermatogonia (CES) are also transmitted to the next generation. This results in an effective increase of de novo mutation rate for CES drivers^5,6,7,8. CES was originally discovered through extreme recurrence of de novo mutations causing Apert syndrome⁵. Here, we develop a systematic approach to discover CES drivers as hotspots of human de novo mutation. Our analysis of 54,715 trios ascertained for rare conditions^{9,10,11,12,13}, 6,065 control trios^{12,14,15,16,17,18,19} and population variation from 807,162 mostly healthy individuals²⁰ identifies genes manifesting rates of de novo mutations inconsistent with plausible models of disease ascertainment. We propose 23 genes hypermutable at loss-of-function (LoF) sites as candidate CES drivers. An extra 17 genes feature hypermutable missense mutations at individual positions, suggesting CES acting through gain of function. CES increases the average mutation rate roughly 17-fold for LoF genes in both control trios and sperm and roughly 500-fold for pooled gain-of-function sites in sperm²¹. Positive selection in the male germline elevates the prevalence of genetic disorders and increases polymorphism levels, masking the effect of negative selection in human populations. Despite the excess of mutations in disease cohorts for 19 LoF CES driver candidates, only 9 show clear evidence of disease causality²², suggesting that CES may lead to false-positive disease associations.

Sperm sequencing reveals extensive positive selection in the male germline

Article Open access 08 October 2025

Direct measurement of the male germline mutation rate in individuals using sequential sperm samples

Article Open access 15 March 2025

Genetic and chemotherapeutic influences on germline hypermutation

Article Open access 11 May 2022

Main

The genome of the average newborn harbours roughly 70 de novo point mutations, of which roughly 80% arise in paternal germline^14,23. De novo mutation is a common cause of sporadic monogenic disease^{9,10,11,12,13,15}, which has led to a lot of international effort collecting large cohorts of parent–child trios^{9,10,11,12,13,15}. These trios typically include unaffected parents and their children who, depending on the cohort, may be affected by conditions ranging from neurodevelopmental disorders (NDD) to congenital heart diseases^{9,10,11,12,13,15}. To discover the genes associated with disease, these studies compare the observed counts of de novo mutations in a gene with the ones expected from a baseline mutation rate model^9,12. Owing to the initial ascertainment on disease phenotype, this signal provides evidence for the gene’s causal role.

Studies of de novo variation, most prominently studies of sporadic NDD⁹ and autism spectrum disorders (ASD)^12,15, have identified hundreds of disease genes. As expected, most were found to be under strong negative selection consistent with a major impact of de novo mutations on severe early-onset conditions⁹. However, we observe that a small subset of the identified genes defy a simple dependence of disease effect and selection and show substantial levels of population polymorphism. Another observed discrepancy is that counts of specific high-impact de novo mutations in disease trio cohorts exceed the maximal levels explainable by disease ascertainment. A potential explanation is that forces beyond random mutagenesis increase the observed incidence of mutations.

Many genes within this subset are involved in carcinogenesis or clonal expansions in somatic tissues suggesting that a plausible explanation may be offered by the positive selection leading to clonal expansions in spermatogonia (CES). Clonal expansions have been recently found to be a feature of all renewing tissues^1,2,3,4, including spermatogonia. Indeed, gain-of-function (GoF) mutations in 13 genes^7,8 have already been demonstrated to drive CES. In contrast to somatic tissues, CES increases the likelihood of transmitting driver variants to the next generation, elevating the observed mutation rate in the offspring (Fig. 1a). For this reason, the first CES drivers were discovered through their recurrence in patients with monogenic disease^5,6. By increasing mutation rate, CES can lead to an increased prevalence of disease. Conversely, genes that drive CES without directly causing disease could appear as disease-causing due to the elevated number of de novo mutations observed in affected children.

**Fig. 1: CES and disease ascertainment.**

To systematically investigate the possible effect of CES on human germline mutation and identify potential CES driver candidates, we developed a statistical approach leveraging human de novo mutation data^{9,10,11,12,13,14,15,23} along with population genetic variation resources²⁰. Most of the identified genes follow biological expectation for drivers of clonal expansions such as involvement in MAPK⁷ and other major signalling pathways. We validated our findings using the already discovered CES genes^7,8, cohorts of unaffected trios^{12,14,15,16,17,18,19} and comparison with the results of the direct sperm sequencing²¹.

Rates of DNA damage, together with imperfections of DNA repair and replication, determine the baseline de novo mutation rate in the male germline²⁴. Positive selection in spermatogonia elevates observed mutation rates of the drivers relative to the baseline (Fig. 1a). The approaches to identify CES genes taken here will generally compare observed mutation counts with those expected under a model of baseline mutation rate.

Historically, mutation rates at functional sites have been estimated from the prevalence of phenotypes caused by mutations at those sites. We generalized this logic to identify CES genes in a large trio sequencing cohort ascertained by any phenotype. Specifically, we counted de novo mutations in the largest assembled NDD trio cohort⁹ (31,058 affected probands). The effect of ascertainment has a strict upper bound given by the inverse prevalence of ascertained phenotype. Denoting V the presence of a de novo variant and D the presence of the disease, we have:

$$\frac{{\rm{No}}.{\rm{of}}\,{\rm{obs}}}{{\rm{No}}.{\rm{of}}\,\exp }\approx \frac{P(V| D)}{P(V)}=\frac{P(D| V)}{P(D)}\le \frac{1}{P(D)},$$

(1)

where P(V) is the probability of de novo variants defined by a mutation rate model (Supplementary Text 1) and P(D) is the overall disease prevalence (Fig. 1b). Unlike disease-causing genes with no CES effect, CES drivers could violate this upper bound, offering an approach for finding CES drivers. We develop a statistical approach based on equation (1) and demonstrate that it is robust to non-monogenic inheritance (Supplementary Text 4).

From the biological standpoint, because many known CES driver mutations cause NDD^7,8, we may expect the uncharacterized CES drivers to be similarly prone to causing NDD. The count of CES-driving mutations in the NDD cohort should be greater than in an unascertained set of trios, maximizing the chance of identifying CES drivers involved in NDD. Genes that drive CES but do not cause NDD can be found by this approach only if the effect of CES significantly exceeds the maximal effect of disease ascertainment given by the inverse prevalence (Supplementary Text 4).

This approach is sensitive to misspecifications of the mutation rate model. For the mutation rate model Roulette²⁵, we demonstrated the accuracy of predictions for the synonymous variants in the NDD cohort (Fig. 1c, Methods section ‘Control for biases in Roulette’ and Supplementary Tables 1–4).

Precision of the prevalence estimates, P(D), might also affect the applicability of equation (1). To address this issue, we adopted a conservative lower bound on prevalence²⁶ of NDD of 1% (Methods section ‘Phenotypic homogeneity in the NDD cohort’). We also show that the phenotypic sampling is uniform across the NDD subcohorts ruling out false-positive CES findings due to heterogeneous ascertainment by phenotype (Methods section ‘Prevalence of NDD’).

GoF mutations causing CES

All CES genes discovered to date act through GoF mechanisms, and all known CES mutations are individual missense variants^7,8. We therefore first apply equation (1) to all possible (5 × 10⁷) individual missense de novo mutations in the NDD cohort and identify 21 variants in 18 unique genes that pass the 20% false discovery rate (FDR) threshold (Extended Data Fig. 1a and Supplementary Table 11). Among the 18 identified genes, 5 have been previously experimentally established to cause CES^7,8. From the functional perspective, the identified genes are enriched in RAF activation (adjusted P = 1.7 × 10⁻⁷) and MAPK1/MAPK3 (Fig. 1d, adjusted P = 2.2 × 10⁻⁷) pathways²⁷ in agreement with previous studies⁷ (Supplementary Table 12 and Methods section ‘Expression and gene set enrichment analyses’). All but one of these genes are expressed in spermatogonia (P = 6.5 × 10⁻⁷). The exception, GRIN2B, was thus excluded from the GoF set (Extended Data Fig. 2) as a potential false positive.

As explained above, the test based on equation (1) has greater power to identify CES drivers that also cause NDD. Indeed, 16 out of 17 genes have independent evidence of association with disease phenotypes through GoF variants (Fig. 1e).

LoF mutations causing CES

Although existing experimental work has focused solely on GoF mutations driving CES, loss-of-function (LoF) mutations are known to also drive clonal expansions in cancers and healthy renewing tissues¹. This motivated us to investigate LoF mutations as plausible CES drivers (Fig. 2a). Because every gene harbours many possible LoF mutations, we may extend equation (1) to test sets of variants within genes rather than individual variants. We validated that the aggregation of LoF sites by gene does not bias our mutation rate expectation (Fig. 2b, Methods section ‘Control for biases in Roulette’ and Supplementary Table 5).

**Fig. 2: LoF-1 set of putative CES drivers.**

Five genes show numbers of de novo LoF mutations in the NDD cohort significantly exceeding any plausible ascertainment by disease (by equation (1)): PURA, ARID1B, CTNNB1, DYRK1A and FOXG1 (Fig. 2c,d and Extended Data Fig. 1b,c). We call this list the LoF-1 set. Three of them, ARID1B, CTNNB1 and DYRK1A, are involved in carcinogenesis or other clonal expansions^28,29 (Fig. 2c). Although PURA and FOXG1 have not been previously identified as drivers of somatic clonal expansions, their functions fit the profile of clonal expansion drivers. PURA is involved in replication and transcription control and FOXG1 is a transcription factor regulating early development (Fig. 2c).

The independent evidence of association with disease phenotypes supports the effect of LoF mutations on NDD for all five of these genes. In addition, these genes are highly selectively constrained in the human population (upper bound estimate of the observed-to-expected ratio of LoF single-nucleotide variant (SNV) counts (LOEUF) less than 0.3), consistent with their role in severe paediatric conditions.

CES and LoF polymorphism

We now bring our attention to CES drivers with weak or no effect on NDD. The approach outlined above is not suited well to identify such genes, because it requires the counts of de novo mutations in the NDD cohort to exceed the expectation by at least 100-fold (inverse prevalence of NDD). Therefore, we rely on extra considerations. In particular, we note that CES should increase the rate of functionally consequential mutations and thus polymorphism levels in the general population. The rate of de novo mutations in a disease cohort should be elevated for both CES genes and disease genes. However, unlike CES driver with no effect on NDD, disease-causing genes should be deprived of functional polymorphism in the predominantly healthy population. This discrepancy between amounts of genetic variation in the population among genes significant for the excess of de novo variants in the NDD cohort over the mutation rate expectation (not necessarily exceeding the boundary of equation (1)) yields a procedure for identifying CES genes. Specifically, a gene with both large amounts of polymorphism in the general population and the excess of de novo variants in a disease cohort may be deemed a CES driver candidate. Note that this procedure may be applied to cohorts with mixed disease phenotypes, because the expectation is defined just by the mutation rate model and is independent of ascertainment, motivating us to increase power by merging de novo variation in cohorts ascertained by NDD (31,058 affected probands)⁹ and ASD (16,877 affected probands)¹². We quantify LoF polymorphism in the general population using LOEUF²⁰.

In total, 122 genes show a significant excess of LoF de novo mutations in the ASD–NDD cohort at FDR < 0.1 (Methods section ‘Construction of the LoF-2 set’ and Supplementary Tables 7 and 14). From these, we selected 19 genes (Fig. 3a and Extended Data Figs. 3 and 4) with LOEUF > 0.5 (concordant with cut-offs usually chosen for relaxed negative selection^12,30,31,32). High values of LOEUF may be indicative of both relaxed selection and misspecifications of the mutation rate model, as should be the case with CES-driven mutation rate elevation. Genes with misannotated LoF variants^33,34 or involved in clonal expansions in blood³ may also have spuriously high LOEUF values without CES. For these reasons, four genes were flagged as potential false findings (Supplementary Text 2 and 3). The remaining 15 genes we call the LoF-2 set.

**Fig. 3: Effect of CES on de novo mutations in disease and on the levels of LoF polymorphism in population.**

Four extra considerations support this set of genes as CES driver candidates. First, we show that LoF-2 genes have a significant excess of de novo LoF mutations not just in the joint ASD–NDD dataset, but also in a healthy trio cohort^{12,14,15,16,17,18,19} (Fig. 3b and Supplementary Tables 9 and 10). The excess of mutations in the healthy trio cohort could not be affected by disease ascertainment, allowing us to estimate the average CES effect for LoF-2 genes. The magnitude of the effect in LoF-2 genes is comparable between ASD–NDD and control cohorts (Fig. 3b, 18.7 versus 26.9, P = 0.2), indicating that the effect of disease ascertainment is most probably small and most LoF-2 genes do not cause ASD or NDD. Indeed, only four genes from this set have independent evidence of the causal role in ASD–NDD²² (Fig. 3c). Just one of these genes, PTEN, demonstrates an excess of de novo LoF mutations in the NDD cohort comparable to the excesses in the LoF-1 set (Fig. 3a,c). In line with small effects of NDD ascertainment in LoF-2 genes, we observe similar magnitudes of the enrichment of de novo LoF mutations observed in other trio cohorts ascertained by different phenotypes: encephalopathic epilepsy and congenital heart disease. Finally, using the cohort of healthy controls, we also estimated that just 15 LoF-2 genes explain 3.8% of the entire de novo LoF variation genome-wide.

To investigate the robustness of the LOEUF threshold used to select genes in the LoF-2 set, we compared the excesses of LoF mutations in the control cohort for different thresholds. The magnitude of the effect does not change substantially in the range of LOEUF threshold values from 0.3 to 2.0 (Extended Data Fig. 4). The effect in the healthy cohort dissipates for ASD–NDD genes with LOEUF < 0.3, consistent with causal disease effect of these genes.

Second, all LoF-2 genes barring KCNA1 are expressed in spermatogonia (expression enrichment P = 0.01, Mann–Whitney U-test). KCNA1 was flagged as a potential false-positive finding (Extended Data Fig. 2). Four LoF-2 genes are also known to cause cancers²⁸.

Third, with the help of population variation data, we can find further evidence that the high level of LoF polymorphism in LoF-2 genes probably reflects elevated de novo mutation rate rather than relaxed selection. The large sample size of gnomAD-v4 (1.6 million haploid genomes) enables analysis of frequencies of LoF variants generated by recurrent mutations^25,35,36,37. LoF mutations within the same gene are expected to have identical functional effects, including on CES. All individual LoF sites within the same gene therefore share a common linear inflation factor relative to baseline mutation rate expectations. In the presence of CES, allele frequencies remain linearly proportional to the baseline mutation rate attesting to CES being a linear effect, as illustrated by the example of MIB1 in Fig. 3d (Supplementary Table 19).

We expand this logic to statistically detect elevation of mutation rate for LoF-2 set in the human population data. Variants with the same mutation rate and selection coefficient have different allele frequencies due to the effect of genetic drift. Strong selection overpowers the effect of drift and makes allele frequencies approach the expectation under the balance between mutation and selection. CES is expected to shift this balance and increase the average allele frequency, but is not expected to affect deviations from this balance. Slightly more formally, the distribution of sampling allele frequencies in the presence of recurrent mutation and relatively strong selection, is given by the Nei approximation³⁸ (Methods section ‘Segregation of LoF-2 polymorphism in the human population’). Relying on the Nei approximation, we isolate variance inflation due to genetic drift. It is inversely proportional to selection coefficient. The mean allele frequency is similarly inversely proportional to selection coefficient, but is also directly proportional to the effect of CES. Thus, CES effects may be identified from the comparison of the mean and variance terms for a single gene. Even with the current sample sizes, this procedure has low power, as most of the information is contained in frequencies of highly mutable yet rare CpG sites, and so we restricted our analysis to 952 genes with more than 10 CpGs. Figure 3e shows the evidence of CES effects for LoF-2 genes included in the analysis.

Our final analysis relies on GeneBayes³⁹, which evaluates per-gene selective constraint relying on a protein function-informed prior and a likelihood based on LoF variation within the gene. The prior, unlike the likelihood, is not influenced by mutation rate and the effects of CES. We observe LoF-2 genes have the most inconsistent prior and likelihood estimates, concordant with the effects of CES on the likelihood (Extended Data Fig. 5 and Supplementary Table 8).

Comparison with sperm sequencing

Direct sequencing of mature sperm provides important data for the evaluation of CES effects without any ascertainment by phenotype. In parallel with this study, NanoSeq, an accurate single-molecule resolution DNA sequencing technology, was applied to sequencing sperm cells of 63 donors²¹. A panel of 263 genes implicated in cancer was sequenced with a mean coverage of 8 × 10⁴ and whole exomes were sequenced with a mean coverage of 2.1 × 10⁴. This study identified 40 genes under positive selection in spermatogonia based on enrichment of non-synonymous mutations in sperm²¹. Direct sequencing of sperm is effectively equivalent to assaying paternal de novo mutations across all genes in a large cohort of randomly selected trios barring those mutations not compatible with live birth, a case that is discussed below. We used these data to verify some of our predictions.

To evaluate the level of concordance, we tested whether CES driver candidates identified in human genetics datasets achieved significance in the test from ref. ²¹ aggregating all types of non-synonymous mutation²¹. Of all 40 genes identified here, 17 are nominally significant (20 at the P value threshold of 0.1 used in ref. ²¹) and 11 study-wide significant in ref. ²¹. We compiled the joint list of all putative CES genes between studies (Supplementary Table 20).

Next, we compared genes of LoF-1 and LoF-2 sets with 16 genes significant in the NanoSeq dataset exclusively due to LoFs (FDR 20%). As shown in Fig. 4a, out of 16 NanoSeq genes, 7 overlap with our LoF sets.

**Fig. 4: Comparison with direct sequencing data.**

Partial overlap between the lists could be attributed to several factors. First of all, datasets obtained by NanoSeq have variable coverage across different genes, and indeed our CES candidates that are not significant in ref. ²¹ tend to have lower coverage (Extended Data Fig. 6). Second, for GoF-1 and LoF-1 the two approaches have differential statistical power. For the human genetics approach, the power is enhanced by the effect of CES drivers on NDD, whereas the power of the NanoSeq experiment is not boosted by the ascertainment. Finally, the power of both approaches is affected by sampling variance in the low mutation count data.

To demonstrate that the overlap is limited by power, we show that a residual CES effect exists in genes outside the overlap. We calculated the observed-to-expected ratio for LoF mutations in the NanoSeq dataset for genes that did not attain genome-wide significance, but are included in our LoF sets (Fig. 4b). We also measured the observed-to-expected ratio for genes outside the overlap using de novo variants in control trios (Fig. 4b). In both cases, we observe significant enrichment of LoF mutations (P < 0.01, Poisson test).

In the NanoSeq dataset, the combined LoF-1 and LoF-2 gene sets demonstrate a 16.6-fold (Poisson 95% CI 13.5–20.2) excess of LoF mutations. For the GoF set, we observe a notable 524-fold excess of mutation counts in sperm sequencing (Poisson 95% CI 311–828).

Discussion

The unique position of gametogenic tissues as a bridge between somatic and germline evolution allows CES to be viewed from two different perspectives and in terms of two different fields. From the biological perspective, CES can be related to cancer with similar genes being involved in clonal expansions in normal tissues and oncogenic transformation. From the human genetics perspective, CES acts as a factor inflating mutation rate at certain positions. Inflation of mutation rate leads to increase in population frequency of CES drivers. However, because the putative CES drivers reported here are kept at low frequency in the population, we expect that they are subject to strong negative selection even in cases with no obvious disease association.

Biological function of CES drivers

Biological functions of many of our CES driver candidates are consistent with the role in clonal expansions (Supplementary Table 6). We note that the method used here biases the resulting gene sets towards NDD and ASD causal genes, but the functional roles of CES driver candidates are distinct from the bulk of known ASD–NDD genes.

Specifically, out of 40 identified genes, 20 have a role in major signalling pathways (Supplementary Tables 6 and 15–18). The pathways include MAPK, WNT and TGFβ (Supplementary Table 13). As expected, the MAPK pathway has the highest enrichment (eight genes, Fig. 4c). We noticed that the putative CES drivers are concentrated on the pathway structure and the LoF and GoF annotations are mirrored in the activator–repressor roles (Fig. 4c).

Serine-threonine kinase CSNK2A1 and genes activated by it (PACS1 and PACS2)⁴⁰ form an intriguing group of functionally related CES driver candidates within the GoF set. Another related kinase, CSNK2B (ref. ⁴⁰), has been reported in ref. ²¹. The R203T mutation in PACS1 is the most recurrent variant in the NDD cohort⁴¹ and explains 0.1% of all cases, and the E209K mutation in PACS2 explains another 0.04%. The high recurrence of these variants in the NDD cohort may be explained by their role in both NDD^41,42 and CES.

As expected for genes involved in clonal expansions, 12 out of 40 genes identified here are COSMIC census cancer drivers²⁸. Four of them (FGFR3, PTEN, MTOR and BRAF) have been specifically associated with testicular tumours^43,44,45,46.

There is a discrepancy in the average elevations of mutation rate due to CES between GoF and LoF mutations (524-fold versus 17-fold). We propose two possible explanations. First, spermatogonia cells are diploid, and the variants discussed here are probably heterozygous. LoF mutations are often partially recessive with moderate effects in heterozygotes⁴⁷. By contrast, GoF mutations are usually dominant. The second explanation comes from population genetics. We observe very limited numbers of GoF CES drivers in a single gene (Fig. 1e). On the other hand, because all LoF mutations within a gene have identical functional consequences, the number of potential CES driver mutations is large. As CES drivers are frequently involved in severe diseases, the GoF-like 500-fold elevation in mutation rate might generate a substantial rate of de novo pathogenic mutations, for example, if the same inflation was present for LoF mutations in ARID1B, about 0.5% of all individuals would be born with NDD if the CES effect was so high. Such pathogenic burden created by LoF mutations in a single gene will trigger efficient negative selection on the CES effect.

Impact of CES drivers on NDD

Here, we develop a series of tests to identify CES drivers using the counts of de novo mutations in cohorts ascertained by disease, primarily NDD. CES effect and disease ascertainment independently increase the counts of de novo mutations in these cohorts. This gives us an upper hand to find CES genes with involvement in NDD compared with non-ascertained cohorts such as trio controls or sperm sequencing. Indeed, out of 40 recovered genes, 26 have orthogonal functional or genetic evidence of involvement in NDD. As expected based on the properties of the tests discussed above, GoF and LoF-1 gene sets are more enriched in genes causal for NDD. This falls in line with the notion of CES determining the prevalence of many types of NDD. A well-known example is given by Noonan syndrome in which clonal expansions underlie both aetiology and prevalence of the disease. For Noonan syndrome mutations, the effect observed in NanoSeq is 440-fold (Poisson 95% confidence interval 91–1285), which resolves the controversy between the net baseline mutation rate of roughly 10⁻⁷ and the prevalence of the syndrome of roughly 10⁻⁴. Prevalence of CES-related diseases such as Apert syndrome and achondroplasia scales exponentially with paternal age^8,48,49. Therefore, it is expected that prevalence of any CES-related condition would similarly show the exponential dependency on the age of the father.

On the other hand, many of the recovered CES driver candidates, especially the ones from the LoF-2 set, for example, MIB1 or TCF7L2, have no strong supporting evidence for NDD, meaning that they could be false positives in studies of gene-disease association that rely solely on de novo enrichment. However, these studies may find evidence of the causal role in NDD not confounded by CES such as: (1) transmission distortion and other types of familial segregation (Extended Data Fig. 7), (2) case–control analyses, (3) phenotypic similarity of mutation carriers and (4) functional assays in vitro and in vivo.

Other modulators of mutation counts

Aside from CES and disease ascertainment, one modulator of the observed de novo counts is embryonic deleteriousness (partial lethality) (Table 1 and Supplementary Fig. 1). It is expected to decrease the observed mutation rates and therefore renders our approaches for finding CES genes conservative. The comparison of de novo counts between sperm sequencing and trio sequencing has a potential to highlight lethal mutations. Following this logic, we find that LoF mutations in RASA2 are consistent with embryonic deleteriousness.

Table 1 Different data modalities offer possibilities for distinguishing types of modulator of observed de novo mutation rates

Full size table

Although here we interpret the elevation of mutation rate relative to the baseline as the effect of CES, the mutation rate increase may be due to clonal expansions in early development or in oocyte progenitors. Formally, our approach does not pinpoint the source of the clonal expansions. Still, substantial overlap (given statistical limitations) with sperm sequencing data supports the CES hypothesis. No mutations reported in ref. ²¹ have appreciable variant allele frequency in sperm²¹, ruling out early developmental origin. Historically, CES is the only reported clonal mechanism responsible for the elevation of germline mutation rate.

In this study we have shown that the phenomenon of CES might be of unexpectedly great importance for future studies in the fields of genetics of rare disease, population genetics, cancer biology and in the emerging field of clonal evolution in somatic tissues.

Methods

Data

We used the dataset of de novo variants in probands affected by NDD published in ref. ⁹. For ASD, we used the de novo portion of the SFARI dataset published in ref. ¹². For congenital heart diseases, encephalopathic epilepsy and trios with healthy offspring used as controls, we used aggregated datasets of de novo variants in cohorts with probands affected by the respective conditions^{9,10,11,12,13,14,15,16,17,18,19,50,51,52,53,54}, For any non-NDD and non-ASD cohorts, we filtered out probands with duplicate IDs and samples in which de novo variants were not obtained by whole genome or whole exome sequencing. Next, for the cohorts with de novo variants reported in hg19 coordinates, we mapped the coordinates to hg38 using LiftOver⁵⁵.

A recently published NanoSeq dataset of sperm cell sequencing was downloaded from the supplementary materials from ref. ²¹ and candidates due to LoF were taken as genes with the LoF enrichment at the FDR < 0.2 level given in the original table (qtrunc_cv column).

COSMIC cancer tiers of the putative CES genes and observed recurrence of variants in these genes with respect to cancer association, were obtained from COSMIC database on 25 October 2024 (ref. ²⁸). The list of genes causing clonal expansions in haematopoietic tissues was obtained from ref. ³.

Data on single-cell gene expression in sperm progenitors and oocytes were downloaded from The Human Protein Atlas⁵⁶ and data on single-cell RNA sequencing of early human embryos was obtained from the supplementary information in ref. ⁵⁷. The structure of the MAPK pathway shown on Fig. 4d was obtained from WikiPathways/WP382 (ref. ⁵⁸). For the per-site mutation rate estimates and quality control tracks, we used Roulette-MR and Roulette-QUAL features of Roulette, respectively²⁵.

Processing of genes

Because Roulette mutation rates are available only for autosomes, we restricted all analyses to autosomal genes. To avoid ambiguity in assigning variants, we further limited our dataset to genes with non-overlapping coding sequences. For any pair of genes with overlapping coding sequence regions, we retained the longer gene and excluded the shorter one. In cases in which more than two genes overlapped, we selected the largest subset of non-overlapping genes that maximized the total coding sequence length. This filtering yielded a final set of 17,791 autosomal protein-coding genes, listed in Supplementary Table 14.

Processing of variants

We included only coding variants with a ‘high’ or ‘TFBS’ quality score in the Roulette-QUAL track. For each gene, variants were assigned to one of three functional categories. First, LoF variants were defined as those with both LOFTEE²⁰ high-confidence annotations and a variant effect predictor (VEP)⁵⁹ consequence of stop_gained, splice_donor_variant or splice_acceptor_variant. Second, synonymous variants were defined as those with a VEP annotation⁵⁹ containing only the term ‘synonymous’, excluding any variants (possibly) involved in splicing. Third, we included missense variants only if they were annotated as ‘missense’ in VEP and had AlphaMissense⁶⁰ scores greater than 0.1.

Mutational expectation

To scale mutational expectations in an unbiased way, we normalized by the observed number of synonymous de novo mutations. This approach accounts for cohort-specific factors such as average parental age and sequencing coverage. For a given variant v, we define the expected number of occurrences in a de novo dataset as:

$${\lambda }_{v}:= {N}_{{\rm{S}}}\frac{{\mu }_{v}}{{\sum }_{v\in S}{\mu }_{v}},$$

where N_S is the total number of synonymous de novo mutations observed in the cohort, S is the set of all admissible synonymous variants and μ_v is the unscaled Roulette mutation rate assigned to v. This formulation reflects the assumption that de novo mutation events occur independently, allowing expected counts to be proportional to the summed rates.

For any annotation A = {v}, the expected number of de novo mutations is:

$${\lambda }_{A}:= \sum _{v\in A}{\lambda }_{v}={N}_{{\rm{S}}}\frac{{\sum }_{v\in A}{\mu }_{v}}{{\sum }_{v\in S}{\mu }_{v}}.$$

(2)

This enables comparison of observed and expected mutation counts across functional categories while controlling for mutation rate variation and technical biases.

Control for biases in Roulette

To test for overdispersion in Roulette estimates at the single-variant level, we compared two models of counts of de novo synonymous mutations in the NDD cohort: a Poisson model assuming no overdispersion, and a negative binomial model assuming overdispersion by a factor of γ + 1. Using expected mutation counts defined as in equation (2), the likelihood under the Poisson model is:

$${{\mathcal{L}}}_{0}=\prod _{v\in S}{{\rm{e}}}^{-{\lambda }_{v}}\frac{{\lambda }_{v}^{{n}_{v}}}{{n}_{v}!},$$

(3)

where n_v is the observed number of recurrent de novo mutations at site v. The likelihood under the negative binomial model, with variance inflated by a factor of γ + 1, is:

$${{\mathcal{L}}}_{\gamma }=\prod _{v\in S}\left(\genfrac{}{}{0ex}{}{{n}_{v}+\frac{{\lambda }_{v}}{\gamma }-1}{{n}_{v}}\right)\,{\left(\frac{\gamma }{\gamma +1}\right)}^{{n}_{v}}{\left(\frac{1}{\gamma +1}\right)}^{\frac{{\lambda }_{v}}{\gamma }},$$

(4)

using the parametrization of the negative binomial distribution by mean λ_v and overdispersion factor γ, such that the variance is λ_v(1 + γ). As equation (4) can be rewritten as a Poisson–Gamma mixture, λ_v/γ need not be an integer.

After maximizing the likelihood in equation (4) over γ, we obtain $\hat{\gamma }=0.004$. Comparing the log-likelihoods of equations (3) and (4), we find a difference of $\text{log}{{\mathcal{L}}}_{\gamma }-\text{log}{{\mathcal{L}}}_{0}$ = 31.1. Applying the Akaike information criterion (AIC), which penalizes extra parameters by 2 log-likelihood units, this strongly supports the presence of modest but statistically significant overdispersion. As a result, we use the negative binomial distribution with γ = 0.004 rather than the Poisson model when modelling individual mutations in the GoF gene set.

Beyond single-site effects, Roulette estimates might have regionally correlated errors resulting in some regions having generally overpredicted or underpredicted mutation rates. In particular, correlated errors in residuals could systematically distort the expected counts for specific regions of interest, such as protein-coding genes and may be undetectable at the per-site level. To assess this, we tested for overdispersion in the number of synonymous de novo mutations aggregated by gene. Let n_g be the observed and λ_g the expected count of synonymous mutations in gene g, with λ_g obtained by summing site-specific rates as in equation (2). We then substitute n_v → n_g and λ_v → λ_g in equations (3) and (4), and re-estimate the log-likelihoods.

In this gene-level analysis, the log-likelihood difference is $\text{log}{{\mathcal{L}}}_{\gamma }-\text{log}{{\mathcal{L}}}_{0}$ = 0.69, which falls well below the AIC threshold for including an extra parameter. Thus, we find no evidence of significant variance inflation at the gene level.

To further test for overdispersion in the gene-level expectations λ_g, we used a regression-based approach. Applying the law of total variance:

$${\rm{V}}{\rm{a}}{\rm{r}}({n}_{g})={\mathbb{E}}[{\rm{V}}{\rm{a}}{\rm{r}}({n}_{g}|{\lambda }_{g})]+{\rm{V}}{\rm{a}}{\rm{r}}[{\mathbb{E}}({n}_{g}|{\lambda }_{g})],$$

where the first term captures residual (Poisson) noise and the second term reflects underlying variation in λ_g. As ${\mathbb{E}}$(n_g|λ_g) = λ_g and Var(n_g|λ_g) = λ_g under the Poisson model, we expect:

$${\rm{V}}{\rm{a}}{\rm{r}}({n}_{g})={\mathbb{E}}({\lambda }_{g})+{\rm{V}}{\rm{a}}{\rm{r}}({\lambda }_{g}).$$

To test this, we computed sample estimates of each term and tested the Poisson assumption with the chi-squared goodness-of-fit test on statistic

$${\chi }^{2}=\sum _{g}\frac{{({n}_{g}-{\lambda }_{g})}^{2}}{{\lambda }_{g}},$$

with degrees of freedom equal to the number of genes minus 1 (17,790). The resulting P value of 0.6 confirms consistency with the Poisson model, with no evidence of excess dispersion. We thus conclude that gene-level Roulette residuals are effectively uncorrelated, and that variance inflation in gene-level mutational expectations is negligible.

Consequently, for gene-level aggregation of variant counts in LoF-1 and LoF-2 sets, we used Poisson tests to assess enrichments.

Statistics

Phenotypic homogeneity in the NDD cohort

Our procedures yielding LoF-1 and GoF sets of putative CES drivers assume uniform phenotypic sampling across the NDD cohort (Main text and below). Deviations from uniformity—for example, overrepresentation of certain syndromes—could result in differential ascertainment of associated mutations, potentially inflating effect sizes for certain genes and introducing false positives to LoF-1 and LoF-2 sets.

To assess homogeneity of recruitment in the composite NDD cohort used here, we compared contributions of the three subcohorts (DDD, GeneDx and RUMC) comprising it to de novo LoF variation in individual genes. If the cohorts were sampled in a uniform way, the fractions of de novo variants coming from each cohort across all genes should correspond to the per-cohort mutation rate, which is proportional to the number of synonymous mutations per individual cohort (equation (2)). The numbers of variants in each gene should therefore come from the multinomial distribution with probabilities given by per-cohort synonymous variant counts and the number of events given by the number of observed de novo variants.

We tested this hypothesis by comparing four models: (1) a multinomial model with probabilities fixed by observed synonymous variant counts (zero free parameters); (2) a multinomial model with unconstrained probabilities to test for shifts in representation (two free parameters due to three probabilities that need to sum up to 1); (3) a Dirichlet-multinomial model centred on the observed synonymous proportions, allowing for overdispersion (one free parameter: amplitude of Dirichlet) and (4) a fully flexible Dirichlet-multinomial model accounting for both overdispersion and shifted means (three free parameters: two for unconstrained multinomial and one more parameter for the amplitude of Dirichlet).

For each model, we computed the total likelihood across all genes and compared models using the AIC. The unbiased multinomial model (model (1)) was the one preferred by AIC, indicating that the observed gene-level LoF variant distributions are consistent with uniform cohort sampling.

To further validate that our LoF-1 and LoF-2 sets are not confounded by non-uniform sampling, we repeated the analysis on specifically LoF-1 and LoF-2 putative CES drivers. In both cases, the AIC again favoured the null multinomial model. Together, these results suggest that recruitment across the DDD, GeneDx and RUMC cohorts was phenotypically homogeneous with respect to LoF variant ascertainment. Full likelihood and AIC values for all model comparisons are reported in Supplementary Table 2.

Prevalence of NDD

The identification of putative CES drivers in the LoF-1 and GoF sets relies on the assumption that the disease prevalence estimate P(D) used in equation (1) is not inflated. Overestimating P(D) would lead to underestimating the maximum plausible effect size attributable to ascertainment, thereby increasing the risk of false-positive CES calls in highly penetrant genes.

Epidemiological estimates of NDD prevalence—based on diagnoses of intellectual disability or global developmental delay—typically fall within the range of 2–3% (ref. ²⁶). To ensure a conservative interpretation, we imposed a lower bound of 1% on P(D), which corresponds to an upper bound of 100 on the effect of ascertainment.

One further consideration suggests that 1% is an appropriate lower boundary for NDD prevalence. Almost all significantly mutated genes have the effects below 100 with very few genes greatly exceeding this threshold (Fig. 3d). The possibility that the prevalence is even lower than 1% and the observed distribution of gene effects is fully explained by disease ascertainment is inconsistent with the existence of genes with substantial penetrance for NDD.

Construction of the GoF set

To define the GoF set of putative CES driver genes, we applied a negative binomial test. The expected count of de novo variants at position v was set to μ_v/P*(D), where μ_v is the Roulette mutation rate at v and P*(D) is the lower bound on disease prevalence, fixed at 0.01 (Main text). To account for overdispersion at the single-site level, we used a variance inflation parameter γ = 0.004, as defined in equation (4). For multiple testing correction, we applied a Bonferroni adjustment using a correction factor equal to the total number of autosomal missense variants annotated by VEP and filtered to have an AlphaMissense score above 0.1 (49,686,008 tests).

Construction of the LoF-1 set

To define the LoF-1 set of putative CES driver genes, we applied a Poisson test to LoF de novo variants observed in the NDD cohort. The expected count of variants in gene g was set to λ_g/P*(D), where λ_g is the LoF mutation rate for gene g as defined in equation (2) and P*(D) is the lower bound on disease prevalence, fixed at 0.01 (Main text). As in the GoF analysis, this expectation corresponds to the number of events expected under maximal disease ascertainment (that is, full penetrance). The choice of the Poisson model is justified in the ‘Control for biases in Roulette’ section above. For multiple testing correction, we used a Bonferroni factor equal to the number of autosomal genes with at least one filtered LoF site (17,791 tests). This yielded 5 Bonferroni-significant genes at the 0.05 level, and 1 more gene (FOXG1) significant at FDR < 0.1, using the Benjamini–Hochberg procedure (Fig. 2a,d).

Construction of the LoF-2 set

The LoF-2 set of putative CES driver genes was constructed using the combined ASD + NDD cohort in two steps. First, we identified genes with a significant excess of de novo LoF variants relative to expectation (n_g > λ_g), where n_g is the observed LoF count and λ_g is the Roulette-based expectation for gene g. As in the LoF-1 analysis, we used a Poisson test with correction for 17,791 tested autosomal genes (Supplementary Table 14). On the second step, we defined the LoF-2 set as those genes passing a FDR threshold of less than 0.1 and having gnomAD-v4 LOEUF scores greater than 0.5 (Fig. 2a).

Excess of LoF variants in NDD compared with NanoSeq

To test whether the excess of de novo LoF variants observed in the NDD cohort exceeded that in the sperm NanoSeq dataset, we used a Binomial test. For each gene g, we set the probability of success to the expected fraction of LoF mutations arising from NanoSeq, defined as $\frac{{\lambda }_{m,g}}{{\lambda }_{m,g}+{\lambda }_{n}}$, where λ_m,_g is the expected LoF count in NanoSeq and λ_n is the expected LoF count in NDD. The number of trials was set to n_g + m_g, where n_g and m_g are the observed LoF counts in gene g in the NDD and NanoSeq datasets, respectively. The test compared the observed fraction $\frac{{m}_{g}}{{m}_{g}+{n}_{g}}$ with the expected null fraction. Only genes with at least one LoF variant in both datasets were tested. A gene was deemed significant if its P value, Bonferroni-corrected for the number of tests (N = 5), was below 0.05. This test identified only CTNNB1 and PTEN as significantly enriched for LoF variants in the NDD cohort relative to NanoSeq. Because the NanoSeq donors were older on average than the parents of individuals in the NDD cohort, this comparison is conservative.

The test for the excess of variants in NanoSeq compared with the excess in the NDD cohort highlights CUL3 and ARID1A as genes with a larger excess in NanoSeq. Although this effect may be confounded by the age differences between the parents in the NDD cohort and the donors of samples used in the NanoSeq experiment, an alternative explanation may be embryonic deleteriousness of LoF variants in these genes.

Expression and gene set enrichment analyses

Expression analysis of candidate genes was performed using single-cell data from The Human Protein Atlas. For each gene, the Atlas reports transcripts per million (TPM) as a measure of expression across a wide range of tissues. To account for gene-specific differences in baseline expression, we normalized the TPM values in spermatogonia and oocytes by the maximum TPM observed for each gene across all tissues. To statistically assess differences in expression patterns between CES driver candidates and other genes, we applied the Mann–Whitney U-test to the normalized TPM values. Gene set enrichment analysis was performed using the g:Profiler²⁷ web service with default parameters. We tested five gene sets: the GoF, LoF-1 and LoF-2 sets individually; the union of LoF findings (LoF-1 and LoF-2); and the union of all three sets. Full g:Profiler results are provided in Supplementary Tables 15–18.

GeneBayes update and CES

To obtain estimates of LoF constraint in the LoF-2 set of genes unimpacted by mutation rate misspecification we used prior estimates of the selection coefficient against heterozygotes, produced by the GeneBayes method. These estimates result from a prior distribution fit to all LoF variation using a set of global gene features and should be minimally impacted by variation in any one gene³⁹. To assess the robustness of the resulting list of genes to the constraint estimation, we also performed an orthogonal analysis of differences between prior and posterior GeneBayes s_het estimates, which were obtained on gnomAD-v3. As expected under moderate selection against LoF mutations and CES, prior estimates of s_het were substantially higher than those estimated using polymorphism alone. The rationale here is that, given downwardly biased mutation rate estimates of CES genes, the data should shift the prior down by factors larger than those expected for other genes. This analysis recapitulates our results obtained with LOEUF.

Owing to the demographic complexity of the gnomAD-v4 sample, we used a data-driven approach to estimate s_het for each gene. We first fit how both the proportion of polymorphic sites as well as the shape of the site frequency spectrum depend on s_het, conditional on μ. For this, we fit a multinomial model to the observed allele counts using prior estimates of as a covariate. To ensure that the site frequency spectrum (SFS) was not too sparse at higher allele counts, we binned sites by allele count with boundaries [0,1,2,3,4,5,6,7,8,9,10,11,16,24,36,100]. The observed distribution of alleles counts in each Roulette mutation rate bin was then transformed to multinomial coefficients such that

$$\log \frac{P(K=i\,| \,\mu )}{P(K=0\,| \,\mu )}={\beta }_{i}^{\mu }$$

(5)

This model for the neutral SFS in each bin was first estimated using synonymous SNV counts. Next, we fit a model to all LoF polymorphism such that

$$\log \frac{P(K=i| \mu )}{P(K=0| \mu )}={\beta }_{i}^{\mu }-{\beta }_{i}^{s}\sqrt{{s}_{{\rm{het}}}}.$$

(6)

The ${\beta }_{i}^{\mu }$ coefficients were determined by the synonymous SFS and ${\beta }_{i}^{s}$ coefficients were fit with maximum likelihood using the L-BFGS-B algorithm as implemented in scipy. The square root dependency on was determined by manual inspection of the SFS, binned by s_het and μ. The values used to fit this global multinomial model were prior mean values output by GeneBayes. This procedure yielded a model for the SFS in each mutation rate bin, given a value of s_het. We then re-estimated for LoF mutations in each gene using this model using maximum likelihood and bounding to [0,1].

Segregation of LoF-2 polymorphism in the human population

LoF-2 genes, despite showing a significant excess of LoF mutations in the NDD cohort, harbour many LoF variants segregating in the general population. This is reflected, for instance, in their relatively high LOEUF values. The number of segregating sites in a gene is governed by both mutation rate and selection. If these genes are genuine CES drivers, we expect an elevated mutation rate to contribute to the number and frequency of LoF SNVs. In such cases, both the number of segregating sites and their allele frequencies would appear higher than expected under a fixed selection coefficient.

In a cohort with more than 1 million chromosomes, this hypothesis can be evaluated by comparing the empirical expectation and variance of LoF allele frequency distributions across genes. Whereas both mutation rate and selection influence the mean and variance of allele frequencies, recurrent mutation produces alleles that tend to be more common but show lower variance in frequency, due to reduced genetic drift. By contrast, relaxed selection increases both mean and variance via enhanced drift. Thus, we expect genes with elevated mutation rate to show lower variance in allele frequency relative to their mean than genes under relaxed selection.

To formalize this, we applied the gamma-Poisson approximation for allele frequency spectra under strong selection and recurrent mutation³⁸.

Gene filtering

For this analysis, we retained only genes with at least 10 LoF sites passing Roulette-QUAL quality control and having unscaled Roulette mutation rate μ > 1 (for example, CpG>TpG transitions). This filtering ensures reliable estimates of the mean and variance of allele frequencies for each gene.

Model

We assume a three-step generative model for LoF allele counts in each gene:

(1)
Mutations occur randomly at any LoF site in the gene.
(2)
Each derived variant segregates in a diploid population of effective size N_e, experiencing negative selection of intensity s in heterozygotes. We assume strong selection and no lineage interference.
(3)
A sample of n alleles is drawn from the population.

Let k_v denote the number of observed derived LoF variants v in the population sample. In what follows, we will drop the index v. We assume the variance of k to be the sum of contributions of the three assumed steps without any extra terms:

$${\rm{Var}}(k)={{\rm{Var}}}_{{\rm{sampling}}}(k)+{{\rm{Var}}}_{{\rm{seg}}}(k)+{{\rm{Var}}}_{\mu }(k),$$

(7)

where the components correspond to mutation heterogeneity across sites, drift-induced segregation, and Poisson sampling noise, respectively. Under the Gamma approximation introduced in ref. ³⁸, the distribution of allele frequencies at LoF sites is modelled with 4N_eμ as the shape parameter and 1/(4N_es) as the scale parameter. The expectation of k is then:

$${\mathbb{E}}(k)=\frac{n{\mathbb{E}}(\mu )}{s}.$$

And the first two terms of equation (7) are given by the variance of the Poisson and Gamma components, respectively:

$${{\rm{V}}{\rm{a}}{\rm{r}}}_{{\rm{s}}{\rm{a}}{\rm{m}}{\rm{p}}{\rm{l}}{\rm{i}}{\rm{n}}{\rm{g}}}(k)\equiv {\mathbb{E}}(k)=\frac{n{\mathbb{E}}(\mu )}{s},$$

$${{\rm{V}}{\rm{a}}{\rm{r}}}_{{\rm{s}}{\rm{e}}{\rm{g}}}(k)=\frac{n{\mathbb{E}}(\mu )}{s}\frac{n}{4{N}_{{\rm{e}}}s}.$$

The final term of equation (7) may be obtained by noticing that ${{\rm{V}}{\rm{a}}{\rm{r}}}_{{\rm{s}}{\rm{a}}{\rm{m}}{\rm{p}}{\rm{l}}{\rm{i}}{\rm{n}}{\rm{g}}}(k)+{{\rm{V}}{\rm{a}}{\rm{r}}}_{{\rm{s}}{\rm{e}}{\rm{g}}}(k)={{\mathbb{E}}}_{\mu }[{\rm{V}}{\rm{a}}{\rm{r}}(k|\mu )]$. Applying the law of total variance,

$$\begin{array}{c}\,{\rm{V}}{\rm{a}}{\rm{r}}(k)=[{{\rm{V}}{\rm{a}}{\rm{r}}}_{{\rm{s}}{\rm{a}}{\rm{m}}{\rm{p}}{\rm{l}}{\rm{i}}{\rm{n}}{\rm{g}}}(k)+{{\rm{V}}{\rm{a}}{\rm{r}}}_{{\rm{s}}{\rm{e}}{\rm{g}}}(k)]+{{\rm{V}}{\rm{a}}{\rm{r}}}_{\mu }(k)\\ \,=\,{{\mathbb{E}}}_{\mu }[{\rm{V}}{\rm{a}}{\rm{r}}(k|\mu )]+{{\rm{V}}{\rm{a}}{\rm{r}}}_{\mu }[{\mathbb{E}}(k|\mu )]\\ \,=\,\frac{n{\mathbb{E}}(\mu )}{s}\left(1+\frac{n}{4{N}_{{\rm{e}}}s}\right)+{\left(\frac{n}{s}\right)}^{2}{\rm{V}}{\rm{a}}{\rm{r}}(\mu )\\ \,\,=\,\frac{n{\mathbb{E}}(\mu )}{s}+\frac{n{\mathbb{E}}(\mu )}{s}\frac{n}{4{N}_{{\rm{e}}}s}+{\left(\frac{n}{s}\right)}^{2}{\rm{V}}{\rm{a}}{\rm{r}}(\mu ),\end{array}$$

$${{\rm{Var}}}_{\mu }(k)={\left(\frac{n}{s}\right)}^{2}{\rm{Var}}(\mu ).$$

We now model the effect of CES as a multiplicative factor κ ≥ 1 on the mutation rate so that:

$${\mathbb{E}}(k)=\kappa \times \frac{n{\mathbb{E}}(\mu )}{s},$$

(8)

$${\rm{V}}{\rm{a}}{\rm{r}}(k)=\kappa \times \frac{n{\mathbb{E}}(\mu )}{s}\left(1+\frac{n}{4{N}_{{\rm{e}}}s}\right)+{\kappa }^{2}\times {\left(\frac{n}{s}\right)}^{2}{\rm{V}}{\rm{a}}{\rm{r}}(\mu ).$$

(9)

The final term in equation (9) containing dependency on κ² can be estimated from the regression of k on μ with Poisson error. Subtracting this term from Var(k) leaves only the sample estimate of ${\mathbb{E}}$_μ(Var(k|μ)).

Estimating selection

This framework yields two estimators of 1/s:

(1)
From the mean allele count (equation (8)):
$$\frac{\kappa }{\hat{s}}=\frac{\bar{k}}{n\bar{\mu }},$$
(10)
where $\overline{\mu }$ and $\overline{k}$ are the mean mutation rate and mean LoF allele count per gene.
(2)
From the conditional variance (equation (9)), subtracting sampling noise:

$$\frac{{{\rm{V}}{\rm{a}}{\rm{r}}}_{{\rm{s}}{\rm{e}}{\rm{g}}}(k)}{{{\rm{V}}{\rm{a}}{\rm{r}}}_{{\rm{s}}{\rm{a}}{\rm{m}}{\rm{p}}{\rm{l}}{\rm{i}}{\rm{n}}{\rm{g}}}(k)}=\frac{{{\mathbb{E}}}_{\mu }[{\rm{V}}{\rm{a}}{\rm{r}}(k|\mu )]-{{\rm{V}}{\rm{a}}{\rm{r}}}_{{\rm{s}}{\rm{a}}{\rm{m}}{\rm{p}}{\rm{l}}{\rm{i}}{\rm{n}}{\rm{g}}}(k)}{{{\rm{V}}{\rm{a}}{\rm{r}}}_{{\rm{s}}{\rm{a}}{\rm{m}}{\rm{p}}{\rm{l}}{\rm{i}}{\rm{n}}{\rm{g}}}(k)}=\frac{n}{4{N}_{{\rm{e}}}s}.$$

(11)

Note that only the first estimator (equation (10)) depends on κ and is therefore sensitive to CES effects.

Figure 3e plots the log estimate of $\frac{n}{4{N}_{{\rm{e}}}s}$ (excess variance due to drift) against the log estimate of $\frac{\kappa }{s}$ (elevated mean allele count due to CES). Because log(n/4N_e) − log(s) is a constant offset from log(κ) − log(s), genes with elevated mutation due to CES appear as positive residuals from the regression line.

This approach is conservative: the gamma-Poisson approximation is invalid for genes under weak or no selection, where allele frequency variance is even higher. Thus, such genes would not mimic the CES signal.

True NDD genes deviate negatively from the regression line, having lower mean allele frequencies than predicted under a fixed s. This is consistent with either sampling bias in gnomAD-v4 resulting from difficulties in recruitment of individuals affected with NDD or from selection acting against LoF variants in these genes even prenatally, leading to inherently reduced representation in population datasets.

MIB1 example

For MIB1, we used Poisson regression to quantify the dependence of LoF allele frequency on mutation rate. To visualize this relationship (Fig. 3d), we binned sites by scaled mutation rate, merging bins with fewer than ten sites with the next-highest bin.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All data used in this study, including intermediate processed datasets, are provided in Supplementary Tables 1–20. Associated summary files are also available on Zenodo (https://zenodo.org/records/15660433)⁶¹.

Code availability

The full analysis pipeline, including code to reproduce figures and statistical analyses, is available on GitHub at https://github.com/mikemoldovan/CES_Discovery.

Change history

18 December 2025
This article was originally published under standard Springer Nature license (© The Author(s), under exclusive licence to Springer Nature Limited). It is now available as an open-access paper under a Creative Commons Attribution 4.0 International license, © The Author(s). The copyright has been corrected in the HTML and PDF versions of the article.

References

Lawson, A. R. J. et al. Somatic mutation and selection at population scale. Nature https://doi.org/10.1038/s41586-025-09584-w (2025).
Maeda, H. & Kakiuchi, N. Clonal expansion in normal tissues. Cancer Sci. 115, 2117–2124 (2024).
Article CAS PubMed PubMed Central Google Scholar
Bernstein, N. et al. Analysis of somatic mutations in whole blood from 200,618 individuals identifies pervasive positive selection and novel drivers of clonal hematopoiesis. Nat. Genet. 56, 1147–1155 (2024).
Article CAS PubMed PubMed Central Google Scholar
Fowler, J. C. & Jones, P. H. Somatic mutation: what shapes the mutational landscape of normal epithelia? Cancer Discov. 12, 1642–1655 (2022).
Article CAS PubMed PubMed Central Google Scholar
Goriely, A., McVean, G. A. T., Röjmyr, M., Ingemarsson, B. & Wilkie, A. O. M. Evidence for selective advantage of pathogenic FGFR2 mutations in the male germ line. Science 301, 643–646 (2003).
Article ADS CAS PubMed Google Scholar
Tiemann-Boege, I. et al. The observed human sperm mutation frequency cannot explain the achondroplasia paternal age effect. Proc. Natl Acad. Sci. USA 99, 14952–14957 (2002).
Article ADS CAS PubMed PubMed Central Google Scholar
Maher, G. J. et al. Selfish mutations dysregulating RAS-MAPK signaling are pervasive in aged human testes. Genome Res. 28, 1779–1790 (2018).
Article CAS PubMed PubMed Central Google Scholar
Wood, K. A. et al. SMAD4 mutations causing Myhre syndrome are under positive selection in the male germline. Am. J. Hum. Genet. 111, 1953–1969 (2024).
Article CAS PubMed PubMed Central Google Scholar
Kaplanis, J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Hamdan, F. F. et al. High rate of recurrent de novo mutations in developmental and epileptic encephalopathies. Am. J. Hum. Genet. 101, 664–685 (2017).
Article CAS PubMed PubMed Central Google Scholar
Jin, S. C. et al. Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nat. Genet. 49, 1593–1601 (2017).
Article CAS PubMed PubMed Central Google Scholar
Zhou, X. et al. Integrating de novo and inherited variants in 42,607 autism cases identifies mutations in new moderate-risk genes. Nat. Genet. 54, 1305–1319 (2022).
Article CAS PubMed PubMed Central Google Scholar
Richter, F. et al. Genomic analyses implicate noncoding de novo variants in congenital heart disease. Nat. Genet. 52, 769–777 (2020).
Article CAS PubMed PubMed Central Google Scholar
Francioli, L. C. et al. Genome-wide patterns and properties of de novo mutations in humans. Nat. Genet. 47, 822–826 (2015).
Article CAS PubMed PubMed Central Google Scholar
Iossifov, I. et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 515, 216–221 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).
Article Google Scholar
An, J.-Y. et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, eaat6576 (2018).
Article ADS PubMed PubMed Central Google Scholar
Gulsuner, S. et al. Spatial and temporal mapping of de novo mutations in schizophrenia to a fetal prefrontal cortical network. Cell 154, 518–529 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhao, G. et al. Gene4Denovo: an integrated database and analytic platform for de novo mutations in humans. Nucleic Acids Res. 48, D913–D926 (2020).
CAS PubMed Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Neville, M. D. C. et al. Sperm sequencing reveals extensive positive selection in the male germline. Nature https://doi.org/10.1038/s41586-025-09448-3 (2025).
Amberger, J. S., Bocchini, C. A., Scott, A. F. & Hamosh, A. OMIM.org: leveraging knowledge across phenotype–gene relationships. Nucleic Acids Res. 47, D1038–D1043 (2019).
Article CAS PubMed Google Scholar
Halldorsson, B. V. et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, eaau1043 (2019).
Article CAS PubMed Google Scholar
Seplyarskiy, V. B. & Sunyaev, S. The origin of human mutation in light of genomic data. Nat. Rev. Genet. 22, 672–686 (2021).
Article CAS PubMed Google Scholar
Seplyarskiy, V. et al. A mutation rate model at the basepair resolution identifies the mutagenic effect of polymerase III transcription. Nat. Genet. 55, 2235–2242 (2023).
Article CAS PubMed PubMed Central Google Scholar
Totsika, V., Liew, A., Absoud, M., Adnams, C. & Emerson, E. Mental health problems in children with intellectual disability. Lancet Child Adolesc. Health 6, 432–444 (2022).
Article PubMed Google Scholar
Raudvere, U. et al. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 47, W191–W198 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tate, J. G. et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019).
Article CAS PubMed Google Scholar
Boßelmann, C. M. et al. Analysis of 1386 epileptogenic brain lesions reveals association with DYRK1A and EGFR. Nat. Commun. 15, 10429 (2024).
Article ADS PubMed PubMed Central Google Scholar
Gogate, A. et al. The genetic landscape of autism spectrum disorder in an ancestrally diverse cohort. NPJ Genomic Med. 9, 62 (2024).
Article CAS Google Scholar
Gallego-Martinez, A. et al. Using coding and non-coding rare variants to target candidate genes in patients with severe tinnitus. NPJ Genomic Med. 7, 70 (2022).
Article CAS Google Scholar
Wainberg, M. et al. Deletion of loss-of-function-intolerant genes and risk of 5 psychiatric disorders. JAMA Psychiatry 79, 78–81 (2022).
Article PubMed Google Scholar
Rodan, L. H. et al. Gain-of-function variants in the ODC1 gene cause a syndromic neurodevelopmental disorder associated with macrocephaly, alopecia, dysmorphic features, and neuroimaging abnormalities. Am. J. Med. Genet. A. 176, 2554–2560 (2018).
Article CAS PubMed Google Scholar
Jansen, S. et al. De novo truncating mutations in the last and penultimate exons of PPM1D cause an intellectual disability syndrome. Am. J. Hum. Genet. 100, 650–658 (2017).
Article CAS PubMed PubMed Central Google Scholar
Harpak, A., Bhaskar, A. & Pritchard, J. K. Mutation rate variation is a primary determinant of the distribution of allele frequencies in humans. PLoS Genet. 12, e1006489 (2016).
Article PubMed PubMed Central Google Scholar
Schraiber, J. G., Spence, J. P. & Edge, M. D. Estimation of demography and mutation rates from one million haploid genomes. Am. J. Hum. Genet. 112, 2152–2166 (2025).
Wakeley, J., Fan, W.-T. L., Koch, E. & Sunyaev, S. Recurrent mutation in the ancestry of a rare variant. Genetics 224, iyad049 (2023).
Article CAS PubMed PubMed Central Google Scholar
Nei, M. The frequency distribution of lethal chromosomes in finite populations. Proc. Natl Acad. Sci. USA 60, 517–524 (1968).
Article ADS CAS PubMed PubMed Central Google Scholar
Zeng, T., Spence, J. P., Mostafavi, H. & Pritchard, J. K. Bayesian estimation of gene constraint from an evolutionary model with gene features. Nat. Genet. 56, 1632–1643 (2024).
Article CAS PubMed Google Scholar
Szklarczyk, D. et al. The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 49, D605–D612 (2021).
Article CAS PubMed Google Scholar
Schuurs-Hoeijmakers, J. H. M. et al. Recurrent de novo mutations in PACS1 cause defective cranial-neural-crest migration and define a recognizable intellectual-disability syndrome. Am. J. Hum. Genet. 91, 1122–1127 (2012).
Article CAS PubMed PubMed Central Google Scholar
Olson, H. E. et al. A recurrent de novo PACS2 heterozygous missense variant causes neonatal-onset developmental epileptic encephalopathy, facial dysmorphism, and cerebellar dysgenesis. Am. J. Hum. Genet. 102, 995–1007 (2018).
Article CAS PubMed PubMed Central Google Scholar
Goriely, A. et al. Activating mutations in FGFR3 and HRAS reveal a shared genetic origin for congenital disorders and testicular tumors. Nat. Genet. 41, 1247–1252 (2009).
Article CAS PubMed PubMed Central Google Scholar
Ichimura, K. et al. Recurrent neomorphic mutations of MTOR in central nervous system and testicular germ cell tumors may be targeted for therapy. Acta Neuropathol. 131, 889–901 (2016).
Article CAS PubMed Google Scholar
Kimura, T. et al. Conditional loss of PTEN leads to testicular teratoma and enhances embryonic germ cell production. Development 130, 1691–1700 (2003).
Article CAS PubMed Google Scholar
Sommerer, F. et al. Mutations of BRAF and RAS are rare events in germ cell tumours. Int. J. Cancer 113, 329–335 (2005).
Article CAS PubMed Google Scholar
Knudson Hypothesis—an overview. ScienceDirect Topics https://www.sciencedirect.com/topics/medicine-and-dentistry/knudson-hypothesis (2002).
Goriely, A. & Wilkie, A. O. M. Paternal age effect mutations and selfish spermatogonial selection: causes and consequences for human disease. Am. J. Hum. Genet. 90, 175–200 (2012).
Article CAS PubMed PubMed Central Google Scholar
Penrose, L. S. Parental age and mutation. Lancet 269, 312–313 (1955).
Article CAS PubMed Google Scholar
EuroEPINOMICS-RES Consortium, Epilepsy Phenome/Genome Project, & Epi4K Consortium. De novo mutations in synaptic transmission genes including DNM1 cause epileptic encephalopathies. Am. J. Hum. Genet. 95, 360–370 (2014).
Article Google Scholar
Helbig, K. L. et al. Diagnostic exome sequencing provides a molecular diagnosis for a significant proportion of patients with epilepsy. Genet. Med. 18, 898–905 (2016).
Article CAS PubMed Google Scholar
Heyne, H. O. et al. De novo variants in neurodevelopmental disorders with epilepsy. Nat. Genet. 50, 1048–1053 (2018).
Article CAS PubMed Google Scholar
Klöckner, C. et al. De novo variants in SNAP25 cause an early-onset developmental and epileptic encephalopathy. Genet. Med. 23, 653–660 (2021).
Article PubMed Google Scholar
Allen, A. S. et al. De novo mutations in epileptic encephalopathies. Nature 501, 217–221 (2013).
Article ADS CAS PubMed Google Scholar
Perez, G. et al. The UCSC Genome Browser database: 2025 update. Nucleic Acids Res. 53, D1243–D1249 (2025).
Karlsson, M. et al. A single–cell type transcriptomics map of human tissues. Sci. Adv. 7, eabh2169 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Xu, Y. et al. A single-cell transcriptome atlas profiles early organogenesis in human embryos. Nat. Cell Biol. 25, 604–615 (2023).
Agrawal, A. et al. WikiPathways 2024: next generation pathway database. Nucleic Acids Res. 52, D679–D689 (2023).
Article PubMed Central Google Scholar
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Article PubMed PubMed Central Google Scholar
Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).
Article CAS PubMed Google Scholar
Moldovan, M. Reproducibility data for the paper ‘Cohort-level analysis of human de novo mutations points to drivers of clonal expansion in spermatogonia’. Zenodo https://doi.org/10.5281/zenodo.15660433 (2025).

Download references

Acknowledgements

We thank C. Boix, C. Chiang and R. Stana as well as A. Quinlan and J. Kunisaki for valuable discussions and helpful suggestions that improved the analyses presented in this study. This work was supported by the National Institutes of Health through grant nos. R35GM12713, R01MH101244 and U01HG012009.

Author information

These authors contributed equally: Vladimir Seplyarskiy, Mikhail A. Moldovan

Authors and Affiliations

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Vladimir Seplyarskiy, Mikhail A. Moldovan, Evan Koch, Prathitha Kar & Shamil Sunyaev
Brigham and Women’s Hospital, Division of Genetics, Harvard Medical School, Boston, MA, USA
Vladimir Seplyarskiy & Shamil Sunyaev
Cancer, Ageing and Somatic Mutation, Wellcome Sanger Institute, Hinxton, UK
Matthew D. C. Neville & Raheleh Rahbari

Authors

Vladimir Seplyarskiy
View author publications
Search author on:PubMed Google Scholar
Mikhail A. Moldovan
View author publications
Search author on:PubMed Google Scholar
Evan Koch
View author publications
Search author on:PubMed Google Scholar
Prathitha Kar
View author publications
Search author on:PubMed Google Scholar
Matthew D. C. Neville
View author publications
Search author on:PubMed Google Scholar
Raheleh Rahbari
View author publications
Search author on:PubMed Google Scholar
Shamil Sunyaev
View author publications
Search author on:PubMed Google Scholar

Contributions

V.S., M.A.M. and S.S. jointly conceived the study and developed the methodological framework. V.S., M.A.M., E.K., P.K. and M.D.C.N. performed data analysis and interpreted results. V.S., M.A.M., E.K., P.K., M.D.C.N., R.R. and S.S. collaboratively drafted and revised the paper. S.S., R.R., V.S. and M.A.M. jointly supervised the project. All authors reviewed and approved the final version of the paper.

Corresponding authors

Correspondence to Vladimir Seplyarskiy, Mikhail A. Moldovan or Shamil Sunyaev.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Ziyue Gao, Mikkel Schierup and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Counts of de novo variants in the NDD cohort.

(a) Numbers of de novo missense variants stratified by recurrence. Genes harboring variants occurring >10 times in the cohort are shown in blue. (b) Numbers of de novo loss-of-function variants aggregated by gene stratified by recurrence. (c) Scatter plot of observed vs. expected de novo loss-of-function variant counts in the NDD cohort. LoF-1 set genes are shown in red; LoF-2 set genes are shown in purple. Upper bound of disease ascertainment in Eq. (1) given by the lower bound of prevalence of 1% is shown as a dashed line.

Extended Data Fig. 2 Expression of the identified CES drivers in germline tissues.

(a) nTPM values for spermatogonia reported in The Human Protein Atlas single-cell dataset. (b) nTPM values normalized by the maximal expression across all tissues for each gene. (c) nTPM values in oocytes. (d) Normalized nTPM values in oocytes.

Extended Data Fig. 3 Stability of LoF-2 set with respect to the metric of loss-of-function constraint.

(a) Observed-to-expected variant count ratio (o/e) for de novo LoFs in genes with FDR < 0.1 in the neurodevelopmental disorder cohort (NDD) merged with the autism spectrum disorder (ASD) cohort plotted against the Loss-of-function Observed/Expected Lower-bound Fraction (LOELF) scores. The dashed violet line indicates the minimal LOELF value across LoF-2 genes of 0.23. LoF-2 genes are shown in violet, LoF-1 genes are shown in red, genes above the chosen LOELF threshold but not included in the LoF-2 set (SHANK3 and ZMYM2) are shown in black. (b) Same as in (a), but for the Loss-of-function Observed/Expected (LOE) metric. The upper bound for LOE (shown as violet dashed line) is 0.355.

Extended Data Fig. 4 Ratio of observed-to-expected counts of LoF de novo mutations in a cohort of control trios for LoF-2 set genes.

The ratio is shown as a function of the LOEUF threshold: we aggregate all genes with LOEUF values lower than the value indicated on the x-axis and calculate the cumulative observed-to-expected ratio. The shaded grey area represents the 95% confidence interval obtained by permuting the LOEUF labels.

Extended Data Fig. 5 Validation of LoF-2 genes with a non-LOE metric.

Prior of the GeneBayes s_het calculated using biological features of genes (x-axis) and the s_het values updated with LoF polymorphism data from gnomAD-v4 (y-axis). See section ‘GeneBayes update and CES’ for details.

Extended Data Fig. 6 Expected LoF rate in NanoSeq data for LoF-1 and LoF-2 genes.

Rates are shown separately for genes overlapping with those significant in NanoSeq and for private LoF-1/2 genes. An asterisk (*) indicates p < 0.05 from the Mann–Whitney U test.

Extended Data Fig. 7 Power analysis of paternal transmission.

Statistical power (i.e., the probability of correctly detecting a signal when it exists calculated as the complement of type-2 error rate ß) of the Binomial test for paternal overtransmission relative to the baseline of 0.75 is shown across the range of CES-related mutation rate inflations κ and counts of observed variants. Results are presented for three significance levels: 0.05, 0.01, and 0.001.

Supplementary information

Supplementary Text (download PDF )

This file contains four Supplementary Notes 1–4. Note 1, Equivalence of de novo mutation rate and mutation probability under low mutation rates: justification for using the Bernoulli approximation to the Poisson distribution when the expected count is low. Note 2, CHIP leads to misinterpretation of LOEUF: rationale for, and description of, further filtering of results based on the involvement of genes in clonal haematopoiesis. Note 3, Misannotations of protein-truncating variants (PTVs) lead to misinterpretations of LOEUF: rationale for, and description of, further filtering of results to account for misannotation of predicted LoF variants. Note 4, Ascertainment of disease-causing variants under arbitrary trait architectures: derivation of equation (1) in the main text and discussion of its applicability to traits with non-strictly monogenic architecture.

Reporting Summary (download PDF )

Supplementary Fig. 1 (download PDF )

Heatmap of expression of identified CES drivers in fetal single-cell expression clusters identified by Xu et al. (ref. ⁵⁷).

Supplementary Tables (download XLSX )

This file contains Supplementary Tables 1–20.

Peer Review File (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Seplyarskiy, V., Moldovan, M.A., Koch, E. et al. Hotspots of human mutation point to clonal expansions in spermatogonia. Nature 647, 429–435 (2025). https://doi.org/10.1038/s41586-025-09579-7

Download citation

Received: 04 January 2025
Accepted: 27 August 2025
Published: 08 October 2025
Version of record: 08 October 2025
Issue date: 13 November 2025
DOI: https://doi.org/10.1038/s41586-025-09579-7

This article is cited by

Conserved shifts in sperm small non-coding RNA profiles during mouse and human aging
- Junchao Shi
- Xudong Zhang
- Qi Chen
The EMBO Journal (2026)
The search for mutations that sperm acquire as men age
- Caroline Watson
Nature (2025)