Main

Short tandem repeats (STRs) of 1–6 bp of DNA are mutable genomic elements with diverse influences on cellular and organismal phenotypes2. Common STR polymorphisms, which have been characterized in human populations using short-read3 and long-read4 sequencing, influence gene expression5 and complex traits6,7. Rare STR expansions cause more than 60 genetic disorders1. The allelic diversity that underlies these effects is generated by frequent mutation: around 1 million polymorphic STRs in the human genome generate around 50–60 de novo repeat-length mutations per offspring8,9,10. Germline mutation rates of specific STRs vary widely11 and are influenced by repeat motif sequence, interruptions of pure repeats and number of repeat units8,9,10,11,12 as well as genetic variation in DNA-repair genes9.

STRs are also prone to somatic mutation2, and lifelong somatic expansion in at least one STR locus can lead to disease. Recently, genome-wide association studies (GWASs) have provided insights into the molecular mechanisms underlying somatic repeat instability13 by finding common genetic modifiers of the timing or progression of Huntington’s disease (HD)14,15,16,17,18,19, which is caused by inherited alleles in which a CAG repeat in the HTT gene is longer than 35 CAGs; these genetic modifiers were found in many DNA-repair genes that affect the stability of DNA repeats14,15,16,17,18,19. Neurodegeneration in HD was subsequently found to be caused by somatic expansion of this repeat beyond a high threshold of about 150 CAG repeats20. The genetic-modifier studies, so far of up to 16,640 persons with HD, have provided early clues toward a few potential therapeutic targets for slowing or halting somatic expansion of DNA repeats; however, the number of such potential targets is so far modest.

Whole-genome sequencing (WGS) of biobank cohorts offers opportunities to study repeat instability in much larger sample sizes than previously possible. Here we analysed repeat instability at 356,131 polymorphic repeat loci using short-read WGS data from the blood-derived DNA of 490,416 participants in UK Biobank (UKB)21 and 414,830 participants in All of Us (AoU)22. To do so, we developed several computational techniques, overcoming challenges in estimating the length and instability of DNA repeats from large numbers of short WGS reads23. These methods enabled us to characterize allele-specific expansion and contraction rates of common repeats, identify genetic influences on somatic repeat expansion and identify associations of expanded repeats with diseases.

CAG-repeat expansions in the UKB

We began by analysing CAG trinucleotide repeats, which we could efficiently ascertain from biobank sequencing data and which cause many progressive, neurodegenerative repeat-expansion disorders1,2. We identified UKB participants with long CAG-repeat alleles (≥45 repeat units) by analysing WGS data for 151 bp sequencing reads comprised entirely or almost entirely of CAG-repeat units (in-repeat reads (IRRs); Extended Data Fig. 1a). Such reads were easily extractable, as nearly all of them had been aligned to the TCF4 CAG-repeat sequence by bwa24 (Supplementary Fig. 1). For each participant with one or more IRRs, we determined the locus or loci from which the IRRs originated by identifying mate sequences that mapped near one of 1,159 commonly polymorphic CAG-repeat loci3.

The vast majority of CAG-repeat expansions in the UKB occurred at only a few loci: 18 autosomal CAG-repeat sequences in the human genome were expanded to at least 45 repeat units in at least five UKB participants (Extended Data Fig. 1b and Supplementary Table 1). Three repeat loci were expanded in thousands of UKB participants—CA10 (137,673 participants), TCF4 (42,004) and ATXN8OS (7,736)—together accounting for 97% of all observed expansions beyond 45 repeat units. Most of these repeats (15 out of 18) were in transcribed genomic regions, consistent with the idea that transcription contributes to repeat instability25 (Supplementary Table 2). For 9 out of the 18 repeats, expanded alleles are known to be pathogenic1.

To study the mutability of these repeats, we measured the lengths of common, short alleles of each repeat (≤30 repeat units) by analysing sequencing reads that spanned repeat alleles, focusing on 15 repeat loci that passed additional filters (Extended Data Fig. 1 and Supplementary Table 1). These analyses recovered repeat-length distributions consistent with previous analyses26 (Extended Data Fig. 1b).

Germline instability of common CAG repeats

We first analysed germline instability of these repeats, using the large UKB cohort to obtain high-resolution estimates of germline mutability (providing context for analyses of somatic mutability). To estimate allele-specific intergenerational expansion and contraction rates of each repeat, we analysed length discordances among alleles belonging to genomic tracts inherited identical-by-descent (IBD) from shared ancestors, building on IBD-based analyses of single-nucleotide mutations27,28,29 (Fig. 1a). We validated this approach using two complementary methods (Supplementary Fig. 2).

Fig. 1: Germline and somatic instability of common CAG-repeat alleles.
figure 1

a, Germline mutation rates were estimated by analysing discordance rates among alleles inherited within IBD tracts shared by pairs of UKB participants. Ancestral alleles were imputed from more-distantly shared haplotypes. b, Per-generation rates of germline expansion (+1 repeat unit) and contraction (−1 repeat unit) of GLS and TCF4 repeat alleles, estimated in the UKB. c, The analytical strategy for estimating somatic mutation rates by detecting and filtering out reads that are likely to reflect PCR artifacts introduced during sequencing. During PCR-based bridge amplification on a flow cell, a DNA fragment is clonally amplified into a cluster of colocalized DNA molecules. A PCR stutter error results in a polyclonal cluster containing a mixture of DNA molecules with and without the error. If the molecules containing the error constitute the majority of the cluster, the sequencing read generated from the cluster (reflecting the majority base at each position within the read) will contain the error, but the heterogeneity of the cluster will reduce base qualities at positions within the read that mismatch between molecules with and without the error. d, The rates of somatic expansion of GLS and TCF4 repeat alleles (that is, the fractions of blood cells in which an allele has expanded by +1 repeat unit), stratified by age in AoU. e, Somatic mutation rates in the UKB plotted against germline mutation rates for GLS and TCF4 repeat alleles. The error bars show the 95% confidence intervals (CIs). Sample sizes are provided in Supplementary Table 3.

Across all 15 CAG-repeat loci, intergenerational mutation rates increased with allele length, rising to 0.5–0.9% per generation for single-repeat-unit expansions of the longest common alleles of repeats in GLS, DMPK and ATXN8OS (Extended Data Figs. 1b and 2). The average mutation rate per locus ranged from 8.2 × 10−5 to 9.5 × 10−4 (Supplementary Table 3). These rates are relatively high for trinucleotide repeats8 and exceed the genome-wide average for STRs (around 5 × 10−5 per haplotype per generation)8,9,10. Repeat loci tended to either expand more often than contract (particularly so for ATXN8OS and GLS) or to have similar expansion and contraction rates (Extended Data Figs. 1b and 2). Interruptions of repeat sequences (that is, intrarepeat sequence variants) greatly stabilized alleles: a common 18-repeat TCF4 allele containing an interruption in its ninth repeat unit exhibited a 135-fold (54–336, 95% CI) lower expansion rate compared with the uninterrupted 18-repeat allele, and an interruption in the second-to-last repeat unit of a 19-repeat GLS allele decreased the expansion rate 3.7-fold (1.9–7.2, 95% CI) (Fig. 1b). These results corroborate previous observations that repeat interruptions stabilize the expansion of pathogenic alleles30,31,32 and quantify the strength of such effects in the germline.

Somatic expansion of common CAG repeats

These high rates of germline instability led us to wonder whether common alleles of some repeats might be sufficiently unstable in blood cells for somatic length-change mutations to be ascertainable in short-read WGS data. Identifying such mutations is challenging because polymerase slippage during PCR amplification can spuriously alter repeat lengths33,34,35. Such ‘PCR stutter’ errors are unavoidable during Illumina sequencing by synthesis, which uses PCR for bridge amplification of DNA fragments36. However, we realized that this PCR error mode tends to produce predictable patterns of reduced base quality scores within sequencing reads, enabling us to detect and exclude most reads with artefactual CAG length mutations (Fig. 1c and Supplementary Fig. 3). We applied this filtering strategy in the UKB to estimate repeat-specific, allele-specific somatic expansion rates, which we quantified as the average fraction of blood cells in which a given repeat allele has expanded by one repeat unit.

For 4 out of the 15 CAG repeats (in TCF4, GLS, DMPK and ATN1), we detected significant increases in somatic single-repeat-unit expansion rates with age (Extended Data Fig. 3). These findings were replicated in AoU, in which the wider age range of participants (aged 18 to 90+ years) revealed clear increases in fractions of blood cells containing somatic expansions with increasing age and with increasing allele length (Fig. 1d and Extended Data Fig. 4). TCF4 repeats were the most somatically unstable: individuals carrying alleles with 25 or more repeat units typically exhibited somatic expansion in more than 1% of blood cells by the age of 55 years (Fig. 1d). We did not observe age-associated contraction of any of the 15 repeat loci.

Comparing these estimates of somatic one-repeat-unit expansion rates with our estimates of intergenerational mutation rates showed that the relative (blood/germline) rates of CAG-repeat expansion varied severalfold across repeat loci (Fig. 1e). The TCF4 repeat exhibited the greatest somatic instability in blood but was relatively stable in the germline, whereas the GLS repeat displayed the opposite behaviour (Fig. 1e), as did the DMPK repeat (Extended Data Figs. 1b, 2 and 4). These results align with observations that somatic instability of pathogenic repeat expansions is highly tissue-specific, perhaps due to differences in transcription or trans-acting factors25,37,38,39,40,41. Consistent with the former hypothesis, the four repeats for which we detected instability in blood are in genes with significantly higher expression in blood (Wilcoxon rank-sum test, P = 0.034; note that all P values reported in this Article were calculated using two-sided statistical tests; Supplementary Fig. 4).

Somatic expansion of long TCF4 CAG repeats

The high somatic expansion rates of TCF4 repeat alleles—even those of shorter lengths—suggested the possibility that long TCF4 alleles (≥45 repeat units) might be sufficiently unstable in blood to allow individual-level phenotyping of somatic expansion using short-read WGS data. This would provide an opportunity to learn about instability of long repeats from somatic expansions in very many people—potentially enabling the identification of genetic modifiers of repeat instability13,14,15,16,17,18,19,42,43—as long TCF4 alleles are common (42,004 carriers in the UKB; Extended Data Fig. 1b).

A barrier to analysing repeat expansions from short-read WGS data is that alleles exceeding the length of a sequencing read (151 bp) cannot be directly sized. However, short-read WGS data does permit rough estimation of the length of a long allele by counting in-repeat reads44 (Extended Data Fig. 1a). In an individual who is mosaic for somatic expansions that vary across cells, this approach estimates the average length of expanded alleles.

We analysed somatic expansion of long TCF4 alleles in the UKB and AoU by applying this approach with two methodological improvements. First, to control for variation in lengths of inherited TCF4 alleles, we used imputation to calibrate each individual’s allele length against measurements from other individuals sharing the same inherited allele (in lieu of longitudinal measurements). Stratifying individuals by imputed allele length showed that somatic expansion accelerates rapidly with TCF4 allele size, reaching around 1 repeat unit per year for 100-repeat alleles (Extended Data Fig. 5a). Second, to reduce noise in estimates of long TCF4 allele lengths, we devised a better-powered metric based on the number of sequenced DNA fragments derived from a highly expanded repeat (Extended Data Figs. 5b and 6a,b). Long-read sequencing of blood-derived DNA from AoU participants (n = 1,027, of whom 28 had long TCF4 alleles) corroborated TCF4 allele-length estimates from short-read WGS and demonstrated extensive mosaicism of expanded alleles41,45 (Extended Data Fig. 5c).

Genetic modifiers of TCF4 repeat expansion

Genome-wide association analysis of an optimized TCF4 somatic-expansion phenotype (Extended Data Fig. 6c) in 48,448 UKB and AoU participants identified seven loci at which common variants modulate TCF4 repeat expansion in blood (P < 5 × 10−8; Fig. 2a and Supplementary Table 4). Four loci—at MSH3 (P = 2.0 × 10−52), FAN1 (P = 8.5 × 10−29), ATAD5 (P = 4.9 × 10−12) and PMS2 (P = 3.0 × 10−8)—overlapped DNA-repair and DNA-damage-response genes that were recently implicated in somatic expansion of the HTT CAG repeat in blood18 (Fig. 2a). The three other modifier loci included GADD45A (P = 2.9 × 10−8), which encodes a growth arrest and DNA damage protein that binds to R-loops46.

Fig. 2: Genetic influences on somatic expansion of TCF4 repeat alleles in blood.
figure 2

a, Genome-wide associations with somatic instability of long TCF4 repeat alleles in the blood (top, meta-analysed across the UKB (n = 40,231) and AoU (n = 8,217)) compared with genetic associations with somatic instability of pathogenic HTT repeat alleles in the blood (bottom; from ref. 18). The TCF4 locus is shown in grey because these associations could reflect imperfect control for inherited TCF4 allele length. b, Comparison of the effect sizes of variants at MSH3, PMS2, FAN1 and ATAD5 for somatic expansion of HTT repeats in the blood (quantified by the somatic expansion ratio; SER18) versus TCF4 repeats in the blood. c, Analogous comparison for variant effect sizes for hastening of an HD clinical landmark of cognitive decline (symbol digit modalities test, SDMT)18. In each plot, variants within 1 Mb of the lead variant for TCF4 somatic expansion are plotted in black if they reached P < 10−5 for association with at least one of the two phenotypes; for FAN1, a subset of these variants is plotted in red or blue according to linkage disequilibrium with the two low-frequency FAN1 missense variants (r2 > 0.05). Variants with P > 10−5 for both phenotypes are plotted in light grey.

Comparing genetic modifiers of somatic expansion of the TCF4 and HTT CAG repeats in blood revealed both consistency and heterogeneity of effects (Fig. 2a,b). Common haplotypes at PMS2, FAN1 and ATAD5 were associated with broadly concordant effects on TCF4 and HTT repeat expansion in blood, whereas at MSH3, common haplotypes that decreased expansion of the TCF4 repeat tended to increase expansion of the HTT repeat in blood (Fig. 2b). Moreover, the strongest modifier of HTT expansion in blood—a haplotype containing a missense variant in MSH2 also implicated in germline STR mutation9—appeared not to affect TCF4 expansion (P = 0.96; Fig. 2a and Supplementary Table 5).

Similarly, comparing genetic modifiers of TCF4 repeat expansion in the blood to genetic modifiers of HD age-at-landmark phenotypes18 (which are probably regulating HTT repeat expansion in neurons) showed that, at both MSH3 and FAN1, common haplotypes that decreased expansion of the TCF4 repeat in blood appeared to increase the expansion of the HTT repeat in the brain (Fig. 2c and Supplementary Table 6). By contrast, two missense variants that reduce FAN1 activity47 appeared to increase the expansion of both the TCF4 repeat (in blood) and HTT repeat (in brain) (Fig. 2c). These results suggest that the tissue-specific instability of many trinucleotide repeats37,38,39,40,41 may arise from complex regulation of mismatch repair processes that differs across cell types18 and even across repeat loci, perhaps interacting with locus-specific differences in chromatin structure or other epigenomic properties.

We also compared modifiers of TCF4 repeat expansion in blood to loci that influence risk of Fuchs endothelial corneal dystrophy (FECD), a common age-associated eye disorder that is thought to be caused (in most cases) by expansion of the TCF4 repeat in corneal endothelial cells48,49. Notably, no modifiers of TCF4 repeat expansion in the blood overlapped with FECD risk loci50, and none of our lead variants for TCF4 blood instability (Supplementary Table 4) were associated with FECD (P > 0.15) in a recent well-powered GWAS50,51. Moreover, FECD risk conferred by long TCF4 repeats appeared to plateau for allele lengths beyond around 75 repeat units (Extended Data Fig. 7). Further work will be required to determine whether the instability-modifying genetic effects that we identified are specific to blood (which is conceivable given the very different (more extreme) dynamics of TCF4 somatic expansion in corneal endothelium41) and whether any modifiers of somatic expansion influence age at FECD onset.

Varied genetic effects on instability of 17 STRs

A much-larger set of DNA repeats involves other (non-CAG) sequence motifs, and the above results motivated us to investigate their expansion. To this end, we developed a computationally efficient tool for extracting IRRs with any 2–6 bp motif from WGS read alignments and applied it to the UKB WGS data. Mapping these IRRs to 356,131 polymorphic STRs identified 154 STRs for which long repeat alleles (>150 bp) were common (>0.5% carrier frequency; Supplementary Data 1). We constructed somatic-expansion phenotypes from IRR counts for these repeats, controlling for inherited allele lengths (inferred from imputation) as before. To identify STRs with evidence of somatic expansion, we tested these phenotypes for association with age or with the MSH3 and MSH2 haplotypes that were most strongly associated with blood instability of TCF4 and HTT repeats. This approach was motivated by initial GWAS analyses on STRs with somatic-expansion phenotypes that associated with age (Supplementary Methods): in these analyses, MSH2 and MSH3 haplotypes were consistently the lead variants and sometimes associated more strongly than age, suggesting that, for some repeats, the effects of genetic modifiers—which act across an individual’s years of life (mean 56.5 years in the UKB)—might be easier to detect than the effects of age differences (s.d., 8 years) on somatic expansion.

These analyses identified 17 STRs for which one or more of the three tests suggested evidence of somatic instability (P < 0.0001; Fig. 3a and Supplementary Data 2). These 17 unstable STRs represented 7 distinct 2–5 bp repeat motifs. Half of these STRs were located in genes that are highly expressed in blood, while six appeared to be in untranscribed regions (Fig. 3a). At some unstable STRs, expanded alleles were very common. Long alleles of an intronic AAAG tetranucleotide repeat in ADGRE2 (carried by 49% of European-ancestry UKB participants) expanded at an average rate of 0.4 repeat units per decade, demonstrating that human genomes commonly contain repeat elements that expand as we age (Fig. 3b).

Fig. 3: Variation in genetic influences on 17 unstable STRs.
figure 3

a, Genomic context, population frequencies (freq) among 420,522 unrelated European-ancestry UKB participants, associations (assoc.) with MSH2 and MSH3 variants and age, and the relative contributions of genetic modifiers of instability of 17 STRs. Prom., promoter. The relative contributions of five genetic modifier loci were estimated using local heritability analyses for STRs with sufficient signal (specifically, local heritability z-score > 2.5 for at least one of the five modifiers). b, The mean lengths of long alleles among UKB participants heterozygous for a long allele (based on imputation; from left to right, n = 155,291, 81,387, 257,934 and 240,294), stratified by age quintile and by genotype of an instability-modifying haplotype. The two STRs with the strongest age association and the two with the strongest genetic associations are shown. Error bars show the 95% CIs. c, Associations of modifier haplotypes of DNA-repair and DNA-damage-response genes with blood instability of 10 STRs (including HTT18) and with hastening of four HD clinical phenotypes18. The table cells contain z statistics from association analyses; associations with P < 0.05, 0.01 and 0.001 are shaded in green or purple depending on whether the effect size agrees or disagrees with the consensus effect direction (cons. effect dir.) for blood instability. The effect sign in each table cell corresponds to the direction of effect on repeat instability: a positive effect indicates that the alternate allele associated with increased somatic expansion or hastening of an HD clinical landmark. TFC6, a score of 6 on the 13-point total functional capacity scale; TMS, total motor score.

GWAS of somatic-expansion phenotypes for the 17 unstable STRs identified 7 loci at which common variants appear to modulate instability of these repeats in blood cells (P = 1.2 × 10−9 to 1.4 × 10−878; Extended Data Fig. 8 and Supplementary Data 3). Variants in four mismatch repair genes (MLH3, MSH3, MSH2 and PMS2) were each associated with somatic expansion of three or more STRs. The relative contributions of these genes to repeat instability varied across STRs, with MSH2 variation having greater influences on dinucleotide repeats and MSH3 variation having greater influences on STRs with longer motifs (Fig. 3a and Supplementary Data 4), consistent with MutSβ (a heterodimer of MSH2 and MSH3) having higher affinity for longer insertion–deletion loops in DNA52,53. Across a broader set of modifier haplotypes identified by fine-mapping genetic associations with optimized somatic-expansion phenotypes (see below), different modifiers appeared to influence different subsets of STRs (Fig. 3c) but generally with a consistent effect direction, with the exception of several opposite-direction effects on somatic expansion of the TCF4 repeat (Fig. 3c). Multiple modifiers were associated with opposite-direction effects on STR expansion in the blood compared with HTT repeat expansion in the brain (as inferred from the timing of HD phenotypes), consistent with recent findings18 (Fig. 3c).

Genetic determinants of AAAG-repeat expansions

Somatic expansion of AAAG repeats at two loci (at chromosome 2: 232.4 Mb and chromosome 19: 14.8 Mb) was particularly strongly shaped by inherited variation, prompting deeper analyses. Mid-length alleles of these repeats (19–26 repeat units) were sufficiently common and unstable for somatic expansions to often be directly observable from spanning reads, enabling us to construct mid-length-allele somatic-expansion phenotypes for these two STRs in the UKB and AoU. We also optimized the common chromosome 19: 14.8 Mb (ADGRE2) long-allele somatic-expansion phenotype in the UKB to increase GWAS power.

Common and low-frequency variants (minor allele frequency (MAF) > 0.1%) at 26 loci were associated with these AAAG somatic-expansion phenotypes (P = 5 × 10−8 to 2.5 × 10−1,438; Fig. 4a and Supplementary Data 5). Beyond the top associations at DNA-repair genes previously found to be modifiers in HD (MSH3, MSH2, MLH3, PMS2 and PMS1), additional modifier loci included genes involved in recognizing DNA damage (XPC and PARP1; P = 1.1 × 10−21 and 2.4 × 10−45) and in responses to DNA damage (NEIL2, which encodes a DNA glycosylase involved in base excision repair54, and SMARCAD1, which encodes a chromatin remodeller that can play critical roles in DNA repair55; P = 1.4 × 10−14 and 1.2 × 10−20). Other loci with subtler effects on somatic expansion contained genes for transcription factors such as RUNX1 and the HOXA and HOXB clusters, suggesting additional, harder-to-interpret complexity in the ways in which genetics shapes repeat instability. The effect sizes of instability-modifying haplotypes were broadly consistent across the two AAAG-repeat loci (chromosome 2: 232.4 Mb versus chromosome 19: 14.8 Mb) and across allele length ranges (mid-length versus long alleles) with a few exceptions (for example, variation at PARP1 appeared to have a larger influence on mid-length alleles; Fig. 4b and Extended Data Fig. 9). Fine-mapping analyses suggested the presence of multiple causal variants at several modifier loci including an expression quantitative trait locus (eQTL) signal at MSH6 (Supplementary Data 6).

Fig. 4: Genetic determinants of somatic expansion of common AAAG repeats in blood.
figure 4

a, Genome-wide associations with blood-instability of long chromosome 19: 14.8 Mb AAAG repeat alleles (top; UKB), mid-length chromosome 19: 14.8 Mb AAAG repeats (middle; UKB + AoU meta-analysis) and mid-length chromosome 2: 232.4 Mb AAAG repeats (bottom; UKB + AoU). Associations of variants proximal to the repeat loci are shown in grey as they could reflect imperfect control for lengths of inherited alleles. b, Comparison of effect sizes for blood instability of mid-length chromosome 2: 232.4 Mb AAAG repeats versus long chromosome 19: 14.8 Mb AAAG repeats for index variants from the common-variant GWAS (that is, lead associations with MAF > 1%). Black dots correspond to labelled loci. c, Analogous comparison for rare pLoF variants in the UKB. The blue shaded rectangle and grey dots within it correspond to effect size range and common-variant associations shown in b. d, The mean level of somatic expansion of mid-length chromosome 2: 232.4 Mb AAAG repeat alleles (that is, the average increase in repeat length across spanning reads) observed in AoU participants who inherited one allele of the indicated length (19, 20, 21, 22, 23, 24, 25 or 26 repeat units). Plots were stratified by age and by polygenic score for somatic expansion (using a model that was fit on UKB data). e, The relative expansion rates of mid-length chromosome 2: 232.4 Mb AAAG repeats in subsets of AoU participants with different pairwise combinations of three modifier genotypes: the lead MSH3 eQTL (chromosome 5: 80632699T>C), the MSH2 G322D missense SNP (chromosome 2: 47416318G>A) and a top MLH3 splicing QTL (chromosome 14: 75002247G>C). The error bars show the 95% CIs. Sample sizes are provided in Supplementary Data 5.

Burden tests identified strong instability-modifying effects of rare coding variants in nine genes (Fig. 4c and Supplementary Table 7). These genes included seven mismatch repair genes, corroborating recent results from an HD mouse model56, as well as XPC (P = 6.3 × 10−14) and NEIL2 (P = 1.7 × 10−6), at which we also detected common modifier haplotypes. Variants predicted to cause loss-of-function (pLoF) of XPC (P = 4.8 × 10−10) and NEIL2 (P = 0.0015) were associated with considerably increased somatic mutation of chromosome 19: 14.8 Mb AAAG repeats (β = +0.32 (s.e., 0.05) and +0.22 (s.e., 0.07) standard deviations; Fig. 4c), suggesting that deficiencies in nucleotide-excision repair and base-excision repair increase instability of repeated DNA. NEIL2 was also recently implicated in de novo STR mutation9. A low-frequency missense variant in PMS1 (rs61756360, PMS1 T75I; allele frequency = 0.2%) was associated with decreased expansion of AAAG repeats (Supplementary Data 5) and appeared to have a previously unrecognized onset-delaying effect on HD (β = 5.2 years (s.e., 1.9)); P = 0.005 in ref. 17).

The combined effects of these genetic modifiers across an individual’s genome strongly influenced mutation rates of these AAAG repeats: somatic expansion rates varied by 4.0-fold (3.7–4.3) between individuals with the highest and lowest 5% of polygenic scores (trained in the UKB and applied in AoU) for expansion of the chromosome 2: 232.4 Mb AAAG repeat (Fig. 4d). These strong genetic effects also provided power to explore gene-by-gene and gene-by-environment interactions. We observed approximately multiplicative effects of genetic modifiers with one another (Fig. 4e) and with smoking, which was associated with substantially increased somatic mutability of chromosome 2: 232.4 Mb AAAG repeats (1.33-fold (1.30–1.36) for current versus never-smokers) but not chromosome 19: 14.8 Mb AAAG repeats (Extended Data Fig. 10).

GLS repeat expansion is associated with disease

The deep phenotyping of the UKB cohort offered an opportunity to search for effects of repeat expansions on a wide variety of diseases and other clinical phenotypes. Phenome-wide analyses of 67,405 repeat loci expanded in at least 5 UKB participants (for association with 6,483 disease phenotypes and 57 heritable quantitative traits) identified 7 repeat loci involved in 23 likely causal associations with quantitative phenotypes and 46 associations with disease phenotypes (Supplementary Data 7 and 8). Associations of expansions in C9orf72, TCF4 and DMPK with disease phenotypes reflected the known roles of these repeats in amyotrophic lateral sclerosis and frontotemporal dementia, FECD and myotonic dystrophy1, and associations involving repeats in AFF3 and DIP2B were recently reported57. However, the association of repeat expansions in GLS with a biomarker of liver function was surprising, as such expansions in GLS (which encodes kidney-type glutaminase) have been observed to be pathogenic in only extremely rare cases of severe, childhood-onset recessive glutaminase deficiency58,59. The large UKB population sample made it possible to see that heterozygous carriers of long GLS alleles exhibited anomalous phenotypes.

In the UKB, 139 individuals (0.03%) carried a long CAG repeat (at least 45 repeat units) within the 5′ untranslated region (5′-UTR) of GLS (Fig. 5a). Most of these individuals (98 out of 139) exhibited evidence of a highly expanded allele (around 100+ repeats based on observing IRR pairs; Extended Data Fig. 1b), consistent with long GLS repeats being highly unstable somatically58 and in the germline (Fig. 1b,d,e). Germline instability of the repeat increased rapidly with allele length: mid-length alleles (25–40 repeats) were already sufficiently unstable for mutations to be observable between close relatives, including a quartet of genetically inferred first cousins among whom multiple intergenerational mutations had occurred (Fig. 5b).

Fig. 5: Instability and pathogenicity of CAG-repeat expansions in the 5′ UTR of GLS.
figure 5

a, The location of a polymorphic CAG repeat within the 5′ UTR of GLS. b, The intergenerational expansion rates of short- and mid-length GLS alleles estimated from IBD (for common alleles containing ≤24 repeat units) and from pairs of related UKB participants carrying rarer, mid-length alleles (25–34 repeat units; n = 390). Inset: multiple mutations observed among four related UKB participants (likely to be mutual first cousins) carrying mid-length GLS repeat alleles. c, The mean GGT, cystatin C and glutamine levels (adjusted for age, age squared and sex) among individuals with long GLS alleles (stratified into quintiles; n = 131) and individuals without a long allele (plotted at the modal length of 8 repeats). d, Effect sizes of GLS alleles for liver and renal biomarkers and other quantitative traits measured in the UKB, adjusted for age, age squared and sex. GLS repeat expansions and SNP and indel variants predicted to cause loss of function (n = 109) were all associated with increased glutamine levels (inverse-normal transformed) and decreased height; by contrast, only highly expanded GLS repeat alleles (≥1 IRR pair, that is, around 100+ repeat units; n = 94) were associated with altered serum biomarker levels. The shaded rectangles indicate biomarker effect directions typically associated with disease. e, Increased odds of liver and renal diseases among UKB and AoU participants with highly expanded GLS repeat alleles compared with participants without a long allele; points plotted are odds ratios. The stage 5 chronic kidney disease phenotype included ICD-10 codes for end-stage renal disease. Analyses in the UKB (n = 421,377) were adjusted for age and sex; analyses in AoU (n = 229,043) were adjusted for age, age squared, sex and genetic ancestry. f, Contrasting pathogenic effects and hypothesized mechanism for heterozygous GLS repeat expansions compared with biallelic GLS expansion or LoF. The error bars show the 95% CIs.

Repeat expansions in GLS were associated strongly with elevated biomarkers of liver and kidney disease (P = 3.4 × 10−15 for gamma-glutamyl transferase (GGT); P = 7.3 × 10−7 for cystatin C). These associations appeared to be driven by highly expanded alleles (beyond a threshold of around 100–200 repeat units; Fig. 5c); single-nucleotide polymorphism (SNP) and insertion–deletion (indel) pLoF variants in GLS did not associate with these biomarkers (Fig. 5d and Supplementary Table 8). Carriers of highly expanded alleles exhibited severalfold increased risk of liver diseases and chronic kidney disease, results that were replicated in AoU, with particularly elevated risk of stage 5 chronic kidney disease (OR = 14.0 (5.7–34.3), P = 7.2 × 10−9; in meta-analysis across UKB and AoU; Fig. 5e and Supplementary Table 9). These results suggest that highly expanded GLS repeats cause a dominant (but low penetrance) DNA-repeat disorder that is distinct from recessive glutaminase deficiency (Fig. 5f). Glutaminase deficiency is caused by biallelic impairment of GLS function—either by SNP/indel pLoF variants60 or long GLS repeats that suppress GLS expression58, both of which were associated with elevated serum glutamine in the UKB (Fig. 5c,d and Supplementary Table 8). By contrast, the effects on kidney and liver biomarkers were specific to highly expanded repeats, indicating a different pathological mechanism unrelated to GLS function, such as RNA toxicity61 (Fig. 5f).

Discussion

These results show how biobank WGS datasets contain abundant information about unstable DNA repeats. We observed that somatic expansion (in blood) of some repeats is under strong genetic control. Different repeats appear to be affected by a largely shared set of common alleles at DNA-repair genes, but the relative influence of these alleles varied across repeat loci, and their effect directions even varied: common modifier haplotypes at several loci appeared to have opposite effects on blood instability of the TCF4 CAG repeat compared with other repeats, including the HTT CAG repeat. These results, along with the repeat locus specificity that we observed in the relative mutation rates of repeats in blood versus the germline, reinforce recent evidence of tissue specificity of genetic modifiers of HTT expansion in blood versus brain18, pointing to highly complex regulation of somatic repeat expansion that varies across repeats and cell types. The modulation of genetic effects by locus-specific effects may suggest roles for locus-specific chromatinization or transcriptional dynamics and will be an interesting area for mechanistic studies.

The clear and strong differences in genetic effects on repeat expansion of different repeats and in different tissues suggest a need for care and caution in efforts to use DNA repeats in clinically accessible tissues (such as blood) to inform on the status of somatic expansion in disease-relevant tissues (such as brain). However, our results also suggest the potential for repeats that are unstable in blood to be used as biomarkers of target engagement for future expansion-slowing therapies. We identified several repeat loci at which common alleles expand in blood as humans age, at rates that are strongly influenced by genetic modifiers (for example, at MSH3). Future analyses using long read sequencing to identify hypermutable loci with longer alleles10 may detect even better candidates.

The deep phenotype data available in biobank datasets also enabled us to observe evidence suggestive of a dominant DNA-repeat disorder involving highly expanded 5′-UTR repeat alleles in GLS, which was associated with a severalfold higher risk of kidney and liver diseases. Large WGS cohorts provide an opportunity to identify pathogenic rare alleles that, despite their strong effects on disease risk, have not been identified to date owing to their low penetrance in families. Analyses of the phenotypic effects of common repeat variation, which we did not undertake here, may reveal subclinical phenotypes and may also resolve the question of whether intermediate-length alleles of pathogenic repeats have any beneficial effects (that could in principle cause them to persist in human populations); association analyses conducted to date6,7 have not detected evidence of such effects.

Analysis of repeat instability in population biobanks does have several limitations. Although we could study germline mutation rates by analysing IBD among unrelated individuals, we could not assess effects of genetic variation, parental age or parent-of-origin on germline mutability, as this requires ascertaining de novo mutations8,9,10. Moreover, the short-read WGS data that we analysed provided only glimpses of somatic mutation, through observations of one or a few reads spanning shorter mutated alleles and through read-count-based evidence of expanded alleles of unknown lengths. Nonetheless, the analytical tools that we have developed here for biobank-scale WGS analysis provide a useful complement to studying repeat instability in families8,9,10 and in patient cohorts using targeted sequencing techniques18,62, and combining these approaches should provide opportunities for further discovery.

Methods

Methods are provided in the Supplementary Information.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.