Introduction

Alzheimer’s disease (AD) is the most common neurodegenerative disorder in the United States and has a growing prevalence in our aging population, yet there is a lack of effective treatments1. Delineating the genetic basis of AD is crucial to uncovering the underlying genes and molecular mechanisms and can spur development of more targeted therapies. Despite studies in hundreds of thousands of individuals, most of the genetic risk for AD has yet to be identified2. One source of genetic variation that has not been explored in AD is the approximately one million short tandem repeats (STRs) tracts in the human genome, which are classically defined as DNA sequences composed of repeated units of 2–6 bp motifs. STR tract length (i.e., number of repeat units) is highly polymorphic in the population and linked to widespread gene expression changes, thus representing a potential source of functional genetic variation3,4,5,6,7,8. Moreover, pathogenic expansions in the tract lengths of STRs cause over 30 monogenic neurological disorders such as Huntington’s disease9,10. Thus, it is important to assess whether STRs may also contribute to the risk of AD.

Pathogenic expansions of specific STR tracts cause >30 monogenic disorders such as fragile X syndrome, myotonic dystrophy, and Huntington’s disease9,10. A single pathogenic STR expansion confers the majority of the genetic risk for these disorders. However, the role of STRs in promoting genetic risk for polygenic disorders such as AD is not well understood. Our ability to understand the relationship between STRs and polygenic disorders has been limited by the need for genome-wide measures of STR lengths in large sample sizes. Recent studies in large cohorts have revealed that patients with autism spectrum disorder and schizophrenia carry a higher burden of germline STR expansions11,12,13. These studies suggest that, in contrast to our traditional view of a single STR conferring disease risk for monogenic disorders, many STRs distributed throughout the genome in aggregate can contribute to neuropsychiatric disease risk in a polygenic fashion.

The majority of STRs in the genome are not within genes and the role of these intergenic STRs with disease is less well-understood. While known disease-associated STR tracts are within gene bodies (exons, introns, or untranslated regions)9, pathogenic-length expansions of disease-associated STRs has been linked to alterations beyond protein-coding functions, including severe disruption to histone modifications, DNA methylation, and genome folding14. Polymorphic changes in intergenic STRs are also known to correlate with expression of nearby genes often in a tissue-specific manner5,15,16. Moreover, STRs correlated with gene expression are enriched at transcriptional start sites and colocalize with putative enhancers5. Of note, prior studies have also found widespread alterations in the epigenome profiled in post-mortem brain tissue from patients with AD17,18,19. Whether the lengths of intergenic STRs are altered in AD and interplay with epigenetic changes has yet to be explored.

Measuring the lengths of STRs is challenging, especially when interrogating across the entire genome in many individuals. The repetitive nature of STRs makes sequencing and downstream processing prone to errors, and their tract lengths often exceed the lengths of traditional short-read next-generation sequencing reads. However, computational tools have been developed to overcome these challenges to infer the lengths of STRs from short-read sequencing data20,21,22,23, which have enabled genome-wide genotyping of STRs in cohorts of individuals11,12,13. Consequently, they provide a valuable opportunity to test the association of numerous STRs with disease risk across large cohorts of patients and uncover novel links between STRs and disease risk.

In this work, we sought to understand whether STRs may promote genetic risk of AD. We applied cutting-edge computational tools ExpansionHunter22 and gangSTR20 to genotype STRs genome-wide using PCR-free WGS data from 2981 individuals with and without AD from the Alzheimer’s Disease Sequencing Project (ADSP). We implemented a rigorous approach to identify STR expansions based on having extreme STR tract lengths. Across the genome, we identified 9641 unique STR tracts that had STR expansions in at least one individual. Strikingly, individuals carrying a high burden of expanded STRs had 3.69 higher odds of having AD and had worse AD neuropathology, thus representing one of the strongest genetic effects on AD risk. STR expansions seen in individuals with AD are enriched in promoter regions active in post-mortem hippocampal tissue. These results suggest a model whereby the cumulative effect of multiple STR expansions across the genome promotes genetic risk of AD and generates important insights into the genetic architecture of AD.

Results

Genome-wide profiling of STR tracts in an AD cohort

In this study, we utilized a discovery cohort of 2393 samples (1213 AD cases and 1180 controls) from the National Institute of Aging Alzheimer’s Disease Centers (ADC) cohort of the ADSP WGS dataset (see Supplementary Table 1 for participant characteristics). For replication of our single STR associations, we utilized a replication cohort of 588 samples from the Religious Orders Study/Memory Aging Project (ROSMAP; n = 309 AD cases, 279 controls). These samples have been sequenced on peripheral blood-derived genomic DNA using a PCR-free Illumina whole genome sequencing (WGS) strategy to >30× coverage using 150 bp paired end sequencing reads. We restricted our analyses to White individuals of European ancestry (based on both self-reported non-Hispanic White race/ethnicity and based on genetic ancestry as determined by principal component analysis [PCA] coordinates) (Supplementary Fig. 1a,b). There were no apparent differences between AD cases and controls based on PCA coordinates (Supplementary Fig. 1c), sequencing centre (Supplementary Fig. 1d), or sequencing cohort (Supplementary Fig. 1e). To limit potential technical artifacts that could lead to spurious results, we restricted our analysis to samples sequenced to between 30× and 50× coverage genome-wide. There was no statistically significant difference in sequencing coverage between case and control samples (p = 0.055, two-sided Student’s t-test) (Supplementary Fig. 1f). Together, these samples represent a deeply phenotyped case/control cohort that has been carefully filtered to mitigate the potential for technical artifacts and confounders.

To understand the role of STRs in AD, we sought to perform an unbiased genome-wide assessment of the relationship between STRs and AD risk (Fig. 1a). There are approximately ~1 million STR tracts in the genome, which is computationally infeasible to apply across thousands of samples using existing methods. To limit the search space for our analyses and reduce computational costs, we focused on STRs that are polymorphic (i.e., vary in tract length in the population). To identify polymorphic STRs, we first applied the gangSTR algorithm on a catalogue of 895,826 STRs across the genome to a subset of 495 individuals of European ancestry from the ADSP (n = 246 AD cases and 249 controls). We identified 237,197 STRs that were polymorphic in these 495 samples. We merged this set of 237,197 STRs with 174,262 STRs previously identified to be polymorphic22 to result in a union set of 312,731 unique polymorphic STRs (Supplementary Data 1). We used this panel of 312,731 STRs throughout this study to provide genome-wide assessment of polymorphic STRs.

Fig. 1: Generation of a panel of polymorphic STRs.
figure 1

a Schematic of study design. Middle section shows six example STRs (STR1-6), with rows representing STRs, columns representing individuals, and numbers representing non-reference STR tract lengths. Long expansions are bolded. STR1 is an example of a rare STR expansion seen only in individuals with AD. At the bottom are three models for testing for associations with AD case/control status. In model 1 (left), STR tract lengths are treated as a continuous variable. In model 2 (middle), STR expansions are defined as those longer than a given STR tract length threshold, and we compare the number of individuals with a given STR expansion between AD cases and controls. In model 3 (right), we identify rare, long STR expansions, and we compare the number of such STR expansions per individual across the genome between cases and controls. b Genomic distribution of STR panel. c Histogram of repeat unit lengths (number of base pairs [bp] in STR motif) for STR panel. Inset shows the subset of STRs with repeat units 7 bp or longer. d Histogram of STR tract lengths (number of repeat units) in the GRCh38 reference genome for each STR in the panel. e Histogram of median STR tract lengths relative to the GRCh38 reference genome as genotyped by ExpansionHunter. Negative values indicate a shorter median STR tract length relative to reference genome and positive values indicate longer median tract lengths relative to reference genome.

As genotyping STR tract lengths is challenging using short-read sequencing data, we performed comparisons with long-read sequencing using PacBio technology. We utilized four samples from the 1000 Genomes Project that were not used in this study24. We called STR tract lengths from PacBio sequencing data using the TRGT software25. We compared STR genotypes from PacBio data to STR genotypes identified using ExpansionHunter and gangSTR from short-read Illumina data on these same four samples. Overall, we observed strong correlations in STR tract lengths as estimated by ExpansionHunter and gangSTR when compared with PacBio (Spearman r2 ~ 0.96–0.97) (Supplementary Fig. 2a-h). Focusing on STR tracts length ≥10 repeat units longer than the reference genome identified by PacBio long-read sequencing, we found that ExpansionHunter achieved 70.0–73.3% sensitivity, while gangSTR achieved 58.0–64.7% sensitivity (Supplementary Fig. 2i). In contrast, ExpansionHunter demonstrated 63.0–67.7% specificity compared with 81.7–85.0% specificity for gangSTR for STR tracts length ≥10 repeat units longer than the reference genome as identified in the Illumina short read data, (Supplementary Fig. 2j). Together, these results suggest that ExpansionHunter is more sensitive for STR expansions, while gangSTR has higher specificity, similar to as observed previously26.

The STRs in the polymorphic STR set were largely not within protein-coding regions of the genome: 47.4% of STRs were within introns, 36.7% in distal noncoding regions (defined as being outside of gene bodies and at least 3 kb from the nearest transcriptional start site [TSS]), and 14.1% in promoter regions (defined as ≤3 kb upstream from a TSS) (Fig. 1b). Only 0.76% of STRs were in coding exons. The high number of polymorphic STRs in gene promoters is consistent with prior studies showing that STRs are enriched in gene promoters and have important gene regulatory roles27,28. The majority (97.7%) of STRs had a repeat unit size between 2–6 bp, which is the classic definition of an STR (Fig. 1c). Only 0.39% of STRs had a repeat unit size larger than 15 bp. In the GRCh38 reference genome, STRs in the panel generally had short tract lengths, with a median of 6 repeat units per tract and 99.8% of STR tract lengths ≤30 repeat units (Fig. 1d).

We next used the ExpansionHunter and gangSTR algorithms to genotype our panel of 312,731 polymorphic STRs. We did find overall high concordance between ExpansionHunter and gangSTR genotypes. For 61.2% of STRs, gangSTR and ExpansionHunter genotypes were at least 90% concordant across individuals (Supplementary Fig. 3a). The tract lengths were strongly correlated across STRs between gangSTR and ExpansionHunter (Pearson r2 = 0.83) (Supplementary Fig. 3b). We do note that tract lengths in ExpansionHunter tended to be longer than gangSTR, which is consistent with our validation using long-read sequencing and the work of others26. Nonetheless, we found that the median STR tract length measured by ExpansionHunter were largely concordant with the tract lengths in the GRCh38 reference genome (Fig. 1e). For the remainder of the paper, we performed our primary association analyses using ExpansionHunter as this algorithm has been shown to be more sensitive for detecting longer STR tract lengths26.

Testing for single STR associations with AD risk

We first tested the lengths of each STR for association with AD risk, treating STR lengths as a continuous variable (Fig. 1a, model 1). We applied a logistic regression model to test the association of the tract length of each STR with AD risk, accounting for sample covariates (age, sex, and the first 3 PCs) and technical covariates (sample sequencing coverage and local sequencing coverage). Throughout, we applied a dominant model, such that only the longer of the two alleles for a given STR in an individual was considered. As the tract lengths of many STRs are not normally distributed, we performed a rank-based inverse normal transformation of STR tract lengths prior to association testing. We filtered out STRs within segmental duplicated regions of the genome from our results as these regions resulted in a high number of artifactual genotype calls (see Methods). This resulted in a final set of 293,752 genome-wide STRs that we report. Our statistical association test provided well-calibrated results (λ genomic control=0.997), suggesting minimal evidence of systemic technical artifacts or population stratification (Fig. 2a; Supplementary Data 2). We identified one STR associated with a clinical diagnosis of AD at a Bonferroni-corrected p-value threshold of 1.70 × 10−7 (Supplementary Fig. 4a). This STR was a TTTA repeat at chr19:44921097–44921125 (p-value 4.29 × 10−9, logistic regression test) and is located approximately 11.7 kb downstream from the APOE gene, which has the strongest known genetic association with AD risk29.

Fig. 2: Statistical testing for single STR associations with AD risk.
figure 2

a Quantile-quantile plot of single STR association statistics. X-axis shows the expected distribution of −log10(p-values) under a uniform p-value distribution. Y-axis shows observed −log10(p-values). Each point represents a separate STR. b Volcano plot for STR tract lengths under a single STR association test. Each point represents a single STR, with colours reflecting statistical significance of association. X-axis reflects standard mean difference, which is the difference in mean STR lengths between cases and controls divided by the standard deviation of STR lengths across the whole cohort. Positive values reflect longer mean STR tract lengths in cases as compared with controls. λ genomic control (λGC) values are listed in figure legend. c Ratio of number of STRs with tract lengths that are longer in cases relative to the number of STRs with tract lengths longer in controls at different statistical significance thresholds. d Quantile–quantile plot of p-values from hypergeometric tests comparing number of STR expansions in cases versus controls for each STR. X-axis shows the expected distribution of −log10(p-values) under a uniform p-value distribution. Y-axis shows observed −log10(p-values) from a two-sided Fisher’s exact test. Each point represents a separate STR, and points are coloured by the STR tract length threshold. e Relative burden of STR expansions in case versus control individuals at different STR tract length and frequency thresholds. Values above 1.0 reflect higher burden of expansions observed in cases than controls. For ac observed p-values are from single STR test performed using a logistic regression model of the tract length of each STR with AD risk, accounting for sample and technical covariates.

We performed several validation steps for our association analysis. We first repeated the association analysis using two alternative statistical models. To ensure that associations do not result from the inverse normal transformation of STR lengths, we repeated the logistic regression test on untransformed genotypes using the same model and covariates. We also applied a non-parametric test (Mann–Whitney test) without covariates. Application of either the untransformed STR lengths in a logistic regression model or a non-parametric Mann–Whitney test led to highly similar results (correlation Pearson r2 = 0.91 and 0.94 respectively) (Supplementary Fig. 4b,c). We further performed the association analyses using genotypes generated by a second STR genotyping software programme called gangSTR20. The STR association near APOE remained statistically significant using either of these alternative statistical models or using genotypes from gangSTR.

We replicated our association test in the ROSMAP replication cohort (n = 309 cases, 279 controls). The expansion at chr19:44921097–44921125:TTTAN repeat near the APOE gene robustly replicated across these additional cohorts (meta-analysis p-value 1.43 × 10–12 Supplementary Data 2). Since this STR is near the APOE gene that is known to be associated with AD risk, we next assessed whether this STR represents an independent association. When re-performing the STR association test with APOE genotype as a covariate, the statistical association of this STR was fully attenuated (association p-value = 0.079 after adjusting for APOE genotype), suggesting that this STR is in linkage disequilibrium with the APOE genotype and does not represent an independent genetic association. Together, these results suggest that there is not a predominant STR that independently drives genetic risk for AD. However, based on power calculations, we note that given our current sample sizes, we have 80% power to identify STR associations for STR allelic odds ratio >1.20 (Supplementary Fig. 4d). Thus, we currently lack statistical power to detect STR associations with more modest effect sizes (see Discussion).

Additionally, there were 11,511 STRs nominally associated with AD risk at a p-value threshold of p < 0.05. Among these nominally-associated STRs, we noticed that there was a clear bias toward more STRs having longer mean STR tract lengths in individuals with AD cases as compared to STRs having longer mean tract lengths in controls (Fig. 2b). This contrasts with the null hypothesis, in which we would expect an equal number of STRs with longer mean tract lengths in AD cases as there are STRs with longer mean tract length in controls. This skew toward more STRs with longer tract lengths in cases versus longer in controls was more marked at increasingly stringent p-value cutoffs. For example, at an association p-value threshold of <0.001, there were 1.15-fold as many STRs where the mean tract length was longer in cases than in controls (Fig. 2c). Together, these results show that while no single STR drove genetic risk of AD, longer STR tract lengths appeared to be systematically associated with higher AD risk.

In the above analysis, we treated STR tract lengths as a continuous variable for association testing. However, in known STR-associated diseases, there is usually a threshold at which a given STR tract length becomes pathogenic9, suggesting that large expansions of STR tract lengths rather than population variation confer disease status. Thus, we performed association analyses to test whether the number of individuals with a given STR expansion differs between AD cases and controls (Fig. 1a, model 2). The challenge with testing this model is that the tract length at which an STR becomes pathogenic is not known a priori. We thus performed our analyses at different pre-specified STR tract length thresholds of ≥1, ≥5, ≥10, or ≥20 repeat units longer than the GRCh38 reference genome.

After excluding 20,653 STRs with low genotyping rates, we tested 273,099 STRs for associations between expansions and AD case/control status in 2393 samples (1213 AD cases and 1180 controls) from the discovery cohort from ADC. For each STR, we counted the number of case and control individuals with and without an STR expansion as defined by a given STR tract length threshold and performed a hypergeometric test to assess for differences in number of expansions between case and control individuals. We found that association results were well-calibrated across all STR tract length thresholds tested (Fig. 2d; Supplementary Data 3). However, besides the chr19:44921097–44921125:TTTAN STR near APOE at a STR tract length threshold of ≥1 repeat unit (p-value = 7.74 × 10−11, two-sided Fisher’s exact test), there were no other STRs that were statistically significant after correcting for multiple hypothesis testing.

While no single STR showed a statistically significant difference in the number of expansions between case and control individuals, we next tested whether there may be a difference in the cumulative burden of expansions across the genome in cases versus controls. We identified STR expansions at different STR length thresholds (≥1, ≥5, ≥10, or ≥20 repeat units longer than reference) and at different STR expansion frequency cutoffs (seen once, ≤5, ≤10, ≤100, or no cutoff in n = 2393 individuals) and counted the total number of STR expansions per individual. Strikingly, we found that individuals with AD carried a higher burden of longer and rare STR expansions. For example, individuals with AD carried a 3.28-fold higher number of STRs ≥20 repeat units longer than the reference and seen only once in the cohort (Fig. 2e). This observed increased burden of STRs in individuals with AD was attenuated when examining either shorter STR alleles or more common alleles. For example, there was only a 1.08-fold increased burden in AD individuals for STRs ≥1 repeat unit longer than reference and seen once in the cohort. Similarly, there was only a 1.03-fold increased burden in AD individuals for STRs ≥20 repeat units longer than reference but when no frequency cutoff was placed. These results suggest that a high burden of rare expanded STR alleles associates with risk of AD.

Identification of an increased burden of rare STRs expansions in AD

Given our finding of an increased burden of STR expansions in individuals with AD, we next sought to systematically identify rare, long STR expansions without pre-specifying an STR length threshold since the tract length threshold is not known a priori for any given STR. To increase statistical power to identify rare STR expansions, we combined samples from across the ADC and ROSMAP cohorts (total n = 2981; n = 1492 controls and 1489 cases). Since we are focused on rare expansions, we lack statistical power to test for the association between rare expansions in AD case/control status. Thus, for remaining analyses, we aggregate the number of expansions per individual and test whether there is a difference in the burden of rare STR expansions between case and control individuals (Fig. 1a, model 3).

To detect STR expansions, we implemented an approach using density-based spatial clustering of applications with noise (DBSCAN), which we extended from the work of Trost et al.13 This approach obviates the need for selecting an arbitrary expansion length threshold and frequency cutoff, but instead identifies individuals carrying outlier STR tract lengths for each STR. In our implementation of DBSCAN, we additionally accounted for the effects of sample and technical covariates (sex, age, first three principal components of ancestry, genomic sequencing coverage for each sample, local sequencing coverage for each sample at the STR). We excluded 15 individuals who had >148 STR expansions in their genome (see Methods), resulting in 2966 individuals (n = 1481 control individuals and n = 1485 AD cases). Applying DBSCAN, we identified expansions in 9641 unique STRs across the genome in these 2966 individuals. We classified expansions to those seen only in AD cases (n = 4412 STRs), expansions seen only in controls (n = 3365 STRs), or seen in both cases and controls (n = 1864 STRs) (Fig. 3a; Supplementary Data 45). There were 6276 unique STR expansions seen in individuals with AD (hereafter referred to as “AD STR expansions”) and 5229 unique STR expansions seen in control individuals (hereafter referred to as “control STR expansions”). AD STR expansions were slightly but statistically significantly longer than control STR expansions (mean tract length of 39.3 and 37.3 repeat units longer than the GRCh38 reference genome tract length, respectively; p < 1.1 × 10−3, two-sided Mann–Whitney test) (Fig. 3b). 50.1% of AD STR expansions were dinucleotide repeats and just 2.1% of AD STR expansions had a repeat unit of 6 bp or longer (Fig. 3c). Many of the STR expansions were observed in more than one individual (Supplementary Fig. 5a). 1497 STR expansions were present in more than one AD case individual, including 213 STR expansions present in five or more AD case individuals.

Fig. 3: DBSCAN identifies increased burden of rare STR expansions in AD.
figure 3

a Three separate STRs are shown as examples of AD-only STR expansions, control-only STR expansions, or shared STR expansions as identified by DBSCAN. For each STR, y-axis shows tract length for the STR, where the STR tract length for each individual is represented by one point and ranked by their tract length. Orange points represent AD cases and green points represent controls. b STR tract lengths in number of repeat units longer than GRCh38 reference genome for control STR expansions (green; n = 7823 expansions) and AD STR expansions (orange; n = 9276 expansions). c Repeat unit lengths for STR expansions seen in controls (green) and AD cases (orange). Histogram shows values as proportion of STR expansions seen in controls and cases respectively. d Number of STR expansions identified by DBSCAN per individual in controls (green; n = 1485 individuals) versus cases (orange; n = 1481 individuals). Each point represents one individual. Right panel is zoomed in for individuals with fewer than 50 STRs. e Odds ratio for AD case/control status for individuals carrying varying numbers of STR expansions. Odds ratios > 1.0 represent higher odds of having AD. p-values are calculated using a two-sided Fisher’s exact. *p-value < 0.05. ***p-value < 0.001. f Stacked bar chart of Braak stages for individuals with ≤ 10 STR expansions (top) or > 30 expansions (bottom). Braak stages are from 0–6, with higher values reflecting more severe neuropathology. For b and d box plots elements: centre line, median; box limits, upper and lower quartiles; and whiskers, 1.5× interquartile range.

We also compared STR tract lengths of these expansions in the ADSP cohort to 476 individuals of European ancestry from the 1000 Genomes Project30. We found that for STRs in which an STR expansion was observed, the median STR tract length in the ADSP cohort was similar to samples from the 1000 Genomes Project (Spearman r2 = 0.99) (Supplementary Fig. 5b). However, for 96.5% of STRs in which an STR expansion was observed, the maximum STR tract length in the ADSP cohort was much longer when compared to samples from the 1000 Genomes Project, suggesting that most STR expansions are rare polymorphisms in the population and/or unique to our dataset (Supplementary Fig. 5c).

We next tested for differences in the burden of STRs in AD cases versus controls. There was a mean of 6.27 and 5.27 STR expansions in AD cases versus controls, representing a 1.19-fold higher burden of STR expansions in cases (p = 8.27 × 10−3, two-sided Mann–Whitney test) (Fig. 3d). The median number of STR expansions was 4.0 for both AD cases and controls. Strikingly, we found that individuals who carried > 30 STR expansions had an odds ratio of 3.69 for having AD (Fig. 3e, p-value 4.30 × 10−4, two-sided Fisher’s exact test; 95% confidence interval for odds ratio 1.83–8.06). In contrast, individuals with ≤10 expansions had a odds ratio of 0.77 for having AD. Importantly, we found that at higher STR expansion burden thresholds, we observed higher odds ratio for AD and increasing statistical significance (Fig. 3e).

We also rigorously examined whether our results may be due to potential confounders. In our above analyses, we had included local and genome-wide sequencing coverage as covariates in the DBSCAN analysis, as we had observed a modest but statistically significant correlation between the sequencing coverage and the number of STR expansions identified (Spearman r2 = 0.17; p = 1.79 × 10−12; linear regression F-test) (Supplementary Fig. 6a). However, even when sequencing coverage was not included as a covariate in the DBSCAN analysis, we continued to find an increased burden of STR expansions in individuals with AD (p = 4.45 × 10−3; two-sided Mann–Whitney test) and that individuals carrying >30 STR expansions had an increased odds for having AD (p-value 4.30 × 10−4, two-sided Fisher’s exact test; odds ratio 3.69 [95% CI 1.83–8.06]) (Supplementary Fig. 6b,c). We also did not observe any clear association of the STR expansion burden and the first two principal components of ancestry (Supplementary Fig. 7a). Moreover, even when correcting up to 10 principal components in the DBSCAN, we continued to observe an increased number of STR expansions in individuals with AD (p = 4.43 × 10−3; two-sided Mann–Whitney test) and that individuals carrying >30 STR expansions had an increased odds for having AD (p-value 4.30 × 10−4, two-sided Fisher’s exact test; odds ratio 3.69; [95% CI 1.83–8.06]) (Supplementary Fig. 7b-c).

We next ascertained if our observations were reproducible using gangSTR20, a separate STR genotyping algorithm. We note that there were fewer STR expansions identified by gangSTR (mean 1.028, median 1.00 STR expansions per individual) as compared to ExpansionHunter (mean 5.77, median 4.00 STR expansions per individual). This is consistent with our analyses and prior work demonstrating higher sensitivity of ExpansionHunter for identifying STR expansions as compared to gangSTR (Supplementary Fig. 2i)26. Using gangSTR, we again found a higher burden of STR expansions in cases (p = 2.93 × 10−4, two-sided Mann–Whitney test) (Supplementary Fig. 8a). We also found that individuals who carry >10 STR expansions as identified by gangSTR have a 3.14 odds ratio for having AD, though this was not statistically significant (95% CI odds ratio 0.73–21.0; p-value 0.19, two-sided Fisher’s exact test) (Supplementary Fig. 8b).

We next assessed whether individuals with a high burden of STR expansions also had differences in neuropathology as measured by Braak staging which reflects the degree of spread of tau pathology31. Braak staging data was available for 1188 individuals of the 2981 individuals in the cohort (n = 365 controls and 823 AD cases). Braak stages are scored from 0 to 6, with 0 representing absence of AD neuropathology and 6 being the most severe spread of AD neuropathology31. Using the same expansion burden thresholds as we used for AD risk, we found that individuals with >30 expansions had worse AD neuropathology compared to individuals with ≤10 expansions (p = 0.01, Kruskal–Wallis rank sum test) (Fig. 3f).

Since APOE is the strongest genetic effect on AD risk in the genome29, we next assessed whether there was any relationship between APOE genotype and STR expansion burden. We did not observe any relationship between the number of ε4 alleles and burden of STR expansions in the genome (p = 0.096, Kruskal–Wallis rank sum test) (Supplementary Fig. 9a). Moreover, the increased burden of STR expansions observed in individuals with AD remained when adjusting for APOE genotype in the DBSCAN analysis (p = 4.57 × 10−3, two-sided Mann–Whitney test) and that individuals carrying >30 STR expansions had an increased odds for having AD (95% CI odds ratio 1.83–8.06; p-value 4.30 × 10−4, two-sided Fisher’s exact test; odds ratio 3.69) (Supplementary Fig. 9b,c). Despite correcting for participant age in our DBSCAN analysis, we did identify a modest correlation between participant age and number of STR expansions (Spearman r2 = 0.0017; p = 0.025, linear regression F-test) (Supplementary Fig. 9d).

STR expansions are enriched in active regulatory regions of the genome

The majority of STRs tested in our study are not within coding portions of genes (Fig. 1b), precluding any simple interpretations of the mechanisms by which they may promote disease pathogenesis. We first annotated the genomic distributions of the 6276 AD STR expansions (those seen in individuals with AD) and found that they had similar distributions to the background of 293,752 STRs we tested. Specifically, the majority of AD STR expansions were in distal intergenic regions (38.1%; defined as being >3 kb from the nearest TSS) or in promoter regions (14.7%; ≤3 kb upstream from the nearest TSS) (Fig. 4a). However, AD STR expansions (median 24.7 kb) were further from the nearest TSS than background STRs (median 22.2 kb) (p-value = 6.10 × 10−10, two-sided Student’s t-test), which appeared to be driven by a slightly larger subset of STR expansions that were >100 kb upstream of the nearest TSS (Fig. 4b).

Fig. 4: STR expansions in AD are enriched in active promoters and SVA transposable elements.
figure 4

a Genomic distribution of all (background) STRs tested (top) compared to AD STRs expansions (bottom). b Distribution of distances to nearest TSS for all STRs tested (top) or AD STR expansions (bottom). c Enrichment of AD STR expansions in different chromatin states from the 18-state chromHMM model for adult hippocampus. d Enrichment of AD STR expansions using a permutation-based test. Dotted vertical line shows the number of AD STR expansions observed to overlap “flanking TSS” (chromatin state 2 from chromHMM) in adult hippocampus using a permutation-based test. e Enrichment of AD expanded STRs in peaks from different histone marks based on ChIP-seq in bulk adult hippocampal tissue. f Enrichment of AD STR expansions in different TE classes. g Enrichment of AD expanded STRs in SVA elements using a permutation-based test. h Proportion of STRs at given genomic distances from the nearest SVA. For c,e,f, height of points along the x-axis represents the statistical significance of enrichment, as measured by two-sided Fisher’s exact test. Size of points represents the magnitude of enrichment as represented by the log2(fold enrichment), and points are ordered by statistical significance for enrichment. For d,g dotted line reflects the observed number of AD STR expansions overlapping the genomic annotation, and histogram represents the number of AD STR expansions overlapping each of 1000 randomly permuted genomic regions. For d,g p-values represent empiric p-values calculated as the number of permutations exceeding observed STR overlap divided by total number of permutations.

Given that the majority of AD STR expansions were not within protein-coding regions, we next explored whether these AD STRs may be enriched in any chromatin features. We utilized all AD STR expansions (including those also seen in control individuals) since the STR expansions we identified in our study are likely not penetrant for AD. We tested whether AD STR expansions were enriched in chromatin states based on bulk post-mortem hippocampus ChIP-seq data from ENCODE32, using the 18 chromatin state partitions in chromHMM33. We found that “flanking TSS” (chromatin state 2) and “active TSS” (chromatin state 1) reflecting active promoter regions were the most enriched chromatin states for AD STR expansions (p-value = 2.82 × 10−18 and 1.66 × 10−14 respectively, two-sided Fisher’s exact test) (Fig. 4c). We validated the enrichment using a permutation-based approach by randomly drawing genomic regions and testing for their overlap with AD STR expansions. This permutation-based approach recapitulated the enrichment of AD STR expansions in “flanking TSS” (empirical enrichment p-value < 0.001) (Fig. 4d). The enrichment of AD STR expansions in “flanking TSS” was stronger in brain tissues compared to tissues from other parts of the body (Supplementary Fig. 10). We also examined STRs present in individuals with AD only and found enrichment p-values of 8.80 × 10−12 and 3.45 × 10−8 for chromatin states 1 and 2 respectively. We also validated the chromHMM enrichment results using ChIP-seq data from post-mortem hippocampal tissue in ENCODE32. Across ChIP-seq peaks for seven different histone marks in hippocampal tissue, we found that AD STR expansions were most highly enriched in H3K4me3 (p = 3.54 × 10−11, two-sided Fisher’s exact test) and H3K27ac peaks (p = 1.59 × 10−7). H3K4me3 marks active promoters, and H3K27ac marks active promoters and/or enhancers. Chromatin marked by both H3K4me3 and H3K27ac had even stronger enrichment (p = 2.84 × 10−11) (Fig. 4e). These findings demonstrate that AD STR expansions are enriched in active promoter regions in the brain.

Given that many transposable elements (TE) contain or are in proximity to STRs34, we next assessed whether AD STR expansions are enriched for any specific TE classes. We tested for enrichment across five broad categories of TEs: Alu, Human Endogenous Retroviruses (HERVs), long interspersed nuclear elements (LINE)-1 and -2, and SINE-VNTR-Alus (SVA). AD STR expansions were highly enriched for SVA elements (p-value 1.90 × 10−8, two-sided Fisher’s exact test) (Fig. 4f), which we also validated using a permutation-based approach (Fig. 4g). We also examined STRs present only in individuals with AD and found enrichment p-values within SVA elements of 7.58 × 10−8. 0.49% of AD STRs were inside the boundaries of an SVA element, compared with 0.15% of all STRs tested (p-value 5.7 × 10−11; two-sided two-proportion z-test), 0.55% of AD STR expansions were within 500 base pairs of an SVA element as compared to 0.38% of all STRs tested (p-value 0.035; two-sided two-proportion z-test) (Fig. 4h). Our observation of a strong enrichment of STR expansions within active promoter regions in the brain is consistent with prior work suggesting that SVA elements having been co-opted during evolution for enhancers and promoters in neurons35,36,37.

AD STR expansions are enriched in disease pathways with relevance to AD

While the majority of the 6276 AD STR expansions were not within protein-coding regions (Fig. 4a), we sought to understand whether they may be enriched near genes with certain biological or molecular functions. For each STR expansion, we identified the gene with the closest TSS within ± 250 kb. We then performed a gene ontology enrichment analysis for genes near AD STR expansions (n = 6276 STRs associated with 2373 unique genes) compared to all STRs tested (n = 293,752 STRs associated with 14,126 unique genes). We found that many of the most strongly associated gene sets were related to neuron biology, such as “neuron projection morphogenesis”, and “axon development” (Fig. 5a). Thus, while most AD STR expansions were not protein-coding, they are highly enriched near genes implicated in biological processes with known relevance to AD pathophysiology.

Fig. 5: Insights into genes and pathways implicated by AD STR expansions.
figure 5

a Gene set enrichment for AD STR expansions relative to all STRs tested. The top 10 most highly associated gene ontology terms are shown. Height of points along the x-axis represents the statistical significance of enrichment, as measured by a two-sided Fisher’s exact test. b Same as a except only for STRs expansions within gene bodies. c Number of STR expansions within gene bodies identified by DBSCAN per individual in controls (green; n = 1485 individuals) versus cases (orange; n = 1481 individuals). d Odds ratio for AD case/control status for individuals carrying varying numbers of STR expansions within gene bodies. Odds ratios > 1.0 represent higher odds of having AD. p-values are calculated using a two-sided Fisher’s exact test. *p-value < 0.05. Box plot elements: centre line, median; box limits, upper and lower quartiles; and whiskers, 1.5x interquartile range. e GRID2 gene is shown with thick vertical lines representing exons and horizontal lines representing introns. Locations of AD STR expansions (orange) and control individual STR expansions (green) are shown. Each circle represents an STR expansion seen in one individual.

Since assigning intergenic STR expansions to a cognate gene is challenging, we next examined AD STR expansions in gene bodies (i.e., those occurring in exons, untranslated regions, or introns of genes). We note that both protein coding and non-protein coding genes were used in this analysis. We compared AD STR expansions found within gene bodies (n = 1327) to all STRs tested within gene bodies (n = 13,594) and identified strong enrichments in many gene sets related to synaptic function (Fig. 5b). These enrichments are notable given the pathologic role of synaptic dysfunction in AD38. We also found a higher burden of STR expansions in gene bodies between AD cases and control individuals (p-value 0.021, two-sided Mann–Whitney test) (Fig. 5c). Moreover, individuals with more than 10 STR expansions in gene bodies had a 2.02 odds ratio for having AD (95% CI odds ratio 1.18–3.58; p-value 0.02, two-sided Fisher’s exact test) (Fig. 5d). There was no difference in the number of STR expansions in protein-coding sequences only between cases and controls (p-value 0.89, two-sided Mann–Whitney test), though the number of such expansions was low (median 0.0 and mean 0.20 protein-coding STR expansions per individual).

As an example, we highlight the GRID2 gene, in which we identified seven STR expansions across nine individuals with AD compared to just one STR expansion seen in a single control individual (Table 1; Fig. 5e; two-sided Fisher’s exact p-value 0.039). GRID2 encodes a subunit of the glutamate receptor and has recognized roles in synaptic transmission39. We also provide per-gene counts of STR expansions including hypergeometric statistical testing results across all genes (Supplementary Data 6). Together, these results demonstrate that AD STR expansions are highly enriched in and near genes implicated in biological processes with known relevance to AD pathophysiology and thus the AD STR expansions may represent molecular drivers of disease. Moreover, these results show that while these STR expansions were detected in blood-derived DNA, they appear to function in the brain, where much of the pathophysiology of AD is presumed to occur.

Table 1 STR expansions observed in GRID2

Discussion

In this work, we perform genome-wide profiling of STRs in 2981 individuals with and without AD to understand whether STRs associate with risk of AD. In contrast to the known STR-associated disorders, we do not find that a single STR individually drives genetic risk for AD. Instead, we identify thousands of STR expansions distributed throughout the genome with a higher burden of STR expansions in cases as compared to controls. Moreover, expanded STRs in AD cases are enriched in active promoters in the brain and in SVA TEs. These results suggest a model in which a polygenic burden of STR expansions distributed throughout the genome promotes risk of AD.

Remarkably, we found that individuals with AD carried an excess burden of expanded STRs. Each individual with AD had a mean of 6.27 STR expansions, though we note this estimate is likely an underestimate given the lower sensitivity of STR genotyping software on long expansions and our stringent quality filters. Nonetheless, this suggests that a polygenic burden of expanded STRs rather than a single STR promotes risk of AD. This polygenic effect is similar to what has been observed for STRs in autism spectrum disorder11,13 and in schizophrenia12. The distributed nature of these AD-associated STR expansions throughout the genome rather than within one or a few genes suggests that general genomic instability is a pathologic hallmark of the genomes in AD. This concept of an increased burden of STRs in individuals with AD is consistent with prior findings that individuals with AD have a higher burden of rare coding single nucleotide variants40 and structural variants41.

This observation of an increased burden of STR expansions in AD suggests one of two mechanistic models, which are not mutually exclusive. First, STR expansions may be an epiphenomenon of a disease process that promotes genomic instability. For example, AD pathology or biological aging may promote STR instability to result in an increased number of STR expansions in AD. The second possibility is that inherited and/or somatically unstable STR expansions promote disease pathogenesis. In support of this second model, we find that STR expansions in cases are preferentially localized in active promoters of genes with important neuronal functions. We note that in our gene set enrichment analyses, we assigned noncoding STRs to the nearest gene, though noncoding elements are known to often act at long genomic distances. Expression quantitative trait loci (eQTL)-based approaches may offer a better approach for assigning noncoding STRs to a cognate gene, but given our focus on rare STR expansions further limited by the relative paucity of available gene expression data in these samples, we did not have the statistical power to apply eQTL approaches. Nonetheless, these results suggest that STR expansions may have a functional role in AD onset and progression, though mechanistic studies will be needed to dissect the exact causal relationships between STRs and AD.

Strikingly, we found that AD-associated STR expansions were colocalized within active promoters in brain tissues. The AD STR expansions were particularly enriched within SVA elements, which have been proposed to have been co-opted for enhancers and promoters during human evolution35,36,42. SVA elements have also been shown to be important sources of tandem repeat variation and evolution in the human genome and particularly near neuronal genes37. Thus, our findings suggest that STRs at these SVA elements in active gene regulatory elements are prone to instability, particularly near genes with roles in AD pathophysiology. These findings in the context of existing literature underscore an important relationship between TEs, genome instability, and disease.

There are several important limitations to our study. First, while we expect molecular drivers of AD to act in the brain, we only have access to peripheral blood-derived DNA. We did however find that AD STR expansions were enriched in active promoters in the brain and near genes involved in neuronal function, suggesting that even though STR lengths were measured in blood-derived cells, they may manifest their effect on AD risk in the brain. Similarly, without paired blood and brain-derived DNA, we were unable to test for somatic instability of STRs in brain tissues. Second, we used short-read WGS where the accuracy of genotyping STRs is limited, though our comparisons with long-read sequencing identified a sensitivity of 66.4–68.3% for detecting STR expansions using ExpansionHunter (Supplementary Fig. 2i)26. We mitigated these concerns by using two different software programmes to replicate our results. We also compared our STR lengths to control individuals well-matched on ancestry and sequencing coverage (Supplementary Fig. 1c, e; Supplementary Table 1) and as such, we expect that any STR genotyping artifacts would be randomly allocated between AD and control individuals. Future work may also apply long-read sequencing approaches such as PacBio or Nanopore sequencing that can improve accuracy of STR genotyping, but are currently limited by cost. Analyses of long-read data will become an important future direction as these technologies become more feasible to apply across thousands of samples. Third, our study focuses on a cohort of 3000 individuals of European ancestry, limiting our statistical power and applicability of our results across diverse populations. With more computationally efficient approaches in the future, we will be able to explore larger sample sizes to increase our statistical power to identify STR associations, especially as our single STR associations, out sample sizes only provide statistical power to detect STRs with larger effect sizes (~1.2 allelic odds ratio; see Supplementary Fig. 4d). Studying diverse populations will allow us to test whether our results are applicable across other groups, especially as STRs are known to be more highly polymorphic in other populations43. Finally, our study design uses observational data in humans and so we cannot directly test causality. Future studies performing in vitro mechanistic dissection of these STRs will be required to establish a causal role of STRs in AD.

Together, our work identifies novel links between STR expansions and regulatory elements in AD. These results underscore the importance of uncovering the role of STRs in a broad range of diseases and understanding the mechanisms by which these STRs may promote disease risk.

Methods

Our research complies with all ethical regulations and has received Institutional Review Board exemption from the University of Pennsylvania.

Cohort Description

The Alzheimer’s Disease Sequencing Project (ADSP) is a collaborative project aimed at identifying new variants, genes, and therapeutic targets in AD44. In the R3 release of ADSP, genetic data from 16,905 individuals were aggregated across 24 cohorts and whole genome sequencing was performed by Illumina HiSeqX, HiSeq2000, HiSeq2500, and NovaSeq platforms.

In this work, we used individuals from the ADC and ROSMAP cohorts within the ADSP. For single STR analyses, we used ADC as the discovery cohort (total n = 2393; 1213 AD cases and 1180 controls) and ROSMAP as the replication cohort (total n = 588; n = 309 AD cases and 279 controls). We utilized AD case/control status as adjudicated and provided by the ADSP. Age of participants refers to the age at disease onset for cases and age at latest assessment for control participants. We restricted our analyses to individuals with self-reported non-Hispanic White race/ethnicity. We further restricted to individuals with European genetic ancestry based on PCA coordinates, which were provided by the ADSP. Based on manual inspection of PCA plots, we used samples with PC1 < −0.0037 and PC2 <0.02 to represent individuals of European ancestry. To minimize the impact of sequencing coverage on STR calls, we removed samples with <30× or >50× sequencing coverage across the genome. We also restricted our analyses to samples sequenced on Illumina HiSeqX machines. All samples meeting these criteria were included in our analyses. We note that have focused on this subset of individuals rather than the full ADSP release given the computational costs of our analyses. All individuals have been previously consented for AD genetic research by their local Institutional Review Boards.

Generation of polymorphic STR panel

To reduce computational burden, rather than testing all STRs genome-wide, we generated a custom panel of polymorphic STRs. To generate this panel, we first ran gangSTR20. using the hg38_ver13 catalogue provided by the gangSTR authors (https://s3.amazonaws.com/gangstr/hg38/genomewide/hg38_ver13.bed.gz) on 246 AD cases and 249 AD controls from the ADC cohort in ADSP. We used gangSTR v2.4.0 on this hg38_ver13 reference panel using default parameters, except --max-proc-read 100000 and –rescue-count 10. STRs were determined to be polymorphic if at least one individual in the cohort had a non-reference STR length and the genotyping rate across the 495 samples was ≥ 90%. In total, there were 237,197 STRs that met these criteria. We next merged these 237,197 STRs with a set 174,262 STRs previously identified to be polymorphic22. This resulted in a union set of 312,731 unique polymorphic STRs.

Genotyping STR tract lengths using Illumina short-read data

We genotyped STR tract lengths using ExpansionHunter v522 on our panel of 312,731 polymorphic STRs using default parameters. Prior to running ExpansionHunter, the polymorphic STR panel was converted to json format required for ExpansionHunter.

We also genotyped STR tract lengths using gangSTR20 on our panel of 312,731 polymorphic STRs. To increase sensitivity of gangSTR, we added in offtarget alignment locations for any STR that had a p-value <0.05 in the ExpansionHunter single STR association analysis. To determine offtarget alignment locations, we used wgsim v1.11 to simulate 10000 sequencing reads for each STR (https://github.com/lh3/wgsim). We used the following parameters in wgsim: -e 0.005 -d 500 -s 100 -N 10000 -1 150 -2 150 -r 0 -R 0 -X 0. Simulated reads were then realigned back to the GRCh38 reference genome using bwa mem version 0.7.17 with default parameters45. The offtarget alignment locations of simulated reads in the GRCh38 reference genome were then extracted using scripts provided by the gangSTR authors, and we used the top 5 offtarget locations for each STR. gangSTR v2.4.2 was run on this custom reference panel with default parameters except --max-proc-read 100000.

Assessment of genotyping accuracy using PacBio long-read data

To assess the accuracy of gangSTR and ExpansionHunter applied to Illumina short-read sequencing data, we also performed STR genotyping on long-read data. We used PacBio Revio long-read data generated on n = 4 individuals (HG00096, HG00268, HG00358, and NA20509) from the 1000 Genomes Project24. We downloaded PacBio data from https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC3/working/20220831_JAX_HiFi/. We aligned PacBio data using Minimap2 v2.2446 with option -ax map-hifi and all default parameters. We then called STR tract lengths from PacBio data using TRGT v0.8.025 with default parameters on the set of 312,731 polymorphic STRs.

For comparison, we also analysed Illumina short-read sequencing data for the same four samples from the 1000 Genomes Project. We downloaded aligned Illumina short-read sequencing data from https://www.internationalgenome.org/data/. We genotyped the n = 312,731 polymorphic STRs using gangSTR v2.4.2 with no offtarget locations and all default parameters except --max-proc-read 100000. We genotyped the same set of 312,731 polymorphic STRs using ExpansionHunter v5 with all default parameters. To measure concordance, we only examined STRs were the longer allele had a tract length ≥10 repeat units longer than the reference STR tract length in GRCh38. For concordance calculations, we required the genotyped tract length to be within ± 20% between PacBio and the corresponding genotypes from Illumina data.

Single STR association analysis

We performed association analyses of STR genotype with AD risk under a dominant model by testing the longer of two alleles for each STR. Given that many of the STR genotypes were not normally distributed, we first performed rank-based inverse normal transformation of STR genotypes. To test for the association of each STR with AD case/control status, we applied a logistic regression model, controlling for sample covariates (sex, age, and the first three PCs) and technical covariates (genomic sequencing coverage for each sample and local sequencing coverage at each STR for each sample). We used ADC as the discovery cohort (total n = 2393; 1213 AD cases and 1180 controls) and ROSMAP as the replication cohort (total n = 588; n = 309 AD cases and 279 controls).

We report the logistic regression p-value of the rank-based inverse normal transformed STR genotype on AD risk. We separately report the regression p-value in a logistic regression model without inverse normal transformation to derive absolute effect sizes on AD case/control status. Finally, to ensure results were not due to the non-normal distribution of the STR genotypes, association analyses were also performed using a non-parametric association test (two-sided Mann-Whitney test) without covariates.

For reporting of STR association analyses, we removed all STRs that were in segmentally duplicated regions (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/genomicSuperDups.txt.gz) (n = 18,979), resulting in a final list of 293,752 STRs that were reported in our association analyses. Multiple hypothesis testing correction was performed using Bonferroni correction, with a p-value threshold of 1.70×10-7 (0.05/293752). P-values between the discovery and replication cohorts were meta-analysed using Fisher’s method.

Power analyses for single STR associations

To estimate power for single STR associations, we simulated STRs with five alleles (i.e., five different STR tract lengths). We simulated a sample size of 3000 individuals, where 600 individuals had each of the five alleles. We tested power at different allelic odds ratios (i.e., the odds ratio per unit increase in STR tract length) ranging from 1.01 to 1.25 in step sizes of 0.01. For each allelic odds ratio, we performed 10,000 permutations of case/control status by randomly permuting each individual to a case or a control with equal probability. For each permutation, we then performed a logistic regression of case/control status against the simulated STR tract length. For each allelic odds ratio, we then counted the number of permutations where the logistic regression p-value was less than the Bonferroni-corrected p-value threshold of 1.70 × 10−7 (0.05/293752). Power was calculated as the number of simulations reaching statistical significance divided by the number of permutations (n = 10,000).

Hypergeometric test for STR expansions

We performed a burden test for the number of STR expansions in case versus control individuals for each STR. We performed this analysis using ExpansionHunter genotyping data on the ADC discovery cohort (total n = 2393; 1213 AD cases and 1180 controls)

In addition to excluding STRs in segmentally duplicated regions as described above, we also excluded 20,653 STRs that had genotyping rates <90% in either the cases or the controls, resulting in a set of 273,099 STRs. We first identified STR expansions for each of the 273,099 STRs in our custom STR panel as STR tract lengths that were ≥ 1, 5, 10, or 20 repeat units longer than the GRCh38 reference STR tract length. For each STR, we then constructed a 2 × 2 contingency table of the number of individuals with and without an STR expansion as defined by these thresholds. We applied a two-sided Fisher’s exact test to each STR to test whether there is a difference in burden of STR expansions in cases versus controls. Multiple hypothesis testing correction was performed using Bonferroni correction, with a p-value threshold of 1.83 × 10−7 (0.05/273099).

We also performed analyses restricting to STR expansions observed in only one individual, ≤5, ≤10, ≤100, or with no frequency cutoff in the combined n = 2981 individuals. We then compared the total number of STR expansions in AD case versus control individuals using a two-sided Mann-Whitney test.

Identifying STR expansions using DBSCAN

We applied the DBSCAN outlier detection method to identify STR expansions, which we extended from the work of Trost et al.13 Briefly, DBSCAN is an unsupervised density-based clustering method that can be used to identify outlier groups47, here representing extreme STR tract lengths for each STR. DBSCAN defines a cluster based on the minimum number of data points (μ) reachable to each other by a maximum distance (ε). Data points not reachable by the clusters are classified outliers if they have an STR tract length that is higher than those of cluster members. Specifically, ε was set as the maximum of twice the mode of STR lengths, and μ was set as the log2 of the number of samples.

To increase statistical power to identify rare STR expansions, we combined samples from across the ADC and ROSMAP cohorts (total n = 2981; n = 1492 controls and 1489 cases). We ran DBSCAN on each of the 293,752 STRs in our panel (after removing STRs in segmentally duplicated regions as described above). For each STR, the longer of the two alleles for each individual was used as input for DBSCAN. To account for potential confounders, we first performed linear regression for the tract lengths of each STR to regress out the effects of sample covariates (sex, age, and the first three PCAs) and technical covariates (genomic sequencing coverage for each sample and local sequencing coverage at each STR for each sample). We input the residuals from the linear regression into DBSCAN, with ε and μ as determined above. We ran DBSCAN on these residuals to identify outlier STR lengths. We applied DBSCAN separately for STR genotyping results from ExpansionHunter and gangSTR. We excluded individuals from downstream analyses if the number of STR expansions identified was more than 2 standard deviations away from the mean STR expansion count for the cohort; 15 individuals were excluded by this filter. DBSCAN was implemented using the dbscan package v1.1-11 (https://github.com/cran/dbscan) in R v3.6.348,49.

Testing for enrichment of STRs in chromHMM and ChIP-seq annotations

We downloaded chromHMM partitions under the 18 state model for the adult human hippocampus (ENCODE tissue ID E071) from the Roadmap Epigenomics Consortium (https://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html)32,33. We then tested whether AD STR expansions were enriched in each chromatin state, relative to a background of all STRs in our custom panel of polymorphic STRs. We performed statistical testing using a two-sided Fisher’s exact test. Prior to statistical testing, we removed all regions located within segmental duplications from the analysis. We considered an STR to be located within a chromHMM partition if at least one bp of the STR was within the partition.

We also tested whether AD STR expansions were enriched in certain histone marks as assayed by ChIP-seq. We downloaded ChIP-seq peak data from ENCODE for the adult hippocampus for H3K4me1, H3K4me3, H3K9ac, H3K27ac, H3K27me3, and H3K36me3 (https://www.encodeproject.org/matrix/). For each biological sample, we used IDR-replicated peaks from ENCODE. If there were multiple biological samples for each histone mark, we merged peaks across samples using BEDTools v2.2650. We then performed the enrichment analysis as above.

For both the chromHMM data and ENCODE ChIP-seq data, we also performed enrichment analyses using a permutation-based approach. We calculated how many of the n = 6276 AD STR expansions directly overlapped the peak set. We then generated 1000 random sets of 6276 peaks using regioneR v1.30.0 using default parameters except per.chromsome=F51. Random peak sets were generated against the GRCh38 reference genome, with masking of segmental duplications. We counted how many AD STR expansions directly overlapped each random peak set. We derived an empirical p-value by counting the number of random permutations with equal or more overlaps than the observed number of overlaps and dividing by the total number of permutations (n = 1000).

Testing for STR enrichments in transposable elements

To identify locations of TE in the genome, we obtained the RepeatMasker file from the UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables)52. We tested for enrichment of AD STR expansions across 5 broad classes of TEs: Alu, HERVs, LINE1, LINE2, and SVA elements. For each TE class, we tested whether AD STR expansions were enriched in the TE, relative to a background of all STRs in our custom STR panel. We performed statistical testing using a two-sided Fisher’s exact test. Prior to statistical testing, we removed all genomic regions within segmental duplications. We considered an STR to be located within a TE if at least one bp of the STR was within the TE coordinates from the RepeatMasker file.

Gene Ontology Enrichment Analysis

We performed gene set enrichment analysis using the clusterProfiler v4.2.2 package in R v3.6.353. For all STRs, we compared AD STR expansions (n = 6276) to all STRs in the reference STR panel (n = 293,752). For this analysis, we assigned each STR to the gene with the closest TSS within ± 250 kb. We assigned genes using transcriptional start sites annotated in Ensembl release 112 in human reference genome GRCh38.14 https://useast.ensembl.org/index.html, which was loaded into R using the biomaRt package v2.58.054,55. All genes in the Ensembl gene annotation were used, including non-protein coding genes. We also performed separate analyses for STRs within the gene body (either introns, exons, or untranslated regions), where we compared genes with AD STR expansions within the gene body (n = 1327) to all genes with containing an STR from the reference STR panel (n = 13,594). We then tested the enrichment of genes assigned to AD STR expansions compared to genes assigned to background STRs using the enrichGO function in clusterProfiler, with the following parameters: keyType = “ENTREZID”, ont = ”ALL”, p-value Cutoff = 0.05, q-value Cutoff = 0.05.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.