Introduction

Reading is a key academic skill and an important component of education. Difficulties with reading are associated with poorer life outcomes, lower socioeconomic status, and can greatly impact quality of life [1]. Dyslexia, characterized by difficulties with accurate and/or fluent word reading, and poor spelling, occurs in 5–10% of school age children [2]. Some diagnostic definitions of dyslexia extend to reduced performance on measures of verbal memory and processing speed, and/or emphasize a discrepancy between reading and other cognitive abilities [3]. Dyslexia tends to cluster within families [4] and shows high heritability in twin-studies (0.4–0.6) [5]. Similarly, quantitative measures of reading ability are highly heritable, with a recent twin-based meta-analysis reporting a heritability of 0.66 [6, 7], Unravelling the biological basis of dyslexia is essential in understanding difference in reading skill, and why some people struggle with reading throughout their lives.

Allelic variation in several genes have been associated with dyslexia and reading-related traits [5, 8,9,10,11], although with mixed support from replication studies [11]. Doust et al. (2022) [11] performed the largest genome-wide association study (GWAS) of this trait to-date, using 23andMe self-reported dyslexia diagnosis in 51,800 cases and 1,087,070 controls. Forty-two significantly associated regions were identified, including 27 not previously reported in studies of educational attainment or cognitive traits. The largest GWAS of quantitative reading skill [9] (meta-analysis of 33,959 individuals from 19 cohorts) by the GenLang Consortium identified a single locus associated with word reading (rs11208009, P = 1.10 × 10−8) containing three candidate genes (DOCK7, ANGPTL3 and USP1). Strong genetic correlations were observed between quantitative measures of word reading (−0.71, 95% CI −0.62–−0.8), nonword reading (−0.7, 95% CI −0.61–−0.8), spelling (−0.75, 95% CI −0.64–−0.86) and dyslexia [11]. This is consistent with the view that dyslexia is representative of the low extreme of normal varying reading ability in the population [12, 13] rather than being a qualitatively distinct phenotype.

It is clear from studies of the genetics of dyslexia and reading [8, 9], echoed in other neurodevelopmental traits such as autism spectrum disorder (ASD) [14] and attention deficit hyperactivity disorder (ADHD) [15], that large sample sizes are key to improving resolution of associated variants. For developmental measures of literacy collected in childhood, it has historically been challenging to gather sufficient sample size. Similarly, availability of large cohorts with clinical diagnoses of dyslexia and suitable genetic data, are limited. One of the alternative methods to collecting and phenotyping new cohorts for a trait of interest, multi-trait analysis of GWAS approach (MTAG) [16], uses the shared genetic architecture of related phenotypes to increase gene discovery power. Grove and colleagues [14] used MTAG to increase their ASD GWAS power by adding GWAS for schizophrenia, educational attainment, and major depression. This showed stronger evidence for previously reported regions, and seven novel regions shared with educational attainment or depression. More recently, multivariate analyses were used across five psychiatric traits (ASD, ADHD, bipolar disorder, schizophrenia, and depression) [17], again increasing the number of associated loci identified for each individual trait, particularly bipolar disorder, which increased from 8 loci to 54. Given the strong genetic correlation between dyslexia and word-reading skills [11], we applied the multivariate method to boost sample size of the dyslexia and word reading GWAS, and identify novel associated loci.

Methods and materials

Multivariate GWAS

Multivariate GWAS (MTAG) [16] was performed using the dyslexia (23andMe, Inc, Ncases = 51,800, Ncontrols = 1,087,070) [11] and word reading (GenLang, N = 27,180) [9] GWAS summary statistics [16] without genomic control correction applied. Power estimation was calculated using the Genetic Power Calculator (https://zzz.bwh.harvard.edu/gpc/) [18] for quantitative traits and the Genetic Association Study (GAS) power calculator for binary traits (https://csg.sph.umich.edu/abecasis/cats/gas_power_calculator/). Associations were visualized using ggplot2 [19] and LocusZoom (http://locuszoom.org), and annotated using FUMA v1.5.0 [20]. SNP-based heritability \(({h}_{{\rm{snp}}}^{2})\) was estimated using LDSC v1.0.1 [21] and SumHer (LDAK) [22]. Sample prevalence was estimated at 5% [11] and sample size of N = 1,228,832. Genetic correlations were performed using LD-Score v1.0.1 within the Complex-Traits Virtual Genetics Lab (CTG-VL) platform (https://vl.genoma.io) (Supplemental Method). Code generated for this study is available online (https://github.com/hayley-mountford/multivariate_GWAS_dyslexia).

Biological annotation

Gene-based associations and gene-set biological pathways analyses were calculated using MAGMA v1.08 [23] within the FUMA interface (https://fuma.ctglab.nl/). Fine mapping and annotations were performed using the Variant Effect Predictor (VEP) online tool (http://grch37.ensembl.org/). Expression QTL analysis was performed using FUMA. MAGMA, within FUMA, was used to test for enrichment of tissue-specific annotations. To interrogate cell- and region-specific resolution, we accessed single-cell RNA-seq (scRNA) data via FUMA. Partitioned SNP heritability was examined using stratified LDSC, as described by Finucane et al. [24] (Supplemental Methods).

Polygenic index prediction and selection

The dyslexia polygenic index (PGI) was calculated for the National Child Development Study (NCDS) (N = 6410) [25] on six longitudinal reading measures described in Bridges et al. [26] using both PRSice2 v2.3.5 [27] and SBayesRC v0.2.6 [28]. Evidence of polygenic selection for dyslexia was examined using a large panel of 1015 imputed ancient genomes from the last 15 k years, sampled from across West Eurasia [29, 30]. Allele frequency trajectories and selection coefficients were modelled using CLUES [31], then polygenic selection gradients with PALM [32] using imputed ancestral data described by Barrie et al. [33] (Supplemental Methods).

Results

Multivariate GWAS of dyslexia

The genetic correlation between the univariate summary statistics of dyslexia and word reading (Europeans only, without GC correction) was −0.71 (SE = 0.05, Z = −15.06), and indicative of a high degree of shared genetic etiology enabling multivariate GWAS analysis with MTAG [16]. Meta-analysis produced an equivalent sample size of 1,228,832 for dyslexia, using 5,449,985 SNPs shared between the univariate summary statistics. This provided more than 83% power to detect additive risk of up to 1.046 (5% prevalence) and ≥ 84% power to detect additive risk of 1.038 at 10% prevalence (N = 1,228,832, α = 5 × 10−8).

We identified 80 genome-wide significant (P ≤ 5 × 10−8) independent loci (r2 < 0.6, and < 250 kb maximum distance between LD blocks to merge into one genomic locus) (Fig. 1a, Table 1), containing 211 independent significant SNPs, independent from each other at an R2 of 0.1. Forty-four of these regions were previously reported as significantly associated with dyslexia; 41 in the univariate dyslexia GWAS [11], and three by Ciulkinyte et al. [34] who used genomic structural equation modelling (GenomicSEM) to identify pleiotropic loci for dyslexia and attention deficit disorder (ADHD). Of our 80 loci, 36 were not present in the literature. However, of these 80 identified loci, 66 were present in the uncorrected univariate dyslexia summary statistics available from 23andMe, Inc. for which genomic control correction was not applied (P ≤ 5 × 10−08). We therefore consider novel regions more conservatively; not reported as significant in Doust et al. [11], Ciulkinyte et al. [34] or in the uncorrected univariate summary statistics. We detected 36 previously unreported loci, and more conservatively, thirteen novel loci not previously associated at significance threshold with dyslexia (Tables 1 and 2).

Fig. 1: Regions associated with dyslexia and reading ability.
figure 1

Manhattan plot of the multivariate GWAS of A dyslexia and B reading ability. The y axis indicates the -log10 P value for association. The threshold for genome-wide significance (P < 5 × 10−8) is represented by a dashed grey line. Significant loci that were previously reported in the GenLang word reading GWAS [9] are represented in red, and those reported in the dyslexia GWAS [11] are shown in purple.

Table 1 Loci associated with dyslexia detected by multivariate GWAS (MTAG), with the lead SNP indicated.
Table 2 Regions not previously associated with dyslexia from multivariate GWAS using MTAG (i.e. regions that were not significant in Doust et al. [11] or Ciulkinyte et al. [34]).

Quantile-Quantile (Q-Q) plots (Figure S1) indicated appropriate control for population stratification, as markers showing low association with dyslexia did not deviate from the expected quantile. Moderate genomic inflation was observed (λ = 1.573, χ2 = 1.773, LD intercept = 1.017 (0.012)) and consistent with a highly polygenic trait (see Supplementary Note for further discussion). Summary statistics for SNPs reaching suggestive significance (P ≤ 1 × 10−5) are presented in Table S1.

The most significantly associated loci were consistent with regions reported in the univariate dyslexia GWAS [11], with each showing higher significance in the multivariate analysis. The top locus, chr3q22.2 (rs13082684, P = 1.8 × 10−21) containing PPP2R3A (Table 1) is consistent with the most highly associated SNP of the dyslexia GWAS. The second top SNP in the present study, rs2426117 (P = 1 × 10−18) mapped in region chr20q13.13 mapped to within the gene CSE1L, where previously it was rs6019624 (P = 2.2 × 10−16) in neighboring gene ARFGEF2. The third top SNP, rs9696811 (P = 5.38 × 10−18) was located in region chr9q34.11, showing consistently higher significance than in the prior study (Table 1).

The lead SNP identified by the GenLang word-reading meta-GWAS [9], rs11208009, did not reach genome-wide significance in the present multivariate study (P = 7.85 × 10−5). However, it fell within a region that reached suggestive significance (chr1:62900811–63199936) at P = 1.9 × 10−6 in which rs1168114 (LD = 0.636) was now the lead SNP. This locus overlaps completely with the original study, therefore including candidate genes DOCK7, ANGPTL3, and USP1 (Figure S2).

Novel regions associated with dyslexia

Our multivariate analysis detected 36 regions not previously reported as significantly associated with dyslexia [11, 34] (Table 2). Individual LocusZoom plots for these 36 regions are presented in Figures S3S38. More conservatively, we detected 13 regions that did not previously reach genome-wide significance in the univariate or the uncorrected dyslexia summary statistics [11] or the more recent GenomicSEM study [34] (shown in bold in Table 2).

Prior associations were reported for three novel lead-SNPs: the most significantly associated novel SNP, rs79445414 (P = 9.22 × 10−10) with schizophrenia [35], rs583452 (P = 2.7 × 10−08) in gene GRIA4 with cognitive performance, and rs362307 (P = 1.98 × 10−09) within the gene HTT, previously linked to a range of phenotypes including educational attainment [36], cognitive ability [37], and a “worry” phenotype key to neuroticism [38, 39]. No associations were reported in the GWAS Catalog for three of the novel loci, with lead SNPs rs2055873 mapped to LMNB1 and MARCH3, rs7776042 mapped to RNF144B and ID4, and rs7184217 in gene CACNG3. Associations of SNPs in LD with significant SNPs in the regions showed associations with several cognitive, psychiatric and neurodevelopmental phenotypes, particularly ADHD [40, 41].

Multivariate GWAS of reading ability

We also examined the reading ability output of MTAG, resulting in an effective sample size of N = 102,082, providing 87% power to detect additive trait variance of up to 0.04% (α = 5 × 10−8). Thirty-five independent loci met the genome-wide significance threshold (Fig. 1b). Of these 35 associated loci, 28 were genome-wide significant in the original univariate dyslexia GWAS [11], and 34 were significant in the uncorrected dyslexia summary statistics. The novel locus present in the reading ability multivariate GWAS (rs362307 (P = 1.91 × 10−8) in HTT) was also novel in the dyslexia multivariate analysis. Summary statistics for SNPs reaching suggestive significance (P ≤ 1 × 10−5) are presented in Table S2 and regions significantly associated with the reading ability multivariate analysis are presented in Table S3.

The GenLang word-reading meta-GWAS lead-SNP [9], rs11208009, did not reach genome-wide significance in the present multivariate study (P = 2.71 × 10−6). However, it fell within a region of suggestive significance (chr1:62900811–63199936) in which rs1168114 (LD = 0.636) was now the lead SNP (P = 1.96 × 10−7), fully overlapping with the original study containing genes DOCK7, ANGPTL3, and USP1.

Q-Q plots (Figure S39) indicated appropriate control for population stratification. Genomic inflation suggested that moderate population stratification was present (λ = 1.28, χ2 = 1.431), however the low LD intercept (0.843 (0.01) indicated that the discrepancy in sample size affects the reliability of results for reading ability (see Supplementary Note for further discussion). Therefore, subsequent genetic and biological analyses are focused only on the dyslexia multivariate summary statistics.

Heritability and genetic correlations of dyslexia

LDSC analysis of the dyslexia multivariate GWAS revealed a liability-scale SNP-based heritability estimate \(({h}_{{\rm{snp}}}^{2})\) of 0.129 (SE = 0.005, 95% CI 0.12–0.139) at 5% population prevalence of dyslexia, \({h}_{{\rm{snp}}}^{2}=0.143\) (SE = 0.006, 95% CI 0.131–0.155) at 7%, and \({h}_{{\rm{snp}}}^{2}=0.160\) (SE = 0.006, 95% CI 0.148–0.173) at 10%. LDSC is known to overestimate SNP-based heritability because it assumes that each SNP contributes equal heritability, so we also applied SumHer which expects heritability to vary with both linkage disequilibrium and minor allele frequency [22]. SumHer liability-scale \({h}_{{\rm{snp}}}^{2}\) estimates for the dyslexia multivariate GWAS were \({h}_{{\rm{snp}}}^{2}=0.164\) (SE = 0.005, 95% CI 0.154–0.174) at 5% population prevalence of dyslexia, \({h}_{{\rm{snp}}}^{2}=0.192\) (SE = 0.006, 95% CI 0.18–0.204) at 7%, and \({h}_{{\rm{snp}}}^{2}=0.204\) (SE = 0.006, 95% CI 0.191–0.216) at 10%.

Genetic correlations were estimated between multivariate dyslexia and 2824 traits (Bonferroni corrected threshold of P ≤ 1.77 × 10−5) including recently published summary statistics for ASD [14] and ADHD [15]. Statistically significant genetic correlations (P ≤ 1.77 × 10−5) were found for 489 traits. A subset of relevant significant correlations is presented in Fig. 2, with full results in Table S4.

Fig. 2: Genetic correlations of dyslexia with selected relevant phenotypes.
figure 2

Significant (P ≤ 1.77 × 10−5) genetic correlations (rg) between multivariate analysis of dyslexia and other selected phenotypes. UKBB UK Biobank, MVP Million Veterans Program, MVP EUR Million Veterans Program Europeans, GLIDE Gene-Lifestyle Interactions in Dental Endpoints, SSGAC Social Science Genetic Association Consortium, PGC Psychiatric Genetics Consortium.

Dyslexia showed strongly negative correlations (rg) with measures of intelligence, specifically verbal-numerical reasoning (UK Biobank) (−0.56, SE = 0.03) in which the F17 synonym question was most strongly correlated (−0.62, SE = 0.06), and cognitive performance (−0.53, SE = 0.02) (SSGAC). Consistent with previous findings, dyslexia was strongly genetically correlated with academic achievement and education level, including highest qualification of GCSEs or equivalent (−0.52, SE = 0.04), years in full time education (−0.37, SE = 0.04), completing college or a university degree (−0.27, SE = 0.02) (UK Biobank) and educational attainment (−0.24, SE = 0.02) (SSGAC). Dyslexia showed positive genetic correlations with workplace hazards (exposure to paints, thinners or glues (0.61, SE = 0.09); chemicals or fumes (0.56, SE = 0.09); asbestos (0.53, SE = 0.09); workplace dust (0.53, SE = 0.1)) (UK Biobank), and achieving either vocational qualifications (NVQ or HND or HNC: 0.51, SE = 0.05; CSEs or equivalent: 0.43, SE = 0.04) (UK Biobank).

In terms of neurodevelopmental traits, ADHD [15] showed a positive correlation with dyslexia (0.4, SE = 0.03). This was weaker than that shown by the univariate dyslexia study alone (0.53, SE = 0.12) [11], and while not reported by Eising et al. [9], we calculated rg for ADHD and the univariate word-reading as −0.40 (SE = 0.05). ASD was not significantly correlated (0.09, SE = 0.04, P = 0.02) with dyslexia.

Major depressive disorder showed a positive correlation (0.16, SE = 0.03) (Psychiatric Genetics Consortium), as did measures linked to poorer wellbeing including manic episodes (0.4, SE = 0.07), risk taking (0.3, SE = 0.03) and mood swings (0.18, SE = 0.03). Measures of pain, injury, and use of pain medication were also highly correlated, including taking medication for severe hearing loss or deafness (0.52, SE = 0.12) (Million Veterans Program), pain all over the body (0.5, SE = 0.06), neck and should pain (0.47, SE = 0.1) and leg pain on walking (0.44, SE = 0.04) (UK Biobank).

Gene and gene-set associations

Gene-based analysis of the multivariate dyslexia summary statistics identified 203 associated genes, meeting the Bonferroni-derived α level for 18,842 tests (P < 2.65 × 10−6). Most of these genes associated with dyslexia (N = 150/203) were also present in associated regions detected by GWAS, while 53 fell outside of associated regions from the SNP-based screen (Table S5). The overall number of associated genes (N = 203) was higher than the 172 detected in the gene-based testing of the original dyslexia study [11] (Table S5). One-hundred and twenty-six of the 172 genes associated in Doust et al. fall within regions that met genome-wide significance in the present dyslexia multivariate GWAS.

Eising et al. (2022) did not report a gene-based association analysis, however, our own analysis using their data showed one gene meeting the significance threshold (AC079354.1, P = 7.4 × 10−7), with candidate genes DOCK7 and USP1 showing the next strongest associations [9].

MAGMA gene-set analysis detected enrichment of four biological pathways from 9113 curated gene sets and gene ontology (GO) terms (Table S6) (Bonferroni threshold of P ≤ 5.49 × 10−6). The Reactome term for oncogene induced senescence (Beta = 0.82, SE = 0.18, P = 1.78 × 10−6), Verhaak glioblastoma proneural (genes correlated with proneural type of glioblastoma multiforme tumors) (Beta = 0.36, SE = 0.03, P = 2.88 × 10−6), GO-biological process of cell part morphogenesis (Beta = 0.19, SE = 0.03, P = 3.34 × 10−6) and GO-cellular compartment synapse (Beta = 0.13, SE = 0.03, P = 4.54 × 10−6) showed significant associations. Other sub-threshold pathways were GO-molecular function proton channel activity (P = 3.12 × 10−5), adherens junction interactions (Reactome) (P = 3.66 × 10−5) and post-synapse (cellular compartment) (P = 4.87 × 10−5).

Variant mapping and functional annotation

VEP annotation of candidate SNPs (N = 14,663) found in LD (R2 ≥ 0.6) with one of the independent significant SNPs, including tagged SNPs from the 1000 Genomes reference panel, resulted in 138,467 individual candidate SNP annotations, due to multiple transcript variants and multiallelic sites. Intronic variants were most common (58%) with coding variants making up 0.29% (N = 397) (Figure S40a). Of the 397 coding variants, 53% were missense and 6% were stop-gains (Figure S40b).

Nine variants predicted as damaging by SIFT and PolyPhen and with CADD scores ≥ 20, an indication for possible deleterious effects of the variants, were found in six genes: rs11142 (chr1:109897103, tag SNP) allelic variants C and T in SORT1; rs1983864 (chr10:100017453, tag SNP) in LOXL4; rs10891314 (chr11:111916647) in DLAT with allelic variants A and T; rs8539 (chr 2:198362018, tag SNP) in HSPD1; rs1064213 (chr2:198950240, tag SNP) in PLCL1; rs1130146 (chr 20:47859217, tag SNP) in DDX27; and rs11539148 (chr3:49138810, tag SNP) in QARS (Table S7). Five of the variants with CADD scores ≥ 20 (but not annotated by SIFT or PolyPhen) were predicted to result in stop-gain changes: rs2424922 alleles T and A (chr20:31386449, tag SNP) (both alleles annotated at stop-gained and splice region variants), and rs6058891 (chr 20:31386347, tag SNP) in DNMT3B; rs6169 (chr11:30255185, tag SNP) in FSHB; and rs3764090 (chr13:50008301, tag SNP) in AL136218.1.

At the gene level, 1115 genes were contained within genome-wide significant regions (Table S8). Sensitivity to loss-of-function was annotated with probability of loss-intolerance scores (pLI) and sensitivity to non-coding variation in regulatory sequences was annotated with non-coding residual intolerance scores (ncRVIS). Two-hundred and twenty-four genes (20.1%) were predicted as loss-of-function intolerant by pLI ≥ 0.9, and seventeen (1.5%) were predicted as less tolerant to non-coding variation by ncRVIS ≥ 2.0. Four genes (SIK2, PTPN14, ARNT2 and XYLT1) were predicted as intolerant by both metrics (pLI ≥ 0.9 and ncRVIS ≥ 2.0).

Two-hundred and nineteen genes (N = 219/1115, 19.6%) located within associated regions showed evidence of association with expression QTLs (eQTL) in brain tissue (minimum eQTL false discovery rate of mapped SNPs (P ≤ 0.05)) (Table S8). The strongest eQTL associations were for DHRS11 (P = 6.36 × 10−77), CYB561 (P = 1.41 × 10−69), INA (P = 3.68 × 10−60), ZNF660 (P = 1.59 × 10−57) and CCDC171 (P = 6.91 × 10−56).

Functional enrichment using partitioned heritability and gene property analysis

To examine the tissue-specific expression profiles of genes implicated in dyslexia, we used MAGMA gene property analysis within FUMA. Using RNA-seq data from the Genotype-Tissue Expression (GTEx) project, we found significant enrichments of genes associated with dyslexia in brain tissue: 11 brain regions tested showed significantly higher expression levels of dyslexia associated genes (P ≤ 4.03 × 10−4), particularly the cerebellum, cerebellar hemisphere and frontal cortex (Figure S41, Table S9). As MAGMA analysis corrects for the average expression level in the dataset, a significant association indicates that genes associated with dyslexia have a higher expression in that tissue relative to the average expression within the dataset.

Next, we tested for enrichment within the BrainSpan data set, consisting of RNA-seq from 11 developmental stages (Figure S42, Table S10) and 29 ages (Figure S43, Table S11) of human brains. No associations met Bonferroni correction for 124 tests (P ≤ 4.03 × 10−4). To further investigate enrichment of gene expression within the developing human brain, we tested for associations with specific cell types in single cell RNA-seq (scRNA) data using MAGMA within FUMA. Embryonic ventral midbrain (6–11 post conception weeks (pcw)) revealed three enriched cell types meeting Bonferroni correction (P < 6.58 × 10−4, 76 tests) (Fig. 3a, Table S12): GABAergic neurons (Gaba, P = 3.19 × 10−7), neuroblast GABAergic neurons (NbGaba, P = 1.49 × 10−4), and red nucleus neurons (RN, P = 5.99 × 10−4). Embryonic prefrontal cortex scRNA-seq data spanning 8–26 pcw, showed significant enrichment in neurons at 16 pcw (P = 1.46 × 10−4), and then GABAergic neurons (P = 6.76 × 10−7), astrocytes (P = 1.65 × 10−4), neurons (P = 3.18 × 10−4) and oligodendrocyte precursor cells (OPC) (P = 2.85 × 10−4) at 26 pcw (Fig. 3b, Table S13). Finally, scRNA-seq data from fetal and adult human cortex showed significant expression in adult cortical neurons (P = 2.94 × 10−4) (Fig. 3c, Table S14).

Fig. 3: Analysis of expression patterns of dyslexia-associated genes in the developing human brain.
figure 3

MAGMA gene property analyses of dyslexia associated genes with single cell gene expression data from A embryonic ventral midbrain from 6–11 post conception weeks (pcw), B embryonic prefrontal cortex from 8–26 post conception/ gestational weeks (GW), C human fetal and adult cortex. DA0-1 dopaminergic neurons, Endo endothelial cells, Gaba GABAergic neurons, Mgl microglia, NbGaba neuroblast GABAergic, NbM medial neuroblast, NbML1-5 mediolateral neuroblasts, NProg neuronal progenitor, OMTN oculomotor and trochlear nucleus, OPC oligodendrocyte precursor cells, Peric pericytes, ProgBP progenitor basal plate, ProgFPL progenitor medial floorplate, ProgM progenitor midline, Rgl1-3 radial glia-like cells, RN red nucleus, Sert serotonergic.

Heritability partitioning by LDSC identified statistically significant enrichment of variants associated with dyslexia in chromatin signatures annotated in fetal and adult brain tissues obtained from the Roadmap Epigenomics project and Enhancing GTEx project (ENTEx) (Fig. 4, Table S15). Out of 489 chromatin signatures tested, thirty-one annotations were enriched (Bonferroni corrected threshold P < 1.02 × 10−4), from fetal brain (N = 6), adult brain (N = 23) and primary neuronal cultured cell lines (N = 2), across a range of chromatin signatures of (active) enhancers and promoters and actively transcribed regions.

Fig. 4: Partitioned heritability enrichment analysis of chromatin signatures.
figure 4

SNP-based heritability of the multivariate dyslexia GWAS is significantly enriched in brain enhancers, promoters and transcribed regions. 489 annotation of tissue-specific chromatin signatures were used to analyze the GWAS results with LDSC heritability partitioning. Only brain-related annotations are shown. P values are plotted on the y axis as -log10.

Polygenic index prediction in NCDS

The multivariate dyslexia polygenic index (PGI) was computed in an independent cohort across five developmental stages and a composite reading score combining all ages. When calculated using PRSice2, the dyslexia PGI explained between 1.57 and 3.61% of variance in reading ability in the NCDS cohort. Explained variance for measures of reading proficiency was 2.22% at age 7 years (N = 5712), 2.24% at age 11 years (N = 5528), 1.57% at age 16 years (N = 4809) and 2.40% for overall composite measure of reading proficiency across all ages (N = 3089). Binary measures of self-reported difficulties with reading explained the highest proportion of variance, at 3.2% (OR = 1.73, 95% CI = 1.48–2.02) at age 23 years (Ncases = 167, Ncontrols = 5 288) and 2% (OR = 1.54, 95% CIs = 1.33–1.78) at age 33 years (Ncases = 203, Ncontrols = 5497) (Figures S44S49; Table S16). SBayesRC yielded improved predictions: estimates of variance explained by PGI ranged between 2.3 and 4.7%. The PGI for dyslexia predicted 3.6% at age 7 years, 3.5% at age 11 years, 2.3% at age 16 years and 4.1% in the reading composite across all ages. Similarly, there was increased predictive power for self-reported reading difficulties at ages 23 years (4.7%, OR = 1.98, CI = 1.69–2.33) and 33 years (3.3%, OR = 1.74, CI = 1.51–2.01) in the NCDS cohort (Table S16).

Polygenic selection analysis

The polygenic selection analysis examined if there was evidence of selection on alleles associated with dyslexia seen through the past 15,000 years of human history. Essentially, this looks for differences in allele frequencies in variants associated with a trait, between the ancient ancestral population(s) and the present day. Such shifts in frequency indicate that selection acted to change the allele frequency of that variant in response to environmental pressures. From the 211 significant independent variants within 80 loci associated with dyslexia in the multivariate GWAS, 104 were retained after clumping and present at high quality in the previously generated imputed ancestral data set [29, 30, 33]. Overall analyses of these SNPs identified no evidence of directional selection acting on dyslexia over the past 15,000 years (ω = −0.115, SE = 0.088, Z = 01.31, P = 0.19) (Fig. 5). Thirteen SNPs showed individually statistically significant evidence of directional selection, after accounting for the number of tests (P ≤ 4.8 × 10−4) (Table S17). It is plausible that these variants individually showed modest directional selection because of their individual contributions to other traits, given that our overall analysis showed no evidence of selection acting on dyslexia through the past 15,000 years (Supplementary Note).

Fig. 5: No evidence for directional selection of dyslexia associated SNPs.
figure 5

Stacked line plot of the ancient ancestry PALM analysis, showing the contribution of SNPs to dyslexia over time. SNPs are shown as stacked lines, the width of each line being proportional to the population frequency of the positive effect allele, weighted by its effect size. When a line widens over time the positive effect allele has increased in frequency, and vice versa. SNPs are sorted by the magnitude and direction of selection, with positively selected SNPs at the top, negatively selected SNPs at the bottom, and neutral SNPs in the middle. SNPs are colored by their corresponding P-value in a single locus selection test. The asterisk on the scale bar marks the Bonferroni corrected significance threshold, and nominally significant SNPs are shown in yellow and labelled by their rsIDs. The Y-axis shows the scaled average polygenic index (PGI) in the population, ranging from 0 to 1, with 1 corresponding to the maximum possible average PGI (i.e. when all individuals in the population are homozygous for all positive effect alleles) and the X-axis shows time in units of thousands of years before present (kyr BP).

Discussion

We performed a multivariate GWAS using MTAG on dyslexia using summary statistics from the two largest reading-related studies to increase the effective sample size to 1,228,832 and identified 80 independent loci associated with dyslexia. The most significantly associated independent SNPs; rs13082684 (PPP2R3A), rs2426117 (CSE1L) and rs9696811 (located between PTPA and IER5L), were consistent with loci reported in the univariate dyslexia GWAS [11]. We identified thirty-six regions not previously associated with dyslexia at genome-wide significance threshold [11, 34]. Taking a conservative approach of novel regions not meeting significance threshold in either study or the univariate dyslexia uncorrected summary statistics, we identified thirteen novel loci associated with dyslexia.

Three novel lead SNPs showed prior associations with brain-related phenotypes. Novel lead SNP, rs362307 in HTT, has been associated with a range of cognitive traits with potential relevance to reading ability in prior studies, including cognitive function [37] and educational attainment [36]. GenomicSEM and genetic correlation analyses have shown that reading-related traits have substantial (albeit partial) genetic overlaps with educational attainment and IQ [9, 11], and therefore SNPs involved in general cognition are likely to be important across traits. Interestingly, rs362307 is used as a marker for identification of the A1 haplotype of the HTT gene in Europeans. The A1 haplotype contains a gain-of-function mutation known to cause Huntington disease [42]. Two novel regions, containing lead SNPs rs12653108 (IPO11, HTR1A) and rs79445414 (DUSP26, UNC5D), showed previous significant associations with ADHD [40, 41]. Preliminary evidence for six of these novel regions came from a recent study using GenomicSEM, where they were short of the significance threshold for pleiotropic loci for dyslexia and ADHD [34]. The presence of these pleiotropic regions and high cooccurrence between dyslexia and ADHD (25–40%) [43, 44] reflects the increasing importance of shared genetics between the traits.

The most significantly associated SNP identified by the GenLang word-reading meta-GWAS [9], rs11208009, did not reach genome-wide significance in the present multivariate study (P = 7.85 × 10−5). It fell within a region that reached suggestive significance (chr1:62900811–63199936, P = 1.9 × 10−6) and fully overlapped with the previously reported region. Interestingly, rs11208009 has been associated with phonological awareness, but not with other processes related to dyslexia such as rapid automized naming [45]. Phonological awareness could be better captured by quantitative measures of word reading than by self-reported dyslexia diagnosis which may include broader processing difficulties.

SNP-based heritability estimates using SumHer [22], which utilizes both linkage disequilibrium and allele frequencies, increased the liability-scale \({h}_{{\rm{snp}}}^{2}\) estimates for the multivariate dyslexia GWAS to 16.4% at 5% dyslexia population prevalence, 19.2% at 7% prevalence, and 20.4% at 10% prevalence. These improved on our LDSC estimates for multivariate dyslexia of 12.9% (5% dyslexia population prevalence), 14.3% (7% prevalence) and 16% (10% prevalence).

LDSC \({h}_{{\rm{snp}}}^{2}\) estimates for the univariate dyslexia study were 15.2% (\({h}_{{\rm{snp}}}^{2}=0.152\), SE = 0.006, 95% CI 0.14–0.164) using the 23andMe sample prevalence of 5, and 18.9% (\({h}_{{\rm{snp}}}^{2}=0.189\), SE = 0.008, 95% CI 0.173–0.205) using a 10% population prevalence of dyslexia, thought to be representative of the general population [11]. The lower \({h}_{{\rm{snp}}}^{2}\) LDSC estimates in the present study may reflect the phenotype differences between univariate studies: self-reported dyslexia in adults unselected for cooccurring conditions such as ADHD versus quantitative measures of reading in children and young people. It may also be influenced by challenges in estimating both the sample and population prevalence of dyslexia: 5% for the 23andMe cohort [11] and difficult to estimate in the GenLang cohort as this includes cohorts of children who are yet to attain maximal reading [9], while the population prevalence of dyslexia varies widely (5–17.5%) [46,47,48].

We predicted a maximum of 4.7% (OR = 1.97, CI = 1.69–2.33) of trait variance in reading measures in the NCDS cohort using the dyslexia polygenic index derived from our multivariate GWAS. This is an improvement on the previous predictions using the univariate dyslexia GWAS on measures of reading in similar population-based cohorts, where 2.9% of variance was explained in adolescents and ~2% in adults [11]. Predictions using SBayesRC across longitudinal measures at five ages within the NCDS ranged from 2.3–4.7% suggesting that this part of the genetic contribution of reading is stable through schooling and into adulthood where the measures are more focused on reading difficulties and thus, more aligned with dyslexia. Other studies have used phenotypically related PGIs to predict reading outcomes. For example, Selzam et al. (2017) [49] used a years-in-education PGI [50] to predict 5% of reading variance at 14 years of age. Future studies might test whether our dyslexia PGI explains incremental variance in reading variation above an educational attainment PGI, which is a broader phenotype encompassing cognitive and non-cognitive factors.

Genetic correlations between the multivariate dyslexia GWAS and other phenotypes showed a high degree of similarity to the profiles of genetic correlations seen for both the univariate GWAS studies of dyslexia and word reading. Key findings were consistent with the univariate GWAS, that is, positive genetic correlations with measures of lower socio-economic status and less desirable workplace conditions, and negative correlations with higher socioeconomic and health measures. Notably, the strongest correlation for the multivariate dyslexia GWAS was verbal-numerical reasoning sub-question F17- synonyms in the UK Biobank (−0.62, SE = 0.06) followed by the full verbal-numerical reasoning measure (−0.56, SE = 0.03). Verbal-numerical reasoning was the among the strongest correlations for the univariate dyslexia at −0.5 (SE = 0.03) [11].

Correlations with other neurodevelopmental traits were consistent between the univariate and multivariate GWASs. The strongest correlation of the univariate dyslexia GWAS was with ADHD at 0.53 (SE = 0.01) [11], and although this correlation was lower in absolute magnitude for the multivariate dyslexia analysis (0.4, SE = 0.03), it remains consistent with the literature linking ADHD with reading and language outcomes [5, 51]. The observation that genetic correlation with ADHD is stronger with the univariate dyslexia (0.53) than the multivariate dyslexia (0.4), but the same as the univariate word-reading (−0.4) may be due to the presence of participants self-reporting ADHD diagnosis in the 23andMe cohort and thus inflating the correlation. Consistent with previous findings [9, 11], we found no significant genetic correlation between autism spectrum disorder (ASD) and dyslexia.

We found strong correlations between dyslexia and several measures of chronic pain including all over body pain (0.5, SE = 0.06), neck and should pain (0.47, SE = 0.1) and leg pain on walking (0.44, SE = 0.04) (UK Biobank). The prevalence of chronic pain is higher in both neurodevelopmental [52] and psychiatric conditions [53]. The underlying mechanism remains unelucidated, however, the genetic overlap between pain-related phenotypes and neurodevelopmental traits may hint at a shared biological basis [11, 54].

Gene-set analysis revealed four enriched biological pathways implicated in dyslexia: the Reactome term for oncogene induced senescence, genes correlated with proneural type of glioblastoma multiforme tumors (Verhaak glioblastoma proneural), the gene-ontology biological process of cell part morphogenesis, and the term for cellular compartment neural synapses, hinting at essential neuronal mechanisms.

Our analysis of expression patterns of dyslexia-associated genes in the developing human brain offered further evidence for a role in early developmental processes, implicating GABAergic and red neurons in embryonic midbrain, GABAergic neurons, astrocytes and oligodendrocyte precursor cells in embryonic prefrontal cortex, as well as cortical neurons in adults. These findings echo the cell type analysis reported in the GenLang GWAS study, where an enrichment in red nucleus neurons and a trend towards enrichment in fetal GABAergic neurons was observed [9]. Price and colleagues reported evidence supporting neuronal migration/axon guidance as potential pathways using a candidate gene-set approach for known neurodevelopmental genes in a hypothesis-driven association analysis of word reading which included the GenLang meta-GWAS [55]. More recently, the same research group implicated glutamatergic (excitatory) and GABAergic (inhibitory) neurons in the adult cortex in word reading, using a subset of the GenLang cohorts (N = 5054) [56]. The sample used in the present study overlaps with that of the prior work [55, 56], which may contribute to the consistency between the two sets of results. This work offers support for the GABAergic inhibitory system as a future focus for connecting genetics to neuronal mechanisms.

Polygenic selection analysis found no significant selection observed from ancestral populations suggesting that the genetic influences on dyslexia were not specifically selected for or against in the transition between hunter gatherer and farmers in Europeans. This finding may be considered unsurprising since reading is many thousands of years old but has only recently become widespread and has no obvious selection pressure on reproductive fitness. Because reading processes are highly dependent on brain circuits that evolved in support of spoken language, it was still possible that we might have detected signals with relevance to aspects of language evolution. The consistency of the PGI through the past 15 k years of history in northern Europe suggests it has not been affected by any major social or societal changes that have taken place in history such as the transition to farming, although it is important to note that our PGI accounts for only a modest proportion of heritable dyslexia. We identified thirteen SNPs that showed individually significant changes through recent history, although the directions of effect on dyslexia were mixed. We speculate that the patterns observed for these thirteen SNPs are most likely due to selection pressures acting on pleiotropic traits. Analyses that examined deeper timescales in our evolutionary history (30 million years ago to 50,000 years ago) were performed in both the Eising et al. [9] and Doust et al. [11] papers. Adopting approaches developed for studying human brain structure evolution [57], five annotations reflecting complementary aspects of human evolution were examined. Doust et al. found no evidence of enrichment for the tested annotations [11]. Eising et al. found evidence of an enrichment in archaic deserts; long regions in the human genome where there is an absence of Neanderthal admixture, suggesting these regions may be intolerant to gene flow and therefore harboring variants essential to Homo sapiens [9]. The findings suggested that these archaic desert regions could contain genetic variations that contribute more to reading and language traits in modern humans than expected by chance.

Through implementation of multivariate GWAS analysis combining work on quantitative measures with self-reported diagnosis data, we have produced the largest genetic study of dyslexia (effective N = 1,228,832) to date. Our findings account for up to 16.4% \(({h}_{{snp}}^{2})\) at 5% prevalence, and 20.4% at 10% population prevalence of dyslexia. We identified thirteen novel loci associated with dyslexia and implicated early brain developmental processes in the biological underpinnings of reading.