Abstract
Genetic disorders impacting vision affect millions of individuals worldwide, including age-related macular degeneration (common) and inherited retinal disorders (rare). There is an incomplete understanding of the impact of genetic variation on gene expression in the human retina and its role in genetic disorders. Through the generation of whole genome sequencing and bulk RNA-sequencing of neurosensory retina and retinal pigment epithelium from 201 post-mortem eyes, we uncover common and rare genomic variants shaping retinal expression profiles. This includes 1,483,595 significant cis-expression quantitative trait loci impacting 9,959 and 3,699 genes in neurosensory retina and retinal pigment epithelium, respectively, with associated genomic variants enriched to cis-candidate regulatory elements and notable shared eGenes between both tissues. We also detect 1051 expression outliers and prioritise 299 rare non-coding single-nucleotide, structural variants or copy number variants as plausible drivers for 28% of outlier events. This study increases understanding of gene expression regulation in the human retina.
Introduction
Genomic variation has been well established to play a role in the onset and susceptibility of visual impairment by disrupting normal functioning of the retina, a highly specialised light-sensitive tissue at the back of eye1. The retina depends on the interaction between neuronal and non-neuronal cell types, including those in the neurosensory retina (NSR), e.g. photoreceptors and ganglion cells2 and the retinal pigment epithelium (RPE), a monolayer which lines the photoreceptor outer segments3. Inherited retinal disorders (IRDs) are a diverse set of largely monogenic conditions driven primarily by highly impactful genetic variants that are rare in the population and disrupt the function of the NSR and/or RPE4. Monogenic IRDs may present in isolation, for example, Stargardt disease, retinitis pigmentosa and cone-rod dystrophy or as part of a multi-system disorder, for example, Usher syndrome, Joubert syndrome and Senior-Loken syndrome. Age-related macular degeneration (AMD) is a common disorder that impacts the retina and is a leading cause of visual impairment in adults5, predicted to impact 288 million individuals by 20406. Whilst non-genetic risk factors exist for AMD, including age, diet and lifestyle, its heritability is estimated to be as high as 71%7. Genome-wide association studies (GWAS) initially identified more than 50 genomic loci impacting 34 genes that convey high risk to AMD in a European ancestry cohort8. Recent expansion of AMD GWAS to Hispanic and African ancestries has uncovered 30 additional genomic loci and distinct AMD genomic architecture in these populations9.
The Genotype-Tissue Expression (GTEx) project has transformed our ability to pinpoint genetic variants that impact gene expression10, including tissue-specific and tissue-shared expression quantitative loci (eQTLs) and rare genetic variants associated with expression outlier (eOutlier) events. Findings from these investigations, along with other studies, have been leveraged across various biological and medical fields to gain a deeper understanding of disease mechanisms11,12, to provide more informed diagnosis and prognosis13 and to pursue pathways for novel treatments14. Notably, ophthalmic tissue was not included in the GTEx resource, but recently, the EyeGEx study identified over 2 million eQTLs in the retina, from a cohort comprised of healthy eyes and eyes displaying signs of AMD15 and Strunz et al.16 identified 580,171 eQTLs in the neural retina from a cohort of healthy eyes. These resources enable the investigation of the role of common single-nucleotide variants (SNVs) influencing retinal gene expression. However, to our knowledge, there are no suitable datasets to also interrogate the impact of rare variants, structural variants (SVs) and copy number variants (CNVs) on retinal gene expression. Expanding our understanding of gene expression regulation in the retina will provide insights into the molecular mechanisms underlying both common and rare eye diseases and help identify new potential strategies for treatment and prevention.
Here, we describe the creation of a paired genomic and transcriptomic resource for the human retina from 201 donors and develop a new understanding of both common and rare variants that drive expression in this highly specialized tissue.
Results
The METR genome-transcriptome resource integrates genomic and retinal transcriptomic data from 201 post-mortem eye samples
The Manchester Eye Tissue Repository (METR) genome-transcriptome cohort comprises 201 unrelated individuals who donated eye tissue post-mortem. The median age of the cohort was 71 years (IQR 64–77) at the time of post-mortem, with a slight male predominance (63.7%). The median ischemic time was 40 h (IQR = 32–44) (Supplementary fig. 1). While 47 individuals (23% of the cohort) were found to carry genetic variants that confer high-risk for age-related macular degeneration (AMD), none of the 201 individuals included in the cohort had phenotypic presentation, assessed post-mortem, consistent with late-stage AMD or monogenic ophthalmic disorders.
Short-read whole genome sequencing was performed on an Illumina NovaSeq6000, with alignment and variant detection performed using DRAGEN software (v4.0.3). The median genome-wide average coverage per sample was 35.9x (IQR = 30.3–40.5) (Supplementary fig. 2), with an average of 88.0% and 92.8% of the genome covered by at least 15 or at least 10 sequencing reads, respectively. Joint SNV calling with DRAGEN PopGen 4.2.4 obtained aggregate calls at 15,617,784 high-confidence variant sites after quality control (Supplementary Data 1). On average, 173 CNVs and 8,814 SVs were identified per sample (Supplementary Data 2). Genomic variation profiles among the 201 donors confirmed that the cohort was exclusively of European genetically inferred ancestry (Supplementary fig. 3).
Transcriptomic data were generated by short-read bulk RNA sequencing of polyadenylated enriched RNA, using an Illumina NovaSeq6000. The median RNA integrity number (RIN) for samples selected for transcriptomic analysis was 7.9 (IQR = 7.5–8.1) for NSR samples (n = 183) and 6.9 (IQR = 6.5–7.5) for RPE samples (n = 176) (Supplementary figs. 1 and 4A and Supplementary Data 3). We obtained an average of 139 million uniquely mapped reads for NSR (IQR = 138–161 million) and 62.3 million for RPE (IQR = 59–94 million), representing 89.4% (IQR = 88.2–90.7%) and 82.9% (IQR = 78–86%) of all generated reads, respectively (Supplementary fig. 4). The median 3’/5’ bias, defined as the ratio of sequencing depth between the 150 bp region at the 3’ end and the 5’ end of the gene for genes with a length greater than 600 bp and at least 5 unambiguous reads, was 0.5 for NSR samples (IQR = 0.48–0.51) and 0.51 for RPE samples (IQR = 0.49–0.55) (Supplementary fig. 1B). Some level of expression (mean TPM > 0.1) was indicated for 28,512 genes across both tissues, with 18,891 and 13,214 genes expressed at moderate (TPM > 1) and high (TPM > 5) levels, respectively (Fig. 1a). 59% of expressed genes (mean TPM > 0.1) (n = 16,785) and 90% of highly expressed genes (mean TPM > 5) (n = 11,984) were protein coding, representing 84% and 60% of all GENCODE protein coding genes, respectively. Significantly higher expression variability was observed in the RPE compared to the NSR for genes expressed at low, moderate and high levels in samples with both NSR and RPE data, measured by the coefficient of variation (p-value < 2.2 × 10−16; Fig. 1b).
a Number of genes expressed in the NSR and the RPE at different expression thresholds, classified into different biotypes. b The coefficient of variation (mean/SD) for all genes expressed at different thresholds in both tissues indicates higher expression variability across samples in the RPE (n = 176) compared to the NSR (n = 183). Box plots show the median (centre line), interquartile range (box) and minimum and maximum values (whiskers). c Top 10 Gene Ontology Biological Process clusters enriched in differentially expressed genes (adj pvalue < 0.01) in the NSR and RPE, respectively. Gene Ontology terms (FDR < 0.01) were sorted by enrichment ratio and clustered based on semantic similarity.
To ensure the validity of the transcriptomic datasets generated in this study, we assessed the biological relevance of expressed genes in NSR and RPE. Gene expression profiles were enriched for gene ontology (GO) terms indicative of the tissues of origin (Fig. 1c). Overall, 14,957 differentially expressed genes (mean expression > 1 TPM and adj. p-value < 0.05) were identified between the RPE and the NSR. Unsurprisingly, cell type deconvolution analyses, with reference to single-cell retinal datasets17, demonstrated a significantly higher representation of RPE cells in data generated from RPE samples compared to data generated from NSR samples (Supplementary fig. 5A). Moreover, genes with increased expression in the RPE (n = 7353) were enriched for 987 GO terms, which were grouped into 55 non-redundant clusters, including epithelial cell proliferation (GO:0050678), regulation of cell adhesion (GO:0030155) and positive regulation of immune response (GO:0050778) (Supplementary Data 4). Deconvolution of the NSR datasets supported the presence of at least 7 neuronal cell types at high levels (estimated proportion > 1%), with an average relative composition, per sample, of 29% rod photoreceptors (95%CI:26.4-30.7%), 28% retinal astrocytes (95% CI = 26.1–29.6%), 16% amacrine cells (95% CI = 15.7–17.1%), 10% horizontal cells (95% CI = 9.5–10.2%), 7% retinal ganglion cells (95% CI = 5.8–9%), 4% bipolar cells (95% CI = 3.8–4.4%), 2% Müller glia (95% CI = 1.7–3.1%) and ~4% other cell types (Supplementary fig. 5C). Genes with increased expression in the NSR, compared to the RPE (n = 7604), were enriched for 238 GO terms, grouped into 27 non-redundant clusters including synapse organisation (GO:0050808), neurotransmitter transport (GO:0006836) and cell morphogenesis involved in neuron differentiation (GO:0048667) (Supplementary Data 4).
METR eQTLs provide novel insights into non-coding variants that impact known eye disease-related genes
We performed cis-eQTL mapping to identify common genetic variants that are associated with gene expression in the NSR and the RPE. We found 1,424,946 significant (FDR < 0.05) cis-eQTL associations between 806,789 variants (eVariants) and 9959 genes (eGenes) in NSR (Supplementary fig. 6). Additionally, 465,045 eQTLs were identified between 303,773 eVariants and 3699 eGenes in the RPE (Supplementary fig. 6). The lower range of alternate internal allele frequencies for eVariants identified as part of eQTLs was 2.5% and included novel variants in comparison to gnomAD v4.1 (Supplementary fig. 7). 406,396 eQTLs were common to both the retina and the RPE, while 1,018,550 associations were NSR-specific and 58,649 were RPE-specific (Fig. 2a). Henceforth, we will refer to eQTLs identified in the NSR and/or the RPE as METR-eQTLs (n = 1,483,595), which included 10,471 unique eGenes (6772 NSR-specific, 512 RPE-specific and 3187 eGenes in both NSR and RPE).
a Overlap between METR-eQTLs (left) and METR-eGenes (right) identified in the neurosensory retina (NSR) and the retinal pigment epithelium (RPE). b Intersection between eGenes identified in the neurosensory retina (METR-NSR), EyeGex15 and Strunz et al.16 retina eQTL study. For eGenes which were present in METR-NSR and at least one other study, we compared the top eQTL for each eGene to identify 1) eQTLs which were replicated in an additional study/studies; 2) eGenes where the top eQTL from each study was in high LD with each other (r2 > 0.8) and 3) eGenes where the top NSR hit was novel. The LD score was unavailable for a subset of eQTLs. c Comparison between METR-eQTLs and GTEX eQTLs indicates the number of NSR-specific, RPE-specific, NSR & RPE-specific and non-eye-specific eQTLs (left) and eGenes (right) from both tissues in our study.
We compared the top eQTLs identified for NSR (METR-NSR eQTLs) for each eGene identified in this study with two published retina-specific eQTL datasets, the EyeGEx project15 and Strunz et al.16, to identify: (1) eQTLs identically replicated in the NSR tissues; (2) METR-NSR eQTLs that impacted eGenes previously described but had alternative eVariants in high linkage disequilibrium (LD) with findings from EyeGEx or Strunz et al. and (3) previously unreported eQTLs for NSR, including newly identified eGenes (Fig. 2b and Supplementary fig. 8). Of note, our cohort excludes individuals with late-stage AMD, whereas EyeGEx includes late-stage AMD eyes. We report 6181 NSR eGenes which were previously described by at least one other study (62% of all NSR eGenes), of which 547 eGenes (9%) share identical top eVariants with at least one other study and 1882 (30%) have top eVariants in high LD with previously identified top eVariants (r2 > 0.8) (Fig. 2b). We identified 343,527 novel eQTLs (24%) in eGenes that were previously described by at least one previous study and 386,741 novel eQTLs (27%) in 3,778 newly identified eGenes. Importantly, we replicate 13 eQTLs that have previously been reported to impact genes that are implicated in increased risk of AMD (Table 1).
Over 800 eGenes are newly identified in the NSR and RPE
To evaluate the tissue-specificity of our dataset, we compared the METR-eQTLs with non-eye-specific eQTLs from the GTEx project (Fig. 2c). We identified 337,424 METR-eQTLs (22.7%) and 916 eGenes (8.7%) that had not been previously identified by GTEx (Fig. 2c); 251,685 (74.6%) of these eQTLs have not been previously described as eQTLs in NSR or RPE previously. Of the novel eGenes, 479 (57.9%) encoded lncRNAs, and 5 had previously been associated with a known rare monogenic eye disease (HPS4, ACO2, CRX, CRYAA, PEX26). We evaluated the degree of similarity between METR-eQTLs and eQTLs from each GTEx tissue using the Intersection over Union (IoU) statistic, which accounts for the wide variation in the number of eQTLs from different tissues (Supplementary Figs. 9 and 10). The brain cortex had the highest level of eQTL similarity to our dataset (IoU = 0.28) and 5 of the top 10 most similar tissues were from the brain.
Genetic variants driving expression profile differences are enriched in candidate cis-regulatory elements (cCREs), with the highest enrichment in retina-specific cCREs
To understand whether eQTLs were enriched for putative regulatory regions, we compared locations of METR-eVariants to cell-type agnostic cis-candidate regulatory elements (cCRE) available through ENCODE (V3). METR-eVariants were enriched in cell-type agnostic promoters (p = 8.05 × 10−19) and proximal enhancers (p = 8.48 × 10−26), compared to control variants matched for allele frequency and gene density. There was no enrichment of eVariants in distal enhancers, CTCF binding sites or DNase-H3K4me3 sites (Fig. 3a).
a Bootstrapped relative enrichment of eye eVariants (n = 1,483,595) which intersect with cCREs from the ENCODE cCRE registry (V3)83 compared to control variants (n = 6,041,074). Relative enrichment is defined as the ratio of eVariants to control variants that intersect with each element. In each bootstrapping iteration, random subsets of eVariants and control variants (subsets of non-eVariants matched for gnomad AF and gene density) were intersected with each cCRE group. Centre points indicate the mean and error bars indicate the 2.5–97.5% confidence intervals. b Relative enrichment of eye eVariants (n = 1,483,595) which intersect with cCREs from retina and RPE from Cherry et al.18 and cCREs from adult tissues in EpiMap23 compared to control variants matched for gnomad AF and gene density (n = 6,041,074). Points indicate the mean and error bars indicate the 2.5-97.5% confidence intervals c Relative enrichment of eVariants (n = 1,483,595) compared to control variants (n = 6,041,074), which intersect accessible chromatin regions from single cell ATACseq peaks from Wang et al.19 in different retina cell types. Points indicate the mean and error bars indicate the 2.5–97.5% confidence intervals d Relative proportion of eVariants (n = 1,483,595) and non-eVariants (n = 6,041,074) in the NSR and/or RPE, which intersect with annotated retina cCREs from Cherry et al.18, non-retina cCREs (EpiMap23) and neither. Control variants refer to those that were included in eQTL mapping (MAF > 2.5% and AC > 10) but were not associated with any eQTLs in NSR or RPE.
When stratified by cell-specific regulatory regions, bootstrapping analysis indicated a significant enrichment of METR-eVariants in NSR-specific (p-value = 4.52 × 10–28) and RPE-specific cCREs (p-value = 8.74 × 10–10)18 compared to control variants matched for allele frequency and gene density (number of gene TSSs within 1 Mb of variant) (Fig. 3b). Furthermore, we observe a significant enrichment of METR-eVariants in cell-type specific accessible chromatin regions across 8 different retina cell types19, with the greatest enrichment in rod cells (p-value = 6.69 × 10–58) and cone cells (p-value = 6.58 × 10–57) (Fig. 3c). Non-eye cCREs from adult tissues in EpiMap were also enriched for METR-eVariants, although the enrichment was lower than in the NSR and RPE. Despite the relative enrichment in annotated regulatory loci, most METR-eVariants (88.2%) do not overlap with any previously characterised cCREs (Fig. 3d).
Further, we assessed whether eVariants previously implicated in AMD risk intersected with cell-type agnostic or cell-specific regulatory regions, observing overlap for 48% (10/21) of unique eVariants with characterised cis-regulatory elements, although only 1 eVariant overlapped with regions previously shown to be active in the retina (Table 1).
Properties of METR-eQTLs differ between known monogenic disease genes and non-disease-related genes
To understand if there were trends that were specific to eQTLs associated with known monogenic eye disease genes, we compared findings from this study against the EyeG2P resource20. We identified 230 METR-eGenes that were described as causes of rare monogenic disorders in EyeG2P (eye-disease genes) and compared trends identified in these genes against all other METR-eGenes (n = 10,241) (eye non-disease genes) (Fig. 4). We observed significantly lower expression variability across samples for eye disease eGenes compared to eye non-disease eGenes (p < 2.2 × 10–16) (Fig. 4a). Eye disease eGenes were associated with significantly fewer eQTLs per gene (p = 7.8 × 10–3) (Fig. 4b). Additionally, eQTLs associated with eye disease eGenes have a significantly lower impact on gene expression (p < 2.2 × 10–16) (Fig. 4c) and significantly higher allele frequency (AF, gnomAD v4) compared to eye non-disease eVariants (p < 2.2 × 10–16) (Fig. 4d). Genes that have been associated with rare monogenic eye disease have higher expression (mean TPM = 37.9) than non-disease genes (mean TPM = 20.3) and to control for this potential confounding factor, we adopted a bootstrapping approach (n = 1000 iterations) to randomly resample 100 eQTLs associated with eye-disease genes and 100 eQTLs associated with non-eye disease genes matched for gene expression level ( ± 5% TPM) (Supplementary fig. 11). The direction of trends remained similar after bootstrapping, with lower effect sizes and higher allele frequencies observed for eQTLs associated with eye disease genes than non-disease genes (Supplementary fig. 11). For both eye disease eGenes and eye non-disease eGenes, there is a negative correlation (p < 2.2 × 10–16) between eVariant allele frequency and the impact of each eQTL on gene expression (Fig. 4E). These findings are consistent with the hypothesis that eVariants which are more common in the population have lower effect sizes on gene expression compared to rarer eVariants (min eVariant allele frequency = 2.5%) and are suggestive of a selective bias against rarer eVariants impacting known eye disease genes.
a NSR eGenes associated with eye diseases on EyeG2P (n = 215) have lower coefficients of variation than non-eye disease eGenes (n = 9744), indicating lower expression variability across samples (p < 2.2 × 10–16). b On average, known eye disease eGenes (n = 215)have fewer eVariants per gene than non-eye disease eGenes (n = 9744) (p = 7.8 × 10–3). c The impact on gene expression (measured in absolute log2 allelic fold change) of each eQTL associated with a known eye disease gene (n = 25,205) is lower than the impact of eQTLs associated with non-eye disease genes (n = 1,425,748) (p < 2.2 × 10–16). d The allele frequency on gnomAD for the eVariants associated with eye disease genes (n = 24,674) is higher than non-eye disease genes (n = 794,167) (p < 2.2 × 10–16). e There is a negative linear relationship between the allele frequency of a variant and the impact on gene expression of the associated eGene (measured in log2(allelic fold change)) for eQTLs associated with eye disease genes (top) and non-eye disease genes (bottom). All reported p-values (**: p < 0.01; ***: p < 0.001) were calculated with a two-sided independent t-test. All box plots (a–d) show the median (centre line), interquartile range (box) and minimum and maximum values (whiskers).
Rare variants are plausible drivers of transcriptomic outliers in NSR and RPE
We utilised the DROP workflow21 to identify statistical outlier events within the METR transcriptome datasets, including expression, splicing and allelic imbalance outliers (Table 2). We identified 1,051 unique instances of a gene being aberrantly expressed in an METR sample (METR expression outlier events, METR-eOutlier events) (adjusted p < 0.05); 728 of these events were in the NSR, 443 in the RPE and 120 eOutlier events were found in both tissues. A median number of 3 genes per sample was considered a significant outlier event in the NSR (IQR = 1,4) and 1 in the RPE (IQR = 0,2). In total, we tested 3,209,821 gene-sample events in the NSR and 3,050,081 in the RPE, indicating a significant outlier rate of 0.023% and 0.015%, respectively. These observations are consistent with a recent study of the GTEx cohort, describing significant outlier rates of 0.026%22.
For each eOutlier event, we were able to harness paired genomic data to identify candidate rare variants potentially driving aberrant expression profiles. We leveraged a hierarchical framework and a probabilistic model to prioritise candidate rare genetic variants driving changes in expression. This identified 230 (23%) eOutlier events likely driven by protein-coding variants and 314 (31%) events with non-coding candidate variants (Supplementary Data 6).
Rare variants predicted to have a functional impact are identified for 50% of eOutlier events in NSR and RPE
First, we applied a hierarchical framework to identify rare SVs, CNVs and SNVs which were predicted to result in loss-of-function (pLoF, including frameshift, nonsense and start/stop site loss variants) or were expected to disrupt a nearby non-coding regulatory region (Supplementary fig. 12). Following this approach, we identified candidate functional variants driving 528 eOutlier events (50.2% of all eOutlier events identified in this study) (Fig. 5A and Supplementary Data 6). Of these, 131 eOutlier events were co-occurring with a SV or CNV impacting the coding-sequence of the outlier gene (77 NSR-only,12 RPE-only and 42 in both tissues) and 98 with a pLoF SNV (77 NSR-only, 6 RPE-only and 15 in both tissues) impacting the same gene (Fig. 5a). For those eOutlier events not explained by an SV, CNV or pLoF SNV disrupting the coding sequence, we identified genomic variants in 71 eOutlier events that were within 10Kb of the gene body and impacted an eye-specific cCRE18, including SV/CNVs (n = 2) and SNVs that are rare ( < 0.01 AF) or absent in gnomAD (n = 69). We also identified non-eye-specific cCREs from EpiMap23 which were disrupted by SV/CNVs (n = 23) or rare SNVs (n = 121) within 10 Kb of the eOutlier gene. Examples of rare variants identified through this analysis strategy are included in Fig. 6 and Supplementary fig. 13.
a A hierarchical workflow to identify candidate variants driving outlier expression identified putative functional variants driving 563 eOutlier events in NSR, RPE, or both (METR) b The Watershed probabilistic model had high concordance with the hierarchical framework for the prioritisation of SNVs and small indels driving eOutliers in NSR and RPE. c Bootstrapping analysis indicates that variants which were predicted to be driving eoutliers by Watershed (n = 135) were enriched for pLoF variants, splicing variants, variants within regulatory elements (including retina cCREs) and those with high conservation scores compared to those which were not predicted to have a functional impact (n = 105,114). Points indicate median values and bars indicate the 95% CI.
In each caption, the tissue and relative outlier expression profiles calculated through OUTRIDER for individuals are shown, with: a copy number states 0 (n = 0), 1 (n = 1), 2 (n = 181) and 3 (n = 1); and b, c homozygous reference (0/0) (n = 182) and heterozygous alternate (0/1) (n = 1) genotypes. Genome tracks displaying transcript isoforms, evolutionary conservation and candidate cis-regulatory elements (cCREs) identified in Cherry et al are included alongside: a segmental duplication regions; and b, c epigenomic histone marks and ATACSeq in retina (Cherry et al.18). Box and whisker plots show median values and interquartile ranges, with grey dots indicating normalised count values for single samples and statistical outliers for each genotype indicated with white dots. In a CNVs encompassing a known eye disease gene (NPHP1) are shown to cause drastic changes in expression in NSR. In b, c deletions are shown to impact cCREs proximal to the transcription start sites of genes expressed in NSR: b 3Kb deletion ~ 3.5Kb upstream of FOXJ2; c 187 bp deletion impacting the minimal promoter region of RHBDD2. Box plots show the median (centre line), interquartile range (box; Q1–Q3) and whiskers extending to the most extreme data points within 1.5×IQR. Points beyond the whiskers are plotted as outliers.
A probabilistic model demonstrates high concordance for candidate SNVs driving expression outlier profiles in NSR and RPE
Next, we applied Watershed24, a probabilistic model that was retrained on 6 tissue-specific outlier p-values from DROP. This was used in the METR transcriptome datasets to obtain posterior probabilities for SNVs and small indels that may be driving outlier expression profiles. Eye-specific cCREs were added as annotation features for Watershed (Supplementary Data 7) and identified 135 (13%) eOutlier events that were likely to be caused by nearby rare variants (posterior probability > 0.8), of which 110 (81%) were also predicted to be driven by the same variants by the hierarchical model and 11 were predicted to be driven by SV/CNVs that are not considered by Watershed (Fig. 5b). We used bootstrapping analysis to compare the annotations associated with these variants to other rare variants which overlapped with eOutlier genes but were not predicted by Watershed to have a functional impact, observing an enrichment of canonical splice variants, frameshift variants, stop gain variants and variants predicted to disrupt splicing (Fig. 5c). In support of other analyses described in this study, there was an enrichment of rare variants which overlap with retina cCREs and a slight enrichment of rare variants which overlap with epigenomic marks associated with non-eye specific regulatory elements from ENCODE. In total, there were 34 eOutlier events where the functional variants prioritised by Watershed intersected with a known candidate cis-regulatory region (cCRE); 28 of these cCREs were active in the eye.
Functional assays confirm the impact of rare variants prioritised as drivers of eOutlier expression
A dual reporter luciferase assay was performed in Human K562 cells to investigate the impact of a CAND2 heterozygous variant that was prioritised as a driver of drastically reduced expressed in NSR (fold change = 0.6, Z-score = –5.6, p-adj = 0.004) and RPE (fold change = 0.5, Z-score =–4.7, p-adj = 0.049) due to its overlap with features indicative of the CAND2 promoter region (NM_001162499.2:c.-41A > G; Fig. 7). CAND2 has recently been implicated in AMD risk through GWAS meta-analysis in European ancestry populations9 and has an emerging role in the targeted degradation of proteins that is distinct from its CAND1 homolog25. Our dual reporter luciferase assay reveals a significant reduction in CAND2 promoter activity in the presence of c.-41A > G (adj p = 0.005; Fig. 7), confirming the disruption of CAND2 activity in NSR and RPE and provides a proof-of-principle for our applied prioritisation methods for genetic drivers of expression outliers.
a Genome tracks displaying the rare promoter variant identified in CAND2 (NM_001162499.2:c.-41A > G) is mapped alongside retina-specific epigenomic peaks and annotated retina-specific cCREs (both generated by Cherry et al.18, evolutionary conservation and GENCODE v48 transcript isoforms. b The outlier expression profiles in NSR and RPE calculated through OUTRIDER for individuals with missing (./.) (n = 2 in NSR; n = 2 in RPE), homozygous reference (0/0) (n = 179 in NSR; n = 173 in RPE) and heterozygous alternate (0/1) genotypes (n = 2 in NSR; n = 1 in RPE) are shown. c Results from a dual reporter luciferase assay confirm that there is a significant reduction in CAND2 promoter activity in the presence of c.-41A > G (n = 3) (adj p = 0.005) compared to the WT promoter (n = 3) and a promoter carrying a common variant in gnomAD (c.-36C > T) (n = 3). The reported p-value was calculated with a one-way ANOVA with multiple comparison test (Dunnett’s). Box plots show the median (centre line), interquartile range (box; Q1–Q3) and whiskers extending to the most extreme data points within 1.5× IQR. Points beyond the whiskers are plotted as outliers.
Discussion
We present a resource to interrogate the impact of both common and rare genomic variation on gene regulation in the human NSR and RPE. We characterised novel eQTL associations that are tissue-specific (Fig. 2) and are enriched to known promoters and proximal enhancers (Fig. 3). We show that eQTLs impacting genes known as a cause of rare genetic eye disease have different properties when compared to those genes which are not known as a cause of eye disease (Fig. 4). We also identify candidate non-coding rare variants, SVs and CNVs which impact cCREs and represent plausible drivers of outlier expression profiles in human NSR and RPE (Fig. 5), including functional validation of a prioritised non-coding genetic variant impacting CAND2 (Fig. 7). The METR resource can be used alongside other multi-omic datasets to facilitate discovery of novel eye-specific regulatory elements, including those implicated in common (e.g. AMD) and rare (e.g. IRDs) genetic disorders impacting the retina.
The cohort of 201 human donors described in this study represents the first dataset, to our knowledge, to pair whole genome sequencing with high-depth RNA sequencing data from the NSR and RPE. Previous studies have developed RNA sequencing from the NSR alongside genotyping arrays15,16,26 and this has enabled the characterisation of eQTLs in the retina including preliminary data supporting the role of a limited number of eQTLs in AMD (Table 1). We performed extensive QC for both DNA and RNA sequencing data to confirm the validity of the datasets generated in this study, in particular due to the prolonged median ischemic time compared to GTEx samples, which may impact the quality of the data obtained27. These analyses confirmed suitable RNA integrity values across the cohort (Supplementary fig. 1), high unique mapping rates with appropriate read lengths and appropriate 3’/5’ biases from RNAseq data (Supplementary figs. 1 and 4), along with representative gene expression profiles (Fig. 1c and Supplementary Data 4). The high-depth and high-quality RNA and whole genome sequencing datasets developed for this study are from a cohort of individuals without clear signs of late-stage AMD and have enabled biological insights beyond those described previously.
Firstly, we were able to assess whether previously characterised eQTLs are replicated utilising alternative methods and technologies in an independent cohort of individuals of European genetic ancestry without late-stage AMD (Supplementary fig. 3). As gene expression profiles have been shown to be significantly disrupted during AMD pathogenesis28,29, it is important to identify eQTL signals that are amplified or disrupted by broader changes in transcriptome profiles associated with AMD, as well as those that remain consistent within a cohort of individuals without clear signs of late-stage AMD. Overall, we show high levels of replication of eQTL findings from Ratnapriya et al., with 5993 identical eGenes and replication of 13 eQTLs previously implicated with a role in AMD (Table 1), including PILRB, which has recently been shown to lead to photoreceptor dysfunction in mice when function is impaired30. Notably, we identified 5 novel QTLs for genes previously implicated in AMD (ACAD10, HTRA1, B3GLCT, PLA2G12A, BAIAP2L2) and 4 genes implicated in AMD without replication of a previously characterised eQTL (CFI, COL4A3, RDH5, TNFRSF10A). This suggests that differences in the approaches undertaken and/or cohort composition, e.g. AMD status, cohort size and/or genetic ancestry, impact the influence of genomic drivers on the expression of these genes.
Second, variants which are rare in the population or unique to individuals have been demonstrated to drive drastic changes in expression profiles, so-called ‘expression outliers’, across different tissues24,31. The use of complete genomic sequencing in this cohort, achieving a median coverage of 36x, has enabled the characterisation of a greater diversity of genomic variation than has previously been studied in the context of expression drivers in the NSR and RPE and identified thousands of new regions which can be interrogated for rare variation within disease cohorts32. Using two distinct variant prioritisation approaches, we describe rare variants in the general population, including SVs, CNVs and small variants that are the most likely drivers of expression outliers in these tissues (Fig. 5). Through functional validation of a prioritised non-coding variant in the CAND2 promoter region, we establish a proof-of-principle for the applied variant prioritisation approaches (Fig. 7) and provide mechanistic insight into non-coding regions regulating the expression level of AMD-risk associated genes9. These data encourage further functional follow-up for the 578 prioritised variants that may be causative of pronounced changes in expression profiles in the human retina, including 272 rare variants predicted to cause loss-of-function and 299 that intersect with non-coding regions (including examples presented in Fig. 6 and Supplementary fig. 13). Other recent studies have identified outlier-associated non-coding rare variants that contribute to common disease predisposition33 and underpin rare genetic disorders34,35. Moreover, non-coding variation has been identified as a cause of genetic ophthalmic disorders in untranslated regions36, retina-specific exons37, promoters38, distal enhancers39 and non-coding genes40 expressed in the NSR and RPE. With the increasing availability of genomic sequencing datasets for the diagnosis and discovery of genetic disorders41,42, including ophthalmic conditions43, the data developed in this study is timely and provides an opportunity, alongside other complementary datasets44, to identify new pathogenic mechanisms underpinning genetic disorders.
Third, we have generated high-coverage RNAseq datasets achieving, on average, 139 million uniquely mapping reads for NSR and 62 million uniquely mapping reads for RPE. Previous studies have developed lower coverage RNAseq datasets for NSR, for example, EyeGEx15, Orozco et al. 26 and Pinelli et al.45 generated 33, 30 and 72 million sequencing reads per sample, respectively. Previous studies have remarked on the level of transcript diversity in NSR46 and highlighted the advantage of high-depth RNAseq in this context. In comparison to EyeGEx, our high coverage approach elevates the number of observable protein-coding genes by 23% (from 13,662 to 16,765) and newly identifies 3,663 eGenes. For 4,481 eGenes that are not replicated from EyeGEx, further study, for example harmonisation of genomic and RNA sequencing dataset processing and meta-analyses would assist in understanding whether their detection is influenced by cohort composition, methodologies undertaken and/or sample size. The increased number of eGenes from this study enables observation of patterns in gene expression at increased resolution and has granted insight into the trends associated with genes previously implicated in genetic disorders impacting vision. Overall, we observed that eGenes that have been characterised as a cause of rare genetic eye disease20 have lower expression variability across individuals than non-disease genes (Fig. 4), suggesting that regulation of these genes is more tightly controlled in NSR and RPE. The role of eQTLs in genetic disorders remains incompletely understood. For example, whilst some studies have shown eQTLs contribute to onset, penetrance and expressivity47,48, including genetic disorders impacting the eye13, others have found limited evidence for their role in neuronal genetic disorders49,50. Here, we observe that eQTL variants which were associated with changes in expression of eye-disease genes had significantly lower effect sizes and their allele frequency was higher than eQTLs impacting genes that have not previously been implicated in eye disease (Fig. 4 and Supplementary fig. 11). Intuitively, the absence of rarer and higher impact eVariants amongst a population of individuals without signs of genetic eye disease suggests constraint on genomic variation with these properties, although population-scale modelling and statistical analysis is required to formally test this hypothesis.
Finally, as our cohort includes 158 individuals with RNA extracted and sequenced from both NSR and RPE datasets, this enables further insights into the expression patterns and regulatory architecture of these tissues, unbiased by sample preparation methods and/or differences between individuals, e.g. genomic background. It should be noted that our cohort is biased towards male individuals (64%) and this may have a hidden bias on eQTLs and transcriptome differences identified. However, our data newly identifies 916 eGenes in NSR and RPE compared to those characterised in other tissues10 and we observe a high level of overlap in eGenes between NSR and RPE, including 86% of RPE eGenes and 32% of NSR eGenes. These data further support the high level of overlap previously observed for active enhancers and promoters between RPE/choroid and NSR18.
Whilst the findings of this study have enhanced our understanding of genomic regulation in human NSR and RPE, other approaches that utilise single-cell26,51,52,53,54 single-nuclei17,55 and spatial56,57 transcriptomic approaches enable increased precision to understand gene expression in specialised retinal layers and cell types. These approaches are particularly advantageous for the NSR, which is a highly heterogeneous tissue comprised of several specialised layers and neuronal cell types, including photoreceptors, bipolar cells, amacrine cells and horizontal cells58 and where transcriptome profiles may differ substantially between the central and peripheral retina59 To overcome potential shortcomings of the bulk RNAseq approach adopted in this study, we performed deconvolution analyses to estimate the relative sample composition against single-nuclei RNA-sequencing17. Given the complexity associated with retinal tissue dissection and storage60,61, the deconvolution approach also enabled confirmation of tissue sample integrity alongside differential expression profiles (Fig. 1c). Bulk RNAseq from NSR had representation, as expected, from diverse cell types with significant enrichment towards rod photoreceptors and retinal astrocytes, representing >50% of the estimated cellular make-up of most samples. In keeping with current understanding of retinal ageing62 there is observed a significant loss of rod photoreceptors with age (Supplementary Fig. 5B). However, deconvolution is naturally limited by the relative differential transcriptional activity between cell types and is complicated by cell types with similar transcription profiles, for example, between Müller glia and retinal astrocytes53. We expect that the high number of astrocytes predicted in RNA samples is influenced by similar transcription profiles to other cell types and whilst we confirm that we have generated high-quality RNA sequencing datasets (Supplementary figs. 1 and 4 and Fig. 1), these estimates may also be influenced by altered transcriptome profiles in samples with longer ischemic times27 and/or the high sequencing depth coverage generated. Moreover, the retina is known to have cyclic patterns of gene expression, related to both circadian rhythm and natural function, i.e. response to light63 and as such, there is an incomplete molecular understanding of all cell types present in the human retina17. Overall, these data support the integrity of the RNAseq dataset developed in this study and whilst confident quantification of the cell types present is not possible, our analyses confirm that the datasets are representative of major cell types in the retina.
Taken together, the data presented in this study provide new insights into the genomic control of gene regulation in the human retina. We build upon previous understanding through replication of eQTLs in a cohort of individuals without clear signs of late-stage AMD, characterise hundreds of new genes under genomic regulation and provide insights into the role of rare variants, SVs and CNVs in the disruption of gene expression in these specialised tissues that enable vision. Future studies utilising this resource, including meta-analysis with other published datasets, co-localisation and transcriptome-wide association studies incorporating findings from genome-wide association studies, will continue to develop understanding of the expression profiles and the role of non-coding genetic variation in the onset and presentation of genetic disorders impacting vision.
Methods
Ethics approval and material transfer
All research and approaches undertaken in this manuscript were approved by the North West—Greater Manchester Central Research Ethics Committee and NHS Health Research Authority (15/NW/0932). Methodological approaches were approved and undertaken at The University of Manchester. The Manchester Eye Tissue Repository is a non-profit tissue bank, and no compensation was provided for the receipt or delivery of tissue samples. A material transfer agreement was agreed upon between the research team and the Manchester Eye Tissue Repository. Any surplus tissue, RNA and/or DNA after sample preparation and sequencing remained with the research team and will be destroyed or returned to the tissue bank within 3 years of the conclusion of the study. Other researchers who wish to access surplus material can submit independent requests to the Manchester Eye Tissue Repository.
Gene expression quantification in neurosensory retina and retinal pigment epithelium from RNA-Seq data
Paired-end short-read sequencing of polyA-enriched mRNA (RNAseq) was performed on an Illumina NovaSeq 6000 instrument for two layers of the retina: (1) the entire neurosensory retina (NSR), including macula and peripheral regions and (2) pelleted cells from the retinal pigment epithelium (RPE), which were scraped from Bruch’s membrane. Donor eye tissues were obtained from the Manchester Eye Tissue Repository, an ethically approved Research Tissue Bank (UK NHS Health Research Authority, 15/NW/0932). Eye tissue was acquired after the corneas had been removed for transplantation and explicit written informed consent had been obtained from donors or their next of kin to use the remaining tissue for research. Samples were selected for RNAseq with reference to RNA concentration (ng/μl) and integrity (RNA Integrity Number—RIN) values, calculated with the Agilent TapeStation system. The QC process was performed without knowledge of sample sex/gender. (see Supplementary Methods 1.1 and 1.2 for additional details on tissue extraction and the RNA sequencing protocol).
The Genotype-Tissue Expression (GTEx) analysis pipeline64 was applied to RNAseq datasets to assess quality and to perform alignment and expression quantification. Alignment was performed against the GRCh38 human reference genome using STAR v2.7.4a65. Duplicate reads were marked with Picard v2.27.166. Gene-level expression quantification, using the GENCODE v38 annotation67 was carried out using RNA-SeQC 1.1.968, for gene-level read counts and RSEM v1.3.069, for gene-level quantifications in transcripts per million (TPM). Quality assessments of processed RNAseq datasets included reference to the total number of reads, number of uniquely mapped reads, number of splice junctions, number of chimeric reads, read length and 3’/5’ bias for all NSR and RPE samples. To ensure concordance between paired WGS and RNAseq samples, we excluded WGS-RNAseq pairs where the predicted relatedness, calculated using Somalier70, was <0.8.
Whole genome sequencing data
Short-read paired-end whole genome sequencing (WGS) was generated for each donor on an Illumina NovaSeq6000 Instrument using DNA extracted from iris biopsies (see Supplementary Methods Section 2 for additional details). Genome alignment and variant calling were carried out using Illumina DRAGEN 4.0.3 software with Machine Learning and Graph Map Enabled. Aggregate variant detection and harmonisation were carried out using Illumina DRAGEN 4.0.3 software Population Mode. We applied quality control filters to the aggregate VCF to remove low-quality variant calls using a combination of bcftools (v.1.16) and PLINK (v.2.0) (see Supplementary Methods 2.3). For eQTL mapping, aggregate genotypes were binarized using PLINK 2.0.
Cell type deconvolution of bulk RNA-seq data
We used BayesPrism (Bayesian cell Proportion Reconstruction Inferred using Statistical Marginalization)71 to run a deconvolution model to estimate the proportion of retinal cell types in the generated bulk RNA-seq data in NSR and RPE. The reference dataset to train the model was a single-cell RNAseq dataset from the ocular posterior segment17 (See Supplementary Methods 1.5 for additional details).
Differential expression analysis between NSR and RPE
To ensure the validity of the transcriptomic datasets generated in this study, we assessed the biological relevance of expressed genes in NSR and RPE. We used the R package deseq272 to identify genes that were differentially expressed between NSR and RPE. We included age and sex as covariates in the deseq2 model with the false discovery rate threshold set at 0.05. To confirm deseq2 results, we replicated the differential expression analysis using edgeR73.
To identify which gene ontology biological pathways were enriched in the upregulated genes in NSR/RPE, we carried out gene set enrichment analysis (GSEA) of the genes that were differentially expressed between both tissues (FDR < 0.05), using WebGestalt74. We processed the GSEA output with a clustering algorithm, rrvgo75, to group similar GO terms together and selected representative terms. (See Supplementary Methods 1.6 for additional details).
Input Data for cis-eQTL analysis
For eQTL analysis, we generated a normalised expression matrix for each tissue. Genes that did not meet expression thresholds of >0.1 TPM in at least 20% of samples and ≥6 reads in at least 20% of samples were removed from eQTL analysis. Expression values were normalised using the trimmed mean of M-values normalisation (TMM) method76 and using an inverse normal transform.
To account for known and unknown biological and experimental confounding factors, a set of 30 covariates was generated for each RNA-Seq sample using the Probabilistic Estimation of Expression Residuals (PEER) method77 applied to normalised gene expression levels.
Principal component analysis with EIGENSOFT 6.0.178 was carried out to capture ancestral variation within the cohort. The top five principal components for each participant were used as covariates in the eQTL analysis.
Cis-eQTL mapping with tensorQTL
TensorQTL79 was used to identify genetic variants that were significantly associated with the expression of nearby genes (up to 1 Mb away) in NSR and RPE (FDR < 0.05). The required input files were the normalised gene expression matrix, the binary and filtered genotype data and a covariates table which included the following information for each participant: sex, WGS batch, five top principal components and 30 PEER factors. To quantify the eQTL effect size, we estimated the log2 allelic fold change (aFC), following the method established by Mohammadi et al.80 (see Supplementary Methods section 4 for additional details).
Comparison with other eQTL studies
We compared all METR eQTLs with retina eQTLs mapped by EyeGEx15 and Strunz et al.16. We identified genes had been associated with eQTLs in our study and in EyeGEx and/or Strunz et al. (eGenes shared between studies). For these shared eGenes, we extracted the top eQTLs identified by EyeGEx and/or Strunz et al. and checked if they were replicated in our cohort or if they were in high LD (r2 > 0.8) with a METR-NSR eQTL. Pairwise LD scores were calculated using LDlinkR81.
To compare to non-retina tissues, all significant eQTL associations were downloaded from the GTEx Open Access portal (v8) for each available tissue (https://www.gtexportal.org/home/downloads/adult-gtex/qtl). We calculated the intersection between the number of METR-eQTLs and eGenes which were also shared by each GTEx tissue using the Intersection over Union (IoU) statistic. The IoU calculates the ratio of the number of eQTLs/eGenes present in both sets over the total number of eQTLs/eGenes in one set and/or the other.
Annotation of eVariants and bootstrapping analysis to calculate enrichment of eQTLs in characterised regulatory loci
All NSR and RPE eQTL variants were annotated with the Ensembl Variant Effect Predictor82. We assessed overlap and annotated all eVariants with a set of tissue-specific and cell-type-specific annotations of candidate cis-regulatory elements (cCREs) from a variety of sources (Supplementary Methods Table 1). These included characterised regulatory loci from retina, RPE and macula18, cell-type specific regions of open chromatin detected by scATACseq from retina samples19, non-eye specific cCREs from adult tissues in EpiMap23 and cell-type agnostic candidate cis-regulatory element (cCRE) annotations from ENCODE83.
To calculate the relative enrichment of eVariants that overlapped with each type of regulatory element, we used bootstrapping analysis. We carried out 1000 iterations by subsampling 100,000 random eVariants with replacement and 100,000 control variants from our cohort that were included as input for the eQTL mapping and did not meet the eQTL significance threshold (FDR < 0.05), matched for gene density and allele frequency. We compared the ratio of eVariants to control variants that intersected with each type of regulatory element. (see Supplementary Methods section 5 for additional details).
Analysis of the properties of eQTLs that impact known eye disease-related genes
To understand if there were trends that were specific to eQTLs associated with known monogenic eye disease genes, we utilised the EyeG2P resource20. All other METR-eGenes were considered non-eye disease genes. We compared the eQTL/eGene properties between eye-disease and non-eye-disease gene eQTLs, including eVariant allele frequency and effect size, measured using log2 allele fold change. (See Supplementary Methods section 6 for additional details.
Identification of transcriptome outliers using the DROP pipeline
We utilised the DROP v.1.4.0 pipeline21 to identify transcriptome outliers from NSR and RPE, using standard parameters.
Hierarchical workflow to identify candidate variants driving outlier expression
We developed a hierarchical workflow to identify candidate variants driving outlier expression (eOutliers) using snakemake version 7.32 (Supplementary Fig. 14). Briefly, the workflow would first search for a pLoF variant from the eOutlier sample in the corresponding eOutlier gene, which could be an exonic structural variant, or a SNV with a high impact consequence based on Ensembl’s Variant Effect Predictor (v.112.0). If no pLoF variant could be identified, the workflow would then search for regulatory variants that were within 10Kb of the eOutlier gene body. Regulatory variants were defined as structural variants and rare SNVs (gnomAD AF<0.01) that overlapped with nearby retina cCREs or non-retina-specific cCREs from different adult tissues in EpiMap. If no regulatory variant was identified, the model would identify any other non-coding structural variant within 10Kb of the eOutlier gene body, before returning a negative search result (see Supplementary Methods section 8 for additional details).
Implementation of the watershed
For all genes with an eOutlier in the NSR, we extracted all rare variants (gnomAD allele frequency <1%) that intersected with the gene body ± 10Kb. Variants were extracted for all samples with NSR RNAseq data from the post-QC aggregate VCF. We annotated all rare variants with selected annotations from VEP82 and CADD84 (Supplementary Data 6) and intersected them with known retina-specific cCREs from Cherry et al.18 and non-retina-specific cCREs from EpiMap. Missing annotations were replaced with default imputation values obtained from CADD (Supplementary Data 6). The Watershed model24 was run using the predict_watershed.R script with an adjusted p-val threshold of 0.05 and the number of dimensions set to 6. (See Supplementary Methods section 9 for additional details).
Dual reporter luciferase assay
A 294 bp fragment of the wild-type promoter region from CAND2 was PCR-amplified from control genomic DNA using Phusion High-Fidelity DNA Polymerase (Promega). To introduce variants, two overlapping fragments were amplified using a combination of mutagenic primers. Variants constructed were the variant of interest, NM_001162499.2:c.-41A > G and a variant that is common in the general population and not expected to impact CAND2 expression, NM_001162499.2:c.-36C > T.
The wild-type and variant fragments were assembled into NheI-NcoI digested pGL4.10[luc2] firefly luciferase plasmid using the Gibson method. The assemblies were transformed into competent E. coli grown overnight on LB agar containing carbenicillin. Candidate colonies were picked for culture and plasmid isolation. The plasmid constructs were verified by Sanger sequencing. Human K562 cells were transiently transfected with 500 ng of plasmid using Lipofectamine LTX (Invitrogen) following the manufacturer’s standard protocol. An empty pGL4.10[luc2] plasmid was transfected as a control for background activity. The Renilla luciferase pGL4.74[hRluc/TK] vector (Promega) was co-transfected as an internal luminescence control. Following 20–24 hr incubation at 37 °C with 5% CO2, a dual luciferase assay was conducted using the Dual-Glo® Luciferase Assay (Promega).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All raw RNA sequencing and genomic sequencing datasets generated in this study are available under controlled access through the European Genome-phenome Archive (EGA; Study ID: EGAS50000001443; Dataset: EGAD50000002082). Processed datasets, including eQTL results, eOutlier statistics and aggregated genomic variant files, are available under controlled access through the EGA (Study ID: EGAS50000001443; Dataset: EGAD50000002082). Controlled access to these datasets is a condition of access to tissue samples from the Manchester Eye Tissue Repository to ensure traceability of data access and usage, as per conditions of ethical approval for the biobank (15/NW/0932). Applications for access to the raw and processed datasets can be made through the EGA and will receive a response by the EGAC50000000807 EGA Data Access Committee within 4 weeks of the data access request. The full data access policy and terms of data usage are available through the EGA. The Genotype-Tissue Expression (GTEx) Project10 data used in this study are available from the GTEX public portal at https://www.gtexportal.org/home/. The ENCODE Project Candidate cis-Regulatory Element Registry V383 data used in this study are available from the SCREEN portal at https://screen.encodeproject.org/. The Epigenome Integration across Multiple Annotation Projects23 (EpiMap) data used in this study are available from the EpiMap Repository at https://compbio.mit.edu/epimap/. The retina-specific epigenomic tracks generated by Cherry et al.18 used in this study are available as custom tracks on the UCSC browser and were accessed from https://tinyurl.com/CherryLab-EyeBrowser. The single-cell RNAseq data from the ocular posterior segment generated by Monavarfeshani et al.17 used in this study are available at the Broad Institute Single Cell Portal, under study number SCP2310, accessible from https://singlecell.broadinstitute.org/single_cell/. The retina scATACseq peaks generated by Wang et al.19 used in this study are available in the GEO database under accession code GSE196235. The EyeGex15 used in this study is available under controlled access; access was obtained by contacting the corresponding author of the study. The retina and RPE-specific eQTLs generated by Orozco et al.26 used in this study were accessed from https://eye-eqtl.com/ in April 2024. The retina-specific eQTLs generated by Strunz et al.16 used in this study are publicly available and were accessed from http://www-huge.uni-regensburg.de/ in October 2025. The 1000 Genomes Project V3 data used in this study are publicly available from the 1000 Genomes Project Public Portal at https://www.internationalgenome.org/data/. The Genome Aggregation Database (gnomAD v4) data used in this study are publicly available from the gnomAD Public Portal at https://gnomad.broadinstitute.org/data. All other data supporting the findings of this study are available in the article and its Supplementary Information files.
References
Wright, A. F., Chakarova, C. F., Abd El-Aziz, M. M. & Bhattacharya, S. S. Photoreceptor degeneration: genetic and mechanistic dissection of a complex trait. Nat. Rev. Genet. 11, 273–284 (2010).
Hoon, M., Okawa, H., Della Santina, L. & Wong, R. O. L. Functional architecture of the retina: development and disease. Prog. Retin. Eye Res. 42, 44–84 (2014).
Strauss, O. The retinal pigment epithelium in visual function. Physiol. Rev. 85, 845–881 (2005).
Hanany, M., Rivolta, C. & Sharon, D. Worldwide carrier frequency and genetic prevalence of autosomal recessive inherited retinal diseases. Proc. Natl. Acad. Sci. USA 117, 2710–2716 (2020).
Fleckenstein, M. et al. Age-related macular degeneration. Nat. Rev. Dis. Primers 7, 31 (2021).
Wong, W. L. et al. Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. Lancet Glob. Health 2, e106–e116 (2014).
Seddon, J. M., Cote, J., Page, W. F., Aggen, S. H. & Neale, M. C. The US twin study of age-related macular degeneration: relative roles of genetic and environmental influences. Arch. Ophthalmol. Chic. 123, 321–327 (2005).
Fritsche, L. G. et al. A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nat. Genet. 48, 134–143 (2016).
Gorman, B. R. et al. Genome-wide association analyses identify distinct genetic architectures for age-related macular degeneration across ancestries. Nat. Genet. 56, 2659–2671 (2024).
THE GTEX CONSORTIUM The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Barbeira, A. N. et al. Exploiting the GTEx resources to decipher the mechanisms at GWAS loci. Genome Biol. 22, 49 (2021).
Hamel, A. R. et al. Integrating genetic regulation and single-cell expression with GWAS prioritizes causal genes and cell types for glaucoma. Nat. Commun. 15, 396 (2024).
Michaud, V. et al. The contribution of common regulatory and protein-coding TYR variants to the genetic architecture of albinism. Nat. Commun. 13, 3939 (2022).
Davenport, E. E. et al. Discovering in vivo cytokine-eQTL interactions from a lupus clinical trial. Genome Biol. 19, 168 (2018).
Ratnapriya, R. et al. Retinal transcriptome and eQTL analyses identify genes associated with age-related macular degeneration. Nat. Genet. 51, 606–610 (2019).
Strunz, T. et al. A mega-analysis of expression quantitative trait loci in retinal tissue. PLoS Genet 16, e1008934 (2020).
Monavarfeshani, A. et al. Transcriptomic analysis of the ocular posterior segment completes a cell atlas of the human eye. Proc. Natl. Acad. Sci. USA 120, e2306153120 (2023).
Cherry, T. J. et al. Mapping the cis-regulatory architecture of the human retina reveals noncoding genetic variation in disease. Proc. Natl. Acad. Sci. USA 117, 9001–9012 (2020).
Wang, S. K. et al. Single-cell multiome of the human retina and deep learning nominate causal variants in complex eye diseases. Cell Genom. 2, 100164 (2022).
Lenassi, E. et al. EyeG2P: an automated variant filtering approach improves efficiency of diagnostic genomic testing for inherited ophthalmic disorders. J. Med. Genet. 60, 810–818 (2023).
Yépez, V. A. et al. Detection of aberrant gene expression events in RNA sequencing data. Nat. Protoc. 16, 1276–1296 (2021).
Hölzlwimmer, F. R. et al. Aberrant gene expression prediction across human tissues. Nat. Commun. 16, 3061 (2025).
Boix, C. A., James, B. T., Park, Y. P., Meuleman, W. & Kellis, M. Regulatory genomic circuitry of human disease loci by integrative epigenomics. Nature 590, 300–307 (2021).
Ferraro, N. M. et al. Transcriptomic signatures across human tissues identify functional rare genetic variation. Science 369, eaaz5900 (2020).
Wang, K. et al. Molecular mechanisms of CAND2 in regulating SCF ubiquitin ligases. Nat. Commun. 16, 1998 (2025).
Orozco, L. D. et al. Integration of eQTL and a single-cell atlas in the human eye identifies causal genes for age-related macular degeneration. Cell Rep. 30, 1246–1259.e6 (2020).
Ferreira, P. G. et al. The effects of death and post-mortem cold ischemia on human tissue transcriptomes. Nat. Commun. 9, 490 (2018).
Voigt, A. P. et al. Choroidal endothelial and macrophage gene expression in atrophic and neovascular macular degeneration. Hum. Mol. Genet. 31, 2406–2423 (2022).
Orozco, L. D. et al. A systems biology approach uncovers novel disease mechanisms in age-related macular degeneration. Cell Genom. 3, 100302 (2023).
Dey, P. N. et al. Loss of paired immunoglobulin-like type 2 receptor B gene associated with age-related macular degeneration impairs photoreceptor function in mouse retina. Hum. Mol. Genet. 34, 64–76 (2025).
Li, X. et al. The impact of rare variation on gene expression across tissues. Nature 550, 239–243 (2017).
Ellingford, J. M. et al. Recommendations for clinical interpretation of variants found in non-coding regions of the genome. Genome Med. 14, 73 (2022).
Smail, C. et al. Integration of rare expression outlier-associated variants improves polygenic risk prediction. Am. J. Hum. Genet. 109, 1055–1064 (2022).
Wakeling, M. N. et al. Non-coding variants disrupting a tissue-specific regulatory element in HK1 cause congenital hyperinsulinism. Nat. Genet. 54, 1615–1620 (2022).
Tenney, A. P. et al. Noncoding variants alter GATA2 expression in rhombomere 4 motor neurons and cause dominant hereditary congenital facial paresis. Nat. Genet. 55, 1149–1163 (2023).
Dueñas Rey, A. et al. Combining a prioritization strategy and functional studies nominates 5’UTR variants underlying inherited retinal disease. Genome Med. 16, 7 (2024).
Vig, A. et al. DYNC2H1 hypomorphic or retina-predominant variants cause nonsyndromic retinal degeneration. Genet. Med. J. Am. Coll. Med. Genet. 22, 2041–2051 (2020).
Daich Varela, M. et al. Multidisciplinary team-directed analysis of whole genome sequencing reveals pathogenic non-coding variants in molecularly undiagnosed inherited retinal dystrophies. Hum. Mol. Genet. 32, 595–607 (2023).
Small, K. W. et al. North carolina macular dystrophy is caused by dysregulation of the retinal transcription factor PRDM13. Ophthalmology 123, 9–18 (2016).
Quinodoz, M. et al. De novo and inherited dominant variants in U4 and U6 snRNAs cause retinitis pigmentosa. MedRxiv Prepr. Serv. Health Sci. 2025.01.06.24317169 https://doi.org/10.1101/2025.01.06.24317169. (2025).
Turnbull, C. et al. The 100 000 Genomes Project: bringing whole genome sequencing to the NHS. BMJ 361, k1687 (2018).
The 100,000 Genomes Project Pilot Investigators 100,000 Genomes pilot on rare-disease diagnosis in health care — preliminary report. N. Engl. J. Med. 385, 1868–1880 (2021).
Ellingford, J. M. et al. Whole genome sequencing increases molecular diagnostic yield compared with current diagnostic testing for inherited retinal disease. Ophthalmology 123, 1143–1150 (2016).
D’haene, E. et al. Comparative 3D genome analysis between neural retina and retinal pigment epithelium reveals differential cis-regulatory interactions at retinal disease loci. Genome Biol. 25, 123 (2024).
Pinelli, M. et al. An atlas of gene expression and gene co-regulation in the human retina. Nucleic Acids Res. 44, 5773–5784 (2016).
Farkas, M. H. et al. Transcriptome analyses of the human retina identify unprecedented transcript diversity and 3.5 Mb of novel transcribed sequence via significant alternative splicing and novel genes. BMC Genomics 14, 486 (2013).
Castel, S. E. et al. Modified penetrance of coding variants by cis-regulatory variation contributes to disease risk. Nat. Genet. 50, 1327–1334 (2018).
Einson, J. et al. Genetic control of mRNA splicing as a potential mechanism for incomplete penetrance of rare coding variants. Genetics 224, iyad115 (2023).
Rio Frio, T., Civic, N., Ransijn, A., Beckmann, J. S. & Rivolta, C. Two trans-acting eQTLs modulate the penetrance of PRPF31 mutations. Hum. Mol. Genet. 17, 3154–3165 (2008).
Wigdor, E. M. et al. Investigating the role of common cis-regulatory variants in modifying penetrance of putatively damaging, inherited variants in severe neurodevelopmental disorders. Sci. Rep. 14, 8708 (2024).
Lukowski, S. W. et al. A single-cell transcriptome atlas of the adult human retina. EMBO J. 38, e100811 (2019).
Menon, M. et al. Single-cell transcriptomic atlas of the human retina identifies cell types associated with age-related macular degeneration. Nat. Commun. 10, 4902 (2019).
Yan, W. et al. Cell atlas of the human fovea and peripheral retina. Sci. Rep. 10, 9802 (2020).
van Zyl, T. et al. Cell atlas of the human ocular anterior segment: Tissue-specific and shared cell types. Proc. Natl. Acad. Sci. USA 119, e2200914119 (2022).
Liang, Q. et al. Single-nuclei RNA-seq on human retinal tissue provides improved transcriptome profiling. Nat. Commun. 10, 5743 (2019).
Choi, J. et al. Spatial organization of the mouse retina at single-cell resolution by MERFISH. Nat. Commun. 14, 4929 (2023).
Dorgau, B. et al. Deciphering the spatiotemporal transcriptional and chromatin accessibility of human retinal organoid development at the single-cell level. iScience 27, 109397 (2024).
Masland, R. H. The neuronal organization of the retina. Neuron 76, 266–280 (2012).
Sharon, D., Blackshaw, S., Cepko, C. L. & Dryja, T. P. Profile of the genes expressed in the human peripheral retina, macula, and retinal pigment epithelium determined through serial analysis of gene expression (SAGE). Proc. Natl. Acad. Sci. USA 99, 315–320 (2002).
McHarg, S., Brace, N., Bishop, P.N. & Clark, S.J. Enrichment of Bruch’s membrane from human donor eyes. J. Vis. Exp. JoVE 53382 https://doi.org/10.3791/53382. (2015).
Cabral, T. et al. Dissection of human retina and RPE-choroid for proteomic analysis. J. Vis. Exp. JoVE 56203 https://doi.org/10.3791/56203. (2017).
Gao, H. & Hollyfield, J. G. Aging of the human retina. Differential loss of neurons and retinal pigment epithelial cells. Invest. Ophthalmol. Vis. Sci. 33, 1–17 (1992).
Bhoi, J. D., Goel, M., Ribelayga, C. P. & Mangel, S. C. Circadian clock organization in the retina: from clock components to rod and cone pathways and visual function. Prog. Retin. Eye Res. 94, 101119 (2023).
GTEx Consortium. Laboratory and Analysis Methods. GTEx Portal. https://gtexportal.org/home/methods (2019).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Picard toolkit. Broad Institute. GitHub repository. https://broadinstitute.github.io/picard/ (2019).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 47, D766–D773 (2019).
DeLuca, D. S. et al. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics 28, 1530–1532 (2012).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinforma. 12, 323 (2011).
Pedersen, B. S. et al. Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches. Genome Med. 12, 62 (2020).
Chu, T., Wang, Z., Pe’er, D. & Danko, C. G. Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat. Cancer 3, 505–517 (2022).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Liao, Y., Wang, J., Jaehnig, E. J., Shi, Z. & Zhang, B. WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 47, W199–W205 (2019).
Sayols, S. rrvgo: a Bioconductor package for interpreting lists of Gene Ontology terms. MicroPublication Biol. https://doi.org/10.17912/micropub.biology.000811 (2023).
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
Stegle, O., Parts, L., Durbin, R. & Winn, J. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput. Biol. 6, e1000770 (2010).
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet 2, e190 (2006).
Taylor-Weiner, A. et al. Scaling computational genomics to millions of individuals with GPUs. Genome Biol. 20, 228 (2019).
Mohammadi, P., Castel, S. E., Brown, A. A. & Lappalainen, T. Quantifying the regulatory effect size of cis-acting genetic variation using allelic fold change. Genome Res. 27, 1872–1884 (2017).
Myers, T. A., Chanock, S. J. & Machiela, M. J. LDlinkR: an R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Front. Genet. 11, 157 (2020).
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).
Snyder, M. P. et al. Perspectives on ENCODE. Nature 583, 693–698 (2020).
Schubach, M., Maass, T., Nazaretyan, L., Röner, S. & Kircher, M. CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res. 52, D1143–D1154 (2024).
Acknowledgements
We express our sincere thanks to the donors and their families for enabling this research. We thank Selina Mcharg, Nadhim Bayatti and Jay Brown for the development of the Manchester Eye Tissue Resource. We also thank staff at the University of Manchester Genomic Technologies Core Facility and at the Ocular Genomics Institute, Harvard Medical School, for their help in the generation of DNA and RNA sequencing datasets for this study. The views expressed are those of the authors and not necessarily those of the funders, including the NIHR and the Department of Health and Social Care. J.E discloses support for the research of this work from the Macular Society (United Kingdom), Fight For Sight, the UK Medical Research Council and the NIHR Manchester Biomedical Research Centre (NIHR203308). J.S discloses support for the research of this work from the UK Medical Research Council (MR/W007428/1). P.I.S discloses support for the publication of this work from the Wellcome Trust (224643/Z/21/Z, Clinical Research Career Development Fellowship) and the UK National Institute for Health Research (NIHR) Clinical Lecturer Programme (CL-201-06-001). D.B discloses support for the publication of this work from the NIHR Research Professorship grant (RP-2016-07- 011). A.V.S discloses support for publication of this work from the NIH/ NEI (R01 EY031424). K.M.B discloses support for publication of this work from the NIH/NEI (R01EY035717).
Author information
Authors and Affiliations
Contributions
J.M.E., A.V.S., K.M.B., P.I.S., D.B. and G.C.B. conceived the study and obtained funding. J.S. led and performed the analysis described in the study under the supervision of J.M.E. and S.B. J.A.-D. performed cellular deconvolution analyses under the supervision of J.M.E. S.J.C. and P.N.B. collected retinal tissue and performed quality and age-related macular degeneration assessments. SH1, BA, SH2 and AH extracted molecular material and generated the paired DNA and RNA sequencing datasets. R.A.K., H.B.T. and R.T.O’K. designed and performed dual reporter luciferase assays. J.S. and J.M.E. wrote the manuscript with review and critical input from all authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Anand Swaroop and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sampson, J., Segrè, A.V., Bujakowska, K.M. et al. Paired DNA and RNA sequencing uncovers common and rare variation regulating human retinal gene expression. Nat Commun 17, 4595 (2026). https://doi.org/10.1038/s41467-026-72979-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-026-72979-4






