Main

EBV (human herpesvirus 4) is a DNA virus that infects approximately 90–95% of the global population1,2. Primary EBV infection usually occurs in childhood and remains asymptomatic or mild. From adolescence onwards, it can cause infectious mononucleosis5. EBV enters the host via the oropharyngeal epithelium and infects naive B cells. These differentiate into long-lived memory B cells that become part of the circulation, thereby establishing persistent infection3,6. Occasionally, EBV-infected memory B cells reactivate to produce new infectious virions7.

EBV infection is a risk factor for various neoplasms (for example, Hodgkin and non-Hodgkin lymphoma and multiple sclerosis)4,8,9. Although EBV seropositivity is a prerequisite for multiple sclerosis10, only some individuals infected with EBV develop the disease, following a prodromal phase11. Furthermore, although multiple sclerosis risk is significantly elevated post-infectious mononucleosis, many patients with multiple sclerosis did not have a severe primary EBV infection12. Thus, multiple sclerosis may arise secondary to inefficient EBV immune control during the prodromal phase, as indicated by high EBV viral load11. Similar mechanisms might be implicated in other EBV-associated autoimmune disorders, as suggested by elevated EBV viral loads in systemic lupus erythematosus13 and rheumatoid arthritis14. In EBV-associated cancers, the importance of proper EBV immune control has been demonstrated by studies of inborn errors of immunity (IEIs): patients with IEIs involving impaired T and natural killer (NK) cell cytotoxicity have elevated EBV viral loads in blood15, and an increased risk for B cell-derived EBV-positive lymphomas16. Individuals with human immunodeficiency virus (HIV) or immunosuppression also show impaired EBV control17 and an increased incidence of EBV-positive lymphomas18,19. Still, despite its presumed clinical relevance, data on immune control during persistent EBV infection are limited.

Research into the biological basis of immune control of persistent EBV infection is hampered by a lack of direct measurements of EBV viral load in large immunocompetent cohorts, and limited knowledge regarding the role of serological factors in the control of EBV20.

To address this, we exploited the fact that EBV DNA in memory B cells is sequenced as a by-product of genome sequencing (GS) of human peripheral blood21. Using blood-based GS data from the UK Biobank (UKB)22 and All of Us (AoU)23 together with orthogonal data, we demonstrated that short-read pairs mapping to the EBV genome (EBV reads) in GS data are a surrogate measure for increased EBV viral load. EBV read prevalence was increased in immunosuppressed individuals; in current smokers; and in samples obtained in winter. Strong genetic associations were found for the MHC locus and 27 loci outside MHC, which were broadly consistent across the two biobanks. Downstream analyses suggested candidate genes, and highlighted pathways and cell types relevant for EBV immunity. Investigations of EBV-associated diseases generated novel hypotheses regarding mechanisms in multiple sclerosis and rheumatoid arthritis, and phenome-wide analyses identified novel diseases for which host control of EBV viral load might be pathophysiologically relevant.

EBV reads are present in GS data from biobanks

We retrieved EBV reads from the GS data of 490,293 UKB participants24 (Methods; Fig. 1a and Supplementary Notes 1 and 2). During quality control (QC), 51 library-preparation plates showed evidence of contamination and were excluded (Methods; Extended Data Fig. 1 and Supplementary Fig. 1). Aggregated EBV reads of the remaining 486,315 individuals (UKB QC cohort) were evenly distributed across the EBV genome (Fig. 1b). EBV read distribution was zero inflated, that is, no EBV reads were observed in n = 407,544 individuals (83.8%, denoted as ‘EBVread’; Fig. 1c). Of the 78,771 individuals with detected EBV reads (‘EBVread+’, 16.2%), 61.9% had EBV read count = 1. Further analysis of coverage and sequence data (Methods) confirmed that EBV reads from this group reflect true signals (Extended Data Fig. 2 and Supplementary Table 1).

Fig. 1: Analysis of EBV reads in blood-based GS data.
Fig. 1: Analysis of EBV reads in blood-based GS data.
Full size image

a, Flowchart of UKB cohort definitions and respective sizes, created by consecutive steps. Technical validation of EBV reads was performed by qPCR in two independent cohorts. non-rel., non-related. b, Cumulative read coverage across the EBV genome in the UKB QC cohort (line smoothed, 500-bp rolling window), for all individuals (dark blue) and split by EBV read count group (light blue). c, Number of individuals within EBV read count groups (maximum of 27,639 reads) in the UKB QC cohort. d, In a subcohort with available EBV serology data (UKB serology cohort), detection of EBV reads (EBVread+) was highly specific for being EBV seropositive (sero+). e,f, qPCR validated (validation 1 (e) and validation 2 (f)) EBV reads as a measure of EBV viral load, based on increasing fraction of positive replicates (bars, left y axis) and decreasing average crossing point (Cp) value for positive replicates (points, right y axis, inversely scaled; data are presented as mean, with the error bars denoting standard deviation). The number of replicates is given above the bars. g, Paired GS–RNA-seq data of 1,010 samples showed positive association between the presence of EBV transcripts and EBV read counts (validation 2; left), largely driven by expression of genes from the BART gene cluster (right; colours according to EBV stage in which the gene is primarily expressed). The dotted line indicates a Spearman’s ρ of 0. h, In the UKB no outlier cohort (non-related), immune-modulating factors significantly increased the prevalence of EBVread+ individuals. i,j, Following the exclusion of immunocompromised individuals (UKB no immune supp. cohort, non-related), male sex and current smoking status were both associated with EBVread+ (i), as were older age, lymphocyte percentage, sequencing yield and winter sampling (j). Estimates and corresponding distributions were obtained using marginalization and bootstrapping (n = 1,000), except for sampling day, where raw EBVread+ prevalences were taken for analysis. Estimated distributions are shown as boxplots (h,i) or individual data points (j, grey). The boxplots show the median (thick line), 25th and 75th percentiles (box), largest–smallest values no further from the box than 1.5 times the interquartile range (whiskers) and outliers (points; h,i). ***Consistent across 1,000 bootstrap replicates.

EBV reads were also extracted from the blood-based GS data of 336,123 ethnically diverse individuals from AoU23 (AoU QC cohort; Methods). EBV read distribution was similar to that in UKB, but a lower fraction of individuals had EBV read count = 1 (n = 37,901 out of 73,137 EBVread+ individuals, 51.8%; Extended Data Fig. 3). Overall, 21.8% of the AoU QC cohort were EBVread+, although this varied across ancestries (Supplementary Table 2). For European (EUR) cohorts, the fraction of EBVread+ individuals was comparable in AoU (17.6%) and UKB (15.8%; UKB EUR cohort; Fig. 1a; Methods). Whether the residual difference is due to ancestry-specific mechanisms of EBV control or characteristics such as a higher average GS coverage in AoU (Supplementary Table 2) awaits elucidation. In our data, the EBVread+ fraction is higher than in smaller GS (14.0%)21 or diagnostic quantitative PCR (qPCR; 11.03%)18 studies of immunocompetent individuals. This might be attributable to differences in cohort composition and/or strict cut-offs used in clinical settings.

EBVread+ status reflects increased EBV viral load in blood cells

We then assessed the relevance of GS-based EBVread+ to EBV biology. First, we investigated how well EBVread+ matches EBV seropositivity. In a UKB subcohort with available serology data (UKB serology cohort; n = 9,281), 491 individuals were EBVsero and 8,790 EBVsero+, based on previous definitions1. EBV reads were observed in 0.61% of EBVsero and 16.38% of EBVsero+ individuals (sensitivity of 16.4% and specificity of 99.4%; Fig. 1d). Second, we investigated whether EBV read detection reflects high viral load in blood cells. Therefore, we (1) simulated GS and compared modelled versus observed outcomes; (2) measured viral load via qPCR in samples from two small, independent cohorts with GS data25,26; and (3) correlated EBV read counts with EBV gene expression from blood-based RNA sequencing (RNA-seq; Japan COVID-19 Task Force26 (JCTF); Methods).

The simulation reproduced the EBV read distribution observed in UKB or AoU, including the zero inflation (Extended Data Fig. 4), and was compatible with an underlying log-normal distribution, as reported for HIV-1 viral load27. In the qPCR analysis, EBV read counts showed a positive correlation with EBV DNA detection, and a negative correlation with Cp (crossing point) values (Fig. 1e,f and Extended Data Fig. 2). In 1,010 individuals from the JCTF, the fraction of individuals with detected EBV transcripts was higher among EBVread+ than among EBVread samples (Fig. 1g). Together, this provides evidence that EBVread+ represents an approximation of elevated EBV viral load within human blood cells.

EBVread+ is associated with decreased EBV control during persistence

To determine which phase of the EBV life cycle is reflected by EBVread+, we investigated correlations between EBV read counts and (1) individual EBV transcript counts from the JCTF cohort; and (2) four individual EBV antibody levels (EA-D, EBNA-1, ZEBRA and VCA-p18, all IgG; median fluorescence intensity (MFI) values)1. In step one, the strongest EBVread+ correlations were with transcripts of A73 (ρ = 0.47, Spearman’s rank correlation), BARF0 (ρ = 0.42) and RPMS1 (ρ = 0.43; Fig. 1g). All three belong to the BART gene cluster that is associated with latency4. We also observed correlations with transcripts of some lytic genes, particularly from the same genomic region.

Step two was performed in 7,338 EUR EBVsero+ individuals from the UKB serology cohort, with presumed persistent (not primary) EBV infection given their age at recruitment (Methods). The strongest correlation was observed with IgG levels to VCA-p18 (ρ = 0.12, P < 2.2 × 10−16), followed by IgG levels to ZEBRA and EBNA-1 (Extended Data Fig. 5). Although VCA-p18 is a lytic-phase antigen, IgG to VCA-p18 is detectable during persistent EBV infection28 and increased titres are found in individuals with high EBV viral load in blood29,30. Thus, higher viral load in blood cells, as measured by EBVread+, might correlate with ongoing lytic activity. This aligns with the ‘germinal centre model of EBV persistence’7, in which the latently infected memory B cell pool in blood is maintained in equilibrium by lytic reactivation events in lymphoid tissues (Extended Data Fig. 2). However, our data suggest an extension to this model, as some reactivation might occur within blood, as recently also demonstrated in individuals with systemic lupus erythematosus31.

Non-genetic factors and sex contribute to EBVread+

Next, we investigated the influence of non-genetic factors and sex on EBVread+, with the aim to (1) identify those factors; (2) enable exclusion from further analysis of all individuals whose EBV read count was probably determined exogenously; and (3) control for these factors in subsequent analyses. Whenever possible, we minimized overfitting by using one biobank for discovery and the other for replication (Supplementary Note 1).

First, we assessed 11,111 SNOMED concept IDs and their association with EBVread+ in the AoU QC cohort (Methods). Initial test statistics were highly inflated, with HIV positivity and smoking showing the strongest associations (Supplementary Table 3 and Supplementary Fig. 2). When the analysis was conditioned on these two traits, inflation was largely resolved, although some residual associations with several immune-related SNOMED concepts remained (Supplementary Note 3).

To quantify the effect of HIV or immunosuppression on EBVread+ and identify additional contributors, we investigated non-related individuals of EUR ancestry. Individuals with outlier blood count measurements and those in the top EBV read count percentile were excluded, given the high prevalence of pathophysiological processes in this group, which probably drive EBV abundance (Supplementary Fig. 3 and Supplementary Note 4). In this UKB no outlier cohort, 48,771 of 313,387 individuals were EBVread+ (that is, 15.6%; with an expected standard deviation (s.d.) of 0.1% based on bootstrapping). HIV infection and immune-modulatory drugs significantly increased the likelihood of EBVread+. The highest probability was for reported HIV infection (39.7%, s.d. = 3.5%;), followed by intake of glucocorticoids (19.4%, s.d. = 0.7%) or other immunosuppressive drugs (18.3%, s.d. = 0.5%).

We then excluded from the UKB no outlier cohort individuals with reported HIV infection, or current use of glucocorticoids or other immunosuppressive drugs (‘UKB no immune supp. cohort’; Fig. 1a), and performed variable selection on a set of predefined covariates to identify further contributing factors in immunocompetent individuals (non-related individuals; 47,234 EBVread+ and 257,899 EBVread; Methods; Supplementary Table 4). EBV reads were more frequent in male individuals than in female individuals (17.1%, s.d. = 0.1% versus 14.1%, s.d. = 0.1%) and in current smokers than in current non-smokers (22.1%, s.d. = 0.3% versus 14.7%, s.d. = 0.1%; Fig. 1i). Former smoking status alone was not identified as a relevant predictor of EBVread+. Other selected variables were increasing age, GS yield and lymphocyte percentage, all of which were positively correlated with EBVread+ (Fig. 1j). EBV read detection was also more probable in samples collected in winter (Fig. 1j). This seasonality effect was confirmed in AoU (Extended Data Fig. 3) and requires further investigation. A plausible hypothesis is that seasonal infections during winter, such as co-infections with respiratory viruses, drive EBVread+. This would be consistent with observations of a higher prevalence of EBVread+ in the JCTF, whose participants were infected with SARS-CoV-2 around the time of sampling (39.2% EBVread+; Supplementary Table 5, Supplementary Fig. 4 and Supplementary Note 5). Together, the identified factors might also contribute to cross-biobank and cross-ancestry differences in EBVread+ prevalence.

Common variants in and outside of MHC contribute to EBVread+

To identify associations between common genetic variants and EBVread+, we performed a genome-wide association study (GWAS) using related individuals from the UKB no immune supp. cohort (Fig. 1a) and imputed data (Methods). Variants at 28 loci showed genome-wide significance (Fig. 2a), including a long-range association at the MHC locus and additional associations at 27 non-MHC loci (Methods; Table 1 and Supplementary Tables 6 and 7). The heritability estimate for EBVread+ for all common variants outside the MHC region was 2.04% (standard error of the mean (s.e.m.) = 0.44%; linkage disequilibrium score regression32).

Fig. 2: Genetic analyses of EBVread+.
Fig. 2: Genetic analyses of EBVread+.
Full size image

a, Manhattan plot for GWAS on EBVread+ (statistical test: regenie single-variant association testing, adjusted for covariates; see Methods) from the UKB no immune supp. cohort (\({n}_{({\text{EBVread}}^{+})}\) = 56,180 versus \({n}_{({\mathrm{EBVread}}^{-})}=304,103\)). Variants at genome-wide significant loci (Puncorrected < 5 × 10−8, red line) are highlighted in blue. Each locus is annotated with a chromosomal band and the closest gene. b, Forest plot for the top 10 conditionally independent HLA alleles (UKB no immune supp. cohort n = 360,764 and AoU no outlier (EUR) n = 184,948). The points reflect effect sizes calculated with regenie as described in panel a, from unconditioned analyses (estimated beta values; error bars represent 95% confidence intervals (unadjusted)). c, The top three most significant epistatic interactions, all between the ERAP2 lead variant and three HLA alleles from MHC-I, from the UKB no immune supp. cohort non-related subset (n = 304,523). Odds ratios and 95% confidence intervals (unadjusted) are based on the fit of the interaction models (Supplementary Note 11). Note that exact sample numbers for the UKB no immune supp. cohort used in ac vary due to respective missing data. df, Effect sizes from 54 conditionally independent HLA alleles (top panels) and lead variants at 27 non-MHC loci (bottom panels) from the EBVread+ GWAS (beta main) were plotted against the same measures of additional phenotypes: HHV7read+ (d), EBV read = 1 versus EBV read ≥ 2 (e), and VCA-p18 IgG levels (f; data taken from ref. 44). Spearman correlation coefficients (ρ) and respective two-sided P values (P) are provided. aExact P value is not available due to computational limits. Data are presented as betas and standard error. Sample sizes used to calculate correlation of effect sizes are given in Supplementary Table 12.

Table 1 Overview of 27 non-MHC loci associated with EBVread+ in UKB

At the non-MHC loci, gene prioritization approaches (Methods) highlighted genes implicated in immune processes (for example, ERAP2 and EOMES), known IEIs (for example, CD70, IKZF3 and CTLA4) and genes of pharmacological relevance (for example, SLAMF7, inhibited by elotuzumab; Supplementary Table 8). Non-MHC lead variants were also associated with a broad range of phenotypes in OpenTargets (Supplementary Table 9), although the extent varied across loci. While some loci showed high pleiotropy (more than 100 associated phenotypes, for example, loci including SH2B3, PTPN22 and IRF1), other lead variants had only few associations at the same significance threshold, suggesting a more specific role in EBV control (for example, ILDR1 and CMC1). Finemapping with SuSie33 identified potentially causative variants at four loci (Table 1 and Supplementary Table 10), including three missense variants with posterior inclusion probability (PIP) scores > 0.1, and one non-coding variant, rs531660643, at PIP > 0.95 (rs531660643). The latter is a splice quantitative trait locus (QTL) for BCL3 (whole blood, GTEx v8), which is involved in B cell fate and NF-κB regulation34.

At the MHC region, the immunologically relevant variants are alleles of HLA genes (‘HLA alleles’), which determine the repertoire of antigens that can be presented to the immune system. On the basis of the imputed HLA alleles22, 116 different classical HLA alleles were associated with EBVread+ (Methods; Supplementary Table 7). The lowest P value was for the MHC class II (MHC-II) allele HLA-DRB1*04:04 (beta = 0.79, s.e.m. = 0.02), which is associated with increased rheumatoid arthritis risk35. The next most significant HLA alleles were HLA-A*02:01 (beta = −0.31, s.e.m. = 0.01), which decreases risk for multiple sclerosis36, EBV+ Hodgkin lymphoma37 and endemic Burkitt lymphoma38, and HLA-B*14:02 (beta = −0.68, s.e.m. = 0.02). After iterative conditional analyses, 54 independent alleles from MHC-I and MHC-II remained with genome-wide significance (Methods; Fig. 2b and Supplementary Table 7).

Given previous evidence for epistatic effects between HLA alleles and genes involved in antigen processing, for example, ERAP2 (ref. 39) and ERAP1 (ref. 40), we conducted an interaction analysis between the 54 conditionally independent HLA alleles and the top three non-MHC loci (Methods). After correction for multiple testing, three significant interactions were identified between the ERAP2 lead variant rs2548225 and HLA alleles of MHC-I (that is, HLA-A*02:01, HLA-B*40:01 and HLA-B*15:01; Fig. 2c and Supplementary Table 11). This is functionally plausible, as ERAP2 encodes an aminopeptidase that trims peptides within the endoplasmic reticulum before loading onto MHC-I41. The rs2548225 risk allele tags ERAP2 haplotypes that are characterized by splice variants, which render ERAP2 mRNA non-functional41.

Finally, we aimed to replicate the UKB-based EBVread+ GWAS results in 184,948 individuals of EUR ancestry from the AoU no outlier cohort. Of the 116 associated HLA alleles, 106 were matched to HLA alleles in AoU (Methods). Of these, 100 showed P < 0.05 and a consistent effect direction in both datasets (Supplementary Table 7). For the 54 conditionally independent HLA alleles, 46 of the 52 that were available in AoU were replicated, as were lead variants at 25 of the 27 non-MHC loci (at P < 0.05; Supplementary Table 6). No meta-analysis was performed due to missing or different covariates, for example, the lack of blood count data in AoU (Supplementary Note 4).

Associated GWAS loci for EBVread+ are specific for increased EBV viral load

To explore whether the identified loci are specific for EBV viral load, we compared effect sizes of lead variants from the EBVread+ GWAS to GWAS data for memory B cell abundance42 and human herpesvirus 7 (HHV7). For memory B cell abundance, no significant Spearman’s correlation was observed for non-MHC loci (Extended Data Fig. 6; no MHC data provided). However, a genome-wide significant association was observed for the EBVread+ lead variant at the 13q33.3 locus comprising TNFSF13B, which is implicated in memory B cell survival43 (Supplementary Table 12). For HHV7, we extracted reads from UKB and calculated effect sizes as for EBV (Methods; Supplementary Fig. 5 and Supplementary Note 6). No significant Spearman’s correlations were found for the EBVread+ non-MHC loci or HLA alleles (Fig. 2d), although six of the non-MHC loci had P < 0.05 and a consistent direction of effect. For two of these (SLC8A1 and PTPN22), colocalization analyses indicated shared causal variants (posterior probability (H4) > 0.5; Supplementary Table 13).

We then created a case–control definition in UKB that captures viral load rather than viral susceptibility, by excluding individuals with EBV read count = 0 (that is, almost all seronegative individuals). Effect sizes from an analysis of EBV read count = 1 versus EBV read count ≥ 2 were highly correlated with those of our main GWAS (non-MHC loci: Spearman’s ρ = 0.93, P = 6.2 × 10−7; HLA alleles: ρ = 0.94, P < 2.2 × 10−16; Fig. 2e). Similar results were obtained in other case–control definitions within EBVread+ individuals and in additional comparisons of (1) EBV read count = 0 versus EBV read count = 1, and (2) female and male participants (Extended Data Fig. 6 and Supplementary Table 12).

Finally, we analysed GWAS summary statistics of four EBV antibody levels44. Consistent with the aforementioned correlation of EBVread counts with IgG antibody levels, effect sizes of lead variants for EBVread+ and VCA-p18 IgG levels were strongly correlated, particularly for the HLA alleles (ρ = 0.64, P = 1.55 × 10−5; Fig. 2f). These findings suggest that the genetic associations with EBVread+ reflect specific EBV viral load-associated factors.

Gene-based analyses suggest an enrichment of IEI genes

We then performed gene-based analyses to capture additional biology and enable systematic downstream analyses, using EBVread+ summary statistics for common variants and exome sequencing data for rare variants.

Common variants were assigned to individual genes, and gene-based P values were calculated (MAGMA45; see Methods; without MHC region). Of 63 genes that remained significant after Bonferroni correction (Supplementary Table 14), ten were located outside of genome-wide significant loci and thus represent additional candidate genes. Nine of the 63 genes were IEI genes, including four (IKZF3, NFKB1, CTLA4 and CD70) that predispose to severe clinical phenotypes post-EBV infection, including persistent EBV viraemia, EBV-associated lymphoproliferation and/or EBV-driven lymphoma (Fig. 3a and Supplementary Note 7). Formal testing using MAGMA gene set enrichment (Methods) showed that IEI genes (n = 456) were strongly enriched for association with EBVread+ (P = 4.66 × 10−6, beta = 0.19, s.e.m. = 0.04). When considering 14 genes that cause monogenic EBV-driven lymphoproliferative diseases15, the effect size increased (beta = 0.35, s.e.m. = 0.22, P = 0.055; Supplementary Table 14).

Fig. 3: Characterization of non-MHC risk loci associated with EBVread+.
Fig. 3: Characterization of non-MHC risk loci associated with EBVread+.
Full size image

a, Nine genes underlying IEIs showed significant enrichment of EBVread+-associated common variants. Clinical information regarding EBV infection and associated outcomes for these IEIs were retrieved from literature (see Supplementary Note 7). b, Bar plot of −log10(P) from MAGMA gene set analysis (one-sided) are shown for those Gene Ontology Biological Processes terms that remained significant after Bonferroni correction (dashed line: P < 6.5 × 10−6; n = 7,743). c, As in panel b, for the top 15 most significantly enriched GTEx tissues based on gene expression levels (MAGMA gene property test, one-sided; dashed line: P = 9.2 × 10−4, Bonferroni; n = 54). Tissues sorted by P values, with the purple colour indicating significant enrichment. d, Uniform manifold approximation and projection (UMAP) representation plot of the PBMC single-cell RNA-seq data50, coloured by cluster labels of cell-type annotation level 1. DC, dendritic cell. e, Distribution of normalized single-cell disease relevance scores (scDRS) across cell types of annotation level 1, presented in descending order based on median scDRS (white bar). Higher scores indicate cells with excess expression of genes implicated by the EBVread+ GWAS. f,g, Results of the Monte Carlo (MC)-based statistical inference cell-type association (f) and within-cell type heterogeneity with scDRS (g), based on EBVread+. The bar colours represent significance, with the purple colour indicating a multiple comparison-adjusted false discovery rate (FDR) < 0.05.

The aggregate effect of rare variants was captured by gene-based collapsing analyses, based on exome sequencing data (minor allele frequency < 0.01; gene-based association analysis of rare variants (RVASgene); Methods). Twenty-eight genes within the MHC locus and one non-MHC gene (TNFRSF13B) were test-wide significant in at least one of four variant pathogenicity definitions (Pgene < 8.86 × 10−7; Methods; Supplementary Table 15). The TNFRSF13B signal was driven by p.Cys104Arg (Pwithout p.Cys104Arg = 0.087), which is associated with common variable immunodeficiency, tonsillectomy and ear surgery46,47.

Intersecting both analyses (MAGMA and RVASgene, each at P < 0.01) showed 24 genes with evidence from common and rare variants (Supplementary Table 16). These included seven genes whose rare variant enrichment was driven by putative loss-of-function variants (PTPN22, GP1BA, CD226, C6orf222, ZNF284, CHD4 and HKR1), all of which are strong novel candidate genes for host control of persistent EBV infection.

Identification of candidate pathways and effector cell types

We then used the gene-based association statistics for common variants, to obtain insights into effector pathways, tissues and cell types48. Using Gene Ontology Biological Processes, we identified 30 test-wide significant pathways (Fig. 3b and Supplementary Table 17). These encompassed various immune processes, for example, T cell activation and differentiation, thus supporting the established role of T cells in EBV control49. In expression data from 54 tissues available in GTEx v8, five (that is, spleen, whole blood, EBV-transformed lymphocytes, lung and terminal ileum) were identified as potential effector tissues (Fig. 3c). For non-blood tissues, we hypothesize that tissue-resident leukocytes are partially responsible for the observed enrichments. For blood, the enrichment was further elucidated using a gene expression dataset from peripheral blood mononuclear cells (PBMCs)50 and the single-cell disease relevance score approach51 (scDRS; Methods). Within eight major cell types (annotation level 1), we observed significant enrichments in CD8+ T cells, consistent with their role in eliminating EBV-infected B cells49 and NK cells (Fig. 3d,e). At a more fine-grained annotation (level 2, 21 cell types; Methods), the highest average scDRS was observed in the small cell cluster annotated as NKbright cells. Furthermore, support was generated for NKdim and memory CD8+ T cells, both of which have similar enrichment P values, albeit for much larger cell numbers (Extended Data Fig. 7).

We also mapped lead variants (or proxies thereof) to cell-type-specific cis-expression QTL (eQTL) data from PBMCs52 (OneK1K project; Methods), and identified 18 variant–gene–cell-type associations. Most were for ERAP2, with consistent direction of effect in multiple cell types, including CD8+ T and NK cells. Additional cell-type-specific eQTL effects were found for CTLA4 and CMC1 in S100B-positive CD8+ T cells and SLC22A5 in NK cells (Supplementary Table 18).

EBVread+ has a polygenic architecture

We then evaluated whether an aggregated genetic risk score (GRS) improves risk prediction for EBVread+ compared with a baseline model (including age and sex), and is transferable across cohorts and ancestries. First, we assigned individuals from the UKB no outlier cohort (EUR) to one of three cohorts: (1) UKB serology target cohort (individuals for whom serology data were available), (2) UKB disease target cohort (individuals with EBV-associated diseases4), or (3) UKB base cohort (remaining individuals; Methods). In the UKB base cohort, we generated six GRSs, using either imputed HLA alleles (three GRSs: HLA all, HLA MHC-I and HLA MHC-II) or genotyped singe-nucleotide polymorphisms (SNPs; all, SNPs in MHC and SNPs outside of MHC; Methods).

We then applied these GRSs to the UKB serology target cohort and found that the GRSs encompassing all HLA alleles (HLA all) best explained EBVread+ according to Nagelkerke R2 (improvement over the base model: ΔR2 = 0.080 ± 0.009 s.d.). HLA MHC-I and HLA MHC-II GRS, which represent uncorrelated predictors (Extended Data Fig. 8), performed similarly well when compared to each other (Fig. 4a). The three GRSs based on HLA alleles outperformed SNP-based GRSs, although the GRSs using SNPs outside of MHC (SNP wo MHC) captured independent genetic risk (Fig. 4a). We therefore proceeded with HLA all, HLA MHC-I, HLA MHC-II and SNP wo MHC, none of which differed between EBVsero+ and EBVsero groups (Fig. 4b) and which were positively correlated with observed EBV read counts in the serology cohort (Fig. 4c and Extended Data Fig. 8).

Fig. 4: GRS analyses in UKB and AoU.
Fig. 4: GRS analyses in UKB and AoU.
Full size image

a, In the UKB serology target cohort (n = 6,063, unrelated, EUR), EBVread+ status was predicted using a baseline model with or without one of six GRSs: imputed HLA alleles (HLA all); HLA alleles of MHC-I (HLA MHC-I); HLA alleles of MHC-II (HLA MHC-II); genotyped SNPs (SNP all); SNPs within MHC (SNP MHC); or all non-MHC SNPs (SNP wo MHC). Nagelkerke R2 values are plotted (error bars denote standard deviations; bootstrapped, n = 1,000). b, GRSs were compared between sero (n = 348) and sero+ (n = 5,715) individuals and were non-significant (P values Bonferroni adjusted for four tests; statistical test: likelihood ratio test applied to logistic regression models, adjusted for covariates; see Methods). c, HLA all was positively correlated with EBV read counts. Sample sizes are indicated above the boxplots. d, Improvements in Nagelkerke R2 for different AoU ancestry groups (HLA all GRS; abbreviations as in Extended Data Fig. 3), compared with the baseline model within UKB (from panel a). Error bars represent standard deviations (bootstrapped, n = 1,000). See the x axis for sample sizes. AFR, African; AMR, admixed American; EAS, East Asian; MID, Middle Eastern; SAS, South Asian. e, GRS distributions in individuals of the UKB serology target cohort and with EBV-associated diseases (UKB disease target cohort; see the x axis for sample sizes). Individuals with multiple diseases were included in each respective group. The statistical test is as in panel b. P values (Bonferroni adjusted) are provided if P < 0.1. aFor multiple sclerosis (MS), the signal was driven by HLA-A*02:01. HL, Hodgkin lymphoma; IM, infectious mononucleosis; NHL, non-Hodgkin lymphoma; RA, rheumatoid arthritis; SLE, systemic lupus erythematosus. f, −log10(P) of 1,751 PheCodes, grouped by organ systems or disease groups, for GRS HLA MHC-I and HLA MHC-II (statistical test as in panel b; dashed line: Bonferroni-corrected significance threshold) from the AoU QC EUR subset (n = 189,658). Phenotype terms are provided for test-wide significant results and for associations identified in panel e (encircled). IBD, inflammatory bowel disease; NOS, not otherwise specified; T1D, type 1 diabetes. The boxplots show the median (thick line), 25th and 75th percentiles (box) and the largest–smallest values no further from the box than 1.5 times the interquartile range (whiskers; b,c,e). Dashed lines in b,c,e correspond to values of 0.

To analyse transferability, we applied similar GRSs within the AoU no outlier cohort, which was stratified by genetic ancestry (Methods). In the EUR subcohort, which had the highest genetic similarity to the UKB base cohort, improvements in Nagelkerke R2 values compared with the baseline model were similar to our results from UKB, with HLA all best explaining EBVread+R2 = 0.072 ± 0.002 s.d.; Fig. 4d and Extended Data Fig. 8). Similarly, HLA all showed the largest improvements in Nagelkerke R2 in each of the five non-EUR ancestry groups, despite differences in absolute values (Fig. 4d). In the African (ΔR2 = 0.055 ± 0.002 s.d.) and admixed American (ΔR2 = 0.065 ± 0.002 s.d.) groups, predictive performance was similar to that of the AoU EUR subcohort (Fig. 4d). This demonstrates some degree of transferability for the GRS comprising all HLA alleles. In all ancestry groups, the SNP-based GRS was least predictive, but was again similar between the EUR subcohorts of UKB and AoU (Extended Data Fig. 8). These results provide evidence for a polygenic component to EBV viral load that is largely driven by the MHC region and can be transferred across ancestries when calculated based on HLA alleles.

GRSs associate with EBV-associated and novel diseases

The four selected GRSs were then applied to the UKB disease target cohorts (infectious mononucleosis, Hodgkin lymphoma, multiple sclerosis, rheumatoid arthritis, non-Hodgkin lymphoma, systemic lupus erythematosus and/or Sjögren disease; see above). Highly significant associations were found for an elevated HLA MHC-I GRS in multiple sclerosis and an elevated HLA MHC-II GRS in rheumatoid arthritis (Fig. 4e). For multiple sclerosis, this effect was attenuated when HLA-A*02:01 was excluded from the GRS (PHLA MHC-I = 3.09 × 10−5, Pwithout HLA-A*02:01 = 0.031). By contrast, exclusion of HLA-DRB1*04:04, which is a risk factor for rheumatoid arthritis35 and was the most significant HLA allele in the EBVread+ GWAS, from the HLA MHC-II GRS did not attenuate the association of this GRS with rheumatoid arthritis. At P < 0.1, we also observed a lower HLA all GRS in individuals with non-Hodgkin lymphoma, and a lower HLA MHC-I GRS in rheumatoid arthritis (Fig. 4e).

We then conducted a phenome-wide association study (PheWAS) in the EUR AoU QC cohort using 1,751 PheCodes. With the exception of Sjögren disease, these PheCodes included all of the aforementioned EBV-associated diseases (Methods; Fig. 4f and Extended Data Fig. 8). At P < 0.001, the PheWas replicated all four significant associations identified in UKB. This approach also identified novel candidate diseases associated with EBV host control: the strongest associations were found for type 1 diabetes (beta = 0.176, s.e. = 0.023 for HLA MHC-II), inflammatory bowel disease (beta = −0.14, s.e. = 0.018 for HLA all, and beta = −0.112, s.e. = 0.018 for HLA MHC-II) and hypothyroidism (beta = −0.043, s.e. = 0.008 for HLA MHC-I, and beta = 0.037, s.e. = 0.007 for SNP wo MHC; Supplementary Table 19).

Suggestive causal effects of EBVread+ are driven by variants in MHC region

To investigate whether EBVread+ as an exposure has a causal effect, we performed two-sample Mendelian randomization (2SMR; Methods) for the five diseases with strong evidence for epidemiological association (that is, multiple sclerosis, rheumatoid arthritis, Hodgkin lymphoma, non-Hodgkin lymphoma and systemic lupus erythematosus)4, and three diseases identified by our PheWAS (that is, inflammatory bowel disease, type 1 diabetes and hypothyroidism). For multiple sclerosis, we tested both case–control status and disease course severity (Supplementary Table 20). We found suggestive evidence for causal effects of EBVread+ on rheumatoid arthritis (betawMed = 0.192, s.e. = 0.053) and type 1 diabetes (betawMed = 0.620, s.e. = 0.062), which were consistent across six estimators including two that are robust to pleiotropy (Methods; Supplementary Table 21, Supplementary Fig. 6 and Supplementary Note 8). However, the effects on both outcomes were driven by variants in the MHC region (Supplementary Table 22). Attributing causality is thus problematic, given the unknown extent of pleiotropic effects of MHC variants, and the limited heritability of EBVread+ attributed to non-MHC variants. No evidence for an EBVread+ causal effect was found for the other seven tested outcomes (Supplementary Table 21) or the negative control trait (Methods).

Discussion

This study is one of the first to demonstrate that GS-based EBVreads are a highly specific proxy for elevated EBV viral load in blood cells. Using this measure, we identified associations between EBVread+ and several non-genetic factors, including current smoking as well as sex. Smoking is also a risk factor for several EBV-associated diseases53,54,55, although the underlying mechanisms remain largely unknown. Current smoking affects both adaptive and innate immunity, with the latter normalizing upon smoking cessation56. This suggests an interaction of the innate immune system with current smoking status in EBV host control. The increased prevalence of EBVread+ in male sex encourages investigations into sex-specific factors, especially in the light of the contrary female predisposition of autoimmune diseases, including multiple sclerosis57.

We found that EBVread+ is polygenic and characterized by a major (and largely equal) contribution of alleles at MHC-I and MHC-II, which supports previous observations that CD8+ cytotoxic T and NK cells49 (MHC-I) as well as CD4+ helper T cells49,58 (MHC-II) are important in EBV control. Some genes implicated by common variants underly monogenic IEIs with increased susceptibility to severe EBV infections, often associated with a pronounced risk of EBV-associated diseases including lymphoma (for example, CD70)59,60. Our results thus probably harbour novel candidate genes for IEIs, such as CD226, which is a member of the immunoglobulin superfamily that contributes to NK and CD8+ T cell regulation61 and impairs CD8+ T cell response in chronic HIV when downregulated62.

Using genetically predicted EBV viral load, we identified genetic overlap with multiple sclerosis and rheumatoid arthritis. Although EBV is a prerequisite for multiple sclerosis, HLA-A*02:01, which reduces multiple sclerosis risk, was among our most significant findings and was associated with better EBV control. By contrast, no consistent effect on EBVread+ was found for the major multiple sclerosis risk allele HLA-DRB1*15:01 (ref. 36), suggesting a pathomechanism distinct from EBV viral load control. This could include a stronger antibody response through preferential EBV peptide presentation63,64, expansion of specific B cell subsets65 or molecular mimicry. In support of this, detailed analysis of HLA-DRB1*15:01 (Extended Data Fig. 9) found that the strongest effect size was with antibody levels of IgG EBNA-1, in line with previous findings that antibodies to EBNA-1 cross-react with the central nervous system protein GlialCAM8. In rheumatoid arthritis, alleles at MHC-I and MHC-II were associated with lower and higher EBV viral load, respectively. This suggests a specific dysregulation of the immune response to EBV, rather than a generic loss of EBV immune control. Although further research is required to determine whether the effect of EBV viral load is causal, the 2SMR results support this hypothesis. Our analyses also revealed a genetic overlap between EBV control and type 1 diabetes, inflammatory bowel disease or ulcerative colitis, and hypothyroidism, suggesting that the pathophysiological relevance of EBV host control may be broader than currently assumed.

Our study had several limitations. First, owing to the standard depth of human GS, most individuals had an EBV read count of zero, and many had an EBV read count of exactly 1. For statistical analyses, we binarized the phenotype into low or high EBV viral load, based on absolute EBV read count numbers, and compared EBV read count 0 versus 1 and higher. Given the limited resolution, some individuals with presumed high viral load might actually have low viral load. However, this potential mis-classification is unlikely to have impacted the overall conclusions, which are supported by our sensitivity analyses and are similar to those of a recent study, which used a different definition for increased viral load66. If specific quantitative measures or deeper GS data become available, statistical power will probably increase. Second, unobserved factors may have confounded associations with EBVread+, although we mitigated this risk by replicating findings across biobanks. Third, despite the partial transferability of HLA-based GRS across ancestries, the discovery analyses mainly involved EUR individuals. This might have influenced the identity of associated HLA alleles, and limit the generalizability of the findings with respect to different EBV strains and EBV-associated diseases, which vary in terms of global distribution and prevalence. Thus, replication of the GWAS findings and downstream analyses in non-EUR ancestries are required. Finally, given the biological complexity of the MHC region and current challenges in HLA allele imputation67, some HLA associations might have been missed or mimicked by extended regions of LD.

This work has established EBV viral sequence traces from blood-based human GS data as the basis for future investigations into functional, mechanistic and epidemiological aspects of persistent EBV infection. Quantification of viral load using host GS data could be extended to other human pathogens, and facilitate investigation of interactions between chronic infections and the host immune system in health and disease.

Methods

Analysis of UKB data

UKB data, accessed based on application ID 135122, were used as the primary discovery cohort, unless stated otherwise (Supplementary Note 1). Individual-level data analyses were conducted within the UKB Research Analysis Platform (RAP).

Extraction of high-quality EBV reads

All individuals with available GS data (n = 490,293)24 were included in the initial stage of analysis (UKB cohort). During the process of the project, 208 individuals (0.04%) withdrew their consent from UKB, explaining slightly lower sample counts in some follow-up analyses (n = 490,085). DNA extraction, library preparation, sequencing and alignment have been described elsewhere68,69 and are summarized in Supplementary Note 2. Reads mapping to the EBV genome (NC_007605.1) were accessed in CRAM files (field 24048), which had been previously generated by aligning fastq data to a GRCh38 graph genome (including the contig chrEBV) and were extracted using samtools (v1.20). Only read pairs where both forwards and reverse reads, respectively, mapped to NC_007605.1, were retained. Within pairs, reads were removed if they had more than 20 soft-clip bases, less than 120 bases matching the reference or were duplicates (see Supplementary Note 2). Finally, if at least one read of a read pair remained, this was counted as one EBV read. We also generated a similar dataset for HHV7 for the purpose of comparison, as described in Supplementary Note 6.

Quality control

We calculated the fraction of individuals with EBV reads per library preparation plate (field 32056). Fifty-one plates had excessively high proportions of EBVread+ individuals, probably due to contamination, and were excluded (Extended Data Fig. 1 and Supplementary Note 1). We also excluded individuals with low GS data quality (field 32064), sex chromosome aneuploidies (array-based genotyping data, field 22019) or discrepancies between reported and genetic sex (fields 31, 22001), resulting in the UKB QC cohort. For analyses limited to EUR ancestry, individuals were selected based on UKB field 22006. Applying a high-quality set of common genotyped variants for principal component analysis and for regenie step 1 (Supplementary Note 9) led to the exclusion of an additional 180 individuals (Supplementary Note 2), leaving n = 403,014 individuals for analyses (UKB EUR cohort).

We also generated a subcohort of the UKB QC cohort, comprising individuals for whom serology measurements were available (UKB serology cohort; n = 9,281, based on data field 23053). In this cohort, EBV seropositivity was defined based on the detection of at least two out of four EBV-related IgG antibodies (EA-D, ZEBRA, EBNA-1 and VCA-p18), as previously suggested44,70.

Processing of covariates

For individuals of the UKB EUR cohort, potentially important confounders of EBV read detection were retrieved based on ref. 71, including information on sequencing, technical aspects, blood composition and demographics. On the basis of the SNOMED associations identified in the AoU cohort, we additionally considered smoking status, pack years of smoking, number of cigarettes smoked per day (or previously smoked in cigar and/or pipe smokers) and number of weekly alcoholic drinks. Extracted values were processed to finally obtain transformed values for each covariate (Supplementary Note 4). Correlated covariates were identified by calculating Pearson correlations (one of each pair removed if correlation > 0.7; n = 4). Together with covariates age × sex and age × age, this resulted in 28 potential covariates, which were further reduced to a final set of 18 covariates by forwards and backwards selection with Bayesian information criterion (Supplementary Table 4 and Supplementary Note 4).

Immunosuppressive and EBV-associated conditions

Immunosuppressed individuals were identified as those reported with (1) taking immunosuppressive drugs (including glucocorticoids) at the time of visiting the UKB assessment centre (verbal interview, field 20003; n = 9,681), or (2) HIV infection (UKB fields 130204, 130206, 130208, 130210 and 130212; n = 230). Individuals affected by EBV-associated diseases were identified based on self-reporting in the assessment centre, International Statistical Classification of Diseases and Related Health Problems, 10th revision codes or codes for operative procedures (OPCS4). Full lists are given in Supplementary Table 23.

Association analyses

For common variants and HLA alleles, the main GWAS on EBVread+ was conducted with two-step regenie (v3.2.4)72, on related individuals of the No immune supp. cohort. Common variants have been previously imputed using the Haplotype Reference Consortium and UK10K haplotype resource22 (UKB field 22828; 29,865,259 variants with info-score > 0.8; 481 individuals lacked imputation data). Individual HLA alleles were obtained from field 22182, based on previous imputation with HLA*IMP:02 (ref. 73). Variants were included if they had a predicted minor allele count of ≥ 25. Non-classical HLA alleles were not included due to the lack of established standards for imputing these alleles. For compatibility with regenie step 2, the provided dosages were converted to plink2 pgen-files. In the statistical analysis, the 18 selected covariates and 20 principal components were used as covariates, and saddle point approximation was applied to account for case–control imbalance (see Supplementary Fig. 7 and Supplementary Notes 9 and 10).

For conditional analysis of HLA alleles, we applied a forwards-stepwise regression approach to identify HLA alleles that independently associate with the trait, based on the following procedure: (1) initial single-variant test for all HLA alleles as described in common variants and HLA alleles. (2) Iterative conditioning: repeat the following process: (i) Identify the allele with the lowest P value from the previous step; (ii) add this allele to the alleles to condition on; and (iii) run the conditioned association analysis (regenie v3.2.4). Step 2 was repeated until the most significant allele in the current iteration had a P value greater than the commonly used genome-wide significance threshold of 5 × 10−8.

For epistatic analyses, the lead variants of the three top non-MHC loci for EBVread+ were tested for interaction with conditionally independent HLA alleles, based on data from non-related individuals of the UKB no immune supp. cohort (n = 304,523 with complete data). Likelihood-ratio tests (LRTs; 1 d.f.) were used, comparing an additive logistic regression model with a model that additionally included an interaction term between the non-MHC SNP and the HLA allele (see Supplementary Note 11). LRT P values were Bonferroni corrected for multiple testing.

For rare variants, RVASgene was performed as described for common variants (identical phenotypes, and covariates, same procedure for regenie step 1), but based on exome variants and annotations as provided by the UKB74 (field 23158; Supplementary Note 9). This resulted in a slight reduction of the overall sample number (based on no immune supp. cohort; n = 54,259 EBVread+ cases and n = 293,834 EBVread controls). For regenie step 2, SKAT-O was used as a test (parameter: ‘--vc-tests skato’) and we restricted the analysis to rare variants with an alternative allele frequency below 1% (parameter ‘--vc-maxAAF 0.01’). The following definitions of variant pathogenicity (masks) were used: (1) M1: predicted loss-of-function variants; (2) strong coding: variants from (1) and likely deleterious missense variants; (3) medium coding: variants from (2) plus possibly deleterious missense variants; and (4) all coding variants from (3) plus likely benign missense variants (Supplementary Note 9). Overall, this analysis comprised rare variants in 18,796 protein-coding genes.

Additional case–control definitions and subcohorts

In addition to the main analysis of EBVread+, in which we compared individuals with EBV reads (1–18) to those without any EBV reads (0), we generated modified case–control definitions. These included GWAS analyses of 0 versus 1 read counts, 0 versus 2–18 read counts, and a ‘within EBVread+’ analysis comparing individuals with 1 read count versus 2–18 read counts. We also performed sex-restricted analyses, that is, on male or female participants only. Sample numbers are provided in Supplementary Table 12.

Analysis in the AoU cohort

We used release 8 (C2024Q3R3) of the AoU Research Program, which included array and GS data from blood-based DNA samples of 365,931 individuals (AoU cohort). The AoU resource, including data generation, processing and quality control of genomic data, is described in ref. 23 and accompanying documents.

Generation of EBV read data and cohort from GS data

First, EBV reads were extracted from CRAM files as described for UKB participants. At the individual level, we restricted our analyses to unrelated individuals with plausible time points of DNA sampling (between 11:00 and 23:59), without mismatch between reported and genetic sex and who were not flagged as population outliers (‘flagged samples’) in accompanying documents (AoU QC cohort, n = 336,123). For population-specific analyses, precomputed genetically predicted population backgrounds were used, which assigned each individual to one of six continental populations (Extended Data Fig. 3; see ‘Genomic research data quality report’).

Phenome-wide association analysis of EBVread+

We retrieved individuals from the AoU QC cohort who had electronic health record data available. For SNOMED concept IDs annotated in 250 or more individuals (n = 11,111), associations with the presence of EBV reads was tested as follows: we first applied logistic regression models with the presence of EBV reads as outcome, the presence of a SNOMED ID as predictor, and included age, sex, age × sex and 16 precomputed principal components as covariates. In a second step, we also included HIV and smoking status as covariates (see Supplementary Note 3). P values were calculated using LRT.

Replication of associated loci and HLA alleles

Detailed information of variant sets, generation of principal components, imputation and quality control of HLA alleles are described in Supplementary Fig. 8 and Supplementary Notes 3 and 12. Association analyses were performed in the EUR subcohort of AoU using regenie (v2.0.2), but without using step 1. We selected similar covariates as in the analysis within UKB, that is, sex, age, age × sex, mean sequencing coverage, hour as well as the week and time of biosample collection, nicotine usage, sequencing site and 20 principal components. However, certain covariates (including blood count traits) are not directly available in AoU and therefore could not be included (see Supplementary Table 4), which prevented a meta-analysis between UKB and AoU (Supplementary Note 4).

Validation cohorts

Two non-UKB/non-AoU cohorts were used for validation (Supplementary Note 13). For each of them, EBV reads were extracted from short-read GS data, in analogy to the analysis in UKB:

(1) Validation 1, qPCR. This cohort was recruited to study ACE inhibitor-induced angiooedema and consisted of 110 participants for whom GS data and DNA samples were available (blood or saliva derived25). To quantify EBV viral load, qPCR was performed on 72 individuals, including all EBVread+ and a random subset of EBVread individuals, using the clinically validated GeneProof EBV PCR Kit (TaqPath Menu, Applied Biosystems; four technical replicates per sample), with the target gene EBNA1.

(2) Validation 2, qPCR and RNA-seq. Partially overlapping subsets of JCTF participants with SARS-CoV-2 infection26,75 were used for qPCR for EBV viral load and reanalysis of RNA-seq data (n = 1,010), respectively. GS was obtained from whole-blood-derived DNA. For qPCR (n = 262 individuals, 3 technical replicates each), an in-house developed qPCRs assay was run, targeting EBNA1 (Supplementary Note 13; sequences available on request). Full-length RNA-seq data were reanalysed for the expression of 94 EBV genes. In short, reads were aligned against the GRCh38 reference genome, which included the EBV sequence NC_007605.1, and EBV transcripts were quantified using RSEM (v1.3.0). Given the high prevalence of EBVread+ in the JCTF subcohort, we investigated whether a more severe COVID-19 disease course drives EBVread+, but did not observe a strong effect (Supplementary Table 5 and Supplementary Note 5).

Genetic risk loci associated with EBVread+

Annotation of non-MHC risk loci

Regional association plots were generated with LocusZoom76 (see Supplementary Fig. 9 and Supplementary Note 14). Genome-wide significance was defined as P < 5 × 10−8, and independent risk loci were defined in FUMA (v1.6.3)48, based on 1000Gv3 (EUR population; r2 threshold of 0.6) lead SNPs (merging distance of 250 kb). For each locus, we reported (1) closest gene (based on distance of lead SNP to the transcription starting site); (2) linkage disequilibrium genes (that is, genes located within associated region, defined through variants with r2 > 0.2 to lead variant); (3) eGenes from GTEx (based on Adult GTEx v10, with genome-wide significant single-tissue eQTL effects (P < 5 × 10−8) in any tissue); and (4) V2G scores from Open Targets (v22.10; based on a cut-off > 0.1). To identify pleiotropic effects of lead variants, we retrieved from OpenTargets (v22.10) all traits at P < 0.005 that were reported in either GWAS Catalog, UKB or FinnGen. To identify potential targets for drug repurposing, approved drugs (clinical phase IV) targeting the identified genes were retrieved from OpenTargets (v25.3).

To investigate for potential regulatory effects on transcription in specific blood cell types, lead variants (or proxies thereof; r2 > 0.7 based on 1000Gv3, EUR subset) were retrieved from the OneK1K dataset52, and reported eQTLs with FDR < 0.05 in the original dataset.

Generation of credible SNP sets

Fine mapping for each non-MHC locus was performed with SuSie (sum of single effects regression)33, using 1-Mb window size (except for 12q24.12_SH2B3 (3 Mb) and 5q31.1_SLC22A5 (5 Mb) due to extended local linkage disequilibrium). Linkage disequilibrium matrices were generated from the imputed genotype data of the unrelated UKB EUR cohort (see above; n = 339,539; without principal component filter) using plink2 (v2.0.0-a.6). Coding variants within the credible SNP sets (cumulative PIP: 0.95) were annotated using Ensembl Variant Effect Predictor77 (VEP; release 113), ClinVar (version June, 2023)78 and AlphaMissense prediction scores79.

Correlation of effect sizes

We retrieved association statistics for lead variants at the 27 genome-wide significant non-MHC risk loci as well as for 54 conditionally independent HLA alleles, from additional GWAS. These included four different case–control definitions based on EBV read counts and female-only or male-only GWAS (see above), as well as from three external datasets: memory B cell absolute counts (GCST90001407 (ref. 42); no MHC data available) and EBV antibody titres44. We additionally calculated effect sizes at these loci for HHV7read+ (Supplementary Note 6) and recalculated effect sizes for main EBVread+ GWAS (0 versus 1–18) using different sets of covariates (Supplementary Note 4). Variants in linkage disequilibrium with the lead variant were used if they increased the overlap between datasets. We then calculated the correlation of effect sizes (betas) using Spearman’s correlation for non-MHC risk variants as well as HLA alleles. For HHV7, we investigated loci with potentially shared causal variants using coloc (v5.2.3)80 in R (v4.4.2).

Gene-level analyses

Gene-based association testing as well as enrichment analyses were conducted using MAGMA (v1.08)45, using default settings unless stated otherwise. Variants were assigned to 19,736 genes using the MAGMA gene boundaries Ensembl v102 file (excluding the extended MHC region as previously suggested71 (25–36 Mb)), and a window of 10 kb upstream and 1.5 kb downstream. Gene sets for IEIs were defined based on literature15 (n = 14 genes) or the IEI classification (available at https://iuis.org/committees/iei/, accessed 6 May 2025; n = 456 genes available in our data). Gene ontology biological processes (n = 7,743 terms) and tissue types (n = 54, GTEx v8) were provided by FUMA (v1.6.3).

Cell-type identification was performed using scDRS51 (v1.0.3) and single-cell RNA-seq data from the 1M-scBloodNL project, published by the sc-eQTLGen consortium50 (samples processed with Genomics (v3); broader level of cell-type annotations with 10 cell types; see Supplementary Note 15). Following data processing using the Seurat package (v5.2.1) in R (v4.3.2), 37,033 cells annotated to 8 cell types remained for the scDRS analysis. The top 1,000 EBVread+ MAGMA genes and their z-scores were used as weights in the scDRS analysis, with otherwise default parameters. Subsequent group analyses (that is, cell-type association and heterogeneity) were conducted with default parameters. Multiple testing correction of P values for the number of cell types was performed using Benjamini–Hochberg procedure.

GRS analyses in UKB

Analysis of polygenic contribution

To study the joint contribution of common variants associated with EBVread+ to EBV-associated diseases within EUR individuals of the UKB no outlier cohort, we generated an independent base cohort plus additional target cohorts, which encompassed (1) individuals with EBV-associated diseases (see results, data fields given in Supplementary Table 23), or (2) individuals for whom serology data were available. We additionally removed all individuals that were related to any other individual of the target or the base cohort. Individuals of the target cohorts were required to pass all filters applied to the UKB no outlier cohort, except that individuals within the top 1% EBV read counts were kept.

For SNP-based GRS, common variant association analysis was performed within the base cohort as described above, except that only autosomal and genotyped variants with a minor allele count > 25 were used for regenie step 2 (n = 739,066). Polygenic risk scoring was performed on non-ambiguous SNPs using PRS-CS (v1.0.0) in combination with a linkage disequilibrium matrix derived from EUR individuals of the 1000 genomes project81. To generate separate GRS for the MHC regions and the non-MHC regions, the summary statistics generated using the base dataset were split to only contain the required regions (that is, MHC region and non-MHC regions). Scores of individuals of the target cohort were obtained using the score function of plink (v1.90b6.21). For HLA allele-based GRS, GRS based on imputed HLA alleles were calculated by fitting a multivariable logistic regression model to the base dataset, where EBVread+ was the outcome. As predictors, we used the 18 covariates and the 20 principal component (see above), plus 178 HLA alleles that had a minor allele frequency > 0.1%. After model fit, the coefficient estimates of the 178 HLA alleles were retrieved. Scores for individuals in target cohorts were generated by multiplying HLA allele dosages with coefficient estimates and summing-up these values. To obtain MHC-I-specific or MHC-II-specific scores, we calculated the GRS using the same coefficient estimates, but considered only the HLA alleles belonging to the respective MHC class. All risk scores were normalized to means of 0 and standard deviations of 1 within the combined target cohorts.

Evaluation of GRS performance

To evaluate GRS performance, we used logistic regression models, where the group membership was the outcome (for example, EBVread+ versus EBVread or serology cohort versus a cohort of individuals with EBV-associated disease). As predictors, we used age, sex, age × sex, 20 principal components (base model) or the predictors of the base model as well as the respective GRS (GRS model). We then calculated Nagelkerke R2 for the base and the GRS models, where variability of Nagelkerke R2 estimates was evaluated using bootstrapping (n = 1,000). To test for statistical significance, we compared base models and GRS models using LRT.

GRS analyses in AoU

Transferability of GRS

For GRS analyses across biobanks and populations, we used our EBVread+ summary statistics from the UKB no immune supp. cohort, as the base dataset. SNP-based GRS were calculated based on 1,509,024 genotyped variants within AoU, which had a Hardy–Weinberg equilibrium P ≥ 1×10−10 and a variant-level missingness < 0.05 in each continental ancestry. Polygenic risk scoring was performed using PRS-CS CS and plink (v1.9.0-b.7.7) as described above. To obtain HLA-based GRS, we used coefficient estimates of the 178 HLA alleles from the UKB (see above) and multiplied them with the estimated dosage for each HLA allele in AoU. Mapping of HLA allele names between HLA*IMP:02 and HLA-TAPAS was performed manually, which resulted in a successful mapping for 166 of 178 alleles (no clear mapping for one MHC-I allele and 11 MHC-II alleles).

Phenome-wide association of EBVread+ GRS with PheCodes

We used the software package PheTK (v0.1.47)82 to assign PheCodes (v1.2) to individuals of the AoU QC cohort (EUR subset, n = 189,658). Individuals were considered as having a certain PheCode if the PheCode was annotated at least twice to the individual, as suggested by PheTK. We used logistic regression models with the presence of a PheCode as outcome and GRS as predictors, respectively. Age, sex, age × sex and 20 population-specific principal components were used as covariates. P values were calculated using LRT comparing models with and without the respective GRS. To comply with AoU publishing guidelines, we have only reported PheCodes annotated to more than 20 individuals, that do not supply count-related data and only give proportions when all underlying groups contain more than 20 individuals. PheCodes (n = 1,751) were compliant with these parameters.

2SMR analysis

Selection of outcome traits

We retrieved publicly available summary statistics for EUR ancestry cohorts for (1) known EBV-associated diseases: multiple sclerosis case–control83, multiple sclerosis severity84, Hodgkin disease85, non-Hodgkin lymphoma85, systemic lupus erythematosus86 and rheumatoid arthritis87; and (2) candidate diseases based on significant PheWAS results: hypothyroidism85, type 1 diabetes88 and inflammatory bowel disease89. Of note, none of the nine outcome GWAS included samples from UKB. We also used ‘red hair colour’90 (including UKB) as a negative control outcome. Summary statistics were retrieved from the GWAS Catalog except for multiple sclerosis severity, where summary statistics were shared by the authors. Further details are provided in Supplementary Table 20. Analyses were performed in R (v4.5.0) using the packages ieugwasr (v1.0.3) and TwoSampleMR (v0.6.15)91,92.

Selection of instrumental variables

First, we applied quality control on exposure and outcome GWAS summary statistics, retaining autosomal and non-duplicate variants with minor allele frequency > 0.01 and info-score > 0.8. Linkage disequilibrium-independent genome-wide significant variants from the GWAS on EBVread+ (the exposure) were identified using linkage disequilibrium clumping (ld_clump function of the ieugwasr package, standard parameters) and harmonized with the outcome GWAS summary statistics. The remaining genome-wide significant variants of the exposure were not suspicious of weak-instrument bias (I2 statistic = 0.99). We further used Steiger filtering92 to exclude potentially invalid instruments (that is, variants showing stronger associations with the outcome versus with the exposure). Together, this resulted in a reduction of the number of variants available for 2SMR.

2SMR

2SMR was performed using four methods93: the inverse variance-weighted estimator, MR-Egger, weighted median and weighted mode. For traits in which the exposure–outcome associations were nominally significant in all four estimators and reached test-wide significance (P < 0.05/9) in two out of four (Supplementary Table 20), we applied the outlier-robust and pleiotropy-robust estimators MR-RAPS94 and MR-PRESSO95. Outcomes that were also significant in these two additional tests were then subjected to further sensitivity analyses, specifically heterogeneity (Cochran’s Q statistic96), pleiotropy tests (for example, MR-Egger intercept97) and leave-one-out analyses. For these outcomes, we also performed 2SMR after excluding variants in the MHC region.

Ethics declaration

This study used de-identified data from the UKB and AoU, which were accessed through the respective computing platforms. UKB has approval from the North West Multi-centre Research Ethics Committee (MREC) as a Research Tissue Bank. This approval means that researchers do not require separate ethical clearance and can operate under the Research Tissue Bank approval. The data collection of the AoU Research Program was conducted under centralized Institutional Review Board (IRB) approval, with informed consent being obtained from the participants. Further ethical approvals were obtained from the Ethics Committee of the Medical Faculty Bonn (no. 101/16; for analysis of validation cohort 1) and by the ethical committees of the affiliated institutes (Keio IRB approval 20200061, Osaka University IRB approval 734-14 and University of Tsukuba IRB approval H29-294) for the JCTF.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.