Introduction

The Epstein-Barr virus (EBV) is one of the most ubiquitous human viruses known, infecting over 90% of the global population1,2. While most infections are asymptomatic, EBV can cause infectious mononucleosis (IM) in individuals infected later in life3. Beyond acute infection, the virus is also implicated in a variety of malignancies, including Burkitt lymphoma, Hodgkin lymphoma4, nasopharyngeal carcinoma, and gastric carcinoma5,6. Further, in immunocompromised individuals, such as organ transplant recipients and HIV/AIDS patients6, EBV-associated lymphoproliferative disorders are prevalent and can lead to oncogenesis7. EBV has also been linked to autoimmune diseases, including multiple sclerosis (MS)8,9,10,11,12,13,14 and systemic lupus erythematosus (SLE)15,16,17, with growing evidence suggesting that EBV infection may trigger or exacerbate autoimmune responses13.

The EBV genome encodes a wide array of viral proteins, long non-coding RNAs, circular RNAs, and microRNAs that manipulate host cellular pathways to support viral persistence and replication, as well as promoting EBV-driven tumourigenesis18,19,20,21,22,23,24,25,26,27,28,29,30. Many of these viral factors play crucial roles in immune evasion, cell proliferation, and the inhibition of apoptosis, creating a cellular environment conducive to disease development31,32. As a member of the herpesvirus family, EBV has a dual infection cycle, a latent phase where the virus expresses a limited set of viral genes that facilitate maintenance of the viral episome, promote cell growth, induce cell differentiation, and promote survival, and a lytic replication phase where infectious virus particles are produced33. During latency, the virus establishes lifelong persistence in B cells with minimal viral protein expression that helps it evade immune surveillance. In response to certain environmental conditions/stimuli, EBV transitions from latency to the lytic replication cascade, where genome-wide transcription occurs, leading to the production of virus replication machinery and viral packaging proteins for the generation of infectious virus particles34. Despite significant advances in understanding EBV’s role in disease, the complexity of its transcriptome during reactivation remains a challenge. Recent developments in long-read high-throughput sequencing technologies have provided insights into the extensive complexity of EBV and other viral transcriptomes35,36,37,38,39,40,41,42, highlighting the sophisticated regulatory mechanisms viruses employ to control its gene expression and their interactions with the host43.

In this study, we aimed to generate an improved resolution of the EBV transcriptome35,44 and to gain insights into regulatory processes controlling progression through the lytic gene expression cascade. To achieve this, we performed high-depth Oxford Nanopore Technologies (ONT) sequencing, and we used two distinct EBV reactivation models along with purification methods to selectively isolate reactivated cells from contaminating latent cells. Further, we developed a computational pipeline that not only validates ONT reads but also classifies each viral promoter/transcript according to its dependence on viral DNA replication, the viral pre-initiation complex (vPIC), and the lytic origin of replication (OriLyt). Through these efforts, we resolved the major transcript isoform for nearly all lytic reading frames, we discovered extensive numbers of significantly expressed alternative promoters, we found genome-wide polyA read-through transcription, we identified highly expressed lncRNAs, and we identified two abundant spliced isoforms that are predicted to generate previously uncharacterized protein structures. Further, through the classification arm of our pipeline, we identified uncovered regulatory mechanisms guiding the complex regulation of lytic transcription including the discovery of biphasic promoters, we identified alternative promoters with differing dependencies on viral DNA replication, vPIC, and OriLyt, and we found evidence for a role for transcriptional interference in fine tuning and/or regulating lytic cycle progression. Altogether, this work advances our understanding of the EBV transcriptome and mechanisms driving the EBV lytic gene expression cascade.

Results

Models

For this project, we sought to improve upon previous studies aimed at resolving the highly complex EBV transcriptome and to add an extra layer to the transcriptome information through the classification of each transcript isoform within the early, leaky late and late categories. To improve the depth of EBV long-read data and thereby enhance discernment of the EBV transcriptome, we utilized a selection model that we previously used to study EBV’s remodeling of cell transcription initiation43. Specifically, we used the EBV-positive Burkitt’s Akata cell model in which we introduced an episomally replicating lytic cycle reporter harboring the early gene BMRF1 promoter driving the expression of a GFP marker gene (Supplemental Fig. 1). These Akata lytic reporter cells were treated with anti-IgG to activate the B-cell receptor (BCR) and fluorescence-activated cell sorting (FACS) was used to isolate the subpopulation of cells that successfully entered the lytic cycle. RNA was isolated from the GFP positive cells and subjected to high depth Oxford Nanopore Technology (ONT) long read cDNA sequencing of the polyadenylated (polyA) fraction (Supplemental Fig. 1). In a separate model, EBV positive Mutu cells were co-transfected with a GFP expression vector and an expression cassette for the EBV encoded immediate early transcription factor Zta to induce reactivation, GFP positive cells were isolated and likewise subjected to high depth ONT polyA sequencing (Supplemental Fig. 1).

Identification of EBV lytic transcripts

Because long read cDNA sequencing data suffers from 5’ truncations, 3’ internal priming, and false splice junction calls35,44, we developed a pipeline to quickly validate each 5’ and 3’ end as well as splice junctions for each long read using CAGE-seq data (from purified reactivating cells)43, 3’ ONT clusters, and splice junctions identified through short read sequencing43, respectively (Fig. 1A). Because of unique aspects of viral transcription initiation, including our observed high density of start sites35, we developed our own peak calling software that is better suited to assessing viral transcription start sites. Specifically, this software was designed to deal with a high density of start sites, and it includes greater flexibility (options) to facilitate desired resolution based on the user’s input preferences. Using merged CAGE-seq coverage data from both Mutu-Zta and Akata-BCR models43, we identified peaks with a minimum of 1000 reads for at least one genomic position and a maximal peak width of 8 bp in which the most distal base of the peak is defined by having a depth value of at least 0.2X the highest position count. Using these criteria, we identified 573 unique start sites (vs 240 start sites identified in our previous study35) across the EBV genome during reactivation (Fig. 1A, upper left inset). This compares to 1697 start sites identified across the cell genome using the same criteria (Fig. 1A, upper left inset). Considering that the human genome is approximately 3.1 billion base pairs long, whereas the EBV genome is about 171,000 base pairs long, this indicates that EBV initiates transcription through a substantially higher density of start sites (>6000X) than what is observed across cell chromatin during reactivation.

Fig. 1: Transcript validation and classification.
figure 1

A Oxford Nanopore Technologies (ONT) long reads from FACS-purified lytic cells in Akata cells reactivated by BCR activation (Akata-BCR) and in Mutu cells activated by Zta transfection (Mutu-Zta). CAGE-seq data from FACS-purified reactivating cells were combined from multiple replicates in both models, and transcriptional start site (TSS) calls were made using custom peak calling software. Cell and viral TSSs identified using this approach are shown in the upper left bar graph. The 5’ ends of Oxford Nanopore Transcripts (ONTs) reads were validated using the identified TSSs/CAGE peaks. 3’ ends were validated using 3’ peaks derived from ONT 3’ ends from both reactivation models. Splice junctions were validated using short-read RNA-seq splice junction data from each respective cell model. Validated transcripts from each model were then classified using CAGE-seq data from ref. 48 using to classify each transcript as independent (Ind), partially dependent (Part) or dependent (Dep) on either OriLyt, BALF2 (DNA replication), or vPIC (viral TATT pre-initiation complex). Separate bed files for each of OriLyt, BALF2, and vPIC were generated, encoding different shading to represent different classifications when loaded on a genome browser. B Classification of genes and transcripts identified in the study. The bar on the left represents canonical genes/transcripts, categorized into four groups: “Confirmed”, “Refined”, “Resolved”, and “Unresolved”. The bar on the right represents previously unannotated genes/transcripts categorized as “Unannotated isoforms”, “Unannotated transcripts”, and “Chimeras” for our core annotation. C OriLyt and vPIC contribution to transcriptional activity is not limited to late gene promoters. Scatter plots show promoters that are OriLyt but not BALF2 dependent (early) and vPIC but not BALF2 dependent (early) promoters. OriLyt ratios are the activity with OriLyt deleted virus versus wild type virus, and BALF2 and vPIC ratios are the activity with BALF2 and BGLF4 mutants versus mutants reconstituted with either BALF2 or BGLF4. Bar graphs show the fraction of early (no dependency) and late (partial or full dependency) genes with dependency (partial or full) on OriLyt (left) and vPIC (right). Source data are provided in the Source Data file.

Combining all replicates for purified BCR-activated Akata cells and combining all replicates for Zta-transfected Mutu cells, we generated 23.4 million and 25.4 million ONT cDNA reads, prior to validation. Of these, 2.6 million and 4.2 million mapped to EBV, respectively (Supplemental Fig. 2). Our validation CAGE-seq data43 which was also generated from purified lytic cells totaled 13.1 and 13.3 million reads for Akata-BCR and Mutu-Zta models, of which approximately 22% and 36% mapped to the EBV genome (Supplemental Fig. 2). Lastly, our previously published short read data (generated from purified lytic cells)43 consisted of approximately 79 million and 58 million mapped read pairs per replicate with approximately 50% and 36% of these reads mapping the EBV genome for the Akata-BCR and Mutu-Zta models (Supplemental Fig. 2).

While we previously generated a pipeline for validating long read data (O’Grady35), for the work here, we generated a pipeline that is easier to use, has embedded ORF annotation, and accommodates automatic annotation of BALF2, OriLyt, and vPIC dependencies (see below). Using the data described above and this long-read validation pipeline, we identified a total of 792 unique viral transcripts in the Akata-BCR model and 1256 unique viral transcripts in the Mutu-Zta model, for a total of 1453 unique transcripts and 595 viral transcripts that were found in both models (Fig. 1A). The finding of 1453 unique transcripts in this current study compares to 355 that we identified previously35 and 351 that Fulop et al.44 identified.

Our approach validates the 5’ end of reads through confirmation by CAGE-seq peak data, it validates splice junctions from small RNA-seq splice junction data, and it validates 3’ ends coinciding with called 3’ end peaks (with the additional requirement of an upstream polyA signal). While 5’ end and splice junction validation utilizes data generated from distinct experimental data, 3’ end validation is achieved through computational inference. In support of the 3’ end validation approach, we found that 92% and 93% the 3’ ends of all isoforms in the Mutu-Zta and Akata-BCR models coincided with previously identified polyA sites. To explicitly validate the 3’ ends of each isoform identified, we counted the number of reads containing stretches of at least 25 As just downstream from each 3’ end. This analysis showed that none of the isoforms identified in either model lacked substantial 3’ end support (Supplementary Data 1). To further support our findings, we also conducted cross-validation using the FLAIR software package45. Results from this analysis indicated that 90% of our annotated transcripts matched the findings from the FLAIR analysis. Notably, Dana Wyman and colleagues from the TALON software development team pointed out that FLAIR has relatively low sensitivity in detecting known genes and transcripts compared to TALON46. As the next step, we continued our cross-validation using the TALON software, specifically the Talon_Label_Reads submodule, in conjunction with the LoRTIA software developed by Zsolt Boldogkői’s lab47. This combined analysis revealed that, with only 18 transcripts being the exception, over 99% of our annotated transcripts matched the results obtained from both LoRTIA and TALON. Together, this additional computational validation, along with the 3’ validation of all isoforms, supports our findings of greater numbers of viral transcripts than have been observed in previous studies, which may be due in part to the sequencing of pure populations of reactivating cells.

The most unique additional component of our pipeline is the classification of each identified viral transcript as early, leaky late, or late (i.e. dependency on BALF2, which is essential for viral DNA replication), as well as whether each transcript is independent, partially dependent or fully dependent on the EBV lytic origin of replication (OriLyt) or the viral preinitiation complex (vPIC) (dependency on the vPIC component, BDLF4) (Fig. 1A) using CAGE-seq data from EBV recombinants published recently by the Johannsen group48. Altogether, we identified and classified 1453 unique viral transcripts expressed during EBV reactivation. It is noteworthy, however, that although long-read sequencing has the capacity to sequence long transcripts, there is still a substantial bias against the recovery of longer reads. We therefore believe that despite identifying 1453 transcripts in our study, this is likely an underestimation of the complexity of the EBV transcriptome.

Generation of a “core” EBV transcriptome annotation

Based on our findings, there is an extensive repertoire of polyadenylated transcript isoforms that are produced during EBV reactivation. We therefore attempted to generate a more manageable annotation consisting of the most robustly expressed key viral transcripts (Supplemental Table I and Supplementary Data 2 (see Supplementary Data 3 for the associated Akata EBV genome fasta file)). Specifically, we include the major isoform for each reading frame, we include additional alternative start site isoforms with CAGE peak read depth comparable to that of the major isoform, we include alternatively spliced isoforms with more than 1000 splice junction reads in at least one of the Mutu-Zta short read RNA-seq samples, we include polyA read-through isoforms when the identified transcript is not spliced, and we include previously unannotated transcripts with unusually high ONT read depths (mostly lncRNAs).

Within this core set of transcripts, we (1) refined the start site (within a few bases of our previous annotation35) of 28 primary isoforms, (2) we formally resolved the start site of the major isoform of 22 ORF transcripts (note that the majority of these were previously identified by Djavadian et al.48 but here, we have now tied them directly to the primary transcript through long read data), (3) we discovered two alternative major spliced isoforms, one of which forms a chimeric protein, (4) we identified secondary start or termination sites for 59 ORF transcripts, and (5) we identified 6 highly detected and/or notable lncRNAs (Fig. 1B and Supplemental Table I). All combined, we have resolved the structure of the major isoform for all but one (BILF1) EBV lytic reading frame and we identified major additional isoforms of known ORFs and major previously unannotated transcripts (Supplemental Table I).

Classification of lytic promoters and transcripts

To classify promoters and transcripts according to their dependency on either viral DNA replication (BALF2 dependency), the EBV lytic origin of replication (OriLyt), or the virus encoded preinitiation complex (vPIC) (BDLF4 dependency), we used published CAGE-seq data from the Johannsen lab48 generated from cells infected with either wild type or an OriLyt deletion mutant (defective for viral DNA replication and OriLyt enhancer function49), a BALF2 deletion mutant (defective for viral DNA replication) or a BALF2 deletion mutant reconstituted with BALF2, and a BDLF4 mutant (defective for vPIC function) or a BDLF4 mutant reconstituted with BDLF4. For classification, we determined the fold change in all CAGE-seq reads spanning each start site in mutant conditions versus wild type or reconstituted conditions. Fold change cutoffs were used to classify each promoter as “independent”, “partially dependent”, or “fully dependent” and were chosen based on thresholds that best recapitulated “early”, “leaky late”, or “late” gene classifications made previously by the Johannsen lab48. Transcripts were then classified based on the classification of their respective start site (Fig. 1A). Bed format files containing annotation data for transcripts identified in the Mutu-Zta and the Akata-BCR models as well as transcripts identified in both models were generated for color-code visualization on a genome browser (Supplementary Data 412).

Contribution of OriLyt to early promoter activity

Early, leaky, and late genes are defined based on their reliance on viral DNA replication for their expression. Classification of early, leaky late, and late promoters can therefore be defined by their dependence on the key replication-associated protein, BALF2. Viral DNA replication also requires OriLyt, and accordingly, we observe that nearly all promoters with partial or full dependency on BALF2 are also dependent on OriLyt (Fig. 1C). The two promoters that show unique dependency on BALF2 are partially dependent and are just below the threshold cutoff. These are therefore likely due to experimental noise in the data. On the other hand, excluding the adjacent promoters in the BHLF1 and BHRF1 regions for which the OriLyt region plays a direct role in activity, 55 out of 132 TSSs with partial or full dependency on OriLyt do not require viral DNA replication (based on BALF2 dependency) (Fig. 1C). This indicates that OriLyt has an impact on the activity of a substantial number of promoters outside of its role in facilitating viral DNA replication. Overall, we found that OriLyt plays a role in 42% of viral early TSSs (Fig. 1C), contributing to the expression of the major isoform of 14 viral early genes (Table 1). These OriLyt responsive TSSs are distributed throughout the viral genome (Fig. 2A (left), red only lines,) suggesting that OriLyt may provide enhancer function to multiple genomic regions. To further investigate a possible enhancer function for OriLyt, Micro-C was performed on purified BCR-activated Akata cells to assess genomic interactions within the EBV genome during reactivation. This analysis showed genome-wide interactions with the OriLyts (Fig. 2B) with a preference for OriLyt-dependent promoters (odds ratio = 3.668 (95% CI (1.993−6.752)), p-value = 1.2 × 10-5). Overall, these results indicate a key role for OriLyt outside of its requirement for replication-dependent late gene expression, namely, a widespread role in enhancing early viral gene expression. While our analyses are technically limited to early genes because for late genes, we cannot discern between a role in replication-dependent expression and potential enhancer function, it is likely that OriLyt also contributes to leaky late and late gene expression through a potential enhancer mechanism.

Fig. 2: OriLyt and vPIC promoter dependencies.
figure 2

A Circos plot of Genome-wide EBV promoter distribution and dependence on vPIC (viral pre-initiation complex), OriLyt, and BALF2 (viral DNA replication). Two color intensities are used to represent the levels of dependence for each factor. Dark blue indicates full dependence, and blue indicates partial dependence for BALF2. Dark red indicates full dependence, and red indicates partial dependence for both vPIC and OriLyt. B Intrachromosomal interactions (magenta loops) with the EBV OriLyts. Interaction scoring was generated using the ICE module of the HiC-pro pipeline84. Interactions originating from either OriLyt (plus and minus 1 kb) of greater than 5 kb and an ICE normalized interaction score of greater than 0.00079 are displayed. Bottom tracks indicate OriLyt-dependent promoters, with the upper track including both BALF2-dependent and BALF2-independent promoters and the lower track displaying OriLyt -dependent promoters that are BALF2 independent. Source data are provided in the Source Data file.

Table 1 Early genes regulated by OriLyt or vPIC

vPIC contributes to early promoter activity

In addition to classic viral DNA replication mechanisms driving late gene expression in herpesviruses, late gene expression in gammaherpesviruses is also regulated in part through encoding their own preinitiation complex that further contributes to their expression through the TATA-like motif, “TATT”48,50,51,52. Accordingly, we found that the majority of leaky late and late TSSs (BALF2 dependent) displayed a dependency on vPIC (Figs. 1C and 2A (right)). Surprisingly, however, we also found that 23% of early TSSs displayed a vPIC dependency (Fig. 1C), contributing to expression of major isoforms of 4 early genes, 2 of which have the classic TATT motif, one having a TATA motif, and the other having a variant motif (Table 1). These findings show that vPIC not only regulates leaky late and late gene expression, but also contributes to the expression of early EBV gene expression.

Unique biphasic initiation characteristics of leaky late promoters

Inspection of start sites at leaky late promoters revealed an unusual characteristic in response to viral DNA replication (BALF2 dependency). Specifically, we noted that in several cases, there were multiple start site positions within each cluster whose activity differed with respect to their dependency on BALF2, with BALF2 dependency being biased specifically towards the downstream positions (Fig. 3). vPIC dependency was similarly biased towards the downstream positions (Supplemental Fig. 3), and it is notable that each of these promoters contains either a TATA or a TATT motif. Based on these findings, we hypothesize that there are distinct differences in the composition of pre-initiation complexes that drive early and late expression at these promoters and that this gives rise to unique downstream positioning of initiation. One scenario could involve a predisposition for non-vPIC pre-initiation factors (potentially a TBP-based preinitiation complex) to drive proximal initiation during the early stage of the lytic cascade, while the vPIC pre-initiation complex drives distal initiation during late transcription. Other potential determinants could involve differences in the chromatinized nature of early genomes vs the naked nature of nascently replicated genomes during late transcription and/or the impact of additional unknown factors. Overall, the finding of the biphasic nature of these leaky late promoters provides insights into the underlying mechanisms driving the leaky late characteristic of these promoters. It should also be mentioned that while other leaky late initiation sites that lack multiple start site positions may similarly be initiated by distinct pre-initiation complexes that do not cause a shift in positioning of the TSS.

Fig. 3: Biphasic transcription start sites and multiple promoter usage.
figure 3

A Biphasic start site in leaky late and early promoters. All transcriptional start sites are oriented from left to right. The histograms show CAGE-seq signals at each genomic position of transcription initiation site clusters, with downstream initiation sites demonstrating greater dependency on BALF2. Ratios of dBALF2 vs dBALF2 plus re-addition of BALF2 expression vector for each position are displayed as red lines above the histograms. The TATA or TATT motif sequences found approximately 30–35 bp upstream from initiation sites are shown to the left of each graph. B Three major promoters for BFLF2. Thick parts of transcript features represent the reading frame, and thin parts display 5’ and 3’ non-coding sequences. CAGE-seq tracks for wild type (or reconstituted) and mutant BALF2, BDLF4, or OriLyt conditions are shown below transcript structures, and the number of read starts at each genomic position is displayed as negative values to represent leftward transcription starts. Source data are provided in the Source Data file.

We also noticed a similar shifted initiation site bias for three early gene promoters (Fig. 3A and Supplemental Fig. 3). For BNLF2a, we separate these into two different isoforms in our core annotation: an early and a leaky late isoform. The early isoform is likely expressed to protect lytic cells from adaptive immune responses against viral lytic antigens immediately as they are being produced, while the late isoform may ensure high amounts of BNLF2a to maintain this protective impact and/or for potentially loading into viral particles to protect de novo infected cells from adaptive immune responses. Interestingly, for BDLF4, this shift results in initiation downstream from the BDLF4 ATG initiation codon, thereby encoding a transcript lacking the ability to express BDLF4 protein. Instead, this small shift in transcription initiation gives rise to a variant isoform of the overlapping leaky late BGLF1 gene (BGLF1-v1) (Fig. 3). The biphasic nature of this promoter provides an unusual example of a single promoter that has the potential to give rise to the production of two entirely different viral proteins with differing dependencies on viral replication.

Regulatory diversity of lytic ORFs through multiple promoter usage

For our core annotation, we included lytic gene isoforms that are driven by alternative promoters with comparable activity. In total, 27 lytic ORFs have two or more major promoters (Supplemental Table I). While in some cases, the additional promoters displayed the same dependence on BALF2, BGLF4, and OriLyt, in most cases, the additional promoter(s) added unique dependencies on viral DNA replication (BALF2), vPIC, and/or OriLyt. For example, while the BFLF2 ORF is driven by three different promoters with no dependence on viral DNA replication, they all displayed a dependence on OriLyt, and the activity of the proximal-most promoter was unique in its partial dependence on vPIC (Fig. 3B). Djavadian et al. have proposed that viral early promoters are shut down once viral DNA replication initiates48. The use of biphasic initiation sites, as mentioned above, and/or the use of different alternative promoters with differing dependence on BALF2 may have evolved as a way to adapt to requirements for select sets of viral proteins to overcome these expression restrictions to support functions in both early and late phases of the lytic replication cascade.

Viral polyA read-through—downstream of gene (DoG) transcription

Numerous studies have demonstrated that alphaherpesvirus replication causes defects in transcription termination within cellular and viral transcripts, producing longer transcripts that extend beyond the typical polyadenylation site, commonly known as “Downstream of the Gene” (DoG) transcripts40,53,54,55,56,57,58,59. Recently, this finding has been extended to EBV reactivation60. While the functional impact of defective cell transcription termination in supporting viral replication is unclear, it is speculated that this may influence cell chromatin structure, potentially contributing to the substantial structural changes in cell chromatin that occur during reactivation53,61. Among the validated ONT reads in our work here, we find evidence for DoG transcription scattered throughout the EBV viral genome (Fig. 4), similar to findings for some of the viruses mentioned above. These include DoG transcripts that arise from transcription proceeding through multiple previously annotated polyA sites, including a DoG transcript initiating from the BFRF3 promoter that passes through 5 annotated polyA sites (Fig. 4).

Fig. 4: Genome-wide EBV read-through (DoG) transcription.
figure 4

Read-through (DoG) transcripts identified here are listed to the left, along with their respective classification and information on polyA sites that are read-through (note that all indicated polyA sites include only previously identified polyA sites rather than the mere presence of an AAUAAA motif). On the right is a circos plot86 displaying the positioning of each read-through transcript across the EBV genome. Orange arrows indicate DoG transcripts, which initiate from the same transcription start sites (TSS) as canonical genes but extend beyond previously annotated polyadenylation sites. The diagram highlights multiple instances where DoG transcription proceeds through several polyA sites, such as transcription initiation from the BFRF3 promoter, resulting in extended transcripts that pass through five previously annotated start sites. The TR (Terminal Repeat) region of the EBV genome is labeled at the top of the circle. Source data are provided in the Source Data file.

There is substantial evidence from the literature demonstrating that upstream transcription proceeding through a downstream promoter causes “transcriptional interference” through displacement of transcription factors and other proteins53,62,63,64,65,66,67,68,69,70. In the context of the highly compacted viral genome, it is likely that DoG transcription plays a role in fine-tuning transcriptional regulation at least in part through this mechanism. For example, copies of the genome in which transcription of the late BDLF1 or BDLF2 genes extend transcription beyond its polyA site, expression of the early genes, BGRF1/BDRF1, BDLF3.5, BGLF3, BGLF3.5, BGLF4 and BGLF5 as well as the leaky late BBLF1 and late BGLF2 and BGLF1 genes would be disrupted. DoG transcription may therefore play a role in the temporal regulation of viral lytic transcription.

High-abundance viral lncRNAs

Among the most highly detected transcripts detected in our ONT data was the small lncRNA (265 bp), lncBGLF5, which is the 9th most highly detected ONT read (66166 reads) in the Mutu-Zta dataset (lower but still significant detection in the Akata-BCR model may result from lncBGLF5 being near the size selection cutoff) (Table 2). This lncRNA is generated by the overlapping context of the BGLF5 and the BALF4/BGLF3.5/BGLF3 gene structures, which end 265 bp downstream from the BGLF5 start site (Fig. 5A). Because of this configuration, the generation of full-length BGLF5 requires transcriptional read-through of the BALF4/BGLF3.5/BGLF3 polyA site. lncBGLF5, then, is generated from failure of BGLF5 promoter-initiated transcripts to read through the BALF4/BGLF3.5/BGLF3 polyA signal, resulting in a high abundance lncRNA. In addition, the abundant lncBALF3 and lncBALF5 lncRNAs are similarly generated from BALF3 and BALF5-initiated transcripts that fail to readthrough overlapping gene polyA sites (Fig. 5B, C and Table 2). Not displayed in Fig. 5 is a lncRNA generated from a similar configuration, which exhibits meaningful but not as robust read numbers (128 reads in the Mutu model). Specifically, this lncRNA is generated from BaRF1-initiated transcript that requires read-through of the BORF1/BORF2 polyA site. For each of these lncRNAs, it will be interesting to assess whether these transcripts are exported to the cytoplasm to serve some unknown function or whether they remain localized at their site of transcription, where they may play roles as stable forms of enhancer RNA-like RNAs that help establish phase separation at these promoters to promote transcription.

Fig. 5: Highlighted lncRNAs identified in this study.
figure 5

A–C Schematic representation of the overlapping gene structures of BGLF5 (A), BALF3 (B), and BALF5 (C), highlighting the generation of the abundant lncRNAs, lncBGLF5, lncBALF3, lncBALF5 and lncBALF4. Full-length BGLF5, BALF3, and BALF5 transcripts are produced by read-through of the polyA sites of overlapping genes and may require the observed termination defects caused by EBV reactivation. Nevertheless, failure to read-through causes the expression of abundant lncRNAs. Dashed lines indicate the polyA site, and the arrows represent the transcription direction of each gene. Thick segments of the features represent coding sequences, while thinner segments correspond to the 5’ and 3’ untranslated regions. D Illustration of the distributed start sites at the BZLF1 locus that give rise to multiple coding and non-coding BZLF1 isoforms. Below are schematic representations of BZLF1 isoforms, which are CAGE-seq tracks for Akata-BCR and Mutu-Zta models, with the number of read starts at each genomic position being displayed as negative values to represent leftward transcription starts.

Table 2 High-abundance viral lncRNAs

Another abundant lncRNAs worth noting is the 1.5 kb lncBALF4 RNA. Despite being longer than the above-mentioned small lncRNAs, it is still highly represented in our ONT libraries (Table 2). While it may have some unknown function, it is also noteworthy that this lncRNA is driven from an alternative late promoter with transcription proceeding through the early BALF5 promoter (Fig. 5C). In this context, it is likely that lncBALF4 transcription causes transcriptional interference and may be one mechanism to shut down early BALF5 transcription during the late phase of the reactivation cascade.

Also noteworthy are a series of lncRNAs generated from the BZLF1 locus (Fig. 5D and Table 2). The BZLF1 reading frame contains only two methionine residues that are present in tandem at the very beginning of the reading frame and there are two promoters of approximately equal strength that drive transcription upstream from the reading frame (Fig. 5D). In addition, there are two downstream start sites of similar strength that drive transcription downstream from the initiation codon, giving rise to lncRNAs (Fig. 5D). Lastly, there is an upstream transcription initiation site of similar strength that causes transcription in the opposite orientation (not shown). Together, this cluster of start sites, some of which make lncRNAs, is somewhat reminiscent of transcription initiation at the OriLyt enhancer, raising the possibility that the BZLF1 locus and/or associated lncRNAs may similarly provide enhancer activity.

Highly abundant alternative splice variants of EBV lytic genes

Forty-seven and 64% of all validated ONT reads in the Akata-BCR and Mutu-Zta models were found to encode at least one splicing event. Despite these high numbers, 50% and 36% of short reads mapped to the EBV genome in the Akata-BCR and Mutu-Zta models, whereas we only observed 2% and 1% of splice junction reads that mapped to the EBV genome. While the substantial isoform diversity that we observe is of likely importance to reactivation, many of these may represent rare splicing events of highly expressed long transcripts that are too long to be easily detected by ONT sequencing. To ensure that previously unannotated spliced isoforms are of likely high significance, warranting inclusion in our “core” annotation, we only include spliced isoforms that are either previously identified and/or have more than 1000 splice junction reads per short read sequencing replicate of the Mutu-Zta model and are detected in both the Akata-BCR and the Mutu-Zta models. Using these criteria, we identified two additional splice variants for inclusion in our “core” annotation (Supplemental Fig. 4 and Fig. 6). One is a second splice variant of BLLF1 (gp350) which encodes the first 121 amino acids of BLLF1 and carboxy-terminal five additional amino acids arising from a frame shift at the splice junction (Fig. 6A). This results in a BLLF1 protein isoform encoding the first of three anti-parallel beta-barrels that in the full-length protein, forms an anchor bridging the other two (Fig. 6)71. Although this splice variant encodes this well-defined structural motif, full-length BLLF1 binding to the CR2 receptor on B-cells occurs through the third beta-barrel. What possible function the protein product of this splice variant has is therefore unclear.

Fig. 6: Identification, structural modeling, and translation of alternative EBV splice variants.
figure 6

A Diagram of the BLLF1 gene and its previously unannotated splice variant, BLLF1-v2. The top illustrates the full-length BLLF1 protein, including the extracellular (EC), transmembrane (TM), and cytoplasmic tail (CT) domains. Splicing produces BLLF1-v2, encoding only the first 121 amino acids of BLLF1 with a frameshift that adds 5 additional amino acids. AlphaFold376 structure predictions are shown for full-length BLLF1 (pTM = 0.36, PAE: min = 0.8, max = 31.7, mean = 27.16) and BLLF1-v2 (pTM = 0.53, PAE: min = 0.8, max = 31.6, mean = 14.62), revealing retention of the beta-pleated sheet. Thick segments indicate coding regions, thin lines indicate UTRs, and arrows denote transcription direction. B Schematic of a splice event fusing the first 489 amino acids of BXLF1 (thymidine kinase) with the last 459 amino acids of BXLF2 (gH glycoprotein), along with AlphaFold376 structural models for each protein and the fusion product. The BXLF1_BXLF2 fusion model shows pTM = 0.57, PAE: min = 0.76, max = 31.71, mean = 21.65, with high PAE due to the intrinsically disordered BXLF1 N-terminus. Individual models: BXLF1 (pTM = 0.60, mean PAE = 20.86) and BXLF2 (pTM = 0.86, mean PAE = 9.50). Truncation impairs interaction with gL (ipTM = 0.17, pTM = 0.48, mean PAE = 24.63) versus intact BXLF2 (ipTM = 0.78, pTM = 0.82, mean PAE = 12.39). Models display BXLF1 in red and BXLF2 in green. Thick and thin segments denote coding regions and UTRs, respectively; arrows show transcription direction. C BLLF1-v2 and BXLF1/BXLF2 splice variants detected in polysome fractions. The upper panel shows the polysome profile; the middle shows 5S, 18S, and 28S rRNA distribution; the lower panels show RT-PCR detection of BLLF1-v2 and BXLF1-BXLF2 splice variants. A less abundant BXLF1-BXLF2 variant with an internal exon is mainly seen in the soluble fraction. Molecular weight markers are indicated in base pairs (bp). This experiment was independently repeated three times with similar results. Source data are provided in the Source Data file.

The other high abundance splicing isoform results in the generation of an early fusion protein containing the first 489 amino acids of BXLF1 (viral thymidine kinase gene) and the last 459 amino acids of the leaky late BXLF2 gH glycoprotein (Fig. 6B). Because translation initiation occurs from the natural BXLF1 start codon, it is likely that this fusion protein is produced. To test this possibility more directly, we assessed the presence of this spliced isoform, as well as the previously uncharacterized BLLF1 spliced isoform, in polysome fractions after polysomal fractionation. This experiment showed abundant representation of both the BXLF1-BXLF2 transcript as well as the alternative BLLF1 spliced isoform in the polysome fractions (Fig. 6C), indicating that the resulting proteins are likely produced during reactivation. To further investigate expression of the protein products of these spliced isoforms, we reanalyzed mass-spec data from Ersing et al.72 from reactivated Akata cells to determine whether peptides spanning the splice junction could be identified. Unfortunately, for the BXLF1-BXLF2 fusion, the BXLF1 portion ends in a lysine residue with cleavage expected immediately downstream, making detection of this peptide impossible. For the BLLF1 isoform, however, the LPINVTTGEEQQV | VK junction spanning peptide was detected, further validating expression of this isoform.

While the function of BXLF1 is unclear, expression of BXLF1 in EBV-negative cells results in localization to a few distinctive large perinuclear dots73,74. Although BXLF2 encodes gH, which is a key virion factor for EBV infection, expression of BXLF2 in EBV-negative cells results in a cytoplasmic localization as well as “discrete patches” at the nuclear rim75. Through AlphaFold376 structure predictions, the structure of BXLF2 appears largely intact in the fusion context, although deletion of the amino terminal region in the fusion context is sufficient to result in a low predicted interaction with its interaction partner, gL (full length BXLF2:gL ipTM = 0.78 vs truncated BXLF2:gL ipTM = 0.17). On the BXLF1 side, a portion of the central BXLF1 alpha-helices bundle is similarly predicted to be intact in the BXLF1-BXLF2 fusion context (Fig. 6B). With many of the core structural components of each molecule remaining in the fusion context, it is possible that the covalent linkage of these regions of BXLF1 and BXLF2 play some role in linking the functional impacts of both proteins at the perinuclear and nuclear rim to support some as yet to be determined viral function.

Discussion

Through enhancements in methodology and data analysis, we have improved our understanding of the high level of EBV lytic transcriptome complexity, identifying more than 1400 unique transcript isoforms across two different reactivation models. In addition, we have classified nearly all of these transcript isoforms as early, leaky late, or late transcripts. We have formally resolved all but one of the major isoforms of lytic ORFs, and we identified dozens of additional significant isoforms that add diversity with respect to their temporal regulation and involvement of vPIC and OriLyt enhancer activity. We have assembled what we refer to as the “core” EBV transcriptome that includes major isoforms, significant additional isoforms, polyA read-through transcripts, abundant lncRNAs, as well as additional spliced isoforms that can be used to parse gene expression for what are the likely most impactful genes involved in the virus replication cascade.

Our initiative to resolve and classify the EBV lytic transcriptome also revealed additional regulatory and mechanistic concepts. First, despite the dogma that vPIC only regulates late gene expression, our studies demonstrate that it also contributes to the expression of dozens of lytic early promoters. All six viral factors that make up this complex, except the core TATT binding factor, BcRF1, are expressed as early genes, while the BcRF1 gene is leaky late. It is therefore reasonable that vPIC could activate early genes containing a TATT (or similar) motif. This may seem at odds with another study that showed a dependency for vPIC promoter activation on viral DNA replication48. However, this study used a late gene promoter reporter context whose activity may be constrained to nascently replicated DNA not through vPIC function but instead, through other classic herpesviral replication-dependent late gene regulatory mechanisms. Our findings, therefore, demonstrate that vPIC does not require dechromatinized viral genomes and that vPIC function is instead dependent on other promoter regulatory features.

A related finding is that some leaky late (and some early) promoters with TATT or TATA motifs display a biphasic dependency on both vPIC and viral DNA replication, with evidence of dual transcription initiation mechanisms/complexes that lead to differential start site positioning. In these cases, the vPIC function may be linked to viral DNA replication since the precise transcription start site bias coincides with both vPIC and DNA replication dependency. This raises the idea for a hypothesis whereby these promoters have evolved with dual sets of cis elements, one set which may, in part, recruit TBP and lead to early transcription and a second set that requires recently synthesized nucleosome free viral genomes, along with vPIC, to initiate transcription during the late phase of viral replication. Further supporting this idea is a previous study that showed that both cellular and viral PICs can bind and activate some human cytomegalovirus promoters77,78. Further, these studies showed that the footprint of TBP PICs and viral PICs is very different, supporting mechanistic distinctions that could result in slightly different start sites.

In addition to the above-described biphasic promoters, we also find a simpler configuration whereby transcription of ORFs is driven by multiple separate major promoters differing in regulatory properties. The acquisition of biphasic properties for single promoters and the acquisition of multiple promoters for individual ORFs likely evolved in response to an advantage for expression across phases. This could be due to a benefit to ensuring high-level expression, a need for function across phases, and/or the acquisition of additional functions pertaining to other phases of the lytic replication cycle. While expression of some lytic ORFs may indeed be restricted to one or the other phase of the lytic cascade, as proposed in Casco et al.60, these dually regulated subgroups of viral factors likely evolved this extra level of complexity to accommodate additional requirements for these factors across the different phases.

The disruption of appropriate transcription termination mechanisms during lytic replication seems to be developing as a common theme across herpesviruses53. While originally described in the context of termination defects on cell transcription, we show here that it is also pervasive across the EBV genome. Transcription termination defects on cell chromatin may play some unknown role in remodeling cell chromatin as it is grossly reorganized for displacement to the nuclear periphery, where it eventually becomes inactive61. Pertaining to the EBV transcriptome, there are multiple overlapping gene contexts that absolutely require polyA read-through (e.g., Fig. 4), and it is likely that these overlapping configurations became permissible due to the acquisition of the polyA read-through activity of herpesviruses. A second functional implication of disrupted transcription termination on the virus is the likely role that it plays in regulating and/or fine-tuning viral gene expression through transcriptional interference and downregulation of read-through promoters. What evolutionary force was the driver of the initial acquisition of read-through transcription is unknown but based on previous studies and our studies here, we now know that there are likely at least three functional impacts of this activity, remodeling cell chromatin, facilitating the production of full length ORF containing transcripts for overlapping viral genes, and regulating the EBV lytic transcription program through transcriptional interference. Pertaining to this latter point, we acknowledge that some of the very long DoG transcripts shown in Fig. 5 are represented by only a single read. However, it should be noted that there is substantial technical bias against long reads in most long-read sequencing approaches, and the representation of these long reads is typically very low. Nevertheless, given the findings of extensive cellular polyA read-through and extensive polyA read-through that is required for the production of BGLF5, BALF3, BALF5, and BaRF1 (e.g., Fig. 4), it is likely that DoG transcripts arising from the EBV genome are more robust than our current detection levels.

Previous studies have shown that OriLyt interacts with the BZLF1 locus to regulate BZLF1 expression through DNA looping49. In contrast, we do not see a dependence of BZLF1 expression on OriLyt activity. However, the Gewurz lab showed49 that OriLyt has both a negative and a positive impact on BZLF1 expression, depending on whether the virus is under latency or reactivation settings and perhaps this complex relationship obfuscated our ability to detect a role for OriLyt on BZLF1 expression in our analyses. On the other hand, though, we identified a large number of early promoters that do show a dependence on OriLyt for full expression. It is likely, therefore, that OriLyt plays a significant enhancer role across the entire viral genome, possibly also regulating the expression of leaky late and late genes through this mechanism.

Methods

Cell culture

EBV+ Burkitt’s Lymphoma lines, Mutu (gift from Samuel H Speck) and Akata (gift from Kenzo Takada) were cultured in RPMI (Fisher Scientific, cat no. SH30027) supplemented with 10% FBS (ThermoFisher Scientific, cat no. 10437) at 37 °C with 5% CO2. Akata cells harboring the pCEP4-BMRF1p-GFP reporter plasmid were cultured in the presence of 250 μg/ml hygromycin to retain the episomal reporter. Both cell lines were authenticated through a cell line authentication model embedded within our RNA-seq analysis pipeline.

Nucleofection

Three million Mutu cells were co-transfected with CMVp-GFP (0.3 μg) plus either a control (SV40p-cntl) or Zta (SV40p-Zta) expression vector (2.7 μg) using an Amaxa Nucleofector II machine (Lonza). Plasmids were added to 100 μl of Amaxa Cell Line Nucleofector Kit R (Lonza, cat no. VCA-1001) and loaded into a cuvette and cells were electroporated using the G016 setting. Cells were then immediately transferred to flasks with growth media and placed in the incubator.

BCR crosslinking

Affinipure goat α-Human IgG (Jackson ImmunoResearch, cat no. 109-005-003) was added to the cell culture media of Akata cells harboring the pCEP4-BMRF1p-GFP reporter plasmid to a final concentration of 10 μg/ml. Cells were incubated for 24 h prior to harvesting for FACS isolation.

Ribosomal profiling

A total of 40 million Mutu cells were transfected with the SV40p-Zta plasmid, and one day after transfection, cells were treated with 100 µg/ml cycloheximide for 10 minutes at 37 °C. Cells were then washed once with 1x PBS containing 100 µg/ml cycloheximide. The cell pellet was resuspended in 500 µl of polysome extraction buffer (100 mM KCl, 5 mM MgCl2, 20 mM HEPES (pH 7.4), 0.5% NP-40, 2 mM DTT, 100 µg/ml cycloheximide, 40 U/ml RNase inhibitor (Thermo Fisher Scientific, catalog No. AM2682), and complete protease inhibitor (Roche, catalog No. 4693159001)) and incubated on ice for 10 min. Cells were spun at 16,000 × g for 10 min to precipitate nuclei and debris. The supernatant was transferred to a fresh tube and quantified for RNA. Extract containing a total of 500 µg RNA was then layered onto a 17–50% sucrose gradient, and the gradient was spun for 2.5 h at 210,000 × g at 10 °C. 0.5 ml fractions were collected, and RNA was extracted from each fraction using the TRIzol Reagent (Invitrogen, cat no. 15596026) and processed according to the manufacturer’s protocol.

Equal volumes of each RNA were used to make cDNA. The iScript Reverse Transcription Supermix for RT-qPCR Kit (catalog No. 1708841) from Bio-Rad, USA, was used for cDNA synthesis. The PCR reaction was performed using GoTaq® Green Master Mix (catalog No. M7123) from Promega, USA and equal amounts of cDNA volumes were used for PCR reactions. PCR primers for BLLF1-v2 detection were “BLLF1_F1: CTACCAACCTCGAGGTGCAT“ and “BLLF1_R1: ACACAGGTGACACCAGCCTCTAT“, and PCR primers used for detecting BXLF1-BXLF2 fusion transcript were “BXLF1_BXLF2_F1: GGGGGACACCATAGTCTGGA“ and “BXLF1_BXLF2_R1: GGAGAACGTAGCGAGCGTAG“. All primer sequences used in this study were ordered from Integrated DNA Technologies, Inc. (Coralville, Iowa). All variants obtained from ribosome profiling of BXLF1-BXLF2 and BLLF1 transcript isoforms were cloned using the TOPO® TA Cloning® Kit for sequencing (Invitrogen, Cat. no. 450030). The verified inserts were subsequently subjected to Sanger sequencing to identify the alternative splicing variant.

Fluorescence-activated cell sorting

Mutu and Akata-BMRF1p-GFP cells were collected and centrifuged at 1000 × g for 5 min. Cells were resuspended in 500 μl of PBS with 1 mM EDTA and passed through a 35 μM mesh filter (Genesee Scientific, cat no. 28-154). Approximately 500,000 GFP+ cells were isolated from each sample using a BD FACSAria III (BDBioSciences) machine. Cells were pelleted at 1000 × g for 5 min flash frozen for later RNA isolation.

RNA extraction

Cells were resuspended in 1 ml of TRIzol Reagent (Invitrogen, cat no. 15596026) and processed according to the manufacturer’s protocol. RNA pellets were resuspended in 42 μl of ddH2O. 5 μl of 10× DNase buffer and 3 μl of DNase I (New England Biotechnology, cat no. M0303L) were added, and tubes were incubated at 37 °C for 15 min. DNase I was removed using the Monarch RNA Cleanup Kit (New England Biotechnology, cat no. T2030L). 1 μg of RNA was reverse transcribed using LunaScript (New England Biotechnology, cat no. M3010), according to the manufacturer’s protocol. RNA quality was assessed using the Agilent 2100 Bioanalyzer (Agilent, Serial No. DE54107860) with the Agilent RNA 6000 Nano Kit (Agilent, Cat. No. 5067-1511) to ensure RNA integrity numbers of greater than 9.5.

Oxford Nanopore long-read sequencing

ONT (Oxford Nanopore Technologies) sequencing was performed using the PromethION platform (Oxford Nanopore Technologies plc.) at the Biotechnology Center, University of Wisconsin-Madison. The sequencing run was conducted on a FLO-PRO002 flow cell with the SQK-PCB109 barcoding kit according to the manufacturer’s instructions. The experiment was run for a total duration of 72 h. Base calling was managed using both MinKNOW and Guppy to ensure high accuracy and data quality. Real-time base calling and read filtering were performed with MinKNOW version 21.11.7 and MinKNOW Core 4.5.4, applying a minimum Q-score threshold of 9 to retain high-quality reads. Mux scans were conducted every 1 h and 30 min with active channel selection enabled throughout the run. For post-run analysis, a high-accuracy base calling model was conducted using Guppy version 5.1.13. This process generated FAST5 files containing 4,000 reads per file, utilizing vbz_compression, and compressed FASTQ files with 4000 reads per file for downstream analysis. The sequencing produced an estimated N50 read length of 1 kb. Post-sequencing analysis included read length histogram assessments and quality checks across the generated data.

Read alignment and processing

All fastq files obtained from barcoded samples corresponding to a single replicate for each condition were concatenated to produce one consolidated fastq.gz file. To process these reads, Pychopper was employed with the following basic command: pychopper -r report.pdf -u unclassified.fq -w rescued.fq input.fq full_length_output.fq. This tool was used to identify, orient, and trim full-length Nanopore cDNA reads, as per the guidelines provided in the Pychopper repository (https://github.com/epi2me-labs/pychopper). Subsequently, all full_length_output.fq files generated by Pychopper were aligned to the human genome plus the EBV genome using minimap2 with the following parameters: minimap2 -ax splice -t 8 hg38_plus_EBV.fa full_length_output.fq > Aligned_fullLength.sam79. SAMtools (version 1.19.2) was then utilized to convert the resulting SAM files into BAM files and to perform statistical analysis on the alignment process. After verifying the quality of the alignment results, SAMtools was again used to merge all BAM files from replicates into a single BAM file for each condition80. BEDtools (v2.30.0)81 was employed to convert the BAM files into BED12 format, followed by steps of sorting, compressing, and indexing the resulting files. To investigate the potential generation of multiple EBV copies during the reactivation experiment, this entire mapping process was repeated using a reference containing five copies of the EBV genome instead of a single copy.

Validation and classification pipeline

CAGE-seq wig coverage files with positional coverage of only the first base of each read generated from STAR (v2.6.1)82 alignments were generated previously43. Transcription start site clusters were generated using the custom virus_TSS_peaks.pl program (https://github.com/flemingtonlab/LongReadAnnotation), inputting all replicates for Mutu-Zta inductions and Akata-BCR inductions for each strand separately using a minimum single position value of 1000, fraction peak of 0.2 and max width of 8. Each peak was classified as either “independent”, “partially dependent”, or “dependent” on either BALF2 (viral DNA replication), BDLF4 (vPIC), or OriLyt if wild type (or reconstituted) to mutant CAGE read count ratios within each peak were greater than or equal to 0.8 (independent), between 0.8 and 0.0286 (partially dependent), or less than or equal to 0.0286 (dependent), and color coded for display on a genome browser using the codes, 0,128,255 (light blue, independent), 0,0,255 (medium blue, partially dependent), 0,0,180 (dark blue, dependent), or 160,180,220 (gray, indeterminant). For identifying 3’ end clusters, the number of 3’ ends of all EBV “full length”, merged, ONT reads mapping to each position were counted and output to wig format for the positive and for the negative strands. This data was then used as input for peak calling using virus_TSS_peaks.pl with minimum single position values of 400, fraction max peak of 0.2 and max width of 8.

TSS clusters and 3’ end clusters determined as outlined above were then used to validate Akata-BCR ONT “full length” reads and Mutu-Zta “full length” ONT reads using a single replicate of each STAR output SJ file using our validate_ONT.pl (https://github.com/flemingtonlab/LongReadAnnotation) program using min CAGE depth of 1000, max distance from ONT start to CAGE peak of 2, minimum 3 prime peak depth of 400, max distance from ONT end to 3’ peak of 10, and minimum splicing junction reads of 1. Validate_ONT.pl validates 5’ and 3’ ends and splice junctions, it color codes the transcripts according to dependency on BALF2, BDLF4, or OriLyt, and it annotates reading frames giving highest priority to already known translation initiation codons and if none, identified open reading frames of greater than 50 amino acids in which the start codon contains either a CAUG or AUGG sequence.

Cross-validation

To ensure that the transcripts generated by our pipeline were not influenced by potential internal priming events, we first performed cross-validation using the FLAIR software45. This process began with remapping reads using Minimap279 with specific flags -ax splice -Y -C5 --MD. Following this, we applied the FLAIR correction step using the command flair correct -q query.bed12 [-f annotation.gtf] | [-j introns.tab] -g genome.fa to improve the accuracy of transcript annotations. However, as noted by Dana Wyman et al. from the TALON software development team, FLAIR has relatively low sensitivity in detecting known genes and transcripts compared to TALON46. To enhance validation, we conducted an additional cross-validation step using TALON, specifically employing the talon_label_reads submodule, which labels reads with potential internal priming events. For TALON, we utilized input files generated by LoRTIA47, which was executed with the command LoRTIA -5 TGCCATTAGGCCGGG --five_score 16 --check_in_soft 15 -3 AAAAAAAAAAAAAAA --three_score 16 -s poisson -f True /path/to/alignments.sam /path/to/output_folder /path/to/reference.fasta. Finally, we manually reviewed transcripts that were not detected by the aforementioned methods to verify their authenticity and confirm their full-length representation.

Micro-C

The Micro-C assay was performed using the Dovetail® Micro-C Kit83 with some minor modifications. Two million cells from purified reactivated Akata cells were crosslinked using DSG and formaldehyde. Chromatin was then digested with the MNase Enzyme Mix, and the digestion was stopped with EGTA. The cells were then lysed with SDS, the lysate was quantified, and a portion was tested for quality. The purified DNA was quantified using a Qubit Fluorometer, with a successful digestion profile indicating 40-70% mononucleosomes. In subsequent steps, chromatin was bound to capture beads, DNA ends were polished, and ligation was performed. After crosslink reversal, the DNA was purified and quantified again with a Qubit Fluorometer. This was followed by end repair, adaptor ligation, and a digest, after which purification was done using SPRIselect beads. Streptavidin beads were prepared for ligation capture, and index PCR was performed to amplify the DNA. Size selection was carried out, and the final library was quantified and verified. Finally, the libraries were sequenced using Illumina NovaSeq 6000 sequencers with paired-end reads of 100 bp. The process began with a shallow run for quality control (QC), followed by ultra-deep sequencing if the QC results were satisfactory.

For bioinformatic analysis, the HiC-Pro84 pipeline was executed with default parameters, using a bin size of 1000 bp. All interaction loops generated from HiC-Pro were normalized by using the ICE method. Interactions with a normalized interaction score of greater than 0.00079 and a minimum interaction distance of 5 kb were extracted. For Fig. 2B, interactions with either OriLyt (plus and minus 1 kb) are displayed.

3D Modeling, visualization, and manipulation

The prediction and modeling of the 3D structures of EBV proteins were conducted using the AlphaFold 3 software76, utilizing default parameters as specified in this study. The resulting 3D structures predicted by AlphaFold 3 were visualized and manipulated using PyMOL(TM) version 3.0.385.

Statistical testing for the enrichment of OriLyt-dependent promoters

To test whether OriLyt-dependent promoters are enriched for Hi-C contacts whose opposite anchors overlap the OriLyt region, we analyzed Hi-C interaction data from EBV-infected Akata cells. Hi-C interactions were obtained as a BEDPE file and filtered to retain those where one anchor overlapped the OriLyt region. Non-OriLyt interaction anchors were also extracted for downstream analysis. Promoter coordinates and their dependency on OriLyt were extracted for further analysis, and promoters were categorized as OriLyt-dependent or independent. To test for enrichment, a 2 × 2 contingency table was constructed with four groups: interacting OriLyt-dependent promoters, non-interacting OriLyt-dependent promoters, interacting OriLyt-independent promoters, and non-interacting OriLyt-independent promoters. Fisher’s exact test was used to assess the statistical significance of enrichment, with an odds ratio greater than 1 and a p-value below 0.05 indicating a significant association.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.