Introduction

Sequencing-based studies have greatly advanced our understanding of extrachromosomal circular DNA (eccDNA), on its roles in oncogene amplification1,2,3,4, gene expression regulation5, genome rearrangements6,7, and intratumor heterogeneity4. Diverse analysis pipelines and experimental methods have been developed to detect eccDNA (Table 1). Viraj Deshpande et al. introduced the AmpliconArchitect (AA) algorithm to predict amplicon structures and eccDNA from short-read (SR) whole-genome sequencing (WGS) (WGS-SR) data8. CReSIL utilizes coverage depths and breakpoint reads to identify eccDNA from long-read (LR) WGS (WGS-LR) data9. Kumar et al. developed Circle_finder to identify eccDNA from short-read ATAC-Seq (ATAC-Seq-SR) data by analyzing split reads for eccDNA coordinates10. However, the performance of these analysis pipelines might be limited by the data generated from the corresponding experimental methods. For example, WGS and ATAC-Seq may have low eccDNA detection efficiency because vast majority of the sequencing reads were generated from linear DNA, and WGS-SR can only detect the copy number amplified eccDNA (ecDNA)4,6,11.

Table 1 Summary of eccDNA analysis pipelines and supported experimental methods

To enhance eccDNA detection, researchers have developed methods such as Circle-Seq7,12,13 and 3SEP14,15 for eccDNA enrichment from crude DNA. Circle-Seq utilizes rolling circle amplification (RCA) for circular DNA amplification, whereas 3SEP employs Solution A for selective circular DNA recovery. Post-enrichment, eccDNA undergoes library construction for sequencing on platforms like Illumina (Circle-Seq-SR/3SEP-SR) or Oxford Nanopore Technology (ONT) (Circle-Seq-LR/3SEP-LR). Concurrently, various analysis pipelines have been developed to process eccDNA sequencing data. Circle-Map16, ECCsplorer17, Circle_finder10, and ecc_finder (map-sr)18 are tailored for short-read data analysis. For long-read data, pipelines such as CReSIL9, NanoCircle7, eccDNA_RCA_nanopore14 and ecc_finder map-ont mode are used. Additionally, ecc_finder offers de novo assembly options: Spades in the asm-sr mode and Tidehunter in the asm-ont mode as distinct algorithms to identify eccDNA from SR and LR sequencing profiles, respectively. These eccDNA-enriched methods and tailored pipelines facilitate eccDNA identification without reliance on copy number information6.

Choosing the most suitable analysis pipeline and experimental method for eccDNA research is a complex task. Existing evaluations of these pipelines often have limited scope, focusing on single aspects like accuracy9 or computational needs18, and rely on oversimplified simulations that fall short of representing the intricacies of actual sequencing data. Additionally, detection efficiency for specific eccDNA types varies significantly between enriched (such as Circle-Seq and 3SEP) and non-enriched experimental methods (such as WGS-SR, WGS-LR, and ATAC-Seq-SR). For example, the rolling circle amplification (RCA) step is known to preferentially amplify circular DNA under 10 kb19, while the bias of Solution A enrichment remains unclear.

In this work, we conducted an in-depth evaluation of 7 analysis pipelines. The comparative analysis scopes included assessing accuracy (F1-score), identity (base pair difference between identified eccDNA and simulated eccDNA), duplication rate, and computational resource cost using seven simulated datasets designed to mirror real eccDNA characteristics. These datasets replicated the length distribution, chimeric eccDNA composition and chromosomal origins as previously identified7,9,13,20,21,22. Additionally, we compared the detection efficiencies of 7 methods on twenty-one real sequencing datasets for different eccDNA types. Our comparative analysis highlights the most effective pipelines for analyzing short-read and long-read data from eccDNA-enriched methods and underscores the variation in eccDNA detection efficiency across different experimental approaches. Our findings are intended to guide researchers in choosing the most suitable methodologies for their eccDNA studies and to foster the development of novel approaches for efficient eccDNA detection.

Results

Study design

To evaluate the performance of analysis pipelines in eccDNA identification, we developed a Python script to generate simulated eccDNA datasets. This script extrapolated length distribution, chromosomal origins, and chimeric eccDNA proportions from existing data to create a mix of simulated circular DNA (true positives) and linear DNA (true negatives). It also simulated the rolling circle amplification (RCA) process and subsequent sequencing on short-read (Illumina) and long-read (ONT) platforms (Fig. 1a). Seven simulated datasets were produced, mirroring eccDNA identified in human sperm cells7, EJM cell line9, JJN3 cell line9, Kelly cell line20, medulloblastoma21, muscle cells13 and OVCAR8 cell line22 (Supplementary Fig. 1 and Supplementary Fig. 2), each comprising 10,000 circular and 10,000 linear DNA sequences at a depth of 50X.

Fig. 1: Assessment of analysis pipelines in eccDNA identification.
figure 1

a Schematic overview of the benchmarking workflow used to compare the performance of bioinformatic pipelines. The cell line, healthy tissue and tumor illustration were created in BioRender. Gao, X. (2024) BioRender.com/h74t202. ‘Std’ represents standard deviation. b Performance comparison of analysis pipelines at a simulated sequencing depth of 50X (bms, bwa-mem-samblaster; mIs, microDNA.InOne.sh). Data are presented as mean values +/- SEM. c Impact of simulated sequencing depth on eccDNA identification accuracy. Data are presented as mean values +/- SEM. d Impact of simulated sequencing depth on eccDNA identification duplication rates. Centre line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range. e Impact of chimeric DNA proportion on eccDNA identification recall. Data are presented as mean values +/- SEM. The ‘n’ in the figure represents the number of datasets successfully analyzed by corresponding analysis pipeline and is used as the sample size to evaluate the performance of the respective analysis pipeline in the analysis. For panels (b, c and d) n = 7, except for ecc_finder (asm-sr) with n = 6 and ECCsplorer with n = 6 when depth ≤ 10, n = 5 for depths between 15X and 25X, n = 4 for depths between 30X and 40X, and n = 3 for depths higher than 40X. For panel (e) n = 4 for only 4 datasets contain chimeric eccDNA. Source data are provided as a Source Data file.

We evaluated 11 modes of 7 pipelines, including Circle-Map, Circle_finder (bwa-mem-samblaster and microDNA.InOne.sh), ECCsplorer, and ecc_finder (map-sr/asm-sr) for short-read data analysis, and CReSIL, eccDNA_RCA_nanopore, NanoCircle, and ecc_finder (map-ont/asm-ont) for long-read data analysis. True positive identification was defined as having over 90% sequence identity and less than 250 base pair (bp) difference with the simulated eccDNA. Performance metrics included F1-score and base pair difference between the identified eccDNA and the simulated eccDNA (see “Methods”). Additionally, we down-sampled the datasets to test pipeline robustness at low sequencing depths and generated datasets with varying chimeric DNA proportions (0–50%) to assess impact of chimeric DNA on eccDNA identification. We also introduced a duplication rate metric to address the issue of multiple detections of the same eccDNA sequence (see “Methods”) and analyzed the computational resource consumption for each pipeline.

For experimental method assessment, we selected Circle-Seq (SR/LR), 3SEP (SR/LR), WGS (SR/LR), and ATAC-Seq (SR) based on their non-targeted nature and sequencing compatibility with Illumina (SR) and ONT (LR) platforms. To minimize batch effects, eccDNA was extracted from a uniform pool of HeLa cells. Controls included a pUC-19 plasmid (2686 bp) and a mouse Egfr gene fragment (2651 bp), spiked into the cell lysate at a 1:1000 ratio to crude circular DNA. We then evaluated eccDNA detection efficiency of each method across various lengths and copy number statuses, quantifying detection efficiency as the number of eccDNA per gigabase (Gb) of sequencing data (see “Methods”).

Assessment of analysis pipelines in eccDNA identification

In our evaluation of the performance of each analysis pipeline in eccDNA identification at a simulated sequencing depth of 50X, Circle_finder (bwa-mem-samblaster) and Circle-Map outperformed the others for short-read data analysis, achieving F1-score of 0.912 and 0.908, respectively. However, Circle-Map had a lower base pair difference between the identified eccDNA and simulated eccDNA (1.354 bp difference) than Circle_finder (bwa-mem-samblaster) (4.344 bp difference). Circle_finder (microDNA.InOne.sh) performed better than Circle_finder (bwa-mem-samblaster) in terms of the base pair difference (1.383 bp difference), but its F1-score was lower (0.825) (Fig. 1b). In the long-read data category, CReSIL led with an F1-score of 0.918 and a base pair difference of 4.160 bp, outperforming eccDNA_RCA_nanopore (F1-score: 0.859, 3.592 bp difference) and NanoCircle (F1-score: 0.905, 4.214 bp difference) (Fig. 1b). Furthermore, ecc_finder asm-ont mode had the lowest F1-score (0.179) and the highest base pair difference (66.158 bp) among all pipelines for long-read data analysis. Meanwhile, ECCsplorer could identify eccDNA from dataset 2, 3 and 7 but failed in other datasets at sequencing depth 50X (Supplementary Data 1).

Impact of sequencing depth on eccDNA identification

Previous research indicates that low eccDNA coverage adversely affects the performance of analysis pipelines in eccDNA identification9. To explore this, we down-sampled our simulated datasets to various sequencing depths, assessing the performance of each pipeline in eccDNA identification. For short-read data analysis, Circle_finder (bwa-mem-samblaster), followed by Circle-Map, consistently achieved the highest F1-scores across all investigated sequencing depths (Fig. 1c). Though ECCsplorer failed in analyzing simulated dataset 5, it had the lowest base pair difference (Supplementary Fig. 3a). Circle-Map and Circle_finder (microDNA.InOne.sh) maintained stable base pair differences when sequencing depth decreased from 50X to 5X, while the base pair difference of Circle_finder (bwa-mem-samblaster) decreased from 4.344 bp at 50X to 2.890 bp at 5X (Supplementary Fig. 3a). ecc_finder (asm-sr) showed the lowest F1-score across all the simulated sequencing depths (Fig. 1c). In the realm of long-read data, CReSIL led with the highest F1-scores at depths over 10X, while eccDNA_RCA_nanopore showed superior performance below a depth of 10X (Fig. 1c). eccDNA_RCA_nanopore kept lowest base pair differences across all the simulated sequencing depths (Supplementary Fig. 3b). The base pair difference of ecc_finder (map-ont) decreased from at 9.015 bp at 50X to 5.976 bp at 5X, while ecc_finder (asm-ont) showed the lowest F1-score and highest base pair difference among all the pipelines in analyzing long-read data (Fig. 1c and Supplementary Fig. 3b).

We observed a pattern of redundancy in eccDNA identification by eccDNA_RCA_Nanopore at all simulated depths, aligning with findings from another study9. Circle_finder (bwa-mem-samblaster) also demonstrated redundancy in its results. Upon calculating the duplication rates, it was evident that both Circle_finder (bwa-mem-samblaster) and eccDNA_RCA_nanopore could identify multiple similar copies from a single eccDNA sequence (Fig. 1d). These substantial duplication rates present considerable obstacles for the experimental validation of their predictions.

Impact of chimeric DNA proportion on eccDNA identification

In addition to sequencing depth, we investigated the influence of chimeric DNA on eccDNA identification performance. We created simulated datasets with varying proportions of chimeric DNA, from 0% to 50%, maintaining a fixed sequencing depth of 20X. For short-read data analysis, the change of chimeric DNA proportion did not affect the recall for simple eccDNA identification of Circle-finder (bwa-mem-samblaster), Circle-Map, and ecc_finder (map-sr). However, the recall for simple eccDNA identification of ECCsplorer decreased from 0.414 at 0% to 0.056 at 50%. (Fig. 1e). ecc_finder (asm-sr) showed the lowest recall. The base pair differences between identified eccDNA and simulated eccDNA for these pipelines remained relatively stable except ecc_finder (asm-sr) (Supplementary Fig. 3c). Among long-read data analysis pipelines, most maintained consistent recall (with changes less than 0.1) for both simple eccDNA and chimeric eccDNA identification (Fig. 1e). The base pair differences of CReSIL, eccDNA_RCA_nanopore and ecc_finder (map-ont) showed a relatively slight increase compared to NanoCircle, of which the base pair difference increased from 2.492 at 0% to 16.899 at 50% (Supplementary Fig. 3d). Unlike the other pipelines, ecc_finder (asm-ont) showed a decreased base pair difference as the chimeric eccDNA proportion increased (Supplementary Fig. 3d).

Computational resources consumed by different analysis pipelines

In our evaluation of computational resources consumed by each pipeline, we utilized a computer cluster equipped with two Intel Xeon Scale 6248 CPUs (2.5 GHz, 320 CPU cores), 384 GB of DDR4 memory, and 2 TB AEP memory. We observed that both the time and memory consumption of most pipelines increased with mean coverage rising (Supplementary Fig. 3e and f). For identifying eccDNA from short-read data, Circle_finder (bwa-mem-samblaster) was the fastest pipeline to identify eccDNA and Circle_finder (microDNA.InOne.sh) used the least memory across all the investigated sequencing depths (Supplementary Fig. 3e). When considering the long-read data analysis pipelines, ecc_finder used the shortest time (map-ont) and the least memory (asm-ont) (Supplementary Fig. 3f). For dataset 5, ecc_finder (asm-sr) and ECCsplorer experienced memory errors on our platform (Supplementary Data 1). Besides, ECCsplorer also encountered memory errors in analyzing dataset 1 (depth over 25X), dataset 4 (depth over 40X), and dataset 6 (depth over 10X) (Supplementary Data 1).

Based on the above analysis, we concluded that Circle_finder (bwa-mem-samblaster) and Circle-Map were the most appropriate analysis pipelines for analyzing eccDNA-enriched short-read data, and CReSIL outperformed the other analysis pipelines to analyze eccDNA-enriched long-read data, due to their high detection accuracy and low base pair difference. In the following experimental methods benchmarking, we selected Circle-Map for analyzing the eccDNA-enriched short-read sequencing data because of its fewer redundant results compared to Circle_finder (bwa-mem-samblaster). High redundancy may cause the high eccDNA detection efficiency bias to the eccDNA-enriched experimental methods (Fig. 2a). Besides, we used AmpliconArchitect for analyzing WGS-SR data, CReSIL for analyzing WGS-LR data and Circle_finder (bwa-mem-samblaster) for analyzing ATAC-Seq-SR data.

Fig. 2: Impact of eccDNA enrichment operations on eccDNA identification.
figure 2

a Schematic overview of the experimental methods comparison. b eccDNA detection efficiency comparison. Data are presented as mean values +/- SEM. c Circular DNA enrichment efficiency. Data are presented as mean values +/- SEM. LDD, Linear DNA Digestion; Solution A, using Solution A for circular DNA purification; RCA, Rolling Cycle Amplification. d Detection efficiency for eccDNA with different length ranges. Data are presented as mean values +/- SEM. e Correlation between eccDNA density and coding gene density. Dots represent individual experiments and the shaded area represents 95% confidence interval. For all experiments, n = 3. Statistical analyses were performed using one-way ANOVA with Tukey correction for panel (b, c and d,) and two-sided Pearson correlation was performed for panel (e). The ‘p’ represents p-value. Source data are provided as a Source Data file.

Impact of eccDNA enrichment steps on eccDNA identification

We assessed eccDNA detection efficiency by the number of eccDNA detected per gigabyte (Gb) of data. The results indicated that methods incorporating RCA steps achieved significantly higher eccDNA detection efficiencies compared to those without RCA (Fig. 2b). Notably, qPCR analyses revealed that both Solution A purification and the RCA step considerably increased the log2 ratio of circular to linear spike-in DNA (Solution A: from 2.26 to 9.60 and from 18.20 to 26.19, RCA: from 2.26 to 18.20 and from 9.60 to 26.19) (Fig. 2c). To validate these findings, we randomly selected nine simple and seven chimeric eccDNA for testing (See “Methods”), observing validation rates above 0.5 in RCA-utilizing methods (3SEP-LR: 8/16, Circle-Seq-SR: 8/9, Circle-Seq-LR: 11/16) (Supplementary Fig. 4 and Supplementary Data 2). Due to the notable efficiency of circular DNA enrichment through RCA and the use of solution A, we hypothesized that eccDNA-enriched experimental methods could effectively detect such DNA entities without the need for copy number amplification. We investigated the association between genome copy numbers and the coverage of overlapped eccDNA. Our analysis revealed a positive correlation between genome copy numbers and the coverage of overlapped eccDNA. Notably, the correlation coefficients (r values) derived from the WGS-LR (0.80) and ATAC-Seq-SR (0.41) datasets were higher than those obtained from eccDNA-enriched experimental methods (< 0.25) (Supplementary Fig. 5).

Further analysis of the eccDNA length distribution and chromatin origins revealed that Circle-Seq-LR had the highest detection efficiency for > 10 kb eccDNA and enriched methods (except for 3SEP-SR) could detect significantly more ≤ 10 kb eccDNA per Gb data than non-enriched methods (Fig. 2d). However, over 97% of the identified eccDNA from eccDNA-enriched methods were shorter than 10 kb (Circle-Seq-LR: 97%, Circle-Seq-SR: 99.8%, 3SEP-LR: 99.9%, 3SEP-SR: 99.5%) and over 90% of eccDNA detected by methods like 3SEP-SR and 3SEP-LR were shorter than 2 kb (Supplementary Fig. 6). In contrast, non-enriched methods showed a higher proportion of eccDNA lengths exceeding 10 kb (Supplementary Fig. 6). Additionally, except for 3SEP-SR and WGS-SR, a significant positive correlation was observed between eccDNA density (number of detected eccDNA per million base (Mb)) and protein-coding gene density across chromosomes in most methods, consistent with prior studies7,13 (Fig. 2e). 3SEP-SR showed a similar trend, though the correlation was not statistically significant (r = 0.39, p = 0.064), and no significant correlation was found in WGS-SR data (r = 0.12, p = 0.6). This could be due to the limited number of eccDNA identified by WGS-SR, suggesting the importance of eccDNA enrichment in experimental setups to obtain a comprehensive eccDNA profile.

Detection efficiency of ecDNA by different experimental methods

The eccDNA overlapping with copy number amplified regions was designated as ecDNA, while eccDNA outside these regions was categorized as nonecDNA11. Circle-Seq-SR, Circle-Seq-LR, and 3SEP-LR identified a higher average number of ecDNA per Gb of data (205.2, 165.8, and 203.9, respectively) compared to WGS-SR, WGS-LR, and ATAC-Seq-SR (0.01576, 0.9100, and 6.862, respectively) (Fig. 3a). However, a significantly higher proportions of ecDNA were found in the eccDNA detected by WGS-SR (100%), WGS-LR (57.68%), and ATAC-Seq-SR (36.67%) compared to Circle-Seq-SR (20.58%), Circle-Seq-LR (17.09%), and 3SEP-LR (19.26%) (Fig. 3b).

Fig. 3: Detection efficiency of ecDNA by 7 experimental methods.
figure 3

a ecDNA detection efficiency of 7 experimental methods. b Comparison of the proportion of ecDNA in the total detected eccDNA. c Comparison of the detection efficiency of ecDNA with different length ranges by 7 experimental methods. d Comparison of the detection efficiency of nonecDNA with different length ranges by 7 experimental methods. Dots represent individual experiments; For all experiments, n = 3. Statistical analyses were performed using one-way ANOVA with Tukey correction; The ‘p’ represents p-value. Data are presented as mean values +/- SEM. Source data are provided as a Source Data file.

Subsequently, we further analyzed the detection efficiencies for both ecDNA and nonecDNA across varying lengths (≤ 2 kb, 2–10 kb, > 10 kb Figs. 3c and d). 3SEP-LR demonstrated the highest efficiency in detecting both ecDNA and nonecDNA up to 2 kb in length. Circle-Seq-SR was the most efficient for detecting ecDNA between 2 kb and 10 kb. For eccDNA over 10 kb, Circle-Seq-LR outperformed all other methods in detecting both ecDNA and nonecDNA. Interestingly, for detecting ecDNA and nonecDNA over 10 kb, WGS-LR, despite not employing a circular DNA enrichment step, showed comparable efficiency with 3SEP-SR, 3SEP-LR, and Circle-Seq-SR (Figs. 3c and d).

EccDNA profiles showed heterogeneity across experimental methods

We investigated the correlation of eccDNA profiles across various technical replicates from different experimental methods. We considered eccDNA from different technical replicates to be highly-correlated (HC) when their eccDNA shared over 90% sequence identity. Our analysis indicated that, among the methods examined, eccDNA profiles detected from different experimental replicates within WGS-SR (> 50%) or WGS-LR (> 54.19%) exhibited higher correlations compared to other methods such as ATAC-Seq-SR (< 10%), 3SEP-SR (< 2%), 3SEP-LR (< 1%), Circle-Seq-SR (< 1%), and Circle-Seq-LR (< 1%) (Fig. 4a and Supplementary Data 3). Specifically, compared to WGS-SR (≤ 2), WGS-LR replicates showed more shared highly-correlated eccDNA (> 95) (Fig. 4b). Furthermore, we observed higher correlations between paired Circle-Seq-SR/LR replicates (HC eccDNA proportion > 15%, number of HC eccDNA > 4900) (e.g., Circle-Seq-SR1/Circle-Seq-LR1) compared to unpaired Circle-Seq replicates (e.g., Circle-Seq-SR1/Circle-Seq-SR2 or Circle-Seq-SR1/Circle-Seq-LR2) (Fig. 4a and b). Despite the DNA material for Circle-Seq-SR/LR pairs originating from the same debranched RCA product, the HC eccDNA proportions within these pairs were below 35% (Supplementary Data 3), indicating that the choice of sequencing platform and analysis pipelines can influence the final eccDNA profiles.

Fig. 4: eccDNA profile heterogeneity across experimental methods.
figure 4

a Proportion of highly-correlated eccDNA between each pair experimental replicates in vertical replicates; b Number of highly-correlated eccDNA between each pair of experimental replicates; c Number of detected oncogenes shared across different methods. d Number of oncogenes shared by different replicates. e Proportion of read mapped to repeat elements (n = 3). For panel e, statistical analyses were performed using one-way ANOVA with Tukey correction; The ‘p’ represents p-value. Source data are provided as a Source Data file.

We speculated that though there existed high heterogeneity across different experimental methods or technical replicates, the copy number amplified eccDNA profiles (ecDNA profiles) of different methods might share common oncogenes. To explore this, we compiled a list of oncogenes from OnGene database23, and compared the detected oncogenes by different methods. Our analysis revealed that the ecDNA sequences obtained from all examined experimental methods mapped to a total of 125 oncogenes (Supplementary Fig. 7 and Supplementary Data 4). No oncogene was detected by all the examined experimental methods. 87 out of 125 oncogenes could be detected by at least two different experimental methods (Fig. 4c and Supplementary Data 4). A total of 18 oncogenes were detected by 4 experimental methods (Fig. 4c and Supplementary Data 4). For example, ZNF217, reported to promote HeLa cell viability24, was detected by Circle-Seq-SR, Circle-Seq-LR, 3SEP-LR, WGS-LR. Further, 16 of the 18 oncogenes were detected by eccDNA-enriched experimental methods. For example, TRIO25 and CULA426, reported to promote metastasis and invasion of HeLa cells, were detected by Circle-Seq-SR/LR and 3SEP-SR/LR. PVT1, a long non-coding RNA that can enhance proliferation27 and promote the cancer progress28,29 of cervical cancer cells, was also detected by Circle-Seq-SR/LR and 3SEP-SR/LR. Notably, experimental methods employing the RCA step demonstrated a higher capacity for oncogene detection (> 69, data from Circle-Seq-LR, Fig. 4d) compared to those lacking this step (< 20, data from 3SEP-SR, Fig. 4d). Furthermore, experimental replicates utilizing RCA exhibited a greater overlap in detected oncogenes (at least 42 oncogenes were detected in more than 2 replicates) compared to those without RCA (Fig. 4d and Supplementary Data 4).

Repeat elements are commonly detected in the sequencing data that are used for identifying eccDNA13,30,31. Considering that our sequencing data originated from the same HeLa cell pool, we postulated that the proportion of reads mapping to repeat elements would remain consistent across different experimental methods. However, our findings revealed notable disparities. WGS-LR exhibited the highest proportion of reads mapping to the examined repeat elements (Fig. 4e and Supplementary Data 5), including long terminal repeats (LTRs, 65.84%), short interspersed nuclear elements (SINEs, 73.70%), long interspersed nuclear elements (LINEs, 74.61%), and satellite elements (16.5%). Furthermore, WGS-LR, 3SEP-LR, and Circle-Seq-LR displayed significantly elevated proportions of reads mapping to LTRs, SINEs, and LINEs compared to their short-read counterparts (Fig. 4e and Supplementary Data 5). This suggests that sequencing results from different experimental methods inherently exhibit heterogeneity. Consequently, when comparing results across different studies, it is important to consider the experimental methods used.

Discussion

Benchmarking the available analysis pipelines and experimental protocols for detecting eccDNA is crucial for advancing eccDNA research. In this study, we have identified top performers for eccDNA detection by assessing 7 analysis pipelines using various metrics, and comparing 7 experimental methods via detection efficiency. Circle_finder (bwa-mem-samblaster) and Circle-Map stand out for their abilities to identify eccDNA from short-read data and CReSIL outperformed the others in long-read data analysis. In the realm of experimental methods, Circle-Seq-LR demonstrates the highest detection efficiency for longer eccDNA, while 3SEP-LR is more effective for shorter eccDNA. This information is vital for researchers in selecting the most suitable methodologies for their eccDNA studies.

Despite our simulated datasets closely mimicked the length distribution of real eccDNA data, they featured a comparatively smaller proportion of eccDNA longer than 10 kb. This imbalance posed challenges in precisely evaluating the performance of different analysis pipelines across various eccDNA length ranges. Additionally, while using DNA from a cell line sheds light on the eccDNA detection efficiency of diverse methods, the potential copy number bias introduced at different experimental stages remains a concern due to the absence of a known ground truth. Future research could benefit from employing a specially designed circular DNA pool with a defined copy number. Such a controlled approach would not only help in addressing potential biases but also allow for more accurate quantification of metrics like F1-score and base pair difference for each experimental method in eccDNA detection.

Split and discordant reads within short-read data, and breakpoint reads in long-read data, are primary sources for eccDNA identification. CReSIL utilizes the breakpoint read information to construct directed graphs, allowing for its effective identification of eccDNA from both the concatemeric tandem copies (CTC) reads and the non-CTC reads containing breakpoints. Conversely, eccDNA_RCA_nanopore only focuses on CTC reads and might limit its ability to identify larger eccDNA that were hard to generate CTC reads. Both eccDNA_RCA_nanopore and Circle_finder (bwa-mem-samblaster) exhibit a tendency for redundancy due to their approach of reporting results for each CTC read or split read, respectively. Circle_finder (bwa-mem-samblaster) showed the highest F1-score across all the investigated sequencing depth, reducing the redundancy results may further enhance its performance. Because the available pipelines are limited for analyzing eccDNA non-enriched data, we only compared the performance of these analysis pipelines for identifying eccDNA from simulated eccDNA-enriched datasets. Future study is needed to compare the performance of the analysis pipelines for detecting eccDNA from non-enriched data when more pipelines are available.

This benchmark study also helps to explain controversial findings in the field. For instance, the limited detection of ecDNA in normal cells4 may be due to the low sensitivity of WGS-SR in identifying eccDNA. Conversely, the effective identification of eccDNA in human germline cells may be facilitated by the use of the Circle-Seq-LR technique7. However, it is important to note from our analysis that non-enriched methods like WGS-SR hold their own unique advantages, such as providing copy number variation information essential for ecDNA classification32. Therefore, we do not suggest that non-enriched methods be replaced by enriched methods. Moreover, other non-enriched methods like WGS-LR33 and modified ATAC-Seq-SR34 can preserve nucleotide decorations in the sequencing reads, a feature could potentially lost in sequences generated from enrichment steps like RCA.

A significant challenge in eccDNA research is the inconsistency in the definitions of different eccDNA types used by various studies. We defined ecDNA as eccDNA colocalizing with genome copy number-amplified regions11, due to the putative gene amplification effect of ecDNA. Other studies may use size thresholds to define ecDNA35,36. Establishing a consensus definition is crucial for harmonizing research findings in this rapidly evolving field.

Lastly, the potential of eccDNA as a diagnostic marker for diseases like advanced chronic kidney disease37, medulloblastoma21, and colorectal cancer38 is promising. Increasing the efficiency of linear DNA digestion will be beneficial for enhancing the enrichment of circular DNA, and further efforts in this direction will be appreciated. Optimizing the RCA step, typically a lengthy process, could also enhance the feasibility of using eccDNA information for clinical diagnosis.

Methods

Generation of simulated datasets

Because the biogenesis of eccDNA has not been fully known, we considered findings or eccDNA simulating methods from previously published papers9,14,16 and created a python script to generate simulated eccDNA datasets for evaluation. The simulated datasets contained circular and linear DNA, according to the length distribution, chromosome origins and chimeric eccDNA proportion of the eccDNA from the given data. We collected the eccDNA profiles identified by different analysis pipelines (Supplementary Data 1) from human sperm cells7, EJM cell line9, JJN3 cell line9, Kelly cell line20, medulloblastoma21, muscle cells13 and OVCAR8 cell line22, and used these 7 datasets as input. We generated 7 simulated datasets, containing 10000 circular DNA (as positive sequences) and 10000 linear DNA fragments (as negative sequences). Then, we randomly shifted the positive sequence to mimic the RCA starting site and concatenated the 5000 bp of individual simulated eccDNA to mimic the RCA procedure. We used generated sequences as templates to further simulate short-read datasets using ART39 (--sr-platform ‘HS25’ --sr-mean ‘400’ --sr-std ‘125’ --sr-readlen ‘150’) and simulate long-read datasets using PBSIM240 (--ont-model ‘R94’, --ont-mean ‘3000’,--ont-std ‘2500’) with different sequencing depth (5X, 10X, 15X, 20X, 25X, 30X, 35X, 40X, 45X, 50X). We also used eccDNA identified from human sperm cells7, EJM cell line9, JJN3 cell line9 and Kelly cell line20 to simulate short-read datasets and long-read datasets with different chimeric DNA ratios (0%, 10%, 20%, 30%, 40%, 50%) at sequencing depth 20X.

Performance evaluation of each pipeline

The identification of eccDNA was done following the instructions on the website of each pipeline. We used hg38 genome as reference. For Circle-Map16, we used Circle Map Realign to identify eccDNA and used recommended filters (circle score > 50, split reads > 2, discordant reads > 2, coverage increase in the start coordinate > 0.33 and coverage increase in the end coordinate > 0.33). For Circle_finder10, we used the script circle_finder-pipeline-bwa-mem-samblaster.sh to identify eccDNA. For ECCsplorer17, we used mapping module to identify eccDNA. For ecc_finder18, all the 4 modes were used to identify eccDNA from either short-read or long-read data. The identified eccDNA with length longer than 107 bp was filtered out. For CReSIL9, we followed the instruction on its website to identify eccDNA and considered cyclic eccDNA as identified results. For NanoCircle7, we followed the instruction on its website and considered high_conf simple eccDNA and complex eccDNA as identified results. For eccDNA_RCA_nanopore14, we followed the instruction on its website to identify eccDNA. For the pipelines that did not supply FASTA format results, we used pysam41 to transform bed format into FASTA format. The FASTA files were then compared to the simulated eccDNA sequence by MUMmer342.

Cell culture

HeLa cells were bought from BeNa Culture Collection (Cat#BNCC342189; RRID: CVCL-0030). NIH3T3 (RRID: CRL-1658) was a gift from Prof. Shu Zhu lab of the University of Science and Technology of China. HeLa cells or NIH3T3 cells were cultured at 37˚C in DMEM (Thermo Fisher Scientific 11965092) containing 10% FBS (Thermo Fisher Scientific 10091148) and 1% penicillin‒streptomycin (Thermo Fisher Scientific 15140122). Upon reaching approximately 80%–100% confluence, the cells were rinsed with 1× PBS (Sangon Biotech, B540626-0500) and digested with 0.25% trypsin (Beyotime C0203-500 ml). The trypsinization process was terminated by adding DMEM + 10% FBS + 1% penicillin‒streptomycin, and the cells were collected by centrifugation at 500 × g for 5 min at RT. Cells were then washed twice by using 1X PBS and then centrifuged at 500 × g for 5 min at 4 °C to obtain the cell pellet for following experiments. Detailed company names and catalog numbers of reagents are recorded in Supplementary Data 6.

ATAC-seq library construction

For each replicate, approximately 50000 cells and a commercialized Tn5 kit (Vazyme, TD501) were used to construct the ATAC-Seq library. The reaction mix, consisting of 50,000 cells, 0.005% digitonin (Sigma‒Aldrich D141-100MG), 33 mM Tris-Ac (pH 7.8), 66 mM KAc, 10 mM MgAc, and 16% DMF, was incubated at 500 rpm for 30 mins at 37 °C using a thermal rotator. After the reaction, the cells were washed twice using wash buffer (10 mM Tris-HCl pH;7.5, 10 mM NaCl, 3 mM MgCl2, 0.005% digitonin) and resuspended in 14 µl of 10 mM Tris-HCl pH 7.5. Cells were then lysed by mixing with 2 µl lysis buffer (200 mM Tris-HCl pH 8.0, 0.4% SDS) and 0.2 µl proteinase K (20 mg/mL) at 500 rpm for 15 mins at 55 °C. The lysis reaction was terminated by adding 4 µL of 10% Tween-20 and 0.4 µL of 100 mM PMSF. The samples were incubated for 5 mins at RT, and then PCR was performed to add adapters to the DNA segment for sequencing. Detailed company names and catalog numbers of reagents are recorded in Supplementary Data 6.

Whole-genome sequencing

For preparing each replicate for WGS-SR, after washing the cells, more than 1 million cells were frozen using liquid nitrogen. Three replicates were sent to Sequanta Technologies for library construction and WGS-SR sequencing (Illumina NovaSeq 6000 platform). For preparing each replicate for WGS-LR, after washing the cells, more than 5 million cells were frozen using liquid nitrogen. Three replicates were sent to Novogene for library construction and WGS-LR sequencing (Oxford Nanopore PromethlON platform).

Isolation of crude circular DNA

Crude circular DNA was extracted from the same pool of HeLa cells following the published protocol15. In brief, more than 60 million HeLa cells were used to extract the crude circular DNA pool. For each reaction (approximately 30 million HeLa cells), cells were collected in a 50 mL tube by centrifugation at 2000xg for 10 mins at 4 °C. Resuspend the cells in 10 ml of suspension buffer (10 mM EDTA pH8.0, 150 mM NaCl, 1% glycerol, Lysis blue (1×, from QIAGEN Plasmid Plus Midi Kit), RNase A (0.55 mg/ml), and freshly supplemented with 20 µL of 2-mercaptoethanol). Add 10 mL Pyr buffer (0.5 M pyrrolidine, 20 mM EDTA, 1% SDS, adjust pH to 11.80 with 2 M Sodium Acetate pH 4.00, and freshly supplemented with 20 µL 2-mercaptoethanol) to the cell suspension. Gently mix by inverting the tube 5–10 times and incubate at room temperature for 5 mins. After lysis, 10 mL of Buffer S3 (From QIAGEN Plasmid Plus Midi Kit) was added to the mixture, and the tube was gently inverted until the solution color turned white. Then, the lysate was centrifuged at 4500xg for 10 mins. The clear lysate was transferred to a QIAilter Catridge (From QIAGEN Plasmid Plus Midi Kit) and incubated at room temperature for 10 mins. Then, the cell lysate was filtered into a 50 mL tube. The volume of the filtrated lysate was approximately 27 mL, and 9–10 mL of Buffer BB (1/3 of the lysate volume, From QIAGEN Plasmid Plus Midi Kit) was added. The lysate was mixed by inverting the tube 4-8 times. The lysate mixture was then transferred to the spin column, and vacuum was applied until all liquid passed through. We added 0.7 mL ETR buffer (From QIAGEN Plasmid Plus Midi Kit) to wash the column, and applied vacuum until all liquid passed through. Then, the wash was repeated by using 0.7 mL PE buffer (From QIAGEN Plasmid Plus Midi Kit). After washing, the tube was centrifuged at 10000xg for 2 mins to remove the liquid, and the column was transferred to a new clean 1.5 mL centrifuge tube. Crude eccDNA was then eluted by using 100 µL of 0.1x EB buffer (From QIAGEN Plasmid Plus Midi Kit). For each microgram crude eccDNA we spiked in 1 ng pUC1943 (was a gift from Joachim Messing, Addgene plasmid # 50005; RRID: Addgene_50005) and 1 ng Egfr fragment (amplified from NIH3T3 cell genome by using forward primer: AACTGCTGTCTTGGGTACGG (ordered from Sangon Biotech) and reverse primer: ATTGCAGTCGCCCAAGTGTA (ordered from Sangon Biotech)) to generate crude circular DNA mixture. Detailed company names and catalog numbers of reagents are recorded in Supplementary Data 6.

Linear DNA digestion

For each DNA digestion reaction, 3 µg crude circular DNA mixture was digested by using 0.5 µL Pac I and 1 µL ATP-dependent Plasmid Safe DNase in 1X ATP-dependent Plasmid-Safe DNase buffer. Then, 0.1 µL of 110 mg ml−1 RNase A and 2 µL of 25 mM ATP were added to the reaction in a total volume of 50 µL. The reaction mix was incubated at 37 °C for 16 hours. After digestion, 1.8X SPRIselect beads were used to purify the DNA. DNA was eluted with 66 µL of 2 mM Tris-HCl pH=7.0 to carry out Solution A purification or eluted with 66 µL of 0.1 X EB buffer (From QIAGEN Plasmid Plus Midi Kit) without further Solution A purification. Detailed company names and catalog numbers of reagents are recorded in Supplementary Data 6.

Solution A purification

The Solution A purification step followed the published study15 and was used in 3SEP-SR and 3SEP-LR only. In brief, we transferred 50 µL eluted circular DNA (in 2 mM Tris-HCl pH=7.0) to a 1.5 mL tube. Added 700 µL of Solution A (room temperature) to the tube, mixed by pipetting up and down, and incubated at room temperature for 5 mins. Took 10 µL DynabeadsTM MyOneTM Silane beads (resuspend by thoroughly vortex) to a 200 µL tube and stood it on a magnetic shelf. When beads were settled, removed the liquid and added 20 µL Solution A to resuspend the beads. Then we transferred the beads to DNA (incubated in Solution A) and pipetted up and down for 10 times. Put the mixture on a magnetic shelf, and removed the liquid when the beads were settled. Quickly spun down the beads and put it on the magnetic shelf again to remove the residual liquid. Took off the tube from magnetic shelf and resuspended the beads in 300 µL Solution A. Put the tube on the magnetic shelf and removed the liquid when the beads were settled. Quickly spun down the beads and put it on the magnetic shelf, removing the residual Solution A when beads were settled. Repeated the 300 µL Solution A wash once more. After the second Solution A wash, kept the tube on the magnetic shelf, added 700 µL 3.5 M NaCl, waited for 1 minute and then removed the liquid, and repeated once. After the second NaCl wash, kept the tube on the magnetic shelf, added 800 µL freshly prepared 80% ethanol, waited for 1 minute and then removed the liquid, and repeated once. Quickly spun down the beads and put it on the magnetic shelf again to remove the residual liquid. Took off the tube and used 30 µL 0.1X EB buffer (From QIAGEN Plasmid Plus Midi Kit) to resuspend the beads and incubated for more than 3 minutes. Put the tube back to the magnetic shelf and transferred the elute (contained purified circular DNA) when beads were settled. Detailed company names and catalog numbers of reagents are recorded in Supplementary Data 6.

Rolling Cycle Amplification (RCA) and debranching

We measured the DNA product concentration by using Qubit 4.0, and aliquoted 1 ng DNA to prepare the RCA reaction premix (2 µL 10X Phi 29 DNA Polymerase Reaction Buffer, 2 µL dNTPs (25 mM each), 1 µL Exo-resistant Random Primer, and add H2O to 17.6 µL). The samples were incubated at 95 °C for 5 mins and then ramped to 30 °C at −0.1 °C per sec. Then, added 1 µL of Phi29 DNA Polymerase, 1 µL of Pyrophosphatase (Inorganic) and 0.4 µL of recombinant Albumin (offered with Phi 29 DNA polymerase) to a 20 µL final reaction mix. The samples were incubated at 30 °C for 14 hours and inactivated at 65 °C for 10 mins. The product was diluted by adding 80 µL of H2O, and 1.8X SPRIselect beads were used to purify the product. Eluted the DNA product in 0.1X EB (From QIAGEN Plasmid Plus Midi Kit) buffer. T7 endonuclease I was employed to cleave the branched RCA product from circular DNA. Briefly, 6 µg RCA product was aliquoted into the reaction tube along with 30 µL 10X NEBuffer 2 and 15 µL T7 Endonuclease I, and H2O was added to 300 µL. The reaction mix was incubated at 37 °C for 15 mins. Used 0.4X SPRIselect to purify the reaction product. Detailed company names and catalog numbers of reagents are recorded in Supplementary Data 6.

DNA fragmentation

For Circle-Seq-SR, the debranched DNA materials were sent to Sequanta Technologies for ultrasonic fragmentation with the fragment size in 300–500 bp as reported in the published protocol12. For 3SEP-SR, the Solution A purified DNA material was sent to Sequanta Technologies for enzymatic fragmentation. To compare across different experimental methods, 1 ng DNA was used to generate the sequencing library by using Nextera XT DNA Library Preparation Kit (Illumina).

Sequencing

For ATAC-Seq-SR, 3SEP-SR, and Circle-Seq-SR, DNA library was sequenced by Sequanta Technologies on Illumina NovaSeq 6000 platform. For 3SEP-LR and Circle-Seq-LR, the long-read sequencing library was constructed by Novogene and sequenced on Oxford Nanopore PromethlON platform.

Identification of eccDNA from real datasets

We used the script circle_finder-pipeline-bwa-mem-samblaster.sh in Circle_finder10 to identify eccDNA from ATAC-seq-SR data and set a filter (length shorter than 107 bp) to select eccDNA. For WGS-SR data, we used AmpliconArchitect8 to identified eccDNA with options (cngain= 4, cnsize= 10000). For WGS-LR data, we used CReSIL identify_wgls command9 to identify eccDNA, and filtered cyclic eccDNA. For Circle-seq-SR and 3SEP-SR data, we used Circle Map Realign16 to identify eccDNA and used recommended filters (circle score > 50, split reads > 2, discordant reads > 2, coverage increase in the start coordinate > 0.33 and coverage increase in the end coordinate > 0.33, length< 107bp). For Circle-seq-LR and 3SEP-LR data, we used CReSIL identify command9 to identify eccDNA and filtered cyclic eccDNA.

Identification of ecDNA

We used Control-FREEC44 (breakPointThreshold = 0.6, window = 50000, step= 10000) to examine the copy number variation in 3 replicates of our WGS-LR data. We defined eccDNA as ecDNA if it had overlap with the CNV gain regions identified by Control-FREEC.

Oncogene overlapping analysis

The information of human oncogenes was obtained from ONGene database23. We annotated our identified ecDNA by using BEDTools intersect command45. We merged the overlap regions and calculated overlap proportion of each oncogene using the following formula.

$${Overlap\; proportion}=\frac{{length\; of\; overlapped\; sequence\; between\; oncogene\; and\; ecDNA}}{{Full\; length\; of\; oncogene}}$$
(1)

We applied ComplexHeatmap package46 to visualize our results.

Repeat elements analysis

The genomic coordinates of repeat elements on the hg38 reference genome were obtained from UCSC genome browser47. We used pysam to calculate the proportion of reads mapped to different repeat elements, including LTR, LINE, SINE and satellite.

Circular DNA enrichment efficiency evaluation

qPCR was used to evaluate the circular DNA enrichment efficiency. qPCR primers for pUC19 (F: GCAGGTCGACTCTAGAGGAT, R: GGGCCTCTTCGCTATTACGC, ordered from Sangon Biotech), and Egfr fragment (F: AAACGGAAGATCCTGCCCTG; R: GTGTACCCTGAACACGAGGG, ordered from Sangon Biotech) were used to quantify the circular DNA and linear DNA, respectively. The ∆\({Ct}\)(original) was used to normalize the qPCR results.

$$\Delta {Ct}\left({{original}}\right)=\frac{\mathop{\sum }_{i=1}^{N}\left({{Ct}\left({{pUC}}19\right)}_{i}-{{Ct}\left({Egfr}\right)}_{i}\right)}{N}$$
(2)

While \({{Ct}\left({{{\rm{pUC}}}}19\right)}_{i}\) and \({{Ct}\left({Egfr}\right)}_{i}\) represent the cycle threshold (Ct) value of pUC19 and Ct value of Egfr fragment of the replicate i of the original DNA pool. N represents the number of replicates.

The circular DNA enrichment efficiency for each step was calculated by:

$${{Circular}}\; {{enrichment}}\; {{efficiency}}\left({{{{\rm{Log}}}}}_{2}\right)=\frac{\mathop{\sum }_{j=1}^{N}-({\Delta {Ct}\left({{{\rm{step}}}}\right)}_{j}-\Delta {Ct}\left({{{\rm{original}}}}\right))}{N}$$
(3)

\({\Delta {Ct}\left({{{\rm{step}}}}\right)}_{j}\) was calculated by:

$${\Delta {Ct}\left({{{\rm{step}}}}\right)}_{j}={{Ct}\left({{{\rm{pUC}}}}19\right)}_{j}-{{Ct}\left({Egfr}\right)}_{j}$$
(4)

While \({{Ct}\left({{{\rm{pUC}}}}19\right)}_{j}\) and \({{Ct}\left({Egfr}\right)}_{j}\) represent the Ct value of pUC19 and Ct value of Egfr fragment of the replicate j after the specific circular DNA enrichment step. N represents the number of replicates.

PCR validation

We created a numerical index for each eccDNA from each sample and used the random number generating formula in EXCEL (=randbetween(start index:end index)) to select the eccDNA. For the eccDNA that we could not design primers (potentially due to repeat sequences or low sequence complexity), we added 1 to the rolled random number and redesigned the primer for the newly indexed eccDNA. DNA sequences spanning the breakpoint were obtained by using Genome Browser (https://genome.ucsc.edu/index.html). Primers targeting the eccDNA breakpoint were designed by using Primer-Blast (https://www.ncbi.nlm.nih.gov/tools/primer-blast/) (Supplementary Data 2) and ordered from Sangon Biotech. The Hela cell genome was extracted by using the DNeasy® Blood & Tissue Kit (QIAGEN Cat. No. 69504). KOD FX (TOYOBO No. KFX-101) was used to perform the PCR. In brief, 20 ng DNA template (Genome DNA or Sample), 1.5 µL 10 µM forward primer, 1.5 µL 10 µM reverse primer, 4 µl 2 mM dNTPs, 10 µL 2X PCR Buffer for KOD FX, 1 µL KOD FX and nuclease-free water (Invitrogen 10977015) (to a 20 µL final volume) were combined. PCR was carried out by using the following thermal cycle: 94 °C for 2 minutes and then 30 cycles at 98 °C for 10 s, 60 °C for 30 s, 68 °C for 1 minute and 68 °C for 5 minutes. The PCR product was cut from the electrophoresis gel and sent for Sanger sequencing validation (by Sangon Biotech). We classified chimeric eccDNA as fully validated when all breakpoints were confirmed through Sanger sequencing (considered as 1 event when calculating the validation rate). In cases where only partial breakpoints could be validated, we categorized it as partially validated chimeric eccDNA (considered as 0.5 event when calculating the validation rate). Detailed company names and catalog numbers of reagents are recorded in Supplementary Data 6.

Benchmark metrics

F1-score

$$F1-{score}=\frac{2\times {Precision}\times {Recall}}{{Precision}+{Recall}}$$
(5)
$${Precision}=\frac{{TP}}{{TP}+{FP}}$$
(6)
$${Recall}=\frac{{TP}}{{TP}+{FN}}$$
(7)

Where TP represents the number of true positive event, FP represents the number of false positive event, and FN represents the number of false negative event.

Base pair difference

$${Base}\; {pair}\; {difference}=\frac{\mathop{\sum }_{i=1}^{N}\left({LEN}R{-}{LEN}1+{LEN}Q{-}{LEN}2\right)}{N}$$
(8)

Where LEN R and LEN Q are length of reference eccDNA and query eccDNA, LEN 1 and LEN 2 are length of alignment on reference and query eccDNA. N is the number of query eccDNA that has more than 90% identity and 90% overlap with reference eccDNA.

Duplication Rate

The duplication rate is defined by the number of identified eccDNA (TP2) that have at least a 90% overlap of simulated eccDNA divided by the number of simulated eccDNAs (TP1) that can be identified by each pipeline.

$${{Dupilcation}}{{Rate}}=\frac{{TP}2}{{TP}1}$$
(9)

Detection efficiency of specific type of eccDNA

Detection efficiency of specific type of eccDNA (per Gb) was calculated by using the following formula:

$${E}_{{ij}}=\frac{{n}_{{ij}}}{{D}_{i}}$$
(10)

Where: Eij is the detection efficiency of experimental method i in detecting eccDNA type j, nij is the number of eccDNA in type j detected by experimental method i, and Di is the size of the data (Gb) generated by experimental method i.

Statistics & reproducibility

For performance evaluation of bioinformatic pipelines. We used Seaborn48 to visualize statistical data. Each point showed the Mean ± SEM (Standard Error of the Mean) in the figure. For column chart, one-way ANOVA (by GraphPad Prism 9) was used to evaluate the statistical significance (degrees of freedom between methods are 6, and degrees of freedom within methods are 14). For group column chart we also used one-way ANOVA (degrees of freedom between methods are 6 and degrees of freedom within methods are 14), because we focused on the comparison within each length range. Each column showed the Mean ± SEM and data points were shown as black dot on the column. For correlation dot plot (Fig. 2e), we used two-sided Pearson correlation in scipy.stats49 to measure the linear relationship between the density of coding genes and the density of eccDNA for each chromosome, and used Seaborn to present the result.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.