Abstract
Active mobile elements in the human genome can create novel mobile element insertions (MEIs) in somatic tissues. Detection of somatic MEIs, particularly those with low mosaicism, remains a significant challenge due to sequencing artifacts and alignment errors. Existing methods lack sensitivity or require biased manual inspection. Here we present RetroNet, a deep learning algorithm that encodes sequencing reads into images to identify somatic MEIs with as few as two reads. Trained on diverse datasets, RetroNet outperforms previous methods and eliminates the need for manual examinations. RetroNet achieves high precision (0.885) and recall (0.579) on a cancer cell line, detecting insertions in just 1.79% of cells. RetroNet is also effective for degraded DNA, like circulating tumor DNA. This tool is applicable to the rapidly generated short-read sequencing data and has the potential to provide further insights into the functional and pathological implications of somatic retrotranspositions.
Introduction
Mobile element (ME), or transposon, is a class of short DNA fragments capable of changing genomic locations, including cut-and-paste DNA transposon and copy-and-paste retrotransposon. About 45% of the human genomic sequence consists of ME1. Retrotransposons are the dominant class, including long terminal repeat (LTR) transposons such as ~98,000 human endogenous retroviruses and non-LTR transposons such as ~560,000 L1 (or LINE-1), ~1,200,000 Alu, and ~5,100 SVA (or SINE-VNTR-Alus) elements. While most human MEs have accumulated with many mutations in evolution and are incapable of moving, a small number of non-LTR retrotransposons remain active, including ~80 active L1 (L1Hs subfamily)2, over 800 active Alu (AluY subfamilies)3, and ~25 active SVA (SVA-E and SVA-F subfamilies) elements per person4. These active retrotransposons can be copied to new genomic loci via a retrotransposition process such as target-primed reverse transcription (TPRT)5, creating de novo mobile element insertion (MEI) mutations6. Retrotranspositions can occur in germline and somatic cells, yielding inter-individual and intra-individual genetic diversities. More than one hundred MEIs have been linked to human diseases7,8, including somatic SVA insertions, found in over 75% of the cells (termed tissue allele frequency, or tAF), within the NF1 gene in patients with Neurofibromatosis Type I9.
Widespread somatic retrotranspositions with various tAFs have been identified in normal tissues10 as well as cancer11. Somatic L1 insertions can occur in neural stem cells12 and mature neurons13, typically resulting in a tAF below 2% per insertion in the adult human brain14,15,16. Different L1 mutations create genomic mosaicism among various brain cells through insertional mutagenesis and alterations in transcriptional regulation17. In cancer, somatic insertions from Alu and SVA elements, along with L1, have been extensively documented, with the retrotransposition activities dependent on both the type of mobile element and the tumor’s origin11,18,19. While driver mutations in tumors usually have high tAFs, clinically significant mutations can still occur at low frequencies due to low tumor purity or the presence of new subclonal populations that confer resistance to treatment20.
The characterization of low-tAF somatic MEIs remains challenging due to weak signals and abundant noise produced by current high-throughput DNA sequencing technologies21. The standard whole genome sequencing (WGS) approach with short, paired-end sequencing reads can reveal evidence of de novo MEI mutations, including supporting reads with one read-end mapped in the human reference genome, and the other end mapped to mobile element consensus sequences. Previous studies on brain or tumor somatic MEIs have utilized various approaches, including whole genome or targeted sequencing of bulk tissue or single cells14,16,22,23,24. To capture the somatic mutations of low tAFs, these methods typically employ high sequencing depth (greater than 100×), ME-targeted PCR amplification, or enzymatic whole-genome amplification of single cells. However, these approaches can increase the likelihood of sequencing artifacts, such as PCR chimeras that link unrelated sequences. When the chimeras form around the abundant mobile elements in the human reference genome, they can mimic low-frequency, novel ME junctions, leading to a high number of false somatic MEI discoveries. Standard MEI detection algorithms generally require a minimum of four supporting reads18 or five supporting reads25,26,27,28 that cover both upstream and downstream insertion junctions. While this requirement helps reduce noise, it greatly limits the ability for detecting low-tAF MEIs that do not reach the threshold of supporting reads.
Bona fide ME retrotransposition exhibits hallmark features such as specific alleles29 and high sequence identity to the active ME consensus sequences7, which can be utilized to differentiate it from the randomly generated false ME junctions. RetroSom, a random-forest model, was previously designed to classify a single supporting read as true or false based on its sequence16. With further manual inspections for proper read positions using a visualization tool, RetroVis, RetroSom reported somatic MEIs as those with a minimum of two supporting reads. The RetroSom and RetroVis analysis, however, still have many limitations. First, it could not detect somatic SVA insertions, and the sensitivity for low-tAF L1 and Alu MEIs remains low. Second, the training data of RetroSom came from 11 individuals of one Caucasian family30 —this limited scope could lead to lower accuracies when applied to individuals of different ancestries. Lastly, the manual inspection with RetroVis, despite its effectiveness in reducing false positives, requires substantial prior knowledge and is prone to human biases. Nevertheless, the graphic design used to visualize MEIs can act as a strong foundation for deep learning techniques. These techniques have recently advanced significantly in various fields of genomics31, improving the detection of single nucleotide polymorphisms, short insertion-deletion mutations32, and DNA structural variants33,34 such as deletions, duplications, inversions, and complex variants.
Here, we formulated the RetroNet framework that uses image encoding to integrate both the sequence and positional features of candidate MEIs and employs a deep neural network to predict somatic L1, Alu, and SVA insertions in the human genome (Fig. 1). The training data consisted of high-coverage WGS from 549 parent-offspring trios with diverse ancestries, significantly improving the model’s generalizability35. Following the exclusion of low-quality reads, candidate MEIs were labeled as true or false based on the inheritance patterns. Groups of two supporting reads were encoded into images to illustrate their alignments to ME consensus sequences and their relative positions. Next, we trained three neural network classifiers based on state-of-the-art architectures for image processing: ResNet-1836, GoogLeNet37, and Vision Transformer (ViT)38. The performance was benchmarked in independent and diverse test datasets. This included (1) detecting germline MEIs in the Illumina Polaris Project39, (2) detecting simulated somatic MEIs with a wide range of tAFs in a genome mixing dataset16, and (3) detecting simulated MEIs across various signal-to-noise ratios. To demonstrate the accuracy in real datasets, we analyzed somatic MEIs from paired tumor-normal sequencing of a patient with pancreatic ductal adenocarcinoma, HG00840,41,42. Additionally, we detected somatic MEIs in the cell-free DNA (cfDNA) from a metastatic castration-resistant prostate cancer patient, DTB-20543, and benchmarked with the matching tumor tissue. All relevant datasets are summarized in Supplementary Table 1.
The RetroNet pipeline was developed in five major steps. First, we extrapolated candidate germline MEIs from 549 trios of the 1000 Genomes Project, and labeled true (red) and false (gray) MEIs based on the inheritance patterns. Second, we used an image-based encoding to convert groups of two MEI supporting reads, each shown as a pair of blue (flanking sequence in the human reference genome) and red (ME sequence) boxes. Two reads supporting an L1 insertion were transformed into an image formatted into nine tracks. These tracks contain both the sequence-L1 alignment represented by black and red dots, as well as the relative positions of the flanking sequences indicated by blue arrows. Third, we trained three deep-learning models based on ResNet-18, GoogLeNet, and vision transformer (ViT) architectures to classify the labeled L1, Alu, or SVA images. Fourth, we benchmarked the trained models in three independent datasets, including germline MEIs (for model selection), somatic MEIs with low tissue allele frequencies (tAFs), and simulated imbalanced datasets with varying signal-to-noise ratios. Finally, we applied RetroNet to detect somatic MEIs in paired tumor-normal sequencing for a pancreatic ductal adenocarcinoma (PDAC) tumor cell line HG008-T, using matched normal duodenal tissue (HG008-N-D) as a control. Furthermore, we analyzed somatic MEIs in matched cfDNA and metastatic tissue samples from a metastatic castration-resistant prostate cancer (mCRPC) patient DTB-205. Created in BioRender. TAN, M. (https://BioRender.com/tq8ohse).
RetroNet extends the detection of somatic MEIs to SVA elements. Based on the area under the precision-recall curve (AUPR), a metric suitable for binary classification of rare events44, RetroNet outperforms RetroSom in detecting both germline and somatic L1 and Alu insertions. In detecting germline L1 elements from the Polaris dataset, RetroNet achieved an average AUPR score of 0.990 (95% confidence interval (CI): 0.988-0.993), compared to RetroSom’s 0.936 (95% CI: 0.929-0.944). Similarly, in detecting simulated somatic L1 insertions at 1% tAF with 200× WGS, RetroNet achieved a 43.4% increase in AUPR over RetroSom (0.223 vs. 0.156). With a more stringent cutoff tuned for highly imbalanced datasets, RetroNet has significantly improved precision while maintaining good recall for detecting low-tAF somatic L1 insertions in the genomic DNA of a cancer cell line, as well as in the cell-free DNA from a cancer patient. Finally, we interpreted the RetroNet neural network and confirmed that it could utilize hallmark sequencing and positional parameters from bona fide MEIs to guide the detection. The code and environment of RetroNet were packaged in GitHub (https://github.com/Czhuofu/RetroNet) and Zenodo45.
Results
Training RetroNet using an image-based encoding of DNA sequencing reads
Due to the limited number of bona fide somatic MEIs, we developed RetroNet by utilizing transfer learning from the more abundant, evolutionarily recent germline MEIs that share the same retrotransposition mechanisms. The training data were true and false MEIs detected in 549 father-mother-offspring trios from the 1000 Genomes Project high coverage (average >30×) WGS dataset35, after excluding 53 trios that overlap with the benchmarking datasets (Supplementary Table 1). True and false L1, Alu, and SVA insertions were labeled in each offspring based on their inheritance patterns: true MEIs were those present in the offspring and one parent, while false insertions were found only in the offspring. MEIs that were present in all three family members were excluded, as these could represent either evolutionarily old MEIs with high population frequencies or common alignment errors. We further excluded genomic loci with repetitive sequences that are prone to alignment errors and ME junctions formed by DNA structural variations instead of retrotranspositions. Using a threshold of two or more supporting reads, we identified 287,096 true MEIs (22,629 L1, 250,874 Alu, and 13,593 SVA) and 1,023,785 false MEIs (834,656 L1, 97,755 Alu, and 91,374 SVA) (Supplementary Table 2 and Supplementary Data 1–4). The class labels were validated using long-read sequencing from 11 randomly selected individuals (Supplementary Note 1 and Supplementary Data 5). Most of the MEIs in the training set have a population frequency of less than 0.1 across all individuals (Supplementary Fig. 1), and therefore the risk of over-representing subsets of MEIs was low. The supporting reads of the labeled MEIs can be classified into three groups, including (1) split reads (SRs) with one read-end map onto the MEI junction and the other to the flanking human sequence, (2) paired-end reads (PEs), with one read-end fully aligned to the ME consensus sequence and the other end to the flank, and (3) clipped paired-end reads (clipped PEs), in which one read-end maps to the ME consensus sequence and the other to the MEI junction (Fig. 2a). Both SRs and clipped PEs could identify the exact MEI junctions, whereas PE supporting reads only outline an approximate junction window.
a Supporting reads used to detect MEI include split-reads (SRs), paired-end reads (PEs) and clipped paired-end reads (clipped PEs). Blue indicates the segment of the supporting read that mapped to the flanking sequence, while red denotes the segment that mapped to the mobile element (ME) consensus sequence. For SRs, the mapped segment to the ME consensus must be greater than 30 bp, while clipped PEs require that the mapped region to the flanking sequence is higher than half of the read length to ensure mapping accuracy. b Two supporting reads from a candidate L1 insertion, each with two paired ends, are encoded into a three-channel image with nine tracks, integrating positional features (relative genome positions) and sequence-based features (read-ME alignment). The L1 consensus sequence is divided into the 5’-end (top) and the 3’-end (bottom), with a further zoom in to the 5’-end to illustrate the encoding syntaxes. Track 1 shows flanking sequence mappability (0–1, black = fully mappable). Tracks 2, 4, 6, and 8 depict genome positions of each read end, and tracks 3, 5, 7, and 9 show alignment to L1Hs consensus; blue arrows indicate genome positions, red pixels indicate L1Hs alignment, and mismatches appear as coexisting black and red pixels. Read 1 is a PE read with end1 mapped in the human genome (blue arrow in track 2) and end2 mapped in L1 (red pixels in track 5). In track 5, the L1Hs consensus sequence is denoted by a matrix where columns represent the base positions and rows represent the nucleotides A, C, T, and G, from top to bottom. The read sequence aligning to L1Hs is highlighted in red pixels. A mismatch appears as a column with both black (L1Hs) and red (supporting read) pixels. Read 2 is a clipped PE read; the shorter blue arrow in track 8 indicates end2 partially maps to the genome, with the unmapped portion in black line and the portion mapped to L1Hs displayed as red pixels in track 9.
We encoded groups of two supporting reads into fixed-size images to integrate both the sequence-based features, such as hallmark alleles of the active ME subfamilies (e.g., L1Hs, AluYa5, and SVA-E)29, and positional features, such as the relative arrangement of the supporting reads. For L1 insertions, two supporting reads, read 1 and read 2, each with two paired sequencing ends, end1 and end2, are encoded into an image of 60 × 6620 pixels that can be divided into nine tracks of information (Fig. 2b). Track 1 illustrates the mappability46 of the flanking sequences, chosen as a window of 500 bp upstream and downstream to the putative insertion junction. The mappability is defined as the degree of local, 100 bp short sequence fragments that could be uniquely and properly aligned, and the values range between 0 and 1. A full black bar represents a genomic region with a mappability of 1, or the DNA alignment of this region is highly reliable. Track 2 to 9 depict each of the four sequencing read ends, read1 end1, read1 end2, read2 end1, and read2 end2, that were mapped to either the flanking human genome, the active L1Hs consensus sequence47, or both (for SR and clipped PE reads). Each end is encoded into two tracks, with the top track featuring the possible alignment in the flank, shown as blue arrows to indicate the relative position. The bottom track features a one-hot encoding of the consensus L1Hs sequence—a 4 × 6064 matrix with white-black pixels, where the columns denote the base positions and rows represent A, C, T, or G from top to bottom. The alignment of the supporting read is denoted by overlaying a string of red pixels, representing the nucleotides from the read, to the L1Hs sequence. One sequence mismatch will, therefore, be encoded into the coexistence of a black pixel (L1 reference) and a red pixel (supporting read) in one column. Several additional encoding syntaxes are implemented to ensure a consistent and comprehensive sequence-to-image conversion (see “Methods” and Supplementary Fig. 2a-d). Finally, to avoid overrepresenting MEIs with a higher number of supporting reads (e.g., homozygous over heterozygous), we chose to include a maximum of five randomly selected images per insertion. The resulting training dataset for L1 MEIs contains 135,774 true images and 1,087,692 false images (Supplementary Table 2). The encoding of Alu and SVA insertions are similar to L1, with the exception of including multiple consensus ME sequences to represent the different active ME subfamilies (Supplementary Fig. 2e).
We adopted state-of-the-art deep learning models to solve the binary classification of the true MEI-derived images from the false ones. We included two convolutional neural network (CNN) architecture-based models: ResNet-18 and GoogLeNet. As the encoded images are organized into horizontal tracks and may challenge the spatial locality principle of the CNN models48, we incorporated a third model, the Vision Transformer (ViT), which allows the partitioning of the input images into ordered tracks and then utilizes a transformer architecture49 for image analysis. The architectures of the three deep learning models, along with the output sizes of each layer, are presented in Supplementary Data 6. By the 30th epoch, the loss function values in the validation dataset, chosen as 10% of the images, had converged and remained stable (Supplementary Fig. 3a). Ultimately, the models achieved AUPR values of 0.997 for ResNet-18, 0.998 for GoogLeNet, and 0.994 for ViT for L1 classification in the validation set. High-performance values were also observed in the models for Alu and SVA classifications (Supplementary Fig. 3b).
Comparable accuracies from three neural network models in detecting germline MEIs
To evaluate the trained ResNet-18, GoogLeNet, and ViT models, we benchmarked their performance in detecting germline MEIs in 49 family trios from the Illumina Polaris Project (Supplementary Table 1). Notably, the same individuals were also part of the 602 trios sequenced by the 1000 Genomes Project but were excluded from model training to avoid data leakage. Using the same labeling process and a threshold of at least two supporting reads, we identified a total of 26,960 true MEIs (2,381 L1, 23,287 Alu, and 1,292 SVA) and 31,566 false MEIs (15,844 L1, 11,145 Alu, and 4,577 SVA) from the 49 offspring (Supplementary Table 2 and Supplementary Data 7–9). Following the sequence-to-image encoding, we evaluated the trained ResNet-18, GoogLeNet, and ViT models in classifying the true and false images. For L1 and Alu insertions, specifically, we also performed the RetroSom analysis and compared its performance with the three deep learning models.
All three deep learning models achieved similar AUPR scores for the classification of L1-derived images: ResNet-18: 0.990 (95% CI: 0.988-0.993), GoogLeNet: 0.990 (95% CI: 0.988-0.993), ViT: 0.991 (95% CI: 0.988-0.993), significantly outperforming the RetroSom model (AUPR = 0.936, 95% CI: 0.929-0.944, v.s. ResNet-18: adjusted P = 1.4e-14) (Fig. 3). Other performance metrics showed similar results: ResNet-18 and GoogLeNet exhibited nearly equal precision (= 0.970) and recall (= 0.956) for L1 detection at the default classification cutoff (Probability = 0.5). In comparison, ViT exhibited a lower average precision of 0.961 (adjusted P = 3.8e-7) but a higher recall of 0.976 (adjusted P = 9.2e-8) (Fig. 3b). Similar results were also observed for benchmarking the Alu-derived and SVA-derived images (Fig. 3). The results indicate that the three deep learning models exhibit similar overall performance, which surpasses that of RetroSom. Among the three models, ResNet-18 and GoogLeNet are less computationally intensive and have fewer parameters than ViT (Supplementary Data 6). ResNet-18’s residual blocks are simpler, more scalable, and more interpretable than GoogLeNet’s inception modules37. This makes ResNet-18 more efficient in training and inference50, and thus we chose ResNet-18 as the default model implemented in the RetroNet framework.
a Precision-recall curves of the trained ResNet-18, GoogLeNet, ViT models, and the RetroSom model (for L1 and Alu) were assessed on the labeled L1, Alu, and SVA image datasets extracted from 49 offspring in the trio sequencing of the Illumina Polaris Project. We labeled the average and 95% confidence intervals of the area under the precision-recall curve (AUPR) scores for each model. b Boxplots to compare the precision, recall, and AUPR scores of the ResNet-18, GoogLeNet, ViT models, and the RetroSom model, using the default threshold of probability >0.5 (n = 49 individuals). P values were calculated using paired two-sided Wilcoxon test and adjusted using the Benjamini-Hochberg correction. Box plots show the median (horizontal line within box), the first to third quartiles (Q1–Q3, 25th–75th percentiles) as the box, and whiskers extending to values within 1.5 × IQR (the interquartile range).
RetroNet outperforms previous methods in detecting simulated somatic MEIs with low levels of mosaicism
Unlike germline MEIs that are present in all cells, somatic MEIs are mosaic with various levels of tAFs, which pose additional challenges in the detection51. Thus, we benchmarked RetroNet at detecting MEIs with various levels of mosaicisms using a genome mixing dataset16 to simulate somatic MEIs at a range of tAFs (Supplementary Table 1). The dataset contains 50×, 100×, 200× and 400× coverage WGS of a mixture of six genomic DNA from unrelated individuals, of which true MEIs have been defined previously16, into the NA12878 genomic DNA at ratios of 0.04%, 0.2%, 1%, 5%, and 25%. A heterozygous germline MEI found in only one of the six genomes thus could simulate somatic MEIs, with a tAF from 0.04% to 25%. We then identified candidate MEIs that are present in the DNA mixture but absent in NA12878, chose a maximum of 10 pairs of supporting reads per insertion, and then applied the RetroNet or RetroSom (for L1 and Alu only) analyses. Somatic MEIs were reported if any pair of the supporting reads was predicted to have a probability above the selected stringency cutoff. Additionally, we compared the results with a separate MEI detection method, xTea (short-read module)52.
At a sequencing coverage of 200×, both RetroNet and RetroSom could detect the simulated somatic L1 insertions with a tAF as low as 1%, with RetroNet outperforming RetroSom. For L1 of 1% tAF, RetroNet’s AUPR was 0.223, while RetroSom’s AUPR was 0.156. Notably, the theoretical maximum of the AUPR was 0.238, since the 200× coverage is insufficient to sample all 1% L1 insertions with two or more supporting reads. RetroNet reached an optimal F1 score of 0.356, defined as the harmonic mean of the recall and precision, at a probability cutoff of 0.8 (recall = 0.238, precision = 0.714). Comparatively, RetroSom’s optimal F1 was 0.250 (recall = 0.143, precision = 1). Similarly, at 5% tAF, RetroNet’s AUPR was 0.861, and RetroSom’s was 0.852 (maximum = 0.875), while at 25% tAF, both RetroNet and RetroSom achieved an AUPR of 1 (Fig. 4). When considering the other sequencing coverages, from 50× to 400×, we found that RetroNet consistently outperforms RetroSom, and AUPR scores were positively correlated with sequencing coverage, which determines the number of supporting reads (Fig. 4). The only exception was that 200× coverage outperformed the 400× coverage data when detecting L1 insertions with 5% tAF. This was due to the higher sequencing depth of the control (400 × NA12878) contained additional sequencing noise that masked a true L1 insertion at chr3:4322118053, which is present in the genome that was mixed at 5%. RetroNet’s prediction performance showed similar patterns for Alu and SVA insertion, with AUPR positively correlated with sequencing depth and tAF and consistently outperforming RetroSom for Alu detection (Supplementary Figs. 4 and 5). Finally, xTea achieved perfect scores in identifying MEIs with a tAF of 25% in the 400× dataset. However, its performance declined at lower tAFs or reduced sequencing depth. This aligns with the requirement of a relatively large number of supporting reads by xTea, which may explain its diminished efficacy for low tAF somatic mutations.
The precision-recall curves were evaluated for detecting somatic L1 insertions at mixing proportions of 0.2%-25% (y-axis) under sequencing depth of 50× to 400× (x-axis), using RetroNet (red), RetroSom (gray) and xTea (green). Based on the AUPR scores, RetroNet could consistently outperform RetroSom at detecting low-mosaicism L1 MEIs. In contrast, xTea only predicts MEIs with relatively high tAF and performs significantly worse than RetroNet. Similar results were also observed for Alu and SVA MEIs (Supplementary Figs. 4 and 5).
Enhancing probability threshold for tackling imbalanced datasets
The AUPR metric shows the tradeoff between precision and recall across different probability cutoffs. The optimal cutoff depends on the signal-to-noise ratio (SNR) in real-world applications. The generally fewer true somatic MEIs found in tissue samples and their low tAFs typically lead to highly imbalanced data, where the noise could overwhelm the signal. For detecting somatic L1 insertions in 200× WGS data of bulk brain tissues, for instance, the previous estimate of the true insertions is between 1 and 10, while the false insertions at the cutoff of two or more supporting reads are ~100016, leading to a SNR at around 1:100 to 1:1000. Germline L1 insertions in the Polaris dataset, for instance, have a SNR at 1:7 (Supplementary Table 2). The SNRs for Alu insertions are likely considerably lower, as there are more reference Alu sequences that can mix into PCR chimeras54. Similarly, SVA insertions may also have reduced SNRs because the activity of the retrotransposition process is likely lower, given that SVA is a non-autonomous mobile element. The significant data imbalance in somatic MEI detection can greatly affect the performance of the classifier model.
We simulated a challenging scenario where all true MEIs had low tAFs and only two supporting reads, with one representing image. To evaluate RetroNet’s performance in classifying individual images in imbalanced conditions, we sampled true and false images from the Polaris datasets into testing data with varying SNRs, ranging from 1:1, 1:10, 1:100, to 1:1000. RetroNet performs well when positive and negative samples are balanced; however, as the SNR decreases, the AUPR steadily declines. At very low SNRs, predicting positive samples can become challenging. For example, at a SNR of 1:1000, RetroNet shows AUPR values of 0.278 (95% CI: 0.252-0.305) for L1, 0.468 for Alu (95% CI: 0.432-0.504), and 0.561 for SVA (95% CI: 0.516-0.606) (Fig. 5a). In real scenarios with very low somatic MEI content, it is crucial to implement stricter thresholds instead of the default cutoff of probability >0.5 to regulate false positive levels. We evaluated the relationship between the precision/recall values and the SNRs (1:1 to 1:1000) at various probability cutoffs (0.5–0.99) in the simulated data (Fig. 5b). To ensure high precision for experimental validations, we increased the default probability cutoffs to 0.95 for L1 and 0.99 for Alu and SVA insertions. For the simulated L1 insertions at an SNR of 1:100, for instance, the prediction of RetroNet at a cutoff of 0.95 strikes a reasonable balance between recall (= 0.836) and precision (= 0.432). When applied to two previously validated somatic L1 insertions with a tAF of ~1%16, RetroNet predicted both with high confidence above 0.95: L1_1 had a probability score of 0.997, and L1_2 was 0.998 (Supplementary Fig. 6). The actual optimal cutoffs depend on the ME types, including the retrotransposition activities in somatic tissues that affect the level of the signals, as well as the likelihood of forming false ME junctions (e.g., PCR chimera) that affect the noise level.
a Precision-recall (PR) curves of RetroNet to identify resampled L1 insertions with various levels of noise, at signal-to-noise ratios (SNRs) of 1:1 (yellow), 1:10 (light green), 1:100 (dark green), and 1:1000 (purple). The solid line represents the average PR curve from 100 simulations, and the ribbon around the line indicates the 95% confidence interval. b The impact of more stringent probability cutoffs (from 0.5 to 0.99, blue to red) on the precision and recall of RetroNet, when applied to imbalanced datasets with an SNR of 1:1 (cross), 1:10 (circle), 1:100 (triangle), and 1:1000 (square). Despite the additional noise, RetroNet could still achieve high precision in highly imbalanced datasets while managing reasonable levels of recall by choosing a higher stringency cutoff.
Interpretation of the RetroNet neural network reveals known retrotransposition hallmarks
Class activation maps (CAMs)55,56 generated from true MEI images supported RetroNet correctly utilized both the positional features (e.g., the blue arrows) and sequence features of the supporting reads (e.g., the red pixels) for the prediction (Supplementary Fig. 7). However, the CAM heatmaps are too coarse to precisely localize the relevant features, and thus, we carried out additional evaluations to understand the prediction behavior of RetroNet at a single-nucleotide resolution. We first investigated the supporting read positions in the L1 consensus sequence because frequent 5’ truncations and intact 3’-end are both hallmarks of true L1 retrotranspositions57. Among the 55,769 training images of true L1 insertions with supporting reads aligned across the 5’-junction, 23,000 images (~40%) have intact 5’-end, and the rest supporting truncations typically between base 4000 and 6000 (Fig. 6a). The distribution of the 5’-junction in true images aligns with a previous study of 20 individuals52 and differs significantly from that of false images. RetroNet predicted values (pre-normalized probabilities, called logits) for the true images positively correlated with the 5’-junction distribution of the true L1 insertions (beta = 0.305, P = 2.96e-2221) but negatively correlated with the distribution in the false L1 insertions (beta = -0.105, P = 4.63e-55). For reads that could report L1 3’-junctions, true L1 insertions almost exclusively (~95%) have the junctions at the L1 3’-poly(A) tail, suggesting the intact L1 3’-end (Fig. 6b). RetroNet predicted values were also influenced by the 3’-junction locations, with a positive correlation to the distribution in true L1 insertions (beta = 0.392, P = 2.56e-240) and a weak negative correlation to the false-L1 distribution (beta = -0.027, P = 0.0423). 5’-truncations after L1 retrotranspositions are caused by premature termination of the reverse transcription, which is a shared mechanism between germline and somatic events58. Similarly, the preference for the intact L1 3’-end is led by the shared TPRT retrotransposition mechanism5. Thus, these results confirmed that RetroNet could learn the proper supporting read positions and implement them for predicting somatic MEIs.
a RetroNet’s predicted values are represented in pre-normalized probabilities (logits) and modeled by a generalized additive model, showing the average in a blue line and the 95% confidence intervals (CIs) in green. The logit (blue) for L1 insertions is positively correlated with the 5’-end positions in the true images (orange) but not the false images (gray). b RetroNet’s logit (blue) for L1 insertions is also positively correlated with the 3’-end positions in the true images (orange) and not the false images (gray). The scarcity of true L1 3’-end deletions results in wider 95% CIs (green) of 3’-ends in L1:0-5000bp. c Boxplots of logits by supporting read orientation: upstream + downstream, two upstream, and two downstream. d Boxplots of logits by supporting read category: clipped paired-end (clipped PE), paired-end (PE), and split reads (SR). e Per-base perturbation of the L1Hs:3-6062. Each site was tested by sampling 300 training images and permuting the allele to three other alternative bases, resulting in probability changes that are colored as down-regulated (blue, probability change < -0.001 and adjusted P < 0.05), significantly up-regulated (red, probability change > 0.001 and adjusted P < 0.05), and not significant (gray). f Categories of perturbations with significant down-regulation in RetroNet, grouped by type of L1 consensus alleles: A (blue), C (dark brown), T (green), and G (light brown). g Three-base perturbations at L1Hs hallmark alleles (L1:5927-5929 ACA/G to GAG and L1:6010-6012 TAG to TAA) caused significant decreases in probabilities. h Permutation of 328 alleles from 37 active L1s showed that five high-frequency (over 19 of 37 active L1s) alleles (706/C, 3952/C, 5389/T, 5533/T, 5536/G) significantly increased probabilities. All statistics in (a–h) were based on images derived from the 1000 Genomes Project offspring (n = 549 individuals). P-values in (c–e, g–h) were calculated using two-sided Wilcoxon tests and adjusted by the Benjamini-Hochberg correction. Box plots show the median (horizontal line within box), the first to third quartiles (Q1–Q3, 25th–75th percentiles) as the box, and whiskers extending to values within 1.5 × IQR (the interquartile range).
Regarding the combination of supporting reads, RetroNet prefers the orientation of one supporting read from the L1 5’-junction (called upstream) and the other from the 3’-junction (downstream) (Fig. 6c). Candidate MEIs with both upstream and downstream supporting reads strongly support two novel junctions in the human genome—one from each end of the mobile element and thus are distinct from chimeric artifacts that only produce one novel junction51. The second preference is that both supporting reads come from the upstream, and lastly, the two supporting reads come from the downstream. The difference is likely due to poorer sequencing qualities near the poly(A) tails at the L1 3’-end, where Illumina sequencing is known to create errors generated by polymerase slippage on low-complexity sequences59. As for the types of supporting reads, RetroNet generally prefers images with at least one clipped PE, followed by the combination of PE and SR reads (Fig. 6d). A clipped PE is preferred as it specifies not only the insertion junction, usually at the 5’-end where the sequencing quality is better, but also two segments of the L1 sequence alignments that should be arranged in proper positions (Fig. 2a).
Active mobile elements that are capable of retrotransposition carry hallmark alleles such as ACA or ACG at L1:5927-5929 for the L1 Ta and pre-Ta subfamilies, as well as allele G at L1:601229,60,61. To investigate the impact of specific sequence alleles on RetroNet, we adopted a perturbation-based test to simulate L1 mutations by permutating nucleotides of the supporting reads (i.e., the red pixels in the encoded images)62. Because local sequence alignment places gaps instead of mismatches near the endpoints, we omitted two nucleotides at either end of the L1Hs consensus sequence and chose bases 3 to 6062 for perturbations. For each base, we randomly selected 300 images with at least one supporting read carrying the L1Hs allele, which was then permutated to the three alternative alleles. In total, this test generated 18,180 probability change results (3 alternative DNA bases × 6060 sites), within which 5861 led to significantly down-regulated probabilities (probability change < −0.001 and adjusted P < 0.05) and 847 led to up-regulated probabilities (probability change > 0.001 and adjusted P < 0.05) (Fig. 6e). Transition mutations, including A > G, T > C, C > T, and G > A, were common among the down-regulating perturbations (Fig. 6f), reflecting the major difference between the active L1 sequence and the noise generated from old, inactive L1 elements that have accumulated too many mutations—a majority of them are transition mutations as they are more common than transversion mutations63.
The perturbation test also demonstrated that RetroNet could recognize canonical alleles of active L1. As expected, probability changes of three-base perturbations from L1:5927-5927 ACA/G to GAG and from L1:6010-6012 TAG to TAA were significantly down-regulated (P = 2.58e-46 and 1.89e-50, respectively) (Fig. 6g). For single-base perturbations, we investigated 328 alternative alleles carried by 37 active L1 elements in the human genome64. Overall, the alleles shared among the majority of the active L1s (>50%) are more likely to cause a positive change in the RetroNet prediction than the less common alleles (P = 9.96e-7) (Fig. 6h). Five mutations that are shared in over 19 of the 37 active L1s led to significantly higher prediction scores. Among these, 706/C, 5389/T, 5533/T, and 5536/G were also recognized canonical sites of active L1s2. The fifth allele, 3952/C, was found in 26 of the 37 active L1 elements, including several that are considered to be highly active in in vitro assays64. Notably, the single nucleotide change of 5929 G > A did not produce significant changes since pre-Ta L1 (5927-5929:ACG) is still capable of retrotransposition, and RetroNet relies on all three nucleotides at L1:5927-5929 for the classification (Fig. 6g). In comparison, the majority of the rest, less common active L1 alleles generally led to down-regulation (~47%) or insignificant (~49%) changes in the predicted probabilities (Fig. 6h), suggesting they were not the focus of the RetroNet’s learning.
Detecting somatic MEIs in cancer cell line HG008
To demonstrate the efficacy of RetroNet in real bulk sequencing datasets, we characterized somatic MEIs in a pancreatic ductal adenocarcinoma tumor cell line (HG008-T) with matched normal duodenal tissue (HG008-N-D) as control, using public short and long read sequencing data from the Genome in a Bottle Consortium (Supplementary Table 1)40,41,42. For benchmarking, we identified a total of 19 true tumor somatic L1 insertions with an estimated tAF between 1.6% and 100%—and no Alu or SVA insertions from two PacBio long-read sequencing65 datasets with a combined average sequencing depth of 212×, using PALMER66, xTea_long (long-read module)52, and an in-house tool (Supplementary Data 10, 11, and Supplementary Note 2). We then applied RetroNet, RetroSom16, xTea (short-read module)51, and TraFiC-mem18,19 to two independent Illumina WGS datasets, ILMN-PCR-free-2 (I2) and ILMN-PCR-free-3 (I3), with an average depth of 195× and 161× for HG008-T, respectively. After accounting for ploidy, the average depth per haploid genome (termed number of reads per chromosome copy, or NRPCC67) was 49× for I2 and 40× for I3. A third dataset, ILMN-PCR-free-1, was excluded as it sequenced a different passage of HG008-T cells and used primary normal pancreatic, instead of the duodenal tissues as control.
With the requirement for at least two supporting reads and in non-repetitive genomic regions, 12 and 14 of the 19 true L1 insertions were detectable in datasets I2 and I3, respectively. Within dataset I2, RetroNet identified a total of 13 somatic L1 insertions with Probability >0.95, of which 10 were true L1 retrotranspositions (recall = 0.833, precision = 0.769). The tAF interquartile range of the identified L1 insertions is between 5.9% and 92.3% (minimum = 1.79%) (Fig. 7a). The two false negatives include one with a 3’ transduction and 5’ inversion and therefore lacks an intact 3’-end, and a short L1 insertion near the 3’-poly(A) that caused many sequencing errors59. The three false positives were low-tAF calls with proper sequence and positional features, including the L1Hs hallmark alleles and both upstream and downstream supporting reads, and, therefore, were possible rare somatic events missed in the long-read sequencing. Comparatively, RetroSom’s accuracy was significantly weaker without manual inspections: recall was 0.667, and precision was 0.381—the majority of errors were PCR duplicates or incoherent read positions (Supplementary Fig. 8). Furthermore, xTea identified six true L1 insertions (recall = 0.500, precision = 0.857, tAF interquartile range: 76.9%-98.4%), while TraFiC-mem identified five true L1 insertions (recall = 0.416, precision = 1, tAF interquartile range: 72.8%-93.5%). Notably, xTea and TraFiC-mem generally overlooked low-tAF MEIs but could identify those that had only transduction and no mobile element sequences, known as “orphan transductions”68,69 (Fig. 7a). They represent a class of noncanonical MEIs not benchmarked here, as the accurate detection of low-tAF orphan transductions remains unresolved in long-read sequencing52,66. Finally, while there were no true Alu or SVA insertions, RetroNet misclassified 9 somatic Alu insertions, a significantly lower number than RetroSom’s 96 and 2 somatic SVA insertions. Contrarily, xTea misclassified 1 Alu insertion, while TraFiC-mem did not misclassify any Alu or SVA insertions (Supplementary Data 12).
a Detection of HG008-T tumor somatic L1 Insertions based on Illumina WGS data. This figure illustrates the detection of tumor somatic L1 insertions in the Illumina WGS datasets of HG008-T, ILMN-PCR-free-2 (I2, upper panel) and ILMN-PCR-free-3 (I3, lower panel) using four different algorithms. The gray ellipse represents true somatic L1 insertions benchmarked by PacBio HiFi sequencing, with at least two supporting reads in the Illumina WGS datasets as the minimum detection requirement. The colored dots indicate true positives detected by each algorithm (RetroNet: red, RetroSom: blue, xTea: green, TraFiC-mem: yellow). The gray open dots indicate false positives. The colored open dots represent orphan insertions detected by the algorithms but not included in the benchmarking list, and thus were excluded when calculating recall. b Detection of tumor-derived somatic L1 insertions in cell-free DNA from patient DTB-205 with treatment-resistant prostate cancer. The Venn diagram compares somatic L1 insertions detected in cfDNA by RetroNet (red circle) or xTea short (green circle) with those present in the tumor (yellow circle). RetroNet identified 44 L1s in cfDNA, six unique to cfDNA, and 38 that were also found in the tumor, of which four were co-detected by xTea short. Fourteen tumor specific L1s were absent from cfDNA. On this cohort, RetroNet achieved a recall = 0.731 and precision = 0.864, whereas xTea short achieved a recall = 0.077 and precision = 1.000. The right panel shows mosaicism in cfDNA for three categories: cfDNA specific insertions (gray, n = 6 insertions), tumor-derived L1s detected in cfDNA by RetroNet alone (red, n = 34 insertions) or co-detected by xTea (green, n = 4 insertions), and tumor specific L1s (yellow, n = 14 insertions). Tumor-derived L1s detected in cfDNA exhibit significantly higher mosaicism than tumor specific insertions (two-sided Wilcoxon test, P = 0.0002). Box plots show the median (horizontal line within box), the first to third quartiles (Q1–Q3, 25th–75th percentiles) as the box, and whiskers extending to values within 1.5 × IQR (the interquartile range). Created in BioRender. TAN, M. (https://BioRender.com/ghjdojc).
Dataset I3, compared to I2, had a lower sequencing depth (161× vs. 195×) and a higher percentage of properly paired reads (98.3% vs. 96.8%), implying it contained fewer chimeric artifacts and, therefore, a lower level of noise. For L1 analysis, RetroNet had a similar recall (= 0.857) with a higher precision (= 1) than I2; both outperformed RetroSom (recall = 0.714, precision = 0.345) (Fig. 7a). The two false negative L1 insertions were the same L1 with 3’ transduction and 5’ inversion as found in I2, as well as a low-tAF somatic L1 insertion with only two short split-read supporting reads in I3. Both xTea and TraFiC-mem maintained a high precision score of 1, with recalls of 0.571 and 0.500, respectively (Fig. 7a). The detected MEIs exhibited higher interquartile ranges in tAF compared to RetroNet (xTea: 64.7%-100%, TraFiC-mem: 80.9%-100% vs. RetroNet: 9.4%-93.7%). For Alu and SVA insertions, RetroNet and TraFiC-mem reported none, while RetroSom misclassified 6 false Alu insertions and xTea misclassified 6 false Alu insertions and 1 SVA insertion. The performance difference between I2 and I3, as well as among various ME types, can be attributed to the differing SNRs. Specifically, the SNR for L1 insertions was 1:5475 in I2 and 1:532 in I3. For Alu, it was below 1:77784 in I2 and 1:731 in I3, while for SVA, it was under 1:1941 in I2 compared to 1:945 in I3.
By combining RetroNet’s filtering system with additional filters—such as the inability to resolve repetitive regions using short reads, the lack of sufficient supporting reads for low-tAF mutations, and the presence of false supports in control tissues—the average recall rate for detecting somatic L1 elements in HG008-T is 0.579 (11 out of 19). This rate is slightly lower than the previously reported recall rates for germline mobile element insertions (MEIs), which were 0.68 for short-read sequencing at 300× and 0.93 for long-read sequencing52. For low-tAF MEIs, RetroNet accurately detected a total of six somatic L1 insertions with fewer than five supporting reads, including four with only two supporting reads. Out of the nine true somatic L1 insertions with a tAF between 1.6% and 10%, the average recall of RetroNet recalls was 0.5 between I2 (n = 4) and I3 (n = 5), with few false positives (n = 3 in I2; n = 0 in I3). The limited recall is primarily due to insufficient supporting reads (<2) in the short-read sequencing data (n = 4 in I2; n = 3 in I3) (Supplementary Data 10). Unlike other MEI detection algorithms, which often require a substantial number of supporting reads, such as the TraFiC-mem algorithm18,19 that found no somatic insertions in this tAF range or xTea that found only one, our analysis using RetroNet successfully improved the detection of low mosaicism somatic MEIs.
Benchmarking somatic MEIs in cell-free DNA and matching tumor tissues
We further assessed the use of RetroNet for short-read sequencing of fragmented cell-free DNA (cfDNA), where long-read sequencing is not applicable. The dataset includes time-matched cfDNA (average depth = 185×), metastatic tumor (98×), and white blood cells (41×), from 13 metastatic castration-resistant prostate cancer (mCRPC) patients43. The cfDNA from patient DTB-205 has an estimated cancer fraction of 28%, with a 100% contribution from the biopsied metastatic tumor. This allows us to use the matching metastatic tissue as an independent validation to confirm the somatic MEIs identified in the cfDNA. The contributions of metastatic tumor biopsies to cfDNA for other patients ranged from 1% to 60%, and thus, were excluded from our benchmarking analysis.
Due to the significant fragmentation of cfDNA (interquartile range: 148-178 bp), we employed a more stringent cutoff of ≥3 supporting reads with ≥2 high-confidence read pairs in RetroNet. We compiled a list of 52 true somatic L1 insertions in the DTB-205 tumor tissue. This list includes 38 insertions identified by xTea in the tumor sequencing and an additional 14 insertions that had fewer supporting reads than the xTea threshold but were identified through manually reviewed matching RetroNet calls in cfDNA (Supplementary Fig. 9). In addition, xTea identified no Alu insertions and one somatic SVA insertion in the DTB-205 tumor, indicating a low level of mosaicism from these elements. Within the cfDNA, RetroNet identified a total of 44 somatic L1 insertions, with an estimated recall of 0.731 and a precision of 0.864 (Fig. 7b). The tAF interquartile range for the correctly identified L1 insertions was between 4.8% and 10.8%, with a minimum value of 2.9% and a maximum value of 25.1%. This range is consistent with the estimated cancer fraction of 28% in the cfDNA. The cfDNA-specific L1 insertions (n = 6) may include false positives or true low-tAF mutations in the tumor that were not detected by sequencing. The tumor-specific insertions (n = 14) exhibited a significantly lower tAF in cfDNA (interquartile range: 1.2%–4.8%) compared to the L1 insertions detected in cfDNA (P = 0.0002), including 7 cases with insufficient supporting reads (Fig. 7b). Attempting the same task by xTea on the cfDNA sequencing data led to the discovery of four somatic L1 in cfDNA; all were found in the tumor (recall = 0.077, precision = 1, tAF interquartile range: 11.7%-22.4%). Overall, the significantly improved recall for detecting somatic L1 insertions in cfDNA demonstrates RetroNet’s effectiveness in analyzing highly degraded DNA and its potential applications in scenarios where long-read sequencing is impractical.
Discussion
Somatic mobile element insertions have increasingly been recognized for their involvement in brain development and tumorigenesis. The challenges of low tissue allele frequency and sequence repetitiveness complicate their detection, necessitating innovative strategies for precise identification. To address this, RetroNet integrates both sequence and positional features of candidate supporting reads using a deep neural network. Compared to the previous model, RetroSom, which only analyzes sequence features, RetroNet demonstrates higher sensitivity, enabling the detection of MEIs at low frequencies with improved precision, thereby reducing the need for manual validation of putative MEIs. This advancement marks a significant step towards uncovering the elusive roles of somatic MEIs in neurological diseases and cancer development.
Direct encoding of DNA sequencing data with the one-hot method has been widely used in deep learning-based models for genomics31. Typically, these methods involve converting the sequencing reads into a 4 × M matrix, where the four rows represent the four nucleotides (i.e., A, C, T, and G), and M is the read length. Applying this encoding method to build a classifier for somatic MEIs, however, would have caused biased representations. All candidate MEI supporting reads contain flanking sequences that are largely different between the germline MEIs used for training and the targeted somatic MEIs, except for the short (~6 bp) endonuclease cleavage sites. The insertion loci of the polymorphic germline MEIs have also been subject to genetic drift or natural selection in human evolution, which were generally absent or selected at a different level, cellular instead of species, for somatic MEIs. Furthermore, learning patterns from the MEI flanking sequences would also lead to overestimating the model’s performance, as a portion of the MEIs (e.g., those with high population frequencies) are shared across the training and the benchmarking datasets with identical flanking sequences. Finally, the conventional one-hot encoding method would also be a poor representation to capture the crucial aspects of the relative positions and orientations of the selected supporting reads, as well as the alignments with the ME consensus sequences. Hence, we leveraged insights from the prior knowledge into an unbiased and comprehensive encoding scheme, including only the relevant sequence and positional features for the sequence-to-image conversion.
Various strategies, including targeted or whole-genome sequencing and single-cell or bulk-tissue sequencing, have been applied to detect somatic MEIs. Of these methods, deep coverage WGS on bulk tissues or mixed cells, when analyzed properly, serves as a good balance between the sequencing costs, noise, and the yield of detectable somatic mutations. Targeted sequencing that captures ME junctions before sequencing is cost-effective but detecting somatic MEIs typically requires additional PCR amplifications that could bring significant levels of chimeric artifacts. Whole genome sequencing, in comparison, enables the simultaneous identification of a broad spectrum of genetic variations, including single nucleotide variants, MEIs, and other structural variations. Furthermore, single-cell WGS allows for detecting somatic mutations with low mosaicism but also requires a significant level of DNA amplification. Enzymatic whole genome amplification approaches could lead to biased sequencing depth or even artifactual chimeras, compromising the result’s accuracy. Alternatively, clonal expansion from fetal or reprogrammed cells circumvents enzymatic DNA amplification biases but is constrained by the type of suitable tissues. Given the same sequencing throughput, the sensitivity of bulk WGS for detecting low-mosaicism mutations is significantly higher than single-cell WGS. Assuming each cell is sequenced at an average depth of 33×, the cost of 200× bulk WGS equates to that of 6 single-cell WGS, and 400× bulk WGS corresponds to 12 single-cell WGS. For somatic MEIs with a tAF of 1%, the chances of sampling them in at least one of the 6 or 12 cells are 0.059 or 0.114, respectively. When using bulk WGS and ≥2 supporting reads, in comparison, the theoretical maximum recall is 0.787 for 200× and 0.979 for 400× depth (Supplementary Note 3 and Supplementary Table 3). The actual recall is lower after filtering for suitable supporting reads and removing false positives, RetroNet achieved the recall of 0.238 with 200× and 0.523 with 400× sequencing in a simulated benchmarking dataset, both of which are still greater than the equivalent single-cell approaches. Overall, while single-cell approaches can characterize private or very low-mosaicism mutations, bulk tissue-based methods remain a practical and effective strategy for MEIs with a tAF of ~1% or higher. Somatic mutations of even lower frequencies could, theoretically, require an unrealistic level of sequencing depth for detection using bulk tissue sequencing. In applications such as genetic testing for disease-causing mutations, however, somatic MEIs in 0.1% or lower percentage of cells likely have limited functional impacts.
The accuracy of detecting low-mosaicism MEIs using bulk sequencing depends on several intercorrelated factors, including sequencing depth, tAF of the MEI, the threshold for the number of supporting reads, and the ratio between the numbers of true and false insertions (i.e., the SNR). Among these, high sequencing depth is generally considered to be necessary for detecting MEIs at low tAF, at the cost of additional sequencing cost and specimen requirement. However, the benefit of increasing sequencing depth still has limitations. Somatic MEIs of very low mosaicism levels likely arise from retrotransposition events in a relatively later developmental stage, leading to narrow anatomical distributions. Sampling additional tissues for higher sequencing depth, therefore, will not guarantee the increase in the number of supporting reads if the additional tissues do not contain the somatic MEIs. Instead, the additional sequencing errors could lead to a worse signal-to-noise ratio. It is, therefore, essential to develop a tool that can reliably detect low-mosaicism MEIs without relying solely on the high sequencing depth, and thus, the tool needs to identify MEIs with as few supporting reads as possible.
Lowering the threshold of supporting read number for calling somatic MEIs, however, could lead to much higher levels of false positive insertions from sequencing or mapping artifacts that, while occurring at low probabilities, could accumulate into a large number of events. We adapted RetroNet to fit a new model for classifying MEIs using a single supporting read instead of the default two reads and demonstrated that the two-read strategy showed significantly higher precision (2 reads: 0.970; 1 read: 0.829) in classifying candidate supporting reads in the Polaris dataset (P = 2.46e-3) (Supplementary Note 4 and Supplementary Fig. 10). Notably, in real-world scenarios, the single-read method would likely perform much worse due to the expanded imbalance between the signals and noise. A previous analysis estimated that by lowering the threshold from two supporting reads to one read and no additional filtering, the number of false positive L1 MEIs increased by ~80-fold16. When using 200× sequencing on bulk brain tissues for detecting somatic L1 insertions, for instance, we typically expect 1~10 true insertions and 100~1000 false positives (SNR ~1:100). Using the single-read method, however, the estimated SNR is 1:10000. As demonstrated in our data imbalance simulations, the weaker SNR suggests the one supporting read strategy on bulk, short-read WGS will likely be unsuitable for real-world applications.
The RetroNet pipeline still comes with several limitations. The primary one lies in the inherent bias of transfer learning from germline MEIs that could have accumulated more mutations and tend to have shorter poly(A) tails than somatic MEIs. We attempted to mitigate this bias by choosing the polymorphic, evolutionarily young MEIs for training, and by coding the known difference between bona fide retrotransposition hallmarks and false MEIs in the image-based encoding. We also opt-out known features that would differ between germline and somatic MEIs, such as the flanking sequences of the MEI junction. As the number of experimentally validated somatic MEIs grows, the RetroNet framework could easily be fine-tuned using somatic MEIs only. Additional significant challenges arise from the short-read sequencing technology, including (1) complex retrotranspositions such as 5’ inversions and long or orphan transductions, (2) DNA structural variations that involve active mobile elements at the junctions and resemble retrotransposition, and (3) making identifications within repetitive genomic sequences. For instance, given RetroNet’s requirement for both supporting reads to align to the same ME strand to filter noise, it would miss low-mosaicism 5’ inversion events with <2 supporting reads covering either end (Supplementary Fig. 11a). In addition, when transduction occurs in L1 or SVA insertions, standard short-read sequencing with a span of ~800bp may be insufficient to capture the ME sequences junction (Supplementary Fig. 11b).
Long-read technologies, such as PacBio HiFi sequencing65, allow for capturing the complete insertion sequences, which facilitates the direct identification of complex retrotranspositions, differentiation from structural variations, and improved detection in repetitive sequences66. However, current long-read sequencing technologies still have drawbacks, including higher sequencing costs and a notable rate of sequencing errors. These challenges are particularly significant when detecting low-tAF somatic mutations, which typically require substantial sequencing depth (e.g., >200×) and precise characterization of active mobile element hallmark alleles. Moreover, long-read sequencing is unsuitable for fragmented DNA such as ancient DNA (40–500 bp)70 or cell-free DNA (40-166 bp)71. As demonstrated in our benchmarking of the cfDNA from a tumor patient, short-read sequencing can effectively complement long-read sequencing technologies and is capable of identifying somatic MEIs as unique cancer biomarkers. Despite the limitations associated with transfer learning and sequencing technology, RetroNet represents a meaningful advancement over previous methods. It can be effectively applied to the abundant Illumina sequencing data that is currently available or being generated for a variety of human traits and diseases. Finally, the RetroNet framework can also serve as pre-trained models for other emerging high-quality short-read sequencing technologies such as Element AVITI, Ultima UG100, and PacBio Onso72.
Methods
This study complies with all relevant ethical regulations. The research was based on publicly available datasets, which were generated with informed consent under protocols approved by the respective institutional ethics committees. As no new data collection from human participants or animals was performed, further ethical approval was not required.
Datasets for training and benchmarking
High-depth 1000 genomes project
The expended 1000 Genomes Project comprises high sequencing depth WGS data of 3,202 individuals, including 602 parent-offspring trios. Each individual’s sample was sequenced to achieve a target depth of 30× genome coverage, utilizing PCR-free technology on Illumina NovaSeq 6000 platform under ENA accession PRJEB5507735. Notably, we excluded 53 trios in model training as they overlap with the benchmarking datasets. These include 49 trios sequenced in the Illumina Polaris dataset, and 4 individuals (NA19240, HG00733, HG00514, and NA12877) used in the genome mixing dataset. As a result, 549 family trios from the 1000 Genomes Project were used for model training.
Illumina polaris project
The Illumina Polaris Project encompasses WGS data from 49 parent-child trios, with the children forming the Kids Cohort under ENA accession PRJEB25009 and the parents originating from the Diversity Cohort under ENA accession PRJEB20654. All individuals were selected from the 1000 Genomes Project to represent a wide range of population diversity. Each individual’s sample was sequenced using PCR-free libraries, achieving an average depth of 30× genome coverage on the Illumina HiSeqX platform39.
Simulated genome mixing dataset
DNA from six different human genomes was mixed with the HapMap sample NA12878 (whose lineage MEIs are generally established) in precise proportions ranging from 0.2% to 25%, including (1) A1S heart at 0.04%, (2) NA19240 at 0.2%, (3) HG00733 at 1%, (4) HG00514 at 1%, (5) Brain somatic mosaicism network (BSMN) common brain at 5%, and (6) NA12877 at 25%. The mixed genomic DNA and the backbone NA12878 genomic DNA were sequenced with Illumina sequencing at a coverage of 200× and resampled to 50×, 100× and 400×. This dataset is available via the.NIMH Data Archive, Collection 2458, Experiment 107216.
Simulated Imbalanced datasets with different SNRs
We simulated datasets of somatic MEIs at various noise levels by sampling true and false images from the Polaris datasets with ratios of 1:1, 1:10, 1:100, and 1:1000, each repeated 100 times.
Cancer cell line HG008-T
We processed four paired tumor-control sequencing datasets, including two Illumina sequencing datasets (ILMN-PCR-free-2 and ILMN-PCR-free-3) and two PacBio HiFi sequencing datasets (PB-HiFi-1 and PB-HiFi-2) from the Genome in a Bottle Consortium (https://42basepairs.com/browse/s3/giab/data_somatic/HG008/Liss_lab)40,41,42. The tumor genomic DNA was extracted from a pancreatic ductal adenocarcinoma tumor cell line (HG008-T) at passage number 23.
Cell-free DNA of mCRPC patient DTB-205
We analyzed sequencing data from time-matched samples, including metastatic tissue, cell-free DNA (cfDNA), and white blood cells, obtained from a metastatic castration-resistant prostate cancer (mCRPC) patient DTB-205. The dataset is publicly available through the European Genome-Phenome Archive under EGA accession EGAS0000100578343.
Labeling candidate MEIs in parent-offspring trios
Following the methods described by Zhu et al.16, candidate supporting reads were extracted using a modified RetroSeq pipeline25. To define a clean set of MEIs, we further removed alignments in non-primary chromosome assemblies, or within repetitive sequences that are prone to alignment errors, including telomere, centromere, and young, fixed genomic mobile elements with <10% sequence divergence73. Specifically for characterizing the short Alu insertions that are ~300 bp in length, since the supporting read alignment positions are always close and cannot help to distinguish signal from noise, we further filtered out reference Alu elements of lower than 20% divergence to avoid mistaking them as somatic insertions. For human reference genome version hg38, the excluded genome regions are 11.89% for L1 and SVA insertions and 20.63% for Alu insertions.
We labeled candidate MEIs in the 549 offsprings of the trios from the 1000 Genomes Project and the 49 offsprings from the Polaris project as true or false insertions based on the inheritance pattern. The true MEIs are those found in the offspring and their parent, while false MEIs are those absent in the parent. In addition, true MEIs satisfy: (1) >20 total supporting reads, including >1 split read, (2) have supporting reads for both the upstream junction and the downstream junction, with the downstream junction reads at a proportion between 10% and 90%, (3) have been previously annotated by MELT in the 1000 Genomes Project phase III MEI list (not DNA structural variations)74,75, and (4) absent in one parent in one parent (not alignment errors or evolutionally old MEIs with high population frequencies). False MEIs also have >1 total supporting reads and <6 paired-end reads, or when there are ≥6 paired-end reads, the downstream junction reads are below 10% or above 90% (not true de novo MEIs).
The extra syntax used for encoding candidate MEI-supporting reads into images
Several additional encoding syntaxes were implemented to ensure a consistent sequence-to-image conversion. First, a maximum of five images of randomly selected pairs of supporting reads for each training MEI were generated to mitigate the risk of overrepresenting those with more supporting reads (e.g., homozygous over heterozygous MEIs). Second, the one-hot encoding of the ME consensus has its 5’-end on the left and the 3’-end on the right, consistent with the MEIs on the plus strand but opposite to the minus strand insertions. To ensure a consistent representation, we rotated the encoding of the minus-strand insertions, including the mappability track and the flank sequence arrows, by 180 degrees. Third, for any pairs of supporting reads, read 1 is selected as the one with the flanking read more on the left, and read 2 as the one with the flanking site more to the right on the image, after adjusting to the insertion’s strandness. Fourth, we also denoted the possible unmappable segments in the supporting reads (Supplementary Fig. 2). The clipped segment in a split read or clipped PE read is represented by a proportional black line, with the segment mappable to ME consensus denoted as a red line below. Unmappable segments in the read end that map to the ME consensus in PE supporting reads are displayed as a proportional black line above the ME sequence (Supplementary Fig. 2a). Other insertion gaps in the read-ME alignment are marked with blue pixels above the ME sequence at the corresponding insertion loci (Supplementary Fig. 2b). MEI with inversions, which typically occur in L1 insertions, may have one L1 segment in clipped PE reads mapped to the L1 opposite strand, which is denoted by purple instead of red pixels in the ME consensus track (Supplementary Fig. 2c). Finally, we denote any unmappable segments supporting Alu insertion carrying 5’ or 3’ transduction as green lines. Since bona fide Alu retrotranspositions do not carry additional sequences from the source elements, these green lines can be learned as signatures of false junctions (Supplementary Fig. 2d).
The image encoding of the Alu and SVA insertion is similar to L1, except there are several active Alu and SVA subfamilies in the human genome (Supplementary Fig. 2e). For Alu insertions, we aligned the supporting reads to the consensus sequences of four active Alu subfamilies, AluYa, AluYb, AluYc, and AluYk58, and employed four tracks to portray the sequence alignments to each Alu subfamily. The encoded image for Alu comprises 21 tracks, where track 1 represents mappability, and tracks 2, 7, 12, 17 denote the relative positions and orientations if the corresponding read end is aligned in the flanking sequences. Tracks 3-6, 8-11, 13-16, and 18-21 represent the alignments of the four read ends to the four Alu consensus sequences. Similarly, we used SVA_E and SVA_F consensus sequences for SVA insertions, resulting in images of 13 tracks.
Neural network training and validation
The images produced in the process above served as direct inputs for the CNN architectures ResNet-18 and GoogLeNet. To train the neural networks, we randomly divided the labeled images from the 1000 Genomes Project datasets into the training and validation datasets at a 9:1 ratio. We then trained binary classifiers based on the ResNet-18 and GoogLeNet architectures (Supplementary Data 6), using the cross entropy between predicted probabilities and true class labels as the loss function. To select an optimal learning rate, we chose an initial value of 0.001 and used the reduce-on-plateau method to dynamically adjust the learning rate. Specifically, if the training losses remain stable for 5 epochs, the learning rate undergoes exponential decay by a factor of 0.1. The final output layer of the CNN model is a two-class Softmax layer. Each model was trained for 30 epochs by which the training loss converged, and we chose the epoch exhibiting the lowest validation loss as the finalized model.
In the third network architecture, ViT, the input images were initially divided into nine horizontal segments, each representing the mappability, or the position and ME alignment of one of the four read ends. Each partition was resized to an identical shape by padding blank spaces when necessary. The remaining model training process for the ViT network was the same as that used for ResNet-18 and GoogLeNet. All these processes were executed in the PyTorch environment (v1.13.1).
Benchmarking RetroNet in independent datasets
We benchmarked RetroNet in three datasets, including (1) germline MEIs generated from 49 parent-offspring trios in the Illumina Polaris cohort50, labeled in the same way as the training data; (2) simulated somatic MEIs with tAF between 0.2% and 25% in a genome mixing dataset in which the true MEIs were established previously16; and (3) simulated somatic MEIs in datasets in a range of signal-to-noise image ratios, from 1:1, 1:10, 1:100, to 1:1000. The model performance was evaluated using the following metrics: precision = true positive / (true positive + false positive), recall = true positive / (true positive + false negative), F1 score = 2 × precision × recall / (precision + recall), and area under the precision-recall curve (AUPR). For dataset 1 with 49 individuals and dataset 3 with 100 times of resampling, we also reported the 95% confidence intervals of the AUPR metric.
Neural network interpretation
We generated the class activation maps using the Grad-CAM algorithm68. The association between RetroNet’s predicted values (logit) and the alignment positions was evaluated in a generalized linear model: \({{{\rm{Logit}}}}\left(x\right)={w}_{0}+{w}_{1}{{{{\rm{L}}}}1}_{{{{\rm{true}}}}}\left(x\right)+{w}_{2}{{{{\rm{L}}}}1}_{{{{\rm{false}}}}}\left(x\right)\), in which \(x\) was the 5’- or 3’-junction position in the L1 sequence and \({{{{\rm{L}}}}1}_{{{{\rm{true}}}}}\) and \({{{{\rm{L}}}}1}_{{{{\rm{false}}}}}\) were the density values in histograms (bin = 50bp) of the true or false L1 insertions, respectively. For each base of L1Hs:3-6062 in the perturbation test to assess mutation impacts, we randomly sampled 300 training images with one or both reads aligned across the targeted base, and then permuted the aligned base (i.e., the red pixels) to all three alternative bases. The three base permutations, including L1Hs: 5927-5929 ACA/G to GAG and L1Hs: 6010-6012 TAG to TAA, were carried out in the same process as the single-base permutations.
Detecting somatic MEIs in the PacBio HiFi sequencing
We extracted sequencing reads with insertion mutations from the PB-HiFi-1 and PB-HiFi-2 datasets (combined average sequencing depth = 212×, combined NRPCC = 56×) for the pancreatic cancer cell line HG008-T42. Candidate MEI supporting reads were chosen as those with insert sequences that could be aligned to consensus sequences of L1 (L1Hs), Alu (AluYa5, AluYk13, AluYb8, AluYb9, AluYc1 and AluYa5a2), and SVA (SVA-E and SVA-F)58. The ME alignment was performed with minimap v2.17-r941 (minimap2 -a ME.fa insert.fa)74,76. Each candidate insertion was then inspected and annotated for the presence of mobile element sequence, inversion, 3′ transduction, and other retrotransposition features. We called somatic MEIs based on the criteria as described below, using a merged BAM file of the tumor tissues (HG008-T) from PB-HiFi-1 and PB-HiFi-2, and a merged BAM file of the normal control tissues. Tumor somatic MEIs were identified as those with a minimum of two supporting reads in HG008-T, and no supporting reads in the control tissues. The identified somatic MEIs were further compared with PALMER (v2.0.1)66 and xTea_long (long-read module of xTea, v0.1.0)52 (Supplementary Note 2).
For L1 insertions, the supporting reads satisfy: (1) mapping quality ≥20; (2) mapping identity to L1Hs > 90% with an alignment length ≥50 bp; (3) the presence of at least one hallmark allele: ACA/G at L1Hs:5927-5929 or G at L1Hs:6012; (4) if transduction sequences were detected, their origin must be traceable to the flanking regions of active full-length reference or germline L1 elements; and (5) the presence of target site duplications (≥4 bp) and a polyA tail (≥10 bp), unless the alignment identity exceeded 95% and both hallmark alleles are present. For Alu insertions: (1) mapping quality ≥20; (2) mapping identity >90% with alignment length ≥50 bp; and (3) exclusion of reads containing additional non-Alu sequences, as Alu elements do not undergo transduction. For SVA insertions: (1) mapping quality ≥20; (2) mapping identity >90% with alignment length ≥50 bp; and (3) if transduction sequences were observed, their origin had to be traceable, as with L1.
Detecting somatic MEIs from Illumina sequencing data by xTea and TraFiC-mem
We benchmarked RetroNet with xTea (short-read module, v0.1.9)52 in the analysis of the simulated genome mixing dataset, the pancreatic cancer cell line tumor sample HG008-T, and the metastatic castration-resistant prostate cancer patient DTB-205. In each dataset, the xTea analysis was performed in the case-control mode to identify somatic MEIs using the default parameters. For the genome mixing dataset, the control was chosen as the pure NA12878 genomic DNA sequencing with the corresponding depth of 50×, 100×, 200×, and 400×. For HG008-T, the sequencing of the normal tissue was chosen as the control. For DTB-205, we compared both the matching tumor and control white blood cells (WBCs) to identify the true tumor somatic MEIs, and compared the cfDNA to WBCs to evaluate the performance of xTea in cfDNA sequencing. We further included the tumor somatic MEIs, possibly missed by xTea due to insufficient supporting reads, by including those with supporting reads in the tumor that are compatible with those identified by RetroNet in cfDNA. These additional somatic MEIs were assessed using RetroVis16, with details listed at GitHub (https://github.com/Czhuofu/RetroNet) and Zenodo45.
In addition, we benchmarked another short-read-based MEI algorithm, TraFiC-mem (multispecies branch)18,19, which is compatible with sequencing alignments using the hg19 reference but not with the newer hg38 human reference genome. Among the various benchmarking datasets, only the HG008-T sequencing has hg19 alignments and has therefore been benchmarked with TraFiC-mem using the default parameters.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All datasets used for training and benchmarking in this study are publicly available. High-depth WGS data from the 1000 Genomes Project are available through the European Nucleotide Archive under ENA accession PRJEB5507735. WGS data from the Illumina Polaris Project are available under ENA accessions PRJEB25009 and PRJEB2065439. The simulated genome mixing dataset is available from the NIMH Data Archive, Collection ID 2458, Experiment ID 107216. Sequencing data for the HG008-T cancer cell line are available via the Genome in a Bottle Consortium (https://42basepairs.com/browse/s3/giab/data_somatic/HG008/Liss_lab)40,41,42. Sequencing data from mCRPC patient DTB-205 are available through the European Genome-Phenome Archive under EGA accession EGAS0000100578343. Further details can be found in the “Datasets for training and benchmarking” subsection of the Methods. Source data are provided with this paper.
Code availability
The code for the RetroNet is available at the GitHub repository (https://github.com/Czhuofu/RetroNet) and permanently archived with Zenodo (https://doi.org/10.5281/zenodo.16940326)45.
References
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Brouha, B. et al. Hot L1s account for the bulk of retrotransposition in the human population. Proc. Natl. Acad. Sci. 100, 5280–5285 (2003).
Bennett, E. A. et al. Active Alu retrotransposons in the human genome. Genome Res. 18, 1875–1883 (2008).
Chu, C. et al. The landscape of human SVA retrotransposons. Nucleic Acids Res. 51, 11453–11465 (2023).
Luan, D. D., Korman, M. H., Jakubczak, J. L. & Eickbush, T. H. Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: A mechanism for non-LTR retrotransposition. Cell 72, 595–605 (1993).
Symer, D. E. et al. Human l1 retrotransposition is associated with genetic instability in vivo. Cell 110, 327–338 (2002).
Hancks, D. C. & Kazazian, H. H. Active human retrotransposons: variation and disease. Curr. Opin. Genet. Dev. 22, 191–203 (2012).
Kazazian, H. H. & Moran, J. V. Mobile DNA in health and disease. N. Engl. J. Med. 377, 361–370 (2017).
Vogt, J. et al. SVA retrotransposon insertion-associated deletion represents a novel mutational mechanism underlying large genomic copy number changes with non-recurrent breakpoints. Genome Biol. 15, R80 (2014).
Nam, C. H. et al. Widespread somatic L1 retrotransposition in normal colorectal epithelium. Nature 617, 540–547 (2023).
Lee, E. et al. Landscape of somatic retrotransposition in human cancers. Science 337, 967–971 (2012).
Muotri, A. R. et al. Somatic mosaicism in neuronal precursor cells mediated by L1 retrotransposition. Nature 435, 903–910 (2005).
MacIa, A. et al. Engineered LINE-1 retrotransposition in nondividing human neurons. Genome Res. 27, 335–348 (2017).
Evrony, G. D. et al. Cell lineage analysis in human brain using endogenous retroelements. Neuron 85, 49–59 (2015).
Erwin, J. A. et al. L1-associated genomic regions are deleted in somatic cells of the healthy human brain. Nat. Neurosci. 19, 1583–1591 (2016).
Zhu, X. et al. Machine learning reveals bilateral distribution of somatic L1 insertions in human neurons and glia. Nat. Neurosci. 24, 186–196 (2021).
Bodea, G. O. et al. LINE-1 retrotransposons contribute to mouse PV interneuron development. Nat. Neurosci. 27, 1274–1284 (2024).
Tubio, J. M. C. et al. Extensive transduction of nonrepetitive DNA mediated by L1 retrotransposition in cancer genomes. Science 345, 1251343 (2014).
Rodriguez-Martin, B. et al. Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition. Nat. Genet. 52, 306–319 (2020).
Shin, H. T. et al. Prevalence and detection of low-allele-fraction variants in clinical cancer samples. Nat. Commun. 8, 1377 (2017).
Evrony, G. D., Lee, E., Park, P. J. & Walsh, C. A. Resolving rates of mutation in the brain using single-neuron genomics. Elife 5, 1–32 (2016).
Baillie, J. K. et al. Somatic retrotransposition alters the genetic landscape of the human brain. Nature 479, 534–537 (2011).
Evrony, G. D. et al. Single-neuron sequencing analysis of l1 retrotransposition and somatic mutation in the human brain. Cell 151, 483–496 (2012).
Upton, K. R. et al. Ubiquitous L1 mosaicism in hippocampal neurons. Cell 161, 228–239 (2015).
Keane, T. M., Wong, K. & Adams, D. J. RetroSeq: transposable element discovery from next-generation sequencing data. Bioinformatics 29, 389–390 (2013).
Thung, D. T. et al. Mobster: accurate detection of mobile element insertions in next generation sequencing data. Genome Biol. 15, 488 (2014).
Zhuang, J., Wang, J., Theurkauf, W. & Weng, Z. TEMP: a computational method for analyzing transposable element polymorphism in populations. Nucleic Acids Res. 42, 6826–38 (2014).
Santander, C. G. et al. STEAK: a specific tool for transposable elements and retrovirus detection in high-throughput sequencing data. Virus Evol. 3, vex023 (2017).
Skowronski, J., Fanning, T. G. & Singer, M. F. Unit-length line-1 transcripts in human teratocarcinoma cells. Mol. Cell Biol. 8, 1385–1397 (1988).
Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).
Zou, J. et al. A primer on deep learning in genomics. Nat. Genet. 51, 12–18 (2019).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983 (2018).
Lin, J. et al. SVision: a deep learning approach to resolve complex structural variants. Nat. Methods 19, 1230–1233 (2022).
Popic, V. et al. Cue: a deep-learning framework for structural variant discovery and genotyping. Nat. Methods 20, 559–568 (2023).
Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell 185, e19 (2022).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recognition. 770–778 (2016).
Szegedy, C. et al. Going deeper with convolutions. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recognition. 1–9 (2022).
Dosovitskiy, A. et al. An image is worth 16X16 words: transformers for image recognition at scale. In ICLR 2021 - 9th Intl Conf. Learning Representations. arXiv:2010.11929 (2021). Retrieved from https://arxiv.org/abs/2010.11929.
Dolzhenko, E. et al. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res. 27, 1895–1903 (2017).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
McDaniel, J. H. et al. Development and extensive sequencing of a broadly-consented Genome in a Bottle matched tumor-normal pair. Sci. Data 12, 1195 (2025).
Herberts, C. et al. Deep whole-genome ctDNA chronology of treatment-resistant prostate cancer. Nature 608, 199–208 (2022).
Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10, 1–21 (2015).
Tan, M. et al. RetroNet Version 1.0. Zenodo. https://doi.org/10.5281/zenodo.16940326 (2025).
Derrien, T. et al. Fast computation and applications of genome mappability. PLoS One 7, e30377 (2012).
Bao, W., Kojima, K. K. & Kohany, O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11 (2015).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. In Proc. IEEE. 86, 2278–2324 (1998).
Vaswani, A. et al. Attention is all you need. In Adv. Neural Inform. Process. Syst. 6000–6010 (2017).
Bianco, S., Cadene, R., Celona, L. & Napoletano, P. Benchmark analysis of representative deep neural network architectures. IEEE Access 6, 64270–64277 (2018).
Faulkner, G. J. & Garcia-Perez, J. L. L1 Mosaicism in Mammals: extent, effects, and evolution. Trends Genet. 33, 802–816 (2017).
Chu, C. et al. Comprehensive identification of transposable element insertions using multiple sequencing technologies. Nat. Commun. 12, 3836 (2021).
MacDonald, J. R., Ziman, R., Yuen, R. K. C., Feuk, L. & Scherer, S. W. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42, D986–992 (2014).
Cordaux, R. & Batzer, M. A. The impact of retrotransposons on human genome evolution. Nat. Rev. Genet. 10, 691–703 (2009).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning Deep Features for Discriminative Localization. In Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recognition. 2921–2929 (2016).
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 336–359 (2020).
Szak, S. T. et al. Molecular archeology of L1 insertions in the human genome. Genome Biol. 3, research00521-18 (2002).
Zingler, N. et al. Analysis of 5’ junctions of human LINE-1 and Alu retrotransposons suggests an alternative model for 5’-end attachment requiring microhomology-mediated end-joining. Genome Res. 15, 780–789 (2005).
Shinde, D., Lai, Y., Sun, F. & Arnheim, N. Taq DNA polymerase slippage mutation rates measured by PCR and quasi-likelihood analysis: (CA/GT)nand (A/T)nmicrosatellites. Nucleic Acids Res. 31, 974–980 (2003).
Boissinot, S. & Furano, A. V. Adaptive evolution in LINE-1 retrotransposons. Mol. Biol. Evol. 18, 2186–2194 (2001).
Salem, A. H. et al. LINE-1 preTa elements in the human genome. J. Mol. Biol. 326, 1127–1146 (2003).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Ebersberger, I., Metzler, D., Schwarz, C. & Pääbo, S. Genomewide comparison of DNA sequences between humans and chimpanzees. Am. J. Hum. Genet. 70, 1490–1497 (2002).
Beck, C. R. et al. LINE-1 retrotransposition activity in human genomes. Cell 141, 1159–1170 (2010).
Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).
Zhou, W. et al. Identification and characterization of occult human-specific LINE-1 insertions using long-read sequencing technology. Nucleic Acids Res. 48, 1146–1163 (2020).
Dentro, S. C. et al. Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes. Cell 184, 2239–2254 (2021).
Rozmahel, R. et al. Amplification of CFTR exon 9 sequences to multiple locations in the human genome. Genomics 45, 554–561 (1997).
Ejima, Y. & Yang, L. Trans mobilization of genomic DNA as a mechanism for retrotransposon-mediated exon shuffling. Hum. Mol. Genet. 12, 1321–1328 (2003).
Dabney, J., Meyer, M. & Pääbo, S. Ancient DNA damage. Cold Spring Harb. Perspect Biol. 5, a012567 (2013).
Underhill, H. R. et al. Fragment Length of Circulating Tumor DNA. PLoS Genet. 12, 1006162 (2016).
Eisenstein, M. Innovative technologies crowd the short-read sequencing market. Nature 614, 798–800 (2023).
Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0. http://www.repeatmasker.org.
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).
Acknowledgements
We thank J.X. from Sun Yet-sen University, C.J. from Zhejiang University, and members of the Somatic Mosaicism across Human Tissues (SMaHT) Network for their constructive comments on the development of the methodology. We thank Joel E.K., T.H.H. and D.R. Weinberger from Liber Institute for Brain Development for providing the BSMN common brain tissue, and Liana Fasching from Yale University for extracting the BSMN common brain DNA. We thank Alexander Wyatt from the University of British Columbia for sharing the Illumina sequencing datasets of the mCRPC patients. This work utilized computing resources provided by the Stanford Genetics Bioinformatics Service Center and the City University of Hong Kong high-performance computing (HPC) resources. This work was supported by Hong Kong RGS Early Career Scheme 21104822 (to X.Z.), Hong Kong RGS General Research Fund 11105123 (to X.Z.), Hong Kong Innovation and Technology Fund ITS/101/22 (to X.Z.), and the City University of Hong Kong New Faculty Fund 9610590 (to X.Z.). M.T. was supported by the Zhejiang Shuren University start-up fund 2023R048. Z.L. was supported by the Hong Kong Innovation & Tech. Fund 000834 and the City University of Hong Kong Institutional Research Tuition Scholarship 000782. Z.G. was supported by the Shanghai Sailing Program 23YF1446900 and the National Science Foundation of China 62202341. Open Access made possible with partial support from the Open Access Publishing Fund of the City University of Hong Kong.
Author information
Authors and Affiliations
Contributions
M.T.: Methodology, Benchmarking, Writing—Original Draft, Visualization. Z.L.: Methodology, Software, Neural network interpretation, Writing—Original Draft. Z.C.: Software, Data Curation. H.Z.: Methodology and software development, Analysis of the cell-free DNA datasets. J.P.: Software development and performance benchmarking. Z.H.: Methodology and software development. E.A.L.: Methodology, Benchmarking, and Writing. Z.G.: Methodology, Writing—Review & Editing. X.Z.: Conceptualization, Resources, Writing—Review & Editing, Data Curation, Funding acquisition. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tan, M., Lin, Z., Chen, Z. et al. Image-based DNA sequencing encoding for detecting low-mosaicism somatic mobile element insertions. Nat Commun 16, 9195 (2025). https://doi.org/10.1038/s41467-025-64237-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-025-64237-w