Fig. 3: TEtrimmer succeeds in removing lowly conserved regions from MSAs. | Nature Communications

Fig. 3: TEtrimmer succeeds in removing lowly conserved regions from MSAs.

From: TEtrimmer: a tool to automate the manual curation of transposable elements

Fig. 3

The B. hordei LINE element rnd-1-family-34, identified by RepeatModeler2, was chosen. A The boundaries of the selected LINE sequence identified after BLASTN search against the B. hordei genome, sequence extension, and MSA generation are indicated by TE start (red) and TE end (blue). Nucleotides are represented with colored bars (A, green; C, blue; G, black; T, red); gaps are indicated as blank regions. Top panel: original MSA before cleaning, containing many gappy columns and noisy rows. After MSA column cleaning by the TEtrimmer function remove_gap_column, the majority of the gappy columns are removed (middle panel). Then, TEtrimmer cleans sequences in the MSA row by row using the TEtrimmer function crop_end_by_divergence (bottom panel), removing lowly conserved regions (bottom panel). B The left and right panels are both the magnified MSA regions near the selected LINE element left boundary, indicated by TE start (red). The left panel illustrates the MSA before, and the right panel after row cleaning. Nucleotide background colors in the MSA represent sites where the proportion of the respective nucleotide is below 0.4. C Ten TEs were randomly selected from the RepeatModeler2 consensus libraries of B. hordei, D. melanogaster, D. rerio, and O. sativa. After BLASTN searches of the selected sequence against the corresponding genomes, sequence extension, and MSA column cleaning, the generated MSAs were used for benchmarking. Manual cleaning was performed to serve as a reference to enable assessment of the TEtrimmer MSA cleaning performance. The MSAs were cleaned by the TEtrimmer cleaning function crop_end_by_divergence using cleaning thresholds ranging from 0 to 1 and a sliding window size of 40 bp. A confusion matrix analysis was conducted to evaluate the TEtrimmer cleaning performance. The x-axis shows the cleaning threshold used by the TEtrimmer function crop_end_by_divergence; the y-axis displays the confusion matrix score for the metrics sensitivity (green), precision (blue), and F1 score (orange). Standard error bars were calculated based on N = 10 TE sequences and data points depict the respective arithmetic means. The grey-shaded box indicates the threshold range where all three metrics scores are above 0.93. Raw data: Source data file.

Back to article page