Table 4 Introduction of typical repeats detection methods.

From: Repetitive DNA sequence detection and its role in the human genome

Method type

Method name

Description/Characteristic

Advantages/Disadvantages

References

 

Censora

Censor consists of RepBase, Perl and C++ modules. It detects interspersed and tandem repeats through sequence similarity comparisons and analyzes repetitive sequences using RepBase Update.

Advantages: (1) Censor can automatically classify all known repeats and generate reports. (2) It has a high detection accuracy. (3) It offers online identification services (www.girinst.org/censor/help.html).

Disadvantages: (1) Highly reliant on homologous databases (RepBase, Dfam, etc.), and cannot discover novel repeats that have not been collected in homology databases. (2) Using BLAST as the alignment algorithm often results in a long run time. (3) The integrity of detection results often depends on the integrity of the homology databases.

163,225

Homology-based

RepeatMaskerb

RepeatMasker is a well-known program that scans DNA sequences for interspersed repeats and low-complexity DNA sequences. It has introduced a new feature that allows the identification of repetitive elements within protein sequences.

Advantages: (1) Less false positives and highly accurate and sensitive detection. (2) It does not impose restrictions on the number or length of input sequences. (3) It is versatile and can be utilized to identify repetitive elements in both nucleotide sequences and protein sequences. (4) It can be used to predict genes from masked sequences.

Disadvantages: (1) Long running times are required when analyzing large-scale genomics. (2) Highly reliant on homologous databases (RepBase, Dfam, etc.), and the integrity of detection results often depends on the integrity of the homology databases.

226,227

 

LTRharvestc

LTRharvest is a de novo detection algorithm used to detect full-length LTR elements in large sequence sets based on known features, such as length, distance, and sequence motifs of LTR transposons.

Advantages: (1) Allows users to make flexible parameter settings. (2) High efficiency, low memory and disk-space consumption. (3) It effectively annotates de novo high-quality, and nearly-full-length LTR retrotransposons. Disadvantages: (1) It cannot detect partial short LTR retrotransposon copies, solo LTRs, and certain nested elements. (2) It is unable to verify the presence of LTR retrotransposon-specific open reading frames (ORFs), primer binding sites, or polypurine tracts.

168,228

Structure-based

SINE_scand

SINE_scan is a highly efficient structure-based algorithm for predicting SINEs in genomic DNA sequences by combining the hallmarks of SINE transposition, copy number, and structural signals.

Advantages: (1) It is flexible and robust for various purposes of SINE annotation and verification. (2) It provides a more comprehensive detection of SINEs in genomes and identifies a substantial number of new SINEs.

Disadvantages: (1) The sensitivity of identification is much lower than other similar tools, such as SINE-Finder. (2) High rates of false discovery.

173,174

 

RepeatScoute

RepeatScout is a de novo identification algorithm that finds repeat families by extending consensus seeds, allowing for a precise determination of repeat boundaries.

Advantages: (1) The algorithm runs efficiently. (2) The detection results of the algorithm are pure and accurate.

Disadvantages: (1) The integrity of the detection results is usually unsatisfactory. (2) The algorithm cannot process more than 1 Gb of the genome at a time. (3) The size change of l-mer has a greater effect on the detection results.

187,229

De novo

RepLongf

RepLong is a de novo method specifically designed for accurately identifying repeats in genomes by constructing overlap networks based on third-generation sequencing (TGS) long reads.

Advantages: (1) It can directly obtain repeats only by relying on TGS long reads. (2) Compared with existing de novo detection methods (e.g., RepARK and REPdenovo), it tends to obtain repeats more completely.

Disadvantages: (1) This algorithm usually consumes vast computing resources (CPU, memory, and disk space) and has a long run time. (2) The detection accuracy of the algorithm is usually unsatisfactory.

193,230

 

EDTAg

The EDTA package is specifically designed to minimize false discoveries in raw TE candidates, enabling the creation of a high-quality, non-redundant TE library for comprehensive whole-genome TE annotations. These annotations contribute to a deeper comprehension of TE diversity and evolution at both intra- and inter-species levels.

Advantages: (1) It demonstrates robustness across plant and animal species based on empirical evidence. (2) It is capable of deconvoluting nested TE insertions, which are commonly observed in highly repetitive genomic regions.

Disadvantages: (1) It can be computationally intensive, requiring significant computational resources and time to process large genome datasets. (2) While it is designed to filter out false discoveries, there is always a risk of false positive or false negative TE annotations. (3) Certain species or specific TE families may pose challenges or have limited support due to variations in TE sequence characteristics and complexities.

205,231

Hybrid framework

RepeatMod2h

RepeatModeler2 is a package designed to create reference TE libraries applicable to any eukaryotic species. Its capability includes generating libraries that accurately represent the known TE composition of three model species with highly intricate TE landscapes.

Advantages: (1) It can create TE libraries that effectively represent the known TE composition of model species with complex TE landscapes. (2) It offers a user-friendly interface, making it accessible to researchers without extensive bioinformatics expertize.

Disadvantages: (1) It demands substantial computational resources, such as memory and processing power, especially when dealing with large genomes. (2) It heavily relies on existing databases of known TEs, which may limit its effectiveness for species with poorly characterized TE landscapes or novel TE families.

206,232

  1. ‘Hybrid frameworks’ refer to detection tools that adopt multiple detection strategies, and they usually cannot be clearly distinguished into the above three typical types. ‘EDTA’ is the abbreviation of the extensive de novo TE annotator. ‘RepeatMod2’ is the abbreviation of RepeatModeler2.
  2. ahttps://www.girinst.org/censor.
  3. bhttps://github.com/mmcco/RepeatScout.
  4. chttps://github.com/oushujun/LTR_retriever.
  5. dhttps://github.com/oushujun/LTR_retriever.
  6. ehttps://github.com/maohlzj/SINEScan.
  7. fhttps://github.com/ruiguo-bio/replong.
  8. ghttps://github.com/oushujun/EDTA.
  9. hhttps://github.com/Dfam-consortium/RepeatModeler.