Table 4 Introduction of typical repeats detection methods.
From: Repetitive DNA sequence detection and its role in the human genome
Method type | Method name | Description/Characteristic | Advantages/Disadvantages | References |
|---|---|---|---|---|
Censora | Censor consists of RepBase, Perl and C++ modules. It detects interspersed and tandem repeats through sequence similarity comparisons and analyzes repetitive sequences using RepBase Update. | Advantages: (1) Censor can automatically classify all known repeats and generate reports. (2) It has a high detection accuracy. (3) It offers online identification services (www.girinst.org/censor/help.html). Disadvantages: (1) Highly reliant on homologous databases (RepBase, Dfam, etc.), and cannot discover novel repeats that have not been collected in homology databases. (2) Using BLAST as the alignment algorithm often results in a long run time. (3) The integrity of detection results often depends on the integrity of the homology databases. | ||
Homology-based | RepeatMaskerb | RepeatMasker is a well-known program that scans DNA sequences for interspersed repeats and low-complexity DNA sequences. It has introduced a new feature that allows the identification of repetitive elements within protein sequences. | Advantages: (1) Less false positives and highly accurate and sensitive detection. (2) It does not impose restrictions on the number or length of input sequences. (3) It is versatile and can be utilized to identify repetitive elements in both nucleotide sequences and protein sequences. (4) It can be used to predict genes from masked sequences. Disadvantages: (1) Long running times are required when analyzing large-scale genomics. (2) Highly reliant on homologous databases (RepBase, Dfam, etc.), and the integrity of detection results often depends on the integrity of the homology databases. | |
LTRharvestc | LTRharvest is a de novo detection algorithm used to detect full-length LTR elements in large sequence sets based on known features, such as length, distance, and sequence motifs of LTR transposons. | Advantages: (1) Allows users to make flexible parameter settings. (2) High efficiency, low memory and disk-space consumption. (3) It effectively annotates de novo high-quality, and nearly-full-length LTR retrotransposons. Disadvantages: (1) It cannot detect partial short LTR retrotransposon copies, solo LTRs, and certain nested elements. (2) It is unable to verify the presence of LTR retrotransposon-specific open reading frames (ORFs), primer binding sites, or polypurine tracts. | ||
Structure-based | SINE_scand | SINE_scan is a highly efficient structure-based algorithm for predicting SINEs in genomic DNA sequences by combining the hallmarks of SINE transposition, copy number, and structural signals. | Advantages: (1) It is flexible and robust for various purposes of SINE annotation and verification. (2) It provides a more comprehensive detection of SINEs in genomes and identifies a substantial number of new SINEs. Disadvantages: (1) The sensitivity of identification is much lower than other similar tools, such as SINE-Finder. (2) High rates of false discovery. | |
RepeatScoute | RepeatScout is a de novo identification algorithm that finds repeat families by extending consensus seeds, allowing for a precise determination of repeat boundaries. | Advantages: (1) The algorithm runs efficiently. (2) The detection results of the algorithm are pure and accurate. Disadvantages: (1) The integrity of the detection results is usually unsatisfactory. (2) The algorithm cannot process more than 1 Gb of the genome at a time. (3) The size change of l-mer has a greater effect on the detection results. | ||
De novo | RepLongf | RepLong is a de novo method specifically designed for accurately identifying repeats in genomes by constructing overlap networks based on third-generation sequencing (TGS) long reads. | Advantages: (1) It can directly obtain repeats only by relying on TGS long reads. (2) Compared with existing de novo detection methods (e.g., RepARK and REPdenovo), it tends to obtain repeats more completely. Disadvantages: (1) This algorithm usually consumes vast computing resources (CPU, memory, and disk space) and has a long run time. (2) The detection accuracy of the algorithm is usually unsatisfactory. | |
EDTAg | The EDTA package is specifically designed to minimize false discoveries in raw TE candidates, enabling the creation of a high-quality, non-redundant TE library for comprehensive whole-genome TE annotations. These annotations contribute to a deeper comprehension of TE diversity and evolution at both intra- and inter-species levels. | Advantages: (1) It demonstrates robustness across plant and animal species based on empirical evidence. (2) It is capable of deconvoluting nested TE insertions, which are commonly observed in highly repetitive genomic regions. Disadvantages: (1) It can be computationally intensive, requiring significant computational resources and time to process large genome datasets. (2) While it is designed to filter out false discoveries, there is always a risk of false positive or false negative TE annotations. (3) Certain species or specific TE families may pose challenges or have limited support due to variations in TE sequence characteristics and complexities. | ||
Hybrid framework | RepeatMod2h | RepeatModeler2 is a package designed to create reference TE libraries applicable to any eukaryotic species. Its capability includes generating libraries that accurately represent the known TE composition of three model species with highly intricate TE landscapes. | Advantages: (1) It can create TE libraries that effectively represent the known TE composition of model species with complex TE landscapes. (2) It offers a user-friendly interface, making it accessible to researchers without extensive bioinformatics expertize. Disadvantages: (1) It demands substantial computational resources, such as memory and processing power, especially when dealing with large genomes. (2) It heavily relies on existing databases of known TEs, which may limit its effectiveness for species with poorly characterized TE landscapes or novel TE families. |