Table 5 Introduction of typical repeats classification methods.

From: Repetitive DNA sequence detection and its role in the human genome

Method type

Method name

Description/Characteristic

Advantages/Disadvantages

References

Homology-searching based

PASTECa, REPCLASSb, TEclassc

These methods utilize a homology search approach, such as BLAST, to compare the input sequences with established repeat databases (e.g., Dfam, Pfam, RepBase), in order to identify similar sequences for repeat classification.

Advantages: (1) They can accurately compare and classify repetitive elements according to known families and superfamilies. (2) These methods often include repeat masking, which helps reduce the impact of repetitive regions on downstream processes such as genome assembly or gene expression analysis. Disadvantages: (1) These methods heavily rely on the availability and quality of reference databases. (2) Balancing sensitivity and specificity can be challenging. (3) The time and computational resources required can limit their practicality for some projects.

208,209,210

Deep Learning-based

DeepTEd, TERLe

These methods are capable of learning complex patterns and features directly from the data, without relying on predefined rules or databases. This allows them to capture subtle and non-linear relationships, potentially enabling the identification of novel repeat elements.

Advantages: (1) Deep-learning models excel in detecting and classifying divergent repeat elements with low sequence similarity by capturing high-level abstract representations from input features. Thus, they have the potential to uncover previously uncharacterized repeat families or variants. (2) Deep-learning models can generalize features and patterns from various genomic data, potentially allowing their transferability across species or genomic contexts. This broadens their applicability to a wider range of organisms.

Disadvantages: (1) Deep-learning models require substantial amounts of high-quality annotated training data to effectively learn and generalize patterns. (2) Training and deploying deep-learning models can be computationally intensive and require substantial computational resources.

212,213

  1. ahttp://urgi.versailles.inra.fr/Tools/PASTEClassifier.
  2. bhttps://sourceforge.net/projects/repclass.
  3. chttps://www.bioinformatics.uni-muenster.de/tools/teclass.
  4. dhttps://github.com/LiLabAtVT/DeepTE.
  5. ehttps://github.com/muriloHoracio/TERL.