Table 1 Datasets from previous research are incorporated to evaluate the effectiveness of different encoding methods and their response to various parameters, such as the number of sequences and the maximum sequence length. To prevent any potential biases from the datasets used in previous studies, a new dataset, Dataset0 is also included in this analysis. Dataset0 has not been previously used in any encoding techniques, ensuring a fair comparison between the different encoding methods being tested.

From: Comparative study of encoded and alignment-based methods for virus taxonomy classification

Name

Description

Total

Seqs

Min.

length

Max.

length

DataSet0

Viruses in the genus AlphaCoV and BetaCoV of coronaviruses, along with their subgenera in BetaCov

59

27165

31526

DataSet1

Viruses from the family Coronaviridae to classify SARS-CoV-2

56

25425

31686

DataSet2

Viruses in the genus BetaCoV to classify SARS-CoV-2 at the genus level

50

29037

31491

DataSet3

Closely related coronaviruses from the seafood market

69

27213

30311

DataSet4

Transmission modes of human coronaviruses originating from animals

106

26883

31473

DataSet5

Virus genomes obtained from human SARS-CoV-2 viruses

141

29674

29882

DataSet6

Genus within the Coronaviridae family, known to induce a range of severe diseases

in the respiratory and gastrointestinal systems

34

9646

31357

DataSet7

Influenza A viruses, which are single-stranded, segmented RNA viruses categorized

according to their hemagglutinin and neuraminidase viral surface proteins

38

1350

1467

DataSet8

Human rhinoviruses, which is the most common cause of upper respiratory tract

116

6944

7458

DataSet9

HPV (Human Papillomavirus) is a common sexually transmitted DNA virus responsible for cervical cancer and genital warts

400

7814

10424