Table 1 Full gene length results

From: Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization

(a) Memorization Test: Full gene length

 

Accuracy (%)

F1-macro (%)

Model

phylum

class

order

family

gene

phylum

class

order

family

gene

6mer Freq.

57.6

45.8

35.9

27.5

29.3

30.4

21.1

19.6

15.6

34.5

DeepMicrobes_family

24.2

8.8

3.7

0.5

0.2

2.7

0.7

0.2

0.1

0.1

DeepMicrobes_gene

25.9

10.5

3.2

1.0

94.1

5.1

1.8

0.7

0.3

93.8

BERTaxa

66.4

N/A

N/A

N/A

N/A

16.9

N/A

N/A

N/A

N/A

BERTax_Embeddinga

77.4

63.0

44.7

34.3

11.6

60.6

50.7

41.6

33.0

11.0

MMseqs2

93.3

89.8

79.7

61.5

97.4

79.4

54.8

47.0

31.2

98.2

Kraken2

64.8

58.0

30.9

1.09

N/A

36.4

23.5

19.3

13.7

N/A

BigBird

71.2

58.4

42.6

32.2

28.2

48.4

37.3

31.8

25.6

27.4

Scorpio-6Freq

85.8

75.3

49.8

29.1

95.1

49.6

27.0

18.5

9.9

94.9

Scorpio-BigEmbed

86.2

76.9

59.3

41.5

89.6

60.1

38.2

30.5

19.2

88.9

Scorpio-BigDynamic

89.0

80.4

62.8

44.2

98.8

65.3

40.8

32.2

19.7

98.5

(b) Generalization Test: Full gene length

 

Taxonomy Generalization

Gene Label Generalization

 

Accuracy (%)

F1-macro (%)

Accuracy (%)

F1-macro (%)

Model

phylum

class

order

family

phylum

class

order

family

gene

gene

6mer Frequency

49.2

31.0

17.0

10.8

20.9

13.5

10.6

8.4

2.3

2.4

DeepMicrobes_family

25.4

9.2

4.1

0.5

2.4

0.6

0.1

0.0

0.2

0.1

DeepMicrobes_gene

19.4

7.5

1.9

0.6

3.6

1.1

0.4

0.2

87.8

89.0

MMseqs2

4.3

2.7

1.1

0.5

2.2

1.4

0.9

0.5

87.3

90.8

Kraken2

1.1

0.6

0.26

0.17

0.5

0.7

0.4

0.3

N/A

N/A

BigBird

64.0

47.1

29.0

20.4

36.7

27.6

22.4

17.8

7.4

7.1

Scorpio-6Freq

73.8

56.3

21.9

9.5

29.3

13.7

6.7

2.9

88.4

87.4

Scorpio-BigEmbed

62.5

41.8

17.2

8.2

24.2

13.1

8.0

5.0

68.9

66.1

Scorpio-BigDynamic

48.5

24.8

7.6

2.7

11.3

4.7

2.2

1.0

95.5

94.7

(c) Memorization Test: Short fragment length

 

Accuracy (%)

F1-macro (%)

Model

phylum

class

order

family

gene

phylum

class

order

family

gene

6mer Freq.

90.3

86.1

76.7

65.6

92.4

78.3

69.0

63.4

55.8

91.9

BigBird

72.4

62.7

50.1

41.5

55.6

53.6

44.0

40.1

35.5

54.9

DeepMicrobes_family

72.7

61.6

43.6

30.8

3.1

42.7

28.3

24.2

19.7

2.4

DeepMicrobes_gene

21.5

8.9

2.4

0.7

93.0

4.3

1.4

0.5

0.2

93.2

BERTaxa

76.4

N/A

N/A

N/A

N/A

22.9

N/A

N/A

N/A

N/A

BERTax_Embeddinga

55.2

38.4

20.9

13.0

15.5

27.9

16.8

12.0

8.4

14.8

Mmseqs2

94.8

92.3

84.1

70.7

97.7

85.0

75.8

69.4

57.9

97.2

Kraken2

77.8

74.4

66.7

59.6

N/A

70.0

67.2

64.4

60.9

N/A

Scorpio-BigEmbed

76.6

66.1

49.8

38.9

74.0

54.3

42.4

37.1

31.4

74.5

Scorpio-6Freq

81.3

70.4

47.7

32.6

92.2

49.8

34.6

29.1

22.9

92.3

Scorpio-BigDynamic

91.0

83.4

63.3

45.8

98.8

73.7

53.3

42.9

32.8

98.9

(d) Generalization Test: Short fragment length

 

Taxonomic Generalization

Gene Generalization

 

Accuracy (%)

F1-macro (%)

Accuracy (%)

F1-macro (%)

Model

phylum

class

order

family

phylum

class

order

family

gene

gene

6mer Freq.

47.0

29.9

14.9

8.4

11.4

8.1

6.8

5.1

54.7

52.7

BigBird

51.4

33.9

17.8

10.6

14.3

11.7

9.8

7.5

16.5

14.4

DeepMicrobes_family

55.8

40.3

22.1

13.2

15.7

11.8

10.3

8.3

2.8

1.9

DeepMicrobes_gene

14.2

5.9

1.8

0.5

2.1

0.9

0.4

0.1

76.0

77.0

Mmseqs2

2.6

1.8

0.8

0.4

1.0

0.6

0.5

0.3

78.5

86.1

Kraken2

0.93

0.63

0.27

0.11

0.2

0.1

0.09

0.06

N/A

N/A

Scorpio-BigEmbed

54.8

37.2

18.0

9.6

14.5

10.0

7.4

5.2

41.7

42.2

Scorpio-6Freq

50.0

31.1

9.4

4.0

10.7

6.1

3.1

1.5

72.4

73.5

Scorpio-BigDynamic

61.2

43.1

18.3

7.8

17.8

10.4

5.6

2.7

92.3

93.1

  1. (a) Memorization Test: Identification of additional training-data-known taxonomy and genes (Test Set). (b) Generalization Test: Taxonomy Generalization (Genes-Out Set) and Gene Label Generalization (Taxa-Out Set). We show that while standard techniques, like MMseqs2, memorize data well for identifying known classes, Scorpio is competitive at classifying novel taxa, especially at higher levels, and is competitive for genes as well.
  2. Short fragment length (400 bp) results: (c) Memorization Test: Identifying additional examples of training-data-known taxonomy and genes (Test Set); (d) Generalization Test: Taxonomy Generalization (Gene Out Set) and Gene Label Generalization (Taxa Out Set) Tests. Again, Scorpio is superior at classifying novel organisms at the phylum level and beats out every method for the gene level.
  3. aAll models, except for BERTax, were trained on the same dataset; for BERTax, we employed a pre-trained version. We use bold for the best and underline for the second-best results.