Table 1 Full gene length results
From: Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization
(a) Memorization Test: Full gene length | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Â | Accuracy (%) | F1-macro (%) | ||||||||
Model | phylum | class | order | family | gene | phylum | class | order | family | gene |
6mer Freq. | 57.6 | 45.8 | 35.9 | 27.5 | 29.3 | 30.4 | 21.1 | 19.6 | 15.6 | 34.5 |
DeepMicrobes_family | 24.2 | 8.8 | 3.7 | 0.5 | 0.2 | 2.7 | 0.7 | 0.2 | 0.1 | 0.1 |
DeepMicrobes_gene | 25.9 | 10.5 | 3.2 | 1.0 | 94.1 | 5.1 | 1.8 | 0.7 | 0.3 | 93.8 |
BERTaxa | 66.4 | N/A | N/A | N/A | N/A | 16.9 | N/A | N/A | N/A | N/A |
BERTax_Embeddinga | 77.4 | 63.0 | 44.7 | 34.3 | 11.6 | 60.6 | 50.7 | 41.6 | 33.0 | 11.0 |
MMseqs2 | 93.3 | 89.8 | 79.7 | 61.5 | 97.4 | 79.4 | 54.8 | 47.0 | 31.2 | 98.2 |
Kraken2 | 64.8 | 58.0 | 30.9 | 1.09 | N/A | 36.4 | 23.5 | 19.3 | 13.7 | N/A |
BigBird | 71.2 | 58.4 | 42.6 | 32.2 | 28.2 | 48.4 | 37.3 | 31.8 | 25.6 | 27.4 |
Scorpio-6Freq | 85.8 | 75.3 | 49.8 | 29.1 | 95.1 | 49.6 | 27.0 | 18.5 | 9.9 | 94.9 |
Scorpio-BigEmbed | 86.2 | 76.9 | 59.3 | 41.5 | 89.6 | 60.1 | 38.2 | 30.5 | 19.2 | 88.9 |
Scorpio-BigDynamic | 89.0 | 80.4 | 62.8 | 44.2 | 98.8 | 65.3 | 40.8 | 32.2 | 19.7 | 98.5 |
(b) Generalization Test: Full gene length | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Â | Taxonomy Generalization | Gene Label Generalization | ||||||||
| Â | Accuracy (%) | F1-macro (%) | Accuracy (%) | F1-macro (%) | ||||||
Model | phylum | class | order | family | phylum | class | order | family | gene | gene |
6mer Frequency | 49.2 | 31.0 | 17.0 | 10.8 | 20.9 | 13.5 | 10.6 | 8.4 | 2.3 | 2.4 |
DeepMicrobes_family | 25.4 | 9.2 | 4.1 | 0.5 | 2.4 | 0.6 | 0.1 | 0.0 | 0.2 | 0.1 |
DeepMicrobes_gene | 19.4 | 7.5 | 1.9 | 0.6 | 3.6 | 1.1 | 0.4 | 0.2 | 87.8 | 89.0 |
MMseqs2 | 4.3 | 2.7 | 1.1 | 0.5 | 2.2 | 1.4 | 0.9 | 0.5 | 87.3 | 90.8 |
Kraken2 | 1.1 | 0.6 | 0.26 | 0.17 | 0.5 | 0.7 | 0.4 | 0.3 | N/A | N/A |
BigBird | 64.0 | 47.1 | 29.0 | 20.4 | 36.7 | 27.6 | 22.4 | 17.8 | 7.4 | 7.1 |
Scorpio-6Freq | 73.8 | 56.3 | 21.9 | 9.5 | 29.3 | 13.7 | 6.7 | 2.9 | 88.4 | 87.4 |
Scorpio-BigEmbed | 62.5 | 41.8 | 17.2 | 8.2 | 24.2 | 13.1 | 8.0 | 5.0 | 68.9 | 66.1 |
Scorpio-BigDynamic | 48.5 | 24.8 | 7.6 | 2.7 | 11.3 | 4.7 | 2.2 | 1.0 | 95.5 | 94.7 |
(c) Memorization Test: Short fragment length | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Â | Accuracy (%) | F1-macro (%) | ||||||||
Model | phylum | class | order | family | gene | phylum | class | order | family | gene |
6mer Freq. | 90.3 | 86.1 | 76.7 | 65.6 | 92.4 | 78.3 | 69.0 | 63.4 | 55.8 | 91.9 |
BigBird | 72.4 | 62.7 | 50.1 | 41.5 | 55.6 | 53.6 | 44.0 | 40.1 | 35.5 | 54.9 |
DeepMicrobes_family | 72.7 | 61.6 | 43.6 | 30.8 | 3.1 | 42.7 | 28.3 | 24.2 | 19.7 | 2.4 |
DeepMicrobes_gene | 21.5 | 8.9 | 2.4 | 0.7 | 93.0 | 4.3 | 1.4 | 0.5 | 0.2 | 93.2 |
BERTaxa | 76.4 | N/A | N/A | N/A | N/A | 22.9 | N/A | N/A | N/A | N/A |
BERTax_Embeddinga | 55.2 | 38.4 | 20.9 | 13.0 | 15.5 | 27.9 | 16.8 | 12.0 | 8.4 | 14.8 |
Mmseqs2 | 94.8 | 92.3 | 84.1 | 70.7 | 97.7 | 85.0 | 75.8 | 69.4 | 57.9 | 97.2 |
Kraken2 | 77.8 | 74.4 | 66.7 | 59.6 | N/A | 70.0 | 67.2 | 64.4 | 60.9 | N/A |
Scorpio-BigEmbed | 76.6 | 66.1 | 49.8 | 38.9 | 74.0 | 54.3 | 42.4 | 37.1 | 31.4 | 74.5 |
Scorpio-6Freq | 81.3 | 70.4 | 47.7 | 32.6 | 92.2 | 49.8 | 34.6 | 29.1 | 22.9 | 92.3 |
Scorpio-BigDynamic | 91.0 | 83.4 | 63.3 | 45.8 | 98.8 | 73.7 | 53.3 | 42.9 | 32.8 | 98.9 |
(d) Generalization Test: Short fragment length | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Â | Taxonomic Generalization | Gene Generalization | ||||||||
| Â | Accuracy (%) | F1-macro (%) | Accuracy (%) | F1-macro (%) | ||||||
Model | phylum | class | order | family | phylum | class | order | family | gene | gene |
6mer Freq. | 47.0 | 29.9 | 14.9 | 8.4 | 11.4 | 8.1 | 6.8 | 5.1 | 54.7 | 52.7 |
BigBird | 51.4 | 33.9 | 17.8 | 10.6 | 14.3 | 11.7 | 9.8 | 7.5 | 16.5 | 14.4 |
DeepMicrobes_family | 55.8 | 40.3 | 22.1 | 13.2 | 15.7 | 11.8 | 10.3 | 8.3 | 2.8 | 1.9 |
DeepMicrobes_gene | 14.2 | 5.9 | 1.8 | 0.5 | 2.1 | 0.9 | 0.4 | 0.1 | 76.0 | 77.0 |
Mmseqs2 | 2.6 | 1.8 | 0.8 | 0.4 | 1.0 | 0.6 | 0.5 | 0.3 | 78.5 | 86.1 |
Kraken2 | 0.93 | 0.63 | 0.27 | 0.11 | 0.2 | 0.1 | 0.09 | 0.06 | N/A | N/A |
Scorpio-BigEmbed | 54.8 | 37.2 | 18.0 | 9.6 | 14.5 | 10.0 | 7.4 | 5.2 | 41.7 | 42.2 |
Scorpio-6Freq | 50.0 | 31.1 | 9.4 | 4.0 | 10.7 | 6.1 | 3.1 | 1.5 | 72.4 | 73.5 |
Scorpio-BigDynamic | 61.2 | 43.1 | 18.3 | 7.8 | 17.8 | 10.4 | 5.6 | 2.7 | 92.3 | 93.1 |