Table 1 Full gene length results

(a) Memorization Test: Full gene length
	Accuracy (%)					F1-macro (%)
Model	phylum	class	order	family	gene	phylum	class	order	family	gene
6mer Freq.	57.6	45.8	35.9	27.5	29.3	30.4	21.1	19.6	15.6	34.5
DeepMicrobes_family	24.2	8.8	3.7	0.5	0.2	2.7	0.7	0.2	0.1	0.1
DeepMicrobes_gene	25.9	10.5	3.2	1.0	94.1	5.1	1.8	0.7	0.3	93.8
BERTax^a	66.4	N/A	N/A	N/A	N/A	16.9	N/A	N/A	N/A	N/A
BERTax_Embedding^a	77.4	63.0	44.7	34.3	11.6	60.6	50.7	41.6	33.0	11.0
MMseqs2	93.3	89.8	79.7	61.5	97.4	79.4	54.8	47.0	31.2	98.2
Kraken2	64.8	58.0	30.9	1.09	N/A	36.4	23.5	19.3	13.7	N/A
BigBird	71.2	58.4	42.6	32.2	28.2	48.4	37.3	31.8	25.6	27.4
Scorpio-6Freq	85.8	75.3	49.8	29.1	95.1	49.6	27.0	18.5	9.9	94.9
Scorpio-BigEmbed	86.2	76.9	59.3	41.5	89.6	60.1	38.2	30.5	19.2	88.9
Scorpio-BigDynamic	89.0	80.4	62.8	44.2	98.8	65.3	40.8	32.2	19.7	98.5

(b) Generalization Test: Full gene length
	Taxonomy Generalization								Gene Label Generalization
	Accuracy (%)				F1-macro (%)				Accuracy (%)	F1-macro (%)
Model	phylum	class	order	family	phylum	class	order	family	gene	gene
6mer Frequency	49.2	31.0	17.0	10.8	20.9	13.5	10.6	8.4	2.3	2.4
DeepMicrobes_family	25.4	9.2	4.1	0.5	2.4	0.6	0.1	0.0	0.2	0.1
DeepMicrobes_gene	19.4	7.5	1.9	0.6	3.6	1.1	0.4	0.2	87.8	89.0
MMseqs2	4.3	2.7	1.1	0.5	2.2	1.4	0.9	0.5	87.3	90.8
Kraken2	1.1	0.6	0.26	0.17	0.5	0.7	0.4	0.3	N/A	N/A
BigBird	64.0	47.1	29.0	20.4	36.7	27.6	22.4	17.8	7.4	7.1
Scorpio-6Freq	73.8	56.3	21.9	9.5	29.3	13.7	6.7	2.9	88.4	87.4
Scorpio-BigEmbed	62.5	41.8	17.2	8.2	24.2	13.1	8.0	5.0	68.9	66.1
Scorpio-BigDynamic	48.5	24.8	7.6	2.7	11.3	4.7	2.2	1.0	95.5	94.7

(c) Memorization Test: Short fragment length
	Accuracy (%)					F1-macro (%)
Model	phylum	class	order	family	gene	phylum	class	order	family	gene
6mer Freq.	90.3	86.1	76.7	65.6	92.4	78.3	69.0	63.4	55.8	91.9
BigBird	72.4	62.7	50.1	41.5	55.6	53.6	44.0	40.1	35.5	54.9
DeepMicrobes_family	72.7	61.6	43.6	30.8	3.1	42.7	28.3	24.2	19.7	2.4
DeepMicrobes_gene	21.5	8.9	2.4	0.7	93.0	4.3	1.4	0.5	0.2	93.2
BERTax^a	76.4	N/A	N/A	N/A	N/A	22.9	N/A	N/A	N/A	N/A
BERTax_Embedding^a	55.2	38.4	20.9	13.0	15.5	27.9	16.8	12.0	8.4	14.8
Mmseqs2	94.8	92.3	84.1	70.7	97.7	85.0	75.8	69.4	57.9	97.2
Kraken2	77.8	74.4	66.7	59.6	N/A	70.0	67.2	64.4	60.9	N/A
Scorpio-BigEmbed	76.6	66.1	49.8	38.9	74.0	54.3	42.4	37.1	31.4	74.5
Scorpio-6Freq	81.3	70.4	47.7	32.6	92.2	49.8	34.6	29.1	22.9	92.3
Scorpio-BigDynamic	91.0	83.4	63.3	45.8	98.8	73.7	53.3	42.9	32.8	98.9

(d) Generalization Test: Short fragment length
	Taxonomic Generalization								Gene Generalization
	Accuracy (%)				F1-macro (%)				Accuracy (%)	F1-macro (%)
Model	phylum	class	order	family	phylum	class	order	family	gene	gene
6mer Freq.	47.0	29.9	14.9	8.4	11.4	8.1	6.8	5.1	54.7	52.7
BigBird	51.4	33.9	17.8	10.6	14.3	11.7	9.8	7.5	16.5	14.4
DeepMicrobes_family	55.8	40.3	22.1	13.2	15.7	11.8	10.3	8.3	2.8	1.9
DeepMicrobes_gene	14.2	5.9	1.8	0.5	2.1	0.9	0.4	0.1	76.0	77.0
Mmseqs2	2.6	1.8	0.8	0.4	1.0	0.6	0.5	0.3	78.5	86.1
Kraken2	0.93	0.63	0.27	0.11	0.2	0.1	0.09	0.06	N/A	N/A
Scorpio-BigEmbed	54.8	37.2	18.0	9.6	14.5	10.0	7.4	5.2	41.7	42.2
Scorpio-6Freq	50.0	31.1	9.4	4.0	10.7	6.1	3.1	1.5	72.4	73.5
Scorpio-BigDynamic	61.2	43.1	18.3	7.8	17.8	10.4	5.6	2.7	92.3	93.1

(a) Memorization Test: Identification of additional training-data-known taxonomy and genes (Test Set). (b) Generalization Test: Taxonomy Generalization (Genes-Out Set) and Gene Label Generalization (Taxa-Out Set). We show that while standard techniques, like MMseqs2, memorize data well for identifying known classes, Scorpio is competitive at classifying novel taxa, especially at higher levels, and is competitive for genes as well.
Short fragment length (400 bp) results: (c) Memorization Test: Identifying additional examples of training-data-known taxonomy and genes (Test Set); (d) Generalization Test: Taxonomy Generalization (Gene Out Set) and Gene Label Generalization (Taxa Out Set) Tests. Again, Scorpio is superior at classifying novel organisms at the phylum level and beats out every method for the gene level.
^aAll models, except for BERTax, were trained on the same dataset; for BERTax, we employed a pre-trained version. We use bold for the best and underline for the second-best results.

Search