Fig. 4: We examined Kraken2 and MMseq2 thresholds and their impact on the number of classified instances.
From: Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization

As shown in (b), the number of classified instances varies across datasets for each method. In the “Gene Out" dataset, Kraken2 classified 2.8% of instances, while MMseq2 classified 28%. We compared our model’s precision across these different thresholds (c–e), with (green) representing Kraken2-like thresholds, (blue) for MMseq2-like thresholds, and (red) for our thresholds. In a, at a 2.8% classification rate (green), our Scorpio model achieved 94% precision, compared to Kraken2's 43%. At a 28% classification rate (blue), our Scorpio model achieved 90% precision, while MMseq2 achieved 50%. This analysis demonstrates our model’s effectiveness in maintaining high precision while balancing the number of classified instances in novel sequences.