Figure 1
From: Interpretable genotype-to-phenotype classifiers with performance guarantees

Summary of the PATRIC data. (a) Number of genomes and antibiotics for which data was extracted, shown by species. (b) Number of k-mers in each dataset (dots), shown by species. Low k-mer counts reflect populations with homogeneous genomes, whereas the converse indicates high genomic diversity. (c) For each dataset (dots) the number of examples (genomes) and features (k-mers) is shown, along with a measure of class imbalance. Clearly, some datasets contain more examples of one of the classes (resistant or susceptible) and each dataset shows a strong discrepancy between the number of examples and features. Together, these conditions make for challenging learning tasks.