Figure 2

Random forest classification excluding data with gene mismatches. (A) Random forest analysis (RFA) for serotype classification when excluding genes for which the allelic notation process had <99% positive matches with the reference genome. (A , top) Density function of RFA scores obtained for each gene in the dataset. The 95% boundaries are marked by the dashed lines. Small bars highlight the RFA scores of genes within particular groups (red for serotype, green for SC genes). (A, bottom) Genomic position for each gene in the dataset against their RFA score (normalised to [0,1]). The circular genome is presented in a linear form on the y-axis, with the first gene being dnaA and the last gene parB. Red and green diamonds mark the top 2.5% ranking genes for serotype and Sequence Cluster classification, respectively. (B) RFA for Sequence Cluster classification; figure details the same as in A. Blue shaded areas in both A and B mark the capsular locus (genes within aliA and dexB). Green shaded areas in both A and B mark the genes contiguous and including the groESL operon (Table 1).