Figure 1

Random forest classification. (A) Random forest analysis (RFA) for serotype classification. (A, top) Density function of RFA scores obtained for each gene in the dataset. The 95% boundaries are marked by the dashed lines. Small bars highlight the RFA scores of genes within particular groups (yellow for MLST genes, blue for capsular locus genes). (A, bottom) Genomic position for each gene in the dataset against their RFA score (normalised to [0,1]). The circular genome is presented in a linear form on the y-axis, with the first gene being dnaA and the last gene parB. MLST genes are marked in yellow diamonds (spi, xpt, glkA, aroE, ddlA, tkt) and genes within the capsular locus with blue diamonds (pseudogenes tagged with ‘x’). (B) RFA analysis for sequence cluster classification; figure details the same as in A. Blue shaded areas in both A and B subplots mark the capsular locus (genes within aliA and dexB).