Fig. 4: Optimization of variant detection as predictors for drug resistance.

a Schematic comparing prior knowledge requirements and accuracy of different approaches. b Boxplot of GAM + ML classification accuracy across model runs (N = 10), each using a different random test set and seed. Data depict median (center bar), 25th and 75th percentile (lower and upper box bounds), and minimum and maximum values (lower and upper whiskers). P-values were calculated from repeat measure 1-way ANOVAs, followed by Dunnett’s test for multiple comparisons, comparing the results to a Gradient Boosting reference model. c Workflow of the ML model using GAM variants as input. Calculated (d) PPV, (e) specificity, and (f) sensitivity (error bars indicate two-sided 95% confidence intervals) of predictive approaches applied to DS1 for specific drug resistance using variants identified by GAM (blue); 2021 (yellow) and 2023 (green) WHO interim criteria; and a gradient boosting model using GAM variants (red). Sample sizes for these comparisons varied according to the number of Mtb isolates with phenotype data for AMI (n = 10027), EMB (n = 8911), ETH (n = 9356), INH (n = 10025), KAN (n = 10085), LEV (n = 10114), MXF (n = 10139), and RIF (n = 10052). Source data are provided as a Source Data file.