Fig. 5: Effect of sample size and DST data incompleteness on GAM and ML-GAM outputs. | Nature Communications

Fig. 5: Effect of sample size and DST data incompleteness on GAM and ML-GAM outputs.

From: Enhanced diagnosis of multi-drug-resistant microbes using group association modeling and machine learning

Fig. 5

a Effect of sample size on GAM and LMM true positive (TP) and false positive (FP) gene identifications. Y-axis breaks between 20 and 200. b Heatmap of mean PPV from model runs, each using a different random test set and seed (N = 10), for GAM and LMM for varying sample sizes. c Effect of missing data on GAM performance. a, c Solid and dashed lines represent nonlinear sigmoidal curves and their two-sided 95% confidence intervals, respectively. Data points display mean ± standard error values from model runs, each using a different random test set and seed (N = 10). d ML-GAM workflow for datasets with missing data. e ML training set size effect on GAM accuracy, indicating median (central line) and minimum and maximum range (box boundaries), and p-value from a 1-way ANOVA with Tukey’s multiple comparison test from model runs, each using a different random test set and seed (N = 30). f Effect of missing data on accurate GAM gene identification after adjusting data with ML models trained with different sample sizes, where the remaining samples are analyzed as the GAM test samples. Solid and dashed lines represent nonlinear sigmoidal curves and their two-sided 95% confidence intervals, respectively. Data points display mean ± standard error values from model runs, each using a different random test set and seed (N = 5). Source data are provided as a Source Data file.

Back to article page