Fig. 4

Impact of data curation on classifier performance (A) Impact of Feature Scaling on Accuracy: For NN we show the average accuracy on the reserved fold of the training set for the transcript (left) and protein (right) datasets for the scaled (top) and un-scaled (bottom) data at a range of values for parameters controlling the model structure.(B) Impact of Feature Set Size on Accuracy: The average accuracy (solid line) and +/- one standard deviation (ribbons) are shown for model sizes ranging from 1–84 for the transcripts data for the five classifiers. These cross validated accuracies were generated via recursive feature selection for NB, NN, SVM, and RF, while GLM elastic net performs feature selection via regularization parameterization. (C) Biological Variation has an outsized impact on model accuracy for certain model configurations. GLM parameter optimization across the transcript dataset shows that structural parameter values cause the GLM to perform very differently against the reserved fold (a) vs. the test set (b) vs. the validation set (c) (the data gathered at a later date and by a different experimentalist). Note that against the reserved fold, there are highly accurate models even at high λ and high α values, which corresponds to models with very few predictors ( 2–4). However, as these models are tested against unseen data in the test and particularly in the validation set, the very minimal models begin to fail because of variations in the primary predictor, CCL5. In order to perform better across data batches, the classifer needs to preserve more predictors (lower α and/or lower λ values).