Fig. 1: Building a machine-learning model to predict essentiality and testing on the original GRACE collection.

a Overview of the input, output, and validation process of our random forest model. b Precision-recall curve of our random forest model on 20% of the GRACE gene set. The model was trained and optimized on the other 80% of the GRACE gene set. The default stringent cutoff score for essential gene predictions results in a precision of 0.73 and a recall of 0.63, with an average precision score of 0.77. The error bars reflect the standard deviation across estimates derived from 10,000 different resamplings (with replacement) of the test set. c Permutation feature importance of our random forest model for the whole GRACE gene set. The decrease in a model upon permutation of that feature score reflects importance, and the box plots show variation for each feature’s importance across 30 permutations. The whiskers extend out to 1.5 times the inter-quartile range, and the flier points reflect outliers beyond 1.5 times the inter-quartile range. S. cer represents S. cerevisiae. d Distribution of our random forest prediction scores across 6638 C. albicans genes. e Distribution of prediction scores for the 866 selected candidates for further experimental validation. Source data are provided as a Source Data file.