Fig. 4: Machine learning to develop a parsimonious biosignature for pediatric TB disease.
From: Plasma proteomics for biomarker discovery in childhood tuberculosis

a Absolute feature importance from a LASSO model for the top ten most important features. b ROC curves for best-scoring combination of features on the test data (25%). Each curve represents the feature subset achieving the highest AUC derived from all combinations of 1 (n = 50), 2 (n = 1225), 3 (n = 19,600), 4 (n = 230,300), 5 (n = 2,118,760), and 6 (n = 15,890,700) features. WHO TPP for a screening test (70% specificity and 90% sensitivity) is denoted by the black circle. c Barplot for the sensitivity achieved at 70% specificity for all 6 models. Dotted red line represents 90% sensitivity. d Venn diagram of the overlap in proteins from the 3-, 4-, 5-, or 6-protein model. e Dotplot representing the mean (dot) and the standard deviation (line) for the proposed biosignature proteins (5 and 6 protein models) across individual patients from different TB classes. N-values represent the number of patients within each class. Different colors highlight the different TB classes according to NIH consensus definition. Each protein is normalized to the Unlikely TB protein abundance for that respective protein. Source data are provided as a Source Data file.