Fig. 2: A comprehensive evaluation of feature prevalence and classifier efficacy across different feature selection algorithms on an ultra-sparse dataset. | npj Biofilms and Microbiomes

Fig. 2: A comprehensive evaluation of feature prevalence and classifier efficacy across different feature selection algorithms on an ultra-sparse dataset.

From: PreLect: Prevalence leveraged consistent feature selection decodes microbial signatures across cohorts

Fig. 2

a Feature prevalence in ‘equivalent size model.’ This panel showcases the distributions of feature prevalence when each algorithm is restricted to an equal number of features. Embedded values represent Cohen’s d, accentuating the differential prominence of PreLect compared to other methods. A positive Cohen’s d value indicates that the features selected by PreLect have a higher prevalence compared to those chosen by the corresponding benchmarking method. b Classifier performance in ‘equivalent size model.’ The AUC scores are displayed here, indicating the classification capability for each algorithm under the model that holds feature counts consistent across methods. c Feature prevalence in ‘full feature size model.’ In contrast to panel a, this representation of the non-zero importance features selected by each algorithm. d Feature counts in ‘full feature size model.’ A logarithmic scale captures the number of features chosen by each algorithm, providing insights into their inherent feature selection tendencies. e Classifier efficacy in ‘full feature size model.’ Presented here are the AUC scores that quantify the discriminatory power of each algorithm, this time under the setting where no limitations are imposed on feature count. The ‘equivalent size model’ limits feature assessment to PreLect’s selection count, while the ‘full feature size model’ uses each method’s typical feature count. The prediction performance of all benchmarking methods was assessed using logistic regression, employing a 7/3 train-test split ratio. The evaluation was conducted on the testing set to ensure the accuracy and reliability of the results. Detailed descriptions of the computational issues encountered are provided in Supplementary Note 3.

Back to article page