Table 1 Area under the curve (AUC) for the prediction of cancer/control with different machine learning models using ONT-V1V9 and Emu’s Default database, based on if feature selection is automatic (Boruta, with two prevalence thresholds of 10% and 30%) or manual.

From: Nanopore full length 16S rRNA gene sequencing increases species resolution in bacterial biomarker discovery

Selection

Features

Feature count

AUC

Boruta top 10 (30%)

P. micra, A. butyriciproducens, A. cellulosilytica, O. timonensis, A. bacterium, R. timonensis, Streptococcus sp. A12, B. luti, Clostridium sp. BNL1100, S. variabile

10

0.92

Boruta top 10 (10%)

P. micra, A. cellulosilytica, A. rhamnosivorans, P. stomatis, A. butyriciproducens, P. anaerobius, P. stercorea, Candidatus Saccharibacteria bacterium oral taxon 957, O. timonensis, R. timonensis

10

0.91

Manual top 4

F. nucleatum, P. micra, B. fragilis, A. butyriciproducens

4

0.82

Manual top 14

F. nucleatum, P. micra, B. fragilis, A. butyriciproducens, P. stomatis, P. anaerobius, G. morbillorum, D. pneumosintes, S. wadsworthensis, C. perfringens, R. ilealis, P. clara, Longibaculum sp. KGMB06250, R. massiliensis

14

0.87