Introduction

Prior to the advent of the Next Generation Sequencing (NGS) and Third Generation Sequencing (TGS) platforms, the analysis of microbial compositions relied on traditional bacteriological procedures. However, the high proportion of viable but non-culturable or difficult-to-cultivate microorganisms in the biosphere complicates the analysis of microbiome compositions1. As a matter of fact, nowadays more than 99% of the potentially 1011–1012 microbial species on Earth remain undiscovered2. Nevertheless, high-throughput sequencing methods enabled, for the first time in history, the possibility to identify, without culturing, practically \(\sim\)100% of the genetic material of all microorganisms in a given sample. For this reason, nowadays 16S rRNA gene sequencing represents the gold-standard method for microbiome studies, standing as the fundamental driver behind the increasing volume of studies shaping environmental and animal microbial communities3,4,5,6.

Multiple studies have investigated the involvement of these communities in human diseases, such as colorectal cancer (CRC)7,8,9,10,11,12, which stands out as one of the most commonly diagnosed tumors all over the world13. Particularly in Spain (Europe), CRC has the highest incidence among tumors, with over 40,000 new cases throughout 202314. Moreover, CRC is the second leading cause of cancer mortality, with more than 15,000 deaths reported, according to the Spanish Association Against Cancer latest annual report14. The CRC assessment programs implemented worldwide predominantly utilize a guaiac-based fecal occult blood test (gFOBT) or a fecal immunochemical test (FIT), which are conducted biennially, given that colon cancer development is a slow and gradual process15,16,17. As a result, the implementation of these screenings allows to: (a) increase the number of annual diagnostics, (b) improve the CRC survival statistics and (c) achieve lower mortality rates17. However, it is important to highlight that blood detection in fecal matter does not necessarily indicate the presence of a carcinoma, and that a colonoscopy (an invasive technique) is required to confirm or deny the FOBT or FIT result. Therefore, there is an urgent need to develop novel screening methods, more specific and non-invasive, in order to enhance the detection of intestinal lesions even at precancerous stages.

Interestingly, it is widely recognized that several gut microbes could develop harmful effects on colonocytes’ integrity and homeostasis18,19. Multiple in vitro and in vivo experiments demonstrated over the last years that specific bacteria, such as pks+ Escherichia coli, enterotoxigenic strains of Bacteroides fragilis (ETBF), Parvimonas micra or Fusobacterium nucleatum are extremely related with the colorectal tumorigenesis process20,21,22,23,24,25. More precisely, microorganisms have the potential to directly and indirectly: (a) affect the epithelial permeability (e.g., by modulating the expression of tight junction proteins)26,27, (b) promote chronic tissue inflammation (e.g., through the secretion of toxins, enhancing bacterial adherence to epithelial cells)28, (c) deregulate host anti-tumoral immune activities (e.g., Fusobacterium nucleatum through the production of certain adhesins such as Fap2, affecting the function of natural killer T cells)25, (d) trigger chromosomic instability and DNA mutagenesis (e.g., ETBF via the BFT toxin promotes the hyperproduction of reactive oxygen species or E. coli pks+, a pathogen which induces a characteristic mutational signature via the colibactin molecule)21,29,30, (e) alter eukaryotic DNA methylation patterns (e.g., Parvimonas micra promotes hypermethylation in genes related to cytoskeleton regulation and tumor suppression)24,28,29 and (d) modulate several cell-signaling pathways (e.g.; E-cadherin/\(\beta\)-catenin, TLR4/MYD88/NF-\(\kappa\)B or SMO/RAS/p38 MAPK)18,31,32. In turn, these deleterious activities lead to hyperproliferation, senescence, tumor growth and invasiveness, additionally inducing the initial stages of the metastatic process18,20,21,23,26,27,28,29,33,34. Consequently, tumor progression and effectiveness of anti-cancer therapies could be directly correlated with the gut bacteriome established in each patient26,35,36. Accordingly, multiple studies have put forth specific non-invasive microbiome-derived biomarkers for CRC during the preceding decade. These investigations have demonstrated the potential to differentiate individuals with malignant dysplasias from those without lesions through a simple analysis of their fecal material, utilizing 16S ribosomal ribonucleic acid (16S rRNA) gene sequencing or quantitative PCR (qPCR) procedures7,8,9,10,11,12.

Multiple 16S rRNA sequencing options exist to study the microbial communities involved in diseases such as CRC and to generate disease-related bacterial biomarkers. For example, Oxford Nanopore Technologies (ONT) and its innovative long-read sequencing method, which can generate >10 kb sequences directly from native DNA37,38,39, has positioned itself as a compelling alternative to short-read technologies such as Illumina or other long-read technologies like PacBio for 16S rRNA analysis37. In comparison to Illumina, which is restricted to sequence small areas of the 16S rRNA such as V3V4 (\(\sim\)400 nt) and identification mostly at the genus level, ONT can obtain the full region6,39 (\(\sim\)1500 nt, V1V9) and identify reads at the species level more consistently. Additionally, the barrier to entry is smaller with ONT, as the starting sequencers cost much less, which is an advantage in less well-resourced environments. However, the main limitation is the relatively higher error rate ONT reads achieve in comparison to other technologies, although its chemistry and basecallers, such as Dorado, are constantly being improved and will presumably get to a similar quality eventually, having recently achieved Q20 and even Q25+ in a small proportion of reads (in contrast to Illumina and PacBio’s consistent Q30+). Putting things into perspective, Q20 indicates an error rate of 1% (Q15 = \(\sim\)3% error rate), which is the threshold needed to confidently assign an OTU (Operational Taxonomic Unit) to a specific species in full length 16S rRNA40.

This limitation has led to different bioinformatic approximations used for each technology. While Illumina and PacBio reads’ quality allows for the creation of precise ASVs (Amplicon Sequence Variant), typically through DADA241,42, this method is not prepared for the current quality profile of ONT reads, which has caused the elaboration of various tools such as Emu43 or NanoClust44. Nonetheless, future advancements in ONT’s technology might allow for the use of procedures typically reserved for Illumina or PacBio.

Thus, the present study evaluates the potential of ONT to improve upon Illumina’s prevailing 16S rRNA analysis through the use of a considerable cohort of subjects (n = 123), examined previously with Illumina12 and composed of colorectal cancer patients and their healthy counterparts, in order to obtain more precise CRC biomarkers. Two different sequencing systems and approaches, Illumina-V3V4 and ONT-V1V9 (R10.4.1 chemistry), three Dorado basecalling models (fast, hac or High Accuracy, sup or Supper-accurate; v4.1.0) and two databases (SILVA and Emu’s Default database) were compared.

Methods

Recruitment of participants and fecal sampling

All volunteers provided informed signed consent prior to the initiation of the sample collection phase, which was performed in the University Hospital of A Coruña (HUAC; Galicia, Spain). Strict adherence to clinical guidelines and regulations was maintained throughout the recruitment period (Research Ethical Committee of Galicia, Spain: code CEIm-G 2018/609). A total of 93 CRC diagnosed subjects were enrolled in the project between 2019 and 2022. Certain inclusion criteria were followed as previously described12: (a) no antibiotic treatment within the last month, (b) no infectious disease, (c) no chemotherapy and/or radiotherapy treatments prior to sample collection, (d) no genetic predisposition to CRC development, (e) no intestinal inflammatory disorders, (f) no immunological diseases, (g) no medical history of transplantation, (h) not currently undergoing immunosuppressive treatment. Moreover, a total of 30 CRC cancer-free patients’ companions/couples were asked to participate in the study, meeting the same inclusion requirements as the CRC group. A preliminary interview with each volunteer was conducted to gather individual data (e.g.: age, sex, weight or height, among others) and lifestyle habits (e.g.: dietary patterns or physical activity, among others). Samples (n=123), each containing approximately 20 mL of fecal material, were self-collected by each patient at home and preserved in 10 mL of RNAlater reagent (Thermo Fisher Scientific, Waltham, MA, USA) in cold storage (\(\sim\)4\(^\circ\)C) for 1-2 days, and were then stored at -80\(^\circ\)C until DNA extraction. Samples were collected normally months after the colonoscopy, ensuring their microbiome profile could recover, just right before the surgery and the pre-surgery diet started. In rare cases, where surgery was urgent, they were collected at least 15 days before surgery.

DNA extraction

As previously outlined12,45 fecal samples underwent a brief pre-processing protocol prior to the DNA extraction procedure. The MasterPure™Complete DNA/RNA Purification Kit (Epicentre, USA) was used in accordance with the manufacturer’s instructions.

16S rRNA metabarcoding sequencing

Illumina (MiSeq™) 16S rRNA V3-V4

Two highly variable regions within the 16S rRNA gene (V3-V4) were selectively amplified through PCR, employing 5’ TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG as forward primer and 5’ GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC as reverse primer12,45. Nuclease-free water was included as negative control in each PCR reaction to avoid bacterial contaminations. Subsequently, libraries were constructed by following the Illumina 16S Metagenomic Sequencing Library Preparation protocol (Illumina, San Diego, CA, USA). Library concentration was measured using a Qubit dsDNA HS Assay Kit (Invitrogen, USA) and a Qubit 2.0 fluorometer (Invitrogen, USA).

Libraries were pooled and diluted to a final concentration of 10 pM and mixed with 20% of 10 pM PhiX control (Illumina, USA). Samples were finally sequenced using a MiSeq Reagent Kit v3 (600 cycles) (Illumina, USA) and a MiSeq platform (Illumina, USA).

Oxford nanopore (MinION™) 16S rRNA V1-V9

Complete bacterial 16S rRNA hypervariable regions (V1–V9) were amplified, using the following forward and reverse primers: 5’ AGMGTTYGATYMYGGCTCAG and 5’ TACGGYTACCTTGTTACGACTT, respectively. Negative controls (nuclease-free water) were included to avoid contaminations. For each PCR reaction, 200 fmol of fecal DNA was used. Afterwards, DNA Oxford Nanopore libraries were constructed by following the Native Barcoding Kit 96 (SQK-NBD114.96) protocol (Oxford Nanopore, Oxford, UK). In the same way as Illumina libraries, DNA concentration was assessed by using a Qubit 2.0 fluorometer with the corresponding Qubit dsDNA HS Assay Kit (Invitrogen, USA).

Pooled libraries were loaded onto R10.4.1 Flow Cells (FLO-MIN114) and sequenced for 72 h following the manufacturer’s instructions, detailed in the Native Barcoding Kit 96 protocol (Oxford Nanopore, Oxford, UK). Two sequencing runs were performed, with an average of 63 samples in each one.

Bioinformatic analysis

Oxford Nanopore Technologies (ONT) reads were first basecalled using Dorado duplex (v. 0.5.3)46 with three models: fast, hac and sup (dna, r10.4.1, e8.2, 400bps, v4.1.0). The resulting reads were then separated into simplex, which were demultiplexed using Dorado (kit SQK-NBD114-96), and duplex reads (“consensus”of two parental simplex reads). Afterwards, duplex reads in which parental reads had different barcodes or which lengths differed more than 50% were removed. The resulting reads were quality controlled with chopper (v. 0.5.0)47, trimming 20 nt from the front and back and selecting reads between 1200-1900 nt with a specific minimum average quality score, depending on the basecalling model (fast: Q7; hac and sup: Q12). Additionally, Duplex Tools (v. 0.2.9)48 was used to detect and remove reads with mid-strand adapters. Host contamination was assessed with Kraken2 using its Standard 64Gb database (https://benlangmead.github.io/aws-indexes/k2)49. Lastly, reads were identified using Emu (v. 3.4.5)43 with its Default database (rrnDB v. 5.650 combined with NCBI 16S RefSeq51,52) and SILVA (v. 138.1)53,54.

Illumina reads were analyzed through QIIME 2 (v. 2021.11)55, where DADA241 was used to trim, denoise, correct sequencing errors and remove chimeras on a per sequencing run basis, producing Amplicon Sequence Variants (ASVs), which were then classified using a feature classifier created with RESCRIPt (v. 2021.11.0)56 and SILVA (v. 138.1). Additionally, contamination in these ASVs was also assessed with Kraken2’s Standard 64Gb database.

Results from both ONT and Illumina were merged and analyzed in R (v. 4.2.0)57, mainly through Phyloseq (v. 1.42.0)58 for data management, ANCOM-BC (v. 2.0.1)59 for differential abundance analysis (prevalence cutoff of 10%, adjusting significance by Holm-Bonferroni60) and microbiome (v. 1.20.0)61 for centered log-ratio abundance normalization (CLR). In order to assess \(\beta\)-diversity differences, a PERMANOVA analysis through adonis262, using a multi-dimensional scaling (MDS) and the Jensen-Shannon distance (JSD), was performed. Additionally, pairwise comparisons were conducted using Wilcoxon rank-sum tests (WRST), adjusting significance for multiple comparisons using Holm-Bonferroni. Significance values across analyses are represented as * (\(p\le 0.05\)), ** (\(p\le 0.01\)) or *** (\(p\le 0.001\)).

For biomarker identification, automated feature selection was performed using the Boruta algorithm63, with two prevalence cutoffs of 10% and 30%. Additionally, combinations of manually selected features were tested using a Random Forest machine learning algorithm and evaluated using the leave-one-out cross-validation method. The performance of the model was expressed through the area under the receiver operating characteristic curve (AUC) value.

Results

Quality control of Illumina reads resulted in a median of 32104 reads per sample with 90% of those being Q30 and 95% being Q20, generating ASVs with a median length of 418 nt. Meanwhile, quality control of ONT reads provided a total median of \(\sim\)109k reads per sample and median length of 1480 nt. Median average quality varied across basecalling models, with the sup model achieving Q18 and some reads above Q30 (Fig. 1A). Duplex rate for sup was on average 8±2.73 %, out of which 3.76±1.61 % were filtered. Rarefaction curves were closed at the species level on both ONT and Illumina (data not shown).

Fig. 1
figure 1

Comparison of ONT-V1V9 basecalling models. (A) Distribution of the average quality of reads per basecalling model. (B) \(\beta\)-diversity analysis using SILVA database and MDS+JSD (Multidimensional Scaling and Jensen-Shannon Divergence). The same samples with different models are connected by lines. (C) \(\alpha\)-diversity analysis using SILVA and Emu’s Default database. Significance levels are given through Wilcoxon rank sum tests and comparisons between groups are not shown.

Comparison of ONT-V1V9 samples regarding \(\beta\)-diversity analysis (MDS and Jensen-Shannon Divergence) revealed no differences between basecalling models (Fig. 1B, p-value\(\ge\)0.8 in PERMANOVA), but did show significant differences between databases (p-value<0.001 in PERMANOVA). Analysis of \(\alpha\)-diversity was significantly different for the fast model, showing higher values of observed features (Fig. 1C, p-value<0.05 in WRST). Additionally, Emu’s Default database also resulted in significantly higher observed features when compared to SILVA (Fig. 1C, p-value<0.05 in WRST). Moreover, significant differences were observed in the percentage of reads identified differently in respect to the sup model at each taxonomic level (p-value<0.001 in WRST). For example, at species level, using the SILVA database, the identification of fast reads differed a median of 9.3% from sup, while hac differed 3.47% (Supplementary Fig. S1). Differences at higher levels such as family were smaller, having a median of 3.67% and 1.20% for fast and hac, respectively. Interestingly, Emu’s Default database had higher standard deviation compared to SILVA in this measurement. Consequently, the sup model (Q18) was chosen for the following analyses.

Fig. 2
figure 2

Comparison of Illumina-V3V4 and ONT-V1V9 approaches using SILVA. (A) \(\alpha\)-diversity indexes at the genus level, comparing the two volunteer groups (cancer vs. control). (B) \(\beta\)-diversity analysis at the genus level. (C) \(\beta\)-diversity analysis at the species level. (D) Mean Centered Log Ratio (CLR) abundance correlation between approaches and for each group (cancer vs. control). Each point represents a different genus, and color indicates if that taxa appear, on average, on both, none or only one of the approaches. Three relevant genera, which contain multiple species related to colorectal cancer are highlighted (Fusobacterium, Parvimonas and Peptostreptococcus).

When comparing Illumina-V3V4 and ONT-V1V9 with the sup basecalling model, both using the SILVA database, \(\alpha\)-diversity at the genus level was not influenced, as shown in Fig. 2A. Regarding \(\beta\)-diversity, samples overlapped at the genus level and not at the species level (Fig. 2B and Fig. 2C), although PERMANOVA analysis indicated significantly different results in both cases (p-value<0.001). Normalized CLR abundance at the genus level correlated well on average between Illumina-V3V4 and ONT-V1V9 (Pearson correlation: \(\ge\)0.8, Fig. 2D), showing few taxa present in only one of the two approaches, which in most cases was due to slight differences in the taxonomy caused by the use of different classifiers. The percentage of feature counts identified as a known species (e.g. at species level and not as uncultured, unidentified, unclassified, Taxon NA...) varied significantly (p-value<0.001 in WRST), obtaining a median of 16.75% with Illumina-V3V4, 27.74% with ONT-V1V9 (SILVA) (Supplementary Fig S2) and, interestingly, 100% with ONT-V1V9 (Default).

Fig. 3
figure 3

Comparison of Illumina-V3V4 and ONT-V1V9 approaches with both databases for specific genera in each subject group. Three important genera, which contain multiple species associated with colorectal cancer are highlighted (Fusobacterium, Parvimonas and Peptostreptococcus). (A) Centered Log Ratio (CLR) abundance of each genus per group using Wilcoxon rank sum tests to assess significance. (B) Percentage of samples where each genus is present per group.

In Fig. 3A the abundance of three important genera (Fusobacterium, Parvimonas and Peptostreptococcus), potential indicators of the presence of colorectal tumors along the large bowel, is shown across technologies (Illumina-V3V4 and ONT-V1V9) and databases (Default Emu database and SILVA). In all cases these genera are significantly more abundant in fecal samples from CRC patients than in the control group (p-value<0.001 in WRST) and their abundances between approaches are similar. The percentage of samples with these taxa can be seen in Fig. 3B, with very similar proportions between methods except in the case of ONT-V1V9 (SILVA) for Parvimonas, where false positives were observed in controls (these occurrences were confirmed manually, identified as Parvimonas sp., not P. micra, at the species level).

When studying differential abundance analysis through ANCOM-BC (Fig. 4A, B) and differences in CLR abundance (Fig. 4C) using ONT-V1V9, both databases concur in multiple CRC biomarkers, such as Parvimonas micra, Fusobacterium nucleatum, Peptostreptococcus stomatis, Peptostreptococcus anaerobius, Gemella morbillorum, Sutterella wadsworthensis, Clostridium perfringens, Bacteroides fragilis and Dialister pneumosintes. In contrast, biomarkers for healthy controls are not shared between databases, with Emu’s Default database indicating more, such as: Agathobaculum butyriciproducens, Romboutsia ilealis, Anaerostipes rhamnosivorans and Anaerocolumna cellulosilytica.

In order to identify useful combinations of CRC microbial biomarkers, multiple machine learning models were created, using both automatic and manual feature selection. Models using automatically selected features are shown in Table 1, with the Default Emu database and two different prevalence levels (10% or 30%). All of them obtained at least 0.9 AUC (Area Under Curve) with just 10 features. For example, using the Default database at 10% prevalence the most important features according to the Boruta algorithm were P. micra, A. cellulosilytica, A. rhamnosivorans, P. stomatis, A. butyriciproducens, P. anaerobius, Prevotella stercorea, Candidatus Saccharibacteria bacterium oral taxon 957 (also named Candidatus Nanosynbacter featherlites (TM7) in NCBI), Olsenella timonensis and Raoultibacter timonensis (Table 1).

However, further evaluation of these taxa at the read level showed low identity to their reference. Specifically, A. cellulosilytica and A. rhamnosivorans, which were big contributors to these models, had a maximum of 90–92% identity, which should be classified as unknown Lachnospiraceae. For this reason, manual combinations of taxa with high read identity and coverage, significant differences between groups (using ANCOM-BC and CLR abundance) or identified as relevant by the Boruta algorithm were tested. These combinations and their AUC are shown in Figure 4D (Full list is available in Supplementary Table 1 and additional information or comments of each species is in Supplementary Table 2) and compared to previously described combinations with Illumina-V3V412. The usage of P. micra and F. nucleatum provides an AUC of 0.71, increasing to 0.76 by adding B. fragilis, to 0.82 by adding A. butyriciproducens and, lastly, obtaining a maximum AUC of 0.87 with a total of 14 features. All comparable combinations to Illumina-V3V4 obtained slightly higher AUC (e.g. P. micra+F. nucleatum vs. Parvimonas+Fusobacterium).

Fig. 4
figure 4

Colorectal cancer biomarkers obtained with ONT-V1V9 and both databases. (A) Differential abundance analysis (DAA) through ANCOM-BC using ONT-V1V9 and SILVA. Control subjects are the reference group, meaning a higher Log Fold Change (LFC) indicates higher abundance of a taxon in the cancer group. (B) DAA through ANCOM-BC using ONT-V1V9 and Emu’s Default database. Control subjects are the reference group, meaning a higher Log Fold Change (LFC) indicates higher abundance of a taxon in the cancer group. (C) Centered Log Ratio (CLR) abundance of species with significant differences between cancer and control groups, indicated through Wilcoxon rank sum tests. (D) AUC of two models (Illumina-V3V4 or ONT-V1V9) using manually selected features based on read identity to reference, ANCOM-BC and CLR abundance. Sample size (n) refers to the number of features included in each model. A complete list of species and genera included in each combination is found in Supplementary Table 1.

Table 1 Area under the curve (AUC) for the prediction of cancer/control with different machine learning models using ONT-V1V9 and Emu’s Default database, based on if feature selection is automatic (Boruta, with two prevalence thresholds of 10% and 30%) or manual.

Discussion

Sequencing of the ubiquitous 16S rRNA gene is a useful approach for microorganism characterization in complex and fastidious samples such as the ones used in this study, where their composition consists of hundreds of species that cannot be easily cultured under laboratory conditions. This technique allows for the discovery of disease-related bacterial biomarkers, which could be a useful approach for early prevention or diagnosis of various afflictions, such as colorectal cancer. In this procedure, small regions, \(\sim\)400-500 nt out of \(\sim\)1500 nt (e.g. V3V4), are typically sequenced with paired-end short-read technologies (e.g. Illumina). Unfortunately, this is mainly effective for genus level identification, while long-read technologies, such as ONT or PacBio, provide the possibility of sequencing the full gene (\(\sim\)1500 nt, V1V9)42,64, allowing for a more accurate identification at the species level. In this study, two approaches (Illumina-V3V4 and ONT-V1V9), three ONT Dorado basecalling models (fast, hac, sup; v4.1.0), and two databases (SILVA and Emu’s Default database) were compared in feces samples obtained from two groups of volunteers (colon cancer patients and healthy subjects) in order to assess ONT-V1V9’s capabilities and to improve upon a previously described set of microbial biomarkers described by our research team for the early diagnosis of CRC12.

Currently, out of these technologies, Illumina and PacBio HiFi reads provide the highest accuracy consistently42,65. However, ONT’s sequencing chemistry and basecalling models are constantly improving and will, presumably, eventually get to a similar accuracy routinely, as a small proportion of them can already get up to Q30. For now, ONT reads are restricted to be analyzed with tools that take into account their relatively high error rate, such as Emu43, while popular algorithms for ASV creation like DADA241 cannot overcome their properties. In this study, three of the latest models from Dorado (for 4Khz data, v4.1.0) were compared, obtaining a median quality of Q18 with the sup model. More recent models like sup v5.0.0+, which are incompatible with the current data, appear to consistently achieve Q22+. Therefore, it is probable that future basecalling models will allow for ONT reads to be analyzed with tools typically reserved for Illumina or PacBio, but it remains to be seen.

Focusing on our results, the three different ONT-V1V9 basecalling models tested in this work did not influence \(\beta\)-diversity or \(\alpha\)-diversity, except in the case of observed features, where the fast model had significantly higher features than the rest. As explained before, this could be due to false positives derived from low accuracy sequences. Additionally, the percentage of reads identified discordantly in regard to sup was significantly different when comparing the fast and hac models. According to our observations, the taxonomic and diversity results can be affected by the selected basecalling model and its read quality, therefore the best model (sup v4.1.0 in this case) should be chosen, even if it is very computationally expensive. Additionally, the use of duplex reads in amplicons with high homology and similar length might not be completely appropriate, as we observed (and removed) false duplex sequences, deriving from two unrelated simplex reads with different barcodes.

Comparing Illumina-V3V4 and ONT-V1V9 (sup) approaches with SILVA resulted in no significant differences when looking at \(\alpha\)-diversity. In regard to \(\beta\)-diversity, although samples did overlap at the genus level, PERMANOVA analysis indicated that there were significant differences at both genus and species level66. The differences at the genus level could be attributed to slight differences in taxonomy caused by the use of two different classifiers, while at the species level they are expected and likely due to a more precise identification with ONT. The CLR abundance and presence of three genera, which contain species related to colorectal cancer was similar and the percentage of feature counts identified as a known species was significantly higher in ONT-V1V9 (27.74%) than in Illumina V3V4 (16.75%). Thus, these results could indicate that both approaches are similar at the genus level, although not identical, and that ONT-V1V9 increases the amount of species identified, as expected and previously described64,66.

Interestingly, when using Emu’s Default database, the percentage of feature counts identified as a known species increased to 100%. Closer examination of both databases revealed that SILVA was seven times larger than Emu’s Default database (NCBI 16S RefSeq + rrnDB), and contained sequences identified at multiple taxonomic levels, including unclassified or uncultured taxa, while Emu’s Default database had all its sequences identified at the species level. A smaller database, such as Emu’s Default database, might function better with higher error reads that really belong to a well-characterized species, but the lack of unknown microorganisms in Emu’s Default database leads to some reads being assigned to a species even when their identity is 90%. Ideally, this kind of reads should be assigned to higher taxonomic levels such as family, reserving the species level for reads with \(\ge\)98% identity40, but it is a difficult task, as described by Emu authors43. This is specially evident in specific families with many unknown or uncultured species, such as the gut bacteria inside the Lachnospiraceae family. This is complicated further by the use of relatively high error reads where Q18, the median accuracy obtained here, equals to an error rate of \(\sim\)2% (Q15 would equal to \(\sim\)3%), although this is expected to improve in the near future, as commented previously. The consequence of this may be significant, especially in the search for microbial biomarkers applied to the diagnosis of human disorders, so it is important to be cautious when choosing the reference database to use with Emu and to study its results at the read level.

Selection of colorectal cancer biomarkers through ANCOM-BC and CLR abundance analyses revealed multiple species with high read identity and increased abundance in CRC patients such as Parvimonas micra, Fusobacterium nucleatum, Peptostreptococcus stomatis, Peptostreptococcus anaerobius, Gemella morbillorum, Sutterella wadsworthensis, Clostridium perfringens, Bacteroides fragilis and Dialister pneumosintes when compared to control samples from healthy volunteers. These species have been previously associated to CRC through a variety of methods12,18,19,20,23,24,25. Other biomarkers for healthy controls were indicated by Emu’s Default database, such as Agathobaculum butyriciproducens, a butyrate-producing species67, a molecule associated to a healthy gut through its anti-inflamatory properties68. Thus, the species level identification of these bacteria improves upon Illumina-V3V4, where the only clearly defined colorectal cancer biomarker at the species level was B. fragilis12, providing higher definition.

Machine learning models using automatic feature selection achieved good results for the prediction of colorectal cancer, with an AUC of 0.9 with just 10 features. However, closer evaluation of some of the selected taxa when using Emu’s Default database revealed low identity to their reference, due to previously commented issues. This led to manual feature selection of taxa with high read identity and coverage, significant differences between groups or identified as important by Boruta. The maximum AUC obtained was 0.87 with a total of 14 features, with the highest contributors being P. micra, F. nucleatum, B. fragilis and A. butyriciproducens (AUC 0.82). The comparable combinations to Illumina-V3V4 were slightly better in ONT-V1V9, but not overwhelmingly better. This could be due to the fact that, even if Illumina-V3V4 was only able to confidently observe genera, the underlying species were few (or just one in the case of Parvimonas), which ends up providing similar abundances between both techniques.

Overall, these results bode well for ONT-V1V9 and for the diagnosis of CRC using the described biomarkers. However, a limitation of the present study is the sample size (n = 123) and specially the uneven amount of subjects in each group (\(\sim\)25% healthy subjects), which could be greatly influencing feature selection for colorectal cancer prediction. Thus, said biomarkers must be thoroughly assessed in a more geographically diverse, larger and equilibrated cohort. Another limitations is not knowing the actual ground truth of these samples. Nonetheless, as previous studies have reported before, the usage of 16S rRNA V1V9 sequencing with ONT presents clear advantages when comparing to short-read technologies information-wise (sequencing a longer segment provides better identification), and against other long-read technologies cost-wise (if including the price of the sequencer)42,66. The technique itself, 16S rRNA metabarcoding, is indeed cost-efficient for bacterial characterization in complex samples, although it is expected that, as sequencing and computation costs keep diminishing, the better alternative will be to use metagenomic sequencing. This approach will provide a more accurate classification and, ideally, a comprehensive perspective of each microbial genome, which in the case of colorectal cancer related species could show specific virulence factors that influence the disease.