A curated bacterial and archaeal 16S rRNA Gene Oral Sequences dataset

Vázquez-González, Lara; Regueira-Iglesias, Alba; Balsa-Castro, Carlos; Tomás, Inmaculada; Carreira, María J.

doi:10.1038/s41597-025-05050-4

Download PDF

Data Descriptor
Open access
Published: 02 May 2025

A curated bacterial and archaeal 16S rRNA Gene Oral Sequences dataset

Lara Vázquez-González ORCID: orcid.org/0000-0002-0122-5266^1,2,
Alba Regueira-Iglesias^2,3,
Carlos Balsa-Castro^1,2,3,
Inmaculada Tomás^1,2,3 &
…
María J. Carreira ORCID: orcid.org/0000-0003-0532-2351^1,2,4

Scientific Data volume 12, Article number: 729 (2025) Cite this article

3290 Accesses
Metrics details

Subjects

Abstract

In a given species, genomes and 16S rRNA gene sequences, along with their intragenomic copy numbers, can vary greatly across environments. The gene copy numbers are crucial for technologies which estimate microbial abundances based on gene counts, such as polymerase chain reaction and high-throughput sequencing. In these, taxa with fewer genes may be underestimated, while those with more genes might be overestimated. Therefore, it is essential to have accurate gene copy number databases specific to the niche under study. The 16S rRNA Gene Oral Sequences dataset (16SGOSeq) contains the number of 16S rRNA genes and their variants in the complete genomes of the bacterial and archaeal species present in the human oral cavity. It includes 3,192 complete genomes of oral bacteria and 191 complete genomes of oral archaea, from which the 16S rRNA gene sequences were extracted, and the sequence variants were identified. This oral-specific dataset of prokaryotic organisms and the pipeline followed for its construction can be applied by clinical microbiologists, bioinformaticians, or microbial ecologists in future microbiome research.

Rapid and reliable species-level identification from clinical samples using 16S rRNA gene nanopore sequencing analysis

Article Open access 05 August 2025

Variation of gene ratios in mock communities constructed with purified 16S rRNA during processing

Article Open access 30 December 2024

Multi-cohort shotgun metagenomic analysis of oral and gut microbiota overlap in healthy adults

Article Open access 16 January 2024

Background & Summary

The oral microbiome is the most diverse and second largest in the human body, with over 700 microbial species detected in the mouth at any time¹. Of these, an individual’s mouth usually harbours between 200 and 300 predominant bacterial species². The dysbiosis of this community, or imbalance, is a key factor in the onset and development of two of the most widespread diseases worldwide: dental caries and periodontitis³. Moreover, alterations in the mouth microbiome have been associated with highly relevant systemic pathologies such as diabetes and cardiovascular diseases⁴.

A range of microbiological techniques have been employed to investigate the oral microbial communities associated with health and disease. Among the most prevalent techniques are polymerase chain reaction (PCR), conventional or quantitative (qPCR), and high-throughput sequencing (HTS). These techniques are typically employed by targeting the 16S ribosomal RNA (rRNA) gene^5,6. The gene in question comprises approximately 1,500 base pairs (bps) and it is regarded as a reliable phylogenetic marker due to several reasons. Primarily, this gene is ubiquitous in bacteria and archaea, exhibiting relative stability in combining conserved (C) and hypervariable (V) regions. Additionally, complete and easily accessible databases exist⁷.

Nevertheless, utilising the 16S rRNA gene also presents certain limitations. One of the most significant is the intragenomic redundancy, which refers to the presence of more than one identical or distinct genes per genome^8,9,10. Under the concept of genetic variants, it could be considered the sequences of genes in a genome that differ in at least one nucleotide. Therefore, a variant can have multiple copies per genome. Technologies such as qPCR and HTS are employed to estimate microbial abundance based on gene counts. Consequently, the outcomes derived from these may be influenced to the extent that taxa with a low number of genes may be underestimated, while those with high numbers may be overestimated¹⁰. Furthermore, heterogeneity between the multiple gene copies within a genome may result in overestimating and taxonomic misassignments⁹.

There are several methodologies for correcting the variation in intragenomic gene copy numbers, such as CopyRighter¹¹ or PICRUSt¹². However, these methodologies are characterised by significant drawbacks. Primarily, these methods are highly dependent on the database used, and therefore, inaccurate estimates may result from incomplete or erroneous data¹³.

The Ribosomal RNA Operon Copy Number Database (rrnDB)¹⁴ and RiboGrove¹⁵ provide information on the number of 16S rRNA genes in the genomes of bacteria and archaea. However, the genomes and 16S rRNA gene sequences, as well as their intragenomic copy number, in a given species may vary among environments or niches^16,17. This variability may be attributed to adaptation to different environmental conditions, selective pressures, or evolutionary events such as gene acquisition through horizontal transfer. It is observed that the greater the pressure or difference between environments, the greater the variability^17,18,19,20. Furthermore, neither of the above databases allows for the identification of the different intragenomic gene variants^14,15.

On the other hand, the rrnDB¹⁴ does not allow the selection of sequences from specific taxonomic groups or taxa; the complete database must be downloaded, and subsequent manual selection must be performed. Moreover, the developers of rrnDB¹⁴ claim to perform quality control of the sequences using data from the Ribosomal Database Project (RDP)²¹. The use of phylogenetically diverse databases can produce classification errors because they contain taxonomically misannotated 16S rRNA gene sequences (i.e., annotation rates in the RDP are approximately 10%)²². They also represent differently the environments included, varying substantially in the quality of the classifications²³.

Preference for the use of environment-specific databases has been demonstrated not only in the oral cavity²⁴ by focusing on the expanded Human Oral Microbiome Database (eHOMD)²⁵, but also on the vaginal²⁶, bovine²⁷, bee²⁸, dairy products²⁹, and wastewater treatment plants microbiomes³⁰. Overall, these studies have proven that the use of niche-specific databases improves the accuracy of taxonomic classifications by aligning reference sequences more closely with the microbial communities under investigation, and reduces the number of unassigned reads.

As a last remark, it is notable that numerous examples of species from the same genus have been associated with opposite mouth conditions^31,32. Consequently, it is recommended that oral microbiology analysis be performed at the lowest taxonomic level possible.

In light of the aforementioned considerations, the objective of this study was to develop a curated dataset comprising all 16S rRNA gene sequences present in the genomes with complete sequencing status of bacterial and archaeal species inhabiting the human mouth. The oral prokaryote-specific dataset, designated 16SGOSeq (16S rRNA Gene Oral Sequences dataset), contains information regarding the size of complete genomes and the size and number of genes per genome and gene variants per genome for taxonomic categories from domain to strain level. All gene sequences are designated at the lowest possible taxonomic level, and users can filter by the desired taxonomic level/taxon and calculate data averages.

Methods

The 16SGOSeq dataset is a curated sequence dataset based on a collection of genetic sequences that have undergone a systematic and rigorous process of selection, validation, classification and updating, thus guaranteeing their high quality, accuracy and reliability for subsequent scientific uses³³.

This dataset was constructed following a number of criteria regarding genome inclusion and the curation procedures were employed for the sequences, as illustrated in Fig. 1.

Data acquisition. Obtaining complete oral bacterial and archaeal genomes

Firstly, the inclusion criteria for the bacterial and archaeal genomes were set to ensure the sequences’ quality. The following criteria were thus established:

1.
The bacterial and archaeal taxa were limited to those identified as inhabitants of the oral cavity.
2.
Only genomes with complete sequencing status according to the expanded Human Oral Microbiome Database (eHOMD)²⁵ were included.
3.
The genomes included in this study are those with complete taxonomy up to the species level (i.e., they had to have a specific name for the domain, phylum, class, order, family, genus, and species). Those with ambiguous or disputed taxonomies were eligible (e.g. Candidatus taxa). Those with the “unclassified” term at any level were not eligible.
4.
Genomes with no more than 10 consecutive ambiguous International Union of Pure and Applied Chemistry (IUPAC) nucleotides were included.

Consequently, the sequences and other information that met the inclusion criteria were downloaded. All available data on bacterial taxa within the mouth was acquired from the publicly available eHOMD oral-specific database²⁵ (https://www.homd.org/ftp//genomes/NCBI/V10.1). Only genomes with complete sequencing status according to the eHOMD were chosen, resulting in a total of 3, 128 complete genomes out of the 8, 622 on the eHOMD website. Each genome was assigned one or more GenBank identifiers, which were included in the downloaded genome’s file as different sequences. In the majority of cases, the identifiers corresponded to chromosomal DNA. However, in some instances, they also corresponded to plasmids. Both were considered in the completion of the dataset.

Regarding the oral archaea, the eHOMD²⁵ only encompasses complete genome information for a single species, namely Methanobrevibacter oralis. For this reason, we used an initial list obtained as part of a previous investigation³⁴, which comprised 177 different archaea found in the human mouth and their corresponding GenBank identifiers³⁵ to obtain complete publicly available archaeal genomes from the NCBI database³⁶. This archaeal list is shown in Supplementary Table S1 of the Supplementary Information. The provenance of these complete genomes was: environmental (91%), human niches (3%), and others (6%).

The Entrez Programming Utilities (E-utilities)³⁷ tool was employed to retrieve pertinent data from the diverse NCBI databases, including the Taxonomy³⁸ and GenBank³⁵ databases. The Entrez module from the BioPython³⁹ package was employed in a Python script (version 3.9.0, http://www.python.org/) to facilitate the transmission of requests to the databases and the acquisition of the oral-archaeal genomes and their corresponding taxonomies, as well as the oral-bacterial metadata.

Genomes that lacked a complete taxonomic classification up to the species level were excluded from the analysis. Furthermore, some genomes included non-specific nucleotides, which were identified by the IUPAC codes for ambiguous characters. It should be noted that these ambiguous characters or nucleotides may represent two, three, or four possible nucleic acid states⁴⁰, instead of a unique specification for the four nitrogenous bases of the DNA (A, adenine; G, guanine; C, cytosine; T, thymine). It was, therefore, necessary to exclude from the analysis genomes that had more than ten consecutive ambiguous IUPAC nucleotides.

Following the application of the aforementioned criteria, a total of 3, 079 complete genomes of oral bacteria were identified. Of these, several had one or more sequences corresponding to either genomes, chromosomes, or plasmids. Each was evaluated as a complete genome, resulting in a final number of 5, 755 oral-bacterial complete genomes being considered for analysis. These genomes all had the taxonomy complete up to the strain level. A total of 177 complete archaeal genomes were listed, of which 166 had one chromosome, 10 had two chromosomes, and one had five chromosomes. Consequently, a final number of 191 oral-archaeal complete genomes were considered for creating the 16SGOSeq dataset.

Detection and extraction of 16S rRNA genes

A Python script was developed to extract the 16S rRNA genes. This script utilised a freely available module, search_16S_py⁴¹, which implements Edgar’s algorithm⁴². The algorithm employs a search strategy that identifies sections of genomes exhibiting a high frequency of 13-mers, which are characteristic of known 16S rRNA genes. Subsequently, the search is conducted within each segment for conserved motifs situated in close proximity to the beginning and end of the gene. The presence of a pair of motifs within the expected length range indicates the presence of the gene and provides consistent and homologous endpoints.

The application of this algorithm resulted in the detection and extraction of 16S rRNA gene sequences from the complete downloaded genomes, which were subsequently stored in FASTA-formatted files along with the variants, that is, the sequences differing by at least one nucleotide between each other in a genome. All the 16S rRNA gene variants identified were designated at the strain level or the species level if no designated strain name existed, in accordance with the established nomenclature guidelines.

A number of the genomes were found to lack 16S rRNA genes, resulting in the number of genomes reducing to 3, 192 oral-bacterial genomes, corresponding to 3, 047 strains and 334 species, and 191 oral-archaeal genomes, corresponding to 135 species. For the bacterial genomes, a total of 14, 966 genes and 8, 155 variants were identified. For the archaeal genomes, a total of 346 genes and 255 variants were identified.

For each genome under consideration, the following data were calculated: genome size, size of the 16S rRNA genes detected, total number of 16S rRNA genes, number of different variants, and number of 16S rRNA genes in each strand.

Additionally, through the use of a complementary script based on Python’s NumPy⁴³ and pandas⁴⁴ modules, the average, median, mode and standard deviation of the data obtained can be determined for hierarchical levels above the variant level.

Data Records

The dataset is available at Zenodo (https://zenodo.org/records/15209015)⁴⁵.

The 16SGOSeq dataset⁴⁵ is provided in eight files, comprising both tabular and FASTA formats. The dataset comprises four files pertaining to bacteria and four files pertaining to archaea.

Variants table (bacteria_variants.csv/.xlsx, archaea_variants.csv/.xlsx)

A single table is included in both the CSV and XLSX formats, containing all the variants identified in all the genomes. The file comprises as many rows as there are variants per group of GenBank identifiers (genome and plasmids). The pertinent data for each variant is included, such as the sequence, the number of copies, and the position of the variant in the genome, among other details (see Table 1).

Table 1 Description of the parameters associated with the variants CSV/XLSX file.

Full size table

Variants FASTA (bacteria_variants.fasta, archaea_variants.fasta)

A FASTA file corresponding to the variants table contains one line per variant in each genome. The header includes the genome GenBank identifier, the full taxonomy up until the variant level, and the number of gene copies in the genome.

Genes FASTA (bacteria_genes.fasta, archaea_genes.fasta)

A FASTA file is provided containing all sequences identified in all genomes. This file illustrates the copies of variants observed in the genomes, with each header including the genome GenBank identifier, the full taxonomy up to the variant level, as well as the positions of the genes in each genome and the strand in which it was found.

Intragenomic variant divergence (bacteria_divergence.csv/.xlsx, archaea_divergence.csv/.xlsx)

A table is included in both the CSV and XLSX formats for bacteria and archaea, containing the information about the divergence existent between the variants of each genome of the dataset. It was acquired using BLASTN⁴⁶ to align each genome’s variants against each other, to obtain the query coverage and the identity percentage used to evaluate the divergence.

Each genome has several rows corresponding to the alignments of each variant against the other variants of the genome. Table 2 shows the pertinent data included to assess the divergence, including the query coverage and the identity percentage.

Table 2 Description of the parameters associated with the intragenomic variant divergence CSV/XLSX file.

Full size table

Technical Validation

To validate the dataset, a total of 2, 039 random bacterial sequences were selected from our dataset, representing 25% of the total number of sequences. This random sequence group was aligned using BLASTN⁴⁶ against a 16S rRNA gene sequence database. This smaller database includes 26, 954 sequences from bacteria and archaea, and does not necessarily include the same taxa represented in our dataset. With the alignment, we obtained an identity percentage of ≥97% in all cases, confirming that our sequences can be considered 16S rRNA gene sequences.

Additionally, we aligned with the same 25% of our bacterial dataset against the Core nucleotide database (core_nt)³⁵, which contains 112, 880, 307 GenBank+EMBL+DDBJ+PDB+RefSeq sequences. In all cases, either the genus and species or the NCBI identifier matched between the query (our sequences) and the subject (core_nt sequences). Without exception, a query coverage of 100% and an identity percentage of ≥99% was obtained, confirming the validity of our sequences by demonstrating both their existence and their correct taxonomic annotation.

For the archaeal dataset, 64 sequences were selected, representing the 25% of the dataset. As done for the bacteria, all the sequences were aligned with BLASTN⁴⁶ against the 16S rRNA gene sequence database from NCBI, obtaining an identity percentage of ≥97% again in all cases. These sequences were also aligned core_nt, with either the genus and species or the NCBI identifier matching between the query and the subject. A query coverage of 100% and an identity percentage of ≥99% were obtained.

Additionally, the intragenomic divergence of the variants was analysed. BLASTN⁴⁶ was used to perform a discontinuous megablast and align each genome’s variants against each other. Genomes were considered to present high divergence if, at least two of their variants presented a query coverage of ≤97% or an identity percentage of ≤97%.

From the total 3, 046 bacterial genomes and 177 archaeal genomes, we have found 43 and 9, respectively, to present high divergence amongst some of their variants.

These results are presented in files bacteria_divergence.csv/.xlsx and archaea_divergence.csv/.xlsx in the dataset repository.

Usage Notes

The quantity of PCR-amplified product is contingent upon the genome size and the number of 16S rRNA genes per genome⁴⁷. Consequently, it is not possible to accurately quantify the number of species represented in clone libraries of samples from a given ecosystem until these two parameters are known for the taxa present⁴⁷. If these factors are not considered, when performing PCR, qPCR, or marker-gene sequencing, inferences about numerous aspects of microbial communities may be affected⁴⁸. Moreover, this information is also pertinent to whole genome sequencing (WGS) technologies, which employ 16S rRNA gene counts for their analyses of the diversity and structure of prokaryotic populations^49,50. Understanding the number of gene copies per genome can facilitate our comprehension of the ecological and evolutionary relationships between different microorganisms¹⁷.

It can be observed that rrnDB¹⁴ and RiboGrove¹⁵ present certain inherent limitations when employed in studies of the mouth microbiome compared to 16SGOSeq⁴⁵ (Table 3). The gene sequences included in our newly developed dataset originate from complete genomes that have been manually monitored at NCBI³⁶ of bacteria and archaea that are known to inhabit the human oral cavity³⁴. This enables researchers to ascertain the number of 16S genes and variants per genome at a desired hierarchical level or taxon. This information is necessary to adjust gene-based abundance values for estimated abundance values of the organism(s).

Table 3 Comparison of features between 16SGOSeq and other 16S rRNA gene copy number databases.

Full size table

The 16SGOSeq⁴⁵ can be filtered according to the needs of the research study, grouping the genes by different taxonomic levels and obtaining the averages, standard deviations, and more calculations. Users can use programming languages such as R or Python, which are the most typically used in bioinformatics, to taxonomically filter the dataset using the columns in the dataset files dedicated to the taxonomy. Complementary, to facilitate the use of 16SGOSeq⁴⁵, we constructed an auxiliary Python⁵¹ script to perform the filtering of the information by the desired taxonomic level and calculate averages of relevant data. Further details about the script can be found in the Supplementary Information.

The list of the oral bacteria included in our dataset originates from the eHOMD²⁵. However, there is no database with sequences of archaea detected in the oral cavity. As there is no specific list of oral archaea in the eHOMD²⁵, the complete genomes of the oral archaea were obtained directly from the NCBI³⁶, and most of them belong to environmental niches. It is recommended that the oral microbiology community increase the amount of evidence related to oral archaea to overcome this limitation by focusing on isolating and sequencing archaea directly from mouth samples to refine taxonomic and functional annotations.

Most of the complete genomes in the eHOMD²⁵ belong to cultivable taxa. While in the HOMD, 71% of the species are cultivable and 26% are non-cultivable, in the 16SGOSeq⁴⁵ dataset 96% are cultivable and 1.20% are non-cultivable. Our dataset provides a more robust representation of the cultivable fraction of the oral microbiome than the non-cultivable. Nevertheless, 16SGOSeq⁴⁵ contains the most prevalent and abundant bacterial species in health, caries, and periodontitis^1,2,52, and numerous examples of the so-called rare taxa^53,54,55.

So far, the sequences and data averages calculated in 16SGOSeq⁴⁵ have been used in two studies by our research group. One to analyse the impact of intragenomic redundancy of the 16S rRNA gene and the selection of primer pairs in the oral cavity, and the other to identify species with very similar 16S rRNA sequence segments using different primer pairs^56,57.

Like other applications, the 16SGOSeq⁴⁵ can be used in the field of oral microbiology, facilitating the design of universal primers or probes capable of detecting the greatest possible diversity of oral prokaryotic organisms, and specific primers or probes that consider the sequences of all the genetic variants of a given taxon. Additionally, if sequences from 16SGOSeq⁴⁵ are to be employed for the identification of primers or probes, we propose the utilisation of the PrimerEvalPy tool⁵⁸. This is a Python⁵¹ application developed by our research team that performs a coverage analysis of any primer against any database. These advances enable more accurate detection and characterisation of oral microbiota and improve the understanding of the oral ecosystem and its role in health and disease. The 16SGOSeq⁴⁵, with high-quality sequences and robust taxonomic annotations, can significantly refine our understanding of phylogenetic relationships among taxa.

The techniques and strategies implemented in the curation of the present dataset can be applied by clinical microbiologists, bioinformaticians or microbial ecologists in other microbiome fields. Our pipeline can be followed to generate gene copy number datasets from existing niche-specific datasets, ensuring the production of taxonomically robust, high-resolution, and biologically informative data.

Code availability

The dataset and code can be accessed via Zenodo (https://zenodo.org/records/15209015)⁴⁵ or the Gitlab repository at the following link: https://gitlab.citius.gal/lara.vazquez/16sgoseq. The dataset construction code was developed in Python 3.9, which was employed to generate the files composing the 16SGOSeq⁴⁵ dataset. Furthermore, a complementary script is provided for the purpose of filtering and generating information from the dataset. Instructions for installing and using the script are detailed in the repository.

16SGOSeq⁴⁵ follows an annual update policy, incorporating HOMD updates when available.

References

Deo, P. & Deshmukh, R. Oral microbiome: Unveiling the fundamentals. Journal of Oral and Maxillofacial Pathology 23, 122, https://doi.org/10.4103/jomfp.jomfp_304_18 (2019).
Article PubMed PubMed Central Google Scholar
Krishnan, K., Chen, T. & Paster, B. A practical guide to the oral microbiome and its relation to health and disease. Oral Diseases 23, 276–286, https://doi.org/10.1111/odi.12509 (2020).
Article Google Scholar
Zhang, Y. et al. Human oral microbiota and its modulation for oral health. Biomedicine & Pharmacotherapy 99, 883–893, https://doi.org/10.1016/j.biopha.2018.01.146 (2018).
Article Google Scholar
Thomas, C. et al. Oral Microbiota: A Major Player in the Diagnosis of Systemic Diseases. Diagnostics 11, 1376, https://doi.org/10.3390/diagnostics11081376 (2021).
Article CAS PubMed PubMed Central Google Scholar
Fukuda, K., Ogawa, M., Taniguchi, H. & Saito, M. Molecular approaches to studying microbial communities: Targeting the 16S ribosomal RNA gene. J. UOEH 38, 223–232, https://doi.org/10.7888/juoeh.38.223 (2016).
Article CAS PubMed Google Scholar
Willis, J. R. & Gabaldón, T. The Human Oral Microbiome in Health and Disease: From Sequences to Ecosystems. Microorganisms 8, 308, https://doi.org/10.3390/microorganisms8020308 (2020).
Article CAS PubMed PubMed Central Google Scholar
Rodicio, Md. R. & Mendoza, Md. C. Identification of bacteria through 16S rRNA sequencing: principles, methods and applications in clinical microbiology. Enferm. Infecc. Microbiol. Clin. 22, 238–245, https://doi.org/10.1157/13059055 (2004).
Article PubMed Google Scholar
Pei, A. Y. et al. Diversity of 16S rRNA Genes within Individual Prokaryotic Genomes. Applied and Environmental Microbiology 76, 3886–3897, https://doi.org/10.1128/aem.02953-09 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Sun, D.-L., Jiang, X., Wu, Q. L. & Zhou, N.-Y. Intragenomic Heterogeneity of 16S rRNA Genes Causes Overestimation of Prokaryotic Diversity. Applied and Environmental Microbiology 79, 5962–5969, https://doi.org/10.1128/aem.01282-13 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Větrovský, T. & Baldrian, P. The Variability of the 16S rRNA Gene in Bacterial Genomes and Its Consequences for Bacterial Community Analyses. PLoS One 8, e57923, https://doi.org/10.1371/journal.pone.0057923 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Angly, F. E. et al. CopyRighter: a rapid tool for improving the accuracy of microbial community profiles through lineage-specific gene copy number correction. Microbiome 2, https://doi.org/10.1186/2049-2618-2-11 (2014).
Langille, M. G. I. et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnology 31, 814–821, https://doi.org/10.1038/nbt.2676 (2013).
Article CAS PubMed PubMed Central Google Scholar
Nearing, J. T., Comeau, A. M. & Langille, M. G. I. Identifying biases and their potential solutions in human microbiome studies. Microbiome 9, https://doi.org/10.1186/s40168-021-01059-0 (2021).
Stoddard, S. F., Smith, B. J., Hein, R., Roller, B. R. K. & Schmidt, T. M. rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development. Nucleic Acids Research 43, D593–D598, https://doi.org/10.1093/nar/gku1201 (2015).
Article CAS PubMed Google Scholar
Sikolenko, M. A. & Valentovich, L. N. RiboGrove: a database of full-length prokaryotic 16S rRNA genes derived from completely assembled genomes. Research in Microbiology 173, 103936, https://doi.org/10.1016/j.resmic.2022.103936 (2022).
Article CAS PubMed Google Scholar
Smith, E. E. et al. Genetic adaptation by Pseudomonas aeruginosa to the airways of cystic fibrosis patients. Proceedings of the National Academy of Sciences 103, 8487–8492, https://doi.org/10.1073/pnas.0602138103 (2006).
Article ADS CAS Google Scholar
Stevenson, B. S. & Schmidt, T. M. Life history implications of rRNA gene copy number in Escherichia coli. Applied and Environmental Microbiology 70, 6670–6677, https://doi.org/10.1128/AEM.70.10.6670-6677.2004 (2004).
Article ADS CAS PubMed PubMed Central Google Scholar
Hunt, D. E. et al. Resource partitioning and sympatric differentiation among closely related bacterioplankton. Science 320, 1081–1085, https://doi.org/10.1126/science.1157890 (2008).
Article ADS CAS PubMed Google Scholar
Cohan, F. M.Ecology of Microbial Communities, chap. Periodic selection and ecological diversity in bacteria, 79–93 (Princeton University Press, 2005).
Tamames, J. & Moya, A. Estimating the extent of horizontal gene transfer in metagenomic sequences. BMC Genomics 9, 136, https://doi.org/10.1186/1471-2164-9-136 (2008).
Article CAS PubMed PubMed Central Google Scholar
Cole, J. R. et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Research 42, D633–D642, https://doi.org/10.1093/nar/gkt1244 (2014).
Article CAS PubMed Google Scholar
Edgar, R. Taxonomy annotation and guide tree errors in 16S rRNA databases. PeerJ 6, https://doi.org/10.7717/peerj.5030 (2018).
Soergel, D. A. W., Dey, N., Knight, R. & Brenner, S. E. Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences. ISME J 6, 1440–1444, https://doi.org/10.1038/ismej.2011.208 (2012).
Article CAS PubMed PubMed Central Google Scholar
Escapa, I. F. et al. Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets. Microbiome 8, 65, https://doi.org/10.1186/s40168-020-00841-w (2020).
Article CAS Google Scholar
Escapa, I. F. et al. New Insights into Human Nostril microbiome from the expanded Human Oral Microbiome Database (eHOMD): a resource for the microbiome of the human aerodigestive tract. mSystems 3, https://doi.org/10.1128/msystems.00187-18 (2018).
Zhu, B. et al. The utility of voided urine samples as a proxy for the vaginal microbiome and for the prediction of bacterial vaginosis. Infectious Microbes and Diseases https://doi.org/10.1097/im9.0000000000000103 (2022).
Myer, P. R. et al. Classification of 16s rrna reads is improved using a niche-specific database constructed by near-full length sequencing. PLOS ONE 15, e0235498, https://doi.org/10.1371/journal.pone.0235498 (2020).
Article CAS PubMed PubMed Central Google Scholar
Brendan, D. A. & Reid, G. Beexact: a metataxonomic database tool for high-resolution inference of bee-associated microbial communities. mSystems 6, e00082–21, https://doi.org/10.1128/msystems.00082-21 (2021).
Article CAS Google Scholar
Meola, M. et al. Dairydb: a manually curated reference database for improved taxonomy annotation of 16s rrna gene sequences from dairy products. BMC Genomics 20, https://doi.org/10.1186/s12864-019-5914-8 (2019).
Dueholm, M. K. D. et al. Midas 4: A global catalogue of full-length 16s rrna gene sequences and taxonomy for studies of bacterial communities in wastewater treatment plants. Nature Communications 13, https://doi.org/10.1038/s41467-022-29438-7 (2022).
Relvas, M. et al. Relationship between dental and periodontal health status and the salivary microbiome: bacterial diversity, co-occurrence networks and predictive models. Scientific Reports 11, https://doi.org/10.1038/s41598-020-79875-x (2021).
Camelo-Castillo, A. J. et al. Subgingival microbiota in health compared to periodontitis and the influence of smoking. Frontiers in Microbiology 6, https://doi.org/10.3389/fmicb.2015.00119 (2015).
Buneman, P.Curated Databases, 2-2 (Springer Berlin Heidelberg, 2009).
Regueira-Iglesias, A. et al. In silico evaluation and selection of the best 16S rRNA gene primers for use in next-generation sequencing to detect oral bacteria and archaea. Microbiome 11, https://doi.org/10.1186/s40168-023-01481-6 (2023).
Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Sayers, E. W. GenBank. Nucleic Acids Res. 44, D67–72, https://doi.org/10.1093/nar/gkv1276 (2016).
Article CAS PubMed Google Scholar
NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 44, D7–D19, https://doi.org/10.1093/nar/gkv1290 (2015).
Article CAS PubMed Central Google Scholar
Bethesda (MD): National Center for Biotechnology Information (US). Entrez Programming Utilities Help. https://www.ncbi.nlm.nih.gov/books/NBK25501/ Accessed: 2024-05-22. (2010).
Schoch, C. L. et al. NCBI taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford) https://doi.org/10.1093/database/baaa062 (2020).
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423, https://doi.org/10.1093/bioinformatics/btp163 (2009).
Article CAS PubMed PubMed Central Google Scholar
Johnson, A. D. An extended IUPAC nomenclature code for polymorphic nucleic acids. Bioinformatics 26, 1386–1389, https://doi.org/10.1093/bioinformatics/btq098 (2010).
Article CAS PubMed PubMed Central Google Scholar
Lyalina, S. search_16s_py: a simple python implementation of the 16s finding procedure. https://github.com/slyalina/search_16S_py Accessed: 2024-06-06. (2018).
Edgar, R. C. SEARCH_16S: A new algorithm for identifying 16S ribosomal RNA genes in contigs and chromosomes. bioRxiv https://doi.org/10.1101/124131 (2017).
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362, https://doi.org/10.1038/s41586-020-2649-2 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
McKinney, W. Data Structures for Statistical Computing in Python. In van der Walt, S. & Millman, J. (eds.) Proceedings of the 9th Python in Science Conference, 56–61, https://doi.org/10.25080/Majora-92bf1922-00a (2010).
Vázquez-González, L., Regueira Iglesias, A., Balsa-Castro, C., Tomás, I. & Carreira, M. J. 16SGOSeq: A curated bacterial and archaeal 16S rRNA Gene Oral Sequences dataset, https://doi.org/10.5281/ZENODO.15209015 (2025).
Morgulis, A. et al. Database indexing for production MegaBLAST searches. Bioinformatics 24, 1757–1764, https://doi.org/10.1093/bioinformatics/btn322 (2008).
Article CAS PubMed PubMed Central Google Scholar
Farrelly, V., Rainey, F. A. & Stackebrandt, E. Effect of genome size and rrn gene copy number on PCR amplification of 16S rRNA genes from a mixture of bacterial species. Applied Environmental Microbiology 61, 2798–2801, https://doi.org/10.1128/aem.61.7.2798-2801.1995 (1995).
Article ADS CAS PubMed PubMed Central Google Scholar
Kembel, S. W., Wu, M., Eisen, J. A. & Green, J. L. Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS Computational Biology 8, e1002743, https://doi.org/10.1371/journal.pcbi.1002743 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Kim, C., Pongpanich, M. & Porntaveetus, T. Unraveling metagenomics through long-read sequencing: a comprehensive review. Journal of Translational Medicine 22, 111, https://doi.org/10.1186/s12967-024-04917-1 (2024).
Article PubMed PubMed Central Google Scholar
Pérez-Cobas, A. E., Gomez-Valero, L. & Buchrieser, C. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. Microbial Genomics 6, mgen000409, https://doi.org/10.1099/mgen.0.000409 (2020).
Article CAS PubMed PubMed Central Google Scholar
Python Software Foundation. Python. https://www.python.org/ Accessed: 2024-05-22. (2020).
Regueira-Iglesias, A. et al. The salivary microbiome as a diagnostic biomarker of periodontitis: a 16s multi-batch study before and after the removal of batch effects. Frontiers in Cellular and Infection Microbiology 14, 1405699, https://doi.org/10.3389/fcimb.2024.1405699 (2024).
Article PubMed PubMed Central Google Scholar
Horz, H.-P. Archaeal lineages within the human microbiome: Absent, rare or elusive? Life (Basel) 5, 1333–45, https://doi.org/10.3390/life5021333 (2021).
Article CAS Google Scholar
Burcham, Z. M. et al. Patterns of oral microbiota diversity in adults and children: A crowdsourced population study. Scientific Reports 10, 2133, https://doi.org/10.1038/s41598-020-59016-0 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Willis, J. R. et al. Citizen-science reveals changes in the oral microbiome in spain through age and lifestyle factors. NPJ Biofilms Microbiomes 8, 38, https://doi.org/10.1038/s41522-022-00279-y (2022).
Article ADS PubMed PubMed Central Google Scholar
Regueira-Iglesias, A. et al. In-Silico Detection of Oral Prokaryotic Species With Highly Similar 16S rRNA Sequence Segments Using Different Primer Pairs. Frontiers in Cellular and Infection Microbiology 11, https://doi.org/10.3389/fcimb.2021.770668 (2022).
Regueira-Iglesias, A. et al. Impact of 16S rRNA Gene Redundancy and Primer Pair Selection on the Quantification and Classification of Oral Microbiota in Next-Generation Sequencing. Microbiology Spectrum 11, https://doi.org/10.1128/spectrum.04398-22 (2023).
Vázquez-González, L. et al. PrimerEvalPy: a tool for in-silico evaluation of primers for targeting the microbiome. BMC Bioinformatics 25, https://doi.org/10.1186/s12859-024-05805-7 (2024).

Download references

Acknowledgements

This work was supported by the Instituto de Salud Carlos III (Spain) [PI24/00222]; the Xunta de Galicia - Consellería de Cultura, Educación e Universidade [ED431G-2019/04, GRC2021/48, GPC2020/27, ED481A-2021 to L.V.-G., IN606B-2023/005 to A.R.-I.]; and the European Union (European Regional Development Fund-ERDF).

Author information

Authors and Affiliations

Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Rúa de Jenaro de la Fuente Domínguez, E15782, Santiago de Compostela, Spain
Lara Vázquez-González, Carlos Balsa-Castro, Inmaculada Tomás & María J. Carreira
Instituto de Investigación Sanitaria de Santiago de Compostela (IDIS), E15706, Santiago de Compostela, Spain
Lara Vázquez-González, Alba Regueira-Iglesias, Carlos Balsa-Castro, Inmaculada Tomás & María J. Carreira
Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical Surgical Specialities, School of Medicine and Dentistry, Universidade de Santiago de Compostela, E15782, Santiago de Compostela, Spain
Alba Regueira-Iglesias, Carlos Balsa-Castro & Inmaculada Tomás
Departamento de Electrónica e Computación, Escola Técnica Superior de Enxeñaría, Universidade de Santiago de Compostela, E15782, Santiago de Compostela, Spain
María J. Carreira

Authors

Lara Vázquez-González
View author publications
Search author on:PubMed Google Scholar
Alba Regueira-Iglesias
View author publications
Search author on:PubMed Google Scholar
Carlos Balsa-Castro
View author publications
Search author on:PubMed Google Scholar
Inmaculada Tomás
View author publications
Search author on:PubMed Google Scholar
María J. Carreira
View author publications
Search author on:PubMed Google Scholar

Contributions

L.V.-G. and C.B.-C. conceived the experiment(s), L.V.-G. conducted the experiment(s), C.B.-C., A.R.-I., I.T. and M.J.C. analysed the results. L.V.-G. and A.R.-I. wrote and reviewed the first version of the manuscript, C.B.-C., M.J.C. and I.T. critically reviewed the manuscript.

Corresponding authors

Correspondence to Inmaculada Tomás or María J. Carreira.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

16SGOSeq Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Vázquez-González, L., Regueira-Iglesias, A., Balsa-Castro, C. et al. A curated bacterial and archaeal 16S rRNA Gene Oral Sequences dataset. Sci Data 12, 729 (2025). https://doi.org/10.1038/s41597-025-05050-4

Download citation

Received: 19 June 2024
Accepted: 23 April 2025
Published: 02 May 2025
Version of record: 02 May 2025
DOI: https://doi.org/10.1038/s41597-025-05050-4