Abstract
With the rapid advancements in genome sequencing technologies, microbial genome data has exponentially increased, making it essential to continuously update dataset for accurate microbial identification and classification. We present the development of Microbial Identification using rRNA Operon Region (MIrROR) release 02, an expanded dataset based on 1,690,470 genomes (1,674,514 bacterial and 15,956 archaeal) sourced from NCBI. The final curated dataset covers 476,579 sequences, 249,907 genomes, and 29,051 species, representing increases of 387.39%, 472.49%, and 206.28% over the previous release. Key updates include the addition of archaeal genomes and taxonomy reclassification based on GTDB R220. Extensive curation was performed, including filtering operon lengths (3,500–7,000 bp), removing duplicate sequences, eliminating sequences with ambiguous nucleotides, and clustering of sequences at 99% identity to remove redundancies. The updated dataset showed improved performance in microbial mock community analyses, supporting its accuracy and reliability. These improvements make MIrROR release 02 a valuable resource for microbial profiling and various microbiological research applications.
Similar content being viewed by others
Data availability
All data supporting the findings of this study are publicly available. The MIrROR release 02 dataset can be accessed at https://doi.org/10.5281/zenodo.17639192. Sequencing datasets used for performance evaluation, including the ZymoBIOMICS microbial community standards, are available from the NCBI Sequence Read Archive under accession numbers SRR35709965–SRR35709957.
Code availability
Genome retrieval from NCBI GenBank using ncbi-genome-download (v.0.3.1)
ncbi-genome-download archaea -s genbank -F fasta -l all -o {output_directory} --flat-output
ncbi-genome-download bacteria -s genbank -F fasta -l all -o {output_directory}
Clustering using MeShClust (v.3.0)
meshclust3 -d {input_fasta} -o {output_fasta} -t 0.99 -c 50 -r n -e y
Redundancy filtering BLAST (v.2.15.0)
blastn -db {blast_database} -query {input_fasta} -out {output_xml} -outfmt 5 -num_threads{num_threads}
Simulation of long-read sequencing using NanoSim (v.3.2.2)
simulator.py metagenome -gl {genome_list.tsv} -a {abundance.tsv} -c {training_directory} -o{output_directory} --fastq
primersearch (EMBOSS v.6.6.0.0)
primersearch -seqall {input.fasta} -infile {primers.txt} -outfile {output.txt} -mismatchpercent 30
Sequence mapping and taxonomic profiling (with threshold options modified as described in the main text):
https://github.com/seoldh/MIrROR
Custom Python scripts used for dataset curation are available at:
https://github.com/jisoll/MirrorMaker
Custom Python scripts used for in silico primer binding analysis are available at: https://github.com/jisoll/MirrorPrimerChecker
References
Fuks, G. et al. Combining 16S rRNA gene variable regions enables high-resolution microbial community profiling. Microbiome 6, 1–13 (2018).
Hassler, H. B. et al. Phylogenies of the 16S rRNA gene and its hypervariable regions lack concordance with core genome phylogenies. Microbiome 10, 104 (2022).
Lan, Y., Rosen, G. & Hershberg, R. Marker genes that are less conserved in their sequences are useful for predicting genome-wide similarity levels between closely related prokaryotic strains. Microbiome 4, 1–13 (2016).
Goldstein, S., Beka, L., Graf, J. & Klassen, J. L. Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing. BMC genomics 20, 1–17 (2019).
Won, S., Cho, S. & Kim, H. rRNA operon improves species-level classification of bacteria and microbial community analysis compared to 16S rRNA. Microbiology Spectrum 12, e00931–00924 (2024).
Woese, C. R. Bacterial evolution. Microbiological reviews 51, 221–271 (1987).
Seol, D. et al. Microbial identification using rRNA operon region: database and tool for metataxonomics with long-read sequence. Microbiology Spectrum 10, e02017–02021 (2022).
Suzuki, N. et al. Discrimination of Streptococcus pneumoniae from viridans group streptococci by genomic subtractive hybridization. Journal of clinical microbiology 43, 4528–4534 (2005).
Ragupathi, N. D., Sethuvel, D. M., Inbanathan, F. & Veeraraghavan, B. Accurate differentiation of Escherichia coli and Shigella serogroups: challenges and strategies. New microbes and new infections 21, 58–62 (2018).
Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic acids research 50, D785–D794 (2022).
Oren, A. & Garrity, G. M. Valid publication of the names of forty-two phyla of prokaryotes. International Journal of Systematic and Evolutionary Microbiology 71, 005056 (2021).
Zheng, J. et al. A taxonomic note on the genus Lactobacillus: Description of 23 novel genera, emended description of the genus Lactobacillus Beijerinck 1901, and union of Lactobacillaceae and Leuconostocaceae. International journal of systematic and evolutionary microbiology 70, 2782–2858 (2020).
Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic acids research 50, D20–D26 (2022).
Robeson, M. S. et al. RESCRIPt: Reproducible sequence taxonomy reference database management. PLoS computational biology 17, e1009581 (2021).
Chiang, C.-Y. et al. Biochemical and molecular dynamics studies of archaeal polyisoprenyl pyrophosphate phosphatase from Saccharolobus solfataricus. Enzyme and Microbial Technology 139, 109585 (2020).
Cannone, G., Kompaniiets, D., Graham, S., White, M. F. & Spagnolo, L. Structure of the Saccharolobus solfataricus type III-D CRISPR effector. Current Research in Structural Biology 5, 100098 (2023).
Samuel, B. S. et al. Genomic and metabolic adaptations of Methanobrevibacter smithii to the human gut. Proceedings of the National Academy of Sciences 104, 10643–10648 (2007).
Huynh, H., Nkamga, V., Drancourt, M. & Aboudharam, G. Genetic variants of dental plaque Methanobrevibacter oralis. European Journal of Clinical Microbiology & Infectious Diseases 34, 1097–1101 (2015).
Sato, T., Fukui, T., Atomi, H. & Imanaka, T. Improved and versatile transformation system allowing multiple genetic manipulations of the hyperthermophilic archaeon Thermococcus kodakaraensis. Applied and environmental microbiology 71, 3889–3899 (2005).
Gonzalez, O. et al. Systems analysis of bioenergetics and growth of the extreme halophile Halobacterium salinarum. PLoS computational biology 5, e1000332 (2009).
Ibrahim, A. et al. Rhizomal reclassification of living organisms. International journal of molecular sciences 22, 5643 (2021).
Barco, R. et al. A genus definition for bacteria and archaea based on a standard genome relatedness index. MBio 11, 02475–02419, https://doi.org/10.1128/mbio (2020).
Martijn, J. et al. Confident phylogenetic identification of uncultured prokaryotes through long read amplicon sequencing of the 16S‐ITS‐23S rRNA operon. Environmental microbiology 21, 2485–2498 (2019).
Mazzoli, L., Munz, G., Lotti, T. & Ramazzotti, M. A novel universal primer pair for prokaryotes with improved performances for anammox containing communities. Scientific Reports 10, 15648 (2020).
Brewer, T. E. et al. Unlinked rRNA genes are widespread among bacteria and archaea. The ISME Journal 14, 597–608 (2020).
Ahn, H., Seol, D., Cho, S., Kim, H. & Kwak, W. Enhanced symbiotic characteristics in bacterial genomes with the disruption of rRNA operon. Biology 9, 440 (2020).
Edgar, R. C. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics 34, 2371–2375 (2018).
Girgis, H. Z. MeShClust v3. 0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores. BMC genomics 23, 423 (2022).
Sievers, F. et al. Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology 7, 539 (2011).
Minh, B. Q., Trifinopoulos, J., Schrempf, D., Schmidt, H. & Lanfear, R. IQ-TREE version 2.0: tutorials and Manual Phylogenomic software by maximum likelihood. https://iqtree.github.io/ (2019).
Fiedorová, K. et al. The impact of DNA extraction methods on stool bacterial and fungal microbiota community recovery. Frontiers in microbiology 10, 821 (2019).
Benítez-Páez, A. & Sanz, Y. Multi-locus and long amplicon sequencing approach to study microbial diversity at species level using the MinION™ portable nanopore sequencer. Gigascience 6, gix043 (2017).
Cuscó, A., Catozzi, C., Viñes, J., Sanchez, A. & Francino, O. Microbiota profiling with long amplicons using Nanopore sequencing: full-length 16S rRNA gene and the 16S-ITS-23S of the rrn operon. F1000Research 7 (2018).
Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: nanopore sequence read simulator based on statistical characterization. GigaScience 6, gix010 (2017).
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature biotechnology 39, 105–114 (2021).
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European molecular biology open software suite. Trends in genetics 16, 276–277 (2000).
Li, H. et al. The sequence alignment/map format and SAMtools. bioinformatics 25, 2078–2079 (2009).
De Coster, W. & Rademakers, R. NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics 39, btad311 (2023).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Walsh, C. et al. GROND (Version 207). https://doi.org/10.5281/zenodo.10889037 (2022).
Walsh, C. J. et al. GROND: a quality-checked and publicly available database of full-length 16S-ITS-23S rRNA operon sequences. Microbial Genomics 10, 001255 (2024).
Kinoshita, Y., Niwa, H., Uchida-Fujii, E. & Nukada, T. Establishment and assessment of an amplicon sequencing method targeting the 16S-ITS-23S rRNA operon for analysis of the equine gut microbiome. Scientific reports 11, 11884 (2021).
Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic acids research 41, D590–D596 (2012).
Lee, J., Cho, S. & Kim, H. MIrROR (2.0) [Data set]. https://doi.org/10.5281/zenodo.17639192 (2025).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP630400 (2025).
Acknowledgements
We thank eGnome, Inc. for providing financial and technical support for this study.
Author information
Authors and Affiliations
Contributions
Jisol Lee conceptualized the study, collected and curated the data, evaluated the dataset, developed the website, and wrote the manuscript. Juyong Hong contributed to data curation and manuscript writing. Donghyeok Seol provided the foundational concept for the dataset, gave feedback on data curation, and assisted with manuscript revision. Wonseok Lee contributed to study design. Junho Lee performed the mock community experiment and contributed to manuscript writing. Gyungbu Kim contributed to website development. Seoae Cho contributed to study design and acquired funding. Heebal Kim supervised the study and contributed to manuscript writing. Seoae Cho and Heebal Kim are the corresponding authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests. The study was conducted independently, and the funding organization had no role in the study design, data collection, analysis, or interpretation.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lee, J., Hong, J., Seol, D. et al. MIrROR release 02: Expanded and refined 16S-ITS-23S rRNA operon dataset. Sci Data (2026). https://doi.org/10.1038/s41597-026-06729-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-06729-y


