Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Data
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific data
  3. data descriptors
  4. article
MIrROR release 02: Expanded and refined 16S-ITS-23S rRNA operon dataset
Download PDF
Download PDF
  • Data Descriptor
  • Open access
  • Published: 19 March 2026

MIrROR release 02: Expanded and refined 16S-ITS-23S rRNA operon dataset

  • Jisol Lee  ORCID: orcid.org/0009-0007-5623-10851,
  • Juyong Hong1,
  • Donghyeok Seol  ORCID: orcid.org/0000-0001-9695-58612,
  • Wonseok Lee3,
  • Junho Lee3,
  • Gyungbu Kim3,
  • Seoae Cho3 &
  • …
  • Heebal Kim  ORCID: orcid.org/0000-0003-3064-13031,3,4 

Scientific Data , Article number:  (2026) Cite this article

  • 908 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Bacterial genetics
  • Genetic databases

Abstract

With the rapid advancements in genome sequencing technologies, microbial genome data has exponentially increased, making it essential to continuously update dataset for accurate microbial identification and classification. We present the development of Microbial Identification using rRNA Operon Region (MIrROR) release 02, an expanded dataset based on 1,690,470 genomes (1,674,514 bacterial and 15,956 archaeal) sourced from NCBI. The final curated dataset covers 476,579 sequences, 249,907 genomes, and 29,051 species, representing increases of 387.39%, 472.49%, and 206.28% over the previous release. Key updates include the addition of archaeal genomes and taxonomy reclassification based on GTDB R220. Extensive curation was performed, including filtering operon lengths (3,500–7,000 bp), removing duplicate sequences, eliminating sequences with ambiguous nucleotides, and clustering of sequences at 99% identity to remove redundancies. The updated dataset showed improved performance in microbial mock community analyses, supporting its accuracy and reliability. These improvements make MIrROR release 02 a valuable resource for microbial profiling and various microbiological research applications.

Similar content being viewed by others

Using nanopore sequencing to identify fungi from clinical samples with high phylogenetic resolution

Article Open access 16 June 2023

Genome-resolved long-read sequencing expands known microbial diversity across terrestrial habitats

Article Open access 24 July 2025

Evaluating the efficiency of 16S-ITS-23S operon sequencing for species level resolution in microbial communities

Article Open access 22 January 2025

Data availability

All data supporting the findings of this study are publicly available. The MIrROR release 02 dataset can be accessed at https://doi.org/10.5281/zenodo.17639192. Sequencing datasets used for performance evaluation, including the ZymoBIOMICS microbial community standards, are available from the NCBI Sequence Read Archive under accession numbers SRR35709965–SRR35709957.

Code availability

Genome retrieval from NCBI GenBank using ncbi-genome-download (v.0.3.1)

ncbi-genome-download archaea -s genbank -F fasta -l all -o {output_directory} --flat-output

ncbi-genome-download bacteria -s genbank -F fasta -l all -o {output_directory}

Clustering using MeShClust (v.3.0)

meshclust3 -d {input_fasta} -o {output_fasta} -t 0.99 -c 50 -r n -e y

Redundancy filtering BLAST (v.2.15.0)

blastn -db {blast_database} -query {input_fasta} -out {output_xml} -outfmt 5 -num_threads{num_threads}

Simulation of long-read sequencing using NanoSim (v.3.2.2)

simulator.py metagenome -gl {genome_list.tsv} -a {abundance.tsv} -c {training_directory} -o{output_directory} --fastq

primersearch (EMBOSS v.6.6.0.0)

primersearch -seqall {input.fasta} -infile {primers.txt} -outfile {output.txt} -mismatchpercent 30

Sequence mapping and taxonomic profiling (with threshold options modified as described in the main text):

https://github.com/seoldh/MIrROR

Custom Python scripts used for dataset curation are available at:

https://github.com/jisoll/MirrorMaker

Custom Python scripts used for in silico primer binding analysis are available at: https://github.com/jisoll/MirrorPrimerChecker

References

  1. Fuks, G. et al. Combining 16S rRNA gene variable regions enables high-resolution microbial community profiling. Microbiome 6, 1–13 (2018).

    Google Scholar 

  2. Hassler, H. B. et al. Phylogenies of the 16S rRNA gene and its hypervariable regions lack concordance with core genome phylogenies. Microbiome 10, 104 (2022).

    Google Scholar 

  3. Lan, Y., Rosen, G. & Hershberg, R. Marker genes that are less conserved in their sequences are useful for predicting genome-wide similarity levels between closely related prokaryotic strains. Microbiome 4, 1–13 (2016).

    Google Scholar 

  4. Goldstein, S., Beka, L., Graf, J. & Klassen, J. L. Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing. BMC genomics 20, 1–17 (2019).

    Google Scholar 

  5. Won, S., Cho, S. & Kim, H. rRNA operon improves species-level classification of bacteria and microbial community analysis compared to 16S rRNA. Microbiology Spectrum 12, e00931–00924 (2024).

    Google Scholar 

  6. Woese, C. R. Bacterial evolution. Microbiological reviews 51, 221–271 (1987).

    Google Scholar 

  7. Seol, D. et al. Microbial identification using rRNA operon region: database and tool for metataxonomics with long-read sequence. Microbiology Spectrum 10, e02017–02021 (2022).

    Google Scholar 

  8. Suzuki, N. et al. Discrimination of Streptococcus pneumoniae from viridans group streptococci by genomic subtractive hybridization. Journal of clinical microbiology 43, 4528–4534 (2005).

    Google Scholar 

  9. Ragupathi, N. D., Sethuvel, D. M., Inbanathan, F. & Veeraraghavan, B. Accurate differentiation of Escherichia coli and Shigella serogroups: challenges and strategies. New microbes and new infections 21, 58–62 (2018).

    Google Scholar 

  10. Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic acids research 50, D785–D794 (2022).

    Google Scholar 

  11. Oren, A. & Garrity, G. M. Valid publication of the names of forty-two phyla of prokaryotes. International Journal of Systematic and Evolutionary Microbiology 71, 005056 (2021).

    Google Scholar 

  12. Zheng, J. et al. A taxonomic note on the genus Lactobacillus: Description of 23 novel genera, emended description of the genus Lactobacillus Beijerinck 1901, and union of Lactobacillaceae and Leuconostocaceae. International journal of systematic and evolutionary microbiology 70, 2782–2858 (2020).

    Google Scholar 

  13. Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic acids research 50, D20–D26 (2022).

    Google Scholar 

  14. Robeson, M. S. et al. RESCRIPt: Reproducible sequence taxonomy reference database management. PLoS computational biology 17, e1009581 (2021).

    Google Scholar 

  15. Chiang, C.-Y. et al. Biochemical and molecular dynamics studies of archaeal polyisoprenyl pyrophosphate phosphatase from Saccharolobus solfataricus. Enzyme and Microbial Technology 139, 109585 (2020).

    Google Scholar 

  16. Cannone, G., Kompaniiets, D., Graham, S., White, M. F. & Spagnolo, L. Structure of the Saccharolobus solfataricus type III-D CRISPR effector. Current Research in Structural Biology 5, 100098 (2023).

    Google Scholar 

  17. Samuel, B. S. et al. Genomic and metabolic adaptations of Methanobrevibacter smithii to the human gut. Proceedings of the National Academy of Sciences 104, 10643–10648 (2007).

    Google Scholar 

  18. Huynh, H., Nkamga, V., Drancourt, M. & Aboudharam, G. Genetic variants of dental plaque Methanobrevibacter oralis. European Journal of Clinical Microbiology & Infectious Diseases 34, 1097–1101 (2015).

    Google Scholar 

  19. Sato, T., Fukui, T., Atomi, H. & Imanaka, T. Improved and versatile transformation system allowing multiple genetic manipulations of the hyperthermophilic archaeon Thermococcus kodakaraensis. Applied and environmental microbiology 71, 3889–3899 (2005).

    Google Scholar 

  20. Gonzalez, O. et al. Systems analysis of bioenergetics and growth of the extreme halophile Halobacterium salinarum. PLoS computational biology 5, e1000332 (2009).

    Google Scholar 

  21. Ibrahim, A. et al. Rhizomal reclassification of living organisms. International journal of molecular sciences 22, 5643 (2021).

    Google Scholar 

  22. Barco, R. et al. A genus definition for bacteria and archaea based on a standard genome relatedness index. MBio 11, 02475–02419, https://doi.org/10.1128/mbio (2020).

    Google Scholar 

  23. Martijn, J. et al. Confident phylogenetic identification of uncultured prokaryotes through long read amplicon sequencing of the 16S‐ITS‐23S rRNA operon. Environmental microbiology 21, 2485–2498 (2019).

    Google Scholar 

  24. Mazzoli, L., Munz, G., Lotti, T. & Ramazzotti, M. A novel universal primer pair for prokaryotes with improved performances for anammox containing communities. Scientific Reports 10, 15648 (2020).

    Google Scholar 

  25. Brewer, T. E. et al. Unlinked rRNA genes are widespread among bacteria and archaea. The ISME Journal 14, 597–608 (2020).

    Google Scholar 

  26. Ahn, H., Seol, D., Cho, S., Kim, H. & Kwak, W. Enhanced symbiotic characteristics in bacterial genomes with the disruption of rRNA operon. Biology 9, 440 (2020).

    Google Scholar 

  27. Edgar, R. C. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics 34, 2371–2375 (2018).

    Google Scholar 

  28. Girgis, H. Z. MeShClust v3. 0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores. BMC genomics 23, 423 (2022).

    Google Scholar 

  29. Sievers, F. et al. Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology 7, 539 (2011).

    Google Scholar 

  30. Minh, B. Q., Trifinopoulos, J., Schrempf, D., Schmidt, H. & Lanfear, R. IQ-TREE version 2.0: tutorials and Manual Phylogenomic software by maximum likelihood. https://iqtree.github.io/ (2019).

  31. Fiedorová, K. et al. The impact of DNA extraction methods on stool bacterial and fungal microbiota community recovery. Frontiers in microbiology 10, 821 (2019).

    Google Scholar 

  32. Benítez-Páez, A. & Sanz, Y. Multi-locus and long amplicon sequencing approach to study microbial diversity at species level using the MinION™ portable nanopore sequencer. Gigascience 6, gix043 (2017).

    Google Scholar 

  33. Cuscó, A., Catozzi, C., Viñes, J., Sanchez, A. & Francino, O. Microbiota profiling with long amplicons using Nanopore sequencing: full-length 16S rRNA gene and the 16S-ITS-23S of the rrn operon. F1000Research 7 (2018).

  34. Yang, C., Chu, J., Warren, R. L. & Birol, I. NanoSim: nanopore sequence read simulator based on statistical characterization. GigaScience 6, gix010 (2017).

    Google Scholar 

  35. Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature biotechnology 39, 105–114 (2021).

    Google Scholar 

  36. Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European molecular biology open software suite. Trends in genetics 16, 276–277 (2000).

    Google Scholar 

  37. Li, H. et al. The sequence alignment/map format and SAMtools. bioinformatics 25, 2078–2079 (2009).

    Google Scholar 

  38. De Coster, W. & Rademakers, R. NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics 39, btad311 (2023).

    Google Scholar 

  39. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Google Scholar 

  40. Walsh, C. et al. GROND (Version 207). https://doi.org/10.5281/zenodo.10889037 (2022).

    Google Scholar 

  41. Walsh, C. J. et al. GROND: a quality-checked and publicly available database of full-length 16S-ITS-23S rRNA operon sequences. Microbial Genomics 10, 001255 (2024).

    Google Scholar 

  42. Kinoshita, Y., Niwa, H., Uchida-Fujii, E. & Nukada, T. Establishment and assessment of an amplicon sequencing method targeting the 16S-ITS-23S rRNA operon for analysis of the equine gut microbiome. Scientific reports 11, 11884 (2021).

    Google Scholar 

  43. Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic acids research 41, D590–D596 (2012).

    Google Scholar 

  44. Lee, J., Cho, S. & Kim, H. MIrROR (2.0) [Data set]. https://doi.org/10.5281/zenodo.17639192 (2025).

    Google Scholar 

  45. NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP630400 (2025).

Download references

Acknowledgements

We thank eGnome, Inc. for providing financial and technical support for this study.

Author information

Authors and Affiliations

  1. Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Republic of Korea

    Jisol Lee, Juyong Hong & Heebal Kim

  2. Department of Surgery, Seoul National University Bundang Hospital, 172 Dolma-ro, Bundang-gu, Seongnam, 13605, Republic of Korea

    Donghyeok Seol

  3. eGnome, Inc., Seoul, Republic of Korea

    Wonseok Lee, Junho Lee, Gyungbu Kim, Seoae Cho & Heebal Kim

  4. Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea

    Heebal Kim

Authors
  1. Jisol Lee
    View author publications

    Search author on:PubMed Google Scholar

  2. Juyong Hong
    View author publications

    Search author on:PubMed Google Scholar

  3. Donghyeok Seol
    View author publications

    Search author on:PubMed Google Scholar

  4. Wonseok Lee
    View author publications

    Search author on:PubMed Google Scholar

  5. Junho Lee
    View author publications

    Search author on:PubMed Google Scholar

  6. Gyungbu Kim
    View author publications

    Search author on:PubMed Google Scholar

  7. Seoae Cho
    View author publications

    Search author on:PubMed Google Scholar

  8. Heebal Kim
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Jisol Lee conceptualized the study, collected and curated the data, evaluated the dataset, developed the website, and wrote the manuscript. Juyong Hong contributed to data curation and manuscript writing. Donghyeok Seol provided the foundational concept for the dataset, gave feedback on data curation, and assisted with manuscript revision. Wonseok Lee contributed to study design. Junho Lee performed the mock community experiment and contributed to manuscript writing. Gyungbu Kim contributed to website development. Seoae Cho contributed to study design and acquired funding. Heebal Kim supervised the study and contributed to manuscript writing. Seoae Cho and Heebal Kim are the corresponding authors.

Corresponding authors

Correspondence to Seoae Cho or Heebal Kim.

Ethics declarations

Competing interests

The authors declare no competing interests. The study was conducted independently, and the funding organization had no role in the study design, data collection, analysis, or interpretation.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download DOCX )

Supplementary Tables 1-4 (download XLSX )

Supplementary Table 5 (download XLSX )

Supplementary Table 6 (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, J., Hong, J., Seol, D. et al. MIrROR release 02: Expanded and refined 16S-ITS-23S rRNA operon dataset. Sci Data (2026). https://doi.org/10.1038/s41597-026-06729-y

Download citation

  • Received: 07 October 2024

  • Accepted: 27 January 2026

  • Published: 19 March 2026

  • DOI: https://doi.org/10.1038/s41597-026-06729-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Editors & Editorial Board
  • Journal Metrics
  • Policies
  • Open Access Fees and Funding
  • Calls for Papers
  • Contact

Publish with us

  • Submission Guidelines
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Data (Sci Data)

ISSN 2052-4463 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing