A roadmap for equitable reuse of public microbiome data

Hug, Laura A.; Hatzenpichler, Roland; Moraru, Cristina; Soares, André R.; Meyer, Folker; Heyder, Anke; Probst, Alexander J.

doi:10.1038/s41564-025-02116-2

Download PDF

Consensus Statement
Published: 26 September 2025

A roadmap for equitable reuse of public microbiome data

Nature Microbiology volume 10, pages 2384–2395 (2025)Cite this article

20k Accesses
13 Citations
147 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 05 November 2025

This article has been updated

Abstract

Science benefits from rapid open data sharing, but current guidelines for data reuse were established two decades ago, when databases were several million times smaller than they are today. These guidelines are largely unfamiliar to the scientific community, and, owing to the rapid increase in biological data generated in the past decade, they are also outdated. As a result, there is a lack of community standards suited to the current landscape and inconsistent implementation of data sharing policies across institutions. Here we discuss current sequence data sharing policies and their benefits and drawbacks, and present a roadmap to establish guidelines for equitable sequence data reuse, developed in consultation with a data consortium of 167 microbiome scientists. We propose the use of a Data Reuse Information (DRI) tag for public sequence data, which will be associated with at least one Open Researcher and Contributor ID (ORCID) account. The machine-readable DRI tag indicates that the data creators prefer to be contacted before data reuse, and simultaneously provides data consumers with a mechanism to get in touch with the data creators. The DRI aims to facilitate and foster collaborations, and serve as a guideline that can be expanded to other data types.

Challenges and opportunities in sharing microbiome data and analyses

Article 02 October 2023

Digital Microbe: a genome-informed data integration framework for team science on emerging model organisms

Article Open access 04 September 2024

Translating microbial kinetics into quantitative responses and testable hypotheses using Kinbiont

Article Open access 11 July 2025

Main

Sequence data reuse has been an evolving topic over the past two decades. The Fort Lauderdale Agreement (FLA), a public declaration by biomedicine scientists supporting the free and unrestricted use of genome sequencing data, was coined in 2003, before the advent of metagenomics and during a time when sequencing was still too costly to be performed by individual laboratories¹. The FLA concluded that large genome projects should be released before publication to allow unrestricted and immediate reuse, which would accelerate the advancement of science. The FLA strengthened the Bermuda Principles defined in 1996², which advocated for the release of sequence data 24 h after generation and before publication of research papers. In 2009, after the Human Genome Project highlighted the advantages of sharing data early and widely, the Toronto Statement (TOR)³ advocated for the prepublication release of other biological data types beyond genomics data. Finally, in 2014, 141 United Nations member states and the European Union entered into the Nagoya Protocol⁴ (Regulation (EU) No 511/2014), which calls on data creators and data users to develop, update and use voluntary codes of conduct, guidelines and best practices in relation to access and benefit-sharing of genetic data (see Box 1 for descriptions of the different roles of researchers working with sequence datasets).

Large-scale sequence data analysis has become mainstream with a wide array of tools available, making data mining accessible to many labs. Now, ~20 years after the FLA, GenBank holds an estimated ~5.09 terabase pairs (Tbp)⁵ of biological sequence data, and the Sequence Read Archive (SRA) holds 90.89 petabase pairs (Pbp) as of February 2024 (Supplementary Fig. 1 and Box 2). These databases are several million times larger than the available sequence data at the time that the FLA or TOR was formulated. With the rapid, continuous increase in public sequence data (projected to reach ~500 Pbp in 2030, Supplementary Fig. 1), data mining projects (or those requiring large public datasets for artificial intelligence training) have increased in both frequency and scope, necessitating a revisit and potential overhaul of the 20-year-old guidelines depicted in the FLA and TOR^1,3.

In 2016, the FAIR principles for data management were defined, which place an emphasis on submitted data being machine actionable (that is, computational systems should be able to Find, Access, Interoperate and Reuse data with minimal human intervention)⁶. These principles were designed to promote good scientific practice and to serve as a guideline for those wishing to enhance the reusability of their data. The FAIR principles have since been adopted as recommendations or requirements by major funding bodies, including the US National Institute of Health and the European Commission⁷. The FAIR data principles prioritize data reuse and computer-driven data mining, and include a specific requirement for (meta)data to be released with a clear and accessible data reuse licence (principle R.1). To date, this aspect of the FAIR principles has not been implemented in a straightforward or machine-readable way, lacking a coordinated implementation between databases and the community.

Biological sciences, and particularly their subdisciplines associated with generating sequence data, have been at the forefront of data availability compared with the fields of earth sciences, mathematics, physics and chemistry⁸. For instance, astrophysics is a subdiscipline that traditionally relies on data sharing and reuse owing to the exorbitant costs of research data. A study investigating the motivational factors behind data sharing and reuse within this field identified several demotivating factors⁹. Among them were the lack of data standards, the lack of facilitating platforms, inconsistency between datasets, limited documentation, difficulties finding and reusing data, and last but not least, competition and fear of being ‘scooped’ (accidentally or purposefully)⁹. The latter point is considered in the FLA. While the FLA recommends swift prepublication of data generated by large sequencing consortia, it also states that “[…] the contributions and interests of the large-scale data producers should be recognized and respected by the users of the data, and the ability of the production centers to analyse and publish their own data should be supported by their funding agencies […]”¹. This highlights one of the significant and enduring tensions between data creators and data consumers. Both data creators and data consumers are indispensable to advancing biological sciences, particularly in the realm of sequence data analysis, in which many data creators also act as data consumers. Unrestricted public use of microbiome data, on which data creators have not yet published, does not always align with the interests of data creators.

How to achieve unrestricted data reuse and, at the same time, give due credit to data creators has been discussed by the scientific community¹⁰. Data reuse in this work refers specifically to those cases in which the sequence data will be featured in a publication prepared by a data consumer, whether in figures, tables or text, or as an important aspect of the workflow that leads to new insights or conclusions (see Supplementary Table 2 for some Data Reuse Information (DRI) usage scenarios). In this spirit, a recent study thoroughly analysed the pros and cons of early data release and considered both the needs of data creators and consumers. The authors proposed immediate, unrestricted release of sequence data before publication, in parallel with the adoption of a reward system (for example, separate promotion and tenure tracks) for acknowledgement of data creators by universities and research institutions¹¹. In addition, the authors proposed making the datasets and the protocols used for their generation citable through Digital Object Identifiers (DOIs). If implemented, these measures would create a safer environment for data sharing, benefiting all parties involved and, most importantly, supporting the advancement of science.

Mechanisms for crediting data creators beyond citing associated publications are not yet widespread in the scientific community. Creating separate tenure tracks or other incentives for data creators and data consumers requires sizable changes in evaluation criteria, and would require substantial time to propagate through institutions. DOIs for datasets, on the other hand, seem relatively easy to implement and would provide data creators with a reportable impact metric. However, their use has not yet been widely adopted, possibly owing to associated costs with purchasing and maintaining DOIs, which can be prohibitive for many publicly funded research institutions. Potential measures to lower DOI costs could include large-scale agreements between research institutions and DOI providers. Other mechanisms of data citation have been discussed in the community but have also not been widely adopted^12,13. Currently, data creators do not have any incentive or reward for releasing sequence data before an associated publication.

It is crucial to implement methodological and ethical guidelines that are based on the principles of good scientific practice and which are driven by the scientific community to facilitate appropriate use of public data. This need has been highlighted by recent conflicts between data creators and data consumers that played out over social media. Implementing and following guidelines for unpublished data usage by all scientists would create ‘safe spaces’ for data creators to publish their first analyses of data—particularly if they are delayed by resource, time or personnel constraints. The research topic also affects the expectations for open data. Research related to public health necessitates swift data release to counteract pandemics or identify zoonotic diseases. For example, in the event of a pandemic, there should be no data restriction on research related to the pandemic¹⁴. The general goal should be to promote open sharing of complete datasets as early and as widely as possible, across all institutions and individuals. This necessitates a technical framework that enhances the communication between data creators and data consumers regarding data reuse.

Here we propose a roadmap to enable equitable reuse of public microbiome data. This roadmap (1) addresses the lack of consensus in the field of microbiome research regarding public microbiome data use and reuse, (2) promotes communication between data consumers and data creators and (3) facilitates the rapid advancement of the microbiome field, including supporting the continued increases in data mining. To achieve the goal of this roadmap, we propose the introduction of a new machine-readable metadata tag, named DRI, containing Open Researcher and Contributor IDs (ORCIDs) of the data creators associated with data in public databases. The DRI will clearly indicate the point of contact for communication and if communication is desired by data creators. The ability to provide a point of contact for data reuse will lead to more rapid and complete data deposition. Following adoption by databases, authors and scientific journals would ideally integrate statements confirming that the best practices governed by DRI use were used in manuscripts and submission processes.

The roadmap is directly in line with the FAIR data principles, specifically contributing to FAIR principle R.1 in providing a machine-readable licence for data usage. This roadmap and its adoption by the scientific community (222 scientists as part of the Data Reuse Consortium—Supplementary Table 1—totalling 229 supporters, including the co-authors of this paper) will provide a citable resource regarding guidelines for public data reuse, will enable appropriate data reuse by data consumers and will reduce tension for data creators when submitting data. Ultimately, this roadmap outlines the expected practices for open data use for sequence data and represents a model for other biological data such as metabolomics or proteomics data.

Box 1 Definitions of roles of scientists and legal entities related to microbiome and sequence data

Data consumer: any legal entity interested in using public data

Data creator: entity or entities, that is, individual or multiple researchers, who designed the study, obtained the samples and intended to publish an analysis; typically assumed to have priority on analysis

Data distributor: public databases that provide access to the digital data (for example, GenBank, ENA, DDBJ; Box 2)

Data generator: entity that renders the sample into a digital object (for example, a sequencing facility that processes biological material and produces sequence-related files to transmit to another entity)

Data owner: legal entity who owns the data by rights; this can be different from ‘data creator’ (for example, an institute at which the data creator is employed or a nation or state)

Box 2 Abbreviations extensively used in microbiome data research

COGs (clusters of orthologous groups): represents a collection of proteins from complete bacterial and archaeal genomes, grouped into clusters of orthologues, and associated with functional annotations

DDBJ (DNA Data Bank of Japan): a database of nucleotide sequence data maintained by the National Institute of Genetics (NIG) in Japan

EMBL (European Molecular Biology Laboratory): a research organization that conducts basic research in molecular biology and offers a range of scientific resources

EMBL-EBI (European Bioinformatics Institute): a bioinformatics research centre belonging to the EMBL, which maintains and provides access to several sequence-related databases (for example, ENA, Interpro, PDBe, UniProt)

ENA (European Nucleotide Archive): a database of raw sequence data and annotated sequence data, from a wide range of organisms, maintained by EMBL-EBI

GenBank (Genetic Sequence Database): a database of DNA and RNA sequences from a wide range of organisms, along with associated annotations and metadata

GSC (Genomic Standards Consortium): an international organization dedicated to the development and implementation of standards and best practices in genomics and related fields

IMG/M (Integrated Microbial Genomes and Metagenomes): a data management and analysis system for microbial genomes and metagenomes maintained by the Department of Energy’s Joint Genome Institute (JGI) in the USA

IMG/VR (Integrated Microbial Genomes with Virus-related Datasets): a specialized database storing virus genomic and metagenomic sequences, annotations and metadata

InterPro (integrated resource of protein domains and functional sites): a database that integrates information on protein domains, motifs and functional sites from a variety of sources

INSDC (International Nucleotide Sequence Database Collaboration): a data-sharing initiative between DDBJ, EMBL-EBI and NCBI

KEGG (Kyoto Encyclopedia of Genes and Genomes): a comprehensive database and knowledge base of biological systems, including genes, proteins and biochemical pathways; maintained by Kanehisa Laboratories in Japan

KOG (Eukaryotic Orthologous Groups): a collection of proteins from eukaryotic genomes, grouped into clusters of orthologues, and associated with functional annotations

NCBI (National Center for Biotechnology Information): part of NIH; a central repository for molecular sequence data, including several databases (for example, GenBank, RefSeq, SRA, COG, KOG and so on)

NIH (National Institutes of Health): a biomedical research agency of the federal government of the USA

PDBe (Protein Data Bank in Europe): a comprehensive collection of 3D structures of proteins and other macromolecules

PFAM (protein family database): a database of protein families, domains and functional sites

RefSeq (Reference Sequence Database): a comprehensive, non-redundant database of reference genomic, transcriptomic and proteomic sequences, for a wide range of organisms

SRA (Sequence Read Archive): a public repository for raw sequence data generated by platforms such as Sanger, Illumina, Ion Torrent and Pacific Biosciences

UniProt (Universal Protein Resource): a comprehensive protein sequence database, including additional field-specific contextual information (for example, protein domain structure and known interactions)

InterPro (Integrated resource of protein domains and functional sites): a database that integrates information on protein domains, motifs and functional sites from a variety of sources

INSDC (International Nucleotide Sequence Database Collaboration): a data sharing initiative between DDBJ, EMBL-EBI and NCBI

KEGG (Kyoto Encyclopedia of Genes and Genomes): a comprehensive database and knowledge base of biological systems, including genes, proteins and biochemical pathways; maintained by Kanehisa Laboratories in Japan

KOG (euKaryotic Orthologous Groups): a collection of proteins from eukaryotic genomes, grouped into clusters of orthologues, and associated with functional annotations

NCBI (National Center for Biotechnology Information): part of NIH; a central repository for molecular sequence data, including several databases (for example, GenBank, RefSeq, SRA, COG, KOG and so on)

NIH (National Institutes of Health): a biomedical research agency of the federal government of the USA

PDBe (Protein Data Bank in Europe): a comprehensive collection of 3D structures of proteins and other macromolecules

PFAM (Protein family database): a database of protein families, domains and functional sites

RefSeq (Reference Sequence Database): a comprehensive, non-redundant database of reference genomic, transcriptomic and proteomic sequences, for a wide range of organisms

SRA (Sequence Read Archive): a public repository for raw sequence data generated by platforms such as Sanger, Illumina, Ion Torrent and Pacific Biosciences

UniProt (Universal Protein Resource): a comprehensive protein sequence database, including additional field-specific contextual information (as, for example, protein domain structure and known interactions)

Survey on data reuse

We created an anonymous survey with Google Forms, which was distributed to the international scientific community on 15 January 2024, to accumulate opinion data on a number of key topics related to this manuscript (Supplementary Box 1). Participation was voluntary and anonymous, and participants gave informed consent before participation. To ensure participant confidentiality, it was emphasized that signing the consensus statement could not be linked to survey responses. The University of Duisburg-Essen ethics committee evaluated the study and declared that the need for ethics approval has been waived. Questions included in the survey were formulated in a neutral fashion with the intent of not biasing responses, which were anonymized to ensure openness and transparency. This survey was online and open for responses over a total of 21 days. Efforts to ensure widespread awareness of this survey included actively advertising it across multiple social media platforms (X.com, LinkedIn, Bluesky.app) over the duration of the survey. To achieve this, 39 authors made use of their accounts across these platforms while also leveraging their working group and other institutional accounts. A blog advertising this survey was additionally hosted at the Springer Nature Research Communities blog¹⁵. Finally, a total of 78 microbiology institutions across the world were contacted to increase participation from the Global South and underrepresented segments of the global scientific community related to this survey. Raw data pertaining to the anonymous responses to this survey in TSV format are available at the Open Science Foundation¹⁶.

This resulted in responses from 306 scientists representing all continents (except Antarctica) with feedback on community interest in and likelihood of adopting the proposed roadmap (Supplementary Figs. 2–10 and Supplementary Tables 3–7). Survey questions were designed to enable quantitative analysis, namely, to appreciate the fraction of the community that agreed or disagreed with proposed aspects of the roadmap delineated in this manuscript. Raw data for this survey were imported into R 4.3.1 running in RStudio 2023.12.0 (Build 369) using the tidyverse ecosystem of data analysis packages to read TSV inputs (readr, version 2.1.5), filter (dplyr, version 1.1.4) and generate plots (ggplot2, version 3.5.1)^17,18. For Fig. 1, survey data were processed using R, and visualizations in Fig. 1a–c were generated using ggplot2. Figure 1d was generated with a copyright-free image, coloured by normalizing height (in pixels) to percentage categories. Colours and formatting were manually edited. Sankey diagrams were generated via the ggsankey package (version 0.0.99999). Tables were generated with the kableExtra package (version 1.3.4)¹⁹. R code as well as system and package versions used for all data analysis in this manuscript are publicly available at a GitHub repository: https://github.com/GeoMicroSoares/DataUsage_Data_Analysis.

**Fig. 1: Summary results from a survey of 306 scientists on data reuse.**

The initial DRI tag and the strategy for its implementation were defined by the Data Reuse Core Team before the survey. After the survey, the Data Reuse Core Team refined the roadmap and DRI strategy through conversations with members of the Data Reuse Consortium, of the Joint Genome Institute, of the European Nucleotide Archive and of the Genomic Standards Consortium, and the reviewers of this manuscript. Throughout the refinement process, the Data Reuse Core Team identified compromises between differing priorities and ensured that the DRI strategy was actionable within current database structures.

Microbiome data types most frequently reused include amplicon datasets of single genes, such as 16S rRNA or internal transcribed spacer (ITS) genes, as well as reads and assemblies of genomes, metagenomes and metatranscriptomes. These sequence data, while typically structured as machine-readable files, can come in various formats depending on the hosting repository.

Organizations hosting data usually have their own legal framework governing data download. In the case of the EMBL-EBI ENA database, for example, while records are typically available with no restrictions on reuse, the Terms of Use²⁰ recognize that third parties may assert restrictions on reuse for a variety of reasons. The database is working towards more systematic implementation of categorizing reusability, probably under Creative Commons (CC0) licensing²¹. This puts the responsibility on the data consumer to ensure their action is indeed legal and covered by the copyrights that may be associated with the data they are accessing, even if the licensing information is not machine readable.

The current state of data policies across online repositories of biological data, including EMBL-EBI, is largely reflective of a now outdated statement in the TOR. Here data users are advised to contact data creators to ‘discuss publication plans’, a demanding and often unfeasible approach given today’s rate of data production and the availability of new data analysis pipelines. The TOR, from 2009, and the FLA, from 2003, are the agreements guiding the field to date, although their contents are often not well known by microbiome scientists across all career stages (Supplementary Figs. 4–6). The FLA specifies, among other things, that “sequence assemblies of 2 kb or greater by large-scale sequencing efforts” must be rapidly released. The language around the scale of data reflects the outdated guidance, and the emphasis on large-scale data creators is no longer in line with the prevalence of independent labs as data creators today. Interestingly, the conflict between data sharing and publishing a first analysis was acknowledged in the FLA, but it does not provide guidelines for navigating this concern. The TOR elaborated further on the conflict between “data [creators] and data users”, stating that, in the author’s experience, conflicts have rarely arisen.

The TOR lists a set of conditions (scale, utility, reference data, community acceptance) to consider for prepublication data sharing that are of limited relevance for today’s data landscape³. Currently, individual research groups can contribute substantial datasets, both in terms of size and scientific value, and the idea of individualized, private agreements on data sharing, as suggested by the TOR, is no longer viable. In response to this need, repositories have developed their own policies (for example, the EMBL-EBI licence information). Data distributors recognize that some public datasets, although hosted by their respective services, carry additional restrictions that are currently neither easily visible nor machine readable. Given the ease of accessibility and sheer volume of sequence data, it has become impractical and, in some cases, impossible for a data consumer to verify and to comply with recommendations and restrictions.

Identifying conflicts of interest between data consumers and data creators

The current scale of open sequence data, including massive open datasets such as the Tara Oceans (7.2 Tbp of metagenomic data) and Integrative Human Microbiome (1.3 Tbp of metagenomic data as of 2019) projects, has made meta-analyses drawing on public data a powerful avenue to explore microbial systems^22,23. Use of public data is now routine; close to 80% of respondents to our poll on data usage identified themselves as both data creators and data consumers (Box 1 and Supplementary Table 7). Access to public data has unequivocally improved the depth of the science conducted. However, it is often difficult to assess whether data reuse follows the expectations of communication and collaboration outlined in the TOR. There are many widely used software tools that include use of secondarily accessed data for which no primary publication exists (for example, the Genome Taxonomy Database, GToTree)^24,25. Identification of the publication status associated with specific data in many repositories is not straightforward. As more governments begin to require data deposition on short or immediate timelines, there is a growing tension between data creators and data consumers around public data use. Clarifying and facilitating data reuse is therefore in the best interest of the community.

The first step is to identify roots of the potential conflict of interest between data creators and data consumers, as established following discussions between the authors of this manuscript as well as within their academic networks. These are discussed in detail below.

Disconnect between efforts of data creation and ease of reuse

One source of conflict is a disconnect between the efforts expended by data creators in generating sequence data and associated metadata, and the ease of reuse and limited or inconsistent acknowledgement of data origins by consumers. Creators sometimes feel that their monetary, time and intellectual investments to design and conduct sample collection and experiments; obtain permission for, plan, fund and carry out research expeditions; process samples; and deposit data and metadata are not adequately acknowledged or are potentially ignored by data consumers. Data creators must obtain legal documents (for example, sampling permits, visas) and follow international agreements (for example, the Convention on Biological Diversity, Nagoya Protocol), and manage the risks that come with certain fieldwork (for example, treacherous terrain, wilderness areas, areas with high criminal activity). In addition, they must secure funding for custom-design vehicles and instrumentation needed for sample collection (for example, drill ships, research vessels, submarines, buoys, remote samplers) and maintain research sites in hard-to-access areas (for example, polar regions or the Amazon). Unbeknownst to data consumers, the original data creators may be bound by restrictive agreements on appropriate or ethical data use if research was conducted in a national park, on private land or land owned by Indigenous nations, and/or for samples obtained from human specimens or biobanks²⁶. There are currently limited rewards for data creators when their data are reused, and data creators have little incentive to make detailed metadata available. Systems for reporting and incentivizing data deposition have been proposed but are not yet the norm^11,27.

Timely deposition contrasts with lengthy multi-omics analyses

Both data creators and consumers generally share ideals of open science and rapid advancement of science. However, conflicts arise from disagreements in the timing of sharing data and prioritization of access. Data creators must balance long trainee timelines with publication of datasets intended for multiple research questions. Publishing a first paper on a large dataset and depositing the full data may make additional research projects associated with that dataset vulnerable to scooping. Results from our poll suggest that 53% of researchers are concerned about negative impacts on their research programme and/or mentees from unauthorized data reuse (Fig. 1 and Supplementary Figs. 7–9). As a result, partial or raw datasets or datasets lacking key metadata are deposited in place of more polished, complete datasets with full metadata to guide interpretation of genome data. In the absence of open data, data consumers are frequently unable to access contextual information (for example, physicochemical parameters, geolocations) of the field site that are essential to accurately interpret the data, to the detriment of downstream analyses. These issues are exacerbated by public sequence databases lacking (links to) the associated metadata and a general lack of familiarity of many data consumers with the literature on, or the environmental context of, a specific system.

Perceived threats to research and career goals

Duplication of effort and the potential for lowered impact or difficulties publishing replicated results are a loss for both creators and consumers. For data creators, raw data underlying published scientific results must be made public to meet expectations for reproducibility. However, unrestricted access to public data can compromise permits, site access agreements and research ethics board approvals, all of which can negatively impact the data creators’ and their mentees’ ongoing research. For data consumers, even unintentional reuse of restricted data can slow research progress while appropriate permissions are sought, delay publications while data are removed and, in extreme cases, lead to paper retractions. The perceived threat is that unauthorized data reuse can also negatively impact planned research directions, funded research goals, acquisition of new funds or career perspectives of early career researchers for both creators and consumers. A lack of formal structure for data reuse causes tension for data creators and consumers alike.

A roadmap to reduce tension between data creators and data consumers

There are multiple potential avenues for mitigating the three conflicts of interest discussed above, yet not every approach is suitable or can be realized. For instance, funding agencies have the power to set rules for data release and data reuse in principle. However, besides the differentiation between private and taxpayer-funded agencies, funders usually have diverging agendas that not only differ across political borders but are also heterogeneous within a single country. To address the current tension(s) between data creators and data consumers and to update the existing agreements from more than 15 years ago, we propose a comprehensive roadmap for data reuse (Fig. 2). We recommend following this roadmap except in cases in which institutions or funding agencies have a different policy for data reuse in place or there is a restricting licence associated with the dataset itself. This roadmap was developed with the aim of minimizing friction between data creators and data consumers, while promoting open science, and involves the introduction of a new machine-readable DRI metadata tag for facilitating communication between data creators, generators and consumers.

**Fig. 2: Recommendations for equitable reuse of public microbiome data.**

Transparent, equitable and ethical use of public data necessitates clear labelling of its usability by data distributors. Although we initially considered a system that places limitations on free data use, following consultations with the Genomic Standards Consortium and polling 306 microbiome scientists, we have converged on an approach that focuses on both simplicity and openness while achieving nearly all the desired effects. The DRI tag would attach ORCIDs of the data creators to deposited data, signalling that data creators wish to be contacted (for example, via email) before data use²⁸. This way, the DRI tag also provides stable contact information to allow data consumers to easily reach data creators. ORCIDs are both free and ubiquitous and, most importantly, are already used internally by the INSDC community. The absence of a DRI tag would signal that the authors of the dataset agree to its reuse without the need to be further involved or contacted.

The DRI will consist of a tag with one or more associated ORCIDs identifying the data creators. In computer science notation, the DRI tag will have the following structure:

DRI = {ORCID1, ORCID2, …}

We note that, implicitly, data consumers are expected to acknowledge or cite any data they use for their scientific work. For the new ethical use of data with DRI, we expect data consumers to follow the approach summarized in Fig. 2.

Sequencing datasets published in public databases (for example, in Genbank) have tags, attributes and fields that indicate which publications are connected to the respective datasets. Examples of such tags are the following: (1) for GenBank entries, ‘REFERENCE’, ‘REFERENCE/AUTHORS’, ‘REFERENCE/TITLE’ and ‘REFERENCE/JOURNAL’; (2) for BioSample, ‘reference for biomaterial’; and (3) for BioProject, ‘Publications’. The content of these tags is input initially and/or updated by the data submitter. In theory, they can be updated later either manually by the submitter or automatically by the system. Large scientific literature databases such as PubMed (https://pubmed.ncbi.nlm.nih.gov/) and Europe PMC (https://europepmc.org/) actively monitor published scientific articles and index listings of sequencing accession numbers, generating crucial linkage information that can help address updating such information in public sequence databases. However, as of 6 November 2024, only 147,632 and 78,430 publications for PubMed and Europe PMC, respectively, have had 196,632 and 91,440 sequence accession numbers assigned. Linkage information on literature and sequence data stored in these databases contrasts poorly with the scale of exponentially growing data hosted in the NCBI SRA, amounting to 9 million accession numbers as of 28 March 2023 (12 petabytes of sequence data). Coordination between publication and sequencing databases could in theory be improved if both database types use ORCID associated with their entries. In our roadmap, the DRI, which will contain the ORCID of at least one of the data creators (typically the corresponding creators, that is, the project leader), could address these issues. The content of this tag would be input or updated solely by the data submitter, preferably during the initial data deposition.

The presence of a DRI tag indicates that data creators prefer to be contacted if a data consumer reuses their data, especially if the respective data have no associated publication. The intentions behind this preference can be manifold, including a willingness to share additional metadata or datasets, or the preference to collaborate to help protect early career researchers’ (for example, PhD students) ability to finish their studies and graduate. Including one or several ORCID(s) with the DRI will provide a stable point of contact, bolster transparency in science and adherence to the FAIR principles (findable, accessible, interoperable and reusable)⁶ and also facilitate science through the exchange of metadata and increased collaboration. There have been instances in which such collaboration with data creators, for example, provision of additional metadata that are not publicly available, has strengthened the content of research studies^29,30. This should be the norm rather than the exception. In other cases, communication with data creators has allowed proper citation of datasets used, and acknowledgement of funding that supported critical datasets, thus providing some benefit to the data creator for their data reuse^29,31.

In the long run, the DRI tag would help facilitate automatic updates of the publication-associated tags. For example, a GenBank sequence entry could be updated as follows: if the GenBank entry or its corresponding BioProject and BioSample entries are mentioned in a PubMed-indexed publication together with the ORCID(s) found in the DRI, then the new publication will be automatically added to the respective publication tags. The proposed metadata tag (that is, the DRI) will substantially reduce the amount of time needed to clarify the status of any public dataset, enabling automated rules for dataset screening and reducing tension between data creators and data consumers in the future. The DRI thus bridges a gap generated by the FAIR principles, for those who are invested in making their data open access, but where it is equitable for the creator to maintain some control over its reuse.

Ideally, DRIs would propagate automatically within databases (for example, from datasets of sequencing reads to the assembled genomes) and across other online databases of -omics sequence data for a given data creator. In the absence of automatic connections, data consumers can manually add DRIs to downstream data depositions (for example, metagenome assembled genomes) via a custom metadata field. These can be the original DRIs from the original data creator or, following conversation with the data creator, new DRIs connected to the data consumer.

Outlook

In this Consensus Statement, we propose a roadmap to facilitate equitable data reuse in the microbiome field. The implementation of machine-readable labels reflecting contact information of data creators will permit efficient reuse of data, accelerate scientific discoveries and hopefully lead data creators to more readily share their valuable datasets with the scientific public at an earlier stage. Using the standards devised by the GSC³², complete metadata are expected to be made available by data creators. Here we propose the addition of a DRI tag to GSC metadata. Enabling equitable use of public microbiome data will rely on close collaborative work between the data creators, data distributors and data consumers. We envision that the GSC will play a pivotal role in helping all parties (that is, data distributors) implement a DRI tag within submission systems for microbiome data to public repositories, as well as in socializing the new approach to data mining (that is, the flowchart and algorithm given above). We note that even tangible incentives (for example, MG-RAST granting priority access to computing for data made public³³) have not alleviated data creators’ hesitancy to make their data available. However, while we do not anticipate the DRI will achieve a complete resolution to this challenge, we do think it is an essential first step in the right direction.

The full adoption of a DRI tag for microbiome data in public repositories will ultimately require broad support from the scientific community, data distributors, and journals and publishing houses. As an encouraging first step, the ENA has independently implemented an ORCID metadata category, allowing data creators to attach identifying information to their submissions. Propagation of this practice to other major databases will set the stage for the DRI to be used to screen for data availability. In the meantime, data creators are encouraged to apply DRI metadata tags to their datasets. This will allow data consumers to connect with data creators to discuss data reuse. We, the Data Reuse Core Team and Data Reuse Consortium, propose that scientific manuscripts partially or exclusively making use of public data should include a written statement by the authors confirming that they have complied with these guidelines for public data use. This statement would include protocols for data download and use of analysis tools, or reproducible workflows describing how the tag was incorporated in the workflow or, in case of a missing DRI, how authors adhered to the roadmap outlined in Fig. 2.

We are aware that the implementation of the DRI could substantially impact the timeline from analysis to publication by increasing the workload (for example, needing to identify email addresses of data creators). This is especially true for projects accessing many datasets (for example, mining single genes for phylogenetic trees). We are confident that, with time, automated and standardized informatic tools will become available that will lower the administrative burden of following the DRI guidelines.

At the same time, the high proportion of participants who stated that they would respect the DRI when reusing data (266 participants (96.73%); Fig. 3) suggests that data creators will be able to more freely and more frequently share their data in the future. This would facilitate adoption of the FAIR principles in science and hugely benefit the scientific community in the long run. Moreover, by fostering collaborations, the scientific best practices, and the roadmap for data sharing in microbiome research as introduced here, will enable research by lower-resourced laboratories, reducing financial bias in scientific progress. However, we do not expect that the recommendations in this roadmap will be applied retroactively to datasets already deposited in public databases before the implementation of the DRI.

**Fig. 3: The receptiveness of survey participants towards implementing a DRI tag.**

The scale of nucleic acid sequence data required early pioneers to establish public databases across political borders and to confront data sharing considerations early. Other omics technologies are maturing (for example, proteomics), and scientists are recognizing the need to establish data mining approaches³⁴. We propose that this roadmap for equitable reuse of public sequence data should be expanded to other fields including but not limited to proteomics, lipidomics, metabolomics, phenomics, microscopy and spectroscopy as data mining becomes routine with these types of data.

Change history

05 November 2025
A Correction to this paper has been published: https://doi.org/10.1038/s41564-025-02212-3

References

The Wellcome Trust Sharing Data from Large-Scale Biological Research Projects: A System of Tripartite Responsibility (National Human Genome Research Institute, 2003).
Report of the International Strategy Meeting on Human Genome Sequencing held at the Princess Hotel, Southampton, Bermuda, on 25th–28th February 1996 (unpublished manuscript, 1996); http://hdl.handle.net/10161/7715
Toronto International Data Release Workshop Authors. Prepublication data sharing. Nature 461, 168–170 (2009).
Article Google Scholar
Parties to the Convention on Biological Diversity. Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilization to the Convention on Biological Diversity 234–249 (Official Journal of the European Union, 2014); https://eur-lex.europa.eu/eli/agree_prot/2014/283/oj
GenBank and WGS Statistics (National Center for Biotechnology Information, accessed February 2024); https://www.ncbi.nlm.nih.gov/genbank/statistics/
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
Article PubMed PubMed Central Google Scholar
Data Management. NIH Grants & Funding https://sharing.nih.gov/data-management-and-sharing-policy/data-management (2025).
Womack, R. P. Research data in core journals in biology, chemistry, mathematics, and physics. PLoS ONE 10, e0143460 (2015).
Article PubMed PubMed Central Google Scholar
Zuiderwijk, A. & Spiers, H. Sharing and re-using open data: a case study of motivations in astrophysics. Int. J. Inf. Manag. 49, 228–241 (2019).
Google Scholar
Uhlir, P. Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop (National Academies, 2012).
Amann, R. I. et al. Toward unrestricted use of public genomic data. Science 363, 350–352 (2019).
Article CAS PubMed Google Scholar
Borgman, C. L. in Theories of Informetrics and Scholarly Communication (ed. Sugimoto, C. R.) 93–116 (De Gruyter, 2016).
Cousijn, H., Feeney, P., Lowenberg, D., Presani, E. & Simons, N. Bringing citations and usage metrics together make data count 18, 9 (2019).
Google Scholar
Rourke, M., Eccleston-Turner, M., Phelan, A. & Gostin, L. Policy opportunities to enhance sharing for pandemic research. Science 368, 716–718 (2020).
Article CAS PubMed Google Scholar
Hug, L. Contribution needed for developing a new community standard for reusing sequencing data. Springer Nature Research Communities https://communities.springernature.com/posts/contribution-needed-for-developing-a-new-community-standard-for-reusing-sequencing-data (2024).
Soares, A. R. Data Usage Manuscript - Data Deposition (OSFHOME, 2025); https://osf.io/skw4a/
Wickham, H. et al. Welcome to the Tidyverse. J. Open Source Softw. 4, 1686 (2019).
Article Google Scholar
Wickham, H. et al. dplyr: a grammar of data manipulation. R package version 3.1 (2023).
Zhu, H. et al. kableExtra: construct complex table with ‘kable’ and pipe syntax. R package version 3.1 (2024).
EMBL-EBI Terms of Use. EMBL-EBI https://www.ebi.ac.uk/about/terms-of-use/ (2025).
Licensing of EMBL-EBI data resources. EMBL-EBI https://www.ebi.ac.uk/licencing/ (2025).
Sunagawa, S. et al. Tara Oceans: towards global ocean ecosystems biology. Nat. Rev. Microbiol. 18, 428–445 (2020).
Article CAS PubMed Google Scholar
Proctor, L. M. et al. The Integrative Human Microbiome Project. Nature 569, 641–648 (2019).
Article Google Scholar
Lee, M. D. GToTree: a user-friendly workflow for phylogenomics. Bioinformatics 35, 4162–4164 (2019).
Article CAS PubMed PubMed Central Google Scholar
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2020).
Article CAS Google Scholar
Jennings, L. et al. Applying the ‘CARE Principles for Indigenous Data Governance’ to ecology and biodiversity research. Nat. Ecol. Evol. 7, 1547–1551 (2023).
Article PubMed Google Scholar
Westoby, M., Falster, D. S. & Schrader, J. Motivating data contributions via a distinct career currency. Proc. R. Soc. B 288, 20202830 (2021).
Article PubMed PubMed Central Google Scholar
Credit where credit is due. Nature 462, 825–825 (2009).
Buessecker, S. et al. An essential role for tungsten in the ecology and evolution of a previously uncultivated lineage of anaerobic, thermophilic Archaea. Nat. Commun. 13, 3773 (2022).
Article CAS PubMed PubMed Central Google Scholar
McKay, L. J. et al. Co-occurring genomic capacity for anaerobic methane and dissimilatory sulfur metabolisms discovered in the Korarchaeota. Nat. Microbiol. 4, 614–622 (2019).
Article CAS PubMed Google Scholar
Viljakainen, V. R. & Hug, L. A. The phylogenetic and global distribution of bacterial polyhydroxyalkanoate bioplastic-degrading genes. Environ. Microbiol. 23, 1717–1731 (2021).
Article CAS PubMed Google Scholar
Field, D. et al. The Genomic Standards Consortium. PLoS Biol. 9, e1001088 (2011).
Article CAS PubMed PubMed Central Google Scholar
Meyer, F. et al. The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform. 9, 386 (2008).
Article CAS Google Scholar
Vaudel, M. et al. Exploring the potential of public proteomics data. Proteomics 16, 214–225 (2016).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The Data Reuse Core Team thanks the numerous participants of our survey who chose not to be listed as authors on this manuscript for their valuable perspectives and the many people who discussed this endeavour with us on multiple occasions. We thank L. Rothe for consultations on data visualizations. A.J.P. acknowledges funding by the German Research Foundation (DFG), CRC 1439/1 and CRC 1439/2, project number 426547801 (project INF). L.A.H. acknowledges support from the Canada Research Chairs. R.H. acknowledges support from the US National Science Foundation (OCE-2049445). C. Moraru acknowledges funding by the German Research Foundation (DFG), Priority Program SPP 2330, project number MO 3498/2-1. F.M. acknowledges support from the German Federal Ministry of Education and Research (BMBF project number 01ZZ2013).

Author information

L. M. Rodriguez-R
Present address: Department of Chemistry and Biosciences, Aalborg University, Aalborg, Denmark
A full list of members and their affiliations appears in the Supplementary Information.
These authors contributed equally: Laura A. Hug, Roland Hatzenpichler, Cristina Moraru, André R. Soares, Folker Meyer, Alexander J. Probst.

Authors and Affiliations

Department of Biology, University of Waterloo, Waterloo, Ontario, Canada
Laura A. Hug & J. D. Neufeld
Department of Microbiology and Cell Biology, Thermal Biology Institute, Montana State University, Bozeman, MT, USA
Roland Hatzenpichler
Department of Chemistry and Biochemistry, Thermal Biology Institute, Montana State University, Bozeman, MO, USA
Roland Hatzenpichler & Z. J. Jay
Center for Biofilm Engineering, Montana State University, Bozeman, MT, USA
Roland Hatzenpichler
Environmental Metagenomics, Research Center One Health Ruhr, University Alliance Ruhr, Faculty of Chemistry, University of Duisburg-Essen, Essen, Germany
Cristina Moraru, André R. Soares, T. L. Bornemann, S. P. Esser, J. Plewka, M. B. Shah, T. L. Stach, J. Starke & Alexander J. Probst
Centre for Water and Environmental Research (ZWU), University of Duisburg-Essen, Essen, Germany
André R. Soares, T. L. Bornemann, F. Leese, S. Rückert, B. Siebers, T. L. Stach, B. Sures & Alexander J. Probst
Institute for AI in Medicine, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
Folker Meyer
Department of Computer Science, University of Duisburg-Essen, Essen, Germany
Folker Meyer
Department of Psychology, Ruhr University Bochum, Bochum, Germany
Anke Heyder
Department of Bioinformatics and Genomics, College of Biotechnology, Misr University for Science and Technology, Giza, Egypt
R. Z. Abdallah
Université de Lorraine, INRAE, IAM, Nancy, France
A. Abdalrahem
Department of Plankton and Microbial Ecology, Leibniz Institute of Freshwater Ecology and Inland Fisheries (IGB), Berlin, Germany
N. Abdulkadir, M. O. Gessner & H.-P. Grossart
Department of Environmental and Occupational Health, School of Public Health, University of Medical Sciences, Ondo, Nigeria
I. M. Adesiyan
Austrian Competence Centre for Feed and Food Quality, Safety and Innovation, FFoQSI GmbH, Tulln an der Donau, Austria
L. Alteio
Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
K. Anantharaman
Carleton College, Northfield, MN, USA
R. Anderson
Department of Plant and Microbial Biology, University of Zurich, Zurich, Switzerland
A-S. Andrei
Department of Biological Sciences, Clemson University, Clemson, SC, USA
J. A. Baeza
Departamento de Biologia Marina, Universidad Catolica del Norte, Coquimbo, Chile
J. A. Baeza
Austrian Institute of Technology, Wien, Austria
F. Bak
The University of Texas at Austin, Austin, TX, USA
B. Baker
GFZ Helmholtz Centre for Geosciences, Potsdam, Germany
A. Bartholomäus
CONICET, Buenos Aires, Argentina
N. Bejerman
University of Delaware, Newark, DE, USA
J. Biddle
CSIRO, Canberra, Australian Capital Territory, Australia
A. Bissett
Department of Plant and Microbial Biology, North Carolina State University, Raleigh, NC, USA
J. A. Blakeley-Ruiz
Universitätsklinikum Essen—IKIM, Essen, Germany
K. Block
Leibniz Institute DSMZ - German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany
J. Boldt
Agroscope, Zurich, Switzerland
G. Bonilla-Rosso
Aquatic Microbiology, Environmental Microbiology and Biotechnology, Faculty of Chemistry, University of Duisburg-Essen, Essen, Germany
V. S. Brauer
School of Biological Sciences, University of Utah, Salt Lake City, UT, USA
W. Brazelton
Evonik Operations GmbH, RD&I Biotechnology, Halle, Germany
A. Bremges
CNRS, UMR 5525, VetAgro Sup, Grenoble INP, TIMC, Univ. Grenoble Alpes, Grenoble, France
E. Buelow
University of Tennessee, Knoxville, TN, USA
Z. M. Burcham
Department of Biology, University of York, York, UK
A. Cansdale & S. Meaden
Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ, USA
J. G. Caporaso
School of Biological Sciences, Faculty of Environmental and Life Sciences, University of Southampton, Southampton, UK
T. Cernava
Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
I. Chatzigiannidou
Department of Bioengineering, Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal
R. Costa
Department of Biochemistry and Biomedical Sciences, M.G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
C. R. Currie
Institute of Soil Biology and Biogeochemistry, Biology Centre CAS, České Budějovice, Czechia
A. Daebeler
Department of Microbiology and Cell Sciences, Fort Lauderdale Research and Education Center, University of Florida, Fort Lauderdale, FL, USA
V. De Anda
Institute of Bioinformatics, University of Georgia, Athens, GA, USA
A. De Santiago
Embrapa Cenargen, Brasília, Brazil
L. M. Arake de Tacca & P. V. Pascoal
Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
J. Debelius
UMR8228, Station Biologique de Roscoff, Roscoff, France
S. M. Dittami
Third Institute of Oceanography, Ministry of Natural Resources, Xiamen, China
X. Dong
Institute for Integrative Systems Biology (I2SysBio), University of Valencia and Spanish National Research Council, Valencia, Spain
M. Džunková
Aberystwyth University, Aberystwyth, UK
A. Edwards & C. L. Williams
Flinders University, Adelaide, South Australia, Australia
R. Edwards
Washington University at St. Louis, St. Louis, MO, USA
S. Egbert
NIOZ—Royal Netherlands Institute for Sea Research, Den Burg, The Netherlands
J. C. Engelmann
The Laboratory of Microbiology, Wageningen University and Research, Wageningen, The Netherlands
T. J. G. Ettema & P. Geesink
University of California, Riverside, CA, USA
C. L. Ettinger
Westmead Institute for Medical Research, Sydney, New South Wales, Australia
A. Petrovic Fabijan
Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia
A. Petrovic Fabijan
The University of Essex, Colchester, UK
R. M. W. Ferguson
Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
P. Ferretti
Biology of Intracellular Bacteria Unit, Pasteur Institute, Paris, France
P. Foucault
University of Southern California, Los Angeles, CA, USA
J. A. Fuhrman
Sokoto State University, Sokoto, Nigeria
A. M. Gada
Embrapa Agricultura Digital, Campinas, Brazil
I. R. Gerhardt
Institute of Ecology, Berlin Institute of Technology (TUB), Berlin, Germany
M. O. Gessner
Department of Biology, University of Naples Federico II, Naples, Italy
D. Giovannelli
University of California, Berkeley, Berkeley, CA, USA
D. Gittins
Western University, London, Ontario, Canada
G. B. Gloor
Department of Biology, The Pennsylvania State University, University Park, PA, USA
R. A. González-Pech
Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
C. Gopalakrishnappa & R. Gregor
Monash University, Melbourne, Victoria, Australia
C. Greening
University of Calgary, Calgary, Alberta, Canada
A. C. Gregory
Institute of Biochemistry and Biology, Potsdam University, Potsdam, Germany
H.-P. Grossart
Institute of Clinical Molecular Biology, Kiel University, Kiel, Germany
M. Groussin & P. Rausch
Departamento de Educación, Facultad de Educación, Universidad de Antofagasta, Antofagasta, Chile
B. Valenzuela Guerrero
Department of Microbiology, University of Tennessee, Knoxville, TN, USA
M. Guzel
Department of Biology, Faculty of Science, Kyushu University, Fukuoka, Japan
N. Hamamura
Plant and Microbial Biology, University of Minnesota, Minneapolis, MN, USA
T. L. Hamilton & P. J. Hesketh-Best
Department of Marine Microbiology and Biogeochemistry, Royal Netherlands Institute for Sea Research (NIOZ), Texel, The Netherlands
J. N. Hamm
Life Sciences Institute, University of Michigan, Ann Arbor, MI, USA
L. Hart
Leibniz Institute for Baltic Sea Research Warnemünde (IOW), Rostock, Germany
C. Hassenrück
Royal Veterinary College, London, UK
M. Hay
Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
R. M. Hechler
Otto-von-Guericke-University Magdeburg, Magdeburg, Germany
P. Hellwig
Northern Illinois University, DeKalb, IL, USA
M. Henson
Luxembourg Institute of Science and Technology, Esch-sur-Alzette, Luxembourg
M. Herold
University of California, Davis, CA, USA
M. Hess & L. Hillary
University Hospital of RWTH Aachen, Aachen, Germany
T. C. Hitch
Agharkar Research Institute, Pune, India
S. S. Hivarkar
University of Greifswald, Greifswald, Germany
K. J. Hoff
University of Mississippi, Oxford, MS, USA
E. F. Hom
Southern University of Science and Technology, Shenzhen, China
S. Hou
Uppsala University, Uppsala, Sweden
L. W. Hugerth
Harvard University, Cambridge, MA, USA
Y. Hwang
University of Oxford, Oxford, UK
N. Ilott
Lawrence Berkeley National Laboratory, United States Department of Energy, Berkeley, CA, USA
S. P. Jungbluth
Université Paris-Saclay, INRAE, AgroParisTech, Micalis Institute, Jouy-en-Josas, France
E. Karimi
Wageningen University & Research, Wageningen, The Netherlands
Y. M. Kaspareit
Durham University, Durham, UK
C. Keating
US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
M. Kellom, N. C. Kyrpides, C. G. Sprehn & T. Woyke
University of Michigan, Ann Arbor, MI, USA
E. A. Kiledal
Systems Ecology, Amsterdam Institute for Life and Environment (A-LIFE), Faculty of Science, Vrije Universiteit, Amsterdam, The Netherlands
I. Klarenberg
Department of Pediatrics, University of California, San Diego, San Diego, CA, USA
R. Knight
Department of Computer Science & Engineering, University of California, San Diego, San Diego, CA, USA
R. Knight
Shu Chien-Gene Lay Department of Bioengineering, University of California, San Diego, San Diego, CA, USA
R. Knight
Halıcıoğlu Data Science Institute, University of California, San Diego, San Diego, CA, USA
R. Knight
Masinde Muliro University of Science and Technology, Kakamega, Kenya
A. K. Koech
Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
E. V. Koonin
Department of Ichthyology and Aquatic Environment, University of Thessaly, Volos, Greece
K. Kormas
Natural Resources Institute Finland, Helsinki, Finland
K. Kujala
Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Aas, Norway
S. L. La Rosa
Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
C. C. Laczny
Virginia Tech, Blacksburg, VA, USA
K. Lahmers
Nankai University, Tianjin, China
X. Lan
University of Ilorin, Ilorin, Nigeria
A. A. Lateef
Department of Forest Sciences, University of Helsinki, Helsinki, Finland
A. A. Lateef
Department of Microbiology, The University of Hong Kong, Hong Kong, China
S. H. Lau
Faculty of Biology, University of Duisburg-Essen, Essen, Germany
F. Leese
IMDEA Water Institute, Alcalá de Henares, Madrid, Spain
M. Á. Lezcano
Microbiome Systems Laboratory, Biomedicine Discovery Institute, Monash University, Melbourne, Victoria, Australia
S. S. Li
EMBRAPA/INCT_BioSyn, Brasília, Brazil
R. N. Lima
Department of Microbiology, RIBES, Radboud University, Nijmegen, The Netherlands
S. Lücker
Medical University of Graz, Graz, Austria
A. Mahnert
Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
S. Majidian
Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
L. Malfertheiner
Department of Biology, Royal Melbourne Institute of Technology, Melbourne, Victoria, Australia
A. Marshall
Nottingham Trent University, Nottingham, UK
C. J. Meehan
University of Bayreuth, Bayreuth Center of Ecology and Environmental Research, Bayreuth, Germany
D. V. Meier
Utrecht University, Theoretical Biology and Bioinformatics, Utrecht, The Netherlands
C. Melkonian & D. Tamarit
Wageningen University & Research, Bioinformatics Group, Wageningen, The Netherlands
C. Melkonian
Amsterdam University Medical Center, Amsterdam, The Netherlands
D. R. Mende
Department of Soil, Water, and Ecosystem Sciences, University of Florida, Gainesville, FL, USA
J. L. Meyer
River Ecosystems Laboratory, Alpine and Polar Environmental Research Center, Ecole Polytechnique Fédérale de Lausanne EPFL, Lausanne, Switzerland
G. Michoud
Institute of Ecology & Earth Sciences, University of Tartu, Tartu, Estonia
V. Mikryukov
Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zurich, Zurich, Switzerland
S. Miravet-Verde & T. Priest
Ocean EcoSystems Biology Unit, Marine Ecology Research Division, GEOMAR Helmholtz Centre for Ocean Research, Kiel, Germany
J. Muschiol
Department of Computer Science and Interdisciplinary Centre of Bioinformatics, University of Leipzig, Leipzig, Germany
M. K. Nata’ala
Department of Data Science in Bioeconomy, Leibniz Institute for Agricultural Engineering and Bioeconomy (ATB), Potsdam, Germany
M. K. Nata’ala
Institute of Microbiology, University of Innsbruck, Innsbruck, Austria
S. Neuhauser
Elizade University, Ilara-Mokin, Nigeria
O. Osuolale
Division of Clinical Microbiology, Department of Laboratory Medicine, Medical University of Vienna, Vienna, Austria
J. Osvatic
Joint Microbiome Facility of the Medical University of Vienna and the University of Vienna, Vienna, Austria
J. Osvatic
Department of Biology, National and Kapodistrian University of Athens, Athens, Greece
K. M. Pappas
The University of Queensland, Brisbane, Queensland, Australia
D. H. Parks
School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, Queensland, Australia
R. H. Parry
European Marine Biological Resource Centre-European Research Infrastructure Consortium (EMBRC-ERIC), Paris, France
C. Pavloudi
Montana State University, Bozeman, MT, USA
B. Peyton & G. Schaible
Institute of Experimental Medicine, Kiel University, Kiel, Germany
M. Poyet & E. K. Quaye
Department of Biochemistry, J.J College of Arts and Science (Autonomous), Pudukkottai, India
S. Ramganesh
Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria
T. Rattei, D. R. Speth & M. Wagner
National Institute of S&T in Synthetic Biology/EMBRAPA, Brasília, Brazil
E. Rech
University of Queensland, Brisbane, Queensland, Australia
C. Rinke
Indiana University, Bloomington, IN, USA
C. Robinson
Department of Ecology, Environment, and Plant Sciences, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
A. Rodríguez-Gijón
Department of Microbiology and Digital Science Center (DiSC), University of Innsbruck, Innsbruck, Austria
L. M. Rodriguez-R
UT Austin, Austin, TX, USA
R. R. Rohwer
Institute for Medical Microbiology, University of Zurich, Zurich, Switzerland
T. Roloff
Department of Microbiology and Plant Pathology, University of California-Riverside, Riverside, CA, USA
J. A. Rothman & J. E. Stajich
Department of Eukaryotic Microbiology, Faculty of Biology, University of Duisburg-Essen, Essen, Germany
S. Rückert
Marine Biological Laboratory, Woods Hole, MA, USA
S. E. Ruff & A. Z. Worden
EPFL, Lausanne, Switzerland
J. S. Saini
Department of Molecular and Cell Biology, The University of Connecticut (UConn), Storrs, CT, USA
M. G. Santiago-Martínez
Department of Biology, Hofstra University, Hempstead, NY, USA
L. Santoferrara
Institute for Biomedicine, Eurac Research, Bolzano, Italy
M. S. Sarhan
The George Washington University, Washington DC, USA
J. H. Saw
Water Research Institute (IRSA), National Research Council of Italy (CNR), Verbania, Italy
T. Sbaffi
RC One Health Ruhr, Research Alliance Ruhr and Faculty of Biology, University of Duisburg-Essen, Essen, Germany
R. B. Schäfer
Helmholtz Munich, Research Unit for Comparative Microbiome Analysis, Munich, Germany
M. Schloter
Kiel University, Kiel, Germany
R. A. Schmitz
ETH Zürich, Institute of Microbiology, Zurich, Switzerland
C. Schubert
Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen, Germany
O. Schwengers
RECETOX, Faculty of Science, Masaryk University, Brno, Czech Republic
L. Sehnal
Institute of Medical Genetics and Applied Genomics, Universitätsklinikum Tübingen, Tübingen, Germany
A. Sekar
M. S. Swaminathan Research Foundation, Chennai, India
J. Sekar
Department of Poultry Science, University of Arkansas, Fayetteville, AR, USA
M. M. Seyoum
Tel Hai Academic College, Qiryat Shemona, Israel
I. Sharon
Molecular Enzyme Technology and Biochemistry (MEB), Environmental Microbiology and Biotechnology (EMB), Department of Chemistry, University of Duisburg-Essen, Essen, Germany
B. Siebers
Department of Agroecology, Aarhus University, Denmark
E. T. Sieradzki
Laboratory of Environmental Biotechnology, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
D. Skliros
Department of Environmental Science, University of Arizona, Tucson, AZ, USA
O. L. Snoeyenbos-West
Institute for Stroke and Dementia Research, LMU Klinikum, Munich, Germany
A. Sorbie
School of Engineering, Cardiff University, Cardiff, UK
P. Srivastava
University of Tennessee - Knoxville, Knoxville, TN, USA
A. D. Steen
Institute of Microbiology and Archaea Centre, University of Regensburg, Regensburg, Germany
R. Stöckl
School of Biological Sciences, Institute for Global Food Security, Queen’s University Belfast, Belfast, UK
T. Stoikidou
National Laboratory of Health Environment and Food, Maribor, Slovenia
N. Stopnisek
University of Kerala, Kerala, India
R. Sukumaran
Department of Aquatic Ecology, University of Duisburg-Essen, Essen, Germany and Research Center One Health Ruhr, Research Alliance Ruhr, University of Duisburg-Essen, Essen, Germany
B. Sures
RIKEN, Wakō, Japan
S. Suzuki
University of California Santa Barbara, Santa Barbara, CA, USA
P. Thieringer
Department of Microbiology, Immunology and Transplantation, KU Leuven, Leuven, Belgium
R. Y. Tito
Center for Microbiology, VIB, Leuven, Belgium
R. Y. Tito
LGC Biosearch Technology, Petaluma, CA, USA
C. B. Trivedi
Lawrence Livermore National Laboratory, Livermore, CA, USA
G. Trubl
Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
J. Truu
Soil Science and Agricultural Chemistry Lab, Dept. of Natural Resources and Agricultural Engineering, Agricultural University of Athens, Athens, Greece
M. Tsiknia
Universidad Andres Bello, Santiago, Chile
J. Ugalde
Plant and Microbial Biology, University of California, Berkeley, Berkeley, CA, USA
L. E. Valentin-Alvarado
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales, Australia
X. Vázquez-Campos
Institute of Water Quality and Resource Management, TU Wien, Vienna, Austria
J. Vierheilig
Interuniversity Cooperation Centre Water & Health, Vienna, Austria
J. Vierheilig
NIOZ Royal Netherlands Institute for Sea Research, Yerseke, The Netherlands
F. A. B. von Meijenfeldt
Department of Microbiology and Immunology, University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia
C. J. Walsh
Department of Microbiology, The Chinese University of Hong Kong, Hong Kong, China
S. Wang
School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
Y. Wang
Heinrich Heine University Düsseldorf, Düsseldorf, Germany
C.-E. Wegner
Colorado State University, Fort Collins, CO, USA
T. Weir
Department of Animal Ecology, Evolution and Biodiversity, Ruhr University Bochum, Bochum, Germany
L. C. Weiss
Department of Biology, The City College of New York, New York, NY, USA
J. L. Weissman
Alfred-Wegener-Institut Helmholtz Zentrum für Polar-und Meeresforschung, Helgoland, Germany
A. Wichels
University of Bath, Bath, UK
T. A. Williams
Australian Centre for Water and Environmental Biotechnology, University of Queensland, Brisbane, Queensland, Australia
M. Wu
State Key Laboratory of Geomicrobiology and Environmental Changes (GMEC), China University of Geosciences, Beijing, China
W. Xiu
Department of Cell and Molecular Biology, College of the Environment and Life Sciences, University of Rhode Island, Kingston, RI, USA
Y. Zhang
Li Ka Shing Institute of Health Sciences, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, China
J. Zhu
Department of Civil Engineering, University of British Columbia, Vancouver, British Columbia, Canada
R. M. Ziels
Institute of Food Science, BOKU University, Vienna, Austria
B. Zwirzitz

Authors

Laura A. Hug
View author publications
Search author on:PubMed Google Scholar
Roland Hatzenpichler
View author publications
Search author on:PubMed Google Scholar
Cristina Moraru
View author publications
Search author on:PubMed Google Scholar
André R. Soares
View author publications
Search author on:PubMed Google Scholar
Folker Meyer
View author publications
Search author on:PubMed Google Scholar
Anke Heyder
View author publications
Search author on:PubMed Google Scholar
Alexander J. Probst
View author publications
Search author on:PubMed Google Scholar

Consortia

The Data Reuse Consortium

R. Z. Abdallah
, A. Abdalrahem
, N. Abdulkadir
, I. M. Adesiyan
, L. Alteio
, K. Anantharaman
, R. Anderson
, A-S. Andrei
, J. A. Baeza
, F. Bak
, B. Baker
, A. Bartholomäus
, N. Bejerman
, J. Biddle
, A. Bissett
, J. A. Blakeley-Ruiz
, K. Block
, J. Boldt
, G. Bonilla-Rosso
, T. L. Bornemann
, V. S. Brauer
, W. Brazelton
, A. Bremges
, E. Buelow
, Z. M. Burcham
, A. Cansdale
, J. G. Caporaso
, T. Cernava
, I. Chatzigiannidou
, R. Costa
, C. R. Currie
, A. Daebeler
, V. De Anda
, A. De Santiago
, L. M. Arake de Tacca
, J. Debelius
, S. M. Dittami
, X. Dong
, M. Džunková
, A. Edwards
, R. Edwards
, S. Egbert
, J. C. Engelmann
, S. P. Esser
, T. J. G. Ettema
, C. L. Ettinger
, A. Petrovic Fabijan
, R. M. W. Ferguson
, P. Ferretti
, P. Foucault
, J. A. Fuhrman
, A. M. Gada
, P. Geesink
, I. R. Gerhardt
, M. O. Gessner
, D. Giovannelli
, D. Gittins
, G. B. Gloor
, R. A. González-Pech
, C. Gopalakrishnappa
, C. Greening
, R. Gregor
, A. C. Gregory
, H.-P. Grossart
, M. Groussin
, B. Valenzuela Guerrero
, M. Guzel
, N. Hamamura
, T. L. Hamilton
, J. N. Hamm
, L. Hart
, C. Hassenrück
, M. Hay
, R. M. Hechler
, P. Hellwig
, M. Henson
, M. Herold
, P. J. Hesketh-Best
, M. Hess
, L. Hillary
, T. C. Hitch
, S. S. Hivarkar
, K. J. Hoff
, E. F. Hom
, S. Hou
, L. W. Hugerth
, Y. Hwang
, N. Ilott
, Z. J. Jay
, S. P. Jungbluth
, E. Karimi
, Y. M. Kaspareit
, C. Keating
, M. Kellom
, E. A. Kiledal
, I. Klarenberg
, R. Knight
, A. K. Koech
, E. V. Koonin
, K. Kormas
, K. Kujala
, N. C. Kyrpides
, S. L. La Rosa
, C. C. Laczny
, K. Lahmers
, X. Lan
, A. A. Lateef
, S. H. Lau
, F. Leese
, M. Á. Lezcano
, S. S. Li
, R. N. Lima
, S. Lücker
, A. Mahnert
, S. Majidian
, L. Malfertheiner
, A. Marshall
, S. Meaden
, C. J. Meehan
, D. V. Meier
, C. Melkonian
, D. R. Mende
, J. L. Meyer
, G. Michoud
, V. Mikryukov
, S. Miravet-Verde
, J. Muschiol
, M. K. Nata’ala
, J. D. Neufeld
, S. Neuhauser
, O. Osuolale
, J. Osvatic
, K. M. Pappas
, D. H. Parks
, R. H. Parry
, P. V. Pascoal
, C. Pavloudi
, B. Peyton
, J. Plewka
, M. Poyet
, T. Priest
, E. K. Quaye
, S. Ramganesh
, T. Rattei
, P. Rausch
, E. Rech
, C. Rinke
, C. Robinson
, A. Rodríguez-Gijón
, L. M. Rodriguez-R
, R. R. Rohwer
, T. Roloff
, J. A. Rothman
, S. Rückert
, S. E. Ruff
, J. S. Saini
, M. G. Santiago-Martínez
, L. Santoferrara
, M. S. Sarhan
, J. H. Saw
, T. Sbaffi
, R. B. Schäfer
, G. Schaible
, M. Schloter
, R. A. Schmitz
, C. Schubert
, O. Schwengers
, L. Sehnal
, A. Sekar
, J. Sekar
, M. M. Seyoum
, M. B. Shah
, I. Sharon
, B. Siebers
, E. T. Sieradzki
, D. Skliros
, O. L. Snoeyenbos-West
, A. Sorbie
, D. R. Speth
, C. G. Sprehn
, P. Srivastava
, T. L. Stach
, J. E. Stajich
, J. Starke
, A. D. Steen
, R. Stöckl
, T. Stoikidou
, N. Stopnisek
, R. Sukumaran
, B. Sures
, S. Suzuki
, D. Tamarit
, P. Thieringer
, R. Y. Tito
, C. B. Trivedi
, G. Trubl
, J. Truu
, M. Tsiknia
, J. Ugalde
, L. E. Valentin-Alvarado
, X. Vázquez-Campos
, J. Vierheilig
, F. A. B. von Meijenfeldt
, M. Wagner
, C. J. Walsh
, S. Wang
, Y. Wang
, C.-E. Wegner
, T. Weir
, L. C. Weiss
, J. L. Weissman
, A. Wichels
, C. L. Williams
, T. A. Williams
, A. Z. Worden
, T. Woyke
, M. Wu
, W. Xiu
, Y. Zhang
, J. Zhu
, R. M. Ziels
& B. Zwirzitz

Contributions

A.J.P. was invited to write this manuscript and formed the ‘Data Reuse Core Team’ via invitation. All members of this team (A.J.P., L.A.H., A.R.S., C. Moraru, F.M. and R.H.) contributed equally to this manuscript. A.J.P., F.M. and A.R.S. led discussions with JGI, ENA and GSC. L.A.H. and A.R.S. performed the analysis and visualization of survey response data. A.H. supervised the construction of the scientific survey, ensuring the quality and impartiality of questions. The ‘Data Reuse Consortium’ provided support and feedback to this manuscript.

Corresponding author

Correspondence to Alexander J. Probst.

Ethics declarations

Competing interests

R.H. was a member (2021–2024) of the User Executive Committee of the US Department of Energy’s (DOE) Joint Genome Institute. A.P. is an affiliate scientist at JGI and sits on the Prokaryotic Advisory Committee. All opinions expressed in this paper are the authors’ and do not necessarily reflect the policies and views of the DOE. R.K. is a scientific advisory board member and consultant for BiomeSense, Inc., has equity and receives income. He is a scientific advisory board member and has equity in GenCirq. He has equity in and acts as a consultant for Cybele. He is a co-founder of Biota, Inc. and has equity. He is a co-founder of Micronoma and has equity and is a scientific advisory board member. He is a board member of Microbiota Vault, Inc. He is a board member of N=1 IBS advisory board and receives income. He is a Senior Visiting Fellow of HKUST Jockey Club Institute for Advanced Study. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies.

Peer review

Peer review information

Nature Microbiology thanks Marnix Medema and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Supplementary information

Supplementary Information (download PDF )

Supplementary Tables 1–7, Supplementary Box 1 and Supplementary Figs. 1–10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hug, L.A., Hatzenpichler, R., Moraru, C. et al. A roadmap for equitable reuse of public microbiome data. Nat Microbiol 10, 2384–2395 (2025). https://doi.org/10.1038/s41564-025-02116-2

Download citation

Received: 27 March 2023
Accepted: 13 August 2025
Published: 26 September 2025
Version of record: 26 September 2025
Issue date: October 2025
DOI: https://doi.org/10.1038/s41564-025-02116-2

This article is cited by

Give credit where credit is due, also for omics data
- Ronald P de Vries
- Mao Peng
EMBO Reports (2026)