Introduction

Recent studies have revealed that gut microorganisms significantly influence our health, including obesity, inflammation, infectious diseases, and cancer. The advent of next-generation sequencing technology, coupled with decreasing costs, has led to an explosion in the amount of human microbiome data available for research. Public sequence repositories such as NCBI Sequence Read Archive (SRA)1 (together with other International Nucleotide Sequence Database Collaboration (INSDC) members; European Bioinformatics Institute (EBI) European Nucleotide Archive (ENA)2, and DDBJ Sequence Read Archive (DRA)3) now host vast collections of raw microbiome sequencing data, including 16 S rRNA amplicon and shotgun metagenomic sequences.

The human microbiota composition is influenced by various factors, including lifestyle, diet, and fitness levels, necessitating a comprehensive approach to data collection and analysis4. To thoroughly investigate the correlations between the human microbiome and health, it is essential to have access to large sample sizes and diverse metadata. However, the metadata available in public repositories are often restricted to specific study objectives, limiting their utility for broader research purposes5. Simply aggregating data from these repositories is insufficient due to the inconsistent and sometimes minimal metadata accompanying the microbiome samples. Additionally, variations in sample preparation and sequencing protocols across different studies can introduce significant biases and hinder comparative analyses6. Given those challenges, there is a critical need to develop a large-scale, standardized human microbiome database that integrates extensive sequencing data with rich metadata. Such a database would facilitate more robust and comprehensive analyses, enabling researchers to unravel the complex interactions between microbiome and human health.

Collecting human-related data presents other significant challenges. First, the costs can be substantial, particularly for advanced techniques such as metagenomic sequencing, which requires specialized equipment, reagents, and data analysis. Additionally, convincing the participants to contribute their data can take time and effort. Ethical concerns, privacy issues, and the perceived risks of data misuse often lead to hesitation or refusal, further complicating the recruitment process and limiting the scope of data collection.

In awareness of all those issues, we have been diligently working on establishing a comprehensive gut microbiome database for healthy individuals in our institute, the National Institutes of Biomedical Innovation, Health, and Nutrition (NIBN, formerly NIBIOHN), since 2015. This effort has led to the creation of the NIBN Japan Microbiome Database (NIBN JMD), which covers a vast amount of gut microbiome data derived from both 16 S rRNA gene amplicon sequencing and shotgun metagenome sequencing. Additionally, it includes extensive lifestyle-related data, such as dietary habits, nutrient intake, and physical activity levels (phenotypic metadata). Moreover, the database features the MANTA platform7, an open-source integrated analysis tool that facilitates comprehensive data visualization and related analyses. As we will describe in detail below, this database, by standardizing data collection, storage, and analysis protocols, would be invaluable for advancing our understanding of the microbiome’s role in health and disease.

Results

The NIBN JMD database contains microbiome data from different cohort studies coordinated by NIBN. The studies were conducted in nine places across six prefectures in Japan (see Table 1; Fig. 1). Participants with no severe health issues were recruited through public websites or the local community. Their diverse geographic backgrounds enhanced the study’s generalizability, and all participants provided written informed consent after thoroughly explaining the study’s purposes and procedures.

Table 1 List of studies in NIBN JMD.
Fig. 1
figure 1

The statistics of the NIBN JMD data set. (a) Shows the sampling places on the Japan map. NIBN JMD covers a wide range of areas in Japan. (b) The gender distribution in NIBN JMD. (c) The age distribution in NIBN JMD.

The collected fecal samples were processed using a standardized procedure, followed by data processing pipelines that adopted QIIME28, Kraken 2-Bracken9,10, MetaPhlAn 411, or FMAP12 software. The microbiome profiles and cleaned-up metadata were integrated into the database using the MANTA platform7 and served as a web application. The Methods section describes the process in more detail.

The current NIBN JMD data set contains 2,273 samples collected from 2,068 subjects. All samples include basic metadata, such as gender and age, and depending on the cohort, some samples have detailed phenotypic metadata, including diet, nutrient intake, lifestyle, and physical activity (described in the Methods section). The subjects range in age from 18 to over 80, with 1,159 females (56.0%) and 909 males (44.0%) (Fig. 1). To ensure the participants remained anonymous, age data were processed as follows: Participants were grouped into age ranges, each containing at least five individuals. These groups were organized separately by gender and cohort, with only the age ranges disclosed. Details of the shotgun metagenome analysis results will be described elsewhere, and the full results (Kraken 2-Bracken, MetaPhlAn 4, and FMAP) will be available from a later version of the database.

NIBN JMD, powered by the MANTA platform, provides data visualization and analysis features via the web browser. Users can choose all samples or those of interest from the entrance page (Fig. 2a). The data are shown in a table and can be visualized in a bar chart (Fig. 2b) or a heatmap (Fig. 2c). Both the bar chart and the heatmap provide hierarchical clustering based on the calculated beta diversity. Some of the most widely used and essential data analysis functions are available in the system. The system precalculated and stored the distances (such as Bray-Curtis) between the sample microbiota profiles. Given those distances, the principal coordinates analysis (PCoA) projects the samples onto new axes to capture the main patterns, allowing researchers to explore sample clustering (Fig. 2d). The ‘envfit’ analysis function from the vegan R package13 was added to the PCoA plot in the latest release. The ‘envfit’ analysis fits the chosen variables (metadata) onto the PCoA plot to visualize how these variables influence the distribution of the samples in the plot by vectors (shown as arrows, see Fig. 2d). These vectors help researchers to interpret the strength and direction of how metadata influences the data structure.

Fig. 2
figure 2

Screenshot of the public version of NIBN JMD. (a) On the entrance page. The user can select the samples of interest for analysis. There are 2,273 samples available currently. (b) Microbial composition in the bar chart. (c) Microbial composition in a heatmap with a dendrogram. (d) PCoA chart with envfit analysis. (e) Correlation analysis.

The correlation analysis facilitates users in finding similar patterns based on the correlation coefficient (Fig. 2e). For example, the user can specify a metadata item and search for the microbiota with a correlated abundance pattern. A similar analysis can be performed for microbiota against metadata and within microbiota or metadata. The correlation analysis provides a quick view of the correlations between two items that can be a hint for the following validations. In the latest release, we added a summary table containing basic statistics for the abundance of selected microbiota or the value of the chosen metadata (Fig. 2e, above the scatter plot).

Discussion

Microbiome databases such as the Human Microbiome Project (HMP)14 and UK Biobank15 have set benchmarks for large-scale efforts to map the human microbiome across diverse populations, generating extensive genomic, metagenomic, and phenotypic data. The HMP focuses on characterizing the human microbiome on specific body sites, while the UK Biobank integrates microbiome data with comprehensive health records from a large, predominantly European cohort. These projects are complemented by regional efforts such as Japan’s ToMMo (Tohoku Medical Megabank Organization)16 and Biobank Japan17, which emphasize East Asian populations. NIBN JMD complements those resources by focusing on microbiome research, with direct access to the data and the interactive online analysis interface. In addition, with its uniform metadata and sampling and analysis protocols, NIBN JMD offers high consistency and comparability across studies, which is often a challenge in global databases where variability in data collection can hinder cross-study comparisons.

Second, by including 16 S rRNA gene amplicon and shotgun metagenomic data (currently not available yet, but more than a hundred metagenomic samples are in the process of being released), NIBN JMD provides a more comprehensive view of the microbiome, offering insights into taxonomic composition and functional potential. This dual approach is valuable for understanding how microbiomes interact with host environments. Third, including longitudinal studies in NIBN JMD, albeit limited in number currently, sets it apart by enabling researchers to track microbiome dynamics, providing critical insights into how microbiomes change in response to disease, treatment, or lifestyle changes. We are accumulating more longitudinal data with more than five repeated observations, and we plan to release some of those data after IRB approval. We hope that NIBN JMD’s longitudinal component will offer a richer understanding of microbiome stability and variability than the one-time sampling from the cross-sectional study, making it a significant resource for advancing microbiome research and precision medicine.

The NIBN JMD data have been successfully applied to several studies. In addition to understanding the insightful relationship between various metadata and intestinal microorganisms, various metadata were utilized as factors in multivariate analysis to clarify the relationship in more detailed studies. For example, Park et al. identified significant associations between gut microbial variance and factors, including Bristol Stool Scale scores, gender, age, and defecation frequency18. The study showed that bowel movement frequency and specific dietary habits influence gut microbiota in Japanese populations. In another analysis, Park et al. found that vitamin B1 significantly impacts the survival and competition of gut microorganisms, particularly affecting the abundance of Ruminococcaceae, which require dietary vitamin B119. The research demonstrated vitamin B1’s role in butyrate production and highlighted its potential for optimizing intestinal health.

In another study, Matsuoka et al. found that barley consumption in the Japanese population is positively associated with increased abundance of Bifidobacterium and Butyricicoccus in the gut20. These findings suggest that barley intake influences gut microbiota composition, potentially contributing to its health benefits. A cross-sectional study of Japanese adults by Hosomi et al. identified the Blautia genus, particularly B. wexlerae, as inversely correlated with obesity and type 2 diabetes21. Oral administration of B. wexlerae in mice induced metabolic and anti-inflammatory changes, suggesting potential preventive and therapeutic strategies for metabolic disorders through unique amino acid and carbohydrate metabolism pathways.

The samples used in the studies mentioned above are mostly restricted to one or two cohorts because these studies were conducted at the initial stages of the NIBN JMD development. However, the standardized pipeline, including data collection, integration, and processing, still benefited all those studies and ensured the quality of the collected data.

NIBN JMD also provides control samples that can be used for comparison with disease patients. In the studies by Tabata et al.22 and Ikubo et al.23, samples from NIBN JMD were used as control to compare gut microbiota between patients with atrial fibrillation (AF) or chronic thromboembolic pulmonary hypertension (CTEPH) and healthy individuals. This application highlights NIBN JMD’s significance in enhancing the reliability of microbiome-based disease research by offering standardized, population-specific baseline data.

Several ongoing studies are based on the NIBN JMD data set, and we are collecting samples from other body sites, such as the oral cavity and skin. We believe that the development of NIBN JMD helps us understand the factors that influence the composition of the human microbiome and discover how this can affect our health.

The most considerable difficulty when integrating the data in NIBN JMD was the variety of metadata item names. Different systems and organizations often use varied terminologies, formats, and structures to represent the same underlying data. This disparity makes it difficult to integrate and analyze information effectively. Aware of this issue, we plan to introduce an ontology in future development. By applying an ontology, we can standardize the definitions and relationships of the data elements, promoting a common framework that all users can refer to. This helps integrate the data into our collaborations and enhances data sharing and reusing in the human microbiome research field.

The wealth of data housed in the NIBN JMD is anticipated to support a wide range of studies. Researchers can leverage this resource to gain insights into the distinctive characteristics of the Japanese population’s intestinal microflora. Additionally, it will aid in exploring the intricate relationships between gut microorganisms and various health conditions or diseases, ultimately contributing to advancements in personalized medicine and public health initiatives.

Methods

Data collection

The samples are collected from cohort studies conducted by the National Institutes of Biomedical Innovation, Health, and Nutrition (NIBN) or collaborations with other research groups in Japan (Table 1). All the fecal sampling and DNA extraction follow our standardized protocols24 to ensure the data quality. The extracted DNA samples were used for 16 S ribosomal DNA amplicon sequencing or shotgun metagenomic sequencing to determine the microbiome composition. All the raw sequencing data have been deposited to DDBJ Sequence Read Archive (DRA) and are available there (see Table 1 for the corresponding accession numbers).

Metadata were collected from multiple collaborators in the form of both electronic records and paper documents. The data were collected using a self-administered questionnaire, which included items on sleep patterns, bowel movement status, and medication. The paper documents were converted to electronic records with careful checks. Missing or ambiguous answers were clarified by enquiring the subjects in person. To facilitate integrating the metadata of the samples from diverse cohort studies, we predefined a set of core data items for data integration, including basic personal information, physical activities, medical history, and dietary information. The dietary information is based on the brief-type self-administered diet history questionnaire (BDHQ), a shortened version of the self-administered diet history questionnaire (DHQ)25,26. A list of the covered metadata can be found in the supplementary table S1. We also maintain the list on our website [https://microbiome.nibn.go.jp/metadata_en.html] to provide up-to-date information. Given the ethics regulations, a limited number of metadata items are available on the web with no restrictions. To access other metadata items, signing a collaborative agreement with us will be required, as well as approval from the Institutional Review Board (IRB), both at the NIBN and the receiving institution.

Data processing

We used Snaq27, a Snakemake pipeline for QIIME28, to process the 16 S rRNA amplicon sequencing data. The parameters used in the pipeline are as follows: bbduk [https://sourceforge.net/projects/bbmap/] trimming threshold 18 for the quality trimming, forward 18 and reverse 21 for fastp28 primer cropping, adopt the DADA2 algorithm29, use SILVA database (version 138)30 for taxonomy classification, and rarefaction to 10,000 sampling depth. The pipeline estimates the microbiota abundance and the corresponding sample diversity indices.

We also created in-house Snakemake pipelines to process the shotgun metagenomic sequencing data. These pipelines include Kraken 29, followed by Bracken10 or MetaPhlAn 411 for taxonomic assignment, some additional calculations for the diversity indices, and data formatting. One more pipeline runs FMAP12 for functional profiling. To ensure the consistency of the taxonomy across different reference databases, we annotate all the taxa using the NCBI Taxonomy identifiers (taxon IDs)31.

Despite our attempt to unify the metadata items, some variations remain due to the different study topics. To solve this problem, we defined a massive collection of metadata items with internal identifiers to cover the metadata from diverse cohorts. The metadata item names from each cohort were integrated by text matching and assigned internal identifiers, such as 10,010 for age and 10,012 for gender, to maintain consistency across the dataset. Data validation checks were conducted to identify and correct errors, such as text entries in numeric fields or invalid category options. After validation, the metadata were stored in NIBN JMD for further analysis.

Database development

We stored and shared the data on the MANTA platform7 for NIBN JMD. MANTA is an integrated analysis platform that hosts gut microbiome composition and phenotypic metadata, such as diet and exercise. It stores the data and provides data analysis functions, including data visualizations. The platform comprises a relational database (PostgreSQL) and a web application as the user interface. We reimplemented the whole system to improve the MANTA platform and added some additional features described in the Results section.

To facilitate continuous updating of NIBN JMD, we used an in-house script to import the data from the processed data described previously. The data importing script is followed by another script to calculate the all-against-all beta diversity32, including Bray-Curtis distance and Jaccard distance, for each rank when new samples are added. These scripts will be included in the newly implemented MANTA platform.