Introduction

Microorganisms play essential roles in sulfur cycle, which determine the compounds of sulfur transformation and their fate in the environment1. Sulfur compounds are abundant in natural environments, and a huge storage of sulfate and sulfides is found in marine ecosystems2. The sulfur cycle, mainly driven by sulfur oxidation and sulfate reduction, is tightly intertwined with other biochemical cycles (i.e., carbon, nitrogen, phosphorus) with far-reaching implications for environmental ecosystem3. Based on previous reports, sulfate-reducing bacteria (SRB) play a crucial role in the precipitation of heavy metals4, pollutants5, and hydrocarbon biodegradation6. Thus, characterizing the sulfur cycling genes and sulfur-metabolizing microorganisms is important to understand sulfur cycling processes in the environments.

The sulfur cycle is a complex biochemical process in the environment, consisting of inorganic and organic sulfur transformations. Inorganic transformations (i.e., canonical dissimilatory sulfate reduction [DSR], and assimilatory sulfate reduction [ASR]) have been well studied as described in the previous study7. For example, the composition of microorganism communities showed that Deltaproteobacteria was the dominant class of SRB and the pathway of ASR was major sulfate reduction in a full-scale biofilm-membrane bioreactor for textile wastewater treatment7. Organic sulfur transformations have a significant role in the sulfur cycle, given the abundance of organic sulfur in the environmental ecosystem8. Previous research has focused on inorganic sulfur transformations, hence the impact of organic sulfur compounds on ecosystems has yet to be explored3. Organosulfur compounds were abundant in the marine environment, such as dimethylsulfoniopropionate (DMSP)9, sulfonates3, sulfate esters3, and methanethiol (MeSH)10. The DMSP enzymatic breakdown product (dimethyl sulfide [DMS]) may result in global warming9. Sulfonates are decisive ecological prevalence for energy interchange between microbial autotrophs and heterotrophs, indicating the importance of organic sulfur metabolism in the environment11. Thus, it is critical to developing capabilities to obtain the complete sulfur cycle via advanced technologies.

Previously, considerable effort has been made to characterize sulfur cycle processes by analyzing key genes, such as dissimilatory sulfite reductase (dsrB)12, adenylyl sulfate reductase (aprA)13, and thiosulfohydrolase (soxB)14. Given the need for suitable DNA primers for many sulfur genes, polymerase chain reaction (PCR) usually produces inaccurate experimental results15,16. However, shotgun metagenome sequencing provides the opportunity to recover underexplored sulfur cycle17. Potential genes involved in sulfur cycling were annotated for metagenomics analysis using orthology database18. However, a comprehensive and reliable orthology database is essential for accurate annotation of functional genes. Thus, the results of the metagenomic analysis are heavily influenced by the selection of orthology databases.

Given the unavailability of a comprehensive sulfur database, multiple databases, such as Clusters of Orthologous Groups (COG)19, Kyoto Encyclopedia of Genes and Genomes (KEGG)20, Evolutionary Genealogy of Genes: Non-supervised Orthologous Groups (eggNOG)21, and M5nr22, are widely used for the efficient metagenomic data set. However, a database with high coverage of sulfur genes is further required23. Given the similar functional domains of the genes, small error abundance values from the application of these databases in analyzing sulfur cycles should be avoided24. Searching sulfur cycling genes in large databases is time consuming. Therefore, a specific database of genes related to the sulfur cycle pathway should be developed to address these problems.

The microbial community assembly processes determine species distribution patterns and abundance25. The neutral theory posits that stochastic processes (dispersal, local extinction, and ecological drift) cause variation in microbial community composition26. According to this theory, sulfur cycle in different areas may present distinct due to geographical distance. Mangroves are widely distributed along tropical and subtropical coasts to withstand waves and storms. Recent studies have investigated sulfur metabolism in mangrove ecosystems27. However, it is still being determined how microbial communities and the sulfur cycle differ from other biomes, such as upland forests, deep-sea sediments, marine waters, and freshwater.

A manually integrated database, namely, the sulfur metabolism gene integrative database (SMDB), gathering most of the sulfur cycling genes from public databases (i.e., KEGG, COG, eggNOG, M5nr, and NR), has been developed to address the limitations of currently available public resources and facilitate the identification and characterization of sulfur genes and sulfur-metabolizing microorganism communities. Our motivations in creating SMDB are as follows: (i) to improve the sulfur pathway about functional genes and associated microorganisms and (ii) to facilitate the annotation of sulfur cycle information in shotgun sequencing. SMDB was applied to functionally and taxonomically identify sulfur cycling genes and sulfur-metabolizing microorganism communities from different habitats (upland forest, deep-sea sediments, marine waters, river sediments, and mangrove sediments). This study could provide a high-quality sulfur cycle database for functional profiling of shotgun metagenomes.

Materials and methods

SMDB sources

The universal protein (UniProt) database (http://www.uniprot.org/; October 2019) was utilized to retrieve core database sequences of sulfur cycling genes by using keywords. COG (ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data), eggNOG (http://eggnogdb.embl.de/download/eggnog_4.5/), KEGG (http://www.genome.jp/kegg/; October 2019), and M5nr (ftp://ftp.metagenomics.anl.gov/data/M5nr/current/M5nr.gz; October 2019) were used for retrieving nontarget homologous sequences. In addition, the homologous sequences from the NCBI nonredundant (NR) database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/; October 2019) were added to the core constructed database.

SMDB development

An integrated database was manually created to profile sulfur cycling genes from shotgun metagenomes as described by a previous study28, with slight modifications. An integrative database of nitrogen cycling genes was manually curated by researchers28. The detailed method for database development is described following (Fig. 1).

Figure 1
figure 1

Flowchart of major steps for SMDB construction. First, a core database was constructed for selected genes by retrieving protein sequences from UniProt databases using keywords. Second, a full database was constructed by integrating target genes from databases, including COG, eggNOG, KEGG, M5nr, and NR. Third, a PERL script was developed to generate functional and taxonomic profiles for shotgun metagenomes using searching tools.

Core database for constructing sulfur cycling genes

The sulfur metabolism in KEGG and MetaCyc databases29 were referenced to retrieve genes involved in the sulfur cycle. Only those genes that had been experimentally confirmed to be involved in sulfur metabolism were collected through an extensive literature search. The next step was followed by searching the UniProt database with their keywords (e.g., dissimilatory sulfite reductase alpha subunit: dsrA) to download the corresponding annotated sulfur metabolism gene sequences. For genes with vague definitions (e.g., cysQ and MET17), the full protein name was added in the keywords to exclude the sequences with vague annotations. Then, to ensure the accuracy of SMDB, these sequences for each sulfur cycling gene were manually checked based on their annotations. Finally, these collected sequences were selected as the core database for sulfur cycling genes.

Full SMDB construction

COG, eggNOG, KEGG, M5nr, and NR databases were used to retrieve sulfur genes' homologous sequences to construct a complete database and reduce false positives. The integrative database included the core database of sulfur cycling genes and their homologous sequences. The sequence files of this integrative database were clustered using the CD-HIT (v.4.5.6, identity set as 100%)30. Representative sequences from this integrative database were subsequently selected to build the SMDB. The sulfur metabolism in KEGG, MetaCyc, and references was referenced to the gene assignments of SMDB sulfur pathways.

For the SMDB taxonomic annotation, the SMDB sequences were aligned to the NR database via BLASTP of DIAMOND (v.0.9.29.130, coverage > 50%, e−value < 1 × 10−10)31. Then, the BLASTP result was performed using the MEGAN software LCA algorithm to get taxonomic classification32. The SMDB (v.1) was deposited in GitHub (https://github.com/taylor19891213/sulfur-metabolism-gene-database) on January 8, 2020. In contrast, the analytic platform of the SMDB website has been online since June 22, 2020 (https://smdb.gxu.edu.cn/).

Implementation of the SMDB in environmental samples

Metagenomic analysis

The SMDB was applied to analyze sulfur cycling genes and sulfur-metabolizing microorganism communities from five distinct habitats: upland forest33, deep-sea sediments34,35, marine waters36, river sediments37, and mangrove sediments38. These data were obtained from the NCBI (https://www.ncbi.nlm.nih.gov/sra) and the Chinese National Genomics Data Center GSA database (https://bigd.big.ac.cn/gsub/) (Supplementary Table S1). The sequence data were merged and assembled by megahit (v1.1.3) with default parameters39. The assembled sequences were used for gene prediction by Prodigal (v3.02)40. The number of reads in the gene alignment in all samples was calculated by SoapAligner (v2.21)41. The gene-normalized abundance was calculated based on the number of reads and gene length. Then, the assembled sequences were annotated using the SMDB, COG, eggNOG, KEGG, M5nr, SCycDB, and NR databases via the DIAMOND with parameters set as an e-value cutoff of 1 × 10−531. Sulfur sequences are extracted from the merged sequences for further taxonomy annotation. Microorganisms at different taxonomic levels were generated after least common ancestor (LCA) algorithm42.

Statistical analysis

Shannon indices reflect the diversity of species in the samples43. The Shannon indices was calculated by using R package. Principal coordinates analysis (PCoA) was used to describe the differences in microbial community and sulfur gene structure among samples from different regions by using the R package “stats.” The R and P of PCoA are calculated based on ANOSIM using the R package “vegan.” The neutral community model (NCM) was used to explain the microbial community assembly in different habitats44. Statistical tests of genes involved in sulfur metabolism between marine and non-marine ecosystems were performed by comparing their abundance by the Tukey–Kramer test. The least significant difference (LSD) test45 was used to analyze the variance (ANOVA) model for multiple comparisons among five habitats for sulfur-metabolizing microorganisms.

Results

Summary of genes and pathways in SMDB

Using keywords (e.g., sulfur, sulfate) to retrieve 284,541 literature reports from 1976 to 2021 in Web of Science and then obtained records of sulfur genes through a web crawler with Python. After manual verification, 875 related literature reports (representative literature was recruited) and 175 genes covering 11 sulfur metabolism pathways (including assimilatory sulfate reduction, thiosulfate disproportionation, sulfide oxidation, dissimilatory sulfate reduction, sulfite oxidation, sulfur oxidation, sulfur reduction, tetrathionate oxidation, tetrathionate reduction, thiosulfate oxidation, and organic degradation/synthesis), were recruited in the SMDB (Table 1, Supplementary Table S2). Each sulfur gene has rich information, including the mechanism of action, structure, and sequence. The SMDB obtained 395,737 representative sequences at 100% identity cutoffs. Summary of the sulfur metabolism pathway genes see Supplementary Materials.

Table 1 Summary of sulfur cycle genes in SMDB and other databases.

Characteristics of the SMDB

The coverage of sulfur metabolism genes in the SMDB was compared with existing public databases to demonstrate the purpose of creating a sulfur metabolism genes database in this study. Among the 175 sulfur genes, we recruited in the SMDB, only 118, 157, 144, 169, and 172 could be found in COG, eggNOG, KEGG, M5nr, and NR databases, respectively (Supplementary Fig. S6). These sequences were counted in public databases and SMDB through Perl programming language for each sulfur gene. The results showed that the coverage of the SMDB containing sulfur gene sequences exceeded that of COG, eggNOG, KEGG, M5nr, and NR databases (Fig. 2). Nearly 342,433 sulfur gene sequences were found in the SMDB and not previously included in NR databases. The 25 metagenomic data obtained from five habitats were aligned against SMDB, COG, eggNOG, KEGG, M5nr, SCycDB, and NR databases to analyze the sulfur metabolism.

Figure 2
figure 2

Percentage of sequences belonging to the selected sulfur cycling genes in public databases. Blue indicates fewer gene sequences from the corresponding public database. Heatmap according to the z-scores of the abundant gene (sub)family. The heatmap was created using the “pheatmap” package (v1.0.12, https://cran.r-project.org/web/packages/pheatmap/index.html).

Taxonomic composition of sulfur cycling genes and pathways in SMDB

Sulfur sequences were aligned again with the NR database using a local BLASTP program to obtain the structure and composition of the taxonomic sulfur cycle in SMDB. The SMDB covered 93 phyla, 87 classes, 194 orders, 432 families, and 2225 genera of bacteria, and 17 phyla, 15 classes, 25 orders, 38 families, and 115 genera of archaea (Table 2). For bacteria, Proteobacteria (66.9%), Actinobacteria (14.2%), and Firmicutes (12.1%) were the dominant phyla, with Pseudomonas (10.4%), Escherichia (5.4%), Burkholderia (4.3%), and Streptomyces (4.1%) were the dominant genera in the SMDB (Supplementary Table S3). Table 2 shows that assimilatory sulfate reduction has the highest coverage of microorganisms, containing 85 phyla and 1840 genera, followed by organic degradation and synthesis with 54 phyla and 1500 genera and dissimilatory sulfate reduction with 54 phyla and 904 genera. Euryarchaeota, Crenarchaeota, Thaumarchaeota, Candidatus Thorarchaeota, and Candidatus Bathyarchaeota were the dominant phyla of archaea in SMDB (Supplementary Table S3). Haloferax, Haloarcula, Archaeoglobus, Methanosarcina, Thermococcus, Nitrosopumilus, Methanobrevibacter, Methanothrix, and Methanobacterium were the dominant genus of archaea in SMDB (Supplementary Table S3). At the genus level, assimilatory sulfate reduction had the highest diversity involving 79 genera, followed by organic degradation/synthesis (67 genera) and sulfur oxidation (45 genera) (Table 2).

Table 2 Summary of sulfur cycle pathways of taxa in SMDB.

Data access and data mining

Keyword searches

A simple keyword search was available on the SMDB website, providing a quick means for searching sulfur genes in our database. For example, users interested in dissimilatory sulfite reductase alpha subunit can search for the keyword “dsrA”. The search results displayed detailed information about each sulfur gene. An advanced search function allows users to search for pathways of the sulfur cycle.

Similarity searches

A DIAMOND interface was also provided to identify and annotate sulfur genes on the SMDB website. 2 G of metagenomic data (i.e., multi-FASTA file and multi-FASTQ file) can be provided to this interface, which consumed 60 s of BLAST time. The web interface tasks required queuing. The output of annotation results was in the standard m8 format; however, additional displays were provided, which were specific to sulfur cycling genes. Our “SMDB annotation format” was inferred from the similarity level to the database sequences associated with a specific sulfur gene.

Validation

As a specific example, Fun Gene database sequences (i.e., dsrA/B, and soxB) were used for validation in this study (Table 3). These sequences were annotated using the SMDB via DIAMOND with parameters set as an e-value cutoff of 1 × 10−5. The results showed that SMDB was able to classify sulfur genes correctly.

Table 3 Example database validation.

Implementation of the SMDB for sulfur cycle in environmental samples

Microbiome diversity and the community assembly patterns in different habitats

Overall, 3403 microorganisms were detected from five habitats. Alpha diversity, as measured by the Shannon index, was significantly higher in mangroves than in the other habitats (Fig. 3a). Principle coordinates analysis (PCoA) revealed that microbial beta diversity in five habitats were significantly different (Fig. 3b; ANOSIM, permutations = 999, p-value = 0.001, R2 = 0.86). The microbial community demonstrated a special composition in mangrove ecosystems, which was predominantly composed of members of Deltaproteobacteria (37.05%), followed by Gammaproteobacteria (28.70%) and Alphaproteobacteria (6.91.8%) (Supplementary Fig. S8). The dominant class in the microbial communities were Alphaproteobacteria (37.59%), followed by Actinomycetia (20.09%) and Acidobacteriia (15.59%) in the upland forest. The dominant class in the microbial communities were Alphaproteobacteria (46.73%), followed by Gammaproteobacteria (21.97%) in marine waters. The dominant class in the microbial communities were Dehalococcoidia (35.15%), followed by Deltaproteobacteria (16.69%) in deep-sea sediments. While, Betaproteobacteria (32.77%) were the dominant class of microbial communities in river sediments (Supplementary Fig. S8).

Figure 3
figure 3

Alpha and beta diversity. (a) The α-diversity of microbial diversity for different habitats. ANOVA analyzed significance. (b) Principal coordinates analysis (PCoA) of the microbial community. The R and P values in the figure are calculated based on ANOSIM. UF, upland forest; DS, deep-sea sediments; MW, marine waters; RS, river sediments; MS, mangrove sediments.

The neutral community model (NCM) fitted well to microbial community assembly in mangrove ecosystems, river sediments, and upland forest habitats (R2 > 0.6). However, this model did not fit well with the microbial community assembly in marine waters and deep-sea sediments habitats (Fig. 4). The Nm-value was highest for microbial community in the mangrove ecosystem (Nm = 371,591; Fig. 4), followed by river sediments (Nm = 161,104), marine waters (Nm = 90,120), deep-sea sediments (Nm = 48,176), and upland forest (Nm = 24,588). These results indicated that microbial dispersal was higher in the mangrove ecosystem than in others habitats.

Figure 4
figure 4

Neutral community model (NCM) of microbial community assembly. (a) Deep-sea sediments; (b) Mangrove sediments; (c) Marine waters; (d) River sediments; (e) Upland forest. The solid blue lines represent the best fit to the NCM as shown in Sloan et al.40, and the dashed blue lines in the figure represent the 95% confidence intervals around the model predictions. Microbial communities with frequencies higher or lower than those predicted by the NCM are shown in different colors. Nm indicates the microbial community size times immigration, and R2 represents the fit to this model.

Distribution of sulfur genes

SMDB was applied to profile sulfur cycle from five habitats: upland forest, deep-sea sediments, marine waters, river sediments, and mangrove sediments (Supplementary Table S4). The results show that the number of sulfur cycling genes were 110–159 in the five habitats. The top 10 sulfur genes were sulfonate transport system ATP-binding protein (ssuB), Heterodisulfide reductase subunit A (hdrA), Arylsulfatase (atsA), 3-(methylthio)propionyl-CoA ligase (dmdB), Adenylylsulfate kinase (cysC), cysteine synthase (cysE), Anaerobic dimethyl sulfoxide reductase subunit A (dmsA), Sulfur-carrier protein (tusA), S-sulfolactate dehydrogenase (slcC), and Heterodisulfide reductase iron-sulfur subunit D (hdrD). The results showed that five of the top 10 genes belonged to organic degradation/synthesis pathway. Therefore, organic degradation/synthesis was the main sulfur metabolism conversion pathway in these five habitats. The sulfur cycling genes were significantly different (ANOSIM, permutations = 999, p-value = 0.001) among the five habitats (Fig. 5b). When grouped by habitats, the sulfur cycling genes were differentially enriched in different environments (Fig. 5). For example, the abundance of dissimilatory sulfate reduction genes (dsrB, aprA/B and sat) in the marine ecosystem (ME) were significantly higher than those of the non-marine ecosystem (NME) area. In comparison, the abundance of sulfur oxidation genes (sqr, SOX, and soxC) in the non-marine ecosystem were significantly higher than those of the marine ecosystem area (p < 0.05). Mangrove sediments exhibited the highest abundance of genes involved in dissimilatory sulfate reduction genes (dsrA, and aprA/B), with deep-sea sediments having a particularly high abundance of dissimilatory sulfate reduction gene (dsrB). River sediments exhibited the highest abundance of genes involved in sulfide oxidation (soxB), and DMSP conversion (dmdB/C).

Figure 5
figure 5

The abundance of sulfur metabolism genes. (a) Abundance comparison of sulfur metabolism genes between the marine ecosystem and non-marine ecosystem. The P values are based on Tukey–Kramer test. (b) Principal co-ordinates analysis (PCoA) of sulfur cycling genes. The R and P values in the figure are calculated based on ANOSIM. (c) Differences in the distribution of key sulfur genes in five habitats. The abundance is the average of each group. SE is expressed as error bars. UF, upland forest; DS, deep-sea sediments; MW, marine waters; RS, river sediments; MS, mangrove sediments.

Sulfur-metabolizing microorganism abundance and diversity

The composition of sulfur-metabolizing microorganisms showed that Proteobacteria was the dominant phylum of sulfur-metabolizing microorganism communities in the mangrove sediments, marine waters, and river sediments (Supplementary Fig. S9). Furthermore, chloroflexi was the dominant phylum of sulfur-metabolizing microorganism communities in deep-sea sediments.

At the phylum level, the abundance of Proteobacteria in mangrove sediments was significantly higher than in the deep-sea sediments and upland forest. The abundance of Nitrospirae in mangrove sediments was significantly higher than in the marine waters and upland forests. The abundance of Bacteroidetes in mangrove sediments was significantly higher than in the deep-sea sediments, river sediments, and upland forests. By contrast, the abundance of Actinobacteria in mangrove sediments was significantly lower than that of the marine waters, deep-sea sediments, river sediments, and upland forest (Fig. 6a). At the class level, the abundance of Gemmatimonadetes, Deltaproteobacteria, and Nitrospira in mangrove sediments were significantly higher than those of in marine waters. However, the abundance of Alphaproteobacteria and Betaproteobacteria in mangrove sediments were significantly lower than those in river sediments and upland forest (Fig. 6b). Random forest is a popular machine learning model that uses bootstrap aggregation and randomization of predictors to achieve a high degree of prediction accuracy46. The features that contribute the most to the sample grouping prediction's accuracy were selected using the random forest method. Notably, for the feature ranking method, the top factor -the Flavilitoribacter (phylum Bacteroidetes) contributed to the random forest models (Fig. 7). Figure 7 also shows that the addition of the signature of the microbial Litoricola and Mariniblastus into the models enabled us to achieve the highest accuracy. The results as mentioned above showed that SMDB was a powerful tool for analyzing sulfur cycle from metagenomic data in various environments.

Figure 6
figure 6

Differences in the distribution of sulfur metabolizing microorganisms in different habitats. The map is according to log10 of the 20 most abundant sulfur metabolizing microorganisms. SE is expressed as error bars. (a) Phylum level. (b) Class level. The color of the abundance bar represents habitats. Different lowercase black letters represent significant differences among habitats (p < 0.05). UF, upland forest; DS, deep-sea sediments; MW, marine waters; RS, river sediments; MS, mangrove sediments.

Figure 7
figure 7

The features of sulfur metabolizing microorganisms are ranked by their frequencies of being selected as the random forest classifiers. The colored boxes on the right indicate the relative abundance ratio of the corresponding factor in each group.

Discussion

Many microbes play an important role in the sulfur cycle47,48. Obtaining the complete list of sulfur cycle genes and sulfur-metabolizing microorganisms are critical to understanding sulfur cycle processes in the environment. In this study, a database was manually created for the fast and accurate analysis of sulfur cycle from metagenomic data. To our knowledge, SMDB online database contributes to the analysis of genomes and metagenomes, thereby screening for sulfur genes in large-scale sequence data sets.

The SMDB presents more comprehensive coverage of genes involved in sulfur metabolism than other public databases (Fig. 2). A new database, SCycDB23, containing 207 sulfur cycling genes, was constructed synchronously with the SMDB. We also found some comparable advantages when comparing with SCycDB (Fig. 2). For example, 32 sulfur genes (e.g., SUOX, APA1_2, aprM, aps, ETHE1, MET10, MET3, MET5, npsr) were not detected in SCycDB, such as the sulfite oxidase (SUOX) gene, which catalyzed the final step in cysteine catabolism, thereby oxidizing sulfite to sulfate49. The number of sulfur genes in the SCycDB database is higher than that in the SMDB database, mainly due to the addition of proteins involved in sulfur transfer, taurine, (R)-DHPS, choline-o-sulfate, PEP, and UDP-glucose metabolism. However, tauD, a gene that converts taurine and sulfite50, was also selected for the SMDB database. The SMDB database focuses on the key genes involved in converting sulfur compounds according to the references and sulfur metabolic pathways in KEGG and MetaCyc databases. We aim to provide the latest knowledge and research progress in sulfur metabolism studies, such as aerobic DMS degradation51. The product of DMS degradation (i.e., sulfuric and methanesulfonic acids) attracts water and promotes cloud formation, thereby affecting the climate. In addition, the sulfur sequences from the NR database have been selected to SMDB to facilitate sulfur-metabolizing microorganism annotation. An appropriate database is critical for the accuracy of metagenomic annotation. The SMDB primarily has the following three characteristics. Firstly, the SMDB has a precise definition of sulfur metabolism genes. Typical examples include phsA and psrA genes that encode the particle thiosulfate reductase and polysulfide reductase, respectively, but these genes share high sequence similarity24. Hence, distinguishing the activities of these two enzymes during genome annotation is difficult. The phylogenetic tree showed a clear separation of the psrA gene from the phsA gene (Supplementary Fig. S7). This result indicated the possibility of faulty ecological explanations. In SMDB, we have precisely defined these genes to avoid mis-annotations. Secondly, the SMDB considers the problem of false positives. This study addresses this problem by adding homologous sequences from multiple public databases. Thirdly, the small size (140 MB) of the SMDB reduces the computational cost required to obtain the sulfur metabolism genes. 2 G of metagenomic data consume 60 s of BLAST time. Therefore, the SMDB presents comparable advantages with regard to data quantity and quality.

Although we have collected as many sulfur genes as possible, some sulfur genes may still need to be noticed. A DATA SHARING interface (https://smdb.gxu.edu.cn/) was provided to submit sulfur genes with experimentally confirmed information in SMDB database. We plan to update the SMDB database continuously with novel sulfur cycle genes from literature and sequences submitted in SMDB web.

As the application of metagenomics in the environment increases, fast obtaining the functional profiles from metagenomics is important for researchers. The database described in this article, SMDB, unifies the most publicly available sulfur genes and provides a reliable annotation service to investigate sulfur cycle in different environments. SMDB gathered the comprehensive inorganic and organic sulfur transformation genes used to analyze sulfur cycle in five types of environments. Our results revealed that 110–159 of sulfur genes were detected in these environments. In the five habitats, the DMSP conversion in a high abundance, two of the top 10 genes belong to this process (Supplementary Table S4). This shows that DMSP is one of the Earth's most abundant organosulfur molecules52. In addition, a significant difference in the distribution of sulfur genes and sulfur-metabolizing microorganisms was found in these environments. For example, dissimilatory sulfate reduction genes were highly abundant in mangrove sediments and deep-sea sediments, probably because of the high concentration of sulfate in the sea35.

Microorganism alpha diversity was significantly higher in mangrove sediments than in the other habitats, PCoA showed that mangrove sediments also differed significantly from that of other habitats (Fig. 3). This finding is in agreement with previous results53. In addition, NCM showed that bacterial community structure in mangrove sediments was mainly driven by stochastic processes (R2 = 0.745, Nm = 371,591). Microbial dispersal was higher in the mangrove ecosystem than in others habitats. The high temperature and nutrient availability in mangrove sediments may explain the higher microorganism diversity. It could also be because mangroves are located in buffer zones that connect land and sea54. The river water discharges nutrients upstream into the mangrove sediments. In addition, it was observed that microbial composition showed distinctive patterns among different habitats. The microbial community in mangrove ecosystems was predominantly composed of members of Deltaproteobacteria.

Previous studies demonstrated that the microorganisms involved in the sulfur cycle primarily belong to Proteobacteria, Firmicutes and Actinobacteria of bacteria55. Sulfur-metabolizing microorganisms (e.g., Gemmatimonadetes, Deltaproteobacteria, Bacteroidetes, and Nitrospira) in mangrove sediments were found to be significantly more abundant than those in other habitats (Fig. 6). The metabolic diversity of Deltaproteobacteria can provide a competitive advantage for survival in fluctuating habitats56. Previous studies demonstrated that Deltaproteobacteria were associated with higher salinity57. Deltaproteobacteria are sulfate-reducing bacteria (SRB) with a potential for sulfate reduction, and organic matter decomposition58. The Bacteroidetes are considered primary degraders of polysaccharides and are found in many ecosystems59. Adding the signature of the microbial Flavilitoribacter into the models enabled the highest accuracy possible that it has a potential for polysaccharides.

Sulfur cycle is widely used in heavy metal contamination16. Given the increase of human activities, the sulfur cycle balance is affected, such as the lack of sulfur elements in the soil ecosystem leading to crop production reduction60,61. The SMDB will facilitate research to understand the sulfur cycle in different environments. Hence, this database will allow microbiologists to obtain the complete sulfur cycling genes comprehensively.

Conclusion

A high sequences coverage of database, namely, SMDB, which focuses on the information of sulfur cycle, has been developed. This integrative database contains 175 genes and covers 11 sulfur metabolism processes. In addition, an online website database of SMDB was provided to analyze sulfur metabolism. The SMDB can analyze the sulfur metabolism quickly and accurately. Applying the SMDB to sulfur cycle in five diverse environments, it has demonstrated its ability to annotate sulfur cycle from metagenomes in different environments. SMDB will be a valuable resource for studying sulfur metabolisms from shotgun metagenomic data.