Abstract
Analyzing microbial samples remains computationally challenging due to their diversity and complexity. The lack of robust de novo protein function prediction methods exacerbates the difficulty in deriving functional insights from these samples. Traditional prediction methods, dependent on homology and sequence similarity, often fail to predict functions for novel proteins and proteins without known homologs. Moreover, most of these methods have been trained on largely eukaryotic data, and have not been evaluated on or applied to microbial datasets. This research introduces DeepGOMeta, a deep learning model designed for protein function prediction as Gene Ontology (GO) terms, trained on a dataset relevant to microbes. The model is applied to diverse microbial datasets to demonstrate its use for gaining biological insights. Data and code are available at https://github.com/bio-ontology-research-group/deepgometa
Similar content being viewed by others
Introduction
Protein function prediction has evolved significantly over the past few years, transitioning from reliance on basic sequence alignment to approaches based on machine learning, natural language processing, or analysis of biological networks1. Despite these advances, few methods have been developed for and evaluated on metagenome or amplicon sequencing data mainly because there is no “ground truth” unless the methods are applied to “mock communities” which are highly simplified versions of actual microbial communities and not representative of the complexities encountered in real-world cases.
Deep learning methods stand out as promising solutions capable of annotating novel proteins by learning complex patterns within large datasets2. Their ability to annotate proteins without prior explicit sequence similarity or homology can potentially overcome the challenges presented by microbial data. However, the applicability of deep learning-based function prediction methods in annotating microbial genomes is hindered by two main limitations: the lack of training on datasets that are representative of microbial communities and the absence of applications on amplicon data and proteins from whole genome sequencing (WGS) reads.
These limitations underscore a significant gap in the field, where even sophisticated approaches like those presented in the Critical Assessment of Function Annotation (CAFA)3challenge, and methods such as ProtInfer4and SPROF-GO5, utilize databases rich in eukaryotic proteins such as SwissProt, overlooking the predominantly prokaryotic nature of metagenomes6. While some deep learning methods, such as NetQuilt7and DeepSS2GO8, have been specifically trained on bacteria and show commendable performance, their training sets focus on well-studied bacterial species and do not incorporate archaea or viruses. Similarly, these deep learning-based methods are not directly applied to datasets that are representative of complex communities and rely on metrics that do not address the biological relevance of the annotations, further increasing the divide between methodological innovation and practical applicability in microbial genomics.
Acknowledging these challenges, we train a deep learning method for the functional annotation of microbial data and explore the biological significance of these annotations. DeepGOMeta incorporates ESM2 (Evolutionary Scale Modeling 2)9, a deep learning framework that extracts meaningful features from protein sequences by learning from evolutionary data. We utilize these learned features through ESM2, train on a more representative dataset, and evaluate against state-of-the-art function annotation methods.
Methods
Materials and data
UniProtKB/Swiss-Prot dataset and gene ontology
We obtained all proteins that were manually curated and reviewed from the UniProtKB/Swiss-Prot Knowledgebase (v2023_03, r28-6–2023)10. We further filtered to select for proteins that belong to prokaryotes, archaea and phages, and only kept proteins with experimental functional annotations using evidence codes EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC, HTP, HDA, HMP, HGI, HEP. The dataset contains 10, 107 reviewed and manually annotated proteins.
Metagenomes contain many uncharacterized, novel proteins and in order to evaluate our models on novel proteins, we generated training, validation and testing splits based on sequence similarity. First, we grouped the proteins by their similarity using Diamond (v2.0.9)11 (e-value 0.001) and split them into training, validation and testing sets, 81/9/10 %, respectively. This is to ensure that the training and validation set proteins do not have any similar sequences in the training set.
We trained and evaluated a separate model for each of the GO sub-ontologies: Molecular Functions (MFO) of the protein, Biological Processes (BPO) in which the protein participates, and Cellular Components (CCO) in which the protein operates (r2023-01-01)12.
To compare our model against other methods, we generated a time-based test set by following the CAFA3 challenge time-based approach. We downloaded UniProtKB/Swiss-Prot (v2023_05 r2023-11-08) and extracted newly annotated proteins in this version. Table 1 summarizes the datasets for each sub-ontology.
Protein-protein interactions data
For the 10,107 proteins in our dataset, we obtained protein–protein interaction (PPI) data from the STRING (v11.0)13 database, which yielded 14,524 interactions. There were 7 different modes of interactions: binding, activation, reaction, catalysis, expression, inhibition, and ptmod.
Paired 16S and WGS dataset
We applied our method to generate functional profiles of microbial communities using four publicly available datasets that contain both 16S amplicon data and WGS from the same samples, shown in Table 2. We used two human stool microbiomes: 60 samples from Indian individuals (PRJNA397112) and 60 samples from Cameroonian individuals (WGS: PRJEB27005, 16S: mgp15238)14,15, an environmental microbiome: 22 blueberry plant soil samples (WGS: PRJNA484230, 16S: PRJNA389786), and 11 mammalian stool samples (WGS: SRP115632, 16S: SRP115643). The datasets represent a variety of host-associated and environmental microbiomes.
Baseline and comparison methods
To evaluate our retrained model, we used baseline methods that do not rely on predictions based on sequence similarity, as we aimed to test the predictors on challenging sequences. We used a “naive” classifier leveraging the annotation frequencies within the Gene Ontology (GO) database, a multi-layer perceptron (MLP) model utilizing protein embeddings from the ESM2 model, and several advanced predictive models including DeepGO-PLUS, DeepGOCNN, and DeepGOZero. We also included DiamondScore, which is primarily based on sequence similarity. For the time-based dataset evaluation, we selected several state-of-the-art methods: TALE16, SPROF-GO5, DeepFRI17, DeepGO-SE18, NETGO 3.019, and TransFun20. We also compared against DiamondScore. We verified that the test protein set in the time-based split was not used for training SPROF-GO, DeepFRI or TALE, but could not verify this for NetGO 3.0 and TransFun due to the unavailability of the training data. Detailed methodologies, including the specific implementations employed in our evaluations, the model versions, and release dates, are documented in the Supplementary Materials section.
Evaluation
Data processing
We analyzed four diverse microbiome datasets, each containing paired 16S rRNA amplicon and WGS data. For the 16S data, we used a Nextflow pipeline employing the RDP classifier (v18) for processing and taxonomic classification available on our GitHub repository (https://github.com/bio-ontology-research-group/16SProcessing). We sourced protein sequences corresponding to the identified bacteria in the RDP database from NCBI and annotated with DeepGOMeta21,22. We then used these predicted functions to create two types of functional profiles. We constructed a binary matrix of all the samples and functions in the dataset, where the presence of a function in a sample is represented by 1 and the absence by 0. We also constructed an abundance-weighted matrix for each sample, where we calculated function abundance to provide a quantitative assessment of the functional potential within a microbial sample. The abundance of a function (A(f) is the sum of the relative abundance of all taxa present in a sample that contain a certain function, given by:
where i is an index representing each taxon, n is the total number of taxa in the sample, \(R(t_i)\) is the relative abundance of the \(i^{th}\) taxon, and \(I(f, t_i)\) is an indicator function that equals 1 if the \(i^{th}\) taxon contains function f, and 0 otherwise.
For WGS data, we used fastp (v0.23.2)23for trimming [-q 30]. For host-associated microbiome samples, we used Bowtie2 (v2.5.1)24to filter out reads mapping to the host’s reference genome. We then assembled the reads with MEGAHIT (v1.2.9)25, predicted protein sequences with prodigal (v2.6.3)26, and annotated the predicted proteins using DeepGOMeta. For each sample, we constructed a functional profile by aggregating the functions derived from DeepGOMeta annotations of all proteins present in the sample. We constructed a binary matrix for these results as described above.
Pathway prediction
PICRUSt227provides the potential functions of microbial communities using 16s rRNA data and a reference genome databases. We used operational taxonomic unit (OTU) tables as the input for PICRUSt2 and focused on MetaCyc28 pathways and their abundance scores. We performed Principal Component Analysis (PCA) and k-means clustering to discern patterns within the dataset based on these MetaCyc pathway features. The value of k was determined based on the number of categories within each phenotype. We measured clustering purity based on the true phenotype labels in the datasets (eq. 2).
HUMAnN (The HMP Unified Metabolic Analysis Network) 3.029 efficiently maps metagenomic sequences to a vast database of reference genomes and metabolic pathways. We applied this tool on our benchmarks datasets using WGS data to obtain pathway annotations. We measured clustering purity as described above.
Evaluation strategy
For each dataset, we applied PCA and k-means clustering to the OTU table containing the relative abundance of bacterial genera, as well as the abundance matrices we constructed using function annotations from each method. The choice of k in k-means clustering was determined by the number of phenotype categories present for each phenotype under investigation. We calculated clustering purity based on the known phenotype category labels provided in the metadata. Purity assesses the homogeneity of clusters formed by a k-means clustering algorithm. We clustered samples based on their bacterial composition and predicted functions, and used purity to evaluate whether samples with same the phenotype are in the same cluster. The Weighted Average Clustering Purity (WACP) formula is given by:
where N is the total number of data points, k is the number of clusters, \(n_{ij}\) is the number of data points from cluster j that are assigned to cluster i, \(n_i\) is the total number of data points assigned to cluster i, and \(w_j\) is the weight associated with cluster j.
Results
DeepGOMeta
Microbial samples are complex and contain many uncharacterized proteins. Previously, we developed DeepGO-SE18to predict protein functions using ESM29 embeddings and approximate semantic entailment. While DeepGO-SE can annotate uncharacterized proteins, it’s trained on all experimentally annotated UniProt-KB/Swissprot proteins, many of the functions it predicts are not relevant to microbiomes and exist only in eukaryotic genomes. Therefore, we trained DeepGOMeta, a specific version of DeepGO-SE, on a dataset of prokaryotic, archaeal and viral proteins with experimental annotations from UniProt-KB/Swissprot. We assessed the model’s performance against other state-of-the-art methods, and applied it to diverse microbial datasets to extract biological insights.
Incorporating PPIs
Proteins do not function in isolation and PPIs play a significant role in biological processes that take place in the environment. PPI networks also offer a means to reveal functional information for unknown proteins within microbial datasets. In order to test if incorporating PPIs improve protein function prediction, we trained a model which combines PPIs from STRING Database13 using Graph Attention Networks. We refer to this model as DeepGOMeta-PPI.
Evaluation on the similarity-based benchmark
We trained, validated and tested our models for the three sub-ontologies of GO using the UniProtKB/Swiss-Prot dataset split based on sequence similarity (See Methods section). We compared with DiamondScore and four baseline methods that do not rely on sequence similarity: MLP (ESM2), DeepGraphGO, InterPro and Naive.
In the MFO evaluation, DeepGOMeta was outperformed by InterPro in terms of \(F_{\max }\) and \(S_{\min }\), but performed better than all the other methods in terms of AUPR and term-centric AUC. Combining PPI network features into the model reduced its performance, but was still better than the DeepGraphGO method, which is also based on PPIs (Table 3).
In the BPO evaluation, our model resulted in best \(F_{\max }\) of 0.476 which was significantly better (Wilcoxon signed-rank test p-value is \(8 \cdot 10^{-37}\)) than the second best MLP (ESM2) baseline. Combining PPI networks in DeepGOMeta lead to a slightly lower \(F_{\max }\) of 0.469.
In the CCO evaluation, DeepGOMeta achieved the best \(F_{\max }\) of 0.739 followed by almost the same performance by MLP(ESM2) baseline. Noticeably, MLP(ESM2) resulted in the best \(S_{\min }\). Similarly to MFO and BPO evaluations, combining PPIs did not improve the predictions. DeepGraphGO method resulted in \(F_{\max }\) of 0.501, which is slightly better than Naive classifier, and InterPro annotation-based prediction performance was considerably lower than most methods (Table 3).
DiamondScore30 was vastly outperformed by all the other methods. Its poor performance may be attributed to its reliance on sequence similarity, which can be limiting for prediction tasks involving proteins with low sequence conservation. This also suggests that methods incorporating sequence or structure embeddings may capture more nuanced relationships beyond sequence similarity.
Using ESM29 embeddings and graph attention mechanisms, we aimed to leverage PPI network contextual information to enrich protein features. However, the sparse interaction data (10,107 proteins, 14,524 interactions, only 1,935 proteins had interactions) introduced noise and did not improve function prediction. Given the sub-optimal performance and sparsity of PPI data, we excluded the DeepGOMeta-PPI model from further evaluation.
Evaluation and comparison on the time-based split
Microbial data often contains many uncharacterized proteins, so we used a time-based split to ensure that our model is robust and effective in prediction the functions of novel proteins. We compared DeepGOMeta predictions on newly annotated proteins with other state-of-the-art methods that predict functions based on protein language model embeddings and transformer-based deep learning models, including TALE16, SPROF-GO5, DeepFRI17, DeepGO-SE18, NetGO 3.019, and TransFun20. We also compared these methods with the Naive classifier and InterPro.
We found that DeepGOMeta outperformed all the compared methods in the BPO and CCO evaluations in terms of \(S_{\min }\), and CCO in terms of \(F_{\max }\). However, it resulted in lower performance than NetGO 3.0 and InterPro in terms of \(F_{\max }\) in MFO and lower performance than NetGO 3.0 in terms of AUC in MFO and BPO. DeepFRI outperformed all methods in terms of \(S_{\min }\) in the MFO evaluation. DeepGOMeta outperformed DeepGO-SE in BPO and CCO evaluations, but DeepGO-SE had better performance in terms of \(S_{\min }\) and AUC in MFO evaluation (Table 4).
Due to the implementation of NetGO 3.0 on the webserver, and the unavailability of the training dataset of TransFun, it is important to note that we cannot exclude the possibility that these methods were trained on the proteins used for testing in the time-based split.
Applications on amplicon and metagenome data
We developed two workflows employing DeepGOMeta for functional characterization of microbial samples from 16S amplicon and WGS reads. Our methodology included generating functional profiles from reference genomes based on OTUs for 16S reads and predicting functions from de novo metagenome assemblies for WGS reads, as illustrated in Figure 1. We applied these workflows to paired 16S and WGS datasets from identical samples, and employed our evaluation strategy to assess our method’s efficacy in capturing functionally relevant information. Due to the absence of ground-truth data for microbial functions, we assume that protein functions found in microbial communities are more similar when the microbial communities are from the same environment or share identical phenotypes. This approach allowed us to explore the primary drivers of community composition, focusing on the application of DeepGOMeta for gaining biological insights.
We used DeepGOMeta to construct functional profiles for each sample using reads from both sequencing strategies and compared against taxonomy-based clustering (Table 5). For each dataset, based on DeepGOMeta results, we constructed a binary representation of functions which indicates presence or absence of a function. For 16S data, we also constructed an abundance-weighted matrix, in which each function is assigned a weight (eq. 1).
In certain contexts, DeepGOMeta demonstrated superior performance over OTU-based clustering. Specifically, in 5 out of the 9 phenotypes we analyzed, employing 16S functions (abundance-weighted) proved to be either on par with or more effective than clustering by OTUs. This suggests that DeepGOMeta’s functional profiles can be effective in capturing specific functional attributes that are unique to each phenotype. In some datasets, such as Mammalian Stool and Cameroon (Region, Ethnicity), the functional attributes were more defining than taxonomic composition, suggesting that these community compositions are driven by functions (in contrast to taxa), in line with previous findings of the relationship between host phylogeny and gut microbiome functional differences31.
Conversely, in 3 out of 9 phenotypes studies, OTU-based clustering proved more effective. Specifically, in two datasets (Blueberry, India), the location phenotype was better explained by OTU composition than by functions, consistent with previous findings32. Interestingly, we found that using 16S functions in a binary format never outperformed the abundance-weighted approach, suggesting its limited efficacy. In the case of WGS functions, this method only took the lead in 1 out of 9 phenotypes, possibly indicating the necessity of weighing functions.
We also compared OTU-based clustering and DeepGOMeta-derived functional profiles with pathway predictions from tools commonly used to annotate microbial data. We generated pathway predictions from WGS data using HUMAnN329and from amplicon data using PICRUSt227. When comparing WGS functions predicted by DeepGOMeta to those of HUMAnN3, we found that the two methods performed equally well in separating the phenotypes in 3 cases, but DeepGOMeta showed superior performance in 4 cases. Notably, HUMAnN3 failed to produce sufficient pathway information for clustering in analyzing the Blueberry dataset.
When comparing 16S functions (abundance-weighted) predicted by DeepGOMeta to those of PICRUSt2, we found that DeepGOMeta better separated the phenotypes in 7 out of 9 cases. While the experiment falls short of comparing the performance of the two function prediction methods, compared to DeepGOMeta, PICRUSt2’s pathway information would be considered limited, as it constitutes only a subset of the predictable functions by DeepGOMeta in the form of BPO predictions. Overall, these results indicated either a lack of strong associations between pathways and phenotypes or limitations of the algorithms/databases used by PICRUSt2 and HUMANn3.
Discussion
In this study we introduced DeepGOMeta, a retrained version of DeepGO-SE, to overcome the limitations of current methods in their lack of representative training sets and the lack of applications on microbial data. We trained, tested, and evaluated three different models on UniProtKB/Swiss-Prot Knowledgebase proteins that belong to microbes (prokaryotes, archaea, viruses) prevalent in microbial datasets. DeepGOMeta provides function predictions in the form of GO terms, as each of the three models was trained on a distinct GO sub-ontology. DeepGOMeta demonstrates an improvement over similarity-based benchmark methods in most evaluation metrics across the BPO and CCO sub-ontologies, but was outperformed by InterPro in terms of \(F_{\max }\) and \(S_{\min }\) in MFO. In the comparison using a time-based split, DeepGOMeta outperformed all the compared methods in the BPO and CCO evaluations in terms of \(S_{\min }\), and CCO in terms of \(F_{\max }\). However, it was outperformed by NetGO 3.0 and InterPro in terms of \(F_{\max }\) in MFO, and NetGO 3.0 in terms of AUC in MFO and BPO. We could not verify that the test protein set in the time-based split was not used for training NetGO 3.0 and TransFun.
We demonstrated an application of DeepGOMeta in annotating both amplicon and metagenomic data in diverse datasets using a workflow that we have developed for this purpose. We constructed functional profiles for each sample based on 16S amplicon and WGS data, and compared the clustering of phenotypes against clustering based on taxonomic classification, allowing us to explore the primary drivers of community composition. For some phenotypes, we found that generating functional profiles using 16S amplicon data with DeepGOMeta (abundance-weighted) yields a higher clustering purity than clustering by OTUs, demonstrating cases where phenotypic differences can be attributed to functional variability rather than taxonomic composition. We also observed variability in performance across the different datasets and phenotypes, highlighting that microbial community composition could be driven by functions and/or taxonomy. We also found that DeepGOMeta predictions better separate phenotypes when compared with pathway predictions from PICRUSt2 and HUMANn3.
Looking forward, we propose several avenues for further enhancing the method’s utility. We plan to expand the training data to incorporate eukaryotic microbial genomes for WGS analysis, which will enable a more comprehensive understanding of metagenomic samples in ecosystems where eukaryotes play significant roles. Furthermore, as our observations reveal the sparsity of PPIs in bacteria, we intend to incorporate methods for interaction predictions. Integrating such methods as features within DeepGOMeta could substantially enhance its predictive accuracy and, consequently, our understanding of microbial interactions.
Additionally, we plan to expand our bioinformatics workflow in two ways. First, we aim to explore several ways through which we can assign weights to functions assigned to proteins from WGS data. Second, we would use the predicted functions and interactions to elucidate pathways both within and between organisms. This shift towards unraveling more complex biological processes will facilitate a deeper understanding of the intricate interactions and dependencies within microbial communities.
Data availibility
Data and code are available at https://github.com/bio-ontology-research-group/deepgometa
References
Mirabello, C. & Wallner, B. rawmsa: End-to-end deep learning using raw multiple sequence alignments. PLoS ONE 14, https://doi.org/10.1371/journal.pone.0220182 (2019).
Mahmud, M. et al. Deep learning in mining biological data. Cognitive Computation 13(1–33), 5. https://doi.org/10.1007/s12559-020-09773-x (2021).
Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat Meth 10, 221–227. https://doi.org/10.1038/nmeth.2340 (2013).
Sanderson, T., Bileschi, M. L. et al. Proteinfer, deep neural networks for protein functional inference. eLife 12, e80942, https://doi.org/10.7554/eLife.80942 (2023).
Yuan, Q., Xie, J. et al. Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion. Briefings in Bioinformatics 24, bbad117, https://doi.org/10.1093/bib/bbad117 (2023). https://academic.oup.com/bib/article-pdf/24/3/bbad117/50410866/bbad117.pdf.
Vecherskii, M. et al. Metagenomics: A new direction in ecology. Biology Bulletin Reviews 48, S107–S117. https://doi.org/10.1134/S1062359022010150 (2021).
Barot, M. et al. NetQuilt: deep multispecies network-based protein function prediction using homology-informed network similarity. Bioinformatics 37, 2414–2422. https://doi.org/10.1093/bioinformatics/btab098 (2021) https://academic.oup.com/bioinformatics/article-pdf/37/16/2414/50339314/btab098.pdf..
Song, F. V., Su, J. et al. DeepSS2GO: protein function prediction from secondary structure. Briefings in Bioinformatics 25, bbae196, https://doi.org/10.1093/bib/bbae196 (2024). https://academic.oup.com/bib/article-pdf/25/3/bbae196/57390436/bbae196.pdf.
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130. https://doi.org/10.1126/science.ade2574 (2023) https://www.science.org/doi/pdf/10.1126/science.ade2574..
Consortium, T. U. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research 51, D523–D531. https://doi.org/10.1093/nar/gkac1052 (2022). https://academic.oup.com/nar/article-pdf/51/D1/D523/48441158/gkac1052.pdf.
Buchfink, B. et al. Diamond: a fast and sensitive alignment tool for shotgun metagenomic data. Genome research 25, 1755–1761 (2015).
Consortium, T. G. O. The gene ontology resource: 20 years and still going strong. Nucleic Acids Research 47, D330–D338 (2019).
Szklarczyk, D. et al. The string database in 2019: quality-controlled protein-protein association networks, made broadly accessible. Nucleic acids research 47, D607–D613 (2019).
Meyer, F. et al. The metagenomics rast server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9, 386 (2008).
Morton, E. et al. Variation in rural african gut microbiota is strongly correlated with colonization by entamoeba and subsistence. PLoS Genet 11, e1005658. https://doi.org/10.1371/journal.pgen.1005658 (2015).
Cao, Y. & Shen, Y. TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding. Bioinformatics 37, 2825–2833. https://doi.org/10.1093/bioinformatics/btab198 (2021) https://academic.oup.com/bioinformatics/article-pdf/37/18/2825/40471543/btab198.pdf..
Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nature Communications 12, 3168. https://doi.org/10.1038/s41467-021-23303-9 (2021).
Kulmanov, M., Guzmán-Vega, F. J. et al. Deepgo-se: Protein function prediction as approximate semantic entailment. bioRxiv[SPACE]https://doi.org/10.1101/2023.09.26.559473 (2023). https://www.biorxiv.org/content/early/2023/09/28/2023.09.26.559473.full.pdf.
Protein language model improves large-scale functional annotations. Wang, S., You, R., Liu, Y., Xiong, Y. & Zhu, S. Netgo 3.0. Genomics, Proteomics & Bioinformatics 21, 349–358. https://doi.org/10.1016/j.gpb.2023.04.001 (2023).
Boadu, F. et al. Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Bioinformatics 39, i318–i325. https://doi.org/10.1093/bioinformatics/btad208 (2023) https://academic.oup.com/bioinformatics/article-pdf/39/Supplement_1/i318/50741490/btad208_supplementary_data.pdf..
Clark, K. et al. Genbank. Nucleic Acids Research 44, D67–D72. https://doi.org/10.1093/nar/gkv1276 (2016).
O’Leary, N. et al. Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research 44, D733–D745 (2016).
Chen, S. et al. fastp: an ultra-fast all-in-one fastq preprocessor. Bioinformatics 34, i884–i890 (2018).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with bowtie 2. Nature Methods 9, 357–359 (2012).
Li, D. et al. Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics 11, 119 (2010).
Douglas, G. et al. Picrust2 for prediction of metagenome functions. Nature Biotechnology 38, 685–688. https://doi.org/10.1038/s41587-020-0548-6 (2020).
Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes - a 2019 update. Nucleic Acids Research 48, D445–D453. https://doi.org/10.1093/nar/gkz862 (2019)https://academic.oup.com/nar/article-pdf/48/D1/D445/31697668/gkz862.pdf.
Beghini, F., McIver, L. J. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with biobakery 3. elife 10, e65088 (2021).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using diamond. Nature Methods 12, 59 EP – (2014). [PubMed:http://www.ncbi.nlm.nih.gov/pubmed/25402007] [doi:10.1038/nmeth.3176].
Milani, C. et al. Multi-omics approaches to decipher the impact of diet and host physiology on the mammalian gut microbiome. Applied and Environmental Microbiologye 86, e01864-20 (2020).
Moeller, A. H. et al. Dispersal limitation promotes the diversification of the mammalian gut microbiota. Proceedings of the National Academy of Sciences 114, 13768–13773. https://doi.org/10.1073/pnas.1700122114 (2017). Edited by James J. Bull, The University of Texas at Austin, Austin, TX, and approved October 16, 2017 (received for review January 3, 2017).
Acknowledgements
This work has been supported by funding from King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. URF/1/4355-01-01, URF/1/4675-01-01, URF/1/4697-01-01, URF/1/5041-01-01, REI/1/5334-01-01, FCC/1/1976-46-01, FCC/1/1976-34-01, REI/1/5235-01-01, and REI/1/4938-01-01. This work was supported by funding from King Abdullah University of Science and Technology (KAUST) - KAUST Center of Excellence for Smart Health (KCSH) under award number 5932; and by funding from King Abdullah University of Science and Technology (KAUST) - Center of Excellence for Generative AI under award number 5940. This work was supported by the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI). We acknowledge support from the KAUST Supercomputing Laboratory.
Author information
Authors and Affiliations
Contributions
R.H., M.K., K.M. and R.T. conceptualized the experiments. R.T., K.N., and M.K. conducted the experiments and analyzed the results. R.H. and M.K. acquired funding. R.H. and M.K. supervised the work. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
No competing interest declared.
Accession codes
Public data utilized in this study was downloaded from the European Nucleotide Archive with the accession codes PRJNA397112, PRJEB27005, PRJNA484230, PRJNA389786, SRP115632, and SRP115643, and MG-RAST with the accession code mgp15238.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Tawfiq, R., Niu, K., Hoehndorf, R. et al. DeepGOMeta for functional insights into microbial communities using deep learning-based protein function prediction. Sci Rep 14, 31813 (2024). https://doi.org/10.1038/s41598-024-82956-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-024-82956-w
Keywords
This article is cited by
-
Protein function prediction using GO similarity-based heterogeneous network propagation
Scientific Reports (2025)



