Peptide abundance correlations in metaproteomics enhance taxonomic and functional analysis of the human gut microbiome

Sun, Zhongzhi; Ning, Zhibin; Wu, Qing; Li, Leyuan; Doxey, Andrew C.; Figeys, Daniel

doi:10.1038/s41522-025-00801-y

Download PDF

Article
Open access
Published: 19 August 2025

Peptide abundance correlations in metaproteomics enhance taxonomic and functional analysis of the human gut microbiome

Zhongzhi Sun¹,
Zhibin Ning¹,
Qing Wu¹,
Leyuan Li²,
Andrew C. Doxey³ &
…
Daniel Figeys^1,4,5

npj Biofilms and Microbiomes volume 11, Article number: 166 (2025) Cite this article

2899 Accesses
3 Altmetric
Metrics details

Subjects

Abstract

Mass spectrometry (MS)-based proteomics is widely used for quantitative protein profiling and protein interaction studies. However, most current research focuses on single-species proteomics, while protein interactions within complex microbiomes, composed of hundreds of bacterial species, remain largely unexplored. In this study, we analyzed peptide abundance correlations within a metaproteomics dataset derived from in vitro cultured human gut microbiomes subjected to various drug treatments. Our analysis revealed that peptides from the same protein or taxon exhibited correlated abundance changes. By using t-SNE for visualization, we generated a peptide correlation map in which peptides from the same taxon formed distinct clusters. Furthermore, peptide abundance correlations enabled genome-level taxonomic assignments for a greater number of peptides. For instance, 1880 (48.9%) of the 3845 peptides initially assigned only to the family Bacteroidaceae could now be assigned to a specific genome. In species representative genome subsets, peptide correlation networks based on taxon-normalized peptide abundance (TNPA) linked functionally related peptides and provided insights into uncharacterized proteins. Altogether, our study demonstrates that analyzing peptide abundance correlations enhances both taxonomic and functional analyses in human gut metaproteomics research.

Metaproteomic portrait of the healthy human gut microbiota

Article Open access 28 June 2024

A molecular toolkit for heterologous protein secretion across Bacteroides species

Article Open access 11 November 2024

Peptide clustering enhances large-scale analyses and reveals proteolytic signatures in mass spectrometry data

Article Open access 20 August 2024

Introduction

Despite many studies linking the human gut microbiome with human health and disease^1,2,3,4,5, elucidating the mechanisms underlying the impact of the microbiome on host biology remains challenging, largely due to the complexity and the dark matter in gut microbial communities^6,7. Different omics technologies, including metagenomics, metatranscriptomics, metaproteomics, and metabolomics, have become important tools for studying the human gut microbiome⁸. One of these techniques, metaproteomics, measures the presence and abundance of proteins that drive biological functions within microbial communities, providing a direct glimpse of gut microbiome pathways⁹.

Proteins involved in the same biological pathways tend to be co-regulated¹⁰. These functionally related proteins typically have similar expression profiles, leading to positive abundance correlations. Conversely, proteins involved in mutual inhibition, feedback regulation, or competition for binding sites tend to display negative abundance correlations^11,12. Given accurate quantification using mass spectrometry, abundance correlations between proteins could reveal functional linkages, even for proteins that do not physically interact or colocalize, thus enabling the prediction of unknown protein functions¹³. Protein abundance correlations have been applied to predict the functions of uncharacterized proteins in various model organisms, such as Escherichia coli¹⁴, Saccharomyces cerevisiae¹⁵, and Homo sapiens¹³. In addition, a study of proteome variation across thousands of bacterial species constructed functional and physical protein interaction networks, revealing the emergence of complex bacterial phenotypes¹⁶. However, these studies have been largely limited to single-species proteomics. In complex microbial communities, like the gut microbiome, protein abundance correlations remain unexplored. A previous study applied gene abundance correlation in metagenomics to predict microbial functional organization¹⁷, but since proteins directly execute biological processes, studying protein abundance correlations is expected to more faithfully reflect functional dynamics in microbial communities. Given that over 40% of proteins from the human gut microbiome remain functionally uncharacterized¹⁸, and that some of these unknown proteins have been implicated in disease development^19,20, understanding their functions is critical for advancing our knowledge of host-microbiome interactions.

Studying protein abundance correlations is more challenging in metaproteomics compared to single-species proteomics due to the difficulties in protein inference²¹. In complex microbial communities, peptides can be shared among homologous proteins from different species, making it hard to assign them to their actual proteins of origin. Our previous studies showed that a large number of identified peptides were shared by multiple species, complicating taxonomic and protein source assignments²². Moreover, even after protein inference, it remains debatable whether it is reasonable to aggregate peptide quantities into a single protein quantity, as peptides from the same protein-coding sequence could express different quantitative responses²³. Recently, peptide-centric metaproteomics analysis, which directly links peptides to their taxonomic and functional annotations, provides an alternative approach²⁴. Since mass spectrometry directly measures peptides instead of proteins, peptide-centric analysis is inherently reasonable and has been shown to be more sensitive and uncover features that are masked at the protein level²⁵.

Given the challenges in protein inference and the inherent advantages of peptide-centric analysis, we focus on peptide abundance correlations in this study. By leveraging peptide-centric analysis, we aim to overcome the biases associated with protein inference and provide a more accurate representation of functional and taxonomic relationships in microbial communities.

In this study, we analyzed peptide abundance correlations in a metaproteomics dataset derived from in vitro cultured human gut microbiomes subjected to over 100 different drug treatments, also referred to as perturbations. These drugs cover 12 Anatomical Therapeutic Chemical (ATC) level-1 classes and were previously shown to potentially impact the human gut microbiome in an initial metaproteomics screening²⁶. We explored the biological foundations of these correlations and visualized them using a peptide correlation map, which was generated by calculating pairwise correlation coefficients of peptide abundance profiles and embedding the resulting matrix into a low-dimensional space using the t-SNE algorithm. Additionally, we applied peptide abundance correlations to improve the taxonomic assignment of peptides. Focusing on single-species representative genome subsets, we calculated taxon-based normalized peptide abundance (TNPA) and constructed peptide abundance correlation networks, where peptides are represented as nodes and edges connect peptide pairs with high correlation in their abundance profiles. These networks revealed functional linkages, providing new insights into the roles of previously uncharacterized microbial proteins.

Results

Peptide abundance correlations in the metaproteomics dataset

Figure 1 summarizes the overall experimental design and research framework. Briefly, 6 individual microbiomes were individually exposed to 107 drugs and five controls using the RapidAIM assay²⁷, which is a 96-well plate assay that maintains the composition and function of the microbiome. The in vitro model used showed a good ability in maintaining microbial taxon-function stability, with pre-post culture correlations of taxon-function-coupled profiles reached an average of r = 0.83 ± 0.03, and also preserved the ability to show microbiome responses to drugs similar to in vivo microbiome²⁸. The metaproteomes were extracted from each well and analyzed by metaproteomics. The 107 selected drugs were shown to have potential impacts on the human gut microbiome in the previous study²⁶. These drugs span a wide range of ATC level-1 categories, with the largest groups being Nervous System (N, 29 drugs), Alimentary Tract and Metabolism (A, 14 drugs), Anti-infectives for Systemic Use (J, 10 drugs), and Cardiovascular System (C, 10 drugs). Peptide identification and quantification results were acquired as previously described²⁶. After acquiring peptide quantification results, peptides with non-zero values in $\ge 20 \%$ of samples from each individual were selected. Peptide abundance correlations for all peptide pairs from each individual were calculated using Spearman correlation coefficients (SCCs) for log2-transformed abundance fold changes from all 112 samples against the control group. Global peptide abundance correlation maps created with t-SNE of all six individuals, colored with family-level taxonomic annotation of peptides, showed a clear clustering of peptides from the same family (Fig. 2A). In addition, these maps revealed substantial inter-individual variation in the taxonomic composition of microbiomes, which was further supported by peptide-based taxonomic profiling for each individual (Supplementary Fig. 1). Among all six individuals, individual V52 has the largest number of quantified peptides (Fig. 2B) and was selected as an example in the main text. For 21,363 peptides with non-zero values in $\ge 20 \%$ of samples from individual V52, SCC calculation yielded a total of 228,178,203 peptide pairs with an average SCC of 0.13 ± 0.35 (Fig. 3A). The selection of a 20% threshold for peptides with non-zero values across samples aims to strike a balance between retaining a sufficient number of peptides and ensuring enough non-zero values to make the calculation of SCC statistically meaningful (Supplementary Fig. 2).

**Fig. 1: Experimental design and research framework.**

**Fig. 2: Peptide identification and abundance correlation across individuals.**

**Fig. 3: Peptide abundance correlations in the metaproteomics dataset of individual V52.**

Upon assigning both protein sources and taxonomic sources to the peptides, we observed that peptides from the same protein and peptides from the same taxon exhibited higher abundance correlation. First, peptide pairs derived from the same protein (7407 pairs) exhibited higher SCCs (0.63 ± 0.22) compared to peptide pairs from different proteins (8,781,121 pairs; SCC = 0.18 ± 0.32, with p $\le$ 0.0001 by Mann–Whitney U-test and a large effect size with Vargha and Delaney’s A of 0.88, Fig. 3B). To be more intuitive, we selected peptides from the three proteins with the highest total peptide abundances to visualize peptide abundance profiles across samples (Supplementary Fig. 3). Principal coordinates analysis (PCoA) revealed that peptides derived from the same protein tend to cluster together based on their abundance profiles across samples (Supplementary Fig. 3). Consistent with this observation, PERMANOVA analysis based on Euclidean distances showed that peptide abundance profiles for these three proteins were significantly associated with their protein source (R² = 0.240, F = 10.76, p = 0.001), indicating that peptides derived from the same protein tend to show more similar abundance changes across samples. In addition, peptide pairs derived from the same genome (457,957) had an average SCC of 0.60 ± 0.22, while those from different genomes (8,330,571) averaged an SCC of 0.16 ± 0.31 (Fig. 3C, p $\le$ 0.0001 by Mann–Whitney U-test and a large effect size with Vargha and Delaney’s A of 0.88). It is worth noting that even after excluding peptide pairs from the same protein, we found that peptides from different proteins within the same genome (450,550 pairs; SCC = 0.60 ± 0.22) still had higher SCCs than those from different genomes (8,330,571 pairs; SCC = 0.16 ± 0.31, p $\le$ 0.0001 by Mann–Whitney U-test, Vargha and Delaney’s A of 0.88), indicating that sourcing from the same genome contributes significantly to higher SCCs of peptide abundance changes (Fig. 3D). We also assigned functional annotations to the peptides, however, the difference in SCCs between peptides from the same functional category and those from different functional categories was relatively minor at both high-level COG category (Fig. 3E) and refined COG family level (Fig. 3F), with negligible effect size, Vargha and Delaney’s A of 0.51 and 0.56, respectively.

Higher SCC of peptides from the same protein and peptides from the same taxon collectively suggested that studying peptide abundance correlations in metaproteomics datasets is biologically meaningful and can provide valuable insights into metaproteomics analysis.

A global peptide abundance correlation map

The SCC matrix of selected 21,363 peptides from individual V52 recorded how strongly or weakly each peptide is correlated with all other peptides. Although it is theoretically possible to be represented as a peptide interaction network with edges indicating strong correlations, even selecting only peptide pairs with the top 1% SCCs resulted in an extremely dense network (2,281,783 edges among 12,549 peptides), which was not informative (Supplementary Fig. 4). To address this, we visualized all 228,178,203 peptide-peptide correlations from 21,363 peptides as a global peptide correlation map using the t-SNE algorithm by embedding the peptides in a low-dimensional space (Fig. 4A). In this map, the distance between peptides reflects the similarity in their abundance changes under various drug treatments. Notably, the global map was generated based on all pairwise peptide abundance changes under different perturbations in the metaproteomics datasets per individual, without any filtering on the SCC.

**Fig. 4: The global peptide correlation map of the individual V52 generated using t-SNE.**

The peptide correlation map revealed a strong correlation between peptide abundance changes and their taxonomic sources. Peptides from the same taxon tend to cluster together in the correlation map across various taxonomic levels (Fig. 4A and Supplementary Fig. 5). Specifically, at the family level, distinct clusters emerged, particularly for families such as Burkholderiaceae, Eggerthellaceae, and Enterobacteriaceae. Zooming into these families, most peptides with genome-level annotations were either assigned to a specific genome or grouped into sub-clusters corresponding to different genomes (Fig. 4B, C). For other families, such as Bacteroidaceae and Lachnospiraceae, the cluster of peptides was less obvious. However, peptides were still grouped at the genome level in the zoomed-in maps (Fig. 4D, E). This clustering trend was also observed in the map colored using peptide taxonomic annotations obtained from Unipept (Supplementary Fig. 6). However, fewer peptides received family-level annotations (10,532 peptides from Unipept versus 15,720 peptides from our own annotation pipeline), and a few differences were observed in taxonomic annotations, likely due to differences in taxonomic naming systems. It is worth noting that all current analyses applied all calculated SCCs without filtering on the p-values of calculated SCCs. However, by only keeping SCCs with Bonferroni-adjusted p-values < 0.01, the global correlation map showed a similar trend of clear taxonomic clusterings (Supplementary Fig. 6).

Although there is a clear clustering of peptides from the same taxon, the global peptide abundance correlation map showed less obvious clustering patterns based on protein function (Fig. 4F and Supplementary Fig. 7). However, when zooming into a single species, more distinct clusters emerged, representing peptides from proteins with similar or related functions (Supplementary Fig. 8). This suggests that studying protein functional linkage through peptide abundance correlations may be more effective at the single-species level, as inter-species functional correlations are not easily discernible in the global map.

Overall, this global peptide correlation map provides an overall trend of peptide abundance correlations between different peptides and indicates that changes in taxa abundance are the primary driver of peptide abundance correlations across different perturbations.

Peptide abundance correlations provide additional information on assigning the peptide taxonomic source

Our results have shown that peptides from the same family clustered together in the global peptide correlation map (Fig. 4A). However, within each family-level cluster, only a small proportion of peptides were annotated to specific genomes, while a larger proportion were only annotated at the family level, leaving their genome-level sources unclear (Fig. 4B–E and Supplementary Fig. 5). Also as mentioned above, peptides from the same genome exhibit high SCCs in their abundance changes across different drug treatments (Fig. 3C). Given a specific sample from the family Bacteroidaceae, genome-distinct peptides from the same genome were not grouped into a single cluster in the heatmap of the peptides SCC matrix. However, these peptides can still be grouped into several visually discernible modules, defined as regions in the SCC heatmap where within-module correlations are clearly higher than between-module correlations (Fig. 5A). Similar genome cluster modules were also observed in other families, such as Eggerthellaceae, Lachnospiraceae, and Burkholderiaceae (Supplementary Figs. 9–11). This observation suggests that the SCC matrix can serve as a valuable input for machine learning models to assign peptide taxonomic sources.

**Fig. 5: Applying peptide abundance correlations for peptide taxonomic assignments using the Bacteroidaceae family as an example.**

Using the trained Random Forest model for the family Bacteroidaceae, we found that most genome-distinct peptides from Genome A (MGYG000002281, 71 out of 78, 91.0%) and Genome B (MGYG000000243, 61 out of 66, 92.4%) in the test set were correctly classified with probabilities exceeding 90% (Fig. 5B). Specifically, the model assigned these peptides to their respective genome sources with high confidence. In contrast, when applying the same 90% probability threshold to peptides from other genomes within the same family and peptides from other families, the model did not tend to attribute most of the peptides from other genomes within the same family (166 out of 230, 72.2%) and the peptides from other families (438 out of 500, 87.6%) to either Genome A or Genome B (Fig. 5C and Supplementary Fig. 12). The confusion matrices generated from the two test sets demonstrated that the trained model achieved high sensitivity (Fig. 5D and Supplementary Fig. 12). A 5-fold cross-validation demonstrated the strong and consistent ability of our model to assign genome-specific peptides to their respective genomes while effectively rejecting peptides from other sources. To be specific, the model achieved high classification accuracy, precision, recall, and F1-score for peptides from Genome A and Genome B (Supplementary Fig. 13), and exhibited stable False Discovery Rates (FDR) and Rejection Rates (RR) for peptides from both other families and from the same family but different genomes (Supplementary Fig. 13). For peptides without genome-level annotations from the same family, the model classified a large amount of these peptides (48.9%) into either Genome A or Genome B (Fig. 5E). This classification was further supported by the SCC heatmap, where unannotated peptides clustered closely with genome-distinct peptides from a specific genome (Fig. 5F). In addition, models trained using the same method also effectively predicted the genome sources of peptides lacking genome-level annotations in other families, such as Eggerthellaceae, Lachnospiraceae, and Burkholderiaceae (Supplementary Figs. 9–11).

Functionally related peptides were connected in peptide abundance correlation networks for species representative genomes

Our results revealed limited abundance correlations between peptides from proteins with the same or related functions (Fig. 3E, F, and Supplementary Fig. 7). Additionally, we found that taxa-abundance changes are the major driver of peptide abundance changes across different perturbations or drug treatments. Consequently, peptides from proteins with the same or related functions were less likely to be correlated across different taxa compared to peptides originating from the same taxon. Therefore, it is more appropriate to analyze peptide functional linkages at the single-species level by focusing on single-species representative genome subsets of the metaproteomics dataset.

The strong effect of taxonomic abundance changes could cause peptides from the same taxon to show high correlations, even if they are involved in different functions (Supplementary Fig. 14). To minimize this effect and better study the correlations between functionally related peptides, we calculated taxon-based normalized peptide abundance (TNPA) from original peptide abundance (OPA) for the top 10 species/genomes with the highest number of identified peptides, as described in the methods section. TNPA was mostly moderately or weakly positively correlated with OPA in most studied species (Supplementary Fig. 15), and we observed that some peptides with high SCCs for OPA log2-FC exhibited low SCCs for TNPA log2-FC (Fig. 6A and Supplementary Fig. 16). This suggests that many peptide abundance correlations driven by taxa-abundance changes were not significantly correlated at the TNPA level, as shown in Supplementary Fig. 14.

**Fig. 6: Functionally related peptides were connected in peptide abundance correlation networks for species representative genomes constructed with SCCs of TNPA log2-FC of individual V52.**

To reveal functional linkages, peptide abundance correlation networks for species representative genomes were constructed using peptide pairs with the top 5% SCC from each of the top 10 species representative genomes with the most identified peptides. Two sets of networks were constructed using SCCs derived from OPA and TNPA, respectively (Supplementary Figs. 17 and 18). Networks constructed with TNPA had higher modularity (Fig. 6B) and lower clustering coefficients (Fig. 6C), indicating that these networks had dense connections within modules but sparse connections between different modules. The lower clustering coefficients also suggested that TNPA-based networks were less tightly connected as shown in the Fig. 6D. By selecting peptide pairs with top 5% SCC, each species representative genome have an average threshold of 0.60 for SCC derived from TNPA (Supplementary Fig. 19), and selecting peptide pairs with top 5% SCC kept a moderate amount of peptides left in the constructed networks for each species representative genome (Supplementary Fig. 19). Most of these selected pairs to construct networks (91.6 ± 14.4% across the selected 10 genomes) also had a significant Bonferroni-adjusted p-value (<0.01) on their original abundances.

Peptide abundance correlation networks for species representative genomes constructed with TNPA revealed functional linkages. Peptides from the same protein and those from proteins potentially located near each other in the genomes (with $\le$10 gene ID differences in the UHGG) were considered functionally related. Peptide pairs in the network showed a higher percentage of these types of relationships compared to all peptide pairs (Fig. 6E, F). It is worth noting that gene ID differences do not always perfectly reflect physical genome distance; however, they generally indicate proximity when genes are located on the same contig. Here, we provide examples of connections between peptides from functionally related proteins. In the peptide correlation network of genome MGYG000004769 (Phascolarctobacterium faecium), peptides from MGYG000004769_01826 (Methylmalonyl-CoA mutase large subunit) and MGYG000004769_01827 (Succinyl-CoA: coenzyme A transferase) were densely connected (Fig. 6G). Genes encoding these two proteins are located near each other in the genome, with adjacent gene ID numbers and only a 218 bp intergenic region on the same contig. These proteins also have closely related functions in the TCA cycle. Methylmalonyl-CoA mutase catalyzes the conversion of methylmalonyl-CoA to succinyl-CoA, while Succinyl-CoA:coenzyme A transferase then catalyzes the reversible reaction: succinyl-CoA + L-malate $\rightleftharpoons$ succinate + L-malyl-CoA. In the same network, peptides from MGYG000004769_00624 (Molecular chaperone DnaK) and MGYG000004769_02173 (Chaperonin GroEL), which have similar functional annotations, were also connected. In another peptide correlation network of genome MGYG000002528 (Anaerostipes hadrus), peptides from proteins MGYG000002528_01044 (NAD-dependent dihydropyrimidine dehydrogenase subunit PreA) and MGYG000002528_01048 (Allantoate amidohydrolase) were densely connected (Fig. 6H). These proteins are also encoded by genes with adjacent gene numbers, which are separated by a 2717 bp intergenic region. Similar functional linkages were also observed in networks with Unipept-derived peptide functional annotations (Supplementary Fig. 20). For example, in the network of the genome MGYG000004769, peptides annotated with GO:0006083 (acetate metabolic process) were densely connected to peptides annotated with GO:0004494 (methylmalonyl-CoA mutase activity) (Supplementary Fig. 20).

In summary, peptide abundance correlation networks for species representative genomes constructed using TNPA revealed connections between functionally related peptides. These connections demonstrated the potential for using peptide abundance correlations to predict protein functions.

Applying peptide correlation networks of species representative genomes for predicting unknown protein functions

An average of 5.2% of peptides in the 10 constructed peptide correlation networks for species representative genomes originated from proteins of unknown functions (PUFs). This percentage was comparable to the overall proportion of peptides from unannotated proteins in each species (Fig. 7A). To predict functions for PUFs, we calculated peptide abundance correlations between peptides from PUFs and peptides from annotated proteins. The proportion of connections between peptides from PUFs and peptides from annotated proteins varied across species (Fig. 7B). For genome MGYG000002478 (Phocaeicola dorei), the proportion reached 24%, while in other genomes, the lowest was 2%. Here, we provide examples from the top two genomes with the highest percentage of connections between peptides from PUFs and peptides from annotated proteins to show the feasibility of predicting microbial protein functions using peptide abundance correlations.

**Fig. 7: Predicting functions of proteins of unknown functions (PUFs).**

In the network of genome MGYG000002478 (Fig. 7C), Phocaeicola dorei, peptides from uncharacterized protein MGYG000002478_00658 had the largest number of connections (31) to annotated peptides in the network. Of these, 10 connections were to peptides annotated to COG1629 (Outer-membrane receptor protein, Fe transport), with 7 linked to peptides from MGYG000002478_00657, an iron complex outer-membrane receptor protein (K02014). This indicates that the uncharacterized protein is likely to relate to transmembrane transport. This prediction was further supported by the InterProScan of the uncharacterized protein, showing it has a SusD-like domain, a typical outer-membrane protein feature. In addition, most parts of the protein were assigned to a non-cytoplasmic domain, also indicating its extracellular location.

In the network of genome MGYG000002281 (Fig. 7D), Bacteroides faecis, peptides from uncharacterized protein MGYG000002281_00118 had the largest number of connections (52) to annotated peptides in the network. Excluding 9 peptide connections to functionally ambiguous COG0457 (Tetratricopeptide repeat protein), the second-largest group comprises 7 connections to COG3525 (N-acetyl-beta-hexosaminidase), including 3 connections to MGYG000002281_02589 (hyaluronoglucosaminidase, K01197), 2 to MGYG000002281_04635 (hexosaminidase, K12373), and 2 to MGYG000002281_01369 (hexosaminidase, K12373). These connections suggest the uncharacterized protein is likely to be involved in carbohydrate metabolism. An InterProScan analysis of the protein revealed a non-cytoplasmic domain, indicating that it is a membrane-bound protein predicted to be outside the membrane, possibly acting as a signal protein for carbohydrate metabolism. In contrast, searching this protein in the AlphaFold Protein Structure Database^29,30 matched UniProt ID A0A1E9BZD4 (100% identity, HSP score of 1197), an uncharacterized protein without a known biological function. These findings highlight the value of peptide correlation networks in providing complementary insights for predicting the potential biological role of PUFs.

Discussion

We analyzed peptide abundance correlations in a metaproteomics dataset of in vitro cultured human gut microbiomes with different drug treatments/perturbations, demonstrating the feasibility of applying peptide abundance correlations for peptide taxonomic assignments as well as revealing protein functional linkages in subsets for species representative genomes.

Protein abundance correlations have been applied to study protein functional linkage and to predict functions of functionally unknown proteins in single-species proteomics datasets^13,14,15. In these single-species studies, functionally associated proteins have coordinated changes in abundance across perturbations^13,14,15, and proteins with identical subcellular localization also exhibited coordinated abundance changes¹³. However, in our metaproteomics dataset, the abundances of peptides from proteins of the same taxon, instead of peptides from proteins with associated functions, were more likely to have correlated abundance changes.

Peptides from the same taxon showing high abundance correlation are largely due to the significant impact of environmental disturbances on the composition of microbial communities³¹. This is supported by the observed variation in the relative abundance of peptides assigned to different genera across samples from individual V52 (Supplementary Fig. 21). In contrast, the limited correlation of peptides from proteins with related functions can be attributed to the high functional redundancy of human gut microbial communities³². Specifically, a decrease in peptides from proteins with a particular function could lead to an increase in peptides from proteins performing a similar function in a phylogenetically unrelated species. This compensatory mechanism helps maintain the overall functional stability of the microbial community, resulting in limited correlations among peptides from proteins with related functions. The microbial community’s proteome-level functional redundancy (FR_p) of the studied samples was calculated in our previous study²⁶. It was found that some compounds transited the microbiome from its original state of high FR_p to an alternative state of low FR_p, showing the inherent high functional redundancy and the perturbations caused by drug treatments. Notably, a study focused on thousands of single-bacterium proteomes applied a method called SCALES (Spectral Correlation Analysis of Layered Evolutionary Signals), which progressively revealed phylogeny, pathways, and protein complexes from proteome data variance¹⁶. However, our metaproteomic data focus on specific microbial communities, in which global functional associations were not directly evident. Novel analytical approaches may offer new perspectives for exploring complex functional interactions within metaproteomics data.

Our study is the first to investigate peptide abundance correlations in a gut microbiome metaproteomic dataset. However, similar studies have been conducted using metagenomic data, such as genetic correlation networks from soil metagenomes, which have revealed a hierarchical functional structure¹⁷, similar to findings in single-species cellular genetic correlation networks³³. In contrast, our analysis of peptide abundance correlations primarily showed a clear clustering of peptides from the same taxon, with a less obvious functional hierarchy. This underscores that functional insights derived from protein abundance measurements are different from those inferred from genetic materials through metagenomics or metatranscriptomics³⁴.

Our global peptide abundance correlation map also provided clues for studying microbial responses to drugs. In the constructed peptide correlation maps (Fig. 2A), peptides from the family Enterobacteriaceae were usually separated from other taxa, indicating unique response patterns for this taxonomic group. This is supported by previous experimental studies. For example, psychotropic drugs like fluoxetine have shown antimicrobial activity against Escherichia coli (family Enterobacteriaceae), while the abundance of other human gut species increased in the in vitro microbiome after the treatment with the same drug³⁵. In addition, microbial genera have been classified into three distinct taxonomic clusters, each representing a different pattern of drug response²⁶. Patterns resembling these taxonomic clusters can also be observed in our peptide correlation maps. For instance, peptides from Parasutterella (Burkholderiaceae family) and Eggerthella (Eggerthellaceae family), which belong to two clusters exhibiting distinct response patterns, are also well separated in our peptide correlation map (Fig. 2A and Supplementary Fig. 5).

Higher correlations of peptides from the same taxon also provide a solution for peptide taxonomic source assignment, which remains a challenge for peptide-centric metaproteomics analysis. Although most of the in-silico digested peptides were genome-distinct peptides³⁶, most of the peptides identified from real metaproteomics datasets were shared by different genomes²². In this study, even after refining the potential genome sources, a large number of peptides could only be assigned to a family-level LCA, and the actual taxonomic sources of these peptides remain unknown (Fig. 4B–E). By investigating peptide abundance correlations, additional information was provided for peptide taxonomic source assignment. This is expected to provide a more accurate microbial biomass profile, which quantifies species biomass contributions to the microbial community using metaproteomics data³⁷. However, in this study, peptides could only be assigned to a species representative genome, which represents a cluster of genomes potentially from different strains¹⁸. With the current workflow, the contribution of individual strains within a species cannot be resolved. Further analysis using a species-specific database containing all possible strains may help assess the contribution of different strains to specific species.

Although taxa-abundance changes were major contributors to peptide abundance correlations, to reveal peptide functional linkages, peptide abundance correlations were further analyzed at the species representative genomes level by extracting subsets of genome-distinct peptides of each genome from the metaproteomics dataset. In addition, to reduce the impact of taxa-abundance changes, TNPA was calculated. The principle of the calculation of TNPA was similar to a normalization method named LFQRatio³⁸, which divided protein LFQ intensity by the sum of all protein LFQ intensities for its respective strain in a microbial coculture system. LFQRatio has shown its ability to transform absolute protein quantification data into accurate and biologically meaningful protein abundance values for samples with multiple species at variable cell ratios. Similarly, in our study, we also found that taxa-based normalized peptide abundance (TNPA) showed a good ability to reveal functional linkages (Fig. 6E, F). Peptides from proteins with related functions were connected in the peptide abundance correlation network for species representative genomes constructed with TNPA.

A limitation of this study comes partly from metaproteomics itself. Current metaproteomics has limited proteome coverage^39,40. It is estimated that around 200 bacterial species reside in the human gut microbial community⁴¹. However, bacterial species with an abundance lower than 0.5% were hard to detect with current metaproteomics techniques⁴², and the number of peptides and proteins identified in low-abundant species is limited⁴⁰. In contrast, most metagenomics studies with moderate sequencing depth (10 Gbases) have the ability to detect species with relative abundances down to approximately 0.01% in a given sample⁴³. With ultra-deep metagenomics sequencing, it is possible to reconstruct metagenomic-assembled genomes (MAGs) of species with extra-low abundance (<0.1%), enabling more accurate and deep comparative metagenomics analysis⁴⁴. Moreover, metatranscriptomics presents additional challenges for detecting low-abundance species due to the wide dynamic range of transcript expression⁴⁵. As a result, a 0.1% relative abundance threshold is commonly applied in metatranscriptomics to discard ultra-low abundance species that likely arise due to limitations of short-read-based taxonomic classification⁴⁶. Considering the limited proteome coverage of current metaproteomics, only the top ten genomes were investigated to study intra-species functional relations in this work. Even in these genomes, the number of peptides was still limited. This results in the limited scale of constructed peptide correlation networks for species representative genomes. A lot of peptides from proteins with interesting or unknown functions were not identified and were not included in the network. In addition to considering proteome coverage, it is also worth noting that a high-resolution LC–MS/MS for identification and quantification is also needed. Our analysis in another metaproteomics dataset from Q Exactive mass spectrometer with the same procedure showed a weak abundance correlation of same taxon peptides (Supplementary Fig. 22). Fortunately, DIA-based metaproteomics has proven to be able to significantly increase the number of identified peptides as well as to improve quantification reproducibilities^47,48, which has the potential to expand and strengthen applying peptide abundance correlations to reveal more interesting findings. Furthermore, protein sizes, proteome sizes across bacterial species, and differences in cell numbers across samples could impact the interpretation of metaproteomics results. These factors may warrant consideration in future analyses.

Another aspect that needs improvement is distinguishing true functional correlations from incidental ones in peptide correlation networks for species representative genomes. First, the number of samples used to calculate Spearman correlation coefficients should be carefully considered, as these coefficients require sufficient sample sizes to achieve high confidence in the correlations detected⁴⁹. Additionally, applying a strict threshold or employing more refined methods to filter out functionally related interactions from correlations driven by other factors is crucial, since abundance correlations do not always indicate functional relationships⁵⁰. It is also worth noting that in this study, for each individual, the abundance of peptides under a specific drug treatment was measured only once. Although the large number of drug treatments may mitigate the impact of random fluctuations in peptide abundance, the inclusion of biological replicates would enhance the reliability of measuring individual-specific microbiome responses to drugs and could potentially improve the robustness of peptide abundance correlation analyses. And the in vitro culture system used in this study might have a systemic impact on the microbiome, potentially affecting peptide abundance correlation patterns. This limitation should be taken into consideration in future studies. Ultimately, the field would benefit from larger, more reproducible datasets and refined protein association reference methods to advance the study of peptide/protein abundance correlations.

In summary, fluctuations in peptide abundance contain wealthy information on the microbiome’s response to perturbations. These fluctuations can be harnessed to assign peptides to their protein and taxon of origin, as well as to predict function for functionally unknown proteins. In the metaproteomics dataset, peptides from the same taxon were clustered in the peptide correlation map, suggesting that microbiome taxonomic abundance change is the major contributor to peptide abundance changes. We anticipate that the concept of peptide/protein abundance correlations needs further investigation to deepen taxonomic and functional understanding of metaproteomics data.

Methods

Metaproteomics dataset

In this study, we analyzed a metaproteomics dataset comprising 672 raw files from human gut microbiomes of six individuals subjected to various drug treatments (112 samples per individual). Specifically, stool samples were treated with 109 different compounds—107 drugs, two DMSO samples as negative controls, and three kestose samples as positive controls (detailed in Supplementary Table S1)—and then in vitro cultured using the RapidAIM assay²⁷, a culture- and metaproteomics-based rapid method for studying individual microbiome responses to drugs. Detailed information on sample collection, preparation, and LC–MS/MS analysis has been documented in a separate study²⁶. It is worth noting that an equal volume of each sample was applied for protein digestion and loaded for LC–MS/MS metaproteomics analysis^26,27. Metaproteomics raw files obtained from LC–MS/MS were searched against the IGC (Integrated Gene Catalog) database⁵¹ with MetaLab2.3⁵² with the MaxQuant⁵³ workflow, using default settings. Peptide quantification results were extracted from the “peptides.txt” file generated by MetaLab for subsequent analysis.

Peptide abundance correlation calculation

Pre-processing and peptide filtering. Peptide abundance correlations across different perturbations were calculated separately for each individual. First, peptide identification and quantification results were extracted individually. Peptides were filtered by retaining only those with non-zero intensity in at least 20% of the samples for each individual (≥23 samples). The quantification results of peptides that passed this filtering step were used for further analysis.

Peptide abundance log2-fold change (log2-FC) calculation. To avoid zero values, 1 was added to peptide raw intensity values. The log2-fold change in peptide abundance for each sample was calculated by dividing the peptide intensity under each treatment by the average intensity of the two DMSO-treated control samples, followed by log2 transformation.

Spearman correlation coefficients (SCC) calculation. Peptide abundance correlations across different treatments were calculated for all peptide pairs using Spearman correlation coefficients (SCCs) of the log2-FC values across the treatments. The SCCs were computed using the cor function in R.

Statistical assessment of calculated SCC. To estimate the statistical significance of each pairwise correlation in a computationally feasible manner, we used a standard approximation based on the t-distribution:

$$t={{\rm{\rho }}}_{s}\times \sqrt{\frac{n-2}{1-{{{\rm{\rho }}}^{2}}_{s}}}$$

(1)

where ρ_s is the Spearman correlation coefficient (SCC) and n is the number of samples. This t-value was then used to compute two-tailed p-values.

Peptide annotations

To enable meaningful biological analysis, peptides were assigned to their protein sources and annotated with taxonomic and functional information using the following procedures.

Protein source refinement. We first generated a genome-level taxonomic profile for each individual to refine peptide protein source assignments and improve peptide annotation resolution. All identified peptides from each individual were mapped to their taxonomic sources by aligning them against MetaPep²² records. Genome-distinct peptides, defined as peptides exclusively present in a single bacterial genome, were utilized to enhance the accuracy of assignments. In the subsequent step, only genomes with at least three genome-distinct peptides identified across all samples were considered for peptide protein source assignments.

Peptide protein sources annotation. Next, for peptide protein source assignments, protein sequences from the 4744 UHGG representative genomes¹⁸ were in-silico digested using DeepDigest⁵⁴ with the following parameters: miscleavage = 2, minimum peptide length = 7, maximum peptide length = 47 (default settings), and trypsin as the protease. Subsequently, all identified peptides were mapped to their unique protein source or multiple protein sources.

Peptide taxonomic and functional sources annotation. Each peptide was taxonomically annotated by assigning it to a specific genome source or the lowest common ancestor (LCA) of the genomes corresponding to all its protein sources. For functional annotations, protein functional annotations were extracted from the UHGG¹⁸ database, and each peptide was functionally annotated based on the functional annotation of its source protein(s).

Additional annotation from Unipept²⁴. All studied peptides from individual V52 were uploaded to the Unipept online metaproteomics analysis platform (https://unipept.ugent.be/mpa) to obtain additional peptide-level taxonomic and functional annotations (Gene Ontology [GO] terms and Enzyme Commission [EC] numbers) using the default settings.

Visualize peptide abundance correlations with global peptide correlation maps

t-SNE (t-distributed stochastic neighbor embedding) was applied to visualize the peptide abundance correlations. The SCC matrix of peptide pairs from each individual was used as the input for the Rtsne package⁵⁵ with default parameters (perplexity = 30). In the peptide correlation map, each point represents a peptide, and the peptides were colored based on their taxonomic or functional annotations acquired as described in the previous section.

Applying a machine learning model to predict peptide genome source

Genome-distinct peptides from the two genomes with the highest number of such peptides within the same bacterial family were extracted. Seventy percent of these peptides were randomly selected as the training dataset, while the remaining 30% constituted the test dataset. Additionally, genome-distinct peptides from other genomes within the same family and 500 randomly selected peptides from other families were included for further evaluation of the model’s performance.

For the training dataset, the Spearman correlation coefficients (SCCs) of each peptide’s abundance changes relative to all peptides in the dataset were calculated and used as input features. A Random Forest classifier was then trained using the randomForest package⁵⁶ in R with ntree = 500, importance = True. The model’s performance was assessed using confusion matrices calculated on two different combined test datasets: (1) genome-distinct peptides from the test dataset combined with genome-distinct peptides from other genomes of the same family, and (2) genome-distinct peptides from the test dataset combined with 500 randomly selected peptides from other families. Following these assessments, the trained model was used to predict genome sources of peptides from the same bacterial family that lacked genome-level taxonomic annotations.

Calculation of taxon-based normalized peptide abundance (TNPA)

To mitigate the impact of changes in taxon abundance on peptide abundance and to better study the effect of different functions on peptide abundance, taxon-based normalized peptide abundance (TNPA) was calculated based on the original peptide abundance (OPA), total peptide abundance of all peptides from the same taxon in the sample (TPATS, Total Peptide Abundance per Taxon in a Sample), and an average level of peptide abundance(${10}^{8}$) using the following formula:

$$\text{TNPA}=\left(\frac{\text{OPA}}{\text{TPATS}}\right)\times {10}^{8}$$

(2)

The TNPA calculation was at the genome level, using genome-distinct peptides specific to each species’ representative genome. TNPA was calculated only for the top 10 genomes from each individual microbiome with the largest number of identified genome-distinct peptides. The subsequent peptide abundance correlation analysis for species representative genomes also focused on these genomes.

Peptide abundance correlation networks for species representative genomes

Two types of peptide abundance correlation networks for species representative genomes were constructed using the SCCs of the log2-FC of OPA and the SCCs of the log2-FC of TNPA. Peptide pairs with the top 5% SCCs were retained for network construction, which was carried out using the graph_from_adjacency_matrix function from the igraph package⁵⁷ in R (with parameters: weighted = TRUE, mode = “undirected”, and diag = FALSE). Nodes and edges from each network were extracted with the same package, and all peptides in networks were mapped to their protein source annotations.

Peptide abundance correlation networks for species representative genomes were visualized using Gephi 0.10.1 with the Yifan Hu layout⁵⁸. Network modularity and clustering coefficients were calculated using the modularity and transitivity functions from the igraph package, respectively. Networks constructed with TNPA were used to study the correlations between peptides with related functions and to predict the functions of previously uncharacterized proteins.

Data availability

Metaproteomics raw files used to compile analysis were deposited at the ProteomeXchange Consortium⁵⁹ via the PRIDE⁶⁰ partner repository as described in our previous study²⁶ (https://doi.org/10.1101/2025.02.13.637346). These files will be publicly available upon publication. Additionally, another dataset generated using a Q Exactive mass spectrometer is available with the data set identifiers PXD012724.

Code availability

All codes to perform the analysis in this study are available on GitHub at https://github.com/northomics/Peptide_Abundance_Correlations.

References

Lynch, S. V. & Pedersen, O. The human intestinal microbiome in health and disease. N. Engl. J. Med. 375, 2369–2379 (2016).
Article PubMed CAS Google Scholar
de Vos, W. M., Tilg, H., Van Hul, M. & Cani, P. D. Gut microbiome and health: mechanistic insights. Gut 71, 1020–1032 (2022).
Article PubMed Google Scholar
Wu, G. et al. A Core Microbiome signature as an indicator of health. Cell 187, 6550–6565.e11 (2024).
Article PubMed CAS Google Scholar
Sanna, S. et al. Causal relationships among the gut microbiome, short-chain fatty acids and metabolic diseases. Nat. Genet. 51, 600–605 (2019).
Article PubMed PubMed Central CAS Google Scholar
Lloyd-Price, J. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019).
Article PubMed PubMed Central CAS Google Scholar
Heintz-Buschart, A. & Wilmes, P. Human gut microbiome: function matters. Trends Microbiol. 26, 563–574 (2018).
Article PubMed CAS Google Scholar
Lobb, B., Kurtz, D. A., Moreno-Hagelsieb, G. & Doxey, A. C. Remote homology and the functions of metagenomic dark matter. Front. Genet. https://doi.org/10.3389/fgene.2015.00234 (2015).
Franzosa, E. A. et al. Sequencing and beyond: integrating molecular “omics” for microbial community profiling. Nat. Rev. Microbiol. 13, 360–372 (2015).
Article PubMed PubMed Central CAS Google Scholar
Kleiner, M. Metaproteomics: much more than measuring gene expression in microbial communities. mSystems 4, e00115-19 (2019).
Article PubMed PubMed Central Google Scholar
Wang, J. et al. Proteome profiling outperforms transcriptome profiling for coexpression based gene Function Prediction. Mol. Cell. Proteom. 16, 121–134 (2017).
Article CAS Google Scholar
Brandman, O. & Meyer, T. Feedback loops shape cellular signals in space and time. Science 322, 390–395 (2008).
Article PubMed PubMed Central CAS Google Scholar
Garrido-Rodriguez, M., Zirngibl, K., Ivanova, O., Lobentanzer, S. & Saez-Rodriguez, J. Integrating knowledge and omics to decipher mechanisms via large-scale models of signaling networks. Mol. Syst. Biol. 18, e11036 (2022).
Article PubMed PubMed Central Google Scholar
Kustatscher, G. et al. Co-Regulation map of the human proteome enables identification of protein functions. Nat. Biotechnol. 37, 1361–1371 (2019).
Article PubMed PubMed Central CAS Google Scholar
Mateus, A. et al. The functional proteome landscape of Escherichia coli. Nature 588, 473–478 (2020).
Article PubMed PubMed Central CAS Google Scholar
Messner, C. B. et al. The proteomic landscape of genome-wide genetic perturbations. Cell 186, 2018–2034.e21 (2023).
Article PubMed PubMed Central CAS Google Scholar
Zaydman, M. A. et al. Defining hierarchical protein interaction networks from spectral analysis of bacterial proteomes. eLife 11, e74104 (2022).
Article PubMed PubMed Central CAS Google Scholar
Ma, B. et al. Genetic correlation network prediction of forest soil microbial functional organization. ISME J. No. 12, 2492–2505 (2018).
Article Google Scholar
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021).
Article PubMed CAS Google Scholar
Haynes, W. A., Tomczak, A. & Khatri, P. Gene annotation bias impedes biomedical research. Sci. Rep. 8, 1362 (2018).
Article PubMed PubMed Central Google Scholar
Kustatscher, G. et al. Understudied proteins: opportunities and challenges for functional proteomics. Nat. Methods 19, 774–779 (2022).
Article PubMed CAS Google Scholar
Schiebenhoefer, H. et al. Challenges and promise at the interface of metaproteomics and genomics: an overview of recent progress in metaproteogenomic data analysis. Expert Rev. Proteom. 16, 375–390 (2019).
Article CAS Google Scholar
Sun, Z. et al. MetaPep: a core peptide database for faster human gut metaproteomics database searches. Comput. Struct. Biotechnol. J. 21, 4228–4237 (2023).
Article PubMed PubMed Central CAS Google Scholar
Plubell, D. L. et al. Putting Humpty Dumpty back together again: what does protein quantification mean in bottom-up proteomics?. J. Proteome Res. 21, 891–898 (2022).
Article PubMed PubMed Central CAS Google Scholar
Gurdeep Singh, R. et al. Unipept 4.0: functional analysis of metaproteome data. J. Proteome Res. 18, 606–615 (2019).
Article PubMed CAS Google Scholar
Simopoulos, C. M. A. et al. pepFunk: a tool for peptide-centric functional analysis of metaproteomic human gut microbiome studies. Bioinformatics 36, 4171–4179 (2020).
Article PubMed CAS Google Scholar
Li, L. et al. Systematic metaproteomics mapping reveals functional and ecological landscapes of human gut microbiota responses to therapeutic drugs. Preprint at bioRxiv https://doi.org/10.1101/2025.02.13.637346 (2025).
Li, L. et al. RapidAIM: a culture- and metaproteomics-based rapid assay of individual microbiome responses to drugs. Microbiome 8, 33 (2020).
Article PubMed PubMed Central CAS Google Scholar
Li, L. et al. An in vitro model maintaining taxon-specific functional activities of the gut microbiome. Nat. Commun. 10, 4146 (2019).
Article PubMed PubMed Central CAS Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article PubMed PubMed Central CAS Google Scholar
Varadi, M. et al. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 52, D368–D375 (2024).
Article PubMed CAS Google Scholar
Nguyen, J., Lara-Gutiérrez, J. & Stocker, R. Environmental fluctuations and their effects on microbial communities, populations and individuals. FEMS Microbiol. Rev. 45, fuaa068 (2021).
Article PubMed CAS Google Scholar
Li, L. et al. Revealing proteome-level functional redundancy in the human gut microbiome using ultra-deep metaproteomics. Nat. Commun. 14, 3428 (2023).
Article PubMed PubMed Central CAS Google Scholar
Costanzo, M. et al. A global genetic interaction network maps a wiring diagram of cellular function. Science 353, aaf1420 (2016).
Article PubMed PubMed Central Google Scholar
Armengaud, J. Metaproteomics to understand how microbiota function: the crystal ball predicts a promising future. Environ. Microbiol. 25, 115–125 (2023).
Article PubMed Google Scholar
Cussotto, S. et al. Differential effects of psychotropic drugs on microbiome composition and gastrointestinal function. Psychopharmacology 236, 1671–1685 (2019).
Article PubMed CAS Google Scholar
Mesuere, B. et al. Unipept: tryptic peptide-based biodiversity analysis of metaproteome samples. J. Proteome Res. 11, 5773–5780 (2012).
Article PubMed CAS Google Scholar
Kleiner, M. et al. Assessing species biomass contributions in microbial communities via metaproteomics. Nat. Commun. 8, 1558 (2017).
Article PubMed PubMed Central Google Scholar
Shi, M., Evans, C. A., McQuillan, J. L., Noirel, J. & Pandhal, J. LFQRatio: a normalization method to decipher quantitative proteome changes in microbial coculture systems. J. Proteome Res. https://doi.org/10.1021/acs.jproteome.3c00714 (2024).
Lohmann, P. et al. Function is what counts: how microbial community complexity affects species, proteome and pathway coverage in metaproteomics. Expert Rev. Proteom. 17, 163–173 (2020).
Article CAS Google Scholar
Sun, Z., Ning, Z. & Figeys, D. The landscape and perspectives of the human gut metaproteomics. Mol. Cell. Proteom. 23, 100763 (2024).
Article CAS Google Scholar
Lloyd-Price, J., Abu-Ali, G. & Huttenhower, C. The healthy human microbiome. Genome Med. 8, 51 (2016).
Article PubMed PubMed Central Google Scholar
Duan, H. et al. Assessing the dark field of metaproteome. Anal. Chem. 94, 15648–15654 (2022).
Article PubMed PubMed Central CAS Google Scholar
Blanco-Míguez, A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat. Biotechnol. 41, 1633–1644 (2023).
Article PubMed PubMed Central Google Scholar
Jin, H. et al. Hybrid, ultra-deep metagenomic sequencing enables genomic and functional characterization of low-abundance species in the human gut microbiome. Gut Microbes https://doi.org/10.1080/19490976.2021.2021790 (2022).
Zhang, Y. et al. Metatranscriptomics for the human microbiome and microbial community functional profiling. Annu. Rev. Biomed. Data Sci. 4, 279–311 (2021).
Article PubMed Google Scholar
Spurbeck, R. R., Catlin, L. A., Mukherjee, C., Smith, A. K. & Minard-Smith, A. Analysis of metatranscriptomic methods to enable wastewater-based biosurveillance of all infectious diseases. Front. Public Health https://doi.org/10.3389/fpubh.2023.1145275 (2023).
Aakko, J. et al. Data-independent acquisition mass spectrometry in metaproteomics of gut microbiota—implementation and computational analysis. J. Proteome Res. 19, 432–436 (2020).
Article PubMed CAS Google Scholar
Zhao, J. et al. Data-independent acquisition boosts quantitative metaproteomics for deep characterization of gut microbiota. npj Biofilms Microbiomes 9, 4 (2023).
Article PubMed PubMed Central Google Scholar
Bonett, D. G. & Wright, T. A. Sample size requirements for estimating Pearson, Kendall and Spearman correlations. Psychometrika 65, 23–28 (2000).
Article Google Scholar
Chase, L. S. S. & Priya, S. S. I’m walking into spiderwebs: making sense of protein−protein interaction data. J. Proteome Res. 23, 2723–2732 (2024).
Article Google Scholar
Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).
Article PubMed CAS Google Scholar
Cheng, K. et al. MetaLab 2.0 enables accurate post-translational modifications profiling in metaproteomics. J. Am. Soc. Mass Spectrom. 31, 1473–1482 (2020).
Article PubMed CAS Google Scholar
Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat. Protoc. 11, 2301–2319 (2016).
Article PubMed CAS Google Scholar
Yang, J. et al. DeepDigest: prediction of protein proteolytic digestion with deep learning. Anal. Chem. 93, 6094–6103 (2021).
Article PubMed CAS Google Scholar
Jesse, K. & Laurens van der, M. Rtsne: T-distributed stochastic neighbor embedding using a Barnes-hut implementation. https://doi.org/10.32614/CRAN.package.Rtsne (2023).
Leo, B., Adele, C., Andy, L. & Matthew, W. randomForest: Breiman and Cutlers random forests for classification and regression. https://doi.org/10.32614/CRAN.package.randomForest (2024).
Gábor, C. et al. Igraph: network analysis and visualization. https://doi.org/10.32614/CRAN.package.igraph (2025).
Bastian, M., Heymann, S. & Jacomy, M. Gephi: an open source software for exploring and manipulating networks. ICWSM 3, 361–362 (2009).
Article Google Scholar
Deutsch, E. W. et al. The ProteomeXchange Consortium in 2020: enabling `big data' approaches in proteomics. Nucleic Acids Res 48, D1145–D1152 (2020).
PubMed CAS Google Scholar
Perez-Riverol, Y. et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 50, D543–D552 (2022).
Article PubMed CAS Google Scholar

Download references

Acknowledgements

Substantial financial support was provided by the Natural Sciences and Engineering Research Council of Canada (NSERC) through the Discovery Grant (to D.F.). Z.S. and Q.W. were funded by a stipend from the NSERC CREATE in Technologies for Microbiome Science and Engineering (TECHNOMISE) Program.

Author information

Authors and Affiliations

School of Pharmaceutical Sciences, Ottawa Institute of Systems Biology, and Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
Zhongzhi Sun, Zhibin Ning, Qing Wu & Daniel Figeys
State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, China
Leyuan Li
Department of Biology and Waterloo Centre for Microbial Research, University of Waterloo, Waterloo, ON, Canada
Andrew C. Doxey
Quadram Institute Bioscience, Norwich Research Park, Norwich, Norfolk, UK
Daniel Figeys
University of East Anglia, Norwich, Norfolk, UK
Daniel Figeys

Authors

Zhongzhi Sun
View author publications
Search author on:PubMed Google Scholar
Zhibin Ning
View author publications
Search author on:PubMed Google Scholar
Qing Wu
View author publications
Search author on:PubMed Google Scholar
Leyuan Li
View author publications
Search author on:PubMed Google Scholar
Andrew C. Doxey
View author publications
Search author on:PubMed Google Scholar
Daniel Figeys
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.S. curated the data, conducted formal analysis, investigation, and visualization, and wrote the original draft. Z.N. contributed to conceptualization, methodology, investigation, and drafting of the manuscript. Q.W. contributed to methodology, investigation, and manuscript review and editing. L.L. provided resources and contributed to manuscript review and editing. A.D. contributed to methodology, investigation, and manuscript review and editing. D.F. supervised the study, acquired funding, contributed to conceptualization, and reviewed and edited the original draft. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Daniel Figeys.

Ethics declarations

Competing interests

D.F. is a co-founder of Biotagenics and MedBiome, both of which are clinical microbiomics companies. The other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Figures

Supplementary Tables

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sun, Z., Ning, Z., Wu, Q. et al. Peptide abundance correlations in metaproteomics enhance taxonomic and functional analysis of the human gut microbiome. npj Biofilms Microbiomes 11, 166 (2025). https://doi.org/10.1038/s41522-025-00801-y

Download citation

Received: 14 March 2025
Accepted: 25 July 2025
Published: 19 August 2025
Version of record: 19 August 2025
DOI: https://doi.org/10.1038/s41522-025-00801-y