Introduction

Metaproteomics measures complex microbial communities in biological samples from natural environments, such as soil rhizosphere1, ocean2,3, and fecal microbiome4,5,6,7,8. Understanding the functional roles of microorganisms in an ecosystem is crucial for gaining insights into the interactions and dynamics of the ecosystem9. This can provide a deeper understanding of how microorganisms participate in processes, such as nutrient cycling10, disease state, and supporting the digestive and immune system11,12,13,14. In shotgun MS-based metaproteomics, tandem mass spectrometry (MS/MS) data is generated as follows: proteins are first hydrolyzed into peptides through an in-solution digestion method, generating a large number of peptides. These peptides are then ionized, isolated, fragmented, and detected in a mass analyzer as they elute from high-performance liquid chromatography (HPLC). A key step in analyzing MS-based metaproteomics data is database searching, which involves comparing the measured mass spectra of the peptides to theoretical mass spectra of peptides in silico digested from protein databases. Each of these comparisons yields a peptide-spectrum match (PSM) score, which measures the similarity between the measured and theoretical mass spectra. The peptide with the highest PSM score is considered the top candidate for the query MS/MS. After the database searching, a filtering step is applied to eliminate false positive identifications by setting a score threshold to obtain a set of confident PSMs at a predefined false discovery rate (FDR).

The PSM scoring function in the database search pipeline serves two key purposes: ranking peptide candidates for a given spectrum to identify the most compatible match and ranking PSMs from a proteomics run to eliminate spurious matches. The challenge of constructing a well-calibrated scoring function has intensified with rapid advances in mass spectrometry and metagenomics sequencing technologies, which have led to a substantial increase in the number of mass spectra and the size of protein databases. Ideally, an MS/MS spectrum should achieve a high score for its match with the correct peptide, while random matches typically follow a probabilistic distribution with a small tail of high scores. As peptide databases grow larger, the likelihood of an incorrect random match scoring higher than the correct match increases. Consequently, developing efficient and sophisticated PSM filtering algorithms to re-score PSMs for improved ranking has become imperative.

In recent years, various PSM filtering algorithms have been proposed. Statistical methods, such as PeptideProphet15, Tailor16, and H-Score17, use approaches like Bayesian statistical assumptions, empirical observations, and confidence-based recalibrations, respectively. Machine learning (ML)-based algorithms, such as Percolator18, CRanker19, QRanker20, and Gradient Boosting21, identify confident PSMs to train models that classify remaining matches. Other methods leverage spectrum comparison features, as seen in MS2Rescore22, or integrate results from multiple search engines for comprehensive analysis, exemplified by iProphet23 and IDPicker24. Our previous work, Sipros-Ensemble25, employed logistic regression to calculate new scores based on three distinct scoring functions.

While these approaches effectively extract PSM features like charge states and mass errors, they may not fully exploit the information within measured and theoretical spectra. To address this, we proposed DeepFilter26, a deep learning architecture that automatically learns matching patterns between measured and theoretical spectra, complemented by human-engineered features. Although DeepFilter achieved promising results, it has limitations: it was trained on data from a single database search engine, restricting its generalizability, and its input format–a large, sparse matrix constructed from ascending-order peak masses–slows inference compared to other widely used filtering tools.

Motivated by the benefits of leveraging MS/MS information and the need to accelerate peptide identification, we developed WinnowNet, a deep-learning-based architecture for re-scoring PSM candidates. WinnowNet utilizes experimentally verified PSM datasets from the ProteomeTools study27,28 and PSM candidates generated by multiple search engines to construct large, diverse training datasets. The training process employs a curriculum learning strategy29 to enhance model performance and accelerate convergence. By leveraging the order-invariant properties of CNN and transformer architectures, WinnowNet reduces the representation matrix size while effectively capturing complex matching patterns between measured MS/MS spectra and theoretical peptide spectra. Experimental results demonstrate that WinnowNet significantly improves identifications at the PSM, peptide, and protein levels, outperforming other widely used filters benchmarked in this study. Also, WinnowNet reduces the need for ad hoc training and can be applied to analyze different metaproteome samples without fine-tuning and still obtain substantial improvements over existing tools. WinnowNet is freely available under the GNU GPL license at https://github.com/Biocomputing-Research-Group/WinnowNet30.

Results

Benchmark datasets and evaluation metrics

To provide a comprehensive performance assessment, WinnowNet was benchmarked on twelve metaproteome datasets. These datasets include those derived from a synthetic microbial mixture (Synthetic), an artificially assembled mock community (P1, P2, and P3), three distinct microbial communities (Marine 1-3, Soil 1-3, Human Gut, and Human Gut TimsTOF), each characterized by increasing complexity in mass spectra and protein databases (see Supplementary Table 1 and Supplementary Note 1). All metaproteome samples except Human Gut TimsTOF were analyzed using the Multidimensional Protein Identification Technology (MudPIT) approach31 on a Thermo Scientific LTQ Orbitrap Elite mass spectrometer. Human Gut TimsTOF (HGT) dataset was obtained from fecal samples of human patients, analyzed using a trapped ion mobility spectrometry (timsTOF) + TOF mass spectrometry approach.

To ensure an accurate performance comparison and to mitigate overfitting during protein identification, we incorporated entrapment proteins into database search, following approaches proposed in many previous studies32,33,34,35. Entrapment proteins were generated by randomly shuffling target protein sequences to create false target sequences, which were then used alongside the target-decoy strategy36. The effective ratio of entrapment proteins to original target proteins in the database was set to 1:1. Identifications at the PSM, peptide, and protein levels were evaluated at a 1% false discovery rate (FDR), as shown in Eq. (1), where nt and ne denote the number of original target and entrapment identifications, respectively. This estimation follows the “combined” method in ref. 35, which provides a conservative upper bound on the FDR. In addition, we performed entrapment analysis using the paired estimation method proposed in ref. 35 (described in Supplementary Discussion), which is proved to yield a tighter bound. Only original target matches with FDR controlled at the predefined level (1% in this study) were reported for all benchmarked methods. It is worth noting that MS/MS spectrum data were extracted using MSConvert from the ProteoWizard release 3.0.1184137, in contrast to our previous study26, which utilized RawConverter Version 1.1.0.2338.

$$Estimated\,FDR=\frac{2\times {n}_{e}}{{n}_{t}+{n}_{e}}$$
(1)

For some experiments, we also used foreign proteins as an entrapment strategy. In this approach, proteins from foreign species were incorporated into the original target protein set to create an extended target database, which was then augmented with decoys generated by randomly shuffling its entries. For this entrapment setup, we estimated the FDR using Eq. (2), where nt represents the combined set of original target proteins and foreign proteins, and nd denotes the number of decoys. To further assess the reliability of the identifications, we computed the False Matching Rate (FMR)33,34, defined as the proportion of false target identifications at a 1% FDR, following Eq. (3). In Eq. (3), nf represents the number of matches to proteins from foreign species.

$$FD{R}_{td}=\frac{{n}_{d}}{{n}_{t}}$$
(2)
$$FMR=\frac{{n}_{f}}{{n}_{t}}$$
(3)

Performance comparison to state-of-the-art filtering algorithms

We evaluated WinnowNet against six leading filtering algorithms: Percolator18, Q-ranker20, PeptideProphet39, iProphet23, MS2Rescore22, and DeepFilter26, all of which have been released or updated within the past six years. Unlike traditional filtering algorithms, WinnowNet eliminates the ad hoc training and can be applied to analyze different metaproteome samples without fine-tuning and still obtain substantial improvements over existing tools. The evaluation was conducted using PSM candidates derived from three standalone database search engines: Comet40, Myrimatch41, and MS-GF+42. Note that Percolator, Q-ranker, and PeptideProphet relied directly on the PSM scores from these search engines, whereas iProphet utilized scores generated by PeptideProphet. To ensure a fair comparison, iProphet was run without the addition of peptide- and protein-level features. While Percolator, Q-ranker, PeptideProphet, and iProphet employ traditional machine learning or statistical methods with human-engineered PSM features, MS2Rescore enhances rescoring by incorporating predicted peptide fragmentation patterns and retention time information. In contrast, both DeepFilter and WinnowNet are deep-learning-based approaches that automatically learn discriminative features from PSMs.

For performance assessment, we applied the entrapment method described in section “Benchmark datasets and evaluation metrics.” Protein identifications were reported only when supported by at least one unique peptide. Identification results for the marine, human gut, soil, and mock datasets at 1% FDR are summarized in Fig. 1 and detailed in Supplementary Tables 1617. Both WinnowNet variants–the self-attention-based and the CNN-based architectures–achieved the highest numbers of identifications at the PSM, peptide, and protein levels across all datasets and three standalone database search engines. Among the baseline methods, either DeepFilter or MS2Rescore consistently provided the highest identification counts. In the following analysis, we focus on the self-attention-based WinnowNet, which demonstrated the best overall performance; the analysis of the CNN-based variant is provided in Supplementary Methods.

Fig. 1: Identification results on marine and human gut datasets at 1% FDR using the entrapment method.
figure 1

a Results for the Marine 2 dataset. b Results for the Marine 3 dataset. c Results for the Human Gut dataset. W/O represents database search results without any filtering; MS refers to MS2Rescore; P to Percolator; Q to Q-ranker; PP to PeptideProphet; I to iProphet; DF to DeepFilter; Win to the self-attention-based WinnowNet.

For the marine datasets (see Fig. 1), WinnowNet outperformed MS2Rescore by, on average, identifying 12.6% more PSMs, 12.4% more peptides, and 9.3% more proteins. The intersection bar plot in Fig. 2 shows the overlap of the unique identifications at 1% PSM/peptide/protein FDR levels across the benchmark datasets. Specifically, for the two marine datasets, WinnowNet uniquely identified an average of 7727 PSMs, 5088 peptides, and 1561 proteins, whereas MS2Rescore uniquely identified 2562 PSMs, 1793 peptides, and 875 proteins.

Fig. 2: Comparison of PSM, peptide, and protein identification results between WinnowNet and MS2Rescore using MS-GF+ output.
figure 2

a PSM-level results. b Peptide-level results. c Protein-level results. The term Loss denotes identifications exclusively derived from either the MS2Rescore or WinnowNet, while Shared indicates identifications obtained by both methods. Protein groups were excluded from the comparison. Note that the numbers at the center, top, and bottom of each bar indicate the absolute number of identifications at the PSM, peptide, or protein level with a 1% FDR. The y-axis represents the relative percentages of identifications categorized as Shared, Loss, and Gain, with Shared identifications normalized to 100%.

In the human gut dataset, which is the most complex metaproteome tested with extensive MS/MS spectra and comprehensive protein databases, WinnowNet yielded an average increase of 8.0% more PSMs, 6.8% more peptides, and 5.7% more proteins than MS2Rescore. Specifically, WinnowNet uniquely identified 19,463 PSMs, 10,683 peptides, and 4180 proteins, in contrast to 6892 PSMs, 2137 peptides, and 1794 proteins found uniquely by MS2Rescore.

For the soil datasets (see Supplementary Table 16), WinnowNet achieved on average 9.4% more identified PSMs, 11.6% more peptides, and 7.6% more proteins at 1% FDR compared to MS2Rescore. On average, WinnowNet uniquely identified 13,531 PSMs, 4841 peptides, and 1247 proteins, whereas MS2Rescore uniquely identified 5408 PSMs, 1813 peptides, and 836 proteins.

WinnowNet also demonstrated strong performance on the mock datasets, which consist of artificial microbial complexes containing 30 species with uniform protein content. On these datasets, WinnowNet achieved an average increase of 9.1% more identified PSMs, 9.3% more peptides, and 7.5% more proteins at a 1% FDR (see Supplementary Table 17). In addition, WinnowNet uniquely identified up to 14,063 PSMs, 3655 peptides, and 1081 proteins, compared to up to 5905 PSMs, 846 peptides, and 779 proteins uniquely identified by MS2Rescore on average. A detailed analysis of the gained and lost identifications between WinnowNet and MS2Rescore is provided in Supplementary Discussion.

When compared to our previous method, DeepFilter, WinnowNet demonstrated consistent improvements: in the marine datasets, it identified 11.9% more PSMs, 10.0% more peptides, and 6.9% more proteins; in the soil datasets, increases averaged 7.8% for PSMs, 7.7% for peptides, and 4.8% for proteins; and in the mock community, improvements were 4.3% for PSMs, 4.8% for peptides, and 2.9% for proteins. Even in the complex human gut dataset, WinnowNet yielded average gains of 3.4% in PSMs, 3.8% in peptides, and 4.1% in proteins relative to DeepFilter. These results underscore WinnowNet’s enhanced ability to leverage spectral information, resulting in improved identification outcomes.

We also collected the number of identifications reported in the original publications for comparison (see Supplementary Table 18). Note that not all publications reported discoveries at the PSM, peptide, and protein levels, nor at the same FDR thresholds. WinnowNet consistently outperformed the original studies in terms of identification counts. For example, while the original publication reported 30,062 proteins, WinnowNet identified 36,143 proteins, representing a 20.2% increase. These findings underscore the potential of applying WinnowNet for the secondary analysis of existing datasets to uncover new biological insights.

Given WinnowNet’s improved performance relative to MS2Rescore, we further analyzed the score distributions to evaluate their comparative efficacy. Figure 3 and Supplementary Fig. 14 present score distributions for top-ranked PSMs from the marine 2 and mock P2 datasets as generated by WinnowNet (self-attention-based) and MS2Rescore, respectively. In contrast to MS2Rescore, WinnowNet assigns higher scores to a larger proportion of target PSMs at a 1% PSM-level FDR, with scores predominantly concentrated in the lower-right quadrant delineated by the solid cutoff lines. Notably, a bimodal score distribution is apparent in Supplementary Fig. 14 for MS2Rescore, a pattern not observed in Fig. 3. This discrepancy stems from the differences between the datasets: the Mock dataset employs a well-annotated protein database, whereas the Marine 2 dataset is based on an incomplete protein database derived from annotated assembled genomes. In metaproteomics, incomplete metagenome assemblies and technical biases during sample extraction frequently result in protein databases that represent only a subset of the actual proteome. As a consequence, many spectra correspond to peptides absent from these databases, leading to high-scoring false PSMs and causing the target PSM distribution to approximate that of decoy/entrapment PSMs. To control the FDR and exclude such false positives, metaproteomics analyses often require higher score thresholds compared to those used in simple culture-based proteomics, albeit at the expense of some true PSMs. The results presented in Fig. 3 and Supplementary Fig. 14 clearly demonstrate that WinnowNet’s incorporation of spectral information through auto-learned features yields a more robust filtering strategy compared to traditional methods, such as the SVM-based approach employed by MS2Rescore.

Fig. 3: Score distributions for top-ranked PSMs in the Marine 2 dataset, derived from WinnowNet (self-attention-based) and MS2Rescore.
figure 3

Original target PSMs are shown in blue, and entrapment and decoy PSMs in yellow. The top panel displays PSM scores from WinnowNet, the left panel shows scores from MS2Rescore, and the central panel illustrates the correlation between WinnowNet scores (x-axis) and MS2Rescore scores (y-axis). Solid lines indicate the 1% PSM-level FDR cutoffs for WinnowNet (0.92) and MS2Rescore (−0.05).

Performance evaluation of WinnowNet-integrated protein identification pipelines

We integrated the self-attention-based WinnowNet into four popular protein identification pipelines: our Sipros-Ensemble platform25, FragPipe43, Peaks Studio 12.544, and AlphaPept45. The evaluation was conducted on four benchmark datasets–Marine3, Soil3, P3, and Human Gut. For each pipeline, we compared the recommended workflow against an alternative in which the filtering step was replaced by WinnowNet. Due to the modular architecture of Peaks, directly substituting its built-in filter was not feasible. As a workaround, the alternative Peaks workflow involved performing the database search in Peaks, followed by WinnowNet filtering and protein inference using Philosopher within FragPipe. Unlike traditional filtering algorithms, WinnowNet eliminates the ad hoc training and can be applied to analyze different metaproteome samples without fine-tuning and still obtain substantial improvements over existing tools. The same entrapment method and FDR estimation described in section “Benchmark datasets and evaluation metrics” were applied.

Figure 4 and Supplementary Table 14 present the results. Across all four datasets and pipelines, integrating WinnowNet led to substantial improvements in PSM, peptide, and protein identification levels. For instance, at the PSM level, identifications increased from 61,190 to 66,432 (8.6% improvement) for Sipros-Ensemble, from 47,970 to 53,276 (11.1%) for FragPipe, from 46,727 to 52,789 (13.0%) for Peaks, and from 43,791 to 49,841 (13.8%) for AlphaPept. At the peptide level, improvements were observed from 40,519 to 43,071 (6.3%) for Sipros-Ensemble, from 25,658 to 31,769 (23.8%) for FragPipe, from 24,864 to 30,091 (21.0%) for Peaks, and from 23,895 to 29,857 (25.0%) for AlphaPept. Significant gains were also evident at the protein level; for example, in Marine3, protein identifications increased from 9500 to 10,416 (9.6%) for Sipros-Ensemble, from 9909 to 10,277 (3.7%) for FragPipe, from 9001 to 9327 (3.6%) for Peaks, and from 8796 to 8426 (4.4%). Similar trends were observed in the Soil3, P3, and Human Gut datasets, with percentage gains ranging approximately from 2.5% to 11.1% at the PSM level, 4.4% to 14.5% at the peptide level, and 1.1% to 12.8% at the protein level. These consistent improvements across diverse datasets and pipelines highlight the robustness and scalability of WinnowNet.

Fig. 4: PSM, peptide, and protein identification results at 1% FDR on the Marine3, Soil3, P3, and human gut datasets, using four metaproteomics pipelines with either default filters or WinnowNet as an alternative filter.
figure 4

SE denotes Sipros-Ensemble, R refers to the default filters used in the pipelines, and Win represents the self-attention-based WinnowNet method.

To simulate real-world analysis conditions for benchmarking, we constructed a composite protein database by combining proteins from mock microbial cultures (30 species) with entrapment proteins from 27 foreign species present in the human gut microbiome (a complete list of foreign species is provided in the Supplementary Table 2). The MS/MS dataset P1, generated from the mock microbial cultures, was searched against this database, which was further augmented with shuffling target sequences as decoys. PSMs corresponding to the mock microbial proteins were considered true identifications, whereas those mapping to the entrapment proteins were treated as false positives. The FDR was estimated using the target-decoy strategy36 as in Eq. (2) and controlled at 1%. Additionally, we calculated the false matching rate (FMR), defined as the proportion of false target identifications among all accepted targets at 1% FDR (see Eq. (3)). For benchmarking, we employed the same protein identification pipelines as in previous analyses, namely Sipros Ensemble, FragPipe, Peaks, and AlphaPept.

The identification results and FMR values are presented in Fig. 5 and Supplementary Table 15. All original and WinnowNet-enhanced pipelines demonstrated robust performance, consistently maintaining FMRs below 1% at PSM and peptide levels. Notably, the integration of WinnowNet led to consistent improvements in identification accuracy across all pipelines. At the PSM level, WinnowNet increased identifications from 87,275 to 91,271 (+4.6%) for Sipros Ensemble, from 84,448 to 88,691 (+5.0%) for FragPipe, from 85,033 to 89,013 (+4.7%) for Peaks, and from 82,891 to 87,381 (+5.4%) for AlphaPept. Similarly, at the peptide level, the number of identifications increased from 24,633 to 25,827 (+4.8%) for Sipros Ensemble, from 20,235 to 21,138 (+4.5%) for FragPipe, from 21,155 to 22,849 (+8.0%) for Peaks, and from 19,243 to 20,195 (+4.9%) for AlphaPept. At the protein level, WinnowNet also led to notable gains: identifications increased from 7126 to 7389 (+3.7%) for Sipros Ensemble, from 7007 to 7272 (+3.8%) for FragPipe, from 7015 to 7267 (+3.6%) for Peaks, and from 6715 to 6942 (+3.4%) for AlphaPept. These consistent improvements across all levels–PSM, peptide, and protein–highlight WinnowNet’s effectiveness in enhancing true identifications while maintaining a stringent FDR threshold.

Fig. 5: Comparison of identification results at multiple levels.
figure 5

ac PSM, peptide, and protein identification results, along with the false matching rate (FMR), at 1% FDR on the mock P1 dataset mixed with foreign species, using four metaproteomics pipelines with either default filters or WinnowNet as an alternative filter. The blue line plot shows the FMR using default filters, while the red line plot shows the FMR when using WinnowNet. SE denotes Sipros-Ensemble, R refers to the default filters used in the pipelines, and Win represents the self-attention-based WinnowNet method.

An additional evaluation was performed using a dataset acquired with the timsTOF instrument. The corresponding results are provided in the Supplementary Discussion.

Analysis of the taxonomic profile of human gut metaproteome

To investigate the biological significance of the proteins identified exclusively by WinnowNet (CNN-based), we searched the human gut protein database against the NCBI public database using Protein-Protein BLAST version 2.11.0+46. Pathway annotations were performed using eggnog-mapper against the EggNOG database47,48. After excluding the protein groups that shared the same identified peptides, we found 1015 proteins only identified by WinnowNet. In Fig. 6, we present the species associated with proteins only identified by WinnowNet in the human gut metaproteome sample. The phylogenetic tree includes 50 taxa at the species level, providing a detailed taxonomic profile of proteins. Circular phylogenetic tree visualizations in Fig. 6 depict the number of genes and spectra for each species as blue and red bars, respectively. The percentage of genes for each species is represented as a decimal value between 0 and 1, calculated using the min-max normalization method. The number of spectra was determined by counting the total PSMs belonging to each species, considering only the PSMs associated with the unique peptides for each protein.

Fig. 6: Phylogenetic tree of proteins uniquely identified by WinnowNet.
figure 6

Pholygenetic tree with corresponding gene and spectral abundance searched by the proteins only identified by WinnowNet*.

In our taxonomic profiling analysis, we observed that WinnowNet identified numerous gut microorganisms characterized by low gene counts. The mean normalized number of genes for each species stood at 24.27%, as annotated in the protein database. Notably, 33 species exhibited gene abundances lower than this average, with 4 out of these 33 species having gene abundances of less than 2%. Interestingly, these four species are recognized as common constituents of the human gut microbiome. For instance, Dorea longicatena contributes significantly to short-chain fatty acid (SCFA) production and plays a vital role in dietary carbohydrate metabolism49,50. Hungatella hathewayi is associated with various human infections51, while Sutterella sp. AM11-39 is linked to health conditions such as inflammatory bowel disease and autism52. These findings underscore WinnowNet’s proficiency in identifying proteins from species characterized by low-abundance genes.

Shifting the focus to the pathway level, proteins exclusively detected by WinnowNet are associated with three distinct pathways identified in the KEGG database: map05231, map00622,and map00440. Notebly, the KEGG pathways Xylene degradation (map00622) and Phosphonate and phosphinate metabolism (map00440) play pivotal roles in probiotics, influencing infant health and human diet53,54.

Computation time

Table 1 summarizes the computation time for CNN-based and self-attention-based WinnowNet models compared to other filtering algorithms across various datasets. The lightweight architecture of the CNN-based WinnowNet is evident from its significantly reduced number of parameters, containing only 22.2% and 31.5% of those in DeepFilter and self-attention-based WinnowNet, respectively. This results in faster training and inference times, making it an efficient solution for PSM rescoring tasks. The self-attention-based WinnowNet model, with 2.6 million parameters, incorporates advanced spectrum representations at the expense of increased computation time. Benchmarks were performed on a workstation equipped with 8 NVIDIA GeForce RTX 2080 Ti GPUs (12 GB memory each) for neural network models. Baseline algorithms were executed on a desktop computer with a 2.3 GHz Intel Xeon Gold 5118 CPU and 32 GB of memory. Under GPU acceleration, the CNN-based WinnowNet completed most rescoring tasks within 10 minutes, while the self-attention-based WinnowNet required under 30 minutes for the majority of datasets. These results demonstrate the adaptability of WinnowNet, providing users with a trade-off between computational efficiency and advanced feature representation depending on their specific requirements.

Table 1 Computation time for benchmark datasets (precise to second)

Discussion

Existing re-scoring algorithms rely primarily on human-engineered features derived from PSM properties and spectrum comparison attributes. Recently, fragment intensity prediction models using deep learning have emerged to improve re-scoring accuracy55,56,57. In this study, we introduced WinnowNet, a re-scoring framework featuring two neural network architectures: one optimized for accuracy to reduce false identifications (self-attention-based WinnowNet) and the other lightweight for enhanced inference speed (CNN-based WinnowNet). While WinnowNet successfully improves PSM identification, it currently operates as a post-processing tool dependent on pre-screened candidates generated by traditional database search engines. A natural question arises: can we improve database search engines to such an extent that machine learning-based rescoring becomes redundant? Database search engines remain essential for candidate generation, but their shortcomings, particularly in handling complex spectra from metaproteomes and highly homologous peptides, limit their effectiveness. WinnowNet bridges these gaps by leveraging deep learning to optimize the rescoring step, effectively mitigating false identifications that arise from limitations in the initial database search. That said, re-scoring tools alone cannot eliminate the need for robust search engines.

To address these limitations, we envision a future extension of WinnowNet into a comprehensive database search engine. Moving beyond its current reliance on pre-processed PSM candidates, we aim to develop a pre-trained scoring model capable of ranking candidate peptides directly for a given spectrum and across multiple spectra. This transition will require innovations in handling the large search spaces inherent to MS-based proteomics. Computational efficiency, particularly inference speed, will be a key challenge. For instance, while the Marine 2 dataset search time using Sipros-Ensemble (a database search engine) on a 128-core node was 27 minutes, WinnowNet required 245 minutes for rescoring. Addressing this disparity will involve optimized CPU/GPU parallel implementations to accelerate inference.

Another critical consideration is the risk of overfitting caused by peptide sequence homology. As MS datasets grow larger, homologous peptide sequences between training and inference datasets may inflate identification performance. To evaluate this risk, we removed homologous peptides from our inference datasets using BLASTP46. Matches with E-values below 1e-10 were classified as homologous peptides. The results of this evaluation are presented in Supplementary Tables 1012. Overall, the homology rate across datasets was low (0.426% to 2.464%), yet removal of homologous peptides led to slight declines in identification performance. For soil metaproteomes, WinnowNet exhibited decreases of 0.04%, 0.04%, and 0.05% at the PSM, peptide, and protein levels, respectively, comparable to or slightly higher than other filtering tools. For human gut datasets, similar patterns were observed, with declines ranging from 0.05% to 0.07%. For mock communities, average decreases increased slightly as the proportion of homologous peptides grew (1.04% to 1.21% for WinnowNet, compared to 0.85% to 1.09% for other tools). Despite these small decreases, WinnowNet consistently performed better than all other filtering tools, indicating that peptide homology-induced overfitting does not compromise its performance. Moving forward, careful management of homologous peptide sequences and larger, diverse datasets will be essential to further enhance WinnowNet’s generalizability.

To gain insight into the features learned by WinnowNet, we examined the attention map weights generated by the experimental and theoretical spectrum encoders. These weights were aggregated, normalized, and projected onto the spectrum representations to assess the contribution of fragment ions in distinguishing true PSMs from false ones. For this analysis, three PSMs corresponding to the same experimental spectrum were selected: a top-ranked true positive, a second-ranked false negative, and a low-ranked false negative. The visualized feature maps were shown in Supplementary Fig. 6. True positive samples (Supplementary Fig. 6(a)) exhibit more matching ions with higher attention weights compared to the other two. Notably, the y12(+1) ion exhibited significant attention in both experimental and theoretical spectra, underscoring its importance in classification. This observation highlights WinnowNet’s ability to successfully capture important patterns required to correlate peaks between experimental and theoretical spectra. In contrast, negative samples with high (Supplementary Fig. 6(b)) and low (Supplementary Fig. 6(c)) predictive scores show distinct patterns. Supplementary Fig. 5(b), with a high predictive score, contains approximately 15 matching ions but with relatively lower weights, while Supplementary Fig. 6(c), with a low predictive score, has fewer matching ions and lower weights, demonstrating a clear correlation between ion matching patterns and model confidence.

Methods

Our work aims to enhance and accelerate peptide identification by leveraging an intelligent learning strategy and efficient neural network architectures with reduced input dimensionality and parameter size. This section describes how these objectives were achieved by detailing the components of WinnowNet, including curriculum learning for peptide identification, the construction of training datasets, the architecture of WinnowNet, and its training procedures. Figure 7b provides an overview of the WinnowNet workflow, highlighting the training dataset construction (Part A) and the curriculum learning process (Part B). Figure 7c depicts the detailed architecture of WinnowNet.

Fig. 7: WinnowNet workflow and architecture.
figure 7

a Overview of peptide and protein identification using database search engines. b Construction of the training datasets and the training process. c Architecture of self-attention-based WinnowNet.

In this study, we designed two neural network architectures: a self-attention-based model, referred to as WinnowNet, and a convolutional neural network (CNN)-based model, referred to as WinnowNet*. While the self-attention-based WinnowNet consistently demonstrated better performance generally, the CNN-based WinnowNet* is described in the Supplementary Methods for comparison. The trained WinnowNet model takes MS/MS spectra and PSM identifications reported by database search engines as input and predicts the probability that a given PSM is a true match.

Training dataset construction

Twelve datasets were used in this study, including two training datasets and nine benchmark datasets. A summary of the MS/MS spectra and protein databases for each dataset is provided in Supplementary Table 1. The ProteomeTools dataset originates from a synthetic peptide library of the human proteome provided by the ProteomeTools project, while the remaining datasets are metaproteomes from five distinct microbial communities. These include three marine microbial communities (Marine 1, Marine 2, Marine 3)58, three soil microbial communities (Soil 1, Soil 2, Soil 3)59, a mock microbial community (P1, P2, P3)60, a human gut microbial community (HG)61, and a synthetic dataset consisting of a quad-culture of four microorganisms10. The ProteomeTools and Marine 1 datasets were used to construct training datasets, while the remaining datasets were used for benchmarking. The raw files for the benchmarking datasets were sourced from various data repositories. Specifically, the marine and soil metaproteome datasets are available in the PRIDE repository under the identifier PXD007587, the datasets from mock community available at PXD006118, the human gut dataset can be retrieved from the iProX repository, available at IPX0001564000.

As illustrated in Fig. 7b, we constructed a high-quality training dataset by searching mass spectra from the respective datasets against their corresponding peptide libraries or protein databases (e.g., derived either from the original study of the ProteomeTools dataset or from the metagenome-assembled protein database of the Marine 1 data). This process employed three widely used database search engines: Comet40, MyriMatch41, and MS-GF+42. For each spectrum and search engine, the top five scoring PSM candidates were retained. Search results were filtered independently using Percolator to generate three lists of Percolator-scored PSMs. These lists were then merged, preserving all PSM candidates from the three search engines, resulting in up to fifteen candidates per MS/MS spectrum. In cases of duplicate PSMs, only the one with the lowest posterior error probability (PEP) assigned by Percolator18 was retained. PSMs with PEP  > 0.9 were excluded, while the top-ranked target PSMs were annotated as positive samples, and the remaining PSMs, regardless of target or decoy status, were annotated as negative samples. The ProteomeTools training dataset contained 240,000 total PSMs, including 150,785 positive samples and 89,215 negative samples. The Marine 1 training dataset contained 1,849,686 PSMs, of which 976,979 were positive samples.

Curriculum learning in WinnowNet

The data-level curriculum learning (CL) approach trains machine learning models by progressively increasing data complexity in an order similar to human curricula. CL can effectively enhance generalization ability and accelerate convergence for a wide range of applications, including image classification, object detection, scene classification, sentiment analysis, and sequence prediction29,62,63,64. The key components of CL are the difficulty measurer and the training scheduler. The former determines training sample easiness, while the latter decides the order of datasets during training based on the difficulty measurer65,66. For example, the learning difficulty could be measured by the noise of the data source62, such as the images collected from Flickr (an image hosting service) were considered noisier, and thus harder, than those from Google images. The training scheduler could be used to fine-tune the model with the Flickr dataset once the model converged with the Google dataset.

The WinnowNet training was based on the CL strategy. The data difficulty level was measured by dataset complexity. Specifically, we distinguished between datasets from single-organism proteome and complex metaproteome, as well as between synthetic peptide libraries and real-world data. Two training datasets are listed in Supplementary Table 1. The easier dataset originates from the synthetic peptide library designed to cover the complete human proteome, referred as the ProteomeTools dataset. The harder dataset is from a real-world metaproteome (Marine 1) and presents noisier and more complex PSM samples. The ProteomeTools dataset is from the PRIDE Proteomics Identifications database, which includes Swiss-Prot annotated isoforms with post-translational modifications (PTMs) considered28. All peptides were individually synthesized with a purpose-built peptide synthesizer. The PSM quality was controlled by searching the MS/MS spectra against their synthetic peptide library with a stringent cutoff to preserve only the high-scored PSMs. Thus, it was believed that the PSMs in the ProteomeTools dataset are approximate ground-truth PSMs with relatively low noise. The ProteomeTools dataset was used to train WinnowNet as easier learning cases. Once WinnowNet converged with this dataset, we continued training it with the Marine 1 dataset. The construction of Marine 1 dataset is described in section “Benchmark datasets and evaluation metrics.” The raw files are sourced from the following data repositories: the ProteomeTools dataset was retrieved from the PRIDE archive, available at PXD010595 and PXD004732; the Marine 1 dataset was retrieved from PXD007587. The details of all the datasets were described in Supplementary Note 1. The models and processed training datasets are available at https://figshare.com/articles/dataset/Models/25513531 and https://figshare.com/articles/dataset/Datasets/25511770.

Spectrum embedding

Each PSM candidate consists of a measured spectrum and a theoretical spectrum, with visualization examples provided in Supplementary Fig. 5. The theoretical spectrum was generated from the corresponding peptide sequences. To process the PSM data, we normalized the intensities and isotopic probabilities of both the experimental and theoretical spectra. These normalized tuples were then input into the spectrum embedding layer. The spectrum embedding is based on sinusoidal embedding projection, inspired by research in de novo peptide sequencing67,68, as defined in Eq. (4), where \(M/{Z}_{\max }\) is 7000, \(M/{Z}_{\min }\) is 0.001, and d represents the embedding dimension (set to 256 in this study). The spectrum embedding transforms m/z values into a 256-dimensional vector, while the corresponding fragment ion intensity values are transformed through a linear layer. Finally, the transformed intensity values are concatenated with their respective m/z embeddings to form the complete representation. This spectrum encoding provides a unique representation for each peak in a spectrum, ensuring that the subsequent model captures the positional relationships among peaks. Unlike traditional approaches that rely on structured high-dimensional arrays to encode peak indices, our method leverages convolutional and self-attention mechanisms, which inherently operate independently of the input order. Specifically, the self-attention layers compute relationships between all peaks in two spectra without being constrained by their sequential arrangement, thereby achieving an order-invariant property. Here, order-invariance refers to the model’s ability to identify and compare spectral features regardless of their original position in the input representation, similar to how convolutional models exhibit translation-invariance in image processing. This property is particularly advantageous for handling mass spectrometry data, where peaks do not have a strict ordering constraint, allowing for more flexible and computationally efficient spectrum analysis. This approach eliminates the need for a large matrix to represent peak indices, as required in many studies, such as DeepNovo69 and our previous work26.

$${{{\boldsymbol{Emb}}}}=\left\{\begin{array}{ll}\sin \left((m/z)/\left(\frac{{(M/Z)}_{\max }}{{(M/Z)}_{\min }}{(\frac{{(M/Z)}_{\max }}{2\pi })}^{2i/d}\right)\right),\quad &\,{{\mbox{for}}}\,\,i\le d/2\\ \cos \left((m/z)/\left(\frac{{(M/Z)}_{\max }}{{(M/Z)}_{\min }}{(\frac{{(M/Z)}_{\max }}{2\pi })}^{2i/d}\right)\right),\quad &\,{{\mbox{for}}}\,\,i\ge d/2\end{array}\right.$$
(4)

Spectrum encoders and loss function

The experimental and theoretical spectrum encoders utilize the self-attention mechanism to capture relationships between input features effectively. The process begins by embedding the fragment ions from both spectra to create spectrum embeddings. These embeddings are further contextualized using a transformer encoder comprising four self-attention layers, each with four attention heads. The outputs of the two encoders are concatenated, allowing the self-attention mechanism to model the similarity between experimental and theoretical fragment ions for each PSM. This combined representation is passed through a fully connected layer with 1024 hidden dimensions to produce the final feature vector.

The model is optimized using a loss function defined in Eq. (5), where qi denotes the q-value of a PSM sample, and pi represents the probability of a true PSM. To calculate the q-value of a PSM, the following steps are performed: Let f represent the score of a PSM reported by a search engine. The number of target and decoy identifications with scores better than f are denoted as t and d, respectively. The estimated false discovery rate (FDR) at a score threshold f is computed using Eq. (6), and the q-value for the PSM with score f is derived using Eq. (7).

$$Loss=-\sum [{q}_{i}\log ({p}_{i})+{q}_{i}\log ({p}_{i})]$$
(5)
$$FDR(\,f)=\frac{d}{t}$$
(6)
$$q(\,f)={\min }_{{f}_{i}\le f}(FDR(\,{f}_{i}))$$
(7)

WinnowNet training

WinnowNet was implemented using PyTorch version 1.4.0 and was trained on a workstation with 8 GeForce RTX 2080 Ti GPUs. The learning rate and weight decay were set to 1e-5 and the mini-batch size was set to 32.

For the easy task, we employed a ratio of 8:1:1 for training, validation, and testing using the ProteomeTools dataset. For the hard task of training on the Marine dataset, we adjusted the ratio between the training and validation datasets to 9:1. WinnowNet demonstrated convergence on the ProteomeTools dataset with training and validation accuracies of 99.32% and 99.05%, respectively. The training process for WinnowNet utilized an early stopping mechanism to prevent overfitting and optimize model convergence. Specifically, the validation loss was monitored after each training epoch. If no improvement in validation loss was observed for 10 consecutive epochs, the training process was halted. This approach ensures that the model stops training once it reaches optimal performance on the validation set, minimizing unnecessary computation and mitigating overfitting. The maximum number of epochs was set to 200 as a safeguard, although early stopping typically occurred much earlier, as indicated by the convergence trends in Supplementary Fig. 8.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.