Unifying the analysis of bottom-up proteomics data with CHIMERYS

Frejno, Martin; Berger, Michelle T.; Tüshaus, Johanna; Hogrebe, Alexander; Seefried, Florian; Graber, Michael; Samaras, Patroklos; Ben Fredj, Samia; Sukumar, Vishal; Eljagh, Layla; Bronshtein, Igor; Mamisashvili, Lizi; Schneider, Markus; Gessulat, Siegfried; Schmidt, Tobias; Kuster, Bernhard; Zolg, Daniel P.; Wilhelm, Mathias

doi:10.1038/s41592-025-02663-w

Download PDF

Article
Open access
Published: 22 April 2025

Unifying the analysis of bottom-up proteomics data with CHIMERYS

Nature Methods volume 22, pages 1017–1027 (2025)Cite this article

15k Accesses
10 Citations
28 Altmetric
Metrics details

Subjects

Abstract

Proteomic workflows generate vastly complex peptide mixtures that are analyzed by liquid chromatography–tandem mass spectrometry, creating thousands of spectra, most of which are chimeric and contain fragment ions from more than one peptide. Because of differences in data acquisition strategies such as data-dependent, data-independent or parallel reaction monitoring, separate software packages employing different analysis concepts are used for peptide identification and quantification, even though the underlying information is principally the same. Here, we introduce CHIMERYS, a spectrum-centric search algorithm designed for the deconvolution of chimeric spectra that unifies proteomic data analysis. Using accurate predictions of peptide retention time, fragment ion intensities and applying regularized linear regression, it explains as much fragment ion intensity as possible with as few peptides as possible. Together with rigorous false discovery rate control, CHIMERYS accurately identifies and quantifies multiple peptides per tandem mass spectrum in data-dependent, data-independent or parallel reaction monitoring experiments.

Detecting diagnostic features in MS/MS spectra of post-translationally modified peptides

Article Open access 12 July 2023

Micropillar arrays, wide window acquisition and AI-based data analysis improve comprehensiveness in multiple proteomic applications

Article Open access 03 February 2024

MSFragger-DDA+ enhances peptide identification sensitivity with full isolation window search

Article Open access 08 April 2025

Main

Mass spectrometry (MS)-based bottom-up proteomics is the mainstay technology for high-throughput protein identification and quantification today^1,2,3. The former is achieved by matching theoretical, predicted or library fragment ion mass spectra (MS2) to experimental MS2 spectra, which contain sequence and amino acid modification information on peptide precursor ions, measured in precursor mass spectra (MS1). Today, MS2 spectra are typically acquired in data-dependent (DDA), data-independent (DIA) or parallel reaction monitoring (PRM) mode. Peptide quantification either uses the precursor intensity from MS1 (DDA) or fragment ion intensities from MS2 (DIA and PRM) spectra. A central challenge for data analysis is the fact that most MS2 spectra are chimeric (they contain more than one peptide)^4,5,6. This is because liquid chromatography–tandem MS (LC–MS/MS) systems cannot fully separate the vast number of peptides resulting from whole proteome enzymatic digestion, in particular when short gradients or no liquid chromatography at all is employed, as exemplified by direct infusion-shotgun proteome analyses (DI-SPA)⁷.

DIA MS2 spectra are usually more complex than DDA MS2 spectra because they are typically acquired with wider isolation windows to maintain low MS cycle times (important for quantification) and hence contain fragment ions from many different precursors⁸. Although DDA and PRM MS2 spectra are normally acquired to minimize co-isolation, they are also chimeric, albeit to a lesser extent⁶. Because of the way that data acquisition approaches have evolved, the corresponding data types are analyzed differently⁹, making it difficult to compare them in an unbiased fashion¹⁰.

DDA data are analyzed in a spectrum-centric fashion⁹. Database search algorithms for DDA data attempt to maximize identifications from chimeric spectra by submitting them for each precursor detected in the isolation window. Frequently, fragment ions explained by a given peptide are removed from the spectrum before it is searched again in a subtractive approach^6,11. While often able to identify multiple peptides, this approach under-utilizes spectral information when fragments are shared between peptides, resulting in reduced sensitivity. When fragment ions are not removed before an additional search (multiplicative approach), the same information may be used too often, resulting in reduced specificity. In the end, the central output of DDA search engines is one or multiple peptide-spectrum matches (PSMs) per experimental MS2 spectrum.

In contrast, DIA and PRM data analysis usually follows a peptide-centric approach that asks the question whether peptides from a predefined list are detectable in the experimental data^9,12. This approach requires spectral libraries, which can be generated from previous experimental data or predicted via machine or deep-learning models. Subsequently, the queried peptides are detected and quantified in MS1 and/or MS2 spectra by extracting co-eluting (fragment) ion chromatograms (XICs) based on the spectral library. Recently, library-free approaches such as DIA-Umpire¹³, PECAN¹⁴, directDIA¹⁵ (implemented in Spectronaut), MSFragger-DIA¹⁶ and diaTracer¹⁷ gained popularity due to their simplicity. In brief, these tools do not require the generation of a spectral library and instead identify peptides in DIA data given a set of query peptides by directly scoring experimental MS2 or ‘pseudo-MS/MS’ spectra against theoretical spectra.

Because of the molecular complexity of proteomic samples and the large quantities of MS2 spectra of varying quality that are generated by LC–MS/MS, accurate false discovery rate (FDR) control is important, particularly in large-scale projects. While FDR control for DDA data is rather mature^18,19,20,21, it is still a substantial challenge for DIA data. Constructing realistic decoy MS2 spectra and retention times is far from obvious, an issue increasingly realized and addressed by machine-learning models for peptide property prediction^22,23,24.

In this work, we introduce a spectrum-centric and data acquisition method-agnostic algorithm for the analysis of MS2 spectra, implemented in CHIMERYS. It deconvolutes any MS2 spectrum, regardless of whether it was acquired by DDA, DIA or PRM, thus unifying the analysis of bottom-up proteomics data. We build upon a concept introduced for the deconvolution of DIA spectra using spectral libraries⁸ and leverage deep-learning-based predictions of fragment ion intensities from INFERYS²⁵ in conjunction with linear algebra for the deconvolution of MS2 spectra. The resulting signal contributions of each peptide identified in each MS2 spectrum can be combined into a quantitative readout. Applying the approach substantially enhances identification rates of PSMs, peptides, and proteins across all sample types in DDA, enables the hands-off processing of PRM data and matches the performance of alternative DIA software while maintaining accurate FDR control throughout.

Results

Deconvolution of chimeric DDA spectra

The core assumption behind CHIMERYS is that chimeric MS2 spectra are linear combinations of pure spectra from co-isolated precursors. The algorithm is entirely spectrum-centric and employs non-negative L1-regularized regression via the LASSO²⁶ to explain as much experimental intensity as possible with as few peptide precursors as possible (Fig. 1a). It uses highly accurate predictions of fragment ion intensities and retention times for target and decoy peptides instead of spectral libraries.

In brief, predicted MS2 spectra from precursors with predicted retention times that fall within a data-dependent retention time window and precursor isotope envelopes that (partially) overlap with the isolation window are compared to experimental MS2 spectra. Matching is based on multiple fragment ion intensity-free and -dependent scores for each PSM (Methods). Spurious PSMs are removed based on some of these scores. For example, PSMs are required to have at least three matched fragment ions, one of which must be the most abundant peak of the prediction and another one of which must be among the top three most-intense peaks of the prediction. PSMs passing these criteria are used for deconvolution, where they compete for experimental fragment ion intensity in one concerted step; an approach fundamentally different from classic methods (Fig. 1a). PSMs with enough contribution to the experimental spectrum as measured by CHIMERYS coefficients and that pass additional score filters are handed to mokapot²⁰ for PSM-level FDR control, specifically allowing for multiple PSMs per spectrum, similar to DIAmeter²⁷.

We validated this FDR estimation on data with varying chimericity by systematically increasing the isolation window width of 1-h single-shot measurements (pancreatic mouse cell digest) from 1.4 to 20.4 Th using entrapment experiments (Supplementary Methods). Figure 1b shows that CHIMERYS’ peptide group-level q-values correspond to empirical q-values calculated based on entrapment identifications with the classic entrapment FDR (eFDR) approach, independent of isolation window width.

Figure 1c displays the confident identification of six precursors with relative contributions to the experimental total ion current ranging from 4% to 54% from a 2-h HeLa DDA single-shot measurement in a mirror spectrum. Their predicted retention times differ from the scan’s observed retention time by 1.14 min on average, corresponding to less than 1% deviation relative to the gradient length of 120 min. Notably, the experimental intensities for the y1, y1-NH₃ and y1-H₂O ions that are shared between five of these precursors (C-terminal lysine) align well with the sum of their predicted intensities, scaled by their respective CHIMERYS coefficients, which can be interpreted as the interference-corrected total ion current of a precursor in an MS2 spectrum (Methods). This exemplifies how the algorithm identifies multiple peptides in chimeric spectra, while distributing intensities of shared fragment ions. Peptides identified by CHIMERYS recapitulate the expected quantitative ratios in a multi-organism-mixture experiment (Fig. 1d). This renders CHIMERYS suitable for approaches like wide-window DDA (also termed WWA or wwDDA)^28,29 and the analysis of DIA data.

To assess the performance of the algorithm on DDA data, we analyzed a 2-h HeLa cell digest with 1.3-Th MS2 isolation windows. CHIMERYS identified 238,795 PSMs at 1% run-specific PSM FDR with >85% of MS/MS spectra yielding one or more PSMs (identification rate; Extended Data Fig. 1a). More than two-thirds of the identified MS2 spectra contained more than one precursor (Extended Data Fig. 1b), confirming previous observations⁶. Fragment ions shared between different peptides were detected across the full MS2 m/z range with an expected higher frequency ≤200 m/z (Supplementary Fig. 1), rendering current strategies for handling chimeric spectra such as subtractive and multiplicative approaches error prone. Comparing these results to eight academic and commercial DDA search engines (Fig. 1e) revealed that CHIMERYS identifies many additional peptide groups (Extended Data Fig. 2a–c) in less time than was spent on data acquisition (Extended Data Fig. 2d). Most of these additional identifications were low abundant (Extended Data Fig. 3a). As such, they had fewer matched fragment ions than shared peptide groups (median of 10 versus 17; Extended Data Fig. 3b) but still high normalized spectral contrast angles³⁰ (median of 0.69 versus 0.85; Extended Data Fig. 3c and Supplementary Discussion). Hence, they are readily distinguished from decoys using mokapot’s support vector machine score that aggregates CHIMERYS’ score set (Extended Data Fig. 3d). Reassuringly, CHIMERYS-unique peptide groups markedly increased the number of peptides per protein group in CHIMERYS compared to Sequest HT (Extended Data Fig. 3e). It is worth noting that some of these search engines do not control FDR at the same level, which has a substantial influence on such comparisons (Extended Data Fig. 3f,g and Supplementary Table 1). Controlling FDR at a ‘lower’ level and counting identifications at a ‘higher’ level (for example counting peptides at PSM FDR) will usually overestimate the number of identifications. Identifications need to be reported at the same level at which FDR is controlled.

The gains observed for HeLa digests relative to Sequest HT (same protein grouping, as well as peptide group- and protein-level FDR estimation as CHIMERYS in Proteome Discoverer; PD) and MSFragger³¹ (second highest number of identified peptide groups after CHIMERYS) were corroborated using CHIMERYS v.2.7.9 with more difficult biological samples at the protein group level (urine³², +21%/+11%; CSF³², +17%/+4%; plasma³², +10%/−10%; formalin-fixed paraffin-embedded (FFPE) material, +35%/+21%; secretomes³³, between +33%/−4% and +71%/+27%, Arabidopsis thaliana³⁴, +13%/+1%; Halobacterium³⁴, +20%/+6% for Sequest HT/MSFragger; Extended Data Fig. 4a–f), as well as using CHIMERYS v.4.0.21 with samples enriched for phosphorylated, acetylated and ubiquitinated peptides at the precursor level (phosphorylation³⁵, +64%/+36%; acetylation³⁶, +98%/+8%; ubiquitination, +88%/+45% for Sequest HT/MSFragger; Extended Data Fig. 4g). Extended Data Fig. 4h visualizes prediction accuracy of INFERYS v.4.0.0 for various post-translational modifications (PTMs). These data highlight that CHIMERYS substantially increases the analysis depth of DDA data.

Revisiting legacy data using CHIMERYS

We conducted a retrospective study of HeLa single-shot analyses spanning many years and Orbitrap instrument generations. Despite many differences that impair a fair comparison, a clear trend was observed, in that the higher the speed and sensitivity of the instrument, the higher the advantage of CHIMERYS over Sequest HT (Fig. 2a and Extended Data Fig. 5a).

Next, we investigated low-resolution ion trap data (ITMS), comparing CHIMERYS to Sequest HT on unprocessed spectra and on spectra filtered for the top 15 most abundant fragments per 100-Th window (Fig. 2b). In contrast to Orbitrap data, we observed a notable improvement by removing low-abundance peaks in ITMS spectra. Specifically, CHIMERYS identified 74% more PSMs, 35% more peptide groups, and 30% more protein groups compared to Sequest HT on unprocessed spectra, while it identified 94% more PSMs, 47% more peptide groups, and 37% more protein groups on spectra preprocessed with a top 15 by 100-Th filter from a HeLa digest. Both examples show that substantially more information can be extracted from legacy data by harnessing the information contained in chimeric spectra.

Optimizing data acquisition with deconvolution in mind

We assessed to what extent CHIMERYS’ capability to deconvolute highly complex spectra can be used to optimize data acquisition. First, we evaluated LC gradients with the goal to increase sample throughput per day (SPD; Methods). Figure 2c shows that CHIMERYS identified a similar number of peptide and protein groups from a 30 min measurement of a pancreatic mouse cell digest (48 SPD) as Sequest HT from a 120 min measurement (12 SPD), increasing throughput by a factor of four.

Next, we explored a possible increase in identification efficiency by widening the isolation window in DDA (between 1.4 Th and 20.4 Th; Fig. 2d,e and Extended Data Fig. 5b–d). The analysis revealed that the number of identified PSMs from a pancreatic mouse cell digest increased with wider isolation windows and began to plateau at >8 m/z. In high-load samples like these, this is likely due to the automatic gain control (AGC) limit, which together with the dynamic range of MS2 spectra, limits the number of precursors in chimeric spectra with a sufficient number of detectable fragment ions. The number of unique peptide group (and protein) identifications reached its maximum already at a window size of 3.4 Th for this specific dataset and decreased for larger isolation windows. This is likely because more and more PSMs were from the same, high-abundant peptides that were now co-isolated more often. In contrast to that, it was previously shown for low-load samples that disabling the AGC limit together with extended injection times enabled CHIMERYS to detect more unique peptides with very wide isolation windows²⁹. There, the reduction in the number of MS2 scans due to the extended injection times resulted in a concomitant decrease in the number of peptide identifications with classic search engines, which CHIMERYS could counteract by identifying many PSMs from these highly chimeric, wide-window scans.

Deconvolution of chimeric DIA spectra

CHIMERYS deconvolutes DIA spectra in the same way as DDA spectra. The only difference is that DIA spectra are usually more chimeric. Exemplified by a high-load LFQbench-type multi-organism mixture dataset³⁷, CHIMERYS identified an average of 529,993 PSMs per raw file at 1% run-specific PSM FDR, mapping to 66,888 unique peptide groups and 7,331 unique protein groups at 1% global peptide group and protein FDR, respectively, with an overall identification rate of >60% (Supplementary Fig. 2a). More than 82% of identified MS2 spectra contained more than one precursor (Supplementary Fig. 2b) and shared fragment ions were more frequent than in DDA, emphasizing the need for spectrum deconvolution that assigns shared fragment ions pro rata to the contributing peptides (Supplementary Fig. 2c–j).

Comparison to other DIA search engines

We compared results obtained with CHIMERYS on DIA data to the library-free workflows implemented in DIA-NN³⁸ and Spectronaut³⁹ using entrapment experiments to validate FDR control in the run-specific context⁴⁰ (Supplementary Methods provide context definitions and search parameters). The results show that CHIMERYS’ self-reported q-values correspond to empirical q-values calculated based on entrapment identifications (Extended Data Fig. 6a). DIA-NN and Spectronaut seemed to underestimate FDR based on all three or the peptide and concatenated entrapment approaches, respectively (Extended Data Fig. 6b,c). Recently proposed more stringent settings for Spectronaut⁴¹ had little, if any, effect on this issue (Extended Data Fig. 6d). Similar observations were made when analyzing the TimsTOF Pro data of the LFQbench-type dataset using DIA-NN and Spectronaut (Extended Data Fig. 6e,f). All subsequent analyses used the peptide eFDR approach. Using this approach, CHIMERYS v.4.0.21 finished the analysis of the dataset 4.9 times faster than DIA-NN and 1.7 times slower than Spectronaut (Extended Data Fig. 7a). Filtering at run-specific eFDR in addition to the algorithm-dependent self-reported FDR did not change the overall number of identifications (number of precursors identified in any number of replicates) for CHIMERYS, but it reduced the overall number of identifications for DIA-NN and Spectronaut to a level comparable to CHIMERYS. The number of precursors identified in two out of three replicates relative to the overall number of identifications (a measure for data completeness) did not change for CHIMERYS when filtering at run-specific eFDR in addition to the algorithm-dependent self-reported FDR (41% data completeness); however, doing so reduced data completeness for Spectronaut from 86% to 61% and for DIA-NN from 78% to 30% (Fig. 3a). CHIMERYS and Spectronaut substantially outperformed DIA-NN when requiring a precursor to be identified in all replicates at 1% algorithm-dependent self-reported FDR and at 1% run-specific eFDR (Extended Data Fig. 7b).

**Fig. 3: Deconvolution of chimeric DIA spectra.**

As expected, precursors filtered out based on eFDR have lower MS2 intensities (Fig. 3b) and fewer fragment ions (Fig. 3c); however, the extent to which this is observed differs substantially between the three tools. CHIMERYS considers more fragments for quantification than the other two search engines with default settings. Further, CHIMERYS is more rigorous in the inclusion of fragment ions. The latter is illustrated in Fig. 3d. The top panel shows fragment ion chromatograms exported from Spectronaut for a precursor confidently identified by all three search engines. The bottom panel shows fragment ion chromatograms exported from Spectronaut for the highest-scoring, Spectronaut-unique precursor that was entirely based on fragment ions with an F.PeakArea ≤ 1. Inspection of the corresponding raw data revealed that these fragment ions are missing in the relevant retention time range (Extended Data Fig. 7e and Supplementary Discussion). Both precursors were identified by Spectronaut with comparable scores and posterior error probabilities. Further investigations regarding the number and intensity of fragment ions (Extended Data Fig. 7c–e) suggest that precursors with less than three quantifiable fragment ions with an intensity exceeding 1 or those with (near-)zero intensity should be removed; either categorically or by applying stringent FDR control, which has a very similar effect (Extended Data Fig. 7f).

Accurate peptide quantification from chimeric PRM and DIA spectra

One of CHIMERYS’ distinguishing concepts is its spectrum-centric processing of chimeric spectra. Apart from peptide identification, it also derives spectrum-centric quantitative information in the form of CHIMERYS coefficients, which can be interpreted as the interference-corrected total ion current for a given precursor in this MS2 spectrum (Methods). If none of the matched fragments for a precursor are shared with another precursor and the predicted MS2 spectrum matches perfectly to the experimental one, the coefficient is the sum of all matched fragment ions in the experimental MS2 spectrum. Hence, tracing the coefficient along retention time generates a pseudo-extracted-ion-chromatogram (XIC) that can be used to perform (relative) quantification of a precursor based on its MS2 signal in PRM and DIA data, but not in DDA data, which usually do not sample precursors multiple times along retention time in MS2. This is different from standard approaches that create XICs for (a subset of) fragment ions of a given precursor, which need to remove interfered fragment ions from quantification to maintain high precision and accuracy (Fig. 4a). To assess the performance of our concept, we carried out a simple PRM assay, focusing on 52 peptides from 18 human proteins spanning five orders of magnitude of cellular abundance (Methods). Both CHIMERYS and Skyline recovered 47 out of 52 peptides from the targeted inclusion list and CHIMERYS’ automatically generated MS2-based quantification was in excellent agreement (Pearson correlation coefficient = 0.99) with the manually curated values obtained from Skyline (Fig. 4b). Without any additional effort, CHIMERYS identified and quantified 1,400 further peptides that were not designed to be in the assay but that happened to be co-isolated with the targeted peptides (Extended Data Fig. 8). CHIMERYS effectively automates the processing of PRM data because it removes the manual curation steps often required in Skyline. These include dealing with shared fragment ions and co-isolated peptides (both used in CHIMERYS but removed in Skyline).

**Fig. 4: Coefficient-based quantification.**

Next, we compared the MS2-level quantitative precision and accuracy of CHIMERYS to DIA-NN and Spectronaut on the LFQbench-type dataset³⁷. To avoid differences in quantification due to different methods for determining peak integration borders, we compared the three algorithms based on their implementation of peak apex quantification (Supplementary Methods provide DIA-NN- and Spectronaut-specific settings, as well as an explanation of the corresponding implementation in CHIMERYS). When filtering the data using eFDR as discussed above, the median quantitative precision of precursors (based on coefficient of variation; CV) was 26.9%, 29.1% and 29.2% for CHIMERYS, DIA-NN and Spectronaut, respectively (Fig. 4c).

Similarly, precursor-level ratio distributions (Fig. 4d) as a measure of quantitative accuracy for the three different search engines were comparable at eFDR (mean log₂ ratios ± s.d. for Escherichia coli, Homo sapiens and Saccharomyces cerevisiae of −1.90 ± 0.25, −0.03 ± 0.25 and 1.00 ± 0.29 for CHIMERYS, −1.86 ± 0.26, −0.03 ± 0.21 and 0.98 ± 0.23 for DIA-NN and −1.86 ± 0.32, −0.05 ± 0.31 and 1.00 ± 0.35 for Spectronaut, respectively). The above analysis demonstrates that CHIMERYS’ spectrum-centric way of quantifying peptide precursors matches the performance of Skyline on PRM data as a gold standard in the field and extends to full-scale DIA data. It also highlights the potential of CHIMERYS for scaling PRM assays to very large numbers of peptides without the need for manual intervention.

Digging deeper into direct infusion experiments

Recently, DI-SPA was shown to deliver proteomics insights at an unprecedented throughput⁷. As direct infusion experiments forfeit chromatographic separation in favor of sample throughput, such data are not readily accessible to algorithms such as DIA-NN or MSFragger-DIA. CHIMERYS natively supports the processing of DI-SPA data, because its spectrum-centric deconvolution of MS2 spectra does not require the detection and scoring of elution peaks in fragment ion XICs. As such, CHIMERYS is the only library-free algorithm capable of analyzing DI-SPA data. A reanalysis of the data underlying Fig. 2 of the original DI-SPA publication⁷ with CsoDIAq⁴², a tool designed for the analysis of DI-SPA data, and CHIMERYS revealed that the TraML library used by CsoDIAq contained a substantial number of decoys that differed in sequence length from their corresponding targets (Extended Data Fig. 9a,b). Generating hybrid and fully predicted spectral libraries with and without matched decoys (Supplementary Methods) revealed that this mismatch between targets and decoys artificially increased CsoDIAq’s sensitivity (Extended Data Fig. 9c). Using matched decoys for both algorithms, CHIMERYS identified up to threefold more unique peptide groups in comparison to CsoDIAq at 1% run-specific peptide group FDR (Fig. 5a–d), resulting in up to threefold more identified proteins at 1% run-specific protein FDR (Fig. 5e). This is driven by ~twofold higher sensitivity of CHIMERYS compared to CsoDIAq (Fig. 5f,g). In turn, this leads to a substantial increase in significantly enriched KEGG⁴³ pathways (8 versus 17 for CsoDIAq and CHIMERYS; Supplementary Table 2) and their coverage with protein identifications (Fig. 5h). This demonstrates that CHIMERYS can unlock new biology hidden in previously acquired data that is inaccessible to other software solutions.

Head-to-head comparison of DDA and DIA data, facilitated by CHIMERYS

We showed that CHIMERYS can analyze DDA and DIA data using the same concept for the deconvolution of chimeric spectra, which enables directly comparing the two acquisition methods on data acquired from the same sample, without the need to process the data with different software packages. As one would expect, it identified more than twice as many PSMs from DIA (8-Th isolation window) compared to DDA (1.3-Th isolation window) data acquired on an Orbitrap QE HF-X (LFQbench-type dataset; Extended Data Fig. 10a); however, DDA identified 52% more peptide groups and 30.3% more protein groups compared to DIA (Fig. 6a and Extended Data Fig. 10b). Likely, this is due to the interplay between the AGC limit and the dynamic range in MS2 spectra, which we already observed for WWA data (see section above). In contrast, relative quantitative data completeness was higher for DIA than for DDA data when filtering for peptide groups that met 1% FDR in the global, but not necessarily the run-specific context and enabling ‘match between runs’ for DDA using the Minora Feature Detector in PD⁴⁴ (78% versus 55.4% of peptide groups quantified in two out of three replicates per condition in DIA and DDA, respectively, Fig. 6a). This resulted in very similar numbers of peptide groups being quantified in two out of three replicates (56,322 and 52,161) for DDA and DIA, respectively.

Perhaps the more interesting comparison is that of DDA versus DIA using the same isolation window (here 2 Th). This has recently become possible because modern, fast-scanning instruments blur the border between DDA and DIA⁴⁵. Interestingly, both 14 min (~100 SPD) and 30 min (48 SPD) gradients on an Orbitrap Astral⁴⁶ yielded similar numbers of PSMs, peptide and protein groups for DDA and DIA (Fig. 6a and Extended Data Fig. 10a,b). The small differences in favor of DIA are likely due to the higher scan rate of the Orbitrap Astral in DIA mode. Again, relative quantitative data completeness was much better for DIA than for DDA (97.9% and 98.7% of peptide groups quantified in two out of three replicates versus 56.3% and 61.7% for the 14-min and 30-min gradients, respectively; Fig. 6a). These data suggest that DIA and MS2-based quantification should be preferred over DDA and MS1-based quantification when performing label-free, single-shot measurements on fast-scanning instruments. Comparing CV distributions of peptide groups detected by DDA and DIA in the three datasets revealed that DDA was slightly more precise on the LFQbench-type dataset, whereas DIA was slightly more precise on the 30 min Orbitrap Astral dataset (Fig. 6b). Quantitative accuracy seemed to be generally better for DIA on the LFQbench-type dataset (Fig. 6c); however, closer inspection suggests that this is due to a problem with the samples rather than with MS1-based quantification per se, as the accuracy of MS1- and MS2-based quantification of the DIA data is comparable (Extended Data Fig. 10c). In fact, CHIMERYS’ MS2-based quantification was highly correlated (R = 0.88) to the MS1-based quantification implemented in PD on the same raw data (Fig. 6d), suggesting that the two quantification methods could be combined in the future in CHIMERYS.

Discussion

In many ways CHIMERYS returns to the very old concept of analyzing tandem mass spectra, one at a time. At least for the task of peptide identification, this so-called spectrum-centric approach places the core analytical evidence acquired by the mass spectrometer at the center of data analysis. This comes with a number of important advantages. First, any proteomic data type (DDA, DIA and PRM) can be treated the same, and CHIMERYS is the first software implementation that stringently follows this unifying philosophy. While some tools such as MSFragger¹⁶ and MaxQuant⁴⁷ claim to unify DDA and DIA data analysis, these are not unifying algorithms but bundles of acquisition method-specific algorithms with a unified user interface. In contrast to that, a unifying algorithm natively supports the analysis of different data types. Second, in principle, there is no difference between identifying a single or multiple peptides from the same MS2 spectrum and skilled scientists have done so since the early days of proteomics. The added sophistication is that artificial intelligence can predict the fragmentation of any peptide with outstanding accuracy so that it is possible to deconvolute even highly chimeric spectra by maximizing the explained intensity in an MS2 spectrum using a minimal set of peptides. Third, statistical methods for PSM-level FDR control are conceptually well worked out and have reached a very high level of practical refinement, again including the use of artificial intelligence that can predict the tandem mass spectrum of any target or decoy peptide with the same accuracy, ensuring fair competition between targets and decoys. Fourth, the plausibility of an identification can be further assessed (albeit not automatically) beyond statistics by visual inspection in the context of the full MS2 spectrum and for example by looking for fragment ions that were not part of the deep-learning model and have thus not yet been used for identification. A current limitation of CHIMERYS in this context is that peptides carrying modifications that are not yet covered by the underlying deep-learning model escape detection. It can be anticipated that this limitation will diminish over time as deep-learning models start to emerge that are capable of generalizing to modifications or fragmentation methods they have not yet been trained for⁴⁸ (Supplementary Discussion provides a more in-depth look on extrapolation). Similarly, CHIMERYS is currently limited to the analysis of data generated by mass spectrometers from Thermo Fisher Scientific; however, support for mzML and other vendor-specific formats will be available in a future version of CHIMERYS.

Akin to other software tools, CHIMERYS also uses the information contained in the MS2 spectrum for peptide quantification; however, unlike all other DIA software, it does not set a fixed number of fragment ions to consider and instead always uses all the fragment ions that have led to an identification in a given MS2 spectrum, but in relative proportion to how much they contributed to the actual signal in the MS2 spectrum (important for the frequent case of fragment ions that are shared between peptide candidates). CHIMERYS uses the sum of these fragment ion intensities rather than the individual fragment intensities to find the apex of a chromatographic peak. This makes the overall quantification more robust against weak signals and spurious detections as encountered for example in single-cell proteomics data. The results indicate that quantitative precision and accuracy closely match that of PRM data, which are often considered to be the gold standard for peptide quantification. In this context it is interesting to note that CHIMERYS also automates the analysis of PRM experiments along the way.

We consciously decided to rate data quality over quantity, such that reported peptide identification and quantification results are rather conservative and other software tools may sometimes seemingly outperform CHIMERYS (Supplementary Discussion); however, when applying rigorous and consistent criteria for peptide detection and quantification, these differences diminish. A perhaps unexpected finding in this regard is that DIA data are often not nearly as complete as default processing parameters of DIA search engines report. Again, and not surprisingly, this is particularly true for low-abundant samples or low-abundant peptides within a sample. The reasons for this could be manifold and investigating them comprehensively goes beyond the scope of the present study; however, it is worth mentioning that the most recent generation of mass spectrometers has driven sensitivity to the point of single ion detection. As a result, MS2 spectra have at least some low level of signal at nearly every m/z. Many of these may not even stem from peptides but will create a situation in which ‘something’ can be found everywhere and all the time, potentially leading to data completeness that bears little if any actual justification. In addition, the increasing volume and density of MS-based proteomic data keeps challenging the scalability of the assumptions underlying data-processing tools. Reassuringly, the community of proteomics software developers and users are increasingly aware of these recurring challenges, as it is in everybody’s best interest to ensure that software tools can be trusted and used at face value. CHIMERYS makes a valuable contribution in this context and a particularly exciting prospect is that the latest LC–MS/MS hardware along with the latest software solutions will soon overcome the historically grown divide in the field between DDA and DIA.

Methods

Brief description of the CHIMERYS algorithm

Setup

The CHIMERYS workflow is a cloud-native web service with an application programming interface, orchestrated by Kubernetes on Amazon Web Services or on-premise. The environment consists of two major components: an INFERYS prediction server²⁵, which delivers predictions via remote procedure calls to a CHIMERYS search algorithm instance, which matches these predictions to experimental spectra. We decided to set up a cluster of servers for predictions, rather than allowing them to be performed locally to reduce the runtime of CHIMERYS (Supplementary Discussion).

Description of the identification workflow

The CHIMERYS workflow follows the setup of classic search engines. After the in silico digest of the protein database and the generation of shuffled decoy sequences using a similar logic as the mimic²¹ entrapment generator (see section below), a coarse first search is performed to identify highly confident peptides for recalibration purposes. Notably, for a group of I/L isomers, CHIMERYS only scores one representative. A fast fragment ion index implementation similar to MSFragger³¹ is used to determine a ranked list of suitable candidate peptides with isotope envelopes that (partially) overlap with the MS2 isolation window (plus tolerances). Fragment ion intensities for highly ranking candidate peptides are predicted for each spectrum, merged against the experimental MS2 spectrum and subsequently, a set of counting-based (for example, number of matching peaks between predicted and experimental spectrum) and intensity-based scores (for example, normalized spectral contrast angle) are calculated. Candidate peptides that fall below certain cutoff criteria are removed and only the best-scoring PSM for a group of isobaric PSMs is retained. For example, PSMs are required to have at least three matched fragment ions, one of which must be the base peak (most abundant peak of the prediction) and another one of which must be among the top three most-intense peaks of the predicted spectrum. After the initial search, a linear discriminant analysis identifies highly confident PSMs for the calibration of optimal prediction parameters of the fragmentation model (for example, normalized collision energy; NCE), refinement learning of the retention time model and recalibration of fragment ion m/z and match tolerances. Peptide classes with few confidently identified peptides are removed entirely from the search space (for example, peptides of length 7 carrying two missed cleavages and two oxidized methionine residues). In the main search, the above-described scoring functions are executed using the optimized settings and prediction parameters, albeit now also filtering candidate peptides based on their predicted retention time. CHIMERYS uses retention time tolerance windows that would allow the identification of all peptides confidently identified in the initial search. In brief, we calculate the absolute difference of experimental and predicted retention times after refinement learning for all PSMs identified at 1% FDR in the initial search. The 100% quantile of these differences is the basis for the initial retention time window (±100% quantile of absolute differences between experimental and predicted retention time), which is then expanded further by multiplying it with a security factor of 2.5. The scoring is repeated to arrive at a set of high-scoring candidate peptides as input for the deconvolution function, where the candidate peptides simultaneously compete for experimental fragment ion intensity in one concerted step.

Essentially, CHIMERYS treats chimeric spectra as linear combinations of pure spectra. Here, predicted spectra for high-scoring candidate peptides serve as the source of pure spectra, which is why CHIMERYS is dependent on accurate fragment ion intensity predictions. The intensities of each predicted spectrum are normalized to a total sum of 1. Let $P$ be a collection of predicted spectra for high-scoring candidate peptides, each comprising a set of intensities ${I}_{p,m}$, where $p=\mathrm{1,2},\ldots ,P$ is an index over the predicted spectra and $m=\mathrm{1,2},\ldots ,M$ is an index over mass channels. The mass channels are defined by the peaks of the experimental spectrum that were within the recalibrated fragment ion match tolerance of at least one fragment ion from the collection of predicted spectra, as well as all unmatched fragment ions from said collection. As such, if two peaks from two different predicted spectra match to the same experimental peak, they will have the same mass channel. This is how CHIMERYS handles shared fragment ions. Hence, one can represent the predicted spectra of all candidate peptides as a ${PxM}$ matrix of intensities ${I}_{p,m}$. The CHIMERYS coefficients ${\beta }_{p}$ then define a combined spectrum as the linear combination of the predicted spectra, scaled by the corresponding coefficients. This combined spectrum has a set of intensities:

$${{\bf{I}}}_{{{m}}}=\mathop{\sum }\limits_{{{p}}={{1}}}^{{{P}}}{\beta }_{{{p}}}{{\bf{I}}}_{{{p}},{{m}}}$$

CHIMERYS uses non-negative L1-regularized regression via the LASSO to optimize the CHIMERYS coefficients ${\beta }_{p}$ such that they minimize the following objective function:

$$\mathop{\sum }\limits_{m}^{M}{\left({I}_{m}^{\,\exp }-{{\bf{I}}}_{m}^{T}{\mathbf{\upbeta}} \right)}^{2}+\lambda {{||}{\mathbf{\upbeta}} {||}}_{1}$$

subject to the coefficients ${\beta }_{p}$ having non-negative values. Here, M is the number of mass channels, ${I}_{m}^{\,\exp }$ is the experimental intensity for the mass channel m, I_m = (I_1,m, I_2,m, I_3,m,…, I_P,m) is a vector of predicted intensities for each predicted spectrum from the collection at mass channel m, ${\mathbf{\upbeta}} =(\,{\beta }_{1},{\beta }_{2},{\beta }_{3},\ldots ,{\beta }_{p})$ is a vector of CHIMERYS coefficients and $\lambda {{||}{\mathbf{\upbeta}} {||}}_{1}=\lambda \mathop{\sum }\nolimits_{p}^{P}|{\beta }_{p}|$ is the LASSO regularization term (L1-regularization). $\lambda$ may be varied to vary the strength of the regularization, and therefore the strength of the constraint on the number of nonzero coefficients. CHIMERYS optimizes $\lambda$ automatically by fitting multiple models with different regularization strengths and selecting the best model with the most regularization by inspecting the corrected Akaike information criterion. As such, CHIMERYS models the experimental spectrum as a function of the matrix of candidate peptides. By using L1-regularization together with the corrected Akaike information criterion, it aims to best explain the experimental spectrum with the fewest number of candidate peptides possible. Notably, this algorithm accounts for the presence of shared fragment ions. The above-mentioned optimization procedure will effectively ‘distribute’ their intensity to the corresponding candidate peptides according to the optimized CHIMERYS coefficients. The more similar the sequences of two co-eluting peptides are, the more fragments they share and the more similar their predicted spectra will be. This is particularly true for positional isomers of PTM-containing peptides that often share many fragment ions. However, also unmodified peptides with the same amino acid composition can have very similar, but also completely dissimilar sequences and hence predicted spectra. The calculation of CHIMERYS coefficients is the same, no matter how many fragment ions are shared between two co-eluting peptides. The only exception are isobaric peptides. Currently, if two candidate peptides are isobaric, CHIMERYS will only insert the best one for a given spectrum into the collection of predicted spectra mentioned above. This is similar to MS1-based quantification, where only the best-scoring peptide is matched to an MS1 signal based on precursor m/z and retention time. However, positional isomers usually have some site-determining ions that are isomer-specific and, depending on the PTM and its localization, fragment ion intensities might also differ between them. Hence, in the future, CHIMERYS might report multiple isomers per scan if their predicted spectra are sufficiently different from one another such that the LASSO regression assigns both of them a non-negative CHIMERYS coefficient.

The CHIMERYS coefficient for each candidate peptide represents its contribution to the experimental spectrum. A coefficient >0 indicates that this candidate peptide was used to explain the experimental spectrum. Based on the resulting coefficient, a subsequent round of intensity-based scoring is executed. Here, the coefficients of the candidate peptides can be used to predict the proportional intensity of all but one candidate peptide, add them together and subsequently subtract this sum from the actual experimental MS2 spectrum to calculate what we call a ‘shadow spectrum’, which is the experimental spectrum with the contributions of all interfering peptides removed. Next, the above-mentioned figures of merit are calculated based on these shadow spectra without the interference of other peaks in the spectrum. Notably, this also works for fragment ions that are shared between candidate peptides. Candidate peptides that fail to meet certain quality criteria (for example, minimum number of most abundant peaks shared between predicted and experimental spectrum) are filtered out. A list of all remaining target and decoy PSMs per spectrum that received a coefficient >0 and met all quality criteria including all calculated scores is generated as input for the PSM-level error estimation in mokapot²⁰.

FDR estimation using mokapot

For error control, the initial implementation of CHIMERYS utilized Percolator²¹ v.3.0.5 to aggregate all calculated scores for all target and decoy PSMs generated in a dataset. As Percolator runtime scales poorly with large input files, we exchanged it with the Python-based reimplementation termed mokapot²⁰. To ensure scalability to large input PSM lists while controlling the compute resources, we rewrote large parts of mokapot’s logic to allow streaming of data from disk, introduced RAM limits and implemented more-performant data structures. The changes made to mokapot have since been merged into the main branch (https://github.com/wfondrie/mokapot). Mokapot is executed using the following parameters: Training FDR of 1%, a training subset of 400k and ten iterations for training. We specifically prevent mokapot from only retaining the top-scoring PSM per spectrum. Afterwards, the resulting PSM-level q-values, support vector machine scores and posterior error probabilities are attached to the corresponding PSMs. Peptides containing leucine/isoleucine (I/L) isomers in the search space are added back to the results with identical scores and are flagged as ‘ambiguous’.

MS2 quantification workflow

CHIMERYS determines raw file-specific peak apex retention times as the CHIMERYS coefficient-weighted mean of retention time deltas relative to the gradient length based on PSMs meeting 1% run-specific PSM-level FDR. If an external inclusion file was used, PSMs meeting 1% run-specific PSM-level FDR including their relative retention times and CHIMERYS coefficients from the list are also considered. If no PSMs meet 1% run-specific PSM-level FDR for a given precursor in a given raw file, the apex for said precursor in this raw file is calculated using the same logic as above but based on PSMs meeting 1% run-specific PSM-level FDR in other raw files and the inclusion file. CHIMERYS in its current implementation then estimates maximum integration borders per raw file as the 99% quantile of peak widths at base (not full width at half maximum) from precursors with at least three PSMs surviving a run-specific PSM-level FDR threshold of 1%. These maximum integration borders are then applied to each precursor in this raw file, leading to relatively wide integration borders, particularly for low-abundant precursors. Afterwards, quantification of PRM and DIA data is performed by either trapezoidal integration of the CHIMERYS coefficients from each precursor in a set of consecutive MS2 spectra sharing the same isolation window within the integration borders, or by using the highest CHIMERYS coefficient within the integration borders as the elution peak apex intensity. The latter implementation was used for the comparison to DIA-NN and Spectronaut. One missing CHIMERYS coefficient in a series of consecutive MS2 scans with the same isolation window is allowed (gap scan) and a contribution of 0 is inserted to any further scan with missing data points, which act as boundaries for peak area integration. Notably, at this point, CHIMERYS coefficients are taken from PSMs irrespective of their run-specific PSM-level FDR; however, CHIMERYS coefficients will only be used from peptide precursors that meet CHIMERYS’ quality criteria (for example, a minimum of three peaks matched between the predicted and the experimental spectrum) and are located in the vicinity of the determined peak apex. Hence, at least one confidently identified PSM across all raw files is required to generate quantitative values based on PSMs around the determined peak apex in each raw file. As such, CHIMERYS will quantify precursors that fail to meet run-specific precursor-level FDR thresholds. Users are free to filter their list of precursors at 1% global precursor-level FDR (precursor was confidently identified in at least one raw file) or additionally also at 1% run-specific precursor-level FDR. The latter will reduce data completeness and is more conservative; however, we have shown that often, these quantifications are precise and accurate, so we recommend to work with precursors filtered to 1% global precursor-level FDR during exploratory data analysis and turn to run-specific precursor-level FDR for the validation of interesting hits.

Post-processing of CHIMERYS’ PSM-level outputs

CHIMERYS v.2.7.9 as showcased in this study is integrated into Thermo Fisher Scientific PD software v.3.1.0.622 (PD)⁴⁴. A pre-release of PD v.3.2 was used to demonstrate the processing of PTM datasets with CHIMERYS v.4.0.21 (Extended Data Fig. 4g). Hence, PD starts CHIMERYS searches on Amazon Web Services by uploading an internal format containing only MS2 spectra and some auxiliary information, a fasta file and the search parameters to the CHIMERYS web service, which then processes the data and generates a result file. The result file is then downloaded and post-processed by PD⁴⁴. In this study, we used the default CHIMERYS processing and consensus workflows with minor modifications. In brief, all DDA data processing was carried out using the PSM Grouper node to generate peptide groups, which were then validated using the Peptide Validator node. For DIA data, we used a special PCM Grouper node, which enables the calculation of run-specific and global precursor-level FDR. MS1-based quantification was performed using the Minora Feature Detector with default settings. MS2-based quantification was performed using the MS2 Fragment Ions Quantifier node with default settings.

Data generation

Cell culture and sample preparation

Human HeLa (ATCC, CCL-2) and pancreatic mouse cells were cultured under standard conditions at 37 °C with 5% CO₂ in DMEM supplemented with 10% fetal bovine serum and 100 U ml⁻¹ penicillin (Invitrogen). At around 80% confluence, cells were washed three times with PBS buffer before urea lysis (8 M urea, 80 mM Tris, pH 7.6 and 1× protease inhibitor) was performed for 5 min on ice. Cell lysate was clarified by centrifugation (20,000g for 10 min).

In-solution protein digestion was conducted as follows. First, proteins were reduced with 10 mM dithiothreitol at 37 °C for 1 h, followed by alkylation with 2-chloroacetamide at a final concentration of 55 mM for 45 min at room temperature in the dark while shaking on a thermo shaker. After the addition of five volumes of 50 mM Tris (pH 8), trypsin digestion was performed overnight by adding trypsin twice (1:100 dilution) after a primary incubation time of 4 h. Desalting was performed using Sep-Pak columns according to the user manual. Human brain FFPE samples were digested using an SDS lysis protocol followed by digestion with the SP3 approach as described in detail by Tüshaus et al.³³.

LC–MS/MS

FFPE, gradient comparison and wwDDA data were acquired on a micro-flow LC coupled via a HESI source to an Q Exactive HF-X hybrid quadrupole-Orbitrap mass spectrometer (Thermo Scientific). Optimization of the micro-flow LC setup as well as technical details were previously published by Bian et al.³². In brief, peptide separation was performed on an Acclaim PepMap 100 C18-HPLC-column (15-cm length, 1-mm inner diameter, 2-µm particle size; 164711, Thermo Fisher Scientific) at 55 °C. Linear gradients with buffer A (0.1% v/v formic acid (FA) and 3% v/v dimethylsulfoxide (DMSO) in dH₂0) and buffer B (0.1% v/v FA and 3% v/v DMSO in acetonitrile) from 3% to 28% B were run at 50 µl min⁻¹. Sample loading, column wash and equilibration was performed at 100 µl min⁻¹. Source settings were applied as 320 °C capillary temperature, 3.5 kV spray voltage and 300 °C auxiliary gas. MS data were acquired at a normalized collision energy of 28%, in Top20 mode, at an m/z range of 360–1,300, AGC target of 3E6 and 1E5, maximal injection time of 50 ms and 22 ms, resolution of 60 k and 15 k on MS1 and MS2 level, respectively. The MS2 isolation window width was 1.4 Th in standard DDA runs and increased up to 20.4 Th for wide-window acquisition DDA as indicated in the figure legends.

Ion trap data were acquired with an Orbitrap Eclipse Tribrid mass spectrometer (Thermo Scientific) that was coupled to a Dionex UltiMate 3000 RSLCnano System (Thermo Scientific). Samples were transferred onto a trap column (75 μm × 2 cm, 5 μm C18 resin Reprosil PUR AQ; Dr Maisch). After washing with the trap washing solvent (5 μl min⁻¹, 10 min), samples were separated on an analytical column (75 μm × 48 cm, 3 μm C18 resin Reprosil PUR AQ, Dr Maisch). A 70-min method, including a 50-min gradient, was performed with a flow rate of 300 nl min⁻¹ (4% B up to 32% B within 50 min). Solvent A was 0.1% v/v FA and 5% v/v DMSO in dH₂O; solvent B was 0.1% v/v FA and 5% v/v DMSO in acetonitrile. MS1 scans were acquired with an Orbitrap resolution of 60 k, within a scan range of 360–1,300 m/z, a maximum injection time of 50 ms, a normalized AGC target of 100% and RF lens of 40%, including charge states 2–6, with an exclusion time of 25 s. MS2 scans were performed with the ion trap with a normalized AGC target of 200%, a maximum injection time of 25 ms and either with a higher-energy collisional dissociation (HCD) collision energy of 31% (wwDDA) or with a CID collision energy of 35% (CID). The quadrupole isolation window varied between 0.4, up to 5.0 m/z as indicated in the figure.

Data for the instrument comparison were assembled from 1-h HeLa quality control runs, acquired over several years at the Chair of Proteomics and Bioanalytics at the Technical University of Munich. They were run on various LC systems, employed diverse instrument-specific settings, slightly different gradients and used different batches of HeLa digest, prepared in house.

Targeted assay generation

A simple PRM assay was devised by randomly selecting 18 proteins and 2–3 peptides each across the whole measured intensity range from a 1-h HeLa run analyzed on an Orbitrap Fusion Lumos mass spectrometer (Thermo Scientific). A total of 51 precursors were put into an inclusion list in addition to 14 precursors corresponding to a retention time standard. A 1-h HeLa sample was analyzed in PRM mode: MS2 spectra were acquired using 0.4-Th isolation window, a maximum injection time of 100 ms, HCD collision at a normalized collision energy of 28% and readout in the Orbitrap at 15 k resolution.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The following external data were downloaded from PRIDE or MassIVE and processed with the respective search engines. In brief, body fluid data from Bian et al. (PXD015087)³², secretome data from Tüshaus et al. (PXD018171)³³, Arabidopsis thaliana and Halobacterium data from Müller et al. (PXD014877)³⁴, phosphorylation data from Frejno et al. (PXD013615)³⁵, acetylation and ubiquitination data from Zecha et al. (PXD023218)³⁶, triple species mix as well as HeLa data from the LFQBench-type dataset by Van Puyvelde et al. (PXD028735)³⁷, Orbitrap Astral data extracted from Gutzman et al. (PXD046453)⁴⁶ and DI-SPA data from Meyer et al. (MSV000085156)⁷. Notably, peptides containing methionine residues were excluded from all analyses of the LFQBench-type dataset, as raw files might show differential oxidation. The same applies to the Orbitrap Astral data from PXD046453. For the LFQBench-type dataset, technical replicates were analyzed (Supplementary Table 3). All other replicates are biological replicates. An itemized mapping of external data processed as part of this study to their source is available in Supplementary Table 3. The following datasets were generated in house: FFPE (biological replicates), gradient comparison, wwDDA, instrument generations and PRM data. An overview of the files generated is provided in Supplementary Table 4. The generated MS raw and search data of internal datasets from this study are available via PRIDE⁵⁰ with the dataset identifier PXD053241. All fasta files used in this study are available via PRIDE⁵⁰ with the dataset identifier PXD053241. All Source and Supplementary Data files required to reproduce this study are available via PRIDE⁵⁰ with the dataset identifier PXD053241.

Code availability

The mokapot version used in this study is available on GitHub (https://github.com/wfondrie/mokapot/). The modifications to the mimic entrapment database generator are available on GitHub (https://github.com/percolator/mimic/). A web version of the mimic tool can be found at https://mimic.msaid.io/. A demo version of PD and CHIMERYS can be requested at https://www.msaid.de/chimerys-demo or by contacting the corresponding authors. The custom R scripts used for data analysis are available on GitHub (https://github.com/msaid-de/chimerys-manuscript).

References

Bantscheff, M., Schirle, M., Sweetman, G., Rick, J. & Kuster, B. Quantitative mass spectrometry in proteomics: a critical review. Anal. Bioanal. Chem. 389, 1017–1031 (2007).
Article CAS PubMed Google Scholar
Bantscheff, M., Lemeer, S., Savitski, M. M. & Kuster, B. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal. Bioanal. Chem. 404, 939–965 (2012).
Article CAS PubMed Google Scholar
Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).
Article CAS PubMed Google Scholar
Michalski, A., Cox, J. & Mann, M. More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC−MS/MS. J. Proteome Res. 10, 1785–1793 (2011).
Article CAS PubMed Google Scholar
Wang, J., Bourne, P. E. & Bandeira, N. MixGF: spectral probabilities for mixture spectra from more than one peptide. Mol. Cell. Proteom. 13, 3688–3697 (2014).
Article CAS Google Scholar
Dorfer, V., Maltsev, S., Winkler, S. & Mechtler, K. CharmeRT: boosting peptide identifications by chimeric spectra identification and retention time prediction. J. Proteome Res. 17, 2581–2589 (2018).
Article CAS PubMed PubMed Central Google Scholar
Meyer, J. G., Niemi, N. M., Pagliarini, D. J. & Coon, J. J. Quantitative shotgun proteome analysis by direct infusion. Nat. Methods 17, 1222–1228 (2020).
Article CAS PubMed PubMed Central Google Scholar
Peckner, R. et al. Specter: linear deconvolution for targeted analysis of data-independent acquisition mass spectrometry proteomics. Nat. Methods 15, 371–378 (2018).
Article CAS PubMed PubMed Central Google Scholar
Ting, Y. S. et al. Peptide-centric proteome analysis: an alternative strategy for the analysis of tandem mass spectrometry data. Mol. Cell. Proteom. 14, 2301–2307 (2015).
Article CAS Google Scholar
Fernández-Costa, C. et al. Impact of the identification strategy on the reproducibility of the DDA and DIA results. J. Proteome Res. 19, 3153–3161 (2020).
Article PubMed PubMed Central Google Scholar
Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat. Protoc. 11, 2301–2319 (2016).
Article CAS PubMed Google Scholar
Pino, L. K. et al. The Skyline ecosystem: informatics for quantitative mass spectrometry proteomics. Mass Spectrom. Rev. 39, 229–244 (2020).
Article CAS PubMed Google Scholar
Tsou, C.-C. et al. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat. Methods 12, 258–264 (2015).
Article CAS PubMed PubMed Central Google Scholar
Ting, Y. S. et al. PECAN: library-free peptide detection for data-independent acquisition tandem mass spectrometry data. Nat. Methods 14, 903–908 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bekker-Jensen, D. B. et al. Rapid and site-specific deep phosphoproteome profiling by data-independent acquisition without the need for spectral libraries. Nat. Commun. 11, 787 (2020).
Article CAS PubMed PubMed Central Google Scholar
Yu, F. et al. Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform. Nat. Commun. 14, 4154 (2023).
Article CAS PubMed PubMed Central Google Scholar
Li, K., Teo, G. C., Yang, K. L., Yu, F. & Nesvizhskii, A. I. diaTracer enables spectrum-centric analysis of diaPASEF proteomics data. Nat. Commun. 16, 95 (2025).
Article PubMed PubMed Central Google Scholar
The, M., Samaras, P., Kuster, B. & Wilhelm, M. Reanalysis of ProteomicsDB using an accurate, sensitive, and scalable false discovery rate estimation approach for protein groups. Mol. Cell. Proteom. 21, 100437 (2022).
Article CAS Google Scholar
Ma, K., Vitek, O. & Nesvizhskii, A. I. A statistical model-building perspective to identification of MS/MS spectra with PeptideProphet. BMC Bioinform. 13, S1 (2012).
Fondrie, W. E. & Noble, W. S. mokapot: fast and flexible semisupervised learning for peptide detection. J. Proteome Res. 20, 1966–1971 (2021).
Article CAS PubMed PubMed Central Google Scholar
The, M., MacCoss, M. J., Noble, W. S. & Käll, L. Fast and accurate protein false discovery rates on large-scale proteomics data sets with Percolator 3.0. J. Am. Soc. Mass. Spectrom. 27, 1719–1727 (2016).
Article CAS PubMed PubMed Central Google Scholar
Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).
Article CAS PubMed Google Scholar
Degroeve, S. & Martens, L. MS2PIP: a tool for MS/MS peak intensity prediction. Bioinformatics 29, 3199–3203 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhou, X.-X. et al. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 89, 12690–12697 (2017).
Article CAS PubMed Google Scholar
Zolg, D. P. et al. INFERYS rescoring: boosting peptide identifications and scoring confidence of database search results. Rapid Commun. Mass Spectrom. https://doi.org/10.1002/rcm.9128 (2021).
Tibshirani, R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Series B Stat. Methodol. 58, 267–288 (1996).
Article Google Scholar
Lu, Y. Y., Bilmes, J., Rodriguez-Mias, R. A., Villén, J. & Noble, W. S. DIAmeter: matching peptides to data-independent acquisition mass spectrometry data. Bioinformatics 37, i434–i442 (2021).
Article CAS PubMed PubMed Central Google Scholar
Matzinger, M. et al. Micropillar arrays, wide window acquisition and AI-based data analysis improve comprehensiveness in multiple proteomic applications. Nat. Commun. 15, 1019 (2024).
Article CAS PubMed PubMed Central Google Scholar
Truong, T. et al. Data-dependent acquisition with precursor coisolation improves proteome coverage and measurement throughput for label-free single-cell proteomics. Angew. Chem. Int. Ed. 62, e202303415 (2023).
Article CAS Google Scholar
Toprak, U. H. et al. Conserved peptide fragmentation as a benchmarking tool for mass spectrometers and a discriminating feature for targeted proteomics. Mol. Cell. Proteom. 13, 2056–2071 (2014).
Article CAS Google Scholar
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bian, Y. et al. Robust, reproducible and quantitative analysis of thousands of proteomes by micro-flow LC–MS/MS. Nat. Commun. 11, 157 (2020).
Article CAS PubMed PubMed Central Google Scholar
Tüshaus, J. et al. An optimized quantitative proteomics method establishes the cell type‐resolved mouse brain secretome. EMBO J. 39, e105693 (2020).
Article PubMed PubMed Central Google Scholar
Müller, J. B. et al. The proteome landscape of the kingdoms of life. Nature 582, 592–596 (2020).
Article PubMed Google Scholar
Frejno, M. et al. Proteome activity landscapes of tumor cell lines determine drug responses. Nat. Commun. 11, 3639 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zecha, J. et al. Linking post-translational modifications and protein turnover by site-resolved protein turnover profiling. Nat. Commun. 13, 165 (2022).
Article CAS PubMed PubMed Central Google Scholar
Puyvelde, B. V. et al. A comprehensive LFQ benchmark dataset on modern day acquisition strategies in proteomics. Sci. Data 9, 126 (2022).
Article PubMed PubMed Central Google Scholar
Demichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat. Methods 17, 41–44 (2020).
Article CAS PubMed Google Scholar
Muntel, J. et al. Surpassing 10 000 identified and quantified proteins in a single run by optimizing current LC-MS instrumentation and data analysis strategy. Mol. Omics 15, 348–360 (2019).
Article CAS PubMed Google Scholar
Rosenberger, G. et al. Statistical control of peptide and protein error rates in large-scale targeted DIA analyses. Nat. Methods 14, 921–927 (2017).
Article CAS PubMed PubMed Central Google Scholar
Baker, C. P., Bruderer, R., Abbott, J., Arthur, J. S. C. & Brenes, A. J. Optimizing Spectronaut search parameters to improve data quality with minimal proteome coverage reductions in DIA analyses of heterogeneous samples. J. Proteome Res. 23, 1926–1936 (2024).
Article CAS PubMed PubMed Central Google Scholar
Cranney, C. W. & Meyer, J. G. CsoDIAq software for direct infusion shotgun proteome analysis. Anal. Chem. 93, 12312–12319 (2021).
Article CAS PubMed Google Scholar
Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. & Ishiguro-Watanabe, M. KEGG: biological systems database as a model of the real world. Nucleic Acids Res. 53, D672–D677 (2025).
Orsburn, B. C. Proteome discoverer—a community enhanced data processing suite for protein informatics. Proteomes 9, 15 (2021).
Article CAS PubMed PubMed Central Google Scholar
Heil, L. R. et al. Evaluating the performance of the astral mass analyzer for quantitative proteomics using data-independent acquisition. J. Proteome Res. 22, 3290–3300 (2023).
Article CAS PubMed PubMed Central Google Scholar
Guzman, U. H. et al. Ultra-fast label-free quantification and comprehensive proteome coverage with narrow-window data-independent acquisition. Nat. Biotechnol. 42, 1855–1866 (2024).
Article CAS PubMed PubMed Central Google Scholar
Sinitcyn, P. et al. MaxDIA enables library-based and library-free data-independent acquisition proteomics. Nat. Biotechnol. 39, 1563–1573 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bouwmeester, R., Gabriels, R., Hulstaert, N., Martens, L. & Degroeve, S. DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nat. Methods 18, 1363–1369 (2021).
Article PubMed Google Scholar
Guevremont, R. High-field asymmetric waveform ion mobility spectrometry: a new tool for mass spectrometry. J. Chromatogr. A 1058, 3–19 (2004).
Article CAS PubMed Google Scholar
Perez-Riverol, Y. et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 50, D543–D552 (2021).
Article PubMed Central Google Scholar

Download references

Acknowledgements

The authors thank numerous scientific colleagues for their input, discussions and support. The authors expressively thank the PD software development team at Thermo Fisher Scientific for their collaboration, support and contributions to the successful integration of CHIMERYS into PD and the scientific discourse on the results. The authors thank E. Zander for consulting on mathematical topics. The authors thank M. The for consulting on entrapment experiments and FDR control. The authors also thank their colleagues D. Bold, J. Santoso and A. Guevende for contributions to the software. This work was partially supported by a European Research Council Starting Grant to M.W. (101077037) and by multiple grants from the German Federal Ministry of Education and Research to B.K. (CLINSPECT‐M, 161L0214A and 16LW0243K; ProteomeTools, 031L0008A) and M.F. (ESTHER, 13GW0603B). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Funding

Open access funding provided by Technische Universität München.

Author information

These authors contributed equally: Martin Frejno, Michelle T. Berger, Johanna Tüshaus, Alexander Hogrebe.

Authors and Affiliations

MSAID GmbH, Garching b. München, Germany
Martin Frejno, Michelle T. Berger, Alexander Hogrebe, Florian Seefried, Michael Graber, Patroklos Samaras, Samia Ben Fredj, Vishal Sukumar, Layla Eljagh, Igor Bronshtein, Lizi Mamisashvili, Markus Schneider, Siegfried Gessulat, Tobias Schmidt & Daniel P. Zolg
School of Life Sciences, Technical University of Munich, Freising, Germany
Johanna Tüshaus, Bernhard Kuster & Mathias Wilhelm
Munich Data Science Institute (MDSI), Technical University of Munich, Garching b. München, Germany
Bernhard Kuster & Mathias Wilhelm

Authors

Martin Frejno
View author publications
Search author on:PubMed Google Scholar
Michelle T. Berger
View author publications
Search author on:PubMed Google Scholar
Johanna Tüshaus
View author publications
Search author on:PubMed Google Scholar
Alexander Hogrebe
View author publications
Search author on:PubMed Google Scholar
Florian Seefried
View author publications
Search author on:PubMed Google Scholar
Michael Graber
View author publications
Search author on:PubMed Google Scholar
Patroklos Samaras
View author publications
Search author on:PubMed Google Scholar
Samia Ben Fredj
View author publications
Search author on:PubMed Google Scholar
Vishal Sukumar
View author publications
Search author on:PubMed Google Scholar
Layla Eljagh
View author publications
Search author on:PubMed Google Scholar
Igor Bronshtein
View author publications
Search author on:PubMed Google Scholar
Lizi Mamisashvili
View author publications
Search author on:PubMed Google Scholar
Markus Schneider
View author publications
Search author on:PubMed Google Scholar
Siegfried Gessulat
View author publications
Search author on:PubMed Google Scholar
Tobias Schmidt
View author publications
Search author on:PubMed Google Scholar
Bernhard Kuster
View author publications
Search author on:PubMed Google Scholar
Daniel P. Zolg
View author publications
Search author on:PubMed Google Scholar
Mathias Wilhelm
View author publications
Search author on:PubMed Google Scholar

Contributions

M.F. and M.W. conceived the study. M.F., M.W., and D.P.Z. developed and evaluated the initial prototype. F.S., P.S., T.S., M.G., I.B., S.B.F. and S.G. developed, implemented and optimized the algorithms. S.G., V.S., S.B.F., L.M. and M.G. developed the deep-learning models. T.S., M.S. and F.S. orchestrated the implementation of software modules and the deployment of the software. M.F., D.P.Z., F.S., M.G., M.T.B. and A.H. evaluated the algorithm. M.F., M.T.B., J.T., A.H. and D.P.Z. processed the results data. M.F., M.T.B., J.T., A.H. and D.P.Z. performed the data analysis. L.E. helped in the preparation of the figures. M.F., D.P.Z., M.T.B., A.H., F.S., P.S., S.G., T.S., J.T., B.K. and M.W. provided critical feedback, discussed the results and consulted in revisions. M.F., D.P.Z., J.T., B.K. and M.W. wrote the manuscript.

Corresponding authors

Correspondence to Martin Frejno or Mathias Wilhelm.

Ethics declarations

Competing interests

M.F., D.P.Z., S.G. and T.S. are co-founders, shareholders and employees of MSAID, a company that develops software for proteomics, including the algorithm presented in this manuscript. M.W. and B.K. are co-founders and shareholders of MSAID and OmicScouts, which operates in the field of proteomics, but they have no operational role in either company. M.T.B., A.H., F.S., M.G., P.S., S.B.F., V.S., L.E., I.B., L.M. and M.S. are employees of MSAID. MSAID is an applicant on multiple pending patent applications covering functionality implemented in CHIMERYS that list M.W., M.F., T.S. and F.S. as inventors. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Fengchao Yu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Chimeric DDA spectra.

(A) Proportions of MS2 spectra with at least one (blue) or no PSM identification (gray) in a 2-h HeLa DDA single-shot measurement (n = 1). Data taken from the LFQbench-type dataset and acquired on an Orbitrap QE HF-X with 1.3 Th isolation windows. (B) Distribution of the number of PSMs per MS2 spectrum for the same data as in (A).

Extended Data Fig. 2 Overlap in identifications and runtime comparison.

Venn diagram of PSM (A) and peptide group (B-C) identifications in a 2-h HeLa DDA single-shot measurement (n = 1) comparing CHIMERYS to the combination of other search engines at 1% PSM-level FDR (A), 1% peptide group-level FDR (B) or 1% FDR at different levels (C). The different FDR levels in (C) were the peptide group level for CHIMERYS, Sequest HT, Comet, MS Amanda and MaxQuant, the precursor level for MSFragger and the PSM level for Metamorpheus and MS-GF + . Data was acquired on an Orbitrap QE HF-X with 1.3 Th isolation windows and taken from the LFQbench-type dataset. (D) Runtime comparison for the analyses shown in (A). All search engines were run in a virtual Windows 10 environment with 8 cores and 64 GB of RAM to mirror the cloud environment of CHIMERYS.

Extended Data Fig. 3 CHIMERYS-unique identifications and FDR levels.

(A) MS1 apex intensity distribution of peptide groups filtered at 1% global peptide group-level FDR and identified uniquely by CHIMERYS (green) or also by at least one other search engine tested as part of Fig. 1e (orange). The dataset was a 2-h HeLa DDA single-shot measurement (n = 1) from the LFQbench-type dataset, acquired on an Orbitrap QE HF-X with 1.3 Th isolation windows. (B) Distribution of the number of matched peaks between predicted and experimental spectra for the same data as in (A). (C) Distribution of the normalized spectral contrast angle after deconvolution between predicted and experimental spectra for the same data as in (A). (D) Distribution of the support vector machine score from mokapot for the same data as in (A), including targets with a global peptide group-level FDR exceeding 1% (blue). The support vector machine score distribution for decoys (gray) is overlayed. (E) Scatter-plot of the number of unique peptides per protein group identified by Sequest HT (x axis) or CHIMERYS (y axis) for the same data as in (A). Protein groups are filtered to 1% global protein FDR and peptides are filtered to 1% global peptide group-level FDR. (F) The number of precursors (left) or peptide groups (right) identified by CHIMERYS at 1% run-specific PSM-, run-specific precursor- or global peptide group-level FDR. The dataset is a single 2-h HeLa DDA single-shot measurement (n = 1), acquired on an Orbitrap QE HF-X with 1.3 Th isolation windows and taken from the LFQbench-type dataset. (G) Same as (F), but for 2-h DDA single-shot measurements from two different conditions, acquired in three replicates on an Orbitrap QE HF-X with 1.3 Th isolation windows from the LFQbench-type dataset (n = 6).

Extended Data Fig. 4 Comparison of CHIMERYS to Sequest HT and MSFragger on various datasets.

PSM, peptide and protein group identifications based on Sequest HT (orange), MSFragger (green) and CHIMERYS (blue) from measurements of human urine³² (A), CSF³² (B) and plasma³² (C), FFPE (D) and secretome samples³³ (E), as well as from publicly available 1-h measurements of Arabidopsis thaliana and Halobacterium³⁴ (F). FDR was controlled at the run-specific PSM, peptide group (only available for Sequest HT and CHIMERYS) and protein group level, respectively. (G) Comparison of phosphorylated³⁵, acetylated³⁶ and ubiquitinated³⁶ precursors for Sequest HT (orange), MSFragger (green) and CHIMERYS v.4.0.21 (blue). FDR was controlled at the run-specific precursor level. The shaded proportion of the barchart displays precursors with a localization probability of >0.7 as calculated by ptmRS for Sequest HT, MSFragger or CHIMERYS (native localization). Note that ptmRS does not support ubiquitination. (H) Violin plot of the spectral angle comparing fragment ion predictions of INFERYS v.4.0.0 for unmodified and modified peptides to a hold-out dataset.

Extended Data Fig. 5 Instrument generations, gradients and wwDDA/WWA.

(A) PSM, peptide and protein group identifications based on Sequest HT (orange), MSFragger (green) and CHIMERYS (blue) from 1-h HeLa single-shot measurements, acquired using various Orbitrap generations (n = 1). FDR was controlled at 1% at the run-specific PSM-, global peptide group- (only available for Sequest HT and CHIMERYS) and global protein level, respectively. (B) PSM, peptide and protein group identifications based on CHIMERYS from pancreatic mouse cell single-shot measurements, acquired using different gradient lengths and isolation window widths (n = 1). FDR was controlled at 1% at the run-specific PSM-, global peptide group- and global protein level, respectively. (C) PSM, peptide and protein group identifications based on Sequest HT (orange), MSFragger (green) and CHIMERYS (blue) from 15 min pancreatic mouse cell single-shot measurements (n = 1). Data was acquired using HCD fragmentation with Orbitrap readout and different isolation window widths. FDR was controlled at 1% at the run-specific PSM-, global peptide group- (only available for Sequest HT and CHIMERYS) and global protein level, respectively. (D) PSM, peptide and protein group identifications based on Sequest HT (orange), MSFragger (green) and CHIMERYS after removal of low-abundance peaks (light blue) from 1-h HeLa single-shot measurements (n = 1). Data was acquired using CID fragmentation with ion trap readout and different isolation window widths. FDR was controlled at 1% at the run-specific PSM-, global peptide group- (only available for Sequest HT and CHIMERYS) and global protein level, respectively.

Extended Data Fig. 6 Entrapment analyses on DIA data.

Scatter plots of run-specific precursor-level self-reported (x axis) and entrapment FDR (y axis) from three different entrapment approaches (Supplementary Methods). Data is shown for CHIMERYS (A), DIA-NN (B), Spectronaut with default settings (C) and Spectronaut with more stringent settings⁴¹ (D) on triplicate 2-h DIA single-shot measurements from two different conditions (n = 6). Data was acquired on an Orbitrap QE HF-X with 8 Th isolation windows and taken from the LFQbench-type dataset³⁷. (E) same as in (B), but for the corresponding TimsTOF Pro data. (F) same as in (C), but for the corresponding TimsTOF Pro data.

Extended Data Fig. 7 DIA data analysis with CHIMERYS, DIA-NN and Spectronaut.

(A) Runtime comparison of CHIMERYS v.2.7.9, CHIMERYS v.4.0.21, DIA-NN v.1.8.1 and Spectronaut v.19 for the peptide eFDR approach. Runtimes include spectral library generation. (B) Precursors quantified by CHIMERYS, DIA-NN and Spectronaut in at least one (orange) or three (gray) out of three replicate measurements of two different conditions (n = 6) in a multispecies LFQbench dataset. Identifications are filtered at 1% run-specific precursor-level FDR or additionally at 1% run-specific precursor-level eFDR based on the peptide eFDR approach (Supplementary Methods). (C) Peak areas from DIA-NN and Spectronaut for fragment ions from precursors surviving (gray) or not surviving (red) 1% run-specific precursor-level eFDR based on the peptide eFDR approach for the same data as in (B) (D) Apex intensities for precursors identified by Spectronaut at 1% run-specific precursor-level FDR for the same data as in (B). Precursors are colored by the number of fragment ions with Peak areas (F.PeakArea) > 1 that were not excluded from quantification by Spectronaut (curated fragments). (E) Example fragment ion XICs directly extracted from the raw file for the precursor at m/z 466.9506 identified by Spectronaut but not by CHIMERYS in Fig. 3d. All six library fragments are shown. XICs were extracted using the R package rawrr with 20 ppm fragment mass tolerance. (F) Precursors quantified by Spectronaut in at least one (orange) or three (gray) out of three replicate measurements of two different conditions in a multispecies LFQbench dataset (n = 6). Identifications in the 1st bar are filtered at 1% run-specific precursor-level FDR. Additionally, the 2nd bar is filtered at 1% run-specific precursor-level eFDR based on the peptide eFDR approach. Additionally, the 3rd bar is filtered by excluding precursors that are quantified based on less than three fragment ions with peak areas (F.PeakArea) > 1, which were not excluded from quantification by Spectronaut.

Extended Data Fig. 8 PRM and DIA quantification using CHIMERYS coefficients.

(A) Venn diagram of peptides identified in a PRM dataset (n = 1) – targeting 52 peptides from 18 human proteins – by CHIMERYS at 1% peptide group-level FDR (blue) or Skyline (gray). (B) Mirror XIC of the top five experimental (above the x axis) and predicted fragment ion intensities, scaled by the corresponding CHIMERYS coefficients (below the x axis) for one of the targeted peptides in A. (C) Coefficient-based reconstruction of elution peaks for four different peptides identified by CHIMERYS in the data in (A), only one of which was targeted in the assay (IGGGIDVPVPR).

Extended Data Fig. 9 CsoDIAq results as a function of the chosen library.

(A-B) Peptide length distributions for (A) the original DI-SPA library and (B) a library generated using the original DI-SPA library targets and decoys generated by CHIMERYS. (C) Boxplot visualizing the number of peptide groups identified by CsoDIAq at 1% run-specific FDR in each sample using different libraries (individual boxplots), including the ones shown in (A) and (B) (first and last, respectively, see also Supplementary Methods). The bottom (lower hinge), center and top of the box (upper hinge) are the first, second (median) and third quartile of the data points. The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (IQR is the inter-quartile range, that is the distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. All individual data points are shown (n = 88 raw files).

Extended Data Fig. 10 Comparison of DDA and DIA data.

PSMs (A) and protein groups (B) identified by CHIMERYS in DDA or DIA data. For the LFQbench-type dataset, triplicate 2-h single-shot measurements from two conditions (n = 6) are shown. Data was acquired on an Orbitrap QE HF-X with 1.3 or 8 Th isolation windows for DDA and DIA, respectively. For the Orbitrap Astral datasets, triplicate 14 min or 30 min single-shot measurements from a HeLa sample (n = 3) are shown. FDR was controlled at 1% at the run-specific PSM level or at the global protein level, respectively. Match between runs was used for DDA data and for DIA data, peptide groups were quantified irrespective of their run-specific FDR. (C) Peptide group-level log₂-ratio density plots for the same DDA and DIA data from the LFQbench dataset as in (A), quantified in MS1 using the Minora Feature Detector or in MS2 using CHIMERYS.

Supplementary information

Supplementary Information

Supplementary Figs. 1 and 2, discussion, methods and references.

Reporting Summary

Peer Review File

41592_2025_2663_MOESM4_ESM.xlsx

Supplementary Table 1: Comparison of FDR levels for search engines used in this study. Supplementary Table 2: Significantly enriched KEGG pathways in the set of proteins detected by CHIMERYS or CsoDIAq. Supplementary Table 3: List of external data processed as part of this study. Supplementary Table 4: List of internal data processed as part of this study.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Frejno, M., Berger, M.T., Tüshaus, J. et al. Unifying the analysis of bottom-up proteomics data with CHIMERYS. Nat Methods 22, 1017–1027 (2025). https://doi.org/10.1038/s41592-025-02663-w

Download citation

Received: 25 June 2024
Accepted: 06 March 2025
Published: 22 April 2025
Issue date: May 2025
DOI: https://doi.org/10.1038/s41592-025-02663-w

This article is cited by

Sensitive neoantigen discovery by real-time mutanome-guided immunopeptidomics
- Ilja E. Shapiro
- Florian Huber
- Michal Bassani-Sternberg
Nature Communications (2025)