Main

Innovation in mass spectrometry (MS) has enabled tremendous progress in life sciences, and advances in MS instrumentation have led to the widespread adoption of omics disciplines (for example, metabolomics, lipidomics and proteomics). Despite the broad application of MS to characterize proteins, peptides, polymers, small molecules and nucleic acids across research disciplines, the ability for scientists to flexibly search for known chemical classes within and across MS datasets remains a challenge. The interrogation of MS data for the presence of specific chemicals or classes of molecules utilizes patterns in MS peaks representing intact analytes1 isotopic signatures or characteristic mass differences (MS1), associated fragmentation patterns in tandem MS data (MS/MS), chromatographic retention time, collisional cross-section or combinations thereof. This search for specific patterns is usually performed through a slow and error-prone manual inspection of the data. Alternatively, specialized software tools have been developed for this purpose but are often limited to search for a specific compound1 or a limited set of class-specific MS patterns2. Although bespoke one-off scripts provide the necessary flexibility to search for specific MS data patterns3, most noncomputational researchers and laboratories lack the computational skills to develop or customize them4,5. This skill gap limits biologists and chemists from effectively searching across MS datasets, potentially leaving many biologically important molecules hidden and undiscovered in the data. To address this gap, here we introduce the Mass Spectrometry Query Language (MassQL), an open-source language for flexible and mass spectrometer manufacturer-independent searching. MassQL aims to enable noncomputational researchers to easily search their MS (across MS1 and MS/MS) data for patterns of interest without the need for programming skills or a dedicated computational collaborator. In this Article, we describe the MassQL language, showcase its accompanying computational ecosystem to increase accessibility and highlight two application examples that demonstrate how MassQL can be used on entire public repositories (such as Global Natural Products Social Molecular Networking (GNPS)/MassIVE6, Metabolomics Workbench7 and MetaboLights8).

Results

The versatility of MS to capture unique characteristics of chemical structures, such as isotopic patterns (for example, bromination), diagnostic fragmentation (for example, product ion of sulfur trioxide) and neutral losses (for example, loss of sugar moieties), makes it a powerful analytical tool but also presents challenges to effectively interpret and utilize the richness of the data. Specifically, the ability to simultaneously utilize some or all of these different dimensions of MS data in an integrated fashion is currently out of reach. To complement the versatility of MS data, the MassQL language implements a succinct and expressive grammar to search for chemically and biologically relevant molecules in the MS data by leveraging these patterns (Fig. 1a,b). The MassQL language enables searching for patterns in MS1 data (for example, isotopic patterns and adduct mass shift) and MS/MS fragmentation spectra (for example, presence/absence of fragments and neutral losses), as well as applying chromatographic and ion mobility constraints. In addition, MassQL provides language support for user-defined tolerances, such as ion intensity and mass accuracy (Fig. 1b). Moreover, each of these query elements can be combined with Boolean operators (for example, AND, OR and NOT) to form more complex queries. These properties and patterns are common to nearly all MS data types, thus making MassQL agnostic to the instrument vendor, mass detector (for example, Orbitrap and quadrupole time-of-flight), ionization source (for example, electrospray ionization and matrix assisted laser desorption/ionization) and separation method (for example, liquid chromatography, gas chromatography and ion mobility). Together, the MassQL language provides users the flexibility and expressiveness to query simple and complex MS patterns regardless of their computational expertise, thus lowering the barriers of entry to MS data interrogation. Finally, as a language, new MassQL terms can be defined, which enables grammar and syntax evolution to maintain compatibility of queries to advancing MS technologies.

Fig. 1: Schematic representation of the MassQL ecosystem.
figure 1

a, Examples of molecules that produce distinctive data patterns when measured by MS as mass/charge (m/z) and intensity (i) peaks. b, MassQL query representing MS/MS fragmentation patterns that encapsulates a characteristic mass loss. The query can be translated to nine languages for enhanced accessibility. c, MassQL is a universal tool to query MS data. MassQL enables data searching in a single file to entire MS repositories. MassQL has also been incorporated into a wide range of MS software. d, MassQL queries are shared and reused via the Community Compendium, which increases reproducibility and knowledge dissemination.

The MassQL computational ecosystem is made of several key components that ensure its extensibility and usability for the community. These components are a formal definition of the MassQL grammar, a MassQL language parser, a reference implementation of the MassQL query engine, interactive web interfaces to enhance accessibility for noncomputational users and a NextFlow computational workflow for parallelized querying of very large datasets (Methods). Together, this enables MassQL searches on a single MS data file, within and across whole MS datasets, up to entire data repositories, including GNPS/MassIVE6, Metabolomics Workbench7 and MetaboLights8.

The MassQL formal grammar builds upon common MS terminology (Methods), which makes MassQL queries easy to write and alter for scientists with basic familiarity with MS. To further help new users, we have created extensive documentation (https://mwang87.github.io/MassQueryLanguage_Documentation/), instructional videos (https://www.youtube.com/playlist?list=PLkDps_-pcYZ5D3rhas208dsMg66lCGmcs) and an interactive MassQL sandbox (https://massql.gnps2.org/). The sandbox enables users to interactively develop and test MassQL queries on demonstration data to check for queries’ correctness before applying to their own data. Moreover, the sandbox automatically translates each query into English, Portuguese, Spanish, German, French, Mandarin Chinese, Japanese, Korean and Russian, which helps the interpretability of queries in manuscripts and grants. In addition, we have developed and deployed a large-language-model-powered conversational assistant (https://massql-analysis.gnps2.org/MassQL_Chatbot) to support new users with real-time MassQL query writing and troubleshooting. Finally, as a community effort, we created a wiki-like compendium of 35 applications of MassQL (https://massql.gnps2.org/compendium/; Fig. 1d) that serves as a reference and inspiration for new users. Queries in the compendium can be (re)used as they are, or as a starting point to develop queries tailored to the chemicals or compound classes of interest. The compendium is regularly updated with new examples of MassQL queries successfully used by community members in published works and will function as an ‘app store’ for a centralized MassQL query deposition and sharing.

While MassQL was originally implemented within the GNPS environment6,9, this was limiting for the accessibility to a broader audience. To further enhance usability, we have engaged with the wider metabolomics software community and, through these efforts, the MassQL language has been adopted and natively supported in a variety of MS data analysis software and infrastructure (Fig. 1c), both open-source (MZmine10, pyOpenMS11, MS-DIAL12, UniDec13 and Metabolomics Workbench7) and commercial (Bruker’s MetaboScape). It must be noted that these software tools use the same MassQL language grammar but implement their own MassQL query engine backend. This provides the possibility to develop optimized query engines for improved query performance, while maintaining query semantic consistency across tools. Finally, to facilitate the integration of MassQL into other platforms and pipelines, MassQL is available as Python and R (ref. 14) libraries and as a web application programming interface (API).

Discovery of siderophores at repository scale

Iron-binding small molecules play essential roles across biology, including microbial or mammalian siderophores that facilitate iron homeostasis15,16. We recently developed a native metabolomics method that includes infusion of iron to identify iron-binding compounds from complex samples based on the identification of retention time and peak shape correlations along with an Fe3+-characteristic m/z delta of 52.91 (ref. 17). As a complementary strategy to our native metabolomics method, which requires hardware modification in instrument setup and data acquisition, we mined all existing public metabolomics data, encompassing over 230 million analytes, to discover putative iron-binding compounds using MassQL. Although iron is often stripped from iron-binding compounds in liquid chromatography, we hypothesized that some iron-binding compounds will remain bound to iron at detectable levels in the MS data.

To develop the MassQL query, we used an MS dataset collected from Eutypa lata supernatant extracts that were treated with a post-liquid chromatography iron addition, as described by Aron et al.17 (Supplementary Notes 3.3.1 and 3.3.2). The refined MassQL query searches MS1 spectra for the characteristic isotopic pattern of iron along with a distinctive iron-binding mass shift (Fig. 2a). Specifically, the MassQL query searches for MS1 precursor ions with an m/z of x, m/z of x 1.993 at 0.063% intensity of x (the stable isotope ratio of 54Fe to 56Fe), and m/z of x + 1.0034 (the 13C peak), in addition to a proton-bound adduct (apo) peak at m/z of x 52.91. Combined MassQL queries for apo and bound MS2 spectra identified seven out of the eight putative siderophores identified using ion-identity molecular networking (IIMN)18 in the published analysis of the post-liquid chromatography iron addition of E. lata extracts. The unique compound that was found using IIMN but not by the MassQL query was missed because the 54Fe peak intensity fell outside of the expected intensity tolerance of 25%, which is probably due to the low intensity of this peak. We used strict m/z (10 ppm) and expected intensity percentage (25%) tolerances to minimize false-positive retrieval. Using MassQL, an additional four molecules were found that were not found using IIMN (Supplementary Fig. 3.3.2-1); these molecules are probably iron-binding, as manual inspection revealed that they exhibit the expected iron-bound isotopic pattern. The published IIMN analysis may have missed these molecules owing to requirements for peak shape and retention time correlations or owing to low-intensity peaks falling below feature finding thresholds.

Fig. 2: Insights from siderophores at a repository scale.
figure 2

a, The MassQL query for iron-binding searches for the characteristic 54Fe12C peak with 6.3% abundance relative to the 56Fe12C peak. Peaks queried by MassQL are colored in red. b, The molecular network of MassQL spectra hits after clustering by MSCluster (gray, no MS/MS annotation by GNPS MS/MS library; orange, MS/MS annotation by GNPS MS/MS libraries). Singletons (no neighbors in the network) have been excluded from this molecular network. c, A spectral family of the molecular network containing desferrioxamines, including proton-bound and iron-bound desferrioxamines E, G and B, in addition to structurally related analogs. d, Less than <1% of clustered MS/MS are annotated as siderophores by GNPS libraries when masses associated with ethylenediaminetetraacetic acid (an anticoagulant added to MS samples) are removed from the network.

After validation of the siderophore query on the E. lata dataset, we extended the search for iron-binding compounds to all public high-resolution Thermo Fisher Q Exactive data available in the GNPS/MassIVE repository6 (Supplementary Note 3.3.3). In searching over 230 million MS/MS spectra in 97,109 public data files, we retrieved 26,944 MS/MS spectra associated with the iron-characteristic isotope pattern in their MS1 data. We used MS-Cluster19 on the retrieved MS/MS spectra to collapse redundant observations of the candidate iron-binding molecules. This resulted in 7,504 consensus MS/MS spectra. Using these consensus spectra, we created a molecular network in GNPS. We could putatively identify 441 (5%) of the consensus spectra by spectral library search against the public MS/MS libraries in GNPS6. The putatively identified known compounds were further filtered to remove duplicates and adducts of ethylenediaminetetraacetic acid (a common anticoagulant). After filtering, 52% of annotated spectra are known iron binders, while an additional 25% are lipids, bile acids, polyphenols and peptide/amino acids—classes that have been shown to bind iron and other divalent metals20,21,22,23. Notably, because such a large fraction (>95%) of the analytes in the molecular network could not be annotated to known molecules (Fig. 2d), this molecular network is probably a rich resource for the discovery of new siderophores.

OPEs in the environments

Organophosphate esters (OPEs) are widely used flame retardants and plasticizers and are ubiquitously detected in environmental samples24,25,26. Recent studies in environmental chemistry have reported several novel OPEs by searching for a conserved, unique fragmentation pattern (O = P(OR)3) in MS/MS data27,28,29. The characteristic fragment, frequently leveraged in literature, is the phosphate product ion (H4O4P+, m/z 98.9842). The traditional low-throughput methodologies, which includes manual analysis and targeted lists of known OPEs, limits the ability to discover novel potentially harmful chemicals. Here, we demonstrate how MassQL can enable high-throughput screening and discovery of novel OPE without a predefined suspect list.

Utilizing the characteristic phosphate product ion, we formulated a MassQL query to search for an MS/MS peak at m/z 98.9847 with 50 ppm mass error tolerance and a peak intensity >50% of the base peak (Supplementary Notes 3.2.3 and 3.2.4). Similarly to siderophores example, a MassQL query was developed on a test marine water dataset to verify the utility of the MassQL for identifying putative OPEs in complex samples (Supplementary Note 3.2.3). In this test dataset, where three OPEs were previously identified by manual analysis30, MassQL returned 589 MS/MS spectra belonging to ~60 unique molecular features. MS/MS library search against the GNPS spectral library of the MassQL retrieved MS/MS spectra putatively annotated four OPE molecules, including all three previously identified OPEs, and one new putatively identified OPE. We additionally putatively identified two non-OPE molecules. The first molecule contained a phosphate group that resulted in the characteristic phosphate fragment in the MassQL query. The second putative non-OPE molecule did not contain a phosphate group, but further investigation revealed a potentially false-positive library match due to large mass errors in the library MS/MS peaks. These results suggested that the MassQL query was useful in retrieving OPEs but may also capture a broader range of phosphate-containing compounds. For this reason, when scaling up to repository searches (see below), we leverage the ability of molecular networking to group together similar structures and segregate OPEs from phosphate-containing compounds more generally.

To identify OPEs in public data, we scaled the MassQL query to all Q Exactive data in the GNPS/MassIVE6 data repository (which included >230 million MS/MS spectra). The MassQL query found 338,439 MS/MS matching the query criteria. Only 15% (51,310) of the MS/MS found by MassQL could be explained (precursor m/z match with 20 ppm mass error) by known OPEs based on a comprehensive OPEs list (n = 95) compiled by Ye et al.29. We extracted all MS/MS spectra and created consensus MS/MS spectra using Falcon-MS30, resulting in 2,777 consensus spectra. We used these consensus spectra to create a molecular network. Combining the library annotation results, we propose one additional OPE that was not included in the GNPS library and the comprehensive OPE list by Ye et al.29 (Fig. 3). It is important to reemphasize that, in the search for OPEs, the MassQL query was not designed to specifically look for OPEs but phosphate-containing molecules more generally. The molecular networking strategy complemented the MassQL results to organize OPE molecules in their own families. This combination of MassQL and molecular networking was critical because MassQL greatly reduced the data size (from 230 million to ~338,000 MS/MS, making molecular networking computationally tractable), while molecular networking helped to focus attention on families of specifically organophosphate molecules.

Fig. 3: Discovering OPEs at a repository scale.
figure 3

a, The general structure of OPEs and the MassQL query for the characteristic phosphate fragment. b, The molecular network of MassQL MS/MS in a repository scale query after clustering by Falcon30 (green, annotation by GNPS library; blue, annotation by an OPEs list curated by Ye et al.29; gray, unannotated). c, Summary of MS matching results (precursor m/z match with 20 ppm mass error) by the GNPS MS/MS library and the OPEs list by Ye et al.29. These putative identifications were based on precursor only (level 3 annotations). d, A molecular family shown containing alkyl-OPEs. OPEs reported by Ye et al. or in the MS/MS database search are indicated: tributyl phosphate and dimers (light orange and dark orange), and trioctyl phosphate and dimers (light blue and dark blue). Dibutyl phosphate (green) was not reported by Ye et al. or in the MS/MS database search. The structures displayed are illustrative of one possible isomer of the alkyl chain; the specific structure is beyond the scope of this report.

False discovery rate estimation and query validation strategies

Overall, the key challenge when using MassQL is to define queries that are sensitive toward the target compound(s) (that is, effectively retrieve the desired spectra) but do not retrieve too many false-positive hits. Due to the flexibility and broad applicability of MassQL, a universal method for false discovery rate (FDR) estimation is difficult to establish. Rather, tailored strategies for query validation can be designed case by case by the user, depending on the research context.

As demonstrated in a recent publication that used MassQL to mine liquid chromatography–MS/MS raw data in the public domain and discover new, unreported bile acids31, one possible strategy to estimate the FDR of MassQL queries over MS/MS spectra is to putatively identify the MS/MS spectra retrieved by MassQL using reference MS/MS spectral libraries. In their study, Mohanty et al. first used the GNPS spectral libraries (which contained 4,533 reference spectra of bile acids) to design and refine MassQL queries for bile acid spectra and estimate the queries’ selectivity. Such selectivity was measured by counting the number of retrieved bile acids and the number of retrieved non-bile acids (false positives). Thereafter, when using the refined MassQL queries to search repository data, more than 594,000 putative bile acids MS/MS spectra were retrieved. Among these, 270,437 MS/MS spectra were putatively identified by MS/MS library search, with 726 MS/MS matching to non-bile acids (0.27%). It is important to highlight that this FDR estimation approach is limited to the compounds that are deposited in the reference MS/MS libraries.

It should be noted that the same validation approach cannot be universally applied, for example, to the repository-scale discovery of siderophores described in the present manuscript (‘Discovery of siderophores at repository scale’). Siderophore molecules do not belong to a single compound class and can exhibit very diverse chemical structures. Therefore, queries for diagnostic MS/MS fragment ions cannot be used, and the search was instead performed on the MS1 level (‘Discovery of siderophores at repository scale’ section). The adopted strategy was to first develop and refine the query on a reference dataset of E. lata extracts known to contain iron-binding molecules17. After satisfactory results were obtained on the reference dataset (Supplementary Note 3.3.2), the query was performed on a repository scale. In this specific example, we used a strategy based on a ‘decoy’ query to get a sense for potential false discoveries (Supplementary Note 2), estimated at 21.7% for this repository scale query. Although a relatively large number of false-positive hits may be expected when performing searches at a repository scale, these false positives can be mitigated by utilizing additional computational tools in a downstream analysis (for example, molecular networking and spectral library search) to validate the query results. In these cases, MassQL acts more as a prefilter to reduce the data to a more tractable size (from hundreds of millions of spectra to a few thousands) and ‘enrich’ them with putative leads for further investigation and confirmatory experiments (‘Discovery of siderophores at repository scale’ and ‘OPEs in the environments’ sections). Overall, we encourage users to critically inspect query results and develop tailored validation strategies that are fit for their research purposes.

Discussion

Here, we introduced MassQL, a platform- and manufacturer-independent query language to search for MS data patterns within and across MS datasets. The main goal of MassQL is to provide noncomputational scientists with the flexibility to encode complex MS patterns into concise and expressive queries, reducing the need to write bespoke programming scripts. Together with the accompanying software infrastructure created through community efforts, the MassQL ecosystem lowers the barrier of entry to MS data interrogation for chemists and biologists lacking programming skills. This goal is being realized by the adoption of MassQL by the community. Of note, MassQL has already been used by researchers for MS data mining32, microbiome research31, exposomic and biomonitoring33, infectious disease research34 and natural product discovery35,36,37,38,39,40,41,42.

While traditional MS/MS similarity/search tools are a powerful technique for matching related or similar compounds within libraries or repositories (for example, spectral library search43 and MASST44), MassQL provides a complementary set of capabilities with enhanced flexibility and precision. Specifically, traditional MS/MS search tools rely on a full MS/MS similarity measure to retrieve a match, whereas MassQL queries are a set of user-defined constraints to retrieve a spectrum. This provides the flexibility to search for more specific and complex patterns (for example, combine MS1 and MS/MS patterns, retention and drift time constraints; Supplementary Notes 3.3 and 3.4) and empowers scientists to leverage their domain knowledge of the chemical or compound class under investigation. A specific example where MassQL can complement MS/MS similarity is when small structural modifications can result in large changes in the overall fragmentation patterns. This situation can cause relevant analog molecules to evade discovery by MS/MS similarity-based search. MassQL has been shown to complement MS/MS comparison strategies by enabling the searching for conserved key fragments or neutral losses in the MS/MS spectrum without requiring a full MS/MS similarity match (for example, ref. 42).

While MassQL-based querying of small and large datasets can be an effective way to prioritize data, the utility of this querying paradigm can be enhanced when paired with complementary analysis tools (before or after MassQL). First, in this Article, we showcased how discovery can be enhanced by combining upstream MassQL searches with downstream molecular networking and spectral library analysis (see Discovery of siderophores at repository scale’ and ‘OPEs in the environments’ sections in the Results). The use of MassQL as a prefiltering tool was essential in making the analysis possible, both from a computational tractability and data interpretability/prioritization perspective. Second, MassQL has been shown in the literature as a downstream tool to enhance the analysis of molecular networking—specifically, to aid in the prioritization of relevant compounds and, as highlighted above, to overcome shortcomings in MS/MS alignment35,37,38,39,41,42. We do acknowledge limitations with MassQL as available today—specifically, MassQL has limited capabilities to leverage more than a handful of MS spectra, for example, consecutive MS spectra arising from the elution of chromatographic peaks that can be grouped as a chromatographic feature.

We envision that the MassQL computational ecosystem will grow in adoption and capability. We recognize areas for improvement within MassQL, such as enhancing the query performance of MassQL and expanding the expressiveness of the language. We have designed and scaffolded the MassQL ecosystem to be extensible in both of these respects. We have defined a format context-free grammar for MassQL that is separate from any query engine that implements the MassQL semantics (Methods). Even though the reference implementation of the MassQL query engine is not explicitly optimized for speed (Supplementary Note 1), this architecture enables the wider community to develop new query algorithms to improve search speed, while using the same formal grammar. This paradigm has already been demonstrated in the popular software tools that implemented their own query engine: MS-DIAL12, Mzmine10 and Bruker’s MetaboScape. Finally, MassQL derives its strength from a vibrant user open-source community, and we expect this community-guided evolution of the language to continue in the future. As both the language and software ecosystem evolve, MassQL will become more capable and versatile to meet the scientific community’s growing needs in mining MS data.

Methods

MassQL language description

MassQL condition description

The following table gives a short summary of the most common ways that we can set a condition to find the data we want. Each of the following conditions may have more than one way to qualify each condition, to modulate the tolerance, intensity and so on to help improve the specificity a user desires. Further details can be found in the official documentation at https://mwang87.github.io/MassQueryLanguage_Documentation/.

Data type

MassQL syntax

Example

MS1 peak m/z

MS1MZ=<m/z value>

MS1MZ=163.1

MS2 precursor m/z

MS2PREC=<m/z>

MS2PREC=488.1

MS2 precursor charge

CHARGE=<Value>

CHARGE=2

Fragmentation product ion m/z

MS2PROD=<m/z>

MS2PROD=163.1

Ionization polarity

POLARITY=<Value>

POLARITY=Positive

Retention time (minimum)

RTMIN=<Value in Minutes>

RTMIN=5

Retention time (maximum)

RTMAX=<Value in Minutes>

RTMAX=10

Scan number (minimum)

SCANMIN=<Value>

SCANMIN=5

Scan number (maximum

SCANMAX=<Value>

SCANMAX=5

Ion mobility

MOBILITY=range (min=<min>, max=<max>)

MOBILITY=range (min=1, max=2)

MassQL qualifier description

The following gives a short summary of the most common qualifiers for conditions that users can specify in MassQL. Further details can be found in the official documentation at https://mwang87.github.io/MassQueryLanguage_Documentation/.

Qualifier type

Condition

Example

m/z tolerance

MS2PROD, MS1MZ, MS2PREC

MS2PROD=163.1:TOLERANCEMZ=0.1

m/z ppm tolerance

MS2PROD, MS1MZ, MS2PREC

MS2PROD=163.1:TOLERANCEPPM=50

Peak intensity minimum

MS2PROD, MS1MZ

MS2PROD=163.1:INTENSITYVALUE=1000

Peak intensity minimum percent of base peak

MS2PROD, MS1MZ

MS2PROD=163.1:INTENSITYPERCENT=10

Peak intensity minimum percent of TIC

MS2PROD, MS1MZ

MS2PROD=163.1:INTENSITYTICPERCENT=10

Mass defect of peak

MS2PROD, MS1MZ, MS2PREC

MS2PROD=ANY:MASSDEFECT=massdefect (min=0.1, max=0.2)

Exclusion of condition

MS2PROD, MS1MZ, MS2PREC

MS2PROD=163.1:EXCLUDED

MassQL reference implementation

The reference implementation is a fully working version of the MassQL software ecosystem for the community to use. It also serves as a guide for future MassQL implementations that may optimize speed and/or introduce new functions in other systems. Specifically, the reference implementation includes the following pieces:

  • MassQL formal grammar. The grammar is defined using the extended backus-naur form and builds upon common MS terminology for improved expressiveness and interpretability (see ‘MassQL language description’ section). During the development, community input and feedback shaped the vocabulary and capabilities of the language.

  • MassQL parser. The parser transforms a query into an internal data structure that can be used by any programming language. The parsing is done by using the lark Python library (https://github.com/lark-parser/lark) and specific Python code to transform a MassQL query to a parse tree and into the internal data structure that organizes all query conditions and qualifiers.

  • MassQL query engine. The MassQL reference query engine is written in Python and utilizes pyteomics45 to read open MS data files from mzML, mzXML and MGF formats into data frames. Such MS spectra in data frame format can optionally be saved as Apache feather files to cache data for repeated querying. The query engine itself processes the query over these data frames using the Python pandas library to perform data filtering and manipulations. Output results are data frames that can be exported as a tabular format. Optionally, the retrieved MS spectra can be exported in JSON format, MGF and mzML46.

The MassQL reference NextFlow47 workflow is designed as an automated high-throughput tool for querying multiple files simultaneously on a computational cluster. This workflow utilizes the reference query engine and parallelizes the querying of multiple files across a multicore processor or a batch cluster, depending on the compute resources available. All results are then merged together, including extracted MS spectra in JSON format, MGF and mzML format, if desired.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.