Main

Over the past decade, metabolomics data have been deposited into public repositories (MetaboLights, NMDR/Metabolomics Workbench, GNPS/MassIVE and NORMAN/DSFP), but these resources remain underutilized for identifying broad molecular trends1,2,3,4. The ability to search/filter mass spectrometry (MS) raw metabolomics data at repository scales requires computational solutions that can scale. Initially, in 2020, Mass Spectrometry Search Tool (MASST) queries required 20–40 min to search a repository of ~110 million tandem MS (MS/MS) spectra, but the introduction of indexing technologies reduced search time to seconds—even across over a billion mass spectra5,6,7 (Fig. 1a). The development of PanReDU enabled metadata harmonization across metabolomics repositories and indexing, allowing MASST-style searches to extend beyond GNPS6 and now include NORMAN/DSFP. Despite these advances and early demonstrations of its use, a limitation persisted: structure- and substructure-based queries using cheminformatics inputs such as names of molecules, SMILES or SMARTS (SMILES Arbitrary Target Specification) strings have not been possible, which has led to an overreliance on singular mass spectra selected by MS experts to represent a molecule’s behavior across public data searches that are dependent on diverse technologies.

Fig. 1: The FASSTrecords/StructureMASST infrastructure.
Fig. 1: The FASSTrecords/StructureMASST infrastructure.
Full size image

Bioinformatics tools must be easy to use for broad impact. a, MASST in metabolomics requires specialized knowledge and manual metadata integration, limiting accessibility5. b, StructureMASST reduces this barrier by allowing structure- or substructure-based searches across all MS/MS spectra and metadata, streamlining biologically contextualized discovery. c, FASSTrecords integrates public metabolomics datasets, linking molecules to structures via GNPS2, MassBank and MoNA references in a unified SQL database. d, The database includes four linked tables (library, masst, mri and redu) connected via integer-based keys for scalable structure-based retrieval. e, The StructureMASST web interface enables retrieval of library spectra, Multi-MASST matches and molecular distributions across sample types, including modification-tolerant searches. For terminology explanations, see Supplementary Table 1.

To address the challenge of structure-based exploration of public metabolomics data, we developed StructureMASST (https://structure-masst.gnps2.org/), a search engine and web-based application that enables pan-repository MS/MS searches using a chemical name, structure or substructure as input. This enables searching multiple MS/MS spectra at once to obtain a global picture of how the MS/MS are distributed among all the public data, which we will refer to as multi-MASSTing (Fig. 1b). This approach addresses several key challenges. First, molecules can have many names, and in principle, one would need to search all of them to capture all corresponding MS/MS spectra. Second, and more critically, public datasets come from diverse instruments and acquisition conditions, including varying collision energies, making cross-repository searches difficult. At the same time, public reference libraries with annotated structures are rapidly expanding, increasingly including multiple ion forms per molecule—such as adducts, in-source fragments (ISFs) and multimers—allowing StructureMASST to leverage this growing diversity for more comprehensive searches. Upon name, structure or substructure input, StructureMASST retrieves all matching MS/MS spectra from 1,565,620 reference spectra from the GNPS community, Massbank of North America (MoNA) and MassbankEU uploaded before September 2025 (Fig. 1c). The current MS/MS libraries from these sources encompass 200,258 unique two-dimensional (2D) chemical structures, noting that mass spectral matching generally does not resolve stereochemistry and that multiple ion forms often exist for a single molecule. These reference spectra cover 63 different ion species (such as [M + H]+, [M + Na]+) and 93 fragmentation energies, acquired from Orbitrap, time-of-flight and Fourier transform ion cyclotron resonance instruments. Once relevant spectra are retrieved, StructureMASST performs multi-MASST across all MS/MS for those molecules, with users able to select which spectra to include. To illustrate the impact of a single MASST versus a multi-MASST, searching with a single [M + H]+ reference spectrum for cholylphenylalanine matches to 184 raw files, only 13 of which are from human samples. By contrast, StructureMASST, through Multi-MASST, retrieves 137 spectra for the same compound, yielding 1,275 matching raw files, 450 from human samples (Supplementary Fig. 1).

Multi-MASST analysis can be run in one of two modes: (1) a non-precomputed (exploratory) search or (2) a precomputed search, which differ in speed and search methods.

The second is an exploratory mode that enables mass shift searches (for example, known Δm/z values or atoms), modification-tolerant matching and searches using reference spectral data uploaded to the repositories after August 2025, following precomputation of the knowledge base. This will always be the most comprehensive search possible with StructureMASST but takes several minutes.

By contrast, the precomputed mode will be faster and includes curated metadata. The precomputed knowledge base, which we call FASSTrecords, is a Structured Query Language (SQL) resource generated by performing spectral matching with FASST of all reference MS/MS libraries against public metabolomics data available as of August 2025. It includes 1,204,350,873 MS/MS matches against those library spectra from compatible public data (4,990 datasets, 920,790 liquid chromatography (LC)–MS/MS files, and 1,752,167,824 MS/MS spectra). Altogether, the search annotated 142,538,419 spectra across LC–MS/MS datasets in the repositories (Fig. 1d and Table 1). This represents an annotation rate of 8.1% at the MS/MS level.

Table 1 The structure of the SQLite database storing public metabolomics data annotations

To enable contextual analysis, PanReDU metadata harmonization has been expanded to cover 861,265 metabolomics raw data files as of August 20256. This structure facilitates access to metadata, including, but not limited to, instrument type, collision energy, organism, health condition and environmental context, when this is available. Provenance of the original data in the data repository is ensured via Universal Spectrum Identifiers (USIs), resolvable through the USI resolver8,9. The final knowledge base includes 420,799,889 metadata connections and defines the annotation-based search space available (Fig. 1d). The zipped SQLite database is 40 GB—enabling efficient distribution and use in local applications (license: ODC-ODbL).

A web-based interface allows users to input chemical names, SMILES or SMARTS for exact or substructure searches (Fig. 1e). Retrieved spectra include metadata on ion form, collision energy and instrument, and can be filtered before multi-MASST analysis. In the Supplementary information, we provide videos explaining how to perform molecule searches, substructure searches and modification searches (Supplementary Videos 110). This interface enables researchers with limited informatics background to explore and derive biological insights. The input for StructureMASST is a chemical name that is available in PubChem10—or a SMILES/SMART11,12 input of the chemical structure or substructure, which can be queried either as an exact match or a substructure. Once the name of the molecule is selected, the SMILES itself is retrieved through PubChem. Alternatively, one can also directly add the structure information in the form of SMILES. SMILES can be used for both search modes, while SMARTS enable finely tuned substructure matching behavior. Users may optionally refine the selected spectra through deselection or selection, depending on needs, before proceeding to downstream multi-MASST analyses.

Overall, there are three search modes available: (1) exact match search, (2) modification search and (3) open modification search. The first is an exact match search, which retrieves all raw spectral matches corresponding to the selected MS/MS spectra from the precomputed SQLite database, FASSTrecords. This mode is recommended for most applications, as it is faster compared with the other search modes and computationally efficient. It can return matches to a single compound or to multiple compounds sharing a given substructure. The second mode is a modification-tolerant search with a defined hypothesis, suited for cases where users suspect a specific chemical modification—such as hydroxylation, methylation or glucuronidation. In this mode, supported through FASST, users can input a Δm/z or atomic composition corresponding to the expected modification. StructureMASST then applies a modified cosine similarity algorithm to identify structurally related compounds that differ by the specified modification. This mode is computationally more intensive, as it requires full spectral alignment against all public data and retrieval of the results. The third mode is a blind modification search, which detects chemically related compounds that differ by undefined modifications. This approach discovers unknown analogs of known chemicals and is computationally demanding. This filtering can be further enhanced to require the unmodified variant of the molecule to be present in the same raw file, taxonomy, or same dataset as any reported analog. Such filtering can be essential as false discoveries tend to be more prevalent in this mode. Regardless of the search mode selected, all results are represented using a summary of the metadata and visualized using an interactive Sankey diagram where biological attributes (for example, taxonomy, sex and health condition) or technical variables (for example, ion form, instrument type and extraction conditions) can be explored. An important consideration in the design of StructureMASST is provenance, ensuring that results are linked back to the raw data deposited in the repository. The USI/MRI (MR Run Identifier) link to raw data allows direct inspection of the MS/MS and MS1 intensity information in the GNPS Dashboard, which are accessible as hyperlinks. A key function of StructureMASST is to map from chemical structure to sample information (metadata) readout. To benchmark structureMASST metadata readout, we reasoned that drugs should not be dominantly detected in nonhuman Metazoa samples. We therefore evaluated whether sample metadata were consistent with biological expectations using odds ratios (ORs) across increasing MS/MS cosine similarity thresholds (Supplementary Fig. 6). At a cosine threshold of 0.7, 10.5% of drug MS/MS spectra were not exclusively associated with human data, including folic acid, ibuprofen and several steroidal compounds. Increasing the threshold to 0.8 and 0.9 reduced this fraction to 3.5%, with folic acid exhibiting near-equal odds across species, consistent with its endogenous role in many animals. Nonhuman matches for ibuprofen were largely attributable to a single rat sleep-deprivation study.

To illustrate its utility, we present examples of molecules and types of hypotheses that StructureMASST can enable. Due to its widespread consumption, caffeine is a fairly ubiquitous molecule. We queried its SMILES and retrieved 316 MS/MS spectra, representing 6 ion forms and 24 collision energies (Supplementary Fig. 2a). Using a cosine threshold of 0.9 and a minimum of 5 matching peaks, StructureMASST via FASSTrecords matched caffeine MS/MS spectra in 6,228 files across 98 datasets. The default Sankey plot display highlights its presence in more than ten human sample types, including blood serum, kidney tissue, human milk, the alveolar system, and Coffea arabica and Camellia sinensis, consistent with these plants being sources of coffee and tea (Supplementary Fig. 2b and ‘Discussion’ in Supplementary Fig. 2). Another example we provide to demonstrate how analog StructureMASST can be leveraged is the soil microbial metabolite surfactin, which, along with its analogs, was uniquely found in people living in remote villages (Supplementary Fig. 3 and ‘Discussion’ in Supplementary Fig. 3).

The caffeine and the surfactin examples represent the most straightforward applications of StructureMASST, consisting of a multi-MASST search of all available reference spectra for the input structure, followed by a search against MASST records or analog FASST searches. By contrast, substructure-based searches enable more complex analysis. For example, siderophores and ionophores such as pyochelin13 (from Pseudomonas aeruginosa, a human pathogen) and yersiniabactin14 (originally identified in Yersinia pestis—cause of bubonic plague) share a biosynthetically conserved substructure derived from salicylic acid and cyclized cysteine.

Using StructureMASST, we queried the salicylic-thiazoline substructure (SMILES: OC1 = CC = CC = C1C2 = NCCS2) to identify all molecules in the reference library containing this core. Substructure-based MS/MS searches retrieved 82 spectra corresponding to nine distinct molecules (Fig. 2a). Among these, dihydroaeruginoic acid represents a biosynthetic shunt product from pyochelin, yersiniabactin and related molecules with similar biosynthetic precursors that can be reduced to form aerugine15,16,17,18,19,20; ulbactin F is produced by a sponge-associated Brevibacillus species21, a genus rarely but occasionally found in immunocompromised individuals; and agrochelin produced by Agrobacterium species22. By contrast, deferitin, deferitazole and CHEMBL compound accession SCHEMBL1314906 are synthetic and not known to occur naturally. A multi-MASST search (cosine >0.7, ≥5 matching fragment ions) across these 9 molecules yielded 1,331 MS/MS matches in public data. Pyochelin matched P. aeruginosa datasets and samples from cystic fibrosis patients, where this pathogen is common23. In line with the elevated risk of P. aeruginosa infection among patients with rheumatoid arthritis receiving immunosuppressants, we also detected pyochelin in this clinical population24. Yersiniabactin MS/MS matched public data annotated as Escherichia coli, Streptomyces sp. and Pseudomonas sp. This is expected as E. coli is a known producer of yersiniabactin25,26, while Streptomyces species are known to synthesize structurally related siderophores such as amychelin, which share the same starting substructure. There is a notable absence of matches to Yersinia and Klebsiella27 genera in the Sankey plot, which are known to produce yersiniabactin. While there is a single Yersinia enterocolitica data file in MetaboLights (MTBLS10328), these data do not contain MS/MS, and no other compatible Yersinia data are in Pan-ReDU. For Klebsiella, a Klebsiella sp. MS 92-3 listed in the ‘others’ category had an MS/MS match to the yersiniabactin reference spectrum. Notably, yersiniabactin was also detected in human fecal samples and Ulbactin F in rheumatoid arthritis datasets, co-occurring with the biosynthetic shunt products. These findings provide direct evidence that yersiniabactin and related molecules are present in humans where they may affect host biology through their known immunomodulatory capacity25,28.

Fig. 2: Substructure and analog-based mapping of metabolites.
Fig. 2: Substructure and analog-based mapping of metabolites.
Full size image

a, Substructure search for salicylic-thiazoline retrieved MS/MS of nine molecules (cosine 0.7, matching peaks 5), with multi-MASST matches across bacterial and human samples. b, Sertraline was detected and limited to human tissues, using multi-MASST analog search (cosine 0.6, matching peaks 5); only spectra where parent (red) and metabolites (blue) co-occurred were considered. c, Mass-defect analysis distinguishes chlorinated metabolites from nonchlorinated ions. d, Retention-time comigration of the –31.04-Da methylamine loss supports assignment as an ISF. e,f, Modifinder maps modifications on sertraline (+43.99 Da carboxylation (e), +148.04 Da pentose conjugation (f)), with red indicating highest-probability sites.

Other examples we highlight include analysis of drug metabolism. Sertraline and amiodarone were detected across multiple human tissues, including the brain. Mass-defect and retention-time analyses distinguished intact metabolites from ISFs, and Modifinder localized chemical modifications consistent with canonical metabolism, including carboxylation and pentose conjugation for sertraline (Fig. 2b–f, Supplementary Fig. 4 and ‘Discussion’ in Supplementary Fig. 4). Sertraline, a dichlorinated antidepressant, was represented by 54 reference spectra in the reference libraries, which we queried using multi-MASST in analog search mode with FASST while requiring co-occurrence of parent and analog ions. Matches were detected across multiple human tissues and biofluids (Fig. 2b). Mass-defect analysis using CHN-backbone compounds with varying degrees of chlorination (non-, mono- and dichlorinated; Supplementary Tables 4 and 5) confirmed that chlorinated features retained both chlorine atoms, whereas ions at –10.93 Da and –10.94 Da were not chlorinated and thus cannot derive from sertraline (Fig. 2c and Supplementary Table 5). The reported –31.04 Da loss (CH₃NH2) was observed but comigrated with the parent ion, indicating it is best explained as an ISF rather than a true metabolite in these data (Fig. 2d). Additional mass shifts included +15.99 Da (oxygenation), –14.02 Da (demethylation), +43.99 Da (carboxylation) and +148.04 Da (C₅H₈O₅), consistent with conjugation to a pentose sugar. These features showed distinct retention times, supporting their assignment as true metabolites and not adducts or ISFs. Modifinder localized most modifications to the amine side chain and adjacent regions, aligning with canonical sertraline metabolism (N-demethylation, hydroxylation and conjugation29; Fig. 2e,f). Together, these analyses reveal the existence of multiple chlorinated metabolites, including oxygenated, carboxylated and sugar-conjugated derivatives, across biofluids such as human milk and brain. In Supplementary Information 1 and 2, we highlight statistical considerations, limitations and future prospects of structureMASST.

In summary, StructureMASST enables pan-repository, structure-based exploration of metabolomics data, supporting multi-MASST, modification-tolerant and blind modification searches. By linking chemical structures to metadata across tissues, organisms and environments, it empowers hypothesis generation, improves discovery and reveals new insights into metabolism, exposure and microbial interactions.

Methods

FASSTrecords database construction

The entire workflow was set up in a nextflow (version 24.10.5 build 5935) pipeline with four distinct processes running Python scripts via Python 3.9 unless specified otherwise.

  1. (1)

    In the first process, GNPS reference libraries, including spectra from GNPS3, MoNA (https://massbank.us/) and MassBankEU30 were aggregated. Specifically, we used the GNPS cleaned library (gnps_cleaned.mgf), the MULTIPLEX synthesis libraries in both filtered (MULTIPLEX-SYNTHESIS-LIBRARY-FILTERED-PARTITION-1.mgf to -4.mgf) and full variants (MULTIPLEX-SYNTHESIS-LIBRARY-ALL-PARTITION-1.mgf to -6.mgf), additional GNPS libraries (GNPS-BILE-ACID-MODIFICATIONS.mgf, GNPS-DRUG-ANALOG.mgf and GNPS-IIMN-PROPOGATED.mgf) and the REFRAME negative and positive libraries (REFRAME-NEGATIVE-LIBRARY.mgf and REFRAME-POSITIVE-LIBRARY.mgf) and clustered via falcon31 to group highly similar spectra. Falcon was adapted to return spectral library IDs for clustered spectra (https://github.com/YasinEl/falcon/tree/feature/fast-clustering; falcon-ms (version string 0.1.dev264+gdf7adb9fb) running via Python 3.9, numpy 1.26.4 and pylance 0.21.0). Clustering parameters were set to min_peaks = 2, scaling = root, min_mz = 40, max_mz = 2000, min_mz_range = 1, distance_threshold = 1, precursor_tol = 20 ppm and fragment_tol = 0.05.

  2. (2)

    In the next step, an SQLite (version 3.39.2) database was initiated, and a library table was added containing all molecular metadata available within the mgf-associated csv files available with the mgf files at https://external.gnps2.org/processed_gnps_data/gnps_cleaned/. Moreover, integer IDs were assigned to each library entry (spectrum_id_int), and falcon grouping ID (falcon_cluster_id) was added as a separate column. Next, raw data files accessible for MS/MS spectral matching via FASST MASST were retrieved from https://fasst.gnps2.org/library/files?library=metabolomicspanrepo_index_nightly, assigned an integer ID (mri_id_int) and deposited as mri_table. After that, sample metadata were downloaded from https://redu.gnps2.org/dump. The metadata table was subsetted to mri values present in the mri_table, and mri_id_int identifiers were added from the previously created mri_table. The metadata table was then added as redu_table. Finally, an empty table for masst_results was added.

  3. (3)

    We then utilized FASST MASST to individually query library spectra against the metabolomicspanrepo_index_nightly database. Spectral matching parameters were set to a cosine of 0.7, minimum of 3 matching peaks, precursor mass tolerance of 0.05 Da and fragment mass tolerance of 0.05 Da. Before depositing the results of each given query result into the masst_table, all returned text values were replaced with representing integer IDs to optimize storage and retrieval efficiency. Namely, we deposited the obtained cosine scores (rounded to 2 digits, the number of matching peaks, an ID for the query spectrum (spectrum_id_int), an ID for the matching file (mri_id_int) and the matching scan ID (scan_id).

  4. (4)

    In the final process, we ensured that no duplicates were accidentally included in any of the generated tables. In addition, indices were created on several columns to enable fast data retrieval. Specifically, the indices referenced in Table 1 were created.

For data handling in the above steps, pandas (2.3.2), sqlite3 (3.50.4) and sqlalchemy (2.0.43) packages were used.

The constructed database

In our informatics workflow, we implemented a compact, file-based SQLite (version 3.39.2) database to manage hundreds of millions of MS/MS match events across four public metabolomics repositories. Table 1 summarizes the five core tables in this database, detailing their roles, primary key columns and any indices used to speed up queries.

All spectrum matches are funneled into a single masst_table, which records only integer IDs (for library spectra, raw data files, datasets and scans) alongside similarity metrics (cosine score and matching peak count). Surrounding this central table are four lookup tables:

library_table: contains the full GNPS reference spectra metadata, keyed by spectrum_id_int.

mri_table: maps each raw data file path (MRI) to a small integer (mri_id_int), avoiding repeated storage of long file paths.

dataset_table: associates each public dataset accession (GNPS/MassIVE, MetaboLights or Metabolomics Workbench) with an integer ID (dataset_id_int).

redu_table: stores ReDU-curated metadata for files, joined via mri_id_int to integrate sample descriptors (for example, organism taxonomy, body part and instrument details) where available.

To balance performance with storage efficiency, we selectively built indices on the most critical join columns—specifically on masst_table.spectrum_id_int, the (mri_id_int, scan_id) pair in masst_table, the mri field in mri_table and redu_table.mri_id_int. This targeted indexing ensures that cross-table joins, even over tens of millions of rows, complete in seconds rather than minutes. By relying on integer-only core tables and a minimal set of indices, our system remains lightweight, portable and easily embedded into reproducible analysis pipelines.

StructureMASST

StructureMASST has been written as a streamlit (version 1.45) app32. Inputs are accepted as molecular names, which are interpreted via the PubChem Auto-Complete Search Service. Canonical SMILES for the obtained matches are then retrieved via the PubChem REST API. Whether SMILES are input through PubChem or manually, they are harmonized via functionality adapted from33. The structure can also be edited or drawn using a streamlit component based on Ketcher (streamlit-ketcher, version 0.0.1), which will then be converted to SMILES for searching. If the input is provided as a SMARTS pattern, the SMARTSview REST API is used for generating a visual representation of it, making it easier to interpret and debug patterns34. This, and all further, SMILES, structure and substructure processing is performed through rdkit’s (version 2024.09.6) HasSubstructMatch() function. Tanimoto matching is performed on the basis of Morgan fingerprints (ECFP4; radius 2, 2,048 bits) using the RDKit rdFingerprintGenerator.GetMorganGenerator. Tanimoto similarity coefficients were then calculated with the RDKit DataStructs.TanimotoSimilarity. Sankey diagrams were generated in Python using the Plotly gosankey implementation. Further data handling was performed through numpy (1.26.4), pyarrow15, requests (2.31), requests-cache (1.2), lxml (5.2), pyteomics (4.6) and celery (5.2.2).

Retrieving library spectra based on structures

All structure-based searches in FASSTrecords are performed using RDKit (version 2024.09.6). For substructure and similarity searches, precomputed RDKit fingerprints stored in the database are used: pattern fingerprints for substructure screening and Morgan (ECFP4; radius 2, 2,048 bits) fingerprints for similarity calculations. These fingerprints are stored as binary blobs and decoded during retrieval; they are loaded directly from the database rather than recalculated, ensuring fast and reproducible comparisons across the entire dataset.

In exact search mode, the query SMILES is parsed with RDKit to obtain its monoisotopic mass. Matches are retrieved through the first 14 characters of the InChIKey, which encode the molecule’s connectivity (regiochemistry) layer, and are further constrained by a ±0.02-Da mass to exclude matches with incorrect numbers of double bonds. Library spectra generated from low-mass-resolution instruments are excluded.

In substructure search mode, one representative molecule per unique InChIKey block is screened using its precomputed Pattern fingerprint, and candidate hits are confirmed with RDKit’s HasSubstructMatch() function. Once a representative block is identified as containing the query substructure, all associated spectra belonging to that block are retrieved.

In similarity (Tanimoto) mode, precomputed Morgan (ECFP4) fingerprints are used to compute Tanimoto similarity coefficients (DataStructs.TanimotoSimilarity). InChIKey blocks with similarity scores above the user-specified threshold are selected, and all spectra belonging to those blocks are expanded and returned with full metadata.

After performing substructure or similarity searches, multiple 2D structures may match a given query. In these cases, we assume that the user is interested in the distribution of the substructure or of structurally related molecules, rather than each individual compound. Therefore, we report only the molecule with the best MS/MS match for each sample. Users interested in the biodistribution of multiple distinct molecules can submit them individually or use the batch mode.

When performing analog searches, the best-matching analog per unique Δmass and sample is reported. Consequently, this is the only search mode where the same sample can appear multiple times in the results (once per detected analog).

Across all modes, low-resolution analyzers (quadrupole or ion-trap instruments) are excluded, textual fields are harmonized (missing values reported as ‘unknown’), and large queries are processed in batches to ensure scalable and reproducible results.

Raw data search modes

After retrieving structure-level matches from FASSTrecords, raw data searches are performed to locate experimental spectra corresponding to these structures across public MS datasets. Representative spectra for each molecule are defined during FASSTrecords creation through FALCON clustering, which groups highly similar MS/MS spectra within the library. These representative spectra are then used as queries against FASSTrecords or directly via FASST using the selected search parameters.

Search results are subsequently intersected with the PanReDU metadata resource, which provides curated sample-level information (for example, organism, tissue and environment annotations). Only samples for which metadata are available are retained, ensuring that all downstream analyses are contextually interpretable. To avoid overcounting, only the top-ranking MS/MS match is reported for each sample based on spectral similarity.

After performing substructure or Tanimoto similarity searches, multiple related 2D structures can correspond to the same molecular pattern or substructure. In such cases, the search is interpreted as aiming to describe the overall distribution of that substructure (or of molecules similar to the query). Consequently, only the molecule with the best MS/MS match per sample is reported, regardless of how many molecules matched that sample. Users interested in the distributions of individual molecules can instead submit them separately or use the batch search mode.

In analog search mode, which identifies molecules differing by specific mass offsets (Δmass) relative to the query structure, the best-matching analog per unique Δmass and sample is reported. As a result, analog searches are the only mode where the same sample can appear multiple times in the results—each instance corresponding to a distinct analog observation.

Across all search modes, this postprocessing ensures that reported hits represent unique, biologically interpretable findings at the sample level, while maintaining consistency between structure-level matching, raw data retrieval and metadata integration.

Downstream and support tooling

Multiple tools have been linked and integrated into StructureMASST to simplify analysis. For library spectra and spectral matches, the GNPS Spectral Resolver (https://metabolomics-usi.gnps2.org) can be used to visualize individual spectra and spectral matches by clicking the respective links in the library and results tables. For raw data results, linkouts to the GNPS Dashboard (https://dashboard.gnps2.org) are provided to inspect extracted ion chromatograms directly in the raw data files. After analog searches, the extracted ion chromatograms of both unmodified and modified species are extracted by default, allowing assessment of whether relative elution orders are as expected and whether co-elution indicates analytical artifacts, such as ISFs or other ion species from the same molecule, which could be mistaken for analogs. After modification/analog searches, Modifinder (https://modifinder.gnps2.org/) can be accessed for all supported adduct types from a linkout provided in the table to assess likely modification sites.

StructureMASST is meant as a tool for the comprehensive and interactive retrieval of raw data matching to query molecules. As such, it proves ways to visualize matches across raw data in a multitude of ways and allows subsetting to matches of interest. However, different applications, such as environmental or evolutionary studies, require different types of integration for these data. StructureMASST is meant as a starting point from which multiple tools can branch off for more specific visualizations. Some preliminary tools, which are still under development, are provided in the ‘Downstream and support tooling’ section.

Reported FASSTrecords numbers

Numbers reported on FASSTrecords were retrieved on 24 September 2025 from the FASSTrecords sqlite database using the following queries:

Number of files with metadata:

SELECT COUNT(*) FROM redu_table;

Number of spectra in the library:

SELECT COUNT(*) FROM library_table;

Number unique 2D structures in the library:

SELECT COUNT(DISTINCT InChIKey_smiles_fi rstBlock) FROM library_table;

Number of annotated scans:

SELECT COUNT(*) AS unique_mri_scan_pairs

FROM (

 SELECT 1

 FROM masst_table

 GROUP BY mri_id_int, scan_id

);

Number of annotated scans in human data:

SELECT COUNT(*) AS unique_mri_scan_pairs

FROM (

 SELECT 1

 FROM masst_table AS m

 WHERE EXISTS (

  SELECT 1

  FROM redu_table AS r

  WHERE r.mri_id_int = m.mri_id_int

   AND r.NCBITaxonomy = ‘9606|Homo sapiens’

)

 GROUP BY m.mri_id_int, m.scan_id

);

Number of annotations with sample metadata:

SELECT COUNT(*)

FROM masst_table mt

WHERE EXISTS (

 SELECT 1 FROM redu_table rt

 WHERE rt.mri_id_int = mt.mri_id_int

);

Number of 2D structures with raw data matches:

SELECT COUNT(*)

FROM (

 SELECT DISTINCT l.InChIKey_smiles_firstBlock

 FROM library_table l

 WHERE l.InChIKey_smiles_firstBlock IS NOT NULL

  AND EXISTS (

   SELECT 1

   FROM masst_table m

   WHERE m.spectrum_id_int = l.spectrum_id_int

    AND m.annotation_rank = 1

   LIMIT 1

  )

);

Number of 2D structures with raw data matches in human samples:

WITH human_mri AS (

 SELECT DISTINCT mri_id_int

 FROM redu_table

 WHERE NCBITaxonomy = ‘9606|Homo sapiens’

)

SELECT COUNT(*)

FROM (

 SELECT DISTINCT l.InChIKey_smiles_firstBlock

 FROM library_table l

 WHERE l.InChIKey_smiles_firstBlock IS NOT NULL

  AND EXISTS (

   SELECT 1

   FROM masst_table m

   WHERE m.spectrum_id_int = l.spectrum_id_int

    AND m.mri_id_int IN (SELECT mri_id_int FROM human_mri)

   LIMIT 1

  )

);

Biological examples – Matching and filtering criteria

Caffeine example

MS/MS spectra were retrieved via exact structure matching of the SMILES CN1C=NC2=C1C(=O)N(C(=O)N2C)C. We then searched FASSTrecords using a minimum cosine of 0.9 and minimum matching peaks set to 5.

Salicylic acid–thiazoline example

MS/MS spectra were retrieved via substructure search of the SMILES OC1=CC=CC=C1C2=NCCS2. We then searched FASSTrecords using a minimum cosine of 0.7 and minimum matching peaks set to 5. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Surfactin C example

MS/MS spectra were retrieved via exact structure matching of the SMILES CC(C)CCCCCCCCCC1CC(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)O1)CC(C)C)CC(C)C)CC(=O)O)C(C)C)CC(C)C)CC(C)C)CCC(=O)O. We then utilized FASST using a minimum cosine of 0.6, 5 minimum matching peaks, and fragment and precursor tolerances of 0.02 Da. Analog search was turned on, and the filter condition was set to ‘Raw file’. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Amiodarone example

MS/MS spectra were retrieved via exact structure matching of the SMILES CCCCC1=C(C2=CC=CC=C2O1)C(=O)C3=CC(=C(C(=C3)I)OCCN(CC)CC)I. We then utilized FASST using a minimum Cosine of 0.6, 5 minimum matching peaks, and fragment and precursor tolerances of 0.02 Da. Analog search was turned on, and the filter condition was set to ‘Raw file’. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Sertraline example

MS/MS spectra were retrieved via exact structure matching of the SMILES CNC1CCC(C2=CC=CC=C12)C3=CC(=C(C=C3)Cl)Cl. We then utilized FASST using a minimum cosine of 0.6, 5 minimum matching peaks, and fragment and precursor tolerances of 0.02 Da. Analog search was turned on, and the filter condition was set to ‘Raw file’. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Desferrioxamine H example

MS/MS spectra were retrieved via exact structure matching on the SMILES CC(=O)N(O)CCCCCNC(=O)CCC(=O)N(O)CCCCCNC(=O)CCC(=O)O. We then searched FASSTrecords using a minimum cosine of 0.7 and minimum matching peaks set to 5. Matches to the library spectrum ‘CCMSLIB00000845585’ were removed by selecting the query_spectrum_id column in the Column-dropdown below the results table and then selecting ‘CCMSLIB00000845585’ in the Value-dropdown. The filter was applied by clicking the ‘Remove selected rows’ button.

Surfactin C Tanimoto similarity example

MS/MS spectra were retrieved via Tanimoto similarity search of the SMILES CC(C)CCCCCCCCCC1CC(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)O1)CC(C)C)CC(C)C)CC(=O)O)C(C)C)CC(C)C)CC(C)C)CCC(=O)O with a threshold of 0.8. We then utilized FASSTrecords using a minimum cosine of 0.7 and 5 minimum matching peaks.

Mass-defect analysis

Mass-defect values were calculated as the difference between the exact m/z and the nearest nominal mass (mass defect = exact mass – nominal mass). Data processing and visualization were performed using R (version 4.5.1) in the RStudio environment. For the mass-defect plot of amiodarone (Supplementary Fig. 4b), the m/z values of amiodarone and its potential metabolites identified through the StructureMASST search (Supplementary Fig. 4a) were used, while CHNO-backbone compounds with varying degrees of iodination (Supplementary Table 2) were referenced to confirm their iodination levels. Similarly, for the mass-defect plot of sertraline (Fig. 2c), the m/z values of sertraline and its potential metabolites identified from the StructureMASST search (Fig. 2b) were used, while CHN-backbone compounds with varying degrees of chlorination (Supplementary Table 4) were referenced to confirm their chlorination levels.

ModiFinder analysis

ModiFinder analysis of several potential metabolites of amiodarone and sertraline identified from the StructureMASST search was performed using the ‘View Modification Site’ function in the resulting table for each USI generated in FASST mode. Then, by clicking ‘View Modification Site,’ users are directed to the GNPS2 dashboard (https://modifinder.gnps2.org/), where the results are shown. The inputs, parameters and results can be accessed through the links provided below.

Amiodarone

Δm/z −26.02:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00013027336&USI2=mzspec%3AMSV000085760%3Araw%2FmzXML%2F5580.mzXML%3Ascan%3A2872&SMILES1=CCCCc1oc2ccccc2c1C%28%3DO%29c1cc%28I%29c%28OCCN%28CC%29CC%29c%28I%29c1&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Δm/z −125.90:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00012316157&USI2=mzspec%3AMTBLS1866%3AFILES%2FLipidomic_ICU+COVID-19_ESI+Positive%2FDA17_p.mzML%3Ascan%3A686&SMILES1=CCCCc1oc2ccccc2c1C%28%3DO%29c1cc%28I%29c%28OCCN%28CC%29CC%29c%28I%29c1&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Sertraline

Δm/z +43.99:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00000084936&USI2=mzspec%3AMSV000080673%3Accms_peak%2F2017.AmericanGut3K.mzXMLfiles%2FSamples%2F000006382_RB8_01_6463.mzML%3Ascan%3A1896&SMILES1=CNC1CCC%28c2ccc%28Cl%29c%28Cl%29c2%29c2ccccc21&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Δm/z +148.04:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00003140022&USI2=mzspec%3AMSV000086415%3Accms_peak%2FPlate+01+Samples+RAW%2F16265624.mzML%3Ascan%3A1311&SMILES1=CNC1CCC%28c2ccc%28Cl%29c%28Cl%29c2%29c2ccccc21&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Statistical analysis of human enrichment of drug matches among Metazoa hits

To quantify whether StructureMASST raw data matches were disproportionately associated with human samples, we tested for enrichment of Homo sapiens within the set of positive matches for each queried drug molecule relative to its prevalence in other Metazoa samples. The background population was defined as all entries in the redu_table of FASSTrecords with MS/MS present (MS2spectra_count > 0) and NCBIKingdom == ‘Metazoa’. Human samples were defined as NCBITaxonomy == ‘9606|Homo sapiens’, and all remaining Metazoa entries were treated as nonhuman. Positive matches for each molecule were defined as raw data hits passing the specified spectral matching criteria (default: cosine >0.7 and matching peaks >5; additional cosine thresholds were evaluated as shown).

For each molecule, we constructed a 2 × 2 contingency table comparing the number of human versus nonhuman Metazoa samples among the molecule’s positive matches to the corresponding counts in the background (hits versus nonhits). We then applied Fisher’s exact test (two-sided) to each table to estimate an OR and associated P value. The OR was interpreted as the relative odds that a positive match originated from Homo sapiens rather than nonhuman Metazoa compared with the same odds in the Metazoa background (OR >1, enrichment; OR <1, depletion). Multiple testing across molecules was controlled using the Benjamini–Hochberg procedure; adjusted P values (q values) are reported, with q < 0.05 considered significant.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.