Structure-centric searching enables global mapping of the public metabolome

El Abiead, Yasin; Seo, Jeong In; Charron-Lamoureux, Vincent; Strobel, Michael; Gonçalves Nunes, Wilhan Donizete; Zhao, Haoqi Nina; Kvitne, Kine Eide; Zuffa, Simone; Mannochio-Russo, Helena; Gouda, Harsha; Bez, Cristina; Patan, Abubaker; Xing, Shipei; Zemlin, Jasmine; Mohany, Ipsita; Agongo, Julius; Caraballo Rodriguez, Andres Mauricio; Burnett, Lindsey A.; Deleray, Victoria; Pakkir Shah, Abzer K.; Kalinski, Jarmo-Charles; Petras, Daniel; Alygizakis, Nikiforos; Carver, Jeremy; Yurekten, Ozgur; Payne, Thomas; Fahy, Eoin; Subramaniam, Shankar; Vizcaíno, Juan Antonio; Wang, Mingxun; Dorrestein, Pieter C.

doi:10.1038/s41587-026-03082-8

Download PDF

Brief Communication
Open access
Published: 15 April 2026

Structure-centric searching enables global mapping of the public metabolome

Nature Biotechnology (2026)Cite this article

Subjects

Abstract

Searching and learning from aggregated public metabolomics data spanning thousands of studies remained largely inaccessible. Here we present StructureMASST, a web-based application enabling scalable, structure-centric searches across public metabolomics repositories using molecule names or chemical representations. It queries a precomputed knowledgebase of 2.19 billion spectral matches and 420 million metadata links, supports modification-tolerant and mass-shift searches, and maps chemical structures across taxonomy, biological context and environmental conditions to accelerate discovery.

Main

Over the past decade, metabolomics data have been deposited into public repositories (MetaboLights, NMDR/Metabolomics Workbench, GNPS/MassIVE and NORMAN/DSFP), but these resources remain underutilized for identifying broad molecular trends^1,2,3,4. The ability to search/filter mass spectrometry (MS) raw metabolomics data at repository scales requires computational solutions that can scale. Initially, in 2020, Mass Spectrometry Search Tool (MASST) queries required 20–40 min to search a repository of ~110 million tandem MS (MS/MS) spectra, but the introduction of indexing technologies reduced search time to seconds—even across over a billion mass spectra^5,6,7 (Fig. 1a). The development of PanReDU enabled metadata harmonization across metabolomics repositories and indexing, allowing MASST-style searches to extend beyond GNPS⁶ and now include NORMAN/DSFP. Despite these advances and early demonstrations of its use, a limitation persisted: structure- and substructure-based queries using cheminformatics inputs such as names of molecules, SMILES or SMARTS (SMILES Arbitrary Target Specification) strings have not been possible, which has led to an overreliance on singular mass spectra selected by MS experts to represent a molecule’s behavior across public data searches that are dependent on diverse technologies.

**Fig. 1: The FASSTrecords/StructureMASST infrastructure.**

To address the challenge of structure-based exploration of public metabolomics data, we developed StructureMASST (https://structure-masst.gnps2.org/), a search engine and web-based application that enables pan-repository MS/MS searches using a chemical name, structure or substructure as input. This enables searching multiple MS/MS spectra at once to obtain a global picture of how the MS/MS are distributed among all the public data, which we will refer to as multi-MASSTing (Fig. 1b). This approach addresses several key challenges. First, molecules can have many names, and in principle, one would need to search all of them to capture all corresponding MS/MS spectra. Second, and more critically, public datasets come from diverse instruments and acquisition conditions, including varying collision energies, making cross-repository searches difficult. At the same time, public reference libraries with annotated structures are rapidly expanding, increasingly including multiple ion forms per molecule—such as adducts, in-source fragments (ISFs) and multimers—allowing StructureMASST to leverage this growing diversity for more comprehensive searches. Upon name, structure or substructure input, StructureMASST retrieves all matching MS/MS spectra from 1,565,620 reference spectra from the GNPS community, Massbank of North America (MoNA) and MassbankEU uploaded before September 2025 (Fig. 1c). The current MS/MS libraries from these sources encompass 200,258 unique two-dimensional (2D) chemical structures, noting that mass spectral matching generally does not resolve stereochemistry and that multiple ion forms often exist for a single molecule. These reference spectra cover 63 different ion species (such as [M + H]+, [M + Na]+) and 93 fragmentation energies, acquired from Orbitrap, time-of-flight and Fourier transform ion cyclotron resonance instruments. Once relevant spectra are retrieved, StructureMASST performs multi-MASST across all MS/MS for those molecules, with users able to select which spectra to include. To illustrate the impact of a single MASST versus a multi-MASST, searching with a single [M + H]+ reference spectrum for cholylphenylalanine matches to 184 raw files, only 13 of which are from human samples. By contrast, StructureMASST, through Multi-MASST, retrieves 137 spectra for the same compound, yielding 1,275 matching raw files, 450 from human samples (Supplementary Fig. 1).

Multi-MASST analysis can be run in one of two modes: (1) a non-precomputed (exploratory) search or (2) a precomputed search, which differ in speed and search methods.

The second is an exploratory mode that enables mass shift searches (for example, known Δm/z values or atoms), modification-tolerant matching and searches using reference spectral data uploaded to the repositories after August 2025, following precomputation of the knowledge base. This will always be the most comprehensive search possible with StructureMASST but takes several minutes.

By contrast, the precomputed mode will be faster and includes curated metadata. The precomputed knowledge base, which we call FASSTrecords, is a Structured Query Language (SQL) resource generated by performing spectral matching with FASST of all reference MS/MS libraries against public metabolomics data available as of August 2025. It includes 1,204,350,873 MS/MS matches against those library spectra from compatible public data (4,990 datasets, 920,790 liquid chromatography (LC)–MS/MS files, and 1,752,167,824 MS/MS spectra). Altogether, the search annotated 142,538,419 spectra across LC–MS/MS datasets in the repositories (Fig. 1d and Table 1). This represents an annotation rate of 8.1% at the MS/MS level.

Table 1 The structure of the SQLite database storing public metabolomics data annotations

Full size table

To enable contextual analysis, PanReDU metadata harmonization has been expanded to cover 861,265 metabolomics raw data files as of August 2025⁶. This structure facilitates access to metadata, including, but not limited to, instrument type, collision energy, organism, health condition and environmental context, when this is available. Provenance of the original data in the data repository is ensured via Universal Spectrum Identifiers (USIs), resolvable through the USI resolver^8,9. The final knowledge base includes 420,799,889 metadata connections and defines the annotation-based search space available (Fig. 1d). The zipped SQLite database is 40 GB—enabling efficient distribution and use in local applications (license: ODC-ODbL).

A web-based interface allows users to input chemical names, SMILES or SMARTS for exact or substructure searches (Fig. 1e). Retrieved spectra include metadata on ion form, collision energy and instrument, and can be filtered before multi-MASST analysis. In the Supplementary information, we provide videos explaining how to perform molecule searches, substructure searches and modification searches (Supplementary Videos 1–10). This interface enables researchers with limited informatics background to explore and derive biological insights. The input for StructureMASST is a chemical name that is available in PubChem¹⁰—or a SMILES/SMART^11,12 input of the chemical structure or substructure, which can be queried either as an exact match or a substructure. Once the name of the molecule is selected, the SMILES itself is retrieved through PubChem. Alternatively, one can also directly add the structure information in the form of SMILES. SMILES can be used for both search modes, while SMARTS enable finely tuned substructure matching behavior. Users may optionally refine the selected spectra through deselection or selection, depending on needs, before proceeding to downstream multi-MASST analyses.

Overall, there are three search modes available: (1) exact match search, (2) modification search and (3) open modification search. The first is an exact match search, which retrieves all raw spectral matches corresponding to the selected MS/MS spectra from the precomputed SQLite database, FASSTrecords. This mode is recommended for most applications, as it is faster compared with the other search modes and computationally efficient. It can return matches to a single compound or to multiple compounds sharing a given substructure. The second mode is a modification-tolerant search with a defined hypothesis, suited for cases where users suspect a specific chemical modification—such as hydroxylation, methylation or glucuronidation. In this mode, supported through FASST, users can input a Δm/z or atomic composition corresponding to the expected modification. StructureMASST then applies a modified cosine similarity algorithm to identify structurally related compounds that differ by the specified modification. This mode is computationally more intensive, as it requires full spectral alignment against all public data and retrieval of the results. The third mode is a blind modification search, which detects chemically related compounds that differ by undefined modifications. This approach discovers unknown analogs of known chemicals and is computationally demanding. This filtering can be further enhanced to require the unmodified variant of the molecule to be present in the same raw file, taxonomy, or same dataset as any reported analog. Such filtering can be essential as false discoveries tend to be more prevalent in this mode. Regardless of the search mode selected, all results are represented using a summary of the metadata and visualized using an interactive Sankey diagram where biological attributes (for example, taxonomy, sex and health condition) or technical variables (for example, ion form, instrument type and extraction conditions) can be explored. An important consideration in the design of StructureMASST is provenance, ensuring that results are linked back to the raw data deposited in the repository. The USI/MRI (MR Run Identifier) link to raw data allows direct inspection of the MS/MS and MS1 intensity information in the GNPS Dashboard, which are accessible as hyperlinks. A key function of StructureMASST is to map from chemical structure to sample information (metadata) readout. To benchmark structureMASST metadata readout, we reasoned that drugs should not be dominantly detected in nonhuman Metazoa samples. We therefore evaluated whether sample metadata were consistent with biological expectations using odds ratios (ORs) across increasing MS/MS cosine similarity thresholds (Supplementary Fig. 6). At a cosine threshold of 0.7, 10.5% of drug MS/MS spectra were not exclusively associated with human data, including folic acid, ibuprofen and several steroidal compounds. Increasing the threshold to 0.8 and 0.9 reduced this fraction to 3.5%, with folic acid exhibiting near-equal odds across species, consistent with its endogenous role in many animals. Nonhuman matches for ibuprofen were largely attributable to a single rat sleep-deprivation study.

To illustrate its utility, we present examples of molecules and types of hypotheses that StructureMASST can enable. Due to its widespread consumption, caffeine is a fairly ubiquitous molecule. We queried its SMILES and retrieved 316 MS/MS spectra, representing 6 ion forms and 24 collision energies (Supplementary Fig. 2a). Using a cosine threshold of 0.9 and a minimum of 5 matching peaks, StructureMASST via FASSTrecords matched caffeine MS/MS spectra in 6,228 files across 98 datasets. The default Sankey plot display highlights its presence in more than ten human sample types, including blood serum, kidney tissue, human milk, the alveolar system, and Coffea arabica and Camellia sinensis, consistent with these plants being sources of coffee and tea (Supplementary Fig. 2b and ‘Discussion’ in Supplementary Fig. 2). Another example we provide to demonstrate how analog StructureMASST can be leveraged is the soil microbial metabolite surfactin, which, along with its analogs, was uniquely found in people living in remote villages (Supplementary Fig. 3 and ‘Discussion’ in Supplementary Fig. 3).

The caffeine and the surfactin examples represent the most straightforward applications of StructureMASST, consisting of a multi-MASST search of all available reference spectra for the input structure, followed by a search against MASST records or analog FASST searches. By contrast, substructure-based searches enable more complex analysis. For example, siderophores and ionophores such as pyochelin¹³ (from Pseudomonas aeruginosa, a human pathogen) and yersiniabactin¹⁴ (originally identified in Yersinia pestis—cause of bubonic plague) share a biosynthetically conserved substructure derived from salicylic acid and cyclized cysteine.

Using StructureMASST, we queried the salicylic-thiazoline substructure (SMILES: OC1 = CC = CC = C1C2 = NCCS2) to identify all molecules in the reference library containing this core. Substructure-based MS/MS searches retrieved 82 spectra corresponding to nine distinct molecules (Fig. 2a). Among these, dihydroaeruginoic acid represents a biosynthetic shunt product from pyochelin, yersiniabactin and related molecules with similar biosynthetic precursors that can be reduced to form aerugine^{15,16,17,18,19,20}; ulbactin F is produced by a sponge-associated Brevibacillus species²¹, a genus rarely but occasionally found in immunocompromised individuals; and agrochelin produced by Agrobacterium species²². By contrast, deferitin, deferitazole and CHEMBL compound accession SCHEMBL1314906 are synthetic and not known to occur naturally. A multi-MASST search (cosine >0.7, ≥5 matching fragment ions) across these 9 molecules yielded 1,331 MS/MS matches in public data. Pyochelin matched P. aeruginosa datasets and samples from cystic fibrosis patients, where this pathogen is common²³. In line with the elevated risk of P. aeruginosa infection among patients with rheumatoid arthritis receiving immunosuppressants, we also detected pyochelin in this clinical population²⁴. Yersiniabactin MS/MS matched public data annotated as Escherichia coli, Streptomyces sp. and Pseudomonas sp. This is expected as E. coli is a known producer of yersiniabactin^25,26, while Streptomyces species are known to synthesize structurally related siderophores such as amychelin, which share the same starting substructure. There is a notable absence of matches to Yersinia and Klebsiella²⁷ genera in the Sankey plot, which are known to produce yersiniabactin. While there is a single Yersinia enterocolitica data file in MetaboLights (MTBLS10328), these data do not contain MS/MS, and no other compatible Yersinia data are in Pan-ReDU. For Klebsiella, a Klebsiella sp. MS 92-3 listed in the ‘others’ category had an MS/MS match to the yersiniabactin reference spectrum. Notably, yersiniabactin was also detected in human fecal samples and Ulbactin F in rheumatoid arthritis datasets, co-occurring with the biosynthetic shunt products. These findings provide direct evidence that yersiniabactin and related molecules are present in humans where they may affect host biology through their known immunomodulatory capacity^25,28.

**Fig. 2: Substructure and analog-based mapping of metabolites.**

Other examples we highlight include analysis of drug metabolism. Sertraline and amiodarone were detected across multiple human tissues, including the brain. Mass-defect and retention-time analyses distinguished intact metabolites from ISFs, and Modifinder localized chemical modifications consistent with canonical metabolism, including carboxylation and pentose conjugation for sertraline (Fig. 2b–f, Supplementary Fig. 4 and ‘Discussion’ in Supplementary Fig. 4). Sertraline, a dichlorinated antidepressant, was represented by 54 reference spectra in the reference libraries, which we queried using multi-MASST in analog search mode with FASST while requiring co-occurrence of parent and analog ions. Matches were detected across multiple human tissues and biofluids (Fig. 2b). Mass-defect analysis using CHN-backbone compounds with varying degrees of chlorination (non-, mono- and dichlorinated; Supplementary Tables 4 and 5) confirmed that chlorinated features retained both chlorine atoms, whereas ions at –10.93 Da and –10.94 Da were not chlorinated and thus cannot derive from sertraline (Fig. 2c and Supplementary Table 5). The reported –31.04 Da loss (CH₃NH₂) was observed but comigrated with the parent ion, indicating it is best explained as an ISF rather than a true metabolite in these data (Fig. 2d). Additional mass shifts included +15.99 Da (oxygenation), –14.02 Da (demethylation), +43.99 Da (carboxylation) and +148.04 Da (C₅H₈O₅), consistent with conjugation to a pentose sugar. These features showed distinct retention times, supporting their assignment as true metabolites and not adducts or ISFs. Modifinder localized most modifications to the amine side chain and adjacent regions, aligning with canonical sertraline metabolism (N-demethylation, hydroxylation and conjugation²⁹; Fig. 2e,f). Together, these analyses reveal the existence of multiple chlorinated metabolites, including oxygenated, carboxylated and sugar-conjugated derivatives, across biofluids such as human milk and brain. In Supplementary Information 1 and 2, we highlight statistical considerations, limitations and future prospects of structureMASST.

In summary, StructureMASST enables pan-repository, structure-based exploration of metabolomics data, supporting multi-MASST, modification-tolerant and blind modification searches. By linking chemical structures to metadata across tissues, organisms and environments, it empowers hypothesis generation, improves discovery and reveals new insights into metabolism, exposure and microbial interactions.

Methods

FASSTrecords database construction

The entire workflow was set up in a nextflow (version 24.10.5 build 5935) pipeline with four distinct processes running Python scripts via Python 3.9 unless specified otherwise.

(1)
In the first process, GNPS reference libraries, including spectra from GNPS³, MoNA (https://massbank.us/) and MassBankEU³⁰ were aggregated. Specifically, we used the GNPS cleaned library (gnps_cleaned.mgf), the MULTIPLEX synthesis libraries in both filtered (MULTIPLEX-SYNTHESIS-LIBRARY-FILTERED-PARTITION-1.mgf to -4.mgf) and full variants (MULTIPLEX-SYNTHESIS-LIBRARY-ALL-PARTITION-1.mgf to -6.mgf), additional GNPS libraries (GNPS-BILE-ACID-MODIFICATIONS.mgf, GNPS-DRUG-ANALOG.mgf and GNPS-IIMN-PROPOGATED.mgf) and the REFRAME negative and positive libraries (REFRAME-NEGATIVE-LIBRARY.mgf and REFRAME-POSITIVE-LIBRARY.mgf) and clustered via falcon³¹ to group highly similar spectra. Falcon was adapted to return spectral library IDs for clustered spectra (https://github.com/YasinEl/falcon/tree/feature/fast-clustering; falcon-ms (version string 0.1.dev264+gdf7adb9fb) running via Python 3.9, numpy 1.26.4 and pylance 0.21.0). Clustering parameters were set to min_peaks = 2, scaling = root, min_mz = 40, max_mz = 2000, min_mz_range = 1, distance_threshold = 1, precursor_tol = 20 ppm and fragment_tol = 0.05.
(2)
In the next step, an SQLite (version 3.39.2) database was initiated, and a library table was added containing all molecular metadata available within the mgf-associated csv files available with the mgf files at https://external.gnps2.org/processed_gnps_data/gnps_cleaned/. Moreover, integer IDs were assigned to each library entry (spectrum_id_int), and falcon grouping ID (falcon_cluster_id) was added as a separate column. Next, raw data files accessible for MS/MS spectral matching via FASST MASST were retrieved from https://fasst.gnps2.org/library/files?library=metabolomicspanrepo_index_nightly, assigned an integer ID (mri_id_int) and deposited as mri_table. After that, sample metadata were downloaded from https://redu.gnps2.org/dump. The metadata table was subsetted to mri values present in the mri_table, and mri_id_int identifiers were added from the previously created mri_table. The metadata table was then added as redu_table. Finally, an empty table for masst_results was added.
(3)
We then utilized FASST MASST to individually query library spectra against the metabolomicspanrepo_index_nightly database. Spectral matching parameters were set to a cosine of 0.7, minimum of 3 matching peaks, precursor mass tolerance of 0.05 Da and fragment mass tolerance of 0.05 Da. Before depositing the results of each given query result into the masst_table, all returned text values were replaced with representing integer IDs to optimize storage and retrieval efficiency. Namely, we deposited the obtained cosine scores (rounded to 2 digits, the number of matching peaks, an ID for the query spectrum (spectrum_id_int), an ID for the matching file (mri_id_int) and the matching scan ID (scan_id).
(4)
In the final process, we ensured that no duplicates were accidentally included in any of the generated tables. In addition, indices were created on several columns to enable fast data retrieval. Specifically, the indices referenced in Table 1 were created.

For data handling in the above steps, pandas (2.3.2), sqlite3 (3.50.4) and sqlalchemy (2.0.43) packages were used.

The constructed database

In our informatics workflow, we implemented a compact, file-based SQLite (version 3.39.2) database to manage hundreds of millions of MS/MS match events across four public metabolomics repositories. Table 1 summarizes the five core tables in this database, detailing their roles, primary key columns and any indices used to speed up queries.

All spectrum matches are funneled into a single masst_table, which records only integer IDs (for library spectra, raw data files, datasets and scans) alongside similarity metrics (cosine score and matching peak count). Surrounding this central table are four lookup tables:

library_table: contains the full GNPS reference spectra metadata, keyed by spectrum_id_int.

mri_table: maps each raw data file path (MRI) to a small integer (mri_id_int), avoiding repeated storage of long file paths.

dataset_table: associates each public dataset accession (GNPS/MassIVE, MetaboLights or Metabolomics Workbench) with an integer ID (dataset_id_int).

redu_table: stores ReDU-curated metadata for files, joined via mri_id_int to integrate sample descriptors (for example, organism taxonomy, body part and instrument details) where available.

To balance performance with storage efficiency, we selectively built indices on the most critical join columns—specifically on masst_table.spectrum_id_int, the (mri_id_int, scan_id) pair in masst_table, the mri field in mri_table and redu_table.mri_id_int. This targeted indexing ensures that cross-table joins, even over tens of millions of rows, complete in seconds rather than minutes. By relying on integer-only core tables and a minimal set of indices, our system remains lightweight, portable and easily embedded into reproducible analysis pipelines.

StructureMASST

StructureMASST has been written as a streamlit (version 1.45) app³². Inputs are accepted as molecular names, which are interpreted via the PubChem Auto-Complete Search Service. Canonical SMILES for the obtained matches are then retrieved via the PubChem REST API. Whether SMILES are input through PubChem or manually, they are harmonized via functionality adapted from³³. The structure can also be edited or drawn using a streamlit component based on Ketcher (streamlit-ketcher, version 0.0.1), which will then be converted to SMILES for searching. If the input is provided as a SMARTS pattern, the SMARTSview REST API is used for generating a visual representation of it, making it easier to interpret and debug patterns³⁴. This, and all further, SMILES, structure and substructure processing is performed through rdkit’s (version 2024.09.6) HasSubstructMatch() function. Tanimoto matching is performed on the basis of Morgan fingerprints (ECFP4; radius 2, 2,048 bits) using the RDKit rdFingerprintGenerator.GetMorganGenerator. Tanimoto similarity coefficients were then calculated with the RDKit DataStructs.TanimotoSimilarity. Sankey diagrams were generated in Python using the Plotly gosankey implementation. Further data handling was performed through numpy (1.26.4), pyarrow¹⁵, requests (2.31), requests-cache (1.2), lxml (5.2), pyteomics (4.6) and celery (5.2.2).

Retrieving library spectra based on structures

All structure-based searches in FASSTrecords are performed using RDKit (version 2024.09.6). For substructure and similarity searches, precomputed RDKit fingerprints stored in the database are used: pattern fingerprints for substructure screening and Morgan (ECFP4; radius 2, 2,048 bits) fingerprints for similarity calculations. These fingerprints are stored as binary blobs and decoded during retrieval; they are loaded directly from the database rather than recalculated, ensuring fast and reproducible comparisons across the entire dataset.

In exact search mode, the query SMILES is parsed with RDKit to obtain its monoisotopic mass. Matches are retrieved through the first 14 characters of the InChIKey, which encode the molecule’s connectivity (regiochemistry) layer, and are further constrained by a ±0.02-Da mass to exclude matches with incorrect numbers of double bonds. Library spectra generated from low-mass-resolution instruments are excluded.

In substructure search mode, one representative molecule per unique InChIKey block is screened using its precomputed Pattern fingerprint, and candidate hits are confirmed with RDKit’s HasSubstructMatch() function. Once a representative block is identified as containing the query substructure, all associated spectra belonging to that block are retrieved.

In similarity (Tanimoto) mode, precomputed Morgan (ECFP4) fingerprints are used to compute Tanimoto similarity coefficients (DataStructs.TanimotoSimilarity). InChIKey blocks with similarity scores above the user-specified threshold are selected, and all spectra belonging to those blocks are expanded and returned with full metadata.

After performing substructure or similarity searches, multiple 2D structures may match a given query. In these cases, we assume that the user is interested in the distribution of the substructure or of structurally related molecules, rather than each individual compound. Therefore, we report only the molecule with the best MS/MS match for each sample. Users interested in the biodistribution of multiple distinct molecules can submit them individually or use the batch mode.

When performing analog searches, the best-matching analog per unique Δmass and sample is reported. Consequently, this is the only search mode where the same sample can appear multiple times in the results (once per detected analog).

Across all modes, low-resolution analyzers (quadrupole or ion-trap instruments) are excluded, textual fields are harmonized (missing values reported as ‘unknown’), and large queries are processed in batches to ensure scalable and reproducible results.

Raw data search modes

After retrieving structure-level matches from FASSTrecords, raw data searches are performed to locate experimental spectra corresponding to these structures across public MS datasets. Representative spectra for each molecule are defined during FASSTrecords creation through FALCON clustering, which groups highly similar MS/MS spectra within the library. These representative spectra are then used as queries against FASSTrecords or directly via FASST using the selected search parameters.

Search results are subsequently intersected with the PanReDU metadata resource, which provides curated sample-level information (for example, organism, tissue and environment annotations). Only samples for which metadata are available are retained, ensuring that all downstream analyses are contextually interpretable. To avoid overcounting, only the top-ranking MS/MS match is reported for each sample based on spectral similarity.

After performing substructure or Tanimoto similarity searches, multiple related 2D structures can correspond to the same molecular pattern or substructure. In such cases, the search is interpreted as aiming to describe the overall distribution of that substructure (or of molecules similar to the query). Consequently, only the molecule with the best MS/MS match per sample is reported, regardless of how many molecules matched that sample. Users interested in the distributions of individual molecules can instead submit them separately or use the batch search mode.

In analog search mode, which identifies molecules differing by specific mass offsets (Δmass) relative to the query structure, the best-matching analog per unique Δmass and sample is reported. As a result, analog searches are the only mode where the same sample can appear multiple times in the results—each instance corresponding to a distinct analog observation.

Across all search modes, this postprocessing ensures that reported hits represent unique, biologically interpretable findings at the sample level, while maintaining consistency between structure-level matching, raw data retrieval and metadata integration.

Downstream and support tooling

Multiple tools have been linked and integrated into StructureMASST to simplify analysis. For library spectra and spectral matches, the GNPS Spectral Resolver (https://metabolomics-usi.gnps2.org) can be used to visualize individual spectra and spectral matches by clicking the respective links in the library and results tables. For raw data results, linkouts to the GNPS Dashboard (https://dashboard.gnps2.org) are provided to inspect extracted ion chromatograms directly in the raw data files. After analog searches, the extracted ion chromatograms of both unmodified and modified species are extracted by default, allowing assessment of whether relative elution orders are as expected and whether co-elution indicates analytical artifacts, such as ISFs or other ion species from the same molecule, which could be mistaken for analogs. After modification/analog searches, Modifinder (https://modifinder.gnps2.org/) can be accessed for all supported adduct types from a linkout provided in the table to assess likely modification sites.

StructureMASST is meant as a tool for the comprehensive and interactive retrieval of raw data matching to query molecules. As such, it proves ways to visualize matches across raw data in a multitude of ways and allows subsetting to matches of interest. However, different applications, such as environmental or evolutionary studies, require different types of integration for these data. StructureMASST is meant as a starting point from which multiple tools can branch off for more specific visualizations. Some preliminary tools, which are still under development, are provided in the ‘Downstream and support tooling’ section.

Reported FASSTrecords numbers

Numbers reported on FASSTrecords were retrieved on 24 September 2025 from the FASSTrecords sqlite database using the following queries:

Number of files with metadata:

SELECT COUNT(*) FROM redu_table;

Number of spectra in the library:

SELECT COUNT(*) FROM library_table;

Number unique 2D structures in the library:

SELECT COUNT(DISTINCT InChIKey_smiles_fi rstBlock) FROM library_table;

Number of annotated scans:

SELECT COUNT(*) AS unique_mri_scan_pairs

FROM (

SELECT 1

FROM masst_table

GROUP BY mri_id_int, scan_id

);

Number of annotated scans in human data:

SELECT COUNT(*) AS unique_mri_scan_pairs

FROM (

SELECT 1

FROM masst_table AS m

WHERE EXISTS (

SELECT 1

FROM redu_table AS r

WHERE r.mri_id_int = m.mri_id_int

AND r.NCBITaxonomy = ‘9606|Homo sapiens’

)

GROUP BY m.mri_id_int, m.scan_id

);

Number of annotations with sample metadata:

SELECT COUNT(*)

FROM masst_table mt

WHERE EXISTS (

SELECT 1 FROM redu_table rt

WHERE rt.mri_id_int = mt.mri_id_int

);

Number of 2D structures with raw data matches:

SELECT COUNT(*)

FROM (

SELECT DISTINCT l.InChIKey_smiles_firstBlock

FROM library_table l

WHERE l.InChIKey_smiles_firstBlock IS NOT NULL

AND EXISTS (

SELECT 1

FROM masst_table m

WHERE m.spectrum_id_int = l.spectrum_id_int

AND m.annotation_rank = 1

LIMIT 1

)

);

Number of 2D structures with raw data matches in human samples:

WITH human_mri AS (

SELECT DISTINCT mri_id_int

FROM redu_table

WHERE NCBITaxonomy = ‘9606|Homo sapiens’

)

SELECT COUNT(*)

FROM (

SELECT DISTINCT l.InChIKey_smiles_firstBlock

FROM library_table l

WHERE l.InChIKey_smiles_firstBlock IS NOT NULL

AND EXISTS (

SELECT 1

FROM masst_table m

WHERE m.spectrum_id_int = l.spectrum_id_int

AND m.mri_id_int IN (SELECT mri_id_int FROM human_mri)

LIMIT 1

)

);

Biological examples – Matching and filtering criteria

Caffeine example

MS/MS spectra were retrieved via exact structure matching of the SMILES CN1C=NC2=C1C(=O)N(C(=O)N2C)C. We then searched FASSTrecords using a minimum cosine of 0.9 and minimum matching peaks set to 5.

Salicylic acid–thiazoline example

MS/MS spectra were retrieved via substructure search of the SMILES OC1=CC=CC=C1C2=NCCS2. We then searched FASSTrecords using a minimum cosine of 0.7 and minimum matching peaks set to 5. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Surfactin C example

MS/MS spectra were retrieved via exact structure matching of the SMILES CC(C)CCCCCCCCCC1CC(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)O1)CC(C)C)CC(C)C)CC(=O)O)C(C)C)CC(C)C)CC(C)C)CCC(=O)O. We then utilized FASST using a minimum cosine of 0.6, 5 minimum matching peaks, and fragment and precursor tolerances of 0.02 Da. Analog search was turned on, and the filter condition was set to ‘Raw file’. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Amiodarone example

MS/MS spectra were retrieved via exact structure matching of the SMILES CCCCC1=C(C2=CC=CC=C2O1)C(=O)C3=CC(=C(C(=C3)I)OCCN(CC)CC)I. We then utilized FASST using a minimum Cosine of 0.6, 5 minimum matching peaks, and fragment and precursor tolerances of 0.02 Da. Analog search was turned on, and the filter condition was set to ‘Raw file’. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Sertraline example

MS/MS spectra were retrieved via exact structure matching of the SMILES CNC1CCC(C2=CC=CC=C12)C3=CC(=C(C=C3)Cl)Cl. We then utilized FASST using a minimum cosine of 0.6, 5 minimum matching peaks, and fragment and precursor tolerances of 0.02 Da. Analog search was turned on, and the filter condition was set to ‘Raw file’. We then limited the returned results to human samples by selecting the NCBITaxonomy column in the Column-dropdown below the results table and then selecting ‘9606|Homo sapiens’ in the Value-dropdown. The filter was applied by clicking the ‘Keep only selected rows’ button.

Desferrioxamine H example

MS/MS spectra were retrieved via exact structure matching on the SMILES CC(=O)N(O)CCCCCNC(=O)CCC(=O)N(O)CCCCCNC(=O)CCC(=O)O. We then searched FASSTrecords using a minimum cosine of 0.7 and minimum matching peaks set to 5. Matches to the library spectrum ‘CCMSLIB00000845585’ were removed by selecting the query_spectrum_id column in the Column-dropdown below the results table and then selecting ‘CCMSLIB00000845585’ in the Value-dropdown. The filter was applied by clicking the ‘Remove selected rows’ button.

Surfactin C Tanimoto similarity example

MS/MS spectra were retrieved via Tanimoto similarity search of the SMILES CC(C)CCCCCCCCCC1CC(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)O1)CC(C)C)CC(C)C)CC(=O)O)C(C)C)CC(C)C)CC(C)C)CCC(=O)O with a threshold of 0.8. We then utilized FASSTrecords using a minimum cosine of 0.7 and 5 minimum matching peaks.

Mass-defect analysis

Mass-defect values were calculated as the difference between the exact m/z and the nearest nominal mass (mass defect = exact mass – nominal mass). Data processing and visualization were performed using R (version 4.5.1) in the RStudio environment. For the mass-defect plot of amiodarone (Supplementary Fig. 4b), the m/z values of amiodarone and its potential metabolites identified through the StructureMASST search (Supplementary Fig. 4a) were used, while CHNO-backbone compounds with varying degrees of iodination (Supplementary Table 2) were referenced to confirm their iodination levels. Similarly, for the mass-defect plot of sertraline (Fig. 2c), the m/z values of sertraline and its potential metabolites identified from the StructureMASST search (Fig. 2b) were used, while CHN-backbone compounds with varying degrees of chlorination (Supplementary Table 4) were referenced to confirm their chlorination levels.

ModiFinder analysis

ModiFinder analysis of several potential metabolites of amiodarone and sertraline identified from the StructureMASST search was performed using the ‘View Modification Site’ function in the resulting table for each USI generated in FASST mode. Then, by clicking ‘View Modification Site,’ users are directed to the GNPS2 dashboard (https://modifinder.gnps2.org/), where the results are shown. The inputs, parameters and results can be accessed through the links provided below.

Amiodarone

Δm/z −26.02:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00013027336&USI2=mzspec%3AMSV000085760%3Araw%2FmzXML%2F5580.mzXML%3Ascan%3A2872&SMILES1=CCCCc1oc2ccccc2c1C%28%3DO%29c1cc%28I%29c%28OCCN%28CC%29CC%29c%28I%29c1&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Δm/z −125.90:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00012316157&USI2=mzspec%3AMTBLS1866%3AFILES%2FLipidomic_ICU+COVID-19_ESI+Positive%2FDA17_p.mzML%3Ascan%3A686&SMILES1=CCCCc1oc2ccccc2c1C%28%3DO%29c1cc%28I%29c%28OCCN%28CC%29CC%29c%28I%29c1&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Sertraline

Δm/z +43.99:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00000084936&USI2=mzspec%3AMSV000080673%3Accms_peak%2F2017.AmericanGut3K.mzXMLfiles%2FSamples%2F000006382_RB8_01_6463.mzML%3Ascan%3A1896&SMILES1=CNC1CCC%28c2ccc%28Cl%29c%28Cl%29c2%29c2ccccc21&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Δm/z +148.04:

https://modifinder.gnps2.org/?USI1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00003140022&USI2=mzspec%3AMSV000086415%3Accms_peak%2FPlate+01+Samples+RAW%2F16265624.mzML%3Ascan%3A1311&SMILES1=CNC1CCC%28c2ccc%28Cl%29c%28Cl%29c2%29c2ccccc21&SMILES2&Helpers=&adduct=%5BM%2BH%5D1%2B&ppm_tolerance=25&filter_peaks_variable=0.01

Statistical analysis of human enrichment of drug matches among Metazoa hits

To quantify whether StructureMASST raw data matches were disproportionately associated with human samples, we tested for enrichment of Homo sapiens within the set of positive matches for each queried drug molecule relative to its prevalence in other Metazoa samples. The background population was defined as all entries in the redu_table of FASSTrecords with MS/MS present (MS2spectra_count > 0) and NCBIKingdom == ‘Metazoa’. Human samples were defined as NCBITaxonomy == ‘9606|Homo sapiens’, and all remaining Metazoa entries were treated as nonhuman. Positive matches for each molecule were defined as raw data hits passing the specified spectral matching criteria (default: cosine >0.7 and matching peaks >5; additional cosine thresholds were evaluated as shown).

For each molecule, we constructed a 2 × 2 contingency table comparing the number of human versus nonhuman Metazoa samples among the molecule’s positive matches to the corresponding counts in the background (hits versus nonhits). We then applied Fisher’s exact test (two-sided) to each table to estimate an OR and associated P value. The OR was interpreted as the relative odds that a positive match originated from Homo sapiens rather than nonhuman Metazoa compared with the same odds in the Metazoa background (OR >1, enrichment; OR <1, depletion). Multiple testing across molecules was controlled using the Benjamini–Hochberg procedure; adjusted P values (q values) are reported, with q < 0.05 considered significant.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All spectral raw data used here are accessible through the public metabolomics repositories GNPS/MassIVE, MetaboLights, Metabolomics Workbench and NORMAN/DSFP. Library spectra with harmonized metadata are available at https://external.gnps2.org/gnpslibrary. The periodically updated precomputed FASSTrecords database is available at https://masst-records.gnps2.org/masst_records and https://zenodo.org/records/18199544 (stable version) under an ODC-ODbL – Open Database License.

Code availability

The code behind StructureMASST is available via GitHub at https://github.com/Wang-Bioinformatics-Lab/Structure_MASST_App.

References

Yurekten, O. et al. MetaboLights: open data repository for metabolomics. Nucleic Acids Res. 52, D640–D646 (2024).
Article CAS PubMed PubMed Central Google Scholar
Sud, M. et al. Metabolomics Workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res. 44, D463–D470 (2016).
Article CAS PubMed Google Scholar
Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).
Article CAS PubMed PubMed Central Google Scholar
Alygizakis, N. A. et al. NORMAN digital sample freezing platform: a European virtual platform to exchange liquid chromatography high resolution-mass spectrometry data and screen suspects in ‘digitally frozen’ environmental samples. Trends Analyt. Chem. 115, 129–137 (2019).
Article CAS Google Scholar
Wang, M. et al. Mass spectrometry searches using MASST. Nat. Biotechnol. 38, 23–26 (2020).
Article PubMed PubMed Central Google Scholar
El Abiead, Y. et al. Enabling pan-repository reanalysis for big data science of public metabolomics data. Nat. Commun. 16, 4838 (2025).
Article CAS PubMed PubMed Central Google Scholar
Mongia, M. et al. Fast mass spectrometry search and clustering of untargeted metabolomics data. Nat. Biotechnol. 42, 1672–1677 (2024).
Article CAS PubMed Google Scholar
Bittremieux, W. et al. Universal MS/MS visualization and retrieval with the Metabolomics Spectrum Resolver Web Service. Preprint at bioRxiv https://doi.org/10.1101/2020.05.09.086066 (2020).
Deutsch, E. W. et al. Universal Spectrum Identifier for mass spectra. Nat. Methods 18, 768–770 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kim, S. et al. PubChem 2025 update. Nucleic Acids Res. 53, D1516–D1525 (2025).
Article CAS PubMed PubMed Central Google Scholar
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).
Article CAS Google Scholar
Schmidt, R. et al. Comparing molecular patterns using the example of SMARTS: theory and algorithms. J. Chem. Inf. Model. 59, 2560–2571 (2019).
Article CAS PubMed Google Scholar
Cox, C. D., Rinehart, K. L. Jr., Moore, M. L. & Cook, J. C. Jr. Pyochelin: novel structure of an iron-chelating growth promoter for Pseudomonas aeruginosa. Proc. Natl Acad. Sci. USA 78, 4256–4260 (1981).
Article CAS PubMed PubMed Central Google Scholar
Haag, H. et al. Purification of yersiniabactin: a siderophore and possible virulence factor of Yersinia enterocolitica. J. Gen. Microbiol. 139, 2159–2165 (1993).
Article CAS PubMed Google Scholar
Kaplan, A. R., Musaev, D. G. & Wuest, W. M. Pyochelin biosynthetic metabolites bind iron and promote growth in Pseudomonads demonstrating siderophore-like activity. ACS Infect. Dis. 7, 544–551 (2021).
Article CAS PubMed PubMed Central Google Scholar
Vinayavekhin, N. & Saghatelian, A. Regulation of alkyl-dihydrothiazole-carboxylates (ATCs) by iron and the pyochelin gene cluster in Pseudomonas aeruginosa. ACS Chem. Biol. 4, 617–623 (2009).
Article CAS PubMed Google Scholar
Kersten, R. D. & Dorrestein, P. C. Secondary metabolomics: natural products mass spectrometry goes global. ACS Chem. Biol. 4, 599–601 (2009).
Article CAS PubMed Google Scholar
Miller, D. A., Luo, L., Hillson, N., Keating, T. A. & Walsh, C. T. Yersiniabactin synthetase: a four-protein assembly line producing the nonribosomal peptide/polyketide hybrid siderophore of Yersinia pestis. Chem. Biol. 9, 333–344 (2002).
Article CAS PubMed Google Scholar
Lee, J. Y., Moon, S. S. & Hwang, B. K. Isolation and antifungal and antioomycete activities of aerugine produced by Pseudomonas fluorescens strain MM-B16. Appl. Environ. Microbiol. 69, 2023–2031 (2003).
Article CAS PubMed PubMed Central Google Scholar
Rayi, S., Cai, Y., Greenwich, J. L., Fuqua, C. & Gerdt, J. P. Interbacterial biofilm competition through a suite of secreted metabolites. ACS Chem. Biol. 19, 462–470 (2024).
Article CAS PubMed PubMed Central Google Scholar
Igarashi, Y. et al. Ulbactins F and G, polycyclic thiazoline derivatives with tumor cell migration inhibitory activity from Brevibacillus sp. Org. Lett. 18, 1658–1661 (2016).
Article CAS PubMed Google Scholar
Acebal, C. et al. Agrochelin, a new cytotoxic antibiotic from a marine Agrobacterium. Taxonomy, fermentation, isolation, physico-chemical properties and biological activity. J. Antibiot. 52, 983–987 (1999).
Article CAS Google Scholar
DA Silva Filho, L. V. F., Levi, J. E., Bento, C. N. O., DA Silva Ramos, S. R. T. & Rozov, T. PCR identification of Pseudomonas aeruginosa and direct detection in clinical samples from cystic fibrosis patients. J. Med. Microbiol. 48, 357–361 (1999).
Article CAS PubMed Google Scholar
Grundmann, E., Gohar, G., Meier, S., Feil, B. & Gagesch, M. Community-acquired pneumonia with Pseudomonas aeruginosa in a geriatric patient with rheumatoid arthritis under baricitinib treatment. Ann. Geriatr. Med. Res. https://doi.org/10.4235/agmr.24.0191 (2025).
Ahn, J.-H. et al. Intestinal E. coli-produced yersiniabactin promotes profibrotic macrophages in Crohn’s disease. Cell Host Microbe 33, 71–88 (2025).
Article CAS PubMed Google Scholar
Behnsen, J. et al. Siderophore-mediated zinc acquisition enhances enterobacterial colonization of the inflamed gut. Nat. Commun. 12, 7016 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kumar, A. et al. Siderophore-mediated iron acquisition by Klebsiella pneumoniae. J. Bacteriol. 206, e0002424 (2024).
Article PubMed PubMed Central Google Scholar
Paauw, A., Leverstein-van Hall, M. A., van Kessel, K. P. M., Verhoef, J. & Fluit, A. C. Yersiniabactin reduces the respiratory oxidative stress response of innate immune cells. PLoS ONE 4, e8240 (2009).
Article PubMed PubMed Central Google Scholar
Huddart, R. et al. PharmGKB summary: sertraline pathway, pharmacokinetics. Pharmacogenet. Genomics 30, 26–33 (2020).
Article CAS PubMed PubMed Central Google Scholar
Elapavalore, A. et al. Adding open spectral data to MassBank and PubChem using open source tools to support non-targeted exposomics of mixtures. Environ. Sci. Process. Impacts 25, 1788–1801 (2023).
Article CAS PubMed PubMed Central Google Scholar
Bittremieux, W., Laukens, K., Noble, W. S. & Dorrestein, P. C. Large-scale tandem mass spectrum clustering using fast nearest neighbor searching. Rapid Commun. Mass Spectrom. 39(Suppl 1), e9153 (2025).
Article CAS PubMed Google Scholar
Mannochio-Russo, H. et al. Bridging complexity and accessibility in metabolomics with MetaboApps. Preprint at ChemRxiv https://doi.org/10.26434/chemrxiv-2025-3nq29 (2025).
Strobel, M. et al. An evaluation methodology for machine learning-based tandem mass spectra similarity prediction. BMC Bioinformatics 26, 174 (2025).
Article PubMed PubMed Central Google Scholar
Ehrt, C., Krause, B., Schmidt, R., Ehmki, E. S. R. & Rarey, M. SMARTS.Plus—a toolbox for chemical pattern design. Mol. Inform. 39, e2000216 (2020).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Y.E. acknowledges the Chan Zuckerberg Initiative (CZI) and APART-USA (ÖAW) for funding. J.I.S. acknowledges National Research Foundation of Korea (NRF) for funding (RS-2025-02373133). H.N.Z. was supported by the National Institute Of Environmental Health Sciences of the National Institutes of Health under award number K99ES037746. H.G. acknowledges the CCFA for funding. C.B. acknowledges the MSCA-GF for funding. S.X. is supported by BBSRC/NSF award 2152526, National Institute of Health Sciences U24DK133658 and Chan Zuckerberg Initiative. P.C.D. acknowledges support from NIDDK U24DK133658, Chan Zuckerberg Initiative for supporting development of FASSTrecords, NSF-BBSRC 2152526 and EnvedaGives Scientific Research Fund for enabling this work. A.M.C.-R. and P.C.D. were supported by the Gordon and Betty Moore Foundation grant GBMF12120 and https://doi.org/10.37807/GBMF12120. J.A.V. acknowledges support from Chan Zuckerberg Initiative (2024-350548), BBSRC (BB/W000156/1) and EMBL core funding. S.S. acknowledges support from NIDDK U24DK141185, NIDDK U2CDK119886, and Chan Zuckerberg Initiative for MetabolomicsXChange. Reproductive Scientist Development Program Grant supports L.A.B. D.P. was supported by the Simons Foundation International (award ID: SFI-LS-ECIAMEE-00013858). M.W. acknowledges support by NIH 5U24DK133658 and NIH 2R01GM107550. We acknowledge J. Heirman and W. Bittremieux for enabling us to use falcon before the official publication date.

Author information

Yasin El Abiead
Present address: Institute of Analytical Chemistry, Department of Natural Sciences and Sustainable Resources, BOKU University, Vienna, Austria

Authors and Affiliations

Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, San Diego, CA, USA
Yasin El Abiead, Jeong In Seo, Vincent Charron-Lamoureux, Wilhan Donizete Gonçalves Nunes, Haoqi Nina Zhao, Kine Eide Kvitne, Simone Zuffa, Helena Mannochio-Russo, Harsha Gouda, Abubaker Patan, Shipei Xing, Jasmine Zemlin, Julius Agongo, Andres Mauricio Caraballo Rodriguez, Victoria Deleray, Jeremy Carver & Pieter C. Dorrestein
Department of Computer Science and Engineering, University of California Riverside, Riverside, CA, USA
Michael Strobel & Mingxun Wang
International Center for Genetic Engineering and Biotechnology, Trieste, Italy
Cristina Bez
Department of Veterinary and Biomedical Sciences, The Pennsylvania State University, University Park, PA, USA
Ipsita Mohany
Department of Nutritional Sciences, The Pennsylvania State University, University Park, PA, USA
Ipsita Mohany
Collaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, USA
Andres Mauricio Caraballo Rodriguez & Pieter C. Dorrestein
Department of Obstetrics, Gynecology, and Reproductive Sciences, University of California San Diego, La Jolla, CA, USA
Lindsey A. Burnett
Interfaculty Institute of Microbiology and Infection Medicine, University of Tuebingen, Tuebingen, Germany
Abzer K. Pakkir Shah
Biotechnology Innovation Centre, Rhodes University, Makhanda, South Africa
Jarmo-Charles Kalinski
Department of Biochemistry, University of California Riverside, Riverside, CA, USA
Daniel Petras
Environmental Institute, Koš, Slovak Republic
Nikiforos Alygizakis
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
Ozgur Yurekten, Thomas Payne & Juan Antonio Vizcaíno
Department of Bioengineering, San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, USA
Eoin Fahy & Shankar Subramaniam
Department of Pharmacology, University of California San Diego, La Jolla, CA, USA
Pieter C. Dorrestein
Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA
Pieter C. Dorrestein

Authors

Yasin El Abiead
View author publications
Search author on:PubMed Google Scholar
Jeong In Seo
View author publications
Search author on:PubMed Google Scholar
Vincent Charron-Lamoureux
View author publications
Search author on:PubMed Google Scholar
Michael Strobel
View author publications
Search author on:PubMed Google Scholar
Wilhan Donizete Gonçalves Nunes
View author publications
Search author on:PubMed Google Scholar
Haoqi Nina Zhao
View author publications
Search author on:PubMed Google Scholar
Kine Eide Kvitne
View author publications
Search author on:PubMed Google Scholar
Simone Zuffa
View author publications
Search author on:PubMed Google Scholar
Helena Mannochio-Russo
View author publications
Search author on:PubMed Google Scholar
Harsha Gouda
View author publications
Search author on:PubMed Google Scholar
Cristina Bez
View author publications
Search author on:PubMed Google Scholar
Abubaker Patan
View author publications
Search author on:PubMed Google Scholar
Shipei Xing
View author publications
Search author on:PubMed Google Scholar
Jasmine Zemlin
View author publications
Search author on:PubMed Google Scholar
Ipsita Mohany
View author publications
Search author on:PubMed Google Scholar
Julius Agongo
View author publications
Search author on:PubMed Google Scholar
Andres Mauricio Caraballo Rodriguez
View author publications
Search author on:PubMed Google Scholar
Lindsey A. Burnett
View author publications
Search author on:PubMed Google Scholar
Victoria Deleray
View author publications
Search author on:PubMed Google Scholar
Abzer K. Pakkir Shah
View author publications
Search author on:PubMed Google Scholar
Jarmo-Charles Kalinski
View author publications
Search author on:PubMed Google Scholar
Daniel Petras
View author publications
Search author on:PubMed Google Scholar
Nikiforos Alygizakis
View author publications
Search author on:PubMed Google Scholar
Jeremy Carver
View author publications
Search author on:PubMed Google Scholar
Ozgur Yurekten
View author publications
Search author on:PubMed Google Scholar
Thomas Payne
View author publications
Search author on:PubMed Google Scholar
Eoin Fahy
View author publications
Search author on:PubMed Google Scholar
Shankar Subramaniam
View author publications
Search author on:PubMed Google Scholar
Juan Antonio Vizcaíno
View author publications
Search author on:PubMed Google Scholar
Mingxun Wang
View author publications
Search author on:PubMed Google Scholar
Pieter C. Dorrestein
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.E.A., P.C.D. and M.W. conceptualized the idea. Y.E.A. developed the tool. M.W. enabled web deployment, advised development choices and edited the manuscript. P.C.D. advised on biological examples and options enabled through the interface. P.C.D. and Y.E.A. wrote the manuscript. J.I.S. wrote manuscript sections and performed data analysis and interpretation of drug examples. K.E.K. and L.A.B. added clinical context to drug examples. V.C.L., H.N.Z., K.E.K., S.Z., H.M.R., H.G., I.M., A.M.C.R., V.D., A.K.P.S., J.C.K. and D.P. contributed metadata and edited the manuscript. A.P. and S.X. provided library spectra and edited the manuscript. M.S. and W.D.G.N. contributed code and edited the manuscript. N.A., J.C., O.Y., T.P., E.F., S.S. and J.A.V. supported the integration of raw and metadata from their respective repositories and edited the manuscript.

Corresponding authors

Correspondence to Yasin El Abiead, Mingxun Wang or Pieter C. Dorrestein.

Ethics declarations

Competing interests

P.C.D. is an advisor and holds equity in Cybele, BileOmix, Sirenas and a scientific cofounder, advisor, holds equity and/or received income from Ometa, Enveda and Arome with prior approval by UC San Diego. P.C.D. also consulted for DSM animal health in 2023. L.A.B. consulted for Locus Biosciences with prior approval from UC San Diego. M.W. is a cofounder of Ometa Labs LLC. The other authors declare no competing interest. A provisional patent covering aspects of this work has been filed by UC-San Diego.

Peer review

Peer review information

Nature Biotechnology thanks Norberto Lopes, Thomas Metz and Fidele Tugizimana for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–6, Information 1 and 2, Tables 2–5 and Videos 1–10.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Supplementary Table 1 (download XLSX )

Terminology and concept table for StructureMASST and FASSTrecords.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

El Abiead, Y., Seo, J.I., Charron-Lamoureux, V. et al. Structure-centric searching enables global mapping of the public metabolome. Nat Biotechnol (2026). https://doi.org/10.1038/s41587-026-03082-8

Download citation

Received: 21 October 2025
Accepted: 09 March 2026
Published: 15 April 2026
Version of record: 15 April 2026
DOI: https://doi.org/10.1038/s41587-026-03082-8

Subjects

Abstract

Main

Methods

FASSTrecords database construction

The constructed database

StructureMASST

Retrieving library spectra based on structures

Raw data search modes

Downstream and support tooling

Reported FASSTrecords numbers

Caffeine example

Salicylic acid–thiazoline example

Surfactin C example

Amiodarone example

Sertraline example

Desferrioxamine H example

Surfactin C Tanimoto similarity example

Mass-defect analysis

ModiFinder analysis

Amiodarone

Sertraline

Statistical analysis of human enrichment of drug matches among Metazoa hits

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information (download PDF )

Reporting Summary (download PDF )

Peer Review File (download PDF )

Supplementary Table 1 (download XLSX )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links