Abstract
Mass spectrometry is a widely used method for the identification of molecules in complex samples. Current tools for database search of experimental spectra against libraries of molecules are not scalable. Moreover, these tools are often limited to known molecules and only perform an exact search. Here, to address this, we introduce Variable Interpretation of Spectrum–Molecule Couples, or VInSMoC, a mass spectral database search algorithm for the identification of variants of molecules. VInSMoC removes some false identifications by estimating the statistical significance of matches between spectra and molecular structures. Benchmarking VInSMoC in a search of 483 million spectra from GNPS against 87 million molecules from PubChem and COCONUT revealed 43,000 known molecules and 85,000 variants that were previously unreported. VInSMoC further facilitates identifying putative microbial biosynthesis pathways of promothiocin B and depsidomycin in Streptomyces bellus and Streptomyces sp. F-2747, respectively.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Data availability
Additional Dataset 1 contains GNPS dataset accessions and summary statistics for large scale GNPS experiments. Additional Dataset 2a contains the top-scoring exact-mode hit for each spectrum against PubChem and COCONUT. Additional Dataset 2b contains the top-scoring variable mode hit for each spectrum against COCONUT. Additional Dataset 3 contains mass spectra provided by Waters Corporation to analyze impurities of imatinib. Additional datasets 1–3 are available via Zenodo at https://doi.org/10.5281/zenodo.11403641 (ref. 32). Chemical structures from NPAtlas42,43 used in this study are available from https://www.npatlas.org/static/downloads/NPAtlas_download.json. Chemical structures from PubChem14 used in this study are available from http://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras. GNPS library mass spectra used in this study are available from https://ccms-ucsd.github.io/GNPSDocumentation/gnpslibraries/. Source data is available for Fig. 4, and Extended Data Figs. 1, 2 and 4–8. Source data for the Supplementary figures is available as Supplementary Data 1.
Code availability
VInSMoC is available as a web application at run.npanalysis.org. Instructions for web application use, documentation, all executables, and code for analyses are available at https://github.com/mohimanilab/vinsmoc (ref. 54).
References
Hollender, J., Schymanski, E. L., Singer, H. P. & Ferguson, P. L. Nontarget screening with high resolution mass spectrometry in the environment: ready to go? Environ. Sci. Technol. 51, 11505–11512 (2017).
Hernandez, F. et al. Current use of high-resolution mass spectrometry in the environmental sciences. Anal. Bioanal. Chem. 403, 1251–1264 (2012).
Grenga, L., Pible, O. & Armengaud, J. Pathogen proteotyping: a rapidly developing application of mass spectrometry to address clinical concerns. Clin. Mass Spectrom. 14, 9–17 (2019).
Wang, Z. et al. A liquid chromatography–tandem mass spectrometry (LC-MS/MS)-based assay to profile 20 plasma steroids in endocrine disorders. Clin. Chem. Lab. Med. 58, 1477–1487 (2020).
Seger, C. & Salzmann, L. After another decade: LC–MS/MS became routine in clinical diagnostics. Clin. Biochem. 82, 2–11 (2020).
Ott, M., Berbalk, K., Plecko, T., Wieland, E. & Shipkova, M. Detection of drugs of abuse in urine using the bruker toxtyper™: experiences in a routine clinical laboratory setting. Clin. Mass Spectrom. 4, 11–18 (2017).
Jarmusch, S. A., van der Hooft, J. J. J., Dorrestein, P. C. & Jarmusch, A. K. Advancements in capturing and mining mass spectrometry data are transforming natural products research. Nat. Prod. Rep. 38, 2066–2082 (2021).
Liu, Y., Romijn, E. P., Verniest, G., Laukens, K. & De Vijlder, T. Mass spectrometry-based structure elucidation of small molecule impurities and degradation products in pharmaceutical development. Trends Anal. Chem. 121, 115686 (2019).
Bandeira, N., Tsur, D., Frank, A. & Pevzner, P. A. Protein identification by spectral networks analysis. Proc. Natl Acad. Sci. USA 104, 6140–6145 (2007).
Mongia, M. et al. Fast mass spectrometry search and clustering of untargeted metabolomics data. Nat. Biotechnol. 42, 1672–1677 (2024).
de Jonge, N. F. et al. MS2query: reliable and scalable MS2 mass spectra-based analogue search. Nat. Commun. 14, 1752 (2023).
Huber, F., van der Burg, S., van der Hooft, J. J. J. & Ridder, L. MS2deepscore: a novel deep learning similarity measure to compare tandem mass spectra. J. Cheminform. 13, 84 (2021).
Rasche, F. et al. Identifying the unknowns by aligning fragmentation trees. Anal. Chem. 84, 3417–3426 (2012).
Kim, S. et al. Pubchem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109 (2019).
Mohimani, H. et al. Dereplication of peptidic natural products through database search of mass spectra. Nat. Chem. Biol. 13, 30–37 (2017).
Mohimani, H. et al. Dereplication of microbial metabolites through database search of mass spectra. Nat. Commun. 9, 4035 (2018).
Cao, L. et al. Moldiscovery: learning mass spectrometry fragmentation of small molecules. Nat. Commun. 12, 3718 (2021).
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI: FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3 (2016).
Tsugawa, H. et al. Hydrogen rearrangement rules: computational MS/MS fragmentation and structure elucidation using MS-finder software. Anal. Chem. 88, 7946–7958 (2016).
Verdegem, D., Lambrechts, D., Carmeliet, P. & Ghesquière, B. Improved metabolite identification with MIDAS and MAGMa through MS/MS spectral dataset-driven parameter optimization. Metabolomics 12, 98 (2016).
Allen, F., Pon, A., Wilson, M., Greiner, R. & Wishart, D. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra. Nucleic Acids Res. 42, W94–W99 (2014).
Goldman, S., Li, J. & Coley, C. W. Generating molecular fragmentation graphs with autoregressive neural networks. Anal. Chem. 96, 3419–3428 (2024).
Young, A., Röst, H. & Wang, B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat. Mach. Intell. 6, 404–416 (2024).
Murphy, M. et al. Efficiently predicting high resolution mass spectra with graph neural networks. In Proc. 40th International Conference on Machine Learning, Proc. Machine Learning Research Vol. 202 (eds Krause, A. et al.) 25549–25562 (PMLR, 2023); https://proceedings.mlr.press/v202/murphy23a.html
Goldman, S., Bradshaw, J., Xin, J. & Coley, C. Prefix-Tree decoding for predicting mass spectra from molecules. Adv. Neural Inf. Process. Syst. 36, 48548–48572 (2023).
Young, A. et al. FraGNNet: a deep probabilistic model for mass spectrum prediction. Preprint at https://arxiv.org/abs/2404.02360 (2024).
Park, J., Jo, J. & Yoon, S. Mass spectra prediction with structural motif-based graph neural networks. Sci. Rep. 14, 1400 (2024).
Jeffryes, J., Strutz, J., Henry, C., & Tyo, K. E. (2019). Metabolic in silico network expansions to predict and exploit enzyme promiscuity. in Microbial Metabolic Engineering: Methods and Protocols (eds Santos, C. N. S. & Ajikumar, P. K.) 11-21 (Springer, 2019).
Gurevich, A. et al. Increased diversity of peptidic natural products revealed by modification-tolerant database search of mass spectra. Nat. Microbiol. 3, 319–327 (2018).
Lee, Y.-Y. et al. HypoRiPPAtlas as an atlas of hypothetical natural products for mass spectrometry database search. Nat. Commun. 14, 4219 (2023).
Guler, M. Supplemental data for “Identifying novel variants of small molecules through database search of mass spectra”. Zenodo https://doi.org/10.5281/zenodo.11403641 (2024).
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 61 (2016).
ISSHIKI, K. et al. Depsidomycin, a new immunomodulating antibiotic. J. Antibiot. 43, 1195–1198 (1990).
Blin, K. et al. antismash 7.0: new and improved predictions for detection, regulation, chemical structures and visualisation. Nucleic Acids Res. 51, W46–W50 (2023).
Du, Y.-L., He, H.-Y., Higgins, M. A. & Ryan, K. S. A heme-dependent enzyme forms the nitrogen–nitrogen bond in piperazate. Nat. Chem. Biol. 13, 836–838 (2017).
Liu, J. et al. Biosynthesis of the anti-infective marformycins featuring pre-NRPS assembly line N-formylation and O-methylation and post-assembly line C-hydroxylation chemistries. Org. Lett. 17, 1509–1512 (2015).
Yun, B.-S., Hidaka, T., Furihata, K. & Seto, H. Promothiocins a and b, novel thiopeptides with a tip a promoter inducing activity produced by Streptomyces sp. sf2741. J. Antibiot. 47, 510–514 (1994).
Madeira, F. ábio et al. Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res. 50, W276–W279 (2022).
Blin, K. et al. antismash 4.0—improvements in chemistry prediction and gene cluster boundary identification. Nucleic Acids Res. 45, W36–W41 (2017).
Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837 (2016).
Van Santen, J. A. et al. The natural products atlas: an open access knowledge base for microbial natural products discovery. ACS Cent. Sci. 5, 1824–1833 (2019).
van Santen, J. A. et al. The Natural Products Atlas 2.0: a database of microbially-derived natural products. Nucleic Acids Res. 50, D1317–D1323 (2022).
Sorokina, M., Merseburger, P., Rajan, K., Yirik, M. A. & Steinbeck, C. COCONUT online: collection of open natural products database. J. Cheminform. 13, 2 (2021).
Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPAC international chemical identifier. J. Cheminform. 7, 23 (2015).
Hopcroft, J. & Tarjan, R. Algorithm 447: efficient algorithms for graph manipulation. Commun. ACM 16, 372–378 (1973).
Chen, S. X. & Liu, J. S. Statistical applications of the Poisson-binomial and conditional Bernoulli distributions. Stat. Sin. 7, 875–892 (1997).
Howard, S. Discussion on Professor Cox’s paper. J. R. Stat. Soc. B 34, 210–211 (1972).
Abramova, A. & Korobeynikov, A. Assessing the significance of peptide spectrum match scores). In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), Leibniz International Proc. Informatics (LIPIcs) Vol. 88 (eds Schwartz, R. & Reinert, K.) 14:1–14:11 (Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2017); https://doi.org/10.4230/LIPIcs.WABI.2017.14
Wang, F. & Landau, D. P. Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett. 86, 2050 (2001).
Horvát, S. Z. & Modes, C. D. Connectedness matters: construction and exact random sampling of connected networks. J. Phys. Complex. 2, 015008 (2021).
Thorup, M. Near-optimal fully-dynamic graph connectivity. In Proc. Thirty-second Annual ACM Symposium on Theory of computing (eds Yao, F. & Luks, E.) 343–350 (ACM, 2000).
Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).
VInSMoC software. Zenodo https://doi.org/10.5281/zenodo.17452093 (2025).
Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).
Acknowledgements
M.G., B.K., T.H., M.T., J.A., S.R. and H.M. were supported by National Institutes of Health New Innovator Award DP2GM137413, US Department of Energy award DE-SC0021340 and National Science Foundation award DBI-2117640. The work of B.B. was supported by the National Institute of General Medicine Sciences of the National Institutes of Health award R43GM150301.
Author information
Authors and Affiliations
Contributions
M.G., B.K., T.H., M.T., J.A. and S.R. implemented the algorithms. M.G. performed the analysis. P.C. and M.L. provided imatinib impurity spectral data and their analyses. B.B. and H.M. designed and directed the work. M.G. and H.M wrote the paper, and all authors contributed to its revision.
Corresponding authors
Ethics declarations
Competing interests
H.M. and B.B. are co-founders of and have equity interests in Chemia Biosciences.
Peer review
Peer review information
Nature Computational Science thanks Bart Ghesquiere, Tomáš Pluskal and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Runtimes of mass spectral search engines on the GNPSLIBvNPAtlas dataset as a function of precursor ion error tolerance.
All runtimes are reported as an average over three runs after and not including a warm-up run to fill disk caches. All times are measured using a single core of an AMD Ryzen Threadripper PRO 5995WX.
Extended Data Fig. 2 Distances between top-scoring and correct modification sites using VInSMoC for various fragmentation settings.
For compounds from the GNPSLIB dataset modified by 107 frequent modifications we compute the distance between the top-scoring modification site predicted by VInSMoC and the correct modification site. Distance is defined as the average distance between the atoms of the predicted and correct site. A scoring strategy limited to only depth 1 fragments is compared to scoring both depth 1 and depth 2 fragments. Additionally, scoring OC (oxygen to carbon), NC (nitrogen to carbon) and CC (carbon to carbon) single bond fragments is compared to scoring only OC and NC bond fragments, and only NC bond fragments. n = 4966 modified compounds.
Extended Data Fig. 3 Modified molecules and their variable scores for different candidate modification site localizations.
Molecules originally taken from the GNPSLIB spectral library. Example modified molecules are from spectra with GNPS IDs (a) CCMSLIB00005722331, (b) CCMSLIB00000080011, (c) CCMSLIB00000853587, and (d) CCMSLIB00000079626. Each row from left to right shows the original molecule, the modified molecule with the modification site highlighted in red, and the modified molecule with atoms highlighted by their variable score against the spectrum for the original molecule.
Extended Data Fig. 4 Accuracies on (A) GNPSLIBvCOCONUT and (B) GNPSLIBvNPAtlas for exact search methods.
Molecule-spectrum matches for each method are grouped by spectrum and ranked by descending score. Tied ranks are assigned the average ranks of all molecule-spectrum matches contributing to the tie. The top-k accuracy indicates for what percentage of test spectra the correct compound was ranked at most k.
Extended Data Fig. 5 Accuracies on GNPSLIBvDNP for spectra not used to train CSI:FingerID for exact search methods.
Molecule-spectrum matches for each method are grouped by spectrum and ranked by descending score. Tied ranks are assigned the average ranks of all molecule-spectrum matches contributing to the tie. The top-k accuracy indicates for what percentage of test spectra the correct compound was ranked at most k. Accuracies for CFM-ID, CSI:FingerID, Dereplicator+, MAGMa+, MetFrag, and molDiscovery are taken from Cao et al.17.
Extended Data Fig. 6 False discovery rates at various metric cutoffs when searching GNPS against COCONUT.
Plots in the first row are for exact-mode search and plots in the second row are for variable mode search. Plots in the first column are using score-based cutoffs and plots in the second column are using p-value cutoffs.
Extended Data Fig. 7 High frequency mass shifts of compounds detected by VInSMoC in variable mode.
Mass shifts that occurred fewer than 300 times are excluded. Mass shifts are calculated as the difference between precursor mass of the spectrum and the monoisotpic mass of the matched molecule. n = 9160 mass shifts.
Extended Data Fig. 8 Mean VInSMoC scores across Classyfire superclasses using various fragmentation settings at fragmentation depth 2.
Fragmentation with various allowed bond breakages are benchmarked across the MoNA dataset.
Extended Data Fig. 9 Identifications of Imatinib impurities by VInSMoC.
For Imatinib (A) and four variants of Imatinib (B–D) this shows the precursor m/z of the mass spectrum and the PubChem CID of the compound annotated by chemists. The distribution of variable mode scores using VInSMoC for each spectrum against the original Imatinib structure is shown in the third column. In the fourth column the structure chemists identified for the spectrum is shown, with differences from the original Imatinib highlighted in red.
Extended Data Fig. 10 Proposed BGC and synthetic pathway for Depsidomycin.
(a) The proposed BGC with NRPS genes colored black and genes installing auxiliary modifications shown in blue and magenta. (b) The NRPS module structure of this BGC with A-domain specificity shown via the elongation of Depsidomycin. (c) Auxiliary modifications applied to produce the final Depsidomycin structure. The thioesterase domain catalyzes formation of the ester (green), conversion of Ornithine to Piperazic acid is catalyzed by genes homologous to ktzI and ktzT (blue), and a gene homologous to mfnA is responsible for N-terminal formylation (magenta).
Supplementary information
Supplementary Information
Supplementary Remark, Algorithms 1 and 2, Tables 1–3 and Figs. 1–23.
Supplementary Data 1
Source data for the Supplementary figures. Source data is provided as a separate CSV for each relevant Supplementary figure, named si_figureX_data.csv.
Source data
Source Data Fig. 4
CSV containing runtime breakdowns of Dereplicator+ and VInSMoC–Exact on the GNPSLIB spectral dataset.
Source Data Extended Data Fig. 1
Runtimes in seconds of exact search methods on the GNPSLIBvNPAtlas dataset across multiple precursor ion tolerances. Each tool was measured three times per precursor ion tolerance; all three runtimes (in seconds) are reported in this data file.
Source Data Extended Data Fig. 2
CSV containing raw distances between predicted and correct modification sites across multiple fragmentation settings in VInSMoC. Data were constructed by applying common modifications to the structures in the GNPSLIB dataset. Distance values and fragmentation settings are provided, which were used to compute the summary metrics reported in the figure.
Source Data Extended Data Fig. 4
CSV containing top-k accuracies of fragmentation graph-based mass spectral search tools on searching GNPSLIB against COCONUT and NPAtlas chemical databases.
Source Data Extended Data Fig. 5
CSV containing top-k accuracies of multiple exact search tools on searching the subset of the GNPLIB not used to train CSI:FingerID against the Dictionary of Natural Products.
Source Data Extended Data Fig. 6
CSV containing false discovery rates at various score cut-offs when using VInSMoC in both exact and open mode and using raw scores or P values as cut-offs.
Source Data Extended Data Fig. 7
CSV containing raw counts of integral mass shifts of compounds detected by VInSMoC in variable mode. This contains the plotted portion, which includes all integral shifts that occurred at least 300 times.
Source Data Extended Data Fig. 8
CSV containing scores of true-positive MoNA spectra using depth-2 fragmentation with various allowed broken bond types. Plot reports averages from this raw dataset.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Guler, M., Krummenacher, B., Hall, T. et al. Identifying variants of molecules through database search of mass spectra. Nat Comput Sci 5, 1227–1237 (2025). https://doi.org/10.1038/s43588-025-00923-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s43588-025-00923-5
This article is cited by
-
A scalable tool for fast and flexible variant identification in mass spectrometry
Nature Computational Science (2025)


