Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Identifying variants of molecules through database search of mass spectra

Abstract

Mass spectrometry is a widely used method for the identification of molecules in complex samples. Current tools for database search of experimental spectra against libraries of molecules are not scalable. Moreover, these tools are often limited to known molecules and only perform an exact search. Here, to address this, we introduce Variable Interpretation of Spectrum–Molecule Couples, or VInSMoC, a mass spectral database search algorithm for the identification of variants of molecules. VInSMoC removes some false identifications by estimating the statistical significance of matches between spectra and molecular structures. Benchmarking VInSMoC in a search of 483 million spectra from GNPS against 87 million molecules from PubChem and COCONUT revealed 43,000 known molecules and 85,000 variants that were previously unreported. VInSMoC further facilitates identifying putative microbial biosynthesis pathways of promothiocin B and depsidomycin in Streptomyces bellus and Streptomyces sp. F-2747, respectively.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: A comparison of exact and variable mass spectral database search methods.
Fig. 2: Outline of the accelerated scoring algorithm.
Fig. 3: The variable scoring framework.
Fig. 4: Molecule-spectrum scoring runtimes.

Similar content being viewed by others

Data availability

Additional Dataset 1 contains GNPS dataset accessions and summary statistics for large scale GNPS experiments. Additional Dataset 2a contains the top-scoring exact-mode hit for each spectrum against PubChem and COCONUT. Additional Dataset 2b contains the top-scoring variable mode hit for each spectrum against COCONUT. Additional Dataset 3 contains mass spectra provided by Waters Corporation to analyze impurities of imatinib. Additional datasets 1–3 are available via Zenodo at https://doi.org/10.5281/zenodo.11403641 (ref. 32). Chemical structures from NPAtlas42,43 used in this study are available from https://www.npatlas.org/static/downloads/NPAtlas_download.json. Chemical structures from PubChem14 used in this study are available from http://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras. GNPS library mass spectra used in this study are available from https://ccms-ucsd.github.io/GNPSDocumentation/gnpslibraries/. Source data is available for Fig. 4, and Extended Data Figs. 1, 2 and 48. Source data for the Supplementary figures is available as Supplementary Data 1.

Code availability

VInSMoC is available as a web application at run.npanalysis.org. Instructions for web application use, documentation, all executables, and code for analyses are available at https://github.com/mohimanilab/vinsmoc (ref. 54).

References

  1. Hollender, J., Schymanski, E. L., Singer, H. P. & Ferguson, P. L. Nontarget screening with high resolution mass spectrometry in the environment: ready to go? Environ. Sci. Technol. 51, 11505–11512 (2017).

    Article  Google Scholar 

  2. Hernandez, F. et al. Current use of high-resolution mass spectrometry in the environmental sciences. Anal. Bioanal. Chem. 403, 1251–1264 (2012).

    Article  Google Scholar 

  3. Grenga, L., Pible, O. & Armengaud, J. Pathogen proteotyping: a rapidly developing application of mass spectrometry to address clinical concerns. Clin. Mass Spectrom. 14, 9–17 (2019).

    Article  Google Scholar 

  4. Wang, Z. et al. A liquid chromatography–tandem mass spectrometry (LC-MS/MS)-based assay to profile 20 plasma steroids in endocrine disorders. Clin. Chem. Lab. Med. 58, 1477–1487 (2020).

    Article  Google Scholar 

  5. Seger, C. & Salzmann, L. After another decade: LC–MS/MS became routine in clinical diagnostics. Clin. Biochem. 82, 2–11 (2020).

    Article  Google Scholar 

  6. Ott, M., Berbalk, K., Plecko, T., Wieland, E. & Shipkova, M. Detection of drugs of abuse in urine using the bruker toxtyper™: experiences in a routine clinical laboratory setting. Clin. Mass Spectrom. 4, 11–18 (2017).

    Article  Google Scholar 

  7. Jarmusch, S. A., van der Hooft, J. J. J., Dorrestein, P. C. & Jarmusch, A. K. Advancements in capturing and mining mass spectrometry data are transforming natural products research. Nat. Prod. Rep. 38, 2066–2082 (2021).

    Article  Google Scholar 

  8. Liu, Y., Romijn, E. P., Verniest, G., Laukens, K. & De Vijlder, T. Mass spectrometry-based structure elucidation of small molecule impurities and degradation products in pharmaceutical development. Trends Anal. Chem. 121, 115686 (2019).

    Article  Google Scholar 

  9. Bandeira, N., Tsur, D., Frank, A. & Pevzner, P. A. Protein identification by spectral networks analysis. Proc. Natl Acad. Sci. USA 104, 6140–6145 (2007).

    Article  Google Scholar 

  10. Mongia, M. et al. Fast mass spectrometry search and clustering of untargeted metabolomics data. Nat. Biotechnol. 42, 1672–1677 (2024).

    Article  Google Scholar 

  11. de Jonge, N. F. et al. MS2query: reliable and scalable MS2 mass spectra-based analogue search. Nat. Commun. 14, 1752 (2023).

    Article  Google Scholar 

  12. Huber, F., van der Burg, S., van der Hooft, J. J. J. & Ridder, L. MS2deepscore: a novel deep learning similarity measure to compare tandem mass spectra. J. Cheminform. 13, 84 (2021).

    Article  Google Scholar 

  13. Rasche, F. et al. Identifying the unknowns by aligning fragmentation trees. Anal. Chem. 84, 3417–3426 (2012).

    Article  Google Scholar 

  14. Kim, S. et al. Pubchem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109 (2019).

    Article  Google Scholar 

  15. Mohimani, H. et al. Dereplication of peptidic natural products through database search of mass spectra. Nat. Chem. Biol. 13, 30–37 (2017).

    Article  Google Scholar 

  16. Mohimani, H. et al. Dereplication of microbial metabolites through database search of mass spectra. Nat. Commun. 9, 4035 (2018).

    Article  Google Scholar 

  17. Cao, L. et al. Moldiscovery: learning mass spectrometry fragmentation of small molecules. Nat. Commun. 12, 3718 (2021).

    Article  Google Scholar 

  18. Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI: FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).

    Article  Google Scholar 

  19. Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3 (2016).

    Article  Google Scholar 

  20. Tsugawa, H. et al. Hydrogen rearrangement rules: computational MS/MS fragmentation and structure elucidation using MS-finder software. Anal. Chem. 88, 7946–7958 (2016).

    Article  Google Scholar 

  21. Verdegem, D., Lambrechts, D., Carmeliet, P. & Ghesquière, B. Improved metabolite identification with MIDAS and MAGMa through MS/MS spectral dataset-driven parameter optimization. Metabolomics 12, 98 (2016).

    Article  Google Scholar 

  22. Allen, F., Pon, A., Wilson, M., Greiner, R. & Wishart, D. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra. Nucleic Acids Res. 42, W94–W99 (2014).

    Article  Google Scholar 

  23. Goldman, S., Li, J. & Coley, C. W. Generating molecular fragmentation graphs with autoregressive neural networks. Anal. Chem. 96, 3419–3428 (2024).

    Article  Google Scholar 

  24. Young, A., Röst, H. & Wang, B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat. Mach. Intell. 6, 404–416 (2024).

    Article  Google Scholar 

  25. Murphy, M. et al. Efficiently predicting high resolution mass spectra with graph neural networks. In Proc. 40th International Conference on Machine Learning, Proc. Machine Learning Research Vol. 202 (eds Krause, A. et al.) 25549–25562 (PMLR, 2023); https://proceedings.mlr.press/v202/murphy23a.html

  26. Goldman, S., Bradshaw, J., Xin, J. & Coley, C. Prefix-Tree decoding for predicting mass spectra from molecules. Adv. Neural Inf. Process. Syst. 36, 48548–48572 (2023).

    Google Scholar 

  27. Young, A. et al. FraGNNet: a deep probabilistic model for mass spectrum prediction. Preprint at https://arxiv.org/abs/2404.02360 (2024).

  28. Park, J., Jo, J. & Yoon, S. Mass spectra prediction with structural motif-based graph neural networks. Sci. Rep. 14, 1400 (2024).

    Article  Google Scholar 

  29. Jeffryes, J., Strutz, J., Henry, C., & Tyo, K. E. (2019). Metabolic in silico network expansions to predict and exploit enzyme promiscuity. in Microbial Metabolic Engineering: Methods and Protocols (eds Santos, C. N. S. & Ajikumar, P. K.) 11-21 (Springer, 2019).

  30. Gurevich, A. et al. Increased diversity of peptidic natural products revealed by modification-tolerant database search of mass spectra. Nat. Microbiol. 3, 319–327 (2018).

    Article  Google Scholar 

  31. Lee, Y.-Y. et al. HypoRiPPAtlas as an atlas of hypothetical natural products for mass spectrometry database search. Nat. Commun. 14, 4219 (2023).

    Article  Google Scholar 

  32. Guler, M. Supplemental data for “Identifying novel variants of small molecules through database search of mass spectra”. Zenodo https://doi.org/10.5281/zenodo.11403641 (2024).

  33. Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 61 (2016).

    Article  Google Scholar 

  34. ISSHIKI, K. et al. Depsidomycin, a new immunomodulating antibiotic. J. Antibiot. 43, 1195–1198 (1990).

    Article  Google Scholar 

  35. Blin, K. et al. antismash 7.0: new and improved predictions for detection, regulation, chemical structures and visualisation. Nucleic Acids Res. 51, W46–W50 (2023).

    Article  Google Scholar 

  36. Du, Y.-L., He, H.-Y., Higgins, M. A. & Ryan, K. S. A heme-dependent enzyme forms the nitrogen–nitrogen bond in piperazate. Nat. Chem. Biol. 13, 836–838 (2017).

    Article  Google Scholar 

  37. Liu, J. et al. Biosynthesis of the anti-infective marformycins featuring pre-NRPS assembly line N-formylation and O-methylation and post-assembly line C-hydroxylation chemistries. Org. Lett. 17, 1509–1512 (2015).

    Article  Google Scholar 

  38. Yun, B.-S., Hidaka, T., Furihata, K. & Seto, H. Promothiocins a and b, novel thiopeptides with a tip a promoter inducing activity produced by Streptomyces sp. sf2741. J. Antibiot. 47, 510–514 (1994).

    Article  Google Scholar 

  39. Madeira, F. ábio et al. Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res. 50, W276–W279 (2022).

    Article  Google Scholar 

  40. Blin, K. et al. antismash 4.0—improvements in chemistry prediction and gene cluster boundary identification. Nucleic Acids Res. 45, W36–W41 (2017).

    Article  Google Scholar 

  41. Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837 (2016).

    Article  Google Scholar 

  42. Van Santen, J. A. et al. The natural products atlas: an open access knowledge base for microbial natural products discovery. ACS Cent. Sci. 5, 1824–1833 (2019).

    Article  Google Scholar 

  43. van Santen, J. A. et al. The Natural Products Atlas 2.0: a database of microbially-derived natural products. Nucleic Acids Res. 50, D1317–D1323 (2022).

    Article  Google Scholar 

  44. Sorokina, M., Merseburger, P., Rajan, K., Yirik, M. A. & Steinbeck, C. COCONUT online: collection of open natural products database. J. Cheminform. 13, 2 (2021).

  45. Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. InChI, the IUPAC international chemical identifier. J. Cheminform. 7, 23 (2015).

    Article  Google Scholar 

  46. Hopcroft, J. & Tarjan, R. Algorithm 447: efficient algorithms for graph manipulation. Commun. ACM 16, 372–378 (1973).

    Article  Google Scholar 

  47. Chen, S. X. & Liu, J. S. Statistical applications of the Poisson-binomial and conditional Bernoulli distributions. Stat. Sin. 7, 875–892 (1997).

    MathSciNet  Google Scholar 

  48. Howard, S. Discussion on Professor Cox’s paper. J. R. Stat. Soc. B 34, 210–211 (1972).

    Google Scholar 

  49. Abramova, A. & Korobeynikov, A. Assessing the significance of peptide spectrum match scores). In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017), Leibniz International Proc. Informatics (LIPIcs) Vol. 88 (eds Schwartz, R. & Reinert, K.) 14:1–14:11 (Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2017); https://doi.org/10.4230/LIPIcs.WABI.2017.14

  50. Wang, F. & Landau, D. P. Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett. 86, 2050 (2001).

    Article  Google Scholar 

  51. Horvát, S. Z. & Modes, C. D. Connectedness matters: construction and exact random sampling of connected networks. J. Phys. Complex. 2, 015008 (2021).

    Article  Google Scholar 

  52. Thorup, M. Near-optimal fully-dynamic graph connectivity. In Proc. Thirty-second Annual ACM Symposium on Theory of computing (eds Yao, F. & Luks, E.) 343–350 (ACM, 2000).

  53. Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).

    Article  Google Scholar 

  54. VInSMoC software. Zenodo https://doi.org/10.5281/zenodo.17452093 (2025).

  55. Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).

    Article  Google Scholar 

Download references

Acknowledgements

M.G., B.K., T.H., M.T., J.A., S.R. and H.M. were supported by National Institutes of Health New Innovator Award DP2GM137413, US Department of Energy award DE-SC0021340 and National Science Foundation award DBI-2117640. The work of B.B. was supported by the National Institute of General Medicine Sciences of the National Institutes of Health award R43GM150301.

Author information

Authors and Affiliations

Authors

Contributions

M.G., B.K., T.H., M.T., J.A. and S.R. implemented the algorithms. M.G. performed the analysis. P.C. and M.L. provided imatinib impurity spectral data and their analyses. B.B. and H.M. designed and directed the work. M.G. and H.M wrote the paper, and all authors contributed to its revision.

Corresponding authors

Correspondence to Bahar Behsaz or Hosein Mohimani.

Ethics declarations

Competing interests

H.M. and B.B. are co-founders of and have equity interests in Chemia Biosciences.

Peer review

Peer review information

Nature Computational Science thanks Bart Ghesquiere, Tomáš Pluskal and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Runtimes of mass spectral search engines on the GNPSLIBvNPAtlas dataset as a function of precursor ion error tolerance.

All runtimes are reported as an average over three runs after and not including a warm-up run to fill disk caches. All times are measured using a single core of an AMD Ryzen Threadripper PRO 5995WX.

Source data

Extended Data Fig. 2 Distances between top-scoring and correct modification sites using VInSMoC for various fragmentation settings.

For compounds from the GNPSLIB dataset modified by 107 frequent modifications we compute the distance between the top-scoring modification site predicted by VInSMoC and the correct modification site. Distance is defined as the average distance between the atoms of the predicted and correct site. A scoring strategy limited to only depth 1 fragments is compared to scoring both depth 1 and depth 2 fragments. Additionally, scoring OC (oxygen to carbon), NC (nitrogen to carbon) and CC (carbon to carbon) single bond fragments is compared to scoring only OC and NC bond fragments, and only NC bond fragments. n = 4966 modified compounds.

Source data

Extended Data Fig. 3 Modified molecules and their variable scores for different candidate modification site localizations.

Molecules originally taken from the GNPSLIB spectral library. Example modified molecules are from spectra with GNPS IDs (a) CCMSLIB00005722331, (b) CCMSLIB00000080011, (c) CCMSLIB00000853587, and (d) CCMSLIB00000079626. Each row from left to right shows the original molecule, the modified molecule with the modification site highlighted in red, and the modified molecule with atoms highlighted by their variable score against the spectrum for the original molecule.

Extended Data Fig. 4 Accuracies on (A) GNPSLIBvCOCONUT and (B) GNPSLIBvNPAtlas for exact search methods.

Molecule-spectrum matches for each method are grouped by spectrum and ranked by descending score. Tied ranks are assigned the average ranks of all molecule-spectrum matches contributing to the tie. The top-k accuracy indicates for what percentage of test spectra the correct compound was ranked at most k.

Source data

Extended Data Fig. 5 Accuracies on GNPSLIBvDNP for spectra not used to train CSI:FingerID for exact search methods.

Molecule-spectrum matches for each method are grouped by spectrum and ranked by descending score. Tied ranks are assigned the average ranks of all molecule-spectrum matches contributing to the tie. The top-k accuracy indicates for what percentage of test spectra the correct compound was ranked at most k. Accuracies for CFM-ID, CSI:FingerID, Dereplicator+, MAGMa+, MetFrag, and molDiscovery are taken from Cao et al.17.

Source data

Extended Data Fig. 6 False discovery rates at various metric cutoffs when searching GNPS against COCONUT.

Plots in the first row are for exact-mode search and plots in the second row are for variable mode search. Plots in the first column are using score-based cutoffs and plots in the second column are using p-value cutoffs.

Source data

Extended Data Fig. 7 High frequency mass shifts of compounds detected by VInSMoC in variable mode.

Mass shifts that occurred fewer than 300 times are excluded. Mass shifts are calculated as the difference between precursor mass of the spectrum and the monoisotpic mass of the matched molecule. n = 9160 mass shifts.

Source data

Extended Data Fig. 8 Mean VInSMoC scores across Classyfire superclasses using various fragmentation settings at fragmentation depth 2.

Fragmentation with various allowed bond breakages are benchmarked across the MoNA dataset.

Source data

Extended Data Fig. 9 Identifications of Imatinib impurities by VInSMoC.

For Imatinib (A) and four variants of Imatinib (BD) this shows the precursor m/z of the mass spectrum and the PubChem CID of the compound annotated by chemists. The distribution of variable mode scores using VInSMoC for each spectrum against the original Imatinib structure is shown in the third column. In the fourth column the structure chemists identified for the spectrum is shown, with differences from the original Imatinib highlighted in red.

Extended Data Fig. 10 Proposed BGC and synthetic pathway for Depsidomycin.

(a) The proposed BGC with NRPS genes colored black and genes installing auxiliary modifications shown in blue and magenta. (b) The NRPS module structure of this BGC with A-domain specificity shown via the elongation of Depsidomycin. (c) Auxiliary modifications applied to produce the final Depsidomycin structure. The thioesterase domain catalyzes formation of the ester (green), conversion of Ornithine to Piperazic acid is catalyzed by genes homologous to ktzI and ktzT (blue), and a gene homologous to mfnA is responsible for N-terminal formylation (magenta).

Supplementary information

Supplementary Information

Supplementary Remark, Algorithms 1 and 2, Tables 1–3 and Figs. 1–23.

Reporting Summary

Supplementary Data 1

Source data for the Supplementary figures. Source data is provided as a separate CSV for each relevant Supplementary figure, named si_figureX_data.csv.

Source data

Source Data Fig. 4

CSV containing runtime breakdowns of Dereplicator+ and VInSMoC–Exact on the GNPSLIB spectral dataset.

Source Data Extended Data Fig. 1

Runtimes in seconds of exact search methods on the GNPSLIBvNPAtlas dataset across multiple precursor ion tolerances. Each tool was measured three times per precursor ion tolerance; all three runtimes (in seconds) are reported in this data file.

Source Data Extended Data Fig. 2

CSV containing raw distances between predicted and correct modification sites across multiple fragmentation settings in VInSMoC. Data were constructed by applying common modifications to the structures in the GNPSLIB dataset. Distance values and fragmentation settings are provided, which were used to compute the summary metrics reported in the figure.

Source Data Extended Data Fig. 4

CSV containing top-k accuracies of fragmentation graph-based mass spectral search tools on searching GNPSLIB against COCONUT and NPAtlas chemical databases.

Source Data Extended Data Fig. 5

CSV containing top-k accuracies of multiple exact search tools on searching the subset of the GNPLIB not used to train CSI:FingerID against the Dictionary of Natural Products.

Source Data Extended Data Fig. 6

CSV containing false discovery rates at various score cut-offs when using VInSMoC in both exact and open mode and using raw scores or P values as cut-offs.

Source Data Extended Data Fig. 7

CSV containing raw counts of integral mass shifts of compounds detected by VInSMoC in variable mode. This contains the plotted portion, which includes all integral shifts that occurred at least 300 times.

Source Data Extended Data Fig. 8

CSV containing scores of true-positive MoNA spectra using depth-2 fragmentation with various allowed broken bond types. Plot reports averages from this raw dataset.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guler, M., Krummenacher, B., Hall, T. et al. Identifying variants of molecules through database search of mass spectra. Nat Comput Sci 5, 1227–1237 (2025). https://doi.org/10.1038/s43588-025-00923-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s43588-025-00923-5

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research