FIDDLE: a deep learning method for chemical formulas prediction from tandem mass spectra

Hong, Yuhui; Li, Sujun; Ye, Yuzhen; Tang, Haixu

doi:10.1038/s41467-025-66060-9

Download PDF

Article
Open access
Published: 12 December 2025

FIDDLE: a deep learning method for chemical formulas prediction from tandem mass spectra

Nature Communications volume 16, Article number: 11102 (2025) Cite this article

2725 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Molecular identification through tandem mass spectrometry is fundamental in small molecule analysis, with formula identification serving as an initial step in the process. Current computational methods often struggle with accuracy, speed, and scalability for relatively larger molecules, limiting high-throughput workflows. We present FIDDLE (Formula IDentification by Deep LEarning), a deep learning-based method trained on over 38,000 molecules and 1 million MS/MS spectra from various Quadrupole Time-of-Flight (Q-TOF) and Orbitrap instruments. FIDDLE accelerates formula identification by more than 10-fold and achieves top-1 and top-5 accuracies of 88.3% and 93.6%, respectively, outperforming state-of-the-art methods based on top-down (SIRIUS) and bottom-up (BUDDY) approaches by over 10%. On external metabolomics datasets, FIDDLE achieves top-5 accuracies of 75.1% (positive ion mode) and 66.2% (negative ion mode), with further improvements to 80.0% and 73.8% when combined with SIRIUS and BUDDY.

BUDDY: molecular formula discovery via bottom-up MS/MS interrogation

Article 13 April 2023

QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules

Article Open access 03 February 2025

Machine learning enabling prediction of the bond dissociation enthalpy of hypervalent iodine from SMILES

Article Open access 12 October 2021

Introduction

Tandem mass spectrometry (MS/MS) is an essential analytical tool for identifying small molecules and elucidating their structural characteristics. The standard approach for identifying unknown analytes from MS/MS data involves searching spectra against reference spectral libraries^1,2,3. However, due to limitations in time, labor, and resources, a significant portion of chemical signatures remains uncharacterized, often termed “dark matter”⁴. These unidentified small molecules may possess unique bioactivities and play crucial roles in understanding biological mechanisms. Unfortunately, such molecules may lack corresponding reference spectra in spectral libraries or may not have been previously reported in the literature (i.e., the “unknown unknowns”⁵). As a result, the identification of unknown compounds has become a challenging yet vital research area, spanning metabolomics, environmental analysis, natural product and drug discovery, and more^{6,7,8,9,10,11}. Specifically, the prediction of molecular formulas serves as the initial and most fundamental step, providing critical constraints that facilitate structural elucidation and the annotation of fragments for these unknown compounds^12,13.

While MZmine initially incorporated isotope pattern matching, MS/MS fragmentation analysis, and heuristic rules as a toolbox¹⁴, computational methodologies for chemical formula identification from MS/MS data have evolved into top-down and bottom-up approaches, exemplified by SIRIUS¹⁵ and BUDDY¹⁶, respectively. SIRIUS begins by generating candidate formulas through the analysis of isotope patterns¹⁷, then computes a fragmentation tree for each candidate¹⁸ and evaluates them by comparing theoretical fragmentation patterns against experimental MS/MS data. The evaluation metric considers multiple factors, including fragment masses, intensities, and isotope patterns, to estimate the likelihood of each candidate producing the observed spectrum. However, SIRIUS’s reliance on neutral loss fragments limits its performance on multiply-charged spectra, which contain charged loss fragments. Such spectra represent 7.7% and 1.8% of Q-TOF and Orbitrap spectra in the National Institute of Standards and Technology (NIST) Spectral Library (2023 version; NIST23)¹⁹, respectively (Supplementary Fig. 1). Additionally, its efficiency is hindered by the computational demand of generating fragmentation trees for all potential formulas inferred from isotope patterns. MIST-CF shows that fragmentation trees can be replaced with a simple peak subformula assignment routine, achieving equally accurate and fast predictions²⁰. However, it still relies on SIRIUS’s algorithmic decomposition of exact masses into formula candidates, limiting efficiency and accuracy. In contrast, BUDDY significantly reduces the number of candidate formulas by focusing on those explainable by MS/MS data, using a reference library of known formulas. It ranks candidates matching the precursor mass and estimates a false discovery rate (FDR) to provide a confidence score. However, BUDDY’s scope is restricted by the coverage of its reference library, potentially missing entirely uncharacterized and previously unreported formulas. Our analysis, illustrated in Supplementary Fig. 2, revealed that 45 unique formulas represented in MS/MS spectra in NIST23, Massbank of North America (MoNA)², Global Natural Product Social Molecular Networking (GNPS) Spectral Library²¹, and the Agilent Personal Compound Database and Library (PCDL) fall outside the MS/MS-explainable space of BUDDY, rendering them unanalyzable by the method.

Both computational methods underutilize the full scope of information present in MS/MS spectrum data. SIRIUS 6, for instance, considers a limited number of peaks (up to 60), while BUDDY relies on manually extracted features, such as double-bond equivalent values of annotated fragments. As a result, increasing precursor mass-to-charge ratio (m/z) leads to higher computational complexity and significantly decreased accuracy due to the exponentially growing number of candidate molecular formulas, which expands the search space and increases ambiguity. For instance, at m/z 800, the number of candidates reaches tens for BUDDY and tens of thousands for SIRIUS¹⁶. This limitation stems from the fact that higher precursor m/z values often correspond to larger, more complex molecular structures, which then requires the evaluation of a broader range of potential formulas. Furthermore, the peaks excluded from analysis, along with the relationships between considered and unconsidered peaks, may hold crucial structural information that is not exploited.

In this paper, we address these limitations by introducing a deep learning approach to chemical formula identification. We present FIDDLE (Formula IDentification from tandem mass spectra by Deep LEarning), which employs dilated convolutions with large kernels^22,23 to extract high-dimensional representations of MS/MS data using extremely large receptive fields. To predict candidate formulas, the model is trained using a composite objective that includes a primary formula regression loss, a contrastive loss, and auxiliary task losses to enhance performance. These initial predictions are refined using a breadth-first search algorithm that adjusts atomic compositions to align the candidate formulas with the precursor mass. Additionally, we train a secondary deep learning model to estimate confidence scores for candidate formulas and rank them based on MS/MS features learned from the formula identification model. Compared to traditional computational methods, our approach dramatically reduces the candidate formula space for a given MS/MS spectrum to a small number (at most five formulas by default). This reduction is facilitated by the deep learning model, which benefits from accelerated GPU-based tensor computations. The narrowed candidate space simplifies confidence score estimation, and the MS/MS features learned by the deep learning model can be reused for this purpose.

Results

Deep learning method for formula identification

Predicting target formulas directly from MS/MS spectra under varying experimental conditions presents significant challenges. To address this, we break down the task into three steps as illustrated in Fig. 1a: (1) predicting formulas from MS/MS spectra using a deep learning model; (2) generating candidate formulas using a breadth-first formula refinement algorithm; and (3) calculating confidence scores for the candidate formulas using an additional deep learning model. The formula refinement step relaxes the requirement for exceptionally high initial prediction accuracy by allowing adjustments to candidate formulas with minimal atom modifications. Moreover, since assessing the correctness of a limited set of predictions is easier than identifying the top correct outcomes from an infinitely large pool²⁴, the refinement step also improves the confidence score estimation.

**Fig. 1: Formula identification from tandem mass spectra using deep learning.**

To input an MS/MS spectrum into a deep learning model, we first bin it into a 1-D vector with a fixed mass-to-charge ratio (m/z) resolution. For example, an MS/MS spectrum with a maximum m/z of 1500 Da is binned into a vector of length 7500, with each bin representing a resolution of 0.2 Da. Molecular formulas are directly converted into formula vectors, where each element type is represented by its atom count as the corresponding value in the vector. For instance, the molecular formula C₆H₁₂O₆ can be represented as the vector $\left[6,12,6,0,...\right]$, where the first three integers correspond to the number of carbon, hydrogen, and oxygen atoms, respectively, followed by atom counts for other elements in a predefined order.

To encode the MS/MS spectra, we use stacked dilated convolutions with large kernels to capture relationships between peaks across broad mass ranges^22,23. This technique expands the model’s receptive field, enabling it to analyze local and global spectral patterns at the same time. It serves as a powerful and computationally efficient alternative to fully connected layers. The learned MS/MS features are then concatenated with experimental conditions, such as collision energies, precursor types, and experimental precursor m/z, and fed into two linear layers to produce condition-independent MS/MS features, denoted as z_i and z_j in Fig. 1b. During training, a contrastive loss is applied to ensure that these condition-independent MS/MS features are close for spectra from the same molecule and far apart for spectra from different molecules. We use sequential linear layers as decoders for both formula identification and auxiliary tasks, including atom number prediction, molecular mass prediction, and H/C ratio prediction. These auxiliary tasks, incorporated through multitask learning, enhance the generalizability of the deep learning model and serve as a form of regularization²⁵. The detailed architecture and parameters of the model are specified in the section “Representation learning for MS/MS spectra.”

Recognizing that deep learning models cannot guarantee the validity or perfect accuracy of predicted formulas, we developed a breadth-first formula refinement algorithm. This algorithm aims to make minimal adjustments to atom counts to ensure that the formulas comply with SENIOR rules²⁶ and align with the target mass within a specified mass tolerance—specifically, 10 parts per million (ppm) for MS/MS data from Q-TOF instruments and 5 ppm for data from Orbitrap instruments. This refinement process produces a set of k candidate formulas for each MS/MS spectrum, where k is set to 5 by default. Note that the algorithm is flexible enough to integrate results from various formula identification methods, such as SIRIUS and BUDDY, where the predicted formula can be expanded into a longer list of candidate formulas. An auxiliary model is then developed to estimate confidence scores using the MS/MS features learned during formula identification and each candidate formula (see details in the section “Prediction of confidence score”). Finally, the candidate formulas are ranked based on their estimated confidence scores.

Performance of formula identification

MS/MS spectra were collected from NIST23, NIST20, Agilent PCDL, MoNA, and GNPS, as well as from an internal dataset (see the section “MS/MS data filtering” for details) acquired using a Waters Q-TOF mass spectrometer. The MS/MS spectra were preprocessed according to the methods described in the section “MS/MS data pre-processing,” including filtering based on peak count, molecular mass, atom type and number, and mass difference in ppm. Additional pre-processing steps included simulating precursor m/z values for the NIST dataset, simplifying precursor types, and constructing the training set for contrastive learning (CL). In total, 131,224 MS/MS spectra from 15,399 molecules acquired with Q-TOF mass spectrometers and 965,656 MS/MS spectra from 28,383 molecules acquired with Orbitrap mass spectrometers were used for training and evaluation. A summary of the number of spectra and compounds in each dataset is provided in Table 1. We retained MS/MS spectra from compounds (represented as canonical SMILES without stereochemical information) found exclusively in NIST23 and not in any other libraries (including NIST20) as the test set, ensuring these spectra were not used during the training of any of the models being compared. It is important to note that these spectra were published after the release of BUDDY and SIRIUS. Hence, it is important not to leverage them to train FIDDLE and maintain a fair comparison. Consistent with previous studies¹⁶, we used the top K accuracy to evaluate the performance of formula identification algorithms. This metric is calculated as the proportion of spectra for which the correct formula is included among the top K (by default K = 5) ranked formulas predicted by a given algorithm. The settings for comparison methods are specified in the section “Running SIRIUS and BUDDY.”

Table 1 Total numbers of spectra (# Spec) and compounds (# Mol) used for training, evaluation, and external test

Full size table

As shown in Fig. 2a–f, FIDDLE outperformed the other state-of-the-art formula identification algorithms (BUDDY and SIRIUS), and sped up the cumulative runtime by approximately 10-fold compared to BUDDY and 100-fold compared to SIRIUS. In addition to formula identification, performance metrics for auxiliary tasks, including mass, atom number, and H/C ratio predictions, are presented in Supplementary Fig. 3. Notably, the top-1 accuracy of BUDDY and SIRIUS declines significantly for larger compounds (molecular weight > 800 Da; Fig. 2b, e), with BUDDY’s accuracy decreasing to 0.427 for Q-TOF and 0.684 for Orbitrap, and SIRIUS’s accuracy decreasing to 0.187 for Q-TOF and no output within the timeout limit for Orbitrap, compared to 0.844, 0.702, 0.583, and 0.669, respectively on smaller compounds (Fig. 2a, d). In contrast, FIDDLE maintained robust performance on large compounds, achieving top-1 accuracies of 0.642 and 0.813 for Q-TOF and Orbitrap spectra, respectively. Notably, for small compounds, FIDDLE slightly outperforms BUDDY and SIRIUS, achieving top-1 accuracies of 0.906 and 0.881 for Q-TOF and Orbitrap, respectively. In top-5 formula prediction, FIDDLE also consistently outperformed BUDDY and SIRIUS, especially on challenging large compounds. For Q-TOF data, FIDDLE’s accuracy on large compounds (0.754) was substantially higher than that of BUDDY (0.489) and SIRIUS (0.178). This trend was even more substantial for Orbitrap data, where FIDDLE achieved an accuracy of 0.971 for large compounds, while BUDDY scored 0.684 and SIRIUS failed to return a result in time. Moreover, incorporating BUDDY’s results into FIDDLE’s candidate formula pool further improved prediction accuracy, indicating that BUDDY and FIDDLE perform well on different sets of compounds and can be complementary. While combining these methods may require additional running time, it can yield superior overall results. The best accuracy and running time of FIDDLE for different settings of K on Q-TOF test spectra are shown in Fig. 2h. Higher K values improve accuracy at the cost of increased computational time, enabling users to balance performance and efficiency based on their requirements.

**Fig. 2: Comparison of the performance of FIDDLE with SIRIUS¹⁵ and BUDDY¹⁶ on formula identification.**

We assessed FIDDLE’s generalizability by comparing its performance on two distinct training and test set splits: one divided randomly by canonical SMILES and a more stringent split divided by unique chemical formula. The formula-based split is significantly more challenging, as it requires the model to predict formulas it has never encountered during training. Despite this, FIDDLE’s accuracy decreased by only about 10% on this task compared to the standard SMILES-based split (p < 0.001, one-sample t-test), demonstrating robust performance. Detailed results are shown in Supplementary Fig. 4.

To assess robustness to spectral noise, we evaluated FIDDLE, BUDDY, and SIRIUS on 1000 randomly sampled MS/MS spectra from the test set. We systematically added Gaussian noise at five increasing levels, with detailed methods described in the section “Training set construction and data augmentation.” FIDDLE demonstrated superior noise resilience across both Q-TOF and Orbitrap instruments, maintaining over 90% accuracy even under large noise conditions. On Q-TOF spectra, FIDDLE achieved 95.0% top-5 accuracy at the highest noise level compared to BUDDY’s 93.5% and SIRIUS’s 52.5%. On Orbitrap spectra, FIDDLE maintains 93.3% accuracy while BUDDY decreases to 75.4% and SIRIUS to 63.1%. Overall, FIDDLE exhibited minimal performance degradation as noise increased, whereas BUDDY showed a moderate decline, and SIRIUS performed poorly but consistently, underscoring FIDDLE’s practical advantage in real-world scenarios with varying spectral quality.

Additional evaluation on chimeric spectra—synthetic mixtures of authentic MS/MS data²⁷—demonstrates the expected trade-off between FIDDLE’s predictive performance and robustness to signal mixtures (Supplementary Note 1). These results inform potential improvements through data augmentation strategies.

It is worth noting that interpretability can be a limitation of deep learning models compared to computation-based methods; therefore, we provide a potential interpretation method for FIDDLE in Supplementary Note 4 using SHAP²⁸. However, current interpretation approaches face constraints from binned spectral resolution and limitations in providing specific atom counts for direct fragment annotation. Future extensions to molecular structure prediction could offer more intuitive explanations by directly linking spectral features to structural fragments.

Impact of data characteristics on model performance

We conducted a comprehensive analysis to evaluate how model performance is influenced by various data characteristics. The factors we investigated include experimental metadata (such as collision energy and precursor type), molecular properties (like size and chemical class), and the internal MS/MS representations learned by FIDDLE.

Metadata

We integrated precursor type and collision energy as metadata in the deep learning model and evaluated performance under different conditions, as shown in Fig. 3a, b. Performance exhibited a positive correlation with training dataset size, with the following examples illustrating this relationship. FIDDLE demonstrates optimal performance when analyzing spectra of dimers [2M + H]⁺ and [2M + 2H]²⁺, perhaps due to their more predictable fragmentation patterns and simpler spectral signatures (Supplementary Fig. 5). Furthermore, higher collision energies produce more peaks, enhancing pattern recognition and boosting performance (the circle sizes denote the average peak numbers). Only the subset of spectra with collision energies in the range [40, ∞) for the Q-TOF instrument had sufficient training data for optimal results by FIDDLE.

Molecular polarity

LogP values were computed from SMILES strings using RDKit's Crippen.MolLogP method. Compounds were classified into three polarity groups: Polar (LogP < 0), Moderate (0 ≤ LogP < 3), and Nonpolar (LogP ≥ 3). Identification accuracy exhibited a strong inverse correlation with LogP values (Fig. 3c). Polar compounds achieved the highest Top-5 accuracy (95.0%), followed by moderate (94.7%) and nonpolar compounds (88.8%), representing a 6.2% performance gap that was consistent across both Q-TOF and Orbitrap instruments. This trend is likely due to the superior ionization efficiency and more consistent fragmentation that polar compounds exhibit under electrospray ionization mass spectrometry.

Molecular mass

Molecular weights were calculated from SMILES using RDKit's MolWt function and classified into five groups: < 200, 200–400, 400–600, 600–800, and ≥800 Da (Fig. 3d). Accuracy exhibited a complex relationship with molecular mass, with the smallest compounds (<200 Da) achieving the highest performance (97.3% Top-5 accuracy for Q-TOF, 95.0% for Orbitrap). While the model performance on Q-TOF data declined substantially for larger molecules (≥800 Da: 75.4%), on Orbitrap data, the model maintained robust accuracy across all mass ranges, performing particularly well for the largest compounds (≥800 Da: 97.1%). This performance divergence suggests that the Orbitrap’s superior mass resolution is key to identifying large molecules, whose richer fragmentation patterns provide discriminative features that deep learning methods can effectively exploit.

Molecular types

Test compounds were classified using ClassyFire Batch into 17 superclasses (see Supplementary Note 2). Fig. 3f reveals substantial variation in identification accuracy across chemical superclasses. While most superclasses achieved high performance (>90% Top-5 accuracy), four categories showed notably lower performance: Organosulfur compounds (72.1% on Q-TOF data and 85.6% on Orbitrap data), Organic Polymers (50.0% on Q-TOF data and 68.6% on Orbitrap data), Organohalogen compounds (0% on Q-TOF data and 66.4% on Orbitrap data), and Unknown compounds (33.3% on Q-TOF data and 57.2% on Orbitrap data). This drop in performance is likely due to insufficient training data for these underrepresented superclasses.

MS/MS representation space

To visualize the learned feature space, we applied t-SNE to project FIDDLE’s 512-dimensional latent vectors into a 2D representation, which we then divided into a 20 × 20 grid. For each cell in the grid, we calculated the top-1 prediction accuracy and the average Euclidean distance between molecular formulas. For cells containing more than 1000 formula pairs, we used a random sample of 1000 for this calculation. As shown in Fig. 3e, accuracy is negatively correlated with average formula distance. This correlation was highly significant for Q-TOF data (Pearson’s p ≪ 0.0001), but not statistically significant for Orbitrap data (p = 0.0588). This finding confirms that FIDDLE’s prediction accuracy is challenged in regions where similar spectral representations map to divergent formulas, indicating that the learned features in these cases are not distinctive enough for discrimination. A detailed analysis is provided in Supplementary Note 3.

Ablation study of FIDDLE’s components

The MoNA (Q-TOF) dataset is used to illustrate the improvements of FIDDLE achieved through data augmentation, contrastive loss, and post-processing steps, including candidate formula generation and confidence score prediction. We evaluated FIDDLE’s performance across different processing steps by measuring the proportion of top-1 formulas with varying numbers of missed atoms (Fig. 4a) and missed heavy atoms (Fig. 4c), respectively. Because hydrogen (approximately 1.008 Da) is light and difficult to accurately determine, the count of heavy atoms (excluding hydrogen) is also considered. The number of missed atoms is calculated by summing the differences in atom counts across all elements (e.g., carbon, oxygen, nitrogen). A comparison of FIDDLE’s performance in “Pred w/o DA” (prediction without data augmentation) and “Pred” (prediction with data augmentation) shows that data augmentation significantly increases the proportion of correctly predicted formulas (from 0% to 9% for all atoms, and from 18% to 26% for heavy atoms). The contrastive loss further enhances performance, as seen in results labeled “Pred w/o CL.” Comparing the performance of FIDDLE under “Pred” with “Post Top-1” (top-1 accuracy after post-processing) to “Post Top-5” (top-5 accuracy after post-processing) shows that FIDDLE’s candidate formulas cover the correct formulas for more than 84% of the MS/MS spectra. These candidates are subsequently ranked based on their confidence scores predicted by FIDDLE, achieving an AUC (area under the ROC curve) of 0.97, as shown in Fig. 4b. From Fig. 4d, it is clear that correct and incorrect formulas can be effectively distinguished based on their confidence scores. Notably, after post-processing, no formulas with three or fewer missed atoms remain, indicating that while post-processing may occasionally worsen certain incorrect predictions, most incorrect formulas with only a small number of missing atoms are effectively refined.

**Fig. 4: Ablation study of data augmentation, contrastive loss, post-processing, and hyperparameter in FIDDLE.**

We experimentally optimized the noise augmentation parameters of FIDDLE by training formula prediction models with different noise intensities applied to the training spectra (1× , 3× , and 5× multipliers). The MoNA (Q-TOF) dataset is reused in these experiments, and 20,000 spectra from NIST23 (Orbitrap) are randomly selected and split using the strategy described in the section “Training set construction and data augmentation.” The results reveal an instrument-dependent response: for Q-TOF data, moderate noise augmentation (3×) yielded the best performance, whereas both lower (1×) and higher (5×) noise levels led to suboptimal model convergence. In contrast, Orbitrap spectra did not benefit from noise augmentation at any tested level, suggesting that adding Gaussian noise may compromise the inherent spectral consistency of high-resolution instruments (Supplementary Fig. 6). The resolution of input spectra is also experimentally optimized as shown in Fig. 4h, i, where 0.2 Da resolution achieves the best performance on both Q-TOF and Orbitrap instruments. Orbitrap shows robustness across different resolutions, with the largest number of missed atoms occurring at a resolution of 1 Da.

To investigate the effect of instrument type, we construct subsets from MoNA (Q-TOF) and NIST23 (Orbitrap), where the spectra are from the same 2357 compounds but acquired on different instruments. The spectra from Orbitrap are downsampled to 18,830, the same size as the Q-TOF spectra. The split strategy is described in the section “Training set construction and data augmentation.” Then we conduct the same-instrument validation and cross-instrument validation (Fig. 4j). Interestingly, the Orbitrap model performs better on both Q-TOF and Orbitrap data, likely because it is trained on higher-resolution spectra with more detailed spectral features, enabling it to learn more robust representations that generalize across platforms. While this demonstrates the advantages of high-resolution training data, Q-TOF-specific models remain valuable for laboratories with limited access to Orbitrap instruments and for high-throughput applications where faster acquisition is preferred.

Evaluation using benchmarking metabolite datasets

Next, we compared the performance of FIDDLE against SIRUIS, BUDDY, and an additional tool MIST-CF on three benchmarking metabolite datasets, the Critical Assessment of Small Molecule Identification (CASMI) 2016²⁹, CASMI 2017, and European Molecular Biology Laboratory - Metabolomics Core Facility (EMBL-MCF) 2.0³⁰ datasets, respectively. For a fair comparison, we removed overlapping compounds in these datasets with our training set; in total, 181 (231 spectra), 2 (2 spectra), and 107 (184 spectra) were removed from these three datasets, respectively (for details see Table 1). For a fair comparison, we trained MIST-CF* using the same training and test data as used for FIDDLE*, excluding unsupported precursor types from both datasets (see details in Supplementary Table 1).

As illustrated in Fig. 5, FIDDLE performs comparably well with SIRIUS and BUDDY on the two CASMI datasets, while demonstrating significantly better performance on the EMBL-MCF 2.0 dataset. Since all methods achieve near-optimal performance on top-3 accuracy, with only FIDDLE (w/ BUDDY and SIRIUS), BUDDY, and MIST-CF* showing less than 2% improvement from top-5 to top-3, top-4, and top-5 accuracies are not demonstrated here. Complete top-5 accuracy results are provided in Supplementary Figs. 7 and 8. The slightly worse performance of FIDDLE on CASMI can be attributed to multiple factors, including different data acquisition protocols, lower MS/MS spectral quality (CASMI 2017 used Q-TOF, while CASMI 2016 and EMBL used Orbitrap), lower compound similarity to the training set (see Supplementary Fig. 9), and spectral complexity, among others. The computation-based methods exhibit greater sensitivity to mass deviation, as their performance decreases on the EMBL-MCF 2.0 dataset with larger mass deviations compared to the CASMI dataset, as shown in Supplementary Fig. 10. Notably, incorporating the NIST23 dataset into the training set enhances FIDDLE’s performance on these external test sets primarily due to the larger training data volume (see analysis in Supplementary Note 4), resulting in higher or comparable top-3 accuracies compared to SIRIUS, BUDDY, and MIST-CF. As shown in Fig. 5g–i, FIDDLE and BUDDY each performs better at different formulas and exhibit distinct error patterns (e.g., C₄S and O₅ for BUDDY versus C₃ and HOF for FIDDLE); therefore, combining candidate formulas from all methods and ranking them by FIDDLE’s predicted confidence scores further improves performance. The challenging component pairs for both BUDDY and FIDDLE, such as N₃ and C₂H₂O (which appear as N₆ and C₄H₄O₂ in BUDDY’s error patterns), are worth further investigations. On average, across three test sets, our method achieves top-5 accuracies of 80.0% (positive) and 73.8% (negative) ion mode, while BUDDY, SIRIUS, and MIST-CF achieve average top-5 accuracies of 69.9% (BUDDY, positive) and 61.4% (BUDDY, negative), 69.3% (positive, SIRIUS) and 67.6% (negative, SIRIUS), and 66.5% (positive, MIST-CF), respectively.

**Fig. 5: Performance and error patterns of formula identification using SIRIUS¹⁵, BUDDY¹⁶, MIST-CF²⁰, and FIDDLE on external test sets, CASMI 2016 and 2017²⁹, and EMBL-CF 2.0³⁰.**

Discussion

In this work, we introduce a deep learning approach named FIDDLE to identify chemical formulas from tandem mass spectra. It consists of three steps: predicting formulas from tandem mass spectra, generating candidate formulas, and ranking the candidates based on predicted confidence scores. FIDDLE not only accelerates the formula identification compared to state-of-the-art algorithms (BUDDY and SIRUIS), but also outperforms them on both the evaluation set (with 88.3% top-1 accuracy and 93.6% top-5 accuracy) and external benchmarking datasets (with the average top-3 accuracy of 80.0% and 73.8% for positive and negative ion modes, respectively, across three datasets). Our noise robustness evaluation on 1000 test spectra shows that FIDDLE maintains over 90% accuracy even at large noise levels, significantly outperforming BUDDY and SIRIUS on both Q-TOF and Orbitrap data. A separate evaluation on chimeric spectra confirmed that FIDDLE’s performance decreases as spectral mixing increases; this finding, while expected, will guide future design of data augmentation strategies to enhance its training process.

According to our ablation study, both data augmentation and contrastive loss between manually combined spectra pairs enhance formula identification from tandem mass spectra. The post-processing steps, which include generating candidate formulas and ranking them by predicted confidence scores, aim to refine the predicted formulas using the SENIOR rules. These steps refine most candidate formulas with no more than three missed heavy atoms, and significantly alleviate the challenge of incorrect hydrogen counts due to their small mass. Furthermore, because FIDDLE and other formula identification methods perform well on different compounds and spectra, combining their predicted candidate formulas and ranking them by using FIDDLE’s confidence score prediction further improves the accuracy of the predicted formulas, even though their running time may be longer. This suggests that a promising direction for future work is the development of hybrid methods that unify the predictive power of deep learning with the systematic rigor of search-based approaches.

We note that, even though the contrastive loss is employed to take into account the effect of experimental conditions, this effect is not completely eliminated, as FIDDLE still shows variations in accuracy under different conditions. Future work may be focused on improving accuracy across different conditions, especially those with less training data. The workflow of FIDDLE offers several avenues, such as improving MS/MS representations by pre-trained models from self-supervised learning³¹, using MS/MS prediction methods for data augmentation³², and extending the workflow with adduct type prediction for comprehensive application.

Furthermore, instead of simply accepting all top K candidates, confidence score thresholds could be used to implement FDR control. This approach would introduce a critical standard from proteomics to the field of metabolomics, where such controls remain largely unexplored.

Beyond formula identification, FIDDLE serves as a robust foundation for compound structural elucidation from MS/MS spectra, enabling the inference of covalent bonds between atoms based on their atomic composition. In the future, this methodology could be expanded to characterize diverse molecular structures by leveraging machine learning techniques to extract structural information from MS/MS spectra and integrating this information into molecular structure characterization.

Methods

MS/MS data pre-processing

MS/MS data filtering

The training and testing MS/MS datasets are compiled from several sources, including NIST20, NIST23, MoNA, GNPS, Agilent PCDL, and an internal dataset. To construct the internal dataset, we acquired 1424 compounds from APExBIO³³ and measured their tandem mass spectra using a Waters Synapt G2 mass spectrometer at 40 eV with various precursor types, including [M + H]⁺, [M + Na]⁺, [2M + H]⁺, [M + 2H]²⁺, and [M−H]⁻. To minimize the interference from impurities in the air, e.g., water vapor (H₂O) and carbon dioxide (CO₂), we excluded peaks below 50 m/z from the scans. Additionally, three benchmarking sets generated for community-wide evaluation of metabolite identification algorithms are used, including CASMI 2016, CASMI 2017, and EMBL-MCF 2.0.

These datasets undergo a series of filtering steps to ensure data quality: (1) Mass spectra with fewer than five peaks are excluded due to potential unreliability. (2) The m/z range is confined to (0, 1500] to account for the rarity of spectra with m/z values above 1500. (3) Only the molecules composed of frequent atoms (C, H, O, N, F, S, Cl, P, B, I, Br, Na, and K) are retained. (4) Only spectra associated with common precursor types are included, e.g., [M + H]⁺, [M + Na]⁺, [2M + H]⁺, [M + H−H₂O]⁺, [M + H−2H₂O]⁺, [M + H−NH₃]⁺, [M + H + NH₃]⁺, [M + H−CH₂O₂]⁺, [M + H−CH₄O₂]⁺ for positive modes, and [M−H]⁻, [2M + 2H]²⁺, [M−H−CO₂]⁻, [M−H−H₂O] for negative modes, along with [M + 2H]²⁺, a doubly charged precursor type. (5) The total count of atoms in the molecules is capped at 300 to exclude the molecules not typically classified as “small molecules.” (6) The tolerance for precursor mass discrepancy is set at 10 ppm for Q-TOF and 5 ppm for Orbitrap instruments, ensuring precise mass matching. The statistical information of the datasets is detailed in the following Table 1. The specific instruments contained in each instrument type are shown in Table 2.

Table 2 List of Q-TOF and Orbitrap instruments

Full size table

Precursor m/z simulation

The precursor m/z from NIST20, NIST23, CASMI 2016, and CASMI 2017 are theoretical values so they are adjusted via random shifts, following the approach used by Xing et al.¹⁶ and the observations by Böcker et al.¹⁷ that the mass deviations fit a Gaussian distribution with the standard deviation of 1/3 of the mass tolerance. We sampled the deviations from Gaussian distributions within the set tolerance ranges (5 ppm for Orbitrap and 10 ppm for Q-TOF) to simulate the experimental conditions accurately. These simulated precursor m/z values are utilized throughout both the training and testing phases, enhancing the model’s generalizability for application in real-world formula identification tasks.

Simplification of precursor types

As depicted in Fig. 2g, the dataset exhibits significant imbalance across different precursor types. To improve formula identification for less common precursor types, we simplify the precursor types by eliminating uncharged molecules, such as water (H₂O), ammonia (NH₃), carbon dioxide (CO₂), formic acid (CH₂O₂), and acetic acid (CH₄O₂), adding them into the original formula representation (for the detailed precursor types see Table 3). For example, consider a molecular formula C₆H₁₂N₂O₂ with the precursor type [M + H + NH₃]⁺. After simplifying the precursor type, the formula is adjusted to C₆H₁₅N₃O₂, and the precursor type is simplified to [M + H]⁺, reflecting the integration of NH₃ into the molecular formula. This simplification is recorded, allowing the predicted formulas to be converted back to the original formula with the original precursor type. Through this process, many uncommon precursor types are consolidated into more common precursor types, which expands the training data of common precursor types, thereby enhancing the predictions.

Table 3 List of simplified precursor types

Full size table

Training set construction and data augmentation

Each dataset, either acquired using Q-TOF or Orbitrap MS/MS instruments, is first split into training and test sets according to their molecular canonical SMILES, which ensures that there are no common compounds between the training set and the test set. It is worth noting that canonical SMILES cannot distinguish stereoisomers, so this splitting strategy guarantees that stereoisomers with different configurations are not separated between training and test sets. Two strategies were employed for data splitting: (1) random splitting (see the model in the section “Ablation study of FIDDLE’s components” and the model marked with an asterisk in the section “Evaluation using benchmarking metabolite datasets”); (2) leaving spectra of unique compounds from NIST23 for evaluation (see the model in the section “Performance of formula identification” and the model without an asterisk in the section “Evaluation using benchmarking metabolite datasets”). Then, for CL, we constructed MS/MS spectra pairs from the training sets. The spectra are grouped by canonical SMILES. For each compound, we randomly picked another compound from the same group to construct a positive pair and another compound from a different group to construct a negative pair.

To enhance the robustness of the deep learning model, we generated augmented MS/MS spectra by perturbing the spectra in the training set. Specifically, we added random noise sampled from a Gaussian distribution with a mean of 0 and a standard deviation of 0.1 ($\sim {{{\mathscr{N}}}}(0,0.1)$) to the intensities of an experimental Q-TOF spectrum, generating two augmented spectra for each Q-TOF spectrum and thus tripling the size of the training set from Q-TOF mass spectrometry. On average, the cosine similarity between the augmented spectrum and the corresponding experimental spectrum is 0.936, which is close to the similarity between replicated spectra of the same compound (0.977) in the Q-TOF training set. However, noise addition led to significantly lower similarity for Orbitrap spectra (as shown in Supplementary Fig. 6). Since the training size for Orbitrap is sufficiently large, we did not apply data augmentation for training a deep learning model for Orbitrap.

Representation learning for MS/MS spectra

Model architecture

The representation learning for MS/MS spectra is structured as a two-stage process: MS/MS embedding and the elimination of the experimental condition effect. This approach decomposes the complex task of MS/MS representation learning under multiple experimental conditions into more tractable steps, facilitating the network’s ability to extract meaningful features. Additionally, the resultant condition-independent MS/MS features reduce the complexity of the decoder.

Acknowledging the significance of the correlation among fragment ions in MS/MS, such as the neutral loss between two ions, dilated convolutions with large kernels are employed, building upon the previous work for de novo peptide sequencing²³. Due to the large effective receptive field (ERF), this method allows the deep neural network to learn meaningful patterns across fragment ions with long mass ranges from the MS/MS data. Each convolutional block consists of two sequential dilated convolutions with ReLU activation, random dropout, a residual connection, and max-pooling. The max-pooling uses a kernel size of 2 and a stride size of 2, which halves the spectra length. Upsampling along the channel dimension is applied in the skip connection when the output and input channel numbers differ. Dilated convolutions with exponentially increasing dilation factors (1, 2, 4, 8, etc.) and large kernels rapidly expand the ERF without significantly increasing the number of parameters^22,34. We can calculate the ERF for our model as:

$$\,{\mbox{ERF}}=({\mbox{kernel size}}-1)\times ({\mbox{dilation size}}\,-1)+1$$

(1)

where kernel sizes are 45, 43, 41, 39, 37, and 35, and the dilation sizes are 1, 2, 4, 8, 8, and 8, respectively. The total ERF of this model is 1153, indicating that each position in the MS/MS feature is directly influenced by 1153 input bins. It is worth noting that while the ERF defines the range of direct input influence, the model can still capture patterns beyond this range. In addition, we utilize weight normalization to speed up the convergence and reduce the dependency of normalization on batch size, enabling training with small batches and limited GPU memory³⁵.

The features from all the blocks are concatenated together along the channel dimension, resulting in a 1024 × 235 tensor, where 1024 results from the stacked convolutional blocks with channel sizes of 32, 32, 64, 128, 256, and 512, and 235 is the length of the MS/MS spectra after pooling layers. Then, through a global pooling layer along the spectra-length dimension, the tensor is pooled into a 1024-dimensional vector as the embedded MS/MS feature. Precursor types, normalized collision energy, and simulated precursor m/z are included as experimental conditions. These conditions are embedded into a 16-dimensional vector through linear layers. The embedded MS/MS feature and experimental conditions are concatenated and fed into sequential linear layers for the condition-independent MS/MS features. These features are constrained by CL so that features from the same molecule are learned to be the same, while features from different molecules are learned to be different.

Loss function

Given a pair of binned spectra i and j, the encoder of the prediction model embeds them into condition-independent MS/MS feature vectors z_i and z_j. These feature vectors are constrained by the contrastive loss:

$${{{{\mathscr{L}}}}}_{c}({{{{\bf{z}}}}}_{{{{\bf{i}}}}},{{{{\bf{z}}}}}_{{{{\bf{j}}}}})=(1-y)| | {{{{\bf{z}}}}}_{{{{\bf{i}}}}}-{{{{\bf{z}}}}}_{{{{\bf{j}}}}}| {| }^{2}+y\cdot \max {(0,m-| | {{{{\bf{z}}}}}_{{{{\bf{i}}}}}-{{{{\bf{z}}}}}_{{{{\bf{j}}}}}| | )}^{2}$$

(2)

where y is the label between the paired spectra. If spectra i and j are from the same molecule, the label is 1; otherwise, the label is 0. m is a hyperparameter (by default, m = 1.0) that defines the lower bound distance between spectra of different molecules.

Three auxiliary tasks, including the atom count prediction, the molecular mass prediction, and the prediction of the hydrogen-to-carbon (H/C) ratio, are used for model regularization to improve the model generalizability²⁵. Mean squared error (MSE) losses are applied to these tasks. The loss function for sample i from the pair (i, j) is defined as:

$${{{{\mathscr{L}}}}}_{t,i}={\alpha }_{t}\cdot \,{\mbox{MSE}}\,({y}_{t,i},{\hat{y}}_{t,i})$$

(3)

where ${{{{\mathscr{L}}}}}_{t,i}$ denotes the loss for task t on sample i; y_t,i and ${\hat{y}}_{t,i}$ denote the label and prediction, respectively; and α_t denotes the weight of loss on task t. In the experiments, we use the weights of 0.01, 1, and 10 for these three auxiliary tasks, respectively. Finally, the loss function for training the model is:

$${{{\mathscr{L}}}}={{{{\mathscr{L}}}}}_{c}+{{{{\mathscr{L}}}}}_{t,i}+{{{{\mathscr{L}}}}}_{t,j}+\,{\mbox{MSE}}\,({{{{\bf{f}}}}}_{i},\hat{{{{{\bf{f}}}}}_{i}})+\,{\mbox{MSE}}\,({{{{\bf{f}}}}}_{j},\hat{{{{{\bf{f}}}}}_{j}})$$

(4)

Formula refinement algorithm

Algorithm 1 — Candidate Formula Refinement Algorithm

Inputs: Initial formulas F_init, Target mass M

Settings: Tolerance ΔM, Maximum search depth ${D}_{\max }$, Number of formulas to return K

Ensure: ΔM > 0

Initialization:

F_refined ← [] List of refined formulas

F_track ← [] Record of explored formulas

F_cand ← F_init Initial formula candidates

D_cand ← [0] × len(F_init) Depth values for each candidate

Refinement Iteration:

while len(F_refined) < K and F_cand do

f ← F_cand. pop(), d ← D_cand. pop(), F_track. append ( f ) Process current candidate

if f passes SENIOR rules and mass of f ∈ [M−ΔM, M + ΔM] then

F_refined. append(f) Add valid formula

end if

if $d < {D}_{\max }$then Generate next-level candidates

Generate new candidates from f by editing one heavy atom

F_cand, D_cand ← Update with untracked new candidate, d + 1

end if

if Timeout check then Terminate on timeout

break

end if

end while

return F_refined Output refined formulas

The candidate formula refinement algorithm (Algorithm 1) refines an input chemical formula into a list of formulas that comply with the SENIOR rules²⁶ and match a target precursor mass within a specified tolerance. This algorithm can start with the formula predicted by the deep learning model or a formula identified by tools such as SIRIUS or BUDDY, facilitating the integration of results from multiple methods. The algorithm initializes lists to track refined formulas, previously explored formulas, and current candidates with their associated search depths. By iteratively modifying the counts of heavy atoms without exceeding a set search depth, the algorithm systematically refines each candidate formula. During the exploration, one heavy atom is added or removed, and hydrogen counts are adjusted to align with the precursor mass. The refinement process concludes either when a predefined number of formulas have been explored or when all candidate formulas are exhausted, incorporating a timeout check. The algorithm ultimately produces a list of formulas that align closely with the target precursor mass and adhere to the SENIOR rules.

Prediction of confidence score

The inputs for the confidence score prediction model consist of the concatenated condition-independent MS/MS feature vector and the vector representation of the predicted formula. The model architecture comprises a sequence of five linear layers, each followed by batch normalization, ReLU activation function, and random dropout for regularization. The dimensions of the five layers are 416, 208, 104, 26, and 13, respectively. A Sigmoid activation function is used in the final layer to produce a confidence score between 0 and 1, with higher values indicating more likely correct formula assignments. Training data for this model is obtained from the inference results of initial formula identification, followed by candidate formula refinement. Specifically, each of the top 5 refined candidate formulas is labeled, with 0 indicating a correct candidate and 1 denoting an incorrect one. Since this model has significantly fewer trainable parameters than the formula prediction model, the training set is randomly sampled down to 10,000 instances if it exceeds this limit to prevent overfitting.

Implementation and computing environment

Both deep neural networks (formula prediction and candidate scoring) were implemented in PyTorch version 1.13.1, comprising 21,027,933 (about 21M) and 20,055,938 (about 20.1M) parameters, respectively. A list of specific parameter numbers for each layer/block is described in Supplementary Tables 2 and 3.

The training of models in FIDDLE was conducted using a batch size of 512 over 200 epochs, with an initial learning rate of 0.001. The models were optimized using AdamW and a ReduceLROnPlateau learning rate schedule that reduces the learning rate when a metric has stopped improving for 5 epochs.

The training device was two NVIDIA RTX A6000 GPUs equipped with 48 GB of memory. For inference only, the minimum GPU memory requirement is 500 MB when using a batch size of 1. The deep learning framework used was PyTorch version 1.13.0, running on Ubuntu 20.04.1.

To facilitate broader adoption within the research community, we release our implementation as an open-source Python package (msfiddle) with optimized routines for both GPU and CPU-only environments, enabling reproducible access to pre-trained models and inference pipelines with standardized APIs compatible with common scientific computing workflows.

Running SIRIUS and BUDDY

We installed SIRIUS v6.0.1, last updated on July 21st, 2024, for the Linux (64-bit) platform, accessible at https://github.com/boecker-lab/sirius. Additionally, the BUDDY v0.3.6 PyPI package, updated on January 29th, 2024, was obtained from https://github.com/Philipbear/msbuddy. All experiments were performed on a workstation with a 20.04.1-Ubuntu 64-bit operating system equipped with 16 Intel(R) Core(TM) i7-9800X CPUs. For comprehensive configuration details, please refer to Supplementary Table 4.

Data availability

The NIST20 and NIST23 data used in this study are available under restricted access as paid standard reference databases, access can be obtained by purchasing from NIST Mass Spectral Library. The GNPS data used in this study are available at GNPS Mass Spectral Libraries. The MoNA, CASMI 2016, CASMI 2017, and EMBL-MCF 2.0 data used in this study are available at Mass Bank of North America. The Agilent PCDL data are available from Agilent Technologies (https://www.agilent.com) upon request for academic use. The Waters Q-TOF data are available under restricted access to comply with licensing terms that prohibit commercial use. Access is limited to academic and non-commercial research purposes and can be requested from the corresponding author. Requests will be reviewed for eligibility, and approved users will receive access within 2 weeks. The data will remain available for the duration of the approved research project. Source data are provided with this paper.

Code availability

The source code for all experiments is publicly available at GitHub: https://github.com/JosieHong/FIDDLE under the Apache 2.0 License³⁶. Pre-trained models for Q-TOF and Orbitrap MS/MS data can be downloaded from GitHub releases: https://github.com/JosieHong/FIDDLE/releases. A command-line tool msfiddle, is available via PyPI (https://pypi.org/project/msfiddle), providing simple usage to download, manage, and use models.

References

Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837 (2016).
Article CAS PubMed PubMed Central Google Scholar
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
Article ADS CAS PubMed Google Scholar
Blaženović, I. et al. Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93% accuracy. J. Cheminform. 9, 1–12 (2017).
Article Google Scholar
da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl. Acad. Sci. USA 112, 12549–12550 (2015).
Article ADS PubMed PubMed Central Google Scholar
Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 17, 7274–7282 (2012).
Article ADS Google Scholar
Moco, S., Vervoort, J., Bino, R. J., De Vos, R. C. & Bino, R. Metabolomics technologies and metabolite identification. TrAC Trends Anal. Chem. 26, 855–866 (2007).
Article CAS Google Scholar
Dettmer, K., Aronov, P. A. & Hammock, B. D. Mass spectrometry-based metabolomics. Mass Spectrom. Rev. 26, 51–78 (2007).
Article ADS CAS PubMed PubMed Central Google Scholar
Alseekh, S. et al. Mass spectrometry-based metabolomics: a guide for annotation, quantification and best reporting practices. Nat. Methods 18, 747–756 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hernandez, F., Sancho, J. V., Ibáñez, M. & Grimalt, S. Investigation of pesticide metabolites in food and water by LC-TOF-MS. TrAC Trends Anal. Chem. 27, 862–872 (2008).
Article CAS Google Scholar
Dueñas, M. E. et al. Advances in high-throughput mass spectrometry in drug discovery. EMBO Mol. Med. 15, e14850 (2023).
Article PubMed Google Scholar
Liu, C. & Zhang, H. High-throughput mass spectrometry in drug discovery. SLAS Technol. 100292, 100292 (2025).
Article Google Scholar
Wang, F. et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. 93, 11692–11700 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods 19, 865–870 (2022).
Article CAS PubMed PubMed Central Google Scholar
Pluskal, T., Uehara, T. & Yanagida, M. Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal. Chem. 84, 4396–4403 (2012).
Article ADS CAS PubMed Google Scholar
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).
Article PubMed Google Scholar
Xing, S., Shen, S., Xu, B., Li, X. & Huan, T. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nat. Methods 20, 881–890 (2023).
Article CAS PubMed Google Scholar
Böcker, S., Letzel, M. C., Lipták, Z. & Pervukhin, A. SIRIUS: decomposing isotope patterns for metabolite identification. Bioinformatics 25, 218–224 (2009).
Article PubMed Google Scholar
Rasche, F., Svatoš, A., Maddula, R. K., Böttcher, C. & Böcker, S. Computing fragmentation trees from tandem mass spectrometry data. Anal. Chem. 83, 1243–1251 (2011).
Article ADS CAS PubMed Google Scholar
NIST Mass Spectrometry Data Center. National Institute of Standards and Technology (NIST) Spectral Library (2023 version; NIST23). Available at: https://www.nist.gov/programs-projects/nist23-updates-nist-tandem-and-electron-ionization-spectral-libraries (2023).
Goldman, S., Xin, J., Provenzano, J. & Coley, C. W. MIST-CF: chemical formula inference from tandem mass spectra. J. Chem. Inf. Model. 64, 2421–2431 (2023).
Article PubMed PubMed Central Google Scholar
Aron, A. T. et al. Reproducible molecular networking of untargeted mass spectrometry data using GNPS. Nat. Protoc. 15, 1954–1991 (2020).
Article CAS PubMed Google Scholar
Lea, C., Vidal, R., Reiter, A. & Hager, G. D. Temporal convolutional networks: a unified approach to action segmentation. Lect. Notes Comput. Sci. 9915, 47–54 (2016).
Article Google Scholar
Liu, K., Ye, Y., Li, S. & Tang, H. Accurate de novo peptide sequencing using fully convolutional neural networks. Nat. Commun. 14, 7974 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Kwok, T.-Y. & Yeung, D.-Y. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Trans. Neural Netw. 8, 630–645 (1997).
Article CAS PubMed Google Scholar
Zhang, Y. & Yang, Q. A. survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 34, 5586–5609 (2021).
Article Google Scholar
Kind, T. & Fiehn, O. Seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinform. 8, 1–20 (2007).
Article Google Scholar
Stancliffe, E., Schwaiger-Haber, M., Sindelar, M. & Patti, G. J. Decoid improves identification rates in metabolomics through database-assisted MS/MS deconvolution. Nat. Methods 18, 779–787 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4765–4774 (2017).
Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9, 1–21 (2017).
Article Google Scholar
Dekina, S., Alexandrov, T. & Drotleff, B. EMBL-MCF 2.0: an LC-MS/MS method and corresponding library for high-confidence targeted and untargeted metabolomics using low-adsorption HILIC chromatography. Metabolomics 20, 114 (2024).
Article CAS PubMed PubMed Central Google Scholar
Guo, H., Xue, K., Sun, H., Jiang, W. & Pu, S. Contrastive learning-based embedder for the representation of tandem mass spectra. Anal. Chem. 95, 7888–7896 (2023).
Article ADS CAS PubMed Google Scholar
Hong, Y. et al. 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations. Bioinformatics 39, btad354 (2023).
Article CAS PubMed PubMed Central Google Scholar
APExBIO. APExBIO: achieve perfection, explore the unknown. Available at: https://www.apexbt.com/ (2023).
Ding, X., Zhang, X., Han, J. & Ding, G. Scaling up your kernels to 31 × 31: revisiting large kernel design in CNNs. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11963–11975 (IEEE, 2022).
Salimans, T. & Kingma, D. P. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. Adv. Neural Inf. Process. Syst. 29, 901–909 (2016).
Hong, Y. JosieHong/FIDDLE. Available at: https://doi.org/10.5281/zenodo.17172712 (2025).

Download references

Acknowledgements

The authors acknowledge the Center for Bioanalytical Metrology (CBM), an NSF Industry-University Cooperative Research Center, for providing funding under grant NSF IIP-1916645 (H.T.). This work was also partially supported by National Science Foundation grant DBI-2011271 (H.T.). This article includes material that first appeared in the PhD thesis of Yuhui Hong, “Deep Learning-Enhanced Approaches in Mass Spectrometric Analysis and Small Molecule Identification” (Indiana University Bloomington, 2025), available via ProQuest.

Author information

Authors and Affiliations

Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, IN, USA
Yuhui Hong, Sujun Li, Yuzhen Ye & Haixu Tang

Authors

Yuhui Hong
View author publications
Search author on:PubMed Google Scholar
Sujun Li
View author publications
Search author on:PubMed Google Scholar
Yuzhen Ye
View author publications
Search author on:PubMed Google Scholar
Haixu Tang
View author publications
Search author on:PubMed Google Scholar

Contributions

H.T. conceived the project and guided the study design. H.T. and Y.H. developed the computational method. Y.H. implemented the software and performed the experiments. S.L., Y.Y., and Y.H. collected and preprocessed the datasets. Y.H. led the manuscript writing, with input from all authors and supervision by H.T. All authors have read and approved the final version of the manuscript for submission.

Corresponding author

Correspondence to Haixu Tang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Dylan Ross, Yan Zhou, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Transparent Peer Review file

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Hong, Y., Li, S., Ye, Y. et al. FIDDLE: a deep learning method for chemical formulas prediction from tandem mass spectra. Nat Commun 16, 11102 (2025). https://doi.org/10.1038/s41467-025-66060-9

Download citation

Received: 29 March 2025
Accepted: 24 October 2025
Published: 12 December 2025
Version of record: 12 December 2025
DOI: https://doi.org/10.1038/s41467-025-66060-9

Subjects

Abstract

Similar content being viewed by others

BUDDY: molecular formula discovery via bottom-up MS/MS interrogation

QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules

Machine learning enabling prediction of the bond dissociation enthalpy of hypervalent iodine from SMILES

Introduction

Results

Deep learning method for formula identification

Performance of formula identification

Impact of data characteristics on model performance

Metadata

Molecular polarity

Molecular mass

Molecular types

MS/MS representation space

Ablation study of FIDDLE’s components

Evaluation using benchmarking metabolite datasets

Discussion

Methods

MS/MS data pre-processing

MS/MS data filtering

Precursor m/z simulation

Simplification of precursor types

Training set construction and data augmentation

Representation learning for MS/MS spectra

Model architecture

Loss function

Formula refinement algorithm

Prediction of confidence score

Implementation and computing environment

Running SIRIUS and BUDDY

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Transparent Peer Review file

Source data

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links