Abstract
Molecular identification through tandem mass spectrometry is fundamental in small molecule analysis, with formula identification serving as an initial step in the process. Current computational methods often struggle with accuracy, speed, and scalability for relatively larger molecules, limiting high-throughput workflows. We present FIDDLE (Formula IDentification by Deep LEarning), a deep learning-based method trained on over 38,000 molecules and 1 million MS/MS spectra from various Quadrupole Time-of-Flight (Q-TOF) and Orbitrap instruments. FIDDLE accelerates formula identification by more than 10-fold and achieves top-1 and top-5 accuracies of 88.3% and 93.6%, respectively, outperforming state-of-the-art methods based on top-down (SIRIUS) and bottom-up (BUDDY) approaches by over 10%. On external metabolomics datasets, FIDDLE achieves top-5 accuracies of 75.1% (positive ion mode) and 66.2% (negative ion mode), with further improvements to 80.0% and 73.8% when combined with SIRIUS and BUDDY.
Similar content being viewed by others
Introduction
Tandem mass spectrometry (MS/MS) is an essential analytical tool for identifying small molecules and elucidating their structural characteristics. The standard approach for identifying unknown analytes from MS/MS data involves searching spectra against reference spectral libraries1,2,3. However, due to limitations in time, labor, and resources, a significant portion of chemical signatures remains uncharacterized, often termed “dark matter”4. These unidentified small molecules may possess unique bioactivities and play crucial roles in understanding biological mechanisms. Unfortunately, such molecules may lack corresponding reference spectra in spectral libraries or may not have been previously reported in the literature (i.e., the “unknown unknowns”5). As a result, the identification of unknown compounds has become a challenging yet vital research area, spanning metabolomics, environmental analysis, natural product and drug discovery, and more6,7,8,9,10,11. Specifically, the prediction of molecular formulas serves as the initial and most fundamental step, providing critical constraints that facilitate structural elucidation and the annotation of fragments for these unknown compounds12,13.
While MZmine initially incorporated isotope pattern matching, MS/MS fragmentation analysis, and heuristic rules as a toolbox14, computational methodologies for chemical formula identification from MS/MS data have evolved into top-down and bottom-up approaches, exemplified by SIRIUS15 and BUDDY16, respectively. SIRIUS begins by generating candidate formulas through the analysis of isotope patterns17, then computes a fragmentation tree for each candidate18 and evaluates them by comparing theoretical fragmentation patterns against experimental MS/MS data. The evaluation metric considers multiple factors, including fragment masses, intensities, and isotope patterns, to estimate the likelihood of each candidate producing the observed spectrum. However, SIRIUS’s reliance on neutral loss fragments limits its performance on multiply-charged spectra, which contain charged loss fragments. Such spectra represent 7.7% and 1.8% of Q-TOF and Orbitrap spectra in the National Institute of Standards and Technology (NIST) Spectral Library (2023 version; NIST23)19, respectively (Supplementary Fig. 1). Additionally, its efficiency is hindered by the computational demand of generating fragmentation trees for all potential formulas inferred from isotope patterns. MIST-CF shows that fragmentation trees can be replaced with a simple peak subformula assignment routine, achieving equally accurate and fast predictions20. However, it still relies on SIRIUS’s algorithmic decomposition of exact masses into formula candidates, limiting efficiency and accuracy. In contrast, BUDDY significantly reduces the number of candidate formulas by focusing on those explainable by MS/MS data, using a reference library of known formulas. It ranks candidates matching the precursor mass and estimates a false discovery rate (FDR) to provide a confidence score. However, BUDDY’s scope is restricted by the coverage of its reference library, potentially missing entirely uncharacterized and previously unreported formulas. Our analysis, illustrated in Supplementary Fig. 2, revealed that 45 unique formulas represented in MS/MS spectra in NIST23, Massbank of North America (MoNA)2, Global Natural Product Social Molecular Networking (GNPS) Spectral Library21, and the Agilent Personal Compound Database and Library (PCDL) fall outside the MS/MS-explainable space of BUDDY, rendering them unanalyzable by the method.
Both computational methods underutilize the full scope of information present in MS/MS spectrum data. SIRIUS 6, for instance, considers a limited number of peaks (up to 60), while BUDDY relies on manually extracted features, such as double-bond equivalent values of annotated fragments. As a result, increasing precursor mass-to-charge ratio (m/z) leads to higher computational complexity and significantly decreased accuracy due to the exponentially growing number of candidate molecular formulas, which expands the search space and increases ambiguity. For instance, at m/z 800, the number of candidates reaches tens for BUDDY and tens of thousands for SIRIUS16. This limitation stems from the fact that higher precursor m/z values often correspond to larger, more complex molecular structures, which then requires the evaluation of a broader range of potential formulas. Furthermore, the peaks excluded from analysis, along with the relationships between considered and unconsidered peaks, may hold crucial structural information that is not exploited.
In this paper, we address these limitations by introducing a deep learning approach to chemical formula identification. We present FIDDLE (Formula IDentification from tandem mass spectra by Deep LEarning), which employs dilated convolutions with large kernels22,23 to extract high-dimensional representations of MS/MS data using extremely large receptive fields. To predict candidate formulas, the model is trained using a composite objective that includes a primary formula regression loss, a contrastive loss, and auxiliary task losses to enhance performance. These initial predictions are refined using a breadth-first search algorithm that adjusts atomic compositions to align the candidate formulas with the precursor mass. Additionally, we train a secondary deep learning model to estimate confidence scores for candidate formulas and rank them based on MS/MS features learned from the formula identification model. Compared to traditional computational methods, our approach dramatically reduces the candidate formula space for a given MS/MS spectrum to a small number (at most five formulas by default). This reduction is facilitated by the deep learning model, which benefits from accelerated GPU-based tensor computations. The narrowed candidate space simplifies confidence score estimation, and the MS/MS features learned by the deep learning model can be reused for this purpose.
Results
Deep learning method for formula identification
Predicting target formulas directly from MS/MS spectra under varying experimental conditions presents significant challenges. To address this, we break down the task into three steps as illustrated in Fig. 1a: (1) predicting formulas from MS/MS spectra using a deep learning model; (2) generating candidate formulas using a breadth-first formula refinement algorithm; and (3) calculating confidence scores for the candidate formulas using an additional deep learning model. The formula refinement step relaxes the requirement for exceptionally high initial prediction accuracy by allowing adjustments to candidate formulas with minimal atom modifications. Moreover, since assessing the correctness of a limited set of predictions is easier than identifying the top correct outcomes from an infinitely large pool24, the refinement step also improves the confidence score estimation.
a The FIDDLE workflow comprises three main steps: predicting formulas from MS/MS spectra using a deep learning model; generating candidate formulas through a breadth-first formula refinement algorithm; and predicting confidence scores for the candidate formulas. b The deep learning model architecture includes an MS/MS spectrum encoder (E) and decoders (Df, Da, Dm, and Dhc) that output the predicted formula along with auxiliary variables, such as atom count, molecular mass, and H/C ratio. A contrastive loss (\({{{{\mathscr{L}}}}}_{c}\)) is computed on pairs of condition-independent MS/MS features (zi and zj) to facilitate model convergence.
To input an MS/MS spectrum into a deep learning model, we first bin it into a 1-D vector with a fixed mass-to-charge ratio (m/z) resolution. For example, an MS/MS spectrum with a maximum m/z of 1500 Da is binned into a vector of length 7500, with each bin representing a resolution of 0.2 Da. Molecular formulas are directly converted into formula vectors, where each element type is represented by its atom count as the corresponding value in the vector. For instance, the molecular formula C6H12O6 can be represented as the vector \(\left[6,12,6,0,...\right]\), where the first three integers correspond to the number of carbon, hydrogen, and oxygen atoms, respectively, followed by atom counts for other elements in a predefined order.
To encode the MS/MS spectra, we use stacked dilated convolutions with large kernels to capture relationships between peaks across broad mass ranges22,23. This technique expands the model’s receptive field, enabling it to analyze local and global spectral patterns at the same time. It serves as a powerful and computationally efficient alternative to fully connected layers. The learned MS/MS features are then concatenated with experimental conditions, such as collision energies, precursor types, and experimental precursor m/z, and fed into two linear layers to produce condition-independent MS/MS features, denoted as zi and zj in Fig. 1b. During training, a contrastive loss is applied to ensure that these condition-independent MS/MS features are close for spectra from the same molecule and far apart for spectra from different molecules. We use sequential linear layers as decoders for both formula identification and auxiliary tasks, including atom number prediction, molecular mass prediction, and H/C ratio prediction. These auxiliary tasks, incorporated through multitask learning, enhance the generalizability of the deep learning model and serve as a form of regularization25. The detailed architecture and parameters of the model are specified in the section “Representation learning for MS/MS spectra.”
Recognizing that deep learning models cannot guarantee the validity or perfect accuracy of predicted formulas, we developed a breadth-first formula refinement algorithm. This algorithm aims to make minimal adjustments to atom counts to ensure that the formulas comply with SENIOR rules26 and align with the target mass within a specified mass tolerance—specifically, 10 parts per million (ppm) for MS/MS data from Q-TOF instruments and 5 ppm for data from Orbitrap instruments. This refinement process produces a set of k candidate formulas for each MS/MS spectrum, where k is set to 5 by default. Note that the algorithm is flexible enough to integrate results from various formula identification methods, such as SIRIUS and BUDDY, where the predicted formula can be expanded into a longer list of candidate formulas. An auxiliary model is then developed to estimate confidence scores using the MS/MS features learned during formula identification and each candidate formula (see details in the section “Prediction of confidence score”). Finally, the candidate formulas are ranked based on their estimated confidence scores.
Performance of formula identification
MS/MS spectra were collected from NIST23, NIST20, Agilent PCDL, MoNA, and GNPS, as well as from an internal dataset (see the section “MS/MS data filtering” for details) acquired using a Waters Q-TOF mass spectrometer. The MS/MS spectra were preprocessed according to the methods described in the section “MS/MS data pre-processing,” including filtering based on peak count, molecular mass, atom type and number, and mass difference in ppm. Additional pre-processing steps included simulating precursor m/z values for the NIST dataset, simplifying precursor types, and constructing the training set for contrastive learning (CL). In total, 131,224 MS/MS spectra from 15,399 molecules acquired with Q-TOF mass spectrometers and 965,656 MS/MS spectra from 28,383 molecules acquired with Orbitrap mass spectrometers were used for training and evaluation. A summary of the number of spectra and compounds in each dataset is provided in Table 1. We retained MS/MS spectra from compounds (represented as canonical SMILES without stereochemical information) found exclusively in NIST23 and not in any other libraries (including NIST20) as the test set, ensuring these spectra were not used during the training of any of the models being compared. It is important to note that these spectra were published after the release of BUDDY and SIRIUS. Hence, it is important not to leverage them to train FIDDLE and maintain a fair comparison. Consistent with previous studies16, we used the top K accuracy to evaluate the performance of formula identification algorithms. This metric is calculated as the proportion of spectra for which the correct formula is included among the top K (by default K = 5) ranked formulas predicted by a given algorithm. The settings for comparison methods are specified in the section “Running SIRIUS and BUDDY.”
As shown in Fig. 2a–f, FIDDLE outperformed the other state-of-the-art formula identification algorithms (BUDDY and SIRIUS), and sped up the cumulative runtime by approximately 10-fold compared to BUDDY and 100-fold compared to SIRIUS. In addition to formula identification, performance metrics for auxiliary tasks, including mass, atom number, and H/C ratio predictions, are presented in Supplementary Fig. 3. Notably, the top-1 accuracy of BUDDY and SIRIUS declines significantly for larger compounds (molecular weight > 800 Da; Fig. 2b, e), with BUDDY’s accuracy decreasing to 0.427 for Q-TOF and 0.684 for Orbitrap, and SIRIUS’s accuracy decreasing to 0.187 for Q-TOF and no output within the timeout limit for Orbitrap, compared to 0.844, 0.702, 0.583, and 0.669, respectively on smaller compounds (Fig. 2a, d). In contrast, FIDDLE maintained robust performance on large compounds, achieving top-1 accuracies of 0.642 and 0.813 for Q-TOF and Orbitrap spectra, respectively. Notably, for small compounds, FIDDLE slightly outperforms BUDDY and SIRIUS, achieving top-1 accuracies of 0.906 and 0.881 for Q-TOF and Orbitrap, respectively. In top-5 formula prediction, FIDDLE also consistently outperformed BUDDY and SIRIUS, especially on challenging large compounds. For Q-TOF data, FIDDLE’s accuracy on large compounds (0.754) was substantially higher than that of BUDDY (0.489) and SIRIUS (0.178). This trend was even more substantial for Orbitrap data, where FIDDLE achieved an accuracy of 0.971 for large compounds, while BUDDY scored 0.684 and SIRIUS failed to return a result in time. Moreover, incorporating BUDDY’s results into FIDDLE’s candidate formula pool further improved prediction accuracy, indicating that BUDDY and FIDDLE perform well on different sets of compounds and can be complementary. While combining these methods may require additional running time, it can yield superior overall results. The best accuracy and running time of FIDDLE for different settings of K on Q-TOF test spectra are shown in Fig. 2h. Higher K values improve accuracy at the cost of increased computational time, enabling users to balance performance and efficiency based on their requirements.
a, b Top K accuracy for formula identification on Quadrupole Time-of-Flight (Q-TOF) tandem mass spectrometry (MS/MS) spectra of molecules ≤800 Da and >800 Da, respectively. N represents the number of spectra. c–f Same analysis for Orbitrap spectra. SIRIUS produced no output for large Orbitrap molecules due to runtime limits. FIDDLE supports all 14 precursor types (charges +1, −1, +2), while BUDDY and SIRIUS are evaluated only on supported types, excluding timeouts. g Top K accuracy on the MS/MS spectra with different levels of added noise. h FIDDLE's performance across different K settings, where the left y-axis (blue lines) represents the best accuracy of top K candidates and the right y-axis (gray bars) represents accumulated running times. Source data are provided as a Source data file.
We assessed FIDDLE’s generalizability by comparing its performance on two distinct training and test set splits: one divided randomly by canonical SMILES and a more stringent split divided by unique chemical formula. The formula-based split is significantly more challenging, as it requires the model to predict formulas it has never encountered during training. Despite this, FIDDLE’s accuracy decreased by only about 10% on this task compared to the standard SMILES-based split (p < 0.001, one-sample t-test), demonstrating robust performance. Detailed results are shown in Supplementary Fig. 4.
To assess robustness to spectral noise, we evaluated FIDDLE, BUDDY, and SIRIUS on 1000 randomly sampled MS/MS spectra from the test set. We systematically added Gaussian noise at five increasing levels, with detailed methods described in the section “Training set construction and data augmentation.” FIDDLE demonstrated superior noise resilience across both Q-TOF and Orbitrap instruments, maintaining over 90% accuracy even under large noise conditions. On Q-TOF spectra, FIDDLE achieved 95.0% top-5 accuracy at the highest noise level compared to BUDDY’s 93.5% and SIRIUS’s 52.5%. On Orbitrap spectra, FIDDLE maintains 93.3% accuracy while BUDDY decreases to 75.4% and SIRIUS to 63.1%. Overall, FIDDLE exhibited minimal performance degradation as noise increased, whereas BUDDY showed a moderate decline, and SIRIUS performed poorly but consistently, underscoring FIDDLE’s practical advantage in real-world scenarios with varying spectral quality.
Additional evaluation on chimeric spectra—synthetic mixtures of authentic MS/MS data27—demonstrates the expected trade-off between FIDDLE’s predictive performance and robustness to signal mixtures (Supplementary Note 1). These results inform potential improvements through data augmentation strategies.
It is worth noting that interpretability can be a limitation of deep learning models compared to computation-based methods; therefore, we provide a potential interpretation method for FIDDLE in Supplementary Note 4 using SHAP28. However, current interpretation approaches face constraints from binned spectral resolution and limitations in providing specific atom counts for direct fragment annotation. Future extensions to molecular structure prediction could offer more intuitive explanations by directly linking spectral features to structural fragments.
Impact of data characteristics on model performance
We conducted a comprehensive analysis to evaluate how model performance is influenced by various data characteristics. The factors we investigated include experimental metadata (such as collision energy and precursor type), molecular properties (like size and chemical class), and the internal MS/MS representations learned by FIDDLE.
Metadata
We integrated precursor type and collision energy as metadata in the deep learning model and evaluated performance under different conditions, as shown in Fig. 3a, b. Performance exhibited a positive correlation with training dataset size, with the following examples illustrating this relationship. FIDDLE demonstrates optimal performance when analyzing spectra of dimers [2M + H]+ and [2M + 2H]2+, perhaps due to their more predictable fragmentation patterns and simpler spectral signatures (Supplementary Fig. 5). Furthermore, higher collision energies produce more peaks, enhancing pattern recognition and boosting performance (the circle sizes denote the average peak numbers). Only the subset of spectra with collision energies in the range [40, ∞) for the Q-TOF instrument had sufficient training data for optimal results by FIDDLE.
a, b Top-5 accuracy by precursor type and collision energy for spectra from Quadrupole Time-of-Flight (Q-TOF) and Orbitrap instruments. The circle size in (b) indicates the average peak number for each collision energy group. c, d Top-5 accuracy by molecular polarity (LogP) and mass range, with a shared legend for both Q-TOF and Orbitrap. e Accuracy vs. formula distance for similar MS/MS spectra. Pearson correlation analysis was performed using two-sided tests, with correlation coefficients (r) and exact p value (p) displayed in the legend. Linear regression trend lines (least squares fit) are shown as dashed lines. f Top-5 accuracy across 17 chemical superclasses (indices in Supplementary Note 2). Source data are provided as a Source data file.
Molecular polarity
LogP values were computed from SMILES strings using RDKit's Crippen.MolLogP method. Compounds were classified into three polarity groups: Polar (LogP < 0), Moderate (0 ≤ LogP < 3), and Nonpolar (LogP ≥ 3). Identification accuracy exhibited a strong inverse correlation with LogP values (Fig. 3c). Polar compounds achieved the highest Top-5 accuracy (95.0%), followed by moderate (94.7%) and nonpolar compounds (88.8%), representing a 6.2% performance gap that was consistent across both Q-TOF and Orbitrap instruments. This trend is likely due to the superior ionization efficiency and more consistent fragmentation that polar compounds exhibit under electrospray ionization mass spectrometry.
Molecular mass
Molecular weights were calculated from SMILES using RDKit's MolWt function and classified into five groups: < 200, 200–400, 400–600, 600–800, and ≥800 Da (Fig. 3d). Accuracy exhibited a complex relationship with molecular mass, with the smallest compounds (<200 Da) achieving the highest performance (97.3% Top-5 accuracy for Q-TOF, 95.0% for Orbitrap). While the model performance on Q-TOF data declined substantially for larger molecules (≥800 Da: 75.4%), on Orbitrap data, the model maintained robust accuracy across all mass ranges, performing particularly well for the largest compounds (≥800 Da: 97.1%). This performance divergence suggests that the Orbitrap’s superior mass resolution is key to identifying large molecules, whose richer fragmentation patterns provide discriminative features that deep learning methods can effectively exploit.
Molecular types
Test compounds were classified using ClassyFire Batch into 17 superclasses (see Supplementary Note 2). Fig. 3f reveals substantial variation in identification accuracy across chemical superclasses. While most superclasses achieved high performance (>90% Top-5 accuracy), four categories showed notably lower performance: Organosulfur compounds (72.1% on Q-TOF data and 85.6% on Orbitrap data), Organic Polymers (50.0% on Q-TOF data and 68.6% on Orbitrap data), Organohalogen compounds (0% on Q-TOF data and 66.4% on Orbitrap data), and Unknown compounds (33.3% on Q-TOF data and 57.2% on Orbitrap data). This drop in performance is likely due to insufficient training data for these underrepresented superclasses.
MS/MS representation space
To visualize the learned feature space, we applied t-SNE to project FIDDLE’s 512-dimensional latent vectors into a 2D representation, which we then divided into a 20 × 20 grid. For each cell in the grid, we calculated the top-1 prediction accuracy and the average Euclidean distance between molecular formulas. For cells containing more than 1000 formula pairs, we used a random sample of 1000 for this calculation. As shown in Fig. 3e, accuracy is negatively correlated with average formula distance. This correlation was highly significant for Q-TOF data (Pearson’s p ≪ 0.0001), but not statistically significant for Orbitrap data (p = 0.0588). This finding confirms that FIDDLE’s prediction accuracy is challenged in regions where similar spectral representations map to divergent formulas, indicating that the learned features in these cases are not distinctive enough for discrimination. A detailed analysis is provided in Supplementary Note 3.
Ablation study of FIDDLE’s components
The MoNA (Q-TOF) dataset is used to illustrate the improvements of FIDDLE achieved through data augmentation, contrastive loss, and post-processing steps, including candidate formula generation and confidence score prediction. We evaluated FIDDLE’s performance across different processing steps by measuring the proportion of top-1 formulas with varying numbers of missed atoms (Fig. 4a) and missed heavy atoms (Fig. 4c), respectively. Because hydrogen (approximately 1.008 Da) is light and difficult to accurately determine, the count of heavy atoms (excluding hydrogen) is also considered. The number of missed atoms is calculated by summing the differences in atom counts across all elements (e.g., carbon, oxygen, nitrogen). A comparison of FIDDLE’s performance in “Pred w/o DA” (prediction without data augmentation) and “Pred” (prediction with data augmentation) shows that data augmentation significantly increases the proportion of correctly predicted formulas (from 0% to 9% for all atoms, and from 18% to 26% for heavy atoms). The contrastive loss further enhances performance, as seen in results labeled “Pred w/o CL.” Comparing the performance of FIDDLE under “Pred” with “Post Top-1” (top-1 accuracy after post-processing) to “Post Top-5” (top-5 accuracy after post-processing) shows that FIDDLE’s candidate formulas cover the correct formulas for more than 84% of the MS/MS spectra. These candidates are subsequently ranked based on their confidence scores predicted by FIDDLE, achieving an AUC (area under the ROC curve) of 0.97, as shown in Fig. 4b. From Fig. 4d, it is clear that correct and incorrect formulas can be effectively distinguished based on their confidence scores. Notably, after post-processing, no formulas with three or fewer missed atoms remain, indicating that while post-processing may occasionally worsen certain incorrect predictions, most incorrect formulas with only a small number of missing atoms are effectively refined.
The counts and proportions of top-1 formulas with different numbers of missed atoms (a) and missed heavy atoms (c), respectively, predicted by FIDDLE with different processing steps. Counts are shown on the y-axis, and proportions are annotated above each bar. “Pred w/o DA” indicates the model trained without data augmentation; “Pred w/o CL” indicates the model trained without contrastive learning; “Pred” indicates the model trained with contrastive loss; “Post Top-1” to “Post Top-4” indicate the top-1 to top-4 formulas from post-processing. b The Receiver Operating Characteristic (ROC) curve with Area Under the Curve (AUC) for distinguishing correct and incorrect formula predictions based on confidence scores from top-5 predictions, while d the confidence score distributions for correct and incorrect predictions. e, f The performance of the models trained on MS/MS spectra with different levels of added noise (formula-level accuracy is shown in parentheses beneath each noise condition on the x-axis), while g their corresponding learning curves. h, i The performance of the model trained with different resolutions. j Cross-instrument validation (missed atom number) on identical compound sets (diagonal: train and test on the same instrument type; off-diagonal: train and test on different instrument types). The box plots in (e, f, h, i) are based on test datasets containing 512 spectra for Q-TOF and 2048 spectra for Orbitrap instruments, respectively, with each data point representing an independent formula prediction. Source data are provided as a Source data file.
We experimentally optimized the noise augmentation parameters of FIDDLE by training formula prediction models with different noise intensities applied to the training spectra (1× , 3× , and 5× multipliers). The MoNA (Q-TOF) dataset is reused in these experiments, and 20,000 spectra from NIST23 (Orbitrap) are randomly selected and split using the strategy described in the section “Training set construction and data augmentation.” The results reveal an instrument-dependent response: for Q-TOF data, moderate noise augmentation (3×) yielded the best performance, whereas both lower (1×) and higher (5×) noise levels led to suboptimal model convergence. In contrast, Orbitrap spectra did not benefit from noise augmentation at any tested level, suggesting that adding Gaussian noise may compromise the inherent spectral consistency of high-resolution instruments (Supplementary Fig. 6). The resolution of input spectra is also experimentally optimized as shown in Fig. 4h, i, where 0.2 Da resolution achieves the best performance on both Q-TOF and Orbitrap instruments. Orbitrap shows robustness across different resolutions, with the largest number of missed atoms occurring at a resolution of 1 Da.
To investigate the effect of instrument type, we construct subsets from MoNA (Q-TOF) and NIST23 (Orbitrap), where the spectra are from the same 2357 compounds but acquired on different instruments. The spectra from Orbitrap are downsampled to 18,830, the same size as the Q-TOF spectra. The split strategy is described in the section “Training set construction and data augmentation.” Then we conduct the same-instrument validation and cross-instrument validation (Fig. 4j). Interestingly, the Orbitrap model performs better on both Q-TOF and Orbitrap data, likely because it is trained on higher-resolution spectra with more detailed spectral features, enabling it to learn more robust representations that generalize across platforms. While this demonstrates the advantages of high-resolution training data, Q-TOF-specific models remain valuable for laboratories with limited access to Orbitrap instruments and for high-throughput applications where faster acquisition is preferred.
Evaluation using benchmarking metabolite datasets
Next, we compared the performance of FIDDLE against SIRUIS, BUDDY, and an additional tool MIST-CF on three benchmarking metabolite datasets, the Critical Assessment of Small Molecule Identification (CASMI) 201629, CASMI 2017, and European Molecular Biology Laboratory - Metabolomics Core Facility (EMBL-MCF) 2.030 datasets, respectively. For a fair comparison, we removed overlapping compounds in these datasets with our training set; in total, 181 (231 spectra), 2 (2 spectra), and 107 (184 spectra) were removed from these three datasets, respectively (for details see Table 1). For a fair comparison, we trained MIST-CF* using the same training and test data as used for FIDDLE*, excluding unsupported precursor types from both datasets (see details in Supplementary Table 1).
As illustrated in Fig. 5, FIDDLE performs comparably well with SIRIUS and BUDDY on the two CASMI datasets, while demonstrating significantly better performance on the EMBL-MCF 2.0 dataset. Since all methods achieve near-optimal performance on top-3 accuracy, with only FIDDLE (w/ BUDDY and SIRIUS), BUDDY, and MIST-CF* showing less than 2% improvement from top-5 to top-3, top-4, and top-5 accuracies are not demonstrated here. Complete top-5 accuracy results are provided in Supplementary Figs. 7 and 8. The slightly worse performance of FIDDLE on CASMI can be attributed to multiple factors, including different data acquisition protocols, lower MS/MS spectral quality (CASMI 2017 used Q-TOF, while CASMI 2016 and EMBL used Orbitrap), lower compound similarity to the training set (see Supplementary Fig. 9), and spectral complexity, among others. The computation-based methods exhibit greater sensitivity to mass deviation, as their performance decreases on the EMBL-MCF 2.0 dataset with larger mass deviations compared to the CASMI dataset, as shown in Supplementary Fig. 10. Notably, incorporating the NIST23 dataset into the training set enhances FIDDLE’s performance on these external test sets primarily due to the larger training data volume (see analysis in Supplementary Note 4), resulting in higher or comparable top-3 accuracies compared to SIRIUS, BUDDY, and MIST-CF. As shown in Fig. 5g–i, FIDDLE and BUDDY each performs better at different formulas and exhibit distinct error patterns (e.g., C4S and O5 for BUDDY versus C3 and HOF for FIDDLE); therefore, combining candidate formulas from all methods and ranking them by FIDDLE’s predicted confidence scores further improves performance. The challenging component pairs for both BUDDY and FIDDLE, such as N3 and C2H2O (which appear as N6 and C4H4O2 in BUDDY’s error patterns), are worth further investigations. On average, across three test sets, our method achieves top-5 accuracies of 80.0% (positive) and 73.8% (negative) ion mode, while BUDDY, SIRIUS, and MIST-CF achieve average top-5 accuracies of 69.9% (BUDDY, positive) and 61.4% (BUDDY, negative), 69.3% (positive, SIRIUS) and 67.6% (negative, SIRIUS), and 66.5% (positive, MIST-CF), respectively.
Methods marked with an asterisk (*) denote deep learning models trained on all available datasets, whereas models without an asterisk were trained on all available datasets except for the NIST23 dataset. Following the original configuration of MIST-CF, we only evaluated the data acquired under the positive ion mode. a–f The accuracy on different datasets in positive (first row) and negative (second row) ion modes. g The number of correct and incorrect top-1 predictions from BUDDY and FIDDLE*. h, i The extra and missing chemical formula components in predictions from BUDDY and FIDDLE*, respectively. Source data are provided as a Source data file.
Discussion
In this work, we introduce a deep learning approach named FIDDLE to identify chemical formulas from tandem mass spectra. It consists of three steps: predicting formulas from tandem mass spectra, generating candidate formulas, and ranking the candidates based on predicted confidence scores. FIDDLE not only accelerates the formula identification compared to state-of-the-art algorithms (BUDDY and SIRUIS), but also outperforms them on both the evaluation set (with 88.3% top-1 accuracy and 93.6% top-5 accuracy) and external benchmarking datasets (with the average top-3 accuracy of 80.0% and 73.8% for positive and negative ion modes, respectively, across three datasets). Our noise robustness evaluation on 1000 test spectra shows that FIDDLE maintains over 90% accuracy even at large noise levels, significantly outperforming BUDDY and SIRIUS on both Q-TOF and Orbitrap data. A separate evaluation on chimeric spectra confirmed that FIDDLE’s performance decreases as spectral mixing increases; this finding, while expected, will guide future design of data augmentation strategies to enhance its training process.
According to our ablation study, both data augmentation and contrastive loss between manually combined spectra pairs enhance formula identification from tandem mass spectra. The post-processing steps, which include generating candidate formulas and ranking them by predicted confidence scores, aim to refine the predicted formulas using the SENIOR rules. These steps refine most candidate formulas with no more than three missed heavy atoms, and significantly alleviate the challenge of incorrect hydrogen counts due to their small mass. Furthermore, because FIDDLE and other formula identification methods perform well on different compounds and spectra, combining their predicted candidate formulas and ranking them by using FIDDLE’s confidence score prediction further improves the accuracy of the predicted formulas, even though their running time may be longer. This suggests that a promising direction for future work is the development of hybrid methods that unify the predictive power of deep learning with the systematic rigor of search-based approaches.
We note that, even though the contrastive loss is employed to take into account the effect of experimental conditions, this effect is not completely eliminated, as FIDDLE still shows variations in accuracy under different conditions. Future work may be focused on improving accuracy across different conditions, especially those with less training data. The workflow of FIDDLE offers several avenues, such as improving MS/MS representations by pre-trained models from self-supervised learning31, using MS/MS prediction methods for data augmentation32, and extending the workflow with adduct type prediction for comprehensive application.
Furthermore, instead of simply accepting all top K candidates, confidence score thresholds could be used to implement FDR control. This approach would introduce a critical standard from proteomics to the field of metabolomics, where such controls remain largely unexplored.
Beyond formula identification, FIDDLE serves as a robust foundation for compound structural elucidation from MS/MS spectra, enabling the inference of covalent bonds between atoms based on their atomic composition. In the future, this methodology could be expanded to characterize diverse molecular structures by leveraging machine learning techniques to extract structural information from MS/MS spectra and integrating this information into molecular structure characterization.
Methods
MS/MS data pre-processing
MS/MS data filtering
The training and testing MS/MS datasets are compiled from several sources, including NIST20, NIST23, MoNA, GNPS, Agilent PCDL, and an internal dataset. To construct the internal dataset, we acquired 1424 compounds from APExBIO33 and measured their tandem mass spectra using a Waters Synapt G2 mass spectrometer at 40 eV with various precursor types, including [M + H]+, [M + Na]+, [2M + H]+, [M + 2H]2+, and [M−H]−. To minimize the interference from impurities in the air, e.g., water vapor (H2O) and carbon dioxide (CO2), we excluded peaks below 50 m/z from the scans. Additionally, three benchmarking sets generated for community-wide evaluation of metabolite identification algorithms are used, including CASMI 2016, CASMI 2017, and EMBL-MCF 2.0.
These datasets undergo a series of filtering steps to ensure data quality: (1) Mass spectra with fewer than five peaks are excluded due to potential unreliability. (2) The m/z range is confined to (0, 1500] to account for the rarity of spectra with m/z values above 1500. (3) Only the molecules composed of frequent atoms (C, H, O, N, F, S, Cl, P, B, I, Br, Na, and K) are retained. (4) Only spectra associated with common precursor types are included, e.g., [M + H]+, [M + Na]+, [2M + H]+, [M + H−H2O]+, [M + H−2H2O]+, [M + H−NH3]+, [M + H + NH3]+, [M + H−CH2O2]+, [M + H−CH4O2]+ for positive modes, and [M−H]−, [2M + 2H]2+, [M−H−CO2]−, [M−H−H2O] for negative modes, along with [M + 2H]2+, a doubly charged precursor type. (5) The total count of atoms in the molecules is capped at 300 to exclude the molecules not typically classified as “small molecules.” (6) The tolerance for precursor mass discrepancy is set at 10 ppm for Q-TOF and 5 ppm for Orbitrap instruments, ensuring precise mass matching. The statistical information of the datasets is detailed in the following Table 1. The specific instruments contained in each instrument type are shown in Table 2.
Precursor m/z simulation
The precursor m/z from NIST20, NIST23, CASMI 2016, and CASMI 2017 are theoretical values so they are adjusted via random shifts, following the approach used by Xing et al.16 and the observations by Böcker et al.17 that the mass deviations fit a Gaussian distribution with the standard deviation of 1/3 of the mass tolerance. We sampled the deviations from Gaussian distributions within the set tolerance ranges (5 ppm for Orbitrap and 10 ppm for Q-TOF) to simulate the experimental conditions accurately. These simulated precursor m/z values are utilized throughout both the training and testing phases, enhancing the model’s generalizability for application in real-world formula identification tasks.
Simplification of precursor types
As depicted in Fig. 2g, the dataset exhibits significant imbalance across different precursor types. To improve formula identification for less common precursor types, we simplify the precursor types by eliminating uncharged molecules, such as water (H2O), ammonia (NH3), carbon dioxide (CO2), formic acid (CH2O2), and acetic acid (CH4O2), adding them into the original formula representation (for the detailed precursor types see Table 3). For example, consider a molecular formula C6H12N2O2 with the precursor type [M + H + NH3]+. After simplifying the precursor type, the formula is adjusted to C6H15N3O2, and the precursor type is simplified to [M + H]+, reflecting the integration of NH3 into the molecular formula. This simplification is recorded, allowing the predicted formulas to be converted back to the original formula with the original precursor type. Through this process, many uncommon precursor types are consolidated into more common precursor types, which expands the training data of common precursor types, thereby enhancing the predictions.
Training set construction and data augmentation
Each dataset, either acquired using Q-TOF or Orbitrap MS/MS instruments, is first split into training and test sets according to their molecular canonical SMILES, which ensures that there are no common compounds between the training set and the test set. It is worth noting that canonical SMILES cannot distinguish stereoisomers, so this splitting strategy guarantees that stereoisomers with different configurations are not separated between training and test sets. Two strategies were employed for data splitting: (1) random splitting (see the model in the section “Ablation study of FIDDLE’s components” and the model marked with an asterisk in the section “Evaluation using benchmarking metabolite datasets”); (2) leaving spectra of unique compounds from NIST23 for evaluation (see the model in the section “Performance of formula identification” and the model without an asterisk in the section “Evaluation using benchmarking metabolite datasets”). Then, for CL, we constructed MS/MS spectra pairs from the training sets. The spectra are grouped by canonical SMILES. For each compound, we randomly picked another compound from the same group to construct a positive pair and another compound from a different group to construct a negative pair.
To enhance the robustness of the deep learning model, we generated augmented MS/MS spectra by perturbing the spectra in the training set. Specifically, we added random noise sampled from a Gaussian distribution with a mean of 0 and a standard deviation of 0.1 (\(\sim {{{\mathscr{N}}}}(0,0.1)\)) to the intensities of an experimental Q-TOF spectrum, generating two augmented spectra for each Q-TOF spectrum and thus tripling the size of the training set from Q-TOF mass spectrometry. On average, the cosine similarity between the augmented spectrum and the corresponding experimental spectrum is 0.936, which is close to the similarity between replicated spectra of the same compound (0.977) in the Q-TOF training set. However, noise addition led to significantly lower similarity for Orbitrap spectra (as shown in Supplementary Fig. 6). Since the training size for Orbitrap is sufficiently large, we did not apply data augmentation for training a deep learning model for Orbitrap.
Representation learning for MS/MS spectra
Model architecture
The representation learning for MS/MS spectra is structured as a two-stage process: MS/MS embedding and the elimination of the experimental condition effect. This approach decomposes the complex task of MS/MS representation learning under multiple experimental conditions into more tractable steps, facilitating the network’s ability to extract meaningful features. Additionally, the resultant condition-independent MS/MS features reduce the complexity of the decoder.
Acknowledging the significance of the correlation among fragment ions in MS/MS, such as the neutral loss between two ions, dilated convolutions with large kernels are employed, building upon the previous work for de novo peptide sequencing23. Due to the large effective receptive field (ERF), this method allows the deep neural network to learn meaningful patterns across fragment ions with long mass ranges from the MS/MS data. Each convolutional block consists of two sequential dilated convolutions with ReLU activation, random dropout, a residual connection, and max-pooling. The max-pooling uses a kernel size of 2 and a stride size of 2, which halves the spectra length. Upsampling along the channel dimension is applied in the skip connection when the output and input channel numbers differ. Dilated convolutions with exponentially increasing dilation factors (1, 2, 4, 8, etc.) and large kernels rapidly expand the ERF without significantly increasing the number of parameters22,34. We can calculate the ERF for our model as:
where kernel sizes are 45, 43, 41, 39, 37, and 35, and the dilation sizes are 1, 2, 4, 8, 8, and 8, respectively. The total ERF of this model is 1153, indicating that each position in the MS/MS feature is directly influenced by 1153 input bins. It is worth noting that while the ERF defines the range of direct input influence, the model can still capture patterns beyond this range. In addition, we utilize weight normalization to speed up the convergence and reduce the dependency of normalization on batch size, enabling training with small batches and limited GPU memory35.
The features from all the blocks are concatenated together along the channel dimension, resulting in a 1024 × 235 tensor, where 1024 results from the stacked convolutional blocks with channel sizes of 32, 32, 64, 128, 256, and 512, and 235 is the length of the MS/MS spectra after pooling layers. Then, through a global pooling layer along the spectra-length dimension, the tensor is pooled into a 1024-dimensional vector as the embedded MS/MS feature. Precursor types, normalized collision energy, and simulated precursor m/z are included as experimental conditions. These conditions are embedded into a 16-dimensional vector through linear layers. The embedded MS/MS feature and experimental conditions are concatenated and fed into sequential linear layers for the condition-independent MS/MS features. These features are constrained by CL so that features from the same molecule are learned to be the same, while features from different molecules are learned to be different.
Loss function
Given a pair of binned spectra i and j, the encoder of the prediction model embeds them into condition-independent MS/MS feature vectors zi and zj. These feature vectors are constrained by the contrastive loss:
where y is the label between the paired spectra. If spectra i and j are from the same molecule, the label is 1; otherwise, the label is 0. m is a hyperparameter (by default, m = 1.0) that defines the lower bound distance between spectra of different molecules.
Three auxiliary tasks, including the atom count prediction, the molecular mass prediction, and the prediction of the hydrogen-to-carbon (H/C) ratio, are used for model regularization to improve the model generalizability25. Mean squared error (MSE) losses are applied to these tasks. The loss function for sample i from the pair (i, j) is defined as:
where \({{{{\mathscr{L}}}}}_{t,i}\) denotes the loss for task t on sample i; yt,i and \({\hat{y}}_{t,i}\) denote the label and prediction, respectively; and αt denotes the weight of loss on task t. In the experiments, we use the weights of 0.01, 1, and 10 for these three auxiliary tasks, respectively. Finally, the loss function for training the model is:
Formula refinement algorithm
Algorithm 1 — Candidate Formula Refinement Algorithm
Inputs: Initial formulas Finit, Target mass M
Settings: Tolerance ΔM, Maximum search depth \({D}_{\max }\), Number of formulas to return K
Ensure: ΔM > 0
Initialization:
Frefined ← [] List of refined formulas
Ftrack ← [] Record of explored formulas
Fcand ← Finit Initial formula candidates
Dcand ← [0] × len(Finit) Depth values for each candidate
Refinement Iteration:
while len(Frefined) < K and Fcand do
f ← Fcand. pop(), d ← Dcand. pop(), Ftrack. append ( f ) Process current candidate
if f passes SENIOR rules and mass of f ∈ [M−ΔM, M + ΔM] then
Frefined. append(f) Add valid formula
end if
if \(d < {D}_{\max }\)then Generate next-level candidates
Generate new candidates from f by editing one heavy atom
Fcand, Dcand ← Update with untracked new candidate, d + 1
end if
if Timeout check then Terminate on timeout
break
end if
end while
return Frefined Output refined formulas
The candidate formula refinement algorithm (Algorithm 1) refines an input chemical formula into a list of formulas that comply with the SENIOR rules26 and match a target precursor mass within a specified tolerance. This algorithm can start with the formula predicted by the deep learning model or a formula identified by tools such as SIRIUS or BUDDY, facilitating the integration of results from multiple methods. The algorithm initializes lists to track refined formulas, previously explored formulas, and current candidates with their associated search depths. By iteratively modifying the counts of heavy atoms without exceeding a set search depth, the algorithm systematically refines each candidate formula. During the exploration, one heavy atom is added or removed, and hydrogen counts are adjusted to align with the precursor mass. The refinement process concludes either when a predefined number of formulas have been explored or when all candidate formulas are exhausted, incorporating a timeout check. The algorithm ultimately produces a list of formulas that align closely with the target precursor mass and adhere to the SENIOR rules.
Prediction of confidence score
The inputs for the confidence score prediction model consist of the concatenated condition-independent MS/MS feature vector and the vector representation of the predicted formula. The model architecture comprises a sequence of five linear layers, each followed by batch normalization, ReLU activation function, and random dropout for regularization. The dimensions of the five layers are 416, 208, 104, 26, and 13, respectively. A Sigmoid activation function is used in the final layer to produce a confidence score between 0 and 1, with higher values indicating more likely correct formula assignments. Training data for this model is obtained from the inference results of initial formula identification, followed by candidate formula refinement. Specifically, each of the top 5 refined candidate formulas is labeled, with 0 indicating a correct candidate and 1 denoting an incorrect one. Since this model has significantly fewer trainable parameters than the formula prediction model, the training set is randomly sampled down to 10,000 instances if it exceeds this limit to prevent overfitting.
Implementation and computing environment
Both deep neural networks (formula prediction and candidate scoring) were implemented in PyTorch version 1.13.1, comprising 21,027,933 (about 21M) and 20,055,938 (about 20.1M) parameters, respectively. A list of specific parameter numbers for each layer/block is described in Supplementary Tables 2 and 3.
The training of models in FIDDLE was conducted using a batch size of 512 over 200 epochs, with an initial learning rate of 0.001. The models were optimized using AdamW and a ReduceLROnPlateau learning rate schedule that reduces the learning rate when a metric has stopped improving for 5 epochs.
The training device was two NVIDIA RTX A6000 GPUs equipped with 48 GB of memory. For inference only, the minimum GPU memory requirement is 500 MB when using a batch size of 1. The deep learning framework used was PyTorch version 1.13.0, running on Ubuntu 20.04.1.
To facilitate broader adoption within the research community, we release our implementation as an open-source Python package (msfiddle) with optimized routines for both GPU and CPU-only environments, enabling reproducible access to pre-trained models and inference pipelines with standardized APIs compatible with common scientific computing workflows.
Running SIRIUS and BUDDY
We installed SIRIUS v6.0.1, last updated on July 21st, 2024, for the Linux (64-bit) platform, accessible at https://github.com/boecker-lab/sirius. Additionally, the BUDDY v0.3.6 PyPI package, updated on January 29th, 2024, was obtained from https://github.com/Philipbear/msbuddy. All experiments were performed on a workstation with a 20.04.1-Ubuntu 64-bit operating system equipped with 16 Intel(R) Core(TM) i7-9800X CPUs. For comprehensive configuration details, please refer to Supplementary Table 4.
Data availability
The NIST20 and NIST23 data used in this study are available under restricted access as paid standard reference databases, access can be obtained by purchasing from NIST Mass Spectral Library. The GNPS data used in this study are available at GNPS Mass Spectral Libraries. The MoNA, CASMI 2016, CASMI 2017, and EMBL-MCF 2.0 data used in this study are available at Mass Bank of North America. The Agilent PCDL data are available from Agilent Technologies (https://www.agilent.com) upon request for academic use. The Waters Q-TOF data are available under restricted access to comply with licensing terms that prohibit commercial use. Access is limited to academic and non-commercial research purposes and can be requested from the corresponding author. Requests will be reviewed for eligibility, and approved users will receive access within 2 weeks. The data will remain available for the duration of the approved research project. Source data are provided with this paper.
Code availability
The source code for all experiments is publicly available at GitHub: https://github.com/JosieHong/FIDDLE under the Apache 2.0 License36. Pre-trained models for Q-TOF and Orbitrap MS/MS data can be downloaded from GitHub releases: https://github.com/JosieHong/FIDDLE/releases. A command-line tool msfiddle, is available via PyPI (https://pypi.org/project/msfiddle), providing simple usage to download, manage, and use models.
References
Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837 (2016).
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
Blaženović, I. et al. Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93% accuracy. J. Cheminform. 9, 1–12 (2017).
da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl. Acad. Sci. USA 112, 12549–12550 (2015).
Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 17, 7274–7282 (2012).
Moco, S., Vervoort, J., Bino, R. J., De Vos, R. C. & Bino, R. Metabolomics technologies and metabolite identification. TrAC Trends Anal. Chem. 26, 855–866 (2007).
Dettmer, K., Aronov, P. A. & Hammock, B. D. Mass spectrometry-based metabolomics. Mass Spectrom. Rev. 26, 51–78 (2007).
Alseekh, S. et al. Mass spectrometry-based metabolomics: a guide for annotation, quantification and best reporting practices. Nat. Methods 18, 747–756 (2021).
Hernandez, F., Sancho, J. V., Ibáñez, M. & Grimalt, S. Investigation of pesticide metabolites in food and water by LC-TOF-MS. TrAC Trends Anal. Chem. 27, 862–872 (2008).
Dueñas, M. E. et al. Advances in high-throughput mass spectrometry in drug discovery. EMBO Mol. Med. 15, e14850 (2023).
Liu, C. & Zhang, H. High-throughput mass spectrometry in drug discovery. SLAS Technol. 100292, 100292 (2025).
Wang, F. et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification. Anal. Chem. 93, 11692–11700 (2021).
Stravs, M. A., Dührkop, K., Böcker, S. & Zamboni, N. MSNovelist: de novo structure generation from mass spectra. Nat. Methods 19, 865–870 (2022).
Pluskal, T., Uehara, T. & Yanagida, M. Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal. Chem. 84, 4396–4403 (2012).
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).
Xing, S., Shen, S., Xu, B., Li, X. & Huan, T. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nat. Methods 20, 881–890 (2023).
Böcker, S., Letzel, M. C., Lipták, Z. & Pervukhin, A. SIRIUS: decomposing isotope patterns for metabolite identification. Bioinformatics 25, 218–224 (2009).
Rasche, F., Svatoš, A., Maddula, R. K., Böttcher, C. & Böcker, S. Computing fragmentation trees from tandem mass spectrometry data. Anal. Chem. 83, 1243–1251 (2011).
NIST Mass Spectrometry Data Center. National Institute of Standards and Technology (NIST) Spectral Library (2023 version; NIST23). Available at: https://www.nist.gov/programs-projects/nist23-updates-nist-tandem-and-electron-ionization-spectral-libraries (2023).
Goldman, S., Xin, J., Provenzano, J. & Coley, C. W. MIST-CF: chemical formula inference from tandem mass spectra. J. Chem. Inf. Model. 64, 2421–2431 (2023).
Aron, A. T. et al. Reproducible molecular networking of untargeted mass spectrometry data using GNPS. Nat. Protoc. 15, 1954–1991 (2020).
Lea, C., Vidal, R., Reiter, A. & Hager, G. D. Temporal convolutional networks: a unified approach to action segmentation. Lect. Notes Comput. Sci. 9915, 47–54 (2016).
Liu, K., Ye, Y., Li, S. & Tang, H. Accurate de novo peptide sequencing using fully convolutional neural networks. Nat. Commun. 14, 7974 (2023).
Kwok, T.-Y. & Yeung, D.-Y. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Trans. Neural Netw. 8, 630–645 (1997).
Zhang, Y. & Yang, Q. A. survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 34, 5586–5609 (2021).
Kind, T. & Fiehn, O. Seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinform. 8, 1–20 (2007).
Stancliffe, E., Schwaiger-Haber, M., Sindelar, M. & Patti, G. J. Decoid improves identification rates in metabolomics through database-assisted MS/MS deconvolution. Nat. Methods 18, 779–787 (2021).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 4765–4774 (2017).
Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9, 1–21 (2017).
Dekina, S., Alexandrov, T. & Drotleff, B. EMBL-MCF 2.0: an LC-MS/MS method and corresponding library for high-confidence targeted and untargeted metabolomics using low-adsorption HILIC chromatography. Metabolomics 20, 114 (2024).
Guo, H., Xue, K., Sun, H., Jiang, W. & Pu, S. Contrastive learning-based embedder for the representation of tandem mass spectra. Anal. Chem. 95, 7888–7896 (2023).
Hong, Y. et al. 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations. Bioinformatics 39, btad354 (2023).
APExBIO. APExBIO: achieve perfection, explore the unknown. Available at: https://www.apexbt.com/ (2023).
Ding, X., Zhang, X., Han, J. & Ding, G. Scaling up your kernels to 31 × 31: revisiting large kernel design in CNNs. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11963–11975 (IEEE, 2022).
Salimans, T. & Kingma, D. P. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. Adv. Neural Inf. Process. Syst. 29, 901–909 (2016).
Hong, Y. JosieHong/FIDDLE. Available at: https://doi.org/10.5281/zenodo.17172712 (2025).
Acknowledgements
The authors acknowledge the Center for Bioanalytical Metrology (CBM), an NSF Industry-University Cooperative Research Center, for providing funding under grant NSF IIP-1916645 (H.T.). This work was also partially supported by National Science Foundation grant DBI-2011271 (H.T.). This article includes material that first appeared in the PhD thesis of Yuhui Hong, “Deep Learning-Enhanced Approaches in Mass Spectrometric Analysis and Small Molecule Identification” (Indiana University Bloomington, 2025), available via ProQuest.
Author information
Authors and Affiliations
Contributions
H.T. conceived the project and guided the study design. H.T. and Y.H. developed the computational method. Y.H. implemented the software and performed the experiments. S.L., Y.Y., and Y.H. collected and preprocessed the datasets. Y.H. led the manuscript writing, with input from all authors and supervision by H.T. All authors have read and approved the final version of the manuscript for submission.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Dylan Ross, Yan Zhou, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Hong, Y., Li, S., Ye, Y. et al. FIDDLE: a deep learning method for chemical formulas prediction from tandem mass spectra. Nat Commun 16, 11102 (2025). https://doi.org/10.1038/s41467-025-66060-9
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-66060-9







