Introduction

Nuclear magnetic resonance (NMR) spectroscopy plays a pivotal role in analyzing complex mixtures thanks to its unique capability to handle matrix complexity, providing detailed molecular information and preserving the integrity of the sample. This is particularly true for metabolomics, which aims at identifying, quantifying, and characterizing the entire ensemble of exogenous and endogenous metabolites present in a biological specimen1.

NMR-based metabolomics requires minimal sample preparation, allows the simultaneous detection and quantification of multiple metabolites over a wide dynamic range, providing highly reproducible data2,3. Furthermore, NMR does not alter the sample during analysis, thus different experiments can provide complementary information about that sample. Blood-derived samples (namely serum and plasma), in which both metabolites and macromolecules are present in solution, offer a practical example of how different types of NMR experiments can provide unique insights into the sample’s molecular composition. The one-dimensional (1D) Nuclear Overhauser Effect SpectroscopY (NOESY) experiment produces a spectrum in which the signals of all the H NMR-detectable chemical species present in solution (e.g., metabolites, proteins, lipoproteins) are visible, independent of their molecular weights. The 1D Carr-Purcell-Meiboom-Gill (CPMG) experiment suppresses signals from macromolecules and other broad resonances by manipulating spin-echo decay, yielding the selective detection of low-molecular weight molecules. Conversely, the Diffusion-edited experiment yields the selective suppression of small metabolite signals by distinguishing between molecules based on size and diffusion behavior2. The 1H-1H J-resolved spectroscopy (JRES) experiment helps resolve overlapping signals by separating chemical shifts (in the frequency dimension F2) and J-coupling information (in the frequency dimension F1). In complex mixtures such as biofluids, constituted by hundreds of molecules, signal assignment is complicated by the relevant peak overlaps present. As a consequence, JRES has become a popular experiment for resolving spectral complexity and improving metabolite identification4.

Each experiment demands spectrometer time and post-processing effort. Thus, acquiring multiple NMR spectra for biofluids, such as serum or plasma, can be time-consuming and resource-intensive, especially for high-throughput studies. Here, we propose a machine learning approach to derive from NOESY spectra other types of NMR spectra, i.e., CPMG, Diffusion-edited, and 1D projection of JRES spectra using serum samples as a case study. Figure 1 depicts the flowchart of the study design, summarizing the process of dataset preparation, models development, and validation.

Fig. 1: Study overview.
figure 1

a NMR spectra of serum samples coming from 18 different recruitment centers were split into training, validation, and test sets. b Fast PLS (partial least square) regression was used to calculate on 1D (one-dimensional) NOESY spectra of the training set the model for the prediction of different 1D spectra (CPMG, Diffusion-edited, pJRES). Each model was validated in the validation set. c Spectra of the independent test set were used to independently validate each PLS model. NOESY Nuclear Overhauser Effect SpectroscopY, CPMG Carr-Purcell-Meiboom-Gill, pJRES positive 1D projection of JRES in the F2 dimension. Globe and Computer Desktop images by Vectorportal.com. All the other parts of the figure were created by authors using PowerPoint (Licence Microsoft 365).

Results and discussion

A computational pipeline for prediction of NMR spectra

We designed a complete computational pipeline, from the import of NMR spectra in the R environment to the export in software-compatible format (i.e., Bruker format), to efficiently predict and visualize CPMG, Diffusion-edited, and positive 1D projection of JRES in the F2 dimension (pJRES) spectra from 1D NOESY spectra (Fig. 1). The objective of this tool is to accelerate metabolomic analyses by reducing the need for long acquisition time and post-processing efforts through the integration of machine learning with NMR spectroscopy, while maintaining accuracy and reproducibility.

In standard NMR-based metabolomics, it is common practice, especially for untargeted studies, to acquire multiple spectra per sample—typically four for serum and plasma—since each spectrum emphasizes different molecular components. For example, CPMG spectra are particularly informative for small-molecule metabolites, whereas diffusion-edited spectra are more suitable for detecting changes in lipoprotein profiles. Among these, 1D NOESY spectra capture signals from all molecular species present in concentrations above the NMR detection limit. However, directly quantifying individual metabolites from NOESY spectra is challenging due to significant peak overlap and spectral complexity. Furthermore, depending on the biological question, NOESY, CPMG, or diffusion-edited spectra may yield different classification performances. Specifically, when discriminative features are more influenced by macromolecular background signals (e.g., from lipoproteins), NOESY or diffusion-edited spectra tend to perform better. Conversely, when low-molecular-weight metabolites are more relevant for discrimination, CPMG spectra typically provide higher classification accuracy. Therefore, having access to the full panel of spectra allows for a more comprehensive analytical approach. To address this, we propose an indirect computational approach that leverages the 1D NOESY spectrum to reconstruct other types of spectra (e.g., CPMG, diffusion-edited, and pJRES). This strategy effectively disentangles the complex information in 1D NOESY data and enables the use of established analysis pipelines designed for these specific spectra, ultimately reducing acquisition time and costs, and increasing throughput.

Firstly, NMR spectra are imported in R, aligned to a reference (anomeric glucose signal at δ 5.24 ppm) and normalized using appropriate R functions. Then 1D NOESY spectra are subjected to binning (bin size 0.001 ppm) to reduce the matrix size and compensate for small signal shifts. We chose the fast Partial Least Squares (PLS) regression algorithm as the method to predict the frequency domain of the spectra all at once. We used PLS regression to build on the training set (Fig. 1) three models able to predict CPMG, Diffusion-edited, and pJRES from binned NOESY spectra. In addition, prediction of pJRES spectra was also performed from the CPMG spectra previously predicted. Each PLS model (one for each kind of 1D spectrum) was tested in a validation set and in an independent test set (serum samples from an independent collection center not included nor in the training neither in the validation sets) (Fig. 1). As a result, for each NOESY spectrum three different predicted spectra were obtained and exported in Bruker compatible format to be compatible with software currently used in metabolomics (e.g., Topspin, Mnova, Chenomx). For our analyses, we used NMR spectra available in the metabolomic repository Metabolights (https://www.ebi.ac.uk/metabolights/) with accession numbers MTBLS242 (severe obesity), MTBLS395 (acute myocardial infarction), and MTBLS424 (breast cancer).

The computational approach to extracting information about small or macromolecules directly from NOESY spectra has already been explored. Specifically, following a suggestion by an anonymous reviewer, we became aware of a study in which neural networks are used to derive NMR spectra of both small molecules and macromolecules directly from blood NOESY spectra5. While the goal of that study aligns with our own, the proposed methodology differs significantly. In the article written by Xiao et al.5, neural network models were trained on a dataset of simulated NMR spectra; instead, we trained our models using real serum spectra. With their approach, Diffusion-edited spectra are directly obtained from NOESY, whereas narrow signals of small molecules, characteristic of CPMG spectra, were obtained indirectly by subtracting signals of macromolecules from NOESY. Instead, we directly predicted CPMG spectra from NOESY. Notably, we also proposed the prediction of pJRES spectra, which are very useful for assignment. Furthermore, our method is evaluated directly on real-case scenarios, including the quantification of a panel of 20 metabolites and the classification of patients with different prognoses following acute myocardial infarction (see the following Sections). Finally, our approach is computationally straightforward, essentially just a regression, highlighting that, even in the age of complex AI models, simpler methods can be remarkably effective. It echoes the wisdom of Occam’s razor, which is sometimes neglected in contemporary approaches.

Goodness of the results

To ensure the quality of the results obtained with this pipeline, we compare the spectra resulting from the PLS models with the original ones. The original and the predicted spectra were compared using four metrics (Table 1): the median of the relative errors as percentage (MRE%, i.e. for each predicted spectrum each point was compared with the respective point in the original spectrum and the percentage error was calculated), the root mean square error (RMSE), the coefficient of determination (R2), and the ratio of performance to deviation (RPD). Considering the results in light of all evaluation metrics, the overall performance of the proposed approach appears to be of high quality. Validation on the independent test cohort proved more challenging than cross-validation, further underscoring the robustness of the proposed approach. Predicting pJRES from the predicted CPMG spectra and from NOESY spectra yielded similar results. However, in the independent test set the model calculated on the predicted CPMG spectra showed slightly better results, likely due to the suppression of macromolecular signals inherent in the CPMG pulse sequence. Indeed, pJRES spectra can be viewed, in a simplified sense, as CPMG spectra with suppressed scalar couplings (i.e., without signal multiplicities). Of course, this approach has its limitations: the model does not explicitly understand or simulate scalar coupling mechanisms, but it simply learns this behavior implicitly from training on matched pairs of spectra, CPMG and pJRES, from the same samples, and therefore may not generalize perfectly to novel coupling patterns outside of the training distribution. All subsequent comparisons refer to the model calculated on predicted CPMG spectra. To better appreciate the quality of the calculated models, the distributions of MRE% across the different spectra are depicted in Fig. 2.

Fig. 2: Quality of PLS predictions.
figure 2

Original and predicted spectra were compared using the median of the relative errors as percentage (MRE%). Results for validation (n = 321) and independent test (n = 232) sets were reported for CPMG, Diffusion-edited (DIFF), and pJRES spectra (highest the bars in the first two MRE% groups, better the results). CPMG Carr-Purcell-Meiboom-Gill, pJRES positive 1D projection of JRES in the F2 dimension. Ind. Test independent test set.

Table 1 Summary of the evaluation metrics of the results obtained in the validation set, the independent test set and through the 10-fold cross-validation (CV). Data are reported as median ± median absolute deviation

In the validation set, 60.7% of the predicted CPMG spectra showed an MRE% ≤ 5%, 38.9% had an MRE% between 5% and 10%, and one spectrum was predicted with an MRE% > 15%. In the independent test set (Figs. 2), 29.7% of the predicted CPMG spectra showed an MRE% ≤ 5%, 65.9% had an MRE% between 5% and 10%, and only 1 spectrum had an MRE% > 15%, further confirming the high quality of the predictions (Fig. 3).

Fig. 3: CPMG predictions.
figure 3

Original (red) and predicted (green) median CPMG (Carr-Purcell-Meiboom-Gill) spectra in the independent test set. The entire spectra (a) and four zoom-in of different spectral regions (be).

To ensure the robustness and reproducibility of our results, we compared the concentrations of 20 representative metabolites obtained by integrating their signals in the original and predicted spectra of the independent test set (Fig. 4). All correlations were statistically significant (all p values < 0.001). Fifteen metabolites showed correlation coefficients (r) greater than 0.90, while four metabolites (citrate, histidine, phenylalanine, and tyrosine) showed suboptimal correlations, with r values between 0.75 and 0.90. Only formate showed a weaker correlation (r = 0.51), likely due to the low intensity of its signal. Taken together these results proved the accuracy of our predictions.

Fig. 4: Metabolites analysis.
figure 4

Concentrations (in arbitrary units) of 20 representative metabolites a 3-hydroxybutyrate, b Acetate, c Alanine, d Choline, e Citrate, f Creatine, g Creatinine, h Formate, i Glucose, j Glutamate, k Glutamine, l GlycA, m GlycB, n Histidine, o Isoleucine, p Lactate, q Leucine, r Phenylalanine, s Tyrosine, and t Valine in real and predicted CPMG (Carr-Purcell-Meiboom-Gill) spectra of the independent test set. Correlation coefficient (r) is reported in each panel. Scatter plot of data with a fitted ordinary least squares regression line are reported. The blue-shaded area represents the 95% confidence interval for the estimated mean response, computed from the regression model’s standard error of the fitted values.

When this pipeline was applied to prediction of Diffusion-edited spectra, we obtained prediction bordering the original spectra (Fig. 5). In the validation set 97.2% of the predicted spectra showed an MRE% ≤ 5%, while in the independent test set, 69.4% of the predicted Diffusion-edited spectra showed an MRE% ≤ 5%, 27.2% had an MRE% between 5% and 10%, and only 2 spectra had an MRE% > 15% (Fig. 2).

Fig. 5: Diffusion-edited predictions.
figure 5

Original (red) and predicted (green) median Diffusion-edited spectra in the independent test set. The entire spectrum (a) and four zoom-in of different spectral regions (be).

Taken together, these results clearly demonstrate that our approach is both feasible and efficient and provide evidence that predicting CPMG spectra is more challenging than predicting Diffusion-edited spectra. The CPMG experiment selectively suppresses signals from macromolecules and broad resonances, focusing on low molecular weight metabolites. However, residual signals from macromolecules are still present in the spectra complicating the prediction. The robustness and reproducibility of results critically depend on the standardization of both pre-analytical and analytical protocols. Indeed, metabolite levels are highly sensitive to sample manipulation procedures (e.g., sample collection, storage, and preparation)6,7 and contamination (e.g., skin disinfectant, drug signals) that often result in sharp signals. Thus, the use of multicenter cohorts may have posed a significant challenge for modeling CPMG spectra, but, at the same time, contributed to demonstrating the robustness of our approach.

The prediction of pJRES spectra, given the significant differences in the NMR acquisition protocols across the three cohorts used (see Methods), represented an extremely complex task. We believe that the broad signals of macromolecules present in NOESY spectra could make them less suitable for the prediction of pure-shift like spectra (i.e., spectra with removed J-coupling effects) such as pJRES. The usage of CPMG spectra could represent an alternative solution. Since the pipeline presented here is thought to be used with the acquisition of 1D NOESY spectra only, we performed the prediction of pJRES spectra from the CPMG spectra previously predicted from 1D NOESY. In the validation set (Fig. 2), 68.8% of the predicted pJRES spectra showed an MRE% ≤ 10%, and 5.9% of the spectra were predicted with an MRE% > 15%. In the independent test set (Fig. 2), 9.5% of the predicted pJRES spectra showed an MRE% ≤ 10%, 73.3% had an MRE between 10% and 15%, and 17.2% of the spectra had an MRE% > 15%. pJRES spectra are quite different from usual 1D spectra, presenting different peak shapes (nor Lorentzian or Gaussian), and different peak positions (i.e. for doublets there is a single peak in the center, where there are no peaks in the original spectra). Moreover, in the presence of strong coupling the JRES experiment is prone to generating spectral artefacts in the projection which are not present in the spectra8. The presence of these extra lines makes the prediction of pJRES spectra even more difficult. The search for an efficient and fast method to obtain pure-shift spectra continue to be a hot-topic in NMR spectroscopy9,10,11,12,13, with different approaches and strategies described in the literature. The pJRES is still a sensitive and fast experiment to obtain pure shift spectra, especially in the metabolomics community, thus our strategy could be a feasible way for this field. Considering the challenges described above, we reported remarkable results in the prediction of pJRES spectra (Fig. 6). However, although accurate prediction of pJRES signal intensities is inherently challenging, this does not compromise the practical utility of pJRES spectra. pJRES spectra themselves are not quantitative and are not used directly for absolute quantitation but rather for disambiguating between nearby peaks and for peak assignment, where its role remains critical.

Fig. 6: pJRES predictions.
figure 6

Original (red) and predicted (green) median pJRES (positive 1D projection of JRES in the F2 dimension) spectra in the independent test set. The entire spectra (a) and four zoom-in of different spectral regions (be).

In order to highlight how each model is able to reproduce the different regions of the NMR spectra, squared correlations between each chemical shift point in original and predicted spectra of the independent test set were calculated and are graphically shown in Fig. 7. The model calculated on Diffusion-edited spectra presents high correlations across the entire spectrum (except the regions of noise, obviously). Whereas the model calculated on CPMG spectra shows high correlations in the aliphatic region and moderate correlations in the aromatic region. Notably, this analysis confirms that the formate signal is difficult to be predicted due to its low intensity. The model predicting pJRES spectra shows less marked correlations independently of the spectral regions.

Fig. 7: Spectral correlations.
figure 7

The plots show the median a CPMG, b Diffusion-edited, and c pJRES spectra of the independent test set, color-coded according to the coefficients of determination (R²) obtained by correlating the original and predicted spectra point by point. CPMG Carr-Purcell-Meiboom-Gill, pJRES positive 1D projection of JRES in the F2 dimension.

The analysis of the root mean square error (RMSE) distribution across the three datasets used: MTBLS242 (severe obesity), MTBLS395 (acute myocardial infarction), and MTBLS424 (breast cancer), is shown in Supplementary Fig. S1. This analysis reveals that samples from the breast cancer study exhibit slightly higher errors. Notably, these spectra were acquired with only 32 scans, compared to 64 scans for the other datasets, suggesting that the increased error may be related to differences in acquisition protocols. Smaller differences in error distribution were observed between the severe obesity and myocardial infarction studies, with the exception of Diffusion-edited spectra. In this case, higher errors in the obesity group may be explained by the particularly intense lipoprotein signals in pre-surgery samples. While relevant clinical differences are undoubtedly present across the three studies, our training set was intentionally designed to incorporate this variability. This was done to improve model robustness and support broader applicability across different clinical and experimental conditions.

Metabolomic fingerprinting using original and predicted spectra

The CPMG and Diffusion-edited spectra predicted by our tenfold cross-validated PLS models exhibited optimal quality. To further demonstrate the applicability of our approach, we conducted metabolomic fingerprinting analysis to distinguish between clinical groups of interest, using both the predicted and original spectra separately. Specifically, we used the serum samples collected in the frame of the Florence Acute Myocardial Infarction-2 (AMI-Florence 2) registry. In the original study, the two-year risk of death was evaluated in AMI patients treated with percutaneous coronary intervention14. Here, we reproduced the discrimination between patients who died (116 deceased patients) and those who survived (673 survivor patients) within two years after the AMI event. As in the original publication, the Random Forest (RF) algorithm has been used to classify the two groups of patients, and the area under the receiver operating characteristic curve (AUC) has been chosen as performance metric (Fig. 8). RF models calculated on original and predicted CPMG spectra exhibited similar performances with AUC of 0.79 and 0.78, respectively. Comparable results were obtained using Diffusion-edited spectra with AUC of 0.76 and 0.78 in original and predicted spectra, respectively. In conclusion, the complete computational pipeline described here made it possible to accurately predict three distinct NMR spectra (CPMG, Diffusion-edited, and pJRES) starting from the acquisition of a unique 1D NOESY spectrum. This approach holds the promise of significantly accelerating sample acquisition times for metabolomic purposes, making analyses increasingly high-throughput; furthermore, the availability of R scripts for all post-processing operations ensures reduced effort and lower risks of operator-dependent variability. The machine learning approach proposed here could be applied for the prediction of other kinds of 1D NMR spectra (e.g., pure shift spectra, DIRE15). In theory, it could be used to predict 2D spectra as well. However, predicting 2D spectra from one-dimensional spectra could be challenging due to the insufficient amount of information to reconstruct the 2D cross-peaks.

Fig. 8: Metabolomic fingerprinting.
figure 8

Discrimination between patients who survived and died within 2-years after Acute Myocardial Infarction (AMI) obtained by the Random Forest (RF) models calculated on original (red) and predicted (green). a CPMG (Carr-Purcell-Meiboom-Gill) and b diffusion-edited spectra. The receiver operator characteristic curves and the area under the receiver operating characteristic curve (AUC) scores are reported for each model. The 95% confidence interval computed as defined by DeLong et al.28 is also shown by the shaded area around the mean AUC values depicted by the solid curves.

Our study is innovative, and our approach proves to be robust even in the complexity of a multicenter population (across multiple studies and geographical regions). Serum samples were obtained from three distinct studies that enrolled patients with widely different metabolic characteristics, which are reflected in the spectra. This high heterogeneity has certainly made modeling more complex and may have led to an underestimation of results; nonetheless, our system has shown a notable tolerance to minor protocol variations, reinforcing the importance of our findings. Important limitations also should be acknowledged: firstly, the current model validation is limited to specific studies; therefore, broader benchmarking is needed to assess how deviations and error accumulation in predicted spectra could impact downstream tasks that require high fidelity, such as absolute metabolite quantification or the detection of weak signals, or fingerprinting. Secondly, the NMR spectra from the three studies were obtained using slightly different NMR methods (for instance, while the majority of samples were acquired using 64 scans, samples from one study were acquired using 32 scans), and none of them complies with the latest metabolomics guidelines2. To ensure maximum sample homogeneity and even better results, we strongly recommend adhering to the current metabolomics standard operating procedures2,16. Unfortunately, no cohort of samples acquired according to current standard procedures and large enough to build robust statistical models was freely available in public repositories. Nevertheless, our pipeline is perfectly applicable to spectra obtained with different parameters, with the only requirement being a sufficiently large sample size to build a robust new training set.

As a final remark, the approach here proposed is not intended as a general replacement for multi-spectral acquisition in all metabolomics investigations, but rather as an additional tool that may offer practical advantages in many scenarios.

Methods

Study populations

We used serum NMR data from a total of 1842 individuals enrolled from 18 recruitment centers in the framework of three published studies14,17,18. NMR data were used in a fully anonymized manner, and no clinical or demographic data were collected, as they fell outside the scope of this methodological study. However, this information could be retrieved from original publications14,17,18. All spectra processed and used in this study are available on Figshare at https://figshare.com/s/2523b08fe8c2a23a341d. For each study, metadata and a selection of the acquired NMR spectra are available in the metabolomic repository Metabolights (https://www.ebi.ac.uk/metabolights/) with accession numbers MTBLS242 (severe obesity), MTBLS395 (acute myocardial infarction), and MTBLS424 (breast cancer).

NMR analysis

Blood serum samples were collected in fasted individuals and stored at −80 °C pending NMR analysis. Frozen serum samples were thawed at room temperature and gently shaken before use. A sodium phosphate buffer (75 mM Na2HPO4x7H2O; 20% (v/v) 2H2O, 4.6 mM 3-(Trimethylsilyl)propionate-2,2,3,3-d4; 6.1 mM NaN3; pH 7.4) was then added to each serum sample in a 1:1 (v/v) ratio.

The 1H-NMR spectra for all samples were recorded using a Bruker 600 MHz spectrometer (Bruker BioSpin) operating at 600.13 MHz. The system was equipped with a 5 mm CPTCI 1H-13C-31P and 2H-decoupling cryo-probe, featuring a z-axis gradient coil, automatic tuning and matching, and an automatic sample changer (at room temperature). Temperature stabilization during measurements was maintained within approximately 0.1 K using a BTO 2000 thermocouple. Prior to data acquisition, samples were equilibrated inside the NMR probe head for at least 3 minutes to ensure thermal stability at 310 K.

The one-dimensional NOESY experiments were acquired using an acquisition time of 2.7 s, a relaxation delay of 4 s, a mixing time of 0.01 s, a FID (free induction decays) size of 98304 data points, and 64 (studies MTBLS242, MTBLS395) or 32 scans (study MTBLS424). The one-dimensional CPMG experiments were acquired using an acquisition time of 3.07 s, a relaxation delay of 4 s, a mixing time of 0.01 s, a FID size of 73728 data points, and 64 (studies MTBLS242, MTBLS395) or 32 scans (study MTBLS424). The Diffusion-edited experiments were acquired using an acquisition time of 2.7 s, a FID size of 98304 data points, a relaxation delay of 4 s, a smoothed square shape for diffusion with dipolar gradients of 1.5 ms, a diffusion time of 0.12 s and 64 (studies MTBLS242, MTBLS395) or 32 scans (study MTBLS424). The 2D JRES experiments in the studies MTBLS242, MTBLS395 were recorded using an increment for delay of 12802 µs, 12288 (F2) and 40 (F1) data points, an acquisition time of 0.61 s and 0.51 s for the direct dimension (F2) and indirect dimension (F1) respectively, a relaxation delay of 2 s, 1 scan, and 8 dummy scans. The 2D JRES experiments in the study MTBLS424 were recorded using an increment for delay of 12820 µs, 8192 (F2) and 40 (F1) data points, an acquisition time of 0.41 s and 0.51 s for the direct dimension and indirect dimension respectively, a relaxation delay of 2 s, 2 scans, and 16 dummy scans.

NMR data processing

Before applying Fourier transform, free induction decays were multiplied by an exponential function equivalent to a 0.3 Hz line-broadening factor. The number of data points after Fourier transformation was 131072 for NOESY, 131072 for CPMG, 65536 for Diffusion-edited, and 16384 for pJRES. Transformed spectra were automatically corrected for phase and baseline distortions. Positive partial 1D projection of each 2D JRES spectrum in the F2 direction (pJRES) was calculated using the Topspin 4.3.0 function “f2projp”. All one-dimensional spectra were calibrated to the anomeric glucose doubled at δ 5.24 ppm using the in-house developed R function “align doublet”. Poor-quality spectra (e.g., large water signal, presence of contaminants) were manually inspected and removed; thus, the final dataset comprised 1753 serum spectra.

The NMR spectra were loaded in the R environment using the in-house developed R function “loadspectra”. Each 1D NOESY spectrum in the range between −1.0 and 12.0 ppm was binned into chemical shift bins of 0.001 ppm using the R function “Bucketing” of the package “PepsNMR”19, thus the system was reduced to 13000 bins. Conversely, the other NMR spectra, namely CPMG, Diffusion-edited, and pJRES, in the range between −1.0 and 12.0 ppm, were used at their full-length resolution: 85065, 28355, and 12780 data points for CPMG, Diffusion-edited, and pJRES, respectively. All spectra were normalized on the area of the spectral region between δ 0.30 and 4.37 ppm.

Statistical modeling

The dataset, comprising serum spectra of samples collected in 17 different recruitment centers, was divided into training and validation sets. The Kennard-Stone algorithm was applied to select training samples from such a large multivariate dataset. The function “kenStone” (from the R package “prospectr”) was used with Mahalanobis distance (retaining the number of principal components which explains at least 95% of the total variance) as the chosen metric. As a result, 80% of the spectra (1200 samples) were included in the training set, while the remaining 20% (321 samples) were allocated to the validation set. Furthermore, the spectra of 232 samples from one independent recruitment center were used as the independent test set (Fig. 1).

The analyses were conducted on a 64-bit Windows 11 Pro machine equipped with a 13th Gen Intel(R) Core(TM) i9-13900K processor (3.00 GHz, 24 cores), 64 GB of 2400 MT/s RAM, and a 1 TB Lexar NM610PRO M.2 SSD.

Predictive models were created applying Partial Least Squares (PLS) regression with 1D NOESY spectra used as independent variables for the prediction of CPMG, Diffusion-edited, and pJRES spectra, these latter used as dependent variables. The prediction of pJRES spectra was also obtained from the previously predicted CPMG, used as independent variables. The X matrix was composed by 1200 rows (i.e., spectra) and 13000 variables for 1D NOESY spectra, and by 1200 rows and 85065 columns for predicted CPMG. The Y matrix was composed of 1200 rows (i.e., CPMG or Diffusion-edited or pJRES spectra, depending on the predictive model) and 85065, 28355, and 12780 columns (i.e., spectral points for CPMG, Diffusion-edited, and pJRES spectra, respectively). Among the several PLS variants proposed in the literature20, we focus on the formulation of PLS computed through the one-step singular value decomposition (SVD) of the cross-product matrix XTY21. This approach has the advantage of avoiding the iterative deflation steps, making the computation faster, especially when the X and Y matrices are large (and thus XTY is particularly heavy)22. Because only a small fraction of the PLS components (i.e., singular vectors) need to be computed, to further boost the computational speed, the SVD decomposition was obtained through the augmented implicitly restarted Lanczos bidiagonalization (IRLBA) algorithm23. A Rcpp/RcppArmadillo implementation of this fast PLS approach is made available in the CRAN package “fastPLS”. With this strategy we were able to obtain the final PLS models with 250, 165, 180 components in 200, 65, 23 s for predicting CPMG, Diffusion-edited and pJRES spectra, respectively. The pJRES spectra were also predicted with the same modeling strategy but using as X matrix the predicted CPMG spectra, and the final PLS model was built with 180 components in 22 s. The number of PLS components of each model was chosen by compromising between using a small number of components and minimizing the prediction error in the validation set. The original and predicted spectra were compared, focusing on spectral regions with intensities at least three times greater than the noise level, using the median of the relative errors expressed as percentage (MRE%), the root mean square error (RMSE), the coefficient of determination (R2), and the ratio of performance to deviation (RPD). RPD is calculated as the ratio between the standard deviation of the reference data and the RMSE. It indicates the ability of a model to predict unknown samples24; RPD > 4 is considered an excellent performance25.

To further verify the stability of the proposed method, the results obtained on the entire datasets were validated using a tenfold cross-validation scheme. All the above-mentioned quality metrics were also calculated for the cross-validated results.

The predicted NMR spectra were saved in a Bruker-compatible format using the in-house developed R function “writespectra”.

Metabolite quantification

The NMR signals of 20 metabolites (3-hydroxybutyrate, acetate, alanine, choline, citrate, creatine, creatinine, formate, glucose, glutamate, glutamine, GlycA, GlycB, histidine, isoleucine, lactate, leucine, phenylalanine, tyrosine, valine) were manually assigned by Chenomx NMR suite 10.0. By way of example, the identified metabolites were quantified (in arbitrary units) by signal integration of the NMR region of interest (Supplementary Table S1) using an R script developed in-house in both original and predicted spectra. Pearson correlations (r) between concentrations in original and predicted spectra were calculated using the function “stat_cor” of the R package “ggpubr”. Correlation plots, showing the confidence interval and r, were obtained using the function “ggscatter”.

Multivariate analysis based on clinical data

To demonstrate the quality of the predicted spectra compared to the original ones, we performed metabolomic fingerprinting analysis based on both predicted (cross-validated prediction) and original spectra separately, to discriminate between groups of interest. Specifically, we reproduced the discrimination between patients who died within two years after an acute myocardial infarction (AMI) event (116 deceased patients) and those who survived for at least two years (673 survivor patients). Mortality information was censored at 24 months after AMI or at the date of death, whichever occurred earlier.

As detailed in the original publication14, the R package “Random Forest”26 (RF) was used to grow a forest of 2000 trees, sampling 90 spectra per groups at each iteration (using the “sampize” argument of the R function), and employing the default settings. The percentage of trees assigning a sample to a specific class was inferred as the probability of class membership (deceased or survivor). RF models were constructed separately for the original and predicted spectra for both CPMG and Diffusion-edited spectra, and their performances were compared using receiver operating characteristic (ROC) curve analysis (via the “roc” function of the R package “pROC”27).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.