Abstract
Analysing metabolites in bioliquids through various spectroscopic methods provides valuable insights into the metabolic phenotypes. Deciphering spectral data has greatly benefited from deep-learning methods; however, data-driven solutions often struggle with data dependence on different devices, samples and spectral modalities. Most current task-specific methods have limited generalizability to different spectral analysis problems, including preprocessing, quantification and interpretation. Here, we developed a pretrained foundation model, termed deep-spectral component filtering (DSCF) through a self-supervised approach termed spectral component resolvable learning. By acquiring general spectral knowledge, DSCF achieved state-of-the-art performance for five distinct spectral analysis tasks on 11 datasets. Notably, the general pretraining led to zero-shot spectral denoising and trace-level quantification in complex mixtures. DSCF achieved molecule-level interpretation of surface-enhanced Raman spectra and mapped serum metabolic profiles from nearly 600 individuals for various diseases, including stroke, Alzheimer’s disease and prostate cancer. Overall, the proposed foundation model illustrates promising generalizability for spectral analysis and offers a clear and feasible pathway for general spectral analysis.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
All the data included in the validation of the DSCF model are open source. The IR spectra of liver tissues are available at figshare https://doi.org/10.6084/m9.figshare.28107236 (ref. 60). The serum SERS spectra for PCa research are available at figshare https://doi.org/10.6084/m9.figshare.28107395 (ref. 61). The serum SERS spectra for stroke research are available at figshare https://doi.org/10.6084/m9.figshare.28107431 (ref. 62). The serum SERS spectra for AD research are available at figshare https://doi.org/10.6084/m9.figshare.28107578 (ref. 63). The SERS spectra of synthetic solution for quantification are at available figshare https://doi.org/10.6084/m9.figshare.28107281 (ref. 64). The SERS spectra for background removal are available at figshare https://doi.org/10.6084/m9.figshare.28107326 (ref. 65) and https://doi.org/10.6084/m9.figshare.28107305. The simulation spectra of QM9S are available at figshare https://figshare.com/articles/dataset/QM9S_dataset/24235333 (ref. 66). Source data are provided with this paper.
Code availability
All the codes for the DSCF model have been made public at Zenodo https://doi.org/10.5281/zenodo.15013288 (ref. 19).
References
Holmes, E., Wilson, I. D. & Nicholson, J. K. Metabolic phenotyping in health and disease. Cell 134, 714–717 (2008).
Nicholson, J. K. Global systems biology, personalized medicine and molecular epidemiology. Mol. Syst. Biol. 2, 52 (2006).
Alseekh, S. et al. Mass spectrometry-based metabolomics: a guide for annotation, quantification and best reporting practices. Nat. Methods 18, 747–756 (2021).
Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nat. Commun. 14, 604 (2023).
Hu, J. et al. RSPSSL: a novel high-fidelity Raman spectral preprocessing scheme to enhance biomedical applications and chemical resolution visualization. Light Sci. Appl. 13, 52 (2024).
He, H. et al. Noise learning of instruments for high-contrast, high-resolution and fast hyperspectral microscopy and nanoscopy. Nat. Commun. 15, 754 (2024).
Guo, S., Popp, J. & Bocklitz, T. Chemometric analysis in Raman spectroscopy from experimental design to machine learning–based modeling. Nat. Protoc. 16, 5426–5459 (2021).
Felten, J. et al. Vibrational spectroscopic image analysis of biological material using multivariate curve resolution–alternating least squares (MCR-ALS). Nat. Protoc. 10, 217–240 (2015).
Bi, X. et al. Molecule-resolvable SERSome for metabolic profiling. Chem https://doi.org/10.1016/j.chempr.2025.102528 (2025).
Su, H. et al. Surface-enhanced Raman spectroscopy study on the structure changes of 4-mercaptophenylboronic acid under different pH conditions. Spectrochim. Acta A Mol. Biomol. Spectrosc. 185, 336–342 (2017).
Giese, B. & McNaughton, D. Surface-enhanced Raman spectroscopic study of uracil. The influence of the surface substrate, surface potential, and pH. J. Phys. Chem. B 106, 1461–1470 (2002).
Zarei, M. et al. Machine learning analysis of Raman spectra to quantify the organic constituents in complex organic-mineral mixtures. Anal. Chem. 95, 15908–15916 (2023).
Koyun, O. C. et al. RamanFormer: a transformer-based quantification approach for raman mixture components. ACS Omega 9, 23241–23251 (2024).
Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems Vol. 33, 1877–1901 (NeurIPS, 2020).
Bi, X., Czajkowsky, D. M., Shao, Z. & Ye, J. Digital colloid-enhanced Raman spectroscopy by single-molecule counting. Nature 628, 771–775 (2024).
Luo, Y. et al. Component identification for the SERS spectra of microplastics mixture with convolutional neural network. Sci. Total Environ. 895, 165138 (2023).
Bi, X. et al. SERSomes for metabolic phenotyping and prostate cancer diagnosis. Cell Rep. Med. https://doi.org/10.1016/J.XCRM.2024.101579 (2024).
Ye, J. et al. Hypoxanthine is a metabolic biomarker for inducing GSDME-dependent pyroptosis of endothelial cells during ischemic stroke. Theranostics 14, 6071–6087 (2024).
Xue, B. Source code for deep spectral component filtering: DSCF_V1 (v1.2). Zenodo https://doi.org/10.5281/zenodo.15013288 (2025).
Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2020).
Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. In Proc. IEEE/CVF International Conference on Computer Vision 9992–10002 (IEEE, 2021).
Hughes, C., Gaunt, L., Brown, M., Clarke, N. W. & Gardner, P. Assessment of paraffin removal from prostate FFPE sections using transmission mode FTIR-FPA imaging. Anal. Methods 6, 1028–1035 (2014).
Meuse, C. W. & Barker, P. E. Quantitative infrared spectroscopy of formalin-fixed, paraffin-embedded tissue specimens: paraffin wax removal with organic solvents. Appl. Immunohistochem. Mol. Morphol. 17, 547–552 (2009).
Nallala, J., Lloyd, G. R. & Stone, N. Evaluation of different tissue de-paraffinization procedures for infrared spectral imaging. Analyst 140, 2369–2375 (2015).
De Lima, F. A. et al. Digital de-waxing on FTIR images. Analyst 142, 1358–1370 (2017).
Bai, B. et al. Deep learning-enabled virtual histological staining of biological samples. Light Sci. Appl. 12, 57 (2023).
Lotfollahi, M., Daeinejad, D., Berisha, S. & Mayerich, D. Digital staining of high-resolution FTIR spectroscopic images. In Proc. 2018 IEEE Global Conference on Signal and Information Processing, GlobalSIP 973–977 (IEEE, 2018).
Gobinet, C. et al. Automatic identification of paraffin pixels on FTIR images acquired on FFPE human samples. Anal. Chem. 93, 3750–3761 (2021).
Bi, X., Fang, Z., Deng, B., Zhou, L. & Ye, J. Ultrahigh Raman-fluorescence dual-enhancement in nanogaps of silver-coated gold nanopetals. Adv. Opt. Mater. 11, 2300188 (2023).
Wang, H. P. et al. Recent advances of chemometric calibration methods in modern spectroscopy: algorithms, strategy, and related issues. Trends Anal. Chem. 153, 116648 (2022).
Zou, Z. et al. A deep learning model for predicting selected organic molecular spectra. Nat. Comput. Sci. 3, 957–964 (2023).
Savitzky, A. & Golay, M. J. E. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36, 1627–1639 (1964).
You, L. et al. An exposome atlas of serum reveals the risk of chronic diseases in the Chinese population. Nat. Commun. 15, 2268 (2024).
Oh, H. S. H. et al. Organ aging signatures in the plasma proteome track health and disease. Nature 624, 164–172 (2023).
Enroth, C. et al. Crystal structures of bovine milk xanthine dehydrogenase and xanthine oxidase: structure-based mechanism of conversion. Proc. Natl Acad. Sci. USA 97, 10723–10728 (2000).
Schwedhelm, E. et al. Trimethyllysine, vascular risk factors and outcome in acute ischemic stroke (MARK–STROKE). Amino Acids 53, 555–561 (2021).
Farthing, D. E., Farthing, C. A. & Xi, L. Inosine and hypoxanthine as novel biomarkers for cardiac ischemia: from bench to point-of-care. Exp. Biol. Med. 240, 821–831 (2015).
Dudka, I. et al. Comprehensive metabolomics analysis of prostate cancer tissue in relation to tumor aggressiveness and TMPRSS2-ERG fusion status. BMC Cancer 20, 437 (2020).
Chen, M. M. & Meng, L. H. The double faced role of xanthine oxidoreductase in cancer. Acta Pharmacol. Sin. 43, 1623–1632 (2021).
Sreekumar, A. et al. Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature 457, 910–914 (2009).
Clark, D. E. et al. The serine/threonine protein kinase, p90 ribosomal S6 kinase, is an important regulator of prostate cancer cell proliferation. Cancer Res. 65, 3108–3116 (2005).
Maitre, M., Klein, C., Patte-Mensah, C. & Mensah-Nyagan, A. G. Tryptophan metabolites modify brain Aβ peptide degradation: a role in Alzheimer’s disease? Prog. Neurobiol. 190, 101800 (2020).
Horgan, C. C. et al. High-throughput molecular imaging via deep-learning-enabled Raman spectroscopy. Anal. Chem. 93, 15850–15860 (2021).
Ribeiro, M. T., Singh, S. & Guestrin, C. ‘Why should I trust you?’: explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ed. Balaji, K.) 1135–1144 (ACM, 2016).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI 1, 9 (2019).
Liu, B., Yuan, Y., Pan, X., Shen, H-.B. & Jin, C. AttSiOff: a self-attention-based approach on siRNA design with inhibition and off-target effect prediction. Med-X https://doi.org/10.1007/s44258-024-00019-1 (2024).
Leopold, N. & Lendl, B. A new method for fast preparation of highly surface-enhanced raman scattering (SERS) active silver colloids at room temperature by reduction of silver nitrate with hydroxylamine hydrochloride. J. Phys. Chem. B 107, 5723–5727 (2003).
Lee, P. C. & Meisel, D. Adsorption and surface-enhanced Raman of dyes on silver and gold sols. J. Phys. Chem. 86, 3391–3395 (1982).
Yuan, H. et al. Gold nanostars: surfactant-free synthesis, 3D modelling, and two-photon photoluminescence imaging. Nanotechnology 23, 075102 (2012).
Kaur, V., Tanwar, S., Kaur, G. & Sen, T. DNA-origami-based assembly of Au@Ag Nanostar dimer nanoantennas for label-free sensing of pyocyanin. Chem. Phys. Chem. 22, 160–167 (2021).
He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF International Conference on Computer Vision 15979–15988 (IEEE, 2022).
Chau, S. L., Hu, R., London, A., Gonzalez, J. & Sejdinovic, D. RKHS-SHAP: Shapley values for kernel methods. Adv. Neural Inf. Process Syst. 35, 13050–13063 (2022).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning Vol. 70 (eds Precup, D. & Teh, Y. W.) 3145–3153 (PMLR, 2017).
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In Proc. IEEE/CVF International Conference on Computer Vision 618–626 (IEEE, 2017).
Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutional networks. In Proc. European Conference on Computer Vision 818–833 (Springer, 2014).
Lundberg, S. M., Allen, P. G. & Lee, S.-I. A unified approach to interpreting model predictions. In Proc. Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) 4765–4774 (Curran Associates, 2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proc. 34th International Conference on Machine Learning (JMLR, 2017).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proc. IEEE/CVF International Conference on Computer Vision 2921–2929 (IEEE, 2016).
Xue, B. FTIR of liver tissue. figshare https://doi.org/10.6084/m9.figshare.28107236.v1 (2024).
Xue, B. ComFilE for PCa. figshare https://doi.org/10.6084/m9.figshare.28107395.v1 (2024).
Xue, B. ComFilE for stroke. figshare https://doi.org/10.6084/m9.figshare.28107431.v1 (2024).
Xue, B. ComFilE for AD. figshare https://doi.org/10.6084/m9.figshare.28107578.v1 (2024).
Xue, B. SERS quantification. figshare https://doi.org/10.6084/m9.figshare.28107281.v1 (2024).
Xue, B. SERS nanoparticle background removal (PCa). figshare https://doi.org/10.6084/m9.figshare.28107326.v1 (2024).
Zou, Z. QM9S dataset. figshare https://doi.org/10.6084/m9.figshare.24235333.v3 (2023).
Acknowledgements
We gratefully acknowledge the financial support from the National Key Research and Development Program of China (grant numbers 2024YFF1502600 and 2022YFB4702702), the Major R&D Plan of Shanghai Municipal Science and Technology Commission (grant number 1DZ1100301), the National Natural Science Foundation of China (grant numbers 82272054, 81627801, 31971151, 623B2070 and 82373358), the Science and Technology Commission of Shanghai Municipality (grant numbers 24DIPA00300, 24490710800, 24490790900, 21511102100 and BI0820067/002), the Fundamental Research Funds for the Central Universities (grant numbers YG2024LC09 and YG2025ZD25), the Clinical Research Plan of Shanghai Hospital Development Center (grant number SHDC2020CR3014A), the ‘Clinic Plus’ Outstanding Project (grant number 2023ZYA007) from the Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine and the Shanghai Key Laboratory of Gynecologic Oncology. Qiqihar Science and Technology Plan Joint Guidance Project (grant number LHYD-202016). We thank Z. Zhao and B. Han at Qingdao University Affiliated Hospital for providing the hepatocellular carcinoma FFPE tissues. Our gratitude also extends to Z. Zou and W. Hu from the Department of Chemistry at Qilu University of Technology for their dataset contributions. We thank C. Wu at the School of Biomedical Engineering, Shanghai Jiao Tong University, for providing animal materials. We thank J. Zhao, Department of Neurology, Minhang Hospital, Fundan University, for providing serum samples from the stroke cohort. We are grateful to J. Pan and W. Xue, Department of Urology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, for providing the serum samples from the PCa cohort. We thank H. Tang, Department of Geriatrics, Rui Jin Hospital, School of Medicine, Shanghai Jiao Tong University, for serum samples from the AD cohort.
Author information
Authors and Affiliations
Contributions
C.J. and J.Y. conceived the research. B.X. designed and implemented the DSCF models and ComFilE, collected the IR spectra of human liver and rat kidney FFPE sections and organized the reader studies. X.B. synthesized the silver NPs, configured the synthetic mixtures and performed the SERS measurements. M.L., M.X., X.F. and Yizhe Yuan collected the H&E staining images of the human liver cancer FFPE sections. J.L., Y.Z., Y.C., S.L., R.W. and R.J. contributed to the model implementation. C.J. administered the project and provided guidance on the methodology. C.J. and J.Y. guided the SERS measurements of the serum samples. All the authors wrote and revised the paper. B.X. and X.B. contributed equally to this work.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Erik Vartiainen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Examples of human liver FFPE sections.
Panels (left to right): infrared hyperspectral images, H&E staining, paraffin abundance maps, and cell annotations. Scale bar = 2 mm.
Extended Data Fig. 2 The scheme for the simulation of the datasets and explanation evaluation.
The profiles recorded the concentrations and the interactive logic of the “biomarker” molecules and the other molecules. The spectra were simulated based on the profile and the spectra dictionary. The classification labels were decided by the concentration threshold and interactive logics among biomarkers. Multiple interactive logics (such as ‘AND’, ‘OR’, ‘SOFT’, etc.) were used to simulate the different interaction and correlation among the molecules in various diseases. Recall and accuracy were evaluated by comparing the predictive biomarkers with the ground-truth biomarkers.
Extended Data Fig. 3 The workflow of comparison between multiple classical explanations + Raman Experts and ComFilE.
a Eleven classical explanations were applied to the trained classifiers and calculated the heat maps. The Raman experts ranked the molecules as decreasing the possibility of markers, referring to heat maps and spectral database of molecules. b ComFilE was applied to trained classifiers, calculated delta accuracy, and ranked the molecules as the delta accuracy decreasing. c The evaluation of explanation efficacy.
Extended Data Fig. 4 Detail results of ComFilE applied in synthetic data and clinical data.
(a-b) The relationship between the explanation efficacy and the scale of training datasets: (a) Explanation recall versus the training dataset scale and (b) Explanation accuracy versus the training dataset scale. (c-d) The relationship between the explanation efficacy and the classification performance (n = 8). Data are all presented as means ± SD. (c) Explanation accuracy versus classification accuracy. (d) Explanation recall versus classification accuracy. In summary, the numerical test indicates that the explanation results derived from models with a classification accuracy of over 90% hold a very high level of credibility (Top-3 accuracy>98%). (e) The receiver operating characteristic (ROC) curves of prostate cancer diagnosis with different metabolites filtered. (f) The receiver operating characteristic (ROC) curves of stroke diagnosis with different metabolites filtered. Data in this figure are all presented as means ± SD.
Extended Data Fig. 5 2-nd Component Filtering Explanation.
a Workflow of 2-nd Component Filtering Explanation. b Results Demonstration of 2-nd ComFilE. c-g Coupling effects of distinct interaction logic by different classifiers. h Summary of coupling effects by different classifiers. i-m Additive effects of distinct interaction logic by different classifiers. n Summary of additive effects by different classifiers. c-n n = 72 for AND, SOFT and OR while n = 2448 for none. Data in this figure are all presented as means ± SD.
Extended Data Fig. 6 Biomarker discovery of prostate cancer and Alzheimer’s Diseases using SERS spectra of serum samples.
a Diagnostic accuracy changes (δacc) of 22 metabolic molecules for prostate cancer, assessed by ComFilE. b Abundance differences of 22 metabolic molecules between prostate cancer patients (n = 116) and controls (n = 104), analyzed by a one-sided t-test (Supplementary Table 6). No multiple comparison adjustments were applied, as only one hypothesis was tested.c Diagnostic accuracy changes (δacc) of 22 metabolic molecules for Alzheimer’s disease (AD), assessed by ComFilE. d Abundance differences of 22 metabolic molecules between AD patients (n = 60) and controls (n = 43), analyzed by a one-sided t-test (Supplementary Table 7). No multiple comparison adjustments were applied, as only one hypothesis was tested. Box plots depict medians (center lines), 25th–75th percentiles (box bounds), and whiskers extending to the furthest points within 1.5 × IQR from the box edges.
Supplementary information
Supplementary Information
Supplementary Notes 1–13 and Tables 1–7.
Source data
Source Data Fig. 1
Performances summary.
Source Data Fig. 2
Results of DSCF dewaxing via IR histological spectroscopy.
Source Data Fig. 3
Results of DSCF-based preprocessing in SERS. (a–e) NP background subtraction and (f–h) noise elimination in SERS, Raman and IR spectra.
Source Data Fig. 4
Results of spectral quantification.
Source Data Fig. 5
Results of the explanation efficacy of the ComFilE approach.
Source Data Fig. 6
Biomarker discovery of stroke via SERS spectra of serum samples.
Source Data Extended Data Fig. 1
IR hyperspectral images of FFPE sections.
Source Data Extended Data Fig. 4
Detail results of ComFilE applied in synthetic data and clinical data.
Source Data Extended Data Fig. 5
Second ComFilE.
Source Data Extended Data Fig. 6
Biomarker discovery of PCa and Alzheimer’s diseases using SERS spectra of serum samples.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xue, B., Bi, X., Dong, Z. et al. Deep spectral component filtering as a foundation model for spectral analysis demonstrated in metabolic profiling. Nat Mach Intell 7, 743–757 (2025). https://doi.org/10.1038/s42256-025-01027-5
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s42256-025-01027-5