LipidIN: a comprehensive repository for flash platform-independent annotation and reverse lipidomics

Xu, Hao; Jiang, Tianhang; Lin, Yuxiang; Zhang, Lei; Yang, Huan; Huang, Xiaoyun; Mao, Ridong; Yang, Zhu; Zeng, Changchun; Zhao, Shuang; Di, Lijun; Zhang, Wenbin; Zeng, Jun; Cai, Zongwei; Lin, Shu-Hai

doi:10.1038/s41467-025-59683-5

Download PDF

Article
Open access
Published: 16 May 2025

LipidIN: a comprehensive repository for flash platform-independent annotation and reverse lipidomics

Nature Communications volume 16, Article number: 4566 (2025) Cite this article

8869 Accesses
3 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Improving annotation accuracy, coverage, speed and depth of lipid profiles remains a significant challenge in traditional lipid annotation. We introduce LipidIN, an advanced framework designed for flash platform-independent annotation. LipidIN features a 168.5-million lipid fragmentation hierarchical library that encompasses all potential chain compositions and carbon-carbon double bond locations. The expeditious querying module achieves speeds exceeding one hundred billion queries per second across all mass spectral libraries. The lipid categories intelligence model is developed using three relative retention time rules, reducing false positive annotations and predicting unannotated lipids with a 5.7% estimated false discovery rate, covering 8923 lipids cross various species. More importantly, LipidIN integrates a Wide-spectrum Modeling Yield network for regenerating lipid fragment fingerprints to further improve accuracy and coverage with a 20% estimated recall boosting. We further demonstrate the utility of LipidIN in multiple tasks for lipid annotation and biomarker discovery in clinical cohorts.

Four-dimensional trapped ion mobility spectrometry lipidomics for high throughput clinical profiling of human blood samples

Article Open access 20 February 2023

Analytical and computational workflow for in-depth analysis of oxidized complex lipids in blood plasma

Article Open access 01 November 2022

Imputation of plasma lipid species to facilitate integration of lipidomic datasets

Article Open access 20 February 2024

Introduction

Lipid structural information encompasses the subclass (head group), fatty acyl and alkyl/alkenyl composition (chain length and degree of unsaturation), C=C double-bond location, fatty acyl positional specificity, various substitutions and modifications, and geometric configuration^1,2,3. Despite multiple annotation methods in liquid chromatography-mass spectrometry (LC-MS)-based untargeted metabolomics and lipidomics have been used to improve the coverage, ~90% of the metabolic features cannot be annotated⁴. The annotation strategies primarily rely on three key aspects: (1) forward prediction that involves constructing a molecular library via standard analysis or in silico fragmentation to generate synthetic data^5,6, (2) reverse prediction that entails inferring molecular properties or structures directly from the spectra^7,8, and (3) networking prediction that clusters similar spectra to identify neighborhoods of structurally analogous compounds^9,10. Based on the structural similarity of molecules within each lipid subclass, theoretical predictions of MS/MS spectral patterns can be derived by interpreting the fragmentation relationships. Consequently, automated lipid annotation currently depends on MS/MS similarity calculation and decision tree annotation mass spectra to spectral libraries, such as MS-DIAL¹¹, LipidSearch¹², Spectral Entropy^13,14, and LipidMatch⁸ (Supplementary Table 1).

However, several problems exist in current lipidomics annotation. First, there are limitations in the matching algorithms. Both classic dot-product (cosine) and spectral entropy similarity algorithms overlook actual significance of feature peaks, causing more or less feature redundancy¹⁵. Second, it is challenging to obtain definitive information from low-abundance signals, particularly when it is necessary to confirm the presence of characteristic fragments, such as the neutral loss of specific fragments. Third, beyond subclasses (head groups) and chain compositions, more in-depth structural information, such as double-bond locations, cannot be discerned by most current annotation tools. Fourth, considering the potential discrepancies between theoretical and actual spectra generated from different sample matrices, instruments, and analytical methods, researchers have to use personalized local databases for the effective accumulation of empirical knowledge. The above issues continue to pose a significant hindrance to improving annotation accuracy and coverage in lipidomics studies.

Retention time (RT) or retention orders (RO) in LC-MS analysis is associated with substructures within a molecule, chromatographic columns, the composition of eluents, gradient, and column temperature¹⁶. We hypothesized that deep learning models could extract the mapping relationship between lipid structure features (e.g., head group, chain length, and degree of unsaturation of each fatty acyl or alkyl/alkenyl composition) and RT (or RO), and that combining MS/MS- and RT-based scores could significantly improve annotation performance in LC-MS/MS-based lipidomics analyses. The advances of artificial intelligence (AI) promote metabolite annotation from complex mass spectrometric data particularly MS/MS fragmentation search and MS/MS-explainable formula candidates^17,18,19,20. For instance, a multi-layer perceptron (MLP) is a type of artificial neural network consisting of multiple layers of neurons²¹. Ziming Liu et al. recently reported a neural network called Kolmogorov-Arnold networks (KAN) as alternatives to MLPs^22,23, inspired by Kolmogorov-Arnold representation theorem²⁴.

In this work, we introduce LipidIN, namely lipidomics integration, an advanced tool for rapid lipid annotation and high-accuracy reverse lipid fingerprint spectrogram regeneration. Compared to existing tools, LipidIN demonstrates superior performance in lipid coverage, including 121 subclasses, mass spectrometry library querying speed, prediction accuracy, false positive removal rate, and comprehensive annotation of mass spectrometric data, including lipid molecules and isomers with different C=C locations. Furthermore, LipidIN incorporates a Wide-spectrum Modeling Yield network (WMYn) as “reverse lipidomics”, which regenerates highly accurate fingerprint spectrograms and easily transferable pre-trained models, independent of sample matrices, instruments, and analytical methods. In a word, LipidIN provides an efficient and reliable method for lipidomics research.

Results

Overview of LipidIN framework

LipidIN contains a five-level spectral fragmentation tree encompassing 168.5 million lipids, including both the Paternò-Büchi (P-B) reaction^25,26 and electron-activated dissociation (EAD)²⁷ for lipid isomers with different C=C locations (Fig. 1a, Supplementary Fig. 1 and 2). The preprocessing stage involves the use of MSconvert²⁸ to convert raw data to .mzML files, and the creation of mass spectrometric information lists using RaMS²⁹, which is a reverse lipidomics approach allowing annotation without peak picking first. After this step, an expeditious querying (EQ) module, based on a non-informative prior greedy algorithm to search theoretical spectra at an ultra-fast speed, was used to calculate ${{{{\rm{Score}}}}}_{{{{\rm{matched}}}}}$ and ${{{{\rm{Score}}}}}_{{{{\rm{ratio}}}}}$ for decision trees (Fig. 1b, c). This enables us to match and highlight the importance of mass spectrometric features within the 168.5-million theoretical lipid library at a querying speed of over one hundred billion times per second. Utilizing ${{{{\rm{Score}}}}}_{{{{\rm{matched}}}}}$ and ${{{{\rm{Score}}}}}_{{{{\rm{ratio}}}}}$ as a priori information, we further leveraged a lipid categories intelligence (LCI) module to extrapolate relationships between carbon number, double bond equivalents (DBEs), and intraclass relative RT. After training, a multi-simulation model is used to reduce false positives and predict candidate annotations without MS/MS fragments (Figs. 1d and 1f). Building on this output, we further designed the WMYn to regenerate high-accuracy fingerprint spectra, which refer to unique ion features of lipid molecules and the common characteristic spectra shared across various mass spectrometry systems, and easy-migration pre-trained model. This network maps the mass-to-charge ratios and intensities of multiple batches into a shared latent space by integrating experimental results from a large number of batches using network layers and self-attention encoder for feature learning (Fig. 1e, g).

Fragmentation trees for a five-level hierarchical library

First of all, we need to establish fragmentation trees for a 5-level hierarchical library, which is a mass spectrometry database categorized by each MS² characteristic peak based on lipid structures. By screening fragment ions and structural characteristics of each lipid subclass in published libraries (more references listed in Supplementary Data 1)³⁰, we classified the self-computed feature peaks into five levels, including (1) precursor ion and head group, (2) fatty acyl and alkyl/alkenyl chain (hereafter side chain), (3) head group neutral loss (NL), (4) side chain NL, and (5) regeneration of fragment fingerprints (Fig. 1a).

The 1st-level library is the precursor ion and the lipid head group fragment ion. Derived from the two characteristic peaks, the lipid subclasses and the lipid adduct ions, as well as the total carbon chains and total DBEs can be determined. The 2nd-level is the side chains and other peaks representing specific chain compositions. The 3rd-level of head group NL is complementary to the 1st-level. The 4th-level of side chain NL is complementary to the 2nd-level. In our analysis of C=C locations, we performed the P-B reaction using LC-MS/MS analysis²⁶ (Supplementary Fig. 1) and employed EAD in SCIEX ZenoTOF 7600 system for lipidomics^31,32 (Supplementary Fig. 2). The fragmentation trees of lipids generated from both methods are integrated into 1st- to 4th-level hierarchical library.

The 5^th-level library represents for molecular fingerprint peaks, by summarizing the MS¹- and MS²-features in different ionization modes and chromatographic conditions. To this end, we designed the WMYn to regenerate highly accuracy fingerprint spectra and execute cross-platform migration, resulting in enhanced subsequent annotations. We conceptualized it as “reverse lipidomics” and applied this strategy for regenerating the 5th-level library during data processing by small-sample learning, which consists of three training stages.

Overall, the generalized fragmentation rules described above have been used to construct libraries for 121 lipid subclasses, containing 168.5 million theoretical lipid fragments (Supplementary Data 1).

Relative retention time regularities of lipid subclasses

In principle, the intraclass relative RT played an important role in lipid annotation. Thereby, three rules have been statistically analyzed and defined by using over 100 published datasets, covering various biological samples^{11,33,34,35,36} (Fig. 2 and Supplementary Data 2): (1) The intra-subclass relative RT within same DBEs show a second-order polynomial trend line with carbon numbers, which was generally recognized as the equivalent carbon number (ECN), in line with previous reports^37,38; (2) The fitted function of intra-subclass lipids within subclasses with different DBEs indicates a parallel relationship, and the degree of unsaturation is positively correlated with the intercept of the fitted equation, defined as intra-subclass unsaturation parallelism (IUP); (3) Isomers with different separated chain compositions can be fitted by different functions, defined as equivalent separated carbon number (ESCN).

Next, we took a published dataset as an example to illustrate the above rules³⁴ (Fig. 2a). After polynomial fitting, FAs with different levels of DBEs show a clear trend of increasing functions (Fig. 2b). By setting appropriate confidence intervals, we observed that these intervals do not intersect, which is characteristic of ECN and IUP. For instance, PCs with four DBEs were highly consistent with ESCN (Fig. 2c). In this dataset, two DBE combinations with 0:4 and 1:3 were separately annotated and well fitted. We also examined the consistency of three rules with the change in carbon chain and unsaturation levels. Furthermore, we analyzed the >100 published datasets and found that, most datasets highly satisfy these three rules, with a median conformity of about 90%, for both positive and negative ionization modes (Fig. 2d–f and Supplementary Fig. 3). The results show that whatever the lipid subclasses, number of side chains, carbons and DBEs exhibit high agreement with three rules. We further fitted a linear relationship between the number of annotations and the rules’ compliance (Supplementary Fig. 3g), showing that only few data training of lipid annotation cannot meet these three rules. By using the absolute errors in RT by the three rules, the average absolute RT deviation rates for the ECN and ESCN rules are 1.44% and 1.28%, respectively (Supplementary Fig. 4a–c). This finding further supports the universality of these rules, although the lipid subclasses BA (bile acids) and ST (sterols) have larger deviation values due to insufficient detailed classification in the published datasets.

Performance of EQ and LCI modules

Flash entropy search seems to be the fastest tool to query all mass spectral libraries thus far, having a squared time complexity of $O({nm}\Delta )$. In contrast, EQ module has linear time complexity of $O(m\Delta )$, displaying a faster querying speed. To verify the performance of spectrum size on spectral matching speed, the EQ and flash entropy were computed on a published dataset³⁹ (MetabolomicsWorkbench ST001794) against different sizes of theoretical library from MassBank, and all methods were tested on a same personal low-memory computer, using a single thread and CPU. It can be seen that as the number of spectral increases, the EQ matching time remains virtually unchanged even in a ten-million library querying task, suggesting that it took only about 2.3 µs to complete 10,000,000 spectral library queries. Flash Entropy also showed good results with small size libraries, but once the library size reached millions, the time consumption spikes, causing speed down to 0.14 s for each MS² querying 1,000,000 spectra library (Fig. 3a). Promoted by the use of hash tables and bisection methods in EQ, users can compute one billion times for MS²-spectral comparison in 0.23 ms (over four thousand billion queries per second), which is around 60,000-time faster than flash entropy using classic low-memory personal computers by searching against one million spectra from MassBank.

To verify the performance of the algorithm, we used published datasets^34,39 and further employed various commonly used tools for small molecular compound annotations, including MS-DIAL¹¹, LipidSearch¹², Spectral Entropy^13,14, and LipidMatch⁸ (Fig. 3b for lipidomics and Supplementary Fig. 5a, b for metabolomics), and all parameters are shown in Supplementary Table 2. MS entropy neither identifies lipids with single characteristic peaks without daughter ions, such as FA (Supplementary Data 3), nor discriminates partial lipid isomers by the similarity scores. For example, flash entropy annotated SM d-38:2 (SM 18:2;2O/20:0) to be SM molecules with multiple fatty acyl compositions by ranking the same similarity score (Supplementary Data 4), whereas EQ can highlight the highest score for SM d-38:2 (SM 18:2;2 O/20:0) (Supplementary Data 7). In our LipidIN framework, EQ module with MS-DIAL public database (https://systemsomicslab.github.io/compms/msdial/main.html#MSP), by contrast, achieved a querying result with ~70% recall@Top-20. Of note, here recall is defined as intersection count of tool and article annotations/total article annotations. Furthermore, based on the 1st- to 4th-level hierarchical library, EQ combining with LCI modules achieved over 90% recall@Top-20, outperforming the above tools by taking advantages of three relative RT rules (ECN, IUP and ESCN) and the 1st- to 4th-level lipid fragmentation library (Fig. 3b). Similarly, we also tested LipidIN in small-molecule compound annotation on another published dataset, showing that LipidIN still achieved 88.26%, which was 1.4688-time higher than entropy search (Supplementary Fig. 5a). In addition, we statistically analyzed recall by lipid subclasses and found that LipidIN had an advantage in the annotation of cardiolipin (CL), N-acyl ethanolamine (NAE), oxidized fatty acid (OxFA), oxidized phosphatidylethanolamine (OxPE), and triacylglycerol (TG) subclasses, although LipidSearch showed superior performance in the annotation of lysophosphatidylserine (LPS) and ceramides phosphate (CerP) subclasses. Interestingly, EQ with or without LCI module using MS-DIAL published database outperformed MS-DIAL program (https://zenodo.org/records/12589462), in particular TG annotation, suggesting that EQ module encompassing MS² fragment annotations surpassed similarity algorithm in MS-DIAL program. Moreover, LipidIN showed the best performance by using our 168.5-million lipid fragmentation hierarchical library, including 121 lipid subclasses, indicating higher coverage in home-made hierarchical library than MS-DIAL public library (Fig. 3c).

Lipid Data Analyzer³⁷ (LDA; http://genome.tugraz.at/lda2), a platform-independent lipid annotation method, used a RT model to filter incorrect species identifications. To validate the effectiveness of the LCI module in removing false positives based on aforementioned relative RT rules, we compared it with RT prediction algorithm proposed by LDA (Supplementary Data 5). As a result, LCI module in LipidIN framework exhibited significantly superior performance to LDA across almost lipid subclasses in terms of accuracy (Fig. 3d).

For those lipid candidates without MS/MS fragmentation, we manually built the test set by removing the MS² information of 10% spectra in the annotations. By performing LCI module using 10-fold random sampling, the accuracies of spectral prediction were 75.0% and 82.6% recall@1 in positive and negative ionization modes, respectively, and nearby 85% recall@10 in both ionization modes (Supplementary Fig. 5c).

In addition to ultra-fast performance of EQ and false-positive removal capability of LCI, the combination of EQ and LCI modules can also improve the coverage of lipid annotation compared to MS-DIAL, LipidSearch, and LipidMatch methods. Our approach using EQ and LCI modules annotated 471 unique lipids from the mixture dataset compared with other tools (Supplementary Fig. 5d and Supplementary Data 6). We further validated these 471 lipids by manually checking their MS/MS fragmentation and ensuring the agreement of their relative RT with aforementioned rules (Supplementary Fig. 5e). Therefore, the combination of EQ and LCI modules can not only cover most lipids highly recognized by multiple methods, but also annotate more high-confidence lipids, which were unannotated by all other methods we tested.

Of great interest, based on the rules of relative RT (ECN, IUP and ESCN), LCI shows a higher recall after EQ module performance by using datasets from various chromatographic and mass spectrometric conditions previously reported^35,40,41,42 (Supplementary Fig. 5e). By taking false discovery rate (FDR) into account, EQ plus LCI achieved average FDR of 5.69% at cutoff threshold at 2.4 under strict criteria, by annotating 8923 lipids in four datasets including RBL-2H3 cells, mixture dataset, human sera, and zebrafish tissues (Fig. 3e–h and Supplementary Data 7). We set the annotations that not only complied with the ECN rule in tolerance 0.5 min but also contained all feature peaks in high intensity to be correct. In a word, LipidIN is an ultra-fast, highly accurate, and coverage platform-independent framework for lipid annotation.

Reliability and flexibility of reverse lipidomics

To strengthen high-confidence and high-coverage annotation, we still need to regenerate fragmentation tree hierarchical library from authentic lipid spectra. Therefore, we established the WMYn, which consists of three stages for reverse lipidomics. The first stage is designed for intra-feature learning, mapping MS² sparse matrix of mass-to-charge ratios and intensities into a same dimensional space with 512 rows for each lipid. The second stage is designed to regenerate higher mass spectrometric resolution by incorporating an encoder with one separate layer, in which the encoder consists of multi-head self-attention and one network layer functions inter-feature learning. The third stage aims to enhance prediction of fragment ions with narrow tolerance, performing downsampling and upsampling sequentially for regenerating lipid fingerprints that are encompassed in the 5th-level hierarchical library (Fig. 1g). We also compared two activation functions in WMYn of LipidIN: ReLU and Sigmoid Linear Unit (SiLU). As a result, we found that SiLU is more flexible and has higher accuracy than ReLU activation function by small-sample learning (Supplementary Fig. 6).

Next, we tested the regenerative capacity of WMYn. By utilizing lipid reference standards (9 in positive ion mode and 6 in negative ion mode, Supplementary Data 8) continuously collected three times under the same experimental conditions, and conducted “spectral entropy similarity”¹⁴ to evaluate prediction results from WMYn. Firstly, the fingerprints are regenerated from WMYn with 1-injection of reference standards that were regarded as GroundTruth. Secondly, we employed liquid chromatography tandem mass spectrometry (LC-MS/MS) analysis of 333 clinical sera samples from breast patients and healthy subjects that were collected on a LC-Orbitrap Exploris 240 MS system (the biomarker discovery of this cohort is discussed below), and then annotate lipids by EQ and LCI modules with 1st- to 4th-level hierarchical library. Thirdly, to regenerate 5th-level fingerprints of the library and promote transferring of lipid annotation in different platforms, data training of 333 samples was performed in WMYn. The entropy similarity was then used to evaluate the difference between predicted fingerprints and three injections of reference standards, respectively, resulting in a similarity average of 0.9826 (Fig. 4a and Supplementary Figs. 7 and 8). To further verify the lipid annotation with reference standards, the RT deviations were calculated to be around 0.03 min, confirming the accuracy of the LipidIN framework (Fig. 4b). More importantly, to test platform transfer, we used the lipidomics data from 105 clinical sera samples in an Agilent LC-quadrupole time-of-flight (qTOF) MS system, and figured out the 15 lipid reference standards in both positive and negative ionization modes at around 0.9 similarity score, exhibiting the LipidIN’s flexibility (Fig. 4c).

As mentioned above, the second stage of WMYn is designed for regenerating higher mass spectrometric resolution. Herein, we compared the predictions of WMYn with other fitting methods, including mean method, linear fitting, polynomial fitting, and exponential fitting, and used entropy similarity to measure the difference between the predicted spectra and the standardized spectra. On one hand, the low resolution of MS values with two decimals was achieved by downsampling the corresponding high resolution data regenerated from 333 samples in Orbitrap Exploris 240 MS system. The fitting results showed that all methods had good similarity and WMYn still obtained the highest similarity scores (Fig. 4d). On the other hand, when we kept high resolution MS data with four decimals, WMYn was much better than other fitting methods (Fig. 4e). To extend reverse lipidomics applications, we utilized MS-DIAL public database with or without the 5th-level library for lipid annotation in the Entropy Search environment (https://github.com/YuanyueLi/FlashEntropySearch), and obtained higher recall at different cutoff thresholds with the 5th-level library (Fig. 4f). In particular, higher recalls for most of lipid subclasses were achieved by theoretical spectral library from MS-DIAL at a modest cutoff threshold at 0.75 similarity score with the 5th-level library (Fig. 5g). The obtained results suggest that the 5th-level library confers powerful potential of the Entropy Search in lipid annotation, although LipidIN can perform lipid annotation independently.

**Fig. 5: Application of lipidIN to breast cancer clinical data.**

Analysis of aging-associated lipidome atlas in mice and NIST SRM 1950

To verify the robustness of LipidIN, we applied the framework to a series of datasets from a recent report to map aging-associated lipidome atlas in mice³³ and NIST SRM 1950⁴³. We investigated the recall of the reported 2704 lipids in the MS-DIAL public database, as well as in the 1st- to 4th-level hierarchical library and 1st- to 5th-level hierarchical library, respectively (Supplementary Fig. 9a). The results showed clearly that the hierarchical library is advantageous in lipid annotation over the MS-DIAL public database. Specifically, LipidIN achieved a 93.64% recall using the 5-level hierarchical library. Furthermore, we statistically counted the recall by lipid subclasses, highlighting a strong contribution of reverse lipidomics to the annotation of almost lipid species (Supplementary Fig. 9b). Of note, some rare lipid subclasses such as N-acyl glycine serine (NAGlySer), N-acyl ornithine (NAOrn), and N-glycolyl GM3 (NGcGM3) were also identified by LipidIN.

As a platform-independent framework, LipidIN achieved much higher accuracy in dozens of lipid subclasses than LDA by using the same datasets of aging-associated lipidome atlas in mice (Supplementary Fig. 9c), which was similar to above observation (Fig. 3d). Specifically, we showed the mass spectrometric spectra of 12 lipids identified by LipidIN but not in the article of aging-associated lipidome atlas, derived from the reported raw data for confirmation (Supplementary Figs. 10 and 11). Together, we demonstrated that LipidIN, including EQ, LCI, and reverse lipidomics modules, is also suitable for such published lipidomic datasets from the raw mass spectrometric data, exhibiting more powerful in hierarchical library and computation.

In the benchmarking experiment of NIST SRM 1950, we compared annotations using LipidIN, MS-DIAL (version 5.1), Entropy Search, and article annotations, along with the additional use of Lipid Hunter⁴⁴ and Lipid Annotator⁴⁵. In this benchmark experiment, we focused on the Top@1 results. LipidIN demonstrated the highest total number of 434 annotations. This achievement reflects deeper identification of fatty acid (FA) compositions. However, we noted that while all other methods identified a significant number of phosphatidylcholines (PCs) and sphingomyelins (SMs), they often lacked precise FA information and were therefore excluded from this analysis. In the context of annotated intersections, LipidIN not only achieved the highest number of intersections but also annotated more lipids that were not annotated by other tools within the Top@1 category, thereby demonstrating its robust functionality. Notably, only 30 lipid molecules remained unannotated by LipidIN but were identified by at least two other tools. We further ascertained why these 30 lipids were overlooked by LipidIN. Among them, PC 18:0_22:5, displaying low intensities of FA chains, is omitted by LipidIN. The other 29 lipids were not identified by LipidIN due to large RT deviation, fail feature extraction by RaMS, lacking of critical information of characteristic ions, or totally misannotated by other software.

As aforementioned above, 15 lipid standards were used for validation of the WMYn model (Fig. 4). We further utilized the known lipids from NIST SRM 1950 as lipid reference, and were able to select 87 lipids from both positive and negative ion modes, encompassing most common lipid subclasses. These lipids were annotated in previous experiments annotated in at least two tools and achieved a top rank of 1 (Supplementary Fig. 9d–g). WMYn was trained on all samples from cohort 3 of 105 samples, with 3000 epochs set and an MS²-tolerance of 0.01 Da for calculating cosine similarity. The final cosine similarity between the predicted and experimental spectra reveals a mean spectral similarity above 0.96 in both positive and negative ionization modes (Supplementary Fig. 12), demonstrating the universality of the WMYn model.

Establishment of breast cancer lipid marker panels in clinical cohorts

Regarding the clinical applications, we further performed various liquid chromatography (LC) coupled high-resolution mass spectrometry methods for sera lipidomics from breast cancer patients and healthy subjects. We harvested the samples from two independent cohorts, including 1393 samples and 333 samples, respectively. We applied LipidIN and annotated 4747 lipids, covering 53 lipid subclasses, with average 133.47 billion times querying per second (Fig. 5a and Supplementary Data 9). Through the importance selection of random forest model (Fig. 5b), we screened 10 featured lipids and constructed Light Gradient Boosting Machine model (LightGBM)⁴⁶. Using this lipid marker panel to distinguish breast cancer patients from healthy subjects, the model achieved an accuracy of 96.93% in the first cohort of 1393 samples. The accuracy rate of the same lipid marker panel in 333 cases of the second cohort achieved 79.61%, indicating that the selected biomarker has a certain degree of credibility (Fig. 5c and Supplementary Fig. 13). We utilized weighted correlation network analysis (WGCNA)⁴⁷ to correlate the changes in clinical manifestations and lipid levels (Fig. 5d, e, f, and Supplementary Data 10). In the correlation analysis of the first cohort (Fig. 5d), we figured out that the levels of hexosylceramide (HexCer) and ceramide (Cer) were associated with diabetes mellitus in breast cancer patients (Fig. 5e). We found that several Cer and lysophosphatidylcholine (LPC) species are positively correlated with tumor grading and tumor size (Fig. 5f). Therefore, LipidIN can empower the lipidomic data associated with clinical manifestations. It should be noted that EAD in SCIEX ZenoTOF 7600 system was also applied for lipidomic analysis of 333 cases in the second cohort. Thereby, we further performed annotation of triglycerides (TGs) in the comparison of LipidIN and MS-DIAL (version 5.1). For the C:DB annotations, we identified more TG isomers with C=C locations than MS-DIAL (Supplementary Fig. 14). Interestingly, one TG molecule without C=C location information in the mass spectrum was predicted by MS-DIAL, but our LipidIN can successfully figure out this wrong annotation (Supplementary Fig. 14d).

Identification of lipid markers associated with breast cancer lung metastasis

In third clinical cohort with 105 human sera samples, including 31 cases of breast nodules, 32 breast cancer without lung metastasis (hereafter breast cancer), 22 breast cancer lung metastases, and 20 female lung cancer, we identified a total of 4854 lipids covering 52 subclasses. Of great interest, we performed volcano plots for two-group comparisons, but could not highlight the potential biomarkers for discriminating these four groups with 1.4-fold change in the comparisons (Fig. 6a and Supplementary Fig. 15). Thereby, we further performed an in-depth double-bond positional resolution of phospholipids. By conducting Paternò-Büchi reaction in lipidomics analysis, we found that the double bond position of PC 18:1_20:1 could be effectively differentiated by fine double-bond positional resolution (Fig. 6b). Patients with breast cancer lung metastasis had higher level of side chain C18:1 (delta15) than the other three groups, with breast nodules had a higher level of side chain C20:1 (delta11) than the other three groups, with female lung cancer had a higher level of side chain C20:1 (delta14) than the other three groups. The obtained results suggest that LipidIN is suitable for phospholipid isomers with in-depth C=C locations for biomarker discovery.

**Fig. 6: Analysis of Breast cancer and Lung cancer Clinical data.**

Discussion

Similarity algorithms such as MS-DIAL, LipidSearch, and MS entropy might cause feature redundancy. In contrast, EQ combined with a hierarchical library consisting of 5-level hierarchical library of MS² fragment annotations, on one hand, could avoid feature redundancy. On the other hand, data structure optimization based on linear time complexity in EQ module promoted by applying hash tables and bisection methods, outperforms squared time complexity in flash entropy search, thereby profoundly enhancing recall and spectral querying speed (Fig. 3a). Furthermore, the combination of EQ and MS-DIAL public database also showed superior performance to MS-DIAL program, suggesting that EQ module encompassing MS² fragment annotations surpassed similarity algorithm in MS-DIAL program.

By statistically analyzing over 100 published datasets, we summarized three relative RT rules: ECN, IUP, and ESCN. Based on these three rules, we developed LCI module by using heuristic search methods. Here, a heuristic search method uses heuristic information to define a route that seems more plausible than the rest, and is designed for problem solving more quickly, suggesting LCI can profoundly decrease FDR in the highly complex datasets. Notably, heuristic search techniques can be classified into two broad categories: depth-first search (DFS) and best-first search (BFS). We utilized BFS in this work to sort the sequence of node expansions according to a heuristic function, thus exhibiting much higher accuracy than RT prediction in LDA for lipidomic data (Fig. 3d and Supplementary Fig. 9c).

We also created WMYn as a regenerative model for reverse lipidomics (Fig. 4), and demonstrated that WMYn can exhibit four advantages: (1) regeneration of lipid fingerprints as the 5th-level library for high-confidence and high-coverage annotation, (2) platform-independent lipidomic analysis for enhanced platform transferability, (3) enhancement of high resolution MS data for higher accuracy annotation, and (4) an interactive interface that empowers the broad exploration of reverse lipidomics with other spectral querying environments like entropy search. Of great importance, the WMYn is capable of learning the differences between MS platforms and effectively extracting spectral features, thereby facilitating the inter-platform migration and improving annotation accuracy and coverage. Taken together, LipidIN, including EQ, LCI, and reverse lipidomics modules, generally surpasses the performance of existing methods in mass spectrometry-based lipidomics, exhibiting three capacities of querying all mass spectral libraries in real time, improving accuracy and coverage of lipid annotation, regenerating lipid fragment fingerprints for higher accuracy and coverage, respectively.

Future research directions include pre-training on a larger and more diverse dataset to extend reverse lipidomics. Furthermore, investigating the deeper relationship between metabolites and relative RT will provide a robust foundation for metabolite identification, prediction, and causal network analysis.

Methods

Ethical statement

Written informed consent was obtained from each participant or from the participant’s parents or legal guardians in the discovery and validation cohorts, and the study was conducted according to the Declaration of Helsinki. Ethical permission was granted by Xiamen University Ethics Committee, Fujian Province, China (Approval number: XDYX202302K08), and Fujian Medical University Union Hospital, Fuzhou, Fujian Province, China (Approval number: 2022KY111).

Lipid extraction

The sera samples from breast cancer patients and healthy subjects were harvested and stored at −80 °C till analysis. We employed three distinct based liquid-liquid extraction methods to extract lipids from these sera samples. The lipid mixture data sets extracted from brown adipose tissue, brain, colon, heart, kidney, liver, lung, pancreas, soleus muscle, spleen, testis, and white adipose tissues of mice are based on methyl tert-butyl ether (MTBE) extraction method. The LC coupled with Orbitrap ID-X Tribrid mass spectrometric system was employed for lipidome analysis. The mobile phase A consists of acetonitrile and water (60:40, v/v) containing 10 mM NH4Ac, while mobile phase B consists of isopropyl alcohol and acetonitrile (90:10, v/v).

MTBE (methyl-tert-butyl ether) method

In the extraction process, the serum samples were harvested from two independent cohorts consisting of 1393 clinical samples (698 breast cancer patients and 695 healthy subjects) and 333 clinical samples (142 breast cancer patients and 191 healthy subjects), respectively. The extraction protocol started with the dispensing of 50 µL of serum into centrifuge tubes, followed by the addition of 400 µL of pre-cooled pure methanol containing internal standards and 1 mL of MTBE to each sample. Subsequently, the samples were thoroughly vortexed (1000 rpm, 10 °C, 30 min). Phase separation was induced by adding 400 μL of Milli-Q water and centrifugation at 15,000 × g for 15 min at 10 °C. From this lipid-rich upper layer, we aspirated 300 µL of supernatant from each sample. To guarantee long-term preservation, the supernatants were dried using a nitrogen blower maintained at room temperature. The dried lipids were then stored at −80 °C until further analysis was required. Prior to LC-MS analysis, the lipids were resuspended in a solvent mixture. This solvent consisted of 20 µL CH₂Cl₂/MeOH mixture (2:1, v/v) and 130 µL acetonitrile/isopropyl alcohol/H₂O containing 5 mM ammonium acetate mixture (65:30:5, v/v/v). Each resuspended sample underwent vortexing for 30 s, followed by centrifugation at 15,000 × g for 10 min at 6 °C. Additionally, a pooled aliquot of supernatant from every sample was combined as a quality control (QC) reference sample.

Modified Folch method

In the extraction process of 105 clinical cohort samples (31 breast nodule patients, 32 breast cancer without lung metastasis patients, 22 breast cancer lung metastasis patients, 20 female lung cancer patients), serum sample preparation was conducted rigorously according to the Standard Operating Procedure (SOP) based on the modified Folch liquid-liquid extraction method. A lipid internal standard mixture consisting of 15 deuterated lipids and methanol was added to each sample. Following this, the samples were vortexed, and dichloromethane and water were introduced for extraction. The samples were then allowed to equilibrate at room temperature for 10 min before being centrifuged at 4 °C and 15,000 × g for 10 min. After centrifugation, the organic layer was carefully transferred to a designated centrifuge tube and dried using a nitrogen blower. The residue was subsequently redissolved with mobile phase B (2-propanol/water in a ratio of 95:5), vortexed for 1 min, and diluted with Mobile Phase A (methanol/acetonitrile/water in a ratio of 50:40:10). To ensure QC, a pool of all samples was prepared. This mixture was divided into equal aliquots and stored at −80 °C. This pool sample served as the QC for that specific batch of samples, ensuring the consistency and reliability of the extraction process.

Lipid extraction for Paternò-Büchi reaction method

In the clinical cohort study involving 105 samples (31 breast nodule patients, 32 breast cancer without lung metastasis patients, 22 breast cancer lung metastasis patients, 20 female lung cancer patients) for the Paternò-Büchi reaction, each 50 μL of serum sample was diluted with 1 mL of water, and subsequently, 1 mL of methanol and 2 mL of chloroform were added. The resulting mixture was vortexed vigorously for 10 min to ensure thorough mixing, followed by centrifugation at 12,000 rpm for 12 min. Following centrifugation, the lower organic layer was carefully collected, and the extraction process was repeated on the upper aqueous layer to maximize lipid recovery. The lower organic layers from both extractions were then combined, and the solvent was gently evaporated using a nitrogen blower. The residue obtained after solvent evaporation was reconstituted in 500 μL of methanol. To ensure sample purity and remove any particulate matter, the reconstituted solution was filtered through a 0.22 µm filter membrane. The resulting filtrate represented the lipid extraction original solution, ready for further analysis in the Paternò-Büchi reaction.

Lipidomics data acquisition

Lipidomics data were obtained through the utilization of five unique mass spectrometer platforms. The details are shown as follows.

Thermo Scientific Orbitrap Exploris 240 MS system

Lipidomics data (1393 clinical samples and 333 clinical samples) were acquired using a Vanquish Flex UPLC system coupled with Thermo Scientific Orbitrap Exploris 240 MS system equipped with a heated electrospray ionization (H-ESI) source. Samples were separated through a BEH C8 column (2.1 × 100 mm with 1.7 μm particle size, Waters, Milford, MA, USA) with column temperature maintained at 55 °C and mobile phases consisting of 2 mM ammonium formate in mobile phase A (40% water and 60% acetonitrile) and mobile phase B (90% isopropanol and 10% acetonitrile). The gradient started with 1.5 min of isocratic elution with 32% B (and 68% A). B was increased to 85% over the next 15.5 min and then from 85% B to 97% B in only 0.1 min. Maintained at 97% B for 2.4 min. Rapidly, the mobile phase composition was returned to 32% B within 0.1 min and maintained 5 min for column post-equilibration. The flow rate for mobile phases was set at 0.26 ml/min. The injection volume was 2 µL for positive ions and 5 µL for negative ions. The mass spectrometer was operated in positive or negative modes using a full scan/data-dependent secondary scan (Full-ddMS2) in the scan range m/z 100–1500 Da. Capillary voltage of 3400 V for positive and 3000 V for negative. Ion Transfer Tube Temp: 320 °C, Vaporizer Temp: 350 °C, sheath gas: 40 Arb, Aux gas: 10 Arb, Sweep Gas: 10 Arb. Orbitrap resolution was set 120,000 in MS1 and 15,000 in MS2. The normalized CE type was selected.

Agilent 6546 Q-TOF MS system

Lipidomics data from 105 clinical cohort samples (31 breast nodule patients, 32 breast cancer without lung metastasis patients, 22 breast cancer lung metastasis patients, 20 female lung cancer patients) were acquired using an Agilent 1290 LC coupled with Agilent 6546 Q-TOF Mass Spectrometer equipped with ESI source. Samples were separated through a CSH C18 column (2.1 × 100 mm with 1.7 μm particle size, Waters, Milford, MA, USA) with column temperature maintained at 40 °C and mobile phases consisting of 10 mM ammonium formate in mobile phase A (methanol/acetonitrile/water, v/v/v = 50:40:10) and mobile phase B (2-propanol/water, v/v = 95:5). The gradient started with 1.5 min of isocratic elution with 32% B. B was increased to 85% over the next 15.5 min and then from 85% B to 97% B in only 0.1 min. Maintained at 97% B for 2.4 min. Rapidly, the mobile phase composition was returned to 32% B within 0.1 min and maintained 5 min for column post-equilibration. The flow rate for mobile phases was set at 0.26 ml/min. The injection volume was 4 µL for positive ions and 12 µL for negative ions. The mass spectrometer was operated in positive or negative modes in the scan range m/z 150–1500 Da. Capillary voltage of 3400 V for positive and 3000 V for negative. Ion Transfer Tube Temp: 320 °C, Vaporizer Temp: 350 °C, sheath gas: 40 Arb, Aux gas: 10 Arb, Sweep Gas: 10 Arb. The CE was 10–60 eV.

Xevo G2-XS Q-TOF MS system

Lipidomics data from 105 clinical cohort samples (31 breast nodule patients, 32 breast cancer without lung metastasis patients, 22 breast cancer lung metastasis patients, 20 female lung cancer patients) were acquired using an ACQUITY UPLC I-Class PLUS coupled with Xevo G2-XS Q-TOF Mass Spectrometer. Samples were separated through a BEH HILIC column (2.1 × 100 mm with 1.7 μm particle size, Waters, Milford, MA, USA) with column temperature maintained at 40 °C and mobile phases consisting of mobile phase A (10 mM ammonium formate and 0.2% acetic acid in water) and mobile phase B (acetonitrile/acetone/isopropanol, v/v/v = 50:48:2). The gradient elution program was as follows: 0 to 2.4 min, 90% to 85% B; 2.4 to 3.2 min, 85% to 80% B; 3.2 to 5 min, 80% B; 5.0 to 5.1 min, 80% to 70% B; 5.1 to 6 min, 70% B; 6 to 6.1 min, 70% to 90% B; 6.1 to 10.0 min, 90% B. The flow rate for mobile phases was set at 0.35 ml/min. The mass spectrometer was operated in positive or negative modes in the scan range m/z 150–1500 Da. The ESI source conditions were as follows: Source Capillary: 2.5 KV; Sampling Cone: 40 V; Source Offset: 80 V; Source Temperatures: 120 °C; Desolvation Temperatures: 500 °C; Cone Gas Flow: 50 L/h; Desolvation Gas Flow: 800 L/h; MS1 scan ranges: m/z 400–1000. Raw data extraction and processing. Any raw data in whatever format is converted to mzML format using MSConvert and quantized using XCMS.

SCIEX ZenoTOF 7600 MS system

For EAD data acquisition, an Exion LC coupled with a quadrupole time-of-flight MS system (ZenoTOF 7600, SCIEX, Framingham, MA, USA). The mobile phase A consists of methanol, acetonitrile, and water (1:1:1, v/v/v) containing 5 mM ammonium acetate. The mobile phase B consists of isopropanol containing 5 mM ammonium acetate. The flow rate is 0.3 mL/min with gradient program of 17 min. The Kinetex C18 column (2.1 × 100 mm, 2.6 µm) was used for separation. A targeted MS/MS scanning mode, referred to as “MRM HR” by SCIEX, was employed. For fragmentation conditions, collision-induced dissociation (CID) mode with collision energy (CE) set at 10, 20, and 40 volts (V) with no CE spread, and EAD mode with CE set at 10 V and electron kinetic energy (KE) at 10, 15, and 20 electron volts (eV) were conducted.

Normalization of lipid intensities

In each tested sample group, we carefully adjusted the lipid intensity using internal standards as references. This was particularly crucial for the extensive data sets from 1393 cases and 333 cases in the first and second clinical cohorts, respectively. To manage this data efficiently, we processed the lipid testing in batches and removed batch effects with statTarget (https://stattarget.github.io/). Additionally, we applied the natural logarithm to all biomarker screening values to normalize the data distribution.

Raw data extraction and processing

Raw data, regardless of its original format, undergoes conversion to the mzML format via MSConvert and subsequent quantization through XCMS. Annotation of lipid content was performed with LipidIN. Data from the Orbitrap Exploris 240 utilized MS1 tolerance of 10 ppm and MS2 tolerance of 20 ppm, while data from the Agilent 6546 Q-TOF was set to MS1 tolerance of 20 ppm and MS2 tolerance of 40 ppm. By taking the relatively low precision associated with the Paternò-Büchi reaction into account, data from the Xevo G2-XS Q-TOF was configured with MS1 tolerance of 1000 ppm and MS² tolerance of 40 ppm.

Lipidomics data filtering

To ensure the accuracy and reliability of our analysis, we only included lipids in our lipidomics analysis that had a ScoreMatched of over 0.75 and a final score of 2.1 or higher in LipidIN annotations. When multiple plausible annotations existed under the same peak, we selected the one with the highest score. Furthermore, we excluded biological samples from our analysis if they had missing data exceeding 20%. Additionally, we removed lipids with coefficient of variation (CV) higher than 30% in QC samples to maintain data consistency and reliability.

Expeditious querying (EQ) module

The EQ module consists of two main components: (1) MS² fragment annotations, and (2) data structure optimization.

MS² fragments annotations. This algorithm emphasizes the matching degree between theoretical spectrum (${n}_{1}$) and measured spectrum (${n}_{2}$) and highlights importance of mass spectrometric features. Based on this concept, two indicators ${{{{\rm{Score}}}}}_{{{{\rm{matched}}}}}$ and ${{{{\rm{Score}}}}}_{{{{\rm{ratio}}}}}$ were defined respectively. The ${{{{\rm{Score}}}}}_{{{{\rm{matched}}}}}$ is defined as:

$${{{{\rm{Score}}}}}_{{{{\rm{matched}}}}}=\frac{{\sum }_{i=1}^{{n}_{1}}{\sum }_{j=1}^{{n}_{2}}{{{\rm{I}}}}\left(\frac{{{||}{s}_{i}-{s}_{j}^{{\prime} }{||}}_{1}}{{s}_{i}}\le {{threshold}}_{{ppm}}\right)}{{n}_{1}},$$

(1)

${s}_{i}$ and ${s}_{j}^{{\prime} }$ denote two peaks, with a representing the jth peak from the measured spectrum and ith peak of theoretical library. ${{||}{s}_{i}-{s}_{j}^{{\prime} }{||}}_{1}$ is the L1 norm of any two peaks ${{{\rm{I}}}}\left(x\right)$ is an indicator function that takes the value of 1 when a given condition is satisfied and 0 otherwise. Herein, the indicator function determines whether the percentage difference between any two peaks exceeds ${{threshold}}_{{ppm}}$.

The ${{{{\rm{Score}}}}}_{{{{\rm{ratio}}}}}$ is defined as:

$${{{{\rm{Score}}}}}_{{{{\rm{ratio}}}}}=\frac{{\sum }_{i=1}^{{n}_{1}}{\sum }_{j=1}^{{n}_{2}}{{{\rm{I}}}}\left(\frac{{{||}{{{{\rm{s}}}}}_{i}-{{{{\rm{s}}}}}_{j}^{{\prime} }{||}}_{1}}{{s}_{i}}\le {{threshold}}_{{ppm}}\right)\times {{intensity}}_{j}}{{\sum }_{j=1}^{{n}_{2}}{{intensity}}_{j}},$$

(2)

${{intensity}}_{j}$ illustrates the intensity of the jth peak of the measured spectrum. ${{{{\rm{Score}}}}}_{{{{\rm{ratio}}}}}$ Calculates the importance of measured characteristic peaks. It is worth noting that when multiple measured spectra match a single peak in the theoretical library, we will only calculate one successful match and select the one with the highest response.

Data structure optimization. During EQ in a task with $n$ peaks to be annotated, we used a combination of bisection and hash tables to shrink time complexity $O\left(n\right)$ into $O\left(1\right)$, achieving a final time complexity of $O\left(m\Delta \right)$, where $m$ is the total number of peaks in the library and $\Delta$ is the querying tolerance.

Lipid categories intelligence (LCI) model

Following by ${{{{\rm{Score}}}}}_{{{{\rm{matched}}}}}$ and ${{{{\rm{Score}}}}}_{{{{\rm{ratio}}}}}$ as a prior information, we further leveraged LCI model to reduce false positives and predict candidate annotations without MS/MS fragments. To delineate ECN and ESCN rules, we utilized loss functions to get starting points’ sets. Modern mass spectrometers are capable of rapid switching, with the delay from MS1 scans to MS2 scans typically ranging from milliseconds to a few seconds. To streamline the extraction of primary peaks, we opted not to perform peak extraction during the lipid annotation step. Instead, we utilized the MS/MS scan times in the LCI model.

Get starting points’ sets. We transferred the abstract problems of relative RT into path optimization problems. Through prior heuristic search (PHS) framework, appropriate amount of the points was randomly selected as the optimal starting point by using ${{{{\rm{Score}}}}}_{{{{\rm{matched}}}}}$ and ${{{{\rm{Score}}}}}_{{{{\rm{ratio}}}}}$. Then we used the ${{{\rm{loss}}}}$ to judge whether these points can construct a starting points set. The ${{loss}}_{1}^{{\prime} }$ is defined as:

$${{{{\rm{loss}}}}}_{1}^{{\prime} }={{{\rm{I}}}}\left({R}_{{adj}}^{2}\ge 0.9\right),$$

(3)

${R}_{{adj}}^{2}$ measures the adjusted coefficient of determination of a regression model, we set Eq. (3) to restrict the adjusted coefficient of determination to be greater than 0.9. ${{loss}}_{2}^{{\prime} }$ measures the monotonicity of the curve, defined as:

$${{{{\rm{loss}}}}}_{2}^{{\prime} }={{{\rm{I}}}}\left(\forall \,{{rt}}_{i}\, > \,{{rt}}_{j},{{rt}}_{i},{{rt}}_{j}\in \left[{{rt}}_{1},{{rt}}_{2}\right],\frac{{{mz}}_{i}-{{mz}}_{j}}{{{rt}}_{i}-{{rt}}_{j}}\ge 0\right),$$

(4)

${{rt}}_{i}$, ${{rt}}_{j}$ and ${{mz}}_{i}$, ${{mz}}_{j}$ represent the RT and m/z of the ith spectrum and jth, within closed interval $\left[{{rt}}_{1},{{rt}}_{2}\right]$. Based on ECN and ESCN, it is evident that these points exhibit a clear increasing trend. Equation (4) is defined to determine whether the selected points have a monotonically non-decreasing trend. ${{{{\rm{loss}}}}}_{3}^{{\prime} }$ ensures that the fitted curve is relatively smooth and continuous, preventing sudden jumps or oscillations in the function, defined as:

$${{{{\rm{loss}}}}}_{3}^{{\prime} }={{{\rm{I}}}}(\forall {{rt}}_{i},{{rt}}_{j}\in [{{rt}}_{1},{{rt}}_{2}],\forall \varepsilon\, > \,0\,s.t.{{\mathrm{lim}}}_{{{rt}}_{j}\to {{rt}}_{i}}\frac{f\left({{rt}}_{j}\right)-f\left({{rt}}_{i}\right)}{{{rt}}_{j}-{{rt}}_{i}} < \, \varepsilon ),$$

(5)

$f\left({{rt}}_{i}\right)$ is the analytical expression after fitted, we require that the fitted curve must be continuous in the closed interval. Equation (5) was used to guarantee two arbitrarily close points in the closed interval, the deviation between their fitted values that is infinitely close. The interconnection of the three loss functions is established through multiplication, simultaneously satisfying ECN and ESCN rules. Ultimately, we calculated the combined loss function using the following equations:

$${{{{\rm{loss}}}}}^{{\prime} }={\prod }_{i=1}^{3}{{{{\rm{loss}}}}}_{{{{\rm{i}}}}}^{{\prime} },$$

(6)

$${{{{\rm{loss}}}}}_{i}=\frac{n\times {{{{\rm{||}}}}\widetilde{{{mz}}_{\iota}}-{{mz}}_{i}{{{\rm{||}}}}}_{1}}{{\sum }_{i=1}^{n}{{mz}}_{i}}\times {{{\rm{I}}}}\left({{{{\rm{loss}}}}}^{{\prime} }=1\right),$$

(7)

where $\widetilde{{{mz}}_{\iota}}$ is the predicted m/z.

Global annotation judgment. For Global annotation judgment, we construct feasible regions to reduce computational complexity. For points within feasible regions, an exhaustive method was adopted, and Eq. (7) was used to calculate the ${{loss}}_{i}$. Finally, we used ${{{{\rm{PHS}}}}}_{i}$ function to normalize the score:

$${{{{\rm{PHS}}}}}_{i}=\left\{\begin{array}{cc}1-\frac{{{{{\rm{loss}}}}}_{i}}{4\times {threshold}},\hfill& {if}\,{{{{\rm{loss}}}}}^{{\prime} }=1,{{{{\rm{loss}}}}}_{i}\le {threshold}\\ 0.5,\hfill& {if}\,{{loss}}^{{\prime} }=0 \hfill \\ \frac{\max \left({{{{\rm{loss}}}}}_{i}\right)-{{{{\rm{loss}}}}}_{i}}{4\times (\max \left({{{{\rm{loss}}}}}_{i}\right)-\min \left({{{{\rm{loss}}}}}_{i}\right))},& {if}\,{{loss}}^{{\prime} }=1,{{{{\rm{loss}}}}}_{i}\, > \,{threshold}\end{array}\right.,$$

(8)

where 4 is a constant for data scaling, and parameter ${threshold}$ is used to quantify the percentage deviation between the measured m/z and expected m/z. Piecewise function is determined by comparing whether ${{PHS}}_{i}$ meets the specified threshold.

Intra-subclass curve translation. To delineate IUP rule, we constructed translation equations to annotate underfitting DBEs of lipid species. Taking n-degree polynomial as an example, the translation equation is defined as follows:

$$\forall \,x\in \left[{x}_{1},{x}_{2}\right]{{{\rm{s}}}}.{{{\rm{t}}}}.\left\{\begin{array}{c}\left|f\left({{rt}}_{i}\right)-{{mz}}_{i}\right|\le \varepsilon,\forall \varepsilon\, > \,0 \hfill\\ g\left(x\right)=q\left(x\right)f\left(x\right)+C,\partial \left(q\left(x\right)\right)\ge 0\end{array}\right.,$$

(9)

$f\left(x\right)$ is the fitting equation after translation, ${{rt}}_{i}$ and ${{mz}}_{i}$ are the RT and m/z of points to be fitted. Furthermore, $f\left(x\right)$ is also required to satisfy the equation after substituting the points, ensuring that there is only a slight deviation of $\varepsilon$, which can be any value greater than zero. This requires $f\left(x\right)$ to be a curve passing through point $({{rt}}_{i},{{mz}}_{i})$, or infinitely close to point $({{rt}}_{i},{{mz}}_{i})$. Where $g\left(x\right)$ is the fitting equation of another DBEs, requiring that $f\left(x\right)$ is a component of $g\left(x\right)$ with the quotient of any constant $C$ remainder, $q\left(x\right)$ is a non-zero polynomial divisor. ${x}_{1}$ and ${x}_{2}$ are the endpoints of a given interval. This requires that $f\left(x\right)$ has no intersection with $g\left(x\right)$ after being translated in the y and x direction, which is the analytical expression of IUP.

Reverse lipidomics

To delineate WMYn termed “reverse lipidomics”, we utilized feature learning, including intra- and inter-feature learning.

Intra-feature learning. The feature extractor consists of two SiLU-activated layers followed by a Rectified Linear Unit (ReLU) activation. The first layer is defined as:

$${{{{\rm{H}}}}}_{1}={\alpha }_{1}\left({{{\rm{SiLU}}}}\left({{{{\bf{W}}}}}_{1}{{{\bf{X}}}}+{{{{\bf{B}}}}}_{1}\right)\right)+{\beta }_{1}=\left(\begin{array}{ccc}{h}_{1,1}^{(1)} & \cdots & {h}_{1,n}^{(1)}\\ \vdots & \ddots & \vdots \\ {h}_{512,1}^{(1)} & \cdots & {h}_{512,n}^{(1)}\end{array}\right),$$

(10)

where ${{{\bf{X}}}}$ represents the input matrix, consisting of ${{{{\boldsymbol{x}}}}}_{{{{\bf{1}}}}},\cdot \cdot \cdot,{{{{\boldsymbol{x}}}}}_{{{{\boldsymbol{n}}}}}$, where each ${{{{\boldsymbol{x}}}}}_{{{{\boldsymbol{i}}}}}$ is a spectrum, and mass spectrometry data are discretized in the m/z dimension at intervals of 0.0001 Da, based on the mass spectral resolution. Specifically, the raw continuous m/z values are rounded to the nearest multiple of 0.0001 Da. ${{{{\rm{H}}}}}_{1}$ is the latent matrix, ${{{{\bf{W}}}}}_{1}$ and ${{{{\bf{B}}}}}_{1}$ are the weight and bias matrices, ${\alpha }_{1}$ is a learnable scaling parameter, and ${\beta }_{1}$ is a learnable bias term applied to the output of the activation function.

The result in the $i$ row and $j$ column after the first layer of processing is represented as:

$${h}_{i,j}^{(1)}={\alpha }_{1}{{{\rm{SiLU}}}}\left({\sum }_{p=1}^{m}{{{{\bf{W}}}}}_{i,p}^{(1)}{{{{\bf{X}}}}}_{p,j}+{{{{\bf{B}}}}}_{i}^{(1)}\right)+{\beta }_{1}$$

(11)

where ${{{{\bf{X}}}}}_{p,j}$ represents the $p$ row and $j$ column of ${{{\bf{X}}}}$. ${{{{\bf{W}}}}}_{i,p}^{(1)}$ represents the element in the $i$ row and $p$ column of the weight matrix ${{{{\bf{W}}}}}_{1}$. ${{{{\bf{B}}}}}_{i}^{(1)}$ represents the $i$ row in ${{{{\bf{B}}}}}_{1}$ bias term.

The second layer is defined as:

$${{{{\rm{H}}}}}_{2}={\alpha }_{2}\left({{{\rm{SiLU}}}}\left({{{{\bf{W}}}}}_{2}{{{{\rm{H}}}}}_{1}+{{{{\bf{B}}}}}_{2}\right)\right)+{\beta }_{2}=\left(\begin{array}{ccc}{h}_{1,1}^{(2)} & \cdots & {h}_{1,n}^{(2)}\\ \vdots & \ddots & \vdots \\ {h}_{512,1}^{(2)} & \cdots & {h}_{512,n}^{(2)}\end{array}\right),$$

(12)

where ${{{{\rm{H}}}}}_{1}$ is the output matrix obtained from the first layer, ${{{{\rm{H}}}}}_{2}$ is the latent matrix, ${{{{\bf{W}}}}}_{2}$ and ${{{{\bf{B}}}}}_{2}$ are the weight and bias matrices, respectively. ${\alpha }_{2}$ is a learnable scaling parameter, and ${\beta }_{2}$ is a learnable bias term applied to the output of the activation function.

The result in the $i$ row and $j$ column after the second layer of processing is represented as:

$${h}_{i,j}^{(2)}={\alpha }_{2}{{{\rm{SiLU}}}}\left({\sum }_{q=1}^{512}{{{{\bf{W}}}}}_{i,q}^{\left(2\right)}({h}_{i,j}^{(1)})+{{{{\bf{B}}}}}_{i}^{\left(2\right)}\right)+{\beta }_{2}$$

(13)

where ${h}_{i,j}^{(2)}$ is the $i$ row and $j$ column of output matrix obtained from the second layer ${{{{\bf{W}}}}}_{i,q}^{\left(2\right)}$ represent the element in the $i$ row and $q$ column of the weight matrix ${{{{\bf{W}}}}}_{2}$. The bias term ${{{{\bf{B}}}}}_{i}^{\left(2\right)}$ allows the model to independently shift the activation output of the $i$ row, enhancing the network’s flexibility. The output matrix from the first layer is passed through the second layer to produce the latent matrix in the feature space.

Activated by the ReLU function as follows, ${{{{\rm{H}}}}}_{2}$ was transferred to matrix ${{{{\rm{H}}}}}_{3}$ with 512 columns:

$${{{{\rm{H}}}}}_{3}={{{\rm{ReLU}}}}\left({{{{\rm{H}}}}}_{2}\right)=\left(\begin{array}{ccc}{h}_{1,1}^{(3)} & \cdots & {h}_{1,n}^{(3)}\\ \vdots & \ddots & \vdots \\ {h}_{512,1}^{(3)} & \cdots & {h}_{512,n}^{(3)}\end{array}\right),$$

(14)

the result in the $i$ row and $j$ column of ${H}_{3}$ is represented as:

$${h}_{i,j}^{(3)}={{{\rm{ReLU}}}}\left({h}_{i,j}^{(2)}\right),$$

(15)

where ${h}_{i,j}^{(2)}$ is the output obtained from the second layer, and ${H}_{3}$ is the final feature matrix. The intra-feature learning in stage 1 of WMYn and the transformation in a network layer can finally be described as:

$$f({X}_{1,j},\ldots,{X}_{n,j})={{{\rm{ReLU}}}}({\alpha }_{2}{{{\rm{SiLU}}}}\left({\sum }_{q=1}^{512}{\varPhi }_{q,i,j}\left({\sum }_{p=1}^{m}{\psi }_{p,q}\left({{{{\bf{X}}}}}_{p,j}\right)\right)+{b}_{i}^{\left(2\right)}\right)+{\beta }_{2}),$$

(16)

$${\psi }_{p,q}({X}_{p,j})={{{\rm{I}}}}(q\in \{1,\ldots,512\}){{{{\bf{W}}}}}_{i,p}^{(1)}{{{{\bf{X}}}}}_{p,j},$$

(17)

$${\varPhi }_{q,i,j}\left({\sum }_{p=1}^{m}{\psi }_{p,q}\left({{{{\bf{X}}}}}_{p,j}\right)\right)={{{{\bf{W}}}}}_{i,q}^{\left(2\right)}({\alpha }_{1}{{{\rm{SiLU}}}}\left({\sum }_{p=1}^{m}{\psi }_{p,q}({{{{\bf{X}}}}}_{p,j})+{b}_{i}^{(1)}\right)+{\beta }_{1})$$

(18)

where $I(q\in \{1,\ldots,512\})$ is an indicator function, which equals 1 when $q\in \{1,\ldots,512\}$, and 0 otherwise.

Inter-feature learning and resolution improvement. Our model uses 6 encoder layers along with 8-head self-attention in each encoder for data processing. Multi-head self-attention function is:

$${{{\rm{Attention}}}}\left({{{\bf{Q}}}},{{{\bf{K}}}},{{{\bf{V}}}}\right)={{{\rm{softmax}}}}(\frac{{{{\bf{Q}}}}{{{{\bf{K}}}}}^{{{{\rm{T}}}}}}{\sqrt{{D}_{k}}}){{{\bf{V}}}},$$

(19)

where ${{{\bf{Q}}}}={{{{\rm{H}}}}}_{3}{{{{\bf{W}}}}}_{Q}$, ${{{\bf{K}}}}={{{{\rm{H}}}}}_{3}{{{{\bf{W}}}}}_{k}$ and ${{{\bf{V}}}}={{{{\rm{H}}}}}_{3}{{{{\bf{W}}}}}_{V}$. ${{{{\bf{W}}}}}_{Q},{{{{\bf{W}}}}}_{k},{{{{\bf{W}}}}}_{V}\in {{\mathbb{R}}}^{512\times {D}_{k}}$ are projection matrices, and ${D}_{k}$ is the dimension of the subspaces for keys and queries in each attention head.

$${{{\rm{MultiHead}}}}\left({{{\bf{Q}}}},{{{\bf{K}}}},{{{\bf{V}}}}\right)={{{\rm{Concat}}}}\left({{{\rm{hea}}}}{{{{\rm{d}}}}}_{1},\ldots,{{{\rm{hea}}}}{{{{\rm{d}}}}}_{h}\right){{{{\bf{W}}}}}^{O},$$

(20)

where ${{{\rm{hea}}}}{{{{\rm{d}}}}}_{i}={{{\rm{Attention}}}}({{{\bf{Q}}}}{{{{\bf{W}}}}}_{i}^{{{{\bf{Q}}}}},{{{\bf{K}}}}{{{{\bf{W}}}}}_{i}^{{{{\bf{K}}}}},{{{\bf{V}}}}{{{{\bf{W}}}}}_{i}^{{{{\bf{V}}}}})$. ${{{\bf{Q}}}}{{{{\bf{W}}}}}_{i}^{{{{\bf{Q}}}}},{{{\bf{K}}}}{{{{\bf{W}}}}}_{i}^{{{{\bf{K}}}}},{{{\bf{V}}}}{{{{\bf{W}}}}}_{i}^{{{{\bf{V}}}}}\in {{\mathbb{R}}}^{512\times {D}_{k}}$, ${{{{\bf{W}}}}}^{O}\in {{\mathbb{R}}}^{H{D}_{k}\times 512}$. In this work, we employ $h=8$ parallel attention heads.

The output is processed through a dropout, residual connection, and layer normalization:

$${{{{\rm{H}}}}}_{4}={\mbox{LayerNorm}}({{{{\rm{H}}}}}_{3}+{\mbox{Dropout}}({\mbox{MultiHead}}({{{\bf{Q}}}},{{{\bf{K}}}},{{{\bf{V}}}}))),$$

(21)

Instead of a traditional feed-forward network. The output of the self-attention mechanism is passed through this custom layer, followed by ReLU activation, dropout, and layer normalization:

$${{{{\rm{H}}}}}_{5}={\mbox{LayerNorm}}({{{{\rm{H}}}}}_{4}+{\mbox{Dropout}}({\mbox{ReLU}}({{{\rm{L}}}}{{{\rm{ayer}}}}({{{{\rm{H}}}}}_{4})))).$$

(22)

Then the ${{{{\rm{H}}}}}_{5}$ passed through network layer to generate higher mass spectrometric resolution ${{{{\rm{H}}}}}_{6}$.

Lipid fingerprint regenerating. The third stage integrates higher mass spectrometric features, effectively regenerating lipid fingerprints through feature integration:

$$\widehat{y}={{{\rm{ReLU}}}}\left(\right.{{{{\rm{Layer}}}}}_{{{{\rm{upsampling}}}}}({{{{\rm{Layer}}}}}_{{{{\rm{downsampling}}}}}({{{{\rm{H}}}}}_{6})).$$

(23)

Fine-tuning objectives. We used the Mean Squared Error (MSE) loss function for parameter adjusting:

$${{\mbox{MSE}}}=\frac{1}{n}{\sum }_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}$$

(24)

where $n$ is the number of elements in the target vector, ${\hat{y}}_{i}$ is the predicted value, and ${y}_{i}$ is the ground truth value.

Additionally, a custom learning rate scheduler adjusts the learning rate dynamically based on the training loss to optimize the model’s performance.

WMYn Model Training. The model was trained using the Adam optimizer with a learning rate of 0.01⁴⁸. To ensure reproducibility, a random seed of 42 was set, and model parameters were initialized accordingly. The training process includes the following steps: (1) During each epoch, the model predictions were computed through a forward pass, followed by the calculation of MSE loss between the predictions and the true values. (2) Gradients were then computed via backpropagation, and the model parameters were updated using the Adam optimizer. (3) A custom learning rate scheduler was employed to adjust the learning rate dynamically based on the validation loss, with learning rates set at [0.01, 0.01, 0.001, 0.001, 0.0001] and a patience of 500 epochs. The training was conducted for a maximum of 3000 epochs. Early stopping was implemented by monitoring the loss during training and halting the process if the loss fell below a predefined threshold of 40, thus preventing overfitting and unnecessary computation. The model with the lowest validation loss after epoch 60 was selected as the optimal model, and its parameters were saved. Loss values were recorded and saved for further analysis. For each dataset, the model was trained individually. The best model for each dataset was saved and used for prediction. The predictions were then evaluated against the ground truth values to ensure the robustness and accuracy of the model. All computations were carried out using PyTorch on a high-performance computing server (https://pytorch.org/).

Benchmark

The MS-DIAL published library was download at https://systemsomicslab.github.io/compms/msdial/main.html#MSP. Additional lipid is a hierarchical library calculated using an iterative algorithm with a total number of 168.5 million. All benchmark tests were performed on a personal computer with 13th Gen Intel® Core™ i7-13700F × 16-Core Processor, 64 GB memory, and installed with Windows11 operation system, R-4.2.3, and Python v.3.9. MS entropy and Flash entropy download from GitHub at https://github.com/YuanyueLi/SpectralEntropy, and https://github.com/YuanyueLi/FlashEntropySearch. LipidMatch was downloaded from https://github.com/GarrettLab-UF/LipidMatch. In all testing, we used MS-DIAL version v4.9.221218 and LipidSearch V4.2.

In recall testing, all methods set precursor ion matching tolerances <0.01 Da (or 5 ppm) and the MS/MS ion querying tolerance <0.025 Da (or 10 ppm). In computation time test, comparison was conducted on MS-DIAL public library, and spectra were randomly selected in mixture dataset. Flash entropy only uses the identity search module, notably. We used the RT prediction algorithm proposed by LDA to perform RT-based false positive removal, filtering false annotations using the predicted RT ± 4-fold mean RT deviation as recommended. Specifically, the “statTarget” package was used to correct for batch effects in clinical cohort of 1393 samples and another clinical cohort of 333 samples. Lipid biomarker selection was done using the “randomForest” package, setting a maximum depth of 5. Differentiation between breast cancer and healthy volunteer models using ten biomarkers was implemented using the “lightgbm” package, setting a learning rate of 0.01 and a maximum depth of 5. Clinical metrics and lipid analysis were done using the “WGCNA” package. In the analysis of expression levels at different C=C positions, the abundance ratios of the counted diagnostic ion pairs were calculated to represent the relative content ratios of the isoforms with the following equation:

$${{RPA}}_{j}={PA}\times \frac{{{IA}}_{j}+{{IB}}_{j}}{{\sum }_{i=1}^{n}({{IA}}_{i}+{{IB}}_{i})}$$

(25)

where ${{RPA}}_{j}$ denotes the relative peak area of the jth C=Cs position isomer and ${PA}$ denotes the peak area of all tautomers, a value generally obtained in LC-MS/MS quantitative results. ${{IA}}_{j}$ and ${{IB}}_{j}$ denote the intensity of the paired peaks of the jth isomer breaks at C=C position, respectively^49,50. Final normalization of ${{RPA}}_{j}$ by lipid annotations in heat maps. All the Terminology and Definitions Summary is available in Supplementary Data 11.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The mass spectrometry data of lipidomics were deposited to Metabolomics Workbench under accession code MTBLS10170⁵¹ and also in National Genomics Data Center and are accessible with identifier PRJCA028507. The datasets in the three rule validations are from the following addresses, ST002384, ST001794 and ST003514 in Metabolomics Workbench [https://www.metabolomicsworkbench.org]^34,39,43, DM0031 and DM0044 [https://prime.psc.riken.jp/menta.cgi/prime/drop_index]^11,33, Metabolomics Workbench identifier MTBLS4684, MTBLS6965, MTBLS1369, MTBLS4654, and MTBLS6511 [https://www.ebi.ac.uk/metabolights/]^{35,36,40,41,42}. The MS-DIAL published library was download at MS-DIAL website [https://systemsomicslab.github.io/compms/msdial/main.html#MSP]¹¹. Additional lipid hierarchical library calculated using an iterative algorithm with a total number of 168.5 million has been uploaded in Zenodo [https://doi.org/10.5281/zenodo.14824498]⁵². All data supporting the results of this study are available in the article, supplementary materials, and source data files. Source data are provided with this paper.

Code availability

The code for LipidIN can be found at GitHub [https://github.com/LinShuhaiLAB/LipidIN], Zenodo [https://doi.org/10.5281/zenodo.14824498], and CodeOcean [https://doi.org/10.24433/CO.3229548.v3]

References

Hu, C., Luo, W., Xu, J. & Han, X. Recognition and avoidance of ion source-generated artifacts in lipidomics analysis. Mass Spectrom. Rev. 41, 15–31 (2022).
Article ADS CAS PubMed Google Scholar
Han, X. & Gross, R. W. The foundations and development of lipidomics. J. Lipid Res. 63, 100164 (2022).
Article CAS PubMed Google Scholar
Ni, Z. et al. Guiding the choice of informatics software and tools for lipidomics research applications. Nat. Methods 20, 193–204 (2023).
Article CAS PubMed Google Scholar
Zhang, X. et al. Leveraging unidentified metabolic features for key pathway discovery: chemical classification-driven network analysis in untargeted metabolomics. Anal. Chem. 96, 3409–3418 (2024).
Article CAS PubMed Google Scholar
Xu, L., Wang, X., Jiao, Y. & Liu, X. Assessment of potential false positives via orbitrap-based untargeted lipidomics from rat tissues. Talanta 178, 287–293 (2018).
Article CAS PubMed Google Scholar
Young, A., Röst, H. & Wang, B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat. Mach. Intell. 6, 404–416 (2024).
Article Google Scholar
Richardson, L. T., Brantley, M. R. & Solouki, T. Using isotopic envelopes and neural decision tree-based in silico fractionation for biomolecule classification. Anal. Chim. Acta 1112, 34–45 (2020).
Article CAS PubMed Google Scholar
Koelmel, J. P. et al. LipidMatch: an automated workflow for rule-based lipid identification using untargeted high-resolution tandem mass spectrometry data. BMC Bioinform. 18, 1–11 (2017).
Article Google Scholar
Chitpin, J. G. et al. BATL: Bayesian annotations for targeted lipidomics. Bioinformatics 38, 1593–1599 (2022).
Article CAS PubMed Google Scholar
Goldman, S. et al. Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nat. Mach. Intell. 5, 965–979 (2023).
Article Google Scholar
Tsugawa, H. et al. A lipidome atlas in MS-DIAL 4. Nat. Biotechnol. 38, 1159–1163 (2020).
Article CAS PubMed Google Scholar
Taguchi, R. & Ishikawa, M. Precise and global identification of phospholipid molecular species by an Orbitrap mass spectrometer and automated search engine Lipid Search. J. Chromatogr. A 1217, 4229–4239 (2010).
Article CAS PubMed Google Scholar
Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. & Fiehn, O. Flash entropy search to query all mass spectral libraries in real time. Nat. Methods 20, 1475–1478 (2023).
Article CAS PubMed PubMed Central Google Scholar
Schmid, R. et al. Ion identity molecular networking for mass spectrometry-based metabolomics in the GNPS environment. Nat. Commun. 12, 3832 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Kretschmer, F., Harrieder, E.-M., Hoffmann, M. A., Böcker, S. & Witting, M. RepoRT: a comprehensive repository for small molecule retention times. Nat. Methods 21, 153–155 (2024).
Article CAS PubMed Google Scholar
Sorokin, A. A. et al. Modern machine-learning applications in ambient ionization mass spectrometry. Mass Spectrom. Rev. 44, 74–88 (2025).
Article PubMed Google Scholar
Krettler, C. A. & Thallinger, G. G. A map of mass spectrometry-based in silico fragmentation prediction and compound identification in metabolomics. Brief. Bioinform. 22, bbab073 (2021).
Article PubMed Google Scholar
Kind, T. et al. Identification of small molecules using accurate mass MS/MS search. Mass Spectrom. Rev. 37, 513–532 (2018).
Article ADS CAS PubMed Google Scholar
Xing, S. et al. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nat. Methods 20, 881–890 (2023).
Article CAS PubMed Google Scholar
Hettiarachchi, I. T. et al. A fresh look at functional link neural network for motor imagery-based brain–computer interface. J. Neurosci. Methods 305, 28–35 (2018).
Article PubMed Google Scholar
Liu, Z. et al. KAN: Kolmogorov–Arnold networks. arXiv:2404.19756 https://doi.org/10.48550/arXiv.2404.19756 (2024).
Vaswani, A. et al. Attention is all you need. In NIPS https://doi.org/10.48550/arXiv.1706.03762 (2017).
Kolmogorov, A. N. Doklady Akademii Nauk, Vol. 114 953–956 (Russian Academy of Sciences, 1957).
Fréneau, M. & Hoffmann, N. The Paternò-Büchi reaction—Mechanisms and application to organic synthesis. J. Photochem. Photobio. C Photochem. Rev. 33, 83–108 (2017).
Article Google Scholar
Ma, X. & Xia, Y. Pinpointing double bonds in lipids by Paternò-Büchi reactions and mass spectrometry. Angew. Chem. Int Ed. Engl. 53, 2592–2596 (2014).
Article CAS PubMed Google Scholar
Zhang, C. et al. Detection and analysis of triacylglycerol regioisomers via electron-activated dissociation (EAD) tandem mass spectrometry. Talanta 270, 125552 (2024).
Adusumilli, R. & Mallick, P. Data conversion with proteowizard msconvert. Methods Mol. Biol. 1550, 339–368 (2017).
Article CAS PubMed Google Scholar
Kumler, W. A. I. & Anitra, E. Tidy data neatly resolves mass-spectrometry’s ragged arrays. R. J. 14, 193–202 (2022).
Article Google Scholar
Zhang, D. et al. LipidOA: a machine-learning and prior-knowledge-based tool for structural annotation of glycerophospholipids. Anal. Chem. 94, 16759–16767 (2022).
Article CAS PubMed Google Scholar
Baba, T. et al. Dissociation of biomolecules by an intense low-energy electron beam in a high sensitivity time-of-flight mass spectrometer. J. Am. Soc. Mass Spectrom. 32, 1964–1975 (2021).
Article CAS PubMed Google Scholar
Brunet, T. A. et al. Concomitant investigation of crustacean amphipods lipidome and metabolome during the molting cycle by Zeno SWATH data-independent acquisition coupled with electron activated dissociation and machine learning. Anal. Chim. Acta 1304, 342533 (2024).
Article CAS PubMed Google Scholar
Tsugawa, H. et al. A lipidome landscape of aging in mice. Nat. Aging 4, 709–726 (2024).
Article CAS PubMed Google Scholar
Zeng, J. et al. Anti-allergic effect of dietary polyphenols curcumin and epigallocatechin gallate via anti-degranulation in IgE/antigen-stimulated mast cell model: a lipidomics perspective. Metabolites 13, 628 (2023).
Article PubMed PubMed Central Google Scholar
Zhou, M. et al. Lipidomic analysis reveals altered lipid profiles of gingival tissues with periodontitis. J. Clin. Periodontol. 49, 1192–1202 (2022).
Article CAS PubMed Google Scholar
Inague, A. et al. Oxygen-induced pathological angiogenesis promotes intense lipid synthesis and remodeling in the retina. iScience 26, 106777 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Hartler, J. et al. Deciphering lipid structures based on platform-independent decision rules. Nat. Methods 14, 1171–1174 (2017).
Article CAS PubMed PubMed Central Google Scholar
White, J. B. et al. Equivalent carbon number and interclass retention time conversion enhance lipid identification in untargeted clinical lipidomics. Anal. Chem. 94, 3476–3484 (2022).
Article CAS PubMed Google Scholar
Folz, J. S., Shalon, D. & Fiehn, O. Metabolomics analysis of time-series human small intestine lumen samples collected in vivo. Food Funct. 12, 9405–9415 (2021).
Article CAS PubMed Google Scholar
Peng, B. et al. LipidCreator workbench to probe the lipidomic landscape. Nat. Commun. 11, 2057 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Savini, M. et al. Lysosome lipid signalling from the periphery to neurons regulates longevity. Nat. Cell Biol. 24, 906–916 (2022).
Article CAS PubMed PubMed Central Google Scholar
Fasimoye, R. et al. Golgi-IP, a tool for multimodal analysis of Golgi molecular content. Proc. Natl. Acad. Sci. USA 120, e2219953120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Martínez, S., Fernández-García, M., Londoño-Osorio, S., Barbas, C. & Gradillas, A. Highly reliable LC-MS lipidomics database for efficient human plasma profiling based on NIST SRM 1950. J. Lipid Res. 65, 100671 (2024).
Article PubMed PubMed Central Google Scholar
Ni, Z., Angelidou, G., Lange, M., Hoffmann, R. & Fedorova, M. Lipidhunter identifies phospholipids by high-throughput processing of LC-MS and shotgun lipidomics datasets. Anal. Chem. 89, 8800–8807 (2017).
Article CAS PubMed Google Scholar
Koelmel, J. P. et al. Lipid annotator: towards accurate annotation in non-targeted liquid chromatography high-resolution tandem mass spectrometry (LC-HRMS/MS) lipidomics using a rapid and user-friendly software. Metabolites 10, 101 (2020).
Ke, G. et al. LightGBM: a highly efficient gradient boosting decision tree. Neural Inform. Process. Syst. 32, 2417–2438 (2017).
Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinform. 9, 559 (2008).
Article Google Scholar
Zhang, Z. Improved Adam optimizer for deep neural networks. In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS) Banff, AB, Canada, 1–2, https://ieeexplore.ieee.org/document/8624183 (IEEE, 2018).
Ma, X. et al. Identification and quantitation of lipid C=C location isomers: a shotgun lipidomics approach enabled by photochemical reaction. Proc. Natl. Acad. Sci. USA 113, 2573–2578 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, W. et al. Online photochemical derivatization enables comprehensive mass spectrometric analysis of unsaturated phospholipid isomers. Nat. Commun. 10, 79 (2019).
Article ADS PubMed PubMed Central Google Scholar
Yurekten, O. et al. MetaboLights: open data repository for metabolomics. Nucleic Acids Res. 52, D1 (2024).
Article Google Scholar
Hao, X. et al. LipidIN: lipid hierarchical library and code https://doi.org/10.5281/zenodo.14824498 (2025).

Download references

Acknowledgements

The authors would like to thank Dr. Chenchun Zhong from SCIEX (China) Co., Ltd, Dr. Jia Li from Xiamen Meliomics Co., Ltd, and Dr. Junhan Wu from PURSPEC Technology (China) Co., Ltd for technical assistance. This work was supported by grants from the National Key Research and Development Program of China (2022YFE0205800, 2022YFA1105300), the National Natural Science Foundation of China (91957120, 21974114), Major Science and Technology Special Project of Fujian Province (2022YZ036012), the Fundamental Research Funds for the Central Universities (20720220003), Project “111” sponsored by the State Bureau of Foreign Experts and Ministry of Education of China (BP0618017) as well as grant support from Guangzhou Hybribio Medicine Technology Ltd. to S.-H.L. Natural Science Foundation of Fujian Province of China (2022J01330), Natural Science Foundation of Xiamen City of China (3502Z20227208), and China Scholarship Council (202308350047) to J.Z.

Author information

These authors contributed equally: Hao Xu, Tianhang Jiang, Yuxiang Lin, Lei Zhang.

Authors and Affiliations

The First Affiliated Hospital of Xiamen University, State Key Laboratory of Cellular Stress Biology, School of Life Sciences, XMU-HBN Skin Biomedical Research Center, Xiamen University, Xiamen, Fujian, China
Hao Xu, Tianhang Jiang, Lei Zhang, Huan Yang, Ridong Mao & Shu-Hai Lin
School of Medicine, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian, China
Hao Xu & Shu-Hai Lin
Department of Breast Surgery, Fujian Medical University Union Hospital, Fuzhou, Fujian Province, China
Yuxiang Lin
School of Pharmaceutical Sciences, Xiamen University, Xiamen, Fujian, China
Huan Yang
College of Ocean Food and Biological Engineering, Jimei University, Xiamen, China
Xiaoyun Huang & Jun Zeng
State Key Laboratory of Environmental and Biological Analysis, Department of Chemistry, Hong Kong Baptist University, Kowloon, Hong Kong, China
Zhu Yang & Zongwei Cai
Department of General Medicine, Shenzhen Longhua District Central Hospital, Shenzhen, China
Changchun Zeng
Xiamen Meliomics Co., Ltd., Xiamen, Fujian, China
Shuang Zhao
Department of Biological Sciences, Faculty of Health Sciences, University of Macau, Macau, China
Lijun Di
Department of Occupational and Environmental Health and the Ministry of Education Key Lab of Hazard Assessment and Control in Special Operational Environment, School of Public Health, Fourth Military Medical University, Xi’an, China
Wenbin Zhang
Eastern Institute of Technology, Ningbo, China
Zongwei Cai

Authors

Hao Xu
View author publications
Search author on:PubMed Google Scholar
Tianhang Jiang
View author publications
Search author on:PubMed Google Scholar
Yuxiang Lin
View author publications
Search author on:PubMed Google Scholar
Lei Zhang
View author publications
Search author on:PubMed Google Scholar
Huan Yang
View author publications
Search author on:PubMed Google Scholar
Xiaoyun Huang
View author publications
Search author on:PubMed Google Scholar
Ridong Mao
View author publications
Search author on:PubMed Google Scholar
Zhu Yang
View author publications
Search author on:PubMed Google Scholar
Changchun Zeng
View author publications
Search author on:PubMed Google Scholar
Shuang Zhao
View author publications
Search author on:PubMed Google Scholar
Lijun Di
View author publications
Search author on:PubMed Google Scholar
Wenbin Zhang
View author publications
Search author on:PubMed Google Scholar
Jun Zeng
View author publications
Search author on:PubMed Google Scholar
Zongwei Cai
View author publications
Search author on:PubMed Google Scholar
Shu-Hai Lin
View author publications
Search author on:PubMed Google Scholar

Contributions

H.X. and S.H.L. conceived the project. H.X. developed and implemented E.Q. module and L.C.I. module of LipidIN framework. T.J. developed and implemented the WMYn of LipidIN framework. Y.L., L.Z., H.Y., C.Z., and S.Z. collected clinical data and obtain mass spectrometry data. H.Y., X.H., and R.M. manually checked the annotation results. H.X. built the LipidIN UI platform. H.X., T.J., L.Z., and S.H.L. wrote the manuscript. Z.Y., J.Z., Z.C., and S.H.L. reviewed the manuscript. L.D., W.Z., J.Z., Z.C., and S.H.L. supervised the project and secured funding.

Corresponding authors

Correspondence to Jun Zeng, Zongwei Cai or Shu-Hai Lin.

Ethics declarations

Competing interests

S.Z. is the chief technology officer of Xiamen Meliomics Co., Ltd, China. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Jeremy Koelmel, Masahiko Okumura and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1-11

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, H., Jiang, T., Lin, Y. et al. LipidIN: a comprehensive repository for flash platform-independent annotation and reverse lipidomics. Nat Commun 16, 4566 (2025). https://doi.org/10.1038/s41467-025-59683-5

Download citation

Received: 08 August 2024
Accepted: 29 April 2025
Published: 16 May 2025
Version of record: 16 May 2025
DOI: https://doi.org/10.1038/s41467-025-59683-5

This article is cited by

Lipid metabolism and lipid signaling in extracellular vesicles ontogeny: from biogenesis to functional execution
- Jiaxin Zhang
- Jiali Li
- Zixuan Sun
Journal of Nanobiotechnology (2025)