Metabolomics integrated with machine learning to discriminate the geographic origin of Rougui Wuyi rock tea

Peng, Yifei; Zheng, Chao; Guo, Shuang; Gao, Fuquan; Wang, Xiaxia; Du, Zhenghua; Gao, Feng; Su, Feng; Zhang, Wenjing; Yu, Xueling; Liu, Guoying; Liu, Baoshun; Wu, Chengjian; Sun, Yun; Yang, Zhenbiao; Hao, Zhilong; Yu, Xiaomin

doi:10.1038/s41538-023-00187-1

Download PDF

Article
Open access
Published: 16 March 2023

Metabolomics integrated with machine learning to discriminate the geographic origin of Rougui Wuyi rock tea

Yifei Peng^1,2^na1,
Chao Zheng ORCID: orcid.org/0000-0002-0290-2177²^na1,
Shuang Guo^1,2,
Fuquan Gao^1,2,
Xiaxia Wang²,
Zhenghua Du²,
Feng Gao³,
Feng Su³,
Wenjing Zhang³,
Xueling Yu³,
Guoying Liu⁴,
Baoshun Liu⁵,
Chengjian Wu⁶,
Yun Sun¹,
Zhenbiao Yang²,
Zhilong Hao¹ &
…
Xiaomin Yu ORCID: orcid.org/0000-0002-8650-6795²

npj Science of Food volume 7, Article number: 7 (2023) Cite this article

7291 Accesses
41 Citations
4 Altmetric
Metrics details

Subjects

Metabolomics

Abstract

The geographic origin of agri-food products contributes greatly to their quality and market value. Here, we developed a robust method combining metabolomics and machine learning (ML) to authenticate the geographic origin of Wuyi rock tea, a premium oolong tea. The volatiles of 333 tea samples (174 from the core region and 159 from the non-core region) were profiled using gas chromatography time-of-flight mass spectrometry and a series of ML algorithms were tested. Wuyi rock tea from the two regions featured distinct aroma profiles. Multilayer Perceptron achieved the best performance with an average accuracy of 92.7% on the training data using 176 volatile features. The model was benchmarked with two independent test sets, showing over 90% accuracy. Gradient Boosting algorithm yielded the best accuracy (89.6%) when using only 30 volatile features. The proposed methodology holds great promise for its broader applications in identifying the geographic origins of other valuable agri-food products.

Phenotypic feature-based identification of tea geographical origin using lightweight deep learning

Article Open access 09 January 2026

Combing machine learning and elemental profiling for geographical authentication of Chinese Geographical Indication (GI) rice

Article Open access 08 July 2021

Analysis of volatile compounds by GCMS reveals their rice cultivars

Article Open access 17 May 2023

Introduction

With the rapid growth in international trade and raising awareness among consumers of the importance of food safety, issues around food authenticity have been receiving increased public attention in recent years¹. Consumers have become increasingly concerned about what foods to buy, where they are from and how they are produced. Agricultural and food products derived from certain core production regions are perceived to have a higher quality and hence a higher market value. Despite strict legislative control, adulteration and false labeling are still prevalent in the food industry driven by economic incentives. This further aggravates unfair market competition, reduces consumer trust, and leads to potential risks to the public health^2,3. Therefore, the effective identification of the geographic origin of foodstuffs is essential for ensuring food quality and protecting consumer interest.

Tea (Camellia sinensis) is one of the most widely consumed beverages in the world, due to an array of nutritional and health benefits. Regional provenance is among the most important attributes that tea consumers associate with high-quality tea products^4,5,6,7,8. Many teas are named based on the geographic regions where they are grown. The most prominent examples are Darjeeling from India, Ceylon from Sri Lanka as well as Westlake Longjing tea and Wuyi rock tea from China. These tea products are highly sought after by tea enthusiasts, the fact that also makes them prime targets for fraud.

For tea authentication, a variety of analytical tools have been developed, such as stable isotope analysis, multi-element profiling and metabolite fingerprinting⁹. The first two methods, used alone or more often in combination, are acknowledged as being effective for the authentication of a wide range of food products including tea^4,10,11. However, both measurements require laborious sample preparation and are hence technically demanding. Metabolomics in conjunction with chemometrics has emerged as one of the most promising methods to distinguish tea origins because of a wide metabolite coverage, high sensitivity and high throughput^{5,6,7,8,12,13}. In particular, gas chromatography-mass spectrometry (GC-MS), a technique well-recognized for its capability to simultaneously monitor a large number of volatile organic compounds (VOCs) with proven reproducibility, has become a central platform for tea metabolomics research^{13,14,15,16,17,18}. Solid phase microextraction (SPME), which integrates sampling, volatile extraction and concentration in one single step, greatly simplifies the sample preparation and thus is widely adopted in food analysis¹⁹. In recent years, SPME in coupled with GC-MS has been successfully applied to assess the geographic origins of various food products, such as rice^20,21, wine²² and oranges²³. As such, VOC fingerprinting by SPME-GC-MS represents a promising approach to identify the geographic origin of tea.

In the past decade, machine learning (ML), which is capable of generating high accuracy outputs from massive volumes of multidimensional data, has found wide applications in many domains of science for pattern recognition, classification and prediction^24,25,26. In particular, its surging popularity for food authentication, origin identification and quality control is amply demonstrated by the increasing number of publications on this topic over the past few years, highlighting the great potential for food inspection and classification^{27,28,29,30,31,32}. Although metabolomics has been widely applied in tea research, the application of ML techniques in their analyses has just emerged^5,13. Nonetheless, the sample sizes in existing studies are typically small. As the performance of a specific classifier could be greatly influenced by the sample size, a small sample set is prone to errors in pattern recognition. Moreover, not enough efforts have been dedicated to distinguishing tea from a narrow geographic scope.

Wuyi rock tea (WRT), belonging to a preeminent subcategory of oolong tea, is internationally renowned for intriguing flavors and lingering taste with a heavy roast note^33,34. WRT has been granted protected geographical indication (PGI) status in China, which in the broad term includes all oolong tea produced in Wuyishan City of Fujian Province using suitable tea varieties and manufactured by the unique traditional processing technology. The distinctive flavor of WRT is decided by tea cultivars, the growth environment and the manufacturing procedures³⁵. More than 200 tea cultivars are used to manufacture WRT, in which “Rougui” and “Shuixian” occupy the largest planting area. In practice, WRT is further classified based on the growing location. The core production region features a small area of only about 60 kilometers inside of the national preserve of Mount Wuyi, which is a UNESCO World Heritage Site located in the Northwest Fujian⁴. WRT produced in this region (also known as “Zhengyan”) is perceived to have higher quality, referred locally as “rock charm and floral fragrance” or “rock rhythm”^36,37. What makes “Zhengyan” WRT so special is the unique terroir where it grows-volcano rock, steep cliff, mineral rich soil, abundant rainfall and high humidity-creating a perfect natural environment for tea cultivation that no other place could find. As expected, WRT produced in the core region usually fetches a high price tag. Due to the conflict between high demand and limited production in the core area, there are many imitations on the market, most of which are grown in areas surrounding the authentic core region and are considered to be inferior in terms of tea quality⁴. The conventional method for quality evaluation of WRT, like other tea, relies heavily on the experience of tea tasters, which not only suffers from inconsistency and inaccuracy, but also lacks quantitative information³⁸. For this reason, the development of rapid and reliable analytical methods to trace the geographic origin of WRT is highly demanded.

In this study, we explored the feasibility of ML-based analyses of VOC metabolomes to differentiate the origins of Rougui WRT. Our results show that our ML-based methods accurately authenticate the geographic origin of WRT based on volatile metabolites. The proposed methodology will be useful for managing adulteration or fraudulent labeling of WRT in the market and may be broadly applicable to tracing the origins of other valuable food and agricultural products.

Results and discussion

Overview of VOCs of WRT from different origins

SPME was combined with gas chromatography-time-of-flight mass spectrometry (GC-TOFMS) to identify and quantify VOCs of 333 Rougui WRT samples collected from the main growing regions in the north and northwestern parts of Fujian Province, China (Fig. 1; Supplementary Table S1). Representative total ion chromatograms (TICs) of WRT samples from the core production region (hitherto referred as “CRT”) and the non-core production region (hitherto referred as “NCRT”) are depicted in Supplementary Fig. S1. We detected a total of 2128 features by GC-MS analysis. QC samples, prepared by pooling equal aliquots of all samples, were used to monitor the system performance. Principle component analysis (PCA) performed on all samples showed that QC samples were in a tight cluster in the center of the plot, indicating that the established GC-MS method was stable enough to allow for the subsequent analysis (Supplementary Fig. S2). After applying hierarchical cluster analysis to remove outliers, 276 samples were retained. A final data matrix composed of 447 metabolic features was obtained after eliminating those with low variance (mean absolute deviation ≤0) and was subsequently used for multivariate analysis. Among these, 44 volatiles were identified in comparison with authentic standards and 236 volatiles were tentatively identified by comparing their mass spectra and retention indices with those recorded in the database (NIST 20) or literatures (Supplementary Table S2).

**Fig. 1: Geographical distributions of sampling points in the current study.**

No clear clustering of tea samples according to their geographic regions was evident in the PCA score plot, with PC1 and PC2 combined affording only 23.1% of the total variance (Fig. 2a). Orthogonal partial least squares-discriminant analysis (OPLS-DA) has shown to generally outperform the unsupervised PCA method in classifying various agri-products^{39,40,41,42,43}. Therefore, it was applied in the current study. As expected, a better separation between groups was achieved by this method (Fig. 2b). The permutation test generated intercepts of R² = 0.23 and Q² = −0.26, thus confirming the validity of the OPLS-DA model (Fig. 2c).

**Fig. 2: Multivariate statistical analysis of volatile metabolites of CRT and NCRT Wuyi rock tea samples.**

Differential VOCs between CRT and NCRT samples

The volcano plot analysis highlighted 111 differential VOCs (57 up and 54 down) for the CRT collection in comparison with NCRT (Fig. 2d). Ultimately, 20 VOCs with significant differences, including esters (6), hydrocarbons (5), ketones (3), alcohols (3), heterocycles (2) and one unknown odorant, were identified based on the criteria of variable importance in projection (VIP) > 1, p < 0.05 and |fold change | >1.5 (Table 1; Fig. 3).

Table 1 Differential volatile compounds tentatively identified in CRT and NCRT Wuyi rock tea samples.

Full size table

**Fig. 3: Boxplots showing the levels of differential volatile metabolites between CRT and NCRT Wuyi rock tea samples based on Wilcoxon test.**

The amounts of volatiles with floral, fruity and woody scents (i.e., hortrienol, α-terpineol and 4-methyl-3-penten-2-one) differed mostly between CRT and NCRT (Fig. 3). Generated by the Maillard reaction⁴⁴, 2-acetylpyrrole with a nutty and bread note was also found at a higher level in CRT. Such nitrogen-containing heterocyclic compounds with pyrrole and pyrazine structures largely account for the characteristic aromas in WRT resulting from roasting^37,45. Additionally, several odorless branched alkanes (i.e., 2,3,6,7-tetramethyloctane, 4,6-dimethyldodecane, 2,3,5,8-tetramethyldecane and 4-ethyl-tetradecane) accumulated to higher levels in CRT than in NCRT, presumably originating from the thermal degradation of terpenes during tea processing⁴⁶.

In contrast, esters that exhibited significant differences were exclusively present in higher proportion in NCRT (Fig. 3). Of note, most of them (i.e., hexyl 2-methylbutyrate, hexyl hexanoate, trans-2-hexenyl caproate and β-phenylethyl butyrate) impart fruity and green flavor. Likewise, a higher accumulation of 5,6-epoxy-β-ionone (fruity and floral) and ethyl isopropyl ketone (minty) was noted in NCRT. In a most recent study, sixteen VOCs (mostly aldehydes and terpenes) were shown to differ among Rougui WRT collected from different cultivation regions⁴⁷. These volatiles, however, have little overlap with what we discovered in the current study, most likely due to the fact that only three tea samples were analyzed in their study and thus were not representative of WRT in general.

Taken together, we can conclude that CRT and NCRT present clearly distinct aroma profiles, with a stronger floral, woody and roasted notes in the former group while stronger fruity and green odors in the latter group. It should be noted that roasting, which is a critical process unique to the processing of WRT, is essential for transforming tea aroma from being floral to roasted and woody³⁷. In this regard, it is possible that the differences in the volatile concentrations of CRT and NCRT are not totally due to differences in geographical locations, but may also be influenced by the tea processing procedures. Furthermore, other factors such as climatic conditions, years of harvest and agricultural practices obviously have a certain influence, as demonstrated in similar studies conducted for other agri-products^29,32,48. Further studies are needed to determine which factor exerts the largest impact on the individual component of WRT aromas.

Authentication of the origin of WRT with machine learning

To develop a model for the origin discrimination of WRT, we selected 176 volatile features that could be stably detected across different batches of tea samples for model training (Supplementary Table S2). The dataset was randomly split into two parts in a stratified fashion: 80% (220) for model training and 20% (56) for model validation. Although ML algorithms such as random forest (RF) and support vector machines (SVM) have been widely employed to discriminate the geographic origins of various agricultural products, no single ML method could perform best on all the datasets, given that the meta-features (i.e., statistics about the number of data points, features, classes as well as data skewness and the entropy of the targets) vary across datasets. In addition, the complexity of different ML algorithms varies. To generate the optimal model with the highest prediction accuracy and generalization ability, we tested 15 classification algorithms of Scikit-learn on the VOC metabolomics data to match model and data complexities in our study (Supplementary Dataset 1). Based on five-fold cross-validation, multilayer perceptron (MLP) achieved the best performance, yielding a prediction accuracy of 92.7% (Table 2). MLP, as the name suggests, consists of interconnected neurons that process data through three or more layers. Belonging to a class of feedforward artificial neural networks (ANN), MLP is well recognized for the prediction power and is widely used to extract information from complex and nonlinear relationships⁴⁹. From the ROC (receiver operating characteristic) curve, the AUC (area under the curve) value was computed as 0.96, which validated the generalization ability of the MLP model (Fig. 4a). High classification accuracies (>85%) could also be achieved by other classifiers, especially quadratic discriminant analysis (QDA), passive aggressive (PA) and SVM (Table 2).

Table 2 Comparison of the prediction results with different models.

Full size table

**Fig. 4: Performance of the multilayer perceptron (MLP) model for provenance discrimination of WRT using 176 volatile features as the input dataset.**

Given the promising performance of MLP on internal cross validation, a test set was built with the remaining 20% of samples (56 in total) for external validation. About 91.1% of tea samples from the test set were correctly classified (Fig. 4b). To further benchmark the model performance, we then performed an independent validation on a separate set of 17 Rougui WRT samples purchased from the market, leading to 94.1% prediction accuracy (Fig. 4c). Overall, these results unequivocally demonstrate the effectiveness of VOC metabolomics in tandem with ML-based algorithms for provenance discrimination of WRT. By applying SVM modeling to the combined data from isotope ratio mass spectrometry (IRMS) and inductively coupled plasma mass spectrometry (ICP-MS), Lou and coworkers achieved 97.7% accuracy when discriminating between WRT and non-WRT⁴. Since IRMS and ICP-MS are both time-consuming and costly, the proposed metabolomics approach in the current study offers advantages in terms of simplicity and speed while still maintaining comparable accuracy. It serves as a good alternative for the authentication of WRT.

Machine learning-based prediction model with a simplified input dataset

In an effort to reduce computing power while improving prediction efficiency, a further attempt was made to establish a ML prediction model using a simplified input dataset. Towards this goal, the top 30 metabolic features with the highest VIP scores in the OPLS-DA analysis, instead of the original 176 metabolic features, were included to construct a new model. As the meta-features of the simplified dataset were changed, we tested the aforementioned 15 ML algorithms again to match the model and data complexities (Supplementary Dataset 2). For the simplified dataset, the GB model demonstrated the highest accuracy (89.6%) and AUC values (0.93) among all tested algorithms, slightly higher than 86.8% accuracy obtained by the same model on the original dataset (Table 2; Fig. 5a). Furthermore, 87.5% and 94.1% accuracies were attained on the test and validation sets, respectively, indicating that a satisfactory prediction performance could still be realized with a much-reduced dataset (Fig. 5b, c). Although the MLP model produced an acceptable accuracy (83.2%) for the training set, accuracies for the test sets dropped to below 80%, implying that overfitting may occur and thus degrade the prediction performance. This is not unexpected, given that MLP is prone to overfitting when adopted for the unknown samples and hence usually requires a large number of parameters and extended data training in order to achieve a good generalization on the test set⁵⁰. In contrast, the GB algorithm is typically optimal for small datasets and has proven to outperform deep learning when the training data is not sufficient⁵¹. Therefore, GB is more advantageous than MLP regarding the overfitting risk and robustness when a simplified dataset is considered.

**Fig. 5: Performance of the gradient boosting (GB) model for provenance discrimination of WRT using the simplified input dataset.**

In summary, the present study demonstrates the geographic origin discrimination of 333 WRT samples, the largest panel tested as far as we know, using a strategy combining VOC fingerprinting with ML. The predictive modeling of VOC metabolomics data has proven to be a highly promising and powerful tool to confirm the authenticity, quality, and origin of WRT. Furthermore, we clearly demonstrate a terroir impact on the flavor of WRT and 20 VOCs are identified to distinguish CRT from NCRT. It should be noted that VOC profiling could be affected by several factors, such as tea storage conditions and processing methods. Moreover, fraud cases in real-life scenarios are even more complicated. For example, intentionally mixing or replacing part of CRT with NCRT for increased commercial value likely happens. Although ML is routinely used to analyze individual chemical datasets, integration of additional spectra data, such as those acquired by liquid-chromatography mass-spectrometry (LC-MS), nuclear magnetic resonance (NMR) or near-infrared (NIR) spectroscopy, as well as prior knowledge on tea samples (i.e., storage conditions and processing methods) will further increase both the accuracy and the generalization ability of the predictive models.

Methods

Sample collection

During the periods of 2019 and 2020, a total of 333 authentic WRT (Camellia sinensis (L.) O. Kuntze cv. “Rougui”) samples of different origins were collected from various producers in Nanping, the north production area and Sanming, the north-western production area of Fujian Province (Supplementary Table S1). These are labeled as CRT and NCRT, respectively. Sample numbers were as follows: CRT (n = 174) and NCRT (n = 159). Samples were stored in airtight and lightproof aluminum foil packages at 4 °C prior to analysis.

Standards

2-acetylpyrrole and n-alkane standard solution C₈-C₂₅ were purchased from Sigma-Aldrich (St. Louis, MO, USA). 2-acetyl-5-methylfuran, β-phenylethyl butyrate, 2-phenylethyl hexanoate, ethyl isopropyl ketone, α-terpineol and trans-2-hexenyl caproate were from Shanghai Yuanye Bio-Technology Co., Ltd. (Shanghai, China). Hexyl hexanoate was purchased from Shanghai Zhenzhun Bio-Technology Co., Ltd. (Shanghai, China). 4-methyl-3-penten-2-one was from Shanghai Yi’en Chemical Technology Co., Ltd. (Shanghai, China). Ethyl decanoate was from Shanghai ANPEL Scientific Instrument Co., Ltd. (Shanghai, China).

Headspace solid-phase microextraction (HS-SPME)

A minor modification for HS-SPME was made according to the method in our previous paper⁵². Finely ground tea powders (2 g) were placed into a 20-mL sealed glass vial with 0.5 nmol ethyl decanoate as an internal standard. Polydimethylsiloxane-divinylbenzene (PDMS-DVB) SPME fiber was selected for volatile extraction following procedures reported in our previous work⁵².

Gas chromatography-mass spectrometry (GC-MS)

The GC-MS system and the analytical conditions were the same as those in our previous paper⁵³. GC separation was achieved on a Restek Rxi^®-5Sil MS capillary column (30 m, inner diameter 0.25 mm and film thickness 0.25 µm). The oven temperature was programmed at 50 °C for 5 min, increased at 3 °C/min to 210 °C, then increased at 15 °C/min to 330 °C and kept for a 5 min final hold. The MS was operated in an electron impact (EI) mode with ionization energy of 70 eV and a mass scan range of 30–500 m/z. Samples were analyzed in triplicates. QC samples prepared by pooling aliquots of all samples were injected every ten runs throughout the analysis to monitor instrument fluctuations.

Qualitative and quantitative analyses of volatile compounds

Raw data were processed with ChromaTOF software (v4.51.6, LECO) for spectral deconvolution and alignment. The following conditions were used: (1) signal-to-noise (S/N) ratio = 20; (2) maximum retention time difference = 2 s; (3) peak width = 5 s and (4) the mass spectral match score ≥700. The criteria used for the assignment of volatile metabolites were the same as in the previous work⁵². Relative quantification was made based on the peak area ratio of analytes to the internal standard.

Multivariate and statistical analysis

Volatile data were auto-scaled, quantile-normalized and log10-transformed prior to hierarchical cluster analysis using the MetaboAnalyst 5.0 program (https://www.metaboanalyst.ca/). PCA and OPLS-DA analyses were conducted in Simca-P (v14.1, Umetrics, Umeå, Sweden). All data were presented as mean ± standard deviation (SD). The Wilcoxon rank sum test was applied to compare the differences in the volatile abundance between CRT and NCRT.

Machine learning modeling

To generate a model for geographic origin discrimination of WRT, different ML algorithms, including multilayer perceptron (MLP), quadratic discriminant analysis (QDA), passive aggressive (PA), support vector machines (SVM), linear discriminant analysis (LDA), random forest (RF), stochastic gradient descent (SGD), gradient boosting (GB), adaboost (AB), k-nearest neighbors (KNN), linear support vector machines (LinearSVM), Bernoulli naive Bayes (BernoulliNB), extra tree (ET), Gaussian naive Bayes (GaussianNB) and decision tree (DT) were tested using Scikit-learn (v0.24.2) package in Python (v3.8.12) (Fig. 6)⁵⁴. Metabolomics data were split into two parts: 80% for model training and 20% for validation. The training set was used to train the models with five-fold cross-validation. The full pipeline Bayesian hyperparameter optimization was performed to select the optimal ML method. We run 40 jobs in parallel on a machine with 80 cores for 24 h. For the resource limits, we set a time limit of 30 min for a single run and a memory limit for 100 GB^54,55. Various classification metrics like accuracy, precision, recall, and AUC were calculated to evaluate the model performance.

**Fig. 6: Diagram of the machine learning workflow.**

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The authors declare that all pertinent data that support this study have been included within the paper. Raw data will be made available by corresponding authors upon request.

References

Walaszczyk, A. & Galińska, B. Food origin traceability from a consumer’s perspective. Sustainability 12, 1872 (2020).
Article Google Scholar
Amaral, J. S. Target and non-target approaches for food authenticity and traceability. Foods 10, 172 (2021).
Article CAS PubMed PubMed Central Google Scholar
Brooks, C. et al. A review of food fraud and food authenticity across the food supply chain, with an examination of the impact of the COVID-19 pandemic and Brexit on food industry. Food Control 130, 108171 (2021).
Article CAS Google Scholar
Lou, Y. X. et al. Stable isotope ratio and elemental profile combined with support vector machine for provenance discrimination of oolong tea (Wuyi-Rock tea). J. Anal. Methods Chem. 2017, 5454231 (2017).
Article PubMed PubMed Central Google Scholar
Peng, C. et al. A comparative UHPLC-Q/TOF-MS-based metabolomics approach coupled with machine learning algorithms to differentiate Keemun black teas from narrow-geographic origins. Food Res. Int. 158, 111512 (2022).
Article CAS PubMed Google Scholar
Wang, Z. et al. Region identification of Xinyang Maojian tea using UHPLC-Q-TOF/MS-based metabolomics coupled with multivariate statistical analyses. J. Food Sci. 86, 1681–1691 (2021).
Article CAS PubMed Google Scholar
Zhang, Q. et al. Differentiating Westlake Longjing tea from the first- and second-grade producing regions using ultra high performance liquid chromatography with quadrupole time-of-flight mass spectrometry-based untargeted metabolomics in combination with chemometrics. J. Sep. Sci. 43, 2794–2803 (2020).
Article CAS PubMed Google Scholar
Zhao, J. et al. Identification of markers for tea authenticity assessment: Non-targeted metabolomics of highly similar oolong tea cultivars (Camellia sinensis var. sinensis). Food Control 142, 109223 (2022).
Article CAS Google Scholar
Shuai, M. Y. et al. Recent techniques for the authentication of the geographical origin of tea leaves from Camellia sinensis: A review. Food Chem. 374, 131713 (2022).
Article CAS PubMed Google Scholar
Liu, Z. et al. Geographical traceability of Chinese green tea using stable isotope and multi-element chemometrics. Rapid Commun. Mass Spectrom. 33, 778–788 (2019).
Article CAS PubMed Google Scholar
Katerinopoulou, K., Kontogeorgos, A., Salmas, C. E., Patakas, A. & Ladavos, A. Geographical origin authentication of agri-food products: a review. Foods 9, 489 (2020).
Article PubMed PubMed Central Google Scholar
Wang, T. et al. Mass spectrometry-based metabolomics and chemometric analysis of Pu-erh teas of various origins. Food Chem. 268, 271–278 (2018).
Article CAS PubMed Google Scholar
Yun, J. et al. Use of headspace GC/MS combined with chemometric analysis to identify the geographic origins of black tea. Food Chem. 360, 130033 (2021).
Article CAS PubMed Google Scholar
Beale, D. J. et al. Review of recent developments in GC-MS approaches to metabolomics-based research. Metabolomics 14, 018–1449 (2018).
Article Google Scholar
Wu, H. et al. GC-MS-based metabolomic study reveals dynamic changes of chemical compositions during black tea processing. Food Res. Int. 120, 330–338 (2019).
Article CAS PubMed Google Scholar
Guo, X., Schwab, W., Ho, C.-T., Song, C. & Wan, X. Characterization of the aroma profiles of oolong tea made from three tea cultivars by both GC–MS and GC-IMS. Food Chem. 376, 131933 (2022).
Article CAS Google Scholar
Qi, D. et al. Study on the effects of rapid aging technology on the aroma quality of white tea using GC–MS combined with chemometrics: In comparison with natural aged and fresh white tea. Food Chem. 265, 189–199 (2018).
Article CAS PubMed Google Scholar
Wang, Z. et al. Effect of different drying methods after fermentation on the aroma of Pu-erh tea (ripe tea). LWT 171, 114129 (2022).
Article CAS Google Scholar
Kataoka, H., Lord, H. L. & Pawliszyn, J. Applications of solid-phase microextraction in food analysis. J. Chromatogr. A 880, 35–62 (2000).
Article CAS PubMed Google Scholar
Lim, D. K. et al. Non-destructive profiling of volatile organic compounds using HS-SPME/GC–MS and its application for the geographical discrimination of white rice. J. Food Drug Anal. 26, 260–267 (2018).
Article CAS PubMed Google Scholar
Ch, R. et al. Metabolomic fingerprinting of volatile organic compounds for the geographical discrimination of rice samples from China, Vietnam and India. Food Chem. 334, 127553 (2021).
Article CAS PubMed Google Scholar
Šuklje, K. et al. Regional discrimination of Australian Shiraz wine volatome by two-dimensional gas chromatography coupled to time-of-flight mass spectrometry. J. Agric. Food Chem. 67, 10273–10284 (2019).
Article PubMed Google Scholar
Centonze, V. et al. Discrimination of geographical origin of oranges (Citrus sinensis L. Osbeck) by mass spectrometry-based electronic nose and characterization of volatile compounds. Food Chem. 277, 25–30 (2019).
Article CAS PubMed Google Scholar
Liakos, K. G., Busato, P., Moshou, D., Pearson, S. & Bochtis, D. Machine learning in agriculture: a review. Sensors 18, 2674 (2018).
Article PubMed PubMed Central Google Scholar
Carleo, G. et al. Machine learning and the physical sciences. Rev. Mod. Phys. 91, 045002 (2019).
Article CAS Google Scholar
Sarker, I. H. Machine Learning: algorithms, real-world applications and research directions. SN Comput. Sci. 2, 160 (2021).
Article PubMed PubMed Central Google Scholar
Gumus, O., Yasar, E., Gumus, Z. P. & Ertas, H. Comparison of different classification algorithms to identify geographic origins of olive oils. J. Food Sci. Technol. 57, 1535–1543 (2020).
Article CAS PubMed Google Scholar
Noviyanto, A. & Abdulla, W. H. Honey botanical origin classification using hyperspectral imaging and machine learning. J. Food Eng. 265, 109684 (2020).
Article CAS Google Scholar
Xu, F. et al. Combing machine learning and elemental profiling for geographical authentication of Chinese Geographical Indication (GI) rice. NPJ Sci. Food 5, 18 (2021).
Article PubMed PubMed Central Google Scholar
Qi, J. et al. Geographic origin discrimination of pork from different Chinese regions using mineral elements analysis assisted by machine learning techniques. Food Chem. 337, 127779 (2021).
Article CAS PubMed Google Scholar
Zhao, H. et al. The application of machine-learning and Raman spectroscopy for the rapid detection of edible oils type and adulteration. Food Chem. 373, 131471 (2022).
Article CAS PubMed Google Scholar
Klare, J. et al. Determination of the geographical origin of Asparagus officinalis L. by ¹H NMR spectroscopy. J. Agric. Food Chem. 68, 14353–14363 (2020).
Article CAS PubMed Google Scholar
Chen, S. et al. Metabolite profiling of 14 Wuyi Rock tea cultivars using UPLC-QTOF MS and UPLC-QqQ MS combined with chemometrics. Molecules 23, 104 (2018).
Article PubMed PubMed Central Google Scholar
Yang, P. et al. Differences of characteristic aroma compounds in Rougui tea leaves with different roasting temperatures analyzed by switchable GC-O-MS and GC x GC-O-MS and sensory evaluation. Food Funct. 12, 4797–4807 (2021).
Article CAS PubMed Google Scholar
Liu, X. et al. Chemical characterization of Wuyi rock tea with different roasting degrees and their discrimination based on volatile profiles. RSC Adv. 11, 12074–12085 (2021).
Article CAS PubMed PubMed Central Google Scholar
Xiao, K. The taste of tea: Material, embodied knowledge and environmental history in northern Fujian, China. J. Mat. Cult. 22, 3–18 (2017).
Article Google Scholar
Yang, P. et al. Characterization of key aroma-active compounds in rough and moderate fire Rougui Wuyi rock tea (Camellia sinensis) by sensory-directed flavor analysis and elucidation of the influences of roasting on aroma. J. Agric. Food Chem. 70, 267–278 (2022).
Article CAS PubMed Google Scholar
Dutta, R., Kashwan, K. R., Bhuyan, M., Hines, E. L. & Gardner, J. W. Electronic nose based tea quality standardization. Neural Netw. 16, 847–853 (2003).
Article PubMed Google Scholar
Song, H. H. et al. Discrimination of white ginseng origins using multivariate statistical analysis of data sets. J. Ginseng Res. 38, 187–193 (2014).
Article PubMed PubMed Central Google Scholar
Zhou, Y. et al. Discrimination of the geographical origin of soybeans using NMR-based metabolomics. Foods 10, 435 (2021).
Article CAS PubMed PubMed Central Google Scholar
Rivera-Pérez, A., Romero-González, R. & Garrido Frenich, A. Application of an innovative metabolomics approach to discriminate geographical origin and processing of black pepper by untargeted UHPLC-Q-Orbitrap-HRMS analysis and mid-level data fusion. Food Res. Int. 150, 27 (2021).
Article Google Scholar
Ghisoni, S. et al. Untargeted metabolomics with multivariate analysis to discriminate hazelnut (Corylus avellana L.) cultivars and their geographical origin. J. Sci. Food Agric. 100, 500–508 (2020).
Article CAS PubMed Google Scholar
Zhao, Q. et al. A comparative HS-SPME/GC-MS-based metabolomics approach for discriminating selected japonica rice varieties from different regions of China in raw and cooked form. Food Chem. 385, 14 (2022).
Article Google Scholar
Ho, C. T., Zheng, X. & Li, S. Tea aroma formation. Food Sci. Hum. Wellness 4, 9–27 (2015).
Article Google Scholar
Liu, Z., Chen, F., Sun, J. & Ni, L. Dynamic changes of volatile and phenolic components during the whole manufacturing process of Wuyi Rock tea (Rougui). Food Chem. 367, 130624 (2021).
Article PubMed Google Scholar
Turek, C. & Stintzing, F. Stability of essential oils: a review. Compr. Rev. Food Sci. Food Saf. 12, 40–53 (2013).
Article CAS Google Scholar
Xu, K. et al. Non-targeted metabolomics analysis revealed the characteristic non-volatile and volatile metabolites in the Rougui Wuyi rock tea (Camellia sinensis) from different culturing regions. Foods 11, 1694 (2022).
Article CAS PubMed PubMed Central Google Scholar
Godelmann, R. et al. Targeted and nontargeted wine analysis by ¹H NMR spectroscopy combined with multivariate statistical analysis. Differentiation of important parameters: grape variety, geographical origin, year of vintage. J. Agric. Food Chem. 61, 5610–5619 (2013).
Article CAS PubMed Google Scholar
Park, Y. S., Lek, S. Chapter 7 - Artificial neural networks: Multilayer perceptron for ecological modeling. in Developments in Environmental Modelling Vol. 28 (ed Jørgensen, S. E.) (Elsevier, 2016), pp 123–140.
Geman, S., Bienenstock, E. & Doursat, R. Neural networks and the bias/variance dilemma. Neural Comput. 4, 1–58 (1992).
Article Google Scholar
Jiang, J. et al. Boosting tree-assisted multitask deep learning for small scientific datasets. J. Chem. Inf. Model. 60, 1235–1244 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chen, S. et al. New insights into stress-induced β-ocimene biosynthesis in tea (Camellia sinensis) leaves during oolong tea processing. J. Agric. Food Chem. 69, 11656–11664 (2021).
Article CAS PubMed Google Scholar
Chen, S. et al. Non-targeted metabolomics analysis reveals dynamic changes of volatile and non-volatile metabolites during oolong tea manufacture. Food Res. Int. 128, 108778 (2020).
Article CAS PubMed Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Feurer, M., et al. Efficient and robust automated machine learning. in Proc. 28th International Conference on Neural Information Processing Systems - Volume 2 (MIT Press: Montreal, Canada, 2015), pp 2755–2763.

Download references

Acknowledgements

This work was supported by the Natural Science Foundation of China (32002093), the Fujian Agriculture and Forestry University (FAFU) Construction Project for Technological Innovation and Service System of Tea Industry Chain (K1520005A02) and the Agriculture (Tea) Industry Technology System of Fujian Province to XY.

Author information

These authors contributed equally: Yifei Peng, Chao Zheng.

Authors and Affiliations

College of Horticulture, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
Yifei Peng, Shuang Guo, Fuquan Gao, Yun Sun & Zhilong Hao
FAFU-UCR Joint Center for Horticultural Biology and Metabolomics, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
Yifei Peng, Chao Zheng, Shuang Guo, Fuquan Gao, Xiaxia Wang, Zhenghua Du, Zhenbiao Yang & Xiaomin Yu
Fujian Farming Technology Extension Center, Fuzhou, 350003, China
Feng Gao, Feng Su, Wenjing Zhang & Xueling Yu
Wuyishan Institute of Agricultural Sciences, Wuyishan, 354300, China
Guoying Liu
Wuyishan Tea Bureau, Wuyishan, 354300, China
Baoshun Liu
Fujian Vocational College of Agriculture, Fuzhou, 350119, China
Chengjian Wu

Authors

Yifei Peng
View author publications
Search author on:PubMed Google Scholar
Chao Zheng
View author publications
Search author on:PubMed Google Scholar
Shuang Guo
View author publications
Search author on:PubMed Google Scholar
Fuquan Gao
View author publications
Search author on:PubMed Google Scholar
Xiaxia Wang
View author publications
Search author on:PubMed Google Scholar
Zhenghua Du
View author publications
Search author on:PubMed Google Scholar
Feng Gao
View author publications
Search author on:PubMed Google Scholar
Feng Su
View author publications
Search author on:PubMed Google Scholar
Wenjing Zhang
View author publications
Search author on:PubMed Google Scholar
Xueling Yu
View author publications
Search author on:PubMed Google Scholar
Guoying Liu
View author publications
Search author on:PubMed Google Scholar
Baoshun Liu
View author publications
Search author on:PubMed Google Scholar
Chengjian Wu
View author publications
Search author on:PubMed Google Scholar
Yun Sun
View author publications
Search author on:PubMed Google Scholar
Zhenbiao Yang
View author publications
Search author on:PubMed Google Scholar
Zhilong Hao
View author publications
Search author on:PubMed Google Scholar
Xiaomin Yu
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.Y., Z.H., and X.Y. conceived and designed the study. Y.P., S.G., F.G., Z.D., X.W., W.H., and P.X. performed the experiments. Y.P., C.Z., and X.Y. analyzed the data. F.G., F.S., G.L., W.Z., X.Y., Y.S., C.W., and Z.H. provided the resources. Y.P., C.Z., and X.Y. prepared the manuscript. Z.Y. and X.Y. reviewed and edited the manuscript. Y.P. and C.Z. contributed equally to the current study as co-first authors. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Zhenbiao Yang, Zhilong Hao or Xiaomin Yu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Reporting Summary (download PDF )

Supplementary Dataset 1 (download XLSX )

Supplementary Dataset 2 (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Peng, Y., Zheng, C., Guo, S. et al. Metabolomics integrated with machine learning to discriminate the geographic origin of Rougui Wuyi rock tea. npj Sci Food 7, 7 (2023). https://doi.org/10.1038/s41538-023-00187-1

Download citation

Received: 31 October 2022
Accepted: 03 March 2023
Published: 16 March 2023
Version of record: 16 March 2023
DOI: https://doi.org/10.1038/s41538-023-00187-1

This article is cited by

LC-QTOF/MS-Based Non-targeted Metabolomics Combined with Machine Learning Algorithms for Geographical Origin Discrimination of Yongfu Luohan Guo (Siraitia grosvenorii)
- Weichao Sun
- Huahong Liu
- Haiyan Fu
Food Analytical Methods (2026)
Characterization of key astringent compounds and optimization of the fixation process of early tea in Chuanyu region
- Xi Wang
- Yan Liu
- Liyong Luo
npj Science of Food (2025)