Abstract
The geographic origin of agri-food products contributes greatly to their quality and market value. Here, we developed a robust method combining metabolomics and machine learning (ML) to authenticate the geographic origin of Wuyi rock tea, a premium oolong tea. The volatiles of 333 tea samples (174 from the core region and 159 from the non-core region) were profiled using gas chromatography time-of-flight mass spectrometry and a series of ML algorithms were tested. Wuyi rock tea from the two regions featured distinct aroma profiles. Multilayer Perceptron achieved the best performance with an average accuracy of 92.7% on the training data using 176 volatile features. The model was benchmarked with two independent test sets, showing over 90% accuracy. Gradient Boosting algorithm yielded the best accuracy (89.6%) when using only 30 volatile features. The proposed methodology holds great promise for its broader applications in identifying the geographic origins of other valuable agri-food products.
Similar content being viewed by others
Introduction
With the rapid growth in international trade and raising awareness among consumers of the importance of food safety, issues around food authenticity have been receiving increased public attention in recent years1. Consumers have become increasingly concerned about what foods to buy, where they are from and how they are produced. Agricultural and food products derived from certain core production regions are perceived to have a higher quality and hence a higher market value. Despite strict legislative control, adulteration and false labeling are still prevalent in the food industry driven by economic incentives. This further aggravates unfair market competition, reduces consumer trust, and leads to potential risks to the public health2,3. Therefore, the effective identification of the geographic origin of foodstuffs is essential for ensuring food quality and protecting consumer interest.
Tea (Camellia sinensis) is one of the most widely consumed beverages in the world, due to an array of nutritional and health benefits. Regional provenance is among the most important attributes that tea consumers associate with high-quality tea products4,5,6,7,8. Many teas are named based on the geographic regions where they are grown. The most prominent examples are Darjeeling from India, Ceylon from Sri Lanka as well as Westlake Longjing tea and Wuyi rock tea from China. These tea products are highly sought after by tea enthusiasts, the fact that also makes them prime targets for fraud.
For tea authentication, a variety of analytical tools have been developed, such as stable isotope analysis, multi-element profiling and metabolite fingerprinting9. The first two methods, used alone or more often in combination, are acknowledged as being effective for the authentication of a wide range of food products including tea4,10,11. However, both measurements require laborious sample preparation and are hence technically demanding. Metabolomics in conjunction with chemometrics has emerged as one of the most promising methods to distinguish tea origins because of a wide metabolite coverage, high sensitivity and high throughput5,6,7,8,12,13. In particular, gas chromatography-mass spectrometry (GC-MS), a technique well-recognized for its capability to simultaneously monitor a large number of volatile organic compounds (VOCs) with proven reproducibility, has become a central platform for tea metabolomics research13,14,15,16,17,18. Solid phase microextraction (SPME), which integrates sampling, volatile extraction and concentration in one single step, greatly simplifies the sample preparation and thus is widely adopted in food analysis19. In recent years, SPME in coupled with GC-MS has been successfully applied to assess the geographic origins of various food products, such as rice20,21, wine22 and oranges23. As such, VOC fingerprinting by SPME-GC-MS represents a promising approach to identify the geographic origin of tea.
In the past decade, machine learning (ML), which is capable of generating high accuracy outputs from massive volumes of multidimensional data, has found wide applications in many domains of science for pattern recognition, classification and prediction24,25,26. In particular, its surging popularity for food authentication, origin identification and quality control is amply demonstrated by the increasing number of publications on this topic over the past few years, highlighting the great potential for food inspection and classification27,28,29,30,31,32. Although metabolomics has been widely applied in tea research, the application of ML techniques in their analyses has just emerged5,13. Nonetheless, the sample sizes in existing studies are typically small. As the performance of a specific classifier could be greatly influenced by the sample size, a small sample set is prone to errors in pattern recognition. Moreover, not enough efforts have been dedicated to distinguishing tea from a narrow geographic scope.
Wuyi rock tea (WRT), belonging to a preeminent subcategory of oolong tea, is internationally renowned for intriguing flavors and lingering taste with a heavy roast note33,34. WRT has been granted protected geographical indication (PGI) status in China, which in the broad term includes all oolong tea produced in Wuyishan City of Fujian Province using suitable tea varieties and manufactured by the unique traditional processing technology. The distinctive flavor of WRT is decided by tea cultivars, the growth environment and the manufacturing procedures35. More than 200 tea cultivars are used to manufacture WRT, in which “Rougui” and “Shuixian” occupy the largest planting area. In practice, WRT is further classified based on the growing location. The core production region features a small area of only about 60 kilometers inside of the national preserve of Mount Wuyi, which is a UNESCO World Heritage Site located in the Northwest Fujian4. WRT produced in this region (also known as “Zhengyan”) is perceived to have higher quality, referred locally as “rock charm and floral fragrance” or “rock rhythm”36,37. What makes “Zhengyan” WRT so special is the unique terroir where it grows-volcano rock, steep cliff, mineral rich soil, abundant rainfall and high humidity-creating a perfect natural environment for tea cultivation that no other place could find. As expected, WRT produced in the core region usually fetches a high price tag. Due to the conflict between high demand and limited production in the core area, there are many imitations on the market, most of which are grown in areas surrounding the authentic core region and are considered to be inferior in terms of tea quality4. The conventional method for quality evaluation of WRT, like other tea, relies heavily on the experience of tea tasters, which not only suffers from inconsistency and inaccuracy, but also lacks quantitative information38. For this reason, the development of rapid and reliable analytical methods to trace the geographic origin of WRT is highly demanded.
In this study, we explored the feasibility of ML-based analyses of VOC metabolomes to differentiate the origins of Rougui WRT. Our results show that our ML-based methods accurately authenticate the geographic origin of WRT based on volatile metabolites. The proposed methodology will be useful for managing adulteration or fraudulent labeling of WRT in the market and may be broadly applicable to tracing the origins of other valuable food and agricultural products.
Results and discussion
Overview of VOCs of WRT from different origins
SPME was combined with gas chromatography-time-of-flight mass spectrometry (GC-TOFMS) to identify and quantify VOCs of 333 Rougui WRT samples collected from the main growing regions in the north and northwestern parts of Fujian Province, China (Fig. 1; Supplementary Table S1). Representative total ion chromatograms (TICs) of WRT samples from the core production region (hitherto referred as “CRT”) and the non-core production region (hitherto referred as “NCRT”) are depicted in Supplementary Fig. S1. We detected a total of 2128 features by GC-MS analysis. QC samples, prepared by pooling equal aliquots of all samples, were used to monitor the system performance. Principle component analysis (PCA) performed on all samples showed that QC samples were in a tight cluster in the center of the plot, indicating that the established GC-MS method was stable enough to allow for the subsequent analysis (Supplementary Fig. S2). After applying hierarchical cluster analysis to remove outliers, 276 samples were retained. A final data matrix composed of 447 metabolic features was obtained after eliminating those with low variance (mean absolute deviation ≤0) and was subsequently used for multivariate analysis. Among these, 44 volatiles were identified in comparison with authentic standards and 236 volatiles were tentatively identified by comparing their mass spectra and retention indices with those recorded in the database (NIST 20) or literatures (Supplementary Table S2).
The enlarged part indicates representative sampling points located in Mount Wuyi Scenic Resort.
No clear clustering of tea samples according to their geographic regions was evident in the PCA score plot, with PC1 and PC2 combined affording only 23.1% of the total variance (Fig. 2a). Orthogonal partial least squares-discriminant analysis (OPLS-DA) has shown to generally outperform the unsupervised PCA method in classifying various agri-products39,40,41,42,43. Therefore, it was applied in the current study. As expected, a better separation between groups was achieved by this method (Fig. 2b). The permutation test generated intercepts of R2 = 0.23 and Q2 = −0.26, thus confirming the validity of the OPLS-DA model (Fig. 2c).
a PCA score plot. b OPLS-DA score plot. R2Ycum = 0.602, Q2cum = 0.532. c Cross-validation plot of the OPLS-DA model with 200 permutation tests. d Volcano plot showing differential metabolites. Blue dots represent fold-change >1.5 and p < 0.05. Green dots represent fold-change <0.67 and p < 0.05. Black dots represent no statistically significant difference.
Differential VOCs between CRT and NCRT samples
The volcano plot analysis highlighted 111 differential VOCs (57 up and 54 down) for the CRT collection in comparison with NCRT (Fig. 2d). Ultimately, 20 VOCs with significant differences, including esters (6), hydrocarbons (5), ketones (3), alcohols (3), heterocycles (2) and one unknown odorant, were identified based on the criteria of variable importance in projection (VIP) > 1, p < 0.05 and |fold change | >1.5 (Table 1; Fig. 3).
The relative abundance of volatiles with significant differences was presented using log10-transformed Z-score values. **p < = 0.01, ***p < =0.001, ****p < = 0.0001. The box represents the interquartile range, with the median shown as the horizontal line inside the box. The bottom and top boundaries of the box represent the 25th and 75th percentiles, while the lower and upper whiskers correspond to the 5th and 95th percentiles.
The amounts of volatiles with floral, fruity and woody scents (i.e., hortrienol, α-terpineol and 4-methyl-3-penten-2-one) differed mostly between CRT and NCRT (Fig. 3). Generated by the Maillard reaction44, 2-acetylpyrrole with a nutty and bread note was also found at a higher level in CRT. Such nitrogen-containing heterocyclic compounds with pyrrole and pyrazine structures largely account for the characteristic aromas in WRT resulting from roasting37,45. Additionally, several odorless branched alkanes (i.e., 2,3,6,7-tetramethyloctane, 4,6-dimethyldodecane, 2,3,5,8-tetramethyldecane and 4-ethyl-tetradecane) accumulated to higher levels in CRT than in NCRT, presumably originating from the thermal degradation of terpenes during tea processing46.
In contrast, esters that exhibited significant differences were exclusively present in higher proportion in NCRT (Fig. 3). Of note, most of them (i.e., hexyl 2-methylbutyrate, hexyl hexanoate, trans-2-hexenyl caproate and β-phenylethyl butyrate) impart fruity and green flavor. Likewise, a higher accumulation of 5,6-epoxy-β-ionone (fruity and floral) and ethyl isopropyl ketone (minty) was noted in NCRT. In a most recent study, sixteen VOCs (mostly aldehydes and terpenes) were shown to differ among Rougui WRT collected from different cultivation regions47. These volatiles, however, have little overlap with what we discovered in the current study, most likely due to the fact that only three tea samples were analyzed in their study and thus were not representative of WRT in general.
Taken together, we can conclude that CRT and NCRT present clearly distinct aroma profiles, with a stronger floral, woody and roasted notes in the former group while stronger fruity and green odors in the latter group. It should be noted that roasting, which is a critical process unique to the processing of WRT, is essential for transforming tea aroma from being floral to roasted and woody37. In this regard, it is possible that the differences in the volatile concentrations of CRT and NCRT are not totally due to differences in geographical locations, but may also be influenced by the tea processing procedures. Furthermore, other factors such as climatic conditions, years of harvest and agricultural practices obviously have a certain influence, as demonstrated in similar studies conducted for other agri-products29,32,48. Further studies are needed to determine which factor exerts the largest impact on the individual component of WRT aromas.
Authentication of the origin of WRT with machine learning
To develop a model for the origin discrimination of WRT, we selected 176 volatile features that could be stably detected across different batches of tea samples for model training (Supplementary Table S2). The dataset was randomly split into two parts in a stratified fashion: 80% (220) for model training and 20% (56) for model validation. Although ML algorithms such as random forest (RF) and support vector machines (SVM) have been widely employed to discriminate the geographic origins of various agricultural products, no single ML method could perform best on all the datasets, given that the meta-features (i.e., statistics about the number of data points, features, classes as well as data skewness and the entropy of the targets) vary across datasets. In addition, the complexity of different ML algorithms varies. To generate the optimal model with the highest prediction accuracy and generalization ability, we tested 15 classification algorithms of Scikit-learn on the VOC metabolomics data to match model and data complexities in our study (Supplementary Dataset 1). Based on five-fold cross-validation, multilayer perceptron (MLP) achieved the best performance, yielding a prediction accuracy of 92.7% (Table 2). MLP, as the name suggests, consists of interconnected neurons that process data through three or more layers. Belonging to a class of feedforward artificial neural networks (ANN), MLP is well recognized for the prediction power and is widely used to extract information from complex and nonlinear relationships49. From the ROC (receiver operating characteristic) curve, the AUC (area under the curve) value was computed as 0.96, which validated the generalization ability of the MLP model (Fig. 4a). High classification accuracies (>85%) could also be achieved by other classifiers, especially quadratic discriminant analysis (QDA), passive aggressive (PA) and SVM (Table 2).
a Receiver operating characteristic (ROC) curve of MLP where the model was trained using 5-fold cross-validation. b Confusion matrix of the MLP model on the test set. c Confusion matrix of the MLP model on the validation set.
Given the promising performance of MLP on internal cross validation, a test set was built with the remaining 20% of samples (56 in total) for external validation. About 91.1% of tea samples from the test set were correctly classified (Fig. 4b). To further benchmark the model performance, we then performed an independent validation on a separate set of 17 Rougui WRT samples purchased from the market, leading to 94.1% prediction accuracy (Fig. 4c). Overall, these results unequivocally demonstrate the effectiveness of VOC metabolomics in tandem with ML-based algorithms for provenance discrimination of WRT. By applying SVM modeling to the combined data from isotope ratio mass spectrometry (IRMS) and inductively coupled plasma mass spectrometry (ICP-MS), Lou and coworkers achieved 97.7% accuracy when discriminating between WRT and non-WRT4. Since IRMS and ICP-MS are both time-consuming and costly, the proposed metabolomics approach in the current study offers advantages in terms of simplicity and speed while still maintaining comparable accuracy. It serves as a good alternative for the authentication of WRT.
Machine learning-based prediction model with a simplified input dataset
In an effort to reduce computing power while improving prediction efficiency, a further attempt was made to establish a ML prediction model using a simplified input dataset. Towards this goal, the top 30 metabolic features with the highest VIP scores in the OPLS-DA analysis, instead of the original 176 metabolic features, were included to construct a new model. As the meta-features of the simplified dataset were changed, we tested the aforementioned 15 ML algorithms again to match the model and data complexities (Supplementary Dataset 2). For the simplified dataset, the GB model demonstrated the highest accuracy (89.6%) and AUC values (0.93) among all tested algorithms, slightly higher than 86.8% accuracy obtained by the same model on the original dataset (Table 2; Fig. 5a). Furthermore, 87.5% and 94.1% accuracies were attained on the test and validation sets, respectively, indicating that a satisfactory prediction performance could still be realized with a much-reduced dataset (Fig. 5b, c). Although the MLP model produced an acceptable accuracy (83.2%) for the training set, accuracies for the test sets dropped to below 80%, implying that overfitting may occur and thus degrade the prediction performance. This is not unexpected, given that MLP is prone to overfitting when adopted for the unknown samples and hence usually requires a large number of parameters and extended data training in order to achieve a good generalization on the test set50. In contrast, the GB algorithm is typically optimal for small datasets and has proven to outperform deep learning when the training data is not sufficient51. Therefore, GB is more advantageous than MLP regarding the overfitting risk and robustness when a simplified dataset is considered.
a ROC curve of GB where the model was trained using 5-fold cross-validation. b Confusion matrix of the GB model on the test set. c Confusion matrix of the GB model on the validation set.
In summary, the present study demonstrates the geographic origin discrimination of 333 WRT samples, the largest panel tested as far as we know, using a strategy combining VOC fingerprinting with ML. The predictive modeling of VOC metabolomics data has proven to be a highly promising and powerful tool to confirm the authenticity, quality, and origin of WRT. Furthermore, we clearly demonstrate a terroir impact on the flavor of WRT and 20 VOCs are identified to distinguish CRT from NCRT. It should be noted that VOC profiling could be affected by several factors, such as tea storage conditions and processing methods. Moreover, fraud cases in real-life scenarios are even more complicated. For example, intentionally mixing or replacing part of CRT with NCRT for increased commercial value likely happens. Although ML is routinely used to analyze individual chemical datasets, integration of additional spectra data, such as those acquired by liquid-chromatography mass-spectrometry (LC-MS), nuclear magnetic resonance (NMR) or near-infrared (NIR) spectroscopy, as well as prior knowledge on tea samples (i.e., storage conditions and processing methods) will further increase both the accuracy and the generalization ability of the predictive models.
Methods
Sample collection
During the periods of 2019 and 2020, a total of 333 authentic WRT (Camellia sinensis (L.) O. Kuntze cv. “Rougui”) samples of different origins were collected from various producers in Nanping, the north production area and Sanming, the north-western production area of Fujian Province (Supplementary Table S1). These are labeled as CRT and NCRT, respectively. Sample numbers were as follows: CRT (n = 174) and NCRT (n = 159). Samples were stored in airtight and lightproof aluminum foil packages at 4 °C prior to analysis.
Standards
2-acetylpyrrole and n-alkane standard solution C8-C25 were purchased from Sigma-Aldrich (St. Louis, MO, USA). 2-acetyl-5-methylfuran, β-phenylethyl butyrate, 2-phenylethyl hexanoate, ethyl isopropyl ketone, α-terpineol and trans-2-hexenyl caproate were from Shanghai Yuanye Bio-Technology Co., Ltd. (Shanghai, China). Hexyl hexanoate was purchased from Shanghai Zhenzhun Bio-Technology Co., Ltd. (Shanghai, China). 4-methyl-3-penten-2-one was from Shanghai Yi’en Chemical Technology Co., Ltd. (Shanghai, China). Ethyl decanoate was from Shanghai ANPEL Scientific Instrument Co., Ltd. (Shanghai, China).
Headspace solid-phase microextraction (HS-SPME)
A minor modification for HS-SPME was made according to the method in our previous paper52. Finely ground tea powders (2 g) were placed into a 20-mL sealed glass vial with 0.5 nmol ethyl decanoate as an internal standard. Polydimethylsiloxane-divinylbenzene (PDMS-DVB) SPME fiber was selected for volatile extraction following procedures reported in our previous work52.
Gas chromatography-mass spectrometry (GC-MS)
The GC-MS system and the analytical conditions were the same as those in our previous paper53. GC separation was achieved on a Restek Rxi®-5Sil MS capillary column (30 m, inner diameter 0.25 mm and film thickness 0.25 µm). The oven temperature was programmed at 50 °C for 5 min, increased at 3 °C/min to 210 °C, then increased at 15 °C/min to 330 °C and kept for a 5 min final hold. The MS was operated in an electron impact (EI) mode with ionization energy of 70 eV and a mass scan range of 30–500 m/z. Samples were analyzed in triplicates. QC samples prepared by pooling aliquots of all samples were injected every ten runs throughout the analysis to monitor instrument fluctuations.
Qualitative and quantitative analyses of volatile compounds
Raw data were processed with ChromaTOF software (v4.51.6, LECO) for spectral deconvolution and alignment. The following conditions were used: (1) signal-to-noise (S/N) ratio = 20; (2) maximum retention time difference = 2 s; (3) peak width = 5 s and (4) the mass spectral match score ≥700. The criteria used for the assignment of volatile metabolites were the same as in the previous work52. Relative quantification was made based on the peak area ratio of analytes to the internal standard.
Multivariate and statistical analysis
Volatile data were auto-scaled, quantile-normalized and log10-transformed prior to hierarchical cluster analysis using the MetaboAnalyst 5.0 program (https://www.metaboanalyst.ca/). PCA and OPLS-DA analyses were conducted in Simca-P (v14.1, Umetrics, Umeå, Sweden). All data were presented as mean ± standard deviation (SD). The Wilcoxon rank sum test was applied to compare the differences in the volatile abundance between CRT and NCRT.
Machine learning modeling
To generate a model for geographic origin discrimination of WRT, different ML algorithms, including multilayer perceptron (MLP), quadratic discriminant analysis (QDA), passive aggressive (PA), support vector machines (SVM), linear discriminant analysis (LDA), random forest (RF), stochastic gradient descent (SGD), gradient boosting (GB), adaboost (AB), k-nearest neighbors (KNN), linear support vector machines (LinearSVM), Bernoulli naive Bayes (BernoulliNB), extra tree (ET), Gaussian naive Bayes (GaussianNB) and decision tree (DT) were tested using Scikit-learn (v0.24.2) package in Python (v3.8.12) (Fig. 6)54. Metabolomics data were split into two parts: 80% for model training and 20% for validation. The training set was used to train the models with five-fold cross-validation. The full pipeline Bayesian hyperparameter optimization was performed to select the optimal ML method. We run 40 jobs in parallel on a machine with 80 cores for 24 h. For the resource limits, we set a time limit of 30 min for a single run and a memory limit for 100 GB54,55. Various classification metrics like accuracy, precision, recall, and AUC were calculated to evaluate the model performance.
A flowchart for building tea geographic origin discrimination model is described, including data preprocessing, feature selection, hyperparameter optimization and model validation.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The authors declare that all pertinent data that support this study have been included within the paper. Raw data will be made available by corresponding authors upon request.
References
Walaszczyk, A. & Galińska, B. Food origin traceability from a consumer’s perspective. Sustainability 12, 1872 (2020).
Amaral, J. S. Target and non-target approaches for food authenticity and traceability. Foods 10, 172 (2021).
Brooks, C. et al. A review of food fraud and food authenticity across the food supply chain, with an examination of the impact of the COVID-19 pandemic and Brexit on food industry. Food Control 130, 108171 (2021).
Lou, Y. X. et al. Stable isotope ratio and elemental profile combined with support vector machine for provenance discrimination of oolong tea (Wuyi-Rock tea). J. Anal. Methods Chem. 2017, 5454231 (2017).
Peng, C. et al. A comparative UHPLC-Q/TOF-MS-based metabolomics approach coupled with machine learning algorithms to differentiate Keemun black teas from narrow-geographic origins. Food Res. Int. 158, 111512 (2022).
Wang, Z. et al. Region identification of Xinyang Maojian tea using UHPLC-Q-TOF/MS-based metabolomics coupled with multivariate statistical analyses. J. Food Sci. 86, 1681–1691 (2021).
Zhang, Q. et al. Differentiating Westlake Longjing tea from the first- and second-grade producing regions using ultra high performance liquid chromatography with quadrupole time-of-flight mass spectrometry-based untargeted metabolomics in combination with chemometrics. J. Sep. Sci. 43, 2794–2803 (2020).
Zhao, J. et al. Identification of markers for tea authenticity assessment: Non-targeted metabolomics of highly similar oolong tea cultivars (Camellia sinensis var. sinensis). Food Control 142, 109223 (2022).
Shuai, M. Y. et al. Recent techniques for the authentication of the geographical origin of tea leaves from Camellia sinensis: A review. Food Chem. 374, 131713 (2022).
Liu, Z. et al. Geographical traceability of Chinese green tea using stable isotope and multi-element chemometrics. Rapid Commun. Mass Spectrom. 33, 778–788 (2019).
Katerinopoulou, K., Kontogeorgos, A., Salmas, C. E., Patakas, A. & Ladavos, A. Geographical origin authentication of agri-food products: a review. Foods 9, 489 (2020).
Wang, T. et al. Mass spectrometry-based metabolomics and chemometric analysis of Pu-erh teas of various origins. Food Chem. 268, 271–278 (2018).
Yun, J. et al. Use of headspace GC/MS combined with chemometric analysis to identify the geographic origins of black tea. Food Chem. 360, 130033 (2021).
Beale, D. J. et al. Review of recent developments in GC-MS approaches to metabolomics-based research. Metabolomics 14, 018–1449 (2018).
Wu, H. et al. GC-MS-based metabolomic study reveals dynamic changes of chemical compositions during black tea processing. Food Res. Int. 120, 330–338 (2019).
Guo, X., Schwab, W., Ho, C.-T., Song, C. & Wan, X. Characterization of the aroma profiles of oolong tea made from three tea cultivars by both GC–MS and GC-IMS. Food Chem. 376, 131933 (2022).
Qi, D. et al. Study on the effects of rapid aging technology on the aroma quality of white tea using GC–MS combined with chemometrics: In comparison with natural aged and fresh white tea. Food Chem. 265, 189–199 (2018).
Wang, Z. et al. Effect of different drying methods after fermentation on the aroma of Pu-erh tea (ripe tea). LWT 171, 114129 (2022).
Kataoka, H., Lord, H. L. & Pawliszyn, J. Applications of solid-phase microextraction in food analysis. J. Chromatogr. A 880, 35–62 (2000).
Lim, D. K. et al. Non-destructive profiling of volatile organic compounds using HS-SPME/GC–MS and its application for the geographical discrimination of white rice. J. Food Drug Anal. 26, 260–267 (2018).
Ch, R. et al. Metabolomic fingerprinting of volatile organic compounds for the geographical discrimination of rice samples from China, Vietnam and India. Food Chem. 334, 127553 (2021).
Šuklje, K. et al. Regional discrimination of Australian Shiraz wine volatome by two-dimensional gas chromatography coupled to time-of-flight mass spectrometry. J. Agric. Food Chem. 67, 10273–10284 (2019).
Centonze, V. et al. Discrimination of geographical origin of oranges (Citrus sinensis L. Osbeck) by mass spectrometry-based electronic nose and characterization of volatile compounds. Food Chem. 277, 25–30 (2019).
Liakos, K. G., Busato, P., Moshou, D., Pearson, S. & Bochtis, D. Machine learning in agriculture: a review. Sensors 18, 2674 (2018).
Carleo, G. et al. Machine learning and the physical sciences. Rev. Mod. Phys. 91, 045002 (2019).
Sarker, I. H. Machine Learning: algorithms, real-world applications and research directions. SN Comput. Sci. 2, 160 (2021).
Gumus, O., Yasar, E., Gumus, Z. P. & Ertas, H. Comparison of different classification algorithms to identify geographic origins of olive oils. J. Food Sci. Technol. 57, 1535–1543 (2020).
Noviyanto, A. & Abdulla, W. H. Honey botanical origin classification using hyperspectral imaging and machine learning. J. Food Eng. 265, 109684 (2020).
Xu, F. et al. Combing machine learning and elemental profiling for geographical authentication of Chinese Geographical Indication (GI) rice. NPJ Sci. Food 5, 18 (2021).
Qi, J. et al. Geographic origin discrimination of pork from different Chinese regions using mineral elements analysis assisted by machine learning techniques. Food Chem. 337, 127779 (2021).
Zhao, H. et al. The application of machine-learning and Raman spectroscopy for the rapid detection of edible oils type and adulteration. Food Chem. 373, 131471 (2022).
Klare, J. et al. Determination of the geographical origin of Asparagus officinalis L. by 1H NMR spectroscopy. J. Agric. Food Chem. 68, 14353–14363 (2020).
Chen, S. et al. Metabolite profiling of 14 Wuyi Rock tea cultivars using UPLC-QTOF MS and UPLC-QqQ MS combined with chemometrics. Molecules 23, 104 (2018).
Yang, P. et al. Differences of characteristic aroma compounds in Rougui tea leaves with different roasting temperatures analyzed by switchable GC-O-MS and GC x GC-O-MS and sensory evaluation. Food Funct. 12, 4797–4807 (2021).
Liu, X. et al. Chemical characterization of Wuyi rock tea with different roasting degrees and their discrimination based on volatile profiles. RSC Adv. 11, 12074–12085 (2021).
Xiao, K. The taste of tea: Material, embodied knowledge and environmental history in northern Fujian, China. J. Mat. Cult. 22, 3–18 (2017).
Yang, P. et al. Characterization of key aroma-active compounds in rough and moderate fire Rougui Wuyi rock tea (Camellia sinensis) by sensory-directed flavor analysis and elucidation of the influences of roasting on aroma. J. Agric. Food Chem. 70, 267–278 (2022).
Dutta, R., Kashwan, K. R., Bhuyan, M., Hines, E. L. & Gardner, J. W. Electronic nose based tea quality standardization. Neural Netw. 16, 847–853 (2003).
Song, H. H. et al. Discrimination of white ginseng origins using multivariate statistical analysis of data sets. J. Ginseng Res. 38, 187–193 (2014).
Zhou, Y. et al. Discrimination of the geographical origin of soybeans using NMR-based metabolomics. Foods 10, 435 (2021).
Rivera-Pérez, A., Romero-González, R. & Garrido Frenich, A. Application of an innovative metabolomics approach to discriminate geographical origin and processing of black pepper by untargeted UHPLC-Q-Orbitrap-HRMS analysis and mid-level data fusion. Food Res. Int. 150, 27 (2021).
Ghisoni, S. et al. Untargeted metabolomics with multivariate analysis to discriminate hazelnut (Corylus avellana L.) cultivars and their geographical origin. J. Sci. Food Agric. 100, 500–508 (2020).
Zhao, Q. et al. A comparative HS-SPME/GC-MS-based metabolomics approach for discriminating selected japonica rice varieties from different regions of China in raw and cooked form. Food Chem. 385, 14 (2022).
Ho, C. T., Zheng, X. & Li, S. Tea aroma formation. Food Sci. Hum. Wellness 4, 9–27 (2015).
Liu, Z., Chen, F., Sun, J. & Ni, L. Dynamic changes of volatile and phenolic components during the whole manufacturing process of Wuyi Rock tea (Rougui). Food Chem. 367, 130624 (2021).
Turek, C. & Stintzing, F. Stability of essential oils: a review. Compr. Rev. Food Sci. Food Saf. 12, 40–53 (2013).
Xu, K. et al. Non-targeted metabolomics analysis revealed the characteristic non-volatile and volatile metabolites in the Rougui Wuyi rock tea (Camellia sinensis) from different culturing regions. Foods 11, 1694 (2022).
Godelmann, R. et al. Targeted and nontargeted wine analysis by 1H NMR spectroscopy combined with multivariate statistical analysis. Differentiation of important parameters: grape variety, geographical origin, year of vintage. J. Agric. Food Chem. 61, 5610–5619 (2013).
Park, Y. S., Lek, S. Chapter 7 - Artificial neural networks: Multilayer perceptron for ecological modeling. in Developments in Environmental Modelling Vol. 28 (ed Jørgensen, S. E.) (Elsevier, 2016), pp 123–140.
Geman, S., Bienenstock, E. & Doursat, R. Neural networks and the bias/variance dilemma. Neural Comput. 4, 1–58 (1992).
Jiang, J. et al. Boosting tree-assisted multitask deep learning for small scientific datasets. J. Chem. Inf. Model. 60, 1235–1244 (2020).
Chen, S. et al. New insights into stress-induced β-ocimene biosynthesis in tea (Camellia sinensis) leaves during oolong tea processing. J. Agric. Food Chem. 69, 11656–11664 (2021).
Chen, S. et al. Non-targeted metabolomics analysis reveals dynamic changes of volatile and non-volatile metabolites during oolong tea manufacture. Food Res. Int. 128, 108778 (2020).
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Feurer, M., et al. Efficient and robust automated machine learning. in Proc. 28th International Conference on Neural Information Processing Systems - Volume 2 (MIT Press: Montreal, Canada, 2015), pp 2755–2763.
Acknowledgements
This work was supported by the Natural Science Foundation of China (32002093), the Fujian Agriculture and Forestry University (FAFU) Construction Project for Technological Innovation and Service System of Tea Industry Chain (K1520005A02) and the Agriculture (Tea) Industry Technology System of Fujian Province to XY.
Author information
Authors and Affiliations
Contributions
Z.Y., Z.H., and X.Y. conceived and designed the study. Y.P., S.G., F.G., Z.D., X.W., W.H., and P.X. performed the experiments. Y.P., C.Z., and X.Y. analyzed the data. F.G., F.S., G.L., W.Z., X.Y., Y.S., C.W., and Z.H. provided the resources. Y.P., C.Z., and X.Y. prepared the manuscript. Z.Y. and X.Y. reviewed and edited the manuscript. Y.P. and C.Z. contributed equally to the current study as co-first authors. All authors have read and agreed to the published version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Peng, Y., Zheng, C., Guo, S. et al. Metabolomics integrated with machine learning to discriminate the geographic origin of Rougui Wuyi rock tea. npj Sci Food 7, 7 (2023). https://doi.org/10.1038/s41538-023-00187-1
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41538-023-00187-1
This article is cited by
-
LC-QTOF/MS-Based Non-targeted Metabolomics Combined with Machine Learning Algorithms for Geographical Origin Discrimination of Yongfu Luohan Guo (Siraitia grosvenorii)
Food Analytical Methods (2026)
-
Characterization of key astringent compounds and optimization of the fixation process of early tea in Chuanyu region
npj Science of Food (2025)








