Introduction

The emission of volatile organic compounds (VOCs) from plants as a defense mechanism against biotic and abiotic stress, and as a tool of communication with their environment, was first documented by Baldwin and Schulz1 as well as Rhoades2 in 1983. One of the most prominent examples is the attraction of insects for pollination, which is crucial for plant distribution and reproduction3,4. Additionally, VOCs play a defensive role; for instance, they are produced in response to pest and/or pathogen infestation5,6 or are emitted to attract natural predators of pests4,6,7. VOCs also serve as signals to neighboring plants, warning them of potential threats and enabling the early production of defensive VOCs in response4,5,6.

Because of these interactions the analysis of VOCs in plants can provide indications of possible pest and/or pathogen infestation. This has the advantage of early detection and therefore contributes to a rapid control and containment of the infestation5,8. Since VOCs are typically emitted as complex mixtures rather than as individual compounds, it is crucial to employ appropriate sampling and analytical methods to accurately assess the plants’ VOC profiles9.

In recent years, some progress has been made in the development of analysis methods using gas chromatography mass spectrometry (GC-MS)10,11, gas chromatography ion mobility spectrometry (GC-IMS)11, electronic nose10,11 and proton transfer reaction mass spectrometry (PTR-MS)12. The analysis of statically or dynamically enriched VOC samples using GC-MS has been established as one of the standard methods in all possible fields of expertise, e.g. disease detection13,14,15, environmental pollution16,17, food science11,17,18, and pharmacy17,19. However, the evaluation of VOC profiles obtained by GC-MS is still a major challenge20,21,22. In order to be able to detect possible pest and/or pathogen infestations in plants using VOC analysis, the VOC profiles of the plants must be examined for possible patterns and markers that specifically represent the pest or pathogen, while simultaneously discrimination against all background emissions can be achieved23. Due to the complexity and the many influencing factors that affect the VOC emission of plants, identifying such patterns manually is both very laborious and time-consuming24. A time-saving and facilitating evaluation process for such complex data is therefore of great advantage. Recent scientific studies are increasingly applying chemometric methods for the evaluation of chromatographic data, such as principal component analysis (PCA)25,26, partial least squares (PLS)25,26, or linear discriminant analysis (LDA)25. Software such as AMDIS27, PARADISe28, and OpenChrom29 are already available to facilitate the comparison and evaluation of data. However, there is still need and demand for improvements in the evaluation of such complex datasets in order to ensure an efficient and reliable detection method for pest infestations.

Therefore, this work features the development of an evaluation method to analyze GC-MS data of VOC measurements using chemometric methods in order to classify the data based on different VOC patterns and markers. For this purpose, a Python script was written that combines data merging, data preprocessing and chemometric evaluation for classification, which can be flexibly adapted to any dataset. In order to test the functionality of this evaluation procedure, a dataset that has already been evaluated manually is analyzed. The dataset is taken from the publications by Makarow et al., who have identified specific patterns and markers for an infestation of the invasive Asian longhorned beetle Anoplophora glabripennis (ALB) on maple trees (Acer L.)30,31. They had a statistically valid dataset of infested trees that represent biological variance. At the same time, the identified VOCs of interest occurred in 95–99% of all measurements30. In addition to the data of ALB those of the European native beetle species goat moth Cossus cossus (CC) and poplar long-horned beetle Saperda carcharias (SC) are also included in the analysis in order to exclude falsification of the pattern by native species. The data of CC and SC was obtained similarly to the data of ALB.

Methods

The dataset from Makarow et al. includes samples from healthy trees (HT), ALB infested trees, SC infested trees/ larvae, and CC larvae30. In their work adsorption tubes filled with Tenax TA from Alltech Associates Inc. were used for sampling. All samples were taken with a flow rate of 30 mL/min for a duration of 90 min. Table 1 shows an overview of the analyzed samples and the used sampling procedure. The chromatographic analysis was performed using a 7890A/5975C inert XN MSD GC/MS device from Agilent Technologies coupled to a thermal desorption unit from Gerstel. GC was equipped with a DB5-MS capillary column from J&W (30m\(\times\)0.250 mm; 0.25 \(\upmu\)m).The mass spectra were recorded in the electron-impact mode (70 eV) from 30 to 400 DA. All other parameters regarding sampling procedure and the analytical methods can be found in the respective publications30,31.

Each chromatographic peak was annotated based on mass spectral comparison using the National Institute of Standards and Technology (NIST) 20 database. Diagnostic fragment ions (m/z) and literature Kovats retention indices (RI) were used to support compound identity, where available. Due to the retrospective nature of the data and changes in instrumentation, RI calibration using alkane standards was not feasible. Therefore, all compound names should be regarded as tentative. Key spectral data and match scores are provided in Supplementary Table S1.

Table 1 Overview of the analyzed samples, number of replicates, sampling method by Makarow et al.30,31 and the sampling description.

In order to extract all possible markers for infestations the total ion chromatograms (TICs) of the data are used for evaluation. All TICs obtained from the respective data are converted into CSV-files using the OpenChrom Lablicate Edition 1.5.0 software and then analyzed using the custom-written script using Python 3.10, which consists of data selection, data preprocessing, and chemometric evaluation and will be explained in the following paragraph. The following packages and functions are used: pandas for handling data, os for interaction with the operating system, datetime for timestamps, NumPy for mathematical operations, SciPy for signal processing, Matplotlib pyplot and plotly for visualization, seaborn for heatmaps, and Sci-Kit learn for chemometric analysis.

In the data selection process, folders are created for test and train data containing all CSV files of the classes to be analyzed (HT, ALB, CC and SC). Due to the availability of data and the time-consuming data acquisition, it is not possible to work with a large amount of data, despite the fact that it is not an ideal condition for a chemometric evaluation. For each class, 10 data files are chosen randomly as train data, all other data files are declared as test data. The script extracts both retention time and TIC for each measurement. Because the first 3 min and the last 4 min of the chromatograms only show impurities, these data areas are cut off. Subsequently, the extracted data undergo preprocessing, where the datasets are smoothed, the baseline is corrected, and the chromatograms are normalized to their total area. A Savitzky-Golay filter with polynomial order 3 and width 11 is used for smoothing. The baseline correction is performed using the Asymmetric Least Squares Smoothing method with a smoothing factor of \(10^6\), an asymmetry parameter of 0.001 and a number of 10 iterations. The resulting baseline is then subtracted from the respective chromatograms. Afterwards, total area normalization is performed on the chromatograms in order to obtain comparable intensities without changing the height ratio of the signals and thus enable a qualitative determination of compounds. Lastly, a signal threshold of 3 % of the normalized intensity of each chromatogram is chosen to focus on peaks and to minimize background noise. An exemplary chromatogram before and after preprocessing can be found in the Supplementary figure S1. The preprocessed data are then merged into a single train or test dataset, respectively.

The preprocessed train and test datasets are then analyzed chemometrically. For this purpose, data reduction is performed on the train dataset using principal component analysis (PCA). The optimum number of PCs used for further analysis is chosen manually and is determined based on the accuracy of the train and test data. PCs are added until the accuracy of the test data deteriorates, indicating that overfitting has occurred32,33. Hence, the the optimum number of PCs is the number before overfitting arises. The dimensionally reduced train dataset is subsequently classified using linear discriminant analysis (LDA). To check the suitability of the LDA, a “k-fold” crossvalidation (k-fold-CV) with 10 iterations as well as a “leave one out” crossvalidation (loo-CV) is performed on the train data. The k-fold-CV is normally used for large amounts of data, while the loo-CV is favored for small amounts of data. Next, the test dataset is dimensionally reduced with the calculated PCA function of the train dataset and classified with the calculated discriminant function of the LDA. Finally, the accuracy of the classified test data is checked using the SciKit-learn package. The script’s workflow is shown in Fig. 1.

Fig. 1
figure 1

Flowchart of the Python script. The relevant steps of data selction, data processing, and chemometrical evaluation are shown. GC/MS: gas chromatography mass spectrometry, RT: retention time, TIC: total ion chromatogramm, PCA: principal component analysis, LDA: linear discriminant analysis.

Results and discussion

In order to detect specific VOC-markers for ALB infestation using GC-MS, data from healthy trees (HT), ALB infested Acer trees, SC larvae and infested Acer trees, and CC larvae were used. Figure 2 shows an exemplary and representative chromatogram for each class. Six class comparisons were created from the four classes: ALB - HT, SC - HT, CC - HT, ALB - SC, ALB - CC and SC - CC. The most important parameters for the chemometric evaluation are listed in Table 2. All six class comparisons could be adequately classified with the dimensionally reduced datasets (Fig. 3).

The results of the crossvalidations and the accuracy of the test datasets as well as the number and variance of the principal components (PCs) are listed in Table 2. Based on the loadings of the dimensionally reduced data, it is possible to understand which retention times and thus which compounds have the greatest influence on the classification. Figure 4a exemplarily shows the loading plots of the PCA, where only the data of ALB vs. HT were investigated. The loading plots of all other comparisons, a 2D and 3D LD plot as well as all score plots can be found in the Supplementary Figures S2–S14. Table 3 shows all important retention times found in the loadings and their corresponding compounds determined through comparison with the NIST 20 database. A total of 24 compounds (10 hydrocarbons (HC), 2 aromatic hydrocarbon (AHC), 3 monoterpenes (MT), 5 sesquiterpenes (ST) and 4 unknown compounds) can be considered as potential markers for pests on trees and are displayed in the following subsections. Although Kovats’ retention indices (RI) for compound 6 and 24 are slightly lower than those for compound 5 and 23, their spectral signatures and retention behavior suggest they are presumably styrene and presumably \(\alpha\)-zingiberene, respectively. Both annotations are based on strong spectral similarity (NIST match >90%) and consistent elution behavior relative to known RI values on DB-5 columns. Three additional compounds from the class of cyclosiloxanes can be excluded as potential markers, as they are typical contaminants in gas chromatography when PDMS-based stationary phases are used (e.g. due to column bleeding)34 and also are emitted by a large number of personal care products and thus occur due to contamination35.

Fig. 2
figure 2

Stacking of GC-MS example chromatograms of the four investigated classes. Red: healthy Acer tree (HT), blue: Anoplophora glabripennis (ALB) infested Acer tree, green: Cossus cossus (CC) larvae, purple: Saperda carcharias (SC) infested Acer tree; RT: retention time. Compounds specific to one pest are marked with diamond symbol. Compounds which occur during pest infestation are labeled with cycle symbol. Increasing concentrations of compounds during infestation are characterized with an upward triangle symbol, decreasing concentrations with a downward triangle symbol, respectively. The peak No. can be assigned to the compounds in Table 3.

Table 2 Overview of the chemometric parameters for the analyzed class comparisons.
Fig. 3
figure 3

LD-Plots of the six different class comparisons. (a) Anoplophora glabripennis (ALB) infested Acer tree and healthy tree (HT), (b) Saperda carcharias (SC) larvae infested Acer tree and HT, (c) Cossus cossus (CC) larvae and HT, (d) ALB infested Acer tree and SC larvae infested Acer tree, (e) ALB infested Acer tree and CC larvae, (f) SC larvae infested Acer tree and CC larvae.

Fig. 4
figure 4

(a) Loading plots of the class comparison of Anoplophora glabripennis (ALB) infested Acer tree and healthy tree (HT). PC: principal component, RT: retention time, the Peak No. can be assigned to the compounds in Table 3; (b) overlay of example chromatograms of HT and ALB between retention time 15.2 min and 16.3 min with identified compounds cyclosativene, \(\alpha\)-longipinene, copaene, and caryophyllene. Red: ALB, black: HT.

Table 3 List of possible marker compounds for the different class comparisons determined via chemometric analysis, including a categorization of the loading values. All compound names are tentative assignments based on NIST 20 spectral comparison and literature RI values. For diagnostic fragment ions and match scores, see Supplementary Table S1. RT: Retention time; AHC: aromatic hydrocarbons, HC: hydrocarbons, MT: monoterpenes, ST: sesquiterpenes; RI: Kovats retention index for non polar columns according to the pherobase database, ALB: Anoplophora glabripennis, HT: Healthy tree, SC: Saperda carcharias, CC: Cossus cossus. Loading values (LV): LV \(<{0.005}:~\vartriangle\), LV \(<{0.010}:~\vartriangle \vartriangle\), LV \(\ge {0.010}:~\vartriangle \vartriangle \vartriangle\), LV \(>{-0.005}:~\triangledown\), LV \(>{-0.010}:~\triangledown \triangledown\), LV \(\le {-0.010}:~\triangledown \triangledown \triangledown\), –: no differences in comparison.

Potential markers of the six class comparisons

21 compounds (7 HCs, 2 AHC, 3 MTs, 4 STs and 5 unknown compounds) were identified as differences between ALB and healthy tree (HT). Regarding the chromatograms of ALB and HT (Fig. 2), 8 compounds (4-methylheptane, 2,4-dimethyl-1-heptene, presumably styrene, 3-carene, \(\alpha\)-longipinene, copaene, caryophyllene and presumably \(\alpha\)-zingiberene) are only emitted during ALB infestation whereas 2 compounds (\(\alpha\)-pinene and decanal) occur exclusively in HT. n-octane, hexanal, n-nonane, limonene and cyclosativene increase in concentration by ALB infestation, while the emissions of nonanal, tridecane and pentadecane decrease.

For the distinction between SC and HT, 15 compounds (7 HCs, 2 AHC, 1 MTs, 4 STs and 1 unknown compounds were characterized. Regarding Fig. 2, copaene, caryophyllene and one unknown compound are exclusively emitted by SC infestation, whereas n-octane, presumably styrene, nonanal, decanal, tridecane, cyclosativene and pentadecane only occur in HT. The concentrations of toluene and hexanal increase by SC infestation, while the emission of \(\alpha\)-pinene decreases.

For the comparison of CC and HT, 15 compounds (7 HCs, 2 AHC, 2 MTs, 2 STs and 2 unknown compounds) could be distinguished. One unkown compound is exclusively emitted from CC, while n-octane, presumably styrene, \(\alpha\)-pinene, nonanal, cyclosativene and pentadecane are emitted only by HT. Undecane and toluene are higher concentrated in HT, whereas hexanal and limonene are more emitted by CC.

When comparing the native beetle species with the ALB, 16 compounds (8 HCs, 1 AHC, 3 MTs, 3 STs and 1 unknown compounds) are distinguished between ALB and SC. Regarding Fig. 2, 9 compounds (n-octane, 2,4-dimethyl-1-heptene, 3-carene, limonene, undecane, nonanal, cyclosativene, pentadecane and presumably \(\alpha\)-zingiberene) are emitted exclusively by ALB infested trees, whereas only one unknown compound is specific for SC infestation. n-nonane is more emitted by ALB infested trees while toluene, hexanal, copaene and caryophyllene are more emitted by SC infested tree respectively.

17 compounds (8 HCs, 1 AHC, 3 MTs, 2 STs and 3 unknown compounds) between ALB and CC are determined. None of the compounds are CC specific, but 9 compounds (n-octane, 2,4-dimethyl-1-heptene, 3-carene, nonanal, cyclosativene, copaene, caryophyllene, pentadecane and presumably \(\alpha\)-zingiberene) are emitted only upon ALB infestation. The compounds hexanal, limonene, and 1 unkown compound are higher concentrated in CC, while toluene, \(\alpha\)-pinene, and undecane are more emitted by ALB infested trees.

9 compounds (5 HCs, 1 AHC, 1 MT, and 2 STs) could be determined as differences for the comparison between SC and CC. limonene, nonanal, and 1 unknown compound are CC specific, while n-nonane, cyclosativene, and caryophyllene are emitted only by SC infestation. In addition, the compounds toluene and hexanal are higher concentrated in SC.

ALB specific markers

Makarow et al. found that ALB larvae, ovipositions, and imagos emit a combination of the compounds copaene, cyclosativene, and 2,4-dimethyl-1-heptene31. When examining ALB-infected trees, they also found the compounds \(\alpha\)-longipinene and 3-carene. Furthermore, caryophyllene could only be detected in trees infected with ALB and not in trees exposed to mechanically induced stress30. These results could be partially confirmed and extended in the chemometric data analysis applied in this study.

Regarding the 24 compounds identified as potential markers via chemometric analysis and the results obtained from comparison of the different classes a total of 14 potential ALB-specific compounds were identified in the analysis of the ALB-HT, ALB-SC and ALB-CC class comparisons. However, only 2,4-dimethyl-1-heptene, 3-carene, and presumably \(\alpha\)-zingiberene could be identified as actually ALB-specific in the analysis of class comparisons, while 4-methylheptane, presumably styrene, \(\alpha\)-longipinene, copaene, and caryophyllene are also emitted by SC-infested trees as well as the CC. Therefore, the results of Makarow et al.30 regarding 2,4-dimethyl-1-heptene and 3-carene as specific to ALB infestation can be confirmed, but for the compounds copaene, cyclosativene, caryophyllene, and \(\alpha\)-longipinene (Fig. 4b) it is reasonable to assume that these compounds are not ALB-specific compounds, but rather compounds that are produced in greater quantities upon induction of biotic stress. This assumption can be verified by consulting the literature and will be discussed in detail in the following paragraph36,37.

Rodriguez-Saona et al. found that gypsy moth infestation on the shrub Vaccinium corymbosum induces a high emission of caryophyllene38. Slavik et al. observed that infection with the fungi Botrytis cinerea and Oidium neolycopesici leads to emission of copaene in tomato plants39. Tomato plants also emit caryophyllene when infected with Bemisia tabaci40. It was established that an increased concentration of caryophyllene increases the resistance of plants to pests40,41. Furthermore, Li et al. discovered that increased concentrations of cyclosativene, \(\alpha\)-longipinene and caryophyllene are observed in borings of the Japanese pine sawyer Monochamus alernatus on Pinus massoniana42. The increased production of copaene, caryophyllene, \(\alpha\)-longipinene and cyclosativene during herbivore infestation is due to a defense mechanism of the plants by activation of the jasmonic acid pathway43,44. Erb et al. induced roots and leaves of maize plants with jasmonic acid and observed a strong increase in the concentration of copaene, caryophyllene and cyclosativene36. During Spodoptera litoralis infestation on Medicago truncatula, Leitner et al. determined both a local increase in jasmonic acid concentration and a strong increase in the emission of caryophyllene, copaene, and cyclosativene45.

In addition to the results of Makarow et al.30,31, presumably \(\alpha\)-zingiberene was identified as a specific marker for ALB infestation using the evaluation method described in this study. A literature search revealed that \(\alpha\)-zingiberene could also be detected in maize infested with Spodoptera frugiperda females46. No publication was found that identifies \(\alpha\)-zingiberene in association with an ALB infestation.

However, the evaluation method applied in this study not only identified specifically produced VOCs, but also differences in concentrations that arise specifically during ALB infestation. Thus, the compound n-octane shows a higher concentration during ALB infestation, while the emission of nonanal and pentadecane is reduced. n-octane was found by Laothawornkitkul et al. in tomato leaves47. Pentadecane serves, besides other roles, as a messenger compound for pollination in plants48, while the emission of nonanal can attract predator insects upon herbivore infestation49,50. The fact that both compounds are emitted in lower concentrations during ALB infestation suggests that either the tree is no longer able to produce these compounds sufficiently due to the infestation or that the ALB actively inhibits the emission in order to prevent the attraction of possible predators.

SC and CC specific markers

In addition to ALB, the two non-invasive herbivores SC and CC were examined for specific markers in this publication. For SC, three potentially specific compounds could be identified: copaene, caryophyllene and an unknown compound. However, copaene and caryophyllene are compounds that are generally emitted during herbivore infestation and are therefore not specific to a particular pest38,39,40,41, leaving the unknown compound to be the only one specific to SC-infested trees. Nevertheless, changes in the concentration of certain compounds are also an indication of SC infestation. Thus, an increased concentration of toluene and hexanal, as well as a reduced emission of \(\alpha\)-pinene was found in SC infestations. Sriprapat et al. were able to prove that plants absorb the pollutant toluene from the ambient air51. The differences in concentration can therefore not be assigned to the plant or the infestation with SC, but are rather a result of polluted ambient air. Hexanal is an inhibitor of phospholipase D (PLD), which is among other functions responsible for the maintenance of cell physiology and cell membrane hydrolysis52. Inhibitors cause, among other effects, proliferation, meaning rapid cell growth53. An increased concentration of hexanal therefore indicates a defense mechanism of the plant, which tries to counteract the damage caused by the SC through increased cell growth. Ortiz-Carreon et al. also found a reduced concentration of \(\alpha\)-pinene in maize plants infested with Spodoptera frugiperda51.

No specific markers could be identified for CC. One unkown compound could be found in CC, but not in HT. However, small amounts of this unknown compound could also be detected in ALB infestation. Furthermore, higher concentrations of hexanal and limonene were detected in CC compared to HT and could be an indication of CC infestation.

For completeness, 4-methylheptane and presumably styrene, which were emitted by all three herbivore infestations, but not by HT, will be discussed. Literature research on these two compounds revealed that styrene is a very common VOCs in the plant environment54,55 and was reported to have inhibitory properties on bacterial56 and fungal infections57. For 4-methyl heptane no literature on herbivore infestation could be found. Literature research only revealed that it was detected in raw amaranth seeds58.

Conclusion

This study introduces a robust multivariate evaluation method for analyzing VOCs to detect pest infestations in plants. By integrating data merging, preprocessing, and advanced chemometric techniques, the approach offers a more efficient and objective alternative to traditional manual evaluation of GC-MS data. The method confirms and refines previous findings by Makarow et al.30,31, with 2,4-dimethyl-1-heptene and 3-carene reaffirmed as reliable markers for ALB infestation. Additionally, another sequiterpene, which is assumed to be \(\alpha\)-zingiberene was identified as a novel, ALB-specific marker, and concentration changes in nonanal and pentadecane were observed in infested samples. The method also enabled the identification of VOC markers associated with other pest species such as SC and CC, demonstrating its ability to distinguish between different types of infestations.

A key strength of this evaluation method lies in its adaptability and speed. It can be easily applied to various plant health questions and GC-MS setups with minimal adjustment. This facilitates the identification of disease- or pest-specific VOC patterns more quickly, which is particularly beneficial for both laboratory-based studies and potential field applications. In conclusion, the presented approach offers a solid and scalable framework for VOC-based pest and pathogen detection, contributing to more effective plant monitoring strategies and advancing the use of chemometric tools in agricultural and ecological research.