Introduction

The Pine Wood Nematode (PWN; Bursaphelenchus xylophilus) is internationally recognized as an exceedingly destructive forestry pest and is classified as a major quarantine pest in China. This pathogen disrupts the healthy architecture of pine trees and produces toxins, thereby resulting in a devastating disease termed Pine Wilt Disease (PWD)1. Pine trees infected with PWD can perish within approximately 40 days. Currently, there are no effective control measures, and without timely intervention, entire pine forests may vanish within a span of 3 to 5 years2,3. According to the 4th Bulletin of the National Forestry and Grassland Administration in 20244, PWD has affected 18 provinces in China, causing a substantial number of pine trees to die, thereby leading to significant economic and ecological losses. Yunnan Pine (Pinus yunnanensis) is a principal timber species for afforestation in the barren hills of southwestern China and holds substantial research value. Yunnan Province is one of the nine major forest regions in China and is a key area for forest resources. Yunnan Pine forests cover 29.2% of the forested land in Yunnan Province and account for 15.8% of the province’s total forest stock, playing a pivotal role in ecological environment construction5.

In the initial phase of infection, it is extremely challenging to differentiate healthy pine trees from newly infected ones solely through visual inspection. The duration of this early stage depends on factors such as tree age, species, environmental temperature, and the virulence of the pine wood nematode6. To effectively monitor the health status of vegetation without disrupting the existing ecosystem, hyperspectral remote sensing data has emerged as a promising tool to address these challenges. Hyperspectral technology of ground objects can acquire continuous spectral information from ground objects using extremely narrow electromagnetic wavebands and has been extensively applied in forest pest and disease monitoring7. For instance, hyperspectral data can be utilized to monitor and identify Pinus tabulaeformis8,9, Fraxinus pennsylvanica Marsh10 and Pinus thunbergii11. Although numerous previous studies have employed ground spectrometers to explore the spectral changes of trees under pest and disease stress, there are relatively few studies on the early stage of Yunnan Pine infection, and the relationship between the changes in needle physiological indices and spectra has been less investigated.

Moreover, hyperspectral sensors generate data with hundreds of spectral bands, and the large volume of high-dimensional data also causes data processing redundancy. Therefore, research on selecting sensitive bands from hundreds of bands for pest and disease monitoring through appropriate dimensionality reduction methods without losing crucial information is also another area of concern12. Several studies have been reported in the literature and have successfully utilized machine learning regression models in conjunction with hyperspectral data to monitor diseases8,9,12,13,14. Li et al. used Linear Discriminant Analysis (LDA) to quantify the separability of spectral reflectance and first-derivative reflectance between healthy samples and early - infected samples. Generally, the combination of hyperspectral reflectance and multivariate statistical analysis has been successfully and precisely applied to forest pest and disease research15. Furlanetto et al. correlated ten machine learning methods with leaf K + based on hyperspectral data. The results indicated that different algorithms exhibited diverse behaviors in wavelength selection, and the author pointed out that the Competitive Adaptive Reweighted Sampling (CARS) algorithm had the highest number of selections16. Chen et al. showed that SNV-FD is the optimal preprocessing method for spectral data of apple tree canopies on the Loess Plateau. CARS, Successive Projections Algorithm (SPA), Random Frog (Rfrog), and Partial Least Squares method (PLS) were employed to extract characteristic variables. It was found that the four models significantly reduced the number of variables used for modeling, and the number of extracted variables was 54, 20, 61, and 5 respectively17. Compared to traditional machine learning methods, the Extreme Gradient Boosting algorithm (XGBoost), as a novel machine learning approach, can reduce model overfitting18,19. Sandino et al. employed the XGBoost algorithm to identify healthy and pathogen-infected trees in forests using remote sensing technology, achieving a classification accuracy of 97%20. Huang et al. found through a comparison of three classification methods that the CA-RF-XGBoost model demonstrated excellent discriminative capability for each category of samples, with per-class accuracy exceeding 82% in every classification instance21. From these it follows that CARS can quickly screen key characteristic wavelengths in pest and disease monitoring, such as focusing on sensitive spectral regions when monitoring pine pests and diseases, reducing data dimensions and computational complexity, and improving prediction performance. SPA can refine the information of variables, which can further select fewer variables with better information. XGBoost as a classifier can be further used to significantly evaluate the performance of the band selection model.

Studies have shown that trees infected with PWD exhibit changes in multiple physiological and biochemical parameters, and these changes can be detected through their spectral characteristics. Under biotic or abiotic stress, the external morphology and color changes of trees are closely associated with the changes in their internal physiological structures, and the degree of external changes reflects the degree of internal physiological variation22,23,24. Michael et al. focused on Dothistroma needle blight in radiata pine, using UAV hyperspectral data and a 3D radiative transfer model to invert plant functional traits. Analyses showed significant changes in variable importance during disease progression. Early, photosynthesis, chlorophyll degradation, and content - related variables helped distinguish diseased from asymptomatic trees, while severe stages were characterized by extreme transpiration reduction and foliage loss traits. Bai based on ASD hyperspectral data, used correlation analysis to extract spectral indices NDSI, DSI, and RSI sensitive to moisture content and chlorophyll content, and constructed inversion models for each physiological parameter and the defoliation rate inversion model of Chinese pine through multiple linear regression and artificial neural network methods. The results demonstrated that the relevant spectral indices extracted based on physiological parameters could better reflect the damage situation of Chinese pine caterpillars25. Although studies focus on monitoring moisture content and other early indicators, changes in other biochemical parameters, such as carotenoids, nitrogen, total sugar, and reducing sugar, are often neglected27,28,29,30.

This study aims to: (1) Analyze early-stage PWD induced changes in spectral signatures and biochemical parameters of Yunnan Pine needles; (2) Identify diagnostic spectral bands and their correlated biochemical indicators by integrating hyperspectral data with biochemical measurements through feature selection algorithms; (3) Classify infection stages using these diagnostic bands and evaluate discriminatory performance using standard classification metrics.

Materials and methods

Study area

This research was conducted at the Haikou Forest Farm in the southwestern part of Yunnan Province, China, between longitudes 102°21′36″E and 102°42′05″E, and latitudes 24°59′15″N and 25°09′21″N. The altitude ranges from 1896 to 2510 m above sea level. The forest farm covers 7161.6 hectares in total, with a forested area of 5404.5 hectares, resulting in a forest coverage rate of 75.47%. The total forest volume is 430,752 cubic meters. The pine forest area in Xishan District exceeds approximately 1333.3 hectares. Numerous studies have emphasized that temperature is the primary climatic factor influencing the survival and spread of pine wood nematodes. The average annual temperature in affected areas is a crucial determinant of the severity of pine wilt disease31,32,33. In recent years, Kunming has witnessed higher-than-average annual temperatures, increased precipitation, and extended sunshine duration. The spring drought has been milder, the rainy season has started later than usual, and the main flood season has experienced higher precipitation levels. These conditions have provided abundant rainfall, light, and heat resources, with an annual average temperature exceeding 14 °C. Such climatic conditions increase the risk of pine wood nematode outbreaks.

Data acquisition

Hyperspectral data acquisition and pre-processing

Sample data from Yunnan Pine infected with pine wood nematodes were collected using the ASD FieldSpec 4 Hi-Res NG spectrometer in November 2023. The device measured the spectral reflectance of the sampled trees within the wavelength range of 350–2500 nm, with a spectral resolution of 3 nm for the 350–1000 nm range and 6 nm for the 1001–2500 nm range, and the field of view angle was set at 10. Spectral data were collected from tree branches, and the average reflectance values were calculated to represent the spectral data of individual trees. During data collection, white reference calibration was carried out every 10 min and whenever there was a change in ambient light conditions. A total of 14 spectral measurements were taken for each of the 70 sample spectral curves, which included 35 healthy and 35 early-stage infected samples.

The infected trees were classified into different disease severity levels according to the morphological characteristics of disease symptoms25,34 ( Fig. 1). Due to the actual conditions in the epidemic area, only samples from the healthy and early stages were collected for analysis.

(1) Healthy stage: The needles are bright green, and the upright branches exhibit vigorous growth.

(2) Early stage: During this period, linear resin flow is observed on the trunk, the needles begin to turn slightly yellow, and the entire plant shows signs of dehydration, with needles hanging downward on the branches.

(3) Middle stage: Resin secretion decreases significantly, and the majority of the needles turn yellow.

(4) Death stage: Resin secretion ceases entirely, and the needles turn brick red or reddish-brown, remaining attached to the branches. The tree enters a completely dead state, with grooves, feather holes, and sawdust visible beneath the bark. When cut, the wood exhibits a bluish discoloration.

Fig. 1
figure 1

Images of Yunnan Pine needles at different stages of pine wilt disease. (a) Healthy stage. (b) Early stage. (c) Middle stage. (d) Death stage.

In the Pathological Analysis Laboratory of Southwest Forestry University, we conducted a detailed study on the isolation and morphological identification of nematodes from five representative samples of Yunnan pine. Employing the Baermann funnel method, we ensured a high efficiency in nematode extraction by finely shredding the pine wood samples (Fig. 2a) and placing them in funnels containing a separation liquid. Over a 24-hour period, the nematodes migrated from the wood fragments into the separation liquid (Fig. 2b). This method is highly effective for nematode extraction35. The samples were subjected to centrifugation at 1200×g for 10 min, followed by sedimentation or centrifugation to ensure ample samples for subsequent analysis,. In terms of morphological identification, we meticulously examined key characteristics such as body length and width, the presence and structure of arthropods and bamboo nodes, the shape of the stylet and the morphology of the reproductive system. By comparing these key features with the existing taxonomic descriptions of PWN36,37, we accurately identified the nematode (Fig. 2c).

Fig. 2
figure 2

Morphological identification. (a) Sample sawdust; (b) Nematode isolation; (c) Nematode morphology observed under a microscope.

Outlier processing

In hyperspectral datasets, the presence of anomalous signals—resulting from instrumental noise, environmental interference, or measurement artifacts can significantly compromise data accuracy and reliability. To address this, the spectral processing software ViewSpecPro 6.0 was systematically employed to identify and eliminate spectral outliers. The analysis revealed that water vapor absorption and system errors commonly introduce anomalies in the short-wave infrared region. Consequently, spectral bands within the ranges 1330–1400 nm, 1785–1974 nm, and 2340–2500 nm were excluded after rigorous validation. This preprocessing step resulted in a dataset comprising 1738 spectral bands. Following outlier removal, the spectral data were averaged to mitigate random measurement fluctuations. For each sample, 14 spectral measurements were acquired, and the arithmetic mean of these measurements was adopted as the representative spectral reflectance value.

Data smoothing

Hyperspectral data inherently contain high-dimensional spectral information, but this complexity often introduces redundancy during analysis. Not all spectral bands are equally informative for monitoring pine wilt disease (PWD) infection, necessitating the application of dimensionality reduction techniques to identify sensitive spectral regions while preserving critical information. The Savitzky-Golay (S-G) filter is a widely adopted data smoothing method that reduces noise and enhances data smoothness by fitting a polynomial to a sliding window of data points. During the acquisition of hyperspectral data from Yunnan Pine needles, environmental interferences and the wavelength-dependent response of spectral instruments frequently generate subtle noise features in the spectral curves. These “burr” noises are primarily attributed to high-frequency random interferences, baseline drifts, sample inhomogeneity, and light scattering38,39. To effectively suppress instrument errors and noise while retaining essential spectral features, a 7-point Savitzky-Golay filter was implemented in this study.

Vegetation index extraction

Based on the spectral characteristics of Yunnan Pine during the early stage of pine wilt disease and in conjunction with previous studies, this study extracted 11 vegetation indices, as detailed in Table 1.

Table 1 Vegetation index.

Biochemical parameter content data acquisition

In this study, biochemical parameter data were measured for 60 samples, including 30 healthy and 30 early-stage infected samples, corresponding to the samples collected for hyperspectral ground object data. The measured biochemical parameters comprised carotenoid content, total sugar, reducing sugar, total nitrogen and moisture content, totaling five indicators. The moisture content of the leaves was determined using the drying method. The fresh weight of each sample group was measured and recorded as \(\:{m}_{1}\). The leaves were then placed in an oven set at 63.5 °C for over 48 h until a constant weight was achieved. The dried leaves were weighed to determine the dry weight, recorded as \(\:{m}_{2}\). The moisture content was calculated using the following formula:

$$\:FMC=\frac{{m}_{1}\left(mg\right)-{m}_{2}\left(mg\right)}{{m}_{2}\left(mg\right)}\times\:100\%$$

For the remaining four biochemical parameters, the following methodologies were employed:

(1) Carotenoid content: Quantified using spectrophotometry at λ = 470 nm, as described in a standard method45. The method involves acetone extraction of pigments followed by determination of their concentrations using extinction coefficients. This parameter serves as a critical indicator of the plant’s antioxidant capacity and its physiological response to biotic stressors, such as pathogen infection.

(2) Total Sugar: Measured via the 3,5-dinitrosalicylic acid (DNS) colorimetric method46, which quantifies reducing sugars through their ability to reduce DNS to a colored complex detectable at 540 nm. This approach is widely recognized for its high sensitivity and reliability in analyzing both polysaccharides and monosaccharides, making it suitable for assessing metabolic changes in infected tissues.

(3) Reducing Sugar: Analyzed using the DNS colorimetric method47, which specifically targets aldose and ketose sugars based on their reducing properties. This parameter provides valuable insights into the metabolic activity of plant tissues, particularly under stress conditions, and is essential for understanding the biochemical dynamics of pathogen-induced physiological disruptions.

(4) Total Nitrogen: Determined according to the national standard NY/T 2419 − 201348,using an automatic nitrogen analyzer. The method employs high-temperature combustion to convert organic nitrogen into nitrogen oxides, which are subsequently quantified via thermal conductivity detection. This technique is widely adopted for its precision in evaluating nutrient availability and metabolic status in plant tissues, offering a robust foundation for nutritional and pathological analyses.

Methods

To identify spectral differences between healthy and early disease stages, the Euclidean distance was calculated between the average spectral reflectance curves of the two groups. The intersection point of these curves was used as a threshold to segment the spectral bands, which were then rounded and merged with peak regions to determine the optimal monitoring wavelengths. To explore diagnostic parameters, each biochemical indicator was evaluated within its respective threshold range using descriptive statistical analysis to highlight significant variations between healthy and diseased states. Subsequently, correlations between the biochemical parameters and the extracted vegetation indices were analyzed to identify the most effective diagnostic indicators based on their co-significance. Spectral bands sensitive to diagnostic biochemical parameters were identified using competitive adaptive reweighted sampling (CARS) and the successive projections algorithm (SPA), followed by classification and evaluation through machine learning models-Extreme Gradient Boosting (XGBoost) to assess their effectiveness in disease diagnosis based on biochemical indicators.

Euclidean distance

The Euclidean distance method, rooted in the mathematical principle of measuring the straight-line distance between two points in n-dimensional space, serves as a foundational tool for quantifying dissimilarity in high-dimensional datasets, particularly in spectral analysis where it is used to evaluate the separation between classes by comparing feature distributions49. In this study, it was employed to assess the significance of spectral bands in disease detection by quantifying their discriminative power through three sequential steps. First, the average intra-category Euclidean distances were calculated within both the healthy and early-stage sample groups, reflecting the internal variability of spectral features in each class. Second, the inter-category Euclidean distance between spectral vectors from the healthy and early-stage groups was determined, representing the separation between the two classes. Finally, a spectral band was deemed statistically significant if its inter-category distance exceeded both intra-category distances. The band’s importance increases proportionally with the ratio of inter-category distance to intra-category distances.

Correlation analysis

The Pearson correlation coefficient is a statistical measure used to quantify the linear relationship between two variables. It evaluates whether a linear association exists and determines its strength by analyzing the covariance and standard deviations of the variables. The statistical significance of the correlation is assessed using the p-value: a p-value < 0.01 indicates a highly significant relationship, while a p-value < 0.05 suggests a significant relationship between the independent and dependent variables50. We examined how early-stage disease affects the correlations between biochemical parameters, and the relationships between biochemical parameters and vegetation indices. By comparing healthy and early-disease samples, we identified significant differences in biochemical parameter correlations and selected the most diagnostically relevant indices based on their co-significance.

Competitive adaptive reweighted sampling (CARS)

The CARS algorithm is a feature selection technique designed to identify optimal feature subsets from high-dimensional data by adaptively weighting variables and systematically eliminating less significant features through competitive mechanisms51. This approach enhances model generalization, reduces overfitting, and improves selection efficiency. The interpretability of selected features remains challenging due to dependency on model performance and initial variable choices. In this study, we develop an enhanced CARS framework integrating Random Forest permutation importance metrics, preserving Darwinian evolutionary principles (“survival of the fittest”) while addressing critical limitations of traditional PLSR-based CARS52. This algorithm was implemented on MATLAB R2023a and the main steps involved are as follows53:

  1. (1)

    The model initializes the following variables: Number of samples, Number of features (bands), Best performance, Performance history, Weights for all features;

  2. (2)

    Train the Random Forest model: using the currently selected subset of features, compute the Out-of-Bag Error (OOB Error);

  3. (3)

    Weight Update and ARS Competitive Selection: The original out-of-bag error updates the weights based on the importance of the features; On the subset screened by EDF, the ARS (Adaptive Reweighting Sampling) method further competitively eliminates wavelengths by employing the weights of regression coefficients.

  4. (4)

    During each iteration, a feature subset is selected based on the updated weights by choosing the top N features with the highest weights. The performance of this feature subset is then evaluated using K-fold cross-validation, with the mean squared error serving as the evaluation metric. The performance result for each iteration P is recorded in the performance history array. If the current performance is lower than the previously recorded best performance, P (best) and the corresponding feature subset are updated. The iteration stops if performance improvement falls below a predefined threshold over consecutive iterations.

Ultimately, the algorithm outputs the best-performing feature subset observed throughout the iterations, along with its corresponding performance metric. This feature subset represents the optimal spectral bands required for the analysis.

Successive projections algorithm (SPA)

The SPA is a deterministic variable selection method based on projection vector analysis. This algorithm enables the selection of a minimal subset of characteristic wavelengths from the full spectral range to represent the majority of sample spectral data using limited spectral bands54. Through systematic mitigation of spectral band collinearity and information redundancy, SPA achieves concurrent improvements in model computational efficiency and predictive accuracy. The algorithm has demonstrated extensive applications in feature wavelength extraction16,53,57. Its core mechanism employs a forward selection strategy to iteratively construct variable combinations: starting with an initial variable, the algorithm subsequently incorporates wavelengths exhibiting the weakest linear relationship with the current combination while maintaining the largest projection vector. This approach simultaneously preserves maximal spectral information while minimizing redundancy. The validation process implements leave-one-out cross-validation combined with multiple linear regression (MLR), dynamically optimizing wavelength combinations to minimize root mean square error of cross-validation. The deterministic nature of SPA ensures unique input-output correspondence and provides physically interpretable wavelength selections, thereby establishing clear correlations between spectral features and target variables55,56,57.

The algorithm was implemented in MATLAB R2023a. Within the primary function SPA, input parameters include: Hyperspectral data matrix, Dependent variable vector, Maximum number of bands, Cross-validation folds. Initialization constructs a projection identity matrix. Iterative operations subsequently select wavelengths exhibiting maximal projection values relative to the dependent variable.

Classification model construction

Extreme Gradient Boosting (XGBoost) is an ensemble learning algorithm based on decision trees designed by Chen and Guestrin, primarily used for classification and regression problems58. The model overcomes many challenges faced by traditional machine learning algorithms, such as high sensitivity to sample data, high computational complexity, and model overfitting. It can extract optimal features from multiple variables, and update and adjust the gradient of base learners for classification and identification. The XGBoost model demonstrates significant advantages in early disease identification due to its excellent feature learning capability and classification accuracy19,21,59. This study employs two evaluation metrics: Overall Accuracy (OA) measures the proportion of correctly predicted samples (range: 0–1), while Area Under the Receiver Operating Characteristic Curve (AUC) quantifies classification boundary robustness through ROC curve integration. The metrics form a complementary evaluation system from sample distribution balance and class separability perspectives.

To ensure robust testing and evaluation, the model dataset was divided into a training set (70%) and a test set (30%) in the balanced dataset. The machine learning models were implemented in Python using Scikit-learn, the search space included learning rate (0.01–0.3), maximum tree depth (5–10), number of estimators (100–1000), subsample ratio (0.5-1.0) and column sampling by tree (0.5-1.0).

Results

Selection of the monitoring window

The hyperspectral reflectance trends of the early infection stage and the healthy stage are basically the same. It is obvious that there are significant differences between the early-stage and healthy spectral curves at these 11 positions, corresponding to the wavelength positions of 552 nm, 766 nm, 911 nm, 976 nm, 1088 nm, 1203 nm, 1253 nm, 1454 nm, 1652 nm, 1780 nm, and 2219 nm (Fig. 3a). The maximum peak difference occurs in the range of 766–911 nm. In the wavelength ranges of 455–677 nm and 770–1159 nm, the reflectance of healthy pine needles is higher than that of early-diseased needles. Conversely, the reflectance of healthy pine needles is lower than that of early-diseased needles in the wavelength range of 1300–2500 nm. This reflectance pattern may be influenced by chlorophyll. As pine needles change from a healthy to a diseased state, the chlorophyll content decreases, resulting in a decrease in absorbance and a corresponding increase in reflectance. This observation is similar to the findings of Li et al.30 for Korean pine. In this study, the optimal bands for early disease monitoring were identified (Fig. 3b). Ranges A, B, C, D, E, and F were determined using spectral difference analysis.

Fig. 3
figure 3

The average reflectance spectrum at different stages of infection. (a) Peak band position analysis. (b) The average spectral difference.

Analysis of monitoring window based on Euclidean distance

The spectral data for both healthy and early-stage samples were categorized into three groups: A-H represent the average Euclidean distance between healthy samples, A-E represent the average Euclidean distance between early-stage samples, and HvsE represent the average Euclidean distance between healthy and early-stage samples. As shown in Table 2, all inter-class distances (between healthy and early-stage samples) within the spectral bands of 455–677 nm, 680–763 nm, 770–1159 nm, 1160–1330 nm, 1400–1785 nm and 1974–2340 nm are significantly are larger than the intra-class distances within each class. Therefore, these monitoring bands are deemed suitable for the early detection of pine wilt disease in Yunnan Pine.

Table 2 Calculation of Euclidean distance in monitoring bands.

The ranking of the selected monitoring bands is determined based on the Euclidean distances among the three category groups, with the HvsE group’s greater distance from A-H and A-E indicating higher importance. The resulting priority order is as follows: 1974–2340 nm > 455–677 nm > 1400–1785 nm > 680–763 nm > 770–1159 nm > 1160–1330 nm (Fig. 4). Analysis of the average spectral reflectance for healthy and early-stage samples reveals significant differences in the 770–1159 nm band. However, Euclidean distance analysis does not indicate the largest separation between healthy and early-stage samples within this band, and early-stage recognition in this range is suboptimal. Therefore, the significance of a monitoring band cannot be determined solely based on spectral average differences.

Fig. 4
figure 4

The result of Euclidean distance analysis.

Biochemical parameters analysis

The biochemical analysis of Yunnan pine revealed distinct patterns in key physiological indicators. During the early stage, the maximum, minimum, and mean values of total sugar, reducing sugar, and moisture content were significantly lower compared to the healthy stage. Conversely, carotenoid content in the early stage exhibited a statistically significant increase relative to the healthy period. Total nitrogen content showed no significant differences between the two stages (p > 0.05), four biochemical parameters: carotenoid content, total sugar, reducing sugar, and moisture content demonstrated highly significant variations (p < 0.01). These four indicators were therefore selected as primary diagnostic markers due to their consistent statistical significance across multiple measurements (Fig. 5).

Fig. 5
figure 5

The result of biochemical parameter content statistics.

As shown in Fig. 6, the correlations between carotenoid content, reducing sugar, total sugar, and vegetation indices were analyzed. Carotenoids were associated with seven vegetation indices, while reducing sugars showed associations with five. Only one index (CRI1) was correlated with total sugars (-0.341). The weak relationship between total sugar and vegetation indices rendered it unsuitable as a diagnostic indicator for spectral data analysis. In contrast, carotenoid content, reducing sugar, and total sugar exhibited strong correlations, and were therefore selected as diagnostic biochemical indicators for early identification.

Fig. 6
figure 6

Correlation analysis between biochemical parameters and vegetation index.

Extraction of sensitive bands

The CARS algorithm identified 19 sensitive spectral bands for the diagnostic biochemical indicators of Yunnan pine (Fig. 7). These bands, located at 534–539 nm, 559–566 nm, 716–717 nm, and 739–741 nm, are associated with carotenoid content are distributed in the vicinity of the green peak and red edge absorption regions. For reducing sugar, 21 sensitive bands were selected, including those at 724–725 nm and 781–799 nm, which are primarily located in the red edge and near-infrared absorption regions. Additionally, 20 sensitive bands were identified for moisture content, including wavelengths at 748 nm, 755 nm, 756 nm, 758 nm, 1989 nm, 1995 nm, 1999 nm, 2000 nm, 2083 nm, 2089 nm, 2094 nm, 2120–2122 nm, 2138 nm, 2152 nm, 2276 nm, 2282 nm, 2289 nm, and 2292 nm. These bands are predominantly concentrated in the red edge and shortwave infrared regions.

Fig. 7
figure 7

Selection results of the CARS algorithm. (a) carotenoid content-CARS. (b) Reducing sugar-CARS. (c) moisture content-CARS.

Based on the SPA selection for diagnostic biochemical indicators of Yunnan Pine (Fig. 8), carotenoid content identified six sensitive bands (713 nm, 758 nm, 997 nm, 1074 nm, 1124 nm, and 1663 nm), while reducing sugar selected five bands (713 nm, 758 nm, 1074 nm, 1124 nm, and 1663 nm), and moisture content determined four bands (758 nm, 1074 nm, 1124 nm, and 1663 nm). These bands correspond to regions in the red edge, near-infrared, and short-wave infrared respectively, with four bands (758 nm, 1074 nm, 1124 nm, and 1663 nm) common to all three biochemical parameters.

Fig. 8
figure 8

Selection results of the SPA algorithm. (a) carotenoid content-SPA. (b) Reducing sugar-SPA. (c) moisture content-SPA.

Classification model construction

To assess the validity of CARS and SPA wavelength selection methods, feature bands from both algorithms and all bands were incorporated into XGBoost classification models (Fig. 9). The results showed that CARS and SPA bands achieved higher accuracy than full-band classification, demonstrating the effectiveness of the selected bands. CARS performed moderately for water and carotenoid content, its AUC of 0.5 for reducing sugar indicated poor early-disease separation. Conversely, SPA-selected bands consistently outperformed other methods, demonstrating superior diagnostic capability for spectral signatures critical in early detection. Classification based on reducing sugars with SPA-selected bands achieved the best performance, with an OA of 0.91 and AUC of 0.9.

Fig. 9
figure 9

Classification accuracy of Yunnan pine using different selected bands and all bands.

Discussion

Selection of monitoring window

Hyperspectral data provides rich spectral information and numerous narrow bands, enabling the detection of subtle variations in vegetation. In their healthy state, the hyperspectral profiles of Yunnan Pine needles exhibit the distinct spectral characteristics of vigorous plants, reflecting physiological factors such as chlorophyll content and cellular structure. These physiological changes, which are imperceptible to the naked eye, manifest in the spectral curves of the needles. The significant spectral variations can serve as a rapid and effective means for identifying the health status of Yunnan Pine. Therefore, it is crucial to identify the most sensitive monitoring bands for disease detection15,16,17,18,19,20,21,22,23,24,25,26,27. We analyzed the spectral peak differences between healthy and early PWD infected pine needles, as well as the monitoring bands. Through Euclidean distance analysis, findings revealed that the SWIR band from 1974 to 2340 nm was identified as highly sensitive to early PWD infection in Yunnan Pine, followed by the visible light range from 455 to 677 nm. These results partially align with previous studies, SWIR regions were found to be among the most responsive for early detection of PWD in ground-level studies34,60. One study utilized a GER-3700 spectrometer to record the spectra of trees inoculated with PWD, reporting an increase in reflectance in the red and SWIR bands 67 days post-inoculation11. Another study employed a ground-based hyperspectral camera to examine the spectral characteristics of PWD at various infection stages, identifying the 688 nm band as optimal and the mid-infrared band as most sensitive in the early stages61,62. Additional research has shown that in black pine and Masson pine affected by pests and diseases, a reduction in the green peak, a blue shift in the red edge, and an intensification of the red valley. Liu et al. report that PWD infection induces alterations to photosynthesis-related mechanisms in Masson pine needles, with associated characteristic bands primarily located in the near-infrared region, which does not entirely correspond with the detection bands for early infection of Yunnan Pine63. This suggests that the spectral response to PWD infection may vary across different pine species. Given that this study aims to delineate the spectral differences between early infection and healthy states in Yunnan Pine and to explore the optimal detection window highlighting spectral changes potentially induced by infection rather than other factors, the influence of mid- and late-stage infections on the early identification bands was not considered.

Biochemical parameter screening

The primary cause of changes in the external morphology and color of trees under pest and disease stress is the alteration of their internal physiological structure. The extent of external changes correlates with the degree of internal physiological variation. For instance, the moisture content in pine needles decreases as the severity of PWD increases, and the needle color changes from green to brick red64,65,66. Studies monitoring the changes in chlorophyll, carotenoid content, and moisture content have revealed statistically significant differences between the early and middle stages of the disease34. Fukuda et al. found that during PWD infection, water transport in affected pine trees is obstructed, leading to reduced crown moisture content and a gradual decline in photosynthesis, which subsequently leads to a decrease in chlorophyll content67,68. Our study observed similar results by comparing five biochemical parameters in early infected Yunnan Pine. It was found that the average values of total sugar, reducing sugar, and moisture content in the early stage were lower than those in the healthy stage. This finding aligns with Xu’s pathophysiological study, which noted a decrease in reducing sugar content in two research species62. The primary reason is that PWD infection reduces the photosynthesis rate, hindering effective sugar synthesis. Additionally, nematodes consume reducing sugars from the tree for their metabolism. A study has demonstrated that reduced moisture content triggers nematodes to synthesize and secrete substantial polysaccharide-based metabolites69. These metabolites coalesce to form a protective, moisture-retaining film around the nematodes, serving as an adaptive mechanism to counteract desiccation stress and enhance survival under unfavorable conditions. The metabolic activity of the nematodes relies heavily on sugars obtained from host pine trees, resulting in measurable depletion of reducing sugars in host tissue as they are consumed. Our study observed higher carotenoid levels during early infection stages than in healthy trees. This finding contrasts with certain previous reports, which suggest a progressive decline in carotenoid content as the disease advances, but is consistent with Xu et al.62, who noted an initial carotenoid increase during early stress exposure followed by a subsequent decrease.

Selection of sensitive bands and classification

In this study, we classified healthy and infected trees in the early stages using the selected spectral bands. The CARS algorithm identified a subset of spectral bands most relevant to the response variable, indicating that 19 band positions associated with carotenoids were selected around the green peak and red edge absorption regions. Bands related to reducing sugars were identified in the red edge and near-infrared absorption regions, while those associated with moisture content were found in the red edge and SWIR regions. Four sensitive bands were selected based on the SPA, corresponding to the red-edge, NIR, and SWIR absorption regions, and they exhibited high classification accuracy. The NIR band was more sensitive to cellular structure and therefore showed significant variation within the population, while being less affected by the stage of infection. This finding is consistent with Li et al.34. Another investigation identified the “infrared edge” band within the 1240–1340 nm range as the primary diagnostic band for longleaf pine disease. Additionally, a narrow range centered at 1780 nm was found to contribute to disease diagnosis8,62. This result also partially corresponds to the band chosen for our SPA. The XGBoost model can effectively improve the early detection accuracy of PWD and provide methodological and technical references for early disease prevention and control. The classification performed well in distinguishing between healthy and infected trees in the early stages. Since the focus of this study is on revealing early features, we did not use other models besides XGBoost to reduce the influence caused by model selection.

Conclusions

This study analyzed the spectral and biochemical properties of needles from Yunnan pine infected with early-stage pine wilt disease. The results showed that total sugar, reducing sugar, and moisture content decreased significantly during early infection, with these three parameters exhibiting the most pronounced differences between healthy and early-infected trees. Using the Euclidean distance algorithm, the study identified the spectral ranges 1974–2340 nm and 455–677 nm as optimal diagnostic windows. The CARS spectral dimensionality reduction algorithm showed that 20 bands related to moisture content were located in the red edge and shortwave infrared regions were the diagnostic spectral bands. The SPA spectral dimensionality reduction algorithm showed that758, 1074, 1124, and 1663 nm common were identified as diagnostic spectral bands located in the red edge, near-infrared, and shortwave infrared regions. By integrating these selected bands with an XGBoost classifier, the best classification accuracies for distinguishing healthy from early-stage infected trees reached 0.91 and 0.83 for the two feature sets, respectively. Moreover, the AUC consistently exceeded 0.8 for both methods, indicating robust discriminatory capability in distinguishing infection stages. This research targeted early spectral feature identification for PWD, enabling leaf-scale monitoring and establishing a technical foundation for hyperspectral detection of incipient infections via airborne and satellite platforms.