Introduction

Sorghum, a member of the Poaceae family, ranks as the fifth largest cereal crop globally and is predominantly cultivated in arid and semi-arid regions worldwide1, its grains are rich in nutrients, including starch, protein, tannin, fat, and amino acids2,3. Different varieties of sorghum grains have significantly different nutritional content, these differences lead to different application fields, such as food manufacturing, feed processing and brewing industries4,5,6. Therefore, the accurate acquisition of nutritional components in sorghum grains plays an important role for its application.

Despite the accuracy of the quantification results obtained through traditional chemical detection methods, they suffer from numerous shortcomings. These include complex sample preparation, high costs, time-consuming procedures, destructiveness to the samples, and the necessity for special equipment and chemical reagents. Furthermore, traditional chemical methods can only quantify a single nutrient component and are unable to quantify multiple nutrients simultaneously. All these factors limit their widespread application7,8. Consequently, there is a need for a low-cost, nondestructive, and efficient detection method.

Hyperspectral imaging system can acquire spectral images of objects by continuously measuring the spectra across multiple wavelength bands. By selecting regions of interest within the image, spectral information from different wavelength bands can be extracted. Preprocessing of the spectral data and selection of characteristic wavelengths are conducted to improve data quality. Quantitative models are then established using machine learning algorithms and relevant inversion indicators. Fatemi et al. employed NIR hyperspectral to identify and predict the spectral range of major chemical components in maize, thereby assessing its quality11. Zareef et al. employed mid-infrared and NIR hyperspectral techniques to analyze crops, including cereals, legumes, and grains12. Chen et al. used NIR diffuse reflectance spectroscopy combined with partial least squares discriminant analysis technology to establish a multi-grain rice seed variety recognition method, and the proposed model can provide a reference for special spectrometers13. Zhu et al. used KS algorithm to extract characteristic spectra and develop NIR spectroscopy model. The proposed NIR spectroscopy prediction model can meet the requirements of rapid determination of Tartary buckwheat starch14. Fan et al. proposed two methods for the detection of amylose content (AC) and fat content (FC) in NIRS single rice grains. The proposed method is suitable for rapid and nondestructive detection and sorting of rice seeds with different AC and FC, and can meet the requirements of coarse screening of rice seed varieties15. Their study demonstrated the applicability of mid-infrared spectroscopy in characterizing active compounds present in minor cereals and their processing, while near-infrared spectroscopy was found suitable for quantitative analysis and identification of quality indicators in minor cereals. Huang et al. employed hyperspectral technology for the rapid determination of amylose and amylopectin contents in sorghum. They investigated the impact of various preprocessing methods on VIS and NIR spectroscopy data, while also comparing the predictive accuracy of these spectroscopic measurements. By combining principal component analysis and the successive projections algorithm to extract characteristic wavelengths, and using both full-spectrum and characteristic-band spectra, the PLSR and Random Forest (RF) models were established to predict the contents of amylose and amylopectin in different sorghum varieties. The study confirmed the application of spectral technology for rapid and nondestructive prediction of amylose and amylopectin contents in various sorghum varieties10. The spectral characteristics of the visible light band (400 to 700 nm) are closely related to a variety of chemical components in the grain. For example, chlorophyll, lutein, carotene and other pigments have specific absorption peaks in the visible light band, and the content of these pigments is directly related to the nutritional quality of cereals. At the same time, other nutrients in grains such as protein, starch, cellulose, etc., may also exhibit specific spectral characteristics in the visible light band. Therefore, the nutritional quality of cereals can be assessed indirectly through spectral analysis in the visible band. Caporaso et al. pointed out that HSI proved to be a multifunctional tool for studying grains, allowing rapid analysis of individual grains as well as multi-component analysis. HSI is one of the most promising non-contact technologies that can quickly provide chemical information and gain insight into the spatial distribution of chemical properties16. Almoujahed et al. combined VIS-NIR and MIR spectroscopy with machine learning models to provide a simple, suitable and nondestructive method to distinguish fusarium ear blight FHB infection in wheat and flour samples at the post-harvest stage17. Shi et al. proposed SLR model based on VIS-NIR hyperspectral data, which can accurately predict hundreds of nutrients in wheat grains18. Shuai et al. use visible hyperspectral images to capture spatial and spectral features of object surfaces. The combination of deep learning and hyperspectral imaging has shown good results in multi-scale agricultural management19. However, there is a limited availability of rapid detection methods for sorghum components such as crude protein, tannin, and crude fat content. Therefore, this work proposes a rapid and nondestructive detection method for sorghum based on VIS-NIR hyperspectral technology to achieve accurate determination of nutrient contents in different sorghum varieties.

This work focused on a total of 93 sorghum varieties, three samples were collected from each variety, comprising 279 samples. First, the spectral information of the samples was obtained using VIS-NIR hyperspectral detection technology, and chemical analysis methods were employed to measure the nutrient contents, including crude protein, tannin, and crude fat. Next, the preprocessing of the data was performed using Standard Normal Variate (SNV), Detrending, and Multiplicative Scatter Correction (MSC) algorithms to enhance the signal-to-noise ratio and data reliability between samples. Then, the Competitive Adaptive Reweighted Sampling (CARS) and Bootstrapping Soft Shrinkage (BOSS) algorithms were utilized for crude feature extraction to identify spectral variable combinations that are highly correlated with the corresponding nutritional components. The Iteratively Retains Informative Variables (IRIV) algorithm was then utilized to further refine the initially extracted feature variables, accurately extracting interpretable variables that are strongly correlated with the nutritional components. These processes effectively reduce the dimensionality of the dataset and highlight important characteristic variable bands. Finally, detection models for determining crude protein, tannin, and crude fat content in sorghum were constructed using PLS, BP, and ELM methods, achieving rapid and non-destructive determination of nutrient contents.

Material and method

Sorghum samples and growth conditions

Sorghum varieties used in this study were all from the Sorghum Research Institute of Shanxi Agricultural University. The plantation is located in Yuci District, Jinzhong City, Shanxi Province (112°34’ ~ 113°8’ E, 37°23’ ~ 37°54’ N). The region has a temperate continental arid climate, with an average altitude of 800 m, an average annual temperature of 9.8 °C, an average annual precipitation of 440.7 mm, an annual sunshine duration of 2662 h, and a frost-free period lasting for 158 days.

In the experiment, a total of 93 varieties of sorghum were collected, with plump and uniform grains. Three samples were collected for each variety, and each sample weighing 300 g, resulting in a total of 279 samples. All samples stored in wide-mouth bottles under hermetically sealed conditions. The experiment began with the capture of hyperspectral images, followed by the determination of component content in the second step. This sequence aimed to investigate the nutritional composition of different sorghum varieties using various analytical methods, including spectroscopy and chemical analysis.

Hyperspectral image acquisition and spectral data extraction

Hyperspectral image acquisition

This experiment employed a visible-near infrared hyperspectral (Starter Kit, Headwall Photonics, USA) scanning platform to capture hyperspectral images of sorghum samples. The components of the platform included a visible near-infrared light hyperspectral camera (Headwall Photonics, USA), a 1000 W external light source produced by Philips (6995P, Philips, Netherlands), a transmission control platform, and a computer, located in a darkroom. The acquisition parameters are set as follows: the distance between the camera and the sample is 240 mm, the push-scan travel is 100 mm, and the platform movement speed is 2.94 mm/s. The lens has a spectral range of 380 ~ 1000 nm, a spectral resolution of 0.727 nm, and a total of 856 bands were collected. Due to oscillations in spectral reflectance at the extremes of the spectrometer’s range, which introduce significant errors, the selected spectral range for modeling purposes is 430 ~ 900 nm, totaling 646 bands.

Sorghum grain samples were loaded into a sample tray with a diameter of 50 mm and a depth of 15 mm, ensuring that it was smooth and compacted, moved by the transmission control platform to facilitate line scanning of the hyperspectral camera. The computer was used to control the transmission platform and save the raw data of collected from samples.

The process of hyperspectral acquisition was as follows:

  1. (1)

    Power on the computer, camera and light source, and preheat for 15 min.

  2. (2)

    Collect current black-and-white board correction data, to eliminate the influence of environmental light, noise, and camera wavelength deviation. The black-and-white board correction formula is as formula 1.

$${R_0}=\frac{{{R_{raw}} - {R_{dark}}}}{{{R_{white}} - {R_{dark}}}}$$
(1)

Where, R0 represents the value of the corrected hyperspectral data, Rraw represents the original data, Rdark represents the dark current data, and Rwhite represents the whiteboard data.

  1. (3)

    Hyperspectral image acquisition and data storage.

Spectral data extraction

To simplify the extraction process of hyperspectral image data and achieve batch processing of hyperspectral data, this work developed a preprocessing platform for hyperspectral data using Visual Basic (Version 6.0, Microsoft, USA). By loading the generated sampling point files into Spectral View software (Headwall Photonics, USA), it becomes possible to efficiently collect spectral data from each pixel within the region of interest (ROI) of the samples under examination, while also facilitating preprocessing of the acquired data.

The ellipse model is used to determine the coordinates of the ROI center, the length of the X/Y half-axis (a, b), and the distance between the X/Y axes (Δx and Δy are set to 1). In accordance with the principle of “left to right, top to bottom” in the target image, pixel points in ROI are sequentially collected according to formula 2 to generate ROI coordinate matrix. The ROI region selected for this experiment contains more than 15,000 pixels.

$$\frac{{x_{i}^{2}}}{{{a^2}}}+\frac{{y_{i}^{2}}}{{{b^2}}} \leqslant 1$$
(2)

SpectralView software was used to import images and actively extract the reflectivity information according to the coordinate matrix. The average reflectance of all pixels is calculated by formula 3. The calculated average reflectance will serve as the foundational dataset for subsequent data processing.

$$A_{i} = \frac{1}{n}(A_{1} + A_{2} + A_{3} + \cdots + A_{n} ) = \frac{1}{n}\sum\limits_{{i = 1}}^{n} {A_{{i,k}} }$$
(3)

Where, Ai represents the average reflectance, and n denotes the number of pixels.

Chemical determination of nutrient contents

After the completion of spectral data collection, the 279 samples will undergo pulverization using a high-speed pulverizer (750T, Bo ou Hardware Products Co., Ltd., Yongkang City, China), followed by passage through an 80-mesh sieve, sequentially sealed and stored as raw material samples for component content analysis. The contents of the crude protein, tannin and crude fat were quantitatively determined in this experiment. To minimize potential measurement errors associated with chemical analysis methods, all samples were processed by a single experimenter to ensure consistency. At the same time, each sample was repeated 3 times to take the mean value. In addition, laboratory temperature stability and instrument calibration methods are strictly controlled throughout the testing procedure.

Crude protein contents determination

The crude protein content was determined using the Kjeldahl method20,21, employing an automated Kjeldahl nitrogen analyzer (K1100, Shandong Hanon Scientific Instrument Co., Ltd., China) for analysis. The digestion of organic nitrogen compounds involves the utilization of concentrated sulfuric acid to synthesize ammonium sulfate. Subsequently, ammonium sulfate undergoes a reaction with a strong base, resulting in the liberation of ammonia, which is then distilled and absorbed into an excess of standard inorganic solution. The quantification of ammonia is achieved through titration using standard alkali, enabling the calculation of the crude protein content present in the sample.

Tannin contents determination

Tannin determination involves the extraction of sorghum tannins with dimethyl sulfoxide, followed by measurement using the ferric ammonium citrate colorimetric method and analysis performed using a UV-visible spectrophotometer (752 N, Shanghai Yidi Analytical Instrument Co., Ltd., China)22,23.

Crude fat contents determination

The crude fat content determination is performed using the JTONE series fat analyzer (Soxhlet extractor), which also includes a defatting step for the sample24. The working process of the Soxhlet extractor mainly consists of three parts: heating extraction, solvent recovery, and cooling. To improve defatting efficiency, it is necessary to repeatedly soak and extract the sample during the extraction process. During actual operation, adjusting the heating temperature in real-time is possible based on changes in both the reagent’s boiling point and ambient temperature.

Spectral data preprocessing and data set division

Spectral data preprocessing

To eliminate errors introduced during spectral measurement, enhance the signal-to-noise ratio of the data further, and improve the correlation between reflectance data and physicochemical indicators, preprocessing of the spectral data is conducted. When preprocessing spectral data, the procedures include consecutive application of Standard Normal Variate (SNV), Detrending, and Multiplicative Scatter Correction (MSC). SNV is used to eliminate the effects of grain glossiness, surface scattering, and background interference on reflectance spectra. Detrending is employed to remove baseline drift caused by diffuse reflection. MSC is utilized to eliminate spectral variations due to environmental light scattering during the measurement process, enhancing spectral information related to chemical composition. This correction adjusts baseline shifts and offsets in spectral data, thereby improving the correlation between spectra and data25,26,27,28.

By combining these three preprocessing methods, the differences in spectral intensity between different samples are eliminated, intrinsic sample characteristics are highlighted, differentiation among various features is enhanced, data stability and reliability are improved, data resolution is increased, and spectral data clarity is enhanced. This lays the foundation for establishing detection models in subsequent steps.

Data set division

This work adopts the SPXY algorithm to partition the sample set based on joint x-y distances, dividing it into calibration and prediction sets in a 3:1 ratio. The calibration set is used for building the model and cross-validation in spectral data modeling with component content, while the prediction set is used to evaluate the predictive performance of the model.

Characteristic variables extraction

Algorithm of characteristic variables extraction

An extraction algorithm can be employed to transform the original data into a feature set that possesses enhanced representativeness and separability, thereby reducing data complexity and improving model performance. This experiment utilized three extraction algorithms, namely Competitive Adaptive Reweighted Sampling (CARS), Bootstrapping Soft Shrinkage (BOSS), and Iteratively Retains Informative Variables (IRIV).

The CARS method simulates Darwin’s theory of “survival of the fittest” by employing adaptive re-weighted sampling techniques29,30. It strategically selects wavelengths with relatively large absolute regression coefficients, which are calculated using partial least squares analysis. Additionally, it utilizes cross-validation to identify the combination set that yields the lowest root mean square error. Due to the inherent stochastic nature of algorithmic computations, it is imperative to conduct multiple iterations to retain bands that exhibit higher selection frequencies, thereby augmenting the stability of subsequent regression models. So, the Monte Carlo sampling procedure was repeated 100 times.

BOSS combines the bootstrap method and flexible shrinkage to address model instability or inaccurate prediction results caused by collinear variables. By statistically analyzing the regression coefficients of multiple sub-models, variable importance is determined based on the absolute value of the coefficients. Variables with larger proportions have a higher probability of selection, thereby evaluating the importance of each variable. Using Weighted Boot Sampling (WBS) for gradual correction and optimization of variable proportions simplifies and reduces the variable space, effectively selecting the variables that have the greatest impact on the predictive performance of the model. The optimal sub-model, which has the smallest root mean square error in the cross-validation set, is selected from the cluster of sub-models to ensure that the model possesses good generalization ability and prediction accuracy, thus obtaining an optimal variable set31,32. The optimal model ratio is set at 0.1, with 1000 iterations of WBS.

IRIV follows the concept of model clustering analysis and fully considers the joint effects among variables. It randomly combines variables to generate a binary matrix, where rows represent variable combinations and columns represent the number of variables. Based on each row, separate partial least squares models are established, and the effectiveness of these combined models is evaluated using cross-validation root mean square error26,33. Finally, the contributions are ultimately categorized into four distinct groups: strong information variables, weak information variables, no information variables, and interference information variables. The selection of retained variables is determined by the Difference of Mean Values (DMEAN) and P-value. The former calculates the root-mean-square error when the subset contains a certain wavelength variable and when it does not, while the latter is the P-value obtained by Mann-Whitney U test. After executing the algorithm through multiple iterations, ineffective and interfering variables are excluded. Then, reverse elimination is performed to extract the best feature bands from strong and weak information variables. The Mann-Whitney U test is conducted with a P-value of 0.05.

Combined use of characteristic variables extraction algorithms

By employing a combination of extraction algorithms, the dimensionality of the dataset can be further reduced, thereby facilitating the accentuation of feature bands’ significance, alleviating overfitting concerns, and providing a more comprehensive and accurate depiction of data characteristics. Consequently, this augmentation enhances model accuracy, computational efficiency, and generalization capability, rendering it better suited for accommodating novel data.

In this work, a two-step strategy was adopted for characteristic wavelength extraction. Firstly, the CARS and BOSS algorithms were used to extract correlation variables corresponding to crude protein, tannin, and crude fat from the entire spectrum, forming a coarse subset. Then, the IRIV algorithm was applied to further refine the optimized variable subset into the selected subset, forming a refined subset. The CARS and BOSS models are based on the concept of model ensemble, which involves selecting the optimal subset of characteristic wavelengths based on RMSECV values. By excluding irrelevant wavelengths related to sorghum chemical composition, the RMSECV value decreases. Ultimately, the subset with the minimum RMSECV value is chosen as the final result for the variable selection of feature wavelengths. By iteratively eliminating variables from the preliminary screening, a finely selected set of feature wavelengths exhibiting strong correlations was identified, thereby enhancing the model’s robustness and stability.

Nutrient contents detection models and evaluation indexes

Nutrient contents detection models

Partial Least Squares (PLS) is a chemometric method that combines multiple linear regression and principal component analysis. Its essence lies in performing regression after eliminating collinearity in spectra, making it suitable for handling linear relationships between multiple variables, including cases with both independent and dependent variables34,35. Its main objective is to identify the principal component direction, which represents the maximum covariance between the predictor variables and the response variables and optimizes the correlations between predictor variables and response variables to obtain new composite variables. Initial setup with 10 potential variables. A 5-fold cross-validation was used to determine the optimal number of potential variables.

Back Propagation (BP) neural network is a multi-layer feedforward network that utilizes the error backpropagation algorithm for training. It comprises an input layer, hidden layers, and an output layer. Adjacent layers are interconnected through weighted connections, with the input layer receiving external input data, the hidden layers processing data to extract features, and the output layer generating final prediction results. Each node possesses corresponding weight and bias parameters, while nodes within each layer operate independently36. The network iteratively performs forward propagation and backpropagation on training samples, gradually adjusting the weights and biases to minimize the discrepancy between the network’s output results and actual values. This approach enables efficient data modeling and prediction, while also showcasing strong capabilities in non-linear mapping, self-learning, generalization, and fault tolerance.

Extreme Learning Machine (ELM) is a feedforward neural network-based machine learning algorithm, comprising an input layer, hidden layers, and an output layer. It exhibits applicability to both supervised and unsupervised learning problems. The ELM model has the advantages of faster training speed and better generalization performance compared to traditional neural network models37. Optimize the parameters nhid, actfun, and init_weights based on the principle of minimizing the Root Mean Square Error (RMSE).

Evaluation indexes

These models’ performance was evaluated using the coefficient of determination (R2), Root Mean Squared Error and relative analysis error (RPD). The calculation formula is as formula 7–9.

$${R^2}=1 - \frac{{\sum\limits_{{i=1}}^{n} {{{\left( {{Y_{Pi}} - {Y_i}} \right)}^2}} }}{{\sum\limits_{{i=1}}^{n} {{{\left( {{Y_i} - \hat {Y}} \right)}^2}} }}$$
(7)
$$RMSE=\sqrt {\frac{{\sum\limits_{{i=1}}^{n} {{{\left( {{Y_{Pi}} - {Y_i}} \right)}^2}} }}{n}}$$
(8)
$$RPD=\frac{1}{{\sqrt {1 - {R^2}} }}$$
(9)

Generally, prediction ability was defined as: If the R2 value is < 0.6, the models’ performing is poor, R2 value is 0.6–0.8, which indicates that the model is good, R2 value > 0.8, which indicates that the model is very high. A lower RMSE value indicate that the model perform better in terms of prediction accuracy. If the RPD value is < 1.5, the models’ performing is not well; RPD is 1.5–2, which indicates that the model is helpful for estimation; RPD value is > 2 indicate excellent prediction38,39,40.

When extracting characteristic variables and constructing detection models, each algorithm is executed 50 times to ensure robustness, and the optimal set of feature variables and model results are selected based on rigorous evaluation metrics. Matlab software (Ver. 2018a, MathWorks, Natick, MA, USA) was used to process and analyze the data.

Results

Nutrient contents analysis

A total of 279 samples were collected from 93 sorghum varieties, the contents of crude protein, tannin, and crude fat were obtained by calculating the average index for each variety. The results of the chemical analysis are presented in Fig. 1, which showed that the average contents of crude protein, tannin, and crude fat were 9.45%, 1.15%, and 3.88% respectively. These findings are consistent with previous reports on the composition of sorghum. Specifically, the crude protein content ranged from 7.21 to 14.39%, the tannin content ranged from 0.04 to 2.56%, and the crude fat content ranged from 2.67 to 5.93%. These differences indicate significant variations in the nutritional content among different varieties of sorghum. Furthermore, the standard deviation (SD) values of crude protein, tannin, and crude fat content were 1.53, 0.04, and 0.56, respectively1,34. This reflects the stability and representativeness of the sample data selected for this work. In summary, the 279 samples selected in this experiment exhibit excellent representativeness, thereby establishing a robust data foundation for further investigation into the characteristics and nutritional components of sorghum varieties.

Fig. 1
figure 1

The chemical analysis results of crude protein, tannin, and crude fat content in 93 sorghum varieties. (a) The box plots and normal distribution of crude protein, tannin, and crude fat in 279 sorghum samples. (b) The differences in crude protein, tannin, and crude fat content among 93 varieties of sorghum grains.

Spectral characteristic analysis

Figure 2a describes the spectral curves of 279 sorghum samples in the VIS-NIR spectral range. Different sorghum varieties exhibit significant differences in spectral reflectance in the VIS range, mainly due to variations in seed color, which significantly alter the absorption and reflection characteristics of light. For instance, certain sorghum samples exhibit distinctive spectral troughs within the range of 650–680 nm, which can be attributed to spectral variations arising from the presence of dark chromophores in these grains. In the NIR range, the spectral shapes of various sorghum varieties exhibit a high degree of similarity, while their reflectance values differ. These differences in spectral characteristics reflect variations in nutrient composition among different sorghum varieties. Therefore, the spectral differences provide an important foundation for analyzing and detecting sorghum components, offering strong support and guidance for further exploration of its chemical composition and potential applications. The spectral analysis technology can effectively distinguish and evaluate the nutritional characteristics of different sorghum varieties, which is of great significance for further research on and utilization of the characteristics of sorghum varieties. Through the overall observation of the spectrum of all samples, the spectral curves showed a consistent trend of initially increasing and then decreasing, with no apparent abnormal samples. Then, the spectral curves of the original data were processed using SNV, Detrending, and MSC preprocessing algorithms, as shown in Fig. 2b–d.

Fig. 2
figure 2

The pretreatment of spectral data for sorghum samples; (a) Raw spectra. (b) Spectral pre-processing by SNV; (c) Spectral pre-processing by SNV and Detrending; (d) Spectral pre-processing by SNV, Detrending and MSC.

Taking the tannin as an example, the PLS model was established by comparing three pretreatment methods, as shown in Table 1. The results show that the combination of the three methods can effectively improve the prediction accuracy of the model. By combining these three preprocessing methods, the spectral differences caused by grain glossiness, surface scattering, background interference, baseline drift, and environmental light scattering were effectively eliminated, while enhancing the spectral information related to chemical composition content. This preprocessing enhances the quality and interpretability of spectral data, thereby contributing to improved accuracy in subsequent modeling.

Table 1 Results of the PLS model established for tannin content by spectral pretreatment method.

Additionally, employing the SPXY algorithm, a total of 279 samples were partitioned into two subsets at a ratio of 3:1. One subset was designated as the calibration set, comprising 209 samples (75% of the total), while the remaining subset served as the prediction set, consisting of 70 samples (25% of the total). The data distribution of both the calibration set and prediction set exhibits a relatively uniform pattern, characterized by closely aligned mean values. This indicates that the partitioning of this dataset is reasonable and effective, thereby mitigating potential biases in data distribution, as shown in Table 2. By conducting data preprocessing and implementing a reasonable division of the dataset, it establishes a solid foundation for analyzing and predicting the chemical composition of sorghum, ensuring the reliability and accuracy of the work.

Table 2 The set partitioning results after using the optimal preprocessing method.

Characteristic variables analysis

Figure 3 illustrates the variation in RMSECV as the number of samples increases during the process of extracting feature wavelengths for coarse protein, tannin, and crude fat using the CARS algorithm. When using the CARS algorithm for characteristic wavelength extraction, as the number of Monte Carlo random sampling runs increases, the RMSECV values exhibit a decreasing trend followed by an increasing trend. During the cross-validation process, the RMSECV for the crude protein reached its minimum value in the validation set after 41 iterations. A total of 63 wavelengths were selected, accounting for 9.75% of all bands. At this point, the RCV2 and RMSECV were 0.73 and 0.80%, respectively. For the tannin, the RMSECV reached its minimum value in the validation set after 45 iterations. A total of 50 wavelengths were selected, accounting for 7.74% of all bands. At this point, the RCV2 and RMSECV were 0.85 and 0.29%, respectively. For the crude fat, the RMSECV reached its minimum value in the validation set after 51 iterations. A total of 35 wavelengths were selected, accounting for 5.42% of all bands. At this point, the RCV2 and RMSECV were 0.43 and 0.40%, respectively.

Fig. 3
figure 3

The change of RMSECV value with the increase of iterations during the key wavelength extraction process of crude protein, tannin, and crude fat using the CARS algorithm. (a) Crude protein; (b) Tannin; (c) Crude fat.

Figure 4 illustrates the variation in RMSECV as the number of samples increases during the process of extracting feature wavelengths for crude protein, tannin, and crude fat using the BOSS algorithm. When using the BOSS algorithm for characteristic wavelength extraction, as the number of iterations runs increases, the RMSECV values exhibit a decreasing trend followed by an increasing trend. During the cross-validation process, the RMSECV for crude protein reached its minimum value in the validation set after 6 iterations. A total of 73 wavelengths were selected, accounting for 11.30% of all bands. At this point, the RCV2 and RMSECV were 0.70 and 0.85%, respectively. For tannin, the RMSECV reached its minimum value in the validation set after 6 iterations. A total of 61 wavelengths were selected, accounting for 9.44% of all bands. At this point, the RCV2 and RMSECV were 0.85 and 0.29%, respectively. For crude fat, the RMSECV reached its minimum value in the validation set after 9 iterations. A total of 30 wavelengths were selected, accounting for 4.64% of all bands. At this point, the RCV2 and RMSECV were 0.41 and 0.41%, respectively.

Fig. 4
figure 4

The change of RMSECV value with the increase of iterations during the key wavelength extraction process of crude protein, tannin, and crude fat using the BOSS algorithm. (a) Crude protein; (b) Tannin; (c) Crude fat.

The aforementioned steps finalized the establishment of the coarse subset, upon which, the IRIV algorithm was applied to form a refined subset. The mean root-mean-square error is initially calculated with and without a specific wavelength, followed by the performance of a Mann-Whitney U test (P = 0.05). Secondly, non-informative variables were eliminated while retaining informative variables of both strong and weak nature. Finally, the optimal subset of variables was determined through cross-validation using the reverse elimination method. The results of the variable classification for the subset were depicted in Fig. 5. The optimized results of the variable subsets are as follows.

The crude protein subset of variables derived from CARS and BOSS underwent further refinement, resulting in 41 and 36 variables respectively, which accounted for 6.35% and 5.57% of the total wavelength bands. Direct at the sensitive wavelength screening for crude protein components, the range of 434–522 nm may be attributed to the interaction between light and the π-electron system present in aromatic amino acids (e.g., phenylalanine, tryptophan) within protein molecules. The aromatic ring structures of these amino acids are sensitive to light within this wavelength range. The range of 540 ~ 546 nm may be associated with resonance absorption related to certain π-electron conjugated structures present in proteins, potentially linked to aromatic amino acids. The range of 573–601 nm exhibits sensitivity towards the secondary structures of proteins, such as α-helices and β-sheets, wherein light at this specific wavelength interacts with vibrations or electronic transitions occurring within these structural elements. The range of 610–730 nm can be ascribed to the interaction between vibrational modes of amino acid residues within protein molecules, specifically involving the vibrations of amine (NH) and carboxyl (COOH) groups. The range of 761–899 nm can be ascribed to the interaction between light and electronic transitions of larger molecules, along with the molecular vibrational modes within protein molecules at longer wavelengths. These modes may involve long-chain amino acid residues or overall conformational changes of the protein. Qiao et al. study found that the corn is sensitive to light at 400–700 nm and 910 nm, related to the C-H vibrations of proteins41. Fatemi et al. study indicated that crude protein is sensitive to light at 760 nm, which is related to the vibration of nitrogen elements11. Wang et al. study found characteristic wavelengths of rice is nearby 500 nm42. These results confirm the accuracy of the characteristic wavelength selected for sorghum crude protein in this work.

The tannin subset of variables derived from CARS and BOSS underwent further refinement, resulting in 44 and 38 variables respectively, which accounted for 6.81% and 5.88% of the total wavelength bands. Direct at the sensitive wavelengths of tannin, the range of 475 –600 nm may match the π-π or n-π transitions of tannin molecules, resulting in related absorption phenomena. The wavelength range of 612–690 nm may interact with the C-H or C-O vibration modes in tannin molecules, thereby triggering associated absorption phenomena. The wavelength range of 745–810 nm may be attributed to interactions such as hydrogen bonding and π-π stacking within tannin molecules, which regulate electron distribution and the charge state of the molecules, thereby influencing their absorption characteristics. The wavelength range of 829–897 nm may be due to tannin being a polyphenolic compounds, with chemical structures that include functional groups such as hydroxyl, carboxyl, and acyl groups. These structural features result in varying absorption characteristics across different wavelengths of light. In their investigation of plant tannin, Zhang et al. identified characteristic spectral ranges at approximately 500–510 nm and around 680–690 nm43, which closely resemble the selected tannin characteristic wavelengths in this work.

The crude fat subset of variables derived from CARS and BOSS underwent further refinement, resulting in 27 and 22 variables respectively, which accounted for 4.18% and 3.41% of the total wavelength bands. Direct at the sensitive wavelengths of crude fat, the wavelength of 447 nm may be associated with the shorter π-π transitions in fatty acids, particularly near the non-conjugated structures adjacent to double bonds. The absorption at 545 nm and 563 nm may be attributed to the π-π* transitions in fatty acid molecules and the absorption associated with conjugated double bonds. The absorption related to the long-chain structure and conjugated system in fatty acids may be associated with a wavelength of 588 nm. 631 nm and 632 nm may be associated with the C-H stretching vibrations in the fatty acid chains, the wavelengths of 649 nm and 650 nm may be associated with the vibrational absorption of C=O stretching in fatty molecules or other functional groups, 668 nm and 671 nm may be related to the spectral features associated with aromatic rings or other conjugated structures in fatty molecules. The wavelengths of 729 nm and 750 nm may be associated with the resonant absorption of long-chain structures in fatty acid molecules, 785 nm and 795 nm may be associated with features such as C-H bending vibrations in fatty acid molecules. Wavelengths in the range of 800–900 nm may exhibit complex absorption patterns in fatty acid chains, including conjugation and long-chain effects. Xue et al. focused on extracting characteristic spectra of crude fat primarily at wavelengths around 450, 500, 680, 800, and 1000 nm44, which aligns with the findings of our work.

The adopted coarse-refined characteristic variable extraction strategy effectively reduces the number of variables while preserving crucial information wavelengths associated with the crude protein, tannin, and crude fat content. This approach provides an optimized foundation for subsequent analysis and modeling endeavors. In summary, the coarse subset obtained by the CARS and BOSS algorithms is devoid of any interference variables, however, according to the IRIV selection process, it is possible that certain variables may not have associated messages. It demonstrates the advantages of CARS and BOSS in the characteristic extraction process. The final variable set is obtained through the reverse elimination of the coarse subset, named the refined subset. The refined subset includes only variables with strong and weak characteristic wavelengths. According to DMEAN and P-value in IRIV algorithm, the strong and weak characteristics of variables are judged. A weak feature wavelength is a wavelength point in spectral data or any other form of high-dimensional data that has a weak signal strength and limited direct prediction of the target variable compared to other wavelengths. Although the predictive power of these wavelength points alone may not be strong, they often contain important information that is indirectly related to the target variable or interacts with other strongly characteristic wavelengths. Therefore, ignoring these weak feature wavelengths can lead to information loss, which in turn affects the comprehensiveness and accuracy of the model. On one hand, it demonstrates that the precision extraction strategy can further reduce the number of variables and enhance the accuracy of the detection model. On the other hand, it underscores the crucial role of weak characteristic wavelength in constructing a robust detection model.

Fig. 5
figure 5

Distribution of the strong information, weak information, no information and interference information wavelengths of key wavelengths of the crude protein, tannin and crude fat in grain spectra of sorghum. (a) Crude protein, (b) Tannin, (c) Crude fat.

Comparison of detection models

Performances of detection models based on whole spectral data and characteristic spectral data

The PLS, BP, and ELM modeling methods were employed to establish detection models for crude protein, tannin, and crude fat based on whole spectral data initially. Subsequently, we proceed to establish separate models based on the coarse subset and the refined subset, respectively. The results of these detection models are presented in Table 3.

Table 3 The results of detection model for quantifying crude protein, tannin, and crude fat content in sorghum samples.

The results obtained from the CARS-PLS detection model are presented as an illustrative example. In terms of the number of variables employed in the modeling process, the variables of the CARS subset decreased by 90.25%, 92.26%, and 94.58% than full spectrum set (Raw) for the crude protein, tannin, and crude fat, respectively. And then, in terms of the result of the detection model, the Rp2 of crude protein and crude fat increased by 0.02 and 0.11, respectively. The RPDp increased by 0.06 and 0.09, respectively. And the RMSEp decreased by 0.03% and 0.04%, respectively. Its indicates that the performance of the characteristic-spectrum model was superior to that of the full-spectrum model. However, for the tannin model, the Rp2 and RPDp decreased by 0.02 and 0.10, the RMSEp increased by 0.01%, and the characteristic-spectrum model performance exhibited a slight decline. The reason may be that the variable strongly correlated with tannin was excluded during the process of extracting characteristic variables from the CARS. Furthermore, upon application of the IRIV algorithm for variable extraction, a further reduction in modeling variables was observed, and the model performance was further enhanced. The number of variables employed in the modeling process, the variables of CARS-IRVI subset of the crude protein, tannin, crude fat decreased by 34.92%, 12.00%, and 22.86% than the variables of the CARS subset. The Rp2 increased by 0.002, 0.01 and 0.03, respectively. The RPDp increased 0.01, 0.14 and 0.03, respectively. And the RMSEp decreased by 0.002%, 0.01%, 0.01%. It indicates that the detection model after post fine extraction is more lightweight and has better detection accuracy. The detection models established using both BP and ELM also reach the same conclusion. In summary, employing a combined strategy of extracting characteristic wavelengths can reduce variable redundancy, lighten the model, and enhance model performance.

Optimal models of crude protein, tannin, and crude fat

The optimal results were achieved by employing the CARS-IRIV characteristic extraction algorithm in combination with PLS to establish the detection model for the crude protein content determination. The prediction set exhibits Rp2, RMSEp, and RPDp values of 0.69, 0.80%, and 1.80, respectively. For the detection of tannin, the optimal results were achieved by employing the BOSS-IRIV characteristic extraction algorithm in combination with the PLS algorithm. The prediction set exhibits Rp2, RMSEp, and RPDp values of 0.88, 0.22%, and 2.84 respectively. For the detection of crude fat content, the optimal results were achieved by employing the BOSS-IRIV characteristic extraction algorithm in combination with the ELM algorithm. The prediction set exhibits Rp2, RMSEp, and RPDp values of 0.61, 0.32%, and 1.61 respectively. The results show that the tannin detection model is very high, crude protein and crude fat detection model is good. Figure 6(a)-(c) illustrates the graph depicting the fitting of the model, it showed that these models had excellent performance.

Fig. 6
figure 6

Fitting results of the calibration set and prediction set for each index in the optimal simultaneous detection model. (a) Crude protein; (b) Tannin; (c) Crude fat.

Conclusion

In this study, we employed VIS-NIR spectroscopy, in conjunction with chemical determination methods, to examine the critical nutrient components—crude protein, tannin, and crude fat—of 93 sorghum varieties. The objective was to pinpoint interpretable wavelengths linked to these constituents and develop robust detection models. The proposed joint strategy incorporates the CARS-IRIV and BOSS-IRIV key wavelength extraction algorithms, and a significant amount of redundant data is eliminated, and wavelengths with both strong and weak interpretations are extracted. The results indicate that 41, 38, and 22 characteristic wavelengths were extracted for crude protein, tannin, and crude fat, respectively. Furthermore, the CARS-IRIV-PLSR model emerged as suitable for crude protein detection, the BOSS-IRIV-PLSR model demonstrated favorable outcomes in tannin detection, and the BOSS-IRIV-ELM model yielded the optimal results for crude fat detection.

In summary, these detection models effectively achieved real-time and nondestructive detection of crude protein, tannin, and crude fat contents in sorghum grains. They offer vital theoretical support and guidance for the utilization of spectral techniques in detecting other grain components. Looking ahead, we plan to integrate multiple independent models into a cohesive comprehensive model and deeply integrate big data analysis with artificial intelligence technology. Based on this, a real-time detection system for grain nutritional components will build. This system will enable simultaneous detection of various nutritional components in grains, enhances prediction efficiency, ensures quality and safety, demonstrating great potential for synchronous detection of multiple nutritional components in grains during complex industrial processes.