Using visible and NIR hyperspectral imaging and machine learning for nondestructive detection of nutrient contents in sorghum

Wu, Kai; Zhang, Zilin; He, Xiuhan; Li, Gangao; Zheng, Decong; Li, Zhiwei

doi:10.1038/s41598-025-90892-6

Download PDF

Article
Open access
Published: 19 February 2025

Using visible and NIR hyperspectral imaging and machine learning for nondestructive detection of nutrient contents in sorghum

Kai Wu^1,3,
Zilin Zhang^1,3,
Xiuhan He^1,3,
Gangao Li^1,3,
Decong Zheng^1,3 &
…
Zhiwei Li ORCID: orcid.org/0000-0001-6803-2349^2,3

Scientific Reports volume 15, Article number: 6067 (2025) Cite this article

4303 Accesses
10 Citations
7 Altmetric
Metrics details

Subjects

Abstract

Nondestructive, rapid, and accurate detection of nutritional compositions in sorghum is crucial for agricultural and food industries. In our study, the crude protein, tannin, and crude fat contents of sorghum variety samples were taken as the research object. The visible near-infrared (VIS-NIR) hyperspectral of sorghum were measured by the indoor mobile scanning platform. The nutritional components were determined using chemical methods to analyze the differences in nutritional composition among different varieties. After preprocessing the original spectral, the competitive adaptive reweighted sampling (CARS) and bootstrapping soft shrinkage (BOSS) algorithms were used to coarsely extract the key variables. Subsequently, the iteratively retains informative variables (IRIV) was employed to assess the importance of these key variables, resulting in explanatory wavelength sets for crude protein, tannin, and crude fat. Finally, the partial least squares (PLS), back propagation (BP) and extreme learning machine (ELM) were utilized to establish detection models. The results indicated that the optimal wavelength variable sets for crude protein, tannin, and crude fat contained 41, 38, and 22 wavelength variables, respectively. The CARS-IRIV-PLS, BOSS-IRIV-PLS and BOSS-IRIV-ELM were suitable for detecting crude protein, tannin and crude fat, respectively. Meanwhile, the R_p², RMSE_p and RPD_p values of the model were 0.69, 0.80% and 1.80, 0.88, 0.22% and 2.84, 0.61, 0.32% and 1.61, respectively. These detection models can be used for the effective estimation of the nutritional compositions in sorghum with VIS-NIR spectral data, and can provide an important basis for the application of food nutrition assessment.

Resequencing of two elite sorghum (Sorghum bicolor (L.) Moench) hybrid parent lines reveals distinctly different genome-wide variation models

Article Open access 26 September 2025

Sorghum yield prediction based on remote sensing and machine learning in conflict affected South Sudan

Article Open access 06 February 2025

Evaluation of the nutritional quality of food composites developed from local ingredients to target the needs of persons experiencing nodding syndrome in Northern Uganda

Article Open access 24 November 2025

Introduction

Sorghum, a member of the Poaceae family, ranks as the fifth largest cereal crop globally and is predominantly cultivated in arid and semi-arid regions worldwide¹, its grains are rich in nutrients, including starch, protein, tannin, fat, and amino acids^2,3. Different varieties of sorghum grains have significantly different nutritional content, these differences lead to different application fields, such as food manufacturing, feed processing and brewing industries^4,5,6. Therefore, the accurate acquisition of nutritional components in sorghum grains plays an important role for its application.

Despite the accuracy of the quantification results obtained through traditional chemical detection methods, they suffer from numerous shortcomings. These include complex sample preparation, high costs, time-consuming procedures, destructiveness to the samples, and the necessity for special equipment and chemical reagents. Furthermore, traditional chemical methods can only quantify a single nutrient component and are unable to quantify multiple nutrients simultaneously. All these factors limit their widespread application^7,8. Consequently, there is a need for a low-cost, nondestructive, and efficient detection method.

Hyperspectral imaging system can acquire spectral images of objects by continuously measuring the spectra across multiple wavelength bands. By selecting regions of interest within the image, spectral information from different wavelength bands can be extracted. Preprocessing of the spectral data and selection of characteristic wavelengths are conducted to improve data quality. Quantitative models are then established using machine learning algorithms and relevant inversion indicators. Fatemi et al. employed NIR hyperspectral to identify and predict the spectral range of major chemical components in maize, thereby assessing its quality¹¹. Zareef et al. employed mid-infrared and NIR hyperspectral techniques to analyze crops, including cereals, legumes, and grains¹². Chen et al. used NIR diffuse reflectance spectroscopy combined with partial least squares discriminant analysis technology to establish a multi-grain rice seed variety recognition method, and the proposed model can provide a reference for special spectrometers¹³. Zhu et al. used KS algorithm to extract characteristic spectra and develop NIR spectroscopy model. The proposed NIR spectroscopy prediction model can meet the requirements of rapid determination of Tartary buckwheat starch¹⁴. Fan et al. proposed two methods for the detection of amylose content (AC) and fat content (FC) in NIRS single rice grains. The proposed method is suitable for rapid and nondestructive detection and sorting of rice seeds with different AC and FC, and can meet the requirements of coarse screening of rice seed varieties¹⁵. Their study demonstrated the applicability of mid-infrared spectroscopy in characterizing active compounds present in minor cereals and their processing, while near-infrared spectroscopy was found suitable for quantitative analysis and identification of quality indicators in minor cereals. Huang et al. employed hyperspectral technology for the rapid determination of amylose and amylopectin contents in sorghum. They investigated the impact of various preprocessing methods on VIS and NIR spectroscopy data, while also comparing the predictive accuracy of these spectroscopic measurements. By combining principal component analysis and the successive projections algorithm to extract characteristic wavelengths, and using both full-spectrum and characteristic-band spectra, the PLSR and Random Forest (RF) models were established to predict the contents of amylose and amylopectin in different sorghum varieties. The study confirmed the application of spectral technology for rapid and nondestructive prediction of amylose and amylopectin contents in various sorghum varieties¹⁰. The spectral characteristics of the visible light band (400 to 700 nm) are closely related to a variety of chemical components in the grain. For example, chlorophyll, lutein, carotene and other pigments have specific absorption peaks in the visible light band, and the content of these pigments is directly related to the nutritional quality of cereals. At the same time, other nutrients in grains such as protein, starch, cellulose, etc., may also exhibit specific spectral characteristics in the visible light band. Therefore, the nutritional quality of cereals can be assessed indirectly through spectral analysis in the visible band. Caporaso et al. pointed out that HSI proved to be a multifunctional tool for studying grains, allowing rapid analysis of individual grains as well as multi-component analysis. HSI is one of the most promising non-contact technologies that can quickly provide chemical information and gain insight into the spatial distribution of chemical properties¹⁶. Almoujahed et al. combined VIS-NIR and MIR spectroscopy with machine learning models to provide a simple, suitable and nondestructive method to distinguish fusarium ear blight FHB infection in wheat and flour samples at the post-harvest stage¹⁷. Shi et al. proposed SLR model based on VIS-NIR hyperspectral data, which can accurately predict hundreds of nutrients in wheat grains¹⁸. Shuai et al. use visible hyperspectral images to capture spatial and spectral features of object surfaces. The combination of deep learning and hyperspectral imaging has shown good results in multi-scale agricultural management¹⁹. However, there is a limited availability of rapid detection methods for sorghum components such as crude protein, tannin, and crude fat content. Therefore, this work proposes a rapid and nondestructive detection method for sorghum based on VIS-NIR hyperspectral technology to achieve accurate determination of nutrient contents in different sorghum varieties.

This work focused on a total of 93 sorghum varieties, three samples were collected from each variety, comprising 279 samples. First, the spectral information of the samples was obtained using VIS-NIR hyperspectral detection technology, and chemical analysis methods were employed to measure the nutrient contents, including crude protein, tannin, and crude fat. Next, the preprocessing of the data was performed using Standard Normal Variate (SNV), Detrending, and Multiplicative Scatter Correction (MSC) algorithms to enhance the signal-to-noise ratio and data reliability between samples. Then, the Competitive Adaptive Reweighted Sampling (CARS) and Bootstrapping Soft Shrinkage (BOSS) algorithms were utilized for crude feature extraction to identify spectral variable combinations that are highly correlated with the corresponding nutritional components. The Iteratively Retains Informative Variables (IRIV) algorithm was then utilized to further refine the initially extracted feature variables, accurately extracting interpretable variables that are strongly correlated with the nutritional components. These processes effectively reduce the dimensionality of the dataset and highlight important characteristic variable bands. Finally, detection models for determining crude protein, tannin, and crude fat content in sorghum were constructed using PLS, BP, and ELM methods, achieving rapid and non-destructive determination of nutrient contents.

Material and method

Sorghum samples and growth conditions

Sorghum varieties used in this study were all from the Sorghum Research Institute of Shanxi Agricultural University. The plantation is located in Yuci District, Jinzhong City, Shanxi Province (112°34’ ~ 113°8’ E, 37°23’ ~ 37°54’ N). The region has a temperate continental arid climate, with an average altitude of 800 m, an average annual temperature of 9.8 °C, an average annual precipitation of 440.7 mm, an annual sunshine duration of 2662 h, and a frost-free period lasting for 158 days.

In the experiment, a total of 93 varieties of sorghum were collected, with plump and uniform grains. Three samples were collected for each variety, and each sample weighing 300 g, resulting in a total of 279 samples. All samples stored in wide-mouth bottles under hermetically sealed conditions. The experiment began with the capture of hyperspectral images, followed by the determination of component content in the second step. This sequence aimed to investigate the nutritional composition of different sorghum varieties using various analytical methods, including spectroscopy and chemical analysis.

Hyperspectral image acquisition and spectral data extraction

Hyperspectral image acquisition

This experiment employed a visible-near infrared hyperspectral (Starter Kit, Headwall Photonics, USA) scanning platform to capture hyperspectral images of sorghum samples. The components of the platform included a visible near-infrared light hyperspectral camera (Headwall Photonics, USA), a 1000 W external light source produced by Philips (6995P, Philips, Netherlands), a transmission control platform, and a computer, located in a darkroom. The acquisition parameters are set as follows: the distance between the camera and the sample is 240 mm, the push-scan travel is 100 mm, and the platform movement speed is 2.94 mm/s. The lens has a spectral range of 380 ~ 1000 nm, a spectral resolution of 0.727 nm, and a total of 856 bands were collected. Due to oscillations in spectral reflectance at the extremes of the spectrometer’s range, which introduce significant errors, the selected spectral range for modeling purposes is 430 ~ 900 nm, totaling 646 bands.

Sorghum grain samples were loaded into a sample tray with a diameter of 50 mm and a depth of 15 mm, ensuring that it was smooth and compacted, moved by the transmission control platform to facilitate line scanning of the hyperspectral camera. The computer was used to control the transmission platform and save the raw data of collected from samples.

The process of hyperspectral acquisition was as follows:

(1)
Power on the computer, camera and light source, and preheat for 15 min.
(2)
Collect current black-and-white board correction data, to eliminate the influence of environmental light, noise, and camera wavelength deviation. The black-and-white board correction formula is as formula 1.

$${R_0}=\frac{{{R_{raw}} - {R_{dark}}}}{{{R_{white}} - {R_{dark}}}}$$

(1)

Where, R₀ represents the value of the corrected hyperspectral data, R_raw represents the original data, R_dark represents the dark current data, and R_white represents the whiteboard data.

(3)
Hyperspectral image acquisition and data storage.

Spectral data extraction

To simplify the extraction process of hyperspectral image data and achieve batch processing of hyperspectral data, this work developed a preprocessing platform for hyperspectral data using Visual Basic (Version 6.0, Microsoft, USA). By loading the generated sampling point files into Spectral View software (Headwall Photonics, USA), it becomes possible to efficiently collect spectral data from each pixel within the region of interest (ROI) of the samples under examination, while also facilitating preprocessing of the acquired data.

The ellipse model is used to determine the coordinates of the ROI center, the length of the X/Y half-axis (a, b), and the distance between the X/Y axes (Δx and Δy are set to 1). In accordance with the principle of “left to right, top to bottom” in the target image, pixel points in ROI are sequentially collected according to formula 2 to generate ROI coordinate matrix. The ROI region selected for this experiment contains more than 15,000 pixels.

$$\frac{{x_{i}^{2}}}{{{a^2}}}+\frac{{y_{i}^{2}}}{{{b^2}}} \leqslant 1$$

(2)

SpectralView software was used to import images and actively extract the reflectivity information according to the coordinate matrix. The average reflectance of all pixels is calculated by formula 3. The calculated average reflectance will serve as the foundational dataset for subsequent data processing.

$$A_{i} = \frac{1}{n}(A_{1} + A_{2} + A_{3} + \cdots + A_{n} ) = \frac{1}{n}\sum\limits_{{i = 1}}^{n} {A_{{i,k}} }$$

(3)

Where, A_i represents the average reflectance, and n denotes the number of pixels.

Chemical determination of nutrient contents

After the completion of spectral data collection, the 279 samples will undergo pulverization using a high-speed pulverizer (750T, Bo ou Hardware Products Co., Ltd., Yongkang City, China), followed by passage through an 80-mesh sieve, sequentially sealed and stored as raw material samples for component content analysis. The contents of the crude protein, tannin and crude fat were quantitatively determined in this experiment. To minimize potential measurement errors associated with chemical analysis methods, all samples were processed by a single experimenter to ensure consistency. At the same time, each sample was repeated 3 times to take the mean value. In addition, laboratory temperature stability and instrument calibration methods are strictly controlled throughout the testing procedure.

Crude protein contents determination

The crude protein content was determined using the Kjeldahl method^20,21, employing an automated Kjeldahl nitrogen analyzer (K1100, Shandong Hanon Scientific Instrument Co., Ltd., China) for analysis. The digestion of organic nitrogen compounds involves the utilization of concentrated sulfuric acid to synthesize ammonium sulfate. Subsequently, ammonium sulfate undergoes a reaction with a strong base, resulting in the liberation of ammonia, which is then distilled and absorbed into an excess of standard inorganic solution. The quantification of ammonia is achieved through titration using standard alkali, enabling the calculation of the crude protein content present in the sample.

Tannin contents determination

Tannin determination involves the extraction of sorghum tannins with dimethyl sulfoxide, followed by measurement using the ferric ammonium citrate colorimetric method and analysis performed using a UV-visible spectrophotometer (752 N, Shanghai Yidi Analytical Instrument Co., Ltd., China)^22,23.

Crude fat contents determination

The crude fat content determination is performed using the JTONE series fat analyzer (Soxhlet extractor), which also includes a defatting step for the sample²⁴. The working process of the Soxhlet extractor mainly consists of three parts: heating extraction, solvent recovery, and cooling. To improve defatting efficiency, it is necessary to repeatedly soak and extract the sample during the extraction process. During actual operation, adjusting the heating temperature in real-time is possible based on changes in both the reagent’s boiling point and ambient temperature.

Spectral data preprocessing and data set division

Spectral data preprocessing

To eliminate errors introduced during spectral measurement, enhance the signal-to-noise ratio of the data further, and improve the correlation between reflectance data and physicochemical indicators, preprocessing of the spectral data is conducted. When preprocessing spectral data, the procedures include consecutive application of Standard Normal Variate (SNV), Detrending, and Multiplicative Scatter Correction (MSC). SNV is used to eliminate the effects of grain glossiness, surface scattering, and background interference on reflectance spectra. Detrending is employed to remove baseline drift caused by diffuse reflection. MSC is utilized to eliminate spectral variations due to environmental light scattering during the measurement process, enhancing spectral information related to chemical composition. This correction adjusts baseline shifts and offsets in spectral data, thereby improving the correlation between spectra and data^25,26,27,28.

By combining these three preprocessing methods, the differences in spectral intensity between different samples are eliminated, intrinsic sample characteristics are highlighted, differentiation among various features is enhanced, data stability and reliability are improved, data resolution is increased, and spectral data clarity is enhanced. This lays the foundation for establishing detection models in subsequent steps.

Data set division

This work adopts the SPXY algorithm to partition the sample set based on joint x-y distances, dividing it into calibration and prediction sets in a 3:1 ratio. The calibration set is used for building the model and cross-validation in spectral data modeling with component content, while the prediction set is used to evaluate the predictive performance of the model.

Characteristic variables extraction

Algorithm of characteristic variables extraction

An extraction algorithm can be employed to transform the original data into a feature set that possesses enhanced representativeness and separability, thereby reducing data complexity and improving model performance. This experiment utilized three extraction algorithms, namely Competitive Adaptive Reweighted Sampling (CARS), Bootstrapping Soft Shrinkage (BOSS), and Iteratively Retains Informative Variables (IRIV).

The CARS method simulates Darwin’s theory of “survival of the fittest” by employing adaptive re-weighted sampling techniques^29,30. It strategically selects wavelengths with relatively large absolute regression coefficients, which are calculated using partial least squares analysis. Additionally, it utilizes cross-validation to identify the combination set that yields the lowest root mean square error. Due to the inherent stochastic nature of algorithmic computations, it is imperative to conduct multiple iterations to retain bands that exhibit higher selection frequencies, thereby augmenting the stability of subsequent regression models. So, the Monte Carlo sampling procedure was repeated 100 times.

BOSS combines the bootstrap method and flexible shrinkage to address model instability or inaccurate prediction results caused by collinear variables. By statistically analyzing the regression coefficients of multiple sub-models, variable importance is determined based on the absolute value of the coefficients. Variables with larger proportions have a higher probability of selection, thereby evaluating the importance of each variable. Using Weighted Boot Sampling (WBS) for gradual correction and optimization of variable proportions simplifies and reduces the variable space, effectively selecting the variables that have the greatest impact on the predictive performance of the model. The optimal sub-model, which has the smallest root mean square error in the cross-validation set, is selected from the cluster of sub-models to ensure that the model possesses good generalization ability and prediction accuracy, thus obtaining an optimal variable set^31,32. The optimal model ratio is set at 0.1, with 1000 iterations of WBS.

IRIV follows the concept of model clustering analysis and fully considers the joint effects among variables. It randomly combines variables to generate a binary matrix, where rows represent variable combinations and columns represent the number of variables. Based on each row, separate partial least squares models are established, and the effectiveness of these combined models is evaluated using cross-validation root mean square error^26,33. Finally, the contributions are ultimately categorized into four distinct groups: strong information variables, weak information variables, no information variables, and interference information variables. The selection of retained variables is determined by the Difference of Mean Values (DMEAN) and P-value. The former calculates the root-mean-square error when the subset contains a certain wavelength variable and when it does not, while the latter is the P-value obtained by Mann-Whitney U test. After executing the algorithm through multiple iterations, ineffective and interfering variables are excluded. Then, reverse elimination is performed to extract the best feature bands from strong and weak information variables. The Mann-Whitney U test is conducted with a P-value of 0.05.

Combined use of characteristic variables extraction algorithms

By employing a combination of extraction algorithms, the dimensionality of the dataset can be further reduced, thereby facilitating the accentuation of feature bands’ significance, alleviating overfitting concerns, and providing a more comprehensive and accurate depiction of data characteristics. Consequently, this augmentation enhances model accuracy, computational efficiency, and generalization capability, rendering it better suited for accommodating novel data.

In this work, a two-step strategy was adopted for characteristic wavelength extraction. Firstly, the CARS and BOSS algorithms were used to extract correlation variables corresponding to crude protein, tannin, and crude fat from the entire spectrum, forming a coarse subset. Then, the IRIV algorithm was applied to further refine the optimized variable subset into the selected subset, forming a refined subset. The CARS and BOSS models are based on the concept of model ensemble, which involves selecting the optimal subset of characteristic wavelengths based on RMSE_CV values. By excluding irrelevant wavelengths related to sorghum chemical composition, the RMSE_CV value decreases. Ultimately, the subset with the minimum RMSE_CV value is chosen as the final result for the variable selection of feature wavelengths. By iteratively eliminating variables from the preliminary screening, a finely selected set of feature wavelengths exhibiting strong correlations was identified, thereby enhancing the model’s robustness and stability.

Nutrient contents detection models and evaluation indexes

Nutrient contents detection models

Partial Least Squares (PLS) is a chemometric method that combines multiple linear regression and principal component analysis. Its essence lies in performing regression after eliminating collinearity in spectra, making it suitable for handling linear relationships between multiple variables, including cases with both independent and dependent variables^34,35. Its main objective is to identify the principal component direction, which represents the maximum covariance between the predictor variables and the response variables and optimizes the correlations between predictor variables and response variables to obtain new composite variables. Initial setup with 10 potential variables. A 5-fold cross-validation was used to determine the optimal number of potential variables.

Back Propagation (BP) neural network is a multi-layer feedforward network that utilizes the error backpropagation algorithm for training. It comprises an input layer, hidden layers, and an output layer. Adjacent layers are interconnected through weighted connections, with the input layer receiving external input data, the hidden layers processing data to extract features, and the output layer generating final prediction results. Each node possesses corresponding weight and bias parameters, while nodes within each layer operate independently³⁶. The network iteratively performs forward propagation and backpropagation on training samples, gradually adjusting the weights and biases to minimize the discrepancy between the network’s output results and actual values. This approach enables efficient data modeling and prediction, while also showcasing strong capabilities in non-linear mapping, self-learning, generalization, and fault tolerance.

Extreme Learning Machine (ELM) is a feedforward neural network-based machine learning algorithm, comprising an input layer, hidden layers, and an output layer. It exhibits applicability to both supervised and unsupervised learning problems. The ELM model has the advantages of faster training speed and better generalization performance compared to traditional neural network models³⁷. Optimize the parameters nhid, actfun, and init_weights based on the principle of minimizing the Root Mean Square Error (RMSE).

Evaluation indexes

These models’ performance was evaluated using the coefficient of determination (R²), Root Mean Squared Error and relative analysis error (RPD). The calculation formula is as formula 7–9.

$${R^2}=1 - \frac{{\sum\limits_{{i=1}}^{n} {{{\left( {{Y_{Pi}} - {Y_i}} \right)}^2}} }}{{\sum\limits_{{i=1}}^{n} {{{\left( {{Y_i} - \hat {Y}} \right)}^2}} }}$$

(7)

$$RMSE=\sqrt {\frac{{\sum\limits_{{i=1}}^{n} {{{\left( {{Y_{Pi}} - {Y_i}} \right)}^2}} }}{n}}$$

(8)

$$RPD=\frac{1}{{\sqrt {1 - {R^2}} }}$$

(9)

Generally, prediction ability was defined as: If the R² value is < 0.6, the models’ performing is poor, R² value is 0.6–0.8, which indicates that the model is good, R² value > 0.8, which indicates that the model is very high. A lower RMSE value indicate that the model perform better in terms of prediction accuracy. If the RPD value is < 1.5, the models’ performing is not well; RPD is 1.5–2, which indicates that the model is helpful for estimation; RPD value is > 2 indicate excellent prediction^38,39,40.

When extracting characteristic variables and constructing detection models, each algorithm is executed 50 times to ensure robustness, and the optimal set of feature variables and model results are selected based on rigorous evaluation metrics. Matlab software (Ver. 2018a, MathWorks, Natick, MA, USA) was used to process and analyze the data.

Results

Nutrient contents analysis

A total of 279 samples were collected from 93 sorghum varieties, the contents of crude protein, tannin, and crude fat were obtained by calculating the average index for each variety. The results of the chemical analysis are presented in Fig. 1, which showed that the average contents of crude protein, tannin, and crude fat were 9.45%, 1.15%, and 3.88% respectively. These findings are consistent with previous reports on the composition of sorghum. Specifically, the crude protein content ranged from 7.21 to 14.39%, the tannin content ranged from 0.04 to 2.56%, and the crude fat content ranged from 2.67 to 5.93%. These differences indicate significant variations in the nutritional content among different varieties of sorghum. Furthermore, the standard deviation (SD) values of crude protein, tannin, and crude fat content were 1.53, 0.04, and 0.56, respectively^1,34. This reflects the stability and representativeness of the sample data selected for this work. In summary, the 279 samples selected in this experiment exhibit excellent representativeness, thereby establishing a robust data foundation for further investigation into the characteristics and nutritional components of sorghum varieties.

Spectral characteristic analysis

Figure 2a describes the spectral curves of 279 sorghum samples in the VIS-NIR spectral range. Different sorghum varieties exhibit significant differences in spectral reflectance in the VIS range, mainly due to variations in seed color, which significantly alter the absorption and reflection characteristics of light. For instance, certain sorghum samples exhibit distinctive spectral troughs within the range of 650–680 nm, which can be attributed to spectral variations arising from the presence of dark chromophores in these grains. In the NIR range, the spectral shapes of various sorghum varieties exhibit a high degree of similarity, while their reflectance values differ. These differences in spectral characteristics reflect variations in nutrient composition among different sorghum varieties. Therefore, the spectral differences provide an important foundation for analyzing and detecting sorghum components, offering strong support and guidance for further exploration of its chemical composition and potential applications. The spectral analysis technology can effectively distinguish and evaluate the nutritional characteristics of different sorghum varieties, which is of great significance for further research on and utilization of the characteristics of sorghum varieties. Through the overall observation of the spectrum of all samples, the spectral curves showed a consistent trend of initially increasing and then decreasing, with no apparent abnormal samples. Then, the spectral curves of the original data were processed using SNV, Detrending, and MSC preprocessing algorithms, as shown in Fig. 2b–d.

Taking the tannin as an example, the PLS model was established by comparing three pretreatment methods, as shown in Table 1. The results show that the combination of the three methods can effectively improve the prediction accuracy of the model. By combining these three preprocessing methods, the spectral differences caused by grain glossiness, surface scattering, background interference, baseline drift, and environmental light scattering were effectively eliminated, while enhancing the spectral information related to chemical composition content. This preprocessing enhances the quality and interpretability of spectral data, thereby contributing to improved accuracy in subsequent modeling.

Table 1 Results of the PLS model established for tannin content by spectral pretreatment method.

Full size table

Additionally, employing the SPXY algorithm, a total of 279 samples were partitioned into two subsets at a ratio of 3:1. One subset was designated as the calibration set, comprising 209 samples (75% of the total), while the remaining subset served as the prediction set, consisting of 70 samples (25% of the total). The data distribution of both the calibration set and prediction set exhibits a relatively uniform pattern, characterized by closely aligned mean values. This indicates that the partitioning of this dataset is reasonable and effective, thereby mitigating potential biases in data distribution, as shown in Table 2. By conducting data preprocessing and implementing a reasonable division of the dataset, it establishes a solid foundation for analyzing and predicting the chemical composition of sorghum, ensuring the reliability and accuracy of the work.

Table 2 The set partitioning results after using the optimal preprocessing method.

Full size table

Characteristic variables analysis

Figure 3 illustrates the variation in RMSE_CV as the number of samples increases during the process of extracting feature wavelengths for coarse protein, tannin, and crude fat using the CARS algorithm. When using the CARS algorithm for characteristic wavelength extraction, as the number of Monte Carlo random sampling runs increases, the RMSE_CV values exhibit a decreasing trend followed by an increasing trend. During the cross-validation process, the RMSE_CV for the crude protein reached its minimum value in the validation set after 41 iterations. A total of 63 wavelengths were selected, accounting for 9.75% of all bands. At this point, the R_CV² and RMSE_CV were 0.73 and 0.80%, respectively. For the tannin, the RMSE_CV reached its minimum value in the validation set after 45 iterations. A total of 50 wavelengths were selected, accounting for 7.74% of all bands. At this point, the R_CV² and RMSE_CV were 0.85 and 0.29%, respectively. For the crude fat, the RMSE_CV reached its minimum value in the validation set after 51 iterations. A total of 35 wavelengths were selected, accounting for 5.42% of all bands. At this point, the R_CV² and RMSE_CV were 0.43 and 0.40%, respectively.

Figure 4 illustrates the variation in RMSE_CV as the number of samples increases during the process of extracting feature wavelengths for crude protein, tannin, and crude fat using the BOSS algorithm. When using the BOSS algorithm for characteristic wavelength extraction, as the number of iterations runs increases, the RMSE_CV values exhibit a decreasing trend followed by an increasing trend. During the cross-validation process, the RMSE_CV for crude protein reached its minimum value in the validation set after 6 iterations. A total of 73 wavelengths were selected, accounting for 11.30% of all bands. At this point, the R_CV² and RMSE_CV were 0.70 and 0.85%, respectively. For tannin, the RMSE_CV reached its minimum value in the validation set after 6 iterations. A total of 61 wavelengths were selected, accounting for 9.44% of all bands. At this point, the R_CV² and RMSE_CV were 0.85 and 0.29%, respectively. For crude fat, the RMSE_CV reached its minimum value in the validation set after 9 iterations. A total of 30 wavelengths were selected, accounting for 4.64% of all bands. At this point, the R_CV² and RMSE_CV were 0.41 and 0.41%, respectively.

The aforementioned steps finalized the establishment of the coarse subset, upon which, the IRIV algorithm was applied to form a refined subset. The mean root-mean-square error is initially calculated with and without a specific wavelength, followed by the performance of a Mann-Whitney U test (P = 0.05). Secondly, non-informative variables were eliminated while retaining informative variables of both strong and weak nature. Finally, the optimal subset of variables was determined through cross-validation using the reverse elimination method. The results of the variable classification for the subset were depicted in Fig. 5. The optimized results of the variable subsets are as follows.

The crude protein subset of variables derived from CARS and BOSS underwent further refinement, resulting in 41 and 36 variables respectively, which accounted for 6.35% and 5.57% of the total wavelength bands. Direct at the sensitive wavelength screening for crude protein components, the range of 434–522 nm may be attributed to the interaction between light and the π-electron system present in aromatic amino acids (e.g., phenylalanine, tryptophan) within protein molecules. The aromatic ring structures of these amino acids are sensitive to light within this wavelength range. The range of 540 ~ 546 nm may be associated with resonance absorption related to certain π-electron conjugated structures present in proteins, potentially linked to aromatic amino acids. The range of 573–601 nm exhibits sensitivity towards the secondary structures of proteins, such as α-helices and β-sheets, wherein light at this specific wavelength interacts with vibrations or electronic transitions occurring within these structural elements. The range of 610–730 nm can be ascribed to the interaction between vibrational modes of amino acid residues within protein molecules, specifically involving the vibrations of amine (NH) and carboxyl (COOH) groups. The range of 761–899 nm can be ascribed to the interaction between light and electronic transitions of larger molecules, along with the molecular vibrational modes within protein molecules at longer wavelengths. These modes may involve long-chain amino acid residues or overall conformational changes of the protein. Qiao et al. study found that the corn is sensitive to light at 400–700 nm and 910 nm, related to the C-H vibrations of proteins⁴¹. Fatemi et al. study indicated that crude protein is sensitive to light at 760 nm, which is related to the vibration of nitrogen elements¹¹. Wang et al. study found characteristic wavelengths of rice is nearby 500 nm⁴². These results confirm the accuracy of the characteristic wavelength selected for sorghum crude protein in this work.

The tannin subset of variables derived from CARS and BOSS underwent further refinement, resulting in 44 and 38 variables respectively, which accounted for 6.81% and 5.88% of the total wavelength bands. Direct at the sensitive wavelengths of tannin, the range of 475 –600 nm may match the π-π or n-π transitions of tannin molecules, resulting in related absorption phenomena. The wavelength range of 612–690 nm may interact with the C-H or C-O vibration modes in tannin molecules, thereby triggering associated absorption phenomena. The wavelength range of 745–810 nm may be attributed to interactions such as hydrogen bonding and π-π stacking within tannin molecules, which regulate electron distribution and the charge state of the molecules, thereby influencing their absorption characteristics. The wavelength range of 829–897 nm may be due to tannin being a polyphenolic compounds, with chemical structures that include functional groups such as hydroxyl, carboxyl, and acyl groups. These structural features result in varying absorption characteristics across different wavelengths of light. In their investigation of plant tannin, Zhang et al. identified characteristic spectral ranges at approximately 500–510 nm and around 680–690 nm⁴³, which closely resemble the selected tannin characteristic wavelengths in this work.

The crude fat subset of variables derived from CARS and BOSS underwent further refinement, resulting in 27 and 22 variables respectively, which accounted for 4.18% and 3.41% of the total wavelength bands. Direct at the sensitive wavelengths of crude fat, the wavelength of 447 nm may be associated with the shorter π-π transitions in fatty acids, particularly near the non-conjugated structures adjacent to double bonds. The absorption at 545 nm and 563 nm may be attributed to the π-π* transitions in fatty acid molecules and the absorption associated with conjugated double bonds. The absorption related to the long-chain structure and conjugated system in fatty acids may be associated with a wavelength of 588 nm. 631 nm and 632 nm may be associated with the C-H stretching vibrations in the fatty acid chains, the wavelengths of 649 nm and 650 nm may be associated with the vibrational absorption of C=O stretching in fatty molecules or other functional groups, 668 nm and 671 nm may be related to the spectral features associated with aromatic rings or other conjugated structures in fatty molecules. The wavelengths of 729 nm and 750 nm may be associated with the resonant absorption of long-chain structures in fatty acid molecules, 785 nm and 795 nm may be associated with features such as C-H bending vibrations in fatty acid molecules. Wavelengths in the range of 800–900 nm may exhibit complex absorption patterns in fatty acid chains, including conjugation and long-chain effects. Xue et al. focused on extracting characteristic spectra of crude fat primarily at wavelengths around 450, 500, 680, 800, and 1000 nm⁴⁴, which aligns with the findings of our work.

The adopted coarse-refined characteristic variable extraction strategy effectively reduces the number of variables while preserving crucial information wavelengths associated with the crude protein, tannin, and crude fat content. This approach provides an optimized foundation for subsequent analysis and modeling endeavors. In summary, the coarse subset obtained by the CARS and BOSS algorithms is devoid of any interference variables, however, according to the IRIV selection process, it is possible that certain variables may not have associated messages. It demonstrates the advantages of CARS and BOSS in the characteristic extraction process. The final variable set is obtained through the reverse elimination of the coarse subset, named the refined subset. The refined subset includes only variables with strong and weak characteristic wavelengths. According to DMEAN and P-value in IRIV algorithm, the strong and weak characteristics of variables are judged. A weak feature wavelength is a wavelength point in spectral data or any other form of high-dimensional data that has a weak signal strength and limited direct prediction of the target variable compared to other wavelengths. Although the predictive power of these wavelength points alone may not be strong, they often contain important information that is indirectly related to the target variable or interacts with other strongly characteristic wavelengths. Therefore, ignoring these weak feature wavelengths can lead to information loss, which in turn affects the comprehensiveness and accuracy of the model. On one hand, it demonstrates that the precision extraction strategy can further reduce the number of variables and enhance the accuracy of the detection model. On the other hand, it underscores the crucial role of weak characteristic wavelength in constructing a robust detection model.

Comparison of detection models

Performances of detection models based on whole spectral data and characteristic spectral data

The PLS, BP, and ELM modeling methods were employed to establish detection models for crude protein, tannin, and crude fat based on whole spectral data initially. Subsequently, we proceed to establish separate models based on the coarse subset and the refined subset, respectively. The results of these detection models are presented in Table 3.

Table 3 The results of detection model for quantifying crude protein, tannin, and crude fat content in sorghum samples.

Full size table

The results obtained from the CARS-PLS detection model are presented as an illustrative example. In terms of the number of variables employed in the modeling process, the variables of the CARS subset decreased by 90.25%, 92.26%, and 94.58% than full spectrum set (Raw) for the crude protein, tannin, and crude fat, respectively. And then, in terms of the result of the detection model, the R_p² of crude protein and crude fat increased by 0.02 and 0.11, respectively. The RPD_p increased by 0.06 and 0.09, respectively. And the RMSE_p decreased by 0.03% and 0.04%, respectively. Its indicates that the performance of the characteristic-spectrum model was superior to that of the full-spectrum model. However, for the tannin model, the R_p² and RPD_p decreased by 0.02 and 0.10, the RMSE_p increased by 0.01%, and the characteristic-spectrum model performance exhibited a slight decline. The reason may be that the variable strongly correlated with tannin was excluded during the process of extracting characteristic variables from the CARS. Furthermore, upon application of the IRIV algorithm for variable extraction, a further reduction in modeling variables was observed, and the model performance was further enhanced. The number of variables employed in the modeling process, the variables of CARS-IRVI subset of the crude protein, tannin, crude fat decreased by 34.92%, 12.00%, and 22.86% than the variables of the CARS subset. The R_p² increased by 0.002, 0.01 and 0.03, respectively. The RPD_p increased 0.01, 0.14 and 0.03, respectively. And the RMSE_p decreased by 0.002%, 0.01%, 0.01%. It indicates that the detection model after post fine extraction is more lightweight and has better detection accuracy. The detection models established using both BP and ELM also reach the same conclusion. In summary, employing a combined strategy of extracting characteristic wavelengths can reduce variable redundancy, lighten the model, and enhance model performance.

Optimal models of crude protein, tannin, and crude fat

The optimal results were achieved by employing the CARS-IRIV characteristic extraction algorithm in combination with PLS to establish the detection model for the crude protein content determination. The prediction set exhibits R_p², RMSE_p, and RPD_p values of 0.69, 0.80%, and 1.80, respectively. For the detection of tannin, the optimal results were achieved by employing the BOSS-IRIV characteristic extraction algorithm in combination with the PLS algorithm. The prediction set exhibits R_p², RMSE_p, and RPD_p values of 0.88, 0.22%, and 2.84 respectively. For the detection of crude fat content, the optimal results were achieved by employing the BOSS-IRIV characteristic extraction algorithm in combination with the ELM algorithm. The prediction set exhibits R_p², RMSE_p, and RPD_p values of 0.61, 0.32%, and 1.61 respectively. The results show that the tannin detection model is very high, crude protein and crude fat detection model is good. Figure 6(a)-(c) illustrates the graph depicting the fitting of the model, it showed that these models had excellent performance.

Conclusion

In this study, we employed VIS-NIR spectroscopy, in conjunction with chemical determination methods, to examine the critical nutrient components—crude protein, tannin, and crude fat—of 93 sorghum varieties. The objective was to pinpoint interpretable wavelengths linked to these constituents and develop robust detection models. The proposed joint strategy incorporates the CARS-IRIV and BOSS-IRIV key wavelength extraction algorithms, and a significant amount of redundant data is eliminated, and wavelengths with both strong and weak interpretations are extracted. The results indicate that 41, 38, and 22 characteristic wavelengths were extracted for crude protein, tannin, and crude fat, respectively. Furthermore, the CARS-IRIV-PLSR model emerged as suitable for crude protein detection, the BOSS-IRIV-PLSR model demonstrated favorable outcomes in tannin detection, and the BOSS-IRIV-ELM model yielded the optimal results for crude fat detection.

In summary, these detection models effectively achieved real-time and nondestructive detection of crude protein, tannin, and crude fat contents in sorghum grains. They offer vital theoretical support and guidance for the utilization of spectral techniques in detecting other grain components. Looking ahead, we plan to integrate multiple independent models into a cohesive comprehensive model and deeply integrate big data analysis with artificial intelligence technology. Based on this, a real-time detection system for grain nutritional components will build. This system will enable simultaneous detection of various nutritional components in grains, enhances prediction efficiency, ensures quality and safety, demonstrating great potential for synchronous detection of multiple nutritional components in grains during complex industrial processes.

Data availability

All data generated or analysed during this study are included in this published article.

References

Khoddami, A. et al. Sorghum in foods: Functionality and potential in innovative products. Crit. Rev. Food Sci. Nutr. 63 (9), 1170–1186. https://doi.org/10.1080/10408398.2021.1960793 (2023).
Article CAS PubMed MATH Google Scholar
Carcedo, A. J. et al. Environment characterization in Sorghum (Sorghum bicolor L.) by modeling water-deficit and heat patterns in the Great Plains Region, United States. Fronit. Plant. Sci. 13, 768610. https://doi.org/10.3389/fpls.2022.768610 (2022).
Article Google Scholar
Bakari, H. et al. Sorghum (Sorghum bicolor L. Moench) and its main parts (by-products) as promising sustainable sources of value-added ingredients. Waste Biomass Valoriz. 14 (4), 1023–1044. https://doi.org/10.1007/s12649-022-01992-7 (2023).
Article CAS MATH Google Scholar
Wang, H. et al. Regulation of density and fertilization on crude protein synthesis in forage maize in a semiarid rain-Fed area. Agriculture 13 (3), 715. https://doi.org/10.3390/agriculture13030715 (2023).
Article CAS MATH Google Scholar
Zeng, X., Jiang, W., Du, Z. & Kokini, J. L. Encapsulation of tannins and tannin-rich plant extracts by complex coacervation to improve their physicochemical properties and biological activities: A review. Crit. Rev. Food Sci. Nutr. 63, 3005–3018. https://doi.org/10.1080/10408398.2022.2075313 (2023).
Article CAS PubMed Google Scholar
Kordan, B., Nietupski, M., Ludwiczak, E., Gabryś, B. & Cabaj, R. Selected cultivar-specific parameters of wheat grain as factors influencing intensity of development of grain weevil Sitophilus granarius (L). Agriculture 13 (8), 1492. https://doi.org/10.3390/agriculture13081492 (2023).
Article CAS Google Scholar
Rizvi, N. B., Aleem, S., Khan, M. R., Ashraf, S. & Busquets, R. Quantitative estimation of protein in sprouts of vigna radiate (mung beans), lens culinaris (Lentils), and cicer arietinum (Chickpeas) by kjeldahl and lowry methods. Molecules 27 (3), 814. https://doi.org/10.3390/molecules27030814 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhao, H. et al. The application of machine-learning and Raman spectroscopy for the rapid detection of edible oils type and adulteration. Food Chem. 373, 131471. https://doi.org/10.1016/j.foodchem.2021.131471 (2022).
Article CAS PubMed Google Scholar
Huang, L., Luo, R., Liu, X. & Hao, X. Spectral imaging with deep learning. Light: Sci. Appl. 11 (1), 61. https://doi.org/10.1038/s41377-022-00743-6 (2022).
Article ADS CAS PubMed MATH Google Scholar
Huang, H. et al. Rapid and nondestructive prediction of amylose and amylopectin contents in sorghum based on hyperspectral imaging. Food Chem. 359, 129954. https://doi.org/10.1016/j.foodchem.2021.129954 (2021).
Article CAS PubMed MATH Google Scholar
Fatemi, A., Singh, V. & Kamruzzaman, M. Identification of informative spectral ranges for predicting major chemical constituents in corn using NIR spectroscopy. Food Chem. 383, 132442. https://doi.org/10.1016/j.foodchem.2022.132442 (2022).
Article CAS PubMed Google Scholar
Zareef, M. et al. Recent advances in assessing qualitative and quantitative aspects of cereals using nondestructive techniques: A review. Trends Food Sci. Technol. 116, 815–828. https://doi.org/10.1016/j.tifs.2021.08.012 (2021).
Article CAS MATH Google Scholar
Chen, J. et al. Rapid and non-destructive analysis for the identification of multi-grain rice seeds with near-infrared spectroscopy. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 219, 179–185. https://doi.org/10.1016/j.saa.2019.03.105 (2019).
Article ADS CAS MATH Google Scholar
Zhu, L. et al. Variation analysis of starch properties in tartary buckwheat and construction of near-infrared models for rapid non-destructive detection. Plants-Basel 13 (15), 2155. https://doi.org/10.3390/plants13152155 (2024).
Article CAS PubMed PubMed Central MATH Google Scholar
Fan, S. et al. Establishment of non-destructive methods for the detection of amylose and fat content in single rice kernels using near-infrared spectroscopy. Agricult.-Basel 12 (8), 1258. https://doi.org/10.3390/agriculture12081258 (2022).
Article CAS MATH Google Scholar
Caporaso, N., Whitworth, M. B. & Fisk, I. D. Near-Infrared spectroscopy and hyperspectral imaging for non-destructive quality assessment of cereal grains. Appl. Spectrosc. Rev. 53 (8), 667–687. https://doi.org/10.1080/05704928.2018.1425214 (2018).
Article ADS CAS MATH Google Scholar
Almoujahed, M. B. et al. Non-destructive detection of fusarium head blight in wheat kernels and flour using visible near-infrared and mid-infrared spectroscopy. Chemometr. Intell. Lab. Syst. 245, 105050. https://doi.org/10.1016/j.chemolab.2023.105050 (2024).
Article CAS MATH Google Scholar
Shi, T. et al. Using VIS-NIR hyperspectral imaging and deep learning for non-destructive high-throughput quantification and visualization of nutrients in wheat grains. Food Chem. 461, 140651. https://doi.org/10.1016/j.foodchem.2024.140651 (2024).
Article CAS PubMed Google Scholar
Shuai, L., Li, Z., Chen, Z., Luo, D. & Mu, J. A research review on deep learning combined with hyperspectral imaging in multiscale agricultural sensing. Comput. Electron. Agric. 217, 108577. https://doi.org/10.1016/j.compag.2023.108577 (2024).
Article MATH Google Scholar
Sáez-Plaza, P., Michałowski, T., Navas, M. J., Asuero, A. G. & Wybraniec, S. An overview of the Kjeldahl method of nitrogen determination. Part I. Early history, chemistry of the procedure, and titrimetric finish. Crit. Rev. Anal. Chem. 43 (4), 178–223. https://doi.org/10.1080/10408347.2012.751786 (2013).
Article CAS Google Scholar
Sáez-Plaza, P., Navas, M. J., Wybraniec, S., Michałowski, T. & Asuero, A. G. An overview of the Kjeldahl method of nitrogen determination. Part II. Sample preparation, working scale, instrumental finish, and quality control. Crit. Rev. Anal. Chem. 43 (4), 224–272. https://doi.org/10.1080/10408347.2012.751787 (2013).
Article CAS Google Scholar
Carmona, A., Seidl, D. S. & Jaffe, W. G. Comparison of extraction methods and assay procedures for the determination of the apparent tannin content of common beans. J. Sci. Food. Agric. 56 (3), 291–301. https://doi.org/10.1002/jsfa.2740560305 (1991).
Article CAS MATH Google Scholar
Palacios, C. E., Nagai, A., Torres, P., Rodrigues, J. A. & Salatino, A. Contents of tannins of cultivars of sorghum cultivated in Brazil, as determined by four quantification methods. Food Chem. 337, 127970. https://doi.org/10.1016/j.foodchem.2020.127970 (2021).
Article CAS PubMed Google Scholar
Liu, C., Chen, F. S. & Xia, Y. M. Composition and structural characterization of peanut crude oil bodies extracted by aqueous enzymatic method. J. Food Compos. Anal. 105, 104238. https://doi.org/10.1016/j.jfca.2021.104238 (2022).
Article CAS Google Scholar
Munawar, A. A., Meilina, H. & Pawelzik, E. Near infrared spectroscopy as a fast and non-destructive technique for total acidity prediction of intact mango: Comparison among regression approaches. Comput. Electron. Agric. 193, 106657. https://doi.org/10.1016/j.compag.2021.106657 (2022).
Article Google Scholar
Yao, K. et al. Non-destructive detection of egg qualities based on hyperspectral imaging. J. Food Eng. 325, 111024. https://doi.org/10.1016/j.jfoodeng.2022.111024 (2022).
Article MATH Google Scholar
Lang, X. et al. Detrending and denoising of industrial oscillation data. IEEE Trans. Industr. Inf. 19 (4), 5809–5820. https://doi.org/10.1109/TII.2022.3188844 (2022).
Article MATH Google Scholar
Dhanoa, M. S. et al. Methodology adjusting for least squares regression slope in the application of multiplicative scatter correction to near-infrared spectra of forage feed samples. J. Chemom. 37 (11), e3511. https://doi.org/10.1002/cem.3511 (2023).
Article CAS Google Scholar
Tang, G. et al. A new spectral variable selection pattern using competitive adaptive reweighted sampling combined with successive projections algorithm. Analyst 139 (19), 4894–4902. https://doi.org/10.1039/C4AN00837E (2014).
Article ADS CAS PubMed MATH Google Scholar
He, X., Huanyu, E. & Ding, G. Development of a CH 2-dependent analytical method using near-infrared spectroscopy via the integration of two algorithms: Non-dominated sorting genetic-II and competitive adaptive reweighted sampling (NSGAII-CARS). Anal. Methods. 15 (10), 1286–1296. https://doi.org/10.1039/D2AY02072F (2023).
Article CAS PubMed Google Scholar
Yan, H. et al. A modification of the bootstrapping soft shrinkage approach for spectral variable selection in the issue of over-fitting, model accuracy and variable selection credibility. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 210, 362–371. https://doi.org/10.1016/j.saa.2018.10.034 (2019).
Article ADS CAS MATH Google Scholar
Zhang, P. et al. Novel comprehensive variable selection algorithm based on multi-weight vector optimal selection and bootstrapping soft shrinkage. Infrared Phys. Techn. 133, 104800. https://doi.org/10.1016/j.infrared.2023.104800 (2023).
Article MATH Google Scholar
Wang, F., Wang, C. & Song, S. Rapid and low-cost detection of millet quality by miniature near-infrared spectroscopy and iteratively retaining informative variables. Foods 11 (13), 1841. https://doi.org/10.3390/foods11131841 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Mateos-Aparicio, G. Partial least squares (PLS) methods: Origins, evolution, and application to social sciences. Commun. Stat.-Theory Methods. 40 (13), 2305–2317. https://doi.org/10.1080/03610921003778225 (2011).
Article MathSciNet MATH Google Scholar
Liu, C. et al. Partial least squares regression and principal component analysis: Similarity and differences between two popular variable reduction approaches. Gen. Psychiatry. 35 (1), e100662. https://doi.org/10.1136/gpsych-2021-100662 (2022).
Article Google Scholar
Deng, Y. et al. New methods based on back propagation (BP) and radial basis function (RBF) artificial neural networks (ANNs) for predicting the occurrence of haloketones in tap water. Sci. Total Environ. 772, 145534. https://doi.org/10.1016/j.scitotenv.2021.145534 (2021).
Article CAS PubMed Google Scholar
Zhou, H. et al. A novel hybrid model combined with ensemble embedded feature selection method for estimating reference evapotranspiration in the North China Plain. Agric. Water Manag. 296, 108807. https://doi.org/10.1016/j.agwat.2024.108807 (2024).
Article MATH Google Scholar
Piepho, H. P. A coefficient of determination (R²) for generalized linear mixed models. Biom. J. 61 (4), 860–872. https://doi.org/10.1002/bimj.201800270 (2019).
Article MathSciNet PubMed MATH Google Scholar
Chen, L., Wu, X., Lopes, A. M., Yin, L. & Li, P. Adaptive state-of-charge estimation of lithium-ion batteries based on square-root unscented Kalman filter. Energy 252, 123972. https://doi.org/10.1016/j.energy.2022.123972 (2022).
Article MATH Google Scholar
Wang, G. et al. The application of discrete wavelet transform with improved partial least-squares method for the estimation of soil properties with visible and near-infrared Spectral Data. Remote Sens. 10 (6), 867. https://doi.org/10.3390/rs10060867 (2018).
Article ADS MATH Google Scholar
Qiao, M. et al. Integration of spectral and image features of hyperspectral imaging for quantitative determination of protein and starch contents in maize kernels. Comput. Electron. Agric. 218, 108718. https://doi.org/10.1016/j.compag.2024.108718 (2024).
Article MATH Google Scholar
Wang, Z. et al. Rapid detection of protein content in rice based on Raman and near-infrared spectroscopy fusion strategy combined with characteristic wavelength selection. Infrared Phys. Technol. 129, 104563. https://doi.org/10.1016/j.infrared.2023.104563 (2023).
Article CAS MATH Google Scholar
Zhang, P. et al. Rapid detection of tannin content in wine grapes using hyperspectral technology. Life 14 (3), 416. https://doi.org/10.3390/life14030416 (2024).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Fei, X. et al. The rapid non-destructive detection of the protein and fat contents of sorghum based on hyperspectral imaging. Food. Anal. Methods. 16 (11), 1690–1701. https://doi.org/10.1007/s12161-023-02529-x (2023).
Article MATH Google Scholar

Download references

Funding

This work was supported by Key Research and Development Program Project of Shanxi Province, grant number 202102140601013; Central Government Guides Local Funds for Scientific and Technological Development, grant number YDZJSX20231C009; School academic recovery project, grant number 2023XSHF2; Construction Project of Shanxi Modern Agricultural Industry Technology System, grant number 24142C0207001; Youth Scientific Research Project of Shanxi Province, grant number 202203021212450.

Author information

Authors and Affiliations

College of Agricultural Engineering, Shanxi Agricultural University, Jinzhong, 030801, China
Kai Wu, Zilin Zhang, Xiuhan He, Gangao Li & Decong Zheng
College of Information Science and Engineering, Shanxi Agricultural University, Jinzhong, 030801, China
Zhiwei Li
Dryland Farm Machinery Key Technology and Equipment Key Laboratory of Shanxi Province, Jinzhong, 030801, China
Kai Wu, Zilin Zhang, Xiuhan He, Gangao Li, Decong Zheng & Zhiwei Li

Authors

Kai Wu
View author publications
Search author on:PubMed Google Scholar
Zilin Zhang
View author publications
Search author on:PubMed Google Scholar
Xiuhan He
View author publications
Search author on:PubMed Google Scholar
Gangao Li
View author publications
Search author on:PubMed Google Scholar
Decong Zheng
View author publications
Search author on:PubMed Google Scholar
Zhiwei Li
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, K.W. and Z.L.; methodology, K.W. ; analysis, K.W., Z.Z. and X.H.; investigation, Z.Z. and G.L.; resources, D.Z., and Z.L.; data curation, K.W. and D.Z.; writing—original draft preparation, K.W.; writing—review and editing, D.Z., and Z.L.; supervision, Z.Z. All authors reviewed the manuscript, and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Decong Zheng or Zhiwei Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wu, K., Zhang, Z., He, X. et al. Using visible and NIR hyperspectral imaging and machine learning for nondestructive detection of nutrient contents in sorghum. Sci Rep 15, 6067 (2025). https://doi.org/10.1038/s41598-025-90892-6

Download citation

Received: 23 September 2024
Accepted: 17 February 2025
Published: 19 February 2025
Version of record: 19 February 2025
DOI: https://doi.org/10.1038/s41598-025-90892-6

Keywords

This article is cited by

Near-infrared spectroscopy for moisture content prediction in soil-mixed woody biomass
- Bat-Uchral Batjargal
- Minjee Kang
- Hwanmyeong Yeo
Scientific Reports (2026)
Estimation of protein content in wheat samples using NIR hyperspectral imaging and 1D-CNN
- Apurva Sharma
- Tarandeep Singh
- Quoc Cuong Ngo
Scientific Reports (2025)

Subjects

Abstract

Similar content being viewed by others

Resequencing of two elite sorghum (Sorghum bicolor (L.) Moench) hybrid parent lines reveals distinctly different genome-wide variation models

Sorghum yield prediction based on remote sensing and machine learning in conflict affected South Sudan

Evaluation of the nutritional quality of food composites developed from local ingredients to target the needs of persons experiencing nodding syndrome in Northern Uganda

Introduction

Material and method

Sorghum samples and growth conditions

Hyperspectral image acquisition and spectral data extraction

Hyperspectral image acquisition

Spectral data extraction

Chemical determination of nutrient contents

Crude protein contents determination

Tannin contents determination

Crude fat contents determination

Spectral data preprocessing and data set division

Spectral data preprocessing

Data set division

Characteristic variables extraction

Algorithm of characteristic variables extraction

Combined use of characteristic variables extraction algorithms

Nutrient contents detection models and evaluation indexes

Nutrient contents detection models

Evaluation indexes

Results

Nutrient contents analysis

Spectral characteristic analysis

Characteristic variables analysis

Comparison of detection models

Performances of detection models based on whole spectral data and characteristic spectral data

Optimal models of crude protein, tannin, and crude fat

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Near-infrared spectroscopy for moisture content prediction in soil-mixed woody biomass

Estimation of protein content in wheat samples using NIR hyperspectral imaging and 1D-CNN

Search

Quick links