Hybrid models of sparse and robust regression to solve heterogeneity problem in black pepper big data

Kumar, Paavithashnee Ravi; Ibidoja, Olayemi Joshua; Ali, Majid Khan Majahar; Ishak, Wan Rosli Wan

doi:10.1038/s41598-026-39290-0

Download PDF

Article
Open access
Published: 03 April 2026

Hybrid models of sparse and robust regression to solve heterogeneity problem in black pepper big data

Paavithashnee Ravi Kumar¹,
Olayemi Joshua Ibidoja^1,2,
Majid Khan Majahar Ali¹ &
…
Wan Rosli Wan Ishak³

Scientific Reports volume 16, Article number: 11292 (2026) Cite this article

Subjects

Abstract

Data analytics is increasingly important in agriculture, particularly in smart farming, enhancing decision-making and sustainability. Research on factors affecting moisture content removal in black pepper drying using solar dryers is crucial for cost reduction and improving product quality and quantity. This drying process involves numerous parameters, resulting in big data complexity. Heterogeneity among these parameters can introduce bias, leading to incorrect inferences, while multicollinearity and outliers impact model validation and interpretation. This study proposes hybrid models of sparse and robust regression to solve the heterogeneity problem using black pepper big data. Sparse regression techniques such as elastic net, ridge and LASSO are used to identify the 25, 35, 45, 55 and 100 highest-ranking variables for black pepper moisture content removal. These models are hybridized with robust regression estimators (M Bi Square, M Hampel, M Huber, S and MM) for handling outliers. Results indicate that before heterogeneity, the hybrid Ridge model with M Bi-Square performs best under both 2-sigma and 3-sigma limits. After heterogeneity removal, LASSO model with S estimator proves to be the most effective across both limits.

Introduction

Data analytics in statistics is an effective method for analysing complex data to gain meaningful insights, supported by visualisation tools to summarise data, detect trends, and aid decision-making across various fields¹. In agriculture, it plays a vital role in smart farming by enhancing efficiency, sustainability, and crop yields. Managing complex and varied agricultural data requires efficient systems for fast and reliable processing². By integrating advanced systems and innovations such as machine learning, the Internet of Things (IoT), Artificial Intelligence (AI), sensors and automation, smart farming empowers farmers with real-time insights and optimised decision-making³. This innovative approach addresses global issues like food security, climate change, and increasing costs while improving resource efficiency and promoting sustainable agricultural practices⁴. Notably, these technologies significantly enhance various stages of herb and spice cultivation and processing, preserving and improving their quality. A notable example is black pepper.

Black pepper, renowned as the “King of Spices” enjoys widespread recognition and is extensively consumed as a spice worldwide. Peppercorns that are almost ripe or have a greenish appearance are sun-dried until they turn brownish black to produce black pepper. The main processing steps involved in black pepper production include harvesting, blanching, drying, cleaning, grading, packaging and storage⁵. Piperine, a significant bioactive compound in black pepper, exhibits diverse physiological and drug-like effects. Black pepper, the most traded spice globally, particularly in Asia, witnessed Vietnam, Brazil, Indonesia, India, and Malaysia as the leading producers in 2022⁶.

The primary method for preserving black pepper is through the drying process, crucial for moisture removal and prevention of microbial decay. Water activity is a crucial determinant of the stability, safety, and quality of agricultural products. It represents the amount of free water available for microbial and biochemical reactions, which directly affects product shelf life and nutritional properties⁷. Controlling water activity is therefore essential in the drying of black pepper, as inadequate or uneven moisture removal can increase the available water for microbial growth and oxidative reactions⁸. Such reactions accelerate the degradation of essential oils and bioactive compounds like piperine, which are responsible for black pepper’s distinctive aroma, flavour and therapeutic properties⁹. Therefore, maintaining optimal water activity through effective drying techniques is essential to ensure microbial safety, chemical stability and overall product quality.

Drying also minimizes pest infestations and reduces the pepper volume and weight of the pepper for better storage and transportation suitability¹⁰. Harvested peppercorns typically contain around 65 to 70% d.b. moisture and require drying to achieve a moisture content of 11% to 12% for storage. Sun- drying black pepper, which is a conventional method, presents challenges like a significantly prolonged drying period and contamination from dust, dirt, insects and other pollutants, thus negatively affecting the quality of products. Meanwhile, mechanical drying systems like tunnel dryers and fluidized bed dryers also exhibit drawbacks, such as their inefficiency, high usage of fossil fuels, and a requirement for substantial labour¹¹.

Solar dryers present a promising alternative to conventional methods in drying technology, as they can effectively control moisture loss during drying to preserve product quality and nutritional value. They offer numerous advantages, including requiring less space, producing clean and high-quality commodities, avoiding insect and animal threats and providing a controlled drying process^12,13,14. However, assessing solar dryer efficiency requires consideration of various factors impacting the dried product quality. Several research studies have explored solar dryers for black pepper drying, with¹⁵ achieving a final moisture content of 9.4% over a—12 h drying period using an indirect-type solar-biomass hybrid dryer. These findings demonstrated the enhanced product quality achievable with solar dryers compared to conventional methods.

However, drying black pepper using a solar dryer with smart IoT monitoring systems involves numerous parameters, resulting in complex big data processing in the cloud database. These complexities come with many challenges, particularly in agricultural data analysis, where issues such as heterogeneity, multicollinearity and outliers are common^16,17. Addressing these issues through advanced data analytics is essential for improving agricultural processes and ensuring the sustainable cultivation of crops like black pepper. In Malaysia, these challenges have resulted in losses of over 25% of the nation’s pepper crop¹⁸. Therefore, this highlights the need for developing a hybrid solar dryer with optimized parameters to preserve the crop’s nutritional value.

From an agricultural perspective, identifying the key parameters that influence moisture content removal provides valuable insights for improving drying performance and preserving the nutritional quality in black pepper. These parameters determine the drying rate and uniformity, which in turn affect the biochemical stability of bioactive compounds such as piperine and volatile oils. By identifying the significant factors that most strongly influence moisture diffusion and evaporation behavior, variable selection helps optimize drying conditions that maintain desirable product quality. However, this process is challenging since black pepper drying involves many interdependent parameters, such as temperature, humidity and air velocity, which can vary widely and introduce heterogeneity into the data.

One major issue in agricultural big data is heterogeneity, which denotes the degree of variability within the parameters due to factors like differences in parameters as well as varying units for factors such as solar radiation, relative humidity and temperature^17,18,19. This variability introduces noise into the data, making it difficult to obtain reliable measurements from different sensors. When this inconsistent data is used in predictive models, especially those depending on factors like temperature and humidity, it can significantly lower their accuracy. Addressing this heterogeneity is crucial for optimizing drying efficiency, improving product quality, and ensuring consistent outcomes. Otherwise, it can result in incorrect predictions, poor decisions, financial losses, and lower product quality²⁰. Heterogeneity could restrict result applicability due to lack of agreement at the study level²¹. Studies like²² highlighted that the assumption of homogeneity poses a significant challenge, as it contributes to heterogeneity issues, which can lead to biased and inconsistent standard errors. Separately,¹⁶ demonstrated that examining heterogeneity in seaweed big data enhanced the understanding of drying parameters dynamics and enabled more effective predictive modelling.

Another challenge is multicollinearity, which arises when two or more independent variables in a multiple linear regression model exhibit a strong linear association, potentially impacting the model’s stability²³. This issue may lead to inaccuracies in parameter evaluation within regression models by inflating the standard errors, causing some previously significant variables to be statistically insignificant²⁴. As a result, the estimated parameters in regression models become inconsistent and lack reliability, leading to a decrease in their precision²⁵. Consequently, the model becomes less effective at making accurate forecasts, and there is a higher risk of overfitting the data. To address multicollinearity, variable selection can be combined with machine learning techniques to improve parameter estimates²⁶.

Outliers tend to occur in agricultural data due to uncontrollable factors and natural variation²⁷. These outliers, which deviate from the typical pattern or structure of the distribution, can arise from factors like human error, instrument errors, setup errors, mechanical problems and environmental changes²⁸. The existence of outliers in data causes incorrect estimates of parameters, thus decreasing model precision and leading to unreliable results. Furthermore, they significantly affect sample mean and standard deviation in statistical analysis, leading to either overestimation or underestimation of values. This serves as a simple illustration of how undesired outliers might impact data analysis outcomes²⁹. Given these challenges, particularly the simultaneous presence of multicollinearity and outliers, recent research has explored advanced techniques such as quantile-based robust ridge regression estimators, modified robust ridge M- estimators and penalized M- estimators which are designed to provide more reliable parameter estimates and enhance model precision^30,31,32,33. On top of that, due to numerous factors affecting the moisture content removal of black pepper resulting in big data complexity, variable selection via machine learning algorithms is performed to identify significant factors. Insignificant variables are then removed to reduce overfitting and improve prediction accuracy.

Existing literature on black pepper, especially using solar dryers, is limited to a few studies by^15,34,35. While some research has addressed multicollinearity for black pepper (for example, by³⁵), heterogeneity and outliers in the context of moisture content removal for black pepper remain unexplored. Understanding of heterogeneity in the application of big data for black pepper in agriculture remains limited despite its obvious presence in actual agricultural data. For instance,¹⁶ investigated heterogeneity in seaweed drying, employing hybrid models using seven machine learning algorithms (Elastic Net, Ridge, Lasso, Bagging, Support Vector Machine, Random Forest, and Boosting) with robust regression methods (M Bi-Square, M Hampel, M Huber, MM and S) to identify the significant drying parameters influencing moisture content removal for 45 highest important variables before and after addressing heterogeneity. Similarly, a study by³⁶ focused on multicollinearity and heterogeneity in seaweed drying, applying Ridge, LASSO and Elastic Net machine learning algorithms, but uses only box plots to detect outliers. These studies, however, are specific to seaweed data, highlighting a notable gap in agricultural research on black pepper.

Additionally, a literature gap exists regarding the consideration of interaction terms in agriculture, particularly for black pepper. Studying the effects of interaction variables is important, as the relationship between two or more variables can be studied thus providing a meaningful result and preventing bias²⁷. Furthermore, no prior work has explored the use of hybrid sparse and robust regression models to identify the most influential parameters affecting moisture content removal in black pepper. Understanding these parameters is an essential step toward optimizing drying efficiency and ensuring consistent product quality. This highlights the broader potential of statistical modelling approaches to enhance process optimization and data-driven agricultural applications.

To address gaps in the existing literature, this study aims to determine the significant drying parameters of black pepper that directly impact heterogeneity and assess their effects on moisture content removal, both before and after eliminating heterogeneity parameters. Additionally, hybrid models that combine sparse and robust regression techniques are proposed to address heterogeneity in black pepper big data and enhance the accuracy of predicting the moisture content removal during the black pepper drying process. The selection of significant parameters is conducted using the 25,35,45,55 and 100 highest ranking variables, which include interaction factors, by applying three sparse regression models, namely elastic net, ridge and LASSO. Furthermore, hybrid models are developed by integrating these sparse models with robust regression estimators (M Bi-Square, M Huber, M Hampel, S and MM estimators) to address multicollinearity and outliers effectively. The performance and accuracy of the models are first assessed using evaluation metrics, while the best model selection is determined using the eight selection criteria (8SC).

Methodology

Flowchart of study

The study is divided into 3 phases outlined in Fig. 1, mapping the study’s objectives.

Phase I: This project begins by collecting data from the black pepper drying process, applying the Modified Hybrid Solar Dryer (MHSD) in Setiu, Terengganu, Malaysia. The computations cover all possible models, considering interactions up to the second order. R software is used to verify the assumptions related to linearity, errors, independent variables, and multicollinearity. Next, this study will use Variance Inflation Factor (VIF) and boxplot analysis to identify the significant parameters that show heterogeneity, as employed by¹⁶ and¹⁷. Therefore, these techniques will be included in this study. The VIF is computed using the vif() function from the ‘car’ package in R, involving the original dataset while considering only the main effects of the independent variables. R-squared values for main drying parameters can be determined through Eq. (1) with the VIF values obtained.

$$R^{2} = 1 - \frac{1}{VIF}$$

(1)

Once the minimum and maximum R-squared values are identified, the average R-squared value for the main drying parameters is computed. This value then serves as a benchmark for detecting heterogeneity. When the R-squared value for the primary drying parameters falls below this benchmark, it suggests potential heterogeneity.

Phase II: Ridge, LASSO and elastic net are the three proposed sparse regression techniques that will be implemented in R software to select variables and identify key parameters affecting moisture ratio removal of black pepper. Consequently, these mentioned sparse regression techniques will independently select the highest-ranking 25, 35, 45, 55 and 100 variables. The machine learning algorithms proposed can provide information about the ranking of the important variables but do not specify the exact number of significant variables to include in a model³⁷. The model’s performance and accuracy will be assessed using the evaluation metrics Sum of Squared Error (SSE), Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), and R-squared.

Phase III: Hybrid models are then developed using the mentioned sparse regression techniques with robust regression estimators such as M Bi Square, M Huber, M Hampel, S and MM estimators. Outliers are then identified using robust estimation and the two—and three—sigma limits. The best models, before and after removing heterogeneity parameters, are selected using 8SC which include the Akaike Information Criterion (AIC), Final Prediction Error (FPE), Generalized Cross-Validation (GCV), Hannan-Quinn information criterion (HQ), Risk Inflation Criterion Estimate (RICE), SCHWARZ, SGMASQ, and SHIBATA.

Data description

The solar dryer used in this study is the Modified Hybrid Solar Dryer (MHSD), installed in Setiu, Terengganu, Malaysia, due to the predominant economic activities being the cultivation and drying of black pepper. Categorized as a forced convection indirect type, MHSD was adopted as the smart farming technology to dry the black pepper, as illustrated in Fig. 2.

This study examines the moisture content in black pepper will be observed as the dependent variable, while the independent variables include solar radiation, ambient temperature, ambient relative humidity, collector temperature, chamber temperature and chamber relative humidity. The data consists of 1924 observations with 29 independent variables and one dependent variable. Interaction variables up to second order will also be considered. For instance, T1T5 indicates the interaction between T1 and T5. As a result, the data includes 29 main variables and 406 interaction variables, making an overall count of 435 independent variables that impact the moisture content removal of black pepper.

Multiple linear regression (MLR)

MLR model serves as a statistical method for analysing how a dependent variable $y_{i}$ relates to multiple explanatory variables $x_{i} = \left( {x_{i1} ,x_{i2} , \cdots ,x_{ip} } \right)$ where $p$ represents the explanatory variables. Consider an MLR with $n$ observations:

$$\user2{ y} = \user2{X\beta } + \user2{\varepsilon }$$

(2)

where ${\varvec{y}}$ is a $n \times 1$ vector representing the dependent variables, while ${\varvec{X}}$ is the $n \times p$ design matrix. The unknown parameters are represented by ${\varvec{\beta}}$, a $p \times 1$ vector and ${\varvec{\varepsilon}}$ is the $n \times 1$ error term, which is normally distributed with a mean of zero, consisting of uncorrelated errors and homoscedastic³⁸.

Ordinary Least Squares (OLS) is a regression analysis technique for estimating ${\varvec{\beta}}$ by minimizing the sum of squared differences between the observed and the predicted values of the dependent variable y¹⁷. According to^39,40, the OLS estimator of ${\varvec{\beta}}$ is obtained by minimizing $\varepsilon \varepsilon^{\prime}$ as follows:

$$\varepsilon \varepsilon^{\prime}\user2{ } = \left( { {\varvec{y}} - {\varvec{X}}\beta } \right){\prime} \left( { {\varvec{y}} - {\varvec{X}}\beta } \right) = \user2{y^{\prime}y} - 2\beta^{\prime}{\varvec{X}}{\prime} {\varvec{y}} + \beta^{\prime}{\varvec{X}}{\prime} {\varvec{X}}\beta$$

$$\partial \left( {\varepsilon \varepsilon^{\prime}} \right)/\left( {\partial \beta } \right)\user2{ } = - 2\user2{X^{\prime}y} + 2{\varvec{X}}{\prime} {\varvec{X}}\beta = 0$$

$${\varvec{X}}{\prime} {\varvec{X}}\beta = \user2{X^{\prime}y}$$

$$\hat{\beta } = \left( {\user2{X^{\prime}X}} \right)^{ - 1} \user2{X^{\prime}y}$$

(3)

From Eq. 2, if $y_{i}$ represents the outcome, then $x_{i} = \left( {x_{i1} ,x_{i2} , \cdots ,x_{ip} } \right)^{T}$ denotes the predictor vector for the i^th case. In MLR model, parameter estimation is performed using the OLS method, where the coefficients $\beta = \left( {\beta_{0} ,\beta_{1} ,\beta_{2} , \cdots ,\beta_{p} } \right)^{T}$ are determined to achieve the best fit by minimizing the sum of squared residuals (SSR)^41,42,43. The SSR is expressed as:

$$SSR = \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \beta_{0} - \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)^{2}$$

(4)

Heterogeneity identification

As mentioned by^16,17,22, the following multiple linear regression model is considered:

$$Y_{i} = \beta_{0} + \beta_{1} T_{i,1} + \beta_{2} T_{i,2} + \ldots + \alpha_{j} + \varepsilon_{i}$$

(5)

Here, $Y_{i}$ for i = 1,2, …, n corresponds to the observed moisture content for the $i^{th}$ case, estimates β′s denote the regression coefficients for the predictor variables (which refer to the drying parameters, T′s), $\alpha_{j}$ represents the parameters exhibiting heterogeneity for j = 1,2, …, f and ε is random error. Omitting an important variable from the regression equation can result in inaccurate and biased estimates of β. Additionally, there is a risk that variables may correlate with the error term, leading to a violation of regression assumptions. VIF is the most commonly used and simplest method to detect the presence of multicollinearity²⁶. Stronger linear relationships between variables result in higher $R^{2}$ values and subsequently lead to increased $VIF_{i}$⁴⁴. A higher $VIF$ indicates more serious multicollinearity among variables, with values exceeding 10 indicating its presence. VIF is defined as:

$$VIF = \frac{1}{{1 - R^{2} }}$$

(6)

hence, $R^{2} = 1 - \frac{1}{VIF}$.

Hybrid sparse and robust regression techniques

This study uses sparse regression models as the variable selection approach to select significant factors affecting the moisture ratio removal of black pepper. Robust regression models are employed to detect outliers effectively. Robust regression offers a more effective approach than traditional regression techniques, particularly for datasets containing outliers and heteroscedasticity, enabling more precise and efficient parameter estimation²⁰. However, for sparse regression, standardizing both the independent and dependent variables before the estimation process is essential, ensuring they have a zero mean and unit variance. This way, the results do not depend on the measurement scale, ensuring that independent variables have equal consideration and are on a comparable scale⁴⁵. The focus on standardization is on sparse regression because of its effectiveness in addressing multicollinearity due to the interaction of variables. Therefore, combining sparse and robust regression techniques can further improve forecasting and enhance black pepper moisture removal prediction by addressing multicollinearity and outliers.

Sparse regression models

Ridge regression

Ridge, referred to as L₂ regularization, is commonly used in statistics and machine learning, as it works by regularizing the estimated coefficients and is particularly effective in handling multicollinearity issues in the data⁴⁶. By shrinking the coefficients towards zero, Ridge regression helps reduce overfitting, though the coefficients will never reach exactly zero. This improves prediction accuracy, but at the cost of a slight increase in bias⁴⁷.

According to^20,22, the ridge regression coefficients estimate $\hat{\beta }^{RR}$ minimize:

$$L^{RR} \left( \beta \right) = \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \beta_{0} - \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)^{2} + \lambda \mathop \sum \limits_{j = 1}^{p} \beta_{j}^{2} = SSR + \lambda \mathop \sum \limits_{j = 1}^{p} \beta_{j}^{2}$$

(7)

where $\lambda \ge$ 0 serves as the regularization parameter that controls the shrinkage effect and L₂ penalty is the second term. Ridge regression determines the coefficient estimates that minimize the SSR while ensuring a good fit to the data. The term $\lambda \mathop \sum \limits_{j = 1}^{p} \beta_{j}^{2}$ serves as the shrinkage penalty. One advantage of ridge regression is its ability to minimize bias in large datasets. By constraining the coefficient estimates, ridge regression reduces the estimator’s variance, introducing some bias as a trade-off. However, ridge regression applies continuous shrinkage to coefficients without setting any to zero, leading to a less interpretable model while retaining all predictors.

LASSO regression

LASSO regression or Least Absolute Shrinkage and Selection Operator regression, also known as L₁ regularization, is a regression analysis method that combines parameter shrinkage with variable selection^48,49. Unlike Ridge regression, LASSO regression has the ability to shrink some coefficients to zero when the regularization parameter, λ is large⁵⁰. In other words, it shrinks certain regression coefficients while completely eliminating others if they are insignificant, effectively performing variable selection⁵¹. As a result, LASSO produces a simpler, more efficient model by removing irrelevant data and reducing the number of parameters. This makes it particularly useful for handling multicollinearity and preventing overfitting⁴⁷.

The LASSO coefficient regression estimate $\hat{\beta }^{LASSO}$, minimize⁵²:

$$L^{LR} \left( \beta \right) = \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \beta_{0} - \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)^{2} + \lambda \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right| = SSR + \lambda \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right|$$

(8)

Here, ${\uplambda }$ represents a positive regularization parameter and the second term corresponds to the L₁ penalty⁵³. LASSO regression stands out as a versatile and effective method for handling complex datasets and improving predictive modelling outcomes⁵¹. Nevertheless, the L₁ penalty term in LASSO applies the same penalty to all coefficients, which can introduce bias, especially for large coefficients. This approach can sometimes exclude important variables if they have relatively smaller coefficients.

In this study, the optimal tuning parameters (λ₁ for Lasso and λ₂ for Ridge regression) were selected using five-fold cross-validation implemented through the cv.glmnet() function in R. According to⁵⁴, the λ₂ penalty is first evaluated over a predefined grid of values, and for each λ₂ value, the Elastic Net solution path is obtained. The selected λ₂ is the value that produces the lowest cross-validation error. The second tuning parameter, λ₁, is then selected through five-fold cross-validation. Cross-validation is one of the most effective and widely used approaches for selecting tuning parameters, as it directly estimates prediction error. Five-fold cross-validation is adopeted due to computational efficiency, while maintaining relatively low bias and varianc. This process ensures a balanced trade-off between bias and variance, thereby improving predictive accuracy and model generalization⁵⁵.

Elastic net

Elastic Net regression was introduced to address the instability of LASSO when predictors are highly correlated, making it a robust solution for analysing high-dimensional data⁵⁶. Elastic Net regression integrates the properties of Ridge (L₂) and Lasso (L₁) norms as a regularization technique. By integrating both penalties, it balances feature selection with coefficient stability, making it effective for handling datasets with highly correlated features. Moreover, it effectively addresses multicollinearity among predictor variables⁵⁷.

According to⁵⁴, the coefficient of the Elastic Net regression estimate $\hat{\beta }^{ENR}$ minimize:

$$\begin{aligned} L^{ENR} \left( \beta \right) = & \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \beta_{0} - \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)^{2} + \lambda_{1} \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right| + \lambda_{2} \mathop \sum \limits_{j = 1}^{p} \beta_{j}^{2} \\ & = SSR + \lambda_{1} \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right| + \lambda_{2} \mathop \sum \limits_{j = 1}^{p} \beta_{j}^{2} \\ \end{aligned}$$

(9)

where the parameter $\lambda_{1}$ represents L₁ regularization parameter, which determines the strength of the Lasso penalty, while $\lambda_{2}$ corresponds to the L₂ regularization parameter, managing the effect of the Ridge penalty. The Elastic Net offers the advantage of enforcing sparsity while allowing flexibility in selecting variables⁵⁸. Additionally, it also encourages grouping among highly correlated variables. However, one concern with this approach is the risk of double shrinkage in naive elastic nets, requiring careful consideration when applying it¹⁷.

Robust regression estimations

M estimation

According to^24,33,59, M-estimation builds on the maximum likelihood estimation method while also providing a more robust approach. The M-estimator minimizes the function ρ(∙), which operates on the residuals. It is given by:

$${ }\hat{\beta }_{ME} = \mathop {{\text{argmin}}}\limits_{\beta } \sum\limits_{i = 1}^{n} {\rho \left( {e_{i} \left( \beta \right)} \right)}$$

(10)

The function $\rho$ represents a $\rho -$ type M-estimator. Assuming $\sigma$ is known, the residuals for $\beta$ are estimated as $e_{i} = y_{i} - \beta^{T} x_{i}$. In M-estimation, $\beta$ will minimize the objective function:

$$\sum\limits_{i = 1}^{n} {\rho \left\{ {\frac{{e_{i} \left( \beta \right)}}{\sigma }} \right\}{ }}$$

(11)

The $\sigma$ is estimated in a robust way, and the scale of $\tilde{\sigma }_{ME}$ in M-estimator has a defined solution:

$${ }\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \rho \left( {\frac{{e_{i} }}{\sigma }} \right) = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \rho \left( {\frac{{y_{i} - \beta^{T} x_{i} }}{\sigma }} \right) = k{ }$$

(12)

where $\beta$ is the $p \times 1$ parameter vector, and the function $\psi$ yields:

$${ }\mathop \sum \limits_{i} \psi \left( {e_{i} } \right)\frac{{\partial e_{i} }}{{\partial \beta_{i} }}for\;i = 1,2, \ldots ,p$$

(13)

The derivative $\psi \left( e \right) = \frac{\partial \rho \left( e \right)}{{\partial \left( e \right)}}$ impacts the function. The weight function is expressed as :

$$w\left( e \right) = \frac{\psi \left( e \right)}{e}$$

(14)

where the function $\psi \left( e \right)$ is defined as:

$\mathop \sum \limits_{i} w\left( {e_{i} } \right)e_{i} \frac{{\partial e_{i} }}{{\partial \beta_{i} }} = 0$, for $i = 1,2, \ldots ,p$.

The objective is to solve the iterated re-weighted least square equation as follow:

$$min\mathop \sum \limits_{i} w\left( {e_{i}^{{\left( {k - 1} \right)}} } \right)e_{i}^{2}$$

(15)

where $k$ represents the iteration number.

The M robust regression is divided into M-Bi- Square Tukey, M-Huber and M-Hampel. Table 1 provides a summary of the three types of M-robust regression methods.

Table 1 Description of M-estimation robust regression.

Full size table

S-estimation

Based on the discussion by^59,60, S-estimators which is proposed by Rousseeuw and Yohai derives from the residual scale used in M- estimation. The main limitation of M-estimation is that it does not account for the overall data distribution since it relies only on the median as the weighted value, making it less representative of the entire dataset. To overcome this, the method incorporates the residual standard deviation. The S-estimator is defined by:

$$\hat{\beta }_{S} = min_{\beta } \hat{\sigma }_{sd} \left( {e_{1} ,e_{2,} e_{3} , \ldots .,e_{n} } \right)$$

with identifying the minimum robust scale estimator $\hat{\sigma }_{S}$ and ensuring

$${ }min\mathop \sum \limits_{i = 1}^{n} \rho \left( {\frac{{y_{i} - \mathop \sum \nolimits_{j = 0}^{k} x_{ij} \beta }}{{\hat{\sigma }_{sd} }}} \right)$$

(16)

where

$$\hat{\sigma }_{sd} = \sqrt {\frac{1}{nK}\mathop \sum \limits_{i = 1}^{n} w_{i} e_{i}^{2} }$$

(17)

$K = 0.199,w_{i} = w_{{\sigma \left( {u_{i} } \right) = }} \frac{{\rho \left( {u_{i} } \right)}}{{u_{i}^{2} }}$ and the initial estimate is:

$$\hat{\sigma }_{s} = \frac{{median\left| {e_{i} - median\left( {e_{i} } \right)} \right|}}{0.6745}$$

(18)

The solution is determined by taking the derivative with respect to β:

$$\mathop \sum \limits_{i = 1}^{n} x_{ij} \psi \left( {\frac{{y_{i} - \mathop \sum \nolimits_{j = 0}^{k} x_{ij} \beta }}{{\hat{\sigma }_{sd} }}} \right) = 0, \;i = 0,1,2, \ldots ,k$$

(19)

$\psi$ is the function as derivatives of $\rho$:

$$\psi \left( {u_{i} } \right) = \rho^{\prime}\left( {u_{i} } \right)\left\{ {\begin{array}{*{20}c} {u_{i} \left[ {1 - \left( {\frac{{u_{i} }}{c}} \right)^{2} } \right]^{2} ,} & {\left| {u_{i} } \right| \le c} \\ {0,{ }} & {\left| {u_{i} } \right| > c} \\ \end{array} } \right.$$

(20)

S-estimators are known to be more robust than M-estimators because they have lower asymptotic bias and variance, especially when dealing with contaminated data.

MM-estimation

MM-estimation integrates S-estimation, which has a high breakdown point, with M-estimation⁶¹. A study by⁶⁰ compared the S, M and MM methods and found that MM was the most effective, as it had the smallest bias and mean square error (MSE). According to⁵⁹, the MM estimation process consists of two steps. First, S-estimation estimates the regression coefficients by minimizing residual scale from M-estimation, followed by the application of M-estimation. MM estimator can be described as:

$$\mathop \sum \limits_{i = 1}^{n} \rho_{1}^{\prime } \left( {u_{i} } \right)X_{ij} = 0{\text{ or }}\mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} { }\rho_{1}^{\prime } \left( {\frac{{Y_{i} - \mathop \sum \nolimits_{j = 0}^{k} X_{ij} \hat{\beta }_{j} }}{{SD_{MM} }}} \right)X_{ij} = 0$$

(21)

Here, $SD_{MM}$ refers to the standard deviation determined based S estimation residuals and $\rho$ represents Tukey’s biweight function, given by:

$$\rho \left( {u_{i} } \right) = \left\{ {\begin{array}{*{20}c} {\frac{{u_{i}^{2} }}{2} - \frac{{u_{i}^{4} }}{{2c^{2} }} + \frac{{u_{i}^{6} }}{{6c^{2} }},} & { - c \le u_{i} \le c} \\ {\frac{{c^{2} }}{6},} & {u_{i} \left\langle { - c{ }or{ }u_{i} } \right\rangle c{ }.} \\ \end{array} { }} \right.$$

(22)

Model evaluation

Assessing a model’s precision is essential in regression analysis. This study evaluates model performance using SSE, MSE, MAPE and R-squared as evaluation metrics. These indicators facilitate the comparison of how well each regression model predicts moisture content removal. In general, lower MSE, SSE, and MAPE values indicate higher prediction accuracy, while a greater R-squared signifies stronger model fit to the data.

Each of these metrics serves a distinct purpose in evaluating model performance. MSE serves as a widely used performance indicator that evaluates the average squared deviation between actual and predicted values in the dataset. It is particularly useful for models that predict a continuous variable due to its connection to the principle of cross-entropy from information theory⁶². SSE quantifies the difference between the observed data and a predictive model, with lower SSE values indicating that the model can more accurately explain the data⁶³. MAPE is frequently used as a performance metric for regression models due to its straightforward interpretation in relation to relative error. It represents the average absolute error expressed as a percentage over a sample⁶⁴. A MAPE value below 10 indicates a highly accurate forecast, while values exceeding 50 suggest an inaccurate forecast⁶⁵. Meanwhile, R-squared shows the proportion of variance in the dependent variable that is explained by the variation in the independent variables⁶⁶.

According to^17,67, this study will apply the R-squared value ranges outlined in Table 2, to assess regression model quality:

Table 2 Range of ${R}^{2}$ values.

Full size table

Table 3 presents the formulas for the evaluation metrics, where ${y}_{i}$ denote the actual observations, ${\widehat{y}}_{i}$ denote the predicted values, $\overline{y }$ signifies the average of all observations, and n refers to the total count of observations.

Table 3 Evaluation metrics formulas.

Full size table

Eight selection criteria

In the next step, the best model from each group is identified using the Eight Selection Criteria (8SC). Based on^27,68 and⁶⁹, the 8SC consists of Akaike Information Criterion (AIC), Final Prediction Error (FPE), Generalized Cross-Validation (GCV), Hannan-Quinn information criterion (HQ), Risk Inflation Criterion Estimate (RICE), SCHWARZ, SGMASQ, and SHIBATA. The optimal model is determined by selecting the one that yields the highest number of minimum values across these criteria.

The formulas for 8SC are presented in Table 4. Here, SSE represents the sum of squared errors, $k + 1$ corresponds to the number of estimated parameters, and $n$ denotes the sample size. As noted by⁷⁰, these criteria can be applied only when the condition $2\left( {k + 1} \right) < n$ is satisfied.

Table 4 Eight selection criteria formulas.

Full size table

Results and discussion

Table 5 highlights that parameters T7 and T11 exhibit heterogeneity, as their R-squared values fall below the benchmark, which is the average R-squared of 0.8372. Moreover, the VIF values are notably high, with the highest reaching 76,050.9483, indicating a high level of multicollinearity.

Table 5 Heterogeneity Identification.

Full size table

Additionally, the box plot serves as supporting evidence for detecting heterogeneity within the drying parameters. It is particularly useful for analyzing symmetry, variability, and identifying potential outliers. The boxplot also helps visualize data distribution by showing the median and quartiles for location and using the interquartile range to capture variability⁷⁹. Figure 3 illustrates the variability in the 29 primary drying parameters of black pepper. Each box plot corresponds to a specific drying parameter of black pepper, providing insight into the variability among the key parameters¹⁹. The box plot reveals that the variables T7, T11, H1, H5 and PY show variation. Notably, the patterns observed for T7 and T11 align with the earlier R-squared results, confirming their heterogeneity. However, while H1, H5 and PY appear heterogeneous in the box plot, their R-squared values exceed the benchmark of 0.4843, suggesting that they do not show heterogeneity based on the previous findings.

While data visualization enhances decision-making by improving speed and quality⁸⁰, it simplifies complex data into visual representation, which may lead to some loss of detail. On the other hand,⁸¹ highlighted that quantitative results from numerical operations and statistical analysis tend to ensure greater accuracy and reliability in predictions. Hence, the average R-squared value will be employed in this context to identify heterogeneity among the drying parameters, indicating that T7 and T11 exhibit heterogeneity. Once these two parameters contributing to heterogeneity and their second-order interaction are removed from the model, 378 parameters remain for determining the moisture content removal of black pepper. However, note that the ridge model can only select significant parameters up to the 89 highest-ranking variables after removing heterogeneity parameters, as the remaining ones are insignificant.

Table 6 provides a summary of the analysis comparison based on evaluation metrics, showing the results before and after removing heterogeneity parameters. Tables 7 and 8 present the outcomes of the 8SC for the models before and after the removal of heterogeneity parameters respectively. Before accounting for heterogeneity, the Elastic Net model shows a downward trend in SSE, MSE and MAPE values, while R-squared values increases as more high-ranking variables are included. For instance, with the 25 highest-ranking variables, the MSE is 62.7179 and the MAPE is 14.61765. When the number of high-ranking variables increases to 100, the MSE declines to 48.75784 and the MAPE drops to 12.90658. Meanwhile, the R-squared value increases from 0.8201 for the 25 highest-ranking variables to 0.8602 for the 100 highest-ranking variables. A similar trend is observed in the LASSO and Ridge models, where SSE, MSE and MAPE consistently decrease, while R-squared values increase with the inclusion of more high-ranking variables before heterogeneity. In addition, the 8SC values also decrease across all criteria as the number of high-ranking variables increases. The 8SC further strengthen the evaluation by confirming which model captures the factors that influence the moisture removal in black pepper. Since drying efficiency depends on interactions among temperature, airflow and humidity, the model with the lowest 8SC values is more likely to represent these relationships accurately. This ensures that the selected model not only minimizes error metrics but also provides a reliable representation of the moisture reduction process. Overall, adding more variables consistently improves model performance, as indicated by lower error metrics, higher R-squared values and decreased 8SC values. After addressing heterogeneity, this pattern remains consistent across all three models. Generally, selecting a larger number of high-ranking variables improves predictive accuracy, which confirms the findings of^19,33.

Table 6 Comparison before and after removing heterogeneity parameters.

Full size table

Table 7 The 8SC for models before removing heterogeneity parameters.

Full size table

Table 8 The 8SC for models after removing heterogeneity parameters.

Full size table

The Elastic Net model consistently performs better than the Ridge and LASSO models for the 25, 35, 45, 55 and achieves its best results for 100 highest ranking variables before the removal of heterogeneity parameters. This advantage is evident from the Elastic Net model’s lower SSE, MSE and MAPE, along with its higher R-squared values. The higher R-squared values indicate that the model explains a greater percentage of variation in moisture content removal for black pepper compared to Ridge and LASSO. For example, achieving an R-squared value of 0.8602 for the 100 highest important variables, the Elastic Net model accounts for 86.02% of the variation, demonstrating strong predictive ability. Since its R-squared values consistently exceed 80%, the model quality ranges from good to very good. Although all three models have MAPE values between 10 and 20, which suggests a good level of forecasting accuracy, the Elastic Net model maintains lower MAPE values across all sets of high-ranking variables, significantly outperforming Ridge and LASSO. In line with these findings, all 8SC also identify Elastic Net as the best model, as it consistently records the lowest values across all criteria and variable sets. This further confirms that Elastic Net provides the most reliable model fit before addressing heterogeneity. Generally, before the removal of heterogeneity parameters, the Elastic Net model which is a combination of L₁ and L₂ regularization proves to be the superior choice, as it maintains a balance between variable selection and model stability. This combination enables it to manage highly correlated predictors and handle multicollinearity effectively⁵⁶.

After removing the heterogeneity parameters, the LASSO model shows better predictive performance with better accuracy and reduced error for the 25, 35, 45 and 55 highest-ranking variables compared to the other two models. Although the Elastic Net model, with 100 variables, records a slightly higher R-squared (0.8408) than LASSO with 89 variables (R² = 0.8403), the difference is minimal. Consistent with these results, the 8SC show that LASSO achieves the lowest values for the 25, 35, 45, and 55 variable sets, confirming it as the best model in these cases. The 8SC also indicate that Elastic Net has the lowest values for the 100-variable set, although the improvement over LASSO with 89 variables is minimal. This suggests that while Elastic Net offers a slightly better fit at higher variable counts, LASSO remains the more efficient and stable option after heterogeneity is removed. Since LASSO reaches almost the same level of accuracy with fewer significant predictors, it proves to be more efficient and stable after the removal of heterogeneity parameters. Similarly to the analysis before excluding parameters exhibiting heterogeneity, all three models have MAPE values ranging between 10 and 20, suggesting a good level of forecasting accuracy. However, LASSO achieves the lowest MAPE values for the 25, 35, 45 and 55 highest ranking variables, confirming it strong predictive ability. This also supports the view that LASSO works well with high-dimensional data by reducing redundancy, preventing overfitting, and improving interpretability⁴⁷. Elastic Net, on the other hand, continues to perform well as the number of high-ranking variables increases due to its strength in managing multicollinearity.

Interestingly, the results indicate that the accuracy of the regression model is unexpectedly reduced by the removal of heterogeneity parameters. When parameters are removed from the model, it can lead to a loss of meaningful variability, causing the model to overlook important relationships among predictors. Such elimination can introduce specification bias in the model, as it may not adequately consider crucial factors that influence moisture content removal, ultimately affecting its predictive performance¹⁷. Overall, the superior performance of the Elastic Net before eliminating heterogeneity parameters implies that retaining this natural variability helps capture meaningful interactions among drying factors such as temperature, humidity, and airflow. Its minimal sensitivity to heterogeneity further explains why Elastic Net remains stable and reliable under varying conditions. Preserving these interdependencies allows the model to more accurately represent real drying conditions, leading to improved variable selection and more reliable parameter estimation. In practical terms, these findings highlight the importance of considering parameter relationships when optimizing drying systems to enhance efficiency and ensure consistent product quality.However, once heterogeneity parameters are removed, the data becomes cleaner and less correlated. Consequently, the LASSO model emerges as the better predictive model due to its strong regularization effect, which enables it to focus on the most important predictors while shrinking less relevant ones to zero. This may explain why LASSO slightly outperformes Elastic Net after accounting for heterogeneity.

Figures 4 and 5 present the standardized residual plots of the optimal Elastic Net, Ridge and LASSO models with the robust estimators, both before and after removing the heterogeneity parameters, respectively. Table 9 compares the outlier counts and their corresponding percentages exceeding the 2-sigma and 3-sigma limits for Elastic Net, Ridge and LASSO, applying robust methods before and after addressing heterogeneity, for 25, 35, 45, 55 and 100 highest-ranking variables. Figures 6, 7 and 8 depict the plots for showing the performance of elastic net before and after heterogeneity, ridge before and after heterogeneity and LASSO before and after heterogeneity, respectively. Before heterogeneity, the hybrid model shows significant improvements in reducing outliers across all three sparse regression models (Elastic Net, Ridge and LASSO). For Elastic Net, the S estimator performs best for the 100 highest ranking variables at the 2-sigma limits, reducing outliers from 113 in the original model to 19, with an 83% reduction. At the 3-sigma limits, the M Bi-Square estimator proves most effective, lowering outliers from 4 to 3 (a 25% reduction). In the Ridge model, the M Bi-Square estimator achieves the highest reduction of 90% for the 25 highest-ranking variables at the 2-sigma limits, identifying only 14 outliers compared to 144 in the original model. Although the MM estimator with 100 variables also performs well with an 84% reduction, the improvement is slightly lower, suggesting that the Ridge M Bi-Square hybrid model works most efficiently with 25 variables at 2 sigma-limits. Meanwhile, at the 3-sigma limits, the hybrid model with the M Bi-Square estimator completely eliminates outliers, lowering the count from 5 to 0 for the 35 highest-ranking variables. For LASSO, the S estimator performs best for the 89 highest ranking variables at the 2-sigma limits, detecting 15 outliers compared to 115 in the original (87% reduction). At the 3-sigma limit, M Bi-Square reduces outliers from 7 to 4 for the 25 highest-ranking variables, indicating a 42% decrease.

Table 9 Comparison of outliers counts and percentages outside 2 and 3—sigma limits in original and hybrid models before and after addressing heterogeneity.

Full size table

Following the removal of heterogeneity parameters, the hybrid models continue to show strong reductions in outliers. For Elastic Net, the MM and S estimators perform best at the 2-sigma limits, identifying 23 outliers among the 55 highest-ranking variables, approximately 85% reduction from the original 158 outliers. At the 3-sigma limits, the S estimator eliminates all outliers for the 45 and 55 highest-ranking variables, while the original model reports 12 and 11 outliers, respectively. This suggests that heterogeneity is a major source of outliers, and its removal makes the data appear more normally distributed, reducing the number of detected outliers⁸². In the Ridge model, the S estimator proves to be the most effective for the 45 highest-ranking variables at the 2-sigma limits, reducing outliers from 164 to 21 (87% reduction). Similarly, at the 3-sigma limits, it remains the best estimator for the 55 high-ranking variables, decreasing outliers from 12 to 1, representing a 91% reduction. For LASSO, the S estimator again performs best at both 2-sigma and 3-sigma limits for the 45 highest-ranking variables, reducing outliers from 163 to 16 (90% reduction) and 14 to 1, with roughly 92% decrease.

Overall, although outlier counts increase after removing heterogeneity parameters especially at 2-sigma limits, the number of outliers significantly declines from the 2-sigma to the 3-sigma limits across all models, with hybrid models demonstrating stronger performance compared to the original models. This is likely because the 3-sigma rule sets a high threshold for detecting outliers, making it less sensitive after applying the robust method⁸⁰. Before removing heterogeneity parameters, the hybrid Ridge model with M Bi-Square estimator performs best at both 2-sigma and 3-sigma limits, completely eliminating outliers at the 3-sigma limit. After heterogeneity removal, the LASSO model combined with the S estimator is the best, achieving the highest reduction at both limits, particularly at the 3-sigma level. These findings confirm that hybrid models combining robust estimators with sparse regression can handle heterogeneity effectively both before and after its removal, maintaining strong performance across both conditions.

Conclusion

This study explores heterogeneity in the drying parameters of black pepper and proposes hybrid models that combine sparse and robust regression estimators to improve the accuracy of estimating moisture content removal. Elastic Net, Ridge and LASSO are applied for variable selection, followed by the development of hybrid models integrating these sparse regression techniques with robust regression estimators to detect and minimize the influence of outliers. Before heterogeneity removal, hybrid Ridge model with M Bi-Square demonstrates the best performance, whereas after removing heterogeneity, the hybrid LASSO model with S achieves the highest accuracy and stability, making it the most effective model. The consistent improvements observed at both 2-sigma and 3-sigma limits highlight the robustness of these methods, with the 3-sigma limit proving especially effective in reducing outliers and improving prediction reliability. The findings confirm that hybrid sparse-robust models are crucial for maintaining stable performance in the presence of heterogeneity. By improving predictive accuracy and reliability, these models offer valuable insights for optimizing sensor placement, drying conditions, and energy use in IoT-based solar drying systems. Ultimately, the proposed hybrid models enhance black pepper drying by incorporating significant parameters and their interactions, enabling farmers to produce high-quality dried pepper with uniform moisture content, improved yield, and shorter drying periods. This will boost efficiency and the income across the black pepper industry.

This study has several limitations; for example, the sensors determine the variables to be captured, and some variables were not captured due to measurement errors. Additionally, the findings are based on specific environmental conditions, which may not represent all possible use cases for the dryer. Differences in sensor accuracy and placement may also affect the results. This study is also limited to the main effects of the drying parameters and second -order interaction due to the constraints related to time, feasibility, and complexity of the models. Hence, it can be inferred that the moisture content removal of black pepper is determined by a total of 435 independent variable models. In the context of big data, determining the precise number of significant variables to include in a model is challenging. While the proposed algorithms can indicate the relative importance of variables, they do not explicitly determine how many of these should be selected, as feature selection methods only provide a ranking rather than a definitive count of significant variables⁸³. Therefore, the 25, 35, 45, 55 and 100 highest-ranking variables are selected to determine the moisture content removal of black pepper. Future studies could explore other sparse regression models, such as adaptive LASSO, adaptive group LASSO, Minimax Concave Penalty and Smoothly Clipped Absolute Deviation (SCAD) for variable selection and the number of drying parameters chosen could also be expanded. Alternative robust estimators, like least median of squares (LMS) and least absolute deviations (LAD) could be applied to build a hybrid model. Hybrid model can also be explored to handle imbalanced data or missing values.

Data availability

All data are included in this article.

References

Kumari, A., Golyan, A., Shah, R., & Raval, N. Introduction to Data Analytics. In Advances in ystems analysis, software engineering, and high performance computing book series (pp. 1–14). https://doi.org/10.4018/979-8-3693-3609-0.ch001 (2024).
Gandhi, P. Towards data science in agriculture with big data management. Res. Sq. (Res. Sq.) https://doi.org/10.21203/rs.3.rs-4766405/v1 (2024).
Article PubMed PubMed Central Google Scholar
Das, P., Banerjee, R., Bharti, R. A. & Varshney, N. An Overview of Data Analytics in Smart Agriculture (Bhumi Publishing, India, 2023).
Google Scholar
Weraikat, D., Šorič, K., Žagar, M. & Sokač, M. Data analytics in agriculture: enhancing decision-making for crop yield optimization and sustainable practices. Sustainability. 16(17), 7331. https://doi.org/10.3390/su16177331 (2024).
Article CAS ADS Google Scholar
Pathakakula, S. Chapter -16 Ripening Mechanism In Black Pepper. 2^nd edn. In Advances in horticulture and allied sciences (Vol. 2). (Royal Book Publishing, 2023).
Priya & Garg, A.P. Biomedical Applications of Black Pepper, The King of Spices: A review. Biomed. J. Sci. Tech. Res. 53(1). https://doi.org/10.26717/bjstr.2023.53.008353 (2023).
Madhumathy, S. Water activity and its impacts on food stability. Int. J. Food Nutr. Sci. 10(12), 832–850 (2021).
Google Scholar
Tapia, M. S., Alzamora, S. M. & Chirife, J. Effects of water activity (a_w) on microbial stability: As a hurdle in food preservation. Water Act. Foods Fundam. Appl. https://doi.org/10.1002/9781118765982.ch14 (2020).
Article Google Scholar
Rodrigues, S. S. Q. et al. Use of black pepper essential oil to produce a healthier chicken pâté. Appl. Sci. 15(4), 1733. https://doi.org/10.3390/app15041733 (2025).
Article CAS Google Scholar
Vieira, L. V. et al. The effects of drying methods and harvest season on piperine, essential oil composition, and multi-elemental composition of black pepper. Food Chem. 390, 133148. https://doi.org/10.1016/j.foodchem.2022.133148 (2022).
Article CAS PubMed Google Scholar
Roslan, N. F. M. & Yudin, A. S. M. Drying process of black pepper in a swirling fluidized bed dryer using experimental method. IOP Conf. Series Mater. Sci. Eng. 863(1), 012047. https://doi.org/10.1088/1757-899x/863/1/012047 (2020).
Article CAS Google Scholar
Ali, M. K. M., Sulaiman, J., Md Yasir, S. & Ruslan, M. Cubic spline as a powerful tools for processing experimental drying rate data of seaweed using solar drier. Article Malays. J. Math. Sci. 11, 159–172 (2017).
Google Scholar
Ali, M. K. M., Fudholi, A., Muthuvalu, M., Sulaiman, J., & Yasir, S. M. Implications of drying temperature and humidity on the drying kinetics of seaweed. Proceedings of the 13th IMT-GT International Conference on Mathematics, Statistics and their Applications (ICMSA2017). 1905(1), 050004–1- 050004–7. https://doi.org/10.1063/1.5012223 (2017b).
Ali, M. K. M., Fudholi, A., Sulaiman, J., Muthuvalu, M. S., Ruslan, M. H., Yasir, S. Md., & Hurtado, A. Q. Post-harvest handling of eucheumatoid seaweeds. In Tropical seaweed farming trends, problems and opportunities. Springer International Publishing. 131–145. https://doi.org/10.1007/978-3-319-63498-2_8 (2017c).
Shreelavaniya, R., Pangayarselvi, R. & Kamaraj, S. Mathematical modeling of drying characteristics of black pepper (Piper nigrum) in indirect type solar-biomass hybrid dryer. Int. J. Curr. Microbiol. Appl. Sci. 6(11), 2634–2644. https://doi.org/10.20546/ijcmas.2017.611.309 (2017).
Article CAS Google Scholar
Ibidoja, O. J., Shan, F. P., Sulaiman, J. & Ali, M. K. M. Detecting heterogeneity parameters and hybrid models for precision farming. J. Big Data. 10, 130. https://doi.org/10.1186/s40537-023-00810-8 (2024).
Article Google Scholar
Kumar, P. R., Ali, M. K. M. & Ibidoja, O. J. Identifying heterogeneity for increasing the prediction accuracy of machine learning models. J. Niger. Soc. Phys. Sci. https://doi.org/10.46481/jnsps.2024.2058 (2024).
Article Google Scholar
Department of Statistics Malaysia, Malaysian Open Data Portal-Information on Black Pepper Industry https://www.data.gov.my/ (2019).
Nunes, A., Trappenberg, T. & Alda, M. The definition and measurement of heterogeneity. Transl. Psychiatry. https://doi.org/10.31234/osf.io/3hykf (2020).
Article PubMed PubMed Central Google Scholar
Afouna, N. H. & Ali, M. K. Optimizing precision farming: Enhancing machine learning efficiency with robust regression techniques in high-dimensional data. J. Niger. Soc. Phys. Sci. https://doi.org/10.46481/jnsps.2025.2314 (2024).
Article Google Scholar
Mikolajewicz, N. & Komarova, S. V. Meta-analytic methodology for basic research: A practical guide. Front. Physiol. https://doi.org/10.3389/fphys.2019.00203 (2019).
Article PubMed PubMed Central Google Scholar
Ibidoja, O. J., Shan, F. P. & Ali, M. K. M. Modified sparse regression to solve heterogeneity and hybrid models for increasing the prediction accuracy of seaweed big data with outliers. Sci. Rep. https://doi.org/10.1038/s41598-024-60612-7 (2024).
Article PubMed PubMed Central Google Scholar
Sundus, K., Hammo, B., Al-Zoubi, M. I. & Al-Omari, A. Solving the multicollinearity problem to improve the stability of machine learning algorithms applied to a fully annotated breast cancer dataset. Inf. Med. Unlocked. 33, 101088. https://doi.org/10.1016/j.imu.2022.101088 (2022).
Article Google Scholar
Ali, M., Bin, M. K., Ismail, M. & Fudholi, A. Accurate and hybrid regularization—Robust regression model in handling multicollinearity and outlier using 8SC for big data. Math. Model. Eng. Probl. 8(4), 547–556. https://doi.org/10.18280/mmep.080407 (2021).
Article Google Scholar
Daoud, J. I. Multicollinearity and regression analysis. J. Phys. 949, 012009. https://doi.org/10.1088/1742-596/949/1/012009 (2017).
Article Google Scholar
Chan, J. Y. et al. Mitigating the multicollinearity problem and its machine learning approach: A review. Mathematics. 10(8), 1283. https://doi.org/10.3390/math10081283 (2022).
Article Google Scholar
Lim, H. Y., Fam, P. S., Javaid, A. & Ali, M. Ridge regression as efficient model selection and forecasting of fish drying using V-groove hybrid solar drier. Pertanika J. Sci. Technol. https://doi.org/10.47836/pjst.28.4.04 (2020).
Article Google Scholar
Ayadi, A., Ghorbel, O., Obeid, A. M. & Abid, M. Outlier detection approaches for wireless networks: A survey. Comput. Netw. 129, 319–333. https://doi.org/10.1016/j.comnet.2017.10.007 (2017).
Article Google Scholar
Perez, H. & Tah, J. H. M. Improving the accuracy of convolutional neural networksby identifying and removing outlier images in datasets using t-SNE. Mathematics. https://doi.org/10.3390/MATH8050662 (2020).
Article Google Scholar
Wasim, D. et al. Quantile-based robust Kibria-Lukman estimator for linear regression model to combat multicollinearity and outliers: Real life applications using T20 cricket sports and anthropometric data. Kuwait J. Sci. 52, 100336. https://doi.org/10.1016/j.kjs.2024.100336 (2024).
Article MathSciNet Google Scholar
Wasim, D., Khan, S. A. & Suhail, M. Modified robust ridge M-estimators for linear regression models: An application to tobacco data. J. Stat. Comput. Simul. 93(15), 2703–2724. https://doi.org/10.1080/00949655.2023.2202913 (2023).
Article MathSciNet Google Scholar
Wasim, D., Suhail, M., Albalawi, O. & Shabbir, M. Weighted penalized m-estimators in robust ridge regression: An application to gasoline consumption data. J. Stat. Comput. Simul. 94(15), 3427–3456. https://doi.org/10.1080/00949655.2024.2386391 (2024).
Article MathSciNet Google Scholar
Wasim, D., Khan, S. A., Suhail, M. & Shabbir, M. New penalized M-estimators in robust ridge regression: Real life applications using sports and tobacco data. Commun. Stat. Simul. Comput. https://doi.org/10.1080/03610918.2023.2293648 (2023).
Article Google Scholar
Joy, C. M., Pittappillil, G. P. & Jose, K. P. Drying of black pepper (Piper nigrum L.) using solar tunnel dryer. Pertanika J. Trop. Agric. Sci. 25(1), 39–45 (2002).
Google Scholar
Padmanaban, K., Mishra, P., Dubey, A. & Tiwari, P. Study of factors of production on productivity of black pepper and its sustainability. Acta Sci. Agric. 2(12), 138–143 (2018).
Google Scholar
Afouna, N. A. & Ali, M. K. M. The impact of heterogeneity in high-ranking variables using precision farming. Malays. J. Fundam. Appl. Sci. 20(6), 1344–1362. https://doi.org/10.11113/mjfas.v20n6.3564 (2024).
Article Google Scholar
Drobnič, F., Kos, A. & Pustišek, M. On the interpretability of machine learning models and experimental feature selection in case of multicollinear data. Electronics (Switzerland) 9(5), 761. https://doi.org/10.3390/electronics9050761 (2020).
Article Google Scholar
Ibidoja, O. J., Shan, F. P., Mukhtar, N., Sulaiman, J. & Ali, M. Robust M estimators and machine learning algorithms for improving the predictive accuracy of seaweed contaminated big data. J. Niger. Soc. Phys. Sci. https://doi.org/10.46481/jnsps.2023.1137 (2023).
Article Google Scholar
Gujarati, D. N. & Porter, D. N. Basic Econometrics 4th edn. (The McGraw-Hill Companies, 2004).
Google Scholar
Obadina, O. G., Adedotun, A. F. & Odusanya, O. A. Ridge estimation’s effectiveness for multiple linear regression with multicol linearity: An investigation using Monte-Carlo simulations. J. Niger. Soc. Phys. Sci. 3(4), 278–281. https://doi.org/10.46481/jnsps.2021.304 (2021).
Article Google Scholar
Yildirim, H. & Revan Özkale, M. The performance of ELM based ridge regression via the regularization parameters. Expert Syst. Appl. 134, 225–233. https://doi.org/10.1016/j.eswa.2019.05.039 (2019).
Article Google Scholar
Moreno-Salinas, D., Moreno, R., Pereira, A., Aranda, J. & de la Cruz, J. M. Modelling of a surface marine vehicle with kernel ridge regression confidence machine. Appl. Soft Comput. J. 76, 237–250. https://doi.org/10.1016/j.asoc.2018.12.002 (2019).
Article Google Scholar
Melkumova, L. E. & Shatskikh, S. Y. Comparing Ridge and LASSO estimators for data analysis. In Procedia Eng. https://doi.org/10.1016/j.proeng.2017.09.615 (2017).
Article Google Scholar
Jiehong, C., Sun, J., Yao, K., Min, X. & Yan, C. A variable selection method based on mutual information and variance inflation factor. Spectrochimica Acta Part A: Mol. Biomol. Spectrosc. 268, 120652. https://doi.org/10.1016/j.saa.2021.120652 (2022).
Article CAS Google Scholar
Frost, J. When Do You Need to Standardize the Variables in a Regression Model? Statistics by Jim. https://statisticsbyjim.com/regression/standardize-variables-regression/ (2017).
Duzan, H. & Shariff, N. S. B. M. Ridge regression for solving the multicollinearity problem: Review of methods and models. J. Appl. Sci. 15(3), 392–404. https://doi.org/10.3923/jas.2015.392.404 (2015).
Article ADS Google Scholar
Safi, S. K. et al. Optimizing linear regression models with lasso and ridge regression: A study on UAE financial behavior during COVID-19. Migr. Lett. 20(6), 139–153. https://doi.org/10.59670/ml.v20i6.3468 (2023).
Article Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. In Source: J. Royal Stat. Soc. Series B (Methodological). 58(1). (1996).
Tibshirani, R. Regression shrinkage and selection via the lasso: a retrospective. In J. R. Statist. Soc. B. 73. (2011).
Enwere, K., Nduka, E. & Ogoke, U. Comparative analysis of ridge, bridge and lasso regression models in the presence of multicollinearity. IPS Intelligentsia Multidiscip. J. 3(2), 1–8. https://doi.org/10.54117/iimj.v3i1.5 (2023).
Article Google Scholar
Usman, M., Doguwa, S. & Alhaji, B. Comparing the prediction accuracy of ridge, lasso and elastic net regression models with linear regression using breast cancer data. Bayero J. Pure Appl. Sci. 14(2), 134–149. https://doi.org/10.4314/bajopas.v14i2.16 (2022).
Article Google Scholar
García-Nieto, P. J., García-Gonzalo, E. & Paredes-Sánchez, J. P. Prediction of the critical temperature of a superconductor by using the WOA/MARS, Ridge, Lasso and Elastic-net machine learning techniques. Neural Comput. Appl. 33(24), 17131–17145. https://doi.org/10.1007/s00521-021-06304-z (2021).
Article Google Scholar
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735 (2006).
Article MathSciNet CAS Google Scholar
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. Royal Stat. Soc. Series B (Stat. Methodol.) 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x (2005).
Article MathSciNet Google Scholar
Le, C. V. How to choose tuning parameters in lasso and ridge regression?. Asian J. Econ. Banking 4(1), 61–76 (2020).
Google Scholar
Ogutu, J. O., Schulz-Streeck, T. & Piepho, H. P. Genomic selection using regularized linear regression models: Ridge regression, lasso, elastic net and their extensions. BMC Proc. https://doi.org/10.1186/1753-6561-6-S2-S10 (2012).
Article PubMed PubMed Central Google Scholar
Al-Jawarneh, A. S., Ismail, M. T. & Awajan, A. M. Elastic net regression and empirical mode decomposition for enhancing the accuracy of the model selection. Int. J. Math. Eng. Manag. Sci. 6(2), 564–583. https://doi.org/10.33889/ijmems.2021.6.2.034 (2021).
Article Google Scholar
Schreiber-Gregory, D. N. Regulation Techniques for Multicollinearity: Lasso, Ridge, and Elastic Nets. In Proceedings of the SAS Conference Proceedings: Western Users of SASSoftware. 1–23. https://api.semanticscholar.org/CorpusID: 189925961. (2018).
Susanti, Y., Pratiwi, H. & H, S. S., & Liana, T.,. M estimation, S estimation, and MM estimation in robust regression. Int. J. Pure Appl. Math. https://doi.org/10.12732/ijpam.v91i3.7 (2014).
Article Google Scholar
Almetwally E.M., & Almongy H.M. Comparison between M-estimation, S-estimation, and MM estimation methods of robust estimation with application and simulation. Int. J. Math. Arch. 9(11). (2018).
Singgih, M. N. A. & Fauzan, A. Comparison of M estimation, S estimation, with MM estimation to get the best estimation of robust regression in criminal cases in Indonesia. Jurnal Matematika Statistika Dan Komputasi. 18(2), 251–260. https://doi.org/10.20956/j.v18i2.18630 (2022).
Article Google Scholar
Hodson, T. O., Over, T. M. & Foks, S. S. Mean squared error, deconstructed. J. Adv. Model. Earth Syst. https://doi.org/10.1029/2021ms002681 (2021).
Article Google Scholar
Kim, S. & Kim, H. A new metric of absolute percentage error for intermittent demand forecasts. Int. J. Forecast. 32(3), 669–679. https://doi.org/10.1016/j.ijforecast.2015.12.003 (2016).
Article MathSciNet Google Scholar
De Myttenaere, A., Golden, B., Grand, B. L. & Rossi, F. Mean absolute percentage error for regression models. Neurocomputing 192, 38–48. https://doi.org/10.1016/j.neucom.2015.12.114 (2016).
Article Google Scholar
Moreno, J., Pol, A. L. P., García-Labiano, F. & Blasco, B. C. Using the R MAPE index as a resistant measure of forecast accuracy. PubMed. 25(4), 500–506. https://doi.org/10.7334/psicothema2013.23 (2013).
Article Google Scholar
Ijomah, M. A. On the misconception of R2 for (r)2 in a regression model. Int. J. Res. Sci. Innovation. 6(12), 2321–2705 (2019).
Google Scholar
Arsad, Z. Multiple Linear Regression 10–31 (Regression Analysis, 2023).
Google Scholar
Abdullah, N., Kiu, A. L. L., & Lintangah, W. Log production prediction model: A comparison between Malaysia and Indonesia using multiple regression technique. UMS Institutional Repository (Universiti Malaysia Sabah). http://eprints.ums.edu.my/14146/7/LOG%20PRODUCTION%20PREDICTION%20MODEL%20A%20COMPARISON%20BETWEEN%20MALAYSIA%20AND.pdf (2016).
Abdullah, N., Lee, C. L. & Jubok, Z. H. Factors on palm oil fruit bunches production volume for biomass fuel and biofuel during cogeneration processes. J. Jpn Inst. Energy 94(12), 1428–1439. https://doi.org/10.3775/jie.94.1428 (2015).
Article Google Scholar
Hajijubok, Z. & Gopal, P. K. Procedure in getting best model using multiple regression. J. Borneo Sci. 23, 47–63 (2008).
Google Scholar
Akaike, H. Fitting autoregressive models for prediction. Ann. Inst. Stat. Math. 21(1), 243–247. https://doi.org/10.1007/bf02532251 (1969).
Article MathSciNet Google Scholar
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723. https://doi.org/10.1109/tac.1974.1100705 (1974).
Article MathSciNet ADS Google Scholar
Golub, G. H., Heath, M. & Wahba, G. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2), 215–223. https://doi.org/10.1080/00401706.1979.10489751 (1979).
Article MathSciNet Google Scholar
Hannan, E. J. & Quinn, B. G. The determination of the order of an autoregression. J. Royal Stat. Soc. Series B (Stat. Methodol.) 41(2), 190–195. https://doi.org/10.1111/j.2517-6161.1979.tb01072.x (1979).
Article MathSciNet Google Scholar
Rice, J. Bandwidth choice for nonparametric regression. Ann. Stat. https://doi.org/10.1214/aos/1176346788 (1984).
Article MathSciNet Google Scholar
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 6(2), 15–18. https://doi.org/10.1214/aos/1176344136 (1978).
Article MathSciNet Google Scholar
Ramanatam, R. Introductory Econometrics with Application 5th edn. (Harcourt College Publishers, 2002).
Google Scholar
Shibata, R. An optimal selection of regression variables. Biometrika 68(1), 45–54. https://doi.org/10.1093/biomet/68.1.45 (1981).
Article MathSciNet ADS Google Scholar
Morales, C. S., Giraldo, R. & Torres, M. E. Boxplot fences in proficiency testing. Accred. Qual. Assur. 26, 193–200. https://doi.org/10.1007/s00769-021-01474-8 (2021).
Article Google Scholar
Eberhard, K. The effects of visualization on judgment and decision-making: A systematic literature review. Manag. Rev. Q. 73(1), 167–214. https://doi.org/10.1007/s11301-021-00235-8 (2021).
Article Google Scholar
Almeida, F., Faria, D. & Queirós, A. Strengths and limitations of qualitative and quantitative research methods. Eur. J. Educ. Stud. 3(9), 369–387. https://doi.org/10.5281/zenodo.887089 (2017).
Article Google Scholar
Lin, L., Chu, H. & Hodges, J. S. Alternative measures of between-study heterogeneity in meta-analysis: Reducing the impact of outlying studies. Biometrics 73(1), 156–166 (2017).
Article MathSciNet PubMed Google Scholar
Drobnič, F., Kos, A. & Pustišek, M. On the interpretability of machine learning models and experimental feature selection in case of multicollinear data. Electronics 9(5), 761. https://doi.org/10.3390/electronics9050761 (2020).
Article Google Scholar

Download references

Funding

The authors are grateful for the financial assistance from the “Ministry of Higher Education Malaysia under the Fundamental Research Grant Scheme (FRGS), with Project Code: FRGS/1/2023/STG06/USM/02/6”.

Author information

Authors and Affiliations

School of Mathematical Sciences, Universiti Sains Malaysia USM, 11800, Penang, Malaysia
Paavithashnee Ravi Kumar, Olayemi Joshua Ibidoja & Majid Khan Majahar Ali
Department of Mathematics, Federal University Gusau, Gusau, Nigeria
Olayemi Joshua Ibidoja
School of Health Sciences, Universiti Sains Malaysia, Health Campus, 16150, Kubang Kerian, Kelantan, Malaysia
Wan Rosli Wan Ishak

Authors

Paavithashnee Ravi Kumar
View author publications
Search author on:PubMed Google Scholar
Olayemi Joshua Ibidoja
View author publications
Search author on:PubMed Google Scholar
Majid Khan Majahar Ali
View author publications
Search author on:PubMed Google Scholar
Wan Rosli Wan Ishak
View author publications
Search author on:PubMed Google Scholar

Contributions

Paavithashnee Ravi Kumar: Conceptualization; Data curation; Formal analysis; Methodology; Project Writing—original draft; and Writing. Olayemi Joshua Ibidoja: Supervision; Writing—review & editing. Majid Khan Majahar Ali: Data curation, Writing—review & editing, Supervision. Wan Rosli Wan Ishak: Supervision, Writing—review & editing.

Corresponding authors

Correspondence to Majid Khan Majahar Ali or Wan Rosli Wan Ishak.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval

Ethics approval was not obtained as humans/ animals were not used in the study.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kumar, P.R., Ibidoja, O.J., Ali, M.K.M. et al. Hybrid models of sparse and robust regression to solve heterogeneity problem in black pepper big data. Sci Rep 16, 11292 (2026). https://doi.org/10.1038/s41598-026-39290-0

Download citation

Received: 10 March 2025
Accepted: 04 February 2026
Published: 03 April 2026
Version of record: 03 April 2026
DOI: https://doi.org/10.1038/s41598-026-39290-0

Hybrid models of sparse and robust regression to solve heterogeneity problem in black pepper big data

Subjects

Abstract

Introduction

Methodology

Flowchart of study

Data description

Multiple linear regression (MLR)

Heterogeneity identification

Hybrid sparse and robust regression techniques

Sparse regression models

Ridge regression

LASSO regression

Elastic net

Robust regression estimations

M estimation

S-estimation

MM-estimation

Model evaluation

Eight selection criteria

Results and discussion

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethics approval

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Subjects

Abstract

Introduction

Methodology

Flowchart of study

Data description

Multiple linear regression (MLR)

Heterogeneity identification

Hybrid sparse and robust regression techniques

Sparse regression models

Ridge regression

LASSO regression

Elastic net

Robust regression estimations

M estimation

S-estimation

MM-estimation

Model evaluation

Eight selection criteria

Results and discussion

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethics approval

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links