Introduction

Groundnut is one of the most important oilseed crops globally1, playing a pivotal role in ensuring food security and contributing to the agricultural economy2,3. Based on the quadrennial average area under groundnut for the 2019–2022 period, Gujarat (35%), Rajasthan (15%), Andhra Pradesh (14%), Karnataka (11%), and Tamil Nadu (7%) are the main contributing states for groundnut production in India4. However, the cultivation and pricing of groundnut are heavily influenced by weather parameters5. Fluctuations in these parameters often lead to unpredictable price dynamics, thereby affecting farmer profitability and market stability6,7,8,9,10,11. accurately forecasted the groundnut prices based on weather parameters are crucial for empowering farmers, improving market efficiency, and formulating effective agricultural policies.

The impact of weather variability on agricultural prices has been extensively researched12,13. established the role of climatic factors such as temperature and precipitation in influencing crop yields and prices. Traditional forecasting models like ARIMA and SARIMA have been applied to agricultural price prediction14, but they often fail to capture the nonlinear complexities of high-dimensional data. The emergence of modern machine learning techniques, including boosting algorithms like XGBoost and LightGBM, has demonstrated superior capabilities15,16.

However, the profitability of groundnut farming highly sensitive to price volatility17and weather-related uncertainties18,19. stated Market prices for groundnuts are influenced by a complex interplay of factors, including seasonal supply fluctuations20, climatic variations, and broader market dynamics. In recent years, these uncertainties have been exacerbated by climate change, which has intensified weather variability21 and disrupted traditional cropping patterns.

Forecasting groundnut prices with high accuracy crucial for enabling farmers to make informed decisions about crop planning, market timing, and resource allocation6. Traditional statistical models have offered limited success in capturing the nonlinear and dynamic relationships22,23 between weather conditions24 and price trends25. As a result, there is growing interest in the application of advanced machine learning techniques particularly boosting algorithms that can handle complex datasets and improve predictive performance.

Boosting algorithms are often chosen over other machine learning models because they combine the strengths of multiple weak learners to create a strong predictive model26. Complex datasets with non-linear relationships are handled and can significantly reduce bias and variance. Boosting methods like XGBoost, LightGBM, and CatBoost are known for their high accuracy and robustness, especially in structured data and competitive machine learning tasks27. In this regard, boosting algorithms, which iteratively minimize prediction errors28, have shown promise in agricultural forecasting. For instance29, utilized deep neural networks to predict crop yields with higher accuracy than traditional models. Similarly30, applied deep Gaussian processes to forecast regional crop production under varying climatic conditions, demonstrating their adaptability to diverse environmental scenarios31. Despite these advancements, there is limited research specifically targeting groundnut price forecasting using these methods, particularly in Tamil Nadu. Addressing this gap is vital, the region’s dependence on groundnut farming.

Profitability analysis adds a crucial dimension to forecasting by linking price predictions to economic outcomes. Studies like32 emphasized integrating profitability metrics to provide actionable insights for farmers. It is important to forecast price accurately and focus on the economic viability of groundnut farming in Tamil Nadu33. Incorporated profitability analysis6,34 with advanced forecasting methods offers a holistic framework for supporting farmer decision-making and policy design.

This study aims to bridge these gaps by leveraging state-of-the-art boosting algorithms, including CatBoost35, alongside advanced feature engineering techniques36,37 such as seasonal decomposition and discretization.

Materials and methods

Data collection

The data collection process for this study involved both primary and secondary sources to comprehensively address the research objectives. The study was conducted in three districts of Tamil Nadu: Erode, Namakkal, and Salem, selected based on their geographical proximity and significant contribution to groundnut cultivation. The data collection process consisted of a primary survey to gather farmer-specific data and a secondary survey to acquire long-term market price and weather data.

Primary data collection

The primary survey aimed to estimate the cost of cultivation and yield associated with identified groundnut-based cropping patterns. A multistage sampling technique was employed to ensure representation and relevance of the data collected. In the first stage, three districts were purposively chosen based on proximity and their importance in groundnut cultivation. In the second stage, two blocks were selected from each district based on groundnut cultivation intensity and accessibility. Subsequently, two villages were chosen from each block, using purposive sampling to include locations actively engaged in groundnut farming. Finally, 15 farmers were randomly selected from each village, resulting in a total of 180 farmers across the three districts. The details of selected blocks and villages in each district is described in Table 1.

Table 1 Distribution of surveyed farmers across districts, blocks, and villages in Tamil Nadu.

Data collection from the farmers involved personal interviews using a structured questionnaire. The questionnaire covered various aspects such as the cost of inputs, cultivation practices, yield, and crop diversification patterns. The survey was conducted with an approval and guidelines provided by Tamil Nadu Agricultural University, Coimbatore, Tamil Nadu, India and farmers’ details were collected with the protocols provided by the Assistant Director of Agriculture in each block of the district. The details of the survey gathered qualitative data regarding farmer perceptions of weather variations and their effects on crop performance, though the details of the qualitative data collected will not discussed in this paper. This information was crucial for analysing the profitability of various cropping patterns under different scenarios.

Secondary data collection

Secondary data were collected to support the forecasting and profitability analysis. The secondary data primarily consisted of weekly market prices of groundnut and weekly weather parameters for the period spanning January 1, 2010, to December 31, 2023. The market price data were sourced from the AGMARKNET portal, a reliable and comprehensive database for agricultural market prices in India. Weather data38,39, including Minimum Temperature (Tmin), Maximum Temperature (Tmax), Relative Humidity (RH) and Rainfall (RF), were obtained from the NASA POWER project portal.

Statistical analysis

The statistical analysis for this study focused on understanding the behavior of price and weather data over time and evaluating their interrelationships. Various tests and measures were applied to ensure robust analysis:

Test for seasonality

According to40,41, Seasonality in the price and weather parameters was examined using the Kruskal-Wallis test, a non-parametric method42 for comparing medians across groups. The test was performed by43 for different periodicities, including monthly, yearly, and weekly intervals, to identify significant seasonal patterns. If there is seasonality in the data, it can be expected that the values in different seasons would follow distinct distributions. The hypothesis for this test can be stated as,

\(\:{H}_{0}\):\(\:\text{T}\text{h}\text{e}\:\text{g}\text{r}\text{o}\text{u}\text{p}\text{s}\:\text{a}\text{r}\text{e}\:\text{b}\text{e}\text{l}\text{o}\text{n}\text{g}\:\text{t}\text{o}\:\text{i}\text{d}\text{e}\text{n}\text{t}\text{i}\text{c}\text{a}\text{l}\:\text{p}\text{o}\text{p}\text{u}\text{l}\text{a}\text{t}\text{i}\text{o}\text{n}\)

\(\:{H}_{1}\):\(\:\text{A}\text{t}\text{l}\text{e}\text{a}\text{s}\text{t}\:\text{o}\text{n}\text{e}\:\text{o}\text{f}\:\text{t}\text{h}\text{e}\:\text{g}\text{r}\text{o}\text{u}\text{p}\text{s}\:\text{b}\text{e}\text{l}\text{o}\text{n}\text{g}\:\text{t}\text{o}\:\text{d}\text{i}\text{f}\text{f}\text{e}\text{r}\text{e}\text{n}\text{t}\:\text{p}\text{o}\text{p}\text{u}\text{l}\text{a}\text{t}\text{i}\text{o}\text{n}\:\text{t}\text{h}\text{a}\text{n}\:\text{t}\text{h}\text{a}\text{t}\:\text{o}\text{f}\:\text{o}\text{t}\text{h}\text{e}\text{r}\text{s}\)

The hypothesis can be tested using the following test statistic:

$$\:H=\frac{12}{N(N+1)}\sum\:_{i=1}^{k}\frac{{R}_{i}^{2}}{{n}_{i}}-3(N+1)\:$$
(1)

.

where \(\:N\) is the total number of observations across all groups, \(\:k\) is the number of groups, \(\:{n}_{i}\) is the number of observations in group \(\:i\) and \(\:{R}_{i}\) is the sum of the ranks of the observations in group \(\:i\). Under the null hypothesis, the test statistic \(\:H\sim{\chi\:}_{(k-1)\:df}^{2}\), and thus the larger the value of \(\:H\), the more evidence there is against the null hypothesis.

Cross-correlation analysis

Cross-correlation analysis was conducted for trained datasets to explore the lagged relationships between price and weather parameters44. Lags ranging from 0 to 52 weeks were examined to identify significant correlations. The Pearson correlation coefficient was employed by45 for this purpose, enabling the identification of meaningful lagged features for inclusion in the boosting machine models. By conducting these statistical analyses, the study established a solid foundation for the forecasting and profitability models46, ensuring that the underlying patterns and relationships in the data were effectively captured.

Feature engineering

According to47,48 improved model performance and accurately forecast price, feature engineering techniques were applied for trained datasets to both price and weather data. These methods ensured that the models could effectively capture the temporal49 and environmental50 influencing price dynamics.

STL decomposition and component-wise forecasting

51,52 used price data decomposed using Seasonal-Trend Decomposition based on LOESS (STL) to extract its trend, seasonal, and residual components43. analysed each component individually and generated separate forecasts for the trend, seasonal, and residual components. These individual forecasts were then combined to produce the final price forecasts. This decomposition approach improved model interpretability and accuracy by isolating distinct temporal patterns in the price data.

Feature transformations

Weather parameters, including Minimum Temperature, Maximum Temperature, Rainfall, and Relative Humidity, were transformed for effective integration into the forecasting models. The transformation was carried out through discretization53, where weather parameters were categorized into distinct bins using k-means clustering. This approach enabled the model to capture non-linear relationships between weather and price, leveraging the strengths of boosting algorithms in handling categorical data54. The decision to use discretized weather data, rather than raw inputs, was driven by the improved forecasting performance observed with this transformation55. Additionally, using transformed weather data aligned with the study’s objective of utilizing weather forecasts for price prediction.

Time-based features and label encoding

According to45, captured temporal patterns, time-based features were generated from the date information. Year, Month and Week were encoded as categorical variables. These time-based features were integrated into the forecasting models as categorical variables, enabling the models to leverage temporal patterns effectively. By combining these engineered features with transformed weather data56, developed robust models for price forecasting. Label encoders are one of the transformers used in machine learning that are used to convert categorical labels to numerical components in order to facilitate feasible features for machine learning models to train upon, which is considered efficient especially for tree-based models.

Boosted learning machines

For the purpose of this study, LightGBM (LGBM), XGBoost (XGBM), HistGradientBoosting (HGBM), and CatBoost (CBM) were chosen. They are advanced machine learning algorithms designed for structured data, and they have been increasingly utilized for time series forecasting57 due to their robust performance and flexibility. These boosting algorithms are ensemble methods that combine the predictive power of multiple decision trees58, sequentially trained, to minimize errors and improve generalization59. Large datasets are handled with high-dimensional features and are particularly adept at capturing complex, nonlinear relationships within the data, which is essential for accurate time series forecasting.

60,61 used LightGBM (LGBM), an algorithm designed to be fast and efficient, particularly for large datasets. It employs a histogram-based approach to decision tree learning, which significantly reduces memory usage and computational time compared to traditional tree-based methods. LGBM also supports features such as categorical variable handling, early stopping, and custom loss functions, making it well-suited for time series forecasting27. The algorithm is particularly effective at identifying long-term trends and seasonality in time series data62, especially when combined with appropriate feature engineering, such as lagged features or rolling statistics. Its ability to handle missing values and categorical data natively further enhances its suitability for real-world time series problems.

XGBoost Machine (XGBM) is another widely used boosting algorithm, known for its scalability and robustness. XGBoost uses gradient boosting techniques and incorporates regularization to prevent overfitting, making it highly effective for time series forecasting where data noise can be a challenge63. Its ability to handle missing data and outliers allows it to adapt to real-world conditions effectively. XGBoost is versatile, capable of capturing both short-term fluctuations64 and long-term trends in time series data65 through the incorporation of lagged variables, moving averages, and other engineered features. However, its computational demands can be higher compared to LightGBM, particularly for very large datasets, which is a factor to consider in time-sensitive forecasting tasks.

HistGradientBoosting Machine (HGBM)66, a more recent implementation available in libraries like scikit-learn, is a histogram-based gradient boosting model. Like LightGBM, HGBM uses histograms to accelerate the training process, making it efficient for large datasets. It also supports missing value imputation natively and uses regularization to enhance generalization. HGBM is particularly useful for time series forecasting tasks that involve a mix of continuous and categorical variables, as it can handle these seamlessly67. Additionally, its integration within the scikit-learn ecosystem allows for easy application of time series-specific cross-validation techniques, which is critical for reliable forecasting model evaluation.

CatBoost Machine (CBM)68 is another powerful boosting algorithm, specifically designed to handle categorical data more effectively than other boosting methods. It employs an innovative encoding mechanism for categorical variables, reducing the risk of overfitting and improving model accuracy. CatBoost’s capability to handle unevenly distributed data and time-lagged relationships makes it a strong candidate for time series forecasting69. It is also known for its ease of use, as it requires minimal parameter tuning compared to other boosting methods. However, CatBoost can be computationally intensive, especially for large datasets, which may pose challenges for time-critical applications.

The multistep direct forecasting strategy was chosen for this study due to its ability to produce more accurate and horizon-specific predictions compared to recursive and hybrid methods. In the context of groundnut price forecasting, where short- and long-term dynamics are influenced by differing sets of variables—such as immediate weather fluctuations versus broader market trends—direct forecasting allows each model to specialize in capturing the unique patterns of its respective horizon70,71. Unlike recursive methods, which suffer from cumulative error propagation as predictions are fed back into the model, the direct approach mitigates this issue by independently modeling each forecast step. This is particularly advantageous for longer horizons where small inaccuracies can compound significantly. Moreover, while hybrid methods attempt to blend recursive and direct strategies, they often inherit the complexity of both and may still suffer from intermediate error buildup. Although computationally intensive, the direct method’s capacity to tailor learning to specific time points leads to more reliable forecasts, especially in applications involving volatile inputs such as weather data. Given the high-stakes nature of agricultural pricing decisions, this trade-off in favor of greater accuracy and stability is both justified and necessary.

Training and validation

The chosen boosting machines for this study included LightGBM, XGBoost, HistGradientBoosting, and CatBoost, each known for their efficiency in handling structured data and complex relationships. To optimize the forecasting models, hyperparameter tuning was performed using a Bayesian search technique. Bayesian optimization72 is an efficient method for exploring the hyperparameter73 space by balancing exploration and exploitation. Instead of evaluating all possible combinations, the algorithm uses prior knowledge and probabilistic models to identify promising regions of the hyperparameter space.

The Bayesian search74 process involved defining the objective function as the forecasting model’s performance metric, Mean Squared Error (MSE). Key hyperparameters, such as learning rate, number of estimators, maximum depth, and minimum samples per leaf, were iteratively optimized over a specified number of iterations. Each iteration refined the search space based on the results of previous evaluations, enabling the algorithm to focus computational resources on identifying the most effective hyperparameter configurations within a reasonable timeframe.

According to75,76, ensured robust evaluation, the optimized models were validated using time-series cross-validation. This technique divided the data into sequential training and testing sets, accounting for the temporal nature of the data and ensuring that the validation results reflected real-world forecasting scenarios. By combining Bayesian optimization with rigorous cross-validation, the study developed highly tuned models for trained datasets capable of accurately forecasting groundnut prices under varying weather conditions. The details on size of dataset used for training and testing process are given Table 2.

Table 2 Details on dataset used for training and testing process.

Model testing

The tuned models were evaluated in two distinct phases to examine the impact of weather features on forecasting performance77. Initially, the models were trained and tested using only the price data. Performance metrics, including Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE)78156, were calculated to establish baseline performance. This phase focused on analyzing the temporal patterns inherent in the price data to evaluate the models’ capability to predict prices without any external influences.

In the second phase79, weather features were incorporated into the models as additional predictors. These features included transformed weather parameters and time-based categorical variables. The models were subsequently retrained and tested, with the MAE and MAPE metrics recalculated. By integrating weather data, this phase provided a more comprehensive assessment80 of price dynamics, incorporating the effects of external environmental factors on market prices.

To quantify the impact of weather features81, the percentage difference between the metrics obtained in the two phases was computed. This comparison highlighted the extent to which weather parameters enhanced the models’ forecasting accuracy. Detailed analysis of these results shed light on the relationship between weather conditions and price fluctuations, underscoring the significance of leveraging weather data in predictive modelling82,83.

Finally, the best-performing model was selected based on the performance metrics84. This model was then utilized to forecast groundnut prices for the year 2024. The model demonstrating the greatest reduction in MAE and MAPE85 with the inclusion of weather features was identified as the most effective for forecasting groundnut prices86. This systematic evaluation ensured that the chosen model not only delivered high accuracy but also maximized the utility of weather data in improving predictive performance87. The metrics that were used in the study are given in the Eqs. (24):

$$\:\text{M}\text{S}\text{E}=\:\frac{1}{N}\sum\:_{t=1}^{N}{{(Y}_{t}-{\widehat{Y}}_{t})}^{2}$$
(2)
$$\:\text{M}\text{A}\text{E}=\:\frac{1}{N}\sum\:_{t=1}^{N}\left|{Y}_{t}-{\widehat{Y}}_{t}\right|$$
(3)
$$\:\text{M}\text{A}\text{P}\text{E}=\:\frac{1}{N}\sum\:_{t=1}^{N}\frac{\left|{Y}_{t}-{\widehat{Y}}_{t}\right|}{{Y}_{t}}\times\:100$$
(4)

.

where, \(\:N\) is the number of observations in the series, \(\:{Y}_{t}\) is the observed value at time \(\:t\) and \(\:{\widehat{Y}}_{t}\) is the predicted value at time \(\:t\).

Profitability analysis

Following the price forecast for all crops in the identified groundnut-based cropping patterns using the best model, profitability analysis88 was conducted to determine the most viable cropping pattern. For each crop, the gross income and net income were calculated using the estimated cost of cultivation and yield89. Gross income (Eq. 4)90,91 was derived from the product of forecasted price and yield, while net income was calculated by subtracting the cost of cultivation from the gross income.

To assess the economic feasibility of each cropping pattern, the Benefit-Cost Ratio (BCR)92 was computed. The BCR (Eq. 5), defined as the ratio of gross income to the cost of cultivation, provided a comprehensive measure of profitability93. The cropping pattern with the highest BCR was identified as the most economically viable option. This systematic and analytical approach ensured that both revenue generation and cost implications were considered, leading to robust recommendations for selecting optimal cropping patterns under diverse94.

$$\:\text{G}\text{r}\text{o}\text{s}\text{s}\:\text{I}\text{n}\text{c}\text{o}\text{m}\text{e}=\:\text{T}\text{o}\text{t}\text{a}\text{l}\:\text{B}\text{e}\text{n}\text{e}\text{f}\text{i}\text{t}\text{s}\:-\:\text{T}\text{o}\text{t}\text{a}\text{l}\:\text{C}\text{o}\text{s}\text{t}$$
(4)
$$\:\text{B}\text{e}\text{n}\text{e}\text{f}\text{i}\text{t}-\text{C}\text{o}\text{s}\text{t}\:\text{R}\text{a}\text{t}\text{i}\text{o}=\:\left(\frac{\text{T}\text{o}\text{t}\text{a}\text{l}\:\text{B}\text{e}\text{n}\text{e}\text{f}\text{i}\text{t}\text{s}}{\text{T}\text{o}\text{t}\text{a}\text{l}\:\text{C}\text{o}\text{s}\text{t}}\right)$$
(5)

.

Results and discussion

The weekly market prices of various crops included in groundnut-based cropping patterns across the selected districts are represented in Fig. 1. The price of groundnut shows an upward trend with a similar distribution across all districts95. Black gram in Namakkal and green gram in both Namakkal and Salem exhibit a step-like upward trend in their price movements. The price of maize in Namakkal also follows an upward trend but with stronger fluctuations96. On the other hand, the price of tobacco in Erode and onion in Namakkal appears relatively stationary. Similarly, black gram in Salem shows a mostly stationary pattern97, except for a sudden spike in prices followed by a period of stability.

Fig. 1
figure 1

Weekly Market Price Trends of Crops in Erode, Namakkal, and Salem Districts (2010 − 2023).

Partial autocorrelation

The partial autocorrelation function (PACF) plots for the weekly market prices of various crops across Erode, Namakkal, and Salem districts are presented in Fig. 2, it provides valuable insight into lagged dependencies in the data. Groundnut prices in all districts show strong PACF spikes at lag 1, indicating significant short-term autocorrelation and suggesting that current prices are heavily influenced by recent past values98. For tobacco in Erode and onion in Namakkal, sharp first-lag spikes with minimal higher-lag contributions reflect a stationary price pattern with limited long-term dependence99. Similarly, PACF plots for black gram and green gram in Namakkal and Salem highlight a dominant first-lag spike, aligning with step-like upward trends and diminishing higher-order lag influence100,101. Maize in Namakkal shows a strong first-lag correlation along with notable contributions from subsequent lags, consistent with the crop’s more volatile price movement102,103. These findings support the persistence prediction analysis, indicating that recent historical prices are reliable predictors of near-future values, which is crucial for effective time-series modeling and forecasting strategies.

Fig. 2
figure 2

Partial Autocorrelation of market prices of different crops in selected districts.

Cross correlation

The cross-correlation analysis shown in Fig. 3, it reveals distinct relationships between weather parameters (Tmax, Tmin, RH, and RF) and crop prices over varying lag periods. Maximum and minimum temperatures generally exhibit a sinusoidal pattern104,105 with positive and negative correlations depending on the lag, indicating a delayed effect on crop prices. Relative humidity (RH)106 shows relatively stable patterns with moderate positive correlations across most lags, while rainfall (RF) often demonstrates peaks and troughs, reflecting its influence on crop prices at specific time intervals. The observed trends suggest that weather parameters exert both immediate and lagged impacts on crop prices, with the magnitude and direction of influence varying by parameter and lag duration103,107.

Fig. 3
figure 3

Cross correlation of chosen weather parameters with the market prices of different crops in selected districts.

Seasonality

The Kruskal – Wallis test108 results shown in Table 3, it reveals significant seasonality109 in crop prices across districts and time intervals, highlighting the importance of temporal patterns in forecasting. In Erode, groundnut and tobacco exhibit strong yearly seasonality110, with moderate to significant effects at the quarterly and monthly levels, emphasizing the role of broader cycles in price variations. Namakkal shows pronounced yearly seasonality111 for black gram and green gram, along with strong seasonal influences for groundnut and maize at yearly and quarterly intervals, and onion displaying the most striking monthly and quarterly seasonality. In Salem112,113, black gram, green gram, and groundnut demonstrate highly significant yearly seasonality, with groundnut also showing strong effects at the quarterly and monthly levels. These findings underline the critical role of seasonality in crop price fluctuations114,115 and suggest that incorporating these patterns as features in boosting machine models can significantly enhance the accuracy of price forecasts, particularly for crops and districts with strong seasonal influences.

Table 3 Kruskal – Wallis test statistic for seasonality of different crops across the selected districts.

Feature transformations

The discretization of continuous weather variables into categorical bins is a crucial preprocessing step for boosting algorithms, particularly when handling high-frequency time-series data with nonlinear interactions. The selection of 20 bins for k-means clustering was guided by both theoretical and empirical considerations, as supported by recent literature55,116.

Firstly, k-means clustering effectively captures the natural groupings and variance within continuous data without relying on arbitrary thresholds. By transforming continuous weather parameters such as temperature, humidity, and rainfall into 20 discrete clusters, the method balances granularity and generalization. Fewer bins (e.g., less than 10) may result in excessive information loss, masking subtle but important variations in weather patterns. Conversely, too many bins (e.g., over 30) can introduce noise and reduce the robustness of the model by overfitting to minor fluctuations.

Secondly, the choice of k = 20 aligns with findings by117, who demonstrated that this level of discretization provides an optimal trade-off between model interpretability and predictive performance in weather-dependent forecasting tasks. Research work shows that 20 clusters adequately preserve the shape and distribution of the original data while ensuring that boosting models such as CatBoost and HGBM maintain high accuracy.

Empirical tests conducted as part of this study further validate this choice. Cross-validation results indicate that models trained with 20-bin discretized weather features achieved higher R² scores and lower RMSE values compared to models using raw continuous data or alternative binning levels. This suggests that the 20-bin approach enhances the model’s ability to learn meaningful patterns from weather inputs without compromising computational efficiency. Therefore, the use of 20 bins in k-means discretization is justified by both its theoretical soundness and empirical effectiveness in improving model performance for groundnut price forecasting under varying weather conditions. The visual representation of discretized parameters is shown in Fig. 4.

Fig. 4
figure 4

Discretized Weather Parameters.

Model training and testing

Hyperparameter tuning

Table 4 presents the results of hyperparameter tuning for four boosting models—LGBM, XGBM, HGBM and CBM — with their respective parameter ranges optimized for groundnut crops.

Table 4 Hyperparameter ranges for boosting models Tuned.

Training and testing

The performance error metrics for price forecast by different models under univariate and multivariate situations is given in Fig. 5. The comparison between univariate and multivariate forecasting118,119 reveals a substantial improvement in accuracy when discretized weather features are added to the forecasting models. Across all districts and crops, multivariate forecasting67 consistently achieves lower MAE120 values compared to univariate forecasting. By aggregating the performance of all models as a whole, the percentage improvement in MAE is striking and emphasizes the value of incorporating weather features121 into the forecasting process122.

Fig. 5
figure 5

Error metrics for price forecast by different models under univariate and multivariate situations.

In Erode, across all crops and forecasting models, the average Mean Absolute Error (MAE) for univariate forecasting is significantly higher compared to multivariate forecasting123,124,125. For instance, in the case of groundnut, the average univariate MAE across all models is 779.54, whereas the average multivariate MAE drops sharply to 206.04, reflecting an improvement of approximately 73.6%. Similarly, for tobacco, the average MAE declines from 697.81 (univariate) to 198.88 (multivariate), amounting to a 71.5% improvement. These significant reductions in error clearly demonstrate the advantage of incorporating multivariate features—particularly weather parameters—into forecasting models126.

In Salem, a similar trend is observed. For groundnut, the univariate MAE averages 786.85, while the multivariate MAE is reduced to 209.64, reflecting a 73.4% improvement. In the case of black gram, the average univariate MAE is 780.54, which decreases to 208.29 with multivariate forecasting—an improvement of 73.3%. These findings further confirm the consistency and robustness of multivariate models in delivering more accurate price predictions across different crops and geographical areas126,127.

The results from Namakkal also reinforce this pattern. For groundnut, the average univariate MAE is 774.92, while the multivariate MAE significantly drops to 203.29, reflecting a 73.8% improvement. Similarly, for maize, the MAE improves from 773.73 (univariate) to 202.85 (multivariate), again demonstrating a 73.8% increase in forecast accuracy. These consistent and substantial reductions in MAE across all crops and districts underscore the impact of weather-informed multivariate modeling on improving the reliability of agricultural price forecasts128,129.

When considering all forecasting models and regions collectively, the average MAE for univariate models is consistently higher than for multivariate models. This trend affirms the overall superiority of multivariate forecasting, especially when discretized weather features are incorporated130,131; Thangavelu, G., et al., 2025)132.

Once the improvement of forecast accuracy through multivariate modeling is established, the focus shifts to identifying the best-performing algorithm. Based on multivariate MAE values, HistGradientBoosting Machine (HGBM) consistently emerges as the top-performing model84,133,134).

  • In Erode, HGBM achieves the lowest multivariate MAE for groundnut (179.29) and tobacco (101.51).

  • In Salem, HGBM records the lowest multivariate MAE for groundnut (182.28) and black gram (180.57)135.

  • In Namakkal, HGBM delivers superior results for groundnut (171.00) and maize (174.86)84.

Table 5 Best-Performing models for groundnut in erode, Namakkal and Salem districts.

Best-Performing Models for groundnut in Erode, Namakkal and Salem districts were represented in Table 5. Mean Absolute Error (MAE) and Mean Squared Error (MSE) were used to find the best performing model. The analysis of the forecasting models LightGBM (LGBM), XGBoost (XGBM), Histogram-based Gradient Boosting (HGBM), and CatBoost (CBM) across the districts of Erode, Namakkal, and Salem reveals interesting insights.

Erode: The HGBM model significantly outperforms the other models, achieving the lowest MAE (265.15) and MSE (99413.9). LGBM and XGBM show relatively higher errors, with CBM having the highest errors. This suggests that HGBM is better suited for the Erode district’s data patterns, possibly due to its robustness in handling non-linear relationships.

Namakkal: Across the models, errors are substantially higher than in Erode and Salem. HGBM again has the lowest MAE (1397.48) and MSE (2352526), indicating better performance. CBM and LGBM models have significantly higher errors, suggesting they may not effectively capture the variability in Namakkal’s data.

Salem: HGBM consistently shows superior performance, with the lowest MAE (420.56) and MSE (249828.3). LGBM and CBM perform worse, with CBM being the least accurate model. The trend of HGBM outperforming other models holds true in this district.

Table 6 Comparative values of MAE and MAPE.

The Table 6 reveals that, distinct differences in forecasting errors (MAE and MAPE) across districts and crops. In Erode, Groundnut and Tobacco have moderate MAE values (245.14 and 165.84, respectively) and similar MAPE values around 5.4–5.6, suggesting comparable prediction accuracy. In Namakkal, Groundnut and Onion stand out with the highest MAE values (356.21 and 533.64, respectively), while Black Gram and Green Gram have the lowest (70.26 and 63.85, respectively), indicating better forecasting accuracy for these crops. MAPE values show a similar trend, with Onion reaching a significantly higher error rate (17.48%), likely due to market or environmental volatility. For Salem, Groundnut has the highest MAE (327.75) and MAPE (6.21), while Black Gram performs the best with an MAE of 93.20 and MAPE of 1.40. This suggests more reliable forecasting for Black Gram compared to other crops in this district.

While other boosting models such as LightGBM and XGBoost also perform competitively, their multivariate MAE values are consistently higher than those of HGBM across most crops and districts135. Conversely, CatBoost tends to underperform, often yielding higher MAE values in comparison to the other algorithms. While CatBoost is a robust model for domains rich in categorical complexity (e.g., user behavior, text, recommender systems), in this agricultural forecasting setting characterized by discretized numerical weather inputs and the need for fast, high-accuracy regression across many models HistGradientBoosting Machine (HGBM) outperforms due to its alignment with data structure, optimization for binned inputs, and computational efficiency.

Hence, these findings strongly establish HGBM as the most reliable and effective model for price forecasting in the context of groundnut based profitable cropping patterns, particularly when discretized weather variables are utilized66,136. The model’s ability to handle high-dimensional, noisy, and nonlinear data enables it to consistently outperform competing methods in both accuracy and stability.

The high accuracy of the forecasting models, particularly the Histogram Gradient Boosting Model (HGBM) has significant implications for groundnut farmers in the Erode, Namakkal, and Salem districts. Reliable forecasts of groundnut prices enable farmers to make more informed decisions regarding the timing of sales, storage, and choice of cropping pattern. For instance, accurate price predictions can help farmers delay sales during periods of anticipated low prices or switch to more profitable cropping combinations in subsequent seasons.

Moreover, integrating these models into mobile-based decision support systems or local agricultural extension services could democratize access to predictive insights, especially for small holder farmers who often lack sophisticated market information. This can improve their bargaining power and reduce income volatility. On a policy level, accurate price forecasts can support better planning of minimum support prices and procurement strategies by government agencies.

Price forecasting

Following the training and testing process, Weekly Price forecast of different crops by Histogram Gradient Boosting model for the period 2024–2025 is plotted in Fig. 6.

Fig. 6
figure 6

Weekly Price forecast of different crops by Histogram Gradient Boosting model for the period 2024–2025.

Profitability analysis

From the primary survey, the estimated cost of cultivation and estimated yield for each crop is given in Table 7. With the forecasted price and estimates obtained from the survey, Gross Income and Benefit – Cost Ratio for each of the identified cropping pattern were calculated (Table 8).

Table 7 Estimated of cost of cultivation and estimated yield of crops involved in groundnut based cropping pattern obtained from primary survey.
Table 8 Forecasted average gross income and benefit cost ratio for different cropping patterns.

The Table 8 provides insights into the economic viability of groundnut-based cropping patterns in the districts of Erode, Namakkal, and Salem, focusing on their average cost of cultivation, average gross income, BCR. These metrics reflect the profitability and efficiency of different groundnut-based crop combinations, helping to identify the most economically profitable patterns for farmers.

In Erode, two groundnut-based cropping patterns have been analyzed: “Groundnut – Tobacco” and “Groundnut – Groundnut – Tobacco.” The “Groundnut – Tobacco” cropping pattern involves an average cost of cultivation of ₹ 71,702.77 and generates an average gross income of ₹ 1,10,242.22. This results in a Benefit-Cost Ratio (BCR) of 1.54, indicating that for every rupee invested, farmers earn ₹1.54 in return. This reflects a modest yet stable level of profitability, demonstrating that this pattern is economically viable and offers reasonable returns on investment137.

In contrast, the “Groundnut – Groundnut – Tobacco” cropping pattern, which incorporates an additional cycle of groundnut, incurs a higher average cost of cultivation at ₹ 1,05,005.54. However, it also leads to an increased average gross income of ₹ 1,44,128.57. Despite the higher income, the BCR for this pattern is slightly lower at 1.09, suggesting that while total earnings rise, the efficiency of return on each rupee invested decreases slightly138,139. This analysis suggests that although the inclusion of a second groundnut cycle in the cropping pattern increases both costs and gross returns, the marginal improvement in profitability is relatively small. The lower BCR in the three-crop system implies that the additional input costs are not matched proportionately by the output gains. Therefore, while the “Groundnut – Groundnut – Tobacco” pattern may be attractive in terms of total income, the “Groundnut – Tobacco” system remains more efficient in terms of cost-effectiveness and return per unit of investment.

In Namakkal, three groundnut-based cropping patterns have been evaluated, each demonstrating greater profitability compared to those observed in Erode. These patterns highlight varying levels of economic efficiency, depending on the crop combinations involved.

The “Groundnut – Maize – Black gram” cropping pattern involves an average cost of cultivation of ₹ 70,199.55 and yields a gross income of ₹ 1,22,136.00. This results in a Benefit-Cost Ratio (BCR) of 1.74, indicating a highly favourable balance between investment and return. The BCR reflects strong profitability and suggests that this combination effectively utilizes available resources to maximize returns140.

The “Groundnut – Maize – Green gram” pattern incurs a slightly lower cultivation cost of ₹ 69,556.92 and produces a gross income of ₹ 1,17,109.85, yielding a BCR of 1.68. While the cost is marginally reduced compared to the black gram variant, the income is also slightly lower. This suggests that substituting green gram for black gram results in a modest decrease in overall profitability. However, due to its lower input requirements and relatively high return, this pattern is still considered economically efficient and viable141.

The “Groundnut – Onion” cropping pattern stands out as the most economically advantageous among the three. With an average cost of cultivation of ₹ 66,250.63 and a gross income of ₹ 1,44,817.91, it yields a remarkable BCR of 2.18. This indicates that for every rupee invested, farmers receive ₹2.18 in return. The inclusion of onion significantly enhances profitability, making this cropping pattern the most lucrative option in Namakkal. It reflects both high income potential and excellent cost efficiency142. However, all three groundnut-based cropping sequences in Namakkal outperform those in Erode in terms of profitability. Among them, the “Groundnut – Onion” pattern offers the highest return on investment, followed by “Groundnut – Maize – Black gram” and “Groundnut – Maize – Green gram.” These findings underscore the importance of crop selection and sequence in maximizing agricultural income and resource efficiency.

In Salem, two groundnut-based cropping patterns have been analyzed: “Groundnut – Green gram” and “Groundnut – Black gram.” These patterns illustrate the profitability and economic viability of integrating pulses into groundnut cultivation systems. The “Groundnut – Green gram” pattern incurs an average cost of cultivation of ₹ 46,573.03 and generates a gross income of ₹ 79,631.71. This results in a Benefit-Cost Ratio (BCR) of 1.71, indicating a favourable return on investment. The relatively low input cost combined with moderate income reflects strong profitability and suggests that this cropping sequence is both economically and agronomically viable143.

In comparison, the “Groundnut – Black gram” pattern has a slightly higher cost of cultivation at ₹ 46,775.46. However, it yields a significantly higher gross income of ₹ 93,012.28, leading to a BCR of 1.99. This substantial improvement in income relative to the marginal increase in cost demonstrates superior economic efficiency. As a result, this cropping sequence is considered more profitable and is the preferred groundnut-based system in Salem144.

Fig. 7
figure 7

Groundnut - cropping patterns based on Benefit-Cost Ratios for Erode, Namakkal and Salem districts.

Groundnut related cropping patterns based on Benefit-Cost Ratios for Erode, Namakkal and Salem shown in Fig. 7. Overall, the analysis across Erode, Namakkal, and Salem highlights the substantial profitability of groundnut-based cropping patterns145. Among all the systems studied, those incorporating high-value crops such as onion particularly in Namakkal achieve the highest BCR, demonstrating exceptional economic returns. Additionally, cropping patterns that include black gram consistently outperform those with green gram in terms of return on investment, as evidenced by patterns in both Namakkal and Salem146. These findings underscore the importance of optimizing groundnut-based cropping sequences tailored to regional agro-climatic conditions, input costs, and market opportunities. By aligning crop selection with profitability metrics and local resource dynamics, farmers across Erode, Namakkal, and Salem can maximize returns and enhance overall farm income21,147. This strategic approach to cropping pattern design is essential for achieving sustainable agricultural development and economic resilience in these regions.

Conclusion

The research highlights the critical role of weather-based forecasting and profitability analysis in optimizing groundnut-based cropping patterns in Tamil Nadu10. The integration of advanced boosting algorithms—including LightGBM, XGBoost, HistGradientBoosting (HGBM), and CatBoost—alongside sophisticated feature engineering techniques such as Seasonal-Trend decomposition (STL) and data discretization, significantly improves the accuracy of groundnut price forecasts148154,155. Among the tested models, HGBM consistently outperforms others by achieving the lowest multivariate Mean Absolute Error (MAE) across most crops and districts. This reflects its robust capacity to manage complex, high-dimensional, and nonlinear data effectively. The findings also confirm that multivariate forecasting models149,150), particularly those that incorporate weather parameters, consistently outperform univariate models. This reinforces the value of environmental variables in predictive modeling for agricultural markets151.

Moreover, the study finds that groundnut-based cropping patterns involving high-value crops, particularly onion in Namakkal, yield the highest Benefit-Cost Ratio (BCR), indicating substantial profitability152. This demonstrates the economic advantage of aligning groundnut with high-yield, high-market-value crops in region-specific sequences. Contrary to some regional observations, green gram-based patterns consistently outperform black gram in economic efficiency across districts, contradicting earlier assumptions favouring black gram. These findings underscore the importance of location-specific, weather-informed cropping strategies153 to improve farm profitability, mitigate risk, and enhance market outcomes.

From a policy standpoint, the study underscores the need for decentralized, data-informed crop planning tools that can guide farmers and extension officers in making adaptive decisions based on both market signals and climatic forecasts. The demonstrated superiority of multivariate, weather-integrated forecasting models provides a strong case for investment in digital agriculture infrastructure, particularly platforms that can disseminate localized forecasts and cropping recommendations in real time. In this research offers actionable insights for policymakers, agricultural planners, and farmers alike. It encourages a paradigm shift toward precision agriculture, where region-specific cropping strategies, informed by predictive analytics and profitability metrics, can drive higher farm incomes, greater resilience to weather variability, and improved market alignment across groundnut-growing regions of Tamil Nadu.