Introduction

Early and accurate crop production forecasting is essential for policymakers to make timely decisions for export–import commerce, which is the foundation for a country's food security1. It is also necessary for agricultural producers to avoid bad crop selection, which could cause incalculable losses in profits due to over-production and under-production2,3,4. Moreover, the cropland loss observed in various nations over the past years with high food demand owing to population growth requires accurate and up-to-date crop yield forecasting to maintain food security5. To prevent these losses, predicting crop production is required. However, human predictions are not effective with increasing amounts of agricultural data. Instead, machine learning has been raised as a promising option for this goal6.

Machine learning was created in data mining as a methodology for teaching computer concepts7,8,9,10,11. This model uses the learning idea to predict new sets of data given big data sets through training and testing. The present study selected rice as one of the world's three major crops extensively farmed and consumed, along with wheat and maize12,13,14. Nearly 88% of the world's rice is grown in Asian nations, where 2.4 billion people eat rice daily15.

Given the importance of rice to national food security, several studies implemented various machine-learning techniques for forecasting rice yield. Jabjone and Jiamrum16 developed an artificial neural network (ANN) model to predict rice production in the Phimai district, Thailand. The developed ANN model achieved highly accurate estimation with low errors (low RMSE) in rice yield forecasting using meteorological factors, including rainfall, water distribution, evapotranspiration, temperature, humidity, and wind speed16. Marndi, Ramesh17 applied long short-term memory (LSTM) for predicting rice yield using different input scenarios. The best LSTM model was achieved using rainfall as an input variable for rice yield forecasting. Sultana and Khanam18 compared the performance of Auto-regressive Integrated Moving Average (ARIMA) and Artificial Neural Network (ANN) on univariate time series data of yearly rice production from 1972 to 2013. According to this study, the ARIMA model outperforms the ANN model since the estimated error of ANN was significantly higher than ARIMA errors. In addition, Balakrishnan and Muthukumarasamy1 suggested an ensemble model to predict crop production over time based on the Ada support vector machine (SVM) and Ada and Naive Bayes (Naive), where Ada SVM and Ada Naive performed better than SVM and Naive Bayes.

Multiple input variables were used in rice yield estimation, including climatic data, remote sensing data, and statistical data (e.g. sowing area). Climatic variables showed a significant relationship with rice yield in several studies16,17,19,20. For example, the temperature increases by 1–2 °C during the paddy earring stage causing a decrease in paddy rice production by 10–20%21. Compared to technology, input, and social and economic factors, climate factors individually explain 84% of the variation in paddy rice production22. Moreover, remote sensing vegetation indices such as normalized difference vegetation index (NDVI) and radar vegetation index (RVI) were found to be highly efficient in evaluating rice production since they quantify the crop photosynthetic activity responsible for biomass formation23. NDVI derived from Moderate Resolution Imaging Spectroradiometer (MODIS) (AQUA/TERRA) imageries achieved a high correlation (R2 = 0.85) with rice production as estimated by Faisal, Rahman24, and R2 of 0.76 to 0.86 as estimated by Mosleh and Hassan15. SAR data captured by RADARSAT has also proved a high accuracy (97.4% and 96.6%) in estimating rice production based on back-scatter25.

Although several studies have discussed the use of machine learning in rice yield prediction, hybrid models that integrate two models are still poorly documented. In addition, integrating multi-data sources such as climate data, remote sensing, and agricultural statistics in rice yield estimation is poorly tested. Therefore, the present study aims to (1) Develop multiple single and hybrid machine learning models for predicting rice production across China, the world's biggest rice producer, producing 211 million tons26 to test multi-input scenarios (climatic variables, remote sensing, agriculture statistics and soil properties) to define the optimal combination of input variables to generate the most accurate rice production model. (2) Select the main dominant factors (climate, soil, remote sensing and sown area) that influence the rice production in each zonal scale. (3) Introduce optimal solutions for improving rice production across China. This research is critical in determining the best approach (optimal model and input variables) that could be used as a simple, rapid, and inexpensive approach for timely and reliable rice production prediction at regional scales across China. Therefore, the main contributions of the research paper are as follows.

  1. 1.

    This study attempts to model and predict rice production using multi-source data and hybrid machine-learning algorithms.

  2. 2.

    This study provides an in-depth comparative analysis of the proposed hybrid model with single machine learning models such as random forest (RF), extreme gradient boosting (XGB), conventional neural network (CNN) and long short-term memory (LSTM), and the hybrid RF-XGB and CNN-LSTM algorithms with eleven combinations (scenarios) of input variables across China.

  3. 3.

    This study investigates and figures out the main dominant factor for rice production across China’s main rice counties based on multi-input scenarios (climatic variables, remote sensing, agriculture statistics and soil properties).

Materials

Study area

In this study, we focused on the main cultivation areas of rice in mainland China, dominated by single-rice system (i.e. one rice harvest per year in a given field) and double-rice system (i.e. two rice harvests per year in a given field) (Fig. 1). The study area covers approximately 29 million hectares in nine provinces. This region, between 20° 10′ N ~ 53° 33′ N and 105° 54′ E ~ 135° 05′ E, is the most important food basket in China, accounting for ~ 96% of the total rice cultivation area and ~ 94% of the total rice production in China27,28,29. China, the world’s largest rice producer (about 206 million metric tons of annual production), accounts for 28% of the world’s rice production30. Rice occupies 41% of total grain production with only 35% of the cropland areas in China, which feeds roughly 65% of Chinese people31. The nine provinces are Heilongjiang, Shaanxi, Liaoning, Hainan, Anhui, Hebei, Henan, Guangdong and Shandong. The large difference in latitude leads to a pronounced variation in illumination conditions during the year: in South China, the minimum and maximum daily sunshine duration are 11 and 13 h while in North China they are 7 and 17 h, respectively. Due to its location at the eastern margin of the Eurasian continent, the climate of the eastern part of China is monsoonal with warm and humid summers and temperate, dry winters.

Figure 1
figure 1

(a) China’s rice districts and distribution of meteorological stations, (b) the flowchart of methodology. The map in Fig. 1a was generated with the ArcGIS10.8 software and (b) was generated based on Microsoft PowerPoint.

Datasets

The monthly meteorological datasets over the rice districts in nine provinces across China were retrieved from the China National Meteorological Data Sharing Platform32,33,34,35. The data on rice production and sown area of 64 rice districts from 2000 to 2017 were extracted from the National Bureau of Statistics of China (Table 1). Moreover, for the remote sensing datasets, three vegetation indices (VIs) and two biophysical parameters (BPs) were used in the present study to estimate rice production. These five parameters are available on Google Earth Engine (GEE, https://developers.google.com/earth-engine/datasets/) with a spatial resolution of 500 m. The VIs were widely used in earlier studies as production estimators due to their relevance to vegetation health18,36,37. BPs were also used in wheat yield prediction3. Compared to VIs, the BPs are usually more reliable in estimating crop production since they more adequately reflect the state of the crops and thus could be more accurate in predicting crop yield and production. The present study used GEE to estimate the average annual value of all five parameters over the 64 rice districts in China. In addition to weather data, soil properties including soil depth, soil organic matter, pH, cation exchange capacity, porosity, bulk density, NPK and soil texture for the topsoil layer (0–30 cm) and the subsoil layer (30–100 cm) at 0.00833° (~ 1 km) were also collected and detailed in http://globalechange.bnu.edu.cn Ref.38.

Table 1 Summary of the collected datasets.

Methodology

The general methodology of the present study is shown in Fig. 1b. The study used multi-data sources, including remote sensing, climate data, agriculture statistics and soil properties data, as input variables to single and hybrid algorithms to predict the rice production. Description of the developed single and hybrid models in this work was presented as follows:

Single models

Extreme gradient boosting (XGB)

The XGB algorithm suggested by Ref.39 is a novel improvement of the Gradient Boosting Machine based on regression trees. The algorithm is based on the idea of “boosting”, which combines all the predictions of a set of “weak” learners to develop a “strong” learner through additive training strategies, for more detailed information and the computation procedures of the XGB algorithm can be found in Ref.39. We applied the XGB by using the grid search method for different n estimators (number of trees) and max depth.

Random forest (RF)

The RF model, developed by Breiman40, is based on an ensemble of decision trees with controlled variance. The RF model has been widely used for regression and classification problems Such as land use/cover mapping41 and water quality field42,43. The detailed data and computation procedure of the RF model can be found in Refs.40,44.

Long short-term memory (LSTM)

LSTM is a special type of recurrent neural network (RNN)45 used to handle sequential data with advantages over traditional RNN. An LSTM network contains different memory blocks, which are linked through layers. Each layer includes a set of frequently connected memory pixels and three multiplicative units, namely the input, forget, and output gates46,47. The Adam training algorithm was used; the learning rate was set to 0. 0001 and the batch size was set to 548.

Conventional neural network (CNN)

The convolution layers are the main difference between CNN and conventional ANN. These layers can perform automatic feature extraction, capturing features of the input data, which are key to figuring out the relationship between the inputs and output parameters. In this study, CNN with one-dimensional (1D) conventional filters (1D CNN) was used44,49. Detailed information about the CNN architecture and specification can be found in Ref.33,50,51.

Hybrid models

Hybrid RF and XGB

The hybridization between the RF model and the XGB aimed to improve the performance of single models. Every single model was described in the previous sections. The use of RF-XGB reported high accuracy compared to other ML models (e.g. ANN and SVM) in agricultural applications, such as determining irrigation timing52 and detecting plant diseases53. Hence, the present study aims to test the performance of the RF-XGB hybrid model in predicting rice yield compared to single models.

Hybrid LSTM and CNN

LSTM and CNN were trained with the same input and hybrid to forecast results. The proposed hybrid CNN-LSTM model uses CNN layers for feature extraction from the input data with LSTM layers for sequence learning. CNN and LSTM are the most commonly used deep learning models. The present study aimed to test the efficiency of the hybrid LSTM–CNN model in rice yield forecasting. The hyper-parameters of the hybrid LSTM–CNN model, including the training algorithm, learning rate, batch size, and the number of training epochs, were set to be similar to the single CNN and LSTM models' hyper-parameters, as explained earlier.

Input scenarios and performance evaluation

This study investigated eleven input scenarios, including various combinations of climatic, soil, agricultural and remote sensing variables. To accurately predict rice production and evaluate each variable’s contribution, the multi-data sources were divided into eleven scenarios to figure out different solutions to predict rice production based on the available data (Table 2). There are two main methods for selecting the inputs combination: based on previous studies which trained and tested multi scenarios to achieve the best combination to arrive at the optimal combination with high accuracy, performance, and less error. The second approach depends on training and testing various variable combinations as we followed in the study to select the best scenarios in the prediction of rice production. For each scenario, we tried to apply some parameters to figure out the weight and the significance of each scenario, for example, in scenario 1, we applied only the sown area as one of the main variables affecting the rice production based on the previous studies. For other scenarios such as scenarios 3, 4 and 6 to illustrate the impact on the soil, climate remote sensing parameters on the rice production in order to figure out some best management for ensure food security in China. Other scenarios are a combination of the important parameters from climate, soil, and remote sensing together. The input datasets were divided as 70% for training and 30% for testing. Performance statistics such as the root mean square error (RMSE), Nash–Sutcliffe model efficiency coefficient (NSE), the mean absolute error (MAE), and coefficient of determination (R2) were used to assess the performance of applied models. The performance statistics equations are defined as:

$$RMSE=\sqrt{\frac{1}{n}{\sum \left({P}_{i}-{O}_{i}\right)}^{2}},$$
(1)
$$NSE=1-{\frac{\sum \left({P}_{i}-{O}_{i}\right)}{\sum (\overline{o}-{O}_{i}{)}^{2}}}^{2},$$
(2)
$$MAE=\frac{1}{n}{\sum }_{i=1}^{n}\left|{O}_{i}-{P}_{i}\right|,$$
(3)
$${R}^{2}={\left[\frac{{\sum }_{i=1}^{n}({O}_{i}-\overline{o})({P}_{i}-\overline{P})}{\sqrt{\left({{\sum }_{i=1}^{n}({O}_{i}-{\overline{o}}_{i})}^{2}\right)\left({\sum }_{i=1}^{n}({P}_{i}-\overline{P}{)}^{2}\right)}}\right]}^{2},$$
(4)

where Oi and Pi are the actual and the predicted production, respectively, \(\mathop O\limits^{ - }\) representing the average values of the actual production, and i is the number of observations.

Table 2 Input combinations (scenarios) for the applied models.

The standardized yield residuals series (SYRS)

Crop yield is affected by many variables besides climate, and shows a positive trend54,55. Moreover, mechanization and innovation in agriculture have increased in the last century due to the following factors55. To remove bias introduced by non-climate factors, the original yield timeseries were transformed to standardized yield residuals series (SYRS)56,57. The indicator of agricultural drought risk is given by the residuals of the detrended yield \(y_{i}^{T}\) as Ref.55:

$$y_{i}^{T} = y_{i}^{0} - y_{i}^{(\tau )} ,$$
(5)

where \(y_{i}^{0}\) is the observed crop yield and \(y_{i}^{(\tau )}\) is the value of the fitted quadratic polynomial regression model. The SYRS is computed as:

$$SYRS = \frac{{y_{i}^{(T)} - \mu }}{\sigma },$$
(6)

where μ is the mean of the yield residuals and σ is the standard deviation of the yield residuals55.

The percentage of annual yield loss was based on Eq. (7). SPEI-3 and SPEI-6 were analyzed to assess the effect of drought severity and to evaluate the vegetation response to drought58. To assess the impacts of drought on crop yields, changes in the percentage of annual yield loss (YL%) was estimated as:

$$Y_{L} = \frac{{Y_{i}^{0} - Y_{i}^{(\tau )} }}{{Y_{i}^{(\tau )} }} \times 100,$$
(7)

Results

Model performance

Performance of the single and hybrid models

To compare the accuracy of the single and the hybrid models, this study tested the performance of the four single models (RF, XGB, CNN, and LSTM) against the two hybrid models (RF-XGB and CNN-LSTM). Overall, hybrid models have performed better in estimating rice production than single models as the average of all input scenarios (Table 3). It is also notable that the use of the sowing area alone achieved a relatively high-performance estimation with an average R2 of 0.825, NSE = 0.823, and RMSE = 35.592 × 104 ton, among all ML methods. Without SA, the integration of both climatic and remote sensing achieved a moderate performance (Sc10, R2 = 0.533 (Table 3). The highest R2 (0.8593) and NSE (0.8556), and the lowest RMSE (26.6903 × 104 ton) were achieved by the hybrid RF-XGB model, followed by LSTM-CNN. In contrast, the lowest model performance was the LSTM model by 0.6786, 0.6693 and 43.9143 × 104 ton for R2, NSE and RMSE respectively.

Table 3 The performance evaluation of applied models in rice production.

Optimum input scenario for rice production

According to the performance’s results of the applied models, The tested models showed variant performance among the various input scenarios. On average, the best scenario was observed in scenario 8 (soil variables and sown area) and 11 (All variables) as inputs to the prediction models (Table 3). In both scenarios 8 and 11, the R2 and NSE were 0.95 and the RMSE was 19.69 × 104 ton and 19.3 × 104 ton for respectively. On the other hand, the use of remote sensing indices alone achieved the lowest performed scenario (Sc6) for rice production estimation (R2 = 0.362, NSE = 0.340, RMSE = 68.659 × 104 ton), while the use of sown area with remote sensing (scenario 7), the performance of the models was enhanced significantly (R2 = 0.899, NSE = 0.898, RMSE = 27.32 × 104 ton).

To investigate the performance of each model (single and hybrid models) under the eleven scenarios, R2, NSE and MAE indices were calculated for the different scenarios in the applied models (Table 4). The lowest single model was LSTM in scenarios 10 and 4 by MAE (51.38 × 104 and 50.35 × 104 ton) respectively. Meanwhile, the highest performance model was RF-XGB in scenarios 8 (soil variables and SA) and 5 (climate variables and SA) by MAE (5.85 × 104 and 7.70 × 104 ton), respectively. In contrast, the highest R2 values were recorded in scenarios 8 and 11 by 0.97 for RF-XGB and LSTM-CNN and the lowest R2 values were in scenario 4 (climate variables) in the LSTM model followed by scenario 10 by 0.11 and 0.13. Moreover, the NSE index indicates that the highest model was RF-XGB and LSTM-CNN by 0.97 for both models in scenarios 8 and 11. The lowest NSE values were 0.27 and 0.32 in scenario 6 (remote sensing) with XGB and RF models respectively. The scenario 3 (soil variables), the NSE was higher 0.82 for all models, while the NSE was enhanced in scenario 11 to be higher than 0.92 for all models. The highest NSE values were recorded in scenarios 8 and 11 by 0.97 for RF-XGB and LSTM-CNN. In contrast, the Radar chart shows the RMSE for the applied models in the different scenarios (Fig. 2a), the lowest single model was LSTM in scenario 4 (climate variables) by RMSE (81.85 × 104 ton), followed by the XGB in scenario 6 (remote sensing) by RMSE (73.13 × 104 ton) (Fig. 2a). However, the performance accuracy in these two scenarios was enhanced when applying the hybrid model, for example, scenarios 4 and 5 with the RF-XGB model achieved RMSE 38.45 × 104 ton and 6.45 × 104 ton by respectively, which enhanced by model by RMSE 13.65 × 104 ton, followed by scenario 11 (All variables) with LSTM-CNN and RF-XGB models by RMSE 14.90 × 104 ton. Based on the results, it is clear that the hybrid models performed better in rice production estimation than single models. On one hand, the lowest performance in all scenarios on the hybrid models was in scenarios 6 (remote sensing) and 4 (climate) respectively. On the other hand, the highest performance in all scenarios was in scenarios 8 and 11 respectively.

Table 4 The performance evaluation of applied models in rice production.
Figure 2
figure 2

Radar chart for the RMSE of the applied models (a), the boxplot of the RF-XGB and LSTM-CNN models (Sc: scenario), (b) The boxplot of error distribution of the developed RF-XGB and LSTM-CNN models at scenarios 8 and 11. The figures were generated with the Origin 2023b software.

Therefore, to select the best hybrid models and scenario, the box plot was developed for scenarios 8 and 11 in RF-XGB and LSTM-CNN to compare the models based on the residuals (estimation error). Positive and negative estimation errors show under- and over-estimations, respectively. The RF-XGB model in scenarios 8 and 11 appears to be the best model having the lowest error by 53% and 23% in comparison with applying LSTM and XGB models, respectively in comparison with the others. On the other hand, the lowest scenario was scenario 8 (soil + SA) with RF-XGB The RF-XGB model in scenarios 8 and 11 appears to be the best model having the lowest error in comparison with the others. For scenario 8, it has a lower quartile (Q1) value of − 3.32 and for the LSTM-CNN (Q1 =  − 9.47), also, for scenario 11, the Q1 was − 3.59 and in the LSTM-CNN (Q1 =  − 4.45). Moreover, the smaller interquartile range (IQR = Q3-Q1) by the RF-XGB model compared with the LSTM-CNN model clearly shows that its distribution of error is much better than the LSTM-CNN model (Fig. 2b), it was 1.41 and 1.45 for scenario 8 and 11 respectively, however, it was 10.37 and 8.46 for LSTM-CNN model. Therefore, the RF-XGB model shows a clear superiority in scenarios 8 and 11.

Importance of predictor variables in rice production estimation

Based on the results obtained from the single RF and XGB models, it is the superiority of the XGB model in comparison with the RF model, thus, the XGB model was applied to analyse the joint contributions of subsets of features while maintaining a fast convergence during iterations. The predictor variables in the XGB model were used to investigate the importance of these predictor variables. The importance ranking of predictor variables for the regional and zonal scale showed that it had different effects or importance on rice production estimation (Fig. 3). For the regional scale, the most important feature in the rice estimation was sown area by 53%, followed by soil properties (32%), and climate (7%) (Fig. 3a). The importance of the sown area decreased to by 8% and 27% respectively. On the other hand, the sown area was very significantly important in the rice production estimation in northeast China and southeast China by 90% and 27% respectively. Therefore, to separately analyze the factors of climate, soil and remote sensing, Fig. 3c–e were developed. For example, the importance of the soil texture contributed 18% of the total contribution of the soil properties (32%) for rice production estimation across China. While the percentage of the contribution increased significantly by 82% in East China from the total contribution of the soil properties (87%), however, the contribution of texture was 24% in South China. In contrast, the contribution of climate change was low in all zones, the relative humidity contributed 3.5% of the total contribution of the climate on the regional scale, however, in southeast China, the temperature contributed almost half of the total contribution of the climate (2.95%) (Fig. 3e), evapotranspiration was at the bottom of the importance ranking due to the low importance of the climate factors. Meanwhile, for the zonal scale, in northeast China, the importance of sown area increased to be the main dominant factor for rice production estimation reaching 90% followed by soil properties by 4% (Fig. 3b). On the other hand, the soil properties were the main dominant factor impacting on rice production in east and southeast China by 87% and 57% respectively.

Figure 3
figure 3

Relative importance ranking of the features in rice production estimation for the regional and zonal scale. The figures were generated with the Origin 2023b software.

Solution for improving rice production

To improve the rice production in each zone, we exchanged and alignment of the soil properties from northeast China to southeast China and from east to southeast China. Figure 4a shows the variation of changing the soil properties in scenario 8, the RMSE decreased by 38% in northeast China when changed the soil properties to southeast China. In contrast, when the soil properties in southeast, China changed to the northeast, China, the RMSE did not significantly decrease (0.6%). In the same manner, the MAE was significantly decreased when changed the soil properties of northeast China to southeast China by 20%. Scenario 11 was consistent with scenario 8, the RMSE significantly decreased when the soil properties of northeast China to southeast China changed by 26% (Fig. 4b). On the other hand, when simulating the soil properties in east China by using the soil properties from southeast China, the performance of the model decreased, for example, the RMSE and MAE increased by 6% and 31% respectively. In contrast, one of the major suggested solutions is to increase the soil organic matter to enhance rice production. Therefore, we simulated the effect of increasing the soil organic matter by 15% on rice production (scenario 8). Figure 4c shows the performance of the hybrid RF-XGB model was enhanced significantly when increasing the SOM in northeast and southeast China by 15%, the RMSE declined by 16% and 10% respectively in comparison with the current SOM. However, increasing the SOM in East China resulted in a negative effect on the rice production estimation, the RMSE increased by 21%.

Figure 4
figure 4

Changing the soil properties (a,b) and increasing SOM by 15% (c) in each zone. The figures were generated with the Origin 2023b software.

On the other hand, as shown in Fig. 5, the decreasing trend of precipitation and increasing temperature in southeast China impacted negatively rice production. The maximum and minimum temperatures increased by 0.16 and 0.19 °C/ year, while precipitation decreased by 20 mm/year which resulted in decreasing the rice production by 2.23% as average in southeast China. In contrast, the production increase in northeast China may be the reason back to the non-significant decreasing and increasing trend in precipitation and temperature (maximum and minimum) and improving irrigation that will positively affect rice production even during dry years59,60. Therefore, the SPEI drought index was analyzed to investigate the drought situation during the period and how it is related to the production anomaly.

Figure 5
figure 5

Time series of precipitation, maximum and minimum temperature), sunshine, sown area and production across zones. The figures were generated with the Origin 2023b software.

The temporal evolution of SPEI series at 3- and 6-month timescales fluctuated during the study period (Fig. 6a and b). In Northeast China, during the period from 2009 to 2012, the drought (SPEI-3) was classified as extreme drought, especially in 2009, it was during the months (May, June and July) of the rice season. However, in East China, during the period from 2009 to 2013, the drought can be classified as severe drought. Meanwhile, in southeast China, during the period from 2011 to 2015, the drought can be classified as severe drought, however, the extreme drought was found only in 2011 for June and September months.

Figure 6
figure 6

The temporal evolution of SPEI-3 and SPEI-6 (a and b), the Pearson correlation coefficient (r) of the linear regression between the SPEIs at 3- and 6-month timescale and the SYRS of rice yield in the three zones (c) and yield losses across the three regions (d). The figures were generated with the Sigma plot software.

On the other hand, in the period from 2002 to 2008, there was no drought event happened during this period. In contrast, Fig. 6c shows the correlation analysis between the SPEI-3/6 and SYRS of rice yield across the three zones. The correlation coefficient between SYRS of rice in southeast China and SPEI in April and May (initial stage) is the highest among all months, revealing that rice yield is more prone to drought in the initial stage. Meanwhile, in northeast and east China, the rice yield is less correlated with drought than in southeast China, which may be the reason back to the improving irrigation will positively affect rice yield even during dry years. It is observed that the degree of yield losses varies during the study period across the three regions due to drought/wet impact on the various crop stages. In East China, 2003 ranked as the year with the highest failure of rice, the yield losses reached to 60%. In contrast, in southeast China, the highest losses occurred in 2001, 2002 and 2003 by 20%, 27% and 18%, with average losses during the whole study period by 2.23% (Fig. 6d). Besides the climate variables, soil properties play a vital role in improving rice production. The results from this study indicated that the clay (30–100 cm) was positively correlated with the rice production in the three zones, especially in northeast China (Fig. 7). It was the same in the sand (0–30 cm) in southeast China, however, it was negatively in the northeast and east China.

Figure 7
figure 7

Variations of sand (0–30), clay (30–100), soil organic matter and porosity in each district. The figures were generated with the Origin 2023b software.

Discussion

Hybrid method importance in rice yield estimation

The results from this study documented that the hybrid models RF-XGB and LSTM-CNN models are more flexible and robust with noisy data than single models, significantly enhancing their prediction accuracy. Previous researches have documented that both climate variables and remote sensing data could exert non-linear and complicated effects on production variations61, which however could be less captured by the single methods. For example, the RMSE was reduced by more than 30% when applying hybrid the RF-XGB and LSTM-CNN models compared to the single models, which agrees with the findings of Chiu, Wen61. The underlying reason may be that using a single machine learning aggressor may result in over-fitting and difficulty with generalization. This is because the regressor may become too complex and fit the noise in the training data, rather than the underlying patterns62,63,64. Further, Huang et al.65 developed, trained, and tested a back-propagation neural network (BP-ANN) model for fiber-reinforced polymer (FRP) reinforced concrete at high temperatures using 151 sets of FRP-reinforced concrete pullout test data at different temperatures reported in the literature. The results showed that the BP-ANN model exhibited greater generality than existing mathematical models. Furthermore, Wang et al.63 combined ANN with genetic algorithm (GA) or particle swarm optimization (PSO) for model training and testing. The findings indicated that the accuracy of the developed hybrid machine learning model in predicting bond strength in CES structures exceeded that of conventional ANN models and existing empirical equations. In addition, both DL and ML models are black boxes. It is difficult to produce testable hypotheses that could potentially provide biological insights because of their complex model structure. In contrast, in comparison with traditional production estimation methods (i.e. crop models simulation and statistical regression), the ML and DL methods provide new opportunities for yield predictions27. However, combining crop models and DL/ML models for yield estimation, forecasts, and disaster monitoring in large regions is recommended. This might encourage running the models of rice production estimation at the local scale to consider the variation among rice districts in their agro-environmental conditions and the relative correlation of various factors with rice production.

Analysis of driving mechanisms on rice production

The global warming phenomenon has undoubtedly brought unprecedented challenges to rice production, vital for food security in southeast Asian countries and China. The excessively high temperature will increase the risk of heat stress, which will not only make others challenging to crack but contribute to the reduction of pollen, thus affecting the normal process of pollination and fertilization. Meanwhile, excessive heat will inhibit rice from synthetic organic matter and accumulate dry matter, leading to reduced seed setting rate, grain mass, and seed weight66,67,68. A reduction in rainfall will decrease the stomatal conductance and inter-cellular CO2 flux, which will slow down the transpiration rate and restrict photosynthesis69. As a result, the uptake of nutrients will be reduced, and respiration consumption will increase oppositely. Therefore, the increase of precipitation in a moderate range can promote rice yield. Our findings agree with the findings of Liu et al.70, who reported that the individual contribution of climate change, soil improvement to rice yield differed with respected factors. Compared with the 1980s, the yield in the 2000s decreased by 19.5% from climate change, while the yield increased by 12.7% due to soil improvement. In contrast, the increase in rice production in northeast China may be the reason back to the non-significant decreasing and increasing trend in precipitation and temperature (maximum and minimum) and adequate irrigation and adjusting sowing dates that will positively affect rice production even during dry years59,71,72,73 and also, the appropriate application of chemical fertilizers, providing ample nutrients to the growth of rice74. As shown in Fig. 6a and b), in southeast China, during the period from 2011 to 2015, the drought can be classified as severe drought, however, the extreme drought was found only in 2011 for June and September months. Furthermore, the correlation coefficient between SYRS of rice in southeast China and SPEI in April and May (initial stage) is the highest among all months, revealing that rice yield is more prone to drought in the initial stage. Meanwhile, in northeast and east China, the rice yield is less correlated with drought than in southeast China, which may be the reason back to the improving irrigation will positively affect rice yield even during dry years72,73. Furthermore, the role of climatic variables in rice yield variation was not significant in some regions in China, these results are supported by some previous studies75,76. The underlying reasons may be that sown area and soil properties represent comprehensive features or information of a county or a field over a long time, while climate factors represent a part of the information related to crop production for a specific period. In contrast, high production can be characterized by healthy soils, well water conditions, farmer's experiences, agricultural practices such as applying mulches, well-equipped irrigation facilities, fertilizers and suitable climate conditions75. All these features can be comprehensively represented by spatial location. Furthermore, climatic variables derived from meteorological data were better in rice production estimation than vegetation parameters derived from remote sensing data. This agrees with earlier studies that the fluctuation in precipitation and temperature proved a strong correlation with rice production21,22. Although remote sensing vegetation indices (VIs) performed less than climatic variables in rice production estimation at the regional scale, VIs were more important than the climate in some rice districts. The explanation may be that the satellite indices can reflect the effects not only of abiotic factors but also biotic factors (e.g. plant disease, irrigation, and fertilization)77,78, which agree with the conclusion of Cao, Zhang27. Moreover, we speculate that monthly EVI and weather data cannot accurately reflect crop growth and development. The EVI at the 8-day or 16-day period might better incorporate crop growth and weather information75. Moreover, a subset of climatic variables in scenario 2 (Sunshine, Tmin, Tmax, and sown area) achieved comparable rice production estimation results to using full climatic variables as in scenario 5. The reason may be due to the highly significant between the sown area and rice production as shown in Fig. 7 in the three zones. In contrast, soil health is one of the major factors affecting rice production79. Increasing the clay content could improve soil fertility80. A higher biomass was recorded in rice grown in high clay soil than in rice grown in low clay soil80,81. Southern China accounts for 88% of national rice production82. Continuous flooding irrigation is practiced by Chinese farmers in lowland rice, threatening rice production83. Moreover, in regions of southern China, clay-textured soils offer the highest potassium-supplying potential84. The results from this study indicated that the clay (30–100 cm) was positively correlated with the rice production in the three zones, especially in northeast China, however, it was negative in northeast and east China. The main reason is that soil texture affects plant growth and nutrient uptake because it alters the availability of water in the soil. When the soil has high clay contents, often with a large proportion of 2:1 clay, it is classified as Vertisol79. In flooded rice soil, soil swelling is dominant because clay absorbs water, then the soil is allowed to dry out before irrigation is applied again85; as such, cracks are dominant in paddy soils86 due to the removal of water from within and between clay micro structures.

Conclusion

In this study, the key issue was finding the best approach to predict rice production across China’s main rice counties by testing multiple single and hybrid models and input scenarios at various study scales. Based on the results, the main findings of the present study can be summarized as follows;

  • Hybrid models performed better than single models in rice production estimation which significantly improves the prediction accuracy.

  • For the zonal scale, the soil properties were the most dominant factors in rice production, it was 87 and 53% in east and southeast China respectively.

  • The increase in temperature and decrease in precipitation restrain rice production by decreasing rice production by 2.2% as average in southeast China.

  • At the regional scale, climatic variables showed a strong relationship with rice production than vegetation parameters. However, remote sensing outperformed climatic factors in some local districts. The paper's innovation lies in its holistic approach to predicting rice production using multi-source data and hybrid machine learning algorithms, offering high-resolution insights into a critical aspect of China's agriculture. Furthermore, one of the main innovative points of this study was to investigate the dominant factor for rice production across China’s main rice counties. In contrast, future research will focus on predicting rice production using agronomic datasets (crop phenology, growing degree days, full grain, panic number, and plant height) as well as management datasets in addition to the existing datasets.