Introduction

Facility horticulture is a modern agricultural production method that uses new production equipment, and management techniques to regulate the environmental parameters such as temperature, light, water, and fertilizer in greenhouses1,2. In greenhouse cultivation, by establishing a scientific water requirement prediction model, a deeper understanding of the growth patterns of greenhouse crops can be achieved, providing basis for scientific irrigation3.

Currently, the Penman-Monteith model, as advocated by FAO-56, serves as the standard for calculating crop water requirements and has been extensively applied to greenhouse crops including tomatoes, eggplants, and lettuce4,5,6,7. The model’s predictive power for crop water demand is derived from the multiplication of the reference crop evapotranspiration (ET0) by the crop coefficient (Kc). Consequently, the ease of acquisition and the reliability of these parameters—ET0 and Kc—are pivotal to the efficiency and accuracy of water requirement predicts. Jo et al.8 employed weighing sensors to monitor the actual transpiration rate, recording tomato crop weight changes at 10-minute intervals, and subsequently developed a water demand prediction model grounded in the established Penman-Monteith (P-M) formula and crop coefficient (Kc). Dong et al.9 conducted an analysis of the spatio-temporal patterns of reference evapotranspiration, temperature, relative humidity, and sunlight duration across China, introducing an innovative enhanced GWA algorithm (MDSL-GWA) designed to refine the empirical estimation of ET0. Despite its utility, the Penman-Monteith model’s broad application is constrained by the necessity to estimate elusive parameters such as aerodynamic resistance, which is integral to its input parameters but challenging to ascertain. Furthermore, the model’s critical calculation parameter (Kc) is often determined empirically and is subject to variation due to diverse climatic conditions and soil properties, leading to significant discrepancies in practical scenarios. Research indicates that the Mean Square Error (MSE) of Kc throughout the tomato’s growth cycle can range from 11.9 to 71.4%10, underscoring the need for more precise predictive tools in agricultural water management.

Therefore, with the advancement of computer technology, researchers have begun to propose methods that use machine learning to directly predict water requirements without the need to calculate ET0 and Kc separately. Dong et al.11 proposed a novel model for predicting crop evapotranspiration in the wheat-corn rotation system of the Loess Plateau in China (GWA-CNN-BiLSTM). This model is based on the Grey Wolf Algorithm and uses five parameters, including net solar radiation (R) and saturation vapor pressure deficit (VPD), for prediction. The model achieved a relative root mean square error (RRMSE) ranging from 8.4 to 41.5%. Fuentes et al.12 used micrometeorological data and artificial neural networks (ANN) for modeling actual evapotranspiration and energy balance estimation in vineyards, and the established model demonstrated high accuracy and performance, with a determination coefficient R2 of 0.97. Tunalı et al.13 employed ANN network to estimate the crop water requirements (ETc) of tomatoes, and compared it with the traditional Penman-Monteith model, finding that the ANN model improved the prediction accuracy for ETc by 30% compared to traditional methods. However, crop water requirements are affected by various factors such as the growth condition of the crop itself, environment, soil, representing a nonlinear and complex characteristic of change. Therefore, this study considers the acquisition of crop growth conditions through imagery and combines it with environmental data, adhering to the principle of decoupling and minimizing characteristic parameters, to propose a multi-source data fusion model for predicting the water requirements of greenhouse tomato crops, predicting the water requirements of greenhouse tomato crops with a small number of parameters. The main objectives include: (1) Using the super green algorithm and the maximum inter-class variance method, a tomato canopy coverage extraction algorithm based on image segmentation is proposed, overcoming the difficulty of traditional methods in large-area measurement; (2) Under the full consideration of crop, soil and environment, the optimal combination of feature variables was proposed based on the principle of reducing the correlation of feature parameters and minimizing the feature parameters, combined with Spearman correlation analysis and random forest feature importance ranking method.; (3) Fusion algorithm based on single machine learning algorithms is proposed to construct a water requirement prediction model for greenhouse tomato crops, and its reliability and generalization are verified.

Method

Data acquisition

Data acquisition was conducted in the solar greenhouse of the National Precision Agriculture Research Demonstration Base in Xiaotangshan Town, Changping District, Beijing, China (East Longitude 116.46°, North Latitude 40.18°, altitude 50 m), which is a scientific research and test base of Beijing Academy of Agriculture and Forestry Sciences. Changping District of Beijing belongs to the temperate continental monsoon climate zone, which is the main area of solar greenhouse production in Beijing.

The cultivation experiment was carried out on tomato crops, using rectangular foam boxes as substrate slots, with dimensions of 100 cm*60 cm*40 cm, filled with coconut coir as the substrate. To ensure the vertical growth of tomato plants, the experiment utilized ropes to hang the plants from hanging scales to prevent environmental factors from affecting growth direction and leaf angles. Additionally, to avoid the impact of substrate moisture evaporation on the measurement of tomato water requirements, a transparent ground film was laid over the substrate surface. The data collected during the experiment included environmental data, image data, and crop water requirement data. The trial was divided into two seasons: the spring planting (from May 20, 2022, to July 22, 2022) and the autumn planting (from September 28, 2022, to January 6, 2023).

Environmental data were collected using greenhouse environmental sensors to measure air temperature (T, ℃), relative humidity (RH, %), soil temperature (Ts, ℃), light intensity (E, Lux), and CO2 concentration (ppm), with the sensors positioned approximately 20 cm above the crop. A photovoltaic total radiation sensor was used to collect cumulative light radiation data (Rn, Kj·m2·h−1) inside the greenhouse, placed 2 meters above the ground and 5 m away from the rear wall of the greenhouse. The technical specifications of the sensors are shown in Table 1. Environmental sensor data were acquired at 10-minute intervals and transmitted to the monitoring software via a wireless gateway. Image data were captured using an infrared mobile timed camera, the Forsafe H805, to obtain visible light images. The camera is placed in a fixed position directly above the tomato plant, giving a top-down view based on the planting layout of the tomato. The camera was set to take photos every hour, with an image resolution of 5200*3900 PPI (Pixels Per Inch).

The crop water requirement (ETc) is determined by measuring the substrate weight of tomato plants using a self-developed online substrate weighing system. The weighing system adopts LoRa wireless communication technology, the measurement error is ±0.03%, and the collection frequency is 10 minutes. In this study, an automatic controller tube was used to manage nutrient solution irrigation once a day to provide the required nutrients for plant growth, that is, irrigation was started 2 hours after the local sunrise time, and irrigation was ended when the matrix water content reached the upper limit (field moisture capacity), and the irrigation duration was less than 10 min, during which the crop water demand ETc was ignored. Therefore, the calculation of crop water demand ETc is shown in formula (1):

Table 1 Technical specifications of greenhouse Sensors.
$${\text{E}}{{\text{T}}_c}={\text{B}}{{\text{W}}_{T1}} - {\text{B}}{{\text{W}}_{T2}}$$
(1)

Among them, \({\text{B}}{{\text{W}}_{T1}}\) represents the substrate weight of the tomato plants at the previous time point, while \({\text{B}}{{\text{W}}_{T2}}\) denotes the substrate weight at the subsequent time point.

Data processing

Utilizing air temperature and humidity data gathered at 10-minute intervals, we derive six key parameters: the hourly/daily average air temperature (Tm), the peak air temperature (Tmax), the lowest air temperature (Tmin), the mean air humidity (RHm), the highest air humidity (RHmax), and the lowest air humidity (RHmin), employing both mean and extremum calculations. Concurrently, soil temperature and CO2 concentration, also measured every 10 min, inform the determination of the hourly/daily soil temperature (Ts) and CO2 concentration, achieved through averaging. Based on the light intensity and accumulated light radiation data collected every 10 min, the two parameters of hourly/daily light intensity (E) and accumulated light radiation (Rn) are calculated by cumulative calculation. The collection of actual visible light images of crops is complemented by a rigorous screening process to exclude images characterized by anomalous positioning, blurriness, or inadequate lighting conditions. For those individual time periods where sufficient and effective images could not be obtained after screening, we employed data augmentation techniques to supplement the dataset. This involved geometric transformations (such as rotation, flipping, and scaling) of high-quality images from adjacent times, as well as subtle adjustments to brightness and contrast to simulate different lighting conditions. Although these enhanced images were synthetic, they retained the core shape and texture features of the crops, effectively filling the data gap without introducing significant deviations.

Construction of a water requirement prediction model for greenhouse tomato crops

This paper proposes a multi-source data fusion model for predicting the water requirements of greenhouse tomato crops, based on images and environmental data. The model aims to calculate the water needs of greenhouse tomato crops with a minimal number of parameters and simple computations, providing a foundation for implementing appropriate irrigation measures. The model first establishes an algorithm for extracting the canopy coverage of greenhouse tomatoes based on image segmentation. It then combines canopy coverage with environmental data to select feature variable combinations with high correlation using Spearman’s correlation analysis method and chooses the optimal feature variables using the random forest feature importance ranking method. Finally, three types of fusion models (Average fusion, Weighted fusion, and Stacking) are constructed based on the RandomForest (RF), LightGBM, and CatBoost models. The greenhouse tomato crop water requirement prediction model is built through comparative experimental results. The model framework is illustrated in Fig. 1.

Fig. 1
figure 1

Framework Diagram of the Water Requirement Prediction Model for Greenhouse Tomato Crops.

Canopy coverage extraction

The ExG (Excess Green) algorithm extracts green plant images effectively, suppressing shadows, withered grass, and soil images, making the plant images more prominent. However, the segmentation effect may be affected under strong light conditions. The ExG algorithm is used to perform grayscale processing on the images, as shown in Eq. (2).

$${\text{ExG}}=2G - R - B$$
(2)

In this context, G represents the pixel value of the green channel, R represents the pixel value of the red channel, and B represents the pixel value of the blue channel.

The Maximum Inter-Class Variance Method (Otsu Method)14 is an automatic threshold selection technique that does not require the manual setting of additional parameters. It segments the image into two parts: the target and the background, based on the selected threshold. The method calculates the maximum inter-class variance value corresponding to the pixel’s grayscale value, and the threshold at which the inter-class variance is maximized is considered the optimal threshold T. Then, the grayscale value of each pixel is compared with the threshold value T, and based on the comparison, the pixel is classified as either plant or background.

The total average grayscale value of an image is:

$$u={w_0}{{\text{u}}_0}+{w_1}{{\text{u}}_1}$$
(3)

The inter-class variance is:

$${\text{g}}={w_0}{w_1}{{\text{(}}{{\text{u}}_0}{\text{-}}{{\text{u}}_1}{\text{)}}^2}$$
(4)

In this context, w0 is the proportion of plant pixel counts to the total image, u0 is the average grayscale value of the plant; w1 is the proportion of background pixel counts to the image, u1 is the average grayscale value of the background; where w0 + w1 = 1.

For the collected visible light images, the ExG algorithm is used for grayscale processing in combination with the Otsu method to segment tomato plants from the background. The segmentation effect is shown in Fig. 2. Observations from Fig. 2(a)-(d) indicate that the canopy coverage expands in tandem with the growth of the tomatoes.

Fig. 2
figure 2

Threshold Segmentation Effect Images.

Based on the segmented tomato plants, the proportion of green tomato plants out of all pixel points in the image is calculated, which represents the canopy coverage at this moment. Therefore, the calculation of daily canopy coverage is as shown in Eq. (5).

$${\text{CC}}=\frac{1}{n}\sum\limits_{1}^{n} {C{C_i}}$$
(5)

In this context, CC represents the daily canopy coverage, and CCi represents the canopy coverage at the i-th moment.

Optimal feature variable selection

When performing feature selection, a common approach is to calculate the significance of each characteristic and retain the most relevant features. However, during the direct ranking of feature importance, there is a risk that a feature may be mistakenly deemed less important and discarded due to high correlations among multiple feature variables, even though discarding one might not affect the outcome. In order to avoid this situation, Spearman correlation analysis15 was used in this paper to calculate the correlation coefficient among multiple variables, screen the combination of characteristic variables with high correlation, and set a threshold for the correlation coefficient. For each set of features whose correlation is above the threshold, only one feature variable is retained. When a duplicate variable is present in more than one combination, the duplicate variable is retained and other variables with a high correlation with this variable are removed. After the screening, the random forest feature importance ranking method is used to calculate the significance of each characteristic variable for predicting the water requirements of greenhouse tomato crops, and the optimal feature variables are selected based on a predefined importance threshold.

Data collected from environmental and image sensors, after preprocessing, yield 11 feature parameters as shown in Table 2.

Table 2 Feature Parameters.

Spearman’s correlation analysis is a method for calculating the correlation between two variables. The method is to rank the values of multiple variables and calculate the rank correlation (Spearman’s correlation coefficient) between them. Spearman’s correlation coefficient ranges from − 1 to 1. A value of −1 indicates a completely negative correlation, a value of 0 indicates no correlation between the two variables, and a value of 1 indicates a completely positive correlation. The calculation method for the correlation coefficient is shown in Eq. (6).

$$\rho =1 - \frac{{6\sum {{d_i}^{2}} }}{{n({n^2} - 1)}}$$
(6)

Where, n is the number of samples, and di is the difference between the position values of the i-th data pair.

The calculation steps for the Random Forest feature importance ranking method are as follows: (1) Train a random forest model on the training set; (2) Randomly shuffle the values of a certain feature variable, and then make predictions on the new dataset; (3) Calculate the loss function using the predicted values and the true values; the degradation in model performance after random shuffling represents the importance of the randomly shuffled column; (4) Restore the values of the feature variable that was randomly shuffled, repeat step (2) on the data of the next feature variable, and continue this process until the importance of each feature variable has been calculated.

Machine learning algorithm

RandomForest (RF)16 is an ensemble learning method that constructs multiple decision trees for classification or regression. In the training process, this method does not build a large decision tree with the entire training data set, but uses different subsets and feature attributes to build several small decision trees, each subset is built by randomly selected samples and feature attributes, and then merged into a more powerful model, as shown in Eq. (7). The RandomForest excels due to its capacity to enhance the model’s performance and introduce randomness during the training process, thereby improving the model’s generalization capability and reducing the risk of overfitting. The hyperparameter is set to: n_estimators=[10, 50, 100, 200, 400], max_depth=[None, 10, 20, 30, 50], min_samples_split=[2, 5, 10], min_samples_leaf=[1, 2, 4], max_features=[‘log2’,‘sqrt’]. After conducting experiments, the optimal combination of hyperparameters was obtained as follows: n_estimators=[400], max_depth=[20], min_samples_split=[2], min_samples_leaf=[1], max_features=[‘log2’].

$$\mathop y\limits^{ \wedge } =\frac{1}{T}\sum\limits_{{i=1}}^{T} {{h_i}(x)}$$
(7)

In this context, \(\mathop y\limits^{ \wedge }\) represents the final prediction result of the RandomForest, T denotes the number of decision trees, and \({h_i}(x)\) is the prediction result of the i-th decision tree for the data point x.

LightGBM17 is a gradient-based decision tree algorithm that iteratively trains a series of weak classifiers (decision trees) and combines them into a strong classifier. The method principles can be divided into the following steps: (1) Initialize the model and related parameters. (2) Calculate the first and second order gradient information of the samples. (3) Train multiple decision trees sequentially, with each tree’s training objective being to minimize the loss function (usually the mean squared error or log loss function). (4) Update the model parameters using gradient descent to reduce the value of the loss function. (5) Repeat steps (3) and (4) until the specified number of iterations is reached or the model performance meets the threshold to stop. The hyperparameter is set to: objective = regression, metric = mse, num_leaves = 20, learning_rate = 0.1, feature_fraction = 0.9.

CatBoost18 is a gradient boosting decision tree-based machine learning algorithm that excels at handling datasets with a large number of categorical features. Unlike traditional gradient boosting algorithms, CatBoost does not require one-hot encoding for categorical features; instead, it directly uses these features for training, thus avoiding information loss and increased computational complexity. The formula is shown in Eq. (8). The hyperparameter is set to: iterations = 4, learning_rate = 1, depth = 4.

$$\mathop y\limits^{ \wedge } =\sum\limits_{{i=1}}^{T} {{\gamma _i}h(x;{\delta _i})}$$
(8)

In this context, \(\mathop y\limits^{ \wedge }\) represents the final prediction result of CatBoost, T denotes the number of decision trees, \({\gamma _i}\) is the weight of the i-th decision tree, and \(h(x;{\delta _i})\) is the prediction result of the i-th decision tree for the data point x.

The averaging method combines the prediction results of multiple models for classification or regression. The core idea is to consider all model predictions as equally important and calculate their arithmetic mean as the final prediction. In this paper, an averaging fusion model is constructed based on the integration of three models: RandomForest, LightGBM, and CatBoost, as shown in Eq. (9), where F represents the final prediction result; n represents the number of models, which is 3 in this case; \({y_i}\) denotes the prediction result of the i-th model.

$${\text{F}}=\frac{1}{n}\sum\limits_{{i=1}}^{n} {{y_i}}$$
(9)

Weighted averaging19 is a method for classification or regression that assigns different weights to the prediction results of each model. The core idea is to allocate weights based on the performance or confidence of each model, with better-performing models receiving higher weights. In this paper, the weights are determined using the models’ MSE, where the weight is inversely proportional to the MSE. The smaller the model’s MSE, the greater the weight. The formula is shown in Eq. (10), where \({w_i}\) represents the weight assigned to the i-th model, with the sum of weights equaling 1.

$${\text{F}}=\sum\limits_{{i=1}}^{n} {{w_i}{y_i}}$$
(10)

Stacking20 is an advanced ensemble learning technique that constructs a new model, known as a meta-model, by integrating the predictions of multiple base models. In this study, we employed a stacking approach where the predictions from three base models (Random Forest, LightGBM, and CatBoost) were used as input features to train a meta-model. Specifically, we used Linear Regression as the meta-learner to combine these predictions. The core idea is to leverage the complementary strengths of diverse base models by learning their prediction patterns through the meta-model, which can enhance overall predictive performance. This approach significantly improves the model’s generalization capability by capturing complex relationships among base model predictions and reducing systematic errors.

Experimental environment

The training environment for this study is CPU: i7-12700 F 2.10 GHz, GPU: RTX 3060Ti, operating system: 64-bit, RAM: 16 GB. The model uses Python language. The Python version and the various versions of the environment packages used in this article are as follows: Python 3.9, numpy 2.0.2, pandas 2.3.0, scikit-learn 1.6.1, matplotlib 3.9.4.

Evaluation index

This study employed four common statistical metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Coefficient of Determination (R2).

$$MAE=\frac{1}{n}\sum\nolimits_{{i=1}}^{n} {\left| {{y_i} - {{\hat {y}}_i}} \right|}$$
(11)
$$MSE=\frac{1}{n}{\sum\nolimits_{{i=1}}^{n} {\left( {{y_i} - {{\hat {y}}_i}} \right)} ^2}$$
(12)
$$RMSE=\sqrt {\frac{{\sum\nolimits_{{i=1}}^{n} {{{\left( {{y_i} - {{\hat {y}}_i}} \right)}^2}} }}{n}}$$
(13)
$${R^2}=1 - {\frac{{\sum\nolimits_{i} {\left( {\mathop y\nolimits_{i} - \mathop {\hat {y}}\nolimits_{i} } \right)} }}{{\sum\nolimits_{i} {{{\left( {\mathop {\bar {y}}\nolimits_{i} - \mathop y\nolimits_{i} } \right)}^2}} }}^2}$$
(14)

In this context, n is the total count of predict outcomes, \(\mathop {\hat {y}}\nolimits_{i}\) represents the predicted value, \(\mathop y\nolimits_{i}\) is the actual value, and \(\mathop {\bar {y}}\nolimits_{i}\)​represents the average value.

Results and analysis

Different model results of multiple parameter combinations

For the daily data of spring-season tomatoes, the feature correlation heatmap after Spearman correlation analysis is shown in Fig. 3(a). Feature combinations with an absolute correlation value above 0.8 include: (Tm, Ts), (RHmax, RHm), (RHm, RHmin), (Rn, E). Based on the strongly correlated feature combinations mentioned above, and following the principle of using fewer feature input parameters, seven feature parameters were selected: Tmax, Tmin, RHm, Ts, Rn, CO2, and CC. After analyzing the feature correlations, we performed a random forest feature importance ranking on seven feature parameters, namely Tmax, Tmin, RHm, Ts, Rn, CO2, and CC. The results are shown in Fig. 3(b). It can be seen that Tmax has the greatest impact on tomato ETc, followed by Ts, while RHm has the smallest impact on tomato ETc. If only five parameters are chosen, then Tmax, Ts, CC, Tmin, and CO2 can be selected. If only three parameters are chosen, then Tmax, Ts, and CC can be selected.

Fig. 3
figure 3

Heatmap of Tomato Feature Correlations and Feature Importance Ranking.

Table 3 presents the four parameter combinations used for model construction. The parameter combination without feature selection includes all 11 feature parameters, namely Tmax, Tm, Tmin, RHmax, RHm, RHmin, Ts, Rn, E, CO2, and CC. The parameter combination after Spearman feature selection includes 7 feature parameters: Tmax, Tmin, RHm, Ts, Rn, CO2, and CC. The parameter combinations after Spearman + RandomForest feature selection include 5 feature parameters: Tmax, Ts, CC, Tmin, CO2, and 3 feature parameters: Tmax, Ts, and CC.

Table 3 Parameter Combinations.

This data set divides the training set and the test set according to the ratio of 8:2, and the results below are the model results under the test set. For different parameter combinations as model inputs, RandomForest, LightGBM, and CatBoost models were constructed separately for predicting the water requirements of greenhouse tomato crops. The model results are shown in Fig. 4. Figure 4(a) shows the model prediction results for the 11 feature parameters without feature selection. The RandomForest model has the smallest error and the highest R2. The MSE of the RandomForest model is 0.025 to 0.042 lower than the other two models, the MAE is 0.076 to 0.093 lower, and the RMSE is 0.057 to 0.088 lower. Figure 4(b) shows the model prediction results for the 7 feature parameters after Spearman feature selection. The RandomForest model has the smallest error and the highest R2, with the MSE being 0.059 to 0.068 lower than the other two models, the MAE being 0.12 to 0.145 lower, and the RMSE being 0.126 to 0.14 lower. Figure 4(c) shows the model prediction results for the 5 feature parameters after Spearman + RandomForest feature selection. The RandomForest model has the smallest error and the highest R2, with the MSE being 0.012 to 0.042 lower than the other two models, the MAE being 0.006 to 0.088 lower, and the RMSE being 0.028 to 0.086 lower. Figure 4(d) shows the model prediction results for the 3 feature parameters after Spearman + RandomForest feature selection. The RandomForest model has the smallest error and the highest R2, with the MSE being 0.031 to 0.05 lower than the other two models, the MAE being 0.069 to 0.116 lower, and the RMSE being 0.074 to 0.109 lower. In summary, for the four different parameter combinations, the RandomForest model has the lowest MSE, MAE, and RMSE among the three models, and the highest R2 Among the same RandomForest models, the model with the 5 feature parameter combination after Spearman + RandomForest feature selection has the largest error, and the model with the parameter combination after Spearman feature selection has the smallest error, with the MSE, MAE, and RMSE errors being reduced by 21%, 23%, and 12% respectively, and the R2 being increased by 0.6–2.7%.

Fig. 4
figure 4

Comparison of Single Machine Learning Model Results for Daily Data of Spring-Season Tomatoes.

Based on the three models (i.e., RandomForest, LightGBM, CatBoost), fusion models (i.e., Average fusion, Weighted fusion, Stacking) were constructed separately for predicting the water requirements of greenhouse tomato crops, and the model results are shown in Fig. 5. Figure 5(a) shows the model prediction results for all 11 feature parameters without feature selection. The errors of the three fusion models are all lower than the best-performing RandomForest model among the single machine learning models, and the R2 is higher. Among them, the Stacking model has the smallest error and the highest R2. The MSE of the Stacking model is 0.012 to 0.013 lower than the other two models, the MAE is 0.011 to 0.015 lower, and the RMSE is 0.034 to 0.038 lower. Figure 5(b) shows the model prediction results for the 7 feature parameters after Spearman feature selection. The MSE of the Stacking model is 0.012 to 0.014 lower than the other two models, and the RMSE is 0.035 to 0.04 lower, with the R2 being 0.037 to 0.043 higher than the other two models. Although the MAE of the Stacking model is higher than the Weighted fusion model, the difference is only 0.001. Figure 5(c) shows the model prediction results for the 5 feature parameters after Spearman + RandomForest feature selection. The MSE of the Stacking model is 0.014 to 0.015 lower than the other two models, the MAE is 0.007 to 0.011 lower, the RMSE is 0.04 to 0.042 lower, and the R2 is 0.042 to 0.044 higher than the other two models. Figure 5(d) shows the model prediction results for the 3 feature parameters after Spearman + RandomForest feature selection. The MSE of the Stacking model is 0.019 to 0.02 lower than the other two models, the MAE is 0.031 to 0.032 lower, the RMSE is 0.054 to 0.056 lower, and the R2 is 0.057 to 0.06 higher than the other two models. In summary, for the four different parameter combinations, the Stacking model has the lowest MSE, MAE, and RMSE among the three fusion models, and the highest R2.

Comparing the best fusion model with the single machine learning model results, as shown in Fig. 6, the Stacking model, which performs the best among the fusion models, has lower error and higher R2 than the RandomForest model, which performs the best among the single machine learning models. Specifically, the MSE is reduced by 0.01, the MAE is reduced by 0.003, the RMSE is reduced by 0.03, and the R2 is increased by 0.03. When using the Stacking model, the model with the 3 feature parameter combination selected by Spearman + RandomForest feature selection has the smallest error, while the model with the parameter combination selected by Spearman feature selection has the largest error. The MSE, MAE, and RMSE errors are reduced by 19%, 16%, and 10% respectively, and the R2 is increased by 1.5%. Therefore, only the three parameters Tmax, Ts, and CC, combined with the proposed Stacking fusion model, can accurately predict the water requirements of greenhouse tomatoes, significantly reducing the computational complexity of traditional formulas.

Fig. 5
figure 5

Comparison of Different Fusion Model Results for Daily Data of Spring-Season Tomatoes.

Fig. 6
figure 6

Comparison of Results Between the Optimal Single Machine Learning Model and the Optimal Fusion Model.

Model reliability and generalization verification

The performance of the proposed optimal parameter combinations and models is verified in the hourly data of spring-season tomatoes to ensure the reliability of the method at different resolutions. The results of the RandomForest, LightGBM, and CatBoost models are shown in Fig. 7, with Fig. 7(a) displaying the model prediction results for all 11 feature parameters, and Fig. 7(b) showing the model prediction results for the optimal feature parameter combinations. The results of the Average fusion, Weighted fusion, and Stacking models are shown in Fig. 8, with Fig. 8(a) displaying the model prediction results for all 11 feature parameters, and Fig. 8(b) showing the model prediction results for the optimal feature parameter combinations. It can be seen that, whether for the RandomForest, LightGBM, CatBoost models, or for the Average fusion, Weighted fusion, Stacking ensemble models, the prediction error using the optimal feature parameter combinations is lower than that using all feature parameters. Compared to the results using the original full set of parameters within the same model, the MSE is reduced by 0.001 to 0.007, the MAE is reduced by 0.001 to 0.025, and the RMSE is reduced by 0.003 to 0.022. The proposed greenhouse tomato water requirement prediction model, that is, the Stacking ensemble model with feature parameters Tmax, Ts, and CC, performs the best. Compared to the three machine learning models with the original full set of parameters, the MSE is reduced by 26–33%, the MAE is reduced by 5–21%, and the RMSE is reduced by 13–16%.

Fig. 7
figure 7

Comparison of Single Machine Learning Model Results for Hourly Data of Spring-Season Tomatoes.

Fig. 8
figure 8

Comparison of Different Fusion Model Results for Hourly Data of Spring-Season Tomatoes.

Figure 9 is a scatter plot of prediction results from all 11 feature parameters in different models using hourly data of spring-season tomatoes, while Fig. 10 is a scatter plot of prediction results from the optimal feature parameter combinations in different models using the same data. It can be observed that, whether for the RandomForest, LightGBM, CatBoost models, or for the Average fusion, Weighted fusion, Stacking ensemble models, the R2 values obtained using the optimal feature parameter combinations are all higher than those using all feature parameters, by 1 to 10% points. The facility tomato water requirement prediction model proposed in this paper, which is the Stacking ensemble model with feature parameters Tmax, Ts, and CC, has the highest R2, with prediction values closer to the actual values. Compared to the three machine learning models with all original parameters, the R2 has increased by 9 to 13% points.

Fig. 9
figure 9

Scatter Diagram of predict Outcomes from Different Models with All 11 Feature Parameters.

Fig. 10
figure 10

Scatter Diagram of predict Outcomes from Different Models with the Optimal Feature Parameter Combinations.

In the autumn-season tomato daily data, the performance of the proposed optimal parameter combinations and models is verified to ensure the generalization of the method. The prediction results of different models for the autumn-season tomato daily data are compared with the true values, as shown in Fig. 11. Figure 11(a) presents the model predict outcomes for all 11 feature parameters, and Fig. 11(b) presents the model predict outcomes for the optimal feature parameter combinations. The horizontal coordinate in the figure represents 12 randomly selected test data, and the vertical axis indicates the mean of the predict outcomes across various models. “Single machine learning model” indicates the mean of the predict outcomes of RandomForest, LightGBM, and CatBoost, “Fusion model” indicates the mean of the predict outcomes of Average fusion, Weighted fusion, and Stacking, and “True value” represents the actual value of the crop water requirement. It can be seen that the predict outcomes of the Fusion model are generally closer to the actual values, and the models using the optimal feature parameter combinations are closer to the true values compared to the models using all 11 feature parameters. Therefore, the proposed fusion model performs better than the single machine learning models, and the models using the optimal feature parameter combinations have better prediction effects.

Fig. 11
figure 11

Error Bar Chart of Prediction Results from Different Models for Daily Data of Autumn-Season Tomatoes.

The results of the RandomForest, LightGBM, and CatBoost models are shown in Fig. 12, with Fig. 12(a) displaying the model prediction results for all 11 feature parameters, and Fig. 12(b) showing the model prediction results for the optimal feature parameter combinations. It can be seen that for the RandomForest, LightGBM, and CatBoost models, the prediction errors using the optimal feature parameter combinations are lower than those using all feature parameters, with MSE reduced by 0.12 to 0.153, MAE reduced by 0.103 to 0.156, RMSE reduced by 0.116 to 0.155, and the predicted R2using the optimal feature parameter combinations is higher than that using all feature parameters, with R2 increased by 0.085 to 0.103. The results for the Average fusion, Weighted fusion, and Stacking models with all 11 feature parameters and the optimal feature parameter combinations are shown in Fig. 13. It can be seen that for the Average fusion, Weighted fusion, and Stacking models, the prediction errors using the optimal feature parameter combinations are lower than those using all feature parameters, with MSE reduced by 0.122 to 0.278, MAE reduced by 0.006 to 0.178, RMSE reduced by 0.162 to 0.181, and the predicted R2 using the optimal feature parameter combinations is higher than that using all feature parameters, with R2 increased by 0.096 to 0.103. In summary, the facility tomato water requirement prediction model proposed in this paper, which is the Stacking ensemble model with feature parameters Tmax, Ts, and CC, performs the best. Compared to the three machine learning models with all original parameters, the MSE is reduced by more than 71%, the MAE is reduced by more than 6%, the RMSE is reduced by more than 41%, and the R2 is increased by more than 13%.

Fig. 12
figure 12

Comparison of Single Machine Learning Model Results for Daily Data of Autumn-Season Tomatoes.

Fig. 13
figure 13

Comparison of Different Fusion Model Results for Daily Data of Autumn-Season Tomatoes.

Discussion

The performance of the integrated model in this study is better than that of the single model, which is consistent with the relevant research conclusions21,22. Integration model integrates multiple models with different characteristics and can mine data information from different angles. There are differences in the ability of different models to capture data features, as in image recognition tasks, some models are good at recognizing the contours of objects, while others are sensitive to color features. By integrating these models, data features can be obtained comprehensively, and the generalization ability and prediction accuracy of the model can be effectively improved.

This study found that models with fewer parameters performed better under certain conditions23. Models with many parameters tend to overfit, learn too much noise and details in training data, and have poor generalization ability on new data. However, the model with fewer parameters has a simple structure, can focus on core features and avoid overfitting. In this study, although the parameters of the optimized model are reduced, each characteristic parameter involved in modeling has an important impact on crop water demand, so the prediction is more stable and accurate.

From the perspective of crop physiology and meteorology, temperature parameters (Tmax, Tmin, Ts), relative humidity parameters (RHm) and radiation parameters (Rn) had significant effects on tomato water demand. Alshami et al.24 proposed that solar radiation has a significant impact on the photosynthesis and transpiration of tomatoes. High temperature accelerated transpiration of tomato and increased water demand. Low temperature decreased transpiration but affected root water absorption, so reasonable water supply was needed. Tuzel et al.25 proposed that, in addition to temperature, relative humidity and radiation also have a significant impact on the transpiration and water requirements of tomatoes. Low relative humidity caused rapid water evaporation and increased water demand, high relative humidity inhibited transpiration and reduced water demand, but disease control should be taken into account. Radiation promoted tomato photosynthesis, increased water demand for time cooperation, and heated up leaves to accelerate water evaporation. However, excessive radiation damaged leaves and changed water demand, so irrigation strategies should be adjusted according to its changes.

However, there are still some limitations in this study. For example, the image segmentation algorithm is greatly affected by natural conditions such as illumination change and occlusion; The data based on the model construction is not long enough in time span, only covering spring and autumn tomatoes, and it is difficult to reflect the influence of climate fluctuation and planting mode adjustment on tomato water demand in different years. In practical application scenarios, the impact of factors such as ventilation equipment operation and irrigation system differences on the tomato canopy microenvironment is not fully considered, which limits the adaptability of the prediction model under different facility conditions and makes it difficult to directly apply it to the cultivation environment of various facilities to accurately predict the tomato water requirement. In the future, we can improve the water demand prediction method of tomato by optimizing algorithm model, expanding data range and considering environmental factors comprehensively.

Conclusion

Accurate prediction of crop water requirements can serve as a basis for irrigation decision-making and contribute to the stable growth of crops. The paper proposes an image segmentation-based algorithm for extracting the canopy coverage of greenhouse tomatoes, which was applied to both spring and autumn crops to extract the canopy coverage. By combining Spearman’s correlation analysis and the random forest feature importance ranking method, an optimal combination of feature variables was proposed. Ultimately, a water requirement prediction model for greenhouse tomato crops was constructed using single machine learning algorithms and ensemble algorithms. Using daily data from the spring season for parameter selection and model building, it was found that models with different combinations of parameters, particularly Tmax, Ts, and CC, had the greatest impact on the water requirements of greenhouse tomatoes, and the Stacking ensemble model showed the best prediction performance. Compared to single machine learning models, the MSE, MAE, and RMSE errors were reduced by more than 31%, 12%, and 17% respectively, and the R2 was increased by more than 3%. Compared to the Stacking model without feature selection, the MSE, MAE, and RMSE errors were reduced by 19%, 16%, and 10% respectively, and the R2 was improved by 1.5%. Moreover, good results were achieved in both hourly data from the spring season and daily data from the autumn season. Compared to the original RandomForest, LightGBM, and CatBoost models with all parameters, the MSE was reduced by more than 26%, the MAE was reduced by more than 5%, the RMSE was reduced by more than 13%, and the R2 was increased by more than 9%. Therefore, the multi-source data fusion model for predicting the water requirements of greenhouse tomato crops proposed in this paper has excellent reliability and generalization. Compared to the traditional PM model, this model uses image algorithms to extract key crop growth parameters, saving manpower. It also reduces the model’s required parameters and computational complexity, and can effectively predict crop cultivation water requirements.