Introduction

Background research

Usually, cost estimation is an experience-based task that includes the evaluation of unknown conditions and complex relationships of factors affecting the cost. An Artificial Neural Network (ANN) is an analogy-based process that is best suited for cost estimation. The main advantages of ANNs include their ability to learn by example (past projects) and to generalize solutions to applications (future projects)1,2. In general, the mechanism that drives advanced technologies and gives rise to innovative tools, especially in the agricultural sector, is ML3,4,5.

In this regard, Elhag & Boussabaine6investigated an artificial neural system for estimating the cost of construction projects. In this study, two ANN models were developed to predict the lowest tender price. 30 projects participated in this study. In model I, 13 cost-determining features were used, but in contrast, only 4 input variables were involved in the development of model II. The findings showed that the two ANN models learned well in the training phase and obtained good generalization ability in the test phase. Models I and II achieved average accuracy percentages of 79.3 and 82.2%, respectively. In this regard, Ahiaga-Dagbui & Smith7 used ANN to model the final cost of water projects. For this purpose, data from 98 water-related construction projects between 2007 and 2011 in Scotland were used. As a prototype of extensive research, the performance of the final model was very satisfactory and the results indicated the high ability of ANN to capture the interaction between estimator features and final cost. Elfaki et al.8 conducted a 10-year review of intelligent techniques in construction project cost estimation and emphasized the high capability of ANN in recognizing, checking, weighting important criteria and finally estimating the initial and final cost of projects. Also, Juszczyk et al.9 investigated the ANN approach based on estimating the construction costs of sports field. Apart from the general conclusion about the application of ANNs, the Multi Layer Perceptron (MLP) model was selected from a wide set of different networks. The analysis of the results shows the optimal performance of the selected network in terms of the correlation between the actual and estimated cost. The level of errors was acceptable and the accuracy of the model was evaluated as suitable.

In a study, Roxas & Ongpeng10 used an artificial neural network approach to estimate the cost of construction projects in the Philippines. The objective of this study was to develop an ANN model that can predict the total structural cost of construction projects. The data of 30 construction projects were collected and randomly divided into three sets: 60% was considered for training, 20% for performance validation and 20% as a fully independent test of network generalization and six input parameters. The results indicated that the obtained ANN model reasonably predicted the total structural cost of construction projects with favorable results in the training and testing phase.

In research by Yadav et al.11also developed a Cost Estimation Model (CEM) using ANN, which is able to predict the total structural cost of residential complexes by considering different parameters. In this research, the data of the last 23 years were collected. The resulting ANN model reasonably predicted the total structural cost of construction projects with a correlation factor of R = 0.9960 and RSquared = 0.9905, which provides favorable training and testing phase results. Also, Leszczyński & Jasiński12used ANN approach to estimate product cost in a case study. The aim of this study was to present artificial neural networks (ANN) as a method for estimating product cost theoretically and practically, and the main problem was to model artificial neural networks for the process of estimating product cost with advanced production technology. The theoretical and experimental analysis carried out showed that ANN models are the most innovative tools for product cost estimation in an industrial environment with advanced technology and digitization of production. In another study, Sharma et al.13, evaluated machine learning and deep learning (transfer learning) techniques for detecting rice diseases such as bacterial blight, rice blast, and brown spot. Their comparative analysis demonstrated that transfer learning models, particularly InceptionResNetV2 and XceptionNet, outperformed conventional machine learning techniques. This research underscores the potential of these methods in enabling early disease diagnosis to assist farmers. The authors recommend future studies focus on larger datasets to improve generalizability.

Estimating the construction cost of construction projects in the early stages with higher accuracy plays a vital role in the success of any project. For this purpose, the estimation of building construction cost using ANN was investigated by Chandanshive & Kambekar14. Based on the data set of 78 construction projects from the big city of Mumbai and its geographical area, the most influential design parameters of the construction cost of buildings were identified as input, and the total cost of the structure skeleton was the output of the neural network models. The results obtained from the trained neural network model showed that it was able to predict the cost of construction projects in the early stages of construction. In another research, Omotayo et al.15presented an artificial neural network approach to predict the most applicable Post-Contract Cost Controlling Techniques (PCCTs) in construction projects. This study aimed to propose a structured decision support method for predicting the most applicable PCCTs using ANN, and for this purpose, the data from 135 samples were used. The instrumentality of ANNs in this study enabled the development of a structured decision support methodology for the analysis of the most suitable PCCTs for deployment at different stages of the construction process, and the RMSE criteria equal to 0.073 and RSquare equal to 0.726 were obtained in the validation stage. In a study, Singh & Singh16used Featurewiz as an effective method for data normalization. Experiments were conducted on 18 benchmark datasets to demonstrate the effectiveness of the proposed approach compared to conventional normalization, and the obtained data were evaluated on four learning algorithms. The results showed that Featurewiz performed better than normal data normalization in four famous machine learning algorithms (KNN, MLP, GNB, and L-SVM) and the statistical analysis also proved the same. Sharma & Kumar17 explored transfer learning models for diagnosing rice plant diseases using publicly available datasets from Mendeley and Kaggle. Their study tested individual models like InceptionV3, ResNet152V2, and DenseNet201, and introduced model ensembling to enhance diagnostic accuracy. Results showed that most ensemble models outperformed individual transfer learning models and even advanced approaches like Convolution-XGBoost, highlighting the potential of ensemble techniques in improving automated rice disease detection systems.

Definition of the problem

In Iran, less than 30% of irrigated lands are covered by modern irrigation systems. This issue shows the importance of examining the weak and strong points, determining the factors affecting the costs of different irrigation systems, and then economic modeling of pressurized irrigation systems, especially estimating the cost of an irrigation system before construction. On the other hand, one of the most important needs of the Ministry of Agricultural Jihad, the Ministry of Energy, and other trustees of the water industry (in Iran and all over the world) as well as employers, consultants, and contractors is to estimate the costs of projects in the initial stage and even before design.

Since the development of different pressurized irrigation systems is part of the strategic policies of the government and the Ministry of Agricultural Jihad of Iran, having knowledge, information, and foresight of the factors affecting the costs of an irrigation system in different regions and before implementation, will be a great help in cost management. Also, considering the significant extent of lands covered by pressurized irrigation systems and their development potential in Iran and the world, this identification of the features affecting cost and then its modeling will play a significant role in the country’s annual budgeting, and paying attention is necessary. On the other hand, the cost per hectare of implementing an irrigation system based on the type of crop, area, water, climate, and other geometric and geographical factors has a lot of variance, and a single version cannot be considered for all conditions. As a result, it is necessary to identify the features affecting the cost of irrigation systems and to model the costs before designing and implementing them so that the cost information of all parts can be estimated.

The importance of pressurized irrigation systems

Regarding the development process of pressurized irrigation systems in Iran in the first to sixth development plans of the country, it can be said that economic issues and the payment of facilities by the government for the implementation of these systems, the most important and also the most influential factor in changing the trend in the area of irrigation systems implemented in the country are. In total, according to the statistics presented in the statistics of the Ministry of Agricultural Jihad, from 1990 to the end of 2021, a total of 1,987,548 hectares of modern pressurized irrigation systems have been implemented in Iran.

Although the quantitative development process of modern pressurized irrigation systems has changed due to economic fluctuations and commercial equations, overall, the process of equipping the lands covered by these systems in Iran was upward. According to surveys, if the statistics up to 2022 are considered, the implemented levels of pressurized irrigation systems in Iran have increased to more than 2.6 million hectares18.

Cost estimation

The use of techniques and formulas that can estimate the cost of the project using a series of basic information will be of great help to consulting engineering companies active in water engineering. It can be used as a comprehensive guideline for estimating the costs of building a pressurized system before its implementation. Among the various methods that can be used for cost estimation in different stages of a project, there are traditional detailed cost estimation methods, simple cost estimation, cost estimation based on cost functions, cost estimation based on activity, cost index method, and expert systems19,20,21,22,61.Hadadian Nejad23,24.

Therefore, early cost estimation plays an important role in the early decisions of the construction project, even when the project is not yet finalized and there is still very limited information about the detailed design available at these stages14,25,26. In addition, cost estimation plays a key role in the successful completion of construction projects. Due to the lack of information, details, maps, and many important factors affecting cost estimation in the initial stage and planning, the project will be at risk. Therefore, cost estimation plays an important role in construction project decisions, and for success in construction projects, cost estimation with high accuracy and less error will be urgently and seriously needed27,28,29,30,31.

In this regard, various methods are available for cost estimation. With the increase in computing power, now a greater tendency to use methods based on Machine Learning (ML) such as Artificial Neural Networks (ANN), Fuzzy Logic (FL), Deep Learning (DL), and Genetic Algorithm (GA) including Gene Expression Programming (GEP) and Genetic Programming (GP), etc. there is for more accurate estimation of project duration and costs. These methods can still be reliable despite insufficient details at the initial stage and even with little data and identify non-linear relationships between cost factors and project costs30,32,Hadadian Nejad23. While the use of ANN for cost estimation from the perspective of contractors has been extensively investigated, there are limited studies on the development and application of ML-based methods for consulting engineering firms. Considering that the nature of the products/services provided by the consulting engineering companies are inherently different from those of the contractors and also considering that the type and level of details of the information available in the bidding stage are different, investigating the application of ML-based methods for cost estimation in consulting companies it is important20,27,3334,Drenthe et al., 201926,35,36;).

Feature selection

Feature Selection (FS) and data processing a fundamental component of many classification, modeling, and regression problems. Because some data have the same effect, some have a misleading effect, and some have no effect on classification or regression problems, and therefore choosing the optimal and minimum size for the features can be useful37,38,39,40. Also, feature selection as one of the most important data processing problems is an important and up-to-date research topic in Pattern Recognition (PR), Machine Learning (ML) and Data Mining (DM). This approach is very widely used and the projects will be valuable when the feature selection technique and Sensitivity Analysis (SA) are used in them. With the development of information storage, the input data has a large number of attributes, which may include a large number of irrelevant or unimportant features. Unnecessary features often lead to low algorithm learning efficiency and difficulty. Therefore, selecting relevant and necessary features for a given learning task is a very important step41,42,43. Therefore, feature selection is an important process of data science model development workflow. There are various feature selection techniques and methods that data scientists use to remove redundant features.

Summaries of the previous studies and innovation this research

In Iran’s internal studies, the economic analysis of irrigation systems by switching from one system to another and its effect on crop performance and economic productivity have been discussed; In international studies, the recognition of the influential components and the estimation of the final costs in road and construction engineering projects have been addressed, and the studies related to the water industry and irrigation systems have been the missing link in these discussions. On the other hand, in the past, when using linear and regression relationships, artificial neural networks, and in general, machine learning and artificial intelligence algorithms, all features and variables were used. This work had several major flaws; (1) it would make the execution time long and spend high time–cost. (2) The results were not user-friendly and every person could not take sufficient advantage of the extracted relations. (3) Most importantly, using all the features would lead to complicated and impractical results. Now, with the advancement of information technology, the feature selection approach seeks to facilitate this process. Therefore, finding a relationship to identify important factors affecting the final cost of an irrigation system and also formulating it for use in areas with different characteristics is what the current research is looking for; Because this research aims to estimate the cost of pressurized irrigation projects in the early stages of design using machine learning methods by using the data of many pressurized irrigation projects carried out in different parts of Iran in different years. Also, the next goals included finding a single and generalizable algorithm for estimating the final cost of an irrigation system and identifying the most important components influencing the cost of implementing an irrigation system. Considering the previous studies and the innovation of this work, it can be said that such a study has not been done until now. Also, this article’s distinguishing feature from previous research is the use of new and numerous models, software, and approaches to modeling the cost of pressurized irrigation systems.

Materials and methods

The early cost stages modeling of pressurized irrigation projects was done in several stages, including collecting the required data, updating the cost of the projects, selecting the best features, and training and validating the cost estimation models in four parts, including the Cost of pumping station and central control system (TCP), Cost of on-farm equipment (TCF), Cost of installation and operation on-farm and pumping station (TCI) and Total cost (TCT).

Collecting the required data

In this research, a comprehensive and complete data bank was prepared from the statistics and information of 515 drip irrigation systems implemented between 2006 and 2020 in different parts of Iran, which were obtained from reputable consulting engineering companies. For each irrigation system, statistics and information used from irrigation plan reports, AutoCAD maps, and Excel files related to design calculations were extracted, and the cost of the systems was categorized into two general parts: pumping station cost and farm cost. Types of candidate variables for the input of drip irrigation system cost estimation models include; the geometric variables of land, soil, water source, plant, and climate are the variables of irrigation and hydraulic management which were extracted as follows:

General information of the project (province, city, owner, type of cultivated crop, water source, energy, and year of implementation), water source information (the amount and status of water rights, electrical conductivity, acidity, sodium, potassium, calcium, magnesium, carbonate and bicarbonate), crop information (the distance between rows of trees, the distance between trees on the row, the maximum daily evapotranspiration of the crop, the shader surface, and the depth of root development), soil information (water holding capacity in the soil, percentage of allowed moisture discharge, percentage of wetted surface, final permeability and apparent specific gravity), irrigation system information (average operating flow rate, average operating pressure, type of arrangement of laterals, number of emitters for each plant and emitter distance), farm irrigation management information (irrigation interval obtained in the design, net irrigation requirement, gross irrigation requirement, irrigation duration, maximum irrigation hours in a day and night, maximum number of irrigation turns in each interval, number of irrigation turns, average area of each irrigation unit and discharge average per irrigation unit), Farm characteristics information (geometric shape of the farm, average slope, height difference of the water source to the highest point of the farm, length of laterals, diameter, and length of main, semi-main pipes, manifolds, and connections), pumping station information (pump type, engine power, pumping height, pumping flow rate/ discharge, central control system connections and equipment-accessories).

Project cost updating and data preprocessing

In the current research, using annual inflation (in a stepwise manner), the price of all 515 drip irrigation projects (2006–2020) were updated from the following relationship for the base year of 202244:

$${X}_{t}={X}_{0}({1+r)}^{n}$$
(1)

where \({X}_{t}\) is the current value of capital, \({X}_{0}\) is the base value of capital (investment value in the year of implementation of the system), r is the average annual bank interest rate and n is the number of years from the year of implementation of the system until now.

To pre-process the data from different standardization methods, the data was standardized and after ensuring the randomness of the data, the data classification was done for model training and also for testing and validation. In this research, 75–80% of the data were considered as training data and 25–20% as testing data39,45. After extracting variables affecting the cost of drip irrigation systems, the next step was to select the best features that have the greatest impact on the output amount of the model (i.e. cost). Table 1 shows the candidate variables to determine the relationship between independent and dependent variables.

Table 1 Candidate variables for cost modeling of drip irrigation systems.

Feature selection technique

After extracting the variables affecting the cost of pressurized irrigation systems, the next step was to select the most important variables, which are referred to as feature selection (FS). It is very important to choose the best features that have the greatest impact on the cost of different parts46,47. Selecting a subset of features is one of the new and active areas of research in machine learning, which is used for regression and classification problems. Feature selection and extraction are two main steps in machine learning programs and modeling. In feature extraction, some features of the existing data that are informative are extracted. However, not all features derived in the learning process of a machine are constructive, and the most important features should be identified using different models, methods, and algorithms45,48.

Feature selection techniques using supervised models are mainly classified into three main categories or in some previous literature into five categories37,39,40,42,49. These include Filter Methods (FM), Wrapper Methods (WM), Embedded Methods (EM), Online Methods (OM) and Hybrid Methods (HM) (Fig. 1).

Fig. 1
figure 1

Schematic of different feature selection methods (FS).

In the figure above, the filter method weights and selects the features. The wrapper method obtains a subset of features based on the learner’s performance. The embedded method selects features based on the learner’s selection order. The online method is based on online tools and the hybrid method combines different methods to achieve better results45. Applying these methods requires coding in different environments or spending a lot of time, but new methods were developed for the important topic of feature selection, discussed below.

Feature selection methods

Eureka formulaize

In this research, by using the evolutionary algorithms, the most famous of which is the Genetic Algorithm (GA) and the subsection of Genetic Programming (GP) and Gene Expression Programming (GEP), the relationships between parameters were discovered and modeled36. For this purpose, Eureqa Formulize software was used to identify the features affecting the cost of drip irrigation systems. This software automatically uses data pre-processing such as normalization, removal of outliers, and data randomization, thereby minimizing the calculation error due to the absence of noise in the data.

This program was later designed and developed by Nutonian Company (http://nutonian.wikidot.com). Finally, this program provides the user with the final equation in the form of symbolic regression by presenting a set of mathematical relations and simplifies the analysis of the presented model for the user by providing different outputs. Here, after data pre-processing, 70% of the data were considered as training data and 30% as testing data50. This software has the ability to identify the most important features and can perform modeling well and with high accuracy.

winGamma

In this study, using GT and MT tests and three techniques of genetic algorithms (GA), Hill Climbing (HC), and Full Embedding (FE), the most important parameters affecting the cost of different parts of drip irrigation systems and the optimal percentage of training and testing data was identified for cost modeling of each part51. The Gamma Test (GT) was specifically developed for modeling and predicting nonlinear systems. Using this software set and the gamma test, it is possible to obtain the order of the importance of the input variables and the best combination among all possible combinations. The gamma test was first reported by Koncar52 and Stefánsson et al.53and later it was used by other researchers such as Durant54 and Tsui et al.55. The gamma test model was introduced by Durrant54 as a software package (https://users.cs.cf.ac.uk/O.F.Rana/Antonia.J.Jones/GammaArchive/Gamma%20Software/winGamma/winGamma.htm).

M test (MT), one of the basic and discussed topics in data series modeling is choosing the right interval for model preparation and a range for model testing. It is often suggested in theoretical discussions that 70% of the data series be used for training (model preparation) and 30% for model testing. But to separate these two from each other, there is a scientific method and basis that can be used to easily define and determine these two intervals. There is a test called M-test that can be used to do this. The goals of using the Gamma and M test in winGamma software can be summarized as follows52,53,54,56,Nekue et al., 202151;: Finding the minimum data required to produce a near-optimal model, scientific determination of the number of data for the training and testing stage of modeling, determining the best embedding dimension and lag time for time series, creating the automatic and fast structure of the neural network and minimum weight to model the data in the best way, and determining the best set of inputs from the list of possible inputs to a neural controller.

Featurewiz

The feature selection in Python is the process of automatically or manually selecting features in a dataset that contribute the most to the estimated variable or desired output. It should be noted that not all features presented in the dataset are important to provide the best model performance. The four main reasons for applying feature selection in Python are: (1) it improves the accuracy of the model if the appropriate subset is selected. (2) Reduces the fit too much. (3) Enables the machine learning algorithm to train faster. (4) Reduce the complexity of a model and make it easier to interpret57,58. A real dataset has many features, some of which are useful for training a robust data science model, and others are extra features that can affect model performance. Feature selection is an important element of the data science model development workflow.

There are various feature selection techniques and methods that data scientists use to remove redundant features. A new, improved, and fast way to select the best features in a dataset is Featurewiz, which offers feature engineering capabilities (https://github.com/AutoViML/featurewiz). The Featurewiz API has a “Feature_Engg” parameter that can be set to “Interactions”, “Grouping” and “Target”, creating hundreds of features in one go. Also, it can reduce the number of features and select the best set of features to train a robust model. Featurewiz uses two algorithms to select the best features from the dataset59: SULOV and Recursive XGBoost.

  • SULOV Algorithm: Abbreviation of uncorrelated variable list search expression, which is very similar to MRMR algorithm. By searching the uncorrelated list of variables, this method finds pairs of variables that have crossed the correlation threshold and are therefore called highly correlated.

  • Recursive XGBoost Algorithm: After the SULOV algorithms selected the best set of features with lower correlation and high mutual information score, the recursive XGBoost algorithm was used to calculate the best features among the remaining variables.

So Featurewiz uses the two algorithms discussed above to find the best set of features that can be further used to train a robust machine learning model. Featurewiz can not only handle datasets with one target variable, but it also can handle datasets with different target variables. The output of this part is important because it can be used to decide which features are more important and which features are less important for predicting the target variable57,5859,).

FeatureSelect

This study used three types of learners to select the feature. The first one is SVM (Support Vector Machine). The second is ANN, which includes only one parameter (training repetition). After examining the types of artificial neural networks, the results showed that the optimization algorithms can lead to better results in the training phase of the artificial neural network. Selecting features by SVM or DT (Decision Tree) and then using ANN to obtain an efficient model is also possible. The third learner is the Decision Tree (DT). Also, three types of feature selection methods were evaluated: (1) Wrapper method (optimization algorithm), (2) Filter method: this type of feature selection consists of five common methods. Experimental results show that each learner and method has its perspective on the dataset, but wrapper methods can generally lead to better results than filter methods. (3) Hybrid-Ensemble method: two-stage feature selection can be used using a combination of filter and wrapper methods. In the following, 11 algorithms were used to select the best feature(s) from the set of features in the wrapper method section. Algorithms developed for this task that can be used in FeatureSelect software include World Competitive Contest (WCC), League Championship Algorithm (LCA), GA, Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), Imperialist Competitive Algorithm (ICA), Learning Automata (LA), Heat Transfer Optimization Algorithm (HTS), Forest Optimization Algorithm (FOA), Discrete Symbiotic Organisms Search (DSOS), and Cuckoo Optimization (CUK)39,45.

FeatureSelect software is a new program for feature selection based on machine learning methods developed by Masoudi-Sobhanzadeh et al.45 was developed in the Laboratory of System Biology and Bioinformatics of the University of Tehran. This software can be applied to problems where there is a need to select an important and effective subset of features from the entire set of features. It was observed in previous sources that some studies have introduced tools and software such as WEKA. While these tools or software are based on filter methods that are less efficient than wrapper methods,FeatureSelect consists of optimization and learning algorithms in addition to filter methods. Here, data normalization and data fuzzification are also performed. Therefore, using FeatureSelect software has three main goals: (1) easy use of LIBSVM, ANN, and DT. (2) Feature selection for regression problems. (3) Feature selection for classification problems. So, FeatureSelect is a feature or gene selection software application available on GitHub (https://github.com/LBBSoft/FeatureSelect).

Different data mining algorithms

Our models (AI) are widely recognized and commonly applied in water and environmental sciences. Previous research has shown that these models have been used the most and have provided excellent results60,61,62,63. More details are given in the following sections.

MLR

Linear regression is divided into two types: simple linear regression and Multivariate Linear Regression (MLR). Simple linear regression predicts the value of a dependent variable based on the value of an independent variable, but multiple regression is a method for collective and individual participation of two or more independent variables in the changes of a dependent variable. Therefore, multivariate regressions have a much wider application. In definition; the rate of change of one variable for other variables is called regression coefficients, and in other words, the rate of change in the dependent variable that occurs due to a unit change in the independent variable64,65,66. The degree of correlation between predictor variables is shown by coefficients. To determine the regression in the present study, the following relationship is used:

$$Y={{\beta }_{0}+\beta }_{1}{X}_{1}+{\beta }_{2}{X}_{2}+\dots +{\beta }_{n}{X}_{n}+\varepsilon$$
(2)

where Y is the dependent variable, \(\beta o\) is the constant coefficient, \(\varepsilon\) is the error rate, and \({X}_{1}\)، \({X}_{2}\) and … \({X}_{n}\)are the independent variables used in the models used in this study67. With the explanation that the cost amounts of the whole system are considered as dependent variables and the cost of used components of the farm and pumping parts are considered as independent variables and the multivariable linear regression method was analyzed. To enter the variables in the regression model, there are five methods, depending on the purpose, a number of them were used. Different methods of multivariate linear regression are: (1) Enter, Backward, Froward, Remove, and Stepwise66.

SVR

The Support Vector Regression (SVR) method was introduced in 1995 by Cortes and Vapnik. They stated that although this method is less used than SVM, SVR has been proven to be an effective tool in estimating real performance68. The main difference between SVM and SVR is their output type. With the help of SVM, linear and non-linear models can be created and its parameters can be calculated. This is achieved by using a non-linear kernel function (such as a polynomial). The choice of kernel for SVR depends on the amount of training data and the dimensions of the feature vector. In practice, four types of Linear Kernel, Polynomial Kernel, Hyperbolic Tangent Kernel, and Gaussian Kernel are used69,70. The formulation of the SVR problem using the one-dimensional example in Fig. 2 is the best form from a geometric point of view. The approximate continuous-valued function can be written as Eq. (3). For multidimensional data, x should be increased one by one and b should be placed in the w vector in order to easily obtain the multivariate regression in Eq. (4)71,72:

Fig. 2
figure 2

Schematic of SVR.

$$y=f\left(x\right)=<w,x>+b=\sum_{i=1}^{M}{w}_{i}{x}_{i}+b, y,b\in {\mathbb{R}},x,w {\mathbb{R}}^{M}$$
(3)
$$f\left(x\right)=\left[\begin{array}{c}w\\ b\end{array}\right]^T\left[\begin{array}{c}x\\ 1\end{array}\right]= {{\varvec{w}}}^{T}{\varvec{x}}+b x,w\in {\mathbb{R}}^{M+1}$$
(4)

ANN

Artificial Neural Networks (ANN) were first introduced in 1943 by McCulloch and Pitts73and in 1962 by Rosenblatt in a serious and influential way74. Later, with the development of computers and the appearance of the backpropagation training algorithm for feedforward neural networks by Rumelhart et al.75, their use entered a new stage. ANN is an idea for information processing that is inspired by the biological nervous system and processes information like the human brain. This system consists of many elements called neurons that work together to solve a problem. A neuron is the smallest information processing unit that forms the basis of ANN operation. Each network consists of an input, an output layer and one or more intermediate layers. Figure 3 shows a schematic of ANN. In this figure, i is an input vector of the system consisting of a number of causal variables that influence the behavior of the system, and o is an output vector of the system consisting of a number of result variables that re-express the behavior of the system.

Fig. 3
figure 3

Schematic of ANN (an MLP network).

MLP

Multilayer Perceptron (MLP) neural network is one type of ANN in which weights and biases can be trained to produce a specific target. MLP is noteworthy because of its good performance76. This network is a set of neurons that are placed in different layers one after the other and therefore it is a complex and non-linear system. The MLP uses supervised learning, which includes providing inputs and outputs to the network and minimizing the estimation error, for training77. Figure 4 shows the schematic of an MLP. In this research, the error backpropagation (BP) training algorithm was used to train MLP. Also, the sigmoid transfer function for the hidden layer and the linear transfer function for the output layer were considered. Determining the output in the MLP method is in the form of the following relationship, where n is the net input, w is the weight, p is the input, b is the bias, and f is the driving function.

Fig. 4
figure 4

Schematic of MLP.

$$n=wp+b \gg a=f\left(n\right)=f\left(wp+b\right)$$
(5)

RBF

The neural network based on the Radial Basis Function (RBF) has a very strong mathematical foundation based on the hypothesis of regularization to solve problems. In general, this network consists of three input parts, a hidden layer, and an output layer78. In this network, the Gaussian transfer function is used in the hidden layer and the linear transfer function is used in the output layer. In Figure 5, the RBF neuron is a Gaussian function. The input of this function is the Euclidean distance between each input to the neuron with the specified vector equal to the input vector. This Gaussian function uses the following relationship77:

Fig. 5
figure 5

Schematic of RBF.

$$f \left({X}_{r} , b\right)={e}^{{-(\Vert {X}_{r}-{X}_{b}\Vert \times \frac{0.8326}{h})}^{2}}$$
(6)

In this relation, \({X}_{r}\) is the input of the network with unknown output, \({X}_{b}\) is the input of observations in time or place, and b and hare the parameters that control the width of the Gaussian function. The output of this function is a variable between zero and one77. The calculation of output \({Y}_{r}\) based on the independent variable \({X}_{r}\) is obtained as follows:

$${Y}_{r}=LW\times f \left({X}_{r} , b\right)+Bias$$
(7)

In this regard, LW is the weight of the communication matrix between the hidden layer and the output layer, and Biasis the bias matrix of the output layer77.

GRNN

Generalized Regression Neural Network (GRNN) is a network for solving regression problems based on statistics. This network is another type of RBF network. GRNN was introduced in 1991 by Specht79. GRNN is a three-layer network, the number of neurons of which is much easier to choose compared to MLP because they are considered equal to the number of observations. Figure 6 shows a GRNN network. Like RBF, this network uses the Gaussian function in the middle layer, but in the output layer, an additional part of RBF is included in the calculations. The following relationship was used to calculate the output amount in this network:

$${Y}_{r}=\frac{1}{\sum_{b=1}^{n}f ({X}_{r} , b)}+\sum_{b=1}^{n}\left[f ({X}_{r} , b)\times {T}_{b}\right]$$
(8)

where \({T}_{b}\) is the target corresponding to the bth observation and nis the number of observations77.

Fig. 6
figure 6

Schematic of GRNN.

ANFIS

The Adaptive Neuro-Fuzzy Inference System (ANFIS) was introduced by Jang in 199380. ANFIS is similar to a multilayer neural network,with the difference that in addition to ANN learning algorithms, it also uses Fuzzy Logic (FL). An ANFIS model consists of five layers. These five layers are respectively: information input layer, fuzzy rule weight calculation layer, obtained rule weight normalization layer, rule calculation layer, summation layer, and network output77. In this research, the considered membership function, the trapezoidal membership function, and the network training algorithm were considered as a hybrid method. Figure 7 shows a schematic of ANFIS.

Fig. 7
figure 7

Schematic of ANFIS.

DL

Deep learning (DL) is a type of ML that works based on the structure and function of the human brain and uses artificial neural networks to perform complex calculations on large amounts of data. These algorithms have self-learning representations81,82,83,84. Therefore, deep learning is a sub-branch of machine learning and is based on a set of algorithms that are trying to model high-level abstract concepts in data. The more (deeper) the hierarchy of layers, the more nonlinear features are obtained. For this reason, more layers are used in deep learning82,84,85,86,87,88.

The word deep in deep learning refers to the number of layers through which data is transformed into output. Deep learning models can extract better features than shallow models and hence additional layers help in learning features81,83. The difference between deep learning and neural networks is that deep learning has a wider scope than neural networks and includes Reinforcement Learning algorithms. Figure 8 shows the structure of the deep neural network and its schematic.

Fig. 8
figure 8

Schematic of DL.

GEP and GP

Gene Expression Programming (GEP), which has been created in the evolution of intelligent models, is one of the circular algorithm methods, all of which are based on Darwin’s theory of evolution. GEP, which is the developed form of Genetic Programming (GP), was presented by Ferreira in 199989. This algorithm can automatically select the input variables that have the most influence on the modeling90. One of the main advantages of GEP and GP algorithms is that they can be used in the following conditions: (1) the relationship between the variables of the problem is not well known, or the validity of the current knowledge of the mentioned relationship is doubtful, (2) finding the final solution to the problem under consideration is difficult, (3) conventional mathematical solution does not exist (or requires an analytical solution), (4) the approximate solution is acceptable, and (5) the number of data that must be tested, categorized and summarized by the computer is large (such as satellite data)89,91,92,93. GEP, like GA and GP, is a genetic algorithm; it uses a population of individuals, selects them according to suitability, and applies genetic changes using one or more genetic operators (such as mutation and combination). In general, it can be stated that the basic difference between these three algorithms is related to the nature of their people.

The modeling process of estimating the early-stage cost of constructing drip irrigation systems based on farm and pumping costs was carried out as follows: (1) the first step is to choose the appropriate fitting function. In this study, the Root Mean Square Error (RMSE) function was chosen as the fitting function. (2) The second step is to select a set of input variables and a set of functions to produce chromosomes. In the present problem, the terminals consist of the amounts of costs in different years and different irrigation systems. Four main operators were used here, including {\(\div , \times , -, +\)} and mathematical functions {\(log \left(x\right),\sqrt{x}, \sqrt[3]{x}, {x}_{2}, {x}_{3},\mathit{tan}(x)\)}. (3) The third step includes choosing the structure and architecture of chromosomes. (4) The fourth step is to select the link function, which in this study, the mathematical operation of addition was used to create a link between sub-branches. (5) Finally, in the fifth step, the genetic operators and the rate of each of them were selected. Figure 9 shows an example of a gene expression program.

Fig. 9
figure 9

Schematic of GEP.

DT

A Decision Tree (DT) is one of the data mining methods and one of the powerful and common tools for classification and prediction or estimation, which, unlike ANN, produces rules. That is, DT explains its prediction in the form of a series of rules. In ANN, only the prediction is expressed, and how it is hidden in the network itself. In addition, in DT, unlike ANN, non-numerical data can be used94. The DT approach is used in many fields, including pattern identification, pattern classification, classification, decision support systems, expert systems, etc. Another advantage is that it can classify both types of numerical, non-numerical, analytical, qualitative, and ranking data.

DT usually consists of several nodes known as input and output nodes. Rules created in DT are expressed as “if” and “then”. Different algorithms can predict/estimate the target (dependent) variable based on the independent variables. One of the most important and widely used of them is the CART algorithm95,96. It is also reported that among the algorithms used in the construction of DT, the most important of them is the C5 algorithm (Algorithm for implementing Decision Tree), which is the developed mode of ID3 (Iterative Dichotomiser 3). Other DT algorithms include Q&RT (Quick Unbiased Efficient Statistical Tree), QUEST, and CHAID. The construction of trees is usually based on three principles Fig. 10:

  • A set of questions in the form x ≤ d? Where x is an independent variable, and d is a fixed value and the answer to each question is yes/no.

  • Determining the best criteria for branching to choose the best independent variable to create a branch.

  • Generate summary statistics for the terminal node97.

Fig. 10
figure 10

Schematic of a simple DT.

A DT is a combination of several logical implications (if–then rules). Decision trees are not only a representation of the decision-making process but they can also be used to solve classification problems. Usually, the set of rules extracted from DT is the most important information obtained from them9899,). The purpose of this research is to investigate the effectiveness of the DT model in predicting the early cost of implementing drip irrigation systems based on different variables and system components (independent variable) that affect the final cost (dependent variable)100.

The schematic and classification of the types of methods used to model the cost of drip irrigation systems are shown in Fig. 11.

Fig. 11
figure 11

Classification and relationship between the types of methods used in modeling.

Hyperparameter setting

Due to the wide range of models used, their training was done in several stages. We adopted a range of machine learning algorithms to capture different perspectives on feature importance and regression performance. For this reason, we have provided a table detailing the hyperparameter settings for each algorithm to increase clarity (Table 2).

Table 2 Hyperparameter setting for each of the used algorithms and models in this research.

Evaluation criteria

To evaluate the models/algorithms and compare the results of different approaches and methods with observational data, three evaluation criteria were used: Coefficient of Determination (R2), Root Mean Square Error (RMSE), and Volume Error (VE). These criteria are defined as follows:

$${R}^{2}=\frac{{[\sum_{i=1}^{n}({o}_{i}-\overline{o} )({p}_{i}-\overline{p} )]}^{2}}{\sum_{i=1}^{n}{({o}_{i}-\overline{o})}^{2}\times \sum_{i=1}^{n}{({p}_{i}-\overline{p})}^{2}}$$
(9)
$$RMSE= \sqrt{\frac{\sum_{i=1}^{n}{({p}_{i}-{o}_{i})}^{2}}{n}}$$
(10)
$$VE=\frac{\sum_{i=1}^{n}\left|\frac{{O}_{i}-{P}_{i}}{{O}_{i}}\right|}{n}\times 100$$
(11)

In these relationships, Oi is the observed values, Pi is the predicted values, \(\overline{o}\) is the average of the observed values, \(\overline{p}\) is the average of the predicted values, and nis the number of data101,102. Any model with more R2 and less RMSE and VE is more desirable. This research used MATLAB and Python software environments to code the models and algorithms. Minitab and SAS software were also used for statistical analysis. To examine the linear and nonlinear correlation between independent and dependent variables, correlation and statistical significance tests were used. Since the relationship between variables is expected to be nonlinear, the use of artificial intelligence algorithms and machine learning models is likely. It is important to note that if the P-value is large (usually greater than 0.05), it indicates that the observed results could have occurred by chance and the null hypothesis is not rejected. If the P-value is small (usually less than 0.05), it indicates that the observed results are unlikely to have occurred by chance and the null hypothesis is rejected. In Fig. 12, the general process of the steps of this research, from data collection to training and testing the models, is fully described.

Fig. 12
figure 12

Flowchart step by step of the current research (from data collection to the application of feature selection methods and training, testing and recognition of superior models for cost modeling).

Results and discussion

Examining the correlation between variables

Correlation results between 39 independent variables mentioned with the Cost of pumping station and central control system (TCP), the Cost of on-farm equipment (TCF), the Cost of installation and operation of on-farm and pumping station (TCI), and the total cost (TCT) in Table 3 are visible. The correlation results can be summarized as follows: 12 variables with a significance at a one percent probability level, four with a significance at a five percent probability level, and 23 with no significant difference with the cost of pumping station and central control system section (TCP). This issue was as follows for other sections: in the cost of on-farm equipment section (TCF); 11 variables with a significance at a one percent probability level, and 28 with no significant difference, in the Cost of installation and operation of on-farm and pumping station section (TCI); Seven variables with a significance at a one percent probability level, four with a significance at a five percent probability level, and 28 with no significant difference, and finally in the total cost section (TCT); Nine variables with a significance at a one percent probability level, six with a significance at a five percent probability level, and 24 with no significant difference (Table 3).

Table 3 Correlation and significance results (P-Value) between independent and dependent variables.

Analysis of cost to area ratio of projects

With the investigations and analyses conducted on the information and data pre-processing of 515 drip irrigation system projects, it was found that the total cost (TCT) of each hectare of drip irrigation system is equal to 510 million Rials on average. Separately, 12.4% of it is related to the cost of the pumping station and central control system (TCP), 62.1% is related to the cost of on-farm equipment (TCF), and 25.5% is related to the cost of installation and operation of on-farm and pumping station (TCI). Meanwhile, the ratio of purchase of equipment to the total cost based on the updated data of 515 drip irrigation projects was equal to 71% Fig. 13. It is true that in the current situation, the standard of budget allocation by the government for each hectare is a fixed amount of 500 to 550 million Rials, but the changes are much more than this and the amount is different in places, crops, unevenness, climatic conditions, and other conditions. Based on the information received from the reputable consulting engineering companies on which the quality control of the design was carried out, as well as the review of the data bank prepared for this research, the minimum and maximum total cost of a drip irrigation system is equal to 127.3 and 1707.1 million Rials, respectively. This means that the cost changes per hectare fluctuate a lot and therefore require a special investigation.

Fig. 13
figure 13

The share of the cost of different sections in the total cost of a drip irrigation system.

It is important to mention that by examining the final cost of the projects, it was found that its standard deviation was equal to 200 million Rials. This means that the cost per hectare for each drip irrigation system is not around 510 million Rials and varies from land to land. Also, the standard deviation of the cost per hectare indicates that for each specific geometric and geographical condition, relevant calculations should be made. Then an estimated cost should be announced. Therefore, modeling the cost of drip irrigation systems is very necessary, and while preventing excessive costs, it provides an accurate price estimate for each project. Also, to achieve a simple relationship to objectively observe the relationship between the area and the final cost of the projects, the area information of 515 drip irrigation system projects and the updated total cost (TCT) were plotted in Fig. 14. The fitted linear (green color) and polynomial (red color) line indicates the determination coefficient of 0.9 and 0.91, respectively.

Fig. 14
figure 14

Comparative diagram of the area and total cost of the drip irrigation system.

Selection of features

Evaluation of feature selection algorithms in the train phase

After reviewing different methods, models, and algorithms for feature selection, the result of the best-chosen method, which is FeatureSelect, was briefly presented in this section. The results of this article are divided into two separate parts. (1) The first part is all the features that affect the costs of a drip irrigation system, which, as mentioned earlier, is 39 in number. (2) The second part is related to the features that can be accessed before designing and implementing an irrigation system. The number of these features is 18 and the feature was selected from among them.

It is worth mentioning that the model training was done with 80% of the data and the initial preprocessing and the evaluation criteria were obtained in the total features section. The classifier was SVM and the feature selection method was optimization algorithms (Wrapper). The reason for choosing these is the recommendation made by Masoudi-Sobhanzadeh et al.45 that emphasized this importance in the results of their work. The model training results in the total feature section, which included 39 features, showed that the RMSE criterion, SVM elapsed time in seconds, and R2were 0.007, 1.30, and 0.92, respectively. It has been stated in all research sources that the numerical value of these criteria is completely reasonable and shows the accuracy of the input data and the correct training of the algorithms45,103. Also, model training and obtaining initial evaluation parameters in the feature section before the design stage (18 features) was done. In this section, RMSE, SVM elapsed time in seconds, and R2 Criteria were obtained as 0.003, 2.01, and 0.89 respectively. The slight change in the evaluation criteria rate in this section compared to the previous one is that from the set of features affecting the costs of the drip irrigation system (the Cost of pumping station and central control system (TCP), the Cost of on-farm equipment (TCF), the Cost of installation and operation of on-farm and pumping station (TCI), and the total cost (TCT)), 18 features were separated and then were selected. Therefore, the training accuracy of the models decreased slightly, but these criteria were within the permissible and excellent range. Other researchers also reported these criteria in their studies and considered this to be a sign of correct training104.

Results of selected algorithms

The results of the selected algorithms for the regression problems are shown in the total section of the features affecting the costs of the drip irrigation system in Fig. 15 and the features before the design phase (BD) in Fig. 16. It is worth mentioning that among 11 algorithms, four algorithms WCC, LCA, LA, and FOA were selected to identify the most important features. There were two reasons for choosing these: (1) First, Masoudi-Sobhanzadeh et al.45, who are the developers of the FeatureSelect software, after many reviews and tests, stated that these algorithms will achieve the best results. (2) Secondly, implementing these algorithms took hours and sometimes several days and nights, so it is not logical to spend more time on other algorithms. According to this introduction, the graphs produced using the SVM learning model and the feature selection method of the optimization algorithms (Figs. 3 and 4) compare the performance of the algorithms based on error, RMSE, and correlation scores. Convergence, average convergence, and stability graphs are shown for each score. The purpose of presenting these figures is whether the algorithms are implemented correctly or not. The average convergence criterion is that the answers should improve when the number of iterations or the execution time allocated to the algorithms increases. In addition to convergence, there is also the concept of average convergence. The difference between the two is that convergence is achieved by extracting the best answer at the end of each iteration, while average convergence is calculated based on the average scores of potential solutions at the end of each iteration. Stability also tells the results of algorithms over time and after more executions.

Fig. 15
figure 15

Performance results of selected algorithms for regression problems in the section of all features.

Fig. 16
figure 16

Performance results of selected algorithms for regression problems in the section of features before the design phase.

As can be seen, all potential responses generated by the algorithms (WCC, LCA, and FOA) except LA improve with increasing iteration. According to Fig. 15, for the criterion of convergence of error, as the repetition increases (30 times), the amount of error for all algorithms decreases and reaches a fixed limit. However, the LA algorithm had a different trend and was no longer effective after a certain number of iterations. In the convergence section, the best models registered an error below 0.002 and a correlation greater than 0.93. The same process has been followed in the convergence of correlation criterion. In the average convergence of error and correlation section, the LA algorithm has not reached a reasonable and logical result with many iterations, and the WCC algorithm has reached a fixed limit, but its RMSE amount is higher than that of the LCA and FOA algorithms. In the average convergence section, the best model (FOA) showed an RMSE criterion of less than 0.003 and a high correlation of 0.92. Stability graphs also show that if an algorithm is better than other algorithms, its results are forward in the graph and its average results are better than other algorithms. In the sustainability section, it can be found that the FOA and LCA algorithms have the least changes, and with more execution, some of their error and correlation have been reduced and finally fixed. The error and correlation of the best algorithms are about 0.002–0.025 and 0.93–0.94, respectively. It should be noted that although the WCC algorithm has better results than LA, it has weaknesses compared to the two selected algorithms. The findings of other researchers are also in line with the results of this research104,105.

Figure 16 shows the results of selected algorithms for features before the design phase (BD) that affect the cost of drip irrigation systems. Similar to Fig. 15, convergence, average convergence, and stability criteria have been analyzed and interpreted. In the convergence of error and correlation section, except for the result of the LA algorithm, the others have converged and reached a fixed limit. The error of the best algorithms (WCC, LCA, and FOA) is around 0.0006 and the correlation of the best algorithm is above 0.95, which shows the good training of the algorithm and its prediction accuracy during many iterations. In the section on features before the design phase, the FOA algorithm is better than the others by a margin and is not competitive with LCA and WCC. The results of average convergence of error and correlation also indicate that the best-trained algorithm is FOA, and its error is the least (0.0007) and its correlation compared to other algorithms is the maximum (0.96). The stability criterion after 30 executions showed that the LCA and FOA algorithms were better than the LA and WCC algorithms by a margin, and while the error was minimal, the correlation was higher than 0.95 Fig. 16. From the results of this section, it is clear that the algorithms have received proper training, and this has shown itself in the convergence, convergence average, and stability criteria in the two parts of error and correlation (Figs. 15 and 16). The findings of this section are in complete agreement with the results of other researchers106,107.

Evaluation of algorithms in the test phase

The results shown in Tables 4 and 5 are calculated based on the stability results of Figs. 15 and 16, respectively. After examining how to train the models, as well as examining the efficiency of different algorithms and identifying them in the previous two sections, we finally tested (validation) and selected features by these algorithms and presented the results of the evaluation criteria. In Table 4, the results of the evaluation criteria for selecting the most important features among all the features affecting the total cost of drip irrigation systems (39 features) are presented for the verification stage. Because this table is derived from the stability results of previous figures, it has two main criteria and several sub-criteria. The main metrics include Error rate (ER), Correlation (CR), Number of selected Features (NOF) and Elapsed Time (ET). The sub-criteria of Standard Deviation (STD), Confidence Interval (CI), Probability Value (P-Value), and Test Statistic (TS) are used for both error and correlation criteria measures.

Table 4 The results of evaluation criteria for selecting the most important features among all the features affecting the total cost of drip irrigation systems.
Table 5 The results of evaluation criteria for selecting the most important features among the features before the design phase affecting the total cost of drip irrigation systems.

Table 4 shows the NOF identified by each algorithm, a critical parameter for this research. The number of features selected based on the results of WCC, LCA, LA, and FOA algorithms are equal to 17, 11, 15, and 8, respectively. After executing the algorithms 30 times, the best results were selected for each. ET shows how much time in seconds was spent in execution to get the best result for an algorithm. Algorithms have different ETs due to different steps in execution. The ER is also a seal of approval on the algorithms identified in the previous step, and its value was obtained for the superior LCA and FOA algorithms 0.0020 and 0.0018, respectively, and is lower than the others. The ER-STD sub-criteria indicates how different the results are from the mean results. Therefore, it is desirable that this criterion be minimal and it is also clear in the results that it was equal to 0.004 for WCC and LA algorithms and 0.002 for LCA and FOA algorithms. The ER-CI sub-criteria represents a range of values, and results are expected to fall within this range with a maximum specific probability. To achieve increased accuracy, this criterion was repeated twice. In both times, almost the same result was reported and again the two top algorithms LCA and FOA had the lowest value.

ER-P sub-criteria is one of the most important parameters in evaluating the results of models and algorithms. The P-Value expresses the similarity of the obtained results with random values. An algorithm with a minimum P-Value is more reliable than others. It is also clear in the results that all four algorithms were reliable and the reason for that is the tendency of the P-Value to be zero. The ER-TS sub-criteria is usually used to reject or accept a null hypothesis. When TS is maximum, P value is minimum. It is interesting to note that the criterion of CR is excellent for all algorithms and its value was obtained for four algorithms WCC, LCA, LA, and FOA equal to 0.9345, 0.9365, 0.9165, and 0.9378 respectively. This shows the ability and accuracy of these algorithms, especially the two LCA and FOA algorithms. Other sub-criteria which were examined and analyzed in the error section were also repeated for correlation (CR) and the trend of the results was the same as the error sub-criteria (ER). Due to the excessive volume of the article, further explanations are avoided (Table 4). The results of this section with the findings of Schubert et al.108, Panday et al.109, and Masoudi-Sobhanzadeh et al.45 are consistent.

Also, in Table 5, the results of the evaluation criteria for selecting the most important features from among the features before the design phase affecting the total cost of drip irrigation systems (18 features) are shown. Except for the NOF criterion, which for all algorithms (WCC, LCA, LA, and FOA) was 6 out of 18 features, in the rest of the criteria and sub-criteria, LCA and FOA algorithms had higher accuracy and correlation and less error. For the two selected LCA and FOA algorithms, the ET criterion was 344.5740 and 153.7386 s respectively, the ER criterion was 0.0006 for both algorithms, and the ER-STD sub-criteria was 0.0003 and 0.0003 respectively. Also, the ER-CI sub-criteria are 0.0008 and 0.0009 in the first iteration and 0.0010 and 0.0011 in the second iteration, the ER-P sub-criteria equals zero for both algorithms, and the ER-TS sub-criterion was 7706/18 and 2511/19 respectively, for algorithms of LCA and FOA. Finally, the CR criterion equal to 0.9541 was obtained for both algorithms. The same process that was followed in the error sub-criteria (ER), also applies to correlation (CR) (Table 5). The results of this section are consistent with the findings of other researchers45,106.

Selecting the best features

After verifying (or testing) the results of the algorithms and checking the evaluation criteria, while confirming these results, the desired features were extracted (Table 6). Choosing a set of the most useful, best, effective, important, and accurate features that can be used to model the early stage cost of pressure irrigation systems, especially drip, was the main goal of this research. It is worth mentioning that in the feature selection section, among the total features (39 variables), the results of WCC, LCA, LA, and FOA algorithms were equal to 17, 11, 15, and 8, respectively. To select the feature from among the features before the design phase (18 variables), WCC, LCA, LA, and FOA algorithms all agreed on six features. Based on this and in the last step, the best features in each section were determined. These features were obtained by combining the common features in each section. From an expert view, it can be said that the selected features in both sectors are very vital in the design and implementation of irrigation systems and play an essential role in costs. Even a non-professional person or an expert with little-experience in consulting engineering companies can confirm this claim.

Table 6 Summary of the results of different feature selection methods to identify features that affect the cost of drip irrigation systems.

What was important in this research was the application of new feature selection methods, for which coding was used in MATLAB and Python software. For this reason, different feature selection algorithms, models, and methods were trained and tested. Finally, the best features are extracted and presented in Table 6. Also, a summary of the results of different feature selection methods is provided for comparison and observation. It can be seen that by changing the feature selection methods and using more and better algorithms and models, an evolutionary process has been completed and a better result has been obtained. The RMSE and R2 evaluation criteria show these changes well. Finally, the feature selection result of the FeatureSelect method in two separate sections (selection from all features and features before the design phase) was selected as the best method for cost modeling of drip irrigation systems in Iran.

Cost modeling with artificial intelligence algorithms

After selecting the most effective features on the cost of different parts of drip irrigation systems, identified in the last two rows of Table 7 in the form of two parts of all features and features before the design phase, cost modeling was done. In general, six main models covering all statistical methods, neural network (NN), artificial intelligence (AI), and machine learning (ML) were developed for this work. These include MLR, DT, DL, GP, ANN, and SVM. Of course, the ANN model has four subsections: MLP, RBF, GRNN, and ANFIS, which are discussed more during economic modeling.

Table 7 The results of the relationship between selected features and the cost of different parts of drip irrigation systems.

The beginning of the results of this research was done by examining the correlation between the input variables and the cost of different parts. In this section, after applying different feature selection methods and identifying the most effective features on the cost of drip irrigation systems, the relationship between independent variables (features) and dependent variables (costs) is again presented in Table 7. Finally, to summarize the selected features, we can pay attention to the type of feature, the R2 evaluation criterion, and the cost part.

Cost modeling based on all features

First, cost modeling was done based on all features (39 variables affecting the costs of drip irrigation systems). At this stage, 70% of the data was used for training and the remaining 30% of the data was used for testing artificial intelligence and machine learning models. The statistical evaluation of the extracted models showed that neural networks (ANN) and support vector machines (SVM) were among the best models and provided the best statistical indicators. The results of the evaluation criteria presented in Table 8 show the acceptable accuracy of the algorithms used in the cost modeling of different parts. However, the excellent correlation, high accuracy, and the least error are related to ANN and SVM models in both the training and testing stages. The evaluation criteria of the best model, i.e. ANN, in the cost of the pumping station and central control system (TCP) part in the training phase are equal to R2 = 0.877, RMSE = 0.009, and VE = 0.093, respectively, and in the testing phase, respectively, equal to R2 = 0.847, RMSE = 0.010 and VE = 0.113 were obtained. The R2 criterion, which indicates the validity of the models; Along with the small and near-zero error of the RMSE and VE criteria, indicates that the ANN model is better trained than the others and has estimated the cost of the TCP section more accurately than other models in the test phase. Of course, it is worth mentioning that the SVM model also achieved excellent results and is close to the ANN model and should not be neglected. In estimating the cost of the TCP segment, the least accurate model was DT, followed by MLR in both the training and testing stages.

Table 8 Results of cost modeling evaluation criteria for drip irrigation systems using all features in different parts.

Except for the TCP cost part, where the ANN model was recognized as the least error model, in the cost parts of on-farm equipment (TCF), in-farm installation and operation of on-farm and pumping station (TCI), and total cost (TCT); The SVM model achieved the best result. This goes back to the type and structure of the model because it uses a kernel function to move the data to a higher space and separate them with a page. It is worth mentioning that the results of SVM and ANN models are very close to each other and it is impossible to distinguish between the evaluation criteria. Still, models such as MLR and DT always recorded a great distance from the results of other models. In most cases, the DL and GP models have been moderate. The numerical value of the evaluation criteria of the superior model (SVM) in estimating the cost of the TCF part in the two stages of training and testing was equal to R2 = 0.923, 0.893, RMSE = 0.008, 0.009, and VE = 0.082, 0.102, respectively. The same process was repeated in the other two parts of the cost, namely TCI and TCT. In general, in modeling the cost of drip irrigation systems using all the features, i.e. 39 variables, the SVM model had the best results. Also, the best criteria were obtained in the TCF, TCT, TCI, and TCF parts, respectively (Table 8).

To summarize the findings of this section, it should be stated that the evaluation criteria generally showed better results during the training phase than during the testing stage. Although the evaluation criteria for each model show a lack of improvement in only one testing stage, most of the models showed good statistical performance during the training and testing stages. In this way, according to the acceptable values of the models within the permissible range of error and accuracy, it is qualified to be used to estimate the costs of drip irrigation systems. However, to distinguish the performance of the models, it should be said that the two ANN models, and especially the SVM, achieved the best results in Part I (cost modeling based on all features). Even without reasoning, the best models were present in the research during the training and testing stages. Therefore, this study recommends future researchers apply and develop the SVM model in estimating and predicting the cost of drip irrigation systems for modeling purposes. In addition to the SVM model, ANN was the second-best model according to its performance in the training and testing phase. Apart from the two MLR and DT models whose results were moderate, the two DL and GP models (GA and GEP); if it is combined with other models and form a hybrid model, their results will be promising and those models can be developed and used. The results of this section are consistent with the findings of Aghelpour et al.110 and Elbeltagi et al.111.

Following and completing the conclusions obtained from the statistical analysis of Table 8, Fig. 17 was presented. In this figure, the observed/actual and estimated/predicted values of different parts of the cost of drip irrigation systems in Iran were prepared based on the results of the best models in two stages of training and testing. In general, and based on the evaluation criteria of the models in the training and testing stages, this section is trying to show the performance of the top models in the testing stage in a special way. These results were obtained based on the use of all features (39 variables) and their modeling using artificial intelligence and machine learning algorithms. The scatter diagrams Fig. 17 show that in all cost parts, the slope of the fitted regression line between the observed and estimated values is very small and is in agreement with the X = Y line (angle bisector of the first and third quadrants). Also, the desired points are well centered on their regression line, and this focus is mostly on the graph related to the TCF part (R2 = 0.923). On the other hand, the scatter plots of Fig. 17 show that the best performance is related to the predictions of the SVM model with the highest R2 coefficient in the cost of on-farm equipment (TCF) part. Accordingly, the ANN model with MLP architecture in the TCP section, the RBF-type SVM model in the TCF section, the RBF-type SVM model in the TCI section, and finally the Sigmoid-type SVM model in the TCT section, had the best results and they were chosen as the best models.

Fig. 17
figure 17

Comparison of observed and estimated cost of drip irrigation systems based on the best models in the all features section.

Cost modeling based on features before the design phase

Since the final goal of this research is to build comprehensive software to estimate the cost of different parts of the drip irrigation system based on the minimum features, the results of this part are more important. Naturally, using all features achieves better results, and the best evaluation criteria are obtained. However, the inflection is where it is possible to perform the best cost modeling by identifying the most effective features that are readily available and accessible to everyone. The results of such modeling are exactly suitable for making software because it requires the least input and will provide appropriate results.

Based on the above explanation, the evaluation criteria result of cost modeling drip irrigation systems using features before the design phase (18 variables) in different cost parts are presented in Table 9. By using the features before the design phase in the TCP and TCF section, two ANN and SVM models showed the best results (maximum of correlation and accuracy, and minimum error). The evaluation criteria of the training phase in the TCP part and for the ANN model were achieved equal to R2 = 0.867, RMSE = 0.010, and VE = 0.103 respectively, and in the TCF part and for the SVM model are equal to R2 = 0.912, RMSE = 0.008, and VE = 0.083 respectively. The same results of modeling the cost of TCP and TCF parts in the testing phase (2017–2022) were obtained with R2 = 0.837, 0.882, RMSE = 0.011, 0.009, and VE = 0.123, 0.103 respectively for ANN and SVM models. The models that provided average and intermediate results were DL and GP, and in all cost parts, the two models of MLR and DT had the weakest evaluation criteria. Finally, in the TCI and TCT parts, the ANN model was the best; after that, the SVM had the best evaluation criteria. Therefore, the ANN model or its combination with SVM can be used to develop the results and build software.

Table 9 Results of cost modeling evaluation criteria for drip irrigation systems using features before the design phase in different parts.

Another point that can be obtained from Table 9 is the extraction of the range of evaluation criteria changes. In the training phase, the range of R2criterion changes (between the worst and the best model) was from a minimum of 17 to a maximum of 30%, which fluctuated between 26.3–38.7% for the testing phase. This process was repeated for other evaluation criteria. Based on all the above-mentioned contents, a model should be selected that has the minimum error criterion and the maximum accuracy and correlation. Otherwise, the results cannot be developed and generalized, nor can a model be prepared from it. Because the superior model must have all its optimal parameters with an evolutionary process and its evaluation criteria should fluctuate within a certain range. These findings are consistent with the results of other researchers20,26. Also, Sharma et al.112, conducted a systematic literature review to explore the application of Machine Learning (ML) and Deep Learning (DL) in monitoring and diagnosing rice crop health and disorders. Their study, spanning 91 articles (2013–2023), highlights the strength of these advanced techniques in addressing critical challenges like disease detection and nutrient deficiency diagnosis. ML/DL models enable accurate classification, efficient segmentation, and effective feature selection, providing a robust framework for enhancing rice crop productivity.

Figure 18 shows the scatter diagrams between the observed and estimated cost of drip irrigation systems based on the best models in the features section before the design phase. This section was presented to complete the results of statistical analysis and based on the results of the best models in two training and testing phases, focusing on revealing the performance of the best models in the testing phase. What is understood from the results is that when using the features before the design phase (selection of 7 out of 18 variables), the accuracy of the top models is still high (compared to the selection of 10 out of 39 variables) and its error is minimal. The important thing to mention is that usually when using features and less data for modeling, they accept some error so that models are simpler and access to input features is minimal so that they can be easily achieved. However, the results of Fig. 18 and Table 9 showed that not only the superior models have no weaknesses, but also have appropriate evaluation criteria, such as the use of all features. Therefore, cost modeling of drip irrigation systems was done based on the best artificial intelligence and machine learning models using the least features. The results of the scatter diagrams Fig. 18 show that in all cost parts, the estimated values are all around the axis of the angle bisector of the first and third quadrants (X = Y line) and the slope of the fitting line is minimal. The best model in this section is ANN, which was repeated three times in four cost parts, and its highest accuracy was obtained in the TCF section (with R2 = 0.912). In a general summary and to summarize the results of the scatter diagrams, it should be stated that the ANN model of the GRNN-type in the TCP part, the SVM model of the RBF-type in the TCF part, the ANN model of the GRNN-type in the TCI part, and finally the ANN model of the MLP-type in The TCT part obtained the best results and were selected as the top models. Finally, it can be recommended that the ANN model is a suitable model for modeling and software development/making due to providing the most accurate results as well as repeating it in different parts of the cost.

Fig. 18
figure 18

Comparison of observed and estimated cost of drip irrigation systems based on the best models in the features before the design phase section.

Choosing the optimal combination of parameters of selected models

The summary of the results of the optimal parameters of the top models in four cost sections including TCP, TCF, TCI and TCT is presented in Table 10. Based on this and when using all the features (39 variables), the SVM model was the best and the most frequent, and the ANN model achieved the best result in the TCP cost section. Using the features before the design phase (18 variables), two ANN and SVM models were superior, and the optimal parameters for these models can be seen in Table 10. Other researchers also achieved such optimal parameters when using machine learning algorithms and artificial intelligence models45,62,104,105,106.

Table 10 Optimal parameters of selected models used to estimate the costs of drip irrigation systems.

Conclusion

Estimation of construction costs is generally based on qualitative criteria derived from the experience of experts, which are not very practical due to the impossibility of direct use in mathematical models. Traditionally, a modeler has to use trial and error to build mathematical models such as artificial neural networks for different input combinations, which is very time-consuming because the modeler needs to train and test different models with all possible input combinations. Various machine learning and artificial intelligence algorithms make it possible to identify the best data with the highest evaluation criteria and achieve accurate and low-error modeling by spending the least time and cost. The modeling technique also makes it possible to estimate any output based on the input variables with accurate data and knowledge of the relationships between the characteristics and get an acceptable result. Therefore, using knowledge-based and numerical methods, which are difficult to access all the required parameters, or require a lot of time and money to measure them, have received less attention from researchers. In contrast, data-based computational intelligence models are used, which have high accuracy and reliability and require fewer and more accessible input parameters. In this regard, research was conducted to use machine learning algorithms and feature selection to model the early-stage cost of drip irrigation systems in Iran. The result of this extensive study showed that:

  • The results of different feature selection methods showed that the FeatureSelect tool is one of the best approaches for feature selection from a large number of feature sets, and the results of this tool make artificial intelligence models deliver better output.

  • Generally; the accuracy of the evaluation criteria of the models in the section of all features (10 out of 39) was better than the features before the design phase (7 out of 18). The reason is clear; more important variables were involved in estimating and modeling.

  • In both sectors, all models have better estimated the total cost and the farm because the characteristics have a higher correlation with the total cost and this was seen in the results. Of course, the accuracy of the modeling of different cost parts was also appropriate; it means close correlations and small errors.

  • It is natural that the evaluation criteria are higher in the training phase, but its insignificant difference with the test phase showed the accurate learning of the models and its acceptable estimation and modeling.

  • As an important point, it should be said that with the features before the design phase, accurate modeling of the cost of pressurized irrigation systems can be done.

  • The excellent results of the features section before the design stage indicated that suitable modeling can be achieved using the least features. Also, the cost estimation software for pressurized irrigation systems can be developed with the least input data.

  • Among the different data mining models, it showed that ANN and SVM models had the best estimates, and it is interesting that such results were obtained in most of the water and environmental science and engineering studies.

Currently, decisions by engineers, consultants, and policymakers to allocate budgets for pressurized irrigation systems are based on experience or traditional calculations. However, the results of the feature selection methods in this study showed that by simply having or knowing a few important parameters that affect the costs of drip irrigation systems, it is possible to have an accurate estimate of the costs before design and implementation. Also, cost modeling helps managers make the best decisions in different situations. The general results of this research also showed that for ease of modeling the cost of different parts of drip irrigation systems, we can rely on the results of this research and obtain optimal models. Then, by modeling the cost of irrigation systems before implementing a system, one can have a correct understanding of the amount of costs and properly manage the budgeting for credit allocation.