Introduction

While the rapid expansion of China’s construction and real estate sectors since the start of the 21st century has driven urbanization and improved living conditions, the total output of construction industry has decelerated since 2013. In response to this market slowdown and fierce competition, firms are universally seeking to boost competitiveness through stricter cost management. Cost control in construction project is implemented through cost prediction in the initial planning and feasibility study stages, with the existing prediction methods primarily falling into two categories: machine learning-based, and deep learning-based1,2,3,4.

Machine learning-based construction cost prediction employs a variety of algorithms, including Support Vector Machine (SVM), BP neural networks, XGBoost, and Random Forest5,6,7. Classically, on the basis of SVM, Miao et al.8 proposed a Least Squares Support Vector Machine (LSSVM) for construction cost prediction, which achieved a relative error of less than 7%, thereby demonstrating high accuracy and stability. Similarly, on the basis of SVM, Wang et al.9 also proposed a new model called PCA-LSSVM for the cost prediction of residential construction. Through robust validation with 290 cases from 25 companies, PCA-LSSVM proved highly effective in estimating costs. However, although SVM has achieved certain results, it has many deficiencies in handling heterogeneous features. Therefore, a more comprehensive model is urgently needed. In 2022, combining the extreme gradient boosting method with random forest (XGBoost-RF), artificial neural network (ANN) and SVM, Wu et al.10 conducted a construction cost prediction on 90 construction projects in Iraq. The results demonstrated that the inflation rate is the most important indicator, and the XGBoost-RF was the top-performing model with a mean absolute error of just 0.25%. By integrating PCA for dimensionality reduction with a BP network for nonlinear fitting, Liu et al.11 introduced an improved model PCA-BP to distinguish between controllable and uncontrollable factors in construction project. The model first applies PCA to preprocess large-scale construction data for improved training, then uses the BP network for deep analysis and prediction. Their results demonstrate that the approach boosts prediction accuracy, enables dynamic management, and improves project success rates and efficiency. To address BP’s tendency to converge to local minima and its slow convergence speed, Feng et al.12 developed a GA-BP model, in which the Genetic Algorithm (GA) can optimize the BP. The model’s generalization ability was assessed using 18 train cases and 2 test cases, validating its effectiveness for project cost estimation. To tackle the issues of low accuracy and efficiency caused by high project complexity and uncertainty, Zheng et al.13 developed a prediction system based on 14 secondary indicators and proposed a Bird Swarm Algorithm-based Random Forest model (BSA-RF). The results on construction data from Xinyu, China showed that the model outperformed both traditional and recent methods in accuracy and efficiency, offering a reliable reference for cost management.

In deep learning-based methods, construction costs are predicted using algorithms such as Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Generative Adversarial Network (GAN), and Convolutional Neural Network (CNN). Classically, Shi et al.14 evaluated the performance of LSTM, GRU, and Transformer in construction cost prediction. The results showed that Transformer is the most accurate model, LSTM performed reliably but less accurately, GRU trained faster but was less precise. Li et al.15 developed a deep neural network (DNN) model that uses project and item features to predict engineering costs. The model achieved high accuracy, with errors as low as 4.203% for total cost and between 2.98% and 4.52% for unit prices, demonstrating its effectiveness for cost management. Feng16 posited that traditional cost prediction methods are inadequate for managing complex data structures and multimodal features. Therefore, an intelligent model incorporating subtractive clustering, a self-learning mechanism, and a convolutional neural network (CNN) was constructed. This integrated framework applies clustering for data optimization, a self-learning mechanism for parameter tuning, and a CNN for deep feature extraction from diverse data types (images, text, and numerical values), enabling highly accurate predictions. To address the challenge of early-stage cost prediction, Liu et al.17 introduced a hypergraph deep learning-based framework. This framework first defines a hypergraph that represents cost factors and their relationships. Subsequently, it then employs a deep learning model for end-to-end prediction, and finally quantitatively reveal the importance of each cost factor. To address the challenge of limited and unreliable data in early-stage cost prediction, Hong et al.18 developed a method called CTGANs for data augmentation. By training an ANN on this synthetic data, their model effectively addressed data scarcity and imbalance. This approach notably reduced RMSE by about 66% and increased predictive effectiveness from 0% to 15.09% compared to a baseline model using only original data.

Based on the analysis of the aforementioned methods, this paper suggests that the core challenge in construction cost prediction is the correct identification of variable functions and attributes, and the effective learning of the complex interaction among them: (1) There are both continuous and categorical variables in construction cost prediction. If only one or a type of model is using without distinction, it may lead to the loss of data information, because these variables need to be preprocessed to eliminate the differences and noise interference; (2) Construction cost variables fall into distinct functional categories (e.g., area, materials, equipment), each with its own logic. A unified modeling approach struggles with this diversity: it must learn excessively complex relationships, often at the cost of interpretability, and fails to respect the specificities of each variable type. Although this goal can be achieved by using neural networks, it lacks interpretability. Consequently, we advocate for a multi-model framework that can integrate multiple models and classify different attribute variables, thereby improving both accuracy and robustness.

In this study, to enhance model interpretability and address the aforementioned prediction challenges, we propose a dual-stacking model for construction cost prediction (DSCostPred) which incorporates both model stacking and variable stacking. Specifically, to avoid mutual interference within variables, the variables are first pre-classified by functional clustering. Then, different types of models are employed to learn the distinct roles and complex interactions of these variables. In terms of algorithm design, a novel variable stacking mechanism is integrated into the ensemble learning framework to achieve an organic synthesis of multiple attributes and models. The main contributions are as follows:

  1. (1)

    To reduce the burden of variable differences on model predictions, a vertical variable segmentation is conducted and pre-classify variables according to their attributes and functions;

  2. (2)

    Improvements were made to the stacking ensemble learning, not only stacking on models but also on variables or data, providing a methodology for construction cost prediction;

  3. (3)

    Experiments and SHAP analyses on a dataset containing 332 samples from China have demonstrated the effectiveness and interpretability of our method.

Materials and methods

Data collection and statistics

Data collection and preprocessing

For the dataset used in this study, from the Construction Cost Platform (https://chaoshi.zjtcn.com/searchclassify?pg=1&l=26), we collected a dataset including 412 nominal records of residential projects from 21 provinces in China from 2019 to 2022. All the data are nominal data. This data contains 14 continuous variables and 12 categorical variables. The names, units, and types of the variables are shown in Table 1 (where “Type = C” represents continuous variables, “Type = IC” represents integer continuous variables, and “Type = D” represents categorical variables; the categorical variables have no units). Then, these data are preprocessed to obtain data that can be directly used by the algorithm (Supplementary file 1). For detailed processing rules, please refer to Supplementary file 2.

Table 1 Variable description.

Although the data provided in Supplementary file 1 can already be used by the algorithm, it still fails to meet the experimental requirements because there are some outliers (Fig. 1). To reduce the impact of outliers on model performance, all outliers were removed in this study, and ultimately an experimental dataset containing 332 samples was obtained.

Fig. 1
Fig. 1
Full size image

Outliers plot of the continuous variables. Only 14 continuous variables are listed, as this issue is applicable to categorical variables.

Fig. 2
Fig. 2
Full size image

Frequency of distribution plots of continuous variables.

Fig. 3
Fig. 3
Full size image

Frequency of distribution plots of categorical variables. Please refer to Supplementary file 2 for the names corresponding to the numbers of horizontal coordinate.

Statistical analysis

To gain an intuitive understanding of the data, data statistics was conducted. Figure 2 shows all continuous variables’ data distribution, Fig. 3 shows all categorical variables’ data distribution, and Supplementary file 3 lists the distribution statistics of all variables, including mean, variance, minimum value, Q1, median, Q3, and maximum value.

Double-stacking model-based construction cost prediction

Figure 4 shows the process of DSCostPred. This method takes the classic stacking ensemble learning as its main structure. The approach of stacking is to first construct multiple different types of first-level learners (i.e., base learners), and use them to obtain first-level prediction results. Then, based on these first-level prediction results, a second-level learner (i.e., meta-learner) is constructed to obtain the final prediction result. The motivation of stacking can be described as follows: If a first-level learner mistakenly learns a certain area of the feature space, then a second-level learner can appropriately correct this error by combining the learning behaviors of other first-level learners. Specifically, in the training of base learners, the K-fold cross-validation is used to divide the training data into K parts, and then K base learners are adopted to learn different folds respectively. In this way, K results can be obtained. Then these results will be input as metadata into a single meta-learner for prediction. This framework is very powerful and robust. However, we believe that this framework may not be the best when dealing with multiple types of variables, because although it uses multiple models at the base-learners stage, it does not distinguish them, and these models are generally homogeneous. This leads to the situation when different variables are input into the model, it may achieve good results on some variables while not on others. To solve this problem, we introduced variable stacking in this framework. As shown in Fig. 4, variable stacking mainly modifies the base learning stage of stacking ensemble learning. It is composed of the vertical variable segmentation in Step1 and the vertical feature concatenation in Step4. Through the variable segmentation in Step1, the data is split into K parts, each containing different variables. Then, the base learning stage in stacking ensemble learning can be replicated K times. Meanwhile, during different base learning processes, the data is allowed to be replicated and input into different models for training. This enables the data to be fully utilized, corresponding to the “repeat” in Step2. The cross-validation in Step2 is the operation of stacking ensemble learning itself. Different from our vertical variable segmentation, it performs horizontal data variable segmentation for each piece of data. Afterwards, all the copied and partitioned data can be input into various types of models to train and obtain features with sufficient information. These features are finally concatenated in Step4 to form comprehensive features for use by the meta-learner. In simple terms, our improvement is similar to the transformation from the attention mechanism to multi-head attention mechanism in classic model Transformer in the field of natural language processing. The specific details are as follows.

Fig. 4
Fig. 4
Full size image

The flowchart of DSCostPred.

Variable attributes and functions-based vertical variable segmentation

The prediction of unit construction cost involves many variables. In this paper, it includes 25 variables (see Table 1). In fact, these variables can be classified into the following categories based on their attributes and functions. Meanwhile, among these variables, there may be situations such as hierarchy, interaction effects, coexistence of qualitative and quantitative factors. In such a complex situation, we cannot expect to obtain high-precision results with a simple model. Therefore, a large amount of data processing and data standardization work is often carried out before prediction. However, this work is very complex and time-consuming. To simplify the process, we propose to vertically segment the input data according to attributes and function of variable, thus prevent the complex relationships and categories of variables from interfering with model training. This operation can be obtained through K-means clustering19 by take feature variable as sample.

In K-means clustering, the initial clustering number K needs to be determined manually. To determine the number of K, we set \(K=\left\{ {3,4,5,6,7} \right\}\) and evaluated the results using silhouette score. Meanwhile, to ensure the stability of the selected K, bootstrap method is used to conduct this process 50 times because of the inherent randomness of K-means clustering. The results are shown in Fig. 5, where the red line represents the average contour coefficient, and the blue and green lines represent the upper and lower bounds of the 95%CI, respectively.

Fig. 5
Fig. 5
Full size image

The silhouette score under different K.

As can be seen from the red line, the clustering result gets the best when \(K=4\), because its contour coefficient is the largest. Meanwhile, the blue line and the green line reflect that the clustering result is most stable when \(K=4\), because the distance between the blue line and the green line is the closest at this point. Then, the K which has the largest silhouette score was used as the final value. The results of the variable division are shown in Table 2.

Table 2 The result of vertical variable segmentation.

These four types of variables respectively reflect the cost of construction projects from four aspects. The variables belonging to cluster1 are the general outline and foundation of cost prediction, determining the magnitude of the cost and usually positively correlated with it. Almost any cost prediction model must first take these variables into account to determine the basic range of costs. The variables belonging to cluster2 describe the connection mode between the building and the foundation, and they are the “foundation” of the project. Their costs belong to the concealed but crucial part and are an important source of cost fluctuations. In areas with complex geological conditions, the cost of foundation engineering may account for a large proportion of the total cost. Accurate description of the basic structure is crucial for prediction accuracy. The variables belonging to cluster3 determine the main construction and installation costs and construction technical plans, and are the key to distinguishing cost levels. The variables belonging to cluster4 directly affect the unit price of individual projects and are the focus of refined prediction and cost optimization. When constructing a cost prediction model, these four types of variables complement each other and none can be missing. An excellent model needs to integrate this information to accurately capture all cost drivers from macro to micro levels.

After vertical data segmentation, they will subsequently be respectively input into 4 stacking ensemble learning sessions for training. However, the type of model used in each stacking ensemble learning session is different, including those based on logistic regression, trees, neural networks, and so on.

Stacking ensemble learning

The dual-stacking model proposed in this paper is based on the traditional stacking ensemble learning20,21, in which variable stacking is added to fully extract the features required for construction cost prediction from different functional variables. Stacking ensemble learning is an effective ensemble method, in which predictions generated by various machine learning algorithms are used as input features for the second-layer learning algorithm. Then the second-layer algorithm, after training, can optimize the predictions of the combined model to form new predictions. Stacking generally consists of two layers: (a) a series of base model used to analyze data from multiple aspects; (b) the model obtained by taking the output of the base model as the training set, that is, the meta-model. The following is an introduction to these two parts respectively.

  1. (1)

    Base model.

The main function of the base model is to learn the complex relationships in the data from multiple perspectives, extract and provide diverse and complementary “meta-features”, reduce the risk of overfitting in model training and improve the generalization ability of the model. The selection of base models is usually not simply about pursuing accuracy, but rather follows the principle of diversity, that is, to choose models with different operating principles as much as possible, such as linear models, tree models, kernel models, and neural networks. This is known as model stacking. Furthermore, since the data we used was split based on variable attributes and functions, we further optimized on the basis of model stacking and thus proposed variable stacking. In the previous training of base models, the input data for each set of base models was horizontally sliced data, and each set of data carried all variables indiscriminately. Then, a set of base models was used for training, which required the base models to have a strong learning ability. If the base model fails to learn effective features, then accurate results still cannot be achieved based on model stacking. With the assistance of variable stacking, the data is cut vertically (Table 2) to several parts, and each part is processed by different model. This not only enables diversified learning of various parts of the data but also allows for tailor-made solutions. This is known as variable stacking. For model selection, this paper adopts linear regression, ridge regression, LASSO regression, kernel ridge regression, random forest, and multi-layer perceptron. The following introduces them respectively:

  1. (i)

    Linear regression.

For a linear equation \(Y=XW\), the loss function of linear regression22 is expressed as:

$$J\left( W \right)={\left\| {Y - \hat {Y}} \right\|^2}={\left( {Y - XW} \right)^T}\left( {Y - XW} \right)$$
(1)

By setting the derivative to 0, the solution can get as follows:

$$W={\left( {{X^T}X} \right)^{ - 1}}{X^T}Y$$
(2)

Here, W describes the influence weights of all variables in X to the construction cost. However, when there is multicollinearity among variables, X is a degenerate matrix, thus making it impossible to solve correctly.

  1. (ii)

    Ridge regression.

Ridge regression23 is an improved algorithm specifically designed to handle multicollinearity problems (where features are highly correlated), in which a regularization term \(\alpha \left\| W \right\|_{2}^{2}\) is added to the loss function:

$$J\left( W \right)={\left\| {Y - \hat {Y}} \right\|^2}+\alpha \left\| W \right\|_{2}^{2}$$
(3)

The solution are as follows:

$$W={\left( {{X^T}X+\alpha I} \right)^{ - 1}}{X^T}Y$$
(4)

where \(\alpha\) is the penalty coefficient used to adjust the penalty intensity of W.

  1. (iii)

    LASSO regression.

LASSO regression24 is another improved version with 1-norm \(\alpha {\left\| W \right\|_1}\) as regularization term:

$$J\left( W \right)={\left\| {Y - \hat {Y}} \right\|^2}+\alpha {\left\| W \right\|_1}$$
(5)

Since this loss function is not continuously distinguishable, it can be solved by using the coordinate descent method or the minimum angle regression method.

  1. (iv)

    Kernel ridge regression.

Kernel Ridge Regression (KRR)25 introduces kernel techniques in ridge regression to solve the nonlinear problems. The objective function is as follows:

$$J\left( W \right)={\left\| {Y - f\left( X \right)} \right\|^2}+\alpha \left\| f \right\|_{2}^{2}$$
(6)

where \(f=\mathop \sum \limits_{{i=1}}^{n} {w_i}k\left( {{x_i},{x_j}} \right)\), \(k\left( { \cdot , \cdot } \right)\) is the kernel function, \({w_i}\) is the coefficient, \({x_i},{x_j}\) are the data samples. The solution is as follows:

$$W=\left( {K+\alpha I} \right)Y$$
(7)

where K is a matrix with dimension of \(n \times n\), \({K_{i,j}}=k\left( {{x_i},{x_j}} \right)\).

  1. (v)

    Random forest.

Random forest (RF)13 is an ensemble learning method that enhances the predictive ability by combining multiple decision trees. The prediction result of RF is as follows:

$$f\left( x \right)=\frac{1}{k}\mathop \sum \limits_{{i=1}}^{k} {\varphi _i}\left( x \right)$$
(8)

where \({\varphi _i}\left( x \right)\) is the predicted result of decision tree. In regression tasks, the typically used decision tree is the Classification and Regression Tree (CART) with following loss function:

$$\mathop {\hbox{min} }\limits_{{j,s}} \left[ {\mathop {min}\limits_{{{c_1}}} \mathop \sum \limits_{{x \in {R_1}\left( {j,s} \right)}} {{\left( {{y_i} - {c_1}} \right)}^2}+\mathop {min}\limits_{{{c_2}}} \mathop \sum \limits_{{x \in {R_2}\left( {j,s} \right)}} {{\left( {{y_i} - {c_2}} \right)}^2}} \right]$$
(9)
$$\left\{ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {{R_1}\left( {j,s} \right)=\{ x|{x^j} \leqslant s\} } \\ {{R_2}\left( {j,s} \right)=\{ x|{x^j} \geqslant s\} } \end{array}} \\ {\begin{array}{*{20}{c}} {{c_1}=ave[{y_i}|x \in {R_1}\left( {j,s} \right)]} \\ {{c_2}=ave[{y_i}|x \in {R_2}\left( {j,s} \right)]} \end{array}} \end{array}} \right.$$
(10)

where, \({x^j}\) is the j-th variable; S is the \({x^j}\) value that minimizes the sum of the squared errors of the two divided regions; \({R_1}\) and \({R_2}\) are the smallest division regions; \({c_1}\) and \({c_2}\) are the average values of the prediction results of the two regions respectively; \({y_i}\) is the predicted result.

  1. (vi)

    Multi-layer perceptron.

A multi-layer perceptron (MLP)26 refers to a neural network with at least three layers (an input layer, a hidden layer, and an output layer), as shown in Fig. 6.

Fig. 6
Fig. 6
Full size image

Schematic diagram of MLP.

Theoretically, MLP can simulate any complex function by set enough layer and neuron. The following is the calculation formula on a certain neuron:

$$u=\mathop \sum \limits_{{i=1}}^{n} {w_i}{x_i}+b$$
(11)
$$y=activation\left( u \right)$$
(12)

where \({x_i}\) is i-th neuron’s output in previous layer, \({w_i}\) is the connection weight, b is the bias term, \(activation\left( \cdot \right)\) is the activation function endowing model with the ability of nonlinear mapping.

  1. (2)

    Meta model.

The purpose of the meta-model is to construct the relationship between the predicted results and the true results of multiple base models. It combines the advantages of the base models in the optimal way, make up for the deficiencies of individual models, and thereby achieve performance that surpasses any single model. In practice, simple and robust models are usually chosen as meta-models. In this article, we use the random forest as the meta-model because it naturally conforms to the role of meta-model, that is, to determine the final result through “swarm intelligence”:

  1. (i)

    RF has low requirements for data distribution and does not need to standardize or normalize features. Because the tree model is based on threshold splitting, the scale change of features will not affect the splitting result, and thus can adapt to the differentiated output of different models;

  2. (ii)

    RF uses bootstrap sampling for training, further increasing the diversity of perspectives and forcing the model to learn more robust results;

  3. (iii)

    The features in the input data of the meta-model are independent of each other because they are obtained from different base models. If one wants to learn the nonlinear relationships and interaction effects among these features, it is also necessary to manually construct the interaction terms among these features. RF can naturally capture this complex nonlinear relationship and interaction effect between features and targets, because the decision tree in it essentially divides the feature space through a series of if-then rules.

Results

In this section, we first compared it with other classic models to comprehensively evaluate the performance of DSCostPred. Then, we conducted an ablation experiment on it to verify the effectiveness of variable stacking and to explore the predictive effects under different base model and meta-model selections.

Experimental setup

Training settings

In this study, the number of samples obtained for the experiment after preprocessing is 332. For this dataset, we divided it into a training set and a test set in an 8:2 ratio. Meanwhile, to avoid misunderstandings, we particularly note that the cross-validation in Step2 of flowchart is independent of the data partitioning here. Cross-validation in Step2 is to further divide the data and put input multiple base models. That is to say, in DSCostPred, during the training phase, the training set here is redivided, and during the testing phase, the test set here is also redivided.

To highlight the performance of the method, we compared it with several classic methods, including LSSVM, PCA-LSSVM, BSA-RF, PSO-BP, and CTGAN. To ensure the fairness of the comparison, we used their original parameters as much as possible, and for those using optimization algorithms, we explored their best performance on this data using the optimization algorithms (Supplementary file 4). All methods are trained and tested on the same training set and test set.

Evaluation metrics

The predicted values of regression models are often difficult to calculate precisely, so the key lies in demonstrating the closeness degree between the predicted values and the true values. This paper evaluates the model by using six indicators: \({R^2}\), mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and symmetric mean absolute percentage error (sMAPE). Their formulas are as follows:

$${R^2}=1 - \frac{{\mathop \sum \nolimits_{{i=1}}^{n} {{\left( {{y_i} - {{\hat {y}}_i}} \right)}^2}}}{{\mathop \sum \nolimits_{{i=1}}^{n} {{\left( {{y_i} - \bar {y}} \right)}^2}}},~where~\bar {y}=\frac{1}{n}\mathop \sum \limits_{{i=1}}^{n} {y_i}$$
(13)
$$MAE=\frac{1}{n}\mathop \sum \limits_{{i=1}}^{n} \left| {{y_i} - {{\hat {y}}_i}} \right|$$
(14)
$$RMSE=\sqrt {\frac{1}{n}\mathop \sum \limits_{{i=1}}^{n} {{\left( {{y_i} - {{\hat {y}}_i}} \right)}^2}}$$
(15)
$$MAPE=\frac{{100}}{n}\mathop \sum \limits_{{i=1}}^{n} \left| {\frac{{{y_i} - {{\hat {y}}_i}}}{{{y_i}}}} \right|$$
(16)
$$sMAPE=\frac{{100}}{n}\mathop \sum \limits_{{i=1}}^{n} \frac{{\left| {{y_i} - {{\hat {y}}_i}} \right|}}{{\left( {\left| {{y_i}\left| + \right|{{\hat {y}}_i}} \right|} \right)/2}}$$
(17)

where, \({R^2}\) is used to measure the degree about a model fits the data. Its value range is usually between \(- \infty\) and 1. The closer \({R^2}\) is to 1, the better the model fitting effect. MAE is used to measure fitting errors, is not sensitive to outliers, and is affected by dimensions. It should be noted that since the prediction target of this study is the unit construction cost, the calculation of MAE is directly based on the unit area cost (CNY/m²), and is not affected by the scale of the project, thus can be directly used for project interpretation. RMSE is used to measure performance with large errors and is sensitive to outliers. MAPE can represent errors in the form of relative proportions, which is convenient for intuitive understanding. sMAPE resolves the issue that MAPE imposes a heavier penalty for negative errors (predicted values > true values) than for positive errors.

Comparative experiment

To highlight the advantages of DSCostPred, we compared it with several classic models (LSSVM, PCA-LSSVM, BSA-RF, PSO-BP, CTGAN) on R2, MAE, RMSE, MAPE and sMAPE. The results are shown in Table 3.

Table 3 The results of comparative experiment.
Fig. 7
Fig. 7
Full size image

Fitting result.

It can be seen from the table that our method (\({{\text{R}}^2}=0.9197,{\text{~~RMSE}}=108.7037,{\text{~~sMAPE}}=4.2492\)) achieved the best results on \({R^2}\), RMSE, and sMAPE, and was second only to PSO-BP on MAE and MAPE ((DSCostPred: \({\text{MAE}}=78.2025,{\text{~~MAPE}}=4.2848\); PSO-BP: \({\text{MAE}}=65.3063,{\text{~~MAPE}}=3.4279\)). This indicates that compared with PSO-BP, our method will have inaccurate predictions on more data points, but the error range is smaller. However, although PSO-BP only makes inaccurate predictions at a few data points, once it makes a wrong judgment, it will cause a large error, which might be intolerable for construction projects. Meanwhile, we have noticed that the positions of the two methods on MAPE and sMAPE are inconsistent. It is well known that MAPE tends to accommodate methods where the predicted values are generally lower than the true values, while sMAPE avoids such a result. Although the MAPE of PSO-BP is relatively low, we know that if the estimated cost of the project is too low, it will lead to the project having to be interrupted due to lack of funds, which is also unacceptable. All these indicate that our method might be more practical. Figure 7 shows the prediction results (Since the data is not time series data, for Fig. 7, we only discuss the differences between the predicted values and the true values at each data point, not the trend of the curve). We can observe that the number of data points with predicted values less than the true values in PSO-BP is smaller than that in DSCostPred, and there are more data points with huge errors, which is consistent with our analysis.

Furthermore, to further comprehensively reflect the performance of each method, we additionally conducted paired bootstrap, which calculates the evaluation metrics by randomly sampling the data with replacement, and finally the average value and 95% confidence intervals (95%CI) of the evaluation metrics can be obtained. This process of multiple random samplings is equivalent to disrupting the order and size of the data, and the results obtained through multiple samplings are more convincing than point estimates. Specifically, through a resampled test set sample with replacement (1000 times), we constructed the sampling distribution of R², MAE, RMSE, MAPE, sMAPE and calculated their 95% confidence intervals (95%CI) and means. This method ensures that our evaluation of the model performance is unbiased and statistically robust. The results are shown in Table 4.

Table 4 The results of comparative experiment under paired bootstrap.

It can be known from the table that DSCostPred still achieved outstanding performance advantages (mean R2 = 0.9160, mean MAE = 73.8160, mean RMSE = 110.0667, mean MAPE = 4.0641, mean sMAPE = 4.0463). It was only slightly inferior to BSA-RF and CTGAN in \({\text{mean~}}{{\text{R}}^2}\) and \({\text{mean~RMSE}}\) (BSA-RF:\({\text{mean~}}{{\text{R}}^2}=0.9187\), \({\text{mean~RMSE}}=108.3997\); CTGAN: \({\text{mean~}}{{\text{R}}^2}=0.9241\), \({\text{mean~RMSE}}=105.0371\)). However, considering the 95%CI, the BSA-RF were \({{\text{R}}^2}\left( {95{\text{\% CI}}} \right)=\left[ {0.8890,0.9498} \right]\) and \({\text{RMSE}}\left( {95{\text{\% CI}}} \right)=\left[ {86.0059,124.0570} \right]\), the CTGAN were \({{\text{R}}^2}\left( {95{\text{\% CI}}} \right)=\left[ {0.9054,0.9408} \right]\) and \({\text{RMSE}}\left( {95{\text{\% CI}}} \right)=\left[ {93.2591,115.1987} \right]\),while ours were \({{\text{R}}^2}\left( {95{\text{\% CI}}} \right)=\left[ {0.8790,0.9521} \right]\) and \({\text{mean~RMSE}}=\) \(\left[ {84.1785,128.7718} \right]\). Our upper bound of \({{\text{R}}^2}\left( {95{\text{\% CI}}} \right)\) is greater than BSA-RF and CTGAN, and the lower bound of \({\text{RMSE}}\left( {95{\text{\% CI}}} \right)\) is less than BSA-RF and CTGAN, which indicates that our method is more likely to achieve higher performance. In conclusion, although DSCostPred is inferior to BSA-RF and CTGAN in \({\text{mean~}}{{\text{R}}^2}\) and \({\text{mean~RMSE}}\), the gap is not significant. Its comprehensive performance on other metrics still holds a prominent position, which indicates that DSCostPred has certain statistical superiority.

In addition, it should be noted that since the bootstrap process involves extracting different numbers of samples which come from different regions. This process is also a process that contains noise and bias. Our method can achieve a good performance in this situation, indicating that it may also be able to resist the influence of noise and bias to a certain extent.

Ablation experiment

To investigate the effectiveness of variable stacking and explore the predictive effects under different base model and meta-model selections, we compared DSCostPred with its variant versions. These variant versions include:

  1. (1)

    noVS: This version has removed the variable stacking and uses the original stacking ensemble learning. The input data has also been restored from the vertically cut data to the original horizontally cut data;

  2. (2)

    BaseModel-LR: This version removes linear regression from base models;

  3. (3)

    BaseModel-RR: This version removes ridge regression from base models;

  4. (4)

    BaseModel-LASSO: This version removes LASSO regression from base models;

  5. (5)

    BaseModel-KRR: This version removes kernel ridge regression from base models;

  6. (6)

    BaseModel-RF: This version removes RF from base models;

  7. (7)

    BaseModel-MLP: This version removes MLP from base models;

  8. (8)

    Metamodel-LR: This version replaces the meta-model used in DSCostPred with linear regression;

  9. (9)

    Metamodel-MLP: This version replaces the meta-model used in DSCostPred with MLP.

Among these variants, (1) replaces the variable stack with the original input method of stacking ensemble learning, with the aim of verifying the validity of the variable stacking; (2) to (7) deletes the corresponding base learners with the aim of verifying the importance of each base learner in the model stacking. (8) to (9) replaced the meta-learner with other classic models to demonstrate the necessity of choosing RF as the meta-learner. The more the effect drops, the greater the contribution of the deleted module in that variant. The results of the ablation experiment are shown in Table 5. Meanwhile, to clarify the contribution of each part to the model, we also calculated the marginal benefits of each part using SHAP values, which are listed in the last column of Table 5. When calculating the SHAP value, we take the \({{\text{R}}^2}\) of DSCostPred as the benchmark, and the absolute value of the difference between the \({{\text{R}}^2}\) obtained after removing a certain part and the \({{\text{R}}^2}\) of DSCostPred as the gain.

Table 5 The results of ablation experiment.

From Table 5, it can be seen that variable stacking is an effective improvement, as noVS (R2 = 0.8931, MAE = 93.1625, RMSE = 125.4364, MAPE = 5.0732, sMAPE = 5.0822) shows significantly worse performance than DSCostPred (R2 = 0.9197, MAE = 78.2025, RMSE = 108.7037, MAPE = 4.2848, sMAPE = 4.2492). Specifically, the R2 of noVS is clearly lower, while its MAE, RMSE, MAPE and sMAPE are all higher, indicating that variable stacking substantially enhances prediction accuracy. Moreover, it can also be seen from the SHAP column that, except for BaseModel-RF, the order of magnitude of SHAP contributions in other parts is the same, and noVS ranks first among them, which proves the effectiveness of the variable stacking we proposed. The reason why it is effective might be that it adds more diverse options and more samples to the model. These two advantages are closely related to the vertical variable segmentation and the repeat operation in Step2, which is precisely one of the core operations of variable stacking.

Then, for the ablation of the base model, we can see that the selection of RF is very important because the performance of BaseModel-RF drops significantly (R2 = 0.6953, MAE = 163.3338, RMSE = 211.7348, MAPE = 8.8251, sMAPE = 8.7291). This is because RF is suitable for learning both continuous and categorical variables, so the internal features of the data can be learned in any group. Meanwhile, we noticed that the performance of BaseModel-MLP (R2 = 0.9128, MAE = 84.5581, RMSE = 113.2543, MAPE = 4.6489, sMAPE = 4.6199) showed almost no decline. In this study, we believe there might be two reasons: (i) Insufficient model training. MLP requires a large amount of data for training to fully exert its function, but the data we use is only a few hundred, which cannot meet the requirements of model training. (ii) The role of MLP can be replaced by the combination of other models. The greatest advantage of MLP is that it can fit complex nonlinear mappings, but KRR and RF also have this advantage. Therefore, even if MLP is removed, the missing part can still be compensated by KRR and RF.

Finally, for the meta-model, we did not ablate all models but only selected LR and MLP as representatives. It can be seen that after changing the meta-model from RF to LR and MLP, the fitting results have decreased significantly, which indicates that using RF as the meta-model is a suitable choice.

Meanwhile, to make the performance comparison more convincing, we also conducted experiments and evaluations using paired bootstrap. The results are shown in Table 6.

Table 6 The results of ablation experiment under paired bootstrap.

As can be seen from Table 6, even under strict evaluation conditions, the results of the ablation experiment still support the previous conclusion, which further proves that each part of our model is effective. In conclusion, the addition of variable stacking is effective. Compared with the traditional stacking ensemble learning, variable stacking can increase the accuracy of cost prediction. This also confirms that it is very important for us to perform vertical segmentation on the data and handle different types of variables separately using different models. Meanwhile, the selection of the base model and the meta-model is effective, and RF plays a crucial role in both the base model and the meta-model.

Feature importance analysis using SHAP analysis

To capture the influence of variables on the unit construction cost prediction, SHAP analysis is used to calculate the SHAP value of all variables. This kind of analysis is crucial for identifying which variables have the greatest impact on the results, enhancing the understanding of complex correlations in the dataset and providing a valuable perspective on the internal operation of the model. This in-depth analysis of variable importance adds additional transparency to the model’s decision-making process.

According to the SHAP analysis, we provided the importance representations of all variables, and the results are shown in Fig. 8. From this, it can be known that variables such as “Concrete Works” and “Reinforcement works” are the most important, which is consistent with reality. “Concrete works” and “Reinforcement works”, as the structural skeleton and cost core of buildings, directly determine the strength, stability and seismic resistance capacity of buildings. Any design changes involving structural types, seismic grades, and floor heights will directly and significantly translate into the increase or decrease of “Concrete works” and “Reinforcement works”. Such results show that our model has obtained results consistent with reality. “Exterior Wall Material”, “Interior Finishing Material” and “Heating Works”, are at the top of the list, which is also consistent with reality. They have a huge impact on the unit construction cost because of their highly cost elasticity. The choice of “Exterior Wall Material” and “Interior Finishing Material” can have a difference of several times or even tens of times on the unit cost, thus be the key variable for cost control. Heating projects (such as floor heating and central air conditioning) are costly. Their expenses increase sharply with the complexity of the system, the brand of equipment and energy-saving standards, and they are essential functional expenditures. Therefore, they are the core for balancing cost, quality and selling price, and have a significant and flexible impact on the unit cost.

Fig. 8
Fig. 8
Full size image

The SHAP value of all variables.

Meanwhile, we also observed that “Above/Below-Ground Floor Area”, “Number of Floors”, “Above/Below-Ground Floors” ranked in the middle since their influence is systematic rather than direct. Their minor changes can trigger a significant response in cost. For instance, if the floor height increases by 10 centimeters, all vertical structures and exterior wall materials will increase. Such characteristics determine that they will have an impact on the final cost. Finally, we also found that “Earthwork”, “Foundation Pit Support”, “Masonry Works”, “Waterproofing Works” ranked lower because their unit engineering cost is relatively stable (unless extreme geological conditions were encountered). In conclusion, through the SHAP analysis of the variables, it has been effectively proved that our model has certain reliability and interpretability.

Finally, to enhance the interpretability, we also provide the SHAP analysis results of each variable in four clusters (Figs. 9, 10, 11 and 12).

Fig. 9
Fig. 9
Full size image

The SHAP value of variables in cluster 1.

Fig. 10
Fig. 10
Full size image

The SHAP value of variables in cluster 2.

Fig. 11
Fig. 11
Full size image

The SHAP value of variables in cluster 3.

Fig. 12
Fig. 12
Full size image

The SHAP value of variables in cluster 4.

Discussion and conclusion

The prediction of construction project cost plays a crucial and core role in engineering construction projects. Before the project is initiated, accurate cost prediction serves as the primary basis for investment decisions, directly determining the economic feasibility of the project and providing a crucial economic judgment criterion for the comparison of different schemes. However, the current prediction of construction project costs involves a multi-dimensional and dynamically variable system, and each major category can be further subdivided into countless specific factors. Meanwhile, the relationships among variables are far from being simple superpositions; rather, they present a complex network of nonlinearity and interaction. This has seriously affected the accuracy of the prediction. To solve this problem, we proposed a dual-stacking project cost prediction method DSCostPred based on variable stacking and model stacking. This method emphasizes that classifying variables based on their functions and attributes and applying different algorithms respectively can avoid the functional differences of variables and the impact brought by complex system interactions. First, pre-classify the variables based on their functions and attributes to avoid mutual interference among variables with different attributes. Then, to learn the attribute and function positioning of variables, as well as the complex interaction among them, different types of models are utilized to learn the variables under different attributes and functions. In terms of algorithm design, to achieve the organic combination of multiple attributes and multiple models in the system, we have incorporated the variables stacking into the stacking ensemble learning.

This method was experimented on real data. Compared with methods such as LSSVM, PCA-LSSVM, BAS-RF, and PSO-BP, DSCostPred achieved better results. This proves that the method we proposed is effective. In addition, we also conducted ablation experiments and SHAP analysis about variable stacking and model stacking. The experimental results show that variable stacking has a significant improving effect on the prediction of construction project costs. In model stacking, the role of RF is crucial, which is consistent with reality. The variables involved in the task of construction project cost prediction include both discrete and continuous variables, and it is not convenient to be standardized. The RF, as a tree model, splits the tree based on the numerical order of features. Therefore, the numerical size and distribution of features have no impact on model training, and thus are naturally adapted to unstandardized data and discrete data. In conclusion, all these factors are the basis for the superior performance of DSCostPred.

Although the comparison with a series of classic methods, as well as the results of ablation experiments and SHAP interpretability analysis, have demonstrated that DSCostPred is a model with outstanding performance. However, it also has some limitations, which are reflected in two aspects:

  1. (1)

    It does not consider the impact of inflation and currency conversion. In the data used in this article, we only focus on the Chinese region, and the data used are from 2019 to 2022. In fact, from 2019 to 2022, although the cost index of China’s engineering and construction market experienced phased fluctuations, it remained within a controllable range overall, and the macro price level also remained stable. Therefore, changes in price levels were not sufficient to affect cross-year cost comparisons or model analysis results. Meanwhile, the research did not involve any cross-currency or currency system conversion at the same time. All data were directly statistically analyzed using the RMB values provided by the domestic engineering cost system. This has led to the possibility that we might have overlooked the impacts related to inflation and currency conversion when designing the model;

  2. (2)

    It does not consider the impact of different contracts (project types) or regions. In fact, to test the impact of regions and contracts on model performance, we collected 20 residential projects records from Hunan, China (not included in the training set) (with the same contract but different regions), as well as 53 teaching building project data from all over the country (with different contracts). The results obtained on 20 data are \(\:{R}^{2}\)=0.9181, MAE = 43.7996, RMSE = 50.1480, MAPE = 2.5212, sMAPE = 2.5185, and on 53 data are \(\:{R}^{2}\)=0.6610, MAE = 135.7463, RMSE = 187.2793, MAPE = 6.9371, sMAPE = 6.9660. This result indicates that our model has excellent generalization for the prediction of the same contract type in different regions, while its predictive ability for different types of contracts drops sharply. In fact, this is obvious. Because our model was trained based on national residential data, and the new data is based on teaching buildings. There are structural differences between these two types of buildings in terms of fundamental use, design standards and functional requirements, which leads to completely different key features and their weights that determine their costs.

In the future, we will be committed to exploring these two aspects. It would be an attractive outcome if the construction costs of more diverse types of projects could be predicted. In the future, we will attempt to collect data from various types of contracts around the world and try to use larger datasets to train our model in order to achieve more comprehensive performance.