Introduction

The burgeoning global population has led to a surge in urban water consumption, exacerbating the existing tension between water supply and demand. This escalating imbalance poses a formidable impediment to societal progress, particularly within the expansive water distribution networks that cater to the needs of residential and industrial sectors. Addressing this challenge is imperative for the advancement of sustainable development goals. The integration of Internet of Things (IoT) technology (Fig. 1) into smart water management systems has become a key strategy to enhance water security and operational efficiency, alleviating the complexities associated with water resource management planning1,2,3.

The essence of sustainable water management lies in the accurate forecasting of water demand and consumption, which serves as the foundation for informed decision-making in water resource planning. Applications such as leak detection and pump operation exemplify the practical utility of IoT in water management. However, the full potential of IoT platforms in water demand prediction remains largely untapped, often relying on human experts to provide estimates based on their expertise. Although valuable, this approach is insufficient for predictive analytics in real time, which is essential for the dynamic and responsive management of water resources. In this context, data-driven artificial intelligence (AI) methodologies offer a promising avenue for the precise and scientific prediction of water demand. These AI-driven approaches not only harness the wealth of data generated by IoT devices but also provide the analytical prowess necessary for the development of robust and forward-looking water supply strategies4,5,6.

As shown in Fig. 1, the IoT-enabled water distribution network combines instrumentation, interconnectivity, and intelligence to seamlessly integrate water infrastructure, culminating in a state-of-the-art smart water distribution system7,8,9,10. This advanced system facilitates a bidirectional data flow between sensors and the monitoring platform, enabling a timely and informed decision-making process. Despite these advancements, identifying patterns within the data remains the main challenge in water demand prediction, influenced by numerous factors11,12. Consequently, the extraction of inherent hidden features, crucial for reducing the impact of water demand pattern variability on hydraulic behavior and consumption within the water supply pipe network, has emerged as a critical scientific inquiry to be addressed.

Fig. 1
figure 1

IoT based smart water management system.

An important property of water demand is the temporal variation property13,14. Many of the modern approaches exploit this property by using statistical models, machine learning models, and deep learning models15,16,17. The general methods for water demand time series modeling are autoregressive integrated moving average model (ARIMA)18 and Naïve Bayes algorithm19. However, these methods are based on linear assumptions or prior distribution selection, and their ability to extract nonlinear features is poor. Other studies propose machine learning approaches for optimizing the water demand prediction, such as Random Forest20, Support Vector Machine (SVM)21, K-nearest neighbor (KNN)22. It has gradually been recognized that due to the nonlinear nature of water demand variations, linear regression methods lack the accuracy and generalization required for practical applications. Recently, deep learning methods, such as Long Short Term Memory (LSTM)23, Graph Convolutional Recurrent Neural Network (GCRNN)8, and Gated Recurrent Unit (GRU)24, have emerged as a notable and promising example of a learning algorithm in water demand modeling. These deep learning methods used for the water demand prediction can analyze high-frequency time-series signals but are limited by error accumulation during training.

To better capture the temporal characteristics and correlations of explanatory variables, researchers have concentrated on hybrid optimisation strategies25,26. Preprocessing methods such as feature selection and time-series decomposition are often used in practical problems. For example, Vo27 developed a hybrid method of convolutional neural networks and bidirectional short-term memory networks for monthly household water consumption prediction. Experimental results show that the performance of the hybrid method is better than that of traditional LSTM. Du28 combines principal component analysis with wavelet transform for data preprocessing, and uses LSTM to achieve urban daily water demand prediction. Xu29 decomposed the water demand sequences by using an integrated empirical mode decomposition method, then reconstructed it into randomness and deterministic terms through Fourier transform. Compared to other data processing methods, the decomposition algorithms can effectively improve the prediction performance of the built models through decomposing the intermittent water demand sequences into several more stationary sub-layers. However, separate optimisations and tuning for different models in a hybrid model seem to restrict the overall performance. To achieve better flexibility and robustness, optimization requires control parameters with good self-adaptive ability.

In the realm of water demand prediction models, parameter optimization has been a significant area of research focus. Several studies have explored different optimization techniques to enhance the performance of prediction models. For instance, some researchers have utilized genetic algorithms (GA) to optimize the parameters of traditional machine learning models like SVM. By evolving the parameter values over multiple generations, GA-based optimization has shown the potential to improve the accuracy of water demand forecasts30. In another approach, particle swarm optimization (PSO) has been employed to fine-tune the hyperparameters of deep learning architectures such as LSTM. PSO can effectively search the parameter space and converge towards optimal values, leading to enhanced prediction capabilities31. Additionally, ant colony optimization (ACO) has been applied in the context of water demand prediction to optimize the selection of input variables and model parameters simultaneously. This method has demonstrated its ability to handle complex relationships between variables and improve the overall model performance32. However, despite the progress made, existing parameter optimization methods still face challenges such as computational complexity and the potential for overfitting in certain scenarios. There is a need for more efficient and robust optimization strategies that can adapt to the dynamic nature of water demand data and the complexity of different prediction models.

Motivated by the quest for efficiency and sustainability, this research embarks on an exploration of hybrid optimization strategies-a critical technique in the field of intelligent water demand prediction. The crux of our study harnesses meta-heuristic optimization algorithms to refine the accuracy of hybrid predictive models, which are pivotal for the sustainable management of water resources. Our approach is validated through a series of practical experiments, designed to encompass diverse water demand scenarios. The main contributions and novelty of this paper are encapsulated within the following innovations: (1) A robust adaptive decomposition strategy is introduced to address the multifaceted nature of water demand sequences. This strategy’s dynamic parameter adjustment dismantles the rigidity of conventional fixed-mode decomposition, bolstering the model’s robustness and adaptability. (2) A objective function is crafted to gauge the complexity of prediction within decomposed sequences. By dynamically modulating decomposition parameters, this strategy enhances the model’s robustness and its capacity for generalization. 3)A novel framework for short-term water demand prediction is proposed, with experimental outcomes from varied real-world datasets underscoring its superiority and efficacy across distinct environmental and operational contexts.

This paper is organized as follows: Section “Methodologies” introduces the general framework and mathematical description of our method. We present the detailed analysis and practical experiments in Section “Experiments and discussion”, and give the evaluation results. Finally, Section “Conclusion” concludes this article.

Fig. 2
figure 2

Diagram of the proposed ROADLSTM model.

Methodologies

The complex temporal features and strong nonlinear characteristics in water demand make it difficult to model and predict such series precisely with traditional prediction models. In this paper, we use Complementary Ensemble Empirical Mode Decomposition (CEEMD) method to decompose time series into different components to reduce the temporal complexity. There are three core parameters that affect the performance of CEEMD: noise amplitude A, number of components K, and total lumped average times N. Heuristic optimization algorithms are very useful to find a global optima or near-optimal solution to parameter search problems. Therefore, an improved quantum genetic search algorithm is applied to obtain parameters and improve the decomposition effect efficiently.

The architecture of the proposed robust adaptive decomposition with LSTM is shown in Fig. 2. In this method, a robust adaptive decomposition strategy is proposed to decompose the original sequence into k components, and the entropy variance of multi-scale arrangement between components is computed in terms of the error function. Therefore, components with the similar complexity of temporal features can be obtained. Then, in parallel, prediction method is performed on each component. A suitable proposal prediction method is therefore the LSTM because the decomposed detailed parts of the water demand time series have stochastic characteristics and short-term dependency. This approach enables the model to focus on the temporal relationship between time segments in each component, which improves the overall prediction accuracy.

Complementary ensemble empirical mode decomposition

The Empirical Mode Decomposition (EMD) technique is an adaptive data analysis method that has been widely applied in non-linear and non-stationary data analysis33,34. Although the data adaptive EMD is a powerful method to decompose nonlinear signal, it has the disadvantage of the frequent emergence of mode integration. This indicates a mode mixing problem, where a single IMF either consists of signals of widely dissimilar scales, or a signal of a scale alike existing in unlike IMFs. To address this problem, a common approach is to add different Gaussian white noise with the same amplitude during each decomposition period to change the extreme point characteristics of the signal. Then, multiple EMDs are performed to obtain the corresponding IMF for overall averaging to cancel out the added white noise, which effectively suppresses the generation of mode mixing35,36. It is worth noting that paired positive and negative white noise signals should be used to minimize the signal reconstruction error, which is the basic idea of Complementary Ensemble Empirical Mode Decomposition (CEEMD)37,38.

For water demand \(\varvec{x} = [\varvec{x}_1, \varvec{x}_2, \dots ,\varvec{x}_N]\), pairwise white noise with specific amplitudes is added to generate new sequences:

$$\begin{aligned} \begin{aligned} \varvec{x}_{\text {pos}}^i&= \varvec{x}^i + A n^i \\ \varvec{x}_{\text {neg}}^i&= \varvec{x}^i - A n^i \end{aligned} \end{aligned}$$
(1)

where \(\varvec{x}_{\text {pos}}^i\) is the i-th subsequence with positive noise; \(\varvec{x}_{\text{ neg }}^i\) is the i-th subsequence with negative noise; \(\varvec{n}^i\) is the white noise for the i-th subsequence; \(i=1,2,\dots ,N\), N is the total number of subsequences; A is the amplitude of white noise.

Then, the corresponding k Intrinsic Mode Function (IMF) components can be obtained by using EMD decomposition:

$$\begin{aligned} \begin{aligned} \varvec{x}_{\text {pos}}^i {\mathop {\rightarrow }\limits ^\textrm{EMD}} \sum _{j=1}^{k-1} \varvec{c}_{j, \text {pos}}^i+\varvec{\epsilon }^i \\ \varvec{x}_{\text {neg}}^i {\mathop {\rightarrow }\limits ^\textrm{EMD}} \sum _{j=1}^{k-1} \varvec{c}_{j, \text {neg}}^i+\varvec{\epsilon }^i \end{aligned} \end{aligned}$$
(2)

where \(\varvec{c}_{j}^i\) is the j-th IMF component, \(j=1,2,\dots ,k-1\); \(\varvec{\epsilon }^i\) is the residual high-frequency component.

Repeat these operations and finally the j-th component can be get by a lumped average. Thus the corresponding signal can be represented as a combination of \(k-1\) IMFs and residual:

$$\begin{aligned} \begin{aligned} \varvec{c}_i^i&=(\varvec{c}_{j,\text{ pos } }^i+\varvec{c}_{j,\text{ neg } }^i) / 2 \\ \varvec{c}_j&=\sum _{i=1}^N \frac{\varvec{c}_i^i}{N} \\ \varvec{x}&=\sum _{j=1}^{k-1} \varvec{c}_j+\varvec{\epsilon } \end{aligned} \end{aligned}$$
(3)

Proposed adaptive parameter optimizer

In practical applications, time consumption in parameter optimization could be critical. It is meaningful to obtain optimal solutions with the smallest possible number of evaluations. In this study, a quantum inspired evolutionary algorithm is implemented to achieve computational efficiency. It uses a group of independent quantum bits with superposition characteristics to encode chromosomes, and updates through the quantum logic gate, so as to achieve the efficient solution of target. Unlike classical computing, the quantum inspired evolutionary algorithm has the advantage that it benefits from superposition or parallelism by considering all the paths at the same time, thus increasing its processing capacity39,40.

Chromosome representation with quantum btis

In quantum computing, the qubit is the smallest unit of the information and can be in either \(|0\rangle\) or \(|1\rangle\) states41,42,43. Then the state \(\psi\) is represented by the linear combination of ket 0 and 1.

$$\begin{aligned} \begin{aligned} |\psi \rangle = \alpha |0\rangle + \beta |1\rangle \end{aligned} \end{aligned}$$
(4)

where \(\alpha\) and \(\beta\) are the probability amplitude; \(|\alpha |^2\) is the probability of the quantum bit in \(|0\rangle\); \(|\beta |^2\) is the probability of the quantum bit in \(|1\rangle\). And \(|\alpha |^2 + |\beta |^2 = 1\). Hence the quantum states of two or more objects are to be described by a single chromosome. The global optimum of the chromosomes can be obtained based on the updating of the quantum revolving gate44.

$$\begin{aligned} \begin{aligned} \left[ \begin{array}{c} \alpha '_i \\ \beta '_i \end{array}\right] = \varvec{U}(\theta _i)\left[ \begin{array}{c} \alpha _i \\ \beta _i \end{array}\right] = \left[ \begin{array}{cc} \cos \theta _i & -\sin \theta _i \\ \sin \theta _i & \cos \theta _i \end{array}\right] \left[ \begin{array}{c} \alpha _i \\ \beta _i \end{array}\right] \end{aligned} \end{aligned}$$
(5)

where \((\alpha '_i, \beta '_i)^T\) is the probability amplitude after each update. Note that updating the quantum bit is equivalent to gradient descent, which can be written as follows:

$$\begin{aligned} \begin{aligned} {U}\left( \Delta \theta \right) \left[ \begin{array}{c} \cos t \\ \sin t \end{array}\right]&=\left[ \begin{array}{cc} \cos \Delta \theta & -\sin \Delta \theta \\ \sin \Delta \theta & \cos \Delta \theta \end{array}\right] \left[ \begin{array}{c} \cos t \\ \sin t \end{array}\right] \\&=\left[ \begin{array}{c} \cos \left( t+\Delta \theta \right) \\ \sin \left( t+\Delta \theta \right) \end{array}\right] \end{aligned} \end{aligned}$$
(6)

The \(\Delta \theta\) is the change of rotation angle, the choice of which is determined by analyzing the trend of the object function at a certain chromosome. When this trend is small, \(\Delta \theta\) can be increased; when it is large, \(\Delta \theta\) can be decreased. Using this change information helps improve convergence. Then, we construct the step size function of rotation angle as follows:

$$\begin{aligned} \begin{aligned} \Delta \theta _{i,j}= \operatorname {sgn}\left| \begin{array}{cc} \alpha _0 & \alpha _1 \\ \beta _0 & \beta _1 \end{array}\right| \times \Delta \theta _0 \times \text {exp} \left( \frac{|\nabla f(\varvec{X}_i^j)|-\nabla f_{j,\min }}{\nabla f_{j,\max }-\nabla f_{j,\min }}\right) \\ \end{aligned} \end{aligned}$$
(7)

where \(\Delta \theta _0\) is an initial value for \(\Delta \theta\); \(\alpha _0\),\(\beta _0\) is the optimal probability amplitude within the current population; \(\alpha _1\),\(\beta _1\) is the probability amplitude of the current solution; \(f(\varvec{X})\) is the fitness function value for an individual gene; \(\nabla f(\varvec{X}_i^j)\) is the gradient of \(f(\varvec{X})\) at the point \(X_i^j\); \(f_{\max }\), \(f_{\min }\) is the maximum and minimum values of the individual fitness function of the current population, respectively. And the gradients of \(f_{j,\max }\), \(f_{j,\min }\) are given by:

$$\begin{aligned} \begin{aligned} \nabla f_{j,\max }=&\max \left\{ \left| \frac{\partial f(\varvec{X}_1)}{\partial X_1^j}\right| , \cdots , \left| \frac{\partial f(\varvec{X}_m)}{\partial X_m^j}\right| \right\} \\ \nabla f_{j,\min }=&\min \left\{ \left| \frac{\partial f(\varvec{X}_1)}{\partial X_1^j}\right| , \cdots , \left| \frac{\partial f(\varvec{X}_m)}{\partial X_m^j}\right| \right\} \end{aligned} \end{aligned}$$
(8)

where,\(\varvec{X}_m^j\) is the j-th component of vector \(\varvec{X}_m\).

Quantum interference crossover

Basically, a genetic algorithm consists of a fixed size population of chromosomes which have the opportunity to survive in the next generation. By using genetic operators, such as selection, crossover, and mutation, the offspring can be generated for iterations. However, the crossover operator is not used in gradient-based algorithm, which often gets stuck in poor local minima.

To overcome this drawback, a quantum interference crossover is proposed as illustrated in Fig. 3. Each row represents a chromosome. And the interference crossover can be described as follows: take the 1st gene of chromosome one, 2nd gene of chromosome two, 3rd gene of chromosome three, etc. No duplicates are permitted within the same universes, if a gene is already present in the offspring, choose the next gene not already contained. Therefore, the information between each gene can be fully utilized by using quantum interference.

Fig. 3
figure 3

Quantum Interference Crossover.

Improved quantum mutation

The mutation operation changes chromosomes generated by interference crossover operator45,46,47. We can select a chromosome randomly and change the probability amplitude of quantum bit arbitrarily. An example of the quantum mutation would be Hadamard-based strategy48:

$$\begin{aligned} \begin{aligned} \frac{1}{\sqrt{2}}\left[ \begin{array}{cc} 1 & 1\\ 1 & -1 \end{array}\right] \left[ \begin{array}{c} \cos \theta _{ij} \\ \sin \theta _{ij} \end{array}\right] = \left[ \begin{array}{c} \cos (\theta _{ij}+\frac{\pi }{4}-2\theta _{ij}) \\ \sin (\theta _{ij}+\frac{\pi }{4}-2\theta _{ij}) \end{array}\right] \end{aligned} \end{aligned}$$
(9)

Essentially the quantum mutation problem can be reduced to one strategy of rotation angle in the quantum bits. Note that it is usually hard to find the global optimum, and we may be stuck in a local optimum49. In practice, using multiple random restarts can increase out chance of finding a “good” local optimum. Of course, careful initialization can help a lot, too. However, those methods of randomness might oscillate, and convergence is not guaranteed. Here we introduce the cataclysm mechanism50. Firstly, an elite retention strategy is adopted to retain the optimal individuals, ensuring the convergence of subsequent population evolution. Then, reinitialize new individuals to replace those with poor fitness rankings in the population, thereby increasing the diversity and search range of the population. This mechanism enables the algorithm to jump out of the local optima and improve the convergence and effectiveness of the results.

Objective function improvement

For general optimization problems, the fitness function usually uses the root mean square error function (RMSE) to measure the difference between the reconstruction sequence and the original sequence. The CEEMD method eliminates residual noise by adding paired positive and negative white noise, making the RMSE of its decomposition result equal to 0. Thus, ordinary fitness functions cannot achieve the desired results.

We know that the degree of complexity and regularity can be described by multi-scale permutation entropy which is capable of fully reflecting the dynamical characteristics of sequences across different temporal scales. As a measure of the complexity or the regularity of a sequence, a large value of entropy often describes the sequence with serious nonlinear dynamics. The stronger the nonlinearity of a sequence, the higher its uncertainty and complexity, which results in further difficulty in prediction.

It is clear from the literature review in Introduction that signal decomposition technology helps to increase the precision of water demand prediction. The components with low-entropy will have good regularity and are relatively easy to predict. We therefore consider a new objective function called multi-scale arrangement entropy variance between components. By minimizing the objective function, we could obtain a sequence of components with similar nonlinear complexity, therefore limiting the nonlinear complexity of high-frequency components, and achieving better prediction results. The formula for calculating the multi-scale permutation entropy variance between components is as follows:

$$\begin{aligned} \begin{aligned} H_p&=\sum _{i=1}^n P_i \ln P_i \\ \textrm{MPEV}&=\sum _{s=1}^k(H_p^s-{\overline{H}_p})^2 \end{aligned} \end{aligned}$$
(10)

where \(P_i\) is the probability of symbol sequence occurrence for reconstructed components. \(H_p^s\) is the multi-scale arrangement entropy value of the s-th subsequence. The range of \(H_p\) is 0-1, which represents the randomness and complexity.

Robust adaptive optimization decomposition with long short-term memory model

For water demand data, we are interested in modelling quantities that are believed to be periodic over 24 hours or over a week cycle. Therefore, in this research, the training process is computed over a sliding window51.

Consider a set of observations \(X(t)=\left[ x_1, x_2 \ldots x_T\right]\) of the water demand. It can be divided into two parts, the input part \(X_{\text {in}}(t)=\left[ x_1, x_2 \ldots x_{T-1}\right]\) and the actual target value \(X_{\text {truth}}^{1}(t)=x_T\), in which the truth corresponds to \(X_{\text {truth}}^{1}(t)\). For a given input part, we can perform the improved signal decomposition method, and then get k subsequences:

$$\begin{aligned} X_{in}(t) {\mathop {\longrightarrow }\limits ^\textrm{CEEMD}} \sum _{s=1}^k X_{\text {sub}}^s \end{aligned}$$
(11)

where \(X_{\text {sub}}^s\) is the s-th subsequence. Then, the presence of a parallel processing architecture will be integrated into the LSTM to facilitate feature extraction for each IMF and increase prediction accuracy.

$$\begin{aligned} \begin{aligned} Y_{\text {sub}}^s&=F(X_{\text {sub}}^s) \\ Y&=\sum _{s=1}^k Y_{\text {sub}}^s \end{aligned} \end{aligned}$$
(12)

where F(X) is the mapping relationship learned by LSTM; \(Y_{s u b}^s\) is the prediction result of the s-th subsequence, and Y is the final prediction result.

After the decomposition step, we can compute the prediction for each subsequence. In general, a latent variable model, which can capture correlation between the visible variables via a set of latent common causes, can be applied for prediction efficiently. However, long-term information preservation and short-term input loss for latent variable models is fraught with difficulties. In this approach, LSTM is used to model these long and short-term dynamics. As shown in Fig. 4, individual LSTM units consists of an internal storage unit and three gates. The input, output, and forget operations of the memory cells are realized by multiplication of the corresponding gate. Through the gating mechanism, LSTM can maintain a certain level of gradient information to prevent gradient explosion. To alleviate the vanishing gradient problem, it properly keeps and forgets past information, indicating its great power for capturing long-term temporal dependencies. This makes possible the nonlinear learning ability for water demand time series.

Fig. 4
figure 4

Architecture of an LSTM cell.

Specifically, the LSTM structure is built considering the number of subsequence to be analyzed. We compute the multiplication of all the hidden state units by “sliding” this weight vector over the input. Meanwhile, all these steps can be performed in parallel on each subsequences. Then, the final output is obtained by computing the local prediction within each sliding window.

$$\begin{aligned} \begin{aligned} \text{ Pred }&=\left[ Y_{\text{1 } }, Y_{\text{2 } } \ldots Y_{ {num }}\right] \\ \text{ Truth }&=\left[ X_{\text{ truth } }^{\text{1 } }, X_{\text{ truth } }^{\text{2 } } \ldots X_{\text{ truth } }^{ {num }}\right] \end{aligned} \end{aligned}$$
(13)

where \(Y_{{num }}\) is the predicted value of the num-th window; \(X_{\text{ truth } }^{{num }}\) is the real water demand value of the num-th window.

Algorithm 1
figure a

ROADLSTM Model

Experiments and discussion

In this study, water demand data is collected from a real-world water distribution system, with a sampling interval of 5 minutes52. The experimental data is chosen from different scenes, including mall, company, apartment, and school. For each scene, water demand data was continuously collected over a one-year period and aggregated into the experimental dataset, as shown in Fig. 5. We can see that, apartment water demand slightly exceeds that of malls, as apartment population flows are comparable to malls, while mall usage is limited by customer activities. In these data sets, there are various possible abnormalities in water consumption patterns53. These could include sudden spikes in water usage that deviate significantly from the typical consumption behavior of an end-user. Another possible abnormality could be an unusually low or zero consumption for an extended period, which might indicate a faulty meter or a situation where water supply has been disrupted without proper authorization. We applied the 3\(\theta\) principle to detect and handle outliers by replacing them with the average of the two adjacent values. This method preserved the original number of 8640 samples, maintaining consistent time intervals and preventing any disruption to the prediction results. Experiments are conducted on these four types of datasets, each with 105120 samples. And 30 days (8640 points) of samples are randomly selected from each dataset. The selected data is divided into an 8:1:1 split. Specifically, the first 80% of the data (the 1st to 6912th samples) is used for training the model, the middle 10% (the 6913th to 7776th samples) is used for validation, and the final 10% (the 7777th to 8640th samples) is used to test the model’s predictive performance. During training process, the sliding window of size 577 is used, which includes two-day samples (576 points) and one-step prediction values.

Table 1 LSTM hyper-parameter settings.
Fig. 5
figure 5

Examples of four types of datasets (30-day).

Three commonly-used metrics are used to evaluate the model in this study: Mean Squared Error(RMSE), Mean Absolute Error(MAE), and Nash-Sutcliffe Efficiency coefficient(NSE). RMSE and MAE reflect the error between the predicted values and the actual values. NSE is used to evaluate the fitting effect and stability of the model. A value closer to 1 indicates higher stability of the model.

Fig. 6
figure 6

Decomposition and prediction results of ROADLSTM on the mall.

Model performance on IMFs

In the proposed ROADLSTM model, CEEMD is applied to decompose the original water demand sequence into several independent IMFs. Different signals have different frequency components, amplitude variations and nonlinear characteristics. The traditional IMF decomposition method with fixed parameters may not be able to handle various types of signals effectively. For example, for some signals containing abrupt change information or multi-scale frequency mixtures, the decomposition with fixed parameters may lead to the phenomenon of mode mixing, that is, components of different frequencies are wrongly decomposed into the same IMF or components of the same frequency are decomposed into different IMFs, thus affecting the accurate analysis of signal characteristics in the subsequent steps. However, the adaptive IMF decomposition can dynamically adjust the key parameters in the decomposition process, such as the number of sifting iterations and the stopping criteria, according to the specific conditions of the signals, so as to improve the accuracy and effectiveness of the decomposition. Through a robust adaptive optimization decomposition strategy, the number of IMFs can be adaptively adjusted according to data characteristics, avoiding the accumulation of estimation errors caused by excessive IMFs. Taking the mall dataset as an example, Fig. 6 illustrates the decomposition and prediction results for the mall dataset. It can be observed that the component sequences constrained by the robust optimization adaptive decomposition strategy exhibit similar nonlinear complexity. Based on this, predicting it can better learn the evolution pattern of the sequence, thereby generating more accurate prediction results.

Comparative study

To verify the effectiveness of the proposed ROADLSTM algorithm, we compare our model with eight models, including classic models like ARIMA, basic RNN, LSTM models, and hybrid models based on EMD and EEMD. Hybrid models based on EMD and EEMD specifically involve replacing CEEMD in the robust adaptive optimization decomposition strategy with EMD and EEMD. Additionally, the predictors support interchangeable use of both ARIMA and LSTM models. Note that ROADLSTM1 and ROADLSTM2 correspond to EMD and EEMD based LSTM model, while ROADAR-1 and ROADAR-2 correspond to EMD and EEMD based ARIMA model. The LSTM hyperparameter settings used in this article are shown in Table 1.

One-step prediction

Table 2 lists the performance of the above 9 models on the four types of datasets. It can be observed that basic models (LSTM, RNN, and ARIMA) yield averages of RMSE and MAE by 0.6551, 0.8203, and 0.5851, 0.7364 on the Mall and Apartment datasets, respectively. However, the RMSE and MAE rise to 1.2222, 3.7552, and 0.8912, 3.4696 on the Company and School datasets, which demonstrates the poor prediction of the nonlinear datasets. It is worth noting that there are bad prediction results on RNN. The problem with RNN is that it is very prone to overfitting, especially if the noise is high. This is because we usually initialize from small random weights, so the model is initially simple (since the tanh function is nearly linear near the origin). As training progresses, the weights become larger, and the model becomes nonlinear. Eventually it will overfit.

Table 2 Results on Four Types of Datasets. (The best results* are highlighted in black and bold, the second best results are underlined.).

Utilizing the decomposition to reduce the non-stationary and non-linear characteristics in water demand data helps to obtain a sufficient prediction accuracy. From Table 2 and Fig. 7, it can be obviously found that the hybrid model can improve the performance of the LSTM model. It is shown that, compared with the single LSTM model, the LSTM-based hybrid models, consisting of ROADLSTM1, ROADLSTM2, and ROADLSTM, demonstrate reduced RMSE and MAE averages across four datasets. Specifically, for RMSE, the LSTM-based hybrid models’ averages achieve decreases of 0.1780, 0.6377, 0.2472, and 0.3989 on the Mall, Company, Apartment, and School datasets, respectively. Similarly, for MAE, the LSTM-based hybrid models’ averages achieve decreases of 0.1078, 0.2675, 0.1193, and 0.2731 on the Mall, Company, Apartment, and School datasets, respectively.

However, when employing the ARIMA prediction, the hybrid models suffer from degradation in accuracy. As shown in Table 2, the error metric scores of the ARIMA-based hybrid models, including ROADAR-1, ROADAR-2, and ROADARIMA, exhibit significant upward trends. Specifically, for RMSE, the averages of the ARIMA-based hybrid models rise to 1.7045, 2.1833, 2.2335, and 10.3483 for the Mall, Company, Apartment, and School datasets, respectively. Similarly, the averages of MAE increase to 1.3763, 1.9939, 2.0434, and 9.8374 across the same datasets. This is because as a Gaussian process model, it is difficult for ARIMA to work with high frequency nonlinear datasets. When performing decomposition on water demand datasets, the majority of the noise is added to the residual part. Therefore, the ROADARIMA and its variants show a cliff-like decline compared to traditional ARIMA.

Fig. 7
figure 7

Error scores of 9 models on four datasets. (The lower value represents the better result. Values between 0 and 1 are shown, making it more intuitive.).

Fig. 8
figure 8

Examples of one-step prediction on random sequences (10 points).

From Fig. 7, we can see that the ROADLSTM has the lowest error score bar within the experiments. As shown in Fig. 8, the proposed ROADLSTM is closer to the true value in the fitting degree of the predicted results, which demonstrates the ability of the ROADLSTM to model periodic variables flexibly. There are significant improvements compared to suboptimal models with the average RMSE of 81.60%, MAE of 78.03%, and NSE of 1.24%. Even when the data set is seriously nonlinear, the ROADLSTM is found in practice could maintain the robust property. Regarding to the variants of ROADLSTM (ROADLSTM1 and ROADLSTM2), the proposed ROADLSTM optimizes the decomposition process by introducing additional cancellation of noise residues, thus obtaining a significant further improvement in processing seasonal and trend sequences.

Fig. 9
figure 9

Multi-step prediction results of different models in four datasets. (RMSE and MAE are measured in \(\text {m}^3\), NSE is unitless.).

Multi-step prediction

It is noted that the prediction results of decomposition based models have superior performance, with a reduction in the statistical metrics when compared with the respective models alone. Therefore, in this section, we investigated the results from the LSTM based hybrid models, applied as prediction models to perform multi-step ahead prediction. The values of RMSE, MAE and NSE of the proposed and comparison models are presented in Table 3, where the smallest value of each row is marked in boldface type. As is shown in Table 3, combining with the CEEMD, the proposed model achieves the best performance compared with other models. This conclusion can be further verified by the results presented in Fig. 9, which provides the prediction performances of the proposed and other comparison models over horizons of 1-step to 7-step ahead.

Table 3 Key steps’ scores of Four Types of Datasets.
Table 4 Ablation studies of ROADLSTM on the school dataset.

Figure 9 illustrates that ROADLSTM produces better error metric results between 1 and 4-step-ahead for all statistic metrics compared to other models. And it is obvious that the performance of comparison models on school and company decrease rapidly as the step increases. Note that this degenerate case is common to all of the prediction models between 4 and 7-step-ahead. This degradation occurs because, in the multi-step prediction process, we use the first-step prediction result as the true value to input into the model to obtain the subsequent two-step predictions. In this process, the prediction error from the first step acts as noise and is introduced, which significantly challenges the model’s robustness and accuracy. When the introduced error exceeds the model’s tolerance, the accuracy of the model experiences a sharp decline. It can, however, be shown that the statistic metrics of the ROADLSTM will not exceed the expected error, and in practice the proposed adaptive decomposition method could avoid to become rapidly unwieldy and of limited practical utility.

Ablation studies

To better understand the contribution of the improved components to the overall model, the ablation studies are conducted, aiming at investigating: (1) the forecasting performance of the variants of ROADLSTM; (2) the validity of time series decomposition to the model; (3)the impact of parameter search methods on the variants of ROADLSTM; (4) the validity of the target function modification of the parameter search method.

Variants of ROADLSTM

To verify the validity of the model, we performed ablation experiments on school dataset with relatively large errors. The traditional LSTM is chosen as the baseline model. And the random initialization method is used to select the parameters of the decomposition module to form the ROADLSTM3 model. For ROADLSTM4, we use RMSE, which measures the error of the reconstruction sequence, as the objective function of the search method to evaluate the decomposition results and select the parameters of the decomposition module. Finally, we use multi-scale permutation entropy to form ROADLSTM.

Result analysis

Table 4 summarizes the performance of the variants of ROADLSTM on the school dataset. In general, ROADLSTM has shown the best results than other models, which verifies the effectiveness of this general framework on water demand forecasting. Specifically, the predictive performance of ROADLSTM3 outperformed the baseline model by 21.92%. This proves that the decomposed sequences with few nonlinear features can achieve higher prediction accuracy than original sequences. Compared with the ROADLSTM3, RoadLSTM4 applies the parameter search method to the decomposition method, but with no obvious differences on the precision. This may be due to the fact that the paired white noise of CEEMD is completely eliminated and the reconstruction error of the decomposition sequence is always 0. In this case, ROADLSTM using MPEV as the objective function improved its predicted performance by 59.46% compared to ROADLSTM4. This means that the metric RMSE for the objective function, as was used in ROADLSTM4, would be inappropriate. These experimental results verify the validity of proposed objective function in water demand prediction.

Conclusion

This paper presents a comprehensive study on a robust multi-step water demand prediction approach within the framework of smart water management, which helps to address an imperative challenge with profound long-term economic implications in the industrial sector. The research introduces an innovative methodology for water demand forecasting, leveraging an adept decomposition technique to dissect the raw water demand series into multiple Intrinsic Mode Functions (IMFs), thereby facilitating a more nuanced analysis and forecasting process. This strategic decomposition effectively mitigates the inherent non-stationarity and nonlinearity of water demand time series, a critical step towards enhancing the accuracy of predictions.

To ensure practical applicability, the study advances a heuristic search algorithm designed to identify optimal parameters, aligning the model with real-world operational demands. Subsequently, an LSTM-based combined prediction model is proposed, tailored to forecast each IMF’s distinct characteristics with precision. Additionally, the introduction of a novel multi-scale permutation entropy variance function serves to quantify prediction error, bolstering the model’s resilience in the face of intricate and variable scenarios.

Through rigorous experimentation on real-world datasets, the proposed model demonstrates superior performance over established benchmarks across key metrics, including Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Nash-Sutcliffe Efficiency (NSE). The integration of an adaptive optimization strategy further endows the hybrid model with heightened stability and precision in multi-step predictions. This research, therefore, contributes a potent and reliable predictive tool to the arsenal of water management practices, underpinning sustainable energy initiatives by optimizing water resource allocation and consumption.