Introduction

Water quality is a key indicator of ecological environmental protection and water resources management, and accurate prediction of water quality is of great significance to ensure the safety of water environment and realize the sustainable use of water resources. The focus of surface water management is to achieve high-precision prediction and early warning of water quality in key watersheds, lakes and reservoirs, so as to timely and accurate assessment of water quality trends to provide a scientific basis for decision-makers1.

In the field of water quality prediction, researchers have proposed various models and methods, mainly including traditional statistical models, machine learning models, and deep learning models. Traditional statistical models, such as Auto-Regressive Integrated Moving Average (ARIMA)2,3, Fuzzy Analytic Hierarchy Process (FAHP)4,5 and Extreme Learning Machine (ELM)6,7, perform well in handling linear time series data but are less effective in dealing with nonlinear and non-stationary data. Machine learning models, such as Support Vector Machines (SVM)8, Random Forests (RF)9, and Extreme Gradient Boosting (XGBoost)10, can handle certain nonlinear data, but they have limitations in capturing the spatiotemporal features of the data. Deep learning models11,12, such as Long Short-Term Memory (LSTM)13and Gated Recurrent Unit (GRU)14, have significant advantages in processing time series data and can capture long-term dependencies. However, single deep learning models may still have some shortcomings when dealing with complex water quality data, such as sensitivity to noise and outliers. In recent years, hybrid models have gained attention by combining different methods15,16, such as Convolutional Neural Network-Long short-term memory (CNN-LSTM)17 and Temporal Convolutional Network-Gated Recurrent Unit (TCN-GRU)18, to leverage their respective strengths and improve prediction accuracy and robustness. Nonetheless, there is still room for improvement in these hybrid models when dealing with complex water quality data19. Signal decomposition methods can effectively address the non-stationarity of data20,21, so they can be introduced into water quality prediction fusion models for data preprocessing, helping to extract short-term variation features and thereby enhancing the prediction accuracy of the fusion models22,23.

Using water quality data from the Shuanggangkou section of the Jiangshan River in Quzhou City, Zhejiang Province, as a sample, this paper proposes a SCV-CBA model that combines Sparrow Search Algorithm (SSA), Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN), Variational modal decomposition (VMD), Bidirectional Long Short-Term Memory Network (CNN, BiLSTM) and Attention for water quality prediction24,25,26. The methodology involves several steps: Firstly, the CEEMDAN algorithm decomposes the complex nonlinear, non-stationary water quality data into several Intrinsic Mode Functions (IMFs), effectively reducing mode mixing. Next, each IMF component’s Sample Entropy (SE) is calculated, and SSA-optimized K-means clustering is applied to classify the IMFs into high, medium, and low-frequency components, reducing model complexity. Subsequently, the SSA-optimized VMD algorithm is applied for a secondary decomposition of high-frequency IMFs to further extract useful high-frequency information and enhance the signal-to-noise ratio. For feature extraction, each component is fed into a CNN-BiLSTM-Attention model, where CNN captures local features within the water quality signal, BiLSTM models temporal dependencies, and Attention mechanism highlights key features, allowing the model to adaptively focus on the information most relevant to prediction accuracy. Finally, the individual prediction sequences are linearly combined to generate the final prediction result. Comparative experiments demonstrate that the SCV-CBA model significantly improves prediction accuracy, exhibits strong generalization ability, effectively mitigates non-stationary data interference, and is suitable for accurate prediction of complex water quality data.

Research methods

CEEMDAN

CEEMDAN is an improved adaptive signal decomposition method that effectively addresses the mode mixing problem present in Empirical Mode Decomposition (EMD) and Ensemble Empirical Mode Decomposition (EEMD) methods. By introducing noise-assisted decomposition and weighted averaging strategies, CEEMDAN significantly enhances the stability and reproducibility of signal decomposition. When dealing with complex, non-stationary time series data, CEEMDAN decomposes the signal into multiple IMFs, each containing different frequency components and being relatively stable. The decomposition accuracy and stability of CEEMDAN are influenced by the noise amplitude and the number of decomposition iterations.

K-means clustering algorithm

The K-means clustering algorithm is a classic unsupervised learning algorithm used to partition a dataset into K clusters. It minimizes the within-cluster sum of squared errors through an iterative optimization process, ensuring that the samples within each cluster are as close as possible in the feature space. The core idea of K-means clustering is to start with K initially randomly selected centroids, and then iteratively adjust the positions of the centroids and assign samples to the nearest centroid until the clusters reach a stable state.

VMD

VMD is an advanced adaptive signal decomposition technique specifically designed to address the mode mixing problem in traditional Empirical Mode Decomposition (EMD) methods. VMD uses a variational approach, converting the signal decomposition problem into a variational problem. Through the variational framework and bandwidth constraints, it effectively reduces mode mixing, thereby enhancing the accuracy and stability of signal processing. The decomposition performance of VMD is influenced by the number of modes k and the penalty factor α.

SSA

SSA is a bio-inspired optimization algorithm that mimics the foraging and alert behavior of sparrows. The core concept of SSA involves a population of sparrows, categorized as “producers” (foragers) and “scroungers” (watchers). Producers focus on exploring optimal areas to locate the best resources, Adjusts the position of producers using a shift based on the global best solution, which ensures convergence, while scroungers monitor for potential threats, scroungers move towards the best position found to avoid suboptimal solutions, enhancing global search adaptability.

CNN

CNN is a deep learning model composed of convolutional layers, pooling layers, and fully connected layers, capable of effectively extracting and learning feature information from input data. The convolutional layer uses convolution operations to extract local features, and by sharing weights, it reduces the number of model parameters, enhancing the model’s generalization ability. The pooling layer, usually following the convolutional layer, is used to reduce the dimensionality of the feature maps, decreasing computational complexity while retaining important feature information. Common pooling methods include max pooling and average pooling. The fully connected layer integrates the features extracted by the convolutional and pooling layers to perform the final classification or regression tasks.

BiLSTM

BiLSTM is an improved type of recurrent neural network specifically designed to handle sequence data and effectively capture contextual information. Unlike traditional LSTM networks, BiLSTM processes information from both the forward and backward directions of the sequence, allowing for a more comprehensive understanding of the temporal characteristics of the data. BiLSTM consists of two LSTM networks: one processes the forward sequence, and the other processes the backward sequence. At each time step of the input sequence, BiLSTM combines the forward and backward hidden state information to generate the output, thereby fully utilizing the dependencies between the preceding and following contexts in the sequence.

Attention mechanism

The Attention mechanism enhances learning by calculating the relevance between input features and the target task, thereby highlighting key information while suppressing irrelevant or minor details. This allows the model to focus more effectively on features that contribute to the task, improving both learning performance and predictive accuracy. In the attention mechanism, three elements are typically introduced: Query, Key, and Value. First, the similarity between the Query and Key is calculated to obtain the weights for each input feature. Next, these weights are multiplied by the corresponding values and summed to yield a weighted output. This process dynamically adjusts the importance of different features, enabling the model to adaptively focus on information from different positions. This flexible weight-adjustment mechanism makes attention especially effective in fields such as natural language processing and time-series prediction, where long-range dependencies are crucial.

Model building

The prediction of water quality data faces the problem of obvious non-stationarity and non-linearity, so the model combining CEEMDAN, K-means, VMD, CNN, BiLSTM and Attention provides an effective solution.

Data preprocessing and decomposition

Data preprocessing and decomposition is the first step in the model, with the primary goal of denoising and decomposing the raw water quality data to provide cleaner and more useful signals for subsequent feature extraction and modeling.

Data preprocessing

Data preprocessing includes handling outliers and missing values, as well as performing normalization operations to ensure the quality and consistency of the input data.

  1. 1.

    Handling of outliers and missing values

Due to possible situations such as power outages or sensor malfunctions at monitoring stations, water quality data may be missing or contain anomalies. To better capture and reflect the smooth changes and complex trends in water quality parameters, outliers are handled using removal methods, and missing values are addressed using spline interpolation. Spline interpolation divides the entire interval into multiple sub-intervals (typically with data points as boundaries), and within each sub-interval, a cubic polynomial is used for interpolation. Each polynomial is fitted through two adjacent data points.

Assuming that there are n + 1 known data points, the \((x_{0} ,y_{0} )\), \((x_{1} ,y_{1} )\), \(\ldots\), \((x_{n} ,y_{n} )\), spline interpolation method defines a cubic polynomial Si(x) on each subinterva \(\left[ {x_{i} ,x_{i + 1} } \right]\) l:

$$S_{i} (x) = a_{i} + b_{i} (x - x_{i} ) + c_{i} (x - x_{i} )^{2} + d_{i} (x - x_{i} )^{3}$$
(1)

Request:

$$\left\{ \begin{gathered} S_{i} (x_{i} ) = y_{i} \hfill \\ S_{i} (x_{i + 1} ) = y_{i + 1} \hfill \\ S_{i}^{\prime} (x_{i + 1} ) = S_{i + 1}^{\prime} (x_{i + 1} ) \hfill \\ S_{i}^{^{\prime\prime}} (x_{i + 1} ) = S_{i + 1}^{^{\prime\prime}} (x_{i + 1} ) \hfill \\ \end{gathered}\right.$$
(1)

Together with the natural boundary conditions \(S^{\prime\prime}_{0} (x_{0} ) = 0\) and \(S^{\prime\prime}_{n} (x_{n} ) = 0\), these conditions form a system of linear equations that can be solved to locate the coefficients \(a_{i} ,b_{i} ,c_{i} ,d_{i}\) of each sub-interval.

  1. 2.

    Data normalization

In order to make the training of the model more stable and effective, min–max normalization is used to map the water quality monitoring data to the range of [0,1], and after training and testing, the prediction results were reversed

$$x^{\prime} = \frac{{x - x_{\min } }}{{x_{\max } - x_{\min } }}$$
(2)

In the equation, x' is the normalized data, and x is the original data of water quality monitoring; xmin is the minimum value of the raw data of water quality monitoring; xmax is the maximum value of the raw data for water quality monitoring.

Data decomposition

  1. 1.

    CEEMDAN decomposition

Using CEEMDAN technology, the preprocessed raw water quality data is decomposed into several IMFs, with each IMF representing different frequency components of the data, thereby providing high-quality input for subsequent processing steps. The algorithm steps are:

  • Noise signal generation

Superimpose a sequence of white noise \(W_{i} (t)\) with the same variance on the original signal \(X(t)\) to generate a noise-assisted signal:

$$X_{i} (t) = X(t) + \varepsilon W_{i} (t)$$
(3)

In the equation, ε is a very small positive number that is used to control the amplitude of noise; i = ,2,…,N, Represents different noise implementations.

  • Calculate the first IMF

Perform EMD decomposition on each noise-assisted signal \(X_{i} (t)\) to obtain the first IMF, denoted as \(IMF_{1,i} (t)\).The first IMF of the original signal is obtained by averaging the first IMF of all noise-assisted signals:

$$IMF_{1} (t) = \frac{1}{N}\sum\limits_{i = 1}^{N} {IMF_{1,i} (t)}$$
(4)

N represents the length of the original signal or noise auxiliary signal, that is the total number of samples contained in these signals.

  • Residual calculation

Remove the first IMF from the original signal to get the residual signal:

$$R_{1} (t) = X(t) - IMF_{1} (t)$$
(5)
  • Recursive computation

The process of (1) and (2) is repeated with the residual signal \(R_{i} (t)\) as a new input signal to generate a new IMF. Recursively calculates until the residuals are close to zero. For the k-th IMF, the noise auxiliary signal is:

$$X_{k,i} (t) = R_{k - 1} (t) + \varepsilon \sum\limits_{i = 1}^{k - 1} {IMF_{j} (t)}$$
(6)

Then the k-th IMF is calculated by the same method

$$IMF_{k} (t) = \frac{1}{N}\sum\limits_{i = 1}^{N} {IMF_{k,i} (t)}$$
(7)
  • Residuals and stop criteria

If the residuals \(R_{k} (t)\) are close to zero, the decomposition process is terminated. Otherwise, continue to iterate.

  1. 2.

    K-means Clustering Process with SSA Optimization

  • Define fitness function

The objective of K-means clustering is to minimize the mean squared error (MSE) between data points and their nearest cluster centers. The MSE can be defined as:

$$MSE = \frac{1}{N}\sum\limits_{i = 1}^{N} {\left\| {x_{i} - c_{nearest} (x_{i} )} \right\|^{2} }$$
(8)

where xi denotes the i-th data point, cnearest(xi) is the closest cluster center to xi, and N is the total number of data points.

  • Initialize cluster centers with SSA

Randomly generate a set of candidate cluster center configurations, each evaluated based on MSE as the fitness metric. In each iteration, SSA updates the positions of producers and scroungers to reduce MSE. After reaching the maximum number of iterations or achieving satisfactory fitness, the best configuration identified by SSA is used as the optimized initial cluster centers for K-means.

  • Execute K-means clustering

Using the SSA-optimized initial centers, K-means clustering is applied to compute the distances and assign cluster labels.The cluster centers are updated iteratively until convergence criteria are met.

  1. 1.

    Calculate the distance from data points to cluster centers

For each data point in the dataset, calculate its distance to each of the SSA-optimized cluster centers. Typically, Euclidean distance is used for this measurement. For any data point xi and cluster center cj, the Euclidean distance is given by:

$$d(x_{i} ,c_{j} ) = \sqrt[{}]{{\sum\limits_{k = 1}^{p} {(x_{i,k} - c_{j,k} )} }}$$
(9)

where p is the dimension of the data.

  1. 2.

    Assign cluster labels

Each data point is assigned to the nearest cluster center. Specifically, for each data point xi, find the closest cluster center cj and assign xi to the j-th cluster. This process generates a cluster label array for the dataset, where each label indicates the cluster membership of each data point.

  1. 3.

    Update cluster centers

Once all data points have been assigned to a cluster, the center of each cluster is recalculated. The new center is the mean of all data points within the cluster. If the j-th cluster includes data points {\(x_{{i_{1}}}\), \(x_{{i_{2} }}\),…,\(x_{{i_{n} }}\)}, the updated cluster center cj is calculated as:

$$c_{j} = \frac{1}{n}\sum\limits_{k = 1}^{n} {x_{{i_{k} }} }$$
(10)

where n is the number of data points in that cluster.

  1. 4.

    Check for convergence

The updated cluster centers are the same as in the previous iteration, indicating that the algorithm has converged.

  1. 5.

    Clustering results

By optimizing the initial centers using SSA, K-means clustering achieves more accurate partitioning of Sample Entropy (SE) features into high, medium, and low-frequency components, thus enhancing feature extraction for the power prediction model.

VMD decomposition with SSA optimization

  1. 1.

    Define fitness function

The goal of VMD decomposition is to minimize the reconstruction error, which measures the difference between the original signal and the sum of decomposed modes. The reconstruction error is defined as:

$${\text{Re}} construction \, Error = \sum\limits_{t = 1}^{T} {(x(t) - \sum\limits_{k = 1}^{K} {u_{k} (t)} )^{2} }$$
(11)

where x(t) is the original signal, uk(t) is the k-th mode, and T is the length of the signal.

  1. 2.

    Optimize VMD Parameters with SSA

    • Population Initialization: Generate a set of candidate solutions, each representing a combination of k and α values.

    • Iterative Update: SSA updates the candidate solutions iteratively, optimizing for the best combination of k and α. Producer Update is Producers adjust their positions based on the best current solution, searching for more optimal VMD parameters.Scrounger Update is Scroungers move towards the best solution, enhancing the search effectiveness and avoiding poor configurations.

    • Select Optimal Parameters: After a set number of iterations, the SSA-optimized combination of k and α is selected for VMD.

Perform VMD decomposition

With the SSA-optimized k and α, the VMD algorithm is executed to decompose the signal into modes.These decomposed modes represent different frequency components of the signal, providing refined features for subsequent model input.

Feature extraction

Feature extraction mainly uses CNN to extract features from the decomposed signal to capture the spatial patterns and local features in the signal. By applying convolutional filters to the high-frequency, mid-frequency and low-frequency components, the CNN effectively extracts discriminative representations that reflect different fluctuation behaviors. These extracted features are subsequently fed into the BiLSTM-Attention module for sequential modeling and adaptive weighting.

  • Convolution operation: The convolution operation is carried out on the processed IMF components to extract the fluctuation characteristics of the data signal in the signal.

  • Pooling operation: The dimensionality of the convoluted features is reduced through the pooling layer, the key features are retained, the data redundancy is reduced, and the computational complexity is reduced.

Feature sequence modeling

Feature sequence modeling is to capture the long-term and short-term dependencies in time series data by using BiLSTM to model the features extracted by CNN.

  • Bidirectional processing: BiLSTM processes time series data in two directions (forward and backward), allowing it to simultaneously capture both past and future temporal dependencies.

  • Long short-term memory: Through the memory gating mechanism, BiLSTM can effectively retain long-term dependent information while filtering out irrelevant short-term noise.

  • Output feature sequence: The output feature sequence after BiLSTM processing not only contains time series information, but also retains the spatial features extracted by CNN.

BiLSTM is able to model complex time-dependent relationships, which is suitable for multiple factors and long-term trends that need to be considered in water quality prediction, and the bidirectional processing and memory mechanism ensures that the model can make full use of historical and future information, improving the accuracy of prediction results.

Dynamic attention mechanism

The Attention mechanism is then applied to the BiLSTM output to highlight key features dynamically, allowing the model to focus on the most relevant information for the task.

  • Weight calculation: The Attention mechanism computes weights by evaluating the similarity between Query and Key vectors, assigning higher weights to features more relevant to the prediction.

  • Weighted summation: These weights are applied to the Value vector, generating a weighted sum that emphasizes important features and refines the output for the final prediction layer.

  • Adaptive focus: The Attention mechanism enables adaptive focusing on different features, enhancing prediction accuracy in time-series data by highlighting the most influential features.

Model integration and prediction

After attention-enhanced feature extraction, the model integrates the final sequence for prediction.

  • Feature integration: The Attention-weighted BiLSTM feature sequence is consolidated as the final input feature set for the prediction layer.

  • Prediction output: A fully connected layer processes the integrated features to generate the final prediction output.

Figure 1 shows the flow of the SCV-CBA algorithm.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Flow of the SCV-CBA algorithm.

Experiments and analysis

Data description

The water quality dataset of this study is derived from the public data of China Environmental Monitoring Station (https://www.cnemc.cn/), and Qingyue Data (data.epmap.org) provides environmental data processing support. In this paper, a total of 6,839 water quality monitoring data from 0:00 on January 1, 2021 to 24:00 on May 1, 2024 in the Jiangshan Port Shuang Gangkou Section, the first batch of "most beautiful water stations" in China, were taken as the research object. The time interval of water quality monitoring is 4 h, and the data include water temperature (WT), pH, dissolved oxygen (DO), permanganate index (CODMn), ammonia nitrogen (NH3-N), and total phosphorus(TP), total nitrogen (TN), electrical conductivity (EC), turbidity (TU), a total of 9 water quality indicators.

DO is an important indicator of water ecological health, which is sensitive to changes in water pollution, temperature changes and photosynthesis. For water quality monitoring and management, changes in DO can reflect changes in water quality in a timely manner. According to the "Environmental Quality Standard for Surface Water" (GB3838-2002), the "Sanitary Standard for Drinking Water" (GB5749-2022) and the obtained waterquality data. In this study, DO was selected as the predictable indicator.

Missing data processing

In the process of data acquisition, there are problems that some data are abnormal or missing due to system or human factors. In order to ensure the validity of the experiment and better capture the trend and fluctuation of the data, the deletion method and spline interpolation method were used to deal with the abnormal and missing data. After the missing data was completed, the dataset was divided into the training set and the test set in a ratio of 7:3, and the changes of DO after filling in the missing data were shown in Fig. 2. As can be seen from Fig. 2, DO data fluctuates strongly with time and has poor stability. Therefore, after DO data is decomposed by CEEMDAN-VMD, the unstable data can be decomposed into stable eigenmodal data.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

DO data curve.

Evaluation indicators

In order to evaluate the prediction performance of the model, the evaluation indexes such as mean square deviation (MSE), root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R2) are used for evaluation(12)–(15).

$$MSE = \frac{1}{N}\sum\limits_{i = 1}^{N} {(y_{i} - \hat{y}_{i} } )^{2}$$
(12)
$$RMSE = (\frac{1}{N}\sum\limits_{i = 1}^{N} {(y_{i} - \hat{y}_{i} )^{2} } )^{\frac{1}{2}}$$
(13)
$$MAE = \frac{1}{N}\sum\limits_{i = 1}^{N} {|y_{i} - \hat{y}_{i} |}$$
(14)
$$R^{2} = 1 - \frac{{\sum\nolimits_{i = 1}^{N} {(y_{i} - \hat{y}_{i} )^{2} } }}{{\sum\nolimits_{i = 1}^{N} {(y_{i} - \overline{y})^{2} } }}$$
(15)

where N is the number of samples; yi is the true value of the i-th sample;\(\hat{y}_{i}\) is the predict valur of the i-th sample;\(\overline{y}_{i}\) is the mean of the true values of the i-th sample.

MSE, RMSE, and MAE measure the deviation between the actual values and the predicted values; smaller values indicate more accurate model predictions. R2 reflects the accuracy of the model’s fit to the data, with a range from 0 to 1. The closer R2 is to 1, the better the model’s fit27.

Feature correlation analysis

The water environment is a complex and dynamic system, and various water quality indicators affect each other. When forecasting, it is necessary to analyze the correlation of characteristics and explore the internal relationship of data. In this paper, the correlation between each index and DO was analyzed by Pearson correlation coefficient method28. Select the feature indicators with a correlation coefficient greater than 0.2 with DO index (Fig. 3), namely WT, pH, TN, and EC, as multiple input variables for prediction.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Pearson correlation heat map of water quality indicators.

Result analysis

Model parameter settings

The noise standard deviation for CEEMDAN decomposition is set to 0.2, the number of noise additions is set to 500, and the maximum number of iterations is set to 5000. The number of clusters for K-means is set to 3, and the sparrow population size is set to 20. For VMD, the lower bounds for the number of modes K and the penalty factor α are set to [2, 1000], and the upper bounds are set to [6, 5000]. The optimizer used for the CNN-BiLSTM layers is Adam, with the learning rate set to 0.001, and the learning rate scheduling method is piecewise. The activation function is set to ReLU. The CNN convolution kernel size is set to1,3, using 16 filters. The pooling window size is set to1,2, the BiLSTM layer is set with 15 neurons, and the self-attention layer is set with 1 head and 2 key and value channels.

Result analysis

DO data of the Shuang Gangkou section was decomposed into 14 IMFs components using CEEMDAN,as shown in Fig. 4. IMF1 primarily represents high-frequency noise, caused by sensor errors or short-term environmental fluctuations. IMF2 captures short-term water quality variations, such as diurnal changes or local pollution events. IMF3 reflects more distinct periodic fluctuations, potentially linked to climate, watershed management, or seasonal changes. IMF4 indicates mid-frequency periodic variations, influenced by factors like rainfall or temperature changes on a monthly or quarterly scale. IMF5 and IMF6 represent longer-term periodic changes, associated with annual trends or long-term climate effects. IMF7 and lower-frequency components (IMF8-IMF14) gradually reveal long-term water quality trends, incorporating influences from environmental policies, watershed development, and climate change. Among them, IMF11 to IMF14 represent overall trends, with IMF14 serving as the baseline trend, reflecting the fundamental direction of water quality changes.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

CEEMDAN’s decomposition of raw DO data.

The optimal K-means clustering centers obtained through SSA, with the average sample entropy values for each cluster being 0.7395, 0.4394 and 0.0760 respectively, are shown in Fig. 5. These represent the high-frequency IMF component Co-IMF1, the mid-frequency IMF component Co-IMF2, and the low-frequency IMF component Co-IMF3 generated after K-means clustering. The high-frequency IMF component represents rapid fluctuations in water quality data, reflects the dynamic characteristics of water quality in a very short period of time, usually associated with high-frequency disturbances or transient environmental changes. The mid-frequency IMF component reflects the medium-time scale changes in water quality data. These components capture changes within a daily, weekly, or monthly cycle, reveal the characteristics of fluctuations in the intermediate cycle of water quality, representing trend changes and fluctuations over the medium term. The low-frequency IMF component represents long-term trends and slow-changing processes in water quality data. The overall trend of water quality over time is revealed.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

K-means clustering results of IMF components.

The optimal mode number k obtained from SSA for VMD is 4, and the optimal penalty factor α is 4513.3164. Figure 6 shows the secondary VMD decomposition of the high-frequency IMF component. This component captures the minute fluctuations at the shortest time scale in the water quality data, reflecting the very rapid and intense changes in the water quality data. In the Fig. 6, IMF5 and IMF6 correspond to the mid-frequency component Co-IMF2 and the low-frequency component Co-IMF3 from Fig. 5, respectively.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

High-frequency, mid-frequency and low-frequency components decomposed by VMD.

It can be seen that the raw water quality data, decomposed into 6 components using CEEMDAN-VMD, shows a significant improvement in stationarity despite the increase in data volume. From Fig. 6, it can be observed that the temporal evolution of DO is composed of both a long-term trend component and a periodic seasonal component. The trend component (IMF6) exhibits slow, low-frequency fluctuations with relatively stable amplitude and a long periodicity, reflecting the overall evolutionary process of DO concentration over time. This trend suggests a slight degradation in water quality, potentially associated with long-term ecological pressures such as eutrophication and the accumulation of pollutants. In contrast, the seasonal component (IMF5) presents a relatively stable medium-frequency structure, characterized by clear annual periodicity. It captures the seasonal variation pattern of DO, which aligns with the typical biophysical behavior of lower oxygen levels in spring and summer, and higher levels in autumn and winter. Additionally, IMF1 through IMF4 display higher frequencies and more intense fluctuations, representing short-term or localized disturbances such as sudden pollution events or heavy rainfall. These components may serve as valuable indicators for anomaly detection and early warning systems. Therefore, accurately identifying and separating the trend and seasonal components in time series decomposition is essential for enhancing the model’s understanding of the intrinsic characteristics of water quality data.

Effect verification

Comparative test

The experimental results of this model are compared with the experimental results of the more popular prediction models SVM, TCN, TCN-GRU and parallel-transformer-LSTM. The fitting results of each model are shown in Fig. 7, and it can be seen from Fig. 7 that compared with other models, the model proposed in this study has the highest degree of fitting of the prediction curve. Although the prediction curves of other models are consistent with the general trend of the measured curves, these models do not deal well with the short-term fluctuations of data.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Fitting results of each model.

It can be seen from the zoomed in on the sample points from 410 to 615 in the test set in Fig. 7 that although the SVM model performs well in processing linear data, its prediction performance is poor in the face of water quality data with complex nonlinear and noisy characteristics. It is difficult for the TCN model to fully capture the high-frequency fluctuation and trend change information in the water quality data, so there may be insufficient extraction of the actual fluctuation trend information in the prediction process. Because the TCN-GRU model does not fully preprocess the input non-stationary water quality data, its prediction results are lacking in capturing fluctuation information. The parallel-transformer-LSTM algorithm combines the strengths of Transformer and LSTM, effectively capturing both long-term trends and short-term fluctuations in water quality data. However, the model still has some limitations when dealing with high-frequency information and complex fluctuations. In contrast, the SCV-CBA model performs more detailed decomposition and feature extraction of water quality data, so as to better capture the complex fluctuations and nonlinear trends in the data, and the prediction accuracy is significantly improved.

In addition, the prediction performance of each model is quantitatively evaluated using four metrics: MSE, RMSE, MAE, R2 and predition time. The comparison of the prediction performance of different models is shown in Table 1. The results demonstrate that, although the proposed model entails a relatively complex architecture, it achieves significantly superior prediction accuracy while maintaining computation time within an acceptable range. In contrast, alternative methods offer faster execution but at the cost of markedly reduced predictive performance. Taken together, the proposed approach effectively balances accuracy and robustness in the context of complex water quality forecasting. Despite slightly higher computational overhead,its enhanced capability in processing multi-dimensional features provides a promising and practical pathway for real-time water quality prediction applications.

Table 1 Comparison of prediction performance.

Ablation test

Conducting ablation experiments comparing the model without SSA, CEEMDAN-VMD, and Attention against the original model, the prediction results of different ablation models are shown in Fig. 8. The quantitative performance metrics from the ablation study are compared in Table 2. As observed in Fig. 8, the prediction accuracy of models without SSA, dual decomposition, or Attention exhibit weaker alignment with the 45° fit line compared to the proposed model. Furthermore, Table 2 shows that the introduction of SSA improves the decomposition accuracy of water quality feature components, enhancing the model’s feature extraction and prediction accuracy, with reductions in MSE, RMSE, and MAE by 0.16098, 0.20418, and 0.14302, respectively, and an increase in R2 by 0.0406. Introducing the dual decomposition technique enables multi-scale decomposition of the original water quality data, effectively extracting components of different frequencies, enhancing the model’s ability to capture detailed features, and reducing data non-stationarity. This results in MSE, RMSE, and MAE reductions by 0.37645, 0.38367, and 0.28590, respectively, with an R2 improvement of 0.12308. Additionally, it was observed that incorporating the Attention mechanism slightly improved the model’s ability to capture critical information in the time series data. However, performance gains were limited, likely because the Attention mechanism is more effective with data exhibiting complex time dependencies, whereas the correlation between water quality data series is relatively weak. The CNN and BiLSTM layers in the model may have already effectively extracted key features and time dependencies, which could explain the minimal performance gains from Attention. In summary, the inclusion of SSA, dual decomposition, and Attention enhances the model’s prediction accuracy.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Predicted results of ablation test.

Table 2 Comparison of prediction performance in ablation assays.

Generalization ability test

To validate the model’s predictive ability for other water quality indicators at different monitoring stations, experiments were conducted using CODMn at Shuang Gangkou section and TN at Xia Tong section of the Qujiang River. Using Pearson correlation coefficients, it was determined that for predicting CODMn at Shuanggangkou, NH3-N, TP, and TN were selected as input features. For predicting TN at Xia Tong section, WT, DO, NH3-N, and TP were selected as input features. The results (Table 3) show that the proposed prediction model achieves the smallest values in RMSE, MAE, and MAPE, and a larger R2, indicating that the model is transferable for predicting water quality data at different sites. This demonstrates the model’s general applicability and stability in predicting time series data, and it performs well in water quality time series prediction, having significant practical implications.

Table 3 Comparison of prediction models with different parameters at each site.

Effect of feature dimensionality

To further investigate the impact of input feature dimensionality on DO prediction performance, four comparative experiments were designed: (1) single-parameter input using only the historical sequence of DO; (2) multi-parameter input based on features with Pearson correlation coefficients greater than 0.5 with DO (WT); (3) multi-parameter input based on features with Pearson correlation coefficients greater than 0.3 with DO (WT, pH); (4)multi-parameter input based on features with Pearson correlation coefficients greater than 0.2 with DO (WT, pH, TN, EC).

The results (Table 4) demonstrate that the single-parameter model, although capable of capturing temporal dependencies in DO, shows relatively lower prediction accuracy due to the lack of external influencing factors. In contrast, the multi-parameter input model achieves significantly improved performance, indicating that incorporating strongly correlated water quality indicators helps uncover the complex nonlinear relationships underlying DO fluctuations.

Table 4 Comparison of prediction performance in ablation assays.

The mixed-feature input model achieves a trade-off between feature dimensionality and predictive performance, outperforming the single-parameter model while maintaining lower input complexity than the full correlated feature set. This strategy proves especially beneficial in practical scenarios with limited sensing resources or computational constraints.

Overall, the results suggest that incorporating multiple correlated parameters as input significantly enhances prediction accuracy and model robustness, providing valuable guidance for feature selection and model deployment in real-world water quality monitoring applications.

Conclusion

According to the experimental results, it can be clearly seen that the SCV-CBA prediction model proposed in this paper shows excellent prediction performance on the whole test set. There is a good fitting relationship between the predicted value and the actual value of the water quality data, and a satisfactory prediction effect is obtained. These results further demonstrate the effectiveness and reliability of the prediction model in the water quality prediction task, and the water quality status can be further evaluated according to the prediction results.

  1. 1.

    Correlation analysis was used to extract water quality elements with strong correlation as input features, which effectively improved the prediction accuracy.

  2. 2.

    The use of CEEMDAN-VMD is conducive to fully mining the fluctuation trend and detailed characteristics of the data in the process of short-term oscillation, so as to effectively extract the characteristic information of DO and improve the accuracy of water quality prediction。

  3. 3.

    Using SSA helps optimize the clustering centers in K-means and the parameters in VMD decomposition, ensuring more stable IMF clustering and more accurate VMD parameters. This leads to improved feature extraction and higher prediction accuracy in subsequent forecasting.

  4. 4.

    The experimental results using the real-time monitoring data of the national water quality monitoring station show that the prediction method based on the mixed model has higher prediction accuracy, and the four performance indexes of MSE, RMSE, MAE and R2 are better than the comparison model.