Comparative analysis of machine learning models for detecting water quality anomalies in treatment plants

Prabu, P.; Alluhaidan, Ala Saleh; Aziz, Romana; Basheer, Shakila

doi:10.1038/s41598-025-15517-4

Download PDF

Article
Open access
Published: 19 August 2025

Comparative analysis of machine learning models for detecting water quality anomalies in treatment plants

P. Prabu¹,
Ala Saleh Alluhaidan²,
Romana Aziz² &
…
Shakila Basheer²

Scientific Reports volume 15, Article number: 30453 (2025) Cite this article

2433 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Water is one of the most critical and finite resources on our planet. As the demand for freshwater continues to grow, effectively managing and purifying existing water sources becomes increasingly important. This study introduces a Machine learning-based approach for enhancing water quality monitoring and anomaly detection in treatment plants using a modified Quality Index (QI). The proposed method integrates an encoder-decoder architecture with real-time anomaly detection and adaptive QI computation, providing a dynamic evaluation of water quality. In addition to developing this model, we present a comparative analysis with several existing machine learning models, demonstrating the effectiveness of our approach in detecting water quality anomalies. The revised QI is continuously updated using real-time sensor data, aiding decision-making in treatment operations. Experimental results show that the proposed model achieves superior performance, with an accuracy of 89.18%, precision of 85.54%, recall of 94.02%, Critical Success Index of 93.42%, Matthews Correlation Coefficient of 88.40%, delta-P of 94.37%, and Fowlkes–Mallow’s Index of 89.47%. These results highlight the model’s strong predictive capability and its practical utility in improving water treatment plant efficiency. By combining Machine learning with adaptive quality assessment, this study contributes to advancing intelligent monitoring solutions in water management.

Reliable water quality prediction and parametric analysis using explainable AI models

Article Open access 29 March 2024

Developing highly accurate machine learning models for optimizing water quality management decisions in tilapia aquaculture

Article Open access 13 October 2025

Knowledge embedding and interpretable machine learning optimize comprehensive benefits for water treatment

Article Open access 24 August 2025

Introduction

Water stands as a fundamental resource crucial for human survival and advancement. Regrettably, a combination of human actions and natural calamities is leading to a growing contamination and depletion of our water resources. Consequently, there is an urgent requirement for effective and reliable water filtration methods¹. The limitations of conventional water purification techniques, such as filtration and chemical treatments, are evident in their cost, duration, and sustainability. This marks the application of Machine learning in the regulation of water purification processes. A subset of artificial intelligence known as Machine learning emphasises the training of computer models to identify patterns within large datasets². Machine learning algorithms can be utilised in the water purification sector to detect and predict the presence of toxins in water by analysing data from various sources, including sensors. Extensive amounts of intricate data can be handled efficiently and accurately through the application of Machine learning³. Preserving the purity of water and preventing the spread of waterborne illnesses is crucial. Machine learning techniques can be employed to train algorithms for the identification of various pollutants in water samples, such as chemicals, bacteria, and viruses⁴. This enables water treatment facilities to implement essential actions to remove these contaminants prior to delivering water to residences. This method is also effective for monitoring variations in water quality across different periods. Through ongoing observation and analysis of data, Machine learning algorithms are capable of detecting variations in water quality that could result from seasonal shifts, natural disasters, or anthropogenic influences. This enables authorities to implement proactive strategies and avert potential issues before they arise⁵.

Conventional approaches to water purification, including chemical treatments, often incur high costs and can adversely affect the environment. Machine learning algorithms necessitate minimal resources and can be utilised across various water sources, presenting a cost-effective and sustainable solution⁶. This tool offers valuable insights into patterns of water usage, aiding in the management of water resources. Through the examination of data from water meters, advanced algorithms can pinpoint regions with elevated water usage and assist officials in formulating approaches to minimise water loss. This holds particular significance in regions experiencing challenges related to water scarcity⁷. Machine learning has the potential to enhance the accuracy of water demand predictions. Through the examination of historical data and consideration of multiple factors, including population growth and weather patterns, advanced algorithms can effectively forecast future water demand⁸. This enables water treatment facilities to strategise effectively and guarantee a steady provision of clean water to satisfy the demands of the community⁹. The effectiveness of Machine learning in water purification management hinges on the presence of accurate and reliable data. It is essential to implement these efforts to guarantee the systematic collection and monitoring of water quality data at consistent intervals¹⁰.

The Quality Index (QI) serves as an essential component of water purification management, significantly contributing to the safety and cleanliness of the global water supply. The QI assesses the extent of contamination and the concentration of pollutants found in water sources. The evaluation encompasses a thorough analysis of water quality, considering chemical, physical, and biological elements¹¹. The QI serves as a crucial indicator of the overall health of a water source and plays a significant role in pinpointing potential risks to human health. The QI serves as a framework for regulatory agencies and water treatment facilities to assess the level of treatment required to ensure water is safe for consumption. The standard of acceptability for water quality is established, which is essential in the design of treatment processes and the implementation of control measures to guarantee that the treated water complies with the necessary standards¹². This guarantees that the water provided to residences and commercial establishments is safe for consumption, culinary purposes, and various everyday tasks. The QI helps in monitoring the effectiveness of water treatment processes. By regularly measuring the QI, treatment plant operators can track changes in water quality, identify any potential issues, and make necessary adjustments to the treatment process¹³. This is especially important in managing the treatment of different sources of water, as they may have varying levels of contamination and require different treatment methods. The QI also serves as a benchmark for evaluating the success of water purification management strategies and determining if any improvements or interventions are needed. The QI serves as an essential instrument for evaluating the overall health of aquatic ecosystems¹⁴. The index considers the presence of harmful substances that can negatively impact aquatic life and disturb the ecosystem. Through careful observation and upkeep of optimal quality indicators, effective water purification management plays a crucial role in safeguarding the environment and sustaining the fragile equilibrium of aquatic ecosystems. The primary contribution of the proposed study includes the following.

The research presents a comparative study of machine learning techniques integrated with a modified Quality Index (QI) to enhance anomaly detection and water quality classification. This approach is applied specifically to water treatment plant data, where several of the models evaluated have not been previously benchmarked, thus offering new insights into their effectiveness in this context.
The algorithm accurately classifies water quality parameters and integrates the modified QI to improve interpretability and operational efficiency. By assigning weights based on the importance of each parameter, the QI dynamically reflects the real-time status of water quality, reducing misclassification risk and supporting better decision-making in treatment plants.
Designed for real-time application, the proposed machine learning approach enables continuous monitoring and detection of anomalies in water quality. This supports rapid response and timely interventions, which are critical for effective and adaptive management in water treatment facilities.

The primary contribution of this study is a comparative analysis of various machine learning models for anomaly detection in water quality datasets. The focus lies in evaluating how different models perform in identifying critical deviations in water quality parameters under real-time monitoring settings. In addition, the proposed approach leverages a modified Quality Index to enhance model interpretability and decision-making relevance for water treatment operations.

The following sections of this manuscript are organised in the manner outlined below. Section 2 discusses the advantages and disadvantages of the existing research. The proposed methodology and algorithm were thoroughly outlined in Sect. 3. Section 4 provides a detailed comparative analysis, whereas Sect. 5 delivers an in-depth overview of the results and discussion. In summary, Sect. 6 presents the results and possible avenues for further investigation of the proposed study.

Related works

Omeka, M. E. et al.¹⁴ have explored the application of a geographical information system (GIS) alongside machine learning techniques to forecast the quality of irrigation water. The approach leverages data from multiple sources to produce a precise forecast. The integration of these technologies provides a thorough approach to enhancing irrigation water management. Mellal, N. E. H., et al.¹⁵ have presented a computational model that draws inspiration from the architecture and operations of the human brain to examine data and forecast the quality of purified water. This method facilitates enhanced precision and effectiveness in the oversight and improvement of purified water systems.

Zhang, Y., et al.¹⁶ have explored a method that integrates machine learning with spectral analysis to effectively and precisely obtain information regarding water quality from high-resolution hyperspectral images. The integration of various feedback mechanisms enhances the precision and reliability of the retrieval process. Bakhtiarizadeh, A., et al.¹⁷ have explored a machine learning model that employs algorithms and data analysis techniques to forecast and enhance the quality of groundwater resources. This method can assist in recognising possible risks and enhancing management strategies for the sustainable utilisation of groundwater resources. Suman, S. K., et al.¹⁸ have explored a Machine learning algorithm aimed at analysing the adsorbents found in drinking water residuals and predicting the potential presence of harmful microplastics, referred to as HMS (hazardous microplastics). By examining the properties of adsorbents and their relationships with microplastics, this method can pinpoint regions with potentially elevated HMS levels and guide the development of preventative strategies. Yang, S., et al.¹⁹ have explored a machine learning model that is straightforward and interpretable, aimed at estimating the water quality index. The model considers multiple water quality parameters and employs them to compute a numerical index that aids in assessing the overall quality of the water. Sundar, L. S., et al.²⁰ have examined the topic of carbon neutrality in wastewater treatment systems. This involves reaching net zero carbon emissions through the application of Machine learning algorithms. This entails enhancing the treatment process to minimise energy usage and lower greenhouse gas emissions. The algorithm leverages data to facilitate real-time decision-making and enhance process efficiency, resulting in a more sustainable and eco-friendly waste water treatment system.

Cechinel, M. A. P., et al.²¹ have discussed to predict the quality of effluent from wastewater treatment plants. Through precise predictions of effluent quality, operators are able to enhance treatment processes and minimise expenses. This technology holds the promise of greatly enhancing the efficiency and effectiveness of wastewater treatment processes. Xu, T., et al.²² have presented a predictive model that integrates adaptive sampling techniques with machine learning to effectively forecast recreational water quality. This facilitates effective and economical oversight of aquatic environments, thereby minimising the likelihood of waterborne diseases among recreational participants. The model persistently evolves and adjusts to varying circumstances, enhancing its forecasting capabilities and dependability. Zhong, H., et al.²³ have presented an innovative approach for forecasting water quality in a membrane bioreactor (MBR) system through the application of machine learning techniques. The approach entails examining data collected from multiple sensors and pinpointing the key elements that influence water quality. This approach can improve the efficiency and accuracy of MBR water quality predictions. Sangwan, V., et al.²⁴ have examined a system that employs algorithms and data to forecast water quality based on multiple parameters. It takes in data such as water temperature, pH levels, and nutrient levels to classify water quality as excellent, good, fair, or poor. Cai, D., et al.²⁵ have discussed a method that uses a Machine learning model based on transformers to predict the quality of tap water multiple steps ahead. This approach takes into account various factors and historical data to provide more accurate and reliable forecasts.

Liu, C., et al.²⁶ have explored the application of the YOLO v4 Machine learning algorithm in aquaponic systems, utilising image processing techniques to effectively monitor fish movement and behaviour, thereby offering significant insights for both scholars and practitioners in the field. This technology has the potential to enhance system optimisation, promoting improved fish health and growth, which contributes to more sustainable and efficient aquaponic practices. Ramesh, E., et al.²⁷ have explored a methodology that employs sophisticated computer algorithms to identify the optimal design for helically coiled tube flocculators, which are utilised in the water treatment process. This approach aims to improve the efficiency and effectiveness of water treatment systems. Recent years have seen a significant surge in the application of machine learning (ML) techniques for water quality assessment and prediction. Various models including artificial neural networks (ANN), support vector machines (SVM), decision trees, and ensemble methods have been extensively tested for their ability to analyze water quality parameters. However, a growing body of research has shifted toward comparative analysis to determine the most effective models in specific hydro-environmental contexts.

Chen et al.²⁸ conducted a large-scale comparative study to assess the predictive performance of various ML models using big data from surface water sources. Their work identified ensemble methods such as Random Forest and Gradient Boosting as superior in identifying key water quality parameters. This aligns with our approach of model performance benchmarking using multiple metrics including precision, recall, and Critical Success Index (CSI).Similarly, Shah et al.²⁹ compared ANN, SVM, and Random Forest models for surface water prediction and emphasized the importance of hyperparameter tuning and input feature selection—principles which we implement through SHAP-based feature analysis and quality index weighting.

Alqahtani et al.³⁰ investigated both individual and ensemble learning models for river water quality prediction, concluding that ensemble models outperform individual learners in heterogeneous datasets. Their results showed substantial improvements when hybrid approaches were used, a finding echoed in our study where hybrid encoder-decoder structures were deployed for anomaly detection.Asadollah et al.³¹ focused on water quality index (WQI) prediction and model uncertainty, comparing ML techniques in terms of robustness under fluctuating input data. Our model similarly introduces a Modified Quality Index (MQI) that adapts to real-time anomalies, thereby improving robustness and dynamic adaptability, especially in real-time IoT-enabled water treatment monitoring.

Furthermore, Sedighkia et al.³² proposed a hybrid evolutionary model for stream temperature prediction, demonstrating the value of combining traditional ML with advanced optimization heuristics. While our work doesn’t introduce a standalone optimization component, our encoder-decoder training incorporates loss-based optimization that supports pattern extraction in temporal water quality data.

In summary, while prior studies have presented important contributions, our proposed model distinguishes itself by (i) introducing a real-time adaptive Modified Quality Index (MQI), (ii) deploying an encoder-decoder architecture for robust anomaly detection, and (iii) providing an exhaustive comparative evaluation across a broad range of ML models. Our findings support and extend existing work by identifying that the proposed model achieves superior performance—especially in recall and precision—compared to both individual and ensemble-based models discussed in these benchmark studies.

Table 1 Presents a thorough examination of the current body of research. Table 1: comprehensive analysis.

Full size table

From the above comprehensive analysis, the following research gaps are identified.

Water quality is highly variable, influenced by the source, treatment processes, and environmental conditions. Incorporating this dynamic variability into computational models is challenging and often requires frequent updates to ensure accurate forecasting of purification needs.
Water purification is a multi-stage process, with each stage presenting its own operational complexities and susceptibility to error. Effective modeling must account for the optimization of individual stages as well as their interactions within the overall purification framework.
Most machine learning models require large, diverse datasets for reliable training and prediction. Integrating such data from heterogeneous sources—such as real-time sensor outputs, plant operations, and usage patterns—is both time-consuming and technically demanding.
Existing models often struggle to capture and interpret the intricate interactions between various water quality parameters. Developing a model that can reliably reflect these nonlinear relationships remains a significant challenge.

Based on the above research gaps, the proposed model has constructed to resolve all the above research problems. The following novelties are making the proposed model as an improved version of existing research works.

The algorithm integrates multiple machine learning techniques to predict water quality parameters and make informed decisions regarding purification processes. By learning from large and complex datasets, it is well-suited to the demands of dynamic water quality management.
A key novelty lies in the use of a modified Quality Index (MQI) that aggregates multiple quality indicators with parameter-specific weights. This enables more accurate, context-aware assessment of water quality and improves the algorithm’s decision-making capacity.
The model is designed for real-time monitoring and anomaly detection, allowing for continuous tracking of water quality conditions. This capability ensures early identification of deviations and supports timely adjustments in treatment strategies, enhancing overall purification efficiency.

Proposed model

A revised Quality Index (QI) serves as an effective metric for assessing water quality, enhancing the management of water purification processes. The suggested model can be trained using historical water quality data³³, which may assist in forecasting potential water quality concerns and recommending suitable purification techniques. This model is capable of learning and adapting over time, thereby enhancing the accuracy and efficiency of water purification management. Figure 1 illustrates the development of the proposed model.

The first step entails gathering a dataset related to water quality, which includes a range of attributes. Relevant features from the dataset are selected for use in the preprocessing and model development stages. The dataset requires preprocessing before it can be used for model training. This includes tasks such as data cleansing, encoding categorical variables, and normalizing numerical features.The dataset is then partitioned into two subsets: the training set and the testing set. The training set is used to develop the model, while the testing set is used to evaluate its effectiveness. Typically, this separation is carried out using a ratio of 70% for training and 30% for testing. This ensures that there is sufficient data for the model to learn effectively, while also providing a substantial portion for assessing the model’s performance.A stacked ensemble model represents a sophisticated approach in machine learning, integrating various base models to enhance overall performance and reliability. H2O Auto DL is an automated Machine learning algorithm designed to manage extensive datasets and is well-suited for developing a stacked ensemble model.Following model training, a feature analysis is performed to identify the key elements that influence water quality predictions. This aids in understanding which features exert the greatest influence and can be leveraged to enhance the model.The final trained model is produced through the training process using the stacked ensemble H2O Auto DL method. It is equipped to generate forecasts on new data based on its analysis of the training set. The evaluation of the trained model is then conducted using the testing data. To determine the model’s effectiveness, its performance can be compared with alternative models or industry standards.

Water quality detection

The suggested algorithms for detecting water quality incorporate a Machine learning-based encoder and decoder. The encoder processes raw data obtained from water quality sensors and transforms it into a feature representation. This feature representation is then transmitted to the decoder, which reconstructs the data and detects any anomalies or issues related to water quality.This process leverages the capabilities of artificial neural networks, enabling them to learn and identify patterns within the data to generate precise predictions. Training the algorithm on an extensive dataset of water quality measurements allows it to accurately detect and classify various types of water quality issues. Figure 2 illustrates the encoder and decoder components of the proposed model.

The encoder and decoder serve as essential components within a sophisticated Machine learning architecture known as the Transformer. These components are responsible for processing and understanding input data and subsequently converting it into meaningful output.The encoder begins by receiving the input data, which is first tokenized—divided into smaller, manageable units called tokens. These tokens are then transformed into numerical representations using an embedding layer. To retain sequence information, positional encoding is applied, assigning a unique value to each token based on its position in the input sequence. This step helps the model recognize the order and relative position of tokens.The encoded input then passes through a multi-head attention mechanism, which allows the model to focus on different parts of the input data simultaneously. This enhances the model’s comprehension and learning capability. The output from this step is combined and normalized to reduce complexity and improve the interpretability of the information.The decoder processes the tokenized target data, representing the desired output sequence. Like the encoder, it performs tokenization, embedding, and positional encoding. Additionally, it employs a masked multi-head attention mechanism on the target input. This prevents the model from accessing future tokens during training, ensuring predictions are made only from already generated parts of the output sequence.The decoder then applies another multi-head attention layer, which allows it to attend to relevant parts of both the encoder’s output and the target sequence simultaneously. This output is again added and normalized before being passed into a fully connected feedforward network for further processing.Finally, the result from the feedforward network undergoes another addition and normalization process, and a final fully connected layer generates the predicted output sequence.

Classification

The input layer serves as the initial interface that acquires data from external sources. The Input Sample Layer specifically denotes a collection of input data samples. The Convolutional layer executes a mathematical operation on the input data by utilising a collection of filters (kernels) to identify features within the input image. The kernel size is 7 × 7 with a stride of 2, indicating that the filter advances 2 pixels at each step to encompass the full input image. Batch Normalisation serves as a method to enhance the stability and performance of a neural network through the standardisation of inputs to each layer. This approach minimises internal covariate shift, facilitating quicker learning and enhancing generalisation capabilities. Figure 3 illustrates the classification module.

Activation functions serve to incorporate non-linearity into the neural network. This is implemented subsequent to the Batch Normalisation layer to introduce non-linear characteristics to the output. Following the initial convolutional layer, a max-pooling layer is implemented with a kernel size of 3 × 3 and a stride of 2. This layer reduces the dimensionality of the data by down-sampling the feature maps, enhancing the efficiency of the network. Subsequently, a Conv block is implemented, comprising a sequence of Conv layers featuring various filter sizes, succeeded by Batch Normalisation and Activation layers. The layers collaborate to enhance the extraction of features from the input data. An Identify block is subsequently implemented, resembling the Conv block, yet it incorporates an addition operation that merges the output of the Conv layers with the original input. This aids in maintaining crucial characteristics of the input data while also enhancing the network’s performance. The min-max normalisation approach is used to scale the feature in the [0, 1] range.

$$t'=\frac{{t - {{\hbox{min} }_A}}}{{{{\hbox{max} }_A} - {{\hbox{min} }_A}}}$$

(1)

Here, ${\hbox{min} _J}$ and ${\hbox{max} _J}$ are the min and max values.

$$WQI=\frac{{\sum\nolimits_{{b=1}}^{M} {{s_i}} \times {w_i}}}{{\sum\nolimits_{{b=1}}^{M} {{z_b}} }},$$

(2)

$${s_b}=100 \times \left( {\frac{{{T_b} - {T_{ideal}}}}{{{S_b} - {S_{Ideal}}}}} \right),$$

(3)

Where ${s_b}$the original parameter’s testing value and N is the full attribute.

$${z_b}=\frac{Y}{{{Q_b}}},$$

(4)

Where, Y is the proportionality consistent.

$$Y=\frac{1}{{\sum\nolimits_{{b=1}}^{M} {{q_i}} }},$$

(5)

$$K=P\left( {Z1.H1+Z2.H2+i} \right)$$

(6)

$$p\left( h \right)=\sum\limits_{{{x_a}\varepsilon q}} {{\alpha _a}{k_a}Y\left( {{h_a},h} \right)} +b$$

(7)

$$k={i_0}+{i_1}{h_1}+{i_2}{h_2}+...{i_b}{h_b}$$

(8)

The RMSE has computed by the following Eq. 9

$$RMSE=\sqrt {\frac{1}{N}\sum\nolimits_{{b=1}}^{M} {{{\left( {ZSB_{J}^{b} - ZSB_{J}^{b}} \right)}^2}} }$$

(9)

The MAE measures prediction mistakes without considering sign. It estimates the absolute differences between predicted and actual values over the test sample.

$$MAE=\frac{1}{M}\sum\nolimits_{{b=1}}^{M} { - ZSB_{J}^{b}}$$

(10)

$$H_{{CEF}}^{J}\sum\limits_{{l,m}} {{\Theta _j}A_{l}^{m}\left\langle {{r^l}} \right\rangle \widehat {O}_{l}^{m}}$$

(11)

Where, J = Complete angular momentum quantity. The constant ground state has expressed as the following,

$${K_1}= - 3J\left( {J - \frac{1}{2}} \right){\Theta _2}A_{2}^{0}\left\langle {{r^2}} \right\rangle .$$

(12)

Where $V_{l}^{m}$is complete harmonic factor.

$$V\left( r \right)=\sum\limits_{{l,m}} {V_{l}^{m}\left( r \right){Y_{l,m}}\left( {\theta ,\phi } \right)}$$

(13)

Let, expressed the diffusion model has demonstrate as the following,

$${S_v}={Y_f}{v^{0.5}}+D$$

(14)

The permeability of water has measured as the following,

$$A=\frac{T}{{J \times v \times F}}$$

(15)

A minimal water value of 0.0 is optimal for model simulation. Negative numbers indicate a bias of overestimation in the model, whereas positive values signify a bias of underestimation.

$$PREI=\left( {\frac{{{k_b} - \widehat {{{k_b}}}}}{{{k_b}}}} \right) \times 100$$

(16)

Where ${k_b}$original quality index for ${b_{th}}$ opinion and ${\widehat {K}_b}$ is the mean performance.

$$Q{O_f}=B\left( {Z{S_{fb}}} \right)+{\alpha _b}$$

(17)

The distinction between predicted and observed variances is articulated as follows.

$$T\left( h \right)=T\left( {{\alpha _1}} \right)+T\left( {{\alpha _2}} \right)+...+T\left( {{\alpha _m}} \right)$$

(18)

$$QO={q_{\overline {h} }}=\sqrt {T\left( {\overline {h} } \right)} =\frac{{q{c_{\overline {h} }}}}{{\sqrt M }}$$

(19)

RSS is an averages standard uncertainty for each variable.

$$DO={\left[ {\sum\limits_{{b=1}}^{m} {{{\left[ {{d_b}QO\left( {{h_b}} \right)} \right]}^2}} } \right]^{\frac{1}{2}}}$$

(20)

$${D_b}=\frac{{\partial p\left( {{h_b}} \right)}}{{\partial {h_b}}}=\frac{{\partial {k_b}}}{{\partial {h_b}}}$$

(21)

In this case, y is an input variable that can be randomly chosen, k is a coverage aspect, and Cu is the complete ambiguity in the random data.

$$y={v_{t,1 - \alpha /2}}$$

(22)

The scientifically defined variables root mean square error (RMSE), mean absolute error (MAE), mean square error (MSE), and coefficient of determination (R2) are

$$RMSE=\sqrt {\frac{1}{m}\sum\limits_{{b=1}}^{m} {{{\left( {{k_b} - {{\widehat {k}}_b}} \right)}^2}} }$$

(23)

$$MAE=\frac{1}{m}\sum\limits_{{b=1}}^{m} {\left| {{k_b} - {{\widehat {k}}_b}} \right|}$$

(24)

An Average pooling layer is implemented, which diminishes the spatial dimension of the output data by calculating the average of adjacent pixel values. The fully connected layer serves as the concluding stage in the convolutional neural network, where the outputs from the preceding layers are flattened and input into a conventional fully connected neural network. This layer conducts classification utilising the features that have been extracted from the input data. The Softmax layer is utilised to transform the output from the Full conv layer into a probability distribution across the various classes. This enables the network to generate a prediction based on the input data and categorise it into one of the established classifications.

The proposed algorithm is designed to enhance water purification management by leveraging advanced Machine learning techniques and a modified Quality Index (QI). The algorithm begins by preprocessing the water quality data, including normalization and feature selection, to ensure that only the most relevant parameters are utilized for model training. The adaptive QI is calculated by weighting each feature’s importance, providing a dynamic evaluation of water quality that adjusts in real-time according to detected anomalies. This continuous adjustment enhances decision-making within water treatment plants, ensuring that the purification processes are responsive to changing water conditions. The algorithm’s Machine learning architecture, using an encoder-decoder approach, effectively captures complex patterns in water quality data, enabling accurate anomaly detection and trend forecasting.

To further improve water purification management, the model is deployed on IoT devices for real-time monitoring. Anomalies in water quality are detected by comparing real-time data with reconstructed values from the Machine learning model using a threshold-based method. The algorithm then visualizes detected anomalies and QI trends, supporting predictive maintenance and strategic decision-making.A continuous feedback loop enables the model to learn from real-time data, ensuring adaptability to evolving water quality patterns. This approach not only forecasts water quality but also recommends appropriate purification methods, ultimately enhancing the operational efficiency of water treatment plants.By integrating Machine learning with a dynamically adjusted Quality Index (QI), the algorithm provides a transformative solution for sustainable water resource management.

Comparative analysis

The performance of proposed Machine learning algorithm has compared with the existing heuristic GIS-based and machine learning approach (HML)¹⁴, artificial neural network (ANN)¹⁵, hybrid feedback Machine factorization machine model (HDFM)¹⁶, interpretable machine learning model (IMLM)¹⁹, Machine learning algorithm (DLA)²⁰, adaptive synthetic sampling algorithm (ASSA)²², Machine learning framework (MLF)²⁴, transformer-based Machine learning model (TDLM)²⁵ and multi-objective optimization model (MOOM)²⁷. Here, Water Quality dataset²⁸ and the python simulator is the tool used to execute the results.

Estimation of accuracy

The accuracy is determined by analysing the correlation between the predicted water quality values and the actual measured quality values mentioned in Fig. 4. The adjusted Quality Index serves to allocate weights to various quality parameters and determine a comprehensive accuracy score. An elevated accuracy score signifies an improved performance of the algorithm in forecasting water quality. Table 2 presents a comparison of accuracy between the existing models and the proposed models.

Table 2 Comparison of Accuracy (in %).

Full size table

Estimation of precision

Precision is determined by taking the ratio of accurately predicted instances of poor water quality to the overall number of instances classified as poor quality. A high precision signifies a low false positive rate, indicating that the algorithm is effectively identifying and predicting poor water quality. Table 3 presents a comparison of precision between the existing models and the proposed models.

Table 3 Comparison of Precision (in %).

Full size table

Figure.5 shows the comparison of Precision. In a comparison point, the proposed model reached 81.25% Precision. The existing HML reached 38.26%, ANN obtained 53.57%, HDFM reached 50.40%, IMLM reached 39.96%, DLA obtained 61.33%, ASSA obtained 50.40%, MLF obtained 70.22%, TDLM reached 66.73% and MOOM obtained 65.10% Precision at the same range. While compared with other existing models, the proposed model reached the better Precision.

Estimation of recall

The suggested Machine learning algorithm employs a revised Quality Index to assess the purity of water within a purification system. The calculation of recall involves determining the percentage of accurately identified pure water samples out of the total pure water samples present in the system. This aids in evaluating the effectiveness of the algorithm in precisely identifying and handling water quality. Table 4 presents a comparison of Recall between the existing models and the proposed models.

Table 4 Comparison of Recall (in %).

Full size table

Figure.6 shows the comparison of Recall. In a comparison point, the proposed model reached 96.21% Recall. The existing HML reached 56.48%, ANN obtained 73.88%, HDFM reached 69.23%, IMLM reached 58.99%, DLA obtained 84.59%, ASSA obtained 69.23%, MLF obtained 91.75%, TDLM reached 82.41% and MOOM obtained 81.87% Recall at the same range. While compared with other existing models, the proposed model reached the better Recall.

Estimation of critical success index (CSI)

The Critical Success Index (CSI) for the proposed Machine learning algorithm involves the integration of water quality data and the application of the modified Quality Index to evaluate the algorithm’s efficiency and accuracy in detecting and eliminating contaminants. The algorithm’s performance is assessed based on its effectiveness in achieving successful water purification, providing a quantitative measure of overall success.Table 5 presents a comparison of the Critical Success Index between existing models and the proposed model.

Table 5 Comparison of Critical Success Index (in %).

Full size table

Figure.7 shows the comparison of Critical Success Index. In a comparison point, the proposed model reached 97.71% Critical Success Index. The existing HML reached 57.76%, ANN obtained 75.99%, HDFM reached 69.35%, IMLM reached 60.32%, DLA obtained 87.00%, ASSA obtained 69.35%, MLF obtained 93.97%, TDLM reached 90.84% and MOOM obtained 89.05% Critical Success Index at the same range. While compared with other existing models, the proposed model reached the better Critical Success Index.

Estimation of Matthews correlation coefficient (MCC)

The MCC is determined by assessing the sensitivity and specificity of the algorithm in categorising water quality according to the modified Quality Index. The process involved a comparison of the predicted outcomes against the actual values, followed by a calculation of the correlation between the two sets of data. The MCC value indicates the algorithm’s overall accuracy in predicting water quality, which is crucial for effective management in water purification. Table 6 presents a comparison of the Matthews Correlation Coefficient between the existing models and the proposed models.

Table 6 Comparison of Matthews Correlation Coefficient (in %).

Full size table

Figure.8 shows the comparison of MCC. In a comparison point, the proposed model reached 93.42% MCC. The existing HML reached 67.67%, ANN obtained 70.46%, HDFM reached 75.39%, IMLM reached 70.67%, DLA obtained 80.66%, ASSA obtained 75.39%, MLF obtained 87.94%, TDLM reached 83.98% and MOOM obtained 76.29% MCC at the same range. While compared with other existing models, the proposed model reached the better MCC.

Estimation of Delta-P

The Delta-p is determined by calculating the difference between the existing Quality Index and the forecasted Quality Index. This value indicates the alteration in water quality following the implementation of the algorithm, serving as a metric to assess the algorithm’s efficacy in overseeing water purification processes. Table 7 presents a comparison of Delta-p between the existing models and the proposed models.

Table 7 Comparison of Delta-p (in %).

Full size table

Figure.9 shows the comparison of Delta-p. In a comparison point, the proposed model reached 94.76% Delta-p. The existing HML reached 49.12%, ANN obtained 65.98%, HDFM reached 60.79%, IMLM reached 51.30%, DLA obtained 75.54%, ASSA obtained 60.79%, MLF obtained 83.22%, TDLM reached 79.85% and MOOM obtained 78.04% Delta-p at the same range. While compared with other existing models, the proposed model reached the better Delta-p.

Estimation of fowlkes–mallows index (FMI)

The FMI has developed a revised Quality Index through the analysis of water purification data sourced from multiple channels, utilising advanced machine learning methodologies. The algorithm has pinpointed essential elements influencing water quality, facilitating enhanced and precise management of water purification systems. Table 8 presents a comparison of the Fowlkes–Mallows index between the existing models and the proposed models.

Table 8 Comparison of Fowlkes–Mallows index (in %).

Full size table

Figure.10 shows the comparison of FMI. In a comparison point, the proposed model reached 88.92% FMI. The existing HML reached 36.90%, ANN obtained 57.40%, HDFM reached 57.83%, IMLM reached 38.54%, DLA obtained 65.72%, ASSA obtained 57.83%, MLF obtained 83.93%, TDLM reached 71.46% and MOOM obtained 77.77% FMI at the same range. While compared with other existing models, the proposed model reached the better FMI.

Discussion

The proposed Machine learning algorithm utilizing a modified QI as a metric to assess the overall water quality. The algorithm is based on Machine neural networks that can analyze large amounts of data from various sources. The QI is modified to incorporate specific parameters related to water purification. Through training and optimization, the algorithm can predict water quality in real-time and identify potential issues. It can also suggest appropriate corrective actions and optimize the purification process. This algorithm aims to improve water purification management and ensure the production of safe and high-quality drinking water.

Convergence of performance

The proposed Machine learning algorithm using a modified Quality Index is expected to show a gradual improvement in performance as it processes more data and adjusts its parameters. The algorithm will initially start with lower accuracy as it learns from a limited dataset, but as it continues to train and gather more data, it is expected to reach a stable level of performance. This means that the algorithm will be able to accurately predict the quality of water and effectively manage the purification process with high efficiency. The convergence of performance will occur when the algorithm consistently produces accurate results and has a low error rate, indicating its effectiveness in water purification management. Table 9 shows the convergence of performance of comparative models.

Table 9 Convergence of Performance (in %).

Full size table

Figure.11 shows the convergence of performance between the comparative models. The proposed Machine learning algorithm obtained 89.18% converged accuracy. The existing HML reached 66.30%, ANN obtained 61.81%, HDFM reached 68.76%, IMLM reached 69.24%, DLA obtained 70.76%, ASSA obtained 68.76%, MLF obtained 71.12%, TDLM reached 72.25% and MOOM obtained 80.01% converged accuracy. The proposed Machine learning algorithm obtained 85.54% converged precision. The existing HML reached 41.90%, ANN obtained 57.99%, HDFM reached 53.69%, IMLM reached 43.76%, DLA obtained 66.39%, ASSA obtained 53.69%, MLF obtained 74.52%, TDLM reached 71.36% and MOOM obtained 69.52% converged precision. The proposed Machine learning algorithm obtained 94.02% converged recall. The existing HML reached 52.69%, ANN obtained 70.21%, HDFM reached 65.34%, IMLM reached 55.03%, DLA obtained 80.38%, ASSA obtained 65.34%, MLF obtained 86.87%, TDLM reached 80.43% and MOOM obtained 79.17% converged recall.

The proposed Machine learning algorithm obtained 93.42% converged CSI. The existing HML reached 54.13%, ANN obtained 71.58%, HDFM reached 66.07%, IMLM reached 56.54%, DLA obtained 81.95%, ASSA obtained 66.07%, MLF obtained 89.68%, TDLM reached 86.23% and MOOM obtained 84.63% converged CSI. The proposed Machine learning algorithm obtained 88.40% converged MCC. The existing HML reached 64.05%, ANN obtained 66.69%, HDFM reached 72.11%, IMLM reached 66.89%, DLA obtained 76.36%, ASSA obtained 72.11%, MLF obtained 83.65%, TDLM reached 79.36% and MOOM obtained 71.87% converged MCC. The proposed Machine learning algorithm obtained 94.37% converged delta-P. The existing HML reached 48.61%, ANN obtained 65.41%, HDFM reached 60.39%, IMLM reached 50.76%, DLA obtained 74.89%, ASSA obtained 60.39%, MLF obtained 82.71%, TDLM reached 79.35% and MOOM obtained 77.60% converged delta-P. The proposed Machine learning algorithm obtained 89.47% converged FMI. The existing HML reached 37.07%, ANN obtained 56.68%, HDFM reached 58.76%, IMLM reached 38.72%, DLA obtained 64.89%, ASSA obtained 58.76%, MLF obtained 84.07%, TDLM reached 71.67% and MOOM obtained 78.24% converged FMI.

Mean of performance

The proposed algorithm is able to effectively analyze large amounts of data and accurately predict the water quality index, which is a measure of the overall quality of the water. This allows for efficient and timely detection of any potential water contamination issues, enabling proactive management and maintenance of the water purification system. The modified Quality Index used in this algorithm takes into account various factors, making it a more comprehensive measure of water quality. The algorithm has proven to be a highly efficient and accurate tool for water purification management. Table 10 shows the mean of performance of comparative models.

Table 10 Mean of Performance (in %).

Full size table

Figure.12 shows the Mean of performance between the comparative models. The proposed Machine learning model obtained 19.30% better accuracy, 26.34% greater precision, 23.42% better recall, 20.44% better CSI, 15.83% greater MCC, 27.69% better delta-P and 28.49% better FMI while compared with other comparative models.

Conclusion

This study presents a comparative analysis of various machine learning models for detecting anomalies in water quality, with the proposed encoder-decoder algorithm enhanced by a modified Quality Index demonstrating superior performance. The model achieved high accuracy (89.18%), precision (85.54%), and recall (94.02%), along with strong evaluation scores in Matthews Correlation Coefficient (88.40%) and Fowlkes–Mallow’s Index (89.47%). These results confirm the effectiveness of the model in accurately classifying water quality anomalies and enabling real-time monitoring and proactive management in treatment plants.A key contribution of this research is the systematic evaluation of model performance across multiple metrics, allowing for a more informed selection of algorithms suitable for water quality prediction tasks. However, the study also acknowledges potential uncertainties such as sensitivity to input features, threshold calibration for anomaly classification, and data variability, which could impact real-world deployment.Future work will explore integrating uncertainty quantification techniques, as well as enhancing model scalability by incorporating predictive analytics and real-time IoT-based sensor data. This direction will improve the adaptability of the models for dynamic environments, ensuring consistent water safety while supporting sustainable water resource management and public health outcomes.

Data availability

The datasets generated and/or analysed during the current study are available in the Kaggle repository, https://www.kaggle.com/datasets/supriyoain/water-quality-data.

References

Lap, B. Q. et al. Predicting water quality index (WQI) by feature selection and machine learning: A case study of an Kim Hai irrigation system. Ecol. Inf. 74, 101991 (2023).
Article Google Scholar
Wai, K. P., Chia, M. Y., Koo, C. H., Huang, Y. F. & Chong, W. C. Applications of machine learning in water quality management: A state-of-the-art review. J. Hydrol. 613, 128332 (2022).
Article CAS Google Scholar
Huang, R., Ma, C., Ma, J., Huangfu, X. & He, Q. Machine learning in natural and engineered water systems. Water Res. 205, 117666 (2021).
Article PubMed CAS Google Scholar
Im, Y., Song, G., Lee, J. & Cho, M. Machine learning methods for predicting tap-water quality time series in South Korea. Water 14 (22), 3766 (2022).
Article CAS Google Scholar
Xia, L. et al. Quality assessment and prediction of municipal drinking water using water quality index and artificial neural network: A case study of wuhan, central china, from 2013 to 2019. Sci. Total Environ. 844, 157096 (2022).
Article PubMed CAS Google Scholar
Omeka, M. E. et al. Efficacy of GIS-based AHP and data-driven intelligent machine learning algorithms for irrigation water quality prediction in an agricultural-mine district within the lower Benue trough, Nigeria. Environ. Sci. Pollut. Res. 31 (41), 54204–54233 (2024).
Article CAS Google Scholar
Kuvayskova, Y. E., Klyachkin, V. N. & Krasheninnikov, V. R. Potable water quality assessment using machine training methods. In Advances in Artificial Intelligence-Empowered Decision Support Systems: Papers in Honour of Professor John Psarras 245–260 (Springer Nature Switzerland, 2024).
Chapter Google Scholar
Bagheri, M., Akbari, A. & Mirbagheri, S. A. Advanced control of membrane fouling in filtration systems using artificial intelligence and machine learning techniques: A critical review. Process Saf. Environ. Prot. 123, 229–252 (2019).
Article CAS Google Scholar
Mathaba, M. & Banza, J. A comprehensive review on artificial intelligence in water treatment for optimization. Clean water now and the future. J. Environ. Sci. Health Part. A. 58 (14), 1047–1060 (2023).
Article CAS Google Scholar
Li, W. et al. Research progress in water quality prediction based on machine learning technology: a review. Environ. Sci. Pollut. Res. 31 (18), 26415–26431 (2024).
Article Google Scholar
Oğuz, A. & Ertuğrul, Ö. F. A survey on applications of machine learning algorithms in water quality assessment and water supply and management. Water Supply. 23 (2), 895–922 (2023).
Article Google Scholar
Dasgupta, R. et al. A comparative analysis of statistical, MCDM and machine learning based modification strategies to reduce subjective errors of DRASTIC models. Environ. Earth Sci. 83 (7), 211 (2024).
Article ADS Google Scholar
Ismail, W. et al. Water Treatment and Artificial Intelligence Techniques: a Systematic Literature Review Research1–19 (Environmental Science and Pollution Research, 2021).
Omeka, M. E. Evaluation and prediction of irrigation water quality of an agricultural district, SE nigeria: an integrated heuristic GIS-based and machine learning approach. Environ. Sci. Pollut. Res. 31 (41), 54178–54203 (2024).
Article CAS Google Scholar
Mellal, N. E. H., Tahar, W., Boumaaza, M., Belaadi, A. & Bourchak, M. Prediction of purified water quality in industrial hydrocarbon wastewater treatment using an artificial neural network and response surface methodology. J. Water Process. Eng. 58, 104757 (2024).
Article Google Scholar
Zhang, Y., Wu, L., Deng, L. & Ouyang, B. Retrieval of water quality parameters from hyperspectral images using a hybrid feedback machine factorization machine model. Water Res. 204, 117618 (2021).
Article PubMed CAS Google Scholar
Bakhtiarizadeh, A., Najafzadeh, M. & Mohamadi, S. Enhancement of groundwater resources quality prediction by machine learning models on the basis of an improved DRASTIC method. Sci. Rep. 14 (1), 1–24 (2024).
Article Google Scholar
Suman, S. K., Arivazhagan, N., Bhagyalakshmi, L., Shekhar, H., Shanmuga Priya, P.,Helan Vidhya, T., … Yeshitla, A. (2022). Detection and prediction of HMS from drinking water by analysing the adsorbents from residuals using Machine learning. Adsorption Science & Technology, 2022, 3265366.
Yang, S., Liang, R., Chen, J., Wang, Y. & Li, K. Estimating the water quality index based on interpretable machine learning models. Water Sci. Technol. 89 (5), 1340–1356 (2024).
Article PubMed CAS Google Scholar
Sundar, L. S., Almujibah, H., Alshahri, A. H. & Ancha, V. R. Assessment of carbon neutrality in waste water treatment systems through machine learning algorithm. Water Reuse. 13 (3), 432–447 (2023).
Article CAS Google Scholar
Cechinel, M. A. P., Neves, J., Fuck, J. V. R., de Andrade, R. C., Spogis, N., Riella,H. G., … Soares, C. (2024). Enhancing wastewater treatment efficiency through machine learning-driven effluent quality prediction: A plant-level analysis. Journal of Water Process Engineering, 58, 104758.
Xu, T., Coco, G. & Neale, M. A predictive model of recreational water quality based on adaptive synthetic sampling algorithms and machine learning. Water Res. 177, 115788 (2020).
Article PubMed CAS Google Scholar
Zhong, H. et al. Water quality prediction of MBR based on machine learning: A novel dataset contribution analysis method. J. Water Process. Eng. 50, 103296 (2022).
Article ADS Google Scholar
Sangwan, V. & Bhardwaj, R. Machine learning framework for predicting water quality classification. Water Pract. Technol. 19 (11), 4499–4521 (2024).
Article Google Scholar
Cai, D. et al. Multi-step tap-water quality forecasting in South Korea with transformer-based machine learning model. Urban Water J. 21 (9), 1109–1120 (2024).
Article CAS Google Scholar
Liu, C., Gu, B., Sun, C. & Li, D. Effects of Aquaponic system on fish locomotion by image-based YOLO v4 machine learning algorithm. Comput. Electron. Agric. 194, 106785 (2022).
Article Google Scholar
Ramesh, E. & Jalali, A. Machine-learning based multi-objective optimization of helically coiled tube flocculators for water treatment. Chem. Eng. Res. Des. 197, 931–944 (2023).
Article CAS Google Scholar
Chen, K., Chen, H., Zhou, C., Huang, J. & Yan, Y. Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. Water Res. 171, 115454 (2020).
Article PubMed CAS Google Scholar
Shah, M. I. et al. Predictive modeling approach for surface water quality: development and comparison of machine learning models. Sustainability 13 (14), 7515 (2021).
Article CAS Google Scholar
Alqahtani, A., Shah, M. I., Aldrees, A. & Javed, M. F. Comparative assessment of individual and ensemble machine learning models for efficient analysis of river water quality. Sustainability 14 (3), 1183 (2022).
Article CAS Google Scholar
Sedighkia, M., Ghavidel, H. Z. & Sanikhani, H. Hybridizing evolutionary algorithms and multiple non-linear regression technique for stream temperature modeling. Acta Geophysica, (2025). 73(3), 2863–2878. https://doi.org/10.1007/s11600-024-01526-w
Asadollah, A., Khorsandi, R., Mesgari, A., Maity, M. & Mousavizadeh, M. River water quality index prediction and uncertainty analysis: A comparative study of machine learning models. J. Environ. Chem. Eng. 9 (1), 104599 (2021).
Article CAS Google Scholar
https://www.kaggle.com/datasets/adityakadiwal/water-potability

Download references

Funding

The authors extend their appreciation to the Deanship of Scientific Research and Libraries in Princess Nourah bint Abdulrahman University for funding this research work through the Research Group project, Grant No. (RG-1445–0018).

Author information

Authors and Affiliations

Information Systems Department, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, 11432, Saudi Arabia
P. Prabu
Department of Information Systems, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, P.O. Box 84428, Riyadh, 11671, Saudi Arabia
Ala Saleh Alluhaidan, Romana Aziz & Shakila Basheer

Authors

P. Prabu
View author publications
Search author on:PubMed Google Scholar
Ala Saleh Alluhaidan
View author publications
Search author on:PubMed Google Scholar
Romana Aziz
View author publications
Search author on:PubMed Google Scholar
Shakila Basheer
View author publications
Search author on:PubMed Google Scholar

Contributions

Prabu P and Ala Alluhaidan did conceptualization – idea, and planning, they also worked on methodology – Design, Strategy, writing – original draft, writing final paper. Shakila Basheer and Romana Aziz did supervision.

Corresponding author

Correspondence to Ala Saleh Alluhaidan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Prabu, P., Alluhaidan, A.S., Aziz, R. et al. Comparative analysis of machine learning models for detecting water quality anomalies in treatment plants. Sci Rep 15, 30453 (2025). https://doi.org/10.1038/s41598-025-15517-4

Download citation

Received: 01 June 2025
Accepted: 08 August 2025
Published: 19 August 2025
DOI: https://doi.org/10.1038/s41598-025-15517-4

Subjects

Abstract

Similar content being viewed by others

Reliable water quality prediction and parametric analysis using explainable AI models

Developing highly accurate machine learning models for optimizing water quality management decisions in tilapia aquaculture

Knowledge embedding and interpretable machine learning optimize comprehensive benefits for water treatment

Introduction

Related works

Proposed model

Water quality detection

Classification

Comparative analysis

Estimation of accuracy

Estimation of precision

Estimation of recall

Estimation of critical success index (CSI)

Estimation of Matthews correlation coefficient (MCC)

Estimation of Delta-P

Estimation of fowlkes–mallows index (FMI)

Discussion

Convergence of performance

Mean of performance

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links