Introduction

Groundwater is a plentiful asset that millions of people throughout the world rely on for their water to consume. The rising prevalence of contaminants in groundwater makes it all the more important to evaluate its quality for human consumption1. Groundwater is a vital and significant source of potable and irrigation water in regions that experience arid and semi-arid environments. Pollution of ground water has become a global problem due to the increasing human population and the accompanying fast urbanization and industrialization2. Over the past few decades, groundwater quality has fallen due to a disruption in chemical processes caused by an increase in human activity3,4. The level of solids and soluble salts determines the irrigation water quality. Evaluating the level of quality is vital for the long-term usage of these natural assets for crop irrigation2. The quantity and quality of groundwater are both negatively affected by changes to an areas local terrain and drainage systems5. Evaluating the quality and quantity of groundwater is a crucial factor in establishing its viability for drinking and irrigation purposes6,7,8. The importance of supplying freshwater for industrial, agro-industrial, and household uses has grown in tandem with the expansion of industrialization. A large portion of groundwater, around 65%, is utilized for human consumption, with a smaller percentage going toward irrigation and domesticated animals using 20% and industrial uses and quarries using 15%9,10,11. A major global problem now is the gradual deterioration of groundwater quality. The rising scarcity of groundwater poses a health hazard to humans, as billions of people around the world are forced to drink polluted water because there is not sufficient potable water. It is now generally known that the cleanliness of groundwater is a greater concern 12,13,14,15. Approximately eighty percent of worldwide water-related illnesses are caused by water that is not fit for human consumption. But water-related illnesses are killing millions in a number of African, Asian, and Indian states16,17,6. Hypertension, hypocalcaemia, kidney stones, gastro-renal pain, arterial calcification, thrombosis, and other major human health problems have been linked to pollution such heavy metals, pesticides, and organic and inorganic pollutants, according to previous studies17,18,6. Usually, the Water Quality Index (WQI) is one of the simplest, comprehensive calculative tools for evaluating water quality19,20,21. The WQI is calculated using a variety of methods, one of which is the water’s mathematical single-scoring number6. WQI is a method applied to measure the quality of water. It is generally determined by measuring electrical conductivity (EC), pH, sodium ions (Na+), chloride ions (Cl), and bicarbonate ions (HCO3)22,23,24,25,26. The quality of groundwater for irrigation is evaluated by sodium percentage (Na٪), sodium absorption ratio (SAR), residual sodium carbonate (RSC), permeability index (PI), chlorine index (KI), and magnesium hazard (MH)2,26,27,28. The indices used in this research, including the Water Quality Index (WQI) and Irrigation Water Quality Index (IWQI), are well-known and effective tools for simplifying complex water quality data. By integrating multiple physicochemical parameters into a single score, these indices provide a clear and comprehensive assessment of overall water quality for both drinking and agricultural purposes. The study uses these standardized methods to give a clear and comparative picture of the quality of the groundwater. This is important for making good decisions about how to manage and fix the water.

Machine learning algorithms enhance and supplement the Water quality index and evaluation. Numerous studies related to gene expression programming (GEP), support vector machines (SVM), artificial neural networks (ANN), and adaptive neuro-fuzzy inference systems (ANFIS) have been employed to assess water quality characteristics. The Automatic Linear Model (ALM) has been utilized to determine the interconnections and key elements that affect structure behavior in many industries in recent studies. This investigation employs indices and the automatic linear model to assess groundwater and identify contaminated sources8. The constructed ANN model in this work, with its precise estimation of the proportion of variance in recorded Water Quality Index values, is resilient. Its exceptional concordance with the testing subset deviations and lowest cross-valuation measurements indicates this excellent performance. Moreover, the model shows the best R2 value and a strong connection between projected and absolute WQI values, reassuring its dependability in water quality prediction. A noteworthy scarcity of studies has been identified on the utilization of XGBoost, ANN, and RF models for the prediction of groundwater WQI despite their widespread use in evaluating groundwater quality. A thorough risk assessment will help us comprehend the non-carcinogenic and carcinogenic health implications of polluted water, and these models show promise. The present state of WQI prediction research is insufficient, but these models’ predictive ability for water quality parameters gives hope for the future295,30,31.

The aims of this research are as follows:

(a) To investigate the physicochemical characteristics of groundwater in 23 sites at Kasganj, U.P., India, 115 samples were examined for key variables such as pH, alkalinity, total dissolved solids (TDS), fluoride, and different ionic components. (b) This study aims to determine the suitability of groundwater for potable and irrigation purposes using the Water Quality Index (WQI) and Irrigation Water Quality Index (IWQI), which provide a comprehensive classification of water quality across the sampled sites. (c) We aim to evaluate the efficacy of three machine learning models—Random Forest (RF), Artificial Neural Networks (ANN), and Extreme Gradient Boosting (XGB)—in predicting WQI from physicochemical characteristics. (d) To identify contamination areas of concern and group areas by their risk of contamination, which will provide scientific evidence for urgent water management and cleanup plans in the study area.

Methods and materials

Hydrology of study area

The mean annual rainfall is 722.4 mm. The sub-humid climate has a lovely winter season and hot summers. The mean daily maximum temperature in May is 41 °C, the mean daily minimum is 27 °C, and the maximum temperature can reach over 46 °C. The monsoon, with its rapid drop in day temperatures, is a significant factor in the region’s climate. January is the coldest, with a mean daily high of 22 °C and a mean daily low of 8 °C. Groundwater occurs in unconsolidated alluvial sediment pore spaces in the sedimentation zone. The top silty, sandy clay beds with kankar support dugwells where groundwater occurs. Deeper aquifers have semi-confined groundwater. Our research, conducted in the Kasganj area, located in the northern portion of the Etah district, has provided precise data on the water levels. During the pre- and post-monsoon periods, the depth of the water level ranges from 3.11 to 10.24 mbgl and 2.58 to 9.79 mbgl, with a variation of 0.17 to 1.50 m. The water table height ranges from 168 to 157 m above mean sea level (m.a.s.l.), indicating a southeasterly regional groundwater flow32.

Analysis of water quality parameters

To understand the study, which was carried out from August 2023 to July 2024 and is shown on the map in Fig. 1, researchers collected underground water samples from 23 neighboring sites with contaminated water. The study involved the systematic collection of water samples from the tube wells, submersibles, and hand pumps, ensuring that stale water was first evacuated and the samples were then stored in prewashed, high-thickness polypropylene (HDPP) bottles in accordance with standard protocols across various locations in the study area. The analytical methods, including advanced techniques such as titration to measure alkalinity, hardness, and chloride, Ca2⁺, and Mg2⁺ concentrations, were used. A multi-parameter kit calculated pH and TDS; a flame photometer measured Na⁺ and K⁺ concentrations. Finally, a Shimadzu UV-1800 spectrophotometer analyzed nitrate, sulfate, and fluoride. This advanced method resulted in comprehensive groundwater chemical characterization, therefore providing the scientific quality of the research33,34. The estimated error was less than ± 5%. The flowchart depicts the methodology of the study region (Figs. 2 and 3).

Fig. 1
figure 1

GIS Location of Kasganj district, Uttar Pradesh, Agra.

Fig. 2
figure 2

Flowchart of calculation of water quality indexing (weighted arithmetic index method) and irrigation water quality indexing (SAR, Na%, MH and KR)39,40,41,6.

Fig. 3
figure 3

Flow chart of the methodology implemented for WQI and IWQI analysis of water samples collected from several locations of Kasganj area.

Geochemical modelling

The PHREEQC geochemical modelling was used to accomplish thermodynamic computations of the SI (saturation indices) of the different minerals phases that are common in groundwater (Eq. 1)35.

$${\text{SI}} = {\text{Log }}({\text{IAP}}/{\text{K}}sp)$$
(1)

In the above equation, IAP stands for solution ion activity, and since carbonate rocks predominate the aquifer materials in the study region according to estimates of thermodynamic saturation, carbonates have shown to be the most significant minerals in this investigation, Ksp is the solubility constant at a given temperature. A groundwater system in an aquifer system, where there is a little amount of the mineral in solution, is represented by a SI level below zero, which indicates that the groundwater is under-saturated with respect to the specific mineral. It also suggests that groundwater has shorter residence spans36. When the saturation index value is greater than zero, it means that the groundwater has reached complete saturation in reaction to the particular mineral present in the solution, meaning that the water can no longer dissolve the mineral.

Estimation of the water quality index (WQI) and irrigational water quality index (IWQI) of the samples

The guidelines given by WHO and BIS standard for drinking water are illustrated in (Table 1) TDS means Total dissolve solids; Na+ is for Sodium; K+ stands for Potassium; TH refers for Total hardness; Ca2+ stands for Calcium; Mg2+ means for Magnesium; TA refers for Total Alkalinity; pH (unitless); Fterm for Fluoride; Cl stands for Chloride; NO3 means Nitrate; (SO42−) refers for Sulphate. The estimation of WQI and IWQI are shown in Fig. 2. With this calculation, the water samples have been classified into five different categories of WQI as shown in Table 2

Table 1 Prescribed water quality and unit weight standards37,38.
Table 2 A classification of drinking water according to the Water Quality Index ranges6,39,40,41.

Calculation of the water quality and irrigation water quality indexing (WQI and IWQI)

According to Table 3, IWQI could be classified into five different groups from excellent to unsuitable for irrigation purposes. Based on the results, it was concluded that water available from different sources in this region is not fit for irrigation.

Table 3 The groundwater samples classification for irrigation purposes by ranges of Na%, SAR, MH and KL27,25,2.

Machine learning models

A wide variety of machine learning classification and prediction techniques have been documented in the literature. Three noteworthy approaches that have demonstrated significant efficacy in a range of applications are examined in this study: Extreme Gradient Boosting (XGBoost), Artificial Neural Networks (ANN), and Random Forest (RF)42. Extreme gradient boosting, a novel algorithm gaining popularity for water quality forecasting, is paired with the adaptability of neural networks in handling a large number of inputs and learning nonlinear complex relationships. The three models used in this study are all capable of classification and regression, showcasing their versatility43.

The high-accuracy gradient boost algorithm XGBoost creates a series of decision trees one after the other, allowing every tree to learn and fix the errors of its predecessors. XGBoost broad acceptance is largely due to its strong focus on avoiding overfitting, which maximizes its generalizability. This is made possible by using regularization in input parameters. XGBoost has become a common choice for data science and applied machine learning contexts is great part to its dependability, strong supervised learning algorithm, and efficiency of gradient-boosted models. This method works for regression and classification. Experts recommend XGBoost for its fast execution and out-of-core computation management for small data set. XGBoost has been used in many studies to measure air and water pollution. Gradient-boosted trees combine weak classifiers to form a robust classifier44,45,46,47,48. The boosting process highlights the deficiencies of prior weak classifiers by augmenting the weights or oversampling particular data points. This method instructs the subsequent classifier to concentrate on samples with more significant classification challenges, allowing the model to learn from its prior mistakes. XGBoost, applied in an ensemble learning context, was used to predict regions with elevated lead contamination risk and to determine significant features strongly associated with increased lead levels in Flint, Michigan49,50,51,52.

The ANN is the next ML model applied, it composed of interconnected neurons that collaborate to execute particular tasks, taking signals from the biological neural networks observed in nature. The output produced by a neuron arises from a defined process: the neuron takes in input, which is subsequently integrated with coefficients like bias and weights, subsequently, it undergoes processing via a non-linear activation function. Neurons are generally organized in layers, enabling the flow of information from the input layer to the output layer through one or more hidden layers of neurons53. The difference between anticipated and actual results for different input data points is utilized to assess the performance of the network. The loss is utilized to adjust the weights of the network through the application of backpropagation and gradient descent algorithms. This enhances the prediction accuracy and consequently minimizes the losses in subsequent iterations54. The essential steps in developing ANN models include selecting suitable inputs and target variable , defining the network’s architecture, pre-processing and partitioning the input data, choosing a network design, establishing performance metrics, and performing training, testing, and validation31,55,56,57,5.

An advanced neural network-based regression model, a significant departure from the traditional non-linear regression model, is employed to accurately predict the Water Quality Index (WQI). This model, which operates on a well-connected parallel with feed-forwarding, is a testament to the innovative strides in our field. The WQI is calculating using F, pH, TDS, Cl, Ca, Mg, Na, K, NO3, SO4, TH and TA. The main steps in building this model were choosing the network architecture and structure.

Our model is not only robust and reliable, but also highly adaptable. It uses twelve dimensions (fluoride, pH, TDS, Cl, Ca, Mg, Na, K, NO3, SO4, TH, and TA), hidden layers with various configurations, Rectified Linear Unit (ReLU) activation function, and L2 regularization, K-fold validation, and a ‘Linear’ output layer targeting WQI. The 5-fold cross-validation tests many hidden layers, learning rates, and regularization strength configurations, showcasing the model’s adaptability. Early stopping prevents overfitting, and the second stage chooses optimal training parameters. In parallel with the iteration count, the model is trained using the entire training set and evaluated using the test set. The learning rate and multi-retrain training method ensure the model’s robustness and performance58,59.

Random forest makes advantage of an ensemble of classification and regression trees. Every tree is built from the original data set using a distinct bootstrap sample (with replacement) is used to construct each tree from the data set. RF introduces a layer of randomity to the process unlike conventional trees that split each node using the best split among all variables. RF splits a node using just a randomly chosen subset of the variables while building a tree. This fascinating randomness helps, RF resist overfitting in contrast to other techniques. Our model is trained and tested using a large number of trees, which typically improves stability and reduces variance. We have implemented using Random Forest Regressor, hyper-parameters up to 1000 trees with maximum depth to 6, 42 random states, and K-fold cross validation. These techniques efficiently retrain the Model for each fold during cross-validation as a good practice, used to standardized data in each fold, and avoiding data leakage.60,61,30.

The current research work, three ML models (1) RF, (2) ANN, and (3) XGBoost were used to predict and analyze groundwater quality indices. Each model had certain strengths and limitations applicable to groundwater quality evaluation:

Random forest (RF)

Advantages: 1. Resistant to overfitting because of ensemble learning. Handles high-dimensional data and nonlinear relationships well. 2. Offers feature importance for understanding and the Disadvantages: 1. May be computationally expensive with big data. 2. Interpretability is relatively lower compared to linear models.

Artificial neural network (ANN)

Advantages: 1. Able to model intricate, nonlinear interactions between water quality parameters. 2. High predictive accuracy when well-trained and tuned. Disadvantages: (1) Needs great computer power and big data. (2) Functions as a “black box,” providing low interpretability of internal processes.

Extreme gradient boosting (XGBoost)

Advantages: (1) High performance and accuracy due to optimized gradient boosting. (2) Effective handling of missing values and overfitting through regularization, and the disadvantage of being more sensitive to parameter tuning.

Results

Comprehensive hydrogeochemistry of groundwater of Ganga Basin Kasganj area, Uttar Pradesh, India

Table 4 displays the physico-chemical water quality characteristics of the sampled area. The alkalinity of the water sample was found to be in the range from 94 to 456 ppm. However, the TDS of the samples was alarmingly high, ranging from 252 to 2054 ppm, with an average of 942 ppm (Fig. 5 a–d). The values of chloride, sodium, potassium, sulphate, nitrate, magnesium, calcium ions, and total hardness were within acceptable limits. The mean pH level was 7.36, with a range from 6.99 to 7.81. The concentrations of fluoride in the water samples ranged from 0.21 to 3.80 ppm, with an average of 1.55 ppm, as shown in Fig. 4a–d. These results indicate that the fluoride ion concentration exceeded the World Health Organization acceptable limit of 1.5 ppm (Fig. a–d)37.

Table 4 Physical and chemical characteristics (minimum, maximum, mean and standard daviation values) of groundwater samples of groundwater of Ganga Basin Kasganj area, Uttar Pradesh, India.
Fig. 4
figure 4

(ad) Special distribution of hydrogeochemistry (a) Chloride, (b) Nitrate, (c) Sulphate, (d) Fluoride of groundwater of sampled area.

Fig. 5
figure 5

(ad) Special distribution of hydrogeochemistry (a) TDS, (b) Calcium (c) Sodium, (d) Magnesium of groundwater of sampled area.

Table 5 illustrates the statistical SI values for each mineral in groundwater during the year 2024 Fluorite (CaF2), Gypsum (CaSO4), Halite (NaCl), and Sylvite (KCl) were found to be dissolved in the groundwater in mostly wells. The study area is characterized by a shallow aquifer system, exhibiting a transition from unconfined to semi-confined groundwater conditions. The proximity of the water table to the surface has facilitated the formation of clay lenses, which have subsequently introduced an inter-fringing phenomenon within the sandy aquifer, effectively rendering it impermeable. This impermeable layer significantly restricts groundwater recharge. Furthermore, the presence of gravel nodules composed of calcium carbonate within the sandy aquifer influences the pH of the groundwater, thereby enabling the dissolution of minerals. The variance of the saturation index value of a few dissolved minerals in water samples of different wells, such as anhydrite (CaSO4), gypsum (CaSO4), halite (NaCl), sylvite (KCl), was found to be under saturated. The chemical composition of these minerals, which mainly include SO4 and Cl, shows high value in the study area that is due to anthropogenic contamination (Table 6). SI values that are negative suggest that the water sample exhibits greater aggressiveness towards corrosion.

Table 5 Illustrates the statistical saturation index values for each mineral in groundwater in sampled area.
Table 6 Depicts the minimum, maximum, mean and standard daviation values of saturation index in the sampled area62,63,64.

Table 6 shows the mineralogical analysis of sediment samples from the semi-arid Kasganj area reveals that carbonate minerals—especially Dolomite and Calcite—are the most abundant, as indicated by their relatively high mean values and moderate variability, reflecting favorable alkaline and evaporative conditions typical of such climates. Aragonite also shows stable but lesser presence. In contrast, evaporite minerals like Halite and Sylvite are scarce, displaying very low mean values and limited variability, suggesting that conditions required for their widespread deposition are rare and localized. Sulphate minerals Anhydrite and Gypsum appear in low quantities, possibly due to seasonal hydrological fluctuations that inhibit extensive precipitation. Fluorite exhibits the highest variability, likely linked to local groundwater chemistry differences. Environmentally, this distribution pattern underscores the influence of high evaporation, intermittent water availability, and geochemical processes in shaping mineral assemblages in the region. In conclusion, the data indicates that Kasganj semi-arid environment primarily supports carbonate formation, with evaporite and sulphate minerals occurring only under specialized, occasionally met conditions.

Geochemical characterization of Kasganj, U.P., India

The hydrochemical characterization with the Piper diagram indicates that most groundwater samples adhere to the Ca–Mg–Cl facies (Fig. 6). In the cation triangle, samples predominantly belong to the no dominating type (Field D). Still, they appear to be calcium-rich, indicating mixed cationic contributions from silicate weathering and limited ion exchange processes. The anion triangle exhibits most chloride (Field G), reflecting the influence of evaporite dissolution, anthropogenic causes, or salty water incursion. The center diamond field verifies the categorization inside the Ca–Mg–Cl + SO₄ hydrochemical zone. This zone is often linked with mineralized, hard water and indicates prolonged residence durations or pollution from agricultural and residential activities. The close clustering of sample points implies an incredibly similar hydrogeochemical signature across the research area. Furthermore, the minimal prevalence of bicarbonate-rich facies indicates that recent recharge or carbonate lithology had a limited impact. Overall, the findings provide insight into how a groundwater system is impacted by natural geological processes and perhaps anthropogenic pressures.

Fig. 6
figure 6

Piper diagram for samples of groundwater at Kasganj, U.P., India.

By concentrating on the relationship between the concentrations of cations (Na+, Ca2+), anions (Cl, HCO3), and TDS (Total Dissolved Solid), the Gibbs diagram is a technique for determining the origin of ions in groundwater. To comprehend the relationship between the chemical components of water, the Gibbs diagram was devised (Gibbs 1970, Eqs. (10, 11) Three separate fields of the Gibbs diagram—precipitation dominance, evaporation dominance, and rock–water interaction dominance—are used to identify the quality features of water. All ions are represented in mg/L.

$${\text{Gibbs ratio I }}\left( {\text{for anion}} \right) = { }\frac{{{\text{Cl}}^{ - } }}{{\left( {{\text{Cl}}^{ - } + {\text{HCO}}_{3}^{ - } } \right)}}$$
(10)
$${\text{Gibbs ratio II }}\left( {\text{for cation}} \right) = { }\frac{{{\text{Na}}^{ + } + {\text{K}}^{ + } }}{{\left( {{\text{Na}}^{ + } + {\text{K}}^{ + } + {\text{Ca}}^{2 + } } \right)}}$$
(11)

Each cation and anion in groundwater has a rock-dominance origin, according to the Gibbs diagram based on TDS and the concentration of cations and anions in Fig. 7. This trait shows that groundwater ion dissolution from interactions with rock or soil is more prevalent than precipitation or other natural sources.

Fig. 7
figure 7

Gibbs diagram for samples of groundwater at Kasganj, U.P., India.

Correlation analysis of Ganga basin area of Kasganj, U.P, Northern India

The present study investigates the correlation of fluoride concentration with other physicochemical characteristics in groundwater samples from the Ganga basin area of Kasganj, Uttar Pradesh, India. Table 7 shows that fluoride has a minimal connection with pH, TA, and HCO3. We discovered a strong positive correlation between fluoride (F) ions and bicarbonate (HCO3), sodium (Na+), and hydrogen (H+) ions, which is in line with earlier research. This could be because fluoride-containing minerals like fluorite dissolve more readily in alkaline environments (pH > 7.5). However, we discovered a 0.03 correlation value between pH, TA, and HCO3 in our instance. Localized geochemical control, such as weathering of calcium and fluoride minerals, or human influence, such as phosphate fertilizers (which include both Ca and F), could be the cause of this. The modest connection between calcium and fluoride (0.48) supports mobilization based on minerals, potentially from complex fluorapatite or mixed silicates instead of pure CaF₂ routes. Low correlations with bicarbonate and pH indicate that local lithology and mineral composition are more important drivers than ion exchange or alkalinity processes. Because our water type is rock dominating, as shown by the Gibbs diagram, fluoride exhibits a significant negative correlation of − 0.69 with potassium (K+). This suggests that fluoride solubility may be influenced by a reverse ion exchange involving Na⁺, Ca2⁺, and K⁺65,66.

Table 7 Correlation analysis of Ganga basin area of Kasganj, U.P, India.

Spatial distribution of WQI

As illustrated in Fig. 8 a, b, the distribution trend in water quality indexing presents a relatively clear picture. Of the water examined in the Kasganj area, 60.87 percent was deemed unsuitable for human consumption. None were as good, 13.04 percent were classed as moderately poor, and 26.08 percent as extremely poor. Table 8 demonstrates the % distribution of numerous groundwater types in the research geographical area, therefore stressing the serious and alarming character of the problem. The water fluoride and TDS exceed World Health Organization standards and IS1050037,38, indicating a serious issue that demands a swift and effective response and treatment. Experimental results confirmed a high fluoride concentration in water samples of Ganga basin area of Kasganj, U.P, India, which might be due to its geological conditions. It was concluded that water available from different sources in this region is not fit for drinking.

Fig. 8
figure 8

(a, b) Special graphical and distribution representation of the WQI of the Ganga basin area of Kasganj, U.P, India.

Table 8 Illustrates the water quality indexing in Kasganj region, North India.

Table 8 comprehensively shows the Kasganj WQI, demonstrating the highest and lowest values across different sampled areas. The range of potential values, a key aspect of our research, is presented, with 63.64 (Saiyad Nagla) representing the water quality index. After a thorough examination, it becomes clear that the maximum value of WQI is in Tarora (221.18).

Table 9 demonstrated that the WQI of the study region, ranging from (233.16, 185.86, 1588, 221.18), reflects water quality influenced by both geogenic and anthropogenic sources, with fluorite rock playing a pivotal role by contributing minerals that significantly impact water chemistry and overall quality.

Table 9 Illustrated the compared of WQI with previous studies1,6,67.

Irrigation water quality

Irrigation water of low quality may affect crop yields and quality68. In the study by69, salinity is the primary determinant of irrigation water quality. In the present investigation, we assessed the water’s potential for agricultural use by calculating its Na, SAR, MH, and KR percentages.

Percentage of sodium, sodium absorption ratio, MH and KR

The percentage of sodium, SAR, MH, and KL were calculated using Eq. 5, 6, 7, and 8 to determine all the collected samples. The results indicated that the average values Na%, SAR, MH, and KL in the above samples were 25.30%, 7.45%, 26.07%, and 0.29% meq/L, respectively (Figs. 9 and 10a–d). These numbers not only show the suitability of quality of water for irrigation and agricultural uses (Table 10) but also convey possible advantages, such the decrease of soil permeability and the reasons of soil hardness, which could result in better agricultural practices68,69,70,71

Fig. 9
figure 9

Special graphical representation of the IWQI of the Ganga basin area of Kasganj, U.P, India.

Fig. 10
figure 10

(ad) Spatial distribution of IWQI (SAR, Mg Hazard, Na% and Kelly ratio) in the sampled area.

Table 10 The groundwater samples classification in the Kasganj, Uttar Pradesh, North India for irrigation purposes27,25,2.

Application of XGBoost (XGB), artificial neural network (ANN), and random forest (RF), models to predict the quality of water in Kasganj areas

This present research used a data partitioning technique, designating 80% of the dataset for training and 10% each for validation and testing. The predictive performance of three machine learning models—XGBoost (XGB), Artificial Neural Network (ANN), and Random Forest (RF)—in forecasting the Water Quality Index (WQI) over 23 monitored sites is assessed in this work. These discoveries could significantly influence machine learning, environmental science and engineering, and other fields, opening new avenues of exploration. Cross-validation split the dataset with 18 sites set aside for training and 5 for testing every iteration. With R2 values of 0.9568 for XGB, 0.9994 for ANN, and 0.9368 for RF, the models showed great accuracy throughout the training phase, suggesting strong positive correlations between anticipated and absolute WQI values. The models kept strong performance during the test, producing R2 values of 0.8427 (XGB), 0.8738 (ANN), and 0.9034 (RF). Via visual regression analysis in Figs. 11a–d, 12a–d, 13a–d, these results confirm the models’ efficacy in WQI prediction; RF shows the best generalizing capacity on unseen data.

Fig. 11
figure 11

(ad) Regression of XGBoost model during training, testing and validation.

Fig. 12
figure 12

(ad) Regression of ANN model during training, testing and validation.

Fig. 13
figure 13

(ad) Regression of RF model during training, testing and validation.

Performance of comparative analysis of XGBoost (XGB), artificial neural network (ANN), and random forest (RF), for regression

As detailed in Table 11, the current research assesses the performance of Water Quality Index (WQI) prediction by utilizing three popular machine learning algorithms: XGB, ANN, and RF, Utilizing basic evaluation parameters—Root Mean Square Error (RMSE), Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R2)—theoretical foundations, forecasting accuracy, and overall efficacy of each model were rigorously investigated. The outcomes are indicated by better performance by RF with minimum error values (RMSE: 5.97, MSE: 35.69, MAE: 5.49) and a high R2 value of 0.951. ANN followed closely with an R2 of 0.957, while XGB achieved an R2 of 0.831. The performance by RF was the best in WQI prediction among these models tested.

Table 11 Performance of XGB, ANN, and RF on metrics (RMSE, MSE, and MAE) for WQI.

Discussions

The models of machine learning perform well in predicting WQI, with RF showing the highest accuracy. The RF model efficacy for the fluoride and sulfate contaminated of groundwater quality assessment is confirmed by its excellent R2 value of 0.951 and its low error values. Although other models such as ANN and XGBoost also showed strong performance, RF consistent accuracy across training and validation sets underscores its dependability as a predictive tool. The present research investigation reveals how traditional WQI techniques can be improved by integrating machine learning, offering a more reliable and effective means of water quality monitoring. The spatial distribution maps and statistical analysis clearly indicate significant hydrogeochemical heterogeneity within the study area, with certain parameters like fluoride, chloride, and sulfate showing concentrated zones of high values.

When considered in the context of recent related works, the results of this study are further supported. In Kerala, Aju et al.72 used machine learning models to predict groundwater quality and discovered that RF was the most successful, with an R2 of 0.92272. Similarly, for groundwater forecasting, Hussein et al.73 emphasized the superior predictive stability of RF and XGBoost over traditional models. The usefulness of ANN-based hybrid approaches for probabilistic risk assessment in fluoride-endemic areas was further illustrated by Islam et al.31. By combining WQI-based evaluation with spatial distribution analysis, the proposed work not only validates the effectiveness of RF in managing hydrogeochemical heterogeneity but also advances the field in comparison to these studies. This thorough approach highlights our methodology’s contribution to sustainable groundwater monitoring and emphasizes its dependability and practical applicability.

Conclusion

The groundwater quality has been deteriorating from the geogenic and anthropogenic sources. However, the 115 water samples of twenty-three different locations of Kasganj reveal that the study region is under a serious threat to groundwater. Water Quality Indexing (WQI) and Irrigation Water Quality Indexing (IWQI) have also been utilized to distinguish the suitability and quality of water sites in the study area for affordable drinking and agricultural purposes. It has also been noticed that Total Dissolved Solids (TDS) and fluoride (F⁻) concentrations exceed WHO guidelines, posing significant health risks. Although pH and hardness were above permissible limits, which indicates the consistently elevated fluoride levels, in correlation with pH, alkalinity, and ion interactions (notably with hydrogen, sodium, and bicarbonate) and the geochemical mechanisms influencing groundwater chemistry in the region. Notably, 60.87% of the samples were classified as unsuitable for human consumption, with several falling into the “extremely poor” category. This highlights both the health risks and the urgent need for sustainable groundwater management. The predictive models used for assessing the water quality include Random Forest (RF), Artificial Neural Network (ANN), and XGBoost (XGB), and affirm that the RF model demonstrated the most balanced and reliable performance, achieving the lowest error metrics (RMSE: 5.97, MSE: 35.69, MAE: 5.49) and a strong coefficient of determination (R2 = 0.951). While ANN slightly outperformed RF in R2 (0.957), its higher error rates rendered RF the more robust choice overall. These machine learning models highlight the strong potential of accurately predicting and monitoring groundwater quality and offering valuable support for water resource management and public health strategies. The wide variation in water quality across the study area suggests it is influenced by both natural geological conditions and human activities. Additionally, the negative saturation index values for minerals like fluoride indicate undersaturation, which may increase fluoride mobility and contribute to its elevated levels in groundwater. Overall, this research reveals serious groundwater quality issues in the study region, which demonstrates how data-driven approaches, especially machine learning, can offer practical solutions for better groundwater monitoring. However, the effectiveness of these models depends heavily on the quality and representativeness of the input data, and the complexity of some algorithms may pose challenges in terms of transparency and interpretability for stakeholders and decision-makers.