Introduction

Water is crucial to human nutrition, with approximately 80% of waterborne diseases attributed to contamination1,2. To mitigate disease risks and avoid high monitoring costs due to pollutants from population growth, industrial expansion, and waste disposal, it is essential to prioritize the accessibility and affordability of clean drinking water3.

The human body is exposed to both natural and artificial radiation. Natural radiation includes cosmic rays and primordial radionuclides such as uranium, thorium, actinium, and potassium, which have long half-lives4. Artificial radiation from human activities, such as nuclear testing and accidents, includes radionuclides such as Cs-137 and Sr-905. These substances can enter the body via inhalation, ingestion, and dermal absorption6. Contaminated water can introduce internal radiation sources, leading to DNA and RNA damage and potentially causing cancer and genetic mutations7. Industrial activities also introduce heavy metals like lead, chromium, arsenic, and cadmium into the body through contaminated water, air, and food, accumulating in tissues and bones8.

Underground water that interacts with geological formations and contaminants can have higher concentrations of radioactive materials than surface water9. The global demand for bottled water has increased by 31% due to economic and safety reasons10. This study evaluated the specific activities of radioactive isotopes and heavy metal levels in bottled water and compared them with WHO health standards and findings from other countries.

Applying ML in the mineral water context involves innovative approaches to improve water quality, treatment processes, and resource management. ML algorithms can analyze data from various sensors for real-time monitoring of mineral water quality, which is essential for identifying contaminants and ensuring compliance with safety regulations11. ML models can enhance water treatment efficiency by forecasting optimal operational parameters, thereby leading to improved impurity removal and resource efficiency. In addition, ML can predict the maintenance needs of water treatment systems, thereby reducing downtime and ensuring continuous functionality11. In mineral processing, ML models can identify and assess the mineral composition of water, which is crucial for achieving the desired mineral balance in bottled water. These applications highlight ML’s transformative potential in the mineral water industry, particularly in terms of quality assurance, process optimization, and sustainable water management12.

Tackling a regression problem involves predicting a continuous outcome variable based on one or more predictor variables, which is common in fields such as finance, healthcare, and engineering. Accurate predictions require a structured approach, starting with an understanding of the data and relationships between variables through exploratory data analysis (EDA)13. Complex models such as polynomial regression, ridge regression, and lasso regression can capture nonlinear relationships and handle multicollinearity. Model evaluation and validation, such as cross-validation, ensures that the model generalizes well to unseen data, which enhances reliability14.

The goal of this study is to apply sophisticated ML models to improve prediction precision in future studies. The study also proposes recommendations for regulating radioactive isotopes and heavy metals in potable water and advocates for longitudinal studies to assess the health effects of consuming mineral water with low concentrations of radionuclides and heavy metals.

The paper is organized as follows: section “Materials and methods” details the materials, methods, and formulations for measuring radioactive isotope activity levels and reviews various ML models to identify the best regression model for predicting cancer risk. Section “Results and discussion” compares results from mineral water samples in Arak City with those predicted by ML models, discussing the most effective methods and cancer risk assessment. Finally, section “Summary and conclusion” summarizes the research findings.

Materials and methods

In this study, a random sampling technique was employed to assess the activity levels of the radioactive isotopes Ra-226, Cs-137, K-40, and Th-232 in water sourced from the consumable mineral water of Arak City. The investigation utilized commercially available mineral waters from the region. Initially, 15 samples of consumable mineral water, branded as Surprise, Miwa, Aquafina, Versailles, Deserni, Zamzam, Akarso, Vata, Damavand, Elham, Prolife, Kalis, Aqiq, Gohar, and Alis, were collected from local supermarkets in Arak. Each sample was 1.5 L of water to satisfy the measurement criteria. The samples were designated with coded identifiers WFR1-WFR15, and 800 cc samples were transferred into Marinelli beaker containers thoroughly cleaned with distilled water and alcohol. These Marinelli containers were then sealed with aquarium adhesive to eliminate any potential air exchange with the external environment and to prevent the escape of radon gas. To achieve stable equilibrium among the decay products, eight half-lives of the radon gas must elapse, which corresponds to a duration of approximately 50 days, given that the half-life of radon is 7 days. Consequently, mineral water sampling was scheduled to occur approximately 50 days after sealing15.

To investigate the heavy-metal concentration in mineral water samples using the inductively coupled plasma technique, 10 cc of each sample was transferred into Falcon containers located in the central laboratory of Arak University. The samples were subsequently pumped into an atomizer, from which they were injected into the ICP-AES device. The plasma within the device reaches a temperature of 4000 °C, which facilitates the dissociation of chemical bonds and results in light emission at various wavelengths. These wavelengths are then used to quantify the heavy metal contents in the samples. To mitigate potential chemical interferences within the apparatus, three appropriate emission lines were selected for each element, and each analysis was conducted in triplicate. Table 1 presents the characteristics of 15 consumable mineral water samples sourced from various cities, each exhibiting distinct compositions, collected from supermarkets in Arak City.

Sample preparation

The gamma-ray spectrometry apparatus used in the nuclear laboratory at Arak University is designed to measure the activity of radioactive nuclei with high-energy resolution. This system employs a Semiconductor and coaxial HPGe p-type detector model 30,195 BSIGCD, which boasts a relative efficiency of 30% and is equipped with 4096 analyzer channels. The energy resolution of the Co-60 gamma line, characterized by energies of 1332.520 keV and operating at a voltage of 3000 V, was recorded at 1.95 keV. The energy and efficiency calibrations in gamma spectroscopy are conducted using a Cs-137 standard source with known specific activity, and the spectrometry process is facilitated by lsrmB SI software. The signals produced in the detector are transmitted to the MCA8192 compact system via a pre-amplifier, where they are recorded in a sped format. Subsequently, WINHEX and MATLAB software were employed to convert these files into the CHN format. Following this conversion, the specific activities of radionuclides in environmental samples were assessed using GammaVision32 (EG&G Ortec). To determine background radiation and ensure that the spectra are analyzed under consistent spatial and temporal conditions, spectroscopy is performed in empty Marinelli containers, and the resulting data is subtracted from the original spectrum16.

Table 1 Brief description of mineral water in Iran, consumable at Arak City.

In the central laboratory of Arak University, there exists an Inductively Coupled Plasma Atomic Emission Spectrometry (ICP-AES) instrument, specifically the 9100 Quant Plasma model, manufactured in Germany. This analytical technique is used for the detection of chemical elements in various samples. The device operates using inductively coupled plasma to generate excited atoms and ions, which subsequently emit electromagnetic radiation at characteristic wavelengths associated with specific elements. Plasma is a high-temperature source of ionized gas, typically argon. The device is maintained in the temperature range of 6000 to 10,000 K using inductive pairs of electric coils and operates at MHz frequencies. The absolute efficiency of the detector is determined by Eq. (1) based on the recorded gamma-ray spectrum.

$${\text{\varvec{\upvarepsilon}}}\left( {\text{\% }} \right)=\frac{{{{\text{N}}_{\text{i}}}}}{{{\text{Act}} \times {{\text{P}}_{\text{n}}}\left( {{{\text{E}}_{\text{i}}}} \right) \times {\text{t}}}} \times 100,$$
(1)

where ε represents the absolute efficiency of the detector, Ni denotes the net count of the sub-peak corresponding to the energy Ei, and Act refers to the specific radioactivity of the standard source measured in Bq/Kg. Additionally, Pn(Ei) indicates the probability of photon emission at energy Ei, and t indicates the duration of the spectrometry measurements, expressed in seconds.

Special activities of radioactive nuclei

The specific activity of samples is assessed using gamma rays from various isotopes. For the gamma lines of Pb-214, which have an energy of 351.93 keV, and Bi-214, which have an energy of 609.31 keV, the specific activity is determined alongside those of Th-232. The gamma line from Ac-228, which exhibited an energy of 911.21 keV with an intensity of 28%, and another line at 338.32 keV with an emission percentage of 11%, was also analyzed. Additionally, the specific activity of K-40 was evaluated using its gamma line at 1460.70 keV, whereas the gamma line of Cs-137, with an energy of 661.66 keV, was employed for further assessment. The specific radioactivity of these radioactive nuclei within the samples is given by

$${\text{A}}=\frac{{{\text{NetArea}}}}{{{\text{\varvec{\upvarepsilon}}}\left( {\text{\% }} \right) \times {\text{BR}}\left( {\text{\% }} \right) \times {\text{t}} \times {\text{m}}}} \times 100,$$
(2)

where A denotes the specific activity of the sample expressed in Bq/kg. The term “Net Area” refers to the net area beneath the peak, while ε signifies the absolute efficiency of the detector. Additionally, BR represents the branching ratio expressed as a percentage, t indicates the sampling duration of the sample measured in seconds, and m corresponds to the mass of the sample in kilograms17.

Annual effective dose

The annual effective dose derived from the ingestion of natural and artificial radionuclides in drinking water is as follows:

$${\text{AED}}=\sum {{\text{A}}_{\text{i}}} \times {\text{DC}}{{\text{F}}_{\text{i}}} \times {\text{Cr}}.$$
(3)

The annual effective dose (AED) is expressed in sieverts per year and is calculated using the specific radioactivity of isotopes Ra-226, Th-232, K-40, and Cs-137 measured in units per liter (Bq/L). The dose conversion factor (DCFi) is defined in Sieverts per liter (Sv/Bq), and Cr represents the annual drinking water consumption for infants, children, and adults (250, 350, and 730 L, respectively18. Table 2 lists the dose conversion values of the core radioactive substances.

Table 2 Values of specific radiation conversion factor to dose.

Potential cancer risk (ELCR)

To evaluate the potential cancer risk associated with drinking water consumption throughout an individual’s lifetime, various methodologies

$${\text{ELCR~}}={{\text{R}}_{\text{F}}}{\text{~}} \times {{\text{D}}_{\text{W}}} \times {{\text{F}}_{{\text{AR}}}}.$$
(4)

The ELCR represents the lifetime cancer risk, and Dw denotes the annual effective dose for the specified age group, which is measured in Sieverts annually. The FAR indicates the duration of the target age range in years, and the RF, which is quantified as 7.3 × 10− 2, corresponds to the risk associated with one per Sv19.

Radium equivalent activity (Raeq)

The total radioactivity can be determined using Eq. (5), which expresses it concerning the radium activity (Raeq) as follows20

$${\text{R}}{{\text{a}}_{{\text{eq}}}}={\text{ }}{{\text{A}}_{{\text{Ra}}}}+{\text{ 1}}.{\text{43 }}{{\text{A}}_{{\text{Th}}}}+{\text{ }}0.0{\text{77}}{{\text{A}}_{\text{K}}}.$$
(5)

The specific activities of Ra-226, Th-232, and K-40 (denoted as ARa, ATh, and AK, respectively) were measured in Bq/l. The internal risk indicators (Hin) and external risk indicators (Hex) were used to assess the risk of radiation exposure associated with specific isotopes within the radon gas decay series. The values of Hex and Hin can be derived from Eqs. (6) and (7)21 as follows:

$${{\text{H}}_{{\text{ex}}}}=\frac{{{{\text{A}}_{{\text{Ra}}}}}}{{370}}+\frac{{{{\text{A}}_{{\text{Th}}}}}}{{259}}+\frac{{{{\text{A}}_{\text{K}}}}}{{4810}} \leqslant 1,$$
(6)
$${{\text{H}}_{{\text{in}}}}=\frac{{{{\text{A}}_{{\text{Ra}}}}}}{{185}}+\frac{{{{\text{A}}_{{\text{Th}}}}}}{{259}}+\frac{{{{\text{A}}_{\text{K}}}}}{{4810}} \leqslant 1.$$
(7)

Safe drinking water

To determine the surface area of the core guide, we used22:

$${\text{GL}}={\text{~}}\frac{{{\text{IDC}}}}{{\left( {{\text{q~}} \times {\text{~hing}}} \right){\text{~}}}},$$
(8)

where GL represents the guide level of the radioactive core in drinking water, measured in becquerels per liter. IDC denotes the individual dose standard, established at 0.1 millisieverts per year. The variable q indicates the annual water consumption of adults (730 L per year, while hing refers to the dose conversion factor expressed in millisieverts. This factor was derived from the values associated with Becquerel (see Table 3).

Table 3 Guidance levels for radioactive nuclei in young adults’ drinking water.

To ensure the safety of drinking water, we implemented various measures.

$${\text{SFW}}=\sum \frac{{{\text{Ci}}}}{{{\text{GLi~}}}} \leqslant 1{\text{~}}{\text{.}}$$
(9)

The specific activity of the radioactive nuclei for the i-th radionuclide in drinking water is denoted Ci, while GLi represents the guideline level for the i-th radionuclide, as derived from Eq. (8). It is established that specific drinking water activities should not exceed a value of 1.

ML methods

A suitable regression model is essential for an effective analysis. Linear regression is one of the most straightforward and commonly employed techniques, primarily because of its simplicity in application. Furthermore, performance indicators such as Mean Squared Error (MSE) and R-squared, serve to measure prediction precision. Ultimately, it is crucial to continuously enhance the model by adjusting hyperparameters and integrating domain-specific insights, which improves performance. The refinement process is critical for developing accurate and reliable models. A systematic methodology is essential when tackling a regression problem to achieve precise and reliable outcomes. The main aim of this study was to predict AEDs and assess ELCR. The following outlines the key steps of this research:

  1. (a)

    Data Collection: This step acquires all pertinent data required for analysis, ensuring its relevance and comprehensiveness. The data presented in sections “Sample preparation” to “Safe drinking water” are primarily discussed in the results and discussion sections.

  2. (b)

    Data Preprocessing: Data preprocessing is a fundamental step in ML because it helps improve the quality of data and prepare them for modeling. In our dataset, there were no outliers or missing values. We used the one-hot encoding method to convert categorical mineral water type and age group data into numerical data.

  3. (c)

    Data Splitting: The dataset was divided into a 70:30 ratio, with 70% allocated for training to learn and 30% for testing to evaluate the performance of the model on previously unseen data.

  4. (d)

    Model Selection: Various regression models, including ridge regression, Decision Tree (DT) Regression, and Random Forest (RF) Regression, were evaluated to determine the most appropriate option. Below, each model is presented with a concise description:

Ridge Regression: Ridge regression improves linear regression by addressing multicollinearity by adding a regularization term to the ordinary least squares objective function. The L2 penalty reduces model complexity by shrinking the coefficients toward zero. Ridge regression is particularly useful for handling multicollinearity, which occurs when independent variables are highly correlated23.

DT Regression: DT regression employs a top-down, greedy-layer approach. The proposed method progressively divides a dataset into increasingly smaller subsets while simultaneously constructing an associated decision tree. The end product is a tree comprising decision and leaf nodes24.

RF Regression: RF Regression is an additive model that predicts outcomes by aggregating decisions from multiple base models. Each base model is a DT, and the final output of the RF model is the combined result of the DT. This method of using several models to enhance predictive accuracy is referred to as model ensemble24.

  1. (e)

    Model Training and Hyperparameter Tuning: The grid search method is used as a strategy to optimize the hyperparameters of ML models to enhance performance. This approach specifies a range of potential values for each hyperparameter, followed by an exhaustive evaluation of all possible combinations to identify the most effective configuration for improving model performance. Table 4 presents the hyperparameter configurations for three different models: Ridge, DT, and RF. These configurations were determined using a grid search technique, thereby outlining the ideal hyperparameters for each model.

Table 4 The hyperparameter configurations for three different models: Ridge, DT, and RF.
  1. (f)

    Model Evaluation: The efficacy of the model was evaluated using various metrics, including the mean absolute error (MAE), Root Mean Square Error (RMSE), and Coefficient of Determination (R²). The mean absolute error (MAE) quantifies the disparity between the actual and forecasted values by computing the average of the absolute differences throughout the dataset25.

$$MAE=\frac{{\mathop \sum \nolimits_{i} \left| {{y_i} - \hat {y}} \right|}}{n}$$
(10)

The RMSE is derived by computing the square root of the mean square error (MSE). The MSE quantifies the disparity between the actual and predicted values by squaring the mean of the differences throughout the dataset25.

$$RMSE=\sqrt {\frac{1}{n}\mathop \sum \limits_{i} {{\left( {{y_i} - \hat {y}} \right)}^2}}$$
(11)

Coefficient of Determination (R2): R2 denotes the coefficient of determination, which serves as an indicator of the degree to which the observed values align with the original values. The coefficient ranges from 0 to 1 and can be interpreted as a percentage. A higher R2 value indicates a superior model performance25.

$${R^2}=1 - \frac{{\mathop \sum \nolimits_{i} {{\left( {{y_i} - \hat {y}} \right)}^2}}}{{\mathop \sum \nolimits_{i} {{\left( {{y_i} - \bar {y}} \right)}^2}}}$$
(12)
  1. (g)

    Selecting the Best Model and Result Interpretation:

Figure 1 presents a comparative analysis of the error metrics: MAE and RMSE across three ML models: RF, Decision Tree, and Ridge models. These metrics are critical for assessing the accuracy of regression models by quantifying the discrepancies between actual and predicted values. The MAE is represented by blue bars, and it quantifies the average absolute deviation between the actual and predicted values, regardless of error signs. A lower MAE typically indicates a more precise model. In this analysis, the RF model exhibits the lowest MAE, indicating superior accuracy compared to the other models. In contrast, the Ridge model obtained the highest MAE, indicating a greater degree of error in its predictions. The MAE of the DT model was marginally higher than that of the RF model, reflecting its moderate performance in this context.

Fig. 1
figure 1

A comparative examination of two error metrics, specifically MAE and RMSE error metrics were compared across three ML models: RF, Decision Tree, and ridge regression models.

In contrast, the RMSE is illustrated with yellow bars and evaluates errors by assigning more penalties to larger discrepancies, which involves squaring the errors. This metric is particularly sensitive to larger errors than the MAE. The RF demonstrated the lowest RMSE, indicating high accuracy and reduced occurrence of significant errors. The ridge model, which had the highest RMSE value, tended to have larger prediction errors than the other models. Although the DT demonstrated a lower RMSE than the ridge, it still fell short of the accuracy exhibited by RF. Overall, the chart indicates that the RF outperforms both the MAE and RMSE, highlighting its effectiveness in minimizing prediction errors.

The R2 indices of the R2 index between three different ML models are displayed in Fig. 2. The RF, DT, and Ridge models were evaluated. The R2 index, also known as the “coefficient of determination,” is a crucial indicator used to assess the accuracy of regression models. This index showcases the amount of variance in the dependent variables explained by independent variables. The closer the R2 value is to 1, the better the model’s explanation of the data variance. In the graph, the RF model, represented in blue, exhibited the highest R2 value. This suggests that this model is the most accurate in prediction. RF is a complex model that combines multiple decision trees, which is typically employed in scenarios with complex data. RF can be used to identify correlations between features.

Fig. 2
figure 2

Comparative analysis of the R2 index across three distinct ML models: RF, DT, and Ridge.

The DT model (green curve) is less accurate than the RF model. This can be attributed to the simplicity of the DT model, as it consists of a single tree rather than a combination of trees, as in the RF. Despite being less accurate than the RF, the DT model seems to outperform the Ridge model (red), which ranks last among the three models. This comparison reveals that the more intricate RF model performs better on our data and produces higher R2 values.

After evaluating the performance of several regression models based on various metrics, we found that the RF Regression model demonstrated the best results.

To assess the robustness of the results obtained from the proposed RF model, we evaluated the model stability using two methods. First, we applied small noise (mean 0, standard deviation 0.01) to the test data and compared the original \(\:{R}^{2}\) with the perturbed \(\:{R}^{2}\). Both values remained consistent at 0.9, demonstrating the model’s robustness and stability against minor input variations. These results suggest that the model can reliably maintain its predictive power, which is crucial for real-world applications in which data often contain small amounts of noise. Second, the bootstrap method was used to estimate the stability of the proposed model. Given the limited dataset size, Cross-Validation (CV) often produces unstable and high-variance results because the small subsets in each fold provide insufficient data for robust training and testing. This limitation can lead to inaccurate model performance estimation. To address this issue, the bootstrap method was employed. The proposed method repeatedly samples with replacement from the entire dataset, thereby allowing for optimal use of limited data. In this study, the bootstrap procedure was repeated 100 times, enabling the calculation of the mean and standard deviation of the performance metric \(\:{R}^{2}\). The results demonstrated a mean \(\:{R}^{2}\)of 0.89 with a standard deviation of 0.08, indicating the model’s stability. Therefore, due to its ability to deliver a more consistent and trustworthy evaluation, the bootstrap method was preferred over CV in this case.

Generally, the RF model demonstrated superior accuracy, robustness, and generalizability. As a result, the random forest model was chosen to make the final predictions. By leveraging its ability to capture complex patterns and relationships in the data, we can anticipate precise predictions for AED. Figure 3 summarizes the step-by-step process of the proposed ML regression model.

Fig. 3
figure 3

Steps to construct the machine learning regression model.

Results and discussion

The specific activities of the radioactive nuclei across all samples are presented in Tables 5, 6, 7 and 8, and Fig. 4 to facilitate the comparative analysis of the specific activity results. The annual dose received, expressed in (µSv/y) due to water consumption, is detailed for the three distinct age groups in Table 9. Figure 5 visual comparison of the dose results. Additionally, data concerning cancer risk associated with water consumption among the same three age groups are presented in Table 10, with Fig. 6 to enhance the comparative evaluation of cancer-related outcomes. Furthermore, Table 11 presents results related to radium equivalent activity, internal and external risk indices, and assessments of drinking water safety. A comparative analysis of the findings from this study and those from other countries is presented in Table 12, complemented by a comparison chart in Fig. 7.

Fig. 4
figure 4

The amounts of special activity226Ra، 40232Th and 137Cs in the samples.

Fig. 5
figure 5

Comparison of the received dose among infants, children, and adults.

Fig. 6
figure 6

Comparison of cancer risk among adults, infants, and children.

Fig. 7
figure 7

Comparison of the specific activities of the radioactive cores in the mineral waters of this study with those of other countries.

Table 5 Summary of the special activity results of 226Ra related to all samples according to (Bq/l).

The specific activities of 226Ra are detailed in Table 5. By applying Eq. (2), it is established that the total sampling duration for all specimens spans one day and night, totaling 86,400 s. The cumulative mass of the samples was 800 cc, equivalent to 0.8 kg. The branching ratio for 226Ra is estimated at 0.46, based on decay software analysis of 214Bi, which has an energy of 609 keV. The calculated absolute efficiency of 226Ra a is 0.015127. The net level for all samples was zero. Because the net level is a pivotal element for assessing specific activities, we conclude that the specific activity level of 226Ra across all samples is also zero.

Table 6 Summary of report special activity results for 232Th related to all samples according to (Bq/l).

The specific activities of 232Th are presented. Using Eq. (2), we determined that the total sampling duration for all samples was one day, equivalent to 86400 s, with a total mass of 800 cc or 0.8 kg for each sample, as shown in Table 6. The branching ratio of 232Th, calculated using Decay software for 228Ac at an energy level of 911 keV, was estimated to be 0.28. The absolute yield for 232Th is recorded as 0.011389. Among the samples, sample 7 exhibited the highest net activity (165), while sample 14 exhibited the lowest net activity ((0). Furthermore, the highest specific activity of 232Th was associated with sample 7, which was recorded at 0.748 Bq/L, whereas sample 14 had the lowest specific activity at zero.

Table 7 Summary of report special activity results for 40K related to all samples according to (Bq/l).

Table 7 lists the special activity values for 40K. Using Eq. (2), it is determined that the total sampling duration for all samples amounts to one day and night, equivalent to 86,400 s. The total mass of all samples was 800 cc (0.8 kg. The branching ratio for 40K was computed using decay software, yielding an energy value of 1460 keV and an estimated branching ratio of 0.1. The absolute efficiency for 40K was calculated as 0.008009. Among the samples, sample 9 exhibited the highest net level (413), while sample 14 exhibited the lowest net level (zero). The most significant specific activity for 40K was associated with sample 9, which measured 7.460 becquerels per liter, whereas sample 14 had the lowest specific activity, recorded as zero.

Table 8 Summary of special activity results from 137Cs related to all samples according to (Bq/l).

Table 8 lists the specific activities of 137Cs. Utilizing Eq. (2), it is determined that the total sampling duration for all samples is one day, equivalent to 86,400 s, with a total mass of 800 cc or 0.8 kg for each sample. The branching ratio of 137Cs was estimated to be 0.94 based on an energy level of 661.66 keV using decay software. The absolute efficiency for 137Cs is recorded at 0.0139. Among the samples, sample 7 exhibited the highest net level (134), while sample 14 exhibited the lowest net level (0). The specific activity of 137Cs was highest in sample 7, with a value of 0.148 becquerels per liter, whereas sample 14 had the lowest specific activity, recorded as zero.

The data presented in Fig. 4 indicate that the mineral water samples exhibited the highest levels of radioactivity associated with the elements 40K, 232Th, and 137Cs. The concentration of 226Ra was recorded as zero across all samples, which is significant given that 98.5% of radiation-related damage is attributed to 226Ra. Consequently, the concentration of radium emerges as a critical parameter for mineral water factories seeking licensure from the Ministry of Health.

The information provided in Table 8 reveals that the annual effective dose for infants ranged from zero (as observed in sample 14) to 0.130 microsieverts (as recorded in sample 7. Additionally, the maximum effective dose was lower than the observed dose in infants. The annual effective dose for the pediatric population ranged from zero microsieverts, noted in sample 14, to 0.182 microsieverts (n = 7). Moreover, the maximum effective dose for children, as indicated by UNSCEAR, is also comparatively lower. Lastly, the annual effective dose for the pediatric age group ranged from zero microsieverts (associated with sample 14) to 379 microsieverts (associated with sample 7. Furthermore, the peak effective dose for children remains below the UNSCEAR threshold.

Table 9 Annual dose for infants, children, and adults based on µSv/y.

Figure 5 compares the dosages received by infants, children, and adults. The annual consumption of drinking water is 730 L for adults, 350 L for children, and 250 L for infants. Consequently, the dosage received by each age group decreased in the following order: adults, children, and infants. The highest doses for all three age groups were associated with sample 7.

The data presented in Table 10 indicate that the cancer risk factor for newborns ranges from 0 to 71,010− 6, demonstrating variability. The cancer risk coefficient for the same age group spans from 0 to 99,510− 6, also exhibiting variability. Additionally, the cancer risk coefficient for infants, within the range of 0–2070, is on the order of 10− 6, highlighting its variable nature.

Table 10 Risk factors for cancer in infants, children, and adults.

Figure 6 illustrates the comparative risk of cancer among the analyzed samples, including adults, infants, and children. This assessment was influenced by the critical AED variable. Notably, the effective dose received was correlated with a decreased cancer risk among the following age groups: adults, children, and infants.

Table 11 Risk indicators for mineral water samples consumed in Arak City.

The equivalent radium activity is presented as an average of 0.61, in Table 11. The internal and external indelibilities are consistent, attributable to the uniform specific activity of radium-226, resulting in identical and low values. Furthermore, the analysis indicates that the water safety levels across all samples are below one, suggesting that there is no significant health risk associated with these findings.

Table 12 Comparison of the obtained results with those from other countries according to (Bq/l).

To facilitate the comparison of the findings of this study, Table 12 presents the specific activity results of radioactive nuclei in bottled water from various countries. The outcomes of this investigation align well with those reported in other countries and the values established by international organizations. Consequently, it can be inferred that the radiation levels in the bottled water consumed by Arak residents do not pose any health risks. As illustrated in Fig. 7, except for Malaysia and Nigeria, the levels of radioactive nuclei activity in all other countries fell below the thresholds established by the World Health Organization. Furthermore, research on bottled water consumption in Arak City, Iran, indicates that the presence of radioactive nuclei does not represent a risk to human health.

This illustration presents the Actual versus Predicted chart for the RF model applied to the regression analysis. A chart serves as a valuable tool for juxtaposing the model’s predicted values against the actual observed values, thereby facilitating the assessment of the regression model’s efficacy. In this representation, points that align closely with the 45-degree line indicate that the model’s predictions are nearly equivalent to the actual values. A significant number of points clustered around this line, particularly at the lower end of the X-axis, suggesting a high level of accuracy in this region. The minimal dispersion of most points on the chart indicates that the RF model effectively captured the complexities of the data and produced reliable predictions. Nevertheless, a few points deviating from the diagonal line may reflect specific characteristics of the data that warrant further investigation. Overall, this chart demonstrates that the RF model has excelled in this regression task, achieving commendable accuracy in its predictions of actual values.

Fig. 8
figure 8

Actual versus predicted values in the RF regression.

Fig. 9
figure 9

A summary of feature importance.

Figure 8 displays the actual and predicted values for the RF model used in our regression analysis. This chart is a useful tool for comparing the model’s predicted values with the actual observed values, making it easier to evaluate the effectiveness of the regression model. The plot shows the actual values on the horizontal axis and the predicted values on the horizontal axis. Each point on the plot represents a specific data pair, where the actual value is compared to the predicted value.

In this representation, points that closely align with the 45-degree line indicate that the model’s predictions are nearly equivalent to the actual values. Notably, a significant number of points were clustered around this line, particularly at the lower end of the X-axis, suggesting a high level of accuracy in this region. The low scatter in most parts of this image shows that the RF model can accurately predict complex representations. However, a few points deviating from the diagonal line may reflect specific data characteristics that warrant further investigation. Overall, this chart demonstrates that the RF model has excelled in this regression task, achieving commendable accuracy in its predictions of actual values.

Table 13 Comparison of actual AECL and ECLR with predicted AECL and ECLR.

Table 13 compares the actual and predicted AECL and ECLR. The samples were divided into three groups: Infants, Children, and Adults. The table includes columns for FAR (75 for all samples), RF (0.073 for all samples), and CR (ranging from 0.00264 to 0.00401). These values are small decimals, indicating some form of measurement or calculation. This table compares the predicted results of our prediction model with actual values across various age categories.

Figure 9 presents a summary of the feature’s importance chart. The CR feature had the highest level of importance, indicating its crucial role in the model analysis. The “Adults” feature ranked second most significant, demonstrating a considerable effect. Mineral water types 7, 6, and 5 also had notable importance although less than CR and Adults. On the other hand, features related to infants and children were moderately important compared to the other features. This figure highlights the features that have the greatest influence on our model analysis.

Summary and conclusion

In all mineral water samples, special Ra-226 radiation levels were absent. The average Th-232, K-40, and Cs-137 concentrations were 0.311, 2.104, and 0.049 Bq/l, respectively, all below the WHO thresholds. The annual effective doses from bottled water consumption were 57.6 µSv/y for infants, 80.7 µSv/y for children, and 168 µSv/y for adults, which were significantly lower than the UNSCEAR limit of 1000 µSv/y. The cancer incidence coefficients were 316 for infants, 442 for children, and 922 for adults, indicating a cancer risk of 922 × 10–6 for a 75-year-old. Hex and Hin values ranged from 0 to 0.002, indicating no health risk. The radium equivalent activity values ranged from 0 to 1.08, aligning with global averages, with the highest level observed in the WFR7 sample. Heavy elements such as Cd, Hg, Sn, Pb, and As were detected at zero mg/L. The RF model’s performance was validated by comparing actual and predicted values, demonstrating its reliability across different age groups and enhancing the study’s robustness.