Introduction

Precipitation is the most important component of the water cycle, which plays as a principal role in hydrological and climatological systems1. Climate changes affect the hydrological cycle and the flow systems of waterways, which in turn change the volume of water available locally, regionally, and globally at various levels2.

Using precise proxies like stable isotopes (δ18O and δ2H) is a key component in assessing, controlling, and protecting water resources3. The main variables that affect isotope ratios in precipitation are climate and location4. The use of stable isotopes to solve biogeochemical problems in ecosystem analysis is increasing rapidly because stable isotopes data can contribute both source-sink (tracer) and process information5. Stable environmental isotopes use as a valuable tool for studying the origins of water bodies, allowing for a good understanding of their capacity and more efficient utilization6. These stable isotopes of hydrogen and oxygen (δ2H and δ18O) provide a distinct signature for each water source. This enables us to evaluate their sources, assess contamination risks, and investigate the movement and fate of pollutants. Additionally, these isotopes can also provide insights into the paleoclimatic conditions during the time of water recharge. A comprehensive study was conducted across different regions of Iraq to establish an isotopic database of Deuterium (δ²H) and Oxygen-18 (δ¹⁸O) in groundwater and to examine its relationship with the isotopic signature of surface water7.

The factors driving variability in rainfall stable water isotopes (specifically δ18O and deuterium excess, d = δ2H − 8 δ18O) were studied in a 13-year data set of daily rainfall samples from coastal southwestern Western Australia (SWWA). Backwards dispersion modeling, automatic synoptic type classification, and a statistical model were used to establish reasons for variability on a daily scale. The predictions from the model were accumulated to longer temporal scales to find out the cause of variability on multiple timescales8.

Artificial intelligence (AI) offers a powerful toolkit for unraveling these intricate relationships and gaining deeper insights. Forty-two precipitation sampling stations were chosen across the Islamic Republic of Iran to assess the fractional importance of these climatic and geographic factors influencing stable isotopes. Additionally, deep learning models were employed to simulate the stable isotope content, with missing data initially addressed using the predictive mean matching (PMM) method. Deep neural network (DNN) models were utilized to predict stable isotope values in precipitation, AND validation using evaluation metrics demonstrated that the model based on DNN exhibited higher accuracy9. This study underscores the efficacy of ML techniques in both simulating and forecasting stable isotope contents with high precision. Several models, including Artificial Neural Networks (ANNs), Stepwise Regression, and Ensemble Machine Learning approaches, were applied to simulate stable isotope signatures in precipitation. Among the studied Machine Learning Models, XGboost showed the most accurate simulation with higher R2 (0.84 and 0.86) and lower RMSE (1.97 and 12.54), NSE (0.83 and 0.85), AIC (517.44 and 965.57), and BIC values (531.42 and 979.55) for 18O and Deuterium compared to other models, respectively10.

The link between environmental isotopes and changing rainfall patterns is a complex and fascinating topic, holding immense potential for understanding climate change and water resource management.

Quantile regression forest for Estimating Evapotranspiration Rates has been used in Iraq, using an RBF-NN artificial intelligence model to estimate evapotranspiration rates across different regions of Iraq based on climatic parameters and vegetation cover11,12.

The model provided robust estimations compared to traditional methods. This technique was aiding in efficient water resource management. Developing a framework using machine learning to calculate isotope time series at a monthly resolution using available climate and location data. As a result, this can improve precipitation isotope model predictions, which can serve as resources for probing historic patterns in the isotopic composition of precipitation with a high level of meteorological accuracy13. A study suggested using data-driven methods, multilayer perception when lacking appropriate laboratory isotope analysis or facing high laboratory analysis costs. The determination coefficient (R2), mean absolute error (MAE), and root mean square error (RMSE) were used to evaluate the performance of the models. In addition, visualization techniques (e.g., Taylor diagram and heat maps) were prepared to assess the similarities between the measured and estimated δD and δ18O values14.

A previous study compared the performance of three artificial neural network (ANN) models—Radial Basis Function (RBF), Multilayer Perceptron (MLP), and Group Method of Data Handling (GMDH)—with the conventional Penman–Monteith (PM) method for estimating monthly reference evapotranspiration (ET₀) in Basrah City, southern Iraq15.

Several machine learning (ML) techniques were implemented, such as shallow neural network (SNN), deep neural network (DNN), decision tree (DT), random forest (RF), and extreme gradient boosting (XGBoost). XGBoost showed the highest accuracy across the majority of studied stations, with a R2 = 0.91, VNS = 0.90, AIC = 405, BIC = 410, and RMSE = 0.76. Additionally, DNN exhibited superior accuracy in specific cases, achieving a R2 = 0.87, VNS = 0.87, AIC = 445, BIC = 460, and RMSE = 1.1016.

To enhance predictive capabilities, a Support Vector Machine (SVM) model was adapted to estimate δ¹⁸O values using multiple hydrochemical indicators, achieving strong performance (R² = 0.92, MSE = 2.89). The study links directly to national water security goals and supports the global Sustainable Development Goal 6 (SDG 6) for clean water and sanitation17.

An efficient ML model based on an ensemble Deep random vector functional link (EDRVFL) optimized by a robust optimization method was developed to forecast daily AQI in three cities (Chengdu, Wuhan, and Taiyuan) in China18.

A study proposed an external attention-based ensemble learning method (EA-ensemble) that combines five sub-models, namely, XGBoost, RF, CNN, GRU, and MLP, for ETO prediction19. A study was focused on providing an optimal solution for monitoring and accurate estimation of river streamflow using EML and ML techniques. The optimal weighted ensemble models were developed through the use of some influential algorithms20.

Despite the availability of various predictive techniques, few studies have evaluated their reliability under diverse and extreme climatic conditions. This study addresses this gap by applying and validating the methodology in the Iraq region, which is characterized by complicated climatic variability.

Our study showed the importance of implementing a particular study to integrate data obtained from the analysis of numerous rainwater samples collected over a long period, as well as data from previously published literature. Additionally, the study showed and assessed the impact of ambient temperature, elevation, and relative humidity on the measured values of stable isotopes of hydrogen and oxygen (δ2H and δ18O). The current methodology outlined holds a promising technique for application in regions worldwide characterized by diverse and severe climatic conditions. This study aims to develop a mathematical model to predict isotopic values using artificial intelligence techniques.

Study area: location and climate

Iraq is situated in Southwest Asia, where geographically located in the semi-arid region between latitudes (29.5 о −37.5 о N) north the equator, and between longitudes (38.45 о- 48.45 о E) east of Greenwich line21. Figure 1 explains the six regions that make up the topography of the nation: Upper Zagros mountain region, near the borders with Iran and Turkey; Jazirah zone, north of Iraq between the Tigris and Euphrates rivers, which includes two regions, Upper Meso. plain and foothills region and Meso. plain region; Lower Meso. plain region, which represented the alluvial plains in the center and southeast, the North desert region, with the South desert region, which is located west of the Euphrates River, and the Marsh Estuary region22.

Fig. 1
figure 1

Major topographic regions of Iraq according to24.

Due to its subtropical continental climate, which is arid to semi-arid, with hot, dry summers and nearly cold winters23, Iraq was chosen as a unique study area, where more than 70% of the country consists of arid and semi-arid regions24. Additionally, the climate varies in different part of Iraq, according to Bailey classification of humidity index, the climate of Iraq can be classified into three categories: semi humid zone in the far north (above 360 N), semidry zone (330–360 N), and dry zone in the middle and south of Iraq (under 330 N)25. The diversity of Iraq’s climate is attributed to several factors, the most prominent of which is Iraq’s astronomical location between latitudes (29–37) north26.

Methodology

Data preparation

Stable isotope data for oxygen (δ¹⁸O) and deuterium (δ²H) in rainfall were collected from 34 stations distributed across Iraq over a 14-year period (2010–2024), as shown in Table 1. An isotopic database was obtained from the Water Isotope System for Data Analysis, Visualization, and Electronic Retrieval (WISER). This Website is the common access point to the International Atomic Energy Agency’s (IAEA) scientific, technical, and regulatory information resources28,29. The rain samples were collected during the rainy season (November to April) with the support of the Iraqi Meteorological Organization and Seismology, under the guidance of authorities from meteorological stations across various governorates in Iraq to accomplish this work. The spatial distribution of the weather stations, reflecting the different topography of Iraq, is illustrated in Fig. 2. These samples, as shown in Table 1, give the mean isotope values. The negative values for isotopes refer to water that is isotopically lighter than the standard, typically indicating depletion in the heavier isotopes (δ18O and δ2H) due to processes such as precipitation under colder climatic conditions or the removal of heavier isotopes through condensation or partial evaporation. Conversely, the positive values indicate samples that are isotopically heavier than the standard, often reflecting warmer conditions, increased evaporation, or interaction with surface waters in arid environments.

Table 1 Geographic location of sampling stations across Iraq and isotope weighted mean data for the period (2010–2024).
Fig. 2
figure 2

Geographic location of sampling stations across Iraq.

Dataset and input features

The dataset comprised 279 samples, each with the following input features:

  • Rain amount (precipitation amount).

  • Temperature (Temp avg).

  • Relative humidity (RH%).

  • Elevation (m).

These variables were selected due to their known influence on isotopic fractionation in rainfall. A correlation test between all variables, as shown in Table 2 illustrates the relationship between these variables. The highest correlation between the inputs and the output is observed for elevation, followed by air temperature. The dataset was split into a training set with about 80% samples, and a test set with 20% samples. Moreover, six distinct machine learning algorithms were applied and compared in this study.

Table 2 Performance metrics of machine learning models.

Support Vector Machine (SVM): A powerful and versatile machine learning model capable of performing linear or non-linear classification, regression, and even outlier detection. SVMs work by finding the hyperplane that best separates different classes in the feature space.

Gradient boosting regressor (GBR): An ensemble learning method that builds a strong predictive model from a combination of weaker models, typically decision trees. It iteratively trains new models to correct the errors of previous models.

Artificial neural network (ANN): A computational model inspired by the structure and function of biological neural networks. ANNs consist of interconnected nodes (neurons) organized in layers, capable of learning complex patterns and relationships in data.

CatBoost: A gradient boosting library developed by Yandex. It is known for its ability to handle categorical features effectively and its robustness against overfitting.

XGBoost: An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework.

Random forest (RF): An ensemble learning method that operates by constructing a multitude of decision trees during training and outputting the mean prediction of the individual trees. It is highly effective for both classification and regression tasks and is known for its robustness and ability to handle high-dimensional data.

Minimal preprocessing was applied. The features were kept in their original units without scaling for clear interpretation. The 10% adjustment was used solely for data augmentation, where feature values were randomly shifted upward or downward by 10% to simulate natural noise and make the model more robust. The only exception was the ‘Elevation (m)’ feature, which was not changed during the augmentation, to ensure the geographical location of the data was preserved.

Result and discussion

The modelling results have been summarized in Table 2 for each model type and compared across various performance indicators. This study found that the Random Forest (RF) model achieved exceptional results, outperforming other algorithms across all key metrics Table 2.

The hyper-parameter tuning for all machine learning models was conducted within a multi-output prediction structure, using 100 estimators and a random state of 42, which yielded the optimal performance metrics. The result indicates that the RF model achieved the highest predictive accuracy, with an R² value of 0.8983, explaining approximately 90% of the variability in isotopic composition. It also recorded the lowest error value (MAE: 1.39, RMSE: 3.60), signifying minimal deviation from actual isotopic values. Both XGBoost and CatBoost also performed well, but were outperformed by the RF model. In contrast, the SVM exhibited the weakest predictive capability, characterized by higher errors (MAE: 6.54) and a low R² (0.18), indicating limited ability to explain isotopic variability.

The Random Forest model identified Rain Amount and Temperature as the most important features for predicting isotopes, followed by Relative Humidity and Elevation (see Fig. 3). Rain Amount (highest Impact) –Temperature- Elevation - Relative Humidity (lowest impact). This aligns with known hydroclimatic processes, where rainfall amount and temperature drive isotopic fractionation.

Fig. 3
figure 3

Feature importance (random forest).

To increase the size of the original dataset (276 samples) and avoid overfitting, and to make the trained model more reliable for real-world use, we used data augmentation. Therefore, we created 100 new samples per row and added slight random changes to the input features (including rain amount, temperature, and average RH) by ± 10%, resulting in 27,600 samples. This process expanded the dataset of 279 samples to the 22,300 samples that were used for training the models shown in Fig. 4.

Fig. 4
figure 4

Predicted vs. Actual isotopic values (random forest) data amplification.

The goal of this process is to expose the machine learning model to a wider variety of data than what was originally available. This helps prevent the model from “memorizing” the training data (a problem known as overfitting) and improves its ability to make accurate predictions on new, unseen data, which includes 5,576 test samples. The high R² values (0.92 for δ18O and 0.93 for δ2H) in the model charts suggest that this approach was successful, as the models demonstrate a strong predictive capability on the test data.

Figure 4 shows scatter plots comparing the predicted isotopic values against the actual measured values for δ18O and δ2H, along with their respective R² values. The strong linear correlation (indicated by R² values δ18O and δ2H) further confirms the high predictive power of the Random Forest model. The data points generally align well with the dashed red line (representing perfect prediction), demonstrating the model’s ability to accurately capture the variability in these isotopic compositions.

Conclusions and implications

Superiority of random forest

  • RF’s ensemble approach (multiple decision trees) effectively captured non-linear relationships between rainfall variables and isotopes.

  • Its low overfitting and high generalizability make it ideal for environmental isotope modeling.

Key finding for water resources management

  • The main outcome is that Random Forest can reliably reconstruct rainfall isotope signatures (δ¹⁸O and δ²H) from routine meteorological data, enabling spatially and temporally continuous isotope information even when direct isotope sampling is limited.

Practical applications

  • The model can be used for hydrological tracing, climate studies, and water resource management in regions with similar rainfall patterns.

  • Policymakers can leverage these predictions to assess water origin and movement in ecosystems.

  • The model can support groundwater recharge assessment and source identification by providing estimated rainfall isotope inputs for comparison with groundwater and surface-water isotopes.

Limitations and future work

  • Data dependency: Model accuracy relies on high-quality, region-specific rainfall data.

  • Algorithm tuning: Further optimization (e.g., hyperparameter tuning) could enhance performance.

For predicting environmental isotopes (δ¹⁸O and δ²H) from rainfall data, Random Forest is the recommended algorithm due to its high accuracy, interpretability, and robustness. Future studies could explore hybrid models (e.g., RF + ANN) or larger datasets to refine predictions.