Abstract
To effectively identify the source of water in coal mines and prevent water-related accidents, this paper utilises the hydrochemical characteristics of the aquifers Shanxi Hanzui Coal Mine. The fuzzy C-means (FCM) clustering method is employed to classify water sample data, followed by principal component analysis (PCA) for dimensionality reduction to extract key features. The SMOTE algorithm is then applied to address the issue of class imbalance. Based on this, a decision tree model (FPS-DT) is constructed using the CART algorithm. To validate the model’s performance, five-fold cross-validation was used for evaluation. The results showed that the average classification accuracy of the FPS-DT model was 93%. In contrast, the accuracy of the comparison model, which only used PCA and decision trees, was 78%, indicating that the method proposed in this paper has significant advantages in terms of identification accuracy and generalisation capability. Additionally, the FPS-DT model features a clear structure and explicit classification rules, offering good interpretability and robustness. It can adapt to the real-time water source identification requirements of complex underground environments, providing theoretical support and technical assurance for coal mine safety production and water hazard prevention and control.
Similar content being viewed by others
Introduction
China possesses abundant coal resources, with proven coal reserves reaching 218.57 billion tonnes by the end of 2023. The country has been mining over 3.5 billion tonnes annually1solidifying coal’s crucial role in its energy mix2. However, the exploitation of these resources has long been hindered by water hazards. As mining operations progress to greater depths in recent years, the complexity and diversity of water hazards have escalated. Consequently, the challenges associated with flood prevention and control have intensified, leading to frequent mine flooding accidents that result in significant injuries, fatalities, and economic losses3. One such incident occurred in March 2010 at the Wangjialing coal mine, where water accumulation in the mining void caused a leakage4. This disaster trapped 153 miners and claimed 38 lives. These events underscore the urgent need for effective and timely water damage prevention strategies. A critical prerequisite for such strategies is the rapid and accurate identification of potential water surge sources, which forms the foundation for efficient mine water control measures5,6,7.
Currently, several methods have been proposed for mine w ater source identification, including chemical analysis, hydrodynamic analysis, water temperature analysis, water level dynamic observation, and geophysical exploration, among others8,9,10,11,12,13. Among these, the chemical analysis method is commonly used to identify sudden water sources by analyzing the chemical characteristics of the water. This method is particularly convenient, rapid, and widely employed, making it the most frequently adopted approach by researchers14.ZHU Saijun15 proposed a model for identifying sudden mine water sources, which is Identification of mine water inrush source based on combination weight-theory of improved grey relational degree, Despite its high accuracy and applicability, the model requires high-quality and complete data. Additionally, when two sets of data are highly similar in feature space, the model exhibits uncertainty in distinguishing the boundaries between neighboring samples, which may negatively impact the accuracy and reliability of the classification results. Feng Dongmei and Wu Jianwei16 developed a correlation theory-driven hybrid model integrating Support Vector Machine (SVM) algorithms for discrimination of mine water inrush sources. Although this methodology demonstrates enhanced robustness in handling multi-indicator systems with inherent interdependencies, its efficacy in resolving multiclass discrimination tasks remains suboptimal, primarily due to the inherent limitations of binary classification frameworks in SVM architectures when extrapolated to high-dimensional categorical scenarios. By integrating the centroid distance metric with Fisher discriminant analysis (FDA), Zhao, W. et al.17 and Sun, F. X. et al.18 successfully incorporated the evaluation of centroid distances into Fisher’s discriminant analysis for water source identification. This method significantly enhanced the accuracy of classification, demonstrating its effectiveness in classifying water chemistry characteristics.However, this method is more sensitive to outliers, requiring rigorous pre-processing (e.g., standardization or normalization) of the data during practical applications to mitigate the impact of magnitude differences during distance calculations ; Furthermore, since the center-of-mass distance method fails to appropriately account for the varying importance of features, the issue of feature weighting remains unresolved, which indicates a higher likelihood of different chemical features influencing the balance of classificatory contributions.
Although the above methods play a significant role in advancing water source identification, particularly in the context of mine water bursts, they face challenges such as the imbalance of sample data, which makes it difficult for discriminative models to comprehensively capture the features of different categories; furthermore, when dealing with neighboring samples that exhibit high similarity, their boundaries become challenging to accurately define, thereby increasing classification uncertainty; Additionally, in multiclassification tasks, existing methods have insufficient refinement in handling complex category structures, which limits the recognition accuracy of the models. More importantly, many models do not allow for intuitive visualization of the process, which imposes limitations on both the interpretability of the results and the operational efficiency of practical applications. Therefore, in-depth research and improvement for these issues are of great scientific significance and application value. Especially important is optimising the sample distribution, enhancing the model’s ability to identify complex boundaries, improving multi-classification task processing accuracy, as well as increasing visualization and interpretability of models.
This study proposes a novel hybrid approach integrating Principal Component Analysis (PCA), Fuzzy C-means Clustering (FCM), and Synthetic Minority Oversampling Technique (SMOTE)-enhanced Decision Tree (DT) to address data complexity and class imbalance challenges. The framework leverages FCM for fuzzy pattern recognition, PCA for dimensionality reduction, SMOTE for minority class augmentation, and DT for interpretable classification, achieving balanced performance in high-dimensional imbalanced scenarios.
Viewing the above analysis, this article proposes an effective combination of a decision tree (DT) algorithm based on Principal Component Analysis (PCA), which integrates fuzzy C-mean clustering and SMOTE oversampling techniques to address the challenges in classifying water sources. This study focuses on water samples from the 2# coal seam old empty well and bedded sandrock fracture water in Xiangning Mine. Through system analysis of the water sample data, it was found that the aforementioned ions—sodium (Na+) and potassium (K+), calcium (Ca2+), magnesium (Mg2+), chloride (Cl−), sulfate (SO42−), bicarbonate (HCO2−), and carbonate (CO32−)—play a significant role in feature recognition. After preprocessing the data by addressing missing values, outliers, and duplicate data points, we obtained complete datasets. Applying FCM clustering analysis enables the identification of water sample types based on the obtained cluster centers. PCA is employed to reduce dimensionality and eliminate redundant features, thereby simplifying the model structure. SMOTE over-sampling technique is utilized to balance sample categories19addressing class imbalance issues and enhancing the generalization capability of decision trees. Ultimately, through decision tree CART algorithm prediction training using the trained model on the test set, the generalization ability of the model was assessed to optimize model parameters for accurate judgment of source of inrush water. Based on reasonable feature selection and data balancing techniques, the decision tree model demonstrates high classification accuracy. Additionally, The decision tree model offers both high interpretability and strong robustness20,21,22making it well-suited for complex classification tasks involving environmental or hydrochemical data. By transforming decision processes into a binary tree structure—where each internal node represents a feature-based splitting rule and each leaf node corresponds to a predicted class—the model allows for full traceability of the decision pathway. This hierarchical structure enables researchers to visualize and interpret the model’s internal logic, thereby facilitating validation of the classification outcomes and ensuring transparency in model reasoning. In addition to interpretability, decision trees exhibit inherent robustness to outliers, owing to their use of impurity-based splitting criteria, such as the Gini index, which minimize the influence of extreme values on the overall tree structure. During the recursive partitioning process, the algorithm consistently prioritizes features with the highest discriminative power, thereby reducing the impact of irrelevant or noisy attributes. Furthermore, when applied in a reduced-dimensional principal component space, decision trees are capable of capturing localized nonlinear decision boundaries, leading to improved classification accuracy while preserving model simplicity and computational efficiency. This study presents a novel method for identifying water inrush sources in mines and contributes to the prevention and control of mine water hazards23.
Methods
Fuzzy C-mean clustering algorithm
Fuzzy C-Means (FCM) is a widely used fuzzy clustering algorithm in data cluster analysis24,25,26. It clusters samples based on their Euclidean distances from the clusters and their memberships. Unlike K-means is a classical hard clustering algorithm that partitions data by minimizing the Euclidean distance between each sample and its assigned cluster center27. While K-means performs efficiently on datasets with well-separated, compact clusters and relatively uniform distributions—such as those found in image compression or market segmentation—its effectiveness diminishes when handling overlapping classes or transitional data, FCM allows each data point to belong to multiple clusters with varying membership degrees based on its characteristics. This characteristic makes FCM particularly effective in handling situations involving uncertainty or ambiguity, offering a more flexible and adaptable approach to clustering. The objective of the FCM algorithm is to minimize a cost function, which is the sum of the weighted distances between data points and cluster centers, along with the affiliation matrix. The algorithm iteratively updates both the cluster centers and membership values to achieve an optimal data partition. Through this fuzzy partitioning, FCM can uncover the underlying structure of the data, especially in cases where cluster boundaries are not clearly defined, thereby providing more detailed and nuanced clustering results.
Such a mechanism is particularly advantageous in the context of hydrochemical datasets, where transitions between water types are often continuous rather than discrete. The capacity of FCM to model this gradual change enables a more realistic representation of the inherent ambiguity in water source classification. Moreover, FCM minimizes a membership-weighted objective function based on the distances between samples and cluster centroids, thereby enhancing its robustness against noise and minor fluctuations in the data28.The objective function Jm is defined as:
Where N is the number of samples, C is the number of clusters, uij represents the membership degree of the i-th sample point to the j-th cluster, m is the fuzziness factor, typically set as m > 1, which controls the degree of fuzziness in the membership, xi is the feature vector of the i-th sample, and cj is the center of the j-th cluster.
the basic procedure of the FCM algorithm for clustering is as follows:
-
1.
Initialize the membership matrix: randomly initialize the membership matrix U
-
2.
Update cluster centers: Based on the current membership matrix U, update the center of each cluster cj:
$${c_j}=\frac{{\sum\limits_{{i=1}}^{N} {u_{{ij}}^{m}} \cdot {x_i}}}{{\sum\limits_{{i=1}}^{N} {u_{{ij}}^{m}} }}$$(2) -
3.
Update the membership matrix: Using the new cluster centers, recalculate the membership matrix U, where the membership uij is adjusted according to the distance between the sample point and each cluster center:
$$u_{{ij}} = \frac{1}{{\sum\limits_{{k = 1}}^{C} {\left( {\frac{{\left\| {x_{i} - c_{j} } \right\|}}{{\left\| {x_{i} - c_{k} } \right\|}}} \right)^{{\frac{2}{{m - 1}}}} } }}$$(3) -
4.
Repeat until convergence: Repeat steps 2 and 3 until the objective function Jm converges to a predefined threshold or the maximum number of iterations is reached.
Clustering the data allows us to group seemingly irregular data into distinct clusters based on their similarity. This process enables the classification of data into different types, followed by labeling the data according to these clusters. These cluster labels can then be used as features to assist classifiers in more effectively categorizing the data, ultimately improving prediction accuracy.
Principal component analysis
Principal Component Analysis (PCA) is a fundamental technique in multivariate statistics19,29,30,31employed to simplify datasets by projecting the original data onto a new coordinate system through linear transformation. This method preserves the most significant features of the data, reducing its dimensionality while retaining as much of the key information as possible. Principal Component Analysis (PCA) is extensively employed in data preprocessing, dimensionality reduction, visualization, and feature extraction. The fundamental principle of PCA involves projecting the original data onto a new set of mutually orthogonal basis vectors, termed principal components, through a linear transformation. These principal components are ordered according to the magnitude of data variance: the first principal component captures the direction of maximum variance, the second principal component represents the direction of the largest variance within the remaining data, and so forth. The steps to perform Principal Component Analysis (PCA) are as follows:
-
(1)
Standardization: First, calculate the mean and standard deviation of each feature, then convert the data into a standard normal distribution with a mean of 0 and a standard deviation of 1. The standardization is expressed as:
where X is the original data, µ is the mean, and σ is the standard deviation.
-
(2)
Covariance matrix calculation: Next, compute the covariance matrix from the standardized data matrix Z:
$$\Sigma =\frac{1}{{n - 1}}{Z^T}Z$$(5) -
(3)
Eigenvalue and eigenvector calculation: Finally, derive the eigenvalues of the covariance matrix Σ, and select the eigenvectors corresponding to the largest eigenvalues as the principal components. Typically, the components with a cumulative variance contribution greater than or equal to 85% are chosen as the principal components (Ju and Hu, 2021). The formula for cumulative variance contribution is as follows:
$$C=\frac{{\sum\limits_{{i=1}}^{k} {{\lambda _i}} }}{{\sum\limits_{{j=1}}^{m} {{\lambda _j}} }}$$(6)
Where C is the cumulative variance contribution, k is the number of selected principal components, and m is the total number of eigenvalues. The principal component samples selected in this manner are sufficient to meet the requirements of the majority of experiments.
The SMOTE algorithm
The Synthetic Minority Over-sampling Technique (SMOTE) is an oversampling algorithm designed to mitigate class imbalance in classification tasks32,33,34,35,36. As an improvement over traditional random oversampling methods, SMOTE generates synthetic samples for the minority class by interpolating between existing minority samples in the feature space. Specifically, for each minority class sample, SMOTE identifies its k nearest neighbors (K-NN) and creates new synthetic samples along the line segments connecting the sample to randomly selected neighbors. This process effectively increases the representation of the minority class, thereby shifting the decision boundary closer to the majority class. By introducing greater diversity among the minority samples, SMOTE helps to alleviate overfitting and substantially enhances the generalization capability of the classifier.
In this study, we apply the SMOTE algorithm to augment the minority class and rectify the sample imbalance within our dataset. The detailed algorithmic procedure is as follows:
-
(1)
For each sample \(\:{X}_{i}\) in the minority class, calculate the Euclidean distance as the criterion, denoted by:
where \(\:{x}_{im}\) and \(\:{x}_{jm}\)represent the m-th feature of samples \(\:{x}_{i}\) and \(\:{x}_{j}\), respectively, and p is the number of features. The distance between \(\:{X}_{i}\) and all other samples in the minority class dataset is computed to identify the K-nearest neighbors. Let \(\:{X}_{i}\) be a sample from the minority class and \(\:{N}_{i}\) the set of its K-nearest neighbors, such that \(N_{i} = \left\{ {X_{{i1}} ,\,X_{{i2}} ,....X_{{ik}} } \right\}\)
-
(2)
Randomly select a neighbor \(\:{X}_{ni}\) from \(\:{N}_{i}\) and compute the new synthetic sample using the following formula:
where δ is a random number in the range [0, 1].
-
(3)
Repeat the above steps until the desired number of minority class samples are generated.
By addressing class imbalance, SMOTE enables the model to better recognize patterns and generalize across classes, thereby enhancing overall performance. It helps mitigate the bias introduced by class imbalance by ensuring the model is not overly influenced by the majority class at the expense of the minority class(Fig. 1). This approach effectively increases the number of samples in the minority class without the need for additional data collection, making it a resource-efficient technique.
Explanation of over-sampling principle.
Decision tree algorithms
Decision Tree (DT) is a tree-structured machine learning algorithm37,38,39,40,41 commonly used for classification and regression tasks. Decision tree algorithms are favored for their simplicity, interpretability, and computational efficiency, which is why they are widely applied in various classification and recognition domains. A decision tree works by recursively partitioning a dataset into smaller subsets based on a series of rules, while organizing decisions into a tree structure. Each internal node represents a decision based on a feature, each branch corresponds to the outcome of the decision, and each leaf node represents the final classification or regression result. The primary advantage of decision trees lies in their ability to break down a complex decision-making process into a sequence of straightforward judgment steps, constructing a model that is easy to interpret. The Classification and Regression Tree (CART) algorithm employed in this study constructs binary decision trees by selecting optimal feature splits based on the Gini index, which serves as the impurity measure. This criterion aims to maximize class separation while minimizing heterogeneity within each node42.
In this study, the classification model was constructed using the Classification and Regression Tree (CART) algorithm, a decision tree method particularly well-suited for classification tasks. One of the key advantages of CART lies in its interpretable, rule-based structure, which recursively partitions the dataset based on feature values to form a binary tree. The core mechanism of CART involves selecting, at each node, the optimal feature and corresponding split point that produce the most homogeneous child nodes.To achieve this, the CART algorithm employs Gini impurity as the splitting criterion. Gini impurity is a measure of the degree of class heterogeneity within a dataset (or node) and serves as an indicator of node purity. For a given dataset D, the Gini impurity is computed using the following formula:
Where \(\:D\) represents the dataset, \(\:n\) denotes the number of classes, and \(\:{p}_{i}\) represents the proportion of samples belonging to class \(\:i\) within the dataset.
The significance is to reflect the probability of randomly extracting two samples from binary tree nodes to belong to different categories. The smaller the Gini value (approaching 0), the higher the node purity and meaning all samples within the node belong to a single class.Conversely, higher Gini impurity values reflect greater class heterogeneity within the node, implying increased uncertainty in classification at that point in the tree.
The feature selection process is executed at each node of the decision tree. Initially, all 42 samples are grouped at the root node. For each candidate feature \(\:{x}_{j}\),the Gini impurity is calculated, and all possible split thresholds s within the range of \(\:{x}_{j}\) are enumerated to identify the optimal split point. For each candidate pair (\(\:{x}_{j}\),s)sthe current dataset D is partitioned into two subsets, D1 and D2, according to whether the feature values satisfy the splitting condition:
\(D_{1} = \{ x \in D|x_{j} \le s\} ,\,D_{2} = \{ x \in D|x_{j} > s\}\)
Subsequently, the weighted average Gini impurity corresponding to the split is computed, serving as a metric to evaluate the overall impurity of the resulting child nodes. This metric reflects the degree of class homogeneity achieved by the split and is used to assess its effectiveness:
The Gini impurities of the subsets D1 and D2 are computed according to Eq. (9), which defines the impurity measure used to evaluate the quality of the split.
At each decision node, the CART algorithm exhaustively evaluates all candidate features \(\:{x}_{j}\) and their corresponding split points s to identify the optimal split(\(\:{x}_{j}\),s) that minimizes the impurity measure Gini(D, \(\:{x}_{j}\) ,s). The selected split is then applied to partition the dataset into two child nodes, forming a binary tree structure. This splitting process is recursively repeated for each child node until a stopping condition is met—such as reaching a predefined maximum tree depth, a minimum number of samples per leaf node, or when no further reduction in impurity can be achieved.
By recursively selecting the feature and threshold that yield the greatest improvement in node purity, the CART algorithm constructs a decision tree that achieves strong classification performance while maintaining interpretability. The construction process of this FPS-DT model is shown in Fig. 2.
Flowchart based on FPS-DT model.
Analysis of water samples
Overview of the study area
The Shanxi Hanzui Coal Mine is situated in the southern part of the Wangjialing Precision Exploration Area, within the Xiangning Mining District of the Hedong Coal Field, as depicted in Fig. 3. The well field lies at the southernmost edge of the Lvliang Mountain range, characterized by high mountains and deep ravines, with a complex topography. The region has a general east-to-west slope, with higher elevations in the east and lower in the west, and is part of the Loess Plateau Zone. The surface water in the mine area belongs to the Yellow River Basin. A key feature of the area is the Anli River, which runs east to west along the northern boundary of the mine. This seasonal river flows out of the area to the west and eventually converges with the Yellow River. The main aquifers in the well field, listed from top to bottom, include: the aquifer in the Quaternary gravel layer with pore-dividing characteristics, the aquifer formed by Permian clastic fissures, the aquifer in the Upper Carboniferous Taiyuan Formation limestone characterized by karst fissures, and the aquifer in the Middle Ordovician formed by karst fissures43. Additionally, the primary aquifers in the region, in order, are: the aquifer at the base of the Cenozoic boundary, the aquifer within the Carboniferous diabase layers, and the aquifer of the Taiyuan Formation extending from the coal basement to the upper boundary of the Benxi Formation. The geological conditions of the mining area are complex, with abundant groundwater resources. As coal mining progresses, water gradually accumulates in the mining area, resulting in the formation of old void water and fissure water. These old void waters not only directly affect the water quality of the mining area but also present a potential threat to the regional ecological balance. The possible presence of pollutants within the void water and their interaction with surrounding water bodies further elevates the importance of this issue. Consequently, understanding the water quality characteristics of old void water and its impact on the ecological environment of the mining area is critical for the sustainable development of the region and the integrated management of water resources and ecological protection44,45.
Distribution of coal mines in Xiangning mining area.
Water quality analysis of Hanzui coal mine
To construct a water source identification model based on 42 data sets of old void water quality from the Hanzui coal mine in Shanxi, a thorough analysis of the water quality at the data collection site is required as the first step(Table 1).
The primary water type in the uppermost Quaternary gravel layer pore aquifer is pore water, which is typically fresh. The main chemical constituents include bicarbonate (HCO₃⁻), calcium ions (Ca²⁺), and magnesium ions (Mg²⁺). The water is classified as hard or moderately hard, with a high degree of hardness. The thickness of the aquifer varies significantly, ranging from a few meters to tens of meters. The aquifer exhibits strong permeability and water storage capacity, with recharge primarily occurring through rainfall and river water. The water level is highly influenced by seasonal changes.
Next is the Permian clastic fissure water-bearing rock system, where fissure water is the predominant water type. The water quality is more complex, with chemical constituents such as sulfate (SO₄²⁻), calcium ions (Ca²⁺), and magnesium ions (Mg²⁺). The water is hard and mineralized, typically classified as hard or very hard. The thickness of the aquifer is substantial, generally ranging from tens to hundreds of meters, with a dominant composition of clastic rocks, such as sandstones and shales. The degree of fissure development significantly influences the permeability and water storage capacity of this aquifer.
Following this, the Upper Carboniferous Taiyuan Formation limestone karst fissure aquifer primarily contains karst fissure water, which typically has relatively clean water quality. The main chemical components are bicarbonate (HCO₃⁻), calcium ions (Ca²⁺), and magnesium ions (Mg²⁺). The water is hard, usually classified as hard or very hard. The thickness of the aquifer varies, generally between tens and hundreds of meters, with limestone being the predominant rock type. The well-developed karst system, along with the extent of fissure and cave development, significantly affects the permeability and water storage capacity. Recharge conditions are favorable, mainly through precipitation, surface water, and neighboring aquifers.
Finally, the Middle Ordovician karst fissure aquifer is characterized by karst fissure water, usually of relatively clean quality. The main chemical constituents are bicarbonate (HCO₃⁻), calcium ions (Ca²⁺), and magnesium ions (Mg²⁺). The water is typically hard or very hard, with thickness varying between tens and hundreds of meters. The aquifer exhibits well-developed karst, with permeability and water storage capacity influenced by the degree of fissure and cave development. Recharge is also facilitated through precipitation, surface water, and neighboring aquifers.
Based on the analysis of 42 groups of old void water samples from the Hanzui Coal Mine in Shanxi Province, the water samples are predominantly distributed across the Quaternary gravel layer pore-dividing aquifer, the Permian clastic rock fracture water-bearing rock series, and the Middle Ordovician karst fractured aquifer43. Building on this analysis, the fuzzy C-means (FCM) algorithm is applied to cluster the water source data. The optimal number of clusters is determined using the elbow method and the DB index. Subsequently, the cluster centers are established based on the weighted calculation of various chemical indicators, and the water source types are classified into three categories (see Table 1) A Piper trilinear plot (Fig. 4) was constructed using the 42 water sample data sets to provide a summary of the distribution of chemical features in the data. Subsequently, 29 of the 42 data sets were designated as the training set, while the remaining 13 sets were used as the prediction set. The datasets were preprocessed using chemical features, including K++Na+,Ca2+,Mg2+,Cl-,SO42-,and HCO3 as discriminators. The preprocessed data were then utilized for training the decision tree model.
Piper’s trilinear diagram of water quality of water samples.
Establishment and validation of the FPS-DT model
Fuzzy C-mean clustering algorithm
The chemical composition of water sources in the study area is highly complex, with limited differentiation between water types, necessitating a rigorous and systematic approach to data analysis. To ensure the reliability of subsequent modeling, raw data were first preprocessed by removing or imputing missing values using mean substitution. Outliers were identified and manually adjusted based on observed data patterns to prevent their undue influence on model performance. Subsequently, Z-score normalization was applied to standardize feature scales and eliminate dimensional disparities among variables.
Following data preprocessing, To determine the optimal number of clusters, the clustering performance is evaluated using both the Silhouette Coefficient and the Davies–Bouldin Index. The evaluation results suggest that three clusters yield the most appropriate partitioning for the given dataset. As illustrated in Fig. 5,corresponding to the highest silhouette score and the lowest Davies–Bouldin index. Based on these findings, the fuzzy C-means (FCM) clustering algorithm was employed to categorize the water source data, with cluster membership determined according to Eq. (1). The clustering results, presented in Fig. 6, show that the data were effectively partitioned into three distinct classes.
The resulting class labels derived from the FCM clustering were then used to construct the classification dataset for subsequent model development. This approach not only enhances the interpretability and granularity of hydrochemical data classification but also provides a robust foundation for the development of high-accuracy predictive models, thereby ensuring both scientific rigor and practical applicability.
Optimal cluster number assessment diagram.
Clustering results.
Principal component analysis
Based on the above clustering analysis results, the clustered data were used as the database for model construction in this study. The data analysis revealed that the major cations were K⁺, Na⁺, Ca²⁺, and Mg²⁺, while the major anions included Cl⁻, SO₄²⁻, and HCO3⁻. These chemical components are widely distributed across the study area and play a decisive role in determining the chemical type of groundwater sources.
To construct the classification model, 29 groups were randomly selected from 42 groups of field-collected data as the training set, with the remaining 13 groups serving as the prediction set. To enhance the efficiency of data analysis, principal component analysis (PCA) was applied for dimensionality reduction, thereby extracting key features from the chemical data. The model training was performed using the decision tree algorithm available in Python’s sklearn library. Before training, the sample set was balanced using the SMOTE oversampling technique to address the category imbalance problem, ensuring a more balanced representation during training.This approach not only captures the core information of the chemical features of water sources but also enhances the robustness and prediction accuracy of the model through effective training strategies.
The use of PCA helps in simplifying complex datasets by reducing the number of variables while retaining crucial information, and SMOTE ensures that the model is trained on a more balanced dataset. This method provides a scientific basis for classifying complex water sources based on their chemical characteristics.The data derived from the clustering results are subsequently subjected to Principal Component Analysis (PCA), which requires all variables to be on a consistent scale. Therefore, the data are standardized according to the method specified in Eq. (4). The standardized data are presented in Table 2.
To analyze the standardized data, it is essential to calculate its covariance matrix, which provides a visual representation of the correlations between the variables. As shown in Table 3, the correlation coefficients between SO₄²⁻ and Mg²⁺, K⁺ + Na⁺ and TDS, Ca²⁺ and SO₄²⁻ are all greater than 0.6, indicating a relatively high reproducibility among the samples. Therefore, directly constructing the water identification model using the eight variables may significantly impact the accuracy of the model due to the strong correlations. Consequently, it is necessary to extract the principal components by evaluating their cumulative contribution in order to reduce the dimensionality of the data.
The eigenvalues are derived from the computed covariance matrix Σ, which captures the inter-variable relationships in the standardized dataset. Subsequently, principal components are selected by identifying the eigenvectors corresponding to the largest eigenvalues. The contributions of each principal component and their associated eigenvalues are presented in Table 4.
A common approach is to use the cumulative variance contribution to determine how many principal components should be retained. According to Table 4, the first four principal components (Y₁ through Y₄) account for a cumulative variance of 91%, indicating that these components effectively capture the essential information from the original dataset without introducing significant error.
After selecting the appropriate key variables through Principal Component Analysis (PCA), the next step is to construct a decision tree model for water source identification.
The SMOTE algorithm
As shown in Table 1, the distribution of water samples in this study is highly uneven, with significant disparities in the number of samples across different water types. This imbalance adversely affects the accuracy of the trained decision tree model, leading to diminished classification performance. To address this issue, the Synthetic Minority Over-sampling Technique (SMOTE) is employed to augment the dataset post-Principal Component Analysis (PCA). By generating synthetic samples, SMOTE increases the representation of underrepresented water types, thereby mitigating class imbalance and enhancing the classification accuracy of the model. A sample is randomly selected from the minority class, and for each selected minority class sample, its k nearest neighbors within the same class are identified. In this study, the value of k is set to 3, meaning that the three nearest neighbor samples are used when generating synthetic samples. To ensure consistency across runs, the random seed is fixed at 6. Through the application of the SMOTE algorithm, the 29 samples in the training set are increased to 65. A comparison of the class distribution of sample data before and after applying the oversampling technique is presented in Fig. 7, illustrating the effectiveness of SMOTE in balancing the minority and majority classes. These 65 augmented samples are then utilized as the training set for constructing the decision tree model.
Comparison of data before and after SMOTE oversampling.
FPS-DT modelling
The dataset prepared above serves as the training sample for the model. A decision tree is constructed using the DecisionTreeClassifier from Python’s scikit-learn library, with PC1, PC2, PC3, and PC4 as input features. Prior to constructing the decision tree, the dataset is partitioned into training and validation sets. The decision tree undergoes pre-pruning, and its hyperparameters are optimized using grid search with 5-fold cross-validation. This approach ensures that the model’s performance is evaluated on different training and validation datasets, thereby reducing the risk of overfitting. The recognition accuracy is employed as the evaluation metric for the model. Finally, the decision tree model is built based on the optimal parameters derived from the grid search. The decision tree model constructed in this study employs the Classification and Regression Trees (CART) algorithm, utilizing Gini impurity as the criterion for feature selection. A lower Gini impurity indicates a more distinct classification of the samples. The training set comprises 65 samples, augmented through the Synthetic Minority Oversampling Technique (SMOTE) to address class imbalance. Modelling was carried out using the procedure described above:
-
(1)
Construct a root node containing all training datasets and configure its parameters: Maximum depth of 4,Minimum split node size of 2,Minimum leaf node size of 1,Complexity pruning set to 0.03.After completing the pre-pruning, select an optimal feature derived from Principal Component Analysis (PCA) and calculate its Gini coefficient using Eq. (4). Using this optimal feature as a basis, partition the training dataset into multiple subsets. Each subset will achieve the best possible classification results given the current conditions.
-
(2)
If all subsets achieve basic correct classification, the process of constructing leaf nodes is complete. However, if there are still subsets that cannot be correctly classified, new optimal features need to be selected for these subsets. The division process continues, and corresponding nodes (branches) are constructed recursively. This process repeats itself until all subsets of the training data can be classified essentially correctly or until no suitable features remain for further selection.
-
(3)
When all subsets have been assigned to leaf nodes and the Gini coefficient reaches zero (i.e., each subset contains clear classification labels), the decision tree model is considered fully trained (Fig. 8). Once the training of the decision tree model is completed, data from the prediction set is input into the model for recognition testing. The results obtained from the decision tree are then output using Python’s visualization module.
Decision Tree.
Analysis of discriminatory results
To evaluate the performance of the constructed decision tree model, classification accuracy was assessed using the ‘metrics’ module from the scikit-learn library. A classification report was generated to summarize key evaluation metrics, including precision, recall, and F1-score, as presented in Table 5. In addition, five-fold cross-validation was employed to estimate the model’s generalization ability. This method partitions the dataset into five subsets and iteratively uses four for training and one for validation, thereby providing a robust and unbiased estimate of predictive performance.
In order to further explain the behaviour of the model, the confusion matrix for each validation during the five-fold cross-validation was output (Fig. 9). In this matrix, the horizontal axis represents the predicted labels and the vertical axis represents the true labels. Diagonal entries correspond to correctly classified instances, while off-diagonal entries indicate misclassification. This visualisation helps to intuitively understand the model’s ability to classify in different water types and reveals potential weaknesses in specific category distinctions.
confusion matrix.
The classification report indicates that the model achieves a discrimination accuracy of 89% for Type II, with both a recall and F1 score of 100%, demonstrating a well-balanced performance between precision and recall. Additionally, the model exhibits exceptional performance for Type I, with perfect accuracy, recall, and F1 scores of 100%, highlighting its high precision in discriminating Type I. The F1 score, which integrates both precision and recall, is particularly useful for evaluating the model’s performance in cases of class imbalance. The formula is as follows:
Where precision is the accuracy of the current category and recall is the recall of the current category.
The model achieves an identification accuracy of 100% for Type III; however, the recall for Type III is relatively low. This discrepancy is likely due to the uneven distribution of samples, with a particularly low number of Type III samples in the test set. The insufficient sample size for Type III likely hinders the model’s ability to effectively identify this type. In conclusion, based on the old goaf water quality data from Hanzui Mine, the overall recognition accuracy of the model across all sample types reached 92%.
To further validate the accuracy and reliability of the FPS-DT model, identical data samples and software platforms were used in the experiments presented in this study. The parameters of the decision tree model were maintained in alignment with those used in the FPS-DT model. Following Principal Component Analysis (PCA) preprocessing, the data were not subjected to the SMOTE oversampling technique and were directly modeled and predicted using the decision tree model. Classification reports for both PCA and the decision tree were generated. Finally, a comparative analysis of the accuracy, recall, and F1 scores of the two models was conducted for the Type III classification task to evaluate their respective classification performance(Fig. 10).
Comparison of FPS-DT and PCA-DT results graphs.
In this study, the PCA combined with decision tree model (PCA-DT) and the FPS-DT model are compared and analyzed under the same dataset and parameter settings. The results indicate that the accuracy of the standard PCA-DT model is only 78%, while the FPS-DT model demonstrates significant improvements in accuracy, recall, and F1 score after incorporating the SMOTE oversampling technique. These findings suggest that the decision tree model, when combined with PCA for feature extraction and SMOTE for data augmentation, offers a more reliable method for identifying groundwater sources, with enhanced accuracy and applicability, thereby effectively addressing the needs of groundwater source identification.
Conclusions
In this study, forty-two sets water samples were randomly selected and plotted on Piper trilinear diagrams to examine the hydrochemical characteristics of the water sample data, with the Hanzui Old Air Water quality serving as the study’s background. The chemical indicators, including Na+ + K+, Ca2+, Mg2+, Cl−, SO42−, HCO2−, and CO22-, constitute the predominant factors in these water sources, thereby defining the chemical composition of the water. These elements were utilized as key conditions for identifying sudden water sources. This study employs the Fuzzy C-Means (FCM) algorithm to perform clustering analysis of these elements, using the results as criteria for identifying potential water inrush sources. Additionally, Principal Component Analysis (PCA) is applied to process the data, with four principal components selected based on the cumulative variance contribution. This approach effectively reduces redundancy among the variables, enhancing the model’s efficiency. To address the issue of class imbalance in model training, the Synthetic Minority Over-sampling Technique (SMOTE) is applied to augment the minority class samples. A groundwater source identification model is then constructed using a decision tree based on the Classification and Regression Tree (CART) algorithm. The Gini index is utilized as the feature splitting criterion, and the feature with the smallest Gini-weighted value is selected as the root node of the tree. This process is iteratively repeated until a complete decision tree is formed. The Gini coefficient was used as the criterion for feature splitting, where the feature with the smallest Gini index value was selected as the root node of the decision tree. This process continued recursively until the decision tree was fully constructed. The experimental results demonstrate that the proposed method exhibits high recognition accuracy and strong potential for practical application. Furthermore, the model structure is simple and intuitive, making it easy to understand and interpret, with high readability. This method not only enhances safety in future mining operations by reducing the risk of sudden water-related accidents but also offers a novel and effective approach for groundwater source identification. Through precise data analysis and feature identification, this method holds significant promise for broader applications in hydrogeological research and related field.o understand and interpret, with high readability. This method not only enhances safety in future mining operations by reducing the risk of sudden water-related accidents but also offers a novel and effective approach for groundwater source identification. Through precise data analysis and feature identification, this method holds significant promise for broader applications in hydrogeological research and related field.
Data availability
All data generated or analyzed during this study are included in the published paper. The detailed data could be supplied on demand after corresponding author.
References
Chen, X., Li, L., Wang, L. & Qi, L. The current situation and prevention and control countermeasures for typical dynamic disasters in kilometer-deep mines in China. Saf. Sci. 115, 229–236 (2019).
Wen, W. A comparative study on the development of china’s coal industry under the belt and road initiative. China Coal. 45, 5–9 (2019).
Wang, J. & Wang, X. Analysis on water inrush danger from roof and floor of 2 seam in Wangjialing mine. Coal Sci. Technol. 39, 120–124 (2011).
Yang, M. Rescue measures and analysis on3·28Water inrush accident in Wangjialing coal mine. Energy Sci. Technol. 11, 42–44 (2013).
Huang, Y. et al. Research on development law of mining floor failure zone in deep coal seam. Coal 33, 25–29 (2024).
Wu, Q. et al. Type classification and main characteristics of mine water disasters. J. China COAL Soc. 38, 561–565 (2013).
Xu, K., Su, Y., Liu, H. & Guo, S. Model test of ground collapse induced by groundwater seepage at deep water level. J. Exp. Mech. 37, 221–233 (2022).
Bi, Y. et al. Discriminant analysis of mine water inrush sources with multi-aquifer based on multivariate statistical analysis. Environ. Earth Sci. 80, 144 (2021).
Liu, Q. et al. Hydrochemical analysis and identification of open-pit mine water sources: a case study from the Dagushan iron mine in Northeast China. Sci. Rep. 11, 23152. (2021).
Li, S. Application of mine comprehensive geophysical exploration technology to seal extra large water inrush. Coal Sci. Technol. 39, 23–26 (2011).
Lei, M. A., Jiazhong, Q. & Weidong, Z. An approach for quickly identifying water-inrush source of mine based on GIS and groundwater chemistry and temperature. Coal Geol. Explor. 42, 49–53 (2014).
Mou, L. Application of dynamic curve prediction method in discriminating water-bursting source. COAL Geol. Explor. 44, 70–74 (2016).
Xue, J. Quantitative analysis of mine water inrush using isotope method. Coal Eng. 51, 150–153 (2019).
Wang, T., Fang, G., Zhang, X. & Wang, S. Qualitative and quantitative study of water source in Mindong 1 mine based on water chemistry and hydrogen and oxygen isotopes characteristics. Saf. Coal Mines. 55, 190–197 (2024).
Zhu, S., Jiang, C., Bi, B., Xie, H. & An, S. Identification of mine water inrush source based on combination weight-theory of improved grey relational degree. Coal Sci. Technol. 50, 165–172 (2022).
Feng, D. & Wu, J. Recognition model for mine water inrush sources based on SVM. J. Liaoning Tech. Univ. (Nat. Sci.). 36, 23–27 (2017).
Zhao, W., Liu, Q., Chai, H., Zhang, M. & Xie, Z. Identification of water inrush source based on fisher discriminant method and centroid distance theory. Sci. Technol. Eng. 20, 3552–3556 (2020).
Sun, F., Wei, J., Wan, Y. & Liu, C. Recognition method of mine water source based on fisher’s discriminant analysis and centroid distance evaluation. COAL Geol. Explor. 45, 80–84 (2017).
Li, B., Zhang, H., Zhang, W. & Li, T. The PCA-KD-KNN-based water chemistry identification model of water inrush source type in mine and its application. Arab. J. Geosci., 14. (2021).
Jin, J. Research and construction of student management platform for special needs students with decision tree model and big data technology. Syst. Soft Comput. 7, 200310. (2025).
Adhab, A. H. et al. Application of robust hybrid tree-based machine learning methods in accurate prediction of underground rock saturation exponent. Measurement 255, 117916 (2025).
Hui, W. et al. Multimodal interpretable image classification method based on visual attributes. ACTA Automatica Sinica. 51, 445–456 (2025).
Ju, Q. & Hu, Y. Source identification of mine water inrush based on principal component analysis and grey situation decision. Environ. Earth sci.., (2021).
Izakian, H. & Abraham, A. Fuzzy C-means and fuzzy swarm for fuzzy clustering problem. Expert Syst. Appl. 38, 1835–1838 (2011).
Li, Y. et al. Discriminative embedded multi-view fuzzy C-means clustering for feature-redundant and incomplete data. Inf. Sci. 677, 120830 (2024).
Xiao, W. et al. Implementation of fuzzy C-Means (FCM) clustering based camouflage image generation algorithm. IEEE Access. 2021 9, 120203–120209 .
Wang, X., Liu, L. & Chen, H. L. Hu & F. Xie. A Pipeline Stage Corrosion Prediction Method based on K-Means Clustering Algorithm and LSTM Neural Network. Corrosion Protect., 84–89. (2024).
He, N., Xi, K. & Gao, F. & Y. S. Liu. Based on FCM-ELM-BBPS predictive control parameter tuning. 50, 168–177. (2023).
Huang, P., Wang, X., Cinzia, F. & Federico, C. Piper-PCA-Fisher recognition model of water inrush source: A case study of the Jiaozuo mining area. Geofluids 2018, 1–10 (2018).
Li, B., Liu, Z., Wu, Q., Ling-Li, Z. & Zhou, L. Identification of mine water inrush source based on PCA-FDA: Xiandewang coal mine case. Geofluids 2020, 1–8 (2020).
Yu, X., Liu, Y. & Zhai, P. Identification of mine water inrush source based on PCA-AWOA-ELM model. Coal Sci. Technol. 51, 182–189 (2023).
Elreedy, D. & Atiya, A. F. Kamalov. A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Mach. Learn. 113, 4903–4923 (2024).
He, Y. et al. An interpretable deep learning framework using FCT-SMOTE and BO-TabNet algorithms for reservoir water sensitivity damage prediction. 15, 1–16. (2025).
Tan, X. et al. Wireless sensor networks intrusion detection based on SMOTE and the random forest algorithm.19, 203. (2019).
Bai, L. et al. An oversampling method based on adaptive artificial immune network and SMOTE. Genetic Program. Evol. Mach. 26, 1–71. (2025).
Swana, E. F., Doorsamy, W. & Bokoro, P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors 22, 3246 (2022).
Ghiasi, M. M., Zendehboudi, S. & Mohsenipour, A. A. Decision tree-based diagnosis of coronary artery disease: CART model. Comput. Methods Programs Biomed. 192, 105400 (2020).
Lin, S. & Luo, W. A new multilevel CART algorithm for multilevel data with binary outcomes. Multivar. Behav. Res. 54, 578–592 (2019).
Pekel, E. Estimation of soil moisture using decision tree regression. Theor. Appl. Climatol. 139, 1111–1119. (2020).
Wang, J. Design of data mining and analysis algorithm based on improved decision tree. Electron. Des. Eng. 32, 84–88 (2024).
Li, Z., Du, X., Xu, A., Wu, T. & Cao, Y. Explaining tree ensembles through single decision trees. Inform. Fusion. 123, 103244 (2025).
Chen, L., Gao, X., Liao, Y. & Deng, J. & B. Zhou. Wetland classification method of Dongting lake district based on CART using GF-2 image. Bull. Surveying Mapp., 12–15. (2021).
Chen, W. & Li, X. Analysis of hydrogeological characteristics and water filling factors in Hanzui coal mine. Shaanxi Meitan. 38, 128–131 (2019). 183.
Zeng, Y. et al. Three Zones Method for Coal Mine Water Hazard Control and its Significance (J. China COAL Soc., 2023).
LiX Study on system of prevention and control of old Goaf water during coal mine re-mining. Saf. Coal Mines. 54, 184–193 (2023).
Acknowledgements
this work is supported by The National Nature Science Foundation of China (No.52104222, No.51909224), the Natural Science Basic Research Program of Shaanxi (2021JLM-48, 2025JC-BMS-511, 2019JM-182), and the Special Fund for the Launch of Scientific Research in Xijing University (XJ18T04, XJ24B12). The authors would like to express sincere thanks to the reviewers for their thorough reviews and valuable advice.
Author information
Authors and Affiliations
Contributions
Kaide Liu: Writing – review & editing, Writing-original draft, Funding acquisition, Formal analysis, Data curation, Conceptualization. Yu Xia: Writing – review & editing, Writing-original draft, Software, Project administration, Formal analysis, Conceptualization. Xiaolong Li: Writing – original draft, Data curation, Software, Conceptualization. Chaowei Sun: Investigation, Conceptualization. Wenping Yue: Supervision, Conceptualization. Qiyu Wang: Software, Validation, Visualization. Songxin Zhao: Methodology, Software, Supervision. Shufeng Chen: Software, Project administration, Investigation.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, K., xia, Y., Li, X. et al. Identification of water sources of mine water bursts based on the FPS-DT model. Sci Rep 15, 27327 (2025). https://doi.org/10.1038/s41598-025-13301-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-13301-y












