Introduction

The machine learning algorithms aim to minimize the overall error and are accuracy-driven1. When these algorithms are applied to a problem with a balanced dataset, they show a significant performance. However, in the case of imbalanced data, particularly for multiclass problems, their performance suffers due to imbalanced class distribution where samples of various classes overlap at the boundary region. Figure 12, shows the problem of imbalance, and overlapping. For imbalanced distribution, the dataset is divided into the majority and minority groups. The former group (the majority class) samples outperform the number of the latter group (minority class). This unfair distribution causes the underline classifier inclined towards the majority class while ignoring the minority class samples, even though sometimes these minority class samples carry important information. If a traditional classifier is applied to such imbalanced datasets, it may result in an appropriate classification model3. However, some researchers are of the opinion that imbalanced problems with a small imbalance ratio do not affect severely the classifier performance, and the imbalanced effect could be reduced by using sampling strategies4.

To be more precise, the imbalanced data is not a big problem for the classifier, but the overlapping region created at the boundaries of imbalanced classes is a key issue for the performance degradation of different classifiers5. Often, in real-world datasets6, samples in a particular problem space may share similar attribute values. If they fall at the boundary region of two classes, it would be difficult to clearly define a hyperplane between the classes7. Moreover, the co-occurrence of data imbalance and overlapping has a huge impact on the classifier and reduces its efficiency. In addition, the increase in the number of classes involved in data overlapping makes the classification more challenging8. Hence, the algorithm occasionally fails to properly learn this complex situation and correctly predict the target class for the overlapped sample, thus compromising the overall classification performance and resulting in an impractical classification model.

Fig. 1
figure 1

Classification problems, (a) Sample distribution problem, (b) Overlapping problems in classes, and (c) Imbalance and overlapping issues, adopted from2.

To address the overlapping data problem, the proposed solutions are based on two largely reported approaches:

  1. i.

    Adaption algorithms: In this approach, the existing algorithm is modified, or a new technique is introduced for handling the complex imbalanced and overlapped data while classifying the multi-class problems. Different algorithm-based approaches like ensemble methods9, one-class learning10, and cost-sensitive-based methods11 are available in the existing literature to improve the classification performance of imbalanced and overlapping data. Fu et al.12 highlight Support Vector Machine (SVM) to be an efficient and generalizable approach in the case of multiple classes, by combining auxiliary algorithm Recursive Feature Elimination (RFE)13, optimization techniques14 for parameters, and Two-Step Classification (TSC)15. Czarnecki and Tabor proposed 2eSVM16, a modified SVM to address the overlapping issue. Xiong et al. present a modified Naïve Bayes17, CANB18 for overlapping data, which follow separate learning from overlapped and non-overlapped regions.

  2. ii.

    Data preprocessing techniques: By adding complimentary features or using data purification techniques to separate the overlapping classes, the original data is modified in the data level method to alleviate/reduce the impact of overlapping data19. To reduce the imbalance effect of the classification model, alternative feature selection strategies are frequently used for sampling procedures. By using the Synthetic Minority Oversampling Technique (SMOTE) and its derivatives, some of the prominent sample strategies mentioned in the literature include oversampling, under-sampling, and hybrid sampling20. E. Elyan suggested in21 using an oversampling-based method to eliminate the negative examples of the majority class from the overlapping region. The borderline samples of both the majority and minority groups were included in the sampling-based methodology presented in22. The model may get over-fit or lose some relevant information if the imbalance and overlapping problem is only addressed using data-level procedures23.

Regardless of the significance of both approaches, there is still a gap in showing, how much a classifier’s performance is affected by the degree of overlapping, i.e., methods to measure the degree of overlap are not available for real datasets24. Secondly, most of the literature discusses the default existing overlapping samples and their effect on the underlying classifier, and we did not find any comprehensive study to show the effect of different degrees of synthetic controlled overlapping on classifier performance. To fill this gap, in the current research work, this study contributes in the following areas.

  • We propose four algorithms to introduce a controlled synthetic overlapping into the multi-class imbalanced datasets. These algorithms include Majority-class Overlapping Scheme (MOS), All-class Oversampling Scheme (AOS), Random-class Oversampling Scheme (ROS), and All-class Oversampling Scheme using Synthetic Minority Oversampling Technique (SMOTE) (AOS-SMOTE).

  • These algorithms allow one to precisely determine samples in overlapping and non-overlapping regions as well as adjust the degree of overlapping between various classes. This thorough categorization of the many sample types in the dataset suggests an entirely new approach to comprehending and assessing the class overlapping concerns in multi-class imbalanced data situations.

  • Algorithms are applied with a different degree of overlap including 10%, 20%, 30%, 40%, and 50%. The synthetic examples are added to the dataset’s majority class, a randomly chosen class, and all classes in order to test the generality of the suggested techniques.

  • For experiments using different multi-class imbalanced datasets, some well-known classifiers like Random Forest (RF)25, K Nearest Neighbor (KNN)26, and SVM27 are applied to these multi-class datasets to empirically validate the overlapping effect on learning from multi-class imbalanced problems. Synthetic generation of samples through different algorithms enables us to conclude the worse effect of overlapping on the learning of the underlying classifiers over 20 multi-class real datasets.

  • An extensive experiment on different artificial and real datasets is carried out, to provide a solid basis to establish the conclusion about the class overlapping effect on classifier performance via conventional confusion metric in terms of accuracy and precision.

In the preceding sections, related work is given in “Related work” section. “Quantification of class overlapping” section presents the quantification of class overlapping. Proposed oversampling approaches for generating synthetic samples are discussed in “General synthetic samples in multiclass datasets using proposed algorithms” section. “Working mechanism for all schemes” section provides details of the working mechanism of all schemes while results are presented in “Results and discussions” section. Finally, “Conclusions” section concludes this study.

Related work

Classification of imbalanced data in multi-class problems is getting more attention, as their existence in the majority of real-world Applications, like medicine, finance, mechanics, different industries, and automation, while dealing with data mining and big data environments. In the majority of these applications, the data collection based on some objective conditions with the predefined characteristics of data attributes results in an overlapping region among the different class samples. With the increasing degree of class overlapping the underlying classifier also, compromises its overall performance, by misclassifying the samples at the boundary line.

Different studies in the literature agreed upon the assumption that most of the time, the classification error generally focused on the borderline of different classes, which is an exactly overlapped region consisting of the samples of both classes28. Overlapping is the most crucial issue than the data imbalance problem as reported by Denil and Trappendberg29. Garc and his co-authors in30 generate artificial datasets to show the effect of overlapping data on classifier performance using KNN based on the overall imbalance ratio and local imbalance ratio.

Class imbalance and overlapping problems become more challenging for the applications of conventional algorithms as reported by Han Kyu Lee and Seoung in2. While overlapping happens when a borderline region in the data space is made up of similar data for many classes, the class imbalance is the outcome of the imbalance in the class samples, giving rise to the majority and minority class (es). The authors deal with the overlapping zones by dividing them into less overlapped (soft overlapping) and densely overlapped (hard overlapping) regions by proposing the Overlap Sensitive Margin (OSM) classifier. The OSM classifier combines KNN with the Fuzzy SVM (FSVM) to address issues of imbalanced and overlapping classes for multi-class domains. The weight that was previously given to SFVM to remedy the class imbalance issue is essentially different from the misclassification cost that the OSM classifier applies to FSVM. The KNN technique determines the level of sample overlap in the dataset. The trained OSM classifier then uses a hyperplane to divide the overlapping area into soft and hard overlapping areas. The authors in7 highlight the problem of imbalance and overlapping regions, where the probabilities of different class samples are almost equal having somehow similar characteristics. These similar data attributes make it very difficult for the traditional classifiers to separate the samples of different classes from each other.

Das et al.31 investigate the overlapping issues by modifying the SVM kernel function to map lower-dimensional feature space into a high-dimensional space. This transformation helps make the original data more linearly separable. Additionally,32 suggests Tomek data purification techniques to deal with overlap issues. Wilson33 introduces the Edited Nearest Neighbor (ENN) rule, which uses convergence features of the nearest neighbors. Batistia and colleagues34 combine the ENN rule with SMOTE and Tomek techniques. They use ENN with SMOTE and SMOTE + Tomek to deal with overlapping problems.

Xiong et al.35 provide a systematic review of approaches to address data overlapping issues and the relationship between imbalance and overlapping data. Several simple ways are demonstrated by first identifying the overlapping region with a support vector description, and then utilizing different techniques to separate, merge, and reject the complex overlapped data. Data overlap is a major problem in machine learning and directly affects the generalization and classification performance of the model, claims36. The authors presented the kernel-MTS to improve classification performance in overlapping contexts. The kernel function significantly reduces both the computation quantity and dimensional complexity. The authors of16 devised a 2eSVM method that maps low-dimensional features into a higher-dimensional feature space using the classical SVM framework. The mapping in higher-dimensional space makes it easy for the classifier to apply the kernel function to separate the data linearly. After studying the existing literature it is clear that imbalance and overlapping are closely related to each other, as highlighted in37.

The authors propose a hybrid oversampling approach to effectively overcome the imbalance of classes, both inter- and intra-classes, and avoid noise formation. SMOTE combines k-means clustering and synthetic minority oversampling. Clustering, filtering, and sampling make up the three steps of the K-means-SMOTE38 approach. The k-mean clustering algorithm divides the input space into k subsets for grouping. A subset of clusters within each of the k groups is chosen in the second phase for sampling with an emphasis on minority class samples. In the third step, the sampling technique SMOTE is applied to each cluster to obtain an improved ratio of minority to majority samples. The suggested approach significantly enhances the classification results by oversampling the training data, according to extensive testing and empirical findings. Furthermore, k-means SMOTE is more often used than other well-liked oversampling techniques.

The study39 proposed an ensemble approach to work with imbalanced and class-overlapping data. The authors followed an integration of class overlap reduction and data undersampling to resolve this issue. Integration is done using an extreme gradient boosting model which showed superior results compared to existing benchmarks. Entropy-based sampling, particularly hybrid sampling is another attractive solution to this problem, as reported by Kumar et al.40. The entropy-based method is especially effective for highly class over and imbalanced problems. The method works by removing less information samples from overlapping areas and generates highly informative samples. Results show performance improvements for various imbalanced datasets.

The use of Generative Adversarial Networks (GANs) is also reported to resolve data imbalance problems. For example,41 made use of GANs for classifying imbalanced datasets. The proposed approach comprises modules for misclassified samples and generated minority samples, in addition to GAN modules for discriminator and generator. Misclassified samples are iteratively updated along with the other modules. Experiments using tabular and textual data show better results than existing sampling approaches. Similarly, the study42 leverages GANs to overcome data imbalance problems. The authors introduced an adaptive weighting method which used local and global density approaches to identify noisy samples. Weights depend on the density of a sample and its proximity to neighboring instances. Later, GAN builds the balanced dataset with the generation of minority class samples using weights. An improved performance is reported using the GAN-generated balanced dataset.

Predominantly, existing works are based on data-level, algorithm-level, and hybrid approaches to address the problem of class overlapping issues in multi-class imbalanced datasets. Very few papers about the quantification of overlapping samples and the effect of these samples are reported in the literature43. Moreover, there is no well-defined mechanism to introduce a controlled overlapping sample in the underlying multi-class dataset to analyze the learning performance of a particular classifier. As a reason for the limited number of papers in the literature, here in this paper, we include some important references that demonstrate the class overlapping effect on learning from multi-class problems44.

From the previous study, one of the main issues relating to the classification of the problems having multiple classes is overlapping issues, even more challenging than the imbalance itself45. To address the issue of overlapping, differing studies like46,47,48 propose different sampling techniques to incorporate in data pre-processing steps, however, as pointed out in49,50,51, in the case of high dimensional imbalanced data, sampling techniques might not be enough to deal with the overlapping issue. The distribution density of the data, particularly in high dimensional feature space together with a minority class with a very small sample space makes it extremely challenging to differentiate the decision boundaries between the negative and positive (majority, minority) classes. At the same time, it is evident from the literature that most of the classifiers are designed by least focusing on how to handle the redundant features. Even if it is not guaranteed to get a better classification report as a result of redundancy problems after taking/using all the available features, forming a conclusion that features in high dimensional space are completely or partially unrelated to impartial learning52,53. In a highly imbalanced environment, some of the features become irrelevant because of the class overlapping issue (when some of the instances, fall into overlap regions of minority class and majority class). If the same situation is happening in various feature spaces, the recognition of positive and negative classes becomes extremely problematic. Henceforth, there is a resilient determination to search out strong, prevailing feature subsets of all existing features, especially in the classification of a multivariate dataset.

Quantification of class overlapping

As discussed in the previous section, most of the real-world problems exhibit overlapping issues when co-occurring with the imbalanced distribution of data, severely affecting the classification performance, as presented by Garcia et al. in30, proving that the learning difficulty of a classifier on an imbalanced dataset is highly dependent on the overlap degree. Despite being a popular field of research, there is currently no clear-cut mathematical explanation for how classes overlap, despite various studies in the literature2,54 being presented. A major limitation of these methods is that they assume the data follows a normal distribution, which is not valid for most real-world datasets. To address this, we modified the formula from21 in Eq. (2) for a better approximation of the overlapping region. This formula was originally intended for binary classification problems using a 2-dimensional feature space.

$$\begin{aligned} Overlap \, Degree (\%) = \frac{Overlapping \, Space}{Minority \, Class \, Samples} \times 100 \end{aligned}$$
(1)

As shown in Eq. (2), the level of overlapping in the majority class influences the classification performance.

$$\begin{aligned} Overlap \, Degree (\%) = \frac{Overlapping \, Region}{Majority \, Class \, Area} \times 100 \end{aligned}$$
(2)

The closest neighbor rule and Euclidean distance are used to calculate the majority class area, and the shared feature space with comparable features is known as the overlapping region. The overlap zone for the two classes \(C_i\) and \(C_j\) can be calculated using Eqs. (3) and (4).

$$\begin{aligned} & \text {Overlapping} = P(x | C_i) \ge 0 \end{aligned}$$
(3)
$$\begin{aligned} & \quad \text {Overlapping} = P(x | C_j) \ge 0 \end{aligned}$$
(4)

where \(x \in\) overlapping region sample.

The same must apply to class \(C_i\) if the probability density of that class is \(\ge 0\), meaning that the features of the class \(C_i\) sample and the class \(C_j\) sample are similar. Here, Eqs. (3) and (4) compute the similar sample that contributes to the overlapping region based on the probability density. Put simply, if a \(C_i\) class sample’s probability density is higher than zero, then the \(C_j\) class sample’s probability density must likewise be higher than zero for samples located in overlapped regions.

$$\begin{aligned} F_1 = \frac{1}{{1 + \max _{i=1}^{m} r_{fi}}} \end{aligned}$$
(5)

In this case, \(r_{fi}\) is the dataset’s feature \(f_i\)’s discriminative ratio. F1 initially stores the value of the largest discriminative ratio. You can also calculate \(r_{fi}\) using the formula described by Orrial in his research publication55, which is as follows:

$$\begin{aligned} r_{fi} = \frac{\sum _{j=1}^{n_c} \sum _{k=1, k \ne j}^{n_c} {p}_{c_j} \cdot {p}_{c_k} \cdot \left( {\mu }_{c_j}^{f_i} - {\mu }_{c_k}^{f_i} \right) ^2}{\sum _{j=1}^{n_c} {p}_{c_j} \cdot \left( {\sigma }_{c_j}^{f_i} \right) ^2} \end{aligned}$$
(6)

where \(P_{C_i}\) and \(P_{C_k}\) denotes samples of class \(C_i\) and \(C_k\). Where \(\mu _{C_i}\) and \(\mu _{C_k}\) represent the value of features’ mean of the class samples of \(P_{C_i}\) and \(P_{C_k}\), consequently, while \(\sigma _{C_i}\) and \(\sigma _{C_k}\) indicate how those samples’ standard deviations are calculated. Equations (5) and (7) are both used to partition the underlying dataset into binary classification issues using a one-vs-one technique. An alternate formula for calculating the discriminative ratio is provided by Molliendia56. It is as follows:

$$\begin{aligned} r_{fi} = \frac{\sum _{j=1}^{n_c} \sum _{{n}_{c_j}} \left( {\mu }_{c_j}^{f_i} - {\mu }^{f_i} \right) ^2}{{\sum _{j=i}^{n_c} \sum _{l=1}^{{n}_{c_j}} \left( {x}_{l_i}^j - {\mu }_{c_j}^{f_i} \right) ^2}} \end{aligned}$$
(7)

where \(n_{c}\) represents the respective samples in class \(C_i\), and \(\mu _k\) denotes samples’ mean for class \(C_k\). The \(\mu\) shows features mean across all the classes, while \(x_{ik}\) is the individual value of \(f_k\) feature from a sample of class \(C_i\).

Based on vector direction, the Directional-vector Maximum Fisher’s Discriminative Ratio concept is presented in55, which pursues a vector that can separate the samples of two classes after being projected and is given by:

$$\begin{aligned} dF = \frac{{\textbf{d}^t \textbf{Bd}}}{{\textbf{d}^t \textbf{Wd}}} \end{aligned}$$
(8)

In Eq. (8), \(\textbf{v}\) represents the directional vector where the data is projected to exploit the class separation in the problem domain. The scatter matrix between the classes is denoted by \(\mathbf {S_b}\), and \(\mathbf {S_w}\) represents the scatter matrix within the class. \(\mathbf {S_b}\) and \(\mathbf {S_w}\) are defined in Eqs. (9), (10), and (11), respectively.

$$\begin{aligned} \textbf{d} = \textbf{W}^{-1} \left( \varvec{\mu }_{c_1} - \varvec{\mu }_{c_2} \right) \end{aligned}$$
(9)

where \(\textbf{u}_{C_i}\) represents the mean vector or centroid of class \(C_i\), and \(\textbf{w}^{-1}\) denotes the pseudo-inverse of \(\textbf{m}_{C_i}\).

$$\begin{aligned} \textbf{B} = \left( \varvec{\mu }_{c_1} - \varvec{\mu }_{c_2} \right) \left( \varvec{\mu }_{c_1} - \varvec{\mu }_{c_2} \right) ^t \end{aligned}$$
(10)

where \(\textbf{u}_{C_j}\) and \(\textbf{u}_{C_k}\) represent the mean vector or centroid of class \(C_j\) and \(C_k\), respectively.

$$\begin{aligned} \textbf{B} = {p}_{c_1} \mathbf {\Sigma }_{c_1} + {p}_{c_2} \mathbf {\Sigma }_{c_2} \end{aligned}$$
(11)

where \(\textbf{p}_{C_i}\) and \(\textbf{p}_{C_k}\) are the sample proportions in class \(C_i\) and \(C_k\), respectively, and \(\textbf{S}_{C_i}\) and \(\textbf{S}_{C_k}\) denote the scatter matrices for classes \(C_i\) and \(C_k\), respectively.

Now using the definitions from Eqs. (9), (10), and (11), the directional-vector-based discriminative ratio can be computed as:

$$\begin{aligned} F_{1_V} = \frac{1}{{1 + dF}} \end{aligned}$$
(12)

Lower values in dF indicate that the underlying problem is simple to project the samples in the data space by using a linear hyperplane to separate most of the data.

Overlapping occurs near the boundary region within the class in the problem domain. An important measure while quantifying overlapping samples is to find out the volume of the overlapping region57, and it is given by:

$$\begin{aligned} \mathcal {F}_{2} = \prod _{i=1}^{m} \frac{\text {overlap}\left( f_i\right) }{\text {range}\left( f_i\right) } = \prod _{i=1}^{m} \frac{\max \left\{ 0, \min \left\{ \max \left( f_i\right) , \max \left\{ \min \left( f_i\right) \right\} \right\} \right\} }{\max \left\{ \max \left( f_i\right) - \min \left\{ \min \left( f_i\right) \right\} \right\} } \end{aligned}$$
(13)

where,

\([\min \left\{ \max \left\{ f_{i} \right\} \right\} = \min \left( \max \left\{ f_{i}^{c1} \right\} , \max \left\{ f_{i}^{c2} \right\} \right) ]\)

\([maxmin(f_i)=max(min(f_i^c1), min(f_i^c2))]\)

\([\max \max (f_i) = \max \left( \max (f_i^{c1}), \max (f_i^{c2}) \right) ]\)

\([minmin(f_i) = min(min(f_i^c1), min(f_i^c2))]\)

The algorithm calculates the volume of overlapping samples within the class and the distribution of the feature values by finding the minimum and maximum values for each feature in all classes. The values \(\text {min}_i\) and \(\text {max}_i\) are the minimum and maximum values of each feature in a class.

General synthetic samples in multiclass datasets using proposed algorithms

In this section, different algorithms are designed to implement the proposed schemes that introduce a controlled synthetic overlapping in multi-class imbalanced datasets to show the impact of class overlapping on the underlying classifier.

Algorithms to synthetically overlap multi-class dataset

In the current research, four different algorithms are used to synthetically overlap the multi-class dataset, namely, Majority-class Overlapping Scheme (MOS), All-class Overlapping Scheme (AOS), Random-class Overlapping Scheme (ROS), and All-class Overlapping Scheme using SMOTE (AOS-SMOTE).

  • MOS: Synthetic samples are introduced in the majority class with a varying degree of 10. The MOS scheme targets only the majority class to create samples with a shared attribute in the overlapping region. The learning performance of the underlying classifier is evaluated directly on the multi-class dataset, and then 10% of samples are overlapped in the majority class. The percentage of overlapping samples is increased with an interval of 10 up to 50% to show the different levels of class overlapping effect on learning from multi-class problems. The MOS scheme further decreases minority class visibility by increasing the majority class samples, flushed even more with the increasing level, as clear from the result section.

  • AOS: All classes in the problem domain are considered as a target class to be synthetically overlapped. If the total number of classes is \(N\), then \(N\) sets of synthetic samples \(s_1, s_2, \ldots , s_N\) will be generated, which will then be merged with the original dataset. The degree of overlapping samples will be increased in the order as described in MOS.

  • ROS: Before applying any algorithm, SMOTE is applied to balance the sample distribution of the majority and minority classes. After applying SMOTE, a target class is selected randomly, and the proposed scheme generates only a single set of synthetic sample \(S_{\text {rand}}\) inserted into the target class, then merges with the original dataset \(D\).

  • AOS-SMOTE: All classes will be overlapped by applying SMOTE, here no synthetic samples will be created, only the minority class samples will be increased to become equal with the majority class samples using SMOTE.

To implement the proposed algorithms, some concepts are based on58 and59 reported in the literature.

Algorithms for the proposed schemes

The first strategy is implemented by Algorithm 1, where different levels of synthetic samples are only introduced to the majority class in order to increase the overlap with the varied level or degree of 10. The classifier’s learning is assessed at each level in order to examine the impact of overlapping at that particular level. The synthetic samples are added to the majority class in increments of 10, starting with 10% of the majority class samples generated using the MOS technique and increasing to 50%. This algorithm first identifies the samples from both classes and then employs Eq. (14) to determine the distance between samples from both classes.

$$\begin{aligned} {[} \Vert \overrightarrow{a} - \overrightarrow{b} \Vert = \sqrt{{\left( \overrightarrow{a} - \overrightarrow{b}\right) }^{2}}, \text {where} \ \overrightarrow{a}, \overrightarrow{b} \in \mathbb {R}^{n}] \end{aligned}$$
(14)

where d is the separation between the two observes and a, b are two vectors (as an instance distribution). Using Eq. (15), the distance for the nth row point is

$$\begin{aligned} {[}\left\| \overrightarrow{a} - \overrightarrow{b} \right\| = \sqrt{{\left( {a}_{1} - {b}_{1}\right) }^{2} + {\left( {a}_{2} - {b}_{2}\right) }^{2} + \ldots + {\left( {a}_{n} - {b}_{n}\right) }^{2}}, \quad \text {where} \ \overrightarrow{a}, \overrightarrow{b} \in \mathbb {R}^{n}] \end{aligned}$$
(15)

The working mechanism of the proposed algorithm is discussed in the subsequent section. Algorithm 1 is used to generate overlapping samples, SS, in the majority class of the multi-class dataset, where \(B = S = \emptyset\) at the beginning and k1 and k2 are two parameters indicating the number of neighbors and are prefixed.

Algorithm 1
figure a

Algorithm for MOS: Creating Overlapping Samples (SS) in majority class.

Algorithm 2 is used for AOS. The idea is to build the set of overlapping samples, or SS, in each of the multi-class dataset’s classes (with \(B = SS = \emptyset\) initially; the number of neighbors is indicated by the prefixed parameters k1 and k2).

Algorithm 2
figure b

Algorithm for AOS: Creating Overlapping Samples (SS) in all classes.

Algorithm 3 is used for for ROS. The idea is to apply SMOTE to a random class in the multi-class data set, and then generate the overlapping samples set denoted by S (originally B = S = \(\emptyset\); k1 and k2 are two prefixed parameters specifying the number of neighbors).

Algorithm 3
figure c

Algorithm for ROS: Creating Overlapping Samples (SS) in randomly selected classes

Working mechanism for all schemes

As discussed in the previous sections, the four schemes generate the synthetic samples according to a predefined level to highlight the overlapping effects in the underlying dataset. The three schemes follow the same working mechanism except for the selection of the target class to be overlapped. The explanation is based on Algorithm 3 and can be applied to all the schemes, consisting of three steps.

  1. i.

    Preprocessing step: This step involves applying several preprocessing to make it efficient for model training. For example, the Label-Encoding scheme converts the categorical data into numeric values. Using normalization techniques, the features are scaled to a range centered around zero, much like the underlying data.

  2. ii.

    Assessment of class samples to the overlapping region: As shown in Fig. 2b, the basic assumption regarding the class overlapping samples is that they comprise samples closest to the boundary. Based on this supposition, we identify target class samples that are near the boundaries. Line 9 of the algorithm determines the average distance \(dist\_i\) for each sample of the target class C and its \(k_1\) neighbors (\(NS_{ample}\)) who do not belong to the target class(es) and the target class. The nearest neighbor is selected by setting the value of \(k_1\) to 1. Euclidean distance is used to compute the distance \(dist\_i\) between the sample and its neighbor60. The closer the sample is to the boundary, the lower the sample’s \(d_i\) value. The detrimental effects of noise in the sample space will be lessened by a greater value of k. We obtain a triplet of four items as a result of this step: \(S_{ample}\) (single sample of class C), \(dist\_i\) (average distance to a sample of \(k_n\) class), \(maj\_i\) (target class), and \(NS_{ample}\) (the closest sample of the other than target class(es)). Euclidean distance can lose its effectiveness in high-dimensional datasets due to several issues including the distortion of distances in high-dimensional spaces and the curse of dimensionality. Since our data is not high-dimensional that’s why we prefer to use Euclidean distance because of its easiness and simplicity. If the data is high-dimensional, obviously we have to look for Principal Component Analysis (PCA) or other approaches for dimensionality reduction. Since the data has a low-dimensional space, the Euclidean distance is considered. Figure 2a and c show imbalanced distribution and imbalanced and overlapping distribution, respectively.

    As a matter of fact, regions in the feature space where instances from several classes are closely packed together, frequently cause data overlap. With the increasing number of overlapping samples, the underlying classifiers suffer in terms of accuracy, precision, etc. Figure 2 is a very clear depiction of the scenario which shows that more ambiguity in class boundaries is suggested by a higher density of mixed-class neighbors. Overlap is influenced by the geometric complexity of the decision boundary (linear vs. nonlinear separability) in addition to the closest-opponent rule. Since the datasets in this study, are not complex for any intrinsic characteristics of the data that call for intricate decision boundaries, how they overlap is not needed. In addition, the data has a low-dimensional space, which is why the class overlap is more visible, compared to the high-dimensional space data where data sparsity makes the overlap less pronounced. For this reason, this study demonstrates the synthetic creation of overlapped samples with controlled intrinsic characteristics to empirically demonstrate the effects of overlap.

  3. iii.

    Creation of the overlapping region via synthetic samples: \(SS = \left( \frac{\#T\_M\_SubSet \cdot x}{100}\right)\), the number of synthetic samples SS, where T_M_SubSet is the subset of \(Mul\_IOD\) that includes every sample of the target class C, can be found using (line 13). The distance determined in the preceding step is used to sort the samples of the target class in ascending order so that they can be processed sequentially. A synthetic sample SS, is created nearby for each sample. A random neighbor to its \(k_2\) nearest neighbor is chosen at random in line 17 of Algorithm 3, where the sample is from target class C, \(NS_{ample}\) is the selected neighbor, and r indicates a number (random) between 0 and 1. Sample SS is created using the interpolation scheme58. For nominal attribute, a random value between \(S_{ample}\) and \(NS_{ample}\) is selected.

Fig. 2
figure 2

Sample distribution with respect to, (a) Imbalanced distribution, (b) With data overlapping, and (c) Imbalanced and overlapping scenarios.

The suggested scheme’s synthetic sample generation is based on SMOTE interpolation58, with the exception that \(k=1\) and \(k=3\) will be used to locate the other class’s closest neighbors. Equation (16) is applied to construct the new synthetic sample.

$$\begin{aligned} {[}r = S_{ample} + \text {rnd} \cdot dist\_i] \end{aligned}$$
(16)

where rnd is a random number between (0, 1), \(S_{ample}\) is the selected sample from the training set of under consideration class (target class) and \(dist\_i\) is the distance between the selected sample \(S_{ample}\), and the nearest search sample from the class other than the target class using Euclidean distance formula. The distance \(dist\_i\) is computed using Eq. (17).

$$\begin{aligned} \mathbf {dist_i} = S_{ample}(1 - S_{ample}) \end{aligned}$$
(17)

where \(dist_i\) is the difference between the selected sample \(S_{ample}\) and the neighbor’s sample \(S_{ample}\) of the other class.

Figure 3 shows the creation of synthetic samples from the target class sample \(S_{ample}\). Four different samples r1, r2, r3, and r4 are created which are different from the target class samples. Here, in this case, \(S_{ample}\) is the original sample of the target class. We set \(K=1\) to find the near neighbor other than the target class, which are \(S_{ample}1\), \(S_{ample}2\), \(S_{ample}3\), and \(S_{ample}4\) as mentioned in Fig. 3. To do this, the difference between each of the chosen neighbors and the feature vector (sample) under consideration is calculated. The previous feature vector is multiplied by a random value between 0 and 1, after which the difference is appended.

Fig. 3
figure 3

Four different samples (r1, r2, r3, r4) created from the target class sample \(S_{ample}\) and the sample (\(S_{ample}1\), \(S_{ample}2\), \(S_{ample}3\), \(S_{ample}4\)).

Figure 4 shows the mathematical description of how to create synthetic samples from the target class. A distance vector is maintained to hold the distance of all the nearest samples not belong to the target class.

Fig. 4
figure 4

Mathematical description of how the synthetic samples are created using the target class, other than the target class, the difference of distance between the target and other class.

Figure 5 shows the creation of synthetic samples and Fig. 6 shows a mathematical description of how to create synthetic samples from the same target class. The difference between Figs. 4 and 5 is the creation of synthetic samples from other than the target class and from the target class. In Fig. 5 four different samples r1, r2, r3, and r4 are created from the same target class to which sample \(S_{ample}1\) belongs. The mathematical description shows how a synthetic sample was created from the target class sample \(S_{ample}1\). A distance vector is maintained to hold the distance of all nearest samples belonging to the same target class from which sample \(S_{ample}1\) is selected.

Fig. 5
figure 5

Different samples (r1, r2, r3, r4) created from class C and the sample \(S_{ample}\) and the sample (\(S_{ample}1\), \(S_{ample}2\), \(S_{ample}3\), \(S_{ample}4\)) from the same class C.

Figure 6 shows the description of generating synthetic samples using target class C.

Fig. 6
figure 6

Mathematical description of how the synthetic samples were created using the target class C.

Results and discussions

Experimental setup

In the current research, four different schemes are implemented via algorithms to synthetically overlap the multiclass dataset using the four sachems, namely, MOS, AOS, ROS, and AOS-SMOTE. The real dataset which is used for experiments is also discussed with the relevant information. Results are also presented and discussed in this section.

Choice of evaluation metrics

Accuracy, precision, and F1 scores are employed for model evaluation for the following reasons.

  • Easy Communication and Ease of Understanding: Accuracy is straightforward to calculate and explain for stakeholders who are not technical. This is the simplest method to convey “how often the model gets things right.”

  • Precision draws attention to the significance of positive predictions, which is helpful in situations where false positives are crucial. For greater insights, precision is frequently combined with recall (e.g., F1 score).

  • Initial Metric: Even in unbalanced datasets, accuracy is frequently employed as a baseline statistic or starting point for preliminary testing. Quickly determining whether the model is performing worse than random guessing is helpful. When concentrating on the performance of positive predictions in the early phases of analysis, precision can be helpful, and most importantly accuracy and precision provide deep insights where the samples in classes are equally distributed or, even in the situation where the imbalanced nature of data can be handled easily.

In this study, since we create synthetic overlapped samples, it would be better to test the performance of the model as a baseline metric of accuracy and precision to show the worst performance in case imbalanced samples are increasing, and it is more convenient for the initial evaluation of the model.

Class overlapping effect on learning from multi-class real-world dataset

This study utilized 20 real-world datasets for experiments which can be found on the UCI repository61. The selection of the datasets is based on various characteristics; for example, the attributes in a dataset, the total number of samples and the number of classes, as well as, the distribution of samples for those classes. Every dataset is different concerning these characteristics. For example, the number of samples is from 150 to 12960, features are between 4 to 65, and classes can be any number between 3 and 20, as shown in Table 1.

Table 1 Multi-class datasets used for experiments.

Analysis of results for effect of imbalance and overlapping

For experiments, synthetic-generated samples were added to the multi-class datasets to increase both the imbalance and the overlapping effect, using the proposed schemes, MOS, AOS, ROS, and AOS-SMOTE, discussed in the previous section. A range of values for each classifier’s various parameters are tested in this study using grid search and ten-fold cross-validation. This process is run to find the optimal parameter configuration that would produce the best overall classification accuracy and precision for each classifier.

For SVM, we adjusted the gamma parameter within the 20-value search range and chose “100” as the optimal value, “0.1” for C, “O-vs-R” for the decision function shape, and “uniform” for the class-weight parameter. For KNN, n-neighbors is “5”, weights are “uniform”, and leaf-size is “3”. For RF2 min-samples-leaf = “2”, number-of-estimators = “00”, and nax-features = “all”.

Here, we present the effect of the synthetic samples on some well-known classifiers like SVM, KNN, and RF by measuring their accuracy, precision, and F1 scores. We have used ten-fold cross-validation to split the samples in the training and testing set. Table 2 highlights the results for proposed schemes with different degrees of overlapping on the Ecoli dataset. In the experiment part of this research, we implemented the proposed schemes for 20 multi-class datasets, but only the results of the Ecoli and Vehicle datasets are presented here.

Table 2 Accuracy and precision for the Ecoli dataset.

The first section of Table 2, displays results for the MOS scheme, where synthetic samples are added to the dataset. Here we are not concerned about the working mechanism or the performance comparison of the underlying classifiers; rather we are interested in discussing how much the learning capability of the classifier is affected by increasing the overlapping samples with varying degrees of 10. It is evident from Table 2, that at 0% (when there is no synthetic overlapping), all three classifiers give maximum learning over the Ecoli dataset. However, with the increasing degree of overlapping, the learning accuracy decreases gradually for all the classifiers, with the worse performance at 50% of the synthetic overlapping. The accuracy results reveal that the overlapping degree gradually decreases with the increasing degree of overlapping. Most of the time majority-class observations largely contribute to the overlapped region, while the minority-class samples are less focused. If we synthetically generate samples to the majority class based on the distance with the other class sample, the samples for the minority class reduce, even more, causing a dramatic decrease in the accuracy and precision. With the increasing degree of overlapping, an optimal hyperplane becomes very difficult for the underlying classifiers to detect the boundary. In this case, the classifier may predict the sample of the minority class as the sample of the majority class, which leads to an impractical classification model by compromising overall performance.

In the second section of Table 2, results for the AOS scheme are presented by overlapping all the classes in the dataset by inserting the synthetic samples. If we compare the MOS and AOS schemes, the former schemes proved their significance over the latter scheme. In the MOS scheme, synthetic samples were added only in the majority class, thus affecting the overlapping region of the majority class with the rest of the classes, whereas in the AOS scheme, synthetic samples were inserted in all the classes, thus enlarging the overlapping region among all the classes in the dataset.

The ROS scheme, selects a random class for the synthetic samples at a different level of overlapping, thus at each level every time a different class is overlapped synthetically. If we compare the ROS scheme with MOS and AOS, ROS gives inconsistent results for both accuracy and precision. The reason behind the inconsistent results is the random selection of classes, such that, the Ecoli dataset consists of 8 classes with unequal distribution of samples, the first class (majority class) having 143 samples, and the rest of the classes have different samples (less than 143). Now if a class with 10 samples is selected randomly, according to the formula used in the proposed algorithm, only one or two synthetic samples can be added to that class. With these two synthetic samples, almost there will be no effect on the overlapping region.

Fig. 7
figure 7

Accuracy graphs for the Ecoli dataset using the proposed schemes, (a) Majority class oversampling, (b) All class oversampling, (c) Random class oversampling, and (d) SMOTE and all class oversampling.

In the AOS-SMOTE scheme, SMOTE balances the distribution of samples in the majority and minority class (es). After a balance distribution, a random class is selected to overlap by inserting synthetic samples. Instead of random selection, we can insert the synthetic samples in any class, as all the classes have an equal number of samples showing the same effects. The results for existing overlapping data (without inserting synthetic samples, 0%) revealed that all three classifiers show better performance as compared to the three schemes. The SMOTE oversampling technique is the main reason behind this improved performance. SMOTE algorithm also synthetically generates samples to insert into the minority classes, to bring them equal to the majority class sample. The rest of the results are nearly consistent for different degrees of overlapping, gradually decreasing with the increasing degree of overlapping. Figures 7a–d show the respective graphs for the four proposed schemes: MOS, AOS, ROS and AOS-SMOTE concerning accuracy.

Figures 8a–d are the respective graphs for the four proposed schemes, MOS, AOS, ROS, and AOS-SMOTE by evaluating the precision. The model reports different precision for the proposed schemes. For the MOS and AOS schemes, SVM performs better while RF leads precision for ROS and AOS-SMOTE methods.

Fig. 8
figure 8

Precision graphs for the Ecoli dataset using the proposed schemes, (a) Majority class oversampling, (b) All class oversampling, (c) Random class oversampling, and (d) SMOTE and all class oversampling.

Table 3 highlights the comparative results for all three classifiers, SVM, KNN, and RF, for MOS, AOS, ROS, and AOS-SMOTE schemes for different degrees of overlapping using the Vehicle dataset. It is evident from the overall effect of synthetic class overlapping for both the dataset over all the four schemes, the increasing degree of overlapping has more effect on the Vehicle dataset as compared to the Ecoli dataset. The learning of the underlying classifier is more affected in the Vehicle multi-class imbalance dataset with the increasing degree of synthetic overlapping. The Ecoli dataset consists of 336 samples with 8 classes and 9 features, whereas the Vehicle dataset consists of 846 samples with 19 features and 4 classes. The majority class in Ecoli dataset has 143 samples, while the rest of the 193 samples distributed in the seven classes with frequencies (0:143, 1:77, 2:2, 3:2, 4:35, 5:20, 6:5, 7:52). If we look at the distribution, some of the classes have a very less number of sample, such that, two, two, and five samples, according to the overlapping degree there will be no synthetic sample for those classes even in case of 10% overlapping, thus greatly minimizing the overlapping region after the synthetic generation of samples. On the other hand, the distribution of vehicle dataset samples, (0:218, 1:212, 2:217, and 3:199), for every class, even with the lowest degree of overlapping (10%), a notable synthetic sample generates and added to the overlapped region. Getting more samples in the overlapping region makes it difficult for the classifier to correctly predict the target class, the misclassifying rate increases with the increasing number of samples, compromising the overall classification performance.

Table 3 Accuracy and precision for proposed schemes using the vehicle dataset.
Fig. 9
figure 9

Accuracy graphs for the Vehicle dataset using the proposed schemes, (a) Majority class oversampling, (b) All class oversampling, (c) Random class oversampling, and (d) SMOTE and all class oversampling.

Figure 9a–d are the respective graphs for the four proposed schemes, MOS, AOS, ROS, and AOS-SMOTE by evaluating the accuracy. Figure 10a–d show the representative graphs for models’ precision when used with the four proposed schemes.

Fig. 10
figure 10

Precision graphs for the Vehicle dataset using the proposed schemes, (a) Majority class oversampling, (b) All class oversampling, (c) Random class oversampling, and (d) SMOTE and all class oversampling.

Table 4 highlights the comparative results for all three classifiers, SVM, KNN, and RF by employing standard procedure, by MOS, AOS, ROS, and AOS-SMOTE scheme for different degrees of overlapping using the Ecoli dataset to measure the F1 score. Similar to the results reported in Tables 2 and 3 for accuracy and precision for the proposed schemes, the performance is better when generating a low ratio of synthetic samples.

Table 4 F1 scores for the proposed overlapping schemes using the Ecoli dataset.

Figure 11a–d are the respective graphs for the four proposed schemes, MOS, AOS, ROS, and AOS-SMOTE for F1 scores. The F1 score varies with respect to the approach for generating synthetic samples, as well as, the ratio of sample generation. Increasing the ratio of generated samples tends to degrade models’ performance.

Fig. 11
figure 11

F1 score graphs for the Ecoli dataset using the proposed schemes, (a) Majority class oversampling, (b) All class oversampling, (c) Random class oversampling, and (d) SMOTE and all class oversampling.

Table 5 highlights the comparative results for all three classifiers, SVM, KNN, and RF by employing standard procedure for MOS, AOS, ROS, and AOS-SMOTE scheme for different degrees of overlapping using the Vehicle dataset to measure the F1 score.

Table 5 F1 scores for the proposed overlapping schemes using the Vehicle dataset.

Figure 12a–d are the respective graphs for the four proposed schemes, MOS, AOS, ROS, and AOS-SMOTE by evaluating the accuracy and precision. SVM tends to show a better F1 score at 50% oversampling method for all the four proposed schemes.

Fig. 12
figure 12

F1 score graphs for the Vehicle dataset using the proposed schemes, (a) Majority class oversampling, (b) All class oversampling, (c) Random class oversampling, and (d) SMOTE and all class oversampling.

Limitations of the study

This study proposes several approaches for the synthetic oversampling of data to alleviate the impact of class overlapping which is an important topic in terms of dealing with the lower performance of classifiers. Experiments on several datasets containing overlapped and imbalanced class samples are run to analyze models’ performance. Although better results compared to existing approaches are reported, the study has several limitations.

  • For class overlapping, only the closest opponent rule is considered with respect to the nature of the datasets used in this study. However, other rules such as the local density of nearest instances in a class, types of data distribution, etc., might provide valuable insights into this problem.

  • For nearest neighbors, the Euclidean distance metric is considered only in this study, because the dataset has a low-dimensional space. However, other distance calculation metrics can also be used to determine class overlapping.

  • This study considered several datasets to run experiments, however, only low-dimensional datasets are considered alone, necessitating further experiments using high-dimensional datasets.

Conclusions

Class overlapping is a challenging issue that degrades the learning of classifiers and severely affects their performance, even more so for multi-class datasets. In the literature, many studies use data-level, algorithm-level, or hybrid techniques to solve class overlapping difficulties. However, a well-presented study that demonstrated the impact of increasing levels of sample overlap on the performance of the classifier could not be found. Our contribution in this paper is to design and implement four different schemes (algorithms) to generate different levels of synthetic samples in the multi-class imbalanced datasets to make them more overlap and imbalance. In this way, we can conclude (based on the results), that how much a particular level of overlapping samples hinders the learning of the classifiers from multi-class imbalanced problems. Four different schemes used for synthetic controlled overlapping are majority-class oversampling (MOS), all-class oversampling (AOS), random-class oversampling (ROS), and AOS using SMOTE (AOS-SMOTE). Each proposed scheme in this paper generates a controlled overlapping, starting from 0 up to 50% with an interval of 10, and inserts into the underlying multi-class dataset. The synthetically overlapped multi-class dataset is then classified via three well-known classifiers, support vector machine, k nearest neighbor, and random forest, to show how much learning of these classifiers compromises with the increasing degree of overlapping.

This research provides a means for extensive evaluation of different samples in the dataset implying a novel concept of understanding the class overlapping and evaluating how the classifier’s performance is affected for multi-class datasets. In the future, the researchers can exactly estimate the correlation between the degree of overlapping and a classifier’s performance, to focus on features engineering, data cleansing, and data preprocessing methods in combination with decomposition techniques to enhance the classifier’s performance while dealing with multi-class imbalanced problems in class overlapping.