An improved SMOTE algorithm for enhanced imbalanced data classification by expanding sample generation space

Li, Ying; Yang, Yali; Song, Peihua; Duan, Lian; Ren, Rui

doi:10.1038/s41598-025-09506-w

Download PDF

Article
Open access
Published: 02 July 2025

An improved SMOTE algorithm for enhanced imbalanced data classification by expanding sample generation space

Ying Li^1,2,
Yali Yang¹,
Peihua Song^1,2,
Lian Duan^3,4 &
…
Rui Ren¹

Scientific Reports volume 15, Article number: 23521 (2025) Cite this article

10k Accesses
14 Citations
Metrics details

Subjects

Abstract

Class imbalance in datasets often degrades the performance of classification models. Although the Synthetic Minority Over-sampling Technique (SMOTE) and its variants alleviate this issue by generating synthetic samples, they frequently overlook local density and distribution characteristics. Consequently, developing methods that incorporate local spatial information to synthesize samples that better preserve the original data distribution is critical for improving model robustness in class-imbalanced scenarios. To address this gap, we propose an enhanced SMOTE algorithm (ISMOTE), which modifies the spatial constraints for synthetic sample generation. Unlike SMOTE, the proposed method first generates a base sample between two original samples. Then the Euclidean distance between the two samples is multiplied by a random number to generate a random quantity. This random quantity is added or subtracted based on the distance between the base sample and the original samples, ensuring that new samples are generated around the two original samples. By adaptively expanding the synthetic sample generation space, ISMOTE effectively alleviates distortions in local data distribution and density. This study compared the ISMOTE algorithm with seven mainstream oversampling algorithms, using three classifiers on thirteen public datasets from the KEEL, UCI, and Kaggle databases. Comparative analysis of 2D and 3D scatter plots revealed that ISMOTE yields more realistic data distributions. Experimental results demonstrated relative improvements in classifier performance, with F1-score, G-mean, and AUC increasing by 13.07%, 16.55%, and 7.94%, respectively. Furthermore, ISMOTE’s parameter adaptability enables its application to multi-class imbalanced datasets.

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data

Article Open access 02 March 2024

Research on expansion and classification of imbalanced data based on SMOTE algorithm

Article Open access 15 December 2021

An oversampling method for imbalanced data based on spatial distribution of minority samples SD-KMSMOTE

Article Open access 07 October 2022

Introduction

Class imbalance is a prevalent issue in real-world datasets, characterized by a significant disparity in the number of samples across different categories. This phenomenon is ubiquitous across diverse domains, including medical disease diagnosis^1,2, financial fraud detection^3,4, software defect prediction^5,6, and hardware fault detection^7,8. In binary classification tasks, the imbalance ratio (IR)—defined as the ratio of minority-class to majority-class samples—can range from tens to hundreds in practical applications, presenting substantial challenges for conventional machine learning algorithms⁹. Traditional machine learning models trained on imbalanced datasets often exhibit a bias toward the majority class¹⁰, resulting in the misclassification of minority-class samples. Such errors can have critical real-world implications. For instance, in medical diagnostics, where cancer patients typically represent the minority class, a false negative (i.e., misclassifying a cancer patient as healthy) may delay life-saving interventions, with irreversible consequences. Thus, improving minority-class recognition accuracy is essential, underscoring the importance of research on class-imbalance mitigation.

Various techniques are currently available to address class imbalance^11,12. These techniques are categorized into data-level and algorithm-level¹³. At the data level, methods include undersampling the majority class^14,15, oversampling the minority class^16,17, and hybrid sampling, which combines undersampling of the majority class with oversampling of the minority class^18,19,20 to achieve class balance. At the algorithm level, machine learning algorithms are optimized to adapt to imbalanced data. The main methods include cost-sensitive learning^21,22, threshold moving strategies^23,24, ensemble learning^25,26 and neural networks^27,28. Compared with algorithm-level modifications, data-level techniques offer greater implementation flexibility and model independence, contributing to their wider practical adoption.

The Synthetic Minority Oversampling Technique (SMOTE) is one of the most common oversampling methods. Compared to other oversampling methods, SMOTE generates more diverse synthetic samples, which helps improve the generalization ability of models. Additionally, the generation logic of SMOTE is relatively simple and easy to implement, and it performs well in addressing class imbalance across various fields. Especially in small and medium-sized datasets, SMOTE is often more effective than other methods. Relevant studies^29,30 indicate that SMOTE is one of the most widely used data imbalance handling techniques among oversampling methods. However, SMOTE’s linear interpolation mechanism for synthetic sample generation presents two inherent limitations: (1) in high-density regions of minority class samples, excessive synthetic instances may be produced, potentially inducing model overfitting; and (2) the linearly interpolated samples may deviate from the underlying data distribution. To address these issues, this study examines SMOTE’s sample generation mechanism and proposes an Improved SMOTE (ISMOTE) algorithm. ISMOTE extends the feasible solution space for synthetic sample generation, effectively mitigating both the density over-amplification problem and distributional distortion inherent in conventional SMOTE.

The remainder of this paper is structured as follows. Section 2 presents a comprehensive review of existing oversampling techniques. Section 3 details the fundamental principles and implementation procedures of the standard SMOTE algorithm, followed by a thorough presentation of our proposed Improved SMOTE (ISMOTE) algorithm, including its conceptual framework, algorithmic workflow, implementation steps, and formal pseudocode. Section 4 describes the experimental setup, including: (1) the thirteen benchmark datasets employed, (2) the evaluation metrics for classification performance assessment, (3) the experimental design methodology, and (4) a comprehensive analysis of the empirical results. Finally, Sect. 5 concludes the study with key findings and contributions.

The main contributions of this paper are summarized as follows:

Expansion of Sample Generation Space : ISMOTE modifies the sample generation conditions of the SMOTE algorithm, significantly expanding the space for generating new samples. Unlike SMOTE, which generates samples solely through linear interpolation, ISMOTE allows new samples to be generated not only between existing samples but also around them. This approach effectively mitigates the issue of generating excessive samples in high-density regions, thereby reducing the risk of overfitting and improving the quality of the generated data.

Improved Sample Distribution : ISMOTE introduces random quantities to dynamically adjust the positions of new samples. This ensures that the generated samples better align with the underlying distribution patterns of the original data. By enhancing the diversity and representativeness of the synthetic samples, ISMOTE improves the generalization capability of classifiers, particularly in imbalanced data scenarios.

Experimental Validation : Extensive experiments were conducted on multiple public datasets to validate the effectiveness of ISMOTE. The results demonstrate that ISMOTE consistently outperforms existing mainstream oversampling algorithms.

Related work

Oversampling techniques are widely used to address class imbalance by increasing the number of minority class samples, thereby improving classifier performance. These techniques can be categorized based on the data types they handle, including numerical data, categorical data, and image data³¹. This study focuses on categorical data, as it is prevalent in many real-world applications such as medical diagnosis, fraud detection, and software defect prediction. Compared to undersampling, oversampling is often preferred because it preserves the original data distribution and avoids information loss^32,33.

Batista et al.³⁴ proposed Random Oversampling (ROS), which duplicates minority class samples randomly to balance the dataset. While ROS is simple to implement, it often leads to overfitting, as the repeated samples do not introduce new information and may amplify noise in the data. Despite its limitations, ROS has been shown to outperform undersampling in certain scenarios when evaluated using metrics such as AUC.

To surmount the limitations of ROS, Chawla et al.³⁵ proposed SMOTE, which generates synthetic minority samples via interpolation between existing minority samples and their k - nearest neighbors. By generating diverse synthetic samples rather than merely replicating the existent ones, SMOTE mitigates overfitting to a certain degree. However, SMOTE is not without its drawbacks. Firstly, SMOTE has the potential to generate synthetic samples within the overlapping regions between classes. This phenomenon gives rise to noisy data, thereby degrading the performance of classifiers.Secondly, in areas where minority samples are densely concentrated, SMOTE tends to generate an excessive number of synthetic samples. This situation exacerbates the problem of overfitting. Thirdly, SMOTE confines new samples to the linear paths between existing samples. As a consequence, the generated samples may deviate from the actual data distribution.

He et al.³⁶ proposed the adaptive synthetic sampling algorithm (ADASYN), which uses a weighted distribution based on the difficulty of learning different minority class samples. More synthetic data is generated for harder-to-learn minority samples. This method reduces the bias caused by class imbalance and adaptively shifts the classification decision boundary towards the difficult samples. However, it faces challenges in sampling boundary samples and does not handle noisy data effectively.

In addition to the classic oversampling algorithms mentioned above, several improved algorithms based on SMOTE have been developed. Han et al.³⁷ proposed the Borderline-SMOTE method, which effectively avoids oversampling noisy samples and reduces the generation of noisy samples by selectively oversampling boundary minority class samples. However, the voting selection strategy of this method involves a large number of high-density minority class samples in oversampling, without considering the problem of excessive generation of minority class samples and the distribution pattern of new samples. Douzas et al.³⁸ proposed a simple and effective oversampling method based on K-means clustering and SMOTE, which avoids noise generation and effectively overcomes both inter-class and intra-class imbalances. Intra-class imbalance refers to the uneven distribution of samples within the same class. However, K-Means SMOTE may increase classification errors for minority samples due to its reliance on clustering, which can be sensitive to data sparsity and noise. Douzas et al.³⁹ proposed Geometric SMOTE (G-SMOTE) as an enhancement to the SMOTE data generation mechanism. G-SMOTE generates synthetic samples within a geometric region in the input space, centered around each selected minority instance. While the default configuration defines this region as a hypersphere, G-SMOTE allows it to deform into a hyperellipsoid. G-SMOTE effectively parameterizes the data generation process and adapts to the specific characteristics of each imbalanced dataset. However, if the minority class samples contain noise or outliers, G-SMOTE may generate unrealistic synthetic samples around them, which can degrade classification performance.

Kunakorntum et al.⁴⁰ proposed a novel oversampling technique, Synthetic Minority based on Probability Distribution (SyMProD), to handle skewed datasets. This technique standardizes the data using Z-scores and removes noisy data. The proposed method then selects minority samples based on the probability distributions of the two classes. Synthetic instances are generated from the selected points and several nearest neighbors of the minority class. Wang et al.⁴¹ propose a new deep learning (DL) based data balancing technique using an Auxiliary-guided Conditional Variational Autoencoder (ACVAE) trained with contrastive learning. Additionally, Wang et al. investigate an ensemble method where ACVAE generates synthetic positive samples, followed by a data undersampling technique.

In the context of ensuring data privacy, Du et al.⁴² propose a secure privacy-preserving SMOTE (SP2-SMOTE) sampling method. It extends traditional SMOTE by allowing parties to independently generate synthetic samples without exposing the data, while effectively preventing unauthorized label inference through minority-class nearest neighbor interpolation. The SMOTE algorithm and its variants can be combined with ensemble learning⁴³ or machine learning algorithms⁴⁴ to solve data imbalance problems in specific fields. Imani et al.⁴⁵ also conducted a comprehensive analysis of the performance of Random Forest and XGBoost using SMOTE, ADASYN, and GNUS at different levels of imbalance.

Although the aforesaid methods have enhanced the classification performance in imbalanced datasets, they are burdened with several limitations. Existing methods typically struggle to generate synthetic samples that accord with the authentic distribution of minority classes, especially in complex or irregular data manifolds, which leads to issues in spatial distribution. Moreover, numerous methods generate an overabundance of synthetic samples in high - density regions, resulting in overfitting and diminished classifier generalization in terms of density control. Additionally, the generation of synthetic samples in overlapping or noisy regions remains a tough nut to crack, as it can undermine classifier performance, presenting a challenge in noise handling. To tackle these limitations, we propose ISMOTE, an enhanced SMOTE algorithm that extends the spatial scope for synthetic sample generation. Differing from existing methods, ISMOTE adaptively adjusts the positions of new samples in line with local density and distribution characteristics. This ensures that synthetic samples are distributed around original samples instead of being restricted to linear paths, maintain the natural density gradients of the original minority class, and avoid over-saturation in high-density regions. By surmounting these challenges, ISMOTE bolsters the robustness of classifiers in imbalanced data scenarios, particularly in applications such as early disease detection or fraud prevention where high precision for minority classes is imperative.

An improved SMOTE algorithm

This section first outlines the concept, implementation steps, and limitations of the SMOTE algorithm. Subsequently, the ISMOTE algorithm proposed in this study is introduced, detailing its concept, flowchart, implementation steps, and pseudocode.

The SMOTE algorithm

SMOTE is one of the most well-established oversampling algorithms, effectively mitigating the overfitting issue caused by random oversampling and improving the generalization capability of models. The basic principle of SMOTE is to balance the dataset by inserting synthetic samples between minority class samples. It synthesizes new minority class samples through linear interpolation with the k-nearest neighbor algorithm. Specifically, it randomly selects a minority class sample and one of its k-nearest neighbors, then generates a new minority class sample by performing linear interpolation between the two. The steps of the algorithm are as follows.

Determine the number of minority class samples to be synthesized, denoted as n_samples.

1) For each selected minority class sample X_q, compute its Euclidean distance to all other minority class samples. Select the k nearest neighbors of X_q, where k = 5 by default.

2) Randomly select a sample X_j from the k nearest neighbor samples and synthesize a new minority class sample X_new with X_q using (1).

$${X_{new}}={X_q}+\alpha \times \left( {{X_q}+{X_j}} \right)$$

(1)

Where α(0,1) is randomly generated during the operation process.

3) If the number of synthesized samples reaches the required amount, the algorithm terminates. Otherwise, repeat from step 2.

As shown in Fig. 1, assuming that SMOTE randomly selects a sample X_q from minority class samples (where X_q is defined as the seed sample), then uses the k-nearest neighbor algorithm to find five nearest minority class samples. It then randomly selects one neighbor sample, X_j, from these five. The new synthetic sample, X_new, is generated through linear interpolation, as described in (1). The SMOTE algorithm generates new samples via linear interpolation along the line connecting two minority class samples.

In the SMOTE algorithm, the position of the newly generated sample is constrained to the space between the two original minority class samples and is influenced by their positions. As more new samples are generated, the density of minority class samples increases. Therefore, the SMOTE algorithm may lead to the distribution of oversampled minority class samples not conforming to the original distribution pattern of minority class samples.

The ISMOTE algorithm

The ISMOTE algorithm Idea

The SMOTE algorithm generates new minority class samples through linear interpolation between existing minority class samples. While this approach mitigates overfitting to some extent, it confines new samples to linear paths between existing samples, which may not accurately reflect the true distribution of the data. To address this limitation, ISMOTE extends the spatial scope for generating new samples by adaptively adjusting their positions based on local density and distribution characteristics. To expand the space for generating new samples, we change the conditions formula for generating new samples in the SMOTE algorithm. First, the Euclidean distance between the selected minority class sample and its k-nearest neighbor is calculated and multiplied by a random number between 0 and 1 to generate a random quantity. The base sample is generated by linear interpolation between the two samples. Then, when the base sample generation position is biased towards the k-nearest neighbor sample, the random quantity is subtracted to generate a new sample near the original sample. Similarly, when the base sample is closer to the original minority sample, the random quantity is added to generate a sample closer to the neighbor.

As shown in Fig. 2, (a) and (c) illustrate the positions of new samples generated by the SMOTE algorithm, and (b) and (d) show the positions of new samples generated by the ISMOTE algorithm. According to the SMOTE algorithm, a base sample X_new1 is generated, which is located between the two selected minority class samples. Then, based on the distance of the base sample X_new1 from the two original samples, the new sample X_new generation position is adjusted. If the generated base sample X_new1 is far from the seed sample X_q, a random quantity is subtracted to position the new sample X_new around the seed sample X_q, as shown in Fig. 2(b). If the base sample X_new1 is close to the seed sample X_q, a random quantity is added to position the new sample X_new around the selected neighboring sample X_j, as shown in Fig. 2(d). The new samples X_new generated by the algorithm are randomly distributed between two samples and around single samples, which increases the space for generating new samples and makes the distribution of new samples more consistent with the distribution pattern of the original samples.

The formula for generating new sample positions is revised. First, the Euclidean distance between the seed sample X_q and its randomly selected neighboring sample X_j is calculated, and this distance is multiplied by a random number between 0 and 1 to generate a random quantity. Secondly, a base sample X_new1 is generated based on (2). Finally, this study proposes (3) to generate a new sample. According to the distance between X_new1 and the seed sample X_q, a random quantity is added or subtracted from the position of X_new1 to generate a new sample X_new. If X_new1 is far away from X_q, the position of X_new1 is changed by subtracting a random quantity to get the X_new, so that it is located around X_q. If X_new1 is close to X_q, the position of X_new1 is adjusted by adding a random quantity to place X_new, so that it is located around the neighboring sample X_j. The revised formula ensures that the generated samples are located around the two selected samples. Here, α (0,1) and β ϵ (0,1) are randomly generated during each operation.

$${X_{new1}}={X_q}+\alpha \times \left( {{X_q} - {X_j}} \right)$$

(2)

$${X_{new}}=\left\{ \begin{gathered} {X_q}+\alpha \times \left( {{X_q} - {X_j}} \right) - \beta \times \left( {{X_q} - {X_j}} \right),if~distance({X_{new1}},{X_q}) \geqslant 0.5 \times distance({X_q},{X_j}) \hfill \\ {X_q}+\alpha \times \left( {{X_q} - {X_j}} \right)+\beta \times \left( {{X_q} - {X_j}} \right),if~distance({X_{new1}},{X_q})<0.5 \times distance({X_q},{X_j}) \hfill \\ \end{gathered} \right.$$

(3)

The flowchart, steps and pseudo-code of the ISMOTE algorithm

Based on the ISMOTE algorithm, the flowchart of the ISMOTE algorithm is shown in Fig. 3. According to the flowchart, the steps of the ISMOTE algorithm are as follows.

1)
Determine the number of samples of minority class to be synthesized, denoted as n_samples.
2)
For each selected sample of minority class X_q (1*N- dimensional vectors, where N is the number of features), calculate its distance to all other minority class samples using the Euclidean distance formula. Select the k nearest neighbors of X_q, with k set to 5 by default.
3)
From the k nearest neighbor samples, randomly select a sample X_j (a 1*N-dimension vector, where N is the number of features of a sample) is selected. Generate a base sample X_new1 randomly between X_q and X_j according to (2). Then, generate the new sample X_new by adjusting the position based on (3). If the number of new samples reaches the required amount, the algorithm terminates. Otherwise, repeat from step 2.

As shown in Fig. 4, in the ISMOTE algorithm, it is first assumed that a sample X_q (chosen as the seed sample) is randomly selected from the minority class samples. Using the k-nearest neighbors algorithm, the five nearest minority class samples nearest to X_q are identified. Then, randomly select one of these neighboring minority class samples, denoted as X_j. Generate a new synthetic minority class sample X_new using (2) and (3) proposed in this study. Compared to the SMOTE algorithm, the ISMOTE algorithm not only retains the position between the original two samples, but also expands the external surrounding space of single samples. This makes the overall distribution of the new samples more consistent with the original sample distribution pattern, improves data quality, and helps improve model performance.

The expansion of the generation space for new samples helps alleviate the problem of high sample density to some extent. The balanced dataset can be used for model training, enabling the model to better learn the characteristics of minority class samples and improve classification performance. The ISMOTE algorithm can also be applied to multi-class classification tasks by adjusting the proportion of synthetic samples. The pseudocode is as follows.

Experiments and analysis

This section introduces the datasets and classifiers used in the research, as well as the metrics used to evaluate the experimental results. It details the experimental design, demonstrates and analyzes the visualization effects of datasets processed by seven oversampling algorithms, and compares and analyzes the results of three classifiers. The code for the ISMOTE algorithm and the dataset used in the experiment are available at https://github.com/Sunshine6828. 6/Improved-SMOTE-algorithm

Experimental datasets

Based on studies^20,48,52 and the requirements of this research, we selected thirteen classification datasets from three widely-used public databases for our experiments: the KEEL database, the UCI database⁴⁶, and the Kaggle competition platform. These databases serve as fundamental resources in data science and machine learning, providing extensive datasets for algorithm testing and validation across various domains and applications. The UCI database holds significant influence in machine learning research, while the KEEL database specializes in data mining and machine learning applications. Datasets from Kaggle competitions are particularly valued for their exceptional data quality and strong relevance to real-world business scenarios. The IR values of the thirteen selected datasets range from 1.25 to 29.17. Detailed information about the datasets is provided in Table 1. The features in the datasets are numerical. For non-numeric data, preprocessing is required to convert it into a numeric form before applying the ISMOTE algorithm. Techniques such as label encoding and one-hot encoding can convert non-numeric data into numeric form. Some multi-class datasets can be converted into binary classification datasets, resulting in multiple binary datasets with the same sample size but different minority class sizes and IR values. All datasets are processed as binary classification datasets. The method for calculating IR is given in (4).

$$IR=\frac{{{N_{maj}}}}{{{N_{\hbox{min} }}}}$$

(4)

Here, N_maj represents the number of majority class samples, and N_min represents the number of minority class samples.

Table 1 The datasets information.

Subjects

Abstract

Similar content being viewed by others

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data

Research on expansion and classification of imbalanced data based on SMOTE algorithm

An oversampling method for imbalanced data based on spatial distribution of minority samples SD-KMSMOTE

Introduction

Related work

An improved SMOTE algorithm

The SMOTE algorithm

The ISMOTE algorithm

The ISMOTE algorithm Idea

The flowchart, steps and pseudo-code of the ISMOTE algorithm

Experiments and analysis

Experimental datasets

Evaluation metrics

Experimental design

Results and analysis

Comparative analysis of visualization results of oversampling algorithms

Comparison and analysis of classifier classification results

Non-parametric statistical tests

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Integrated Lesion and Extranodal PET/CT Radiomics for Predicting Treatment Response in Hodgkin Lymphoma

An investigation into detecting anomalous trading patterns in electricity markets utilizing a SMOTE-CMAES-LightGBM model

Search

Quick links