Abstract
Class imbalance remains a critical challenge in machine learning, as it often leads to biased predictions where algorithms disproportionately favor the majority class, resulting in the misclassification of minority class instances and reduced overall model performance. This study explores an innovative approach to addressing class imbalance in Random Forests by combining pruning with resampling techniques. While pruning typically improves performance and reduces computational costs, its effectiveness can be limited in complex ensembles dealing with imbalanced data. To tackle this, the proposed method incorporates three resampling strategies: under-sampling the majority class, over-sampling the minority class, and a hybrid of both. After balancing the training data, multiple trees are grown from bootstrap samples, and only those with low out-of-bag error rates are selected for the final ensemble. The classification performance of the proposed method is evaluated and compared against standard algorithms including k-Nearest Neighbors (k-NN), Tree, Random Forest (RF), Balanced Random Forest (BRF), and Support Vector Machine (SVM). The results demonstrate that the proposed method outperformed its competitors in most of the cases.
Data availability
All data generated or analyzed during this study are included in this published article [and its supplementary information files].
References
Gao, J., Ding, B., Fan, W., Han, J. & Yu, P. S. Classifying data streams with skewed class distributions and concept drifts. IEEE Internet Comput.12, 37–49 (2008).
Guo, H. et al. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl.73, 220–239 (2017).
Garcia, V., Sanchez, J. S., Mollineda, R. A., Alejo, R. & Sotoca, J. M. The class imbalance problem in pattern classification and learning. Int. J. Inf. Sci. 283–291 (2006).
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H. & Herrera, F. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42, 463–484 (2011).
Garcia, S. & Herrera, F. Evolutionary under-sampling for classification with imbalanced data sets: Proposals and taxonomy. Evol. Comput. 17, 275–306. https://doi.org/10.1162/evco.2009.17.3.275 (2009).
Wang, S. & Yao, X. Diversity analysis on imbalanced data sets by using ensemble models. In Proceedings IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 324–331 (2009).
Bauder, R. A. & Khoshgoftaar, T. M. The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data. Health Inf. Sci. Syst. 6, 9 (2018).
Sadeghi, S., Khalili, D., Ramezankhani, A., Mansournia, M. A. & Parsaeian, M. Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. BMC Med. Inform. Decis. Mak. 22, 36 (2022).
Mi, Y. Imbalanced classification based on active learning smote. Res. J. Appl. Sci. Eng. Technol. 5, 944–949 (2013).
Eltayeb, R., Karrar, A. E., Osman, W. I. & Mutasim, M. Handling imbalanced data through re-sampling: Systematic review. Indones. J. Electr. Eng. Inform. (IJEEI) 11, 503–514 (2023).
Goyal, A., Rathore, L. & Kumar, S. A survey on solution of imbalanced data classification problem using smote and extreme learning machine. In Communication and Intelligent Systems, 31–44 (Springer, Singapore, 2021).
Kubus, M. Evaluation of resampling methods in the class unbalance problem. Econometrics 24, 39–50 (2020).
Marques, H. M. P. Imbalanced Learning: A Comparative Study of Oversampling and Undersampling Techniques. Master’s thesis, Universidade NOVA de Lisboa (2024).
Sachdeva, S. & Singh, R. Covid-19 cumulative death prediction in two most populated countries by fitting arima model and linear regression. Statistics and Applications (2023).
Johnson, J. M. & Khoshgoftaar, T. M. Survey on deep learning with class imbalance. J. Big Data 6, 27 (2019).
Krawczyk, B. Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5, 221–232 (2016).
Barandela, R., Valdovinos, R. M., Sanchez, J. S. & Ferri, F. J. The imbalanced training sample problem: Under or over sampling. Int. J. Springer 806–814 (2004).
Boulesteix, A. L., Janitza, S., Kruppa, J. & König, I. R. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2, 493–507 (2012).
More, A. S. & Rana, D. P. Review of random forest classification techniques to resolve data imbalance. In 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), 72–78 (IEEE, 2017).
He, H. & Garcia, E. A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009).
Zakariah, M. Classification of large datasets using random forest algorithm in various applications: Survey. Int. J. Eng. Innov. Technol. 4, 189–198 (2013).
Kulkarni, V. Y. & Sinha, P. K. Effective learning and classification using random forest algorithm. Int. J. Eng. Innov. Technol. 3, 267–273 (2014).
Chen, C., Liaw, A. & Breiman, L. Using random forest to learn imbalanced data (Research article of Department of Statistics, UC Berkeley, 2003).
Ali, J., Khan, R., Ahmad, N. & Maqsood, I. Random forests and decision trees. Int. J. Comput. Sci. Issues 9, 272–278 (2012).
Kulkarni, V. Y. & Sinha, P. K. Efficient learning of random forest classifier using disjoint partitioning approach. In Proceedings of the World Congress on Engineering 5, 1–5 (2013).
Krawczyk, B., Wozniak, M. & Schaefer, G. Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. J. 14, 554–562. https://doi.org/10.1016/j.asoc.2013.08.014 (2014).
Bader-El-Den, M., Teitei, E. & Perry, T. Biased random forest for dealing with the class imbalance problem. IEEE Trans. Neural Netw. Learn. Syst. 30, 2163–2172 (2018).
Bader-El-Den, M., Teitei, E. & Perry, T. Biased random forest for dealing with the class imbalance problem. IEEE Trans. Neural Netw. Learn. Syst. 30, 2163–2172 (2018).
Shaikhina, T. et al. Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation. Int. J. 1–7 (2017).
Shukla, S. & Yadav, R. Regularized weighted circular complex-valued extreme learning machine for imbalanced learning. IEEE Trans. Paper 3, 3048–3057 (2015).
Bach, M., Werner, A., Żywiec, J. & Pluskiewicz, W. The study of under-and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190 (2017).
Chang, E. Y., Li, B., Wu, G. & Goh, K. Statistical learning for effective visual information retrieval. In In Proceedings IEEE International Conference on Image Processing3, 609–612 (2003).
Tao, D., Tang, X. & Wu, X. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1088–1099 (2006).
Hido, S., Kashima, H. & Takahashi, Y. Roughly balanced bagging for imbalanced data. Stat. Anal. Data Min. 2, 412–426 (2009).
Peargin, G. Random Forest for Prediction with Unbalanced Data. Ph.D. thesis, Hochschule für angewandte Wissenschaften München (2019).
Dittman, D. J., Khoshgoftaar, T. M. & Napolitano, A. Is data sampling required when using random forest for classification on imbalanced bioinformatics data? In Theoretical Information Reuse and Integration, 157–171 (Springer, 2016).
Kircher, A. S. Random forest for unbalanced multiple-class classification. Ph.D. thesis, Wien (2017).
Feng, W. et al. Dynamic synthetic minority over-sampling technique-based rotation forest for the classification of imbalanced hyperspectral data. IEEE J Sel. Top. Appl. Earth Obs. Remote Sens. 12, 2159–2169 (2019).
Ciciana, C., Rahmawati, R. & Qadrini, L. The utilization of resampling techniques and the random forest method in data classification. TIN: Terapan Informatika Nusantara 4, 252–259 (2023).
Galar, M., Fernández, A., Barrenechea, E., Bustince, H. & Herrera, F. Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets. Inf. Sci. 354, 178–196. https://doi.org/10.1016/j.ins.2016.02.056 (2016).
Galar, M., Fernández, A., Barrenechea, E., Bustince, H. & Herrera, F. Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets. Inf. Sci. 354, 178–196. https://doi.org/10.1016/j.ins.2016.02.056 (2016).
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. Classification and Regression Trees (CRC Press, 1984).
Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Han, S., Williamson, B. D. & Fong, Y. Improving random forest predictions in small datasets from two-phase sampling designs. BMC Med. Inform. Decis. Mak. 21, 1–9 (2021).
Ratnasari, A. P. Performance of random oversampling, random undersampling, and smote-nc methods in handling imbalanced class in classification models. Valley Int. J. Digit. Libr. 494–501 (2024).
Andrew, E., Jo, T., & Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20, 18–36 (2004).
Kohavi, R. & Edu, S. A study of cross-validation and bootstrap for accuracy estimation. J. Big Data 6, 27 (1993).
Maruthi Padmaja, T., Dhulipalla, N., Bapi, R. S. & Radha Krishna, P. Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection. In Proceedings of the 15th International Conference on Advanced Computing and Communications, 511–516 (2007).
Therneau, T. & Atkinson, B. Rpart: Recursive Partitioning and Regression Trees. R Package Version 4.1-13 (2018).
Liaw, A. & Wiener, M. Classification and regression by randomforest. R News 2, 18–22 (2002).
Edwards, T. G., Özgün-Koca, A. & Barr, J. Interpretations of boxplots: Helping middle school students to think outside the box. J. Stat. Educ. 25, 21–28 (2017).
Funding
This research is a joint collaboration between Arab Open University, AWKU Mardan, SBBWU Peshawar, and is funded by Arab Open University-Bahrain.
Author information
Authors and Affiliations
Contributions
N.F. and S.I. designed the study and developed the methodology. N.F and M.H. performed the analysis and wrote the main manuscript text. D.M.K. and M.A. contributed to data preprocessing and experimental validation. S.J. applied machine learning methods, prepared the figures and tables and funded the APC. All authors reviewed and approved the final manuscript.
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Faiz, N., Iftikhar, S., Jan, S. et al. Pruning tree forest and re-sampling for class imbalanced problem. Sci Rep (2026). https://doi.org/10.1038/s41598-026-38320-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-38320-1