Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Pruning tree forest and re-sampling for class imbalanced problem
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 10 February 2026

Pruning tree forest and re-sampling for class imbalanced problem

  • Nosheen Faiz1,
  • Soofia Iftikhar2,
  • Salman Jan3,
  • Muhammad Hamraz1,
  • Muhammad Aamir1 &
  • …
  • Dost Muhammad Khan1 

Scientific Reports , Article number:  (2026) Cite this article

  • 158 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Mathematics and computing

Abstract

Class imbalance remains a critical challenge in machine learning, as it often leads to biased predictions where algorithms disproportionately favor the majority class, resulting in the misclassification of minority class instances and reduced overall model performance. This study explores an innovative approach to addressing class imbalance in Random Forests by combining pruning with resampling techniques. While pruning typically improves performance and reduces computational costs, its effectiveness can be limited in complex ensembles dealing with imbalanced data. To tackle this, the proposed method incorporates three resampling strategies: under-sampling the majority class, over-sampling the minority class, and a hybrid of both. After balancing the training data, multiple trees are grown from bootstrap samples, and only those with low out-of-bag error rates are selected for the final ensemble. The classification performance of the proposed method is evaluated and compared against standard algorithms including k-Nearest Neighbors (k-NN), Tree, Random Forest (RF), Balanced Random Forest (BRF), and Support Vector Machine (SVM). The results demonstrate that the proposed method outperformed its competitors in most of the cases.

Data availability

All data generated or analyzed during this study are included in this published article [and its supplementary information files].

References

  1. Gao, J., Ding, B., Fan, W., Han, J. & Yu, P. S. Classifying data streams with skewed class distributions and concept drifts. IEEE Internet Comput.12, 37–49 (2008).

    Google Scholar 

  2. Guo, H. et al. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl.73, 220–239 (2017).

    Google Scholar 

  3. Garcia, V., Sanchez, J. S., Mollineda, R. A., Alejo, R. & Sotoca, J. M. The class imbalance problem in pattern classification and learning. Int. J. Inf. Sci. 283–291 (2006).

  4. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H. & Herrera, F. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42, 463–484 (2011).

  5. Garcia, S. & Herrera, F. Evolutionary under-sampling for classification with imbalanced data sets: Proposals and taxonomy. Evol. Comput. 17, 275–306. https://doi.org/10.1162/evco.2009.17.3.275 (2009).

    Google Scholar 

  6. Wang, S. & Yao, X. Diversity analysis on imbalanced data sets by using ensemble models. In Proceedings IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 324–331 (2009).

  7. Bauder, R. A. & Khoshgoftaar, T. M. The effects of varying class distribution on learner behavior for medicare fraud detection with imbalanced big data. Health Inf. Sci. Syst. 6, 9 (2018).

    Google Scholar 

  8. Sadeghi, S., Khalili, D., Ramezankhani, A., Mansournia, M. A. & Parsaeian, M. Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. BMC Med. Inform. Decis. Mak. 22, 36 (2022).

    Google Scholar 

  9. Mi, Y. Imbalanced classification based on active learning smote. Res. J. Appl. Sci. Eng. Technol. 5, 944–949 (2013).

    Google Scholar 

  10. Eltayeb, R., Karrar, A. E., Osman, W. I. & Mutasim, M. Handling imbalanced data through re-sampling: Systematic review. Indones. J. Electr. Eng. Inform. (IJEEI) 11, 503–514 (2023).

    Google Scholar 

  11. Goyal, A., Rathore, L. & Kumar, S. A survey on solution of imbalanced data classification problem using smote and extreme learning machine. In Communication and Intelligent Systems, 31–44 (Springer, Singapore, 2021).

  12. Kubus, M. Evaluation of resampling methods in the class unbalance problem. Econometrics 24, 39–50 (2020).

    Google Scholar 

  13. Marques, H. M. P. Imbalanced Learning: A Comparative Study of Oversampling and Undersampling Techniques. Master’s thesis, Universidade NOVA de Lisboa (2024).

  14. Sachdeva, S. & Singh, R. Covid-19 cumulative death prediction in two most populated countries by fitting arima model and linear regression. Statistics and Applications (2023).

  15. Johnson, J. M. & Khoshgoftaar, T. M. Survey on deep learning with class imbalance. J. Big Data 6, 27 (2019).

    Google Scholar 

  16. Krawczyk, B. Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5, 221–232 (2016).

    Google Scholar 

  17. Barandela, R., Valdovinos, R. M., Sanchez, J. S. & Ferri, F. J. The imbalanced training sample problem: Under or over sampling. Int. J. Springer 806–814 (2004).

  18. Boulesteix, A. L., Janitza, S., Kruppa, J. & König, I. R. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2, 493–507 (2012).

    Google Scholar 

  19. More, A. S. & Rana, D. P. Review of random forest classification techniques to resolve data imbalance. In 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), 72–78 (IEEE, 2017).

  20. He, H. & Garcia, E. A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009).

    Google Scholar 

  21. Zakariah, M. Classification of large datasets using random forest algorithm in various applications: Survey. Int. J. Eng. Innov. Technol. 4, 189–198 (2013).

    Google Scholar 

  22. Kulkarni, V. Y. & Sinha, P. K. Effective learning and classification using random forest algorithm. Int. J. Eng. Innov. Technol. 3, 267–273 (2014).

    Google Scholar 

  23. Chen, C., Liaw, A. & Breiman, L. Using random forest to learn imbalanced data (Research article of Department of Statistics, UC Berkeley, 2003).

    Google Scholar 

  24. Ali, J., Khan, R., Ahmad, N. & Maqsood, I. Random forests and decision trees. Int. J. Comput. Sci. Issues 9, 272–278 (2012).

    Google Scholar 

  25. Kulkarni, V. Y. & Sinha, P. K. Efficient learning of random forest classifier using disjoint partitioning approach. In Proceedings of the World Congress on Engineering 5, 1–5 (2013).

    Google Scholar 

  26. Krawczyk, B., Wozniak, M. & Schaefer, G. Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. J. 14, 554–562. https://doi.org/10.1016/j.asoc.2013.08.014 (2014).

    Google Scholar 

  27. Bader-El-Den, M., Teitei, E. & Perry, T. Biased random forest for dealing with the class imbalance problem. IEEE Trans. Neural Netw. Learn. Syst. 30, 2163–2172 (2018).

    Google Scholar 

  28. Bader-El-Den, M., Teitei, E. & Perry, T. Biased random forest for dealing with the class imbalance problem. IEEE Trans. Neural Netw. Learn. Syst. 30, 2163–2172 (2018).

    Google Scholar 

  29. Shaikhina, T. et al. Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation. Int. J. 1–7 (2017).

  30. Shukla, S. & Yadav, R. Regularized weighted circular complex-valued extreme learning machine for imbalanced learning. IEEE Trans. Paper 3, 3048–3057 (2015).

    Google Scholar 

  31. Bach, M., Werner, A., Żywiec, J. & Pluskiewicz, W. The study of under-and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190 (2017).

    Google Scholar 

  32. Chang, E. Y., Li, B., Wu, G. & Goh, K. Statistical learning for effective visual information retrieval. In In Proceedings IEEE International Conference on Image Processing3, 609–612 (2003).

  33. Tao, D., Tang, X. & Wu, X. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1088–1099 (2006).

    Google Scholar 

  34. Hido, S., Kashima, H. & Takahashi, Y. Roughly balanced bagging for imbalanced data. Stat. Anal. Data Min. 2, 412–426 (2009).

    Google Scholar 

  35. Peargin, G. Random Forest for Prediction with Unbalanced Data. Ph.D. thesis, Hochschule für angewandte Wissenschaften München (2019).

  36. Dittman, D. J., Khoshgoftaar, T. M. & Napolitano, A. Is data sampling required when using random forest for classification on imbalanced bioinformatics data? In Theoretical Information Reuse and Integration, 157–171 (Springer, 2016).

  37. Kircher, A. S. Random forest for unbalanced multiple-class classification. Ph.D. thesis, Wien (2017).

  38. Feng, W. et al. Dynamic synthetic minority over-sampling technique-based rotation forest for the classification of imbalanced hyperspectral data. IEEE J Sel. Top. Appl. Earth Obs. Remote Sens. 12, 2159–2169 (2019).

    Google Scholar 

  39. Ciciana, C., Rahmawati, R. & Qadrini, L. The utilization of resampling techniques and the random forest method in data classification. TIN: Terapan Informatika Nusantara 4, 252–259 (2023).

  40. Galar, M., Fernández, A., Barrenechea, E., Bustince, H. & Herrera, F. Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets. Inf. Sci. 354, 178–196. https://doi.org/10.1016/j.ins.2016.02.056 (2016).

    Google Scholar 

  41. Galar, M., Fernández, A., Barrenechea, E., Bustince, H. & Herrera, F. Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets. Inf. Sci. 354, 178–196. https://doi.org/10.1016/j.ins.2016.02.056 (2016).

    Google Scholar 

  42. Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. Classification and Regression Trees (CRC Press, 1984).

  43. Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).

    Google Scholar 

  44. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Google Scholar 

  45. Han, S., Williamson, B. D. & Fong, Y. Improving random forest predictions in small datasets from two-phase sampling designs. BMC Med. Inform. Decis. Mak. 21, 1–9 (2021).

    Google Scholar 

  46. Ratnasari, A. P. Performance of random oversampling, random undersampling, and smote-nc methods in handling imbalanced class in classification models. Valley Int. J. Digit. Libr. 494–501 (2024).

  47. Andrew, E., Jo, T., & Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20, 18–36 (2004).

  48. Kohavi, R. & Edu, S. A study of cross-validation and bootstrap for accuracy estimation. J. Big Data 6, 27 (1993).

    Google Scholar 

  49. Maruthi Padmaja, T., Dhulipalla, N., Bapi, R. S. & Radha Krishna, P. Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection. In Proceedings of the 15th International Conference on Advanced Computing and Communications, 511–516 (2007).

  50. Therneau, T. & Atkinson, B. Rpart: Recursive Partitioning and Regression Trees. R Package Version 4.1-13 (2018).

  51. Liaw, A. & Wiener, M. Classification and regression by randomforest. R News 2, 18–22 (2002).

    Google Scholar 

  52. Edwards, T. G., Özgün-Koca, A. & Barr, J. Interpretations of boxplots: Helping middle school students to think outside the box. J. Stat. Educ. 25, 21–28 (2017).

    Google Scholar 

Download references

Funding

This research is a joint collaboration between Arab Open University, AWKU Mardan, SBBWU Peshawar, and is funded by Arab Open University-Bahrain.

Author information

Authors and Affiliations

  1. Department of Statistics, Abdul Wali Khan university, Mardan, 23200, Pakistan

    Nosheen Faiz, Muhammad Hamraz, Muhammad Aamir & Dost Muhammad Khan

  2. Department of Statistics, Shaheed Benazir Woman University, Peshawar, Pakistan

    Soofia Iftikhar

  3. Faculty of Computer Studies, Arab Open University-Bahrain, A’Ali, 18211, Bahrain

    Salman Jan

Authors
  1. Nosheen Faiz
    View author publications

    Search author on:PubMed Google Scholar

  2. Soofia Iftikhar
    View author publications

    Search author on:PubMed Google Scholar

  3. Salman Jan
    View author publications

    Search author on:PubMed Google Scholar

  4. Muhammad Hamraz
    View author publications

    Search author on:PubMed Google Scholar

  5. Muhammad Aamir
    View author publications

    Search author on:PubMed Google Scholar

  6. Dost Muhammad Khan
    View author publications

    Search author on:PubMed Google Scholar

Contributions

N.F. and S.I. designed the study and developed the methodology. N.F and M.H. performed the analysis and wrote the main manuscript text. D.M.K. and M.A. contributed to data preprocessing and experimental validation. S.J. applied machine learning methods, prepared the figures and tables and funded the APC. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Salman Jan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Faiz, N., Iftikhar, S., Jan, S. et al. Pruning tree forest and re-sampling for class imbalanced problem. Sci Rep (2026). https://doi.org/10.1038/s41598-026-38320-1

Download citation

  • Received: 06 November 2025

  • Accepted: 29 January 2026

  • Published: 10 February 2026

  • DOI: https://doi.org/10.1038/s41598-026-38320-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Bootstrap
  • Classification
  • Class imbalance
  • Pruning
  • Random forest
  • Over-sampling
  • Under-sampling
  • Trees
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics