Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
A novel generative oversampling for software defect prediction
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 03 April 2026

A novel generative oversampling for software defect prediction

  • Somya R. Goyal1 

Scientific Reports , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Mathematics and computing

Abstract

Software defect prediction is one of the highly active research areas as it allows to focus the testing efforts on the defective modules and reduce the cost of development. The imbalanced nature of defect data poses a threat to the performance of software defect predictors. This study proposes a novel Generative oversampling-based Software Defect Prediction naming GeNSDP. It oversamples defect data by generating synthetic minority instances utilizing lightweight generative model, then the oversampled data is used for defect prediction using deep network. NASA and PROMISE datasets are used for experimentation, for which GeNSDP achieves a remarkable average score of 99.1% for Area Under the Curve, and 0.92 for F-measure. The proposed model outperforms the traditional oversampling methods (including ROS, SMOTE, COSTE) by 30.1%, and selected baseline models by 14.1%. From the statistical evidence obtained by conducting the Anova Test with Bonferroni Post-hoc test at the confidence level of 95%, it is concluded that the proposed model is effective to handle class imbalance and achieve stable defect prediction.

Similar content being viewed by others

A conformational benchmark for optical property prediction with solvent-aware graph neural networks

Article Open access 18 February 2026

DG2GAN: improving defect recognition performance with generated defect image sample

Article Open access 26 June 2024

Addressing data imbalance in collision risk prediction with active generative oversampling

Article Open access 17 March 2025

Data availability

The data utilized in this paper is available at: https://github.com/feiwww/PROMISE-backuphttps://github.com/ApoorvaKrisna/NASA-promise-dataset-repository Further enquiries about data availability should be directed to the author (Somya R. Goyal, Manipal University Jaipur, at somyagoyal1988@gmail.com or somya.goyal@jaipur.manipal.edu).

Code availability

https://github.com/SRGoyal/GeNSDP.git

Abbreviations

GeNSDP :

Generative oversampling for software defect prediction

SDP:

Software defect prediction

GAN:

Generative adversarial network

ML:

Machine learning

CIL:

Class imbalance learning

DL:

Deep learning

RQ:

Research question

AUC:

Area under the curve

EBSE:

Evidence based software engineering

\(\left|{\text{D}}_{\text{clean}}\right|\) :

Data-points labelled as ‘Clean’

\(\left|{\text{D}}_{\text{buggy}}\right|\) :

Data-points labelled as ‘Buggy’

ROS:

Random over sampling

SMOTE:

Synthetic Minority Oversampling TEchnique

COSTE:

Complexity based Oversampling TEchnique

ROC:

Receiver operating characteristic

SOTA:

State of the art

References

  1. Zhao, Y., Damevski, K. & Chen, H. A systematic survey of just-in-time software defect prediction. ACM Comput. Surv. 55(10), 1–35 (2023).

    Google Scholar 

  2. Goyal, S. Software measurements using machine learning techniques - A review. Recent Adv. Comput. Sci. Commun. 16(1), 38–55. https://doi.org/10.2174/2666255815666220407101922 (2023).

    Google Scholar 

  3. Goyal, S. Open Challenges in Software Measurements Using Machine Learning Techniques. In: Computational Intelligence Applications for Software Engineering Problems,. Apple Academic Press. ISBN- 9781000575927, 1000575926. 19–31. (2023).

  4. Bhandari, K., Kumar, K. & Sangal, A. L. Data quality issues in software fault prediction: A systematic literature review. Artif. Intell. Rev. 56(8), 7839–7908. https://doi.org/10.1007/S10462-022-10371-6 (2023).

    Google Scholar 

  5. Chen, L., Fang, B., Shang, Z. & Tang, Y. Tackling class overlap and imbalance problems in software defect prediction. Softw. Qual. J. 26(1), 97–125. https://doi.org/10.1007/s11219-016-9342-6 (2018).

    Google Scholar 

  6. Goyal, S. R. Current trends in class imbalance learning for software defect prediction. IEEE Access 13, 16896–16917. https://doi.org/10.1109/ACCESS.2025.3532250 (2025).

    Google Scholar 

  7. Goyal, S. R. A systematic review on AI based class imbalance handling in software defect prediction. Results Eng. 27, 106578. https://doi.org/10.1016/j.rineng.2025.106578 (2025).

    Google Scholar 

  8. Feng, S. et al. COSTE: Complexity-based oversampling technique to alleviate the class imbalance problem in software defect prediction. Inf. Softw. Technol. 129, 106432 (2021).

    Google Scholar 

  9. Singh Rathore, S., Singh Chouhan, S., Kumar Jain, D. & Gopal Vachhani, A. Generative oversampling methods for handling imbalanced data in software fault prediction. IEEE Trans. Reliab. 71(2), 747 (2022).

    Google Scholar 

  10. Zhang, S., Jiang, S. & Yan, Y. A software defect prediction approach based on bigan anomaly detection. Sci. Program. 2022(1), 5024399 (2022).

    Google Scholar 

  11. Alqarni, A. & Aljamaan, H. Leveraging ensemble learning with generative adversarial networks for imbalanced software defects prediction. Appl. Sci. 13(24), 13319 (2023).

    Google Scholar 

  12. Yedida, R. & Menzies, T. On the value of oversampling for deep learning in software defect prediction. IEEE Trans. Softw. Eng. 48(8), 3103–3116. https://doi.org/10.1109/TSE.2021.3079841 (2022).

    Google Scholar 

  13. Song, W., Gan, L. & Bao, T. Software defect prediction via generative adversarial networks and pre-trained model. Int. J. Adv. Comput. Sci. Appl. 15, 3. https://doi.org/10.14569/IJACSA.2024.01503119 (2024).

    Google Scholar 

  14. Bennin, K. E., Keung, J., Phannachitta, P., Monden, A. & Mensah, S. MAHAKIL: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans. Softw. Eng. 44(6), 534–550. https://doi.org/10.1109/TSE.2017.2731766 (2018).

    Google Scholar 

  15. Feng, S., Keung, J., Yu, X., Xiao, Y. & Zhang, M. Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf. Softw. Technol. 139, 106662. https://doi.org/10.1016/j.infsof.2021.106662 (2021).

    Google Scholar 

  16. Arasteh, B. et al. Sahand: A software fault-prediction method using autoencoder neural network and K-Means algorithm. J. Electron. Test. 40, 229–243. https://doi.org/10.1007/s10836-024-06116-8 (2024).

    Google Scholar 

  17. Goyal, S. R. Effective software defect prediction with deep neural networks. Results Eng. 29, 108378. https://doi.org/10.1016/j.rineng.2025.108378 (2026).

    Google Scholar 

  18. Qiao, L., Li, X., Umer, Q. & Guo, P. Deep learning based software defect prediction. Neurocomputing 385, 100–110. https://doi.org/10.1016/j.neucom.2019.11.067 (2020).

    Google Scholar 

  19. Giray, G., Bennin, K. E., Köksal, Ö., Babur, Ö. & Tekinerdogan, B. On the use of deep learning in software defect prediction. J. Syst. Softw. 195, 111537. https://doi.org/10.1016/j.jss.2022.111537 (2023).

    Google Scholar 

  20. (PROMISE) https://github.com/feiwww/PROMISE-backup/tree/master/bug-data, (NASA) https://github.com/ApoorvaKrisna/NASA-promise-dataset-repository

  21. Aggarwal, D. Software defect prediction dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.13536506.v1 (2021).

  22. Sayyad, S. & Menzies, T.,”The PROMISE Repository of Software Engineering Databases”, Canada: University of Ottawa,. http://promise.site.uottawa.ca/ SERepository. 2005

  23. Lehmann, E. L. & Romano, J. P. Testing Statistical Hypothesis: Springer Texts in Statistics (Springer, 2008).

    Google Scholar 

  24. Ross, S., M., “Probability and Statistics for Engineers and Scientists”, Third Edition, Elsevier Press, 2005.

  25. Gong, L., Jiang, S. & Jiang, L. Tackling class imbalance problem in software defect prediction through cluster-based over-sampling with filtering,. IEEE Access 7, 145725–145737. https://doi.org/10.1109/ACCESS.2019.2945858 (2019).

    Google Scholar 

  26. Khuat, T. T. & Le, M. H. Evaluation of sampling-based ensembles of classifiers on imbalanced data for software defect prediction problems. SN Comput. Sci. https://doi.org/10.1007/s42979-020-0119-4 (2020).

    Google Scholar 

  27. Farid, A. B., Fathy, E. M., Eldin, A. S. & Abd-Elmegid, L. A. Software defect prediction using hybrid model (CBIL) of convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM). PeerJ Comput. Sci. 7, e739 (2021).

    Google Scholar 

  28. Khleel, N. A. A. & Nehéz, K. Software defect prediction using a bidirectional LSTM network combined with oversampling techniques. Cluster Comput. https://doi.org/10.1007/s10586-023-04170-z (2023).

    Google Scholar 

  29. Thi Minh Phuong, H. et al. A comparative study of handling imbalanced data using generative adversarial networks for machine learning based software fault prediction. Appl. Intell. 55, 280. https://doi.org/10.1007/s10489-024-05930-z (2025).

    Google Scholar 

  30. Goyal, S. Predicting the defects using stacked ensemble learner with filtered dataset. Autom. Softw. Eng. 28, 14. https://doi.org/10.1007/s10515-021-00285-y (2021).

    Google Scholar 

  31. Wang, H., Zhuang, W. & Zhang, X. Software defect prediction based on gated hierarchical LSTMs. IEEE Trans. Reliab. 70(2), 711–727 (2021).

    Google Scholar 

Download references

Funding

Open access funding provided by Manipal University Jaipur.

Author information

Authors and Affiliations

  1. Manipal University Jaipur, Jaipur, Rajasthan, 303007, India

    Somya R. Goyal

Authors
  1. Somya R. Goyal
    View author publications

    Search author on:PubMed Google Scholar

Contributions

SRG: Conceptualization, Methodology, Software, Data curation, Writing—Original draft preparation, Visualization, Investigation, Supervision, Software, Validation, Writing—Reviewing and Editing.

Corresponding author

Correspondence to Somya R. Goyal.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goyal, S.R. A novel generative oversampling for software defect prediction. Sci Rep (2026). https://doi.org/10.1038/s41598-026-41981-7

Download citation

  • Received: 13 April 2025

  • Accepted: 24 February 2026

  • Published: 03 April 2026

  • DOI: https://doi.org/10.1038/s41598-026-41981-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Software defect prediction (SDP)
  • Class imbalance
  • Sampling
  • Generative oversampling
  • Deep networks
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics