Abstract
Software defect prediction is one of the highly active research areas as it allows to focus the testing efforts on the defective modules and reduce the cost of development. The imbalanced nature of defect data poses a threat to the performance of software defect predictors. This study proposes a novel Generative oversampling-based Software Defect Prediction naming GeNSDP. It oversamples defect data by generating synthetic minority instances utilizing lightweight generative model, then the oversampled data is used for defect prediction using deep network. NASA and PROMISE datasets are used for experimentation, for which GeNSDP achieves a remarkable average score of 99.1% for Area Under the Curve, and 0.92 for F-measure. The proposed model outperforms the traditional oversampling methods (including ROS, SMOTE, COSTE) by 30.1%, and selected baseline models by 14.1%. From the statistical evidence obtained by conducting the Anova Test with Bonferroni Post-hoc test at the confidence level of 95%, it is concluded that the proposed model is effective to handle class imbalance and achieve stable defect prediction.
Similar content being viewed by others
Data availability
The data utilized in this paper is available at: https://github.com/feiwww/PROMISE-backuphttps://github.com/ApoorvaKrisna/NASA-promise-dataset-repository Further enquiries about data availability should be directed to the author (Somya R. Goyal, Manipal University Jaipur, at somyagoyal1988@gmail.com or somya.goyal@jaipur.manipal.edu).
Code availability
Abbreviations
- GeNSDP :
-
Generative oversampling for software defect prediction
- SDP:
-
Software defect prediction
- GAN:
-
Generative adversarial network
- ML:
-
Machine learning
- CIL:
-
Class imbalance learning
- DL:
-
Deep learning
- RQ:
-
Research question
- AUC:
-
Area under the curve
- EBSE:
-
Evidence based software engineering
- \(\left|{\text{D}}_{\text{clean}}\right|\) :
-
Data-points labelled as ‘Clean’
- \(\left|{\text{D}}_{\text{buggy}}\right|\) :
-
Data-points labelled as ‘Buggy’
- ROS:
-
Random over sampling
- SMOTE:
-
Synthetic Minority Oversampling TEchnique
- COSTE:
-
Complexity based Oversampling TEchnique
- ROC:
-
Receiver operating characteristic
- SOTA:
-
State of the art
References
Zhao, Y., Damevski, K. & Chen, H. A systematic survey of just-in-time software defect prediction. ACM Comput. Surv. 55(10), 1–35 (2023).
Goyal, S. Software measurements using machine learning techniques - A review. Recent Adv. Comput. Sci. Commun. 16(1), 38–55. https://doi.org/10.2174/2666255815666220407101922 (2023).
Goyal, S. Open Challenges in Software Measurements Using Machine Learning Techniques. In: Computational Intelligence Applications for Software Engineering Problems,. Apple Academic Press. ISBN- 9781000575927, 1000575926. 19–31. (2023).
Bhandari, K., Kumar, K. & Sangal, A. L. Data quality issues in software fault prediction: A systematic literature review. Artif. Intell. Rev. 56(8), 7839–7908. https://doi.org/10.1007/S10462-022-10371-6 (2023).
Chen, L., Fang, B., Shang, Z. & Tang, Y. Tackling class overlap and imbalance problems in software defect prediction. Softw. Qual. J. 26(1), 97–125. https://doi.org/10.1007/s11219-016-9342-6 (2018).
Goyal, S. R. Current trends in class imbalance learning for software defect prediction. IEEE Access 13, 16896–16917. https://doi.org/10.1109/ACCESS.2025.3532250 (2025).
Goyal, S. R. A systematic review on AI based class imbalance handling in software defect prediction. Results Eng. 27, 106578. https://doi.org/10.1016/j.rineng.2025.106578 (2025).
Feng, S. et al. COSTE: Complexity-based oversampling technique to alleviate the class imbalance problem in software defect prediction. Inf. Softw. Technol. 129, 106432 (2021).
Singh Rathore, S., Singh Chouhan, S., Kumar Jain, D. & Gopal Vachhani, A. Generative oversampling methods for handling imbalanced data in software fault prediction. IEEE Trans. Reliab. 71(2), 747 (2022).
Zhang, S., Jiang, S. & Yan, Y. A software defect prediction approach based on bigan anomaly detection. Sci. Program. 2022(1), 5024399 (2022).
Alqarni, A. & Aljamaan, H. Leveraging ensemble learning with generative adversarial networks for imbalanced software defects prediction. Appl. Sci. 13(24), 13319 (2023).
Yedida, R. & Menzies, T. On the value of oversampling for deep learning in software defect prediction. IEEE Trans. Softw. Eng. 48(8), 3103–3116. https://doi.org/10.1109/TSE.2021.3079841 (2022).
Song, W., Gan, L. & Bao, T. Software defect prediction via generative adversarial networks and pre-trained model. Int. J. Adv. Comput. Sci. Appl. 15, 3. https://doi.org/10.14569/IJACSA.2024.01503119 (2024).
Bennin, K. E., Keung, J., Phannachitta, P., Monden, A. & Mensah, S. MAHAKIL: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans. Softw. Eng. 44(6), 534–550. https://doi.org/10.1109/TSE.2017.2731766 (2018).
Feng, S., Keung, J., Yu, X., Xiao, Y. & Zhang, M. Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf. Softw. Technol. 139, 106662. https://doi.org/10.1016/j.infsof.2021.106662 (2021).
Arasteh, B. et al. Sahand: A software fault-prediction method using autoencoder neural network and K-Means algorithm. J. Electron. Test. 40, 229–243. https://doi.org/10.1007/s10836-024-06116-8 (2024).
Goyal, S. R. Effective software defect prediction with deep neural networks. Results Eng. 29, 108378. https://doi.org/10.1016/j.rineng.2025.108378 (2026).
Qiao, L., Li, X., Umer, Q. & Guo, P. Deep learning based software defect prediction. Neurocomputing 385, 100–110. https://doi.org/10.1016/j.neucom.2019.11.067 (2020).
Giray, G., Bennin, K. E., Köksal, Ö., Babur, Ö. & Tekinerdogan, B. On the use of deep learning in software defect prediction. J. Syst. Softw. 195, 111537. https://doi.org/10.1016/j.jss.2022.111537 (2023).
(PROMISE) https://github.com/feiwww/PROMISE-backup/tree/master/bug-data, (NASA) https://github.com/ApoorvaKrisna/NASA-promise-dataset-repository
Aggarwal, D. Software defect prediction dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.13536506.v1 (2021).
Sayyad, S. & Menzies, T.,”The PROMISE Repository of Software Engineering Databases”, Canada: University of Ottawa,. http://promise.site.uottawa.ca/ SERepository. 2005
Lehmann, E. L. & Romano, J. P. Testing Statistical Hypothesis: Springer Texts in Statistics (Springer, 2008).
Ross, S., M., “Probability and Statistics for Engineers and Scientists”, Third Edition, Elsevier Press, 2005.
Gong, L., Jiang, S. & Jiang, L. Tackling class imbalance problem in software defect prediction through cluster-based over-sampling with filtering,. IEEE Access 7, 145725–145737. https://doi.org/10.1109/ACCESS.2019.2945858 (2019).
Khuat, T. T. & Le, M. H. Evaluation of sampling-based ensembles of classifiers on imbalanced data for software defect prediction problems. SN Comput. Sci. https://doi.org/10.1007/s42979-020-0119-4 (2020).
Farid, A. B., Fathy, E. M., Eldin, A. S. & Abd-Elmegid, L. A. Software defect prediction using hybrid model (CBIL) of convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM). PeerJ Comput. Sci. 7, e739 (2021).
Khleel, N. A. A. & Nehéz, K. Software defect prediction using a bidirectional LSTM network combined with oversampling techniques. Cluster Comput. https://doi.org/10.1007/s10586-023-04170-z (2023).
Thi Minh Phuong, H. et al. A comparative study of handling imbalanced data using generative adversarial networks for machine learning based software fault prediction. Appl. Intell. 55, 280. https://doi.org/10.1007/s10489-024-05930-z (2025).
Goyal, S. Predicting the defects using stacked ensemble learner with filtered dataset. Autom. Softw. Eng. 28, 14. https://doi.org/10.1007/s10515-021-00285-y (2021).
Wang, H., Zhuang, W. & Zhang, X. Software defect prediction based on gated hierarchical LSTMs. IEEE Trans. Reliab. 70(2), 711–727 (2021).
Funding
Open access funding provided by Manipal University Jaipur.
Author information
Authors and Affiliations
Contributions
SRG: Conceptualization, Methodology, Software, Data curation, Writing—Original draft preparation, Visualization, Investigation, Supervision, Software, Validation, Writing—Reviewing and Editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Goyal, S.R. A novel generative oversampling for software defect prediction. Sci Rep (2026). https://doi.org/10.1038/s41598-026-41981-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-41981-7


