Table 1 Details of the datasets and scenarios.
Dataset name | Description | Features (N) | Samples (M) | Class distribution |
|---|---|---|---|---|
Dataset 1 | NASA’s Metrics Data Program (MDP) [51, 52] dataset is a foundational resource for software bug prediction research. Comprising software metrics and defect data from NASA projects, it offers a wealth of attributes such as lines of code and bug counts. Researchers leverage this dataset to explore the complex interplay between metrics and software defects, refining bug prediction models and enhancing proactive bug management strategies. | 50 k | 1000 k | Class 1: 60%, Class 2: 40% |
Dataset 2 | The Eclipse Bug Dataset [53] is a vital asset in the realm of software bug prediction, extracted from the open-source Eclipse projects. It encapsulates bug reports, comments, and status transitions, providing insights into the lifecycle of software defects. By studying this dataset, researchers gain a holistic understanding of bug resolution dynamics and factors influencing effective issue management within collaborative development environments. | 75 k | 1500 k | Class 1: 70%, Class 2: 30% |
Dataset 3 | Derived from the Mozilla project’s Bugzilla bug tracking system [54], the Mozilla Bugzilla Dataset holds immense significance for bug prediction research. Comprising bug descriptions, comments, timestamps, and severity ratings, it offers a granular view of bug resolution processes. Researchers capitalize on this dataset to unearth patterns in bug discussions and develop strategies that streamline bug identification and resolution across diverse software projects. | 100 k | 2000 k | Class 1: 55%, Class 2: 45% |
Dataset 4 | The Apache JIRA Dataset [55] stands as a cornerstone in open-source bug prediction studies, featuring bug reports and issue tracking data from Apache Software Foundation projects. With comprehensive details including issue descriptions, comments, and timestamps, it unravels the intricate journey of bug identification to resolution. Researchers harness this dataset to craft predictive models that align with the complexities of open-source collaboration, furthering the understanding of software defects and their management. | 60 k | 1200 k | Class 1: 45%, Class 2: 55% |