Introduction

Breast cancer is the leading cause of cancer among women worldwide1,2and the second leading cause of death among women3,4. The early stages of many breast cancers often present no noticeable symptoms. Consequently, the extraction and analysis of pertinent information from the vast data pool for the scientific evaluation of breast cancer is both complex and time-intensive5,6. This complexity poses significant challenges for early diagnosis, affecting both the treatment effectiveness and patient prognosis. Notably, accurate and early diagnosis substantially enhances the likelihood of patients receiving timely treatment, thereby reducing breast cancer mortality rates7,8.

Recently, many researchers have adopted diverse techniques for the early detection of breast cancer9, incorporating various machine learning algorithms into the WDBC dataset. Specifically, Tarek Khater et al.‘s k-nearest neighbors model10 reached a remarkable 97.7% accuracy and 98.2% precision for breast cancer classification using WDBC data. Masri Ayob et al.11 successfully employed a Fast Learning Network (FLN), attaining an impressive 98.37% accuracy on the WBCD database. Further reinforcing these findings, Sheng Zhou et al.12, through extensive experimentation with various machine learning models on the same dataset, highlighted the superior performance of AdaBoost-Logistic, exhibiting commendable classification capabilities for both benign and malignant cases. Deepa Kumari et al.13 achieved a 97% diagnostic accuracy by combining hybrid multi-layer perceptron (MLP) with random forest (RF), as well as Xception (a type of convolutional neural network) with RF. Indu Chhillar et al.14 successfully addressed class imbalance through Synthetic Minority Over-sampling Technique-Edited Nearest Neighbor (SMOTEENN) and employed Boruta and Coefficient-Based Feature Selection (CBFS) for robust feature selection, ultimately proposing a soft voting ensemble model. Their approach yielded an impressive 99.42% accuracy when utilizing the CBFS method. Vandana Rawat et al.15employed several ML algorithms for classification purposes and found that the Support Vector Machine algorithm delivered superior results. However, further explanation of the model is lacking, making it challenging for people to comprehend. To address the widespread issue of imbalanced learning, a common challenge for standard machine learning algorithms16, T. R. Mahesh et al.6implemented A-SMOTE for dataset balancing and achieved noteworthy outcomes. Nonetheless, A-SMOTE’s occasional selection of unsuitable samples as synthetics introduces noise that impairs the classification capability of the model. Feature selection is an essential step preceding classification tasks, particularly given the high dimensionality of biomedical datasets that frequently encompass irrelevant and redundant features17. In breast cancer research, Principal Component Analysis (PCA) has gained prominence as the preferred feature selection technique18,19. However, PCA synthesizes new components as linear combinations of the original features, potentially resulting in an information loss from the initial dataset. Furthermore, these newly formed features often pose challenges for an intuitive interpretation. Overall, challenges persist in areas such as dataset balancing, feature optimization, and model interpretability. The entire experimental process is shown in Fig. 1.

Fig. 1
figure 1

Experimental procedure of breast cancer.

Materials and methods

Dataset

In this study, we used the publicly accessible Wisconsin Diagnostic Breast Cancer (WDBC) datasets (https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic)20. The datasets comprised 569 samples, with a distribution of 357 benign and 212 malignant cases, all devoid of missing values. These features were extracted from digital images of breast mass fine-needle aspiration (FNA), which describes the characteristics of cell nuclei21.

Data preprocessing

Before employing machine learning (ML) for classification tasks, the data were subjected to a series of pre-processing steps22. Initially, min-max normalization was employed to normalize all feature values to a range between 0 and 1. Subsequently, the dataset was split 65:35 to train and test. Thereafter, to mitigate data imbalance in the training set, the Borderline Smote1 technique is applied23. The detailed process is shown in Fig. 2.

Fig. 2
figure 2

Borderline Smote1 algorithm.

Shapley additive explanations (SHAP)

$$\phi i=\sum\nolimits_{{j \in F}} {\sum\nolimits_{{P \in Sj}} {\frac{{w\left( {\left| P \right|,j} \right)}}{{Lj\left( \begin{gathered} Lj - 1 \hfill \\ \left| P \right| \hfill \\ \end{gathered} \right)}}} } (p_{o}^{{i,j}} - p_{z}^{{i,j}})\nu j$$
(1)

SHAP is a technique used to explain the predictions made by machine learning models24.

This method creates a framework for understanding by determining Shapley values, treating each feature as a “contributor.” In this framework, each feature is assigned a SHAP value in a specific set of predictors. These values show how much each feature contributes to the final prediction result. They also show whether each feature promotes or inhibits changes in the target variable and how each feature interacts with the target variable25,26. The mean absolute SHAP values across features are indicative of their respective importance. The calculation formula is as follows:

Each leaf node will contain a proper proportion of all possible subsets in the collection , where \(\:{S}_{j}\:\)is the feature subset that appears at leaf node j, is a subset of \(\:{S}_{j}\), \(\:{L}_{j}\) is the path length from the root node to leaf node j, (𝑤(|𝑃|,𝑗)) is the proportion of all subsets of at leaf node j, \(\:{p}_{o}^{i,j}\) and \(\:{p}_{z}^{i,j}\) represent the proportions of subsets that include and exclude feature i, respectively, and \(\:{\upsilon\:}_{j}\) is the output value of leaf node j .

SHAP-RF-RFE

Recursive Feature Elimination (RFE) is an effective feature selection technique that systematically reduces feature set sizes via a recursive process27. In this study, we developed a unique algorithm, designated SHAP-RF-RFE, by integrating the Shapley additive explanation (SHAP) values with the Random Forest (RF) methodology within the RFE framework. This algorithm unfolds in a structured manner, as follows:

  1. 1.

    Initially, a Random Forest classifier is trained using the available dataset.

  2. 2.

    Subsequently, SHAP values for each feature are computed, quantifying their contribution to the prediction.

  3. 3.

    The feature exhibiting the least SHAP value is then eliminated, signifying its minimal impact on the model’s predictive accuracy.

Machine learning models

The Random Forest (RF) algorithm is a sophisticated ensemble learning method. It comprises multiple distinct Decision Trees (DTs), each contributing to the final decision-making process. Unlike methods that depend on a single decision tree, RF aggregates the predictions from each tree, relying on the majority vote to formulate the ultimate prediction28,29. In this framework, each decision tree node executes splits based on the Gini Index, which is a measure of the statistical dispersion.

The Support Vector Machine (SVM) is a robust supervised learning model that is frequently employed to address classification and regression issues. Its fundamental premise is to identify an optimal hyperplane within the feature space that maximizes the distance between data points belonging to disparate categories, thereby facilitating effective classification18.

The logistic regression (LR) classification algorithm is a widely used tool in the field of machine learning. Its main goal is to predict the occurrence of an event by estimating probabilities, and it has the characteristics of easy implementation and strong interpretability of results30.

The K-Nearest Neighbor (KNN) algorithm is a fundamental and pervasive classification and regression technique. Its working principle is simple and intuitive, mainly relying on measuring the distance between different feature points to perform classification or regression31.

LightGBM represents an advanced iteration of the Gradient-Boosted Decision Tree (GBDT) system32. LightGBM employs a histogram-based approach and leaf-wise growth strategy. This accelerates training and reduces the memory usage33,34. LightGBM retains data points with large gradients and down samples other data points while maintaining the essential characteristics of the data35. Given the typically sparse nature of high-dimensional data, this sparsity enables the formulation of a near-lossless method for feature-dimensionality reduction. In these sparse feature spaces, many features are often mutually exclusive and do not assume nonzero values simultaneously. LightGBM capitalizes on this by amalgamating these exclusive features into a single entity in a process called Exclusive Feature Bundling (EFB).

Hyperparameter optimization

Metaheuristic algorithms demonstrate significant advantages in optimizing machine learning model parameters, particularly in handling large-scale, complex problems with no explicit gradient information. By mimicking natural search mechanisms, they facilitate effective global searches across extensive solution spaces, avoiding the pitfalls of local optima. Among metaheuristics, Genetic Algorithms (GA) stand out for simulating biological evolution processes using selection, crossover, and mutation to navigate the solution space, gradually refining candidate solutions toward optimality. Notable members include Particle Swarm Optimization (PSO), GA, Differential Evolution (DE), Artificial Bee Colony (ABC), Firefly Algorithm (FA), the Coati Optimization Algorithm, and various hybrid intelligent algorithms36,37,38,39, which have found widespread applications in diverse domains such as healthcare, engineering, mathematics, and science40. This work focuses on Particle Swarm Optimization (PSO) due to its merits: minimal parameter tuning requirements, high computational efficiency, robust performance, and ease of implementation for hyperparameter optimization. From analyzing bird flocking behavior, PSO is a collective intelligence optimization technique introduced by Kennedy and Eberhart et al.41. Its core principle is leveraging collaborative efforts and information sharing among particles to achieve optimal solutions. Fundamentally, PSO simulates the movement of a swarm of particles in the search space, continuously updating their positions and velocities until converging to the global optimum. Each particle maintains a position and velocity vector, and through iterative adjustments of these parameters, the swarm collectively identifies the best solution to the problem at hand42,43.

Performance assessment

Performance evaluation metrics include: accuracy, precision, recall, specificity, and F-measure; the ROC curve graphically displays the performance of the model at different thresholds; ten cross validations are used to evaluate the effectiveness and stability of the model on unseen data to enhance the understanding of the model’s performance44.

Results

The training and testing samples were run on a Windows 11 machine equipped with an i5 processor and NVIDIA RTX 2050. The model was implemented using Python 3.9. The data preprocessing was mainly performed using the ‘imblearn’ and ‘pandas’ libraries. The model development was carried out using the ‘nump’, ‘sklearn’, ‘shap’, and ‘scikit-opt’ packages. For the development of the online platform, the ‘streamlit’ package was employed.

The best machine learning model

Initially, we employed and evaluated RF, SVM, LR, KNN, and LightGBM models to classify the WDBC dataset. Figure 3(a) shows the accuracy achieved by employing all the feature subsets ranging from 1 to 30. Figure 3(b) presents the AUC values and Fig. 3(c) shows the 10-fold cross-validation accuracy. After analyzing the accuracy of the five distinct models, it was observed that the LightGBM model generally surpassed the performance of the other models across most feature subsets. However, this model exhibits a slight decrease in accuracy compared with the others when the feature subsets include four, seven, ten, eleven, or twelve features. Remarkably, the accuracy of the LightGBM model reached 99.0% with a subset of 26 features. Moreover, a comparative analysis of the AUC values revealed that the LightGBM model typically outperformed the other models, achieving an AUC as high as 0.987 for the 26-feature subset. Nonetheless, the AUC values were marginally lower in smaller subsets of features 4, 7, 10, 11, and 12. The ten-fold cross-validation comparison of the accuracy rates for five models indicates that, within feature subsets ranging from 1 to 30, there is no significant difference among the LightGBM, KNN, and RF models. Conversely, the SVM and LR models generally exhibit weaker performance.

Fig. 3
figure 3

The training results of five models: (a) Accuracy, (b) AUC, (c) 10-fold cross-validation Accuracy.

Subsequently, we evaluated the performance of RF, SVM, LR, KNN, and LightGBM models by selecting the model with the highest accuracy, AUC values, and ten-fold cross-validation accuracy from 30 models, each representing different feature subsets. Figure 4 displays the confusion matrices for the five best-performing models. Notably, the RF model with 28 features achieved a TP of 74, FN of 3, FP of 0, and TN of 123. The SVM model, equipped with 18 features, recorded a TP of 76, FN of 1, FP of 3, and TN of 120. Similarly, the LR model with 27 features showed a TP of 75, FN of 2, FP of 3, and TN of 120. The KNN model, with the least features at 12, excelled with a TP of 76, FN of 1, FP of 2, and TN of 121. The LightGBM model, with 26 features, demonstrated superior performance with a TP of 75, FN of 2, FP of 0, and TN of 123. Figure 5; Table 1 show the ROC curves and performance metrics of these models, highlighting the LightGBM model’s top accuracy of 99%, which is 0.5% higher than both the RF and KNN models, 1.0% higher than the SVM model, and 1.5% higher than the LR model. This model also excelled in specificity (100%), precision (100%), recall (97.40%), F-measure (98.68%), AUC (0.9870), and ten-fold cross-validation accuracy (0.9808). Table 2 details the optimized hyperparameters of the LightGBM model with 26 features, achieved through the PSO algorithm.

Fig. 4
figure 4

The confusion matrices for the five models. (a) RF, (b) SVM, (c) LR, (d) KNN, (e) LightGBM.

Fig. 5
figure 5

The receiver operating characteristic (ROC) curve for the five models. (a) RF, (b) SVM, (c) LR, (d) KNN, (e) LightGBM.

Table 1 The performance of different models.
Table 2 Hyperparameters optimized for the LightGBM model using a subset of 26 features with PSO.

Distribution of importance of 26 features

In the SHAP-RF-RFE feature-selection algorithm, the average absolute SHAP values of each feature indicate their respective importance. Figure 6 illustrates the important distributions of these 26 features in the best-performing LightGBM model obtained using the SHAP-RF-RFE algorithm. The 26 features are ranked in order of importance from top to bottom. Notably, ‘radius_worst’, ‘area_worst’, and ‘perimeter_worst’ are deemed pivotal. The ‘radius_worst’ represents the radius of the largest cross-sectional area of the tumor. Generally, a larger radius signifies a larger tumor, which could potentially indicate a more aggressive form of cancer. The ‘area_worst’ refers to the area of the tumor’s largest cross-section. Typically, a larger tumor area implies a higher tumor load and may correlate with a higher degree of malignancy. ‘Perimeter_worst’ is the circumference of the tumor at its largest cross-section. The size of the perimeter mirrors the tumor’s morphology and the complexity of its growth. Generally, a longer perimeter may suggest a more irregular tumor morphology, which is often associated with a higher degree of aggressiveness and malignancy.

Fig. 6
figure 6

Ranking of SHAP values in recommended algorithm.

The interpretation of the model

In the subsequent analysis, the SHAP values were employed to interpret the LightGBM model, which integrates the previously mentioned 26 features. The SHAP swarm plot of this model is shown in Fig. 7, where positive SHAP values correlate with an increased probability of breast cancer diagnosis, whereas negative values suggest a decreased likelihood. To enhance visual comprehension, higher values were represented in red and lower values in blue. Notably, the feature with the most substantial impact on the model is ‘radius_worst.'A high value indicates an elevated risk of breast cancer, whereas a low value indicates a diminished risk. Conversely, ‘concavity_se’ emerges as the feature with the least influence on the model. Figure 7 shows that breast cancer risk is associated with the following 18 characteristics: radius_worst, texture_mean, area_worst, perimeter_worst, concave point s_worst, smoothness_worst, texture_worst, concavity_worst, concave points_mean, area_se, symmetry_worst, radius_se, smoothness_mean, concavity_mean, area_mean, perimeter_mean, fractal_dimension_worst, and compactness_worst. Conversely, lower values of these attributes imply reduced risk. For the subsequent five features, compactness_se, symmetry_se, concave points_se, fractal_dimension_se, and compactness_mean, the relationship was inverse; higher values correlated with a decreased likelihood of breast cancer, whereas lower values suggested an increased risk. Notably, for the symmetry_mean feature, a low value yielded an ambiguous prediction of the likelihood of breast cancer. For the radius_mean feature, a high value also yields an ambiguous prediction. However, the predictive value of concavity_se in breast cancer remains unclear.

Fig. 7
figure 7

SHAP Beeswarm plot for LightGBM-PSO.

Comparison with other models

Comparative analysis (detailed in Table 3) highlights the exceptional performance of our breast cancer prediction model, leveraging SHAP-RF-RFE for feature selection, LightGBM as the classifier, and PSO for hyperparameter tuning. Achieving remarkable accuracy (99%) and perfect precision (100%), our model surpasses counterparts in the literature, demonstrating superior predictive capability. While this high precision ensures no false positives, a slightly lower recall rate of 97.4% indicates potential under-detection of some actual cases. This integrated approach showcases strong predictive power and promising application potential for enhancing breast cancer diagnosis.

Table 3 Accuracy comparison with other works from the literature.

Discussion and conclusion

This study introduces a new breast cancer diagnostic model that is more accurate and efficient. We also use SHAP values to understand how the model makes decisions.

Breast cancer poses a significant public health concern and is also among the primary causes of mortality in women45,46. The early identification of breast cancer continues to be a pivotal focus in medical research. Traditionally, pathologists and radiologists are accustomed to manually observing breast images and reaching a consensus with other medical experts to make decisions and conduct analyses47,48. However, manually analyzing a large number of images used for diagnosing breast cancer is both laborious and time-consuming, which often may lead to false positive or false negative results49. Therefore, we need an automated system to improve analysis efficiency to assist radiologists in the early diagnosis of breast cancer50, where the role of machine learning in research is becoming increasingly vital. First, these algorithms analyze breast X-ray imagery, encompassing mammography, ultrasound, and MRI, to aid physicians in pinpointing potential lesions indicative of breast cancer51,52. Jia Li et al.53employed the Self-Attention Random Forest (SARF) model to classify breast X-ray images and achieved excellent accuracy. Second, through machine learning-driven analysis of extensive genomic data, researchers have delved into genetic mutations and biomarkers linked to breast cancer emergence, thereby facilitating the identification of genetic predispositions and crafting tailored preventive and therapeutic strategies54,55. Byung-Chul Kim et al.56constructed a high-accuracy model for predicting breast cancer metastasis using RNA-seq data and machine learning algorithms. Additionally, machine learning has been employed in the scrutiny of clinical patient data to discern potential risk factors and early indicators of breast cancer, utilizing both pathological findings and clinical histories to support more informed diagnostic and treatment decisions by medical professionals57,58. Mahendran Botlagunta et al.59proposed a machine learning-based web application that utilizes blood feature data for the early detection of breast cancer metastasis. The integration of machine learning into telemedicine systems enables real-time screening and diagnostic services for breast cancer, addresses disparities in medical resources, and enhances the accessibility and effectiveness of early detection efforts60.

Despite the impressive performance of our model, several limitations warrant consideration. Primarily, its generalization capability across diverse datasets requires further validation. While we achieved outstanding results on a specific dataset, applicability in other clinical settings or varied populations remains to be comprehensively assessed. Additionally, the model’s complexity may incur higher computational costs during practical deployment, posing challenges particularly in resource-constrained healthcare environments. Our research has culminated in a novel breast cancer prediction model, marked by significant accuracy, adaptability, and scalability improvements. To democratize access to this advanced technology, we have launched the “Breast Cancer Prediction Tool” (https://breast-cancer-prediction-tool-cgbjlhkns7yig6bmzvztmc.streamlit.app/), an intuitive online platform offering accessible risk assessment services. Patients and healthcare professionals can input relevant health data through a user-friendly interface to receive personalized risk evaluations instantaneously. This immediate feedback mechanism empowers early interventions and tailored treatment plans, supporting clinicians in making more informed and precise diagnoses and treatment decisions. Future work will encompass several key directions to enhance the robustness and applicability of our model. First, we aim to train the model on a more diverse set of datasets and integrate various imaging modalities to enrich the assessment of disease manifestations. Second, we plan to explore advanced optimization algorithms such as Hybrid Particle Swarm Optimization (HPSO) and HPSO with Time-Varying Acceleration Coefficients (HPSO-TVAC)61. These techniques have demonstrated superior performance in tackling complex problems by efficiently converging towards optimal solutions, thereby boosting model accuracy62. Third, we intend to expand the scope of the model to predict the risk of other diseases such as lung cancer, thereby significantly enhancing its practical value.