Introduction

The occurrence of cancer is rising over time, with one in five individuals expected to be diagnosed with cancer during their lifetime, breast cancer now represents one in four of all cancers in women1.

Breast cancer now makes up one in four cancer cases in women. Machine learning forecasts indicate that the number of breast cancer cases will rise by 50% between 2020 and 20402.

The mortality rate has risen alongside morbidity, with mortality increasing from 6.2 to 10 million between 2000 and 2020. Breast cancer is a significant global public health issue, largely driven by delays in screening, diagnosis, and treatment, which greatly contribute to cancer-related mortality3. As evidence, 2.3 million new cases and 685,000 deaths from breast cancer are reported annually4.

Also, recent evidence indicates that 70% of breast cancer cases and 60% of related deaths occur in low- and middle-income countries5,6.

Breast cancer (BC) screening tools, such as breast self-examination (BSE), clinical breast examination (CBE), and mammography, play a crucial role in early detection and prevention7. Breast self-examination is cost-effective strategy for developing countries it can reduce cost medical equipment, transportation, and other indirect medical service related cost8,9. Furthermore, it enhances accessibility to healthcare services and plays a crucial role in mitigating the spread of infectious diseases in low- and middle-income countries, offering a sustainable solution for public health improvement10,11.

Additionally, BSE is easy to perform, feasible, convenient, and safe, requiring no specialized equipment or setup10. Breast cancer can be effectively treated if detected early through consistent screening12. However, a lack of awareness about BSE in many African countries results in delayed diagnoses and treatment of breast cancer13.

Breast self-examination (BSE) is “a method by which a woman examines her own breasts for changes that may indicate breast cancer, such as lumps, changes in size or shape, or skin changes. It involves both visual inspection and palpation (feeling the breasts with the hands) to detect any unusual changes”14.

Recent data revealed that gender is one of the key risk factors among the sixteen known factors, with women under the age of 50 having an 82% higher breast cancer rate in 2021 compared to men of the same age15.

According to evidence, factors such as education level, age, poor patient-provider interactions, income, employment status, use of technology and social media, health facility visits, exposure to healthcare information, and cultural beliefs are common predictors of breast self-examination16,17,18,19,20,21,22,23.

The majority of previous studies has used conventional statistical techniques, such as logistic and multilevel regression models. However, these approaches face limitations because of their reliance on predetermined assumptions and inability to handle complicated, high-dimensional data due to that unable find the most influential predictors.

In contrast, machine learning (ML) provides a novel approach that combines sophisticated pattern recognition, automated feature selection, and outstanding predictive accuracy24. Algorithms such as random forests, gradient boosting, and neural networks can automatically identify intricate, non-linear relationships in data without the need for human intervention.ML algorithms can process huge datasets to uncover hidden patterns that might not be possible with traditional methods. The application of ML in the prediction of BSE awareness in Sub-Saharan Africa can yield valuable insights on the socio-demographic, cultural, and healthcare factors of the knowledge gap.

Methods and materials

Study design and setting

A supervised machine learning research was carried out in eight Sub-Saharan African nations, i.e., Burkina Faso, Côte d’Ivoire, Ghana, Kenya, Lesotho, Madagascar, Mozambique, and Tanzania, between 2020 and 2023 through Demographic and Health Surveys (DHS).

Sub-Saharan Africa (SSA) is the region of the African continent south of the Sahara Desert and home to over 1 billion people.

Source and study population

The study population consists of all reproductive age woman live in Sub-Saharan Africa, using data from the most recent DHS surveys conducted between 2020 and 2023.

Study variables

Dependent variable: Awareness of breast examination to detect breast cancer: “Have you ever examined your own breasts for breast cancer?” The response options were “Yes,” recoded as “1,” and “No,” recoded as “0.” The independent features were also recoded based on these concepts2.

Independent variables: Marital status, education status, social media use, wealth status, age of mother, place of delivery, smartphone availability, distance to healthcare facility (HF), residence,, woman occupation, woman ‘health status, previous exposure of visit health facility, Awareness of STI, previously examined by health professional, and number of children.

Data processing and management

For this study, the dataset was imported from DHS into STATA version 17. After that, it was merged using the command ‘append using’ for eight countries. Following the data integration, a thorough check for missing values was conducted. Variables with more than 25% missing values were excluded from the analysis. For the remaining variables with missing values below the 25% threshold, imputation techniques such as Mode Imputation or K-Nearest Neighbors (KNN) Imputation were applied based on the nature of the variables, particularly for crucial variables.

Feature selection

Recent studies indicate that irrelevant variables can weaken a model’s ability to generalize, increase its overall complexity, and potentially reduce the classifier’s accuracy in machine learning applications24. In this study, Recursive Feature Elimination (RFE) with K-fold cross-validation was employed to iteratively remove the least significant features. The model was then built using the remaining features, with the top ten features selected as the cutoff point in Fig. 1.

Fig. 1
figure 1

Feature importance.

Splitting the data

According to machine learning evidence, data splitting into training and testing sets is crucial for the accurate evaluation of a model’s performance on unseen data, in addition to managing overfitting25. For this study, based on the above concept, the total data was split in an 80:20 ratio with 10-fold cross-validation for training and testing to measure the model’s performance.

Handling imbalanced data

According to data science, specifically machine learning concepts, the use of imbalanced data for training and testing affects model performance due to an increase in fabricated data compared to the original. For this study employed SMOTE + Tomek Links for dual purpose for balance the data and removes majority class points that might create noise or mislead the classifier.

This approach seeks to improve the model’s generalization by ensuring balanced representation of both classes and reducing bias. As shown in the image, 23% and 77% of the outcome variable were ‘Yes’ and ‘No,’ respectively. After SMOTE + Tomek, the distribution is equal. See in Fig. 2.

Fig. 2
figure 2

Class distribution before SMOTE.

Model selection

After thoroughly reviewing machine learning studies in machine learning26,27 We selected seven supervised machine learning models for evaluation: Decision Tree (DT), Random Forest (RF), k-Nearest Neighbors (KNN), Artificial Neural Network (ANN), eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LGBM), Adaptive Boosting (AdaBoost), and Gradient Boosting (GB).

Model training

After the selection of the model, unbalanced as well as balanced datasets were employed to train the classifiers identified, and tenfold cross-validation was applied in order to compare their performances. Subsequently, the highest-performance predictive model was compared and identified by being retrained over balanced training data for final predictions made over unseen test data.

Model evaluation

Recent studies emphasize the significance of model assessment in machine learning as it allows us to identify how well a trained model performs when handling unseen data28,29.

Machine learning (ML) holds immense potential in the healthcare system, where large and complex data sets can yield valuable information across a wide variety of areas, from patient care to policy-making. However, this potential also brings the responsibility to carefully monitor and mitigate biases and overfitting, ensuring that healthcare services are improved and robust evidence is generated30. To do this, the F1-Score, ROC-AUC, Accuracy, Precision, and Recall were used in this study’s model evaluation. See the results in Table 1, and Fig. 3,

Table 1 Model evaluation metrics results after SMOTE.
Fig. 3
figure 3

Model comparison by ROC Curve for BSE.

Awareness of breast self-examination by country distribution

According to this finding Kenya has the highest awareness rate of breast self-examination 24.10%, whereas Burkina Faso has 13.24% and Madagascar 14.14%. Ghana and Tanzania have moderate awareness of breast self-examination 11.25% and 11.43%, respectively, whereas Côte d’Ivoire has 11.15%. See the results in Table 2.

Table 2 Awareness of breast self-examination by country distribution.

Descriptive results on breast self-examination (BSE) awareness and related factors

According to this evidence, the majority of women in this study are from the age group of 15–24 years (39.14%), and most women in the study are married (60.47%). Additionally, the data show that 36.65% of women have received secondary education or higher, compared to those with only primary education or no education.

Regarding wealth status, the majority of women are classified as ‘rich’ (46.60%), followed by those categorized as ‘poor’ (34.19%) and ‘middle’ (19.22%). Approximately 40% of women report that the distance to a health facility is a major barrier to accessing healthcare. The majority of 67.75% of reproductive-age women had not access to a smartphone, while a significant proportion (67.04%) use other forms of social media, such as radio, TV, and newspapers. “Additionally, the majority of women in the sample live in rural areas (59.42%), and more than half (51.41%) have undergone HIV testing. Approximately 44.22% of women have mothers who are employed, and 64.81% of women who checked their health status in the last 12 months report having good health.

More than half of the women (52.19%) have visited a health facility in the last 12 months; however, among the total participants, a small proportion (11.45%) have previously been examined by a healthcare provider for breast cancer examination. Among all reproductive-age women, 32% have one child. See in Table 3.

Table 3 Individual characteristics of breast Self-Examination (BSE) awareness.

Predictors of breast self-examination (BSE) awareness among reproductive-age women

According to the Decision Tree machine learning model, Woman’s age, smartphone availability, visited health facility, test for HIV, examined by healthcare provider, mother’s occupation, education level, health status, and distance to health facility increase awareness of breast self-examination. However, marital status, number of children, wealth status, place of residence, and social media use decrease awareness of BSE. See in Fig. 4.

Fig. 4
figure 4

Predictors of breast self-examination (BSE) awareness.

Discussion

Breast cancer is among the leading causes of death among women globally, and early diagnosis is the key to effective treatment as well as better survival. Breast self-examination awareness is crucial among women of childbearing age since it allows them to detect early breast changes. However, the majority of women lack proper knowledge and requisite skills for performing BSE, hence delayed diagnosis and treatment. Breast self-examination is a very cost-effective intervention that significantly reduces medical equipment-related expenditure, healthcare services of healthcare providers, transport to health centers, and indirect costs. It also enhances the availability of health services and serves as a key element in controlling the spread of infectious diseases in low- and middle-income countries, offering an eco-friendly approach to enhancing public health. The aim of this study is to predict and establish predictors of awareness of breast self-examination for women of reproductive age in Sub-Saharan Africa from recent DHS data.

The study aims to generate representative and novel evidence through a machine learning approach with seven algorithms, among which Decision Tree was the best-performing model with, with an AUC, Accuracy, 87% and, 82% respectively.

According to the Decision Tree prediction of this study, the prevalence of BSE is 23%, which is lower than the study conducted in Addis Ababa, Ethiopia (49.9%), Saudi Arabia (94.2%), Ghana (42.06%), and Nepal (46.7%)20,31,32,33. The possible reason for the lower prevalence of BSE (Breast Self-Examination) is that our study population differs in terms of education or awareness level about BSE compared to previous studies.

According to this supervised machine learning study, age, smartphone availability, marital status, visits to health facilities, HIV testing, number of children, healthcare provider exams, wealth status, place of residence, mother’s occupation, education level, social media usage, health status, and distance to health facilities are the top predictors of BSE awareness among women.

According to SHAP, women with secondary and higher education levels have better awareness of breast self-examination. This is supported by previous evidence from Ghana, Nepal, Southwest Ethiopia, and Bangladesh20,21,23,34.

The possible reason could be that educated women have good health literacy, which helps them easily understand information from social media and healthcare providers about breast self-examination (BSE). They also tend to have jobs, which gives them access to digital health tools to browse and access health information using computers, smartphones, and other social media platforms. They also develop healthcare-seeking behavior, which is the possible reason21,34.

This study also showed that breast self-examination by a healthcare provider increases awareness among women about breast self-examination. This is confirmed by previous evidence from Rwanda and Iran35,36.

The possible reason might be that BSE provided by healthcare professionals helps women consider the severity of breast cancer and understand the perceived benefit of breast examination as a preventive mechanism, because the community as a whole, including women, trusts healthcare providers.

The availability of smartphones has been shown to significantly increase awareness of breast self-examination among women of reproductive age. This is also supported by previous studies: a systematic review and a single study from Iran, Switzerland, and Southern Ethiopia37,38,39,40,41. The possible reason might be that a smartphone is convenient, feasible, and easier to use than other digital devices. Smartphone availability enables access to mobile health apps, internet, and other social media, which are easily accessible health information platforms including breast self-examination and indirectly, the majority of those who have smartphones are digitally literate, reside in urban areas, have good wealth status, and have better health literacy awareness than their counterparts42.

The findings indicate that women who do not have major problems traveling to health facilities have better awareness of breast self-examination. This is supported by earlier studies conducted in Cameroon, Thailand, Rwanda, the United Kingdom, and Morocco19,43,44,45,46.This is due to fewer barriers in obtaining relevant health information. These women receive health education from healthcare professionals and stay updated with information due to frequent visits to healthcare facilities in an easy manner.

The machine learning analysis indicates that women of reproductive age who have previously undergone HIV testing exhibit higher awareness of breast self-examination awareness compared to those who have not been tested this supported by previous evidence from Northwest Ethiopia and, and sub–Saharan Africa47,48.

The general truth behind this evidence is that those who have exposure to visiting healthcare facilities may develop a perceived severity and recognize the usefulness of health-seeking behaviors due to interactions with healthcare professionals, access to health resources, and a more proactive approach to personal health and well-being49.

The SHAP finding showed that being unmarried decreases the awareness of breast self-examination awareness this supported by previous evidence from Bangladesh, India, United Kingdom, and Ghana19,20,43,50,51.

The possible reasons could be that unmarried women may have less autonomy or a lesser income, which may restrict their access to health education52. They may believe they are less likely to experience breast-related problems. Stigma, privacy concerns, or fear of being judged may cause unmarried women to shy away from breast health-related conversations or services53.

According to SHAP predictor, women aged 25 and above have increased awareness of breast self-examination. Evidence from Ghana, Rwanda, Lesotho, and India20,44,54,55,56 supported this. The possible explanation could be Women in this age range an increase in education level, better healthcare access, greater participation in health campaigns, and more frequent visits to healthcare services for reproductive and other health-related needs.

Those women who previously checked their health status have increased awareness of breast self-examination. This is in line with studies conducted in Yemen, Ghana, and Sydney57,58,59. This result can be explained by the idea that women who regularly monitor their health tend to be more knowledgeable about health practices, such as breast self-examination.

This finding shows that women who live in rural areas have decreased awareness of breast self-examination compared to those in urban areas this supported by earlier studies conducted in Niger, Ethiopia, and South India18,60,61.

One possible explanation is that rural women are less likely to be aware of breast self-examination (BSE) because they have less access to digital platforms, less engagement with healthcare providers, lower levels of education and health literacy, and limited access to health information. Economic hardships and cultural norms, fewer healthcare facilities and outreach programs in rural areas contribute to lower awareness of breast self-examination62.

Women with low wealth status have decreased awareness of breast self-examination. This is in line with studies conducted in Ghana, Bangladesh, Côte d’Ivoire, and Rwanda20,44,50,63. A possible explanation is that financial limitations result in reduced focus on preventive health programs like screening, while women encounter difficulties accessing medical advice or learning about self-examination techniques due to time and transportation constraints.

According to the findings of this study, those who frequently visit healthcare facilities tend to have increase awareness of breast self-examination this in line with previous studies conducted in Southern Thailand, North West Ethiopia Vietnam, Saudi Arabia45,64,65,66.

This finding shows that women who use social media less frequently have decreased awareness of breast self-examination, which is consistent with previous studies conducted in Ghana, Malaysia, and Kenya4,58,67.

According to this study’s evidence, women with fewer than two total children ever born have decreased awareness of BSE compared to those with more than two. This is supported by recent previous studies conducted in Ethiopian, Lesotho, Pakistan, and Pokhara21,68,69,70.

The possible reason could be that women with a single child may be younger, unmarried, or have had an unplanned pregnancy, and may have lower socioeconomic status. As a result, they are less likely to attend campaigns, visit health facilities, or have contact with healthcare providers71.

Strengths and limitations of the study.

One of the main strengths of this study is its utilization of a large and varied dataset, which improves the applicability of the results to a wider geographic area. Additionally, machine learning models are capable of identifying complex patterns and relationships that traditional statistical methods might miss. However, the study does have limitations, such as factors that could impact prediction accuracy, including self-reported data, recall bias, and incomplete information. Moreover, not consider statistical methods for model comparison, and due to the cross-sectional design of the data, it is difficult to determine causal relationships.

Conclusion and recommendations

In conclusion, Decision Tree is the top-performing model with an AUC and accuracy of 87% and 82%, respectively, due to its ability to capture non-linear relationships between predictors and the target variable, use ensemble averaging and random feature selection to reduce variance and overfitting, and its inherent feature importance mechanism that keeps it robust to irrelevant features. The level of awareness of breast self-examination is 23%, which is lower than previous studies.

The key predictors for this outcome include a woman’s age, smartphone availability, marital status, health facility visits, HIV testing, number of children, examination by healthcare providers, wealth status, place of residence, mother’s occupation, education level, social media use, health status, and distance to health facilities. Based on this data we recommend, Create awareness for community leaders about breast cancer and the benefits of self-examination, deploying mobile health clinics and outreach programs, Training health extension workers on proper BSE to share with the community, additionally, launching radio/television campaigns in local languages to disseminate information for large audience.

Implication of the study

The findings of this study have several important implications for policymakers in formulating policies based on this evidence. For example, they can work on the machine learning predictors by planning and conducting campaigns at the community level, with a focus on rural communities to create awareness of breast self-examination. These campaigns should be supported by radio and television programs to give special attention to the issue, reach a large population, and help save women’s lives through early detection and management of breast cancer. Additionally, this can help increase the productivity of women and decrease cancer-related deaths. Researchers can use this machine learning research as a benchmark.