SMOTE-augmented machine learning model predicts recurrent and metastatic breast cancer from microbiome analysis

Hong, Ji Eun; Kim, Yeon Eun; Kang, Yun Soo; Choi, Dong Hyeok; Ahn, So Hyun; An, Jeongshin

doi:10.1038/s41598-025-16790-z

Download PDF

Article
Open access
Published: 26 September 2025

SMOTE-augmented machine learning model predicts recurrent and metastatic breast cancer from microbiome analysis

Ji Eun Hong¹,
Yeon Eun Kim²,
Yun Soo Kang²,
Dong Hyeok Choi^3,4,5,
So Hyun Ahn^6,7^na1 &
…
Jeongshin An^8,9^na1

Scientific Reports volume 15, Article number: 33096 (2025) Cite this article

708 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Recurrence and metastasis of breast cancer (RMBC) have a decisive impact on patient survival, necessitating reliable biomarkers for its early prediction. This study used machine learning to evaluate blood microbiome profiles as predictive biomarkers of RMBC. A retrospective predictive analysis was conducted on 288 participants, including 96 patients with breast cancer and 192 healthy controls. After 7 years of follow-up, patients were classified into disease-free survival (DFS, n = 88) and RMBC (n = 8) groups. Blood microbiome composition was analysed using 16S rRNA sequencing, followed by quality control. The Synthetic Minority Oversampling Technique (SMOTE) was employed to address class imbalance. Eleven machine learning models were trained using leave-one-out cross-validation (LOOCV) and k-fold cross-validation, and evaluated based on the area under the receiver operating characteristic curve (AUROC), recall, precision, F1-score, and Matthews correlation coefficient (MCC). Alpha diversity was significantly lower in DFS and RMBC groups than in the healthy control group (p < 0.05), while beta diversity analysis revealed distinct clustering. The random forest achieved an AUROC of 0.94, a recall of 0.81, an F1-score of 0.83, and an MCC of 0.88. Enterobacter, Bacteroides, Klebsiella, and Bifidobacterium were among the key microbial genera predicting RMBC in the top five models. Blood microbiome profiling shows potential as a non-invasive RMBC biomarker. Machine learning effectively distinguished RMBC, warranting further validation.

Biopsy bacterial signature can predict patient tissue malignancy

Article Open access 17 September 2021

Prediction of breast cancer using blood microbiome and identification of foods for breast cancer prevention

Article Open access 29 March 2023

Predicting diagnosis and survival of bone metastasis in breast cancer using machine learning

Article Open access 25 October 2023

Introduction

Breast cancer (BC) is the most prevalent malignancy among women worldwide, accounting for the highest incidence and mortality rates¹. Despite advancements in diagnostic methods, 5–10% of patients with BC present with metastatic disease at initial diagnosis and 20% develop recurrence later². Recurrence and metastasis of breast cancer (RMBC) significantly increase the mortality rate^3,4. This trend highlights the need for reliable indicators for RMBC and early detection strategies.

Over the past several decades, traditional markers such as carcinoembryonic antigen and tumour-associated antigen CA-15.3 have been used to predict the risk of BC recurrence, with elevated levels suggesting recurrence before the emergence of clinical symptoms⁵. However, the American Society of Clinical Oncology noted insufficient evidence to support their routine use⁶. To address these limitations, traditional markers have been combined with molecular techniques⁷. While these approaches can improve diagnostic efficiency and accuracy, their high cost remains a considerable challenge⁸. Thus, the need to identify reliable and affordable predictors remains.

With the emergence of next-generation sequencing (NGS), human blood has been shown to carry complex microbiomes, even in healthy individuals. Furthermore, disruptions in microbial balance have been associated with numerous diseases, including cancer, and unique blood microbiome profiles have shown the potential for early cancer detection^9,10. However, most studies focus on cancer diagnosis rather than prediction of recurrence or metastasis, possibly due to a limited sample size¹¹.

In this study, we focused on developing a machine learning model to predict RMBC risk using blood microbiome data, by incorporating the synthetic minority oversampling technique (SMOTE) for data augmentation to address the small sample size of patients with RMBC. We further aimed to identify significant microbes associated with RMBC and evaluate their predictive performance. This approach will help advance our understanding of the role of the blood microbiome in BC pathophysiology and contribute to the management of its recurrence, metastasis, and prediction of prognosis.

Methods

Study population and demographics

This study was a retrospective predictive analysis comparing patients with BC and healthy controls to analyse the predictive power of the blood microbiome. Participants were recruited between 2014 and 2017 at Ewha Womans University Hospital and Inje University Haeundae Hospital, and their clinical outcomes were subsequently followed for 7–8 years. Blood samples from the healthy control group were collected at a health screening centre, and all participants were confirmed to have no underlying chronic or acute diseases based on routine comprehensive health examinations. In contrast, blood samples were collected prior to any treatment (surgery, chemotherapy, or radiation), and patients with BC were diagnosed after presenting with symptoms or abnormal radiological findings, confirmed by histological examination. After a 7-year follow-up, patients were stratified into disease-free survival (DFS; n = 88) and RMBC (n = 8) groups based on their recurrence or metastasis status¹². The RMBC group included patients with locoregional recurrence or distant metastases (Fig. 1). Ethical approval for the study was obtained from the Institutional Review Board of the Ewha Womans University Hospital (IRB No. EUMC 2014-10-005) and Inje University Haeundae Hospital (IRB No. 1297992-2015-064). All procedures followed institutional guidelines and informed consent was obtained from all participants.

DNA extraction and microbiome sequencing

Blood samples were processed for extracellular vesicle (EV) isolation through centrifugation and filtration. Bacterial DNA was extracted from EVs using the DNeasy PowerSoil Kit (QIAGEN) and quantified with the QIAxpert system. The 16S rRNA gene was amplified using the primers 16s_V3_F (5′-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG-3′) and 16s_V4_R (5′-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC-3′)¹³. Microbiome profiling was performed following the Illumina protocol using a MiSeq platform (MiSeq Software v4.1.0). To minimise the risk of contamination, bacterial DNA was extracted from EVs rather than directly from whole blood. The extracted DNA was then used for 16S rRNA sequencing to identify operational taxonomic units (OTUs)¹².

Data preprocessing

All preprocessing and analyses of the sequencing data were performed using the CLC Workbench (QIAGEN, Hilden, Germany). Adapter trimming and quality control were applied with a quality limit of 0.05 and a maximum of two ambiguous bases. The sequences were filtered to exclude reads shorter than 200 bp, longer than 550 bp, and those with insufficient coverage, based on CLC Workbench filtering parameters (minimum number of reads: 100; minimum percent from the median: 50.0). After excluding low-quality data, the final dataset consisted of 229 individuals, including 141 healthy controls, 81 patients with DFS, and seven patients with RMBC. OTUs were clustered using a reference-based method with a 97% similarity threshold, and taxonomic classification was performed using the SILVA 138 database in CLC Workbench. The OTUs with a relative abundance of less than 1% across all samples were excluded from further analysis¹⁴. Amplicon sequence variant (ASV)-based analysis was not performed, since the sequencing data were processed using CLC Workbench, which is limited to an OTU-based pipeline.

Data augmentation

SMOTE was applied to augment the RMBC group to address the class imbalance in the dataset. This technique synthesises new samples by interpolating existing samples within the minority class¹⁵. While the default SMOTE algorithm uses a k-value of five (considering the five nearest neighbors for synthesis), we selected a reduced k-value of three after evaluating multiple oversampling methods, as it was better suited to the limited RMBC samples and provided more stable performance (Supplementary Table S1). SMOTE (k = 3) and Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN; n_neighbors = 3) showed similar KS D values, indicating low distributional bias. We evaluated both methods using cross-validation, with detailed results presented in Supplementary Table S2. Specifically, for LOOCV, the results of SMOTE (k = 3) and ADASYN (n_neighbors = 3) are presented in Supplementary Tables S2.1 and S2.2, respectively. For k-fold cross-validation, the corresponding results are provided in Supplementary Tables S2.3 for SMOTE (k = 3) and in S2.4 for ADASYN (n_neighbors = 3). Additional validations were conducted using permutation tests (Supplementary Tables S3) and bootstrapping (Supplementary Table S4), with bootstrapping results for SMOTE (k = 3) and ADASYN (n_neighbors = 3) presented in Supplementary Tables S4.1 and S4.2, respectively. ADASYN exhibited superior performance metrics, which may be attributed to its adaptive approach. Nonetheless, SMOTE was ultimately selected due to its ability to generate synthetic data while preserving the original data distribution. The number of RMBC samples was increased from 7 to 35. The augmented dataset exhibited diversity comparable to that of the original RMBC data, thereby minimising the risk of overfitting.

Machine learning model development and evaluation

The data were split into training (80%) and test (20%) sets, preserving the class proportions in all groups. The SMOTE-augmented data were used for training, with the test set comprising only the original samples to ensure an unbiased evaluation. The resulting training set consisted of 204 samples (112 healthy control, 64 DFS, and 28 RMBC), whereas the test set consisted of 53 samples (29 healthy control, 17 DFS, and seven RMBC). Eleven machine learning algorithms were employed to predict the RMBC: random forest, gradient boosting, decision tree, support vector machine, neural network, AdaBoost, Naïve Bayes, nearest neighbors, linear discriminant analysis, quadratic discriminant analysis, and logistic regression. All models were implemented using Python 3.10 and the scikit-learn library. Because of the small number of RMBC class instances, leave-one-out cross-validation (LOOCV) and k-fold cross-validation were employed for model evaluation, wherein each instance served as a separate test set¹⁶. Model performance was evaluated using accuracy, recall, precision, F1-score, MCC, and area under the receiver operating characteristic curve (AUROC). Confusion matrices were generated to visualise the classification performance. Feature importance was assessed for the five best-performing models using permutation analysis for Naïve Bayes, Shapley additive explanation (SHAP) values for nearest neighbors, and scikit-learn attributes for tree-based models. Biological significance was inferred by analysing feature contributions across the healthy control, DFS, and RMBC groups. Permutation tests and bootstrapping analyses were performed to evaluate the model’s robustness (Supplementary Tables S3, S4).

Statistical analysis

Kruskal–Wallis test and false discovery rate (FDR) correction were conducted to identify differentially abundant OTUs between the healthy control and cancer groups. Taxa with Kruskal–Wallis and FDR-adjusted p-values below 0.05 were selected to reduce the risk of ‘false positives’. Differences between DFS and RMBC in baseline characteristics were assessed using the Mann–Whitney U test for age and Fisher’s two-sided exact test for categorical variables. Principal coordinate analysis (PCoA) using the Bray–Curtis dissimilarity distance was conducted to assess the clustering of groups. Permutational multivariate analysis of variance (PERMANOVA), centroid shift, and the Kolmogorov–Smirnov (KS) test were applied to analyse the metrics from principal component analysis (PCA). All analyses were performed using Python version 3.11 (Python Software Foundation, Beaverton, OR, USA), SPSS version 18.0 (SPSS Inc., Chicago, IL, USA), and R version 4.4.2 (R Foundation for Statistical Computing, Vienna, Austria). The data processing scripts used in this study are available at: https://github.com/dokkulla/breastcancer-microbiome.git.

Results

Clinical characteristics

The baseline characteristics of the study population are summarised in Table 1. The mean age at diagnosis was lower in the RMBC (48.7 ± 11.5 years) than in the DFS group (51.9 ± 11.5 years); however, the difference was not significant. Significant differences were not observed between the groups in terms of tumour stage, lymph node metastasis, histological grade, Ki67 expression, primary surgery type, and BC subtype.

Table 1 Patient’s characteristics (N = 229).

Full size table

Microbiome diversity

Alpha diversity

Alpha diversity metrics, including the observed OTUs, Chao1, Shannon, and Simpson indices, were calculated for each group (Fig. 2a). The observed OTUs, Chao1 and Simpson indices showed significantly reduced microbial diversity in the DFS and RMBC groups compared with the healthy control group (p-value < 0.05). In contrast, the Shannon index did not differ significantly from the healthy control in the RMBC groups.

Beta diversity

Beta diversity analysis was conducted using the Bray–Curtis dissimilarity metric. PCoA plots at different taxonomic levels revealed distinct clustering patterns among the three groups (Fig. 2b). PERMANOVA based on the Bray–Curtis distance indicated significant differences between the healthy control group and both the DFS and RMBC groups, whereas no significant difference was observed between the DFS and RMBC groups.

Comparative blood microbiome profiles

Significant differences in the relative abundance of specific genera were observed among the healthy control, DFS, and RMBC groups (Fig. 3). The stacked bar plot shows a clear difference between the healthy control group and the other groups, while the DFS and RMBC groups exhibit minimal differences relative to each other (Fig. 3a). The heatmap more specifically illustrates which bacteria differ among the three groups, particularly between the DFS and RMBC groups (Fig. 3b). The abundances of genera such as Corynebacterium, Pseudomonas, Acinetobacter, and Sphingomonas were significantly reduced in the DFS and RMBC groups compared with those in the healthy control group (p < 0.05) (Fig. 3c). In contrast, the abundances of Bacteroides, Bifidobacterium, Klebsiella, and Enterobacter were significantly enriched in the DFS and RMBC groups compared with those in the healthy control group (p < 0.05) (Fig. 3d).

Data augmentation

Principal component analysis (PCA) was conducted to visualise the data distribution before and after SMOTE augmentation (Fig. 4a). As shown in the PCA plots, the synthetic RMBC samples generated via SMOTE were distributed within the feature space of the original RMBC samples, confirming that the augmentation process preserved the underlying data structure. The relative abundance distributions of the top 20 genera were visualised using stacked bar charts by comparing the original RMBC samples with SMOTE-augmented data (Fig. 4b). As can be seen from these plots, the augmented data retained a composition consistent with that of the original samples, thus ensuring biological plausibility. The PCA-based statistical analyses, including centroid shift, KS test, and PERMANOVA, revealed no significant group differences; the detailed results are provided in Supplementary Table S1.

Machine learning model performance

Eleven machine learning classifiers were evaluated for their ability to classify the healthy control, DFS, and RMBC groups based on genus-level microbiome data. The performance of the eleven machine-learning classifiers were evaluated both before and after the SMOTE augmentation (Table 2). Before SMOTE augmentation, the highest recall and precision achieved among all classifiers were below 0.67. After SMOTE augmentation, a significant improvement was observed across all classifiers, with a marked increase in recall and precision. The best-performing classifier achieved a recall of 0.88 and precision of 0.94. All models demonstrated promising performance, with the mean AUROC values ranging from 0.75 to 0.94 (Fig. 5a). The random forest model exhibited the strongest discriminative ability with an AUROC of 0.94. It also demonstrated high accuracy (0.81), recall (0.94), precision (0.83), F1-score (0.98), and MCC (0.88). The random forest model correctly classified all 29 healthy control and 17 DFS samples in the test set, as well as three of the seven RMBC samples (Fig. 5b). All models were evaluated using LOOCV and k-fold cross-validation to ensure an unbiased assessment of classifier performance (Supplementary Table 2).

Table 2 Performance comparison of 11 classifiers before and after the SMOTE augmentation.

Full size table

Feature importance

Significant features were analysed for the top five-performing classifiers, using feature importance (random forest, AdaBoost, gradient boosting), SHAP (nearest neighbors), and permutation importance (Naïve Bayes), to identify key microbial genera contributing to the classification of the healthy control, DFS, and RMBC groups. For the five best-performing models, the top 10 features with the highest importance scores were identified and visualised as bar graphs (Fig. 6a–e), with each panel corresponding to random forest (a), gradient boosting (b), AdaBoost (c), nearest neighbors (d), and Naïve Bayes (e). Among the identified features, Enterobacter and Obscuribacteraceae emerged as key genera common to all five models, highlighting their consistent predictive value for RMBC. Additionally, Klebsiella and Bifidobacterium were shared across three of the models (Table 3).

Table 3 Top 10 features with the highest importance scores for the top five models.

Full size table

Discussion

In this study, important features associated with RMBC were identified using machine learning and augmented blood microbiome data. Due to the difficulty in acquiring recurrence and metastasis cases, the RMBC group had a limited sample size. This data imbalance reduces diversity and leads to a skewed distribution¹⁷. To address this issue, we utilised SMOTE, which relies on linear interpolation and may overlook compositional relationships in the data¹⁵. To better preserve these relationships, SMOTE was applied to the relative abundance data rather than OTU counts. After applying SMOTE, model performance substantially improved, particularly for recall, precision, and F1-score, compared with the performance prior to SMOTE application. Previous machine learning studies on RMBC risk primarily relied on clinical information and often required numerous examinations^18,19. Our study demonstrated the potential of blood microbiome-based models using SMOTE, which was validated using clinicopathological data. Augmentation techniques can be a valuable tool for studying rare diseases, their recurrence and metastasis, which is often challenging due to small sample sizes and limited numbers of participants²⁰.

In the present study, some bacteria were associated with RMBC. In a previous study, members of the Enterobacteriaceae family, including the genera Enterobacter and Klebsiella, which are crucial features in our models, were extensively studied for their association with cancer^21,22. The abundance of Enterobacteriaceae was significantly associated with immune-related pathways, including antigen processing and presentation and cytosolic DNA sensing, suggesting potential roles in modulating tumour immune surveillance and evasion^21,23. Yurdakul et al. elucidated the effects of Enterobacter strains and analysed the bacterial proteins that increased cancer cell viability and proliferation while reducing apoptosis²³. They found that cancer cells exhibited higher levels of nuclear factor kappa B (NF-κB) activation compared with that of normal colonic mucosa following exposure to bacterial proteins. This finding suggested that these bacterial influences may enhance breast tumour cell progression and inhibit apoptosis through NF-κB activation²⁴.

The genera Bacteroides and Bifidobacterium were also key features, showing higher relative abundances in the DFS and RMBC groups than in the healthy control group. These genera are part of the “estrobolome” that metabolises oestrogens²⁵. Elevated oestrogen levels are associated with the risk of RMBC²⁶. Kibria et al. reported higher Bacteroides and Blautia levels in the faecal samples of patients with breast cancer, and other studies have shown that toxins produced by certain Bacteroides spp. promote breast cancer progression^27,28. Furthermore, hormones like oestrogen and progesterone can increase their levels, influencing BC dynamics^29,30. The increased levels of these hormones may be involved in a positive feedback loop, triggering RMBC.

Among the taxa selected by the model, family Obscuribacteraceae belongs to under-characterised clade with no functional annotation currently available. Their inclusion among the top-ranked features may reflect microbial community shifts associated with disease status, but the biological interpretation is currently limited.

Recent studies have highlighted the blood microbiome as a promising tool for cancer diagnosis³¹. Building on this insight, we identified microbial taxa associated with BC prognosis. The findings of this study enhance our understanding of the intricate relationships between microbial communities and cancer progression, potentially mediated through immune regulation and hormone metabolism. Furthermore, this work may bring us one step closer to developing a tool that could improve clinical outcomes in BC patients.

The study size was determined based on a power analysis using G*Power to ensure sufficient statistical power to detect significant differences among the three study groups (healthy control, DFS, and RMBC)³². A priori power analysis for MANOVA was conducted with an effect size of f²(V) = 0.15, a significance level of 0.05, and a statistical power of 0.80. Given three groups, 80 predictors, and three response variables, the analysis indicated that a total sample size of 222 participants was required to achieve an actual power of 0.80 and Pillai’s V of 0.26.

This study included 229 participants: 141 healthy controls, 81 patients with DFS, and seven patients with RMBC. In this dataset, we collected pre-treatment samples from patients with breast cancer, and then followed up with them for approximately 7–8 years to confirm their recurrence and metastasis status. Recurrence and metastasis data are difficult to obtain due to the requirement for long-term follow-up. Therefore, analyses were conducted on this small yet essential dataset of recurrent cases. To address data imbalance, the SMOTE technique was applied, and group comparisons were performed using the Kruskal–Wallis test and FDR correction. Further studies involving larger cohorts will be necessary to longitudinally monitor recurrence and metastasis. Another major limitation of this study is the lack of external validation due to the rarity of comparable longitudinal datasets. However, this study remains valuable because it is a rare investigation that evaluates the potential of blood microbiome data to predict cancer recurrence.

As a future direction, this algorithm may be utilised for pre-treatment prediction of recurrence and metastasis in patients with breast cancer, enabling the identification of high-risk individuals who may benefit from more aggressive and personalised therapeutic strategies. Moreover, the integration of this model with imaging modalities—such as mammography, breast ultrasonography, and magnetic resonance imaging—and clinical features—such as age, body mass index, and hormone receptor status—holds promise for developing a multimodal fusion framework, which could further enhance predictive accuracy and support clinical decision-making in real-world settings.

Conclusion

In the present study, we identified potential biomarkers for RMBC prediction using augmented blood microbiome data. This algorithm could enable the early identification of high-risk breast cancer patients before clinical relapse, facilitating more aggressive treatment or closer monitoring to improve outcomes, and supporting personalised therapeutic strategies. Our findings highlight the importance of blood microbiome profiling for the early diagnosis and prognosis of BC, enabling personalised diagnostic strategies and improved long-term management.

Data availability

The raw sequence data are available in the Sequence Read Archive under BioProject IDs PRJNA834579, PRJNA834581, and PRJNA834582.

References

Bray, F. et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clinicians 74, 229–263 (2024).
Google Scholar
Courtney, D. et al. Breast cancer recurrence: factors impacting occurrence and survival. Irish J. Med. Sci. (1971-) 191, 2501–2510 (2022).
Article CAS Google Scholar
Dent, R. et al. Factors associated with breast cancer mortality after local recurrence. Curr. Oncol. 21, e418 (2014).
Article CAS PubMed PubMed Central Google Scholar
Valachis, A. et al. Overall survival of patients with metastatic breast cancer in Sweden: A nationwide study. Br. J. Cancer 127, 720–725 (2022).
Article PubMed PubMed Central Google Scholar
Safi, F., Kohler, I., Röttinger, E., Suhr, P. & Beger, H. Comparison of CA 15–3 and CEA in diagnosis and monitoring of breast cancer. Int. J. Biol. Markers 4, 207–214 (1989).
Article CAS PubMed Google Scholar
Harris, L. et al. American Society of Clinical Oncology 2007 update of recommendations for the use of tumor markers in breast cancer. J. Clin. Oncol. 25, 5287–5312 (2007).
Article CAS PubMed Google Scholar
Zhou, Y. et al. Tumor biomarkers for diagnosis, prognosis and targeted therapy. Signal Transduct. Target. Ther. 9, 132 (2024).
Article PubMed PubMed Central Google Scholar
Vitko, A. S. et al. Costs of breast cancer recurrence after initial treatment for HR+, HER2−, high-risk early breast cancer: Estimates from SEER-Medicare linked data. J. Med. Econ. 27, 84–96 (2024).
Article PubMed Google Scholar
Manos, J. The human microbiome in disease and pathology. APMIS 130, 690–705 (2022).
Article PubMed PubMed Central Google Scholar
Cheng, H. S. et al. The blood microbiome and health: Current evidence, controversies, and challenges. Int. J. Mol. Sci. 24, 5633 (2023).
Article CAS PubMed PubMed Central Google Scholar
Petinrin, O. O. et al. Machine learning in metastatic cancer research: Potentials, possibilities, and prospects. Comput. Struct. Biotechnol. J. 21, 2454–2470 (2023).
Article PubMed PubMed Central Google Scholar
An, J. et al. Prediction of breast cancer using blood microbiome and identification of foods for breast cancer prevention. Sci. Rep. 13, 5110 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
An, J., Kwon, H., Kim, Y. J. & Moon, B.-I. Microbiome-induced microenvironmental changes before and after breast cancer treatment. Microorganisms 13, 1057 (2025).
Article CAS PubMed PubMed Central Google Scholar
Zhou, R., Ng, S. K., Sung, J. J. Y., Goh, W. W. B. & Wong, S. H. Data pre-processing for analyzing microbiome data–A mini review. Comput. Struct. Biotechnol. J. 21, 4804–4815 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artific. Intell. Res. 16, 321–357 (2002).
Article Google Scholar
Cha, G.-W., Moon, H.-J. & Kim, Y.-C. Comparison of random forest and gradient boosting machine models for predicting demolition waste based on small datasets and categorical variables. Int. J. Environ. Res. Public Health 18, 8530 (2021).
Article PubMed PubMed Central Google Scholar
Kumar, P., Bhatnagar, R., Gaur, K. & Bhatnagar, A. In IOP Conference Series: Materials Science and Engineering. 012077 (IOP Publishing).
Zuo, D. et al. Machine learning-based models for the prediction of breast cancer recurrence risk. BMC Med. Inform. Decis. Mak. 23, 276 (2023).
Article PubMed PubMed Central Google Scholar
Botlagunta, M. et al. Classification and diagnostic prediction of breast cancer metastasis on clinical data using machine learning algorithms. Sci. Rep. 13, 485 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Hirsch, M. C., Ronicke, S., Krusche, M. & Wagner, A. D. Rare diseases 2030: How augmented AI will support diagnosis and treatment of rare diseases in the future. Ann. Rheum. Dis. 79, 740–743 (2020).
Article PubMed Google Scholar
Uzelac, M. et al. Urinary microbiome dysbiosis and immune dysregulations as potential diagnostic indicators of bladder cancer. Cancers 16, 394 (2024).
Article CAS PubMed PubMed Central Google Scholar
Eid, M., Martínek, A., Dolina, J., Uvírová, M. & Dítě, P. Gut microbio-me and pancreatic cancer. Klinicka onkologie: casopis Ceske a Slovenske onkologicke spolecnosti 38, 20–26 (2024).
Google Scholar
Yurdakul, D., Yazgan-Karataş, A. & Şahin, F. Enterobacter strains might promote colon cancer. Curr. Microbiol. 71, 403–411 (2015).
Article CAS PubMed Google Scholar
Van Antwerp, D. J., Martin, S. J., Verma, I. M. & Green, D. R. Inhibition of TNF-induced apoptosis by NF-κB. Trends Cell Biol. 8, 107–111 (1998).
Article PubMed Google Scholar
Plottel, C. S. & Blaser, M. J. Microbiome and malignancy. Cell Host Microbe 10, 324–335 (2011).
Article CAS PubMed PubMed Central Google Scholar
Woolcott, C. G. et al. Plasma sex hormone concentrations and breast cancer risk in an ethnically diverse population of postmenopausal women: The Multiethnic Cohort Study. Endocr. Relat. Cancer 17, 125 (2010).
Article CAS PubMed PubMed Central Google Scholar
Kibria, M. K. et al. Identification of bacterial key genera associated with breast cancer using machine learning techniques. The Microbe 100228 (2024).
Ma, W. et al. Microbiota enterotoxigenic Bacteroides fragilis-secreted BFT-1 promotes breast cancer cell stemness and chemoresistance through its functional receptor NOD1. Protein Cell 15, 419–440 (2024).
Article CAS PubMed PubMed Central Google Scholar
Kornman, K. S. & Loesche, W. J. Effects of estradiol and progesterone on Bacteroides melaninogenicus and Bacteroides gingivalis. Infect. Immun. 35, 256–263 (1982).
Article CAS PubMed PubMed Central Google Scholar
Nuriel-Ohayon, M. et al. Progesterone increases Bifidobacterium relative abundance during late pregnancy. Cell Rep. 27, 730–736 (2019).
Article CAS PubMed Google Scholar
You, L. et al. Novel directions of precision oncology: Circulating microbial DNA emerging in cancer-microbiome areas. Precision Clin. Med. 5, pdac005 (2022).
Article Google Scholar
Faul, F., Erdfelder, E., Lang, A.-G. & Buchner, A. G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav. Res. Methods 39, 175–191 (2007).
Article PubMed Google Scholar

Download references

Acknowledgements

This work was supported by National Research Foundation of Korea (NRF) grants funded by the Korean government (MSIT) (RS-2021-NR058775, RS-2022-NR067387, and RS-2023-00240003), Ministry of Health and Welfare (RS-2023-00262969), Korean Association for Women Surgeons (202201770001), and Ewha Womans University Research Grant of 2024.

Author information

So Hyun Ahn and Jeongshin An equally contributed to this work.

Authors and Affiliations

Department of Medical Science, Ewha Womans University College of Medicine, Seoul, Republic of Korea
Ji Eun Hong
Ewha Womans University College of Medicine, Seoul, Republic of Korea
Yeon Eun Kim & Yun Soo Kang
Department of Medicine, Yonsei University College of Medicine, Seoul, Republic of Korea
Dong Hyeok Choi
Medical Physics and Biomedical Engineering Lab (MPBEL), Yonsei University College of Medicine, Seoul, Republic of Korea
Dong Hyeok Choi
Department of Radiation Oncology, Yonsei Cancer Center, Heavy Ion Therapy Research Institute, Yonsei University College of Medicine, Seoul, Republic of Korea
Dong Hyeok Choi
Ewha Medical Research Institute, Ewha Womans University College of Medicine, Seoul, Republic of Korea
So Hyun Ahn
Ewha Medical Artificial Intelligence Research Institute, Ewha Womans University College of Medicine, Seoul, Republic of Korea
So Hyun Ahn
Department of Surgery, Ewha Womans University Mokdong Hospital, Ewha Womans University College of Medicine, 1071 Anyangcheon-ro, Yangcheon-gu, Seoul, 07985, Republic of Korea
Jeongshin An
Institute of Convergence Medicine Research, Ewha Womans University Mokdong Hospital, Ewha Womans University College of Medicine, 1071 Anyangcheon-ro, Yangcheon-gu, Seoul, 07985, Republic of Korea
Jeongshin An

Authors

Ji Eun Hong
View author publications
Search author on:PubMed Google Scholar
Yeon Eun Kim
View author publications
Search author on:PubMed Google Scholar
Yun Soo Kang
View author publications
Search author on:PubMed Google Scholar
Dong Hyeok Choi
View author publications
Search author on:PubMed Google Scholar
So Hyun Ahn
View author publications
Search author on:PubMed Google Scholar
Jeongshin An
View author publications
Search author on:PubMed Google Scholar

Contributions

J.E.H., Y.E.K., and Y.S.K wrote the manuscript. J.E.H., Y.E.K., Y.S.K, and D.H.C analysed the data. S.H.A and J.A. conceptualised, reviewed, and revised the manuscript.

Corresponding authors

Correspondence to So Hyun Ahn or Jeongshin An.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Hong, J.E., Kim, Y.E., Kang, Y.S. et al. SMOTE-augmented machine learning model predicts recurrent and metastatic breast cancer from microbiome analysis. Sci Rep 15, 33096 (2025). https://doi.org/10.1038/s41598-025-16790-z

Download citation

Received: 19 March 2025
Accepted: 19 August 2025
Published: 26 September 2025
DOI: https://doi.org/10.1038/s41598-025-16790-z

Subjects

Abstract

Similar content being viewed by others

Biopsy bacterial signature can predict patient tissue malignancy

Prediction of breast cancer using blood microbiome and identification of foods for breast cancer prevention

Predicting diagnosis and survival of bone metastasis in breast cancer using machine learning

Introduction

Methods

Study population and demographics

DNA extraction and microbiome sequencing

Data preprocessing

Data augmentation

Machine learning model development and evaluation

Statistical analysis

Results

Clinical characteristics

Microbiome diversity

Alpha diversity

Beta diversity

Comparative blood microbiome profiles

Data augmentation

Machine learning model performance

Feature importance

Discussion

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links