Introduction

Breast cancer (BC) is the second leading cause of cancer-related death in females, followed by lung cancer1. In 2020, it has been estimated that there were 2.26 million new cases of BC in females globally, accounting for approximately 11.7% of all new diagnosed cancer cases2, and 680,000 deaths, accounting for 6.9% of all cancer-related deaths in females2. Most cases are diagnosed at advanced stage due to the insidious early symptoms of BC, leading to missing the best opportunity for treatment. Consequently, 5-year survival rate of advanced BC is relatively low. Therefore, it is essential to identify effective biomarkers that aid in early diagnosis and developing novel treatment for improving BC prognosis3.

It has been claimed that genetic mutations leading to changes in DNA sequence, activation of oncogenes and inactivation of tumor suppressor genes are the main causes of tumorigenesis4,5. BC is a complex heterogeneous disease with primary risk factors including age, sex, high-fat diet, alcohol intake and obesity6. In addition, its occurrence is also associated with two types of genetic mutations: (1) germline mutations (a combination of egg and sperm cell)7,8; and (2) somatic mutations, which occurs due to gene repair misalignment of DNA recombination9. The occurrence of BC can be described as a multi-step process in which each step is thought to be linked to one or more different mutations in the major regulatory genes.

Early diagnosis of cancer can improve the treatment effectiveness. To date, clinical detection of BC patients is usually carried out by imaging and histopathology10. Imaging diagnosis of BC incorporates Mammography, Ultrasound, Magnetic resonance imaging (MRI), Computed tomography (CT) and Positron emission tomography (PET). Of these, Mammography, CT and PET exist the risk of radiation10. Moreover, pathological diagnosis of BC is not applicable for rapid diagnosis because of insufficient pathologists, difficulties in sample labeling and heavy workload of manual diagnosis11. Increasing high-throughput data facilitate to identify tumor biomarkers that contribute to accurately reflecting the oncogenesis and progression of BC, which is of great significance for clinical monitoring of tumors12. Hence, it may be a better option for early detection of BC via utilizing collaborative analysis of multiple biomarkers at the molecular level.

Rapid development of bioinformatics promotes to exploit plenty of BC related potential diagnostic biomarkers at molecular level13. Jia et al. and Liu et al. constructed BC diagnostic models based on deep learning and bioinformatics analysis, in an attempt to find biomarkers for clinical diagnosis of BC, with AUC values of 0.99 and 0.93, respectively, but the robustness of these findings has yet to be validated in independent external cohorts14,15. Li et al. established a prediction model for BC based on protein–protein interaction (PPI) network and support vector machines (SVM)16. Although the prediction accuracy of the SVM model reached 95.394%, the results also have yet to be verified in independent external cohorts or experiments16. Moreover, Chen et al. identified hub genes related to the occurrence of BC based on PPI network, but defects exist in this research17: firstly, the research only analyzed the independent diagnostic performance of genes for BC instead of jointly constructing a classifier; secondly, the research was limited to only one dataset with insufficient sample size; thirdly, how robust these hub genes in BC diagnosis are open to evaluate.

In the current study, we sought to explore novel genes connected with the pathogenesis of BC. We downloaded gene expression profiles from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) database and identified differentially expressed genes (DEGs) between BC tissue and normal breast tissue samples via bioinformatics analysis. Subsequently, functional enrichment analysis was performed to further scout the biological processes of the DEGs. Finally, the most robust hub genes were extracted using LASSO-Logistic regression, support vector machine-recursive feature elimination (SVM-RFE) and Logistic analysis, and Logistic model and Nomogram prediction model were established based on the hub genes to further understand the pathogenesis of BC. The workflow of this study is shown in Fig. 1.

Fig. 1
figure 1

Flowchart for identifying breast cancer (BC) candidate diagnostic biomarkers.

Discussion

BC is one of the most common malignancies in females1. Most patients are diagnosed at advanced stage, which poses a serious threat to women’s health. BC, similar to other cancers, occurs due to environmental and genetic factors8. Imaging techniques and pathological histology are the major diagnostic tools for BC10. However, due to various limitations, neither is suitable for rapid diagnosis. The European Group on Tumor Markers (EGTM) guidelines stated that the application of various biomarkers is considered as one of the substantial aspects of BC diagnosis and monitoring, contributing to a better understanding of the cellular and molecular pathways involved in the pathogenesis of BC18. Accordingly, it is urgent to explore specific and sensitive biomarkers for early BC diagnosis.

In the present study, a total of 154 DEGs were identified between BC and adjacent normal samples via a combination of integrated bioinformatics analysis and machine learning. Subsequently, six genes were selected as diagnostic biomarkers of BC by Lasso-Logistic regression, SVM-RFE and Logistic regression analysis: CD300LG, IGSF10, FAM83D, MAMDC2, COMP and SEMA3G. It is well known that the mutation status of BC-related genes is closely associated with the occurrence and progression of BC, especially mutations in BRCA1 and BRCA2. In this study, 6-gene biomarker set was screened based on gene expression profiles. Gene expression is not equivalent to gene mutation, and the difference is whether the gene structure is altered or not. Therefore, it’s different status of genes under investigating that leads to the lack of overlap between the 6-gene biomarker set obtained in this study and the BC-related genes. These six genes were discussed in detail as follows.

CD300LG (CD300 molecule like family member G), encoding proteins belonging to the family of CD300 membrane-bound molecules, possesses a wide variety of immune effects, including the ability to recognize and interact with extracellular lipids19,20. Umemoto et al. declared that CD300LG is selectively expressed in microvessels and has a specific expression pattern in tissue, which may play a crucial role in immune function21. Meanwhile, its expression is rapidly down-regulated by acute inflammation or tumor-related signals21. Bloomstein and colleagues claimed that CD300LG may come into play in the pathophysiology of cervical cancer metastasis and may serve as a potential biomarker for cervical cancer lymph node metastasis22. It has been showed that CD300LG expression is down-regulated in BC tissue compared with normal tissue, which is in agreement with our findings23,24,25. Besides, latest research implied that there is a link between polymorphism Arg82Cys of CD300LG and high-density lipoprotein cholesterol (HDLc) metabolism26.

IGSF10 (Immunoglobulin superfamily member 10), the genetic basis of delayed puberty and neuronal development27, is connected with differentiation and developmental processes28. Up to now, the biological role of IGSF10 in most cancers is poorly investigated, especially in the pathogenesis of BC. Chang et al. identified a potential association between the potential passenger gene IGSF10 and endometrial carcinoma via whole-exome sequencing29. Ling and colleagues reported that IGSF10 expression is down-regulated in lung cancer patients; meanwhile, it is confirmed that IGSF10 knockdown significantly promotes lung cancer cell proliferation, enhances the adhesion between cells and stroma, and activates the integrin-β1/FAK pathway30. Additionally, IGSF10 mutations may increase the risk of rectal and gastric cancer28. As a gene that is down-regulated in BC, IGSF10 is a novel potential biomarker for BC diagnosis, treatment and prognosis31,32, which is in accordance with our findings. IGSF10 is positively correlated with BC prognosis, and it is hypothesized that the underlying pathogenesis of BC is the activation of the integrin-β1/FAK pathway and the reduction of IGSF10 expression32.

FAM83D (Family with sequence similarity 83 member D) belongs to FAM83 family, which can regulate cell proliferation, growth, migration and epithelial–mesenchymal transition (EMT)33,34. It has been indicated that FAM83D is not an oncogenic factor, but may be an indirect indicator of tumors that is more dependent on cell division and mitosis35. Overexpression of FAM83D can not only promote cell proliferation, migration and invasion as well as EMT via down-regulating the expression of tumor suppressor gene FBXW7 protein36, but also involve in the activation of MAPK signaling, which promotes BC cell growth and drives epithelial cell transition37. Snijders and colleagues pointed out that in BC, alterations in FAM83 family genes are significantly correlated with TP53 mutations38. FAM83D involved in the development and progression of multifarious cancers is up-regulated in BC39, endometrial cancer40, colorectal cancer41 and lung cancer42, which can be regarded as a potential candidate gene for various cancers diagnosis and prognosis. According to recent researches, FAM83D expression is significantly up-regulated in primary BC tissue, which is associated with poor prognosis and distant metastasis in BC patients35,36,39,43.

Beckmann et al. identified the present of MAM domain in the functionally diverse receptors meprin, A5 protein and protein tyrosine phosphatase μ, and proposed that it may widely present in various adhesive proteins44. MAMDC2 (MAM domain containing 2), a member of the MAM family, is recognized as a kind of secreted protein. Although MAMDC2 is differentially expressed in several types of cancer, including oral squamous cell carcinoma45, gastric cancer46 and BC25, its exact molecular function has not been determined. In line with our findings, Lee et al. argued that MAMDC2 expression is down-regulated in BC tissue and has good diagnostic performance47. Moreover, MAMDC2 may be an extracellular regulator of cell proliferation, exerting its growth inhibition function through attenuating MAPK signaling pathway47.

COMP (Cartilage oligomeric matrix protein), a metabolic marker of cartilage, is an extracellular matrix protein that can modulate cell phenotype in the course of histogenesis and remodeling48. Intriguingly, COMP can be expressed by cancer cells and aggravate malignant diseases33,49. It has been already established that COMP expressing in cancer cells renders them resistant to apoptosis, enhances their invasive potential and warburg metabolic effect33,49,50. The mechanisms of COMP in BC progression are still under investigation. Papadakos et al. confirmed that COMP affects the migration, invasion and metabolism of BC cells as well as the abundance of cancer stem cells51. High expression level of COMP in BC makes the disease more aggressive52. The present study uncovered that COMP expression level was significantly elevated in BC tissue compared with adjacent normal tissue.

It has been proposed that SEMA3G (Semaphorin 3G) is a regulator of angiogenesis that conduces to the patterning of vascular networks53. Simultaneously, as a potential transcription gene associated with cancer susceptibility candidate 9 (CASC9), SEMA3G is involved in the pathogenesis of lung squamous cell carcinoma (LUSC)54. Ji et al. noted that SEMA3G is considered as a predictor of poor prognosis in testicular germ cell tumor patients55. Additionally, SEMA3G expression is also connected with the prognosis of glioblastoma or lung cancer patients56,57. It has been revealed that SEMA3G is a protective factor for lung cancer and glial tumor patients, and its high expression forebodes good clinical outcomes of patients57,58. However, it seems that SEMA3G exerted the opposite effect on the pathogenesis of BC in the current study. Since there are a few studies on the relationship between SEMA3G and BC, its exact mechanism of action needs to be further investigated.

Next, in the training cohort, we applied the six hub genes to construct a diagnostic nomogram for individual risk quantification. It was demonstrated that the AUC value of ROC analysis in the model was 0.978 (0.962, 0.995), indicating that the model possessed considerable risk prediction power. The calibration curve illustrated that the nomogram predicted probability was in agreement with the actual results observed in the training cohort. Meanwhile, the decision curve and clinical impact curve uncovered that the model had good clinical benefit. Besides, the predictive performance of the model was satisfactory in training and validation sets of validation cohort: the AUC values were greater than 0.900. In conclusion, these results suggested that the six-gene diagnostic model was connected with BC pathogenesis, and may carry good net benefit in clinical practice, which was a potential diagnostic biomarker for BC.

Our study also found that the DEGs related to the pathogenesis of BC were mainly enriched in cell cycle pathways. Of note, conferring to the GSEA analysis, we discovered that the E2F targets and G2M checkpoint were strongly enriched in BC. E2F targets and G2M checkpoint are involved in regulating the cell cycle, and play crucial roles in tumor progression and metastasis. The proteins encoded by E2F target genes are associated with DNA replication, cyclin and the E2F transcription factor59. Dysregulated expressions of E2F target genes are common in a variety of malignancies, including BC59. The transcription factor E2F family acts as an indispensable downstream effector in pathways that control cell cycle progression, G1/S conversion and DNA replication-related gene expression60. Components that regulate the E2F pathway are found in almost all human malignancies, and many of these components, including the E2F transcription factor themselves, serve as imperative parts in BC progression, metastasis and therapeutic response61. Masanori with colleagues reported that the E2F transcription factor is a marker of BC aggressiveness and can predict the response to neoadjuvant chemotherapy in ER-positive/HER2-negative patients59. In addition, the G2M checkpoint is regarded as biomarkers of cancer progression, which is strongly correlated with the E2F targets62. Both E2F targets and G2M checkpoint are associated with TP53 mutations62. These results showed that these DEGs may be associated with BC development through cell cycle-related E2F transcription factor and G2M checkpoint pathways.

Limitations were presented in studies on biomarkers used for BC diagnosis14,15,16,17. The robustness of detection remains to be improved and validated in independent external cohorts or experiments. Small sample size and insufficient markers used compromised the diagnosis reliability, suggesting a promising classifier that includes more hub genes might convince accuracy. This study presented a novel diagnostic model that can effectively distinguish BC from healthy individuals through integrated bioinformatics analysis with machine learning.

There were some limitations that need to be cautious when interpret the data. First, the samples were obtained from publicly accessible databases and further verification of diagnostic performance is warranted. The characteristics that are risky for BC such as age, family history, and menstrual status, should be taken into consideration, and this prompted us to evaluate the potential influence of these factor on the classifier in future work. Second, the datasets in the study mainly come from Asian, European, and North American populations, and it remains unclear whether the conclusions apply to other populations. The diagnostic efficacy and therapeutic potential of these hub genes for BC patients with different stages, degrees of differentiation, and other characteristics still need to be verified in multiple clinical cohorts, which will help to achieve clinical translational utility. Finally, the model constructed in this paper needs to be further compared with other cutting-edge advanced algorithms, so that the advantages of the prediction model can be more comprehensively demonstrated.

Conclusions

In this study, we construct a six-gene based diagnostic model and verify it in an external independent cohort, which reliably distinguishes BC from healthy individuals. Our study revealed that these hub genes played complex roles in immune regulation, cell adhesion, cell proliferation and signaling pathway regulation of BC, and may serve as diagnostic markers and therapeutic targets for BC, helping to improve the prognosis of BC. Enrichment analysis illustrates that these DEGs related to BC are enriched in cell cycle, E2F targets and G2M checkpoint pathways, which may shed some new light on the underlying mechanism of tumor progression as well as aid in clinical diagnosis of BC.

Materials and methods

Gene expression datasets preparation

The microarray datasets used in the current study were downloaded from the GEO database63. Keywords “breast cancer” and “normal” were searched in GEO database, and appropriate datasets were selected according to the following criteria for subsequent analysis: (1) samples were collected from human BC and normal breast tissue; (2) gene expression profiles were complete; and (3) at least 40 tissue samples were included. Training cohort was comprised of a total of 287 BC and 76 adjacent normal samples from GSE29044, GSE42568 and GSE162228. Validation cohort contained a total of 360 BC and 218 adjacent normal samples from GSE37751, GSE29431, GSE109169, GSE115144, GSE120129, GSE103512, GSE134359, and GSE38959. After batch-effect was removed using the combined association test (ComBat) empirical Bayes method in the surrogate variable analysis (“SVA”) package64, these datasets were further integrated as the training cohort and the validation cohort, respectively. Additionally, the validation cohort was split into a new set of training and validation sets at a ratio of 7:3, respectively. The data in the datasets were normalized and log2 transformed. Supplementary Fig. 1 showed PCA and box plotsbefore/after batch-effect removal. Details of these datasets in this study are exhibited in Table 1.

Table 1 Datasets used in the training cohort and validation cohort.

Identification of differentially expressed genes and functional enrichment analysis

In the training cohort, the DEGs between BC and adjacent normal samples were identified via linear models for microarray data (“limma”) package in R65. Genes satisfied the threshold of absolute log2FoldChange (log2FC) > 2 and adjusted p-value < 0.05 were considered as significantly differentially expressed and selected for subsequent analysis.

In order to elucidate the potential biological processes and molecular functions of the DEGs, “org.Hs.eg.db” package in R was utilized to convert gene symbols into corresponding EntrezIDs. Subsequently, “clusterProfiler” R package was used for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis at the cutoff criteria of p-value < 0.05 and false discovery rate (FDR) < 0.0566. Furthermore, to explore biomarker performance of the DEGs in the training cohort, Gene Set Enrichment Analysis (GSEA) was performed utilizing gene set variation analysis (“GSVA”) R package, with annotated h.all.v7.0.entrez.gmt gene set retrieved from the molecular signatures database (MsigDB, https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) as the reference gene set67,68. Absolute value of normalized enrichment score (|NES|) > 1, FDR q-value < 0.25 and p-value < 0.05 were regarded as critical criteria.

Candidate diagnostic biomarkers evaluation and signature establishment

LASSO-Logistic regression was performed with ten-fold cross-validation, as implemented in the R package “glmnet”69, and λ value with the minimum cross-validation error was selected as the criterion for screening the best diagnostic features in the training cohort. Subsequently, SVM-RFE algorithm was executed to further dimensionality reduction based on the results of LASSO analysis70: the classification model with the highest accuracy of tenfold cross-validation was selected as the optimal model71,72. Ultimately, multivariate stepwise Logistic regression was used to extract genes significantly associated with BC as hub genes (p < 0.05), and then the optimal diagnostic model of BC was obtained based on these genes. Moreover, a diagnostic nomogram was created to visualize the predicted rate for each individual. The Hosmer–Lemeshow test was applied to verify the calibration curve that was established to improve and assess the accuracy and reliability of the nomogram prediction. Decision curve and clinical impact curve analysis were conducted via R package “rmda” so as to evaluate the clinical effectiveness of the prediction model, and the results were visualized by “DecisionCurve” R package73. The performance and discriminative powers of the final diagnostic model were assessed using the area under the curve (AUC) approach.

Validation of the diagnostic signature

In order to further verify the stability of the diagnostic model, we applied the model to the training and validation sets of validation cohort, respectively. The expression profiles of the hub genes in the validation cohort were extracted for subsequent analysis. Split violin plot was drawn to evaluate the differential expression of the hub genes between BC and adjacent normal samples. Additionally, the predictive performance of the model was evaluated by receiver operating characteristic (ROC) curve and confusion matrix analysis, and the sensitivity and specificity of the model in BC diagnosis were assessed utilizing the AUC values.

Statistical analysis

All statistical analysis and graphs were performed in R software (version 4.1.2, https://www.r-project.org/). The “glmnet” package in R was conducted for LASSO-Logistic regression analysis, and the “caret” and “glm” R packages were executed for SVM-RFE analysis and construction of Logistic diagnostic model, respectively. Nomogram was established by the root mean squares (“RMS”) package, and “pROC” and “InformationValue” packages were applied to ROC curve and confusion matrix analysis, respectively. Odds ratio (OR) and the corresponding 95% confidence interval (CI) were evaluated by the Logistic regression model. Independent sample t-test or Mann–Whitney U-test was executed to compare continuous variables. The p-value was adjusted by the Benjamini–Hochberg method to control the FDR74. Statistical significance was defined as p < 0.05.

Results

Identification of the DEGs in the training cohort

We utilized the “limma” R package to recognize the DEGs between BC tissue and adjacent normal breast tissue. Out of 20,862 genes profiles in the training cohort after Log transformation and normalization, 154 DEGs were identified at the threshold criteria of adjusted p-value < 0.05 and |log2FC|> 2, with 51 DEGs being up-expression and 103 DEGs being down-expression in BC group (Fig. 2A, B, Supplementary Table 1).

Fig. 2
figure 2

Identification of DEGs in training cohort at the cutoff criteria of adjusted p-value < 0.05 and |log2FoldChange|> 2. (A) Volcano plot of gene expression profiles between samples from BC to adjacent normal tissue; (B) Heatmap represented the expression of 154 DEGs between samples from BC to adjacent normal tissue.

Functional enrichment analysis of the DEGs

To further reveal the potential biological mechanisms of the DEGs, we performed functional enrichment analysis on 154 DEGs in GO terms including biological process (BP), cellular component (CC) and molecular function (MF), and KEGG pathways75,76,77. GSEA analysis was also conducted based on hallmark gene sets in MSigDB. Of the three categories in GO terms, the most abundant groups were nuclear division, collagen-containing extracellular matrix and extracellular matrix structural constituent, respectively (Fig. 3A–C). We discovered that PPAR, Cell cycle, p53, Oocyte meiosis, Tyrosine metabolism and AMPK were the most enriched KEGG pathways (Fig. 3D). Moreover, the GSEA analysis demonstrated that HALLMARK_E2F_TARGETS (NES = 4.47, FDR q-value < 0.0001), HALLMARK_G2M_CHECKPOINT (NES = 4.36, FDR q-value < 0.0001) and HALLMARK_MITOTIC_SPINDLE (NES = 3.88, FDR q-value < 0.0001) gene sets were the most enriched gene sets in BC (p-value < 0.0001) (Fig. 3E–H).

Fig. 3
figure 3

Enrichment analysis of the DEGs. (A) Biological process. (B) Cellular component. (C) Molecular function. (D) KEGG pathway analysis. (E-H) GSEA analysis using hallmark gene sets from MSigDB.

Identification of diagnostic biomarkers and establishment of diagnostic signature in the training cohort

To obtain the most robust diagnostic indicators for BC, we performed the LASSO-Logistic regression on 154 DEGs in the training cohort for dimensionality reduction. We utilized tenfold cross-validation to adjust parameters so as to overcome over-fitting. The LASSO regression results showed that the λ value corresponding to the minimum cross-validation error was 0.057, at which point the model with 26 genes was the optimal predictor (Fig. 4A). In this model, all 26 candidate genes associated with diagnosis had non-zero LASSO coefficients (Fig. 4B, C). Subsequently, we carried out the SVM-RFE algorithm to further detect diagnostic candidates that can optimize the performance of the classifier. Grid search and tenfold cross-validation were utilized to train and estimate SVM hyperparameters. The radial basis function (RBF) kernel function was employed in this study, with a cost parameter of 0.25 and sigma of 0.15. We chose the optimal classification model with the highest accuracy, which contained 10 candidate genes (Fig. 4D–F). The accuracy of the SVM-RFE model was 93.44%, with a kappa of 0.791, a positive predictive value (PPV) of 87.50% and a F1 score of 0.851 (Supplementary Table 2). Finally, multivariate stepwise Logistic regression was applied to check the relation of candidate gene expression levels on the disease status, and six biomarkers with the most robust diagnostic performance were obtained to construct a diagnostic model for BC. In terms of the Logistic analysis results, the elevated expression of FAM83D, COMP and SEMA3G were risk factors for BC. In contrast, CD300LG, IGSF10 and MAMDC2 expression were protective factors (Fig. 5A, Table 2). Moreover, cluster heatmap of the six biomarkers revealed that the expression of these genes in the training cohort differed significantly between BC and adjacent normal breast tissue. All these data indicated that our classifier was robust in distinguishing BC from normal tissue (Fig. 5B).

Fig. 4
figure 4

LASSO-Logistic regression and SVM-RFE results of the DEGs. (AC) 26 genes considered the more correlated with BC were identified by LASSO-Logistic regression. (DF) Accuracy, Kappa values and confusion matrix heatmap of the SVM-RFE results.

Fig. 5
figure 5

Construction and evaluation of the diagnostic model. (A) Multivariate Stepwise Logistic regression model. (B) Cluster heatmap of six diagnostic biomarkers in the training cohort. (C) Diagnostic nomogram for quantifying risk assessment of individual. (D) Calibration curve illustrated the calibration of the nomogram in terms of the agreement between the predicted risk of BC and the observed outcomes. (E, F) The decision curve and clinical impact curve evaluated the clinical benefit of the model. (G) ROC analysis of the model for predicting BC risk. (H) Confusion matrix heatmap of the model in the training cohort.

Table 2 Odds ratio and coefficient of the diagnostic signature.

In order to quantify the risk assessment for each individual, we constructed a nomogram involving six candidate genes. Subsequently, we randomly calculated the values of each covariate of an individual in the training cohort and mapped them to the corresponding scores to predict the probability of developing BC by calculating the total score. The probability value of No.25 individual (GSM719696) being attacked by BC was 0.247 (Fig. 5C). We carried out the calibration curve analysis for the sake of evaluating prediction performance of the diagnostic model, and the results revealed that the nomogram predicted rate was in agreement with the actual outcomes observed in the training cohort (Fig. 5D), indicating that the nomogram had good accuracy and reliability. Moreover, the decision curve drawn with net benefit (NB) as vertical coordinate and high risk threshold as horizontal coordinate illustrated that the NB of the model was greater than 0, which was of clinical significance. When the high risk threshold was 0.6, the NB was 0.52, meaning that 52 out of every 100 BC patients would benefit from diagnosis and treatment of the model without harming the benefit of others (Fig. 5E, Supplementary Table 3). The clinical impact curve of the model exhibited that the number of people classified as high risk by the diagnostic model was basically consistent with the number of true positive people at the threshold probability of 0.6 (Fig. 5F). Both the decision curve and the clinical impact curve disclosed that the diagnostic model had good net benefit and predictive performance. Additionally, the ROC curve analysis indicated that the AUC of the model was 0.978 (0.962, 0.995) (Fig. 5G), with the resulting confusion matrix heatmap of the classifying performance shown in Fig. 5H. This indicated that the diagnostic model had a high sensitivity and specificity of 0.829 and 0.976, respectively, in distinguishing BC patients from healthy individuals in the training cohort (Table 3). Besides, accuracy, recall and precision of the model were 0.945 (0.916, 0.966), 0.829 and 0.900, respectively, indicating that the classifier performed well (Table 3).

Table 3 Diagnostic performance of the diagnostic model.

Validation of the diagnostic signature in the independent validation cohort

We validated the diagnostic model in validation cohort to confirm its robustness. The expression levels of six genes in the model were significantly different between samples from BC tissue to normal tissue (Fig. 6A). Notably, the ROC analysis uncovered that the AUC values of the model in training set and validation set of validation cohort were 0.936 (0.910, 0.961) and 0.921 (0.870, 0.972), respectively (Fig. 6B, D, Table 3). The confusion matrices of classification performance were shown in Figs. 6C and E. These results revealed that the prediction performance of the diagnostic model was satisfactory, and it performed well in distinguishing BC patients from healthy individuals in training set and validation set.

Fig. 6
figure 6

Evaluating the diagnostic model in the validation cohort. (A) Expression profiles of six diagnostic biomarkers between samples from BC to adjacent breast normal tissue. (B, D) ROC analysis of the model for predicting BC risk. (C, E) Confusion matrix heatmap of the model.

Contribution to the field statement

Breast cancer (BC) is the second leading cause of cancer-related death in females, followed by lung cancer. Disadvantages exist in conventional diagnostic techniques of BC, such as radiation risk. Therefore, it is essential to identify effective biomarkers that aid in early diagnosis and developing novel treatments for improving BC prognosis. The present study integrated bioinformatics analysis with machine learning to elucidate potential key candidate genes associated with the tumorigenesis of BC. As a result, a total of six hub genes connected with BC pathogenesis were identified. Then, a diagnostic model of BC on the basis of these genes was established. ROC analysis of the diagnostic model illustrated that AUC of training set was 0.978 (0.962, 0.995). In the validation cohort, AUC of training set and validation set were 0.936 (0.910, 0.961) and 0.921 (0.870, 0.972), respectively. The results indicated that the model was reliable in separating BC patients from healthy individuals. Our findings may assist in early diagnosis of BC with implications for improving the prognosis of BC patients.