Introduction

Breast cancer remains the most frequently diagnosed malignancy among women worldwide1. Axillary lymph node (ALN) status is a critical determinant for tumor staging, prognosis, and treatment decision-making in patients with early breast cancer2. The sentinel lymph node (SLN), defined as the first node(s) within the ALN drainage basin, is the key target for axillary staging. SLN biopsy has replaced ALN dissection as the standard surgical procedure for early-stage patients without clinically evident axillary metastasis3. Recently, several large de-escalation trials — including SENOMAC4, AMAROS5, and the more recent INSEMA6 trials, have further shifted surgical axillary management toward minimal invasiveness. INSEMA demonstrated that omission of SLNB may be safe in selected cT1–2, clinically node-negative patients. However, these trials primarily focus on patients with low axillary tumor burden and address whether axillary surgery can be safely reduced or avoided.

In contrast, accurate preoperative stratification of ALN tumor burden remains essential, because a subset of clinically node-negative (cN0) patients still harbor high tumor burden and may not be appropriate candidates for omission of SLNB. Timely identification of this subgroup would improve the safety of selectively omitting SLNB in low-risk patients, streamline surgical decision-making, and enhance the overall patient experience.

Axillary ultrasound (US) is the primary imaging method used to preoperatively assess ALN status7. However, there is great variation in the diagnostic accuracy of axillary US for detecting ALN metastasis (sensitivity from 48.8% to 87.1%; specificity from 55.6% to 97.3%)8,9,10, especially for clinically negative LNs. The main reason may be the great overlap of morphology between LNs with and without metastasis. Small metastases in lymph nodes resulting in minor changes in US images can result in false negatives. Nonspecific lymph node enlargement and cortical thickening are irrelevant to metastasis and can result in false positives10,11,12. Moreover, conventional axillary US has limited value for identifying the SLN(s) and predicting the lymph node tumor burden13.

Deep learning (DL) has been widely used in the medical field and can effectively improve the ability of imaging diagnosis14. Previous studies have proven that the DL of US of primary breast cancer lesions can predict ALN metastasis status, for which AUCs varies from 72% to 90%15,16,17,18,19,20,21,22,23. Nevertheless, such approaches are indirect, relying on features of the primary tumor rather than on nodal characteristics themselves. They cannot be applied to patients with multifocal lesions or those who have received prior local treatment15,17,18. Importantly, multifocal or treated cases are not rare, with reported proportions ranging from 6% to 60%24. Therefore, developing DL models that rely solely on ALN or SLN imaging features is essential for robust and generalizable preoperative axillary evaluation. To date, only a few DL models based on SLN images have been developed specifically for tumor burden stratification in cT1-2N0 breast cancer patients. CEUS can accurately identify SLN(s) by tracing lymphatic vessels, offering the opportunity for direct assessment of the SLN(s) in this study11,25.

Our study aimed to develop and validate a DL model based on SLN US images to evaluate the preoperative tumor burden of axillary lymph nodes in patients with cT1-2N0 breast cancer, with the goal of identifying those at high risk of extensive nodal involvement who may directly benefit from ALND.

Patients and methods

Development cohort and clinical test cohorts

This study was a prospective multicenter study and was approved by the Ethics Committee of Peking Union Medical College Hospital (approval number: ZA-2291). This study was registered at www.chictr.org.cn with the clinical trial registration number ChiCTR2000031231 on March 25, 2020. All the participants signed informed consent forms. We enrolled consecutive patients from the breast surgery department of Peking Union Medical College Hospital between April 2020 and July 2021 and Sichuan Cancer Hospital between April 2022 and July 2022. The inclusion criteria were as follows: (a) histopathologically confirmed clinically staged T1-T2 invasive breast cancer; (b) clinically negative ALNs; (c) successful detection of the SLN with CEUS; and (d) SLN biopsy for axillary lymph node staging. The exclusion criteria were as follows: (a) received axillary therapy (biopsy, neoadjuvant chemotherapy); (b) lacked full-screen conventional (grayscale or color Doppler mode) US images of SLNs identified by CEUS; and (c) incomplete clinicopathological information.

Acquisition of grayscale SLN images and pathological outcomes of ALNs

All sonographic examinations were performed preoperatively by experienced radiologists specializing in breast US (Q.Z., M.L., more than 20 years; J.L., 9 years; Z.N., 7 years) using high-frequency linear array transducers on the iU22 and EPIQ 7 (Philips-Advanced Technology Laboratories, Bothell, WA, USA, L12-5MHZ) machines. The SLN(s) were identified as the first visualized LN(s) traced by enhanced lymphatic channels on CEUS imaging. The dual display mode showed both CEUS and grayscale images simultaneously. Then, full-screen, grayscale images of as many sections of SLNs as possible were acquired. One or two days later, the surgeons identified the SLN(s) by methylene blue dye and indocyanine green. Both micro- and macromatastases were defined as metastases. The details of the percutaneous CEUS procedure and surgical management of ALNs are provided in Supplementary material 1.

Pathological assessment

Estrogen receptor (ER), progesterone receptor (PgR), and human epidermal growth factor receptor 2 (HER2) status were evaluated by immunohistochemistry (IHC) on surgical or biopsy specimens. ER and PgR positivity were defined as ≥ 1% of tumor cell nuclei showing immunoreactivity, according to current ASCO/CAP guidelines. HER2 status was determined following the ASCO/CAP recommendations: IHC 3 + was considered positive, whereas IHC 2 + cases underwent in situ hybridization (ISH), and HER2 amplification by ISH was classified as HER2-positive. Molecular subtypes were assigned based on ER, PR, HER2 status, and Ki67 index according to the St. Gallen consensus. Histological grade was determined using the Nottingham Grading System, incorporating tubule formation, nuclear pleomorphism, and mitotic count. Ki67 was assessed by IHC, and a 20% cutoff was used to distinguish low from high proliferative activity, consistent with current clinical practice.

Input preparation (images and clinical characteristics)

The input images (manual cropping) were conducted by experienced radiologists who meticulously tailored each image, ensuring that the focal sentinel lymph node (SLN) occupied the predominant portion of the final image. It was also carried out to prevent shape deformation or imbalance between the height and width of the image before feeding images into deep neural networks. These coordinates were subsequently employed for image cropping. Additionally, all images were resized to a standardized size of 512 × 512, accompanied by supplementary image augmentations.

In addition to the US scanning images, we also collected patient characteristics using a patient characteristics data selection procedure. The final set included age, longitudinal and short SLN length, the longitudinal-to-short diameter SLN ratio, cortical thickness, the estrogen value, the progesterone receptor value, the human epidermal growth factor receptor 2 value, and the Ki-67 index of primary breast cancer. For multifocal cases, we consistently used the largest lesion to determine tumor size, molecular subtype, and other primary tumor characteristics. For patients who had received preoperative treatment, we incorporated the pre-treatment biopsy results and, whenever available, tumor size information documented in outside-hospital reports. All the above data serve as their respective inputs into our proposed model. Further preprocessing details are provided in Supplementary material 2.

Modality-adaptive network development

The training set was used for optimizing the parameters of our proposed modality-adaptive network with clinicopathological information (MAN + C). The MAN + C pipeline is shown in Fig. 1. IBN-ResNet26 was adopted as the backbone of our DL model. The optimization was performed on top of the IBN-ResNet model that was pretrained on the ImageNet dataset. The bulk of the model was maintained except that the output linear layer was replaced by a squeeze excitation27 module followed by a custom classification block tailored for predicting the ALN tumor burden. Moreover, we analyzed 9 different clinical characteristics, as mentioned in the last section. The 15-element input, comprising 6 single-valued data points and 3 × 3 categorical encodings representing null, 1 and 0, is processed through a dedicated patient data network (PDN) with three distinct fully connected layers. After obtaining US image features and clinical characteristic features, we adopted a feature fusion module for merging them together, in which both outputs of IBN-ResNet and PDN were batch normalized and concatenated. Finally, a feed-forward-layer classifier was added to produce a scalar for each data sample, which was the final prediction of the model. Further details are provided in Supplementary material 3.

Fig. 1
Fig. 1
Full size image

The pipeline of the proposed modality-adaptive network for ALN tumor burden prediction.

The location of the SLN(s) can be traced by CEUS. The dual display mode showed both CEUS and grayscale images simultaneously. Then, full-screen, conventional US images of the SLNs were acquired. Patients’ clinical data were processed into PDN information. Any grayscale or color modalities of US images use an IBN-ResNet50-a for feature extraction and a custom classifier for final prediction. The output of the PDN (dimension of 8) linearly concatenates with the output of the IBN-ResNet50-a model (dimension of 128) and passes through a final linear layer for binary prediction.

In this work, we proposed a multipathway DL model (Fig. 1). By conducting comprehensive experiments across training, validation, independent test and external test datasets, we compared 4 different model architectures: IBN-ResNet-a26, EfficientNet-b428, VGG1629, and ResNet5030.

Statistical analysis

Continuous variables are expressed as medians with the first and third quartiles, and categorical variables are expressed as numbers and percentages (%). Kolmogorov‒Smirnov tests were used for normally distributed data. Continuous variables were compared using one-way analysis of variance or the Kruskal–Wallis test, as appropriate. Categorical variables were compared using the chi-square test or Fisher’s exact test. A two-sided P value < 0.05 was considered statistically significant. In our proposed models, the AUC and F1 score were used as the main parameters for the prediction metrics. The F1 score was included because the dataset exhibited class imbalance between patients with low and heavy ALN tumor burden. As the harmonic mean of precision and recall, the F1 score provides a balanced metric that better reflects the model’s ability to correctly identify patients with heavy tumor burden compared with accuracy alone. Low tumor burden was defined as isolated tumor cells or micrometastasis (≤ 2 mm) and ≤ 2 positive lymph nodes. Heavy tumor burden was defined as the presence of at least one macrometastasis (> 2 mm) and/or ≥ 3 positive axillary lymph nodes. This threshold was used as the binary classification endpoint for all model evaluations. The strategy used to find the optimal threshold was to optimize the geometric mean (G-mean):

$${\text{G-Mean}}=\sqrt {{\text{SEN}} \cdot {\text{SPEC}}}$$

which is a balance between sensitivity and specificity. Choosing the threshold that yielded the largest G-mean value will certify the threshold as the optimal threshold. All the statistical tests were two-sided, and P < 0.05 was considered to indicate statistical significance. All the statistical analyses were performed with SPSS (version 26.0 IBM), Python 3.8 and the Numpy library, and the statistical graphs were generated with Matplotlib.

Results

We prospectively included 222 patients from the inpatient department of breast surgery at Peking Union Medical College Hospital between April 2020 and March 2021 as the development dataset; these patients were randomly divided into training and validation sets at a ratio of 8:2. After one month, we included 53 patients between May 2021 and July 2021 as an independent test dataset. We further enrolled 99 patients from Sichuan Cancer Hospital between April 2022 and July 2022 as an external test dataset. The detailed baseline data of the patients are listed in Table 1. Figure 2 shows the patient recruitment workflow.

Table 1 Characteristics of the development and test cohorts.
Fig. 2
Fig. 2
Full size image

Overview of the patient enrollment workflows.

Base model selection

The statistical results of base models are shown in Supplementary Table (1) The training, validation, and independent test ROC curve comparisons are illustrated in Fig. 3 (abc). The IBN-ResNet-a outperformed the other models on the test set. All models were then evaluated for performance with patient characteristics concatenated after the last custom classification block. The statistical results are shown in Supplementary Table (2) The training, validation, and independent test ROC comparisons are illustrated in Fig. 3(def). In this category, IBN-ResNet ranked first as well. The integration of patient characteristic data also improved overall performance by approximately 20%.

Fig. 3
Fig. 3
Full size image

ROC curves for different base models with or without clinicopathological characteristics in the training, validation and independent test cohorts. The figures display ROC curves comparing various base models for predicting ALN tumor burden in the training (a), validation (b), and independent test (c) sets. The figures present the performance evaluation of each model with clinicopathological characteristics in predicting ALN tumor burden in the training (d), validation €, and test (f) sets.

Performance of the models

Aided by the patient characteristics data, MAN + C achieved AUCs of 0.91(95% CI: 0.899–0.943), 0.98(95% CI: 0.950-1), 0.89 (95% CI: 0.850–0.935) and 0.84 (95% CI: 0.811–0.869) on the training, validation, independent and external test datasets, respectively, as shown by the ROC curves in Fig. 4. The model showed consistent improvement across the two test groups, which indicated its robustness to the factors of time and hospital type.

Threshold-dependent performance was further evaluated to determine clinically meaningful operating points for SLNB avoidance (Supplementary Table 3). At the conservative threshold of 0.1, sensitivity and NPV reached 100% but at the cost of an excessively high false-positive rate (specificity 43.5%). A threshold of 0.2 provided a more balanced clinical profile, maintaining high sensitivity (92.9%) and very high NPV (98.6%) while substantially reducing false positives (specificity 83.5%). Therefore, 0.2 was selected as the primary decision threshold for SLNB-avoidance analysis.

Fig. 4
Fig. 4
Full size image

ROC curves for the ability of MAN+C to predict ALN tumor burden. The blue, orange, green and red curves represent the performance of MAN+C on the training, validation, independent and external test sets, respectively.

To explore the effect of clinicopathological characteristics, we removed radioclinicopathological information step by step according to clinical accessibility. Using the full set of patient characteristic data yielded the most promising results, especially the addition of C2 and C3 (C2 represents SLN min/max length and ratio, C3 represents cortical thickness), with AUCs ranging from 0.58 to 0.84 (Table 2; Fig. 5). The characteristics of C2 and C3 pertain to the US images of SLNs. In summary, as each group of patient characteristics was added, a steady improvement in performance was observed. When we used only clinicopathological information to predict ALN tumor burden, the area under the curve (AUC) reached 0.82(95% CI: 0.762–0.884), which indicates that the invisible information in US images could improve the AUC by 8.5% (Supplementary Fig. 1).

Table 2 Performance metrics of IBN Resnet50 with different levels of patient characteristics data given.
Fig. 5
Fig. 5
Full size image

ROC curves for MAN+C with stepwise removal of patient characteristic groups. The notation C after each type of model indicates the addition of patient characteristic data. C1 represents age, C2 represents SLN min/max length and ratio, C3 represents cortical thickness, and C4 represents ER, PR, Her-2 and Ki-67.

Benefit and interpretability of MAN

As shown in Fig. 6(a), the MAN + C could be used to directly determine the ALN tumor burden independent of the morphology of the primary breast lesion, as more than 30% of patients with multifocal lesions or who received any primary lesion treatment could benefit from our model. In addition, even in the external dataset, which showed the worst performance, 88.9% of patients with cT1-2N0 breast cancer could receive accurate ALN tumor burden assessments, enabling the subsequent development of individualized treatment management by using this model.

We applied gradient-weighted class activation mapping (Grad-CAM)31 to generate a heatmap of the input image based on the prediction result of MAN + C. The heatmap shows the importance of the region in which the model paid the most attention. Figure 6(b) shows a visual representation from the output of Grad-CAM. The MAN generally matched the ROI. Depending on the geometry of the LN, the model highlighted the entry and exit drainage of the LN, which showed the model’s advanced understanding of the LN features.

Fig. 6
Fig. 6
Full size image

The overall benefit of MAN+C and its visualization map. (a) The pie chart above shows the proportions of different lesions in cT1-2N0 patients who underwent axillary lymph node CEUS in the four datasets. The additional patients who were unable to benefit from earlier models relying solely on primary breast lesion data could now resort to our model, constituting 31% of the total patient population. The pie chart illustrates the proportion of patients diagnosed using MAN+C in the external validation dataset, with 88.9% of them accurately identified for lymph node tumor burden. (b) Grad-CAM visualization of four patient examples. The red region represents the larger attention given by the deep neural network. Note that since both manual cropping images and original images are square, not rectangular, the image is not scaled down equally.

Automatic detection reduces the impact of operator-dependence

Additionally, operator-dependence has always been a difficulty of DL model in US. While in our study, we incorporated saliency detection cropping into our prediction pipeline to fully automate the end-to-end prediction process, from which the computer could automatically identify where the lymph nodes were located. Experiments showed that our proposed model was still able to yield promising results (AUC of 0.82 on the independent test set) (Supplementary Fig. 2) with the new saliency detection US image cropping and suffered a minor performance drop compared to manual cropping. This opened exciting possibilities for a fully automated end-to-end prediction pipeline without any human intervention, which avoided the error and non-repeatability of manual operations and further improved the degree of automations.

Discussion

In this study, we applied CEUS for tracing SLN(s) to establish a one-to-one correlation between US images of SLN(s) and pathological results. Then, we meticulously designed the MAN + C model using these images with radioclinicopathological information to predict ALN tumor burden. The model demonstrated competitive performance compared to state-of-the-art models based on primary breast cancer lesions. Integrating radiologists’ prior knowledge with high-throughput imaging information from US images significantly enhances diagnostic capabilities. The wealth of quantitative imaging data extracted from these images not only complements the interpretative expertise of radiologists but also has the potential to substantially improve diagnostic accuracy when integrated with existing radiological readings and clinical pathological information.

Axillary ultrasonography is the standard initial tool for evaluating ALN status, but its sensitivity for detecting nodal tumor burden is limited. Although DL- and radiomics-based approaches have emerged, most high-quality models still rely on primary tumor features15,17,18, resulting in exclusion of patients with multifocal disease or prior treatment17,18,32. Consequently, there remains a clinical need for a direct assessment method independent of primary lesion status. Since sentinel lymph node (SLN) pathology is strongly associated with ALN involvement, nodal tumor burden, and survival33, our MAN approach addresses this unmet need because it is not limited by primary tumor characteristics, including size, multifocality, or pretreatment.

Recent DL studies have attempted to classify ALN metastasis directly from US images, yet they suffer from methodological constraints. Ozaki J. et al.34 compared only normal versus heavily metastatic nodes; David C. et al.35 excluded US-negative metastases, inflating positive rates; and Shawn S. et al.36 achieved an AUC of only 0.72. A key obstacle is the lack of one-to-one correspondence between preoperative ALN images and pathological references, which diverges from guideline-recommended SLN-based N staging. By employing CEUS-identified SLNs as the pathological standard, our study aligns with current practice and addresses the challenge of imaging-negative nodes, achieving comparatively high performance in real-world data.

Our study fully explored the capability of the DL methodology. The design involved the integration of both US images and selected patient characteristics into a unified network comprising two independent networks, the IBN Resnet 50 and the PDN. The new design took advantage of domain knowledge, mainly the incorporation of direct measurements of lymph node parameters from these images, which are crucial for enhancing the predictive power of our model. The relevance of all clinicopathological information included in our model to ALN metastasis has been validated in previous studies37,38,39,40,41. Because these clinical variables were jointly optimized within the PDN rather than modeled separately, their combined predictive contributions were inherently captured. Additionally, studies and experiments were conducted on patient data selection to retain only useful patient characteristic data while maintaining a high prediction performance comparable to that of models from prior studies. The robust performance on independent tests reflects the model’s resilience over time and across different hospital settings.

Compared with the previous application of multimodal imaging data such as elasticity and CEUS of primary breast cancer lesions18, MAN requires only grayscale or color Doppler images acquired by conventional axillary US. The implication of this is that we do not have stringent requirements for the quality of other modality images, which simplifies the image acquisition process and makes it more feasible to implement. This approach reduces the complexity and enhances the practicality of the procedure, allowing for broader application in various clinical settings without the need for specialized or advanced imaging equipment. A potential source of bias in our study is the manual cropping of ultrasound images during preprocessing. Although the regions of interest were selected by experienced radiologists using a standardized protocol, this step remains operator-dependent and may introduce subtle variability. Manual cropping is also time-consuming and limits scalability. Future incorporation of automated lesion localization or segmentation methods could improve reproducibility and reduce human-related variation.

However, this study has some limitations. Firstly, since CEUS of ALNs is not yet popular enough, large and multicenter external validation is warranted to verify the generalizability of this model for accurate diagnosis of ALN metastasis. Secondly, all the US images used to develop our model were of SLNs, and further work is needed to explore whether MAN combined with fully automated detection technology still performs well for whole axillary scanning. Thirdly, this study was designed as an observational diagnostic modeling study, and the model outputs did not influence clinical management. Therefore, no treatment-related or recurrence-related differences exist between model-defined groups, and time-to-event analyses such as Kaplan–Meier curves cannot be meaningfully performed. Future prospective studies, including potential randomized trials using the model to guide axillary management, are needed to determine whether risk stratification translates into differences in recurrence outcomes. Lastly, in this exploratory diagnostic modeling study, the DL model provides preoperative risk stratification rather than guiding axillary treatment decisions. Its potential clinical impact—such as identifying patients who might safely avoid ALND —requires prospective validation in future trials and is beyond the scope of the present study. Therefore, we were unable to compare model performance with radiologists because estimating the exact number of metastatic lymph nodes preoperatively in cN0 participants is not feasible for human readers, and no clinical benchmark exists for this task.

MAN + C based on SLN US with clinicopathological information provides a direct and efficient method for accurate preoperative assessment of ALN tumor burden in cT1-2N0 breast cancer patients. These findings will provide new possibilities for determining appropriate axillary treatment options for breast cancer patients. Prospective multicenter validation is expected to provide high-level evidence for clinical application in follow-up studies.