Abstract
Distinguishing between pathologic complete response and residual cancer after neoadjuvant chemotherapy (NAC) is crucial for treatment decisions, but the current imaging methods face challenges. To address this, we developed deep-learning models using post-NAC dynamic contrast-enhanced MRI and clinical data. A total of 852 women with human epidermal growth factor receptor 2 (HER2)-positive or triple-negative breast cancer were randomly divided into a training set (n = 724) and a validation set (n = 128). A 3D convolutional neural network model was trained on the training set and validated independently. The main models were developed using cropped MRI images, but models using uncropped whole images were also explored. The delayed-phase model demonstrated superior performance compared to the early-phase model (area under the receiver operating characteristic curve [AUC] = 0.74 vs. 0.69, P = 0.013) and the combined model integrating multiple dynamic phases and clinical data (AUC = 0.74 vs. 0.70, P = 0.022). Deep-learning models using uncropped whole images exhibited inferior performance, with AUCs ranging from 0.45 to 0.54. Further refinement and external validation are necessary for enhanced accuracy.
Similar content being viewed by others
Introduction
Neoadjuvant chemotherapy (NAC) for breast cancer reduces primary tumor size and axillary lymph node metastatic burden, making breast-conserving surgery and sentinel lymph node biopsy viable alternatives to mastectomy and axillary lymph node dissection1,2. Exceptional responders can achieve a pathologic complete response (pCR), which is a surrogate marker indicative of an excellent prognosis in triple-negative and human epidermal growth factor receptor type 2 (HER2)-positive breast cancer3.
After the completion of NAC, evaluation of residual tumors using imaging examinations is crucial to determine the method and the extent of surgical intervention. Among imaging examinations, dynamic contrast-enhanced (DCE) MRI is the most accurate, although it is not perfect4. Overestimation or underestimation can occur because of decreased vascularity, fragmentation of tumors, fibrosis, and inflammation caused by chemotherapy4. In a systemic review by Lobbes et al., the median correlation coefficient was 0.70 (range, 0.21–0.98) between tumor size measured on MRI and tumor size reported on surgical pathology5. In meta-analyses, the pooled sensitivity and specificity of post-NAC MRI showed a wide range of heterogeneous values, suggesting difficulties in assessing residual tumors and the potential for inter-observer variability6,7,8,9,10.
For patients who achieve pCR, the added value of breast surgery remains uncertain. Therefore, several clinical trials are currently underway to evaluate the feasibility and safety of obviating the need for surgery in patients who achieve radiologic CR11,12,13. As imaging methods are not sufficiently precise, these trials use vacuum-assisted biopsy to predict pCR11,12,13. However, biopsy has limitations such as false-negative diagnosis, the need for skilled radiologists, and rare complications such as bleeding and pneumothorax. Therefore, the development of a non-invasive imaging method to accurately predict pCR is a timely and clinically relevant subject.
Deep-learning techniques based on MRI have been used to predict pCR more accurately14,15. Most studies have predicted pCR using pretreatment or early treatment MRI16,17,18,19,20,21,22. Until now, few studies have classified pCR versus non-pCR using MRI after NAC completion23,24. Additionally, most studies have pooled all phases of DCE-MRI instead of evaluating the performance of each phase independently16,17,18,19,20,21,22,23,24. Using the best-performing single dynamic phase as input may not only improve the model performance but also reduce the computational burden compared to using all dynamic phases as inputs. Furthermore, most studies have pooled all breast cancer subtypes16,18,19,20,21,23,24. Breast cancers have heterogeneous properties and different pCR rates depending on their subtypes3. Luminal (hormone receptor-positive and HER2-negative) breast cancers usually undergo NAC in the locally advanced stage, with pCR rates being low, approximately 5–15%. NAC is performed for these patients to reduce tumor extent rather than to achieve pCR. In contrast, HER2-positive and triple-negative breast cancers undergo NAC not only in the locally advanced stage but also in earlier stages, with pCR rates being significantly higher, up to 40–60%. Consequently, clinical trials aimed at omitting surgery after NAC often include HER2-positive and triple-negative breast cancers and exclude luminal breast cancers13. Accordingly, subtype-specific prediction models have greater clinical impact.
Therefore, we developed a deep-learning model for discriminating pCR status using DCE-MRI performed after completion of NAC in patients with HER2-positive and triple-negative breast cancers.
Materials and methods
This retrospective single-center study was approved by the Institutional Review Board of Seoul National University Hospital (No. 2303-028-1409). The requirement for informed consent was waived by the Institutional Review Board, because of the retrospective study design. All methods were conducted in accordance with the Declaration of Helsinki.
Study patients
The inclusion criteria comprised consecutive patients diagnosed with HER2-positive or triple-negative breast cancer who underwent NAC followed by surgery from 2017 to 2021 at our institution. Exclusion criteria included patients who did not receive the planned NAC regimen due to disease progression or adverse effects, those who received an off-protocol regimen, those lacking post-NAC MRI, those with a time interval of more than one month between post-NAC MRI and surgery, those diagnosed with bilateral breast cancer, and those who underwent palliative chemotherapy due to systemic metastasis at diagnosis. From our institution’s breast cancer registry, we identified 1,146 patients with HER2-positive or triple-negative breast cancer who underwent NAC followed by surgery between 2017 and 2021. Among them, 294 women were excluded because of the following reasons: more than a one-month interval between post-NAC MRI and surgery (n = 150), absence of post-NAC MRI (n = 55), no receipt of planned NAC regimen due to progression or side effects (n = 30), receipt of an off-protocol regimen (n = 24), bilateral breast cancer at diagnosis (n = 22), and palliative chemotherapy due to systemic metastasis at diagnosis (n = 13). Consequently, a total of 852 women (mean age, 51 ± 10 years) were included in this study (Fig. 1). Among the 852 women included in the study, 526 had HER2-positive breast cancers, and 326 had triple-negative breast cancers.
NAC regimen
During the study period, patients with HER2-positive breast cancers were administered one of two chemotherapy regimens at the discretion of the oncology specialists with 5–25 years of experiences: AC#4-DH#4 (four cycles of doxorubicin [60 mg/m2] plus cyclophosphamide [600 mg/m2] followed by four cycles of docetaxel [75 mg/m2] plus herceptin [8 mg/kg loading dose followed by 6 mg/kg]) or TCHP#6 (six cycles of docetaxel [75 mg/m2] plus carboplatin [area under the plasma concentration-time curve 5] plus herceptin [8 mg/kg loading dose followed by 6 mg/kg] plus pertuzumab [840 mg loading dose followed by 420 mg]). For patients with triple-negative breast cancer, the choice between the following two chemotherapy regimens was made: AC#4-D#4 (four cycles of doxorubicin [60 mg/m2] plus cyclophosphamide [600 mg/m2] followed by four cycles of docetaxel [75 mg/m2]) or AD#6 (six cycles of doxorubicin [50 mg/m2] plus docetaxel [75 mg/m2]). Specifically, a total of 526 women with HER2-positive breast cancers received either AC#4-DH#4 (n = 129) or TCHP#6 (n = 397), while a total of 326 women with triple-negative breast cancers received either AC#4-D#4 (n = 271) or AD#6 (n = 55).
Clinical data collection
The following clinical data were collected from the electronic medical records: age (years), tumor size at baseline MRI (cm), clinical T stage (1–4), clinical N stage (0–3), estrogen receptor (ER) status (negative or positive), progesterone receptor (PR) status (negative or positive), HER2 status (negative or positive), Ki-67 index (%), histologic grade (low or intermediate vs. high), and histologic type (ductal vs. others). Data on ER, PR, HER2, Ki-67, histologic grade, and histologic type were obtained from the biopsy specimens before starting NAC. ER and PR positivity were defined as nuclear staining in 1% or more cancer cells by immunohistochemistry. HER2-positivity was defined as either a HER2 score of 3 + by immunohistochemistry or the presence of gene amplification by fluorescence in situ hybridization in tumors with a HER2 score of 2+.
Reference standard
pCR was defined as the complete absence of invasive and in situ tumors in the breast based on postoperative pathological examination, irrespective of the axillary nodal status (i.e., ypT0), for use in clinical trials to obviate the need for breast surgery.
DCE-MRI protocol and region-of-interest annotation
Breast MRI was performed using a 3T MRI scanner, either Philips Ingenia (55% [471 of 852]) or Siemens Skyra (45% [381 of 852]). The routine MRI protocol sequentially consisted of an axial fat-suppressed T2-weighted sequence, DCE-MRI with an axial 3D fat-suppressed T1-weighted gradient echo sequence, sagittal fat-suppressed T1-weighted sequence, axial fat-suppressed T1-weighted sequence for the axillary regions, and axial diffusion-weighted sequence. DCE-MRI was performed using one pre-and five post-contrast dynamic series (Table E1 for the detailed parameters). For DCE-MRI, gadobutrol (Gadovist, Bayer Healthcare, Berlin, Germany) was administered at a dose of 0.1 mmol/kg and at a rate of 2 mL/sec, followed by a 20 mL saline flush. Post-contrast dynamic images were acquired 90, 180, 270, 360, and 450 s after contrast injection. The subtraction images were created by subtracting pre-enhanced images from contrast-enhanced images. In our institution, we routinely acquire subtraction images from the first, third, and fifth dynamic phases, while we do not obtain subtraction images from the second and fourth dynamic phases. In this study, the subtraction images from the first, third, and fifth dynamic phases (hereafter referred to as sub1, sub3, and sub5, respectively) were used to create a deep-learning model.
To create a cropped deep-learning model for the region of interest (ROI) of a residual-enhancing lesion, a dedicated breast radiologist with 11 years of experience annotated the ROIs on the fifth dynamic subtraction images. Three rectangular shaped ROIs were placed at the three points (beginning, middle, and ending) of a residual enhancing lesion, fitting to the lesion. For absence of residual enhancement, ROIs were placed according to the location and size of the initial tumor on pre-NAC MRI and the location of the inserted clip. To specify, the rectangular-shaped ROIs were positioned at three key points along the axial dimension of the lesion: the starting point, the middle point (representing the largest rectangular), and the ending point. These ROIs could encompass not only the target lesion but also a small area of surrounding tissue due to its shape. Engineers then utilized these three boxes to generate a 3D voxel ROI, which served as input for the deep learning models. The choice of using the fifth dynamic phase (the last phase of DCE-MRI) was based on the tendency for the extent of the residual tumor to be more accurately and prominently visualized on delayed phases compared to early phases, due to the effect of NAC on delaying enhancement25,26. The 3D voxel ROI obtained from the fifth dynamic phase was subsequently applied to the first and third subtraction images.
Preprocessing
MR image preprocessing was performed using the SimpleITK Python package (V2.1) as follows: First, the voxel spacing values were equalized using linear interpolation in 3D. Second, the MR signal intensity was normalized using multidimensional contrast-limited adaptive histogram equalization to mitigate intensity diversity and standardize image intensities17. Third, to make the input sizes equal, the MR images were cropped to the same size of 72 × 66 × 22 voxels, which was determined to be the median size of all 3D voxel ROIs. Choosing the median size rather than the largest size was to maintain the higher resolution.
Data splitting
A total of 852 patients were divided into a training set (n = 724) and validation set (n = 128) at a ratio of 8.5:1.5. The division between the two sets was determined using a stratified split based on pCR versus non-pCR status, which is the prediction target of our study. Other clinical information was not considered during the division process and thus was randomly distributed. The deep-learning model was trained using 5-fold cross validation in the training set, where four-fifths of the training set was used for training the model and the remaining one-fifth was used to tune the hyperparameters of the model. The performance of the developed deep-learning model was validated in the validation set.
Development of clinical model
Based on the aforementioned clinical data (age, initial tumor size, clinical stage, ER, PR, HER2, Ki-67, histologic grade, and type), five clinical models were developed using a simple deep-learning method called multilayer perception (MLP) and four machine learning methods: Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and XGBoost (XGB). The detailed parameters of each clinical model were optimized using the Optuna Python package (V3.0). For the MLP model, five parameters were considered for optimization, including the number of layers, the type of activation function, and the solver for weight optimization. For the LR model, the parameters considered were penalty (L1 or L2), C as a regularization parameter, and the maximum number of iterations. For the RF model, eight parameters were considered, including the function to measure the quality of a split, the number of trees, and the maximum depth of the tree. For the SVM model, six parameters were considered, including C as a regularization parameter, the kernel type, and the degree of the kernel function. For the XGB model, ten parameters were considered, including the booster type, the maximum depth of a tree, and lambda or alpha as regularization parameters.
Deep-learning model architecture
Deep-learning models were created based on the 3D convolutional neural network (CNN) ResNet50 model architecture by Hara et al., with modifications considering data size27. Firstly, the max pooling layer with size of 3 × 3 × 3 in the first layer block was removed in order to retain the information of input volume. Additionally, the fifth layer block with three 512-dimensional convolution layers was excluded in our modified model, as it was deemed that the previous four layer blocks were sufficient for extracting the features of the input 3D image. Furthermore, the convolutional stride with a size of 2 × 2 × 2 was applied only at the first convolution layer of third and fourth layer block, which prevented the extracted 3D feature map from becoming too small in size. Three types of deep learning models were created using different input data as follows: (a) a single dynamic phase of DCE-MRI (sub1, sub3, or sub5), (b) multiple dynamic phases of DCE-MRI (sub1 + sub3 + sub5), and (c) combined MRI and clinical data (sub1 + sub3 + sub5 + clinical data) (Fig. 2).
Deep-learning model architecture. (a) single-phase MRI (sub1, sub3, or sub5), (b) multiple-phase MRI (sub1 + sub3 + sub5), (c) combined MRI and clinical data (sub + sub3 + sub5 + clinical data). sub1 = subtraction images of the first dynamic phase. sub3 = subtraction images of the third dynamic phase. sub5 = subtraction images of the fifth dynamic phase.
Cropped and uncropped models were created. The cropped model was created using the ROI annotated by the radiologist, as mentioned above. An uncropped model was created using all the MR images without ROI annotation.
The single-phase model used the subtraction images of the single dynamic phases of DCE-MRI as the input (sub1, sub3, or sub5). The single-phase model comprised three convolutional stages, following an initial 7 × 7 × 7 convolutional layer and a batch normalization layer. The first convolutional stage encompassed three residual blocks with 64-dimension. The second stage involved four residual blocks with 128-dimension, with the initial block featuring a 2 × 2 × 2 stride. Similarly, the third stage comprised six residual blocks with 256-dimension, with an initial block of 2 × 2 × 2 strides. Finally, a global average pooling layer was applied to extract the features as a 256-dimensional vector. The architecture of this model is shown in Fig. 1a.
The multiple-phase model used all three phases of the MRI as the input. MRI images from each phase were entered into separate single models, and the resulting 6-dimensional features were concatenated to create a single 18-dimensional feature. This concatenated feature was then fed into a fully connected dense layer, similar to the single model. The architecture of the model is shown in Fig. 1b.
The combined model utilized not only the three phases of MRI but also clinical data as inputs. Similar to the multiple-phase MRI model, 6-dimensional features were extracted from each phase of the MRI. In addition, the clinical data were extracted as 6-dimensional feature using fully connected layers. The subsequent processes, including feature concatenation, were the same as those used for the multiple-phase MRI model. The architecture of the model is shown in Fig. 1c. To ensure objectivity in determining the optimal hyperparameters, we employed an automated grid search methodology utilizing the hyperparameter optimization framework Optuna. Through this approach, we found that reducing the features to 6 dimensions yielded the best performance for our study. We did not perform an independent test before building a combined model, as we thought that the grid search-based parameter optimization process used in this study would minimize the correlation between the features. The deep-learning models were implemented using Python 3.6, TensorFlow 2.9, and Keras 2.9. The training was performed using an NVIDIA RTX 3090, with a learning rate of 2e-7, batch size of 16, and early stopping. The learning rate and batch size were set to 2e-7 and 16, respectively, as these values empirically demonstrated the best performance. Early stopping was applied with a patience of 30 steps, meaning that training was stopped if the validation loss did not improve for 30 consecutive steps. Data augmentation was performed by flipping the image in the sagittal plane to improve model generalization. This augmentation effectively increases the diversity of the training data by simulating different orientations of the breast MRIs. This method helps the model become more robust to variations in patient positioning and anatomical differences.
Visualization of the results
To provide interpretability of the model, we used gradient class activation maps (Grad-CAMs). Grad-CAM is a technique used in the fields of computer vision and deep-learning to understand the parts of an input image that are most influential in a neural network’s decision-making process for a particular class or category28.
Statistical analysis
Clinical characteristics of training set and validation set were compared using the Mann-Whitney U test for continuous variables and the χ2 test or Fisher’s exact test for categorical variables. The area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, positive predictive value, and negative predictive value of each model were calculated for the validation set at the patient level. These metrics were derived from the model’s predictions compared to the ground truth labels. Among the five clinical models, the model with the maximum AUC was selected as the representative clinical model. AUC of each model was compared using DeLong method29.
Results
Patient characteristics
Table 1 shows the clinical and pathological characteristics of all patients (n = 852), patients in the training set (n = 724), and patients in the validation set (n = 128). Most characteristics were similar between the training and validation sets, although tumor size on pre-NAC MRI and clinical N stage were significantly different between the two sets (P = 0.014 and P = 0.003, respectively). The pCR rate was 35.9% (260 of 724) in the training set and 39.1% (50 of 128) in the validation set.
Performance of the clinical model
Five clinical models were developed using one deep-learning technique and four machine learning techniques, and the performance of each model was compared (Table E2). The AUC of the five clinical models were low, ranging from 0.59 to 0.63. Among the five clinical models, the model developed using the XGBoost technique showed the highest AUC value of 0.63 (95% CI: 0.59–0.67), thus was selected as the representative clinical model.
Performance of the deep-learning imaging models using the cropped images
The performances of the deep-learning imaging models using cropped images (with segmentations) in the validation set are provided in Table 2; Fig. 3. The AUC values of deep-learning models using the single-phase of DCE-MRI were 0.69 (95% CI: 0.68–0.70) for sub1, 0.74 (95% CI: 0.74–0.74) for sub3, and 0.74 (95% CI: 0.73–0.75) for sub5. The sub3 and sub5 model exhibited identical AUC values up to the second decimal place (0.74) but differed at the third decimal place (0.743 and 0.741, respectively). The sub3 model showed significantly higher AUC value than the sub1 model (0.74 vs. 0.69, P = 0.013) and the clinical model using the XGBoost technique (0.74 vs. 0.63, P < 0.001). The AUC value of the multiple-phase model was 0.71 (95% CI: 0.70–0.72), comparable with that of the sub3 model (0.71 vs. 0.74, P = 0.087). The AUC value of the combined model was 0.70 (95% CI: 0.69–0.71), significantly lower than that of the sub3 model (0.70 vs. 0.74, P = 0.022). The sensitivity of the deep-learning models using cropped images ranged from 0.44 to 0.56. The specificity ranged from 0.75 to 0.79. Grad-CAM applied to the sub3 model is shown in Fig. 4 and Figure E1, providing interpretability of the results. From the Grad-CAM results, we confirmed that the 3D CNN model predicted pCR or non-pCR status based on MRI information of the tumor bed. For example, in the true negative case, the activation of the CNN model occurred in the region corresponding to the residual enhancing lesion in the MRI image, as seen in Fig. 4b and Figure E1. On the other hand, in the case of true positives, the CNN model was not activated in the tumor bed region but was activated at the image border or activated randomly throughout the image, as seen in Fig. 4a.
Receiver operating characteristic curves showing the AUC values of different deep-learning models in the validation set. Sub1 = subtraction images of the first dynamic phase of dynamic contrast-enhanced (DCE)-MRI. Sub3 = subtraction images of the third dynamic phase of DCE-MRI. Sub5 = subtraction images of the fifth dynamic phase of DCE-MRI. Multiple = Sub 1 + Sub 3 + Sub 5. Combined = Sub 1 + Sub 3 + Sub 5 + Clinical data.
Visualization of the activation of the convolutional layer using GradCAM. The first column shows center slices of residual enhancing lesions or the tumor bed in the subtraction images of the 3rd dynamic phase of dynamic contrast-enhanced MRI. These are preprocessed cropped images. The second column shows the corresponding GradCAM activation heat map. The third column presents the overlaid image of the previous two columns. Representative images were selected based on the GradCAM results, with pathologic complete response (pCR) status as the reference standard, categorized as true-positive, true-negative, false-positive and false-negative. True positive: A 55-year-old woman with human epidermal growth factor type 2 (HER2)-positive breast cancer achieved pCR. No enhancement was observed in the tumor bed. GradCAM was not activated in the tumor bed. True negative: A 40-year-old woman with triple-negative breast cancer did not achieve pCR. There remained a 2 cm enhancing mass, and GradCAM was activated for the enhancing area. False positive: A 31-year-old woman with triple-negative breast cancer did not achieve pCR. A 0.9 cm enhancing mass (arrow) remained, and GradCAM was not activated. False negative: A 42-year-old woman with HER2-positive breast cancer achieved pCR. There remained a 0.8 cm subtle enhancement (arrow), and GradCAM was activated for the enhancing lesion.
Performance of the deep-learning imaging models using whole images
The performances of the deep-learning models using whole images (without segmentations) in the validation set are listed in Table 3. Overall, AUC values were low, ranging from 0.45 to 0.54. The sub3 model showed the highest AUC value of 0.54 (95% CI: 0.51–0.57) among the models listed in the Table 3. The multiple-phase model and the combined model showed AUC less than 0.5 (0.47 and 0.45, respectively). The sensitivity ranged from 0 to 0.03. The specificity ranged from 0.87 to 1.00.
Discussion
The precise discrimination of pCR through MRI acquired after the completion of NAC is becoming increasingly important in patients with HER2-positivie and triple-negative breast cancers. This importance is underscored by the high pCR rates in these subtypes and the increasing interest in clinical trials that omit surgery in exceptional responders. We developed a deep-learning model based on DCE-MRI and clinical information in the training set (n = 724) and evaluated its performance in the validation set (n = 124). We found that the deep-learning model based on the delayed-phase, rather than the early-phase of DCE-MRI, showed better performance in discriminating pCR. Moreover, the single delayed-phase model showed comparable performance to the multiple-phase model and better performance than the combined model using multiple phases of DCE-MRI and clinical information. Additionally, deep learning models that input whole MR images without segmentation exhibited results that were scarcely trained.
To our knowledge, this was the first study to develop separate deep-learning MRI models for early and delayed phases in discriminating pCR from residual cancer. To date, only two studies have developed deep-learning models for pCR prediction using post-NAC DCE-MRI23,24. These studies collectively utilized all phases of DCE-MRI without distinguishing between early- and delayed- phases23,24. The comparison of the study design between the two prior studies and our study is presented in Table E3 of the supplemental material. We observed that the delayed-phase model outperformed the early-phase model, which is in line with findings from prior studies25,26. Santamaria et al. found that the absence of enhancement on delayed-phase MRI increased the probability of pCR by 28 times compared to the presence of enhancement25. Kim et al. demonstrated a stronger correlation between tumor size measured in the delayed-phase than in the early-phase of DCE-MRI and the actual tumor size found in surgical specimens26. Moreover, our study found comparable performance between the single delayed-phase model and the multi-phase model. This comparable performance may be due to the limited data size, which possibly hindered the full optimization of the multi-phase model’s parameters. With a larger dataset, the model might better utilize the features from multiple phases. Nontheless, our findings suggest that a deep-learning model using single delayed-phase MR images may suffice, obviating the need for multiple DCE-MRI phases. This approach could mitigate the computational demands and minimize potential registration errors.
Despite being developed using the versatile clinical information listed in Table 1, our clinical models generally showed poor discrimination performance, with AUC values ranging from 0.59 to 0.63. Furthermore, the combined model, which integrated multiple phases of DCE-MRI with clinical information, yielded lower performance than the single delayed-phase model. These results highlight the limited performance of clinical information in discriminating pCR from residual cancer after NAC. However, further studies with larger data sizes are needed, as these findings seem inconsistent with previous studies demonstrating the superiority of the combined model utilizing both MRI and clinical information over the MRI-only model19,24. The disparities in breast cancer subtypes across various studies likely contribute to this inconsistency. Previous studies encompassed all breast cancer subtypes, whereas our study focused solely on the HER2-positive and triple-negative subtypes, excluding the luminal (hormone receptor-positive and HER2-negative) subtypes19,24. Breast cancer subtypes are expected to possess substantial predictive power for pCR within the realm of clinical information. This predictive capability is likely attributed to the marked differences in pCR rates between the luminal and non-luminal subtypes. However, given that our study solely included non-luminal subtypes while excluding luminal subtypes, it was presumed that the predictive capacity of subtypes within our study cohort had diminished.
In our study, the deep-learning models using unsegmented whole MR images exhibited AUC values ranging from 0.45 to 0.54, indicating limited efficacy. The decreased performance observed when using whole images without segmentation may be attributed to the model learning unnecessary and excessive information that is not relevant for accurate judgment. Including unnecessary regions can introduce noise and irrelevant information, potentially confusing the model and degrading its performance. Models trained on such data may exhibit reduced accuracy and robustness30. These findings are inconsistent with those of Dammu et al.24 Their study demonstrated an acceptable AUC of 0.78 for a deep-learning model using unsegmented whole MR images in differentiating pCR. Compared with the utilization of images segmented by human observers, the use of unsegmented whole images has the potential to alleviate issues related to labor intensity and inter-observer variability. However, as our study demonstrates, this might not lead to effective learning and could also lead to an increase in computational burden and cost. Given the insufficient number of studies addressing this issue, further research is required.
Furthermore, the MRI models exhibited relatively high specificity but low sensitivity, aligning with earlier research findings19,24. In simpler terms, there were instances where MRI incorrectly classified pCR as non-pCR. This could be attributed to the presence of slight enhancements caused by factors like inflammation, granulation, or fibrosis, even in cases of pCR status31.
Our study has several limitations. First, it was a single-center study. External validation is required to confirm the generalizability of our model. Second, we did not use multiparametric MRI to develop the deep-learning model. We used subtraction images from DCE-MRI and did not use other MRI sequences such as T2-weighted imaging or diffusion-weighted imaging. To the best of our knowledge, most deep-learning studies for predicting pCR have utilized only DCE-MRI, with only a few studies utilizing multiparametric MRI. For example, Dammu et al. used DCE-MRI and T2-weighted imaging, while Zhou et al. used DCE-MRI and DWI22,24. Third, we used post-treatment MRI data alone, without pretreatment or interim MRI data. The combined use of pre-treatment and post-treatment MRI could improve model performance, as pre-treatment MRI provides information on the location and enhancing/shape characteristics of the initial tumor. The development of deep-learning models using multiple time-point images remains understudied because of the challenges of integrating multiple channels and neural networks while maintaining consistent correlations between multiple time-point images for the same patient14. Previously, two studies reported that combining pre- and post-treatment MRI data improved the predictive performance for pCR compared with using single time-point data23,24. Owing to the lack of relevant studies, further research is needed to elucidate whether using multiparametric MRI and multiple treatment time-point images can improve the predictive performance for pCR. Fourth, the utilization of subtraction images from all dynamic phases, rather than solely from the first, third, and fifth phases as conducted in this study, may enhance prediction performance. However, due to the routine post-processing of only these phases to obtain subtraction images in our institution, they were exclusively utilized in our deep learning model. Future studies should address this consideration to potentially improve model performance. Fifth, limitations related to the segmentation method include the fact that a radiologist conducted the segmentation, raising concerns about potential intra-and inter-observer variability. Additionally, only three points (initial, middle, and end) of each lesion were marked, instead of all points, which could lead to potential erroneous segmentation. Machine-based automatic segmentation may ensure more accurate and objective results compared to the current manual method. Sixth, we did not combine MRI features with original clinical information during modeling, concerned that this approach might limit model scalability and compromise effective feature selection. However, this method could simplify the model, reduce computational burden, and potentially improve performance. Future research should explore this method. Seventh, we did not perform an independent test before building a combined model. Conducting an independent test may improve the model performance. Future studies should consider incorporating an independence test to better assess the complementarity of information and potentially enhance model robustness. Lastly, it should be mentioned that the AUC value of the single delayed-phase model, which had the highest AUC value among the models, was only 0.74. Further model improvement is necessary for clinical application. Future studies should consider training with larger sample sizes, utilizing multiparametric MRI data, combining pre-treatment and post-treatment MRI data, and employing more advanced segmentation methods.
Conclusion
We found that the deep-learning model utilizing a single delayed-phase MRI exhibited superior performance compared to both the single early-phase MRI and the combined model incorporating multiple dynamic phases and clinical data in discriminating pCR. The application of our deep-learning model holds promise for evaluating potential candidates for obviating the need for breast surgery. However, further model improvements and validations are required to confirm their utility.
Data availability
The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.
References
Korde, L. A. et al. Neoadjuvant chemotherapy, endocrine therapy, and targeted therapy for breast Cancer: ASCO Guideline. J. Clin. Oncol.39, 1485–1505. https://doi.org/10.1200/JCO.20.03399 (2021).
National Comprehensive Cancer Network. Clinical practice guidelines in oncology. Breast cancer. version 6. (2020).
Cortazar, P. et al. Pathological complete response and long-term clinical benefit in breast cancer: the CTNeoBC pooled analysis. Lancet. 384, 164–172. https://doi.org/10.1016/S0140-6736(13)62422-8 (2014).
Fowler, A. M., Mankoff, D. A. & Joe, B. N. Imaging neoadjuvant therapy response in breast Cancer. Radiology. 285, 358–375. https://doi.org/10.1148/radiol.2017170180 (2017).
Lobbes, M. B. et al. The role of magnetic resonance imaging in assessing residual disease and pathologic complete response in breast cancer patients receiving neoadjuvant chemotherapy: a systematic review. Insights Imaging. 4, 163–175. https://doi.org/10.1007/s13244-013-0219-y (2013).
Yuan, Y., Chen, X. S., Liu, S. Y. & Shen, K. W. Accuracy of MRI in prediction of pathologic complete remission in breast cancer after preoperative therapy: a meta-analysis. AJR Am. J. Roentgenol.195, 260–268. https://doi.org/10.2214/AJR.09.3908 (2010).
Marinovich, M. L. et al. Meta-analysis of magnetic resonance imaging in detecting residual breast cancer after neoadjuvant therapy. J. Natl. Cancer Inst.105, 321–333. https://doi.org/10.1093/jnci/djs528 (2013).
Sheikhbahaei, S. et al. FDG-PET/CT and MRI for Evaluation of Pathologic Response to Neoadjuvant Chemotherapy in Patients With Breast Cancer: A Meta-Analysis of Diagnostic Accuracy Studies. Oncologist 21, 931–939, doi: (2016). https://doi.org/10.1634/theoncologist.2015-0353
Gu, Y. L., Pan, S. M., Ren, J., Yang, Z. X. & Jiang, G. Q. Role of Magnetic Resonance Imaging in detection of pathologic complete remission in breast Cancer patients treated with Neoadjuvant Chemotherapy: a Meta-analysis. Clin. Breast Cancer. 17, 245–255. https://doi.org/10.1016/j.clbc.2016.12.010 (2017).
Janssen, L. M. et al. MRI to assess response after neoadjuvant chemotherapy in breast cancer subtypes: a systematic review and meta-analysis. NPJ Breast Cancer. 8, 107. https://doi.org/10.1038/s41523-022-00475-1 (2022).
Tasoulis, M. K. et al. Accuracy of Post-neoadjuvant Chemotherapy Image-guided breast biopsy to predict residual Cancer. JAMA Surg.155, e204103. https://doi.org/10.1001/jamasurg.2020.4103 (2020).
van Loevezijn, A. A. et al. Minimally Invasive Complete Response Assessment of the breast after Neoadjuvant systemic therapy for early breast Cancer (MICRA trial): interim analysis of a Multicenter Observational Cohort Study. Ann. Surg. Oncol.28, 3243–3253. https://doi.org/10.1245/s10434-020-09273-0 (2021).
Kuerer, H. M. et al. Eliminating breast surgery for invasive breast cancer in exceptional responders to neoadjuvant systemic therapy: a multicentre, single-arm, phase 2 trial. Lancet Oncol.23, 1517–1524. https://doi.org/10.1016/S1470-2045(22)00613-1 (2022).
Khan, N., Adam, R., Huang, P., Maldjian, T. & Duong, T. Q. Deep learning prediction of pathologic complete response in breast Cancer using MRI and other Clinical data: a systematic review. Tomography. 8, 2784–2795. https://doi.org/10.3390/tomography8060232 (2022).
Lo Gullo, R. et al. Artificial Intelligence-enhanced breast MRI: applications in breast Cancer Primary Treatment Response Assessment and Prediction. Invest. Radiol.https://doi.org/10.1097/RLI.0000000000001010 (2023).
Ha, R. et al. Prior to Initiation of Chemotherapy, can we predict breast tumor response? Deep learning convolutional neural networks Approach using a breast MRI tumor dataset. J. Digit. Imaging. 32, 693–701. https://doi.org/10.1007/s10278-018-0144-1 (2019).
Braman, N. et al. Deep learning-based prediction of response to HER2-targeted neoadjuvant chemotherapy from pre-treatment dynamic breast MRI: a multi-institutional validation study. arXiv preprint arXiv:08570 (2020). (2001).
Comes, M. C. et al. Early prediction of neoadjuvant chemotherapy response by exploiting a transfer learning approach on breast DCE-MRIs. Sci. Rep. 11, 14123. https://doi.org/10.1038/s41598-021-93592-z (2021).
Joo, S. et al. Multimodal deep learning models for the prediction of pathologic response to neoadjuvant chemotherapy in breast cancer. Sci. Rep.11, 18800. https://doi.org/10.1038/s41598-021-98408-8 (2021).
Massafra, R. et al. Robustness evaluation of a deep learning model on sagittal and axial breast DCE-MRIs to predict pathological complete response to Neoadjuvant Chemotherapy. J. Pers. Med.12https://doi.org/10.3390/jpm12060953 (2022).
Peng, Y. et al. Pretreatment DCE-MRI-Based Deep Learning outperforms Radiomics Analysis in Predicting Pathologic Complete response to neoadjuvant chemotherapy in breast Cancer. Front. Oncol.12, 846775. https://doi.org/10.3389/fonc.2022.846775 (2022).
Zhou, Z. et al. Prediction of pathologic complete response to neoadjuvant systemic therapy in triple negative breast cancer using deep learning on multiparametric MRI. Sci. Rep.13, 1171. https://doi.org/10.1038/s41598-023-27518-2 (2023).
Qu, Y. H. et al. Prediction of pathological complete response to neoadjuvant chemotherapy in breast cancer using a deep learning (DL) method. Thorac. Cancer. 11, 651–658. https://doi.org/10.1111/1759-7714.13309 (2020).
Dammu, H., Ren, T. & Duong, T. Q. Deep learning prediction of pathological complete response, residual cancer burden, and progression-free survival in breast cancer patients. PLoS One. 18, e0280148. https://doi.org/10.1371/journal.pone.0280148 (2023).
Santamaria, G. et al. Neoadjuvant systemic therapy in breast Cancer: association of contrast-enhanced MR Imaging findings, diffusion-weighted imaging findings, and Tumor Subtype with Tumor Response. Radiology. 283, 663–672. https://doi.org/10.1148/radiol.2016160176 (2017).
Kim, S. Y. et al. Dynamic contrast-enhanced breast MRI for evaluating residual tumor size after Neoadjuvant Chemotherapy. Radiology. 289, 327–334. https://doi.org/10.1148/radiol.2018172868 (2018).
Hara, K., Kataoka, H. & Satoh, Y. in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.
Selvaraju, R. R. et al. in Proceedings of the IEEE international conference on computer vision. 618–626.
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 44, 837–845 (1988).
Ketkar, N. & Santana, E. Deep Learning with Python Vol. 1 (Springer, 2017).
Wasser, K. et al. Accuracy of tumor size measurement in breast cancer using MRI is influenced by histological regression induced by neoadjuvant chemotherapy. Eur. Radiol.13, 1213–1223. https://doi.org/10.1007/s00330-002-1730-6 (2003).
Acknowledgements
This study was supported by Seoul National University Hospital Research Fund (04-2022-2150) and the National Research Foundation (NRF) grant funded by the Korean government (MSIT) (RS-2024-00343630). The funder did not participate in study design, data collection, data analysis, and manuscript preparation. We would like to thank Editage (www.editage.co.kr) for English language editing.
Author information
Authors and Affiliations
Contributions
S.Y.K. acquired the funding, collected MRI and clinical data, and wrote the manuscript. J.L. conducted the deep learning analyses and prepared the Tables and Figures. N.C. critically reviewed the manuscript. Y. G.K. guided the methodology and supervised the study.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Tables.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kim, SY., Lee, J., Cho, N. et al. Deep-learning based discrimination of pathologic complete response using MRI in HER2-positive and triple-negative breast cancer. Sci Rep 14, 23065 (2024). https://doi.org/10.1038/s41598-024-74276-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-024-74276-w






