Introduction

Background

Breast cancer is a formidable challenge in the realm of public health, significantly influencing mortality statistics worldwide1. However, from a clinical perspective, approximately 90% of cancers can be cured if detected and treated at an early stage, thus emphasizing the importance of early diagnosis of breast cancer2. Detecting breast lesions, which are the principal indicators of breast cancer, during the nascent stages is crucial for improving survival probabilities and the efficacy of therapeutic interventions. Therefore, there is a continuous demand for improving the accuracy of breast lesion diagnosis based on high sensitivity and low false positive rates3. Mammography is an indispensable primary diagnostic modality in breast cancer screening, providing multiple images depending on the angle of view4. Among the various views, the craniocaudal view (CC view), obtained by compressing the breast up and down, and the mediolateral oblique view (MLO view), obtained by adjusting the angle to include the pectoralis muscles, are used as standard views for screening5,6. The use of these two views in the clinical interpretation of patients with suspected breast cancer has been reported to be clinically advantageous in reducing false positives (FPs) and improving sensitivity compared with using only a single view7,8,9,10.

In the contemporary era, advances in artificial intelligence (AI) have led to remarkable progress in medical imaging as well as computer vision. This has significantly reduced the burden on clinicians by accelerating the diagnosis process. Moreover, AI-based CAD has proven its worth by providing consistent and objective indicators, demonstrating its reliability and efficiency. This is evident in its positive impact on inter- and intra-observer11,12,13,14. Deep learning-based convolutional neural networks (CNN), a field of artificial intelligence, extract features as filter maps based on convolutional operations. This approach, which preserves the location information of the extracted feature values, has found widespread use in computer vision and has demonstrated excellent efficiency in medical image data15,16. As a result, it has become a key tool in the development of breast lesion detection algorithms for early diagnosis of breast cancer from mammography data.

Related works

AI algorithms for breast lesion detection have been mainly based on single images. The Faster R-CNN model is an object detection algorithm that sequentially divides region detection and detected region classification based on convolutional neural networks17. This model has been used variously through application and structural modification to detect the location of breast lumps with relatively high detection accuracy and to classify lesions as benign or malignant18,19,20. In addition, the YOLO model, which has the advantage of fast inference speed while simultaneously detecting the location of candidate regions for the target object and classifying the detected regions, has been used mainly by using the model structure as it is or by modifying its layers21,22,23. The Retinanet model, proposing to improve the loss function to train about small objects efficiently, has been utilized in various breast lesion detection studies24. However, these studies were single view-based algorithms, where the lesion detection was performed separately for each image, and the correlated information between the two reference images, which could potentially enhance the accuracy of the detection, was not considered.

Some studies have utilized patches for breast lesion regions to apply the matching information of the two standard images. There is an experiment that extracted feature maps through a convolutional layer for the manually extracted patches from each image and performed patch-by-patch classification of whether the patches are for the same lesion by measuring the similarity of the neighboring values for each pixel in the two feature maps25. Some studies used deep learning-based detection models to extract patches for the lesion region individually and then similarly classified the two patches’ matching. One study utilized Unet to extract individual patches from two images. For each patch, a feature map was extracted using a convolutional filter, and each feature map was combined in a fully connected layer to share the features of the two images26. In addition, another study extracted patches using the YOLO model. It calculated cosine similarity between the two feature maps extracted through the convolutional layer from the patches to classify the matching27. These experiments aimed to remove false positives by determining the matching at the patch level. In this case, the positional information of the whole image was not considered in the feature extraction process for matching between patches. Nevertheless, when the two standard images employed the similarity of the lesion patches, they showed improved results in terms of quantitative performance indicators, demonstrating the validity of utilizing both images.

Another way to enhance performance has been conducted by using multi-view mammograms. One study attempted to employ the two views using the distance between the nipple and lesion and considering the location information from the CC view; however, in these studies, the MLO view image was not applied to detect breast lesions and was only used as an auxiliary input for detection in the CC view28. Other studies attempted to reconcile the two or four multi-view mammograms using an ensemble network across whole feature maps. However, they only employed the multi-views simultaneously for classification problems (abnormality or benign and malignant), not for extracting the feature map when detecting mass29,30,31.

Contributions

Therefore, in this study, we aimed to improve the detection performance by proposing an AI lesion detection algorithm that utilizes paired standard mammograms. The key points of the proposed contributions in this paper are as follows:

  1. 1.

    We propose a deep learning based PMVnet (paired mammogram view based-network) that simultaneously utilizes the whole mammogram image’s paired craniocaudal and mediolateral oblique views.

  2. 2.

    To evaluate the proposed algorithm, we compared its performance with single mammogram view-based networks(SMVnet), which are original standard Unet-based models.

  3. 3.

    We analyzed localization results to investigate whether the PMVnet effectively extracts feature maps correlated with two views.

Materials and methods

Data collection

Due to the retrospective nature of the study, the Institutional Review Board (IRB) of Gachon University Gil Medical Center (IRB Number: GCIRB2020-477) waived the need for obtaining informed consent. All experimental protocols were performed in accordance with the relevant guidelines and regulations in compliance with the Declaration of Helsinki. We collected data acquired from devices produced by two different manufacturers: Lorad Selenia 2D Digital Mammography model, referred to as “Hologic” in this paper, from Hologic Inc. based in Lesionachusetts, USA; Senographe Essential model, referred to as “GE” in this paper, from General Electric Healthcare based in Illinois, USA. Figure 1 shows an example of the four standard views (RCC, LCC, RMLO, and LMLO) obtained from a single patient. To simultaneously utilize paired images, we grouped two standard views (RCC and RMLO) for the right side and two standard views (LCC and LMLO) for the left side into a unified set. The dataset comprises 818 sets of left and right views (1,636 images), with 525 sets (1,050 images) sourced from Hologic devices and 293 sets (586 images) obtained from GE. The Hologic data had a resolution of 2,560 × 3,328 pixels, whereas GE data had a resolution of 3,328 × 4,096 pixels. Of the entire dataset, 738 cases (1,476 images) were allocated to train the models, and the remaining 80 cases (160 images) served as the evaluation data to validate the performance. In the training dataset, the average age of patients was 51.255 years (± 11.651), with 99.6% female and 0.4% male. For performance assessment, the patients were all female and had an average age of 54.263 years (± 11.392). All identifiable information was anonymized for privacy purposes, except for age and sex. All the data used in this research consisted of cases diagnosed as abnormal(malignant and/or benign lesions).

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Examples of standard mammogram views in the left and right directions. LCC left craniocaudal view, LMLO left mediolateral oblique view, RCC right craniocaudal view, RMLO right mediolateral oblique view.

Data preprocessing

Breast lesions have vague outlines when they overlap with fibroglandular tissue, particularly in high-density breast tissue32,33. To address this issue, various studies have demonstrated that using pseudo-color images as a preprocessing technique can improve contrast and enhance the detection performance of masses, particularly in high-density breast tissue34,35,36.Therefore, we generated pseudo-color images as a preprocessing step to increase the contrast of the surrounding breast lesions, as shown in Fig. 2. Using pixel intensity, the window settings were adjusted in three different manners. Three adjusted grayscale images were assigned to each of the three channels (channels 0, 1, and 2), and synthesized to create a pseudo-colored image. In channel 0, an image set to the default window width and window level at the time of acquisition using the machine’s internal algorithm was inserted. Channel 1 was set to the window width and level for the range of the maximum and minimum values of the pixel value distribution for the breast region. Finally, Channel 2 was set by adjusting the pixel values by setting the window level to the most pixel value in the breast region.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Example of pseudo-color image generation process to enhance the contrast between the lesion and surrounding area. (a) Original image. (b) The denoised and background-removed breast region. (c) The pseudo-color image generation process and the synthesized pseudo-color image. (d) The input image with padding and scaling for training a network.

We undertook a resizing process to enhance the utility of the collected mammogram images, which possess resolutions of 2,560 × 3,328 pixels or 3,328 × 4,096 pixels. The high resolution of these images posed potential challenges related to computational demand and memory usage. As a solution, we reduced the height of the images to 512 pixels, with the width adjusted correspondingly to maintain the original aspect ratio through zero-padding, resulting in a final image size of 512 × 512 pixels. Furthermore, acknowledging the critical relationship between the volume of training data and model performance, we applied data augmentation techniques to enrich our dataset. By utilizing the padded images, we resized them to ratios of 4/3 and 2 times the original reference image. This approach enabled us to generate additional variations through both horizontal and vertical transitions, effectively increasing our dataset fourfold. An example of the data augmentation methodology is presented in Fig. 3.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Examples of augmented images with the data scaling process applied. The final preprocessed image was zero-padded to be 512 × 512 in size. (a) Resized by keeping the original aspect ratio centered on the center of the image and applying padding on either side. (b) Resized to 512 × 512, ignoring the original proportions. (c) Resized to 4/3 times the original aspect ratio and flipped up and down. (d) Resized to 2 times the original aspect ratio and flipped left and right.

Paired standard mammogram views network

The Unet is a prominent CNN-based semantic segmentation model widely applied in medical image segmentation due to its exceptional performance, as noted across various studies37,38,39,40. This research introduces a PMVnet structure with a Unet comprising two core components: a contracting path (encoder) and an expanding path (decoder) with a multi-input design41.

To clearly understand the baseline model used for comparison, we define SMVnet as a single-view mammography network. Unlike PMVnet, which utilizes paired views to improve performance, SMVnet is a traditional Unet-based model that processes only one mammography view at a time. The SMVnet model in this study leverages the Unet architecture with various encoders such as VGG, ResNet, and Efficientnet to ensure a fair and thorough performance evaluation. By comparing the SMVnet-based model with PMVnet, we aim to directly evaluate the benefits of integrating paired-view information.

Figure 4 shows an example of the proposed model structure. This approach involves sharing weights by treating two standard images as a unified representation of one breast. We employed cosine similarity and an attention mechanism based on squeeze and excitation (SE) techniques42, effectively integrating the feature maps between the two images. The models trained on an Nvidia V100 GPU with configurations as follows: 4 batch sizes; 100 epochs; learning rate of 0.0001; focal loss; the learning rate was set to decrease by a factor of 0.1 if the loss did not decrease for 15 epochs on the validation data.

Let the outputs extracted from the encoders of each view be \(\:{O}_{cc}\) and \(\:{O}_{mlo}\), respectively, and the concatenation of the two outputs be \(\:O\). Let the size of the feature map of the \(\:O\) be HWC, the cth feature map by \(\:{F}_{c}\), squeezing the spatial information be \(\:{S}_{C}\), the output applied the SE technique to the \(\:O\) be \(\:O{\prime\:}\), the operation by the fully connected layer with n outputs be \(\:Cn\left(x\right)\), the sigmoid function be \(\:\sigma\:\left(x\right)\) and the Relu function be \(\:R\left(x\right)\), and the following operations are applied. In this study, the reduction ratio was set to 4.

$$\:{S}_{c}=\:\frac{1}{H\times\:W}\sum\:_{i=1}^{H}\sum\:_{j=1}^{W}{F}_{c}$$
(1)
$$\:O{\prime\:}=\:\sigma\:\left({C}_{n}\left(R\left({C}_{\frac{n}{4}}\left({S}_{c}\right)\right)\right)\right)$$
(2)

Therefore, we let the feature maps activated by the SE technique be \(\:{O{\prime\:}}_{cc}\) and \(\:{O{\prime\:}}_{mlo}\), respectively, they can be represented as follows:

$$\:{{O}^{{\prime\:}}}_{cc}={O}_{cc}\cdot\:{O}^{{\prime\:}}$$
(3)
$$\:{{O}^{{\prime\:}}}_{mlo}={O}_{mlo}\cdot\:{O}^{{\prime\:}}$$
(4)

Equation (5) is applied to the results extracted using Eqs. (3) and (4), and the result is applied to the feature map for each image through a scalar product to ensure that the similarity between the two images contributes to the feature extraction.

$$\:{\text{f}}_{\text{c}\text{o}\text{s}}\left(\text{A},\text{B}\right)=\frac{\text{A}\cdot\:\text{B}}{\parallel\:\text{A}\parallel\:\:\parallel\:\text{B}\parallel\:}\:=\frac{{\sum\:}_{i=1}^{n}{A}_{i}{B}_{i}}{\surd\:{\sum\:}_{i=1}^{n}{A}_{i}^{2}\surd\:{\sum\:}_{i=1}^{n}{B}_{i}^{2}}$$
(5)
Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

The proposed PMVnet structure connects two U-shaped based models to form multiple inputs, and the feature maps extracted through each CC and MLO encoder are combined and fused, and then the squeeze-and-excitation method is applied to each feature map at the decoder to measure the cosine similarity based on dot product.

Performance assessment

The performances of the models were evaluated and compared to verify their effectiveness. The data used for the performance evaluation were 80 mammograms of the CC view and 80 mammograms of the MLO view that were not used for training. The performance of each model was evaluated using the same data. TP indicates that the model predicted a lesion area as a lesion area, FP indicates that the model predicted a no-lesion area as a lesion area, and FN indicates that the model predicted a lesion area as not a lesion area. TN indicates that the model predicted a no-lesion area as not a lesion area. The TP, FP, FN, and TN were calculated by comparing the ground truth generated by the radiologist. The breast lesion area was predicted by each model on a pixel-by-pixel basis for semantic segmentation assessment using dice similarity coefficient (DSC), recall (same as sensitivity), and precision.

To evaluate lesion detection performance for investigating localization results, we generated a bounding box that included the lesion area (Fig. 5). The intersection of union (IoU) is an evaluation metric based on the overlap between two regions. In this study, we measured the IoU for the ground truth and the predicted region using a model based on the region containing the lesion and defined the measured IoU value for the predicted region as TP if it was above 0.1, FP if it was below 0.1, and FN if there was no predicted region to compare with the reference image region. Using these metrics, we calculated the recall, precision, and false positives per image (FPPI) to derive and analyze the performance of the model in detecting lesion-containing regions. The evaluation formulae are as follows:

$$\:DSC=\frac{2TP}{2TP+FP+FN}$$
(6)
$$\:Recall=\frac{TP}{TP+FN}$$
(7)
$$\:Precision=\frac{TP}{TP+FP}$$
(8)
$$\:FPPI=\frac{FP}{the\:Number\:of\:images}$$
(9)
Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Example of generating lesion regions for detection performance evaluation from semantic segmentation results. In (c) and (d), the red rectangle represents the converted rectangular format of the detection result containing the lesion. (a) Segmented lesions extracted from the model. The red-blended areas represent lesion regions predicted using the proposed model. (b) The converted area includes lesions generated by morphological operations from the segmentation results. (c) Conversion to the lesion detection format. The red squares represent the lesion-containing regions detected by the morphology operation based on the model predictions. (d) the final format of the lesion region reconstructed from the image size changes during preprocessing, considering the proportion and position of the original image.

Results

In this study, VGGnet16, Resnet50, and EfficientnetB5 were employed as encoder networks in the proposed U-shaped SMVnet and PMVnet-based models. The study aimed to assess the effectiveness of the proposed models and understand the underlying reasons by evaluating the performance of each configuration. In addition, a pretrained model with ImageNet weights was applied to the encoder structure of each model, and the results were compared43.

Table 1 lists the performance evaluation results for semantic segmentation, detailing how variations in the encoding stage of each model influenced the outcomes, as observed across different experiments. In the models without pretrained weights (plain), the proposed PMVnet resulted in a DSC of 0.709 for the VGGnet16 encoder-based model, 0.652 for the Resnet50-based model, and 0.707 for the EfficientnetB5 model. Except for based on EfficientnetB5, the PMVnet-based model showed higher performance in all the values. However, when the models were initialized by ImageNet weights, overall values in PMVnet outperformed SMVnet. Figure 6 illustrates the semantic segmentation outcomes of both SMVnet and PMVnet with frameworks that employ VGGnet16, Resnet50, and EfficientnetB5 models as the encoder.

Table 2 compares the detection performance of each model for the region containing the lesion. It shows that the VGGnet16-based model achieved an overall performance improvement of 0.137 for recall and 0.032 for FPPI, and the Resnet50-based model achieved an improvement of 0.012 for recall and 0.050 for FPPI for the PMVnet. The EfficientnetB5-based model showed an improvement in sensitivity of 0.013 and a decrease in FPPI of 0.013. In the pixel-to-pixel comparison of semantic segmentation, the EfficientnetB5-based paired view model performed worse than the single-view model across all performance metrics but better in sensitivity in the region containing the lesion.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Example of segmentation results from models utilizing encoders based on VGGnet16, Resnet50, and EfficientnetB5 for the Hologic (left) and GE (right) cases. (a) Original image. (b) Ground truth. (c) SMVnet-based model results. (d) PMVnet-based model results.

Table 1 Performance results of semantic segmentation.
Table 2 Performance results of lesion detection.

Discussion

To investigate the effectiveness of PMVnet in improving the performance of breast lesion detection, we analyzed the performance assessment results for segmentation and detection with and without ImageNet weights. The segmentation results show that the VGGnet16-based encoders exhibit the largest performance improvement when using PMVnet, followed by Resnet50 and EfficientnetB5, and that the performance improvement decreases as the model structure becomes more complex. PMVnet, designed to extract individual feature maps for each view image and concatenate the feature maps, experiences a rapid increase in the number of parameters as the encoder structure deepens and becomes more complex. Due to this parameter proliferation, applying PMVnet to the same architecture without structural modifications may impede model optimization. Furthermore, the highest performance of VGGnet16, the shallowest and simplest structure in this study, indicates that it is optimized for extracting feature maps from data owing to its structural characteristics. Therefore, in this study, we show that the model size and depth are not necessarily correlated to the performance of the networks and data. Numerous studies indicate no direct correlation between model size and performance. To enhance performance, various strategies have been employed, including modifying the activation function, refining the loss function, and adjusting factors such as the resolution of the input image, the depth of the model layers, and the number of channels in the extracted feature map44,45. In the future, we must develop a model with an optimal architecture by considering components such as the size of the input data and the number of channel layers.

Moreover, to verify the suitability of the PMVnet for sharing significant positional information between the two standard views, we analyzed the detection performance by converting it into breast lesion inclusion regions. For all models, the recall value of the detection performance was proportional to the DSC value of the segmentation result; however, in contrast to the segmentation performance of the EfficientnetB5 model, the model with PMVnet showed higher performance. Figure 7 shows examples of conversion to breast lesion areas by the EfficientnetB5-based PMVnet model, which showed a low DSC performance below 0.7; however, the detection performance was measured as TP. The results demonstrated that the large size of the model and the excessive number of parameters in PMVnet did not significantly affect the extraction of feature maps required for details in pixels but were significant in extracting effective feature maps for localization. Additionally, the lower segmentation performance compared to detection is attributed to the challenges in accurately predicting the boundaries and shape of the breast lesion.

Breast lesions on mammography can be subdivided according to shape, margin, and density according to the Breast Imaging Reporting and Data System (BI-RADS), and there are significant differences between each type. However, in this study, the type of breast lesion type was not considered when constructing the training data, and all lesion types were classified as a single lesion. Thus, the prediction of the center of the lesion, which shows a relatively high contrast for most lesion types in mammograms, was successful; however, the prediction of the boundaries of various lesion types was not. The distinction between benign and malignant lesions is crucial in diagnosing breast cancer, and the type of lesion is a significant indicator of this distinction. Therefore, in future studies, a higher accuracy performance can be achieved by subdividing the types and developing data distribution and learning models that consider them. Moreover, the classification of benign and malignant lesions is an essential process in breast cancer diagnosis, and it is expected that the development of additional models for the classification of such indicators will be crucial in assisting breast cancer diagnosis.

Except for VGGnet16, utilizing ImageNet weights resulted in an overall improvement in the segmentation and detection performance compared with the unweighted cases. Using ImageNet weights hindered feature map extraction in the best-optimized model for the medical data used in this study by initializing the weights fitted in real-world images. The detection performance results mirrored the improvement pattern seen in segmentation. However, models based on EfficientnetB5 exhibited lower performance in Precision and FPPI for detection, indicating that initializing to ImageNet did not universally enhance performance across all models in this study. This result indicates that a comparative study on initializing the optimal weights and the optimized model structure is necessary in the future. In addition, because breast cancer diagnosis uses mammograms as well as breast ultrasound and magnetic resonance imaging data, we plan to develop PMVnet-based algorithms optimized for breast lesion detection by fusing data from various modalities.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Example of lesion-containing region detection results from a PMVnet-based model utilizing the EfficientB5 encoder, where the Dice similarity coefficient (DSC) is less than or equal to 0.707 (average) but is rated as a true positive (TP) in the detection region performance evaluation. Red squares represent regions the model estimates; green squares represent ground truth.

While this study presents a novel approach for effectively merging dual-view mammographic information through the proposed PMVNet, certain limitations exist. First, since this study mainly focused on the structural and methodological aspects of integrating paired mammography views to improve diagnostic accuracy, we did not analyze the computational efficiency. This lack of processing time and efficiency is important for real-world clinical applications. In future work, we will evaluate and optimize the computational efficiency of the proposed method to increase its effectiveness in real-world settings. In addition, the dataset used in this study consisted of only abnormal cases, which limits the generalizability to datasets containing both normal and abnormal cases. In future studies, incorporating a balanced dataset will help ensure the proposed model’s broad applicability and robustness. Finally, although this study’s performance analysis was objectively validated based on quantitative evaluation, it did not include subjective evaluation by expert radiologists, which affects real-world usability. Incorporating expert feedback in future studies would provide valuable insights into the proposed method’s clinical validity and utility. Furthermore, the effectiveness of PMVNet needs to be validated in more extensive and diverse populations to ensure its applicability in real-world clinical settings. Testing the model across varied demographic groups will help assess its generalizability and robustness in diverse patient populations, which is an essential next step for ensuring the clinical utility of the proposed method.

Conclusion

In this study, we proposed a PMVNet, paired mammogram view images-based breast mass detection algorithm, considering that both CC and MLO views are utilized as reference images in clinical diagnosis. The PMVNet applied the SE method and cosine similarity to the feature maps extracted from the two view images to share the weights so that the feature information of the two images can be employed by each other. To verify PMVNet, we compared it to the performance of the single-view-based algorithm, and the results showed overall improved detection performance compared to the contrast algorithm. However, as the model’s size increased, the proposed algorithm’s effectiveness decreased, and the Efficientnet-based model showed the same or even relatively lower performance values. This is because many more parameters are extracted from the PMVNet structure, and the more complex encoder structure does not help improve performance due to excessive parameters unsuitable for fitting the data. In future work, we will develop a structured algorithm to optimize it. In addition, applying the clinical diagnostic approach utilized in this study to mammograms and 3D-based digital breast tomosynthesis data is expected to help clinicians diagnose breast cancer as a computer-aided diagnosis system.