Introduction

Breast cancer is a significant health concern for women worldwide, as it is one of the most commonly diagnosed malignancies and the leading cause of deaths1. Timely detection of breast cancer is crucial for improving patient prognosis, and medical imaging serves as a key tool in screening, diagnosis, and treatment planning2,3. Breast ultrasound imaging serves as a commonly adopted supplementary technique to mammography, owing to its non-invasive qualities, low expense, real-time diagnostic capability, and effectiveness in identifying tumors in dense breast tissue4,5. Unlike mammograms, which often struggle with overlapping tissue structures, ultrasound offers superior contrast for distinguishing solid masses from cystic structures and is especially beneficial for younger women with denser breast composition5,6.

Despite its advantages, automated breast lesion segmentation presents substantial challenges. Ultrasound images are inherently characterized by low signal-to-noise ratio, low contrast boundaries, speckle noise, and operator-dependent variability, which collectively hinder the reliable delineation of tumor margins7. Furthermore, intra-class variability and inter-class similarity between malignant and benign masses exacerbate the difficulty of precise segmentation, particularly in small datasets commonly encountered in medical imaging7. These challenges are less pronounced in mammographic images, where tissue structures are generally more consistent and the signal quality is higher8,9,10.

Medical image segmentation has recently seen considerable progress with deep learning approaches, especially convolutional neural networks (CNNs)11,12,13,14. Encoder-decoder frameworks, such as U-Net15 and its variants, have become widely used for biomedical applications due to their efficacy in capturing both fine-grained texture data and high-level semantic information16. Despite their popularity, conventional CNN-based models often struggle to model contextual relationships and long-range spatial dependencies that are necessary for precise segmentation in complex modalities such as breast ultrasound17. Furthermore, when dealing with substantial speckle noise and indistinct tumor boundaries, relying solely on basic skip connections, as adopted in many encoder–decoder architectures, may be insufficient for transmitting detailed spatial information from the encoder to the decoder, which can ultimately hinder segmentation accuracy17.

To overcome these limitations, we introduce a novel hybrid attention-based network for the segmentation of breast ultrasound lesions. It combines a decoder with a pre-trained DenseNet-121 encoder18 for reliable feature extraction with multiple attention mechanisms. At the bottleneck, the model integrates Global Spatial Attention (GSA)19, Position Encoding (PE)20, and Scaled Dot-Product Attention (SDPA)21 to capture global context, spatial dependencies, and relative positional information. Furthermore, the Spatial Feature Enhancement Block (SFEB) is incorporated at the skip connections to refine spatial representations, allowing the model to concentrate more effectively on relevant regions. This architecture improves the precise localization of lesions and sharpens boundary definition, both of which are essential for reliable clinical use18,19. To optimize, a composite loss function is employed that combines jaccard index loss and binary cross-entropy (BCE)18, balancing pixel-level classification with region-level overlap. This strategy improves robustness against class imbalance and accommodates irregular tumor shapes, resulting in more accurate and reliable segmentation outcomes18,19,22.

The key contributions of the proposed HA-Net are outlined below:

  • A hybrid attention network is proposed using an encoder composed of DenseNet-121, pre-trained on ImageNet, specifically designed for precise segmentation of breast ultrasound images (BUSI).

  • A transformer-based attention mechanism is introduced to incorporate spatial, positional, and semantic cues, improving segmentation precision.

  • Spatial Feature Enhancement Block (SFEB) is incorporated in skip connections to refine feature propagation and enhance focus on tumor regions.

  • A combined loss that integrates BCE and Jaccard Index loss is employed to optimize both pixel-level classification and region-level overlap, effectively tackling speckle noise and ambiguous tumor margins.

  • Extensive tests on publicly accessible breast ultrasound datasets show that our proposed HA-Net performs well when evaluated against competent approaches, indicating its potential to help radiologists diagnose breast cancer early and accurately.

The remaining parts of this manuscript are classified as follows. The related work section reviews existing studies with a focus on the limitations of U-Net and conventional CNN-based architectures, the emergence of attention mechanisms, and the recent adoption of Transformer and Mamba-based models in medical imaging. The methodology section presents the proposed framework in detail, including the transformer attention module, transformer self-attention, global self-attention, SFEB, and the hybrid loss functions employed for optimization. The experiments section describes the datasets utilized, preprocessing strategies, implementation details, ablation studies, and evaluation metrics performed to validate the model design. The results section reports both numerical and visual outcomes, supported by ablation results, while the discussion provides statistical analysis and interprets the significance of findings in the context of clinical application. Finally, the conclusion highlights the main contributions, summarizes key insights, acknowledges limitations, and outlines potential ideas for future exploration.

Related work

The segmentation of breast tumors has become a major focus in recent research due to its importance in early detection and treatment planning23. Compared to other modalities like mammography, ultrasound imaging offers a number of benefits, such as real-time imaging, reduced ionizing radiation, cost-effectiveness, and improved visibility in dense breast tissue. However, ultrasound images pose unique challenges for automated analysis because of speckle noise, low contrast, operator variability, and anatomical ambiguities23,24. These difficulties have driven the development of advanced deep learning methods capable of extracting robust features and leveraging contextual information to improve segmentation accuracy23,25.

Limitations of U-Net and CNNs

Initial attempts at breast tumor segmentation relied on classical computer vision techniques such as filtering, active contours, and clustering methods26. For example, threshold-based segmentation and graph-based approaches were used in early studies to delineate lesions in ultrasound images27. However, these methods required extensive domain knowledge and struggled with noise sensitivity and over-segmentation23,24.

The field was revolutionized by CNNs, particularly the U-Net design, which enabled the learning of hierarchical features in an end-to-end manner23,25. Recent studies demonstrate that densely connected U-Net variants with attention mechanisms achieve dice scores exceeding 0.83, outperforming traditional methods23. For instance, ACL-DUNet23 integrates channel attention modules and spatial attention gates to suppress irrelevant regions while enhancing tumor features. Similarly, SK-U-Net24 employs selective kernels with dilated convolutions to adapt receptive fields, gaining a mean dice score of 0.826 in comparison to 0.778 for the standard U-Net.

To address limited contextual awareness, multi-branch architectures have emerged. One approach25 combines classification and segmentation branches, achieving an AUC of 0.991 for normal/abnormal classification and a dice score of 0.898 for segmentation. These models reduce false positives in normal images while maintaining sensitivity advancement for clinical screening25. Hybrid designs like DeepCardinal-5028 further optimize computational efficiency, achieving 97% accuracy in tumor detection with real-time processing capabilities.

However, challenges persist in modeling long-range dependencies for lesions with irregular morphology. While attention mechanisms in ACL-DUNet improve spatial focus23, and scale attention modules enhance multi-level feature integration23, fuzzy boundaries in low-contrast ultrasound images remain difficult24. These constraints are being addressed by ongoing advancements in adaptive kernel selection and boundary-guided networks23,24.

Rise of attention mechanisms

Recent advancements in breast tumor segmentation in ultrasound imaging have been driven by the incorporation of attention mechanisms and hybrid network architectures. Early strategies focused on spatial-channel attention to address challenges such as fuzzy lesion boundaries and variable tumor sizes. For example, SC-FCN-BLSTM29 combined bi-directional LSTM with spatial-channel attention to exploit inter-slice contextual information in 3D automated breast ultrasound. Abraham et al.30 presented hybrid attention mechanisms that adaptively reweigh feature maps based on contextual saliency, improving segmentation performance in noisy ultrasound images. Similarly, adaptive attention modules such as HAAM31 replaced standard convolutions in U-Net variants, allowing dynamic adjustment of the receptive field across spatial and channel dimensions for more robust segmentation.

Further improvements were achieved with CBAM-RIUnet4, which combined convolutional block attention modules with residual inception blocks, yielding intersection-over-union (IoU) and dice scores of 88.71% and 89.38%, respectively, by effectively suppressing irrelevant features. The authors32 presented ESKNet, which integrates particular kernel networks into the U-Net to dynamically modulate receptive fields using attention, enhancing segmentation accuracy across diverse lesion types. Although attention-based models have improved segmentation accuracy, many approaches are still limited in adequately representing long-range spatial relationships, specifically when relying on a single attention strategy. This has led to the exploration of hybrid models that combine multiple attention mechanisms to provide a richer representation of both local and global features.

ARF-Net33 was introduced for breast mass segmentation in both mammographic and ultrasound images, leveraging an encoder–decoder backbone integrated with a Selective Receptive Field Module (SRFM) to adaptively regulate receptive field sizes based on lesion scale, thereby balancing global context and local detail for improved accuracy. In34, the authors presented a lightweight CNN-based model for mammogram segmentation, incorporating feature strengthening modules for enhanced representation, a parallel dilated convolution block for multi-scale context and boundary refinement, and a mutual information loss to maximize consistency with ground truth. These innovations collectively enable accurate and efficient segmentation with low computational cost. ATFE-Net35 employed an Axial-Trans module to efficiently capture long-range dependencies and a Trans-FE module to enhance multi-level feature representations.

Transformer and Mamba-based architectures in medical imaging

Inspired by the breakthroughs of Transformer architectures in natural language processing, Vision Transformers (ViTs) and their variants have gained significant traction in medical image analysis, demonstrating strong capability in modeling global context and capturing long-range dependencies36. Transformers overcome CNNs’ local constraints by enabling global context modeling through self-attention processes. Several studies have successfully incorporated transformers into segmentation pipelines, either as standalone modules or in combination with CNN backbones37,38.

To integrate local convolutional features with long-range contextual information, a hybrid CNN-transformer architecture was presented by He et al.39 and Ma et al.35. While these models demonstrate strong performance, these architectures often face challenges in retaining fine-grained boundary details, which are essential for precise segmentation of medical images. Swin Transformer-based networks address this limitation by employing hierarchical attention and shifted windows to capture features at multiple scales. For instance, DS-TransUNet40 leverages these mechanisms to simultaneously extract coarse and fine features, enhancing segmentation precision. Similarly, Swin-Net41 combines a Swin Transformer encoder with feature refinement and hierarchical multi-scale feature fusion modules to achieve more accurate lesion delineation. SwinHR42 further enhances performance by adopting hierarchical re-parameterization with large kernel convolutions, capturing long-range dependencies efficiently while maintaining high accuracy through shifted window-based self-attention. Cao et al.43 took this further by developing a pixel-wise neighbor representation learning approach (NeighborNet), allowing each pixel to adaptively select its context based on local complexity. This approach is particularly suitable for ultrasound segmentation, where lesion boundaries may be fragmented or ambiguous.

In breast cancer segmentation, a critical research gap lies in the integration of transformer-based models with CNNs, where semantic mismatches between locally extracted CNN features and globally contextualized transformer representations often lead to suboptimal fusion44,45. Inflexible or disjointed fusion strategies, such as rigidly inserting transformer blocks into CNN architectures without addressing feature consistency, result in redundant or insufficient hierarchical representations45. This challenge is exacerbated in noisy or irregular data, such as breast ultrasound images, where speckle artifacts, shadowing, and blurred lesion boundaries create discordance between local texture details and global anatomical structures45,46. Current approaches frequently fail to link the semantic gap between CNNs’ localized feature extraction and transformers’ long-range dependency modeling, particularly in decoder stages where misaligned feature maps reduce segmentation precision for small lesions and complex margins39. Furthermore, the lack of adaptive cross-attention mechanisms to harmonize multi-scale features often diminishes model robustness against ultrasound-specific noise patterns45, highlighting the need for more sophisticated hybrid architectures that enable synergistic local-global feature interaction while maintaining computational efficiency39.

Accurate medical image segmentation is essential for clinical decision-making, but existing CNN-Transformer hybrid models often depend heavily on skip connections, which limit the extraction of contextual features. To address this, MRCTransUNet combines a lightweight MR-ViT with a reciprocal attention module to close the semantic gap and retain fine details. The MR-ViT and RPA modules enhance long-range contextual learning in deeper layers, but skip connections are only utilized in the first layer, in contrast to conventional U-Net variations. Tests on breast, brain, and lung datasets show that MRCTransUNet exceeds the performance of current leading methods on Dice and Hausdorff metrics, demonstrating its potential for reliable clinical use applications47.

The authors proposed HCMNet48, a hybrid CNN–Mamba network that integrates CNN’s strength in local feature extraction with Mamba’s capability for efficient global representation. A wavelet feature extraction module enriches feature learning by combining low- and high-frequency components, reducing spatial information loss during downsampling. Furthermore, an adaptive feature fusion module enhances skip connections by dynamically merging encoder and wavelet features, thereby preserving critical details and suppressing redundancy. The authors introduced AttmNet49, a novel multiscale attention-mamba (MAM) module in a U-shaped model. Using a Mamba unit that combines self-attention and Mamba processes, the MAM block combines multi-level convolutional layers to extract features across various spatial scales. With this design, the model can retain fine structural features.

Methodology

The proposed HA-Net is consists of four key components: an encoder, a decoder, a transformer-based attention module, and a spatial feature enhancement block. For the encoder backbone, DenseNet-12150 is used to effectively capture both complex and fine-grained representations. DenseNet’s used feature–direct connections between all layers within a dense block (DB) encourages feature reuse, improves gradient flow, and supports efficient information propagation. These characteristics are especially valuable in medical image segmentation, where subtle anatomical variations and boundary precision are critical for reliable lesion delineation. In the encoding path, four hierarchical encoding stages are constructed following the standard DenseNet-121 design. Each stage comprises multiple dense blocks interleaved with transition layers (TLs), as illustrated in Fig. 1. This hierarchical organization enables the model to progressively learn low-level texture features alongside high-level semantic information while maintaining spatial continuity. The dense connectivity within DBs strengthens feature propagation, while TLs serve to reduce dimensionality and regulate complexity without discarding critical details. Together, these mechanisms ensure that the encoder produces a rich, multi-scale, and highly discriminative representation suitable for subsequent decoding and attention operations. To further refine extracted features, we append a convolutional block–consisting of a \(3 \times 3\) convolution, a ReLU activation, and batch normalization (BN) after the pre-trained DenseNet-121.

Fig. 1
figure 1

The specifics of the HA-Net. The proposed HA-Net includes a transformer attention module, a spatial features augmentation block, a pre-trained encoder, and a particular decoder.

The decoding path follows a simplified U-Net15 inspired design, optimized to maintain strong representational power while reducing the number of parameters for improved computational efficiency. Rather than relying on transposed convolutions, which are prone to introducing checkerboard artifacts and can substantially increase computational complexity, our approach utilizes bilinear upsampling followed by convolutional layers. This combination preserves spatial resolution and fine-grained feature details while minimizing parameter count and inference time. By preserving detailed feature reconstruction and precise boundary delineation, the proposed decoding pathway delivers accurate segmentation while keeping computational demands low.

The proposed HA-Net employs five sequential convolutional blocks in the decoder path as shown in Fig. 1, to progressively extract hierarchical features. Each convolutional block is composed of a \(3 \times 3\) convolution layer, BN, and a ReLU activation function to stabilize training and enhance feature representation. This design ensures stable and efficient training while enabling the network to simultaneously capture high-level semantic representations and fine-grained textural details. BN reduces internal covariate shift, speeds up convergence, and enhances generalization, while ReLU activation adds non-linearity to efficiently represent intricate patterns in breast ultrasound images.

Additionally, skip connections from the encoder are employed to preserve spatial information and facilitate multi-scale feature fusion across different resolution levels. The integration of SFEB within the decoding path further refines the feature representations by selectively emphasizing tumor-relevant regions, thereby improving segmentation accuracy while maintaining a reduced parameter count relative to a conventional U-Net. This optimized architecture not only enables efficient processing of high-resolution medical images but also ensures precise delineation of fine structural details, making the model highly suitable for practical clinical deployment and real-time applications.

Transformer Attention Module (TAM)

To strengthen the method’s capacity to capture and fuse contextual information, we incorporate a self-aware attention module51. There are two main components to this module. Initially, contextual information is captured by the Transformer Self-Attention (TSA) block by taking into account relative positions within the input data. It integrates positional information by concatenating input features with positional embeddings paths to allow the model to understand spatial relationships within the input data. Secondly, the Global Spatial Attention (GSA) block refines local contextual information by aggregating it with global features. By incorporating a broader perspective, this design enhances the model’s ability to retain fine structural details while simultaneously maintaining a holistic understanding of the lesion’s overall morphology. Collectively, these attention mechanisms improve feature representation, helping the model effectively balance local and global information for more precise segmentation.

Figure 2 depicts the Transformer Attention Module (TAM) architecture. The input feature map \(F_{in}\) is first enriched with positional encoding and passed to two parallel branches. In the top branch (TSA), the encoded features are projected into Q, K, and V for calculation of scaled dot-product attention, capturing long-range contextual dependencies. In the bottom branch (GSA), the features are embedded into two complementary representations whose dot product produces a spatial attention map, highlighting global positional relationships. Finally, the outputs of TSA, GSA, and the PE-enriched input are concatenated to generate \(F_{out}\), which jointly preserves local details, global context, and spatial correlations.

Fig. 2
figure 2

The components of the transformer attention module. The top block illustrates the transformer self-attention, while the bottom block displays the global self-attention block.

Transformer Self-Attention (TSA)

Since multi-head attention effectively captures self-correlation but cannot learn spatial relationships, a common strategy is to introduce positional encoding before applying attention. Specifically, the incoming feature representation \(F\in \mathbb {R}^{h\times w\times c}\) is first enriched with positional information, producing a representation, which is then fed into the multi-head attention block (Fig. 2). F is first reshaped into a two-dimensional representation \(F' \in \mathbb {R}^{c \times (h \times w)}\). Using learnable weight matrices, \(F'\) is then projected into three distinct spaces: queries \(Q \in \mathbb {R}^{c \times (h \times w)}\), keys \(K \in \mathbb {R}^{c \times (h \times w)}\), and values \(V \in \mathbb {R}^{c \times (h \times w)}\), defined as

$$\begin{aligned} Q = W_q F', \quad K = W_k F', \quad V = W_v F', \end{aligned}$$
(1)

where \(W_q, W_k, W_v \in \mathbb {R}^{c \times c}\) are learnable projection matrices.

The scaled dot-product attention mechanism computes the similarity between different channels by applying the Softmax-normalized dot-product of Q and the transposed version of K. This matrix represents the contextual attention map \(A \in \mathbb {R}^{c \times c}\). Finally, the contextual attention map A is applied to the value matrix V to produce attention-weighted feature representations. This mechanism allows the multi-head attention module to selectively aggregate relevant features while preserving essential contextual dependencies across spatial positions. Mathematically, the Transformer Self-Attention (TSA) operation can be expressed as:

$$\begin{aligned} A_{\textrm{TSA}}(Q, K, V) = \textrm{softmax}\Big (\frac{Q K^\top }{\sqrt{d_k}}\Big ) V, \end{aligned}$$
(2)

where \(d_k\) denotes the dimensionality of the key vectors, ensuring proper scaling of the dot-product attention. This formulation allows the TSA block to model long-range dependencies and refine feature aggregation while maintaining spatial coherence in the output representations.

Global Spatial Attention (GSA)

To further enhance contextual learning, the TAM incorporates the Global Spatial Attention (GSA) block, which captures correlations among different spatial positions across the feature map. The initial feature representation \(F\in \mathbb {R}^{h\times w\times c}\) is embedded in \(F^c\in \mathbb {R}^{h\times w\times c}\) and \(F^{cc}\in \mathbb {R}^{h\times w\times c'}\) where c’ = c/2. The latter is reshaped to \(F1^{cc}\in \mathbb {R}^{h\times w\times c'}\) and \(F2^{cc}\in \mathbb {R}^{h\times w\times c'}\). The scaled dot product of these matrices is computed and subsequently passed through a Softmax normalization layer, resulting in an attention map \(GSA \in \mathbb {R}^{(h \times w) \times (h \times w)}\) that encodes the pairwise correlations between different spatial positions. The multi-head attention mechanism is then formulated as:

$$\begin{aligned} A_{GSA} = softmax(F1^{cc} \cdot F2^{cc}) \end{aligned}$$
(3)

The outputs from TSA, GSA, and the original input are then concatenated to create the output feature map (\(F_{out}\in \mathbb {R}^{h\times w\times c}\)) of the self-aware attention module. The model’s capacity to extract significant features for precise segmentation is improved by this method, which guarantees that both local spatial relationships and global context are well recorded.

Spatial Feature Enhancement Block (SFEB)

Pooling operations play a critical role in deep learning by reducing the spatial dimensions of feature maps, accelerating computation, and enhancing feature robustness. In lesion segmentation, it is crucial to simultaneously capture fine-grained structural details and global contextual cues, since tumors often exhibit low contrast, small spatial extent, and heterogeneous textural patterns. To address these challenges, we incorporate an SFEB within the skip connections of our network, which strengthens feature fusion, spatial awareness, and residual learning, ultimately improving segmentation accuracy and the preservation of lesion boundaries.

To improve discriminative characteristics and refine spatial features before fusion, the SFEB is integrated into skip connections. Global max-pooling highlights sharp lesion boundaries, while global average-pooling preserves contextual information, and their combination ensures a balance between local detail and global context. The attention pathway further reweights channels to emphasize lesion-relevant features and suppress background noise. Finally, residual fusion preserves fine spatial details, making the SFEB particularly effective for refining skip connection features in noisy ultrasound images with irregular tumor boundaries.

The input tensor is first passed through a \(3 \times 3\) convolutional layer, BN, and a ReLU activation, resulting in an intermediate feature map \(I_1\).

$$\begin{aligned} I_{1} = \text {ReLU}(\mu (f^{3\times 3}(I))), \end{aligned}$$
(4)

where \({I} = \mathbb {R}^{H\times W\times C}\) represents the input tensor with height H, width W, and channel depth C. To extract global contextual information, the intermediate feature map \(I_1\) is subjected to both global max-pooling (\(G_m\)) and global average-pooling (\(G_a\)), producing the pooled feature maps \(GP_m\) and \(GP_a\), respectively:

$$\begin{aligned} GP_{m} = G_{m}(I_1), \quad GP_{a} = G_{a}(I_1) \end{aligned}$$
(5)

These pooled features are concatenated to form a complementary representation Po:

$$\begin{aligned} P_{o} = GP_{m} \copyright GP_{a}. \end{aligned}$$
(6)

The concatenated pooled features \(P_o\) are then refined by applying a \(3 \times 3\) convolution, BN, and ReLU activation:

$$\begin{aligned} F_{c} = \text {ReLU}(\mu (f^{3\times 3}(P_{o}))). \end{aligned}$$
(7)

In parallel, \(G_a\) is applied to the original input I, followed by a \(1 \times 1\) convolution, BN, and sigmoid activation, producing an attention map Fcc:

$$\begin{aligned} F_{G} = GAP(I), \quad F_{cc} = \sigma (\mu (f^{1\times 1}(F_{G}))), \end{aligned}$$
(8)

where \(\sigma\) represents the sigmoid activation. The attention-enhanced features \(F_{em}\) are computed by element-wise multiplication of the refined feature map \(F_c\) and the attention coefficients \(F_{cc}\):

$$\begin{aligned} F_{em} = F_{c} \otimes F_{cc}. \end{aligned}$$
(9)

Finally, to preserve the original spatial information and maintain residual learning, the input tensor I is added element-wise to the attention-enhanced features:

$$\begin{aligned} F = F_{em} \oplus I. \end{aligned}$$
(10)

By maintaining fine structural details, this architecture effectively balances local feature intricacy with global contextual information, enabling the model to focus on relevant regions. The SFEB’s architecture is shown in Fig. 3.

Fig. 3
figure 3

Spatial Feature Enhancement Block.

Loss functions

An appropriate choice of a loss function is crucial to train the model because it directly influences the convergence behavior, stability, and the balance between pixel-wise accuracy and region-level consistency in segmentation tasks52. BCE loss quantifies the pixel-wise difference between the predicted probability map and the ground truth mask. By computing the negative log-likelihood of the predicted probabilities, it penalizes incorrect classifications and enforces accurate pixel-level segmentation. Formally, for N pixels, BCE loss is defined as:

$$\begin{aligned} \textrm{Loss}_\textrm{bce} = -\sum _{i=1}^{N} \Big ({y}{i},\textrm{log},{\hat{y}}{i} + (1-{y}{i}),\textrm{log}(1-{\hat{y}}{i})\Big ), \end{aligned}$$
(11)

where \(y_i \in {0,1}\) represents the ground truth label of the i-th pixel, and \(\hat{y}_i \in [0,1]\) denotes the predicted probability. This formulation ensures that confident misclassifications are penalized more heavily, guiding the model toward robust pixel-level discrimination.

Jaccard loss, also known as Intersection-over-Union (IoU) loss, is a region-level metric that evaluates the degree of overlap between the predicted segmentation mask and the ground truth, emphasizing accurate delineation of target regions. Because Jaccard loss prioritizes structural similarity over pixel-wise loss, it works especially well with highly unbalanced datasets in which the lesion or region of interest only takes up a small portion of the image. It has the following mathematical definition:

$$\begin{aligned} \textrm{Loss}_\textrm{jaccard} = 1-\frac{\sum _{i=1}^{N}(P_{i}G_{i})}{\sum _{i=1}^{N}(P_{i} + G_{i} - P_{i}G_{i}) }, \end{aligned}$$
(12)

where \(P_i \in [0,1]\) is the predicted probability for the i-th pixel, and \(G_i \in {0,1}\) is the corresponding ground truth label.

To leverage both pixel-level accuracy (captured by BCE loss) and region-level similarity (captured by Jaccard loss), a hybrid objective is formulated. The final training objective is defined as:

$$\begin{aligned} \textrm{Loss}_\textrm{combined} = \textrm{Loss}_\textrm{bce} + \textrm{Loss}_\textrm{jaccard}, \end{aligned}$$
(13)

where \(\textrm{Loss}_\textrm{bce}\) ensures fine-grained classification at each pixel, and \(\textrm{Loss}_\textrm{jaccard}\) enforces global shape and boundary consistency. This joint formulation stabilizes convergence and improves segmentation performance across varying lesion sizes and shapes.

Code availability

The source code implementing the proposed Hybrid Attention Network (HA-Net) for breast tumor segmentation in ultrasound images is openly available at GitHub Repository Link: https://github.com/nisarahmedrana/HA-Net. A DOI has been generated via Zenodo to ensure long-term accessibility: https://doi.org/10.5281/zenodo.17190194. The repository includes processed dataset, Jupyter Notebook describing architecture, preprocessing pipelines, training and evaluation scripts and usage instructions required to reproduce the results presented in this study. The code is released for research purposes only under the specified license.

Experiments

Datasets for breast ultrasound image segmentation

To rigorously evaluate the effectiveness of the HA-Net, we conducted extensive experiments on two publicly available breast ultrasound datasets, BUSI and UDIAT. Both datasets consist of grayscale ultrasound images with corresponding pixel-level annotations provided by clinical experts, serving as reliable benchmarks for tumor segmentation. The BUSI dataset contains ultrasonograms of multiple patients with varying lesion types (benign, malignant, and normal), thereby reflecting the heterogeneity of real clinical scenarios. The UDIAT dataset, on the other hand, offers high-quality ultrasound scans with consistent acquisition settings, enabling controlled evaluation. Together, these datasets provide complementary characteristics, ensuring that the proposed method is validated across diverse imaging conditions and lesion appearances.

BUSI Dataset: The BUSI dataset53 comprises 780 grayscale breast ultrasound images obtained from 600 female patients within the age range of 25 to 75 years. Each image has an approximate spatial resolution of \(500 \times 500\) pixels and is annotated into three diagnostic categories: normal, benign, and malignant. For tumor segmentation, only the benign and malignant categories were retained, as these are accompanied by expert-annotated binary masks delineating tumor regions. Images belonging to the normal class were excluded since they lack lesion annotations. To ensure uniformity in model input, all images and their corresponding masks were resized to \(256 \times 256\). This preprocessing step not only standardizes input dimensions across the dataset but also reduces computational overhead during training and evaluation.

UDIAT Dataset: The UDIAT dataset was introduced by54 and consists of 163 breast ultrasound images. These images are divided into benign and malignant classifications and have a resolution of \(760 \times 570\) pixels. Pixel-wise segmentation masks with expert annotations that identify tumor locations are included with every image. Before training, all images and their corresponding masks were resized to \(256 \times 256\) pixels to ensure consistency. Table 1 provides details of the BUSI and UDIAT datasets’ separation into training and test sets.

Table 1 A summary of the datasets employed in this study, including the total number of images, diagnostic categories, and image resolution, is presented.

Implementation details

To ensure robust training and reliable performance evaluation, 20% of the training set was withheld for validation, enabling effective monitoring of learning progress and guiding hyperparameter adjustments. Model optimization was performed using the Adam optimizer55 with an initial learning rate of 0.001. To promote stable convergence and mitigate the risk of stagnation, the learning rate was reduced by a factor of 0.25 when the validation loss plateaued for four consecutive epochs. In addition, early stopping was employed to prevent overfitting and automatically terminate training once no further improvements were observed.

A hybrid loss function combining Binary Cross-Entropy (BCE) and Jaccard loss was utilized, allowing simultaneous optimization at both the pixel level and the region overlap level. Training was conducted with a batch size of 10, and the model achieved competitive performance without the application of explicit data augmentation strategies. The proposed framework was implemented in Keras with TensorFlow as the backend. All experiments were executed on a workstation equipped with an NVIDIA Tesla K80 GPU, an Intel Xeon 2.20 GHz CPU, 13 GB of system RAM, and 12 GB of dedicated GPU memory

Evaluation metrics

The segmentation performance of the proposed HA-Net was quantitatively assessed using a set of widely adopted evaluation metrics in medical image analysis. These metrics capture both pixel-level accuracy and region-level overlap, providing a comprehensive view of model performance. The definitions and interpretations of all metrics are summarized in Table 2.

Table 2 Summary of evaluation metrics used for segmentation performance assessment.

Ablation studies

A series of ablation studies were performed on the BUSI dataset to systematically assess the individual contributions of each component within the proposed HA-Net. The pre-trained DenseNet-121 encoder used in the backbone model was chosen for its strong feature extraction and multi-scale representation capabilities. To progressively enhance spatial and contextual understanding, we incrementally integrated three key modules into the baseline: a convolutional block, the SFEB, and TAM.

The results in Table 3 clearly demonstrate the impact of each module, with sequential integration consistently improving performance across all evaluation metrics. In particular, the incorporation of SFEB and TAM yields significant gains in performance, underscoring their effectiveness in refining feature representation and enhancing lesion localization. These findings highlight the critical role of both spatial feature refinement and attention-based contextual modeling in enabling precise delineation of tumor regions, validating the design choices of the proposed architecture.

Table 3 Outcomes of ablation studies on the BUSI dataset.

To further explain the interpretability of the HA-Net, heatmaps of the SEFB are visualized using Grad-CAM56 on the BUSI dataset. The SEFB module is integrated into skip connections across four hierarchical levels of the network, enabling progressive refinement of feature representations. The visualization demonstrates how SEFB adaptively emphasizes salient lesion regions while suppressing background noise throughout the encoding–decoding process. In the presented results in Fig. 4, the first column corresponds to the original ultrasound image, while the second column provides the ground truth segmentation mask. The subsequent columns depict the SEFB attention responses at the four skip-connection stages. These stage-wise heatmaps highlight the evolving focus of the network, where shallow layers capture broad structural context and deeper layers progressively concentrate on more discriminative lesion boundaries. This stage-wise visualization confirms that SEFB effectively guides the network toward lesion-relevant regions, thereby improving the reliability of feature propagation through skip connections and contributing to more accurate segmentation outcomes.

Fig. 4
figure 4

Heatmaps of SEFB at four skip-connection stages on the BUSI dataset. The first column shows the original image, the second column displays the ground truth, and the remaining columns present SEFB responses at successive stages, highlighting progressively focused lesion regions.

Results and discussion

Comparison with SOTA methods on the BUSI dataset

To comprehensively assess the efficiency of HA-Net, the outcomes of SOTA approaches on the BUSI breast ultrasound dataset are compared with the proposed HA-Net. The selected benchmark models encompass a range of architectures and design strategies, including classical encoder-decoder variants (U-Net, UNet++, Attention U-Net), transformer-based networks (Swin-UNet, Eh-Former, BGRD-TransUNet), and specialized attention-guided frameworks (BGRA-GSA, AAU-Net, MCRNet, DDRA-Net). These models represent the current landscape of approaches for the segmentation task and provide a rigorous reference for evaluating the HA-Net.

The quantitative outcomes are summarized in Table 4. The HA-Net consistently achieves competent performance across all quantitative metrics, including DSC, IoU, sensitivity, precision, specificity, and accuracy. The combined use of SFEB and TAM equips the model with the ability to emphasize detailed boundary information while retaining a broader contextual understanding. This architectural design enables the network to effectively handle common challenges in BUSI.

Table 4 Comparison with cutting-edge segmentation methods on the BUSI dataset.

The improvements are particularly notable in metrics that emphasize overlap and boundary accuracy (DSC and IoU), highlighting the method’s ability to precisely delineate tumor regions. High sensitivity and precision scores further indicate that the model reliably identifies tumor pixels with lower false positives, which is critical in clinical practice for accurate diagnosis and reducing unnecessary interventions. Furthermore, the model maintains high specificity, demonstrating its ability to correctly classify normal tissue and avoid mislabeling background regions as lesions.

By effectively combining dense feature extraction, contextual information based on attention features, and spatial features refinement, the framework consistently outperforms existing SOTA methods, providing reliable and accurate segmentation results that could assist radiologists in early breast cancer detection.

Statistical significance analysis

To validate the observed performance improvements of HA-Net over other segmentation models on the BUSI dataset, we applied the Wilcoxon signed-rank test. Compared to Attention U-Net, HA-Net achieved a p-value of \(1.55 \times 10^{-14}\), and against U-Net, the p-value was \(2.71 \times 10^{-14}\). These highly significant results confirm that the superior performance of HA-Net is statistically robust, highlighting its reliability and effectiveness.

Comparison with SOTA methods on the UDIAT dataset

The generalization capability of HA-Net was further examined through comparative experiments on the UDIAT breast ultrasound dataset. The benchmarked models encompass a wide range of recent SOTA approaches, including BGRA-GSA, AAU-Net, MCRNet, Swin-UNet, Eh-Former, U-Net, BGRD-TransUNet, Attention U-Net, and UNet++. These models collectively represent diverse architectural strategies, from encoder-decoder networks to attention-guided and transformer-based frameworks, providing a robust reference for performance assessment.

As presented in Table 5, HA-Net demonstrates strong and consistent performance across all evaluated metrics. It attains the highest scores in Jaccard Index, Dice coefficient, and specificity, which are critical indicators of precise tumor localization and accurate segmentation boundaries. Although BGRD-TransUNet exhibits slightly higher sensitivity and overall accuracy, our model demonstrates a more balanced performance profile, with notable advantages in overlap-based metrics that are particularly relevant for assessing segmentation quality in medical imaging.

These findings highlight the robustness and adaptability of the model across datasets with diverse imaging conditions and tumor characteristics, thereby demonstrating its strong potential for reliable integration into real-world clinical breast cancer diagnosis and screening workflows. Consistent results on UDIAT further demonstrate the suitability of the proposed HA-Net for clinical deployment, supporting its role in accurate tumor segmentation for early diagnosis and effective treatment planning.

Table 5 Comparison with cutting-edge segmentation approaches on the UDIAT dataset.

Statistical significance analysis

To evaluate the statistical significance of HA-Net’s performance on the UDIAT dataset, a Wilcoxon signed-rank test was conducted, comparing the proposed model against Attention U-Net and U-Net. The results indicate a p-value of \(1.76 \times 10^{-6}\) when compared to Attention U-Net and \(1.47 \times 10^{-5}\) against U-Net. These highly significant values demonstrate that HA-Net’s superior segmentation performance is statistically robust, confirming its effectiveness and reliability in accurately delineating breast tumors in ultrasound images.

Qualitative visualization results

To complement the quantitative results, we also provide qualitative segmentation examples from both the BUSI and UDIAT datasets. Figure 5 presents side-by-side comparisons between HA-Net and representative SOTA models, including U-Net15, UNet++63, and Attention U-Net62, on the BUSI dataset. These visual results emphasize how different methods perform on challenging conditions, such as poor contrast, heterogeneous lesion boundaries, and varying tumor sizes.

Fig. 5
figure 5

Qualitative comparisons with state-of-the-art methods on the BUSI dataset are illustrated, where green highlights correctly segmented tumor regions, red marks false-positive detections, and blue indicates missed tumor areas.

The proposed HA-Net consistently produces more precise segmentation boundaries with higher spatial alignment to the ground truth. It effectively suppresses false positives (highlighted in red) and recovers missed tumor regions (highlighted in blue), resulting in cleaner and more reliable segmentation maps. The SFEB and TAM contribute to these improvements by enhancing both local detail and global contextual understanding.

Similarly, Fig. 6 shows qualitative outcomes on the UDIAT dataset, comparing the proposed HA-Net against U-Net, UNet++, and Attention U-Net. The outcomes indicate that the model maintains high segmentation fidelity even in the presence of noise, low-intensity contrast, and irregular tumor morphology. These visualizations reinforce the quantitative findings reported earlier, particularly improvements in Dice coefficient, Jaccard Index, and specificity, emphasizing the model’s robustness.

Fig. 6
figure 6

Visual results comparison with SOTA approaches on the UDIAT dataset.

Discussion

The HA-Net consistently demonstrates competent performance, outperforming recent SOTA models in critical metrics such as Dice coefficient, Jaccard index, and specificity. These advancements underscore the significance of the proposed HA-Net, which synergistically combines dense feature extraction, a spatial feature enhancement block, and a Transformer-based attention module. The model exhibits a strong ability to accurately delineate lesions even in challenging imaging conditions characterized by low contrast, speckle noise, and irregular tumor morphology, highlighting its robustness and generalizability.

Statistical analyses further validate the significance of these performance gains, particularly when compared against leading segmentation frameworks such as UNet++, Attention U-Net, and BGRD-TransUNet. These findings reinforce the potential clinical utility of HA-Net as a reliable tool for automated breast cancer detection and decision support in real-world scenarios.

Despite these competent results, limitations should be acknowledged. First, the model was trained and evaluated on two datasets of moderate size. While the outcomes are encouraging, further validation on larger, multi-center, or multi-device ultrasound datasets is essential to fully assess generalizability. Second, the proposed method can exhibit reduced performance on images with extremely low contrast or poorly defined lesion boundaries, which may challenge accurate feature extraction and segmentation. Future work could address this limitation by incorporating advanced contrast enhancement techniques, adaptive preprocessing, or specialized attention mechanisms to better handle such challenging cases.

General-purpose backbones such as OverLoCK65, SparX66, TransXNet67, and SegMAN68 have advanced visual recognition through novel architectural strategies, but their validation largely relies on natural image benchmarks. In contrast, HA-Net is explicitly designed for breast ultrasound segmentation, addressing domain-specific challenges including speckle noise, scale variation, and indistinct tumor boundaries. HA-Net combines a DenseNet-121 encoder with hybrid attention modules GSA, PE, and SDPA to model long-range dependencies and contextual feature interactions. The inclusion of SFEB in skip connections strengthens lesion-specific spatial details, while a composite BCE and Jaccard loss ensures balanced optimization across pixel- and region-level accuracy. Compared with OverLoCK’s biologically inspired attention, SparX’s cross-layer aggregation, and TransXNet’s dynamic token mixing, HA-Net adapts these concepts more effectively to the clinical setting by prioritizing feature clarity, spatial refinement, and noise robustness. Unlike SegMAN, which targets large-scale semantic segmentation, HA-Net demonstrates how domain-adaptive design can substantially improve segmentation reliability under the complex conditions of BUS imaging.

Computational complexity

To provide a rigorous evaluation of the proposed HA-Net against SOTA approaches, a computational complexity analysis was conducted on the BUSI dataset. The primary goal of this analysis is to establish a trade-off between segmentation performance and computational complexity, where resource-constrained environments such as portable ultrasound scanners or clinical workstations are common. The comparison considers several complementary metrics. First, the number of trainable parameters is reported, which directly reflects the capacity of the model and its tendency toward overfitting or generalization. Models with fewer parameters generally require less storage and faster inference but may sacrifice representational power if overly simplified. Second, the IoU is adopted as the main performance metric, as it provides a reliable measure of region overlap between the predicted segmentation mask and the ground truth annotation. This metric is especially suitable for medical segmentation tasks, where precise boundary delineation is critical. Alongside segmentation accuracy, we also report the floating-point operations (FLOPs), which represent the theoretical computational cost of a single forward pass through the network. A lower FLOP count reflects reduced arithmetic complexity, thereby improving the model’s suitability for real-time clinical deployment. Finally, the memory footprint is reported, capturing the storage and runtime memory requirements. This measure is crucial in scenarios where computational resources are limited, such as edge devices or cloud-based telemedicine applications. By integrating these four metrics, parameters, IoU score, FLOPs, and memory consumption provide a comprehensive view of model performance that extends beyond accuracy alone. The results, summarized in Table 6, enable a fair comparison between methods and highlight the balance between predictive reliability and computational feasibility, thereby guiding the choice of models for practical medical imaging applications.

Table 6 Comparison of computational complexity of different models on the BUSI dataset.

Conclusion

This study introduces HA-Net, a hybrid attention-based architecture specifically designed for the automated segmentation of breast tumors in ultrasound images. The architecture leverages a pre-trained DenseNet-121 encoder combined with an attention mechanism incorporating Global Spatial Attention (GSA), Position Encoding (PE), and Scaled Dot-Product Attention (SDPA), thereby allowing the model to effectively capture global contextual relationships while preserving fine-grained spatial details that are critical for precise tumor delineation. The Spatial Feature Enhancement Block was integrated into skip connections to preserve high-resolution information and refine focus on tumor regions. The segmentation process is guided by a combined loss function, thereby effectively mitigating challenges arising from class imbalance and the heterogeneous morphologies of breast lesions. Comprehensive experiments conducted on the BUSI and UDIAT datasets show that HA-Net consistently surpasses existing SOTA segmentation methods across multiple evaluation metrics. Both quantitative and qualitative assessments validate its robustness, high performance, and generalizability, highlighting its potential utility as a clinical tool for facilitating early and precise breast cancer diagnosis.

Future work will aim to improve cross-device and multi-center generalization via domain adaptation, incorporate lesion classification to create a comprehensive diagnostic framework, and enable real-time deployment in clinical workflows to enhance diagnostic efficiency and patient care.