EFCNet enhances the efficiency of segmenting clinically significant small medical objects

Kong, Lingjie; Wei, Qiaoling; Xu, Chengming; Ye, Xiaofeng; Liu, Wei; Wang, Min; Fu, Yanwei; Chen, Han

doi:10.1038/s41598-025-93171-6

Download PDF

Article
Open access
Published: 14 April 2025

EFCNet enhances the efficiency of segmenting clinically significant small medical objects

Lingjie Kong²,
Qiaoling Wei¹,
Chengming Xu²,
Xiaofeng Ye^1,3,
Wei Liu^1,3,
Min Wang^1,3,
Yanwei Fu² &
…
Han Chen¹

Scientific Reports volume 15, Article number: 12813 (2025) Cite this article

1804 Accesses
Metrics details

Subjects

Abstract

Efficient segmentation of small hyperreflective dots, key biomarkers for diseases like macular edema, is critical for diagnosis and treatment monitoring.However, existing models, including Convolutional Neural Networks (CNNs) and Transformers, struggle with these minute structures due to information loss.To address this, we introduce EFCNet, which integrates the Cross-Stage Axial Attention (CSAA) module for enhanced feature fusion and the Multi-Precision Supervision (MPS) module for improved hierarchical guidance. We evaluated EFCNet on two datasets: S-HRD, comprising 313 retinal OCT scans from patients with macular edema, and S-Polyp, a 229-image subset of the publicly available CVC-ClinicDB colonoscopy dataset. EFCNet outperformed state-of-the-art models, achieving average Dice Similarity Coefficient (DSC) gains of 4.88% on S-HRD and 3.49% on S-Polyp, alongside Intersection over Union (IoU) improvements of 3.77% and 3.25%, respectively. Notably, smaller objects benefit most, highlighting EFCNet’s effectiveness where conventional models underperform. Unlike U-Net-Large, which offers marginal gains with increased scale, EFCNet’s superior performance is driven by its novel design. These findings demonstrate its effectiveness and potential utility in clinical practice.

Medical image segmentation by combining feature enhancement Swin Transformer and UperNet

Article Open access 25 April 2025

Dual-attention transformer-based hybrid network for multi-modal medical image segmentation

Article Open access 28 October 2024

Multi-branch CNN and grouping cascade attention for medical image classification

Article Open access 01 July 2024

Introduction

Small medical objects, such as HyperReflective Dots (HRDs) observed on fundus Optical Coherence Tomography (OCT) are increasingly recognized by ophthalmologists as potential biomarkers for retinal diseases, including age-related macular degeneration (AMD), diabetic macular edema (DME), and macular edema caused by branch retinal vein occlusion (BRVO). The presence, number, and spatial distribution of HRDs provide key insights into the severity and progression of these disorders^1,2,3,4. However, manual annotation of these small objects is both time-consuming and labor-intensive. Consequently, there is a growing need to develop automated segmentation methods for small medical objects using advanced computer vision algorithms. Figure 1A provides examples of such small medical objects.

Despite the success of numerous image segmentation models such as U-Net⁵, ResUNet⁶, DenseUNet⁷, ResUNet++⁸, TransFuse⁹, Swin-Unet¹⁰, and the recent SAM¹¹, their effectiveness in segmenting small medical objects like HRDs is limited. Two primary limitations—significant information loss and inadequate feature utilization—suggest that these models might may not perform well for small object tasks. First, traditional methods^5,12 restrict feature integration to the encoder-decoder stages, which limits the diversity of features used across decoding stages. Previous studies^13–14, have shown that combining shallow and deep features improves segmentation accuracy. As a result, these models fail to use the full range of encoder information, leaving valuable data untapped. Second, many models use a single segmentation head for supervision at the final decoding stage^6,12,15. Research indicates that early decoding stages, which process low-resolution features, are critical for accurately localizing small objects^16,17,18. However, these stages often lose information during convolution and upsampling processes. Additionally, small medical objects inherently contain less data compared to larger ones, which worsens the impact of information loss. These limitations are evident in the SAM¹¹ model, which performs poorly when segmenting small medical objects, as noted in prior research.

To address the challenge of information loss in segmenting small medical objects, we developed a novel encoder-decoder-based model that optimizes feature use at every stage. Figure 1B highlights the advantages of our method compared to conventional approaches. In the encoder, our model integrates the Cross-Stage Axial Attention Module (CSAA), which employs an attention mechanism to combine features from all stages. This enables the model to adjust to the informational needs of each decoding stage, minimizing the typical information loss. The CSAA ensures the decoder has access to relevant information throughout the process, improving segmentation accuracy.In the decoder, we introduce the Multi-Precision Supervision Module (MPS). This module applies segmentation heads with different precisions at various stages, effectively using low-resolution features. These features help capture broad contextual information while temporarily focusing less on fine local details. By using MPS, the model optimizes feature utilization at each stage, reducing information loss and enhancing segmentation precision.

To validate our model’s effectiveness, we conducted experiments on two datasets: S-HRD and S-Polyp, as shown in Fig. 1A. The S-HRD dataset was developed by our team, and the S-Polyp dataset is a subset of CVC-ClinicDB¹⁹. Our model’s performance on these datasets surpasses existing state-of-the-art models, as shown by higher scores in the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). These results demonstrate the potential of our approach to improve the segmentation of small medical objects, offering significant benefits for medical imaging diagnostics.

Related work

Medical image segmentation

Recently, researchers have developed several innovative methods for semantic segmentation in medical images^{20,21,22,23,24,25,26,27,28,29}. Agarwal et al.²⁵ proposed multi-scale dual-channel feature embedding decoder in the field of biomedical image segmentation. Zhang et al.⁹ proposed a unique hybrid architecture that integrates both CNN and Transformer technologies, which helps in preserving low-level details that are often lost. Chen et al.³⁰ enhanced the U-Net encoder by incorporating a Transformer module, which improved the model’s capability for long-range dependency modeling. Wang et al.³¹ tackled the issue of overfitting through the use of a Transformer encoder and furthered this innovation by introducing a progressive locality decoder, aimed at optimizing local information processing within medical images. While these approaches have significantly advanced the field of medical image segmentation, they typically do not adequately address the impact of object size on segmentation outcomes.Lou et al.³² introduced the Context Axial Reverse Attention module (CaraNet) to address this challenge by focusing on small objects’ local details. However, CaraNet’s reliance on bilinear interpolation during the decode phase results in considerable information loss, negatively impacting small object segmentation. In contrast, our model mitigates such information loss through the integration of the CSAA and MPS, substantially enhancing segmentation precision for small objects.

General segmentation model

Kirillov et al. introduced the SAM¹¹, a versatile segmentation model that has made significant strides in natural image segmentation. Despite its success, SAM is not well-suited for many medical image segmentation tasks, especially those involving small objects and complex boundaries, which often require manual intervention^33,34. To address these limitations, Ma et al. developed MedSAM³⁵, an adaptation of SAM tailored for medical images. MedSAM outperforms SAM in handling complex segmentation tasks but still struggles with small object segmentation.

Attention mechanism

Numerous attention-based methods have emerged in recent years^{36,37,38,39,40,41,42,43,44}, and have been applied to diverse tasks in computer vision and natural language processing. Vaswani et al.⁴¹ introduced the Transformer, which marked a shift from traditional convolutional models to attention-based ones. Woo et al.⁴² enhanced this idea by adding Channel and Spatial Attention to CNNs, improving their ability to handle complex inputs. Zhang et al.⁴³ proposed the Pyramid Squeeze Attention Module (PSA), which captures spatial information across multiple channels. Building on these developments, we introduce the CSAA, which promotes better feature integration across the model while minimizing information loss. Yadav et al.⁴⁴ proposed a modified recurrent residual attention U-Net model, which has advanced the segmentation of brain tumors in MRI images, further influencing small object segmentation approaches in medical imaging.

Contribution

1.
Innovative segmentation approach: We introduce EFCNet, a novel method designed to address the challenges of segmenting small medical objects. This approach enhances the extraction of diverse information by systematically processing features at every stage, reducing information loss—a common issue in small object segmentation. Our approach directly improves on existing techniques by emphasizing comprehensive feature utilization.
2.
Development of key modules: We have designed two innovative modules: the CSAA and MPS. These modules specifically address information loss during the encoder and decoder phases of segmentation models. Their integration results in a significant improvement in segmentation accuracy, capturing fine-grained details more effectively.
3.
Benchmark construction and model validation: We established a robust benchmark for small medical object segmentation, developing specialized datasets, S-HRD and S-Polyp. Our model was rigorously tested against this benchmark and consistently outperformed existing state-of-the-art models, setting a new standard for performance in small object segmentation in medical imaging.

Methods

This section describes the development of a fundus OCT dataset, the architectural details of EFCNet, including its CSAA and MPS, and the formulation of the loss function used in our experiments.

Dataset establishment

For this study, we established two distinct datasets, with medical image samples shown in Fig. 1A:

S-HRD dataset

This dataset was compiled from 313 retinal OCT scans obtained from patients treated for DME or BRVO-induced macular edema at the EENT Hospital of Fudan University within the last six months. Informed consent was obtained from all participants involved in the study. The study protocol was approved by the Ethics Committee of the EENT Hospital, adhering to the principles of the Declaration of Helsinki. Patient confidentiality was strictly maintained through rigorous anonymization. Each OCT scan, performed using the Heidelberg Spectralis OCT + HRA (Heidelberg Engineering, Heidelberg, Germany), was centered on the fovea. HRDs were defined as discrete intraretinal reflectivity changes no larger than 30 μm, with reflectivity similar to that of the nerve fiber layer and no associated back shadowing. The images were annotated independently by two ophthalmologists who cross-referenced clinical records to ensure the accuracy of annotations. Discrepancies were resolved by a senior specialist. In this dataset, each lesion occupies less than 1% of the total image area.

S-Polyp dataset

A subset of 229 images was curated from the publicly available CVC-ClinicDB¹⁹, which includes 612 high-resolution images from 31 colonoscopy video sequences. This database is commonly used for polyp detection and includes ground truth masks that delineate the polyp regions. The subset was specifically selected to include images where each lesion occupies less than 5% of the total image area, emphasizing small-scale features. This selection challenges conventional segmentation methods and aims to improve the accuracy of small polyp detection.

Problem setup and notations

In our segmentation task, we define the medical image dataset as $\:X=\left\{{x}_{1},\dots\:,{x}_{m}|{x}_{i}\in\:{\mathbb{R}}^{\left(C\times\:H\times\:W\right)},i=\text{1,2},\dots\:,m\right\}$, where each $\:{x}_{i}$ represents an OCT image and $\:i$ ranges over the dataset indices. Experienced ophthalmologists used ITK-SNAP, a versatile, open-source 3D medical image analysis software for user-guided segmentation, to segment HRD lesions on each OCT image. This meticulous process produced a corresponding ground truth dataset $\:Y=\left\{{y}_{1},\dots\:,{y}_{m}|{y}_{i}\in\:{\text{0,1}}^{\left(1\times\:H\times\:W\right)},i=\text{1,2},\dots\:,m\right\}$, where each $\:{y}_{i}$ is a binary mask of the segmented lesions in $\:{x}_{i}$ The combined dataset $\:D=\{X,\:Y\}\:$is split into a training set $\:{D}_{\text{train}}=\{{X}_{\text{train}},{Y}_{\text{train}}\}$ and a testing set $\:{D}_{\text{test}}=\{{X}_{\text{test}},{Y}_{\text{test}}\}$,facilitating the model’s training and evaluation phases.

The objective of this study is to develop an algorithm that enhances our model’s capability to effectively segment small medical objects from $\:{D}_{\text{train}}$ and demonstrate its robustness on $\:{D}_{\text{test}}$. The focus is on ensuring precise segmentation during training and consistent generalizability on unseen data during testing.

Overall architecture

We present EFCNet, a novel architecture tailored for precise segmentation of small medical objects. The model integrates two key modules: the CSAA and the MPS, both crucial for processing information throughout the encoding and decoding phases.

As shown in Fig. 1C, the process begins with an input image undergoing multi-stage encoding to produce detailed feature maps. The CSAA Module harmonizes these features across all encoder stages, refining the data for the decoding process. This ensures that each decoding stage leverages a complete dataset profile for accurate segmentation. In the decoding process, the MPS Module executes multi-precision predictions with each stage receiving specific supervision to fine-tune the segmentation results. During testing, the output from the final decoder stage serves as the definitive model result, optimized for high accuracy and reliability.

CSAA module

To address the challenge of critical information dispersion across encoder stages and reduce information loss during convolution and downsampling, we introduce the CSAA module. This module adaptively processes features from all encoder stages, refining their integration to efficiently inform the decoding process. As shown in Fig. 1D, the CSAA operates in four steps: resizing, W-dimensional axial attention, H-dimensional axial attention, and resizing back. This design optimizes feature fusion, ensuring better alignment between encoder and decoder stages, which ultimately enhances segmentation accuracy.

Resizing

To improve the integration of features across all encoder stages, each feature map within the encoder is resized to uniform dimensions $\:\left({C}^{*},{H}^{*},{W}^{*}\right)$ using convolution operations. This adjustment aligns both spatial and channel dimensions to facilitate subsequent processing steps:

$$\:{f}_{i}^{*}=\sigma\:\left(BN\left(con{v}_{j}^{e}\left({f}_{i}^{e}\right)\right)\right),i=\text{1,2},\dots\:,k$$

where $\:{f}_{i}^{e}$ represents the feature map at stage $\:i$ of the encoder, $\:{\upsigma\:}$ signifies the ReLU activation function, and $\:BN\:$denotes batch normalization. $\:k$ represents the total number of encoder and decoder stages. These resized feature maps, $\:\{{f}_{i}^{*}{\}}_{i=1}^{k}$, are then prepared as inputs for the subsequent W-dimensional axial attention step.

W-dimensional CSAA

In the W-Dimensional CSAA module, we generate the Query $\:\{{Q}_{i,w}{\}}_{i=1}^{k}$, Key $\:\{{K}_{i,w}{\}}_{i=1}^{k}$, and Value $\:\{{V}_{i,w}{\}}_{i=1}^{k}$ components based on the resized feature maps $\:\{{\text{f}}_{\text{i}}^{\text{*}}{\}}_{\text{i}=1}^{\text{k}}$ in the width dimension. The process is defined by the following equations: $\:{Q}_{i,w}={W}_{i}^{Q}\left({f}_{i,w}^{*}\right),\:\:{K}_{i,w}={W}_{i}^{K}\left({f}_{1,w}^{*},{f}_{2,w}^{*},\dots\:,{f}_{k,w}^{*}\right),\:{\:V}_{i,w}={W}_{i}^{V}\left({f}_{1,w}^{*},{f}_{2,w}^{*},\dots\:,{f}_{k,w}^{*}\right),\:$ where $\:\{{\text{W}}_{\text{i}}^{\text{Q}}{\}}_{\text{i}=1}^{\text{k}}$, $\:\{{\text{W}}_{\text{i}}^{\text{K}}{\}}_{\text{i}=1}^{\text{k}}$, and $\:\{{\text{W}}_{\text{i}}^{\text{V}}{\}}_{\text{i}=1}^{\text{k}}$are the weight matrices responsible for generating the corresponding Query, Key, and Value sets. Here, $\:\text{k}$ denotes the number of stages in both the encoder and decoder, and $\:{\text{f}}_{\text{i},\text{w}}^{\text{*}}$ represents the feature map $\:{\text{f}}_{\text{i}}^{\text{*}}$ along the width dimension. These matrices integrate information from all encoder stages to generate a comprehensive feature representation for each stage.

Subsequently, the output $\:\{{\text{f}}_{\text{i}}^{\text{w}}{\}}_{\text{i}=1}^{\text{k}}$ of the W-Dimensional Axial-Attention is computed as follows:

$$\:{f}_{i}^{w}=\text{Softmax}\left(\frac{{Q}_{i,w}{K}_{i,w}^{T}}{\sqrt{{C}^{*}{H}^{*}}}\right){V}_{i,w},\:\:i=\text{1,2},.,k.$$

This equation merges information across all stages of the encoder along the width dimension, effectively utilizing axial attention to enhance the detail and specificity of feature maps prior to the decoding phase.

H-dimensional CSAA

Continuing from the W-dimensional axial attention, we extend our methodology to the height ($\:H$) dimension by generating Query $\:\{{Q}_{i,h}{\}}_{i=1}^{k}$, Key $\:\{{K}_{i,h}{\}}_{i=1}^{k}$, and Value $\:\{{V}_{i,h}{\}}_{i=1}^{k}$ sets based on the feature maps $\:\{{f}_{i}^{w}{\}}_{i=1}^{k}$ processed in the previous step:$\:{Q}_{i,h}={W}_{i}^{Q}\left({f}_{i,h}^{w}\right),$ $\:{K}_{i,h}={W}_{i}^{K}\left({f}_{1,h}^{w},{f}_{2,h}^{w},\dots\:,{f}_{k,h}^{w}\right),$ $\:{V}_{i,h}={W}_{i}^{V}\left({f}_{1,h}^{w},{f}_{2,h}^{w},\dots\:,{f}_{k,h}^{w}\right),i=\text{1,2},.,k,$ where $\:\{{W}_{i}^{Q}{\}}_{i=1}^{k}$, $\:\{{W}_{i}^{K}{\}}_{i=1}^{k}$, and $\:\{{W}_{i}^{V}{\}}_{i=1}^{k}$ are the weight matrices that facilitate the generation of Query, Key, and Value sets respectively. These matrices serve to synthesize information across the encoder stages, now focusing on the height dimension.

The outcome of this axial attention process is computed as follows:

$$\:{f}_{i}^{h}=\text{Softmax}\left(\frac{{Q}_{i,h}{K}_{i,h}^{T}}{\sqrt{{C}^{*}{W}^{*}}}\right){V}_{i,h},i=\text{1,2},.,k.$$

This equation efficiently combines information from all stages of the encoder, both in width and height dimensions, to produce feature maps that are rich in detail and spatial context. This enhanced feature integration ensures that each stage of the decoder is informed by a comprehensive, multi-dimensional understanding of the input data, which is crucial for accurate segmentation of small medical objects.

Resizing back

To facilitate the decoding process, the output features$\:{\:f}_{i}^{h}$ from the axial attention stages are resized to align with the dimensions of the corresponding decoder stage feature maps. This resizing involves adjusting both spatial and channel dimensions through convolution operations:

$$\:{f}_{i}^{attn}={\upsigma\:}\left(BN\left(con{v}_{i}^{h}\left({f}_{i}^{h}\right)\right)\right),i=1,.,k,$$

where σ represents the ReLU activation function, $\:BN$ indicates batch normalization, and k denotes the number of stages in both the encoder and decoder. The resized feature maps, denoted as $\:{f}_{i}^{attn}$, are then concatenated to the feature map of the corresponding decoder stage along the channel dimension.

This approach addresses the computational challenges often associated with traditional two-dimensional attention mechanisms⁴⁵, which require substantial resources. By employing a two-stage one-dimensional attention structure within the CSAA, our model processes feature maps sequentially across both width and height dimensions, optimizing computational efficiency without compromising performance.

The CSAA is instrumental in enhancing the model’s ability to extract and utilize detailed information about small medical objects throughout the encoding and decoding processes. This method ensures that each decoder stage is equipped with comprehensive and relevant information from all encoder stages, thereby improving segmentation accuracy and efficiency. Through this integration, the CSAA strengthens the connection between the encoder and decoder, reinforcing the model’s overall performance in segmenting small medical objects.

MPS module

The MPS is designed to optimize the utilization of low-resolution features within the decoder, which are pivotal for their robust global perception capabilities. Previous models such as those detailed in Ronneberger et al.⁵, Chen et al.²⁵, and Lou et al.²⁷ often failed to fully capitalize on this globally perceptual information, resulting in significant data loss during convolution and upsampling stages. Our MPS addresses this deficiency by preserving and enhancing the information extracted from these low-resolution features throughout the decoding process.

Segmentation process

In the segmentation step of MPS, feature maps from each decoder stage are processed through dedicated segmentation heads tailored to their respective resolutions. This segmentation is conducted as follows:

$$\:{P}_{i}=S\left({\upsigma\:}\left(BN\left(con{v}_{i}^{d}\left({f}_{i}^{d}\right)\right)\right)\right),\hspace{1em}i=1,.,k,$$

where $\:{f}_{i}^{d}\:$is the feature map at decoder stage i, σ represents the ReLU function, $\:BN$ denotes batch normalization, and S is the sigmoid function. These operations yield segmentation results at varying resolutions, which are crucial for detailed object analysis.

Upsampling and integration

Following segmentation, these results are upsampled using a neighbor interpolation method to align with the ground truth dimensions, thereby facilitating multi-precision segmentation:

$$\:{\text{M}}_{\text{i}}=\text{U}\text{p}\text{s}\text{a}\text{m}\text{p}\text{l}\text{e}\left({\text{P}}_{\text{i}}\right)\in\:{\mathbb{R}}^{\text{C}\times\:\text{H}\times\:\text{W}},\text{i}=\text{1,2},.,\text{k}.$$

This step ensures that each segmentation output matches the scale of the ground truth, enhancing the accuracy of the model’s output against actual data.

Supervision strategy

The MPS adopts a varied precision supervision strategy, recognizing that while low-resolution features offer substantial global insight, they lack finer local details. By applying a lower precision of supervision to these features, the module maintains an emphasis on global attributes without overfitting to local noise. This approach not only preserves the benefits of traditional single-segmentation heads but also enhances them by integrating broad-scale perceptual strengths, thereby significantly improving the model’s efficacy in segmenting small medical objects.

Loss function

In the context of small medical object segmentation, where there is a significant imbalance between positive and negative pixels, we employ a hybrid loss function that combines DiceLoss⁴⁶ and Binary Cross-Entropy (BCE) Loss. This approach is designed to optimize our model’s performance by addressing both the spatial and class imbalance challenges inherent in such tasks. The loss function for the segmentation outputs at each decoder stage is defined as follows:

$$\:{\mathcal{L}}_{i}={\lambda\:}_{1}\cdot\:{\mathcal{L}}_{D}ice\left({M}_{i},Y\right)+{\lambda\:}_{2}\cdot\:{\mathcal{L}}_{B}CE\left({M}_{i},Y\right),$$

where i = 1,2,…,k indexes the decoder stages, $\:{M}_{i}\:$represents the predicted segmentation map at stage i, and Y is the ground truth. The hyperparameters $\:{{\uplambda\:}}_{1}$ and $\:{{\uplambda\:}}_{2}$ are used to balance the contributions of Dice Loss and BCE Loss, respectively.

To integrate the contributions from each decoder stage, the total model loss is computed as:

$$\:{L}_{total}={\sum\:}_{i=1}^{k}\left({{\upalpha\:}}_{i}\cdot\:{L}_{i}\right)$$

where the set of hyperparameters $\:\{{{\upalpha\:}}_{\text{i}}{\}}_{\text{i}=1}^{\text{k}}$ weights the losses from different stages, acknowledging the varying precision of segmentation results across the decoder stages.

This formulation of the loss function ensures that the model is finely tuned to maximize accuracy across all stages of decoding, thereby improving the overall segmentation performance and addressing the unique challenges posed by small medical object segmentation. This balanced and stage-wise approach to loss computation helps to mitigate the potential for overfitting to less informative regions and emphasizes learning from critical features relevant to medical diagnostics.

Experiment

Evaluation metrics

To assess the performance of our model relative to existing state-of-the-art methods, we utilize two widely recognized metrics. DSC, which quantifies the similarity between the predicted labels and the ground truth. It is mathematically defined as:

$$\:DSC=\frac{2\times\:\left|P\cap\:G\right|}{\left|P\right|+\left|G\right|},$$

where P denotes the area covered by the predicted labels, and G represents the area covered by the ground truth labels. IoU, which measures the overlap between the predicted labels and the ground truth relative to their union. It is defined as:

$$\:IoU=\frac{{S}_{i}}{{S}_{u}},$$

where $\:{S}_{i}$ is the intersection area between the predicted labels and the ground truth, and $\:{S}_{u}\:$is their union area. These metrics are critical for validating the accuracy and reliability of segmentation methods in medical imaging, providing a comprehensive evaluation of model performance.

Implementation details

Our experiments were conducted using an NVIDIA RTX A6000 GPU with 48GB of memory, utilizing the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01. The training procedure spanned 200 epochs with a batch size of 4. All input images were uniformly resized to 352 × 352 pixels. The architecture of our model includes four stages each in the encoder and decoder.

To fine-tune the loss function, we adjusted the weights of Dice Loss and BCE Loss using the hyperparameters $\:{\lambda\:}_{1}$ = 0.7 and $\:{\lambda\:}_{2}$ = 0.3. Additionally, to effectively manage the contributions from multi-precision segmentation outputs, we set the balancing coefficients $\:{{\upalpha\:}}_{1}$ = 1.0, $\:{\alpha\:}_{2}$ = 0.9, $\:{\alpha\:}_{3}$ = 0.8, and $\:{\alpha\:}_{4}$ = 0.7 for the respective stages of the decoder.

This experimental setup ensures rigorous testing and validation of our model under controlled and replicable conditions, allowing for reliable comparison of its performance against established benchmarks in medical image segmentation.

Comparative models

To benchmark our approach in small medical object segmentation, we compare against a diverse set of state-of-the-art models: 1.CNN-Based Methods: Includes foundational U-Net⁵ and its advanced variants Attention-UNet³⁷, MSU-Net⁴⁸, and CaraNet³². 2. Transformer-Based Methods: Incorporates TransFuse⁹, TransUNet³⁰, SSFormer³¹, and Swin-UNet¹⁰, which utilize self-attention to enhance detail processing.3. SAM¹¹ and the related works: SAM without any prompt, SAM with point, SAM with box and MedSAM³⁵. Additionally, we assess the impact of model scale on segmentation accuracy by testing an expanded version of U-Net (U-Net-Large) with 12 layers in both encoder and decoder, exploring the relationship between model size and performance efficacy.

Quantitative results analysis

As illustrated in Table 1, EFCNet consistently surpasses previous state-of-the-art models in both DSC and IoU across all folds for the S-HRD and S-Polyp datasets.

For the S-HRD dataset, EFCNet shows a significant improvement with an average increase of 4.88% in DSC and 3.77% in IoU compared to earlier methods. In the S-Polyp dataset, the model achieves an enhancement of 3.49% in DSC and 3.25% in IoU. These results highlight EFCNet’s robust performance, particularly in handling datasets with smaller medical objects where other models underperform due to limited available information.

The analysis reveals that the smaller the medical objects in the datasets, the greater the relative improvement of EFCNet over previous state-of-the-art methods, underscoring its effectiveness in small medical object segmentation. Notably, scaling up the architecture of U-Net (U-Net-Large) results in only marginal gains, suggesting. that EFCNet’s superior performance is primarily driven by its innovative design rather than merely increased model size.

The integration of CSAA and MPS modules within EFCNet, although increasing computational costs, is justified by the significant performance gains in critical segmentation tasks. A detailed comparison of model costs and performance benefits is provided in the supplementary materials.

Visualization

Figure 1E presents visual comparisons of segmentation results using EFCNet against previous state-of-the-art methods on the S-HRD and S-Polyp datasets. As indicated by our quantitative analysis in Table 1, the prior state-of-the-art models include Attn-UNet³⁷ for S-HRD and SSFormer²⁶ for S-Polyp. The visual results distinctly highlight several advantages of our EFCNet:1. Precision in Detecting Small Objects: EFCNet demonstrates superior ability to capture extremely small medical objects, which are clearly marked within the green circled areas in Fig. 1E.2. Boundary Accuracy: Our method exhibits enhanced accuracy in delineating the boundaries of small medical objects, as shown in the yellow circled areas.3. Reduced False Positives: EFCNet is markedly less prone to incorrectly segmenting the background as part of the medical objects, as indicated by the red circled areas.

These visual outcomes underscore the effectiveness of the CSAA and MPS modules. CSAA effectively utilizes local information from low-level features in the encoder to refine the segmentation details, while MPS leverages global perceptual insights from low-resolution features early in the decoder, significantly improving the detection and segmentation of small medical objects.

Table 1 Performance comparison of EFCNet with baseline models on segmenting S-HRD and S-Polyp datasets in terms of dice similarity coefficient (DSC) and intersection over union (IoU).

Full size table

Ablation studies

We conducted a series of ablation experiments on the S-HRD and S-Polyp datasets to confirm the individual and synergistic effects of the CSAA and MPS modules on segmentation performance.

Initially, CSAA and MPS were integrated into the U-Net backbone separately to assess their standalone contributions. The results, presented in Table 2, reveal that each module significantly enhances segmentation accuracy. Moreover, the concurrent application of both modules leads to even greater improvements, demonstrating their combined efficacy. To explore the optimal configuration for CSAA, we manipulated the number of encoder stages involved in feature aggregation: The ‘Concat-One’ configuration merges feature from each encoder stage directly to the corresponding decoder stage without additional modifications; The ‘AA-One’ setup applies axial attention to features from a single encoder stage; The ‘AA-All’ arrangement, which represents our final model configuration, aggregates features across all encoder stages. As indicated in Table 3, ‘AA-All’ outperforms the other configurations, underscoring the importance of extensive feature integration for enhancing segmentation performance.

Table 2 Ablation study results for CSAA and MPS modules on segmenting S-HRD and S-Polyp datasets.

Full size table

Table 3 Ablation study on the CSAA module’s staging impact on S-HRD and S-Polyp segmentation performance.

Full size table

Different levels of supervision within the MPS were evaluated by varying the number of connected segmentation heads: The ‘MPS-4’ configuration, which connects segmentation heads to all decoder stages, was tested against setups with fewer connections (‘MPS-3’, ‘MPS-2’, and ‘MPS-1’). As detailed in Table 4, increased supervision levels correlate positively with segmentation accuracy, confirming the benefits of comprehensive supervision in complex segmentation tasks.

The results from these ablation studies validate the effectiveness of CSAA and MPS in our model, particularly highlighting their roles in significantly improving the precision and reliability of medical object segmentation. The configurations tested affirm the model’s robustness and adaptability, making it particularly suited for challenging segmentation environments.

Table 4 Effects of MPS configuration variations on segmentation performance for S-HRD and S-Polyp datasets.

Full size table

Conclusion

In this study, we introduced EFCNet, a novel framework specifically designed to enhance the segmentation of small medical objects in medical images. EFCNet addresses a common challenge in these tasks, significant information loss, by ensuring that features at every stage of the model are fully utilized. This is achieved through the integration of two key modules: the CSAA and MPS. These modules are strategically developed to minimize information loss during both the encoding and decoding phases, resulting in a substantial improvement in segmentation performance. Additionally, EFCNet establishes a new benchmark in the field of small medical object segmentation, offering a solid reference point for future research and development.

Our extensive experimental evaluation, conducted across two distinct datasets, demonstrates that the incorporation of CSAA and MPS enhances segmentation accuracy, allowing EFCNet to consistently outperform previous state-of-the-art methods. These results highlight the potential of EFCNet as a powerful tool in medical imaging, particularly in applications where accurate segmentation of small objects is essential.

However, it is important to acknowledge the limitations of our study. First, our method requires significant computational resources, which may pose challenges for its practical application in resource-constrained environments. Reducing the resource consumption of the model will be a key area of future work. Second, our approach primarily focuses on a single modality (medical images) and does not yet incorporate other modalities, such as natural language, to guide the segmentation process. Expanding the model to integrate multi-modal data will be another important direction for future research.

Despite these limitations, the insights gained from the development and evaluation of EFCNet provide valuable contributions to the field of medical image analysis and lay the groundwork for further advancements in small object segmentation.

Data availability

The datasets used and analysed during the current study available from Mr. Lingjie Kong on reasonable request.

References

Chung, Y. R. et al. Role of inflammation in classification of diabetic macular edema by optical coherence tomography. J. Diabetes Res. 2019, 1582 (2019).
Article Google Scholar
Arthi, M., Sindal, M. D. & Rashmita, R. Hyperreflective foci as biomarkers for inflammation in diabetic macular edema: retrospective analysis of treatment Naïve eyes from South India. Indian J. Ophthalmol. 69 (5), 1197 (2021).
Article CAS PubMed PubMed Central Google Scholar
Huang, H. et al. Algorithm for detection and quantification of hyperreflective Dots on optical coherence tomography in diabetic macular edema. Front. Med. 8, 688986 (2021).
Article Google Scholar
Qin, S. et al. Hyperreflective foci and subretinal fluid are potential imaging biomarkers to evaluate anti-vegf effect in diabetic macular edema. Front. Physiol. 2337, 1661 (2021).
Google Scholar
Ronneberger, O., Fischer, P. & Brox, T. U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015 234–241 (Springer, 2015).
Google Scholar
Xiao, X., Lian, S., Luo, Z. & Li, S. Weighted res-unet for high-quality retina vessel segmentation. In 2018 9th International Conference on Information Technology in Medicine and Education (ITME), 327–331 (2018).
Guan, S., Khan, A. A., Sikdar, S. & Chitnis, P. V. Fully dense Unet for 2-d sparse photoacoustic tomography artifact removal. IEEE J. Biomed. Health Inf. 24 (2), 568–576 (2020).
Article Google Scholar
Jha, D. et al. Resunet++: An advanced architecture for medical image segmentation. (2019).
Zhang, Y., Liu, H., Hu, Q. Transfuse Fusing Transformers and CNNs for medical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2021 14–24 (Springer, 2021).
Google Scholar
Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision. 205–218 (Springer, 2022).
Kirillov, A. et al. Segment anything. arXiv preprint arXiv:2304.02643 (2023).
Jha, D., Riegler, M. A., Johansen, D., Halvorsen, P. & Johansen, H. D. Doubleu-net: A deep convolutional neural network for medical image segmentation. In 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), 558–564 (IEEE, 2020).
Cheng, H. K., Chung, J., Tai, Y. W. & Tang, C. K. Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8890–8899 (2020).
Poudel, R. P. K., Bonde, U., Liwicki, S., Zach, C. & Contextnet Exploring context and detail for semantic segmentation in real-time. arXiv preprint arXiv:1805.04554 (2018).
Huang, H. et al. Unet 3+: A full-scale connected unet for medical image segmentation. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1055–1059 (IEEE, 2020).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014).
Chen, L. C. et al. Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected Crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848 (2017).
Article PubMed Google Scholar
Zhang, Z. et al. Enhancing feature fusion for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), 269–284 (2018).
Bernal, J. et al. Wm-dova maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. Comput. Med. Imaging Graph.. 43, 99–107 (2013).
Article Google Scholar
Chen, L. et al. DRINet for medical image segmentation. IEEE Trans. Med. Imaging. 37 (11), 2453–2462 (2018).
Article PubMed Google Scholar
Gu, Z. et al. CE-Net: context encoder network for 2D medical image segmentation. IEEE Trans. Med. Imaging. 38 (10), 2281–2292 (2019).
Article PubMed Google Scholar
Hatamizadeh, A. et al. UNETR: Transformers for 3D medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 574–584 (2022).
Milletari, F., Navab, N. & Ahmadi, S. A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), 565–571 (IEEE, 2016).
Valanarasu, J. M. J. & Patel, V. M. UNeXt: MLP-based rapid medical image segmentation network. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2022: 25th International Conference, Singapore 23–33 (Springer, 2022).
Agarwal, R. et al. Deep quasi-recurrent self-attention with dual encoder-decoder in biomedical CT image segmentation. IEEE J. Biomed. Health Inform. (2024).
Agarwal, R. et al. Multi-scale dual-channel feature embedding decoder for biomedical image segmentation. Comput. Methods Progr. Biomed. 257108464 (2024).
Mandal, B. Optimization of quadratic curve fitting from data points using real coded genetic algorithm. In Emerging Technologies in Data Mining and Information Security: Proceedings of IEMIS 2020, vol. 1. 419–428 (Springer, 2021).
Agarwal, R. et al. Spiking neural network in computer vision: techniques, tools and trends. In International Conference on Advanced Computational and Communication Paradigms 201–209 (Springer Nature Singapore, 2023).
Chowdhury, A. et al. U-Net based optic cup and disk segmentation from retinal fundus images via entropy sampling. In Advanced Computational Paradigms and Hybrid Intelligent Computing: Proceedings of ICACCP 2021. 479–489 (Springer, 2022).
Chen, J. et al. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306. (2021).
Wang, J. et al. Stepwise feature fusion: Local guides global. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2022: 25th International Conference, Singapore. 110–120 (Springer, 2022).
Lou, A., Guan, S., Ko, H., Loew, M. H. & CaraNet Context axial reverse attention network for segmentation of small medical objects. In Medical Imaging 2022: Image Processing, 81–92 (SPIE, 2022).
He, S., Bao, R., Li, J., Grant, P. E. & Ou, Y. Accuracy of segment-anything model (SAM) in medical image segmentation tasks. arXiv preprint arXiv:2304.09324. (2023).
Huang, Y. et al. Segment anything model for medical images. arXiv preprint arXiv:2304.14660. (2023).
Ma, J. & Wang, B. Segment anything in medical images. arXiv preprint arXiv:2304.12306. (2023).
Huang, L., Wang, W., Chen, J. & Wei, X. Y. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4634–4643 (2019).
Oktay, O. et al. Attention U-Net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018).
Shen, T. et al. Reinforced self-attention network: A hybrid of hard and soft attention for sequence modeling. ArXiv Preprint arXiv :180110296. (2018).
Sinha, A. & Dolz, J. Multi-scale self-guided attention for medical image segmentation. IEEE J. Biomed. Health Inf. 25 (1), 121–130 (2020).
Article Google Scholar
Tao, A., Sapra, K. & Catanzaro, B. Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821. (2020).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30. (2017).
Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) 3–19 (2018).
Zhang, H., Zu, K., Lu, J., Zou, Y. & Meng, D. EPSANet: An efficient pyramid squeeze attention block on convolutional neural network. In Proceedings of the Asian Conference on Computer Vision 1161–1177 (2022).
Yadav, A. C., Kolekar, M. H. & Zope, M. K. Modified recurrent residual attention U-Net model for MRI-based brain tumor segmentation. Biomed. Signal Process. Control. 102, 107220 (2025).
Article Google Scholar
Wang, H. et al. Axial-DeepLab: Stand alone axial-attention for panoptic segmentation. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK. 108–126 (Springer, 2020).
Sudre, C. H., Li, W., Vercauteren, T., Ourselin, S. & Cardoso, M. J. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, 240–248 (Springer, 2017).
Zhang, M. et al. Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 877–886 (2022).
Su, R., Zhang, D., Liu, J. & Cheng, C. MSU-Net: Multi-scale U-Net for 2D medical image segmentation. Front. Genet. 12, 639930 (2021).
Article PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Eye Institute, Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, 200031, China
Qiaoling Wei, Xiaofeng Ye, Wei Liu, Min Wang & Han Chen
School of Data Science, Fudan University, Shanghai, 200433, China
Lingjie Kong, Chengming Xu & Yanwei Fu
Diagnosis and Treatment Center of Macular Disease, Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, 200031, China
Xiaofeng Ye, Wei Liu & Min Wang

Authors

Lingjie Kong
View author publications
Search author on:PubMed Google Scholar
Qiaoling Wei
View author publications
Search author on:PubMed Google Scholar
Chengming Xu
View author publications
Search author on:PubMed Google Scholar
Xiaofeng Ye
View author publications
Search author on:PubMed Google Scholar
Wei Liu
View author publications
Search author on:PubMed Google Scholar
Min Wang
View author publications
Search author on:PubMed Google Scholar
Yanwei Fu
View author publications
Search author on:PubMed Google Scholar
Han Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

Q.W., H.C., X.Y., W.L., and M.W. spearheaded the project, offering clinical expertise and guidance on the study design. L.K., C.X., and Y.F. were instrumental in designing the network architecture and developing the data/modeling infrastructure. They also managed the training and testing setups and conducted the statistical analysis. Q.W. and H.C. were responsible for collecting the datasets and segmenting the HRDs in OCT data. L.K. and Q.W. made equal contributions to this article.

Corresponding authors

Correspondence to Yanwei Fu or Han Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kong, L., Wei, Q., Xu, C. et al. EFCNet enhances the efficiency of segmenting clinically significant small medical objects. Sci Rep 15, 12813 (2025). https://doi.org/10.1038/s41598-025-93171-6

Download citation

Received: 26 August 2024
Accepted: 05 March 2025
Published: 14 April 2025
Version of record: 14 April 2025
DOI: https://doi.org/10.1038/s41598-025-93171-6

Subjects

Abstract

Similar content being viewed by others

Medical image segmentation by combining feature enhancement Swin Transformer and UperNet

Dual-attention transformer-based hybrid network for multi-modal medical image segmentation

Multi-branch CNN and grouping cascade attention for medical image classification

Introduction

Related work

Medical image segmentation

General segmentation model

Attention mechanism

Contribution

Methods

Dataset establishment

S-HRD dataset

S-Polyp dataset

Problem setup and notations

Overall architecture

CSAA module

Resizing

W-dimensional CSAA

H-dimensional CSAA

Resizing back

MPS module

Segmentation process

Upsampling and integration

Supervision strategy

Loss function

Experiment

Evaluation metrics

Implementation details

Comparative models

Quantitative results analysis

Visualization

Ablation studies

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links