Introduction

Small medical objects, such as HyperReflective Dots (HRDs) observed on fundus Optical Coherence Tomography (OCT) are increasingly recognized by ophthalmologists as potential biomarkers for retinal diseases, including age-related macular degeneration (AMD), diabetic macular edema (DME), and macular edema caused by branch retinal vein occlusion (BRVO). The presence, number, and spatial distribution of HRDs provide key insights into the severity and progression of these disorders1,2,3,4. However, manual annotation of these small objects is both time-consuming and labor-intensive. Consequently, there is a growing need to develop automated segmentation methods for small medical objects using advanced computer vision algorithms. Figure 1A provides examples of such small medical objects.

Despite the success of numerous image segmentation models such as U-Net5, ResUNet6, DenseUNet7, ResUNet++8, TransFuse9, Swin-Unet10, and the recent SAM11, their effectiveness in segmenting small medical objects like HRDs is limited. Two primary limitations—significant information loss and inadequate feature utilization—suggest that these models might may not perform well for small object tasks. First, traditional methods5,12 restrict feature integration to the encoder-decoder stages, which limits the diversity of features used across decoding stages. Previous studies1314, have shown that combining shallow and deep features improves segmentation accuracy. As a result, these models fail to use the full range of encoder information, leaving valuable data untapped. Second, many models use a single segmentation head for supervision at the final decoding stage6,12,15. Research indicates that early decoding stages, which process low-resolution features, are critical for accurately localizing small objects16,17,18. However, these stages often lose information during convolution and upsampling processes. Additionally, small medical objects inherently contain less data compared to larger ones, which worsens the impact of information loss. These limitations are evident in the SAM11 model, which performs poorly when segmenting small medical objects, as noted in prior research.

To address the challenge of information loss in segmenting small medical objects, we developed a novel encoder-decoder-based model that optimizes feature use at every stage. Figure 1B highlights the advantages of our method compared to conventional approaches. In the encoder, our model integrates the Cross-Stage Axial Attention Module (CSAA), which employs an attention mechanism to combine features from all stages. This enables the model to adjust to the informational needs of each decoding stage, minimizing the typical information loss. The CSAA ensures the decoder has access to relevant information throughout the process, improving segmentation accuracy.In the decoder, we introduce the Multi-Precision Supervision Module (MPS). This module applies segmentation heads with different precisions at various stages, effectively using low-resolution features. These features help capture broad contextual information while temporarily focusing less on fine local details. By using MPS, the model optimizes feature utilization at each stage, reducing information loss and enhancing segmentation precision.

To validate our model’s effectiveness, we conducted experiments on two datasets: S-HRD and S-Polyp, as shown in Fig. 1A. The S-HRD dataset was developed by our team, and the S-Polyp dataset is a subset of CVC-ClinicDB19. Our model’s performance on these datasets surpasses existing state-of-the-art models, as shown by higher scores in the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). These results demonstrate the potential of our approach to improve the segmentation of small medical objects, offering significant benefits for medical imaging diagnostics.

Related work

Medical image segmentation

Recently, researchers have developed several innovative methods for semantic segmentation in medical images20,21,22,23,24,25,26,27,28,29. Agarwal et al.25 proposed multi-scale dual-channel feature embedding decoder in the field of biomedical image segmentation. Zhang et al.9 proposed a unique hybrid architecture that integrates both CNN and Transformer technologies, which helps in preserving low-level details that are often lost. Chen et al.30 enhanced the U-Net encoder by incorporating a Transformer module, which improved the model’s capability for long-range dependency modeling. Wang et al.31 tackled the issue of overfitting through the use of a Transformer encoder and furthered this innovation by introducing a progressive locality decoder, aimed at optimizing local information processing within medical images. While these approaches have significantly advanced the field of medical image segmentation, they typically do not adequately address the impact of object size on segmentation outcomes.Lou et al.32 introduced the Context Axial Reverse Attention module (CaraNet) to address this challenge by focusing on small objects’ local details. However, CaraNet’s reliance on bilinear interpolation during the decode phase results in considerable information loss, negatively impacting small object segmentation. In contrast, our model mitigates such information loss through the integration of the CSAA and MPS, substantially enhancing segmentation precision for small objects.

General segmentation model

Kirillov et al. introduced the SAM11, a versatile segmentation model that has made significant strides in natural image segmentation. Despite its success, SAM is not well-suited for many medical image segmentation tasks, especially those involving small objects and complex boundaries, which often require manual intervention33,34. To address these limitations, Ma et al. developed MedSAM35, an adaptation of SAM tailored for medical images. MedSAM outperforms SAM in handling complex segmentation tasks but still struggles with small object segmentation.

Attention mechanism

Numerous attention-based methods have emerged in recent years36,37,38,39,40,41,42,43,44, and have been applied to diverse tasks in computer vision and natural language processing. Vaswani et al.41 introduced the Transformer, which marked a shift from traditional convolutional models to attention-based ones. Woo et al.42 enhanced this idea by adding Channel and Spatial Attention to CNNs, improving their ability to handle complex inputs. Zhang et al.43 proposed the Pyramid Squeeze Attention Module (PSA), which captures spatial information across multiple channels. Building on these developments, we introduce the CSAA, which promotes better feature integration across the model while minimizing information loss. Yadav et al.44 proposed a modified recurrent residual attention U-Net model, which has advanced the segmentation of brain tumors in MRI images, further influencing small object segmentation approaches in medical imaging.

Contribution

  1. 1.

    Innovative segmentation approach: We introduce EFCNet, a novel method designed to address the challenges of segmenting small medical objects. This approach enhances the extraction of diverse information by systematically processing features at every stage, reducing information loss—a common issue in small object segmentation. Our approach directly improves on existing techniques by emphasizing comprehensive feature utilization.

  2. 2.

    Development of key modules: We have designed two innovative modules: the CSAA and MPS. These modules specifically address information loss during the encoder and decoder phases of segmentation models. Their integration results in a significant improvement in segmentation accuracy, capturing fine-grained details more effectively.

  3. 3.

    Benchmark construction and model validation: We established a robust benchmark for small medical object segmentation, developing specialized datasets, S-HRD and S-Polyp. Our model was rigorously tested against this benchmark and consistently outperformed existing state-of-the-art models, setting a new standard for performance in small object segmentation in medical imaging.

Methods

This section describes the development of a fundus OCT dataset, the architectural details of EFCNet, including its CSAA and MPS, and the formulation of the loss function used in our experiments.

Dataset establishment

For this study, we established two distinct datasets, with medical image samples shown in Fig. 1A:

S-HRD dataset

This dataset was compiled from 313 retinal OCT scans obtained from patients treated for DME or BRVO-induced macular edema at the EENT Hospital of Fudan University within the last six months. Informed consent was obtained from all participants involved in the study. The study protocol was approved by the Ethics Committee of the EENT Hospital, adhering to the principles of the Declaration of Helsinki. Patient confidentiality was strictly maintained through rigorous anonymization. Each OCT scan, performed using the Heidelberg Spectralis OCT + HRA (Heidelberg Engineering, Heidelberg, Germany), was centered on the fovea. HRDs were defined as discrete intraretinal reflectivity changes no larger than 30 μm, with reflectivity similar to that of the nerve fiber layer and no associated back shadowing. The images were annotated independently by two ophthalmologists who cross-referenced clinical records to ensure the accuracy of annotations. Discrepancies were resolved by a senior specialist. In this dataset, each lesion occupies less than 1% of the total image area.

S-Polyp dataset

A subset of 229 images was curated from the publicly available CVC-ClinicDB19, which includes 612 high-resolution images from 31 colonoscopy video sequences. This database is commonly used for polyp detection and includes ground truth masks that delineate the polyp regions. The subset was specifically selected to include images where each lesion occupies less than 5% of the total image area, emphasizing small-scale features. This selection challenges conventional segmentation methods and aims to improve the accuracy of small polyp detection.

Problem setup and notations

In our segmentation task, we define the medical image dataset as \(\:X=\left\{{x}_{1},\dots\:,{x}_{m}|{x}_{i}\in\:{\mathbb{R}}^{\left(C\times\:H\times\:W\right)},i=\text{1,2},\dots\:,m\right\}\), where each \(\:{x}_{i}\) represents an OCT image and \(\:i\) ranges over the dataset indices. Experienced ophthalmologists used ITK-SNAP, a versatile, open-source 3D medical image analysis software for user-guided segmentation, to segment HRD lesions on each OCT image. This meticulous process produced a corresponding ground truth dataset \(\:Y=\left\{{y}_{1},\dots\:,{y}_{m}|{y}_{i}\in\:{\text{0,1}}^{\left(1\times\:H\times\:W\right)},i=\text{1,2},\dots\:,m\right\}\), where each \(\:{y}_{i}\) is a binary mask of the segmented lesions in \(\:{x}_{i}\) The combined dataset \(\:D=\{X,\:Y\}\:\)is split into a training set \(\:{D}_{\text{train}}=\{{X}_{\text{train}},{Y}_{\text{train}}\}\) and a testing set \(\:{D}_{\text{test}}=\{{X}_{\text{test}},{Y}_{\text{test}}\}\),facilitating the model’s training and evaluation phases.

The objective of this study is to develop an algorithm that enhances our model’s capability to effectively segment small medical objects from \(\:{D}_{\text{train}}\) and demonstrate its robustness on \(\:{D}_{\text{test}}\). The focus is on ensuring precise segmentation during training and consistent generalizability on unseen data during testing.

Overall architecture

We present EFCNet, a novel architecture tailored for precise segmentation of small medical objects. The model integrates two key modules: the CSAA and the MPS, both crucial for processing information throughout the encoding and decoding phases.

As shown in Fig. 1C, the process begins with an input image undergoing multi-stage encoding to produce detailed feature maps. The CSAA Module harmonizes these features across all encoder stages, refining the data for the decoding process. This ensures that each decoding stage leverages a complete dataset profile for accurate segmentation. In the decoding process, the MPS Module executes multi-precision predictions with each stage receiving specific supervision to fine-tune the segmentation results. During testing, the output from the final decoder stage serves as the definitive model result, optimized for high accuracy and reliability.

CSAA module

To address the challenge of critical information dispersion across encoder stages and reduce information loss during convolution and downsampling, we introduce the CSAA module. This module adaptively processes features from all encoder stages, refining their integration to efficiently inform the decoding process. As shown in Fig. 1D, the CSAA operates in four steps: resizing, W-dimensional axial attention, H-dimensional axial attention, and resizing back. This design optimizes feature fusion, ensuring better alignment between encoder and decoder stages, which ultimately enhances segmentation accuracy.

Resizing

To improve the integration of features across all encoder stages, each feature map within the encoder is resized to uniform dimensions \(\:\left({C}^{*},{H}^{*},{W}^{*}\right)\) using convolution operations. This adjustment aligns both spatial and channel dimensions to facilitate subsequent processing steps:

$$\:{f}_{i}^{*}=\sigma\:\left(BN\left(con{v}_{j}^{e}\left({f}_{i}^{e}\right)\right)\right),i=\text{1,2},\dots\:,k$$

where \(\:{f}_{i}^{e}\) represents the feature map at stage \(\:i\) of the encoder, \(\:{\upsigma\:}\) signifies the ReLU activation function, and \(\:BN\:\)denotes batch normalization. \(\:k\) represents the total number of encoder and decoder stages. These resized feature maps, \(\:\{{f}_{i}^{*}{\}}_{i=1}^{k}\), are then prepared as inputs for the subsequent W-dimensional axial attention step.

W-dimensional CSAA

In the W-Dimensional CSAA module, we generate the Query \(\:\{{Q}_{i,w}{\}}_{i=1}^{k}\), Key \(\:\{{K}_{i,w}{\}}_{i=1}^{k}\), and Value \(\:\{{V}_{i,w}{\}}_{i=1}^{k}\) components based on the resized feature maps \(\:\{{\text{f}}_{\text{i}}^{\text{*}}{\}}_{\text{i}=1}^{\text{k}}\) in the width dimension. The process is defined by the following equations: \(\:{Q}_{i,w}={W}_{i}^{Q}\left({f}_{i,w}^{*}\right),\:\:{K}_{i,w}={W}_{i}^{K}\left({f}_{1,w}^{*},{f}_{2,w}^{*},\dots\:,{f}_{k,w}^{*}\right),\:{\:V}_{i,w}={W}_{i}^{V}\left({f}_{1,w}^{*},{f}_{2,w}^{*},\dots\:,{f}_{k,w}^{*}\right),\:\) where \(\:\{{\text{W}}_{\text{i}}^{\text{Q}}{\}}_{\text{i}=1}^{\text{k}}\), \(\:\{{\text{W}}_{\text{i}}^{\text{K}}{\}}_{\text{i}=1}^{\text{k}}\), and \(\:\{{\text{W}}_{\text{i}}^{\text{V}}{\}}_{\text{i}=1}^{\text{k}}\)are the weight matrices responsible for generating the corresponding Query, Key, and Value sets. Here, \(\:\text{k}\) denotes the number of stages in both the encoder and decoder, and \(\:{\text{f}}_{\text{i},\text{w}}^{\text{*}}\) represents the feature map \(\:{\text{f}}_{\text{i}}^{\text{*}}\) along the width dimension. These matrices integrate information from all encoder stages to generate a comprehensive feature representation for each stage.

Subsequently, the output \(\:\{{\text{f}}_{\text{i}}^{\text{w}}{\}}_{\text{i}=1}^{\text{k}}\) of the W-Dimensional Axial-Attention is computed as follows:

$$\:{f}_{i}^{w}=\text{Softmax}\left(\frac{{Q}_{i,w}{K}_{i,w}^{T}}{\sqrt{{C}^{*}{H}^{*}}}\right){V}_{i,w},\:\:i=\text{1,2},.,k.$$

This equation merges information across all stages of the encoder along the width dimension, effectively utilizing axial attention to enhance the detail and specificity of feature maps prior to the decoding phase.

H-dimensional CSAA

Continuing from the W-dimensional axial attention, we extend our methodology to the height (\(\:H\)) dimension by generating Query \(\:\{{Q}_{i,h}{\}}_{i=1}^{k}\), Key \(\:\{{K}_{i,h}{\}}_{i=1}^{k}\), and Value \(\:\{{V}_{i,h}{\}}_{i=1}^{k}\) sets based on the feature maps \(\:\{{f}_{i}^{w}{\}}_{i=1}^{k}\) processed in the previous step:\(\:{Q}_{i,h}={W}_{i}^{Q}\left({f}_{i,h}^{w}\right),\) \(\:{K}_{i,h}={W}_{i}^{K}\left({f}_{1,h}^{w},{f}_{2,h}^{w},\dots\:,{f}_{k,h}^{w}\right),\) \(\:{V}_{i,h}={W}_{i}^{V}\left({f}_{1,h}^{w},{f}_{2,h}^{w},\dots\:,{f}_{k,h}^{w}\right),i=\text{1,2},.,k,\) where \(\:\{{W}_{i}^{Q}{\}}_{i=1}^{k}\), \(\:\{{W}_{i}^{K}{\}}_{i=1}^{k}\), and \(\:\{{W}_{i}^{V}{\}}_{i=1}^{k}\) are the weight matrices that facilitate the generation of Query, Key, and Value sets respectively. These matrices serve to synthesize information across the encoder stages, now focusing on the height dimension.

The outcome of this axial attention process is computed as follows:

$$\:{f}_{i}^{h}=\text{Softmax}\left(\frac{{Q}_{i,h}{K}_{i,h}^{T}}{\sqrt{{C}^{*}{W}^{*}}}\right){V}_{i,h},i=\text{1,2},.,k.$$

This equation efficiently combines information from all stages of the encoder, both in width and height dimensions, to produce feature maps that are rich in detail and spatial context. This enhanced feature integration ensures that each stage of the decoder is informed by a comprehensive, multi-dimensional understanding of the input data, which is crucial for accurate segmentation of small medical objects.

Resizing back

To facilitate the decoding process, the output features\(\:{\:f}_{i}^{h}\)​ from the axial attention stages are resized to align with the dimensions of the corresponding decoder stage feature maps. This resizing involves adjusting both spatial and channel dimensions through convolution operations:

$$\:{f}_{i}^{attn}={\upsigma\:}\left(BN\left(con{v}_{i}^{h}\left({f}_{i}^{h}\right)\right)\right),i=1,.,k,$$

where σ represents the ReLU activation function, \(\:BN\) indicates batch normalization, and k denotes the number of stages in both the encoder and decoder. The resized feature maps, denoted as \(\:{f}_{i}^{attn}\)​, are then concatenated to the feature map of the corresponding decoder stage along the channel dimension.

This approach addresses the computational challenges often associated with traditional two-dimensional attention mechanisms45, which require substantial resources. By employing a two-stage one-dimensional attention structure within the CSAA, our model processes feature maps sequentially across both width and height dimensions, optimizing computational efficiency without compromising performance.

The CSAA is instrumental in enhancing the model’s ability to extract and utilize detailed information about small medical objects throughout the encoding and decoding processes. This method ensures that each decoder stage is equipped with comprehensive and relevant information from all encoder stages, thereby improving segmentation accuracy and efficiency. Through this integration, the CSAA strengthens the connection between the encoder and decoder, reinforcing the model’s overall performance in segmenting small medical objects.

MPS module

The MPS is designed to optimize the utilization of low-resolution features within the decoder, which are pivotal for their robust global perception capabilities. Previous models such as those detailed in Ronneberger et al.5, Chen et al.25, and Lou et al.27 often failed to fully capitalize on this globally perceptual information, resulting in significant data loss during convolution and upsampling stages. Our MPS addresses this deficiency by preserving and enhancing the information extracted from these low-resolution features throughout the decoding process.

Segmentation process

In the segmentation step of MPS, feature maps from each decoder stage are processed through dedicated segmentation heads tailored to their respective resolutions. This segmentation is conducted as follows:

$$\:{P}_{i}=S\left({\upsigma\:}\left(BN\left(con{v}_{i}^{d}\left({f}_{i}^{d}\right)\right)\right)\right),\hspace{1em}i=1,.,k,$$

where \(\:{f}_{i}^{d}\:\)is the feature map at decoder stage i, σ represents the ReLU function, \(\:BN\) denotes batch normalization, and S is the sigmoid function. These operations yield segmentation results at varying resolutions, which are crucial for detailed object analysis.

Upsampling and integration

Following segmentation, these results are upsampled using a neighbor interpolation method to align with the ground truth dimensions, thereby facilitating multi-precision segmentation:

$$\:{\text{M}}_{\text{i}}=\text{U}\text{p}\text{s}\text{a}\text{m}\text{p}\text{l}\text{e}\left({\text{P}}_{\text{i}}\right)\in\:{\mathbb{R}}^{\text{C}\times\:\text{H}\times\:\text{W}},\text{i}=\text{1,2},.,\text{k}.$$

This step ensures that each segmentation output matches the scale of the ground truth, enhancing the accuracy of the model’s output against actual data.

Supervision strategy

The MPS adopts a varied precision supervision strategy, recognizing that while low-resolution features offer substantial global insight, they lack finer local details. By applying a lower precision of supervision to these features, the module maintains an emphasis on global attributes without overfitting to local noise. This approach not only preserves the benefits of traditional single-segmentation heads but also enhances them by integrating broad-scale perceptual strengths, thereby significantly improving the model’s efficacy in segmenting small medical objects.

Loss function

In the context of small medical object segmentation, where there is a significant imbalance between positive and negative pixels, we employ a hybrid loss function that combines DiceLoss46 and Binary Cross-Entropy (BCE) Loss. This approach is designed to optimize our model’s performance by addressing both the spatial and class imbalance challenges inherent in such tasks. The loss function for the segmentation outputs at each decoder stage is defined as follows:

$$\:{\mathcal{L}}_{i}={\lambda\:}_{1}\cdot\:{\mathcal{L}}_{D}ice\left({M}_{i},Y\right)+{\lambda\:}_{2}\cdot\:{\mathcal{L}}_{B}CE\left({M}_{i},Y\right),$$

where i = 1,2,…,k indexes the decoder stages, \(\:{M}_{i}\:\)represents the predicted segmentation map at stage i, and Y is the ground truth. The hyperparameters \(\:{{\uplambda\:}}_{1}\)​ and \(\:{{\uplambda\:}}_{2}\) are used to balance the contributions of Dice Loss and BCE Loss, respectively.

To integrate the contributions from each decoder stage, the total model loss is computed as:

$$\:{L}_{total}={\sum\:}_{i=1}^{k}\left({{\upalpha\:}}_{i}\cdot\:{L}_{i}\right)$$

where the set of hyperparameters \(\:\{{{\upalpha\:}}_{\text{i}}{\}}_{\text{i}=1}^{\text{k}}\) weights the losses from different stages, acknowledging the varying precision of segmentation results across the decoder stages.

This formulation of the loss function ensures that the model is finely tuned to maximize accuracy across all stages of decoding, thereby improving the overall segmentation performance and addressing the unique challenges posed by small medical object segmentation. This balanced and stage-wise approach to loss computation helps to mitigate the potential for overfitting to less informative regions and emphasizes learning from critical features relevant to medical diagnostics.

Fig. 1
Fig. 1
Full size image

(A) Samples of small medical objects from two datasets (S-HRD and S-Polyp). This panel displays examples from our datasets showing small medical objects, with emphasis on detailed views and annotated ground truths, to underscore the challenges in segmenting such minute features. (B) Methodological comparison. Panels (a) and (b), traditional methods5,6 merge encoder and decoder features by concatenation or addition, ending with a single segmentation head. Panel (c), advanced methods32,47 integrate attention mechanisms at single encoder stages but still use only one segmentation head. Panel (d), Our EFCNet employs the Cross-Stage Axial Attention Module (CSAA) for comprehensive feature integration across all encoder stages and the Multi-Precision Supervision Module (MPS) to refine outputs across decoder stages. (C) EFCNet architecture overview: Provides a comprehensive depiction of EFCNet, highlighting the integration of CSAA and MPS to mitigate information loss and enhance segmentation fidelity across encoder and decoder stages. (D) Cross-stage axial attention module (CSAA) details: Focuses on the mechanics of CSAA, showcasing how multi-stage encoder features are dynamically integrated and directed through the decoding process to enhance the detail and accuracy of the segmentation. (E) Visualization of EFCNet and Previous Methods on S-HRD and S-Polyp. Our EFCNet compared to Attn-UNet37 on S-HRD and SSFormer31 on S-Polyp. Green circles highlight where EFCNet successfully captures extremely small medical objects that previous methods did not detect. Yellow circles show EFCNet’s enhanced accuracy in segmenting the boundaries of small medical objects compared to previous methods. Red circles indicate areas where previous methods incorrectly segmented small medical objects, whereas EFCNet provides correct segmentation.

Experiment

Evaluation metrics

To assess the performance of our model relative to existing state-of-the-art methods, we utilize two widely recognized metrics. DSC, which quantifies the similarity between the predicted labels and the ground truth. It is mathematically defined as:

$$\:DSC=\frac{2\times\:\left|P\cap\:G\right|}{\left|P\right|+\left|G\right|},$$

where P denotes the area covered by the predicted labels, and G represents the area covered by the ground truth labels. IoU, which measures the overlap between the predicted labels and the ground truth relative to their union. It is defined as:

$$\:IoU=\frac{{S}_{i}}{{S}_{u}},$$

where \(\:{S}_{i}\) is the intersection area between the predicted labels and the ground truth, and \(\:{S}_{u}\:\)is their union area. These metrics are critical for validating the accuracy and reliability of segmentation methods in medical imaging, providing a comprehensive evaluation of model performance.

Implementation details

Our experiments were conducted using an NVIDIA RTX A6000 GPU with 48GB of memory, utilizing the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01. The training procedure spanned 200 epochs with a batch size of 4. All input images were uniformly resized to 352 × 352 pixels. The architecture of our model includes four stages each in the encoder and decoder.

To fine-tune the loss function, we adjusted the weights of Dice Loss and BCE Loss using the hyperparameters \(\:{\lambda\:}_{1}\) = 0.7 and \(\:{\lambda\:}_{2}\) = 0.3. Additionally, to effectively manage the contributions from multi-precision segmentation outputs, we set the balancing coefficients \(\:{{\upalpha\:}}_{1}\) = 1.0, \(\:{\alpha\:}_{2}\) = 0.9, \(\:{\alpha\:}_{3}\) = 0.8, and \(\:{\alpha\:}_{4}\) ​= 0.7 for the respective stages of the decoder.

This experimental setup ensures rigorous testing and validation of our model under controlled and replicable conditions, allowing for reliable comparison of its performance against established benchmarks in medical image segmentation.

Comparative models

To benchmark our approach in small medical object segmentation, we compare against a diverse set of state-of-the-art models: 1.CNN-Based Methods: Includes foundational U-Net5 and its advanced variants Attention-UNet37, MSU-Net48, and CaraNet32. 2. Transformer-Based Methods: Incorporates TransFuse9, TransUNet30, SSFormer31, and Swin-UNet10, which utilize self-attention to enhance detail processing.3. SAM11 and the related works: SAM without any prompt, SAM with point, SAM with box and MedSAM35. Additionally, we assess the impact of model scale on segmentation accuracy by testing an expanded version of U-Net (U-Net-Large) with 12 layers in both encoder and decoder, exploring the relationship between model size and performance efficacy.

Quantitative results analysis

As illustrated in Table 1, EFCNet consistently surpasses previous state-of-the-art models in both DSC and IoU across all folds for the S-HRD and S-Polyp datasets.

For the S-HRD dataset, EFCNet shows a significant improvement with an average increase of 4.88% in DSC and 3.77% in IoU compared to earlier methods. In the S-Polyp dataset, the model achieves an enhancement of 3.49% in DSC and 3.25% in IoU. These results highlight EFCNet’s robust performance, particularly in handling datasets with smaller medical objects where other models underperform due to limited available information.

The analysis reveals that the smaller the medical objects in the datasets, the greater the relative improvement of EFCNet over previous state-of-the-art methods, underscoring its effectiveness in small medical object segmentation. Notably, scaling up the architecture of U-Net (U-Net-Large) results in only marginal gains, suggesting. that EFCNet’s superior performance is primarily driven by its innovative design rather than merely increased model size.

The integration of CSAA and MPS modules within EFCNet, although increasing computational costs, is justified by the significant performance gains in critical segmentation tasks. A detailed comparison of model costs and performance benefits is provided in the supplementary materials.

Visualization

Figure 1E presents visual comparisons of segmentation results using EFCNet against previous state-of-the-art methods on the S-HRD and S-Polyp datasets. As indicated by our quantitative analysis in Table 1, the prior state-of-the-art models include Attn-UNet37 for S-HRD and SSFormer26 for S-Polyp. The visual results distinctly highlight several advantages of our EFCNet:1. Precision in Detecting Small Objects: EFCNet demonstrates superior ability to capture extremely small medical objects, which are clearly marked within the green circled areas in Fig. 1E.2. Boundary Accuracy: Our method exhibits enhanced accuracy in delineating the boundaries of small medical objects, as shown in the yellow circled areas.3. Reduced False Positives: EFCNet is markedly less prone to incorrectly segmenting the background as part of the medical objects, as indicated by the red circled areas.

These visual outcomes underscore the effectiveness of the CSAA and MPS modules. CSAA effectively utilizes local information from low-level features in the encoder to refine the segmentation details, while MPS leverages global perceptual insights from low-resolution features early in the decoder, significantly improving the detection and segmentation of small medical objects.

Table 1 Performance comparison of EFCNet with baseline models on segmenting S-HRD and S-Polyp datasets in terms of dice similarity coefficient (DSC) and intersection over union (IoU).

Ablation studies

We conducted a series of ablation experiments on the S-HRD and S-Polyp datasets to confirm the individual and synergistic effects of the CSAA and MPS modules on segmentation performance.

Initially, CSAA and MPS were integrated into the U-Net backbone separately to assess their standalone contributions. The results, presented in Table 2, reveal that each module significantly enhances segmentation accuracy. Moreover, the concurrent application of both modules leads to even greater improvements, demonstrating their combined efficacy. To explore the optimal configuration for CSAA, we manipulated the number of encoder stages involved in feature aggregation: The ‘Concat-One’ configuration merges feature from each encoder stage directly to the corresponding decoder stage without additional modifications; The ‘AA-One’ setup applies axial attention to features from a single encoder stage; The ‘AA-All’ arrangement, which represents our final model configuration, aggregates features across all encoder stages. As indicated in Table 3, ‘AA-All’ outperforms the other configurations, underscoring the importance of extensive feature integration for enhancing segmentation performance.

Table 2 Ablation study results for CSAA and MPS modules on segmenting S-HRD and S-Polyp datasets.
Table 3 Ablation study on the CSAA module’s staging impact on S-HRD and S-Polyp segmentation performance.

Different levels of supervision within the MPS were evaluated by varying the number of connected segmentation heads: The ‘MPS-4’ configuration, which connects segmentation heads to all decoder stages, was tested against setups with fewer connections (‘MPS-3’, ‘MPS-2’, and ‘MPS-1’). As detailed in Table 4, increased supervision levels correlate positively with segmentation accuracy, confirming the benefits of comprehensive supervision in complex segmentation tasks.

The results from these ablation studies validate the effectiveness of CSAA and MPS in our model, particularly highlighting their roles in significantly improving the precision and reliability of medical object segmentation. The configurations tested affirm the model’s robustness and adaptability, making it particularly suited for challenging segmentation environments.

Table 4 Effects of MPS configuration variations on segmentation performance for S-HRD and S-Polyp datasets.

Conclusion

In this study, we introduced EFCNet, a novel framework specifically designed to enhance the segmentation of small medical objects in medical images. EFCNet addresses a common challenge in these tasks, significant information loss, by ensuring that features at every stage of the model are fully utilized. This is achieved through the integration of two key modules: the CSAA and MPS. These modules are strategically developed to minimize information loss during both the encoding and decoding phases, resulting in a substantial improvement in segmentation performance. Additionally, EFCNet establishes a new benchmark in the field of small medical object segmentation, offering a solid reference point for future research and development.

Our extensive experimental evaluation, conducted across two distinct datasets, demonstrates that the incorporation of CSAA and MPS enhances segmentation accuracy, allowing EFCNet to consistently outperform previous state-of-the-art methods. These results highlight the potential of EFCNet as a powerful tool in medical imaging, particularly in applications where accurate segmentation of small objects is essential.

However, it is important to acknowledge the limitations of our study. First, our method requires significant computational resources, which may pose challenges for its practical application in resource-constrained environments. Reducing the resource consumption of the model will be a key area of future work. Second, our approach primarily focuses on a single modality (medical images) and does not yet incorporate other modalities, such as natural language, to guide the segmentation process. Expanding the model to integrate multi-modal data will be another important direction for future research.

Despite these limitations, the insights gained from the development and evaluation of EFCNet provide valuable contributions to the field of medical image analysis and lay the groundwork for further advancements in small object segmentation.