Abstract
Medical image segmentation is vital for accurate diagnosis. While U-Net-based models are effective, they struggle to capture long-range dependencies in complex anatomy. We propose GH-UNet, a Group-wise Hybrid Convolution-ViT model within the U-Net framework, to address this limitation. GH-UNet integrates a hybrid convolution-Transformer encoder for both local detail and global context modeling, a Group-wise Dynamic Gating (GDG) module for adaptive feature weighting, and a cascaded decoder for multi-scale integration. Both the encoder and GDG are modular, enabling compatibility with various CNN or ViT backbones. Extensive experiments on five public and one private dataset show GH-UNet consistently achieves superior performance. On ISIC2016, it surpasses H2Former with 1.37% and 1.94% gains in DICE and IOU, respectively, using only 38% of the parameters and 49.61% of the FLOPs. The code is freely accessible via: https://github.com/xiachashuanghua/GH-UNet.
Similar content being viewed by others
Introduction
As public interest in health increases, medical image analysis has emerged as a critical application of computer vision technology, aiding doctors in diagnosis and treatment planning. Medical image segmentation is designed to accurately delineate regions of interest (ROIs) in complex images, assisting clinicians in making precise decisions. However, the high noise, low contrast, and complex structures typical of medical images, coupled with patient-specific anatomical differences, pose significant challenges to automatic image segmentation1.
Recently, convolutional neural networks (CNNs) have significantly improved image classification, recognition, and segmentation, and have achieved notable success in medical image analysis2,3. U-Net4, a landmark architecture in this domain, integrates global context and local texture features via a symmetrical encoder-decoder structure and skip connections. Variants such as Attention U-Net5, which introduces spatial attention mechanisms to emphasize lesion regions, UNet++2, which employs nested and dense skip connections to refine feature fusion, and nnUNet3, an automated pipeline that adapts architecture and training parameters to the dataset, have further advanced segmentation accuracy and generalizability. However, CNNs inherently struggle to model long-range dependencies, limiting their capability in complex anatomical scenarios.
Following breakthroughs in natural language processing, Transformers were introduced to computer vision, giving rise to vision transformers (ViT)6. ViT leverages self-attention mechanisms to model global dependencies and capture intricate details in images. Transformer-based architectures such as SWIN-Unet7, which introduces hierarchical window-based attention, TransUNet8, which embeds Transformer layers into a CNN backbone for enriched global context, and UNETR9, which uses a pure Transformer encoder with skip connections to a CNN decoder, have been proposed for medical image segmentation, offering improved global contextual modeling. Yet, their high computational and memory costs restrict their scalability to high-resolution images and real-time applications.
To bridge this gap, hybrid architectures that integrate CNNs and transformers have emerged. These include CVT10, which combines convolutional token embedding with Transformer blocks. Rolling-Unet11 integrates multi-directional MLP modules to enhance global context learning. Building on this, subsequent studies have explored improvements in model efficiency and decoder design. EMCAD12 is a multi-scale convolutional attention decoder that reduces computational overhead while maintaining strong performance. It enhances feature representation using multi-scale deep convolutional blocks and integrates channel-, spatial-, and group-gated attention to capture complex spatial dependencies. FSCA-Net13 enhances U-Net’s skip connections to better fuse spatial and channel features. MixFormer14 combines the global modeling of Transformers with the local feature extraction of CNNs during encoding, enabling effective interaction between coarse- and fine-grained features. Despite promising results, challenges remain in feature fusion efficiency and inference cost.
Recently, the receptance weighted key value (RWKV) framework combines linear complexity with parallelism and has demonstrated promising results in computer vision15. This is attributed to RWKV’s integration of Transformer and RNN strategies, which enhances efficiency and enables long-range dependency modeling. This integration has also facilitated RWKV’s application in medical image segmentation. For instance, BSBP-RWKV16 was the first to apply the RWKV architecture to medical image segmentation, integrating Perona-Malik diffusion and ODE-based modeling. It significantly improves both efficiency and accuracy, making it well-suited for precision-focused and resource-constrained clinical scenarios. Zig-RiR introduced a nested RWKV-in-RWKV structure with zigzag cross-connections, effectively addressing RWKV’s limitations in local pattern extraction and continuity of dependencies17.
MobileViT18 and similar lightweight hybrid encoders combine CNN and Transformer modules to enhance feature extraction while reducing computational demand. This design creates a lightweight network that reduces computational demands while preserving robust feature expression capabilities. MobileViT features a deformable hybrid convolution-Transformer encoder that utilizes traditional convolution operations for extracting local information from low-level features and Transformer modules for capturing global context in high-level features. This architecture combines the strengths of CNNs in local pattern recognition with the long-range contextual information of Transformers, facilitating both high-resolution image processing and comprehensive contextual information capture. Moreover, MobileViT features a lightweight design that minimizes computational resource usage, making it ideal for efficient image processing in resource-limited settings.
To tackle existing limitations, we propose GH-UNet, an innovative medical image segmentation model built upon the U-Net framework. GH-UNet incorporates a multi-scale gated attention (MSGA) block to model both local and global dependencies, a channel-spatial gating (CSG) mechanism for multi-scale fusion, and a group-wise dynamic gating (GDG) component that dynamically adjusts channel-wise feature weights. Additionally, a cascading decoder integrates multi-scale information for precise reconstruction. Figure 1 compares various models on ISIC2016 dataset, illustrating GH-UNet’s advantage in performance, parameter efficiency, and computational cost.
This figure presents a comprehensive comparison of different segmentation models in terms of Dice score, IoU score, number of parameters, and FLOPs on the ISIC2016 dataset. GH-UNet demonstrates superior segmentation performance while requiring fewer parameters and lower computational cost compared to state-of-the-art models such as H2Former and TransUNet.
The contributions of this work are summarized as follows:
-
We developed a U-Net-based model (GH-UNet), achieving state-of-the-art performance with reduced computational complexity.
-
We proposed a hybrid convolution-ViT encoder to efficiently capture both local and long-range dependencies.
-
We introduced a GDG component for adaptive channel-wise feature modulation.
-
We designed a multi-scale cascading decoder to improve spatial information integration.
Results
This study conducts a comprehensive performance evaluation of the novel GH-UNet architecture against current advanced methodologies across five public and one private 2D/3D medical image segmentation tasks, focusing on accuracy and parameter efficiency. Subsequently, systematic ablation experiments were performed to quantify the contributions of the key components of GH-UNet model. Finally, visualization results are presented to clearly illustrate the operational effectiveness of GH-UNet model.
Datasets
ISIC201619 released by the International Skin Imaging Collaboration (ISIC), focuses on skin lesion image analysis, particularly in melanoma classification and segmentation. Introduced in the ISIC 2016 challenge, this dataset comprises 900 training dermatoscopic images and 379 test images, with expert-annotated binary segmentation masks.
Kavasir-SEG20 published by the Norwegian Research Center (NORCE), is a public dataset for gastrointestinal polyp segmentation, specifically targeting the segmentation of polyp regions in endoscopic images. This dataset includes 1000 high-definition gastrointestinal endoscopic images with corresponding pixel-level segmentation annotations.
IDRiD21 is a public medical dataset specifically created for research on diabetic retinopathy (DR)-related tasks. This dataset was part of the retinal image segmentation and classification challenge held during the 2018 International Symposium on Biomedical Imaging (ISBI). It comprises 81 high-resolution color fundus images (resolution: 4288 × 2848) exhibiting signs of DR and associated abnormalities. The official dataset is divided into a training set of 54 images and a test set of 27 images. Due to the typically small size of fundus lesions and the significant imbalance in pixel distribution between lesion and background areas, pixel-level segmentation tasks face considerable challenges.
ACDC22 is a public medical imaging dataset that specializes in cardiac magnetic resonance imaging (CMR). It includes 100 patient samples with expert-annotated masks for the right ventricular cavity, left ventricular myocardium, and left ventricular cavity. These cases are partitioned into 70 training, 20 testing, and 10 validation samples.
Synapse23 comprises 30 abdominal CT scans capturing 13 organs. It’s structured into 20 training and 10 testing samples. Differing from TransUNet, our approach utilizes all categories for experimentation.
BT-Seg is a private brain tumor segmentation dataset. It comprises high-resolution 3D brain MRI images from 100 patients with pathologically confirmed brain tumors, obtained from Second Xiangya Hospital, Central South University, in Hunan Province, China. All scans were acquired using the United Imaging u790 MR machine with a T1-weighted imaging sequence. Initial image registration was performed using SPM12 (statistical parametric mapping, version 12) with default parameters. All acquired 3D MRI sequences were co-registered to pre-contrast T1-weighted (T1W) images. This included post-contrast T1-weighted (T1+C) and fluid-attenuated inversion recovery (FLAIR) sequences, ensuring spatial alignment across all image sequences. Tumor ROIs were manually delineated on post-contrast T1W images, encompassing both enhancing tumor components and non-enhancing regions (including necrotic core). Pixel-level annotations were performed by experienced radiologists using 3D Slicer software. The annotated regions encompass the tumor core and surrounding edema. Each case has a volumetric dimension of 360 × 460 × 178 voxels. The dataset was divided into 80% training and 20% evaluation based on individual cases. Each case was further processed into 178 2D slices of size 360 × 460, along with corresponding masks.
Implementation details
All experiments were performed on Python 3.12 and PyTorch 2.3.0, using an NVIDIA 4090 GPU with 24GB memory. The network was trained with the Adam optimizer for 400 epochs, using an initial learning rate of 2e-4 for the Synapse dataset and 1e-4 for the other datasets. Weight decay was consistently set at 1e-4 across all datasets. For the Kavasir-SEG dataset, image and label dimensions were resized to 244 × 244 with a batch size of 16. For the ISIC2016 dataset, image and label sizes were resized to 256 × 256 with a batch size of 8. For the IDRiD dataset, to preserve detail in tiny lesions of fundus images, the input size was set to 1024 × 1024 with a batch size of 1. For the ACDC and Synapse datasets, 3D data was converted to 2D slices, resized to 320 × 320 and 480 × 480, respectively, with batch sizes of 8 and 4. For the private datasets BT-Seg and LN-Seg, the 3D data were processed into 2D slices. The 2D slices and their corresponding labels were resized to 320 × 320 and 512 × 512, respectively, with batch sizes set to 10 and 4.
During training, images underwent random enhancements such as rotation, cropping, flipping, sharpening, brightness adjustments, color jitter, elastic transformations, grid shuffling, resizing, and normalization. Performance was evaluated using pixel accuracy (Acc), mean absolute error (MAE), dice similarity coefficient (Dice), intersection over union (IoU), and 95% Hausdorff distance (HD95).
Table 1 presents the training time (TT, per epoch), inference time (IT, per image), frame rate (FPS), peak GPU memory usage during training (TPGM) and inference (IPGM), input size (IS), and batch size (BS) of GH-UNet across five datasets (ISIC2016, Kvasir-SEG, ACDC, IDRiD, and Synapse) on an NVIDIA RTX 4090 GPU. For example, the average training time per epoch on the ISIC2016 dataset is 1 min and 18 s, indicating rapid convergence. With a batch size of 8 and input size of 256 × 256, the average inference latency per image is 31.66 ms, corresponding to 31.58 FPS, demonstrating the model’s potential for real-time diagnostic applications. The peak GPU memory usage during training is 11.24 GB. During inference, peak GPU memory usage drops significantly to 1.077 GB, making the model suitable for deployment in resource-constrained clinical settings. Similar efficiency and resource usage were observed on the Kvasir-SEG, ACDC, IDRiD, and Synapse datasets. These results demonstrate that GH-UNet achieves both high performance and resource efficiency, indicating strong potential for clinical deployment.
Performance comparison
The performance of the proposed GH-UNet model was evaluated alongside comparison models on common medical image segmentation datasets, including ISIC2016, Kvasir-SEG, IDRiD, ACDC, and Synapse. The comparison models consist of CNN-based models, such as U-Net4, U-Net++2, Attention-UNet5, PSPNet24, DeepLabv3+25, SFA26, ParNet27, ACSNet28, nnUNet3, and Rolling-Unet11; Transformer-based models, including Swin-UNet7, nnFormer29, and MISSFormer30; and hybrid approaches combining CNN and Transformer, such as ResT31, BoTNet32, TransUNet8, CvT10, EMCAD12, FSCA-Net13, MixFormer14 and H2Former33; RWKV methods, containing Zig-RiR17.
Skin lesion detection
Table 2 presents the performance comparison between GH-UNet model and other models on the ISIC2016 dataset. Experimental results confirm that GH-UNet architecture demonstrates superior performance compared to all other models across the ISIC2016 benchmark dataset. GH-UNet achieves DICE and IOU scores of 93.78% and 88.39%, respectively, surpassing the SOTA H2Former model by 1.37% and 1.94%. Compared with the pure CNN-based SOTA model nnUNet and the pure transformer-based SOTA model Swin-UNet, GH-UNet shows improvements of 3.33% and 5.14% in DICE scores, and 3.87% and 5.18% in IOU scores, respectively. Additionally, GH-UNet achieves the best MAE and ACC values in the field. These findings indicate that, compared with models using different architectures, GH-UNet excels in capturing lesion areas, excluding background interference, and achieving more accurate segmentation boundaries. Compared with the current SOTA model H2Former, GH-UNet model demonstrates superior performance across all aspects while maintaining a parameter scale equivalent to only 38.00% of H2Former.
Polyp lesion detection
The performance of all models in identifying and detecting polyp lesion boundaries was evaluated on the Kavasir-SEG dataset, as summarized in Table 3. The results present the superior performance of GH-UNet in identifying polyp lesion boundaries. Specifically, GH-UNet achieved a Dice score of 92.68% and an IoU score of 87.19%, significantly exceeding the industry average for similar models. Compared with the current state-of-the-art H2Former model, GH-UNet shows an improvement of 0.88% in Dice score and 0.9% in IoU score. Additionally, GH-UNet achieved the best MAE and ACC metrics in the field. These findings highlight the significant advantages of GH-UNet for polyp lesion image segmentation, demonstrating that the predicted results closely align with the actual lesion areas in both morphology and spatial positioning. The model accurately delineates polyp boundaries while significantly reducing over-segmentation and under-segmentation.
Fundus lesion detection
Detecting the boundaries of fundus lesions remains a significant challenge for advanced medical image segmentation models. The IDRiD dataset includes four diverse types of micro fundus lesions, providing a robust sample foundation for evaluating model performance and enabling effective verification across various complex lesion scenarios. Accordingly, we evaluated the performance of GH-UNet and comparative models on the IDRiD dataset to assess the effectiveness and generalizability of GH-UNet in medical image segmentation. Detailed performance metrics for GH-UNet and all comparative models are shown in Table 4. The results clearly demonstrate that GH-UNet excels across all metrics, achieving significant improvements over the H2Former model. For instance, the Dice score improves by 9.85%, and the IoU score increases by 6.31% compared to existing advanced models. These findings indicate that GH-UNet accurately segments micro fundus lesion areas and adapts well to complex challenges, including small targets and multiple lesion types.
Cardiac image segmentation
Accurate identification of the left and right ventricles and myocardium boundaries aids in medical diagnosis, treatment planning, cardiac function evaluation, and understanding the heart’s anatomical structure. This task is challenging due to the significant morphological differences between the ventricles and myocardium. Accordingly, we evaluated GH-UNet and comparative approaches on the automatic cardiac diagnosis task using the ACDC dataset. In the experiment, GH-UNet processes 3D data using a 2D slicing approach. Table 5 demonstrates that GH-UNet surpasses all comparative models, including specialized 3D medical image segmentation models such as nnUNet and nnFormer. These findings highlight the advantages of GH-UNet in handling complex medical images, particularly its adaptability to varying scales, such as those of the ventricular myocardium. In comparison, nnUNet and nnFormer occasionally fail to fully capture image details. By introducing a novel MSGA component, GH-UNet performs refined feature processing across multiple scales and dimensions, enabling superior identification of left and right ventricular and myocardial boundaries. These findings also underscore the potential for broad application of GH-UNet in medical image segmentation.
Multi-organ image segmentation
Additionally, we assessed GH-UNe and comparative approaches on the multi-organ image segmentation task using the Synapse dataset. In the experiment, GH-UNet employs 2D slicing to process 3D data. Table 6 presents the boundary annotation performance for 13 organs, evaluated primarily using HD95 and Dice metrics. The results indicate that, GH-UNet improves the Dice metric by 3.29% and reduces the HD95 metric by 0.64 compared to the current SOTA model Zig-RiR, demonstrating enhanced performance. These findings confirm that GH-UNet maintains superior performance in complex multi-category scenarios and is well-suited for 3D medical image segmentation.
Visualization analysis
To further evaluate the model’s performance, we conducted several visualization analyses. We selected several images from the ISIC2016, Kavasir-SEG, IDRiD, ACDC, and Synapse datasets and used various models to delineate target boundaries. The first and second rows of Fig. 2 illustrate the boundary recognition performance of all models on skin lesion images. GH-UNet and the H2Former model demonstrate significantly better performance compared to other models. Additionally, GH-UNet outperforms the H2Former model in capturing finer details. The third and fourth rows of Fig. 2 depict the boundary recognition performance of all models on polyp images. Both GH-UNet and the H2Former models perform comparably and surpass other models significantly.
Figure 3A illustrates the performance of all models in outlining fundus lesion boundaries. Small lesions highlighted by white arrows are often misclassified or inaccurately predicted by other models, whereas GH-UNet identifies them more effectively. Furthermore, GH-UNet outperforms the current SOTA model H2Former in identifying very small lesions, as demonstrated in the green box. As shown in the yellow box, GH-UNet identifies lesion boundaries with high consistency to the actual lesion areas in morphology and spatial positioning, accurately outlining delicate lesion boundaries. The H2Former model often identifies lesions as larger, less detailed blocks. This highlights the strong performance of GH-UNet in handling complex scenarios, including large morphological differences, overlapping boundaries, and tiny targets. Figure 3B, C illustrate the performance of all models in outlining ventricular myocardium and multi-organ boundaries, respectively. In the experiments, 2D slicing was employed to process 3D image data. The H2Former model performed comparably to GH-UNet and outperformed other models. However, GH-UNet demonstrated significantly better performance than the H2Former model in certain details. For instance, as indicated by white arrows in Fig. 3C, GH-UNet accurately identifies small hole boundaries. As indicated by red arrows in Fig. 3C, GH-UNet effectively identifies narrow gaps and delicate boundaries. The H2Former model often classifies small targets as blocks and exhibits relatively low boundary sensitivity. This further demonstrates that GH-UNet is highly sensitive to details and effectively handles challenging boundary recognition. This capability is crucial for medical diagnosis and surgical treatment.
Evaluation on private dataset
This study evaluates the performance of the proposed GH-UNet on private dataset BT-Seg, in comparison to H2Former and nnUNet. As shown in Table 7, GH-UNet significantly outperforms existing SOTA methods, such as H2Former and nnUNet, in segmenting both the tumor core and edema regions in the brain tumor segmentation task. GH-UNet achieves an overall Dice coefficient of 84.53%, surpassing H2Former (83.78%) and nnUNet (64.34%) by 0.75% and 20.19%, respectively. Additionally, 2D slicing was employed to process 3D image data, and the predicted slices were reconstructed into 3D for visualization. Figure 4 illustrates the segmentation results of various methods on representative cases, displayed at the 3D level and across three orthogonal 2D slicing planes. GH-UNet accurately captures the boundaries of the edema region while preserving the tumor core’s integrity, whereas H2Former and nnUNet exhibit significant mis-segmentation. These results demonstrate that GH-UNet offers substantial advantages in complex multi-region segmentation tasks.
Ablation experiments
A series of ablation experiments were carried out on the ISIC2016 dataset to examine the impact of various factors on model performance and validate the design rationale of GH-UNet. First, the contributions of the MSGA block, GDG block, and cascade mechanism to model performance were evaluated. Second, the impact of the downsampling strategy in the encoder module and the upsampling strategy in the decoder on model performance was assessed. Third, the impact of BCE, Dice, and IoU loss functions under varying weight combinations was evaluated. Additionally, we performed ablation studies on the configurations of the MSGA and GDG modules. Each set of experiments was run for 200 epochs.
Key component analysis
Table 8 presents the contributions of the pre-trained weights (PW), MSGA block, GDG block, and cascade mechanism to model performance. The results indicate that excluding the pre-trained weights, MSGA block, GDG block, and cascading mechanism significantly reduces GH-UNet’s performance. Additionally, model performance decreases when any single component is removed. Specifically, replacing the MSGA block with a standard double-layer convolution block results in the worst model performance. These findings demonstrate that the PW, MSGA block, GDG block, and cascade mechanism positively contribute to model performance, confirming the rationale behind these key modules.
Sampling technique analysis
Table 9 presents the impact of different upsampling and downsampling methods. The results indicate that the model performs worst when using Maxpool downsampling and interpolate upsampling strategies. Performance improves when the model employs HWD downsampling or DySample upsampling strategies. The model performs best when both HWD downsampling and DySample upsampling strategies are used simultaneously. These results highlight the importance of selecting the appropriate upsampling or downsampling strategy in the encoder-decoder architecture. These results were obtained after 200 epochs. Increasing the number of epochs may make the difference more apparent.
Loss weight analysis
Table 10 presents experiments and analyses of BCE, Dice, and IoU loss functions under varying weight combinations. We further conducted ablation studies on ISIC2016, Kvasir-SEG, and Synapse to examine how different combinations of BCE, Dice, and IoU losses affect model performance under various weighting schemes. Multiple weight combinations were tested, and the results are summarized in Table 7. For instance, 0:1:0 uses only Dice loss; 1:0:0 uses only BCE loss; 0:0:1 uses only IoU loss; 1:1:1 assigns equal weight to all three; and 0.5:1.5:0.5 emphasizes Dice loss. Results demonstrate that combining all three loss functions significantly enhances segmentation performance. For example, on ISIC2016, the model performs best with a 0.5:1.5:0.5 weight configuration, achieving MAE = 3.54, ACC = 96.34%, Dice = 93.56%, and IoU = 88.01%. These findings highlight the critical role of dice loss in segmenting images with ambiguous boundaries, while BCE and IoU losses improve convergence stability and spatial overlap. Similar performance trends are also observed on the Kvasir-SEG and Synapse datasets. In summary, we conclude that a weighted combination of loss functions-such as 0.5:1.5:0.5-can effectively enhance model accuracy.
Analysis of convolution kernel size combinations in MSGA
This study investigates the effects of different kernel size combinations in the MSGA module on model performance. Tested combinations include common odd-sized kernels (e.g., {1,3,5}, {3,3,3}, {3,5,7}, {5,7,9}, {1,5,7}) and extended multi-scale variants such as {3,5,7,9} and {1,3,5,7}. The {1,3,5} combination emphasizes local detail due to its smaller receptive fields. Combinations such as {3,3,3}, {3,5,7}, and {5,7,9} offer larger receptive fields, enhancing global context modeling. The {3,5,7,9} and {1,3,5,7} configurations increase scale diversity through deeper stacking. The {1,5,7} configuration, currently adopted by GH-UNet, achieves a balance between local detail and global structure. These kernel combinations were evaluated on the ISIC2016, Kvasir-SEG, and Synapse datasets, with results summarized in Table 11. On the ISIC2016 dataset, the {1,5,7} configuration achieved the best performance, with a Dice score of 93.56% and an IoU of 88.01%. On Kvasir-SEG, {1,5,7} configuration also yielded the best results (Dice: 92.46%, Accuracy: 97.66%). On Synapse, {1,5,7} configuration achieved the highest Dice score (77.46%) and the lowest HD95 (12.77). These results demonstrate that combining small and large kernels (e.g., {1,5,7}) effectively balances local and contextual feature extraction, leading to improved segmentation performance.
Analysis of grouping strategies for GDG
As shown in Fig. 5, GH-UNet comprises seven GDG blocks. Without altering the network architecture or training strategy, we investigated two grouping strategies for the GDG modules. In the first setting, all seven modules use a uniform group count, such as {1,1,1,1,1,1,1} or {2,2,2,2,2,2,2}. The second is a layer-wise adaptive grouping strategy, where the number of groups varies across modules, gradually increasing with network depth to enhance mid- and high-level channel expressiveness. This results in an asymmetric configuration, e.g., {2, 4, 8, 8, 4, 2, 2}. As shown in Table 12, on the ISIC2016, Kvasir-SEG, and Synapse datasets, the 2,4,8,8,4,2,2 configuration yielded the best performance. For example, on ISIC2016, it achieved a Dice score of 93.56% and an IoU of 88.01%. On Kvasir-SEG, the Dice score reached 92.46% and IoU was 86.81%. On Synapse, the highest Dice score (77.46%) and the lowest HD95 (12.77) were observed. This performance gain may stem from the model’s ability to adaptively select appropriate receptive field sizes.
GH-UNet architecture comprises: a the overall structure, and key components including b a hybrid vonvolution-VIT encoder, c an MSGA block, d a CSG block, and e a GDG block. concatenation denotes channel-wise concatenation, addition indicates element-wise addition, and multiplication refers to standard matrix multiplication.
Additionally, we provide a brief theoretical rationale for selecting the {2, 4, 8, 8, 4, 2, 2} configuration in this study. The GDG module captures dynamic inter-channel dependencies using group-wise convolution and channel-wise gating mechanisms. The number of groups directly influences the following two aspects: (1) Channel modeling granularity: Fewer groups lead to stronger representational capacity due to increased inter-channel interaction, whereas more groups improve computational efficiency at the cost of reduced interaction. (2) Trade-off between redundancy and isolation: Too few groups may cause parameter redundancy, while too many can lead to feature fragmentation and limited cross-channel communication.
Discussion
GH-UNet demonstrates strong performance across both public and private datasets, showcasing its capability to effectively capture complex anatomical structures while maintaining computational efficiency. Through the integration of a novel hybrid convolutional-ViT encoder and a MSGA mechanism, the model enhances its ability to extract both local and long-range contextual features. The introduction of the GDG component further improves feature expression by enabling dynamic weighting across channel groups. A multi-scale cascading decoder and enhanced upsampling/downsampling strategies contribute to refined spatial detail reconstruction, particularly at anatomical boundaries.
Despite these advancements, the model faces certain limitations that require further investigation. First, GH-UNet’s effectiveness may be constrained by the diversity of medical imaging modalities, particularly in low-contrast ultrasound or multi-modal imaging scenarios, where its generalizability remains to be validated. Second, although GH-UNet maintains a lightweight structure (only 12.81MB of parameters), its inference speed on high-resolution 3D data is still limited due to the computational complexity of GDG operations, which may hinder deployment in real-time or edge environments. Third, GH-UNet currently processes volumetric data using 2D slices, which could affect spatial continuity and reduce segmentation consistency across slices.
To overcome these limitations, future work will explore extending GH-UNet for multi-modal applications (e.g., PET-MRI fusion) and cross-device domain generalization via adaptation techniques. We also plan to develop hierarchical GDG mechanisms that adjust group numbers based on feature resolution, enabling real-time optimization. Further improvements will focus on adapting the MSGA block to 3D space and incorporating sparse attention techniques (e.g., axial attention) to reduce global modeling cost. Additionally, we aim to explore weakly supervised or interactive learning strategies-such as training with scribbles or bounding boxes-to minimize the dependence on large fully annotated datasets.
In summary, GH-UNet presents an effective and generalizable medical image segmentation architecture that outperforms several state-of-the-art models across diverse datasets. Its modular design, high segmentation accuracy, and reduced computational burden make it a promising tool for future clinical applications in diagnosis and image-guided treatment.
Methods
This study introduces a novel medical image segmentation technology within the U-Net framework, designed to efficiently capture both local and long-range dependencies and precisely delineate the boundaries of complex targets. This approach distinguishes itself from existing U-Net models in three primary ways. First, we developed a novel hybrid convolution-VIT encoder component that captures both local and long-range dependencies. Second, we introduced a GDG component that dynamically adjusts feature weights to enhance expression effectiveness. Third, we implemented a cascading mechanism in the decoder to integrate information across various scales effectively. Notably, the hybrid convolution-VIT encoder and GDG components are modular, enhancing their applicability across various CNN or VIT architectures. Moving forward, our research will concentrate on discussing these technologies and principles.
Model architecture
As depicted in Fig. 5, the GH-UNet architecture features a newly designed hybrid convolution-VIT encoder, wherein low-level features are handled by the MSGA block and high-level global features by the VIT layer. This configuration captures both local information and long-range dependencies while reducing computational demands. The GDG component dynamically adjusts feature weights to enhance expressiveness. The decoder’s cascading mechanism integrates multi-scale information effectively. The Decoder block, DL block, and Bottom block conform to the Rolling-UNet architecture. Additionally, in Fig. 5, red lines indicate upsampling and red lines indicate downsampling. Downsampling and upsampling in the encoder and decoder utilize Haar wavelet34 and Dysample methods35, respectively.
Novel hybrid convolution-VIT encoder
While U-Net-based models have significantly advanced medical imaging, they encounter inherent challenges, particularly in capturing long-range dependencies—a common limitation within convolutional architectures. This study introduces a novel convolution-VIT encoder component, implementing several enhancements specifically for image segmentation tasks. Initially, we developed a new CSG block, tailored for image segmentation and based on a gating mechanism, to optimize task-specific performance. Subsequently, a MSGA block was created, utilizing multi-scale fusion technology to enhance feature integration. Notably, both the CSG convolutional block and MSGA block are modular, easily integrating into existing CNN or VIT architectures. In the encoder, our focus has been on refining the convolutional block, integrating original VIT within the Transformer module for enhanced processing.
We devised a new MSGA block, grounded in multi-scale convolution and gating mechanisms, aimed at bolstering the model’s sensitivity to both local features and the global context. Particularly in medical image segmentation, the variety of details and background nuances necessitates a model with superior adaptability. The MSGA block contains a CSG convolution block, GSCONV36, and employs multi-scale fusion technology for robust feature processing. The MSGA block extracts and fuses features across multiple scales and dimensions, enabling more nuanced feature integration.
Channel-spatial gating (CSG) block
This study introduces a novel convolutional block based on CSG mechanisms, designed to enhance feature selectivity across various channels and spatial positions. The CSG block incorporates two types of gating mechanisms: channel gating and spatial gating. Channel gating assesses each channel’s importance via adaptive pooling and convolution, whereas spatial gating determines the weights of each spatial position using convolution operations.
The channel gating computation is formulated as follows:
where X ∈ RC×H×W, Achannel represents the channel attention map, σ denotes the Sigmoid activation function, AvgPool refers to adaptive average pooling, and Conv1×1 is the convolution operation.
The spatial gating computation is articulated as:
where Aspatial is the spatial attention map. The output feature Y can be derived from these gating mechanisms:
GSCONV block
This study incorporates GSCONV within the MSGA block for feature extraction, striving to balance model complexity and prediction precision. GSConv merges the benefits of standard convolution (SC) and depthwise separable convolution (DSC), harnessing SC for channel-specific information and DSC for spatial details to enhance collaborative processing. The channel shuffle technique is then employed to disrupt the independent processing of SC and DSC, facilitating the exchange of local feature information across channels for natural and seamless integration. GSConv achieves computational efficiency and improved capability to discern complex medical image features without increasing FLOPS.
Define the input feature as X ∈ RC×H×W, where C represents the number of channels, and H and W denote the height and width of the feature map, respectively. In the GSConv layer, the initial input feature map X is processed by the first convolution layer:
The output of this convolution operation, feature X1 ∈ RB×C/2×H×W, has its channels reduced to C/2. Following this, X1 is processed by the second convolution layer with a kernel size of 5, a stride of 1, and C/2 channels:
The output feature X2 ∈ RB×C/2×H×W from this operation retains the same shape and number of channels as X1. Subsequently, X1 and X2 are concatenated along the channel dimension:
Next, apply the Channel Shuffle operation to X3:
Finally, the two segments of Yout are concatenated along the channel dimension:
where Y1 and Y2 constitute two segments of Yc.
Multi-scale gated attention fusion
The MSGA module features three convolutional layers with kernel sizes of 1 × 1, 5 × 5, and 7 × 7, each tailored to capture features at varying scales. Convolutional kernels within the GSConv module perform operations on input features across multiple scales. The three convolution operations proceed as follows:
and the convolution operation performed by the GSConv module on the input. Each feature map undergoes processing with BatchNorm and RELU6 activation functions, improving expressiveness:
Upon obtaining feature maps X1bn, X2bn, X3bn with different scales, the MSGA module employs channel-wise concatenation to produce the fused feature map:
The fused features are further refined through a convolution operation to derive the final output features:
To maintain feature information flow and prevent attenuation or gradient vanishing during training, the MSGA block incorporates residual connections:
where Residual(X) denotes the output of input X, adjusted for channel count via a convolution layer, serving as the residual term.
Group-wise dynamic gating
In medical image segmentation tasks, dynamically adjusting feature weights can significantly improve the effectiveness of feature expression. To accomplish this, we introduced a new GDG component. The GDG component enhances the model’s focus by segmenting input features into multiple groups and dynamically applying gating mechanisms to each, adaptively adjusting their output weights. Notably, this component is pluggable, enhancing its ease of use.
Essential principles-grouping
The GDG component’s core concept involves dividing the input feature map into channel-based subgroups, applying separate convolution operations to each, and adjusting their outputs via a gating mechanism. This design enables dynamic adjustment of each group’s contribution based on feature importance, resulting in more flexible feature fusion.
Let’s divide the input feature map into G subgroups, with each subgroup comprising Cg channels such that Cg = C/G. Each subgroup’s convolution operation is executed by the GroupConv module:
where \({X}_{(i)}\in {R}^{{C}_{g}\times H\times W}\) denotes the i-th subgroup of the input feature map. Subsequently, the outputs from all subgroups are combined using a concatenation operation:
where Xout ∈ RC×H×W.
Gating mechanism
The GDG component employs a gating mechanism that controls the output of each subgroup by dynamically calculating their weights using adaptive average pooling. We perform adaptive average pooling on the output feature set Xout:
Subsequently, the pooled features undergo a convolution process to compute gating weights that control the contribution of each group:
The gating weights are determined using a Sigmoid activation function σ, which scales the outputs to a range of [0,1], directly influencing each group’s contribution to the final output.
Dynamic weighting and fusion
After computing the gating weights, the outputs of each group are multiplied by their respective gating weights to produce weighted outputs:
Subsequently, the weighted outputs from all groups are concatenated to form the final output:
Haar wavelet-based downsampling
In imaging tasks, downsampling crucially reduces feature map size, broadens the receptive field, and lessens computational demands. Traditional downsampling methods like maximum and average pooling often result in a loss of information, particularly high-frequency details and edge data. Medical image segmentation requires sensitivity to details like edges and textures, necessitating a downsampling method that preserves comprehensive band information. The Haar wavelet transform (HWT), with its superior multi-resolution analysis and information retention, is an ideal alternative to traditional downsampling methods.
This study implements Haar wavelet-based downsampling (HWD)34 between MSGA blocks, indicated by the green arrows in Fig. 5. HWD employs HWT for downsampling by conducting frequency domain analysis through wavelet decomposition, splitting the input feature map into low and high-frequency components across horizontal, vertical, and diagonal directions.
DySample dynamic upsampling
In medical image segmentation, anatomical boundaries of fine structures (e.g., lesion contours and organ edges) are often complex, irregular, and poorly defined. Traditional upsampling methods exhibit several limitations. For instance, bilinear interpolation employs fixed, uniform sampling, disregards feature distribution, and lacks semantic boundary awareness. PixelShuffle avoids artifacts from transposed convolution but remains a static rearrangement without content awareness. Transposed convolution introduces learnable parameters but often results in checkerboard artifacts, leading to instability.
To tackle these challenges, this study introduces a dynamic upsampling technique, DySample35, indicated by the orange arrow in Fig. 5. Its key advantages include: (1) combining PixelShuffle and deformable convolution principles to introduce position offsets and structure-aware sampling kernels; (2) employing an offset and scope control mechanism for flexible interpolation in feature space; (3) enabling structure-aligned upsampling, significantly enhancing boundary detail restoration. DySample predicts dynamic offset fields to generate content-aware sampling coordinates for each upsampling position, enabling adaptive feature sampling, with the offset calculated as follows:
where posinit is the initial position coordinate, and the sigmoid function adjusts these coordinates. Subsequently, network coordinates are adjusted based on the offset:
where a normalizer standardizes coordinates to fit within the [-1,1] sampling range. Finally, upsampling is completed using bilinear interpolation:
Additionally, anatomical boundaries in medical images often exhibit nonlinear structures, such as tumor margins and tissue interfaces. Fixed upsampling methods often result in over-smoothing and structural discontinuities. DySample dynamically adjusts the sampling region to align with semantic boundaries while preserving overall morphology, making it more suitable for fine-grained anatomical structure reconstruction. And DySample enhances feature quality during decoding, thereby improving overall segmentation accuracy.
Loss function
We employed a supervised learning approach during the training phase. Specifically, outputs from each decoder block(p1, p2, p3) are compiled into the final output list, and losses are computed against the corresponding target segmentation masks. This study employs three loss functions: Binary Cross-Entropy (BCE) loss, Dice loss, and IoU loss. Dice loss quantifies the overlap between predicted and ground truth masks, making it particularly suitable for imbalanced scenarios such as medical image segmentation. The Dice loss is calculated as follows:
where pa denotes the predicted probability of the a-th pixel, ya is the corresponding ground truth label, and ε is a small constant added to prevent division by zero. Dice loss improves regional consistency by optimizing the ratio between the intersection and union of prediction and ground truth, making it effective in foreground-sparse segmentation tasks.
Practically, we downsample the target segmentation masks to match the prediction resolutions for each stage’s outputs. Consequently, the final training objective is the total loss across all stages at varying resolutions. The calculation for the final training loss is as follows:
where \({{\mathcal{L}}}_{out}\) denotes the loss from network outputs, while \({{\mathcal{L}}}_{p1}\), \({{\mathcal{L}}}_{p2}\), and \({{\mathcal{L}}}_{p3}\) represent losses for respective feature maps. Additionally, BCE, Dice, and IoU losses are included in variables \({{\mathcal{L}}}_{out}\), \({{\mathcal{L}}}_{p1}\), \({{\mathcal{L}}}_{p2}\), and \({{\mathcal{L}}}_{p3}\). For all tasks, we apply uniform weightings: BCE (0.5), Dice (1.5), and IoU-loss (0.5).
Data availability
All datasets used in this study are publicly available or available upon request:ISIC2016: International Skin Imaging Collaboration 2016 challenge dataset for skin lesion segmentation. Available at: https://challenge.isic-archive.com/data. Kvasir-SEG: Public polyp segmentation dataset collected from colonoscopy videos. Available at: https://datasets.simula.no/kvasir-seg/. ACDC: Automated Cardiac Diagnosis Challenge dataset for cardiac MRI segmentation. Available at: https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html. IDRiD: Indian DR Image Dataset containing retinal fundus images and lesion masks. Available at: https://idrid.grand-challenge.org/. Synapse: Multi-organ CT dataset used in the MICCAI 2018 Medical Segmentation Decathlon. Available at: https://www.synapse.org/#!Synapse:syn3193805. BT-Seg (Private): A proprietary multi-region brain tumor segmentation dataset collected in collaboration with clinical partners. This dataset is not publicly available but can be accessed from the corresponding author upon reasonable request.
Code availability
Our code is publicly available at: https://github.com/xiachashuanghua/GH-UNet.
References
Wang, Y. et al. Abdominal multi-organ segmentation with organ-attention networks and statistical fusion. Med. Image Anal. 55, 88–102 (2019).
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N. & Liang, J. Unet++: a nested U-Net architecture for medical image segmentation. In Proc. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, 3–11 (Springer, 2018).
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).
Ronneberger, O., Fischer, P. & Brox, T. U-net: convolutional networks for biomedical image segmentation. In Proc. Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, Proceedings, Part III 18, 234–241 (Springer, 2015).
Oktay, O. et al. Attention U-Net: Learning where to look for the pancreas. Preprint at https://openreview.net/forum?id=Skft7cijM (2018).
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations (2020).
Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proc. European conference on computer vision, 205–218 (Springer, 2022).
Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. Preprint at arXiv:2102.04306 (2021).
Hatamizadeh, A. et al. Unetr: transformers for 3D medical image segmentation. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 574–584 (IEEE, 2022).
Wu, H. et al. Cvt: Introducing convolutions to vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 22–31 (IEEE, 2021).
Liu, Y. et al. Rolling-unet: Revitalizing MLPs’ ability to efficiently extract long-distance dependencies for medical image segmentation. In Proc. AAAI Conference on Artificial Intelligence Vol. 38, 3819–3827 (ACM, 2024).
Rahman, M. M., Munir, M. & Marculescu, R. Emcad: Efficient multi-scale convolutional attention decoding for medical image segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11769–11779 (IEEE, 2024).
Tan, D. et al. A novel skip-connection strategy by fusing spatial and channel wise features for multi-region medical image segmentation. IEEE J. Biomed. Health Inform. 28, 5396–5409 (2024).
Liu, J. et al. Mixformer: a mixed cnn-transformer backbone for medical image segmentation. IEEE Trans. Instrum. Meas. 74, 1–20 (2025).
Duan, Y. et al. Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures. In The Thirteenth International Conference on Learning Representations {ICLR}, Singapore (OpenReview.net, 2025).
Zhou, X. & Chen, T. Bsbp-rwkv: background suppression with boundary preservation for efficient medical image segmentation. In Proc. 32nd ACM International Conference on Multimedia 4938–4946 (ACM, 2024).
Chen, T. et al. Zig-rir: Zigzag rwkv-in-rwkv for efficient medical image segmentation. IEEE Trans. Med. Imaging https://doi.org/10.1109/TMI.2025.3561797 (2025).
Mehta, S. & Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In The Tenth International Conference on Learning Representations {ICLR}, Virtual Event (OpenReview.net, 2022).
Gutman, D. et al. Skin lesion analysis toward melanoma detection: a challenge at the International Symposium on Biomedical Imaging (ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC). http://arxiv.org/abs/1605.01397 (2016).
Jha, D. et al. Kvasir-seg: a segmented polyp dataset. In Proc. MultiMedia modeling: 26th international conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26, 451–462 (Springer, 2020).
Porwal, P. et al. Indian diabetic retinopathy image dataset (IDRID): a database for diabetic retinopathy screening research. Data 3, 25 (2018).
Bernard, O. et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans. Med. Imaging 37, 2514–2525 (2018).
Landman, B. et al. MICCAI multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge Vol. 5, 12 (Sage, 2015).
Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2881–2890 (IEEE, 2017).
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. European Conference on Computer Vision (ECCV), 801–818 (Springer, 2018).
Fang, Y., Chen, C., Yuan, Y. & Tong, K.-y. Selective feature aggregation network with area-boundary constraints for polyp segmentation. In Proc. Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part I 22, 302–310 (Springer, 2019).
Fan, D.-P. et al. Pranet: Parallel reverse attention network for polyp segmentation. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention 263–273 (Springer, 2020).
Zhang, R. et al. Adaptive context selection for polyp segmentation. In Proc. Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part VI 23, 253–262 (Springer, 2020).
Zhou, H.-Y. et al. nnFormer: volumetric medical image segmentation via a 3D transformer. IEEE Trans. Image Process. 32, 4036–4045 (2023).
Huang, X., Deng, Z., Li, D. & Yuan, X. Missformer: An effective transformer for 2d medical image segmentation. IEEE Trans. Image Process. 42, 1484–1494 (2021).
Zhang, Q. & Yang, Y.-B. Rest: an efficient transformer for visual recognition. Adv. Neural Inf. Process. Syst. 34, 15475–15485 (2021).
Srinivas, A. et al. Bottleneck transformers for visual recognition. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16519–16529 (IEEE, 2021).
He, A. et al. H2former: an efficient hierarchical hybrid transformer for medical image segmentation. IEEE Trans. Med. Imaging 42, 2763–2775 (2023).
Xu, G. et al. Haar wavelet downsampling: a simple but effective downsampling module for semantic segmentation. Pattern Recognit. 143, 109819 (2023).
Liu, W., Lu, H., Fu, H. & Cao, Z. Learning to upsample by learning to sample. In Proc. IEEE/CVF International Conference on Computer Vision 6027–6037 (IEEE, 2023).
Li, H. et al. Slim-neck by GSConv: a lightweight-design for real-time detector architectures. J. Real Time Image Process. 21, 62 (2024).
Acknowledgements
The study was supported by National Natural Science Foundation of China (Nos. 62476291, 62372158 and 62302339), the Hunan Provincial Natural Science Foundation for Distinguished Young Scholars (No. 2025JJ20097), the Research Foundation of Education Bureau of Hunan Province (No. 24B0003), and Wenzhou Key Scientific and Technological Projects (Nos. ZG2024007 and ZG2024012).
Author information
Authors and Affiliations
Contributions
S.W. and G.L. spearheaded the methodological framework design and manuscript drafting. M.G. conducted data acquisition and preprocessing across experimental cohorts. L.Z., M.L., and Z.M. provided experimental oversight and performed computational analysis of publicly available fundus lesion datasets. W.Z. and X.F. analyzed results derived from private datasets and executed critical revision of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
The collection of BT-Seg private dataset adhered to ethical standards and received approval from the Ethics Review Committee of Second Xiangya Hospital, Central South University, which waived the requirement for patients’ informed consent, referring to the Council for International Organizations of Medical Sciences (CIOMS) guidelines. And the data were anonymized to ensure patient privacy. Due to institutional privacy policies, these datasets are not publicly available; however, access may be granted for legitimate academic and research purposes.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, S., Li, G., Gao, M. et al. GH-UNet: group-wise hybrid convolution-VIT for robust medical image segmentation. npj Digit. Med. 8, 426 (2025). https://doi.org/10.1038/s41746-025-01829-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-025-01829-2