Abstract
For medical image segmentation, transformer-based models have demonstrated superior performance. However, their high computational complexity remains a significant challenge. In contrast, Mamba provides a more computationally efficient alternative, though its segmentation performance is generally inferior to that of transformers. This study proposes a novel lightweight hybrid model based on U-Net, named SwiM-UNet, which represents the first Mamba–transformer hybrid model specifically designed for processing three-dimensional data. Specifically, efficient TSMamba (eTSMamba) blocks are incorporated in the early stages of the U-Net architecture to effectively manage computational overhead, while efficient Swin transformer (eSwin) blocks are employed in the later stages to capture long-range dependencies and local contextual information. Additionally, the model strategically integrates both the Mamba and Swin transformer architectures through a Mamba–Swin adapter (MS-adapter). The proposed MS-adapter comprises three sub-adapters that emphasize local information along the \(x\)-, \(y\)-, and \(z\)-axes, as well as channel-wise features between the eTSMamba and eSwin modules, and includes gating mechanisms to balance the contributions of the sub-adapters. Moreover, a low-rank MLP is utilized in the encoder, and channel reduction is applied in the decoder to further enhance computational efficiency. Performance evaluations conducted on the publicly available BraTS2023 and BraTS2024 datasets demonstrate that the proposed model surpasses state-of-the-art benchmark models while maintaining low computational complexity.
Data availability
The BraTS 2023 and 2024 datasets used in this study are publicly available. The BraTS 2023 data were accessed through the official challenge page on the Synapse platform under Synapse ID syn51156910. The BraTS 2024 data were accessed via the Synapse portal under Synapse ID syn53708249. Additionally, the Beyond the Cranial Vault (BTCV) multi-organ abdominal CT segmentation data were accessed via the official challenge page on the Synapse platform under Synapse ID syn3193805.
References
Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. Preprint at arXiv: 2010.11929 (2020).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. of the IEEE/CVF International Conference on Computer Vision, 10012–10022 (2021).
Ghazouani, F., Vera, P. & Ruan, S. Efficient brain tumor segmentation using swin transformer and enhanced local self-attention. Int. J. Comput. Assist. Radiol. Surg. 19, 273–281 (2024).
Ferreira, A. et al. How we won brats 2023 adult glioma challenge? just faking it! enhanced synthetic data augmentation and model ensemble for brain tumour segmentation. Preprint at arXiv:2402.17317 (2024).
Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. Preprint at arXiv:2312.00752 (2023).
Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. Preprint at arXiv:2111.00396 (2021).
Zhu, L. et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. Preprint at arXiv:2401.09417 (2024).
Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers . In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17773–17783, https://doi.org/10.1109/CVPR52733.2024.01683 (IEEE Computer Society, 2024).
Han, D. et al. Demystify mamba in vision: a linear attention perspective. In Proc. of the 38th International Conference on Neural Information Processing Systems, NIPS ’24 (Curran Associates Inc., 2024).
Han, D. et al. Agent attention: On the integration of softmax and linear attention. In Leonardis, A. et al. (eds.) Computer Vision – ECCV 2024, 124–140 (Springer Nature, 2025).
Lou, M. et al. Transxnet: Learning both global and local dynamics with a dual dynamic token mixer for visual recognition. IEEE Trans. Neural Netw. Learn. Syst. 36, 11534–11547. https://doi.org/10.1109/TNNLS.2025.3550979 (2025).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 234–241 (Springer, 2015).
Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, 3–11 (Springer, 2018).
Hatamizadeh, A. et al. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI brainlesion workshop, 272–284 (Springer, 2021).
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnu-net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).
Goodfellow, I. et al. Generative adversarial nets. Advances in Neural Information Processing Systems 27 (2014).
ZongRen, L., Silamu, W., Yuzhen, W. & Zhe, W. Densetrans: Multimodal brain tumor segmentation using swin transformer. IEEE Access 11, 42895–42908 (2023).
Shi, Y., Li, M., Dong, M. & Xu, C. Vssd: Vision mamba with non-causal state space duality. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 10819–10829 (2025).
Lou, M., Fu, Y. & Yu, Y. Sparx: A sparse cross-layer connection mechanism for hierarchical vision mamba and transformer networks. vol. 39, https://doi.org/10.1609/aaai.v39i18.34103 (2025).
Shi, Y., Dong, M. & Xu, C. Multi-scale vmamba: Hierarchy in hierarchy visual state space model. In Globerson, A. et al. (eds.) Advances in Neural Information Processing Systems, vol. 37, 25687–25708, https://doi.org/10.52202/079017-0808 (Curran Associates, Inc., 2024).
Fu, Y., Lou, M. & Yu, Y. Segman: Omni-scale context modeling with state space models and local attention for semantic segmentation. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19077–19087, https://doi.org/10.1109/CVPR52734.2025.01777 (2025).
Lai, Y. et al. Advancing efficient brain tumor multi-class classification–new insights from the vision mamba model in transfer learning. Preprint at arXiv:2410.21872 (2024).
Bozinovski, S. & Fulgosi, A. The influence of pattern similarity and transfer of learning upon training of a base perceptron b2. In Proc. Symp. Informatica, 3–121–5 (Bled, 1976). Original in Croatian: Utjecaj slicnosti likova i transfera ucenja na obucavanje baznog perceptrona B2.
Dang, T. D. Q., Nguyen, H. H. & Tiulpin, A. Log-vmamba: Local-global vision mamba for medical image segmentation. In Proc. of the Asian Conference on Computer Vision, 548–565 (2024).
Xing, Z., Ye, T., Yang, Y., Liu, G. & Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 578–588 (Springer, 2024).
Ding, X., Zhang, X., Han, J. & Ding, G. Scaling up your kernels to 31\(\times\)31: Revisiting large kernel design in cnns. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11953–11965, https://doi.org/10.1109/CVPR52688.2022.01166 (2022).
Ding, X. et al. Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5513–5524, https://doi.org/10.1109/CVPR52733.2024.00527 (2024).
Lou, M. & Yu, Y. Overlock: An overview-first-look-closely-next convnet with context-mixing dynamic kernels. https://doi.org/10.1109/CVPR52734.2025.00021 (2025).
Zhou, R. et al. Cascade residual multiscale convolution and mamba-structured unet for advanced brain tumor image segmentation. Entropy 26, 385 (2024).
Hatamizadeh, A. & Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. Preprint at arXiv:2407.08083 (2024).
Zhang, M., Chen, Z., Ge, Y. & Tao, X. Hmt-unet: A hybird mamba-transformer vision unet for medical image segmentation. Preprint at arXiv:2408.11289 (2024).
Cao, A., Li, Z., Jomsky, J., Laine, A. F. & Guo, J. Medsegmamba: 3d cnn-mamba hybrid architecture for brain segmentation. Preprint at arXiv:2409.08307 (2024).
Peiris, H., Hayat, M., Chen, Z., Egan, G. & Harandi, M. A robust volumetric transformer for accurate 3d tumor segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention, 162–172 (Springer, 2022).
Gong, H. et al. nnmamba: 3d biomedical image segmentation, classification and landmark detection with state space model. In 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), 1–5 (IEEE, 2025).
Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. Med. Image Anal. 97, 103280. https://doi.org/10.1016/j.media.2024.103280 (2024).
Hatamizadeh, A. et al. Unetr: Transformers for 3d medical image segmentation. In Proc. of the IEEE/CVF Winter Conference on Applications of Computer Vision, 574–584, https://doi.org/10.1109/WACV51458.2022.00181 (2022).
Hatamizadeh, A. et al. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, 272–284, https://doi.org/10.1007/978-3-031-08999-2_22 (Springer, 2022).
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnu-net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211. https://doi.org/10.1038/s41592-020-01008-z (2021).
Xing, Z., Ye, T., Yang, Y., Liu, G. & Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 578–588, https://doi.org/10.1007/978-3-031-69023-9_57 (Springer, 2024).
Esteva, A. et al. A guide to deep learning in healthcare. IEEE Trans. Med. Imaging 38, 2650–2660 (2019).
NVIDIA Corporation. Nvidia jetson xavier nx developer kit: Technical specifications (2023). Reference 42: Source for computing specifications (TOPS/TFLOPS) for portable MRI and edge boxes.
NVIDIA Corporation. Nvidia jetson orin nano and orin nx product brief (2024). Reference 43: Specifications for next-generation edge GPU and DLA units mentioned in Table 10.
Google LLC. Edge tpu system architecture (2022). Reference 44: Technical basis for Ultra-Low-Power (ULP) Edge Accelerator specs.
ARM Ltd. Arm ethos npu technical overview (2023). Reference 45: Source for NPU computing capabilities in handheld devices.
Maier-Hein, L. et al. Surgical data science for next-generation interventions. Nat. Biomed. Eng. 1, 691–696 (2017).
Funding
This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00223501, No. 2022R1A5A8019303), and partly supported by Hallym University MHC (Mighty Hallym 4.0 Campus) project, 2025 (MHC-MHC-202502-002).
Author information
Authors and Affiliations
Contributions
Y. N. (Yeonwoo Noh) conceived and designed the study, developed the methodology, implemented the software, and performed the experiments. S. L. (Seongwook Lee) and S. J. (Seyong Jin) supported the data curation and validation process. Y. C. (Yunyoung Chang) contributed to visualization of the results. D. W. (Dong-Ok Won) contributed to analyzing real-world on-device and edge deployment scenarios. Y. N. and W. N (Wonjong Noh) wrote the original draft of the manuscript. W. N., M. L. (Minwoo Lee) and D. W. contributed to supervision, critical review of the manuscript, and funding acquisition. W. N. was responsible for project administration. All authors reviewed and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Noh, Y., Lee, S., Jin, S. et al. Lightweight SwiM-UNet with multi-dimensional adaptor for efficient on-device medical image segmentation. Sci Rep (2026). https://doi.org/10.1038/s41598-026-35771-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-35771-4