Abstract
Semantic segmentation of moving objects in dynamic backgrounds faces core challenges such as background interference and blurred target features. This study proposes an innovative architecture that integrates Generative Adversarial Network (GAN) with Transformers. The GAN module enhances adaptability to dynamic backgrounds through adversarial training, while the self-attention mechanism in the Transformer captures long-range semantic dependencies. A gated fusion strategy is designed to achieve dynamic balancing of multimodal features. The method employs a conditional GAN to generate dynamic background samples with variations in illumination and motion blur. A Transformer-based encoder-decoder structure is used to model global contextual relationships. A temporal attention module is introduced to incorporate motion vector fields, improving temporal consistency. Additionally, a KL-divergence (KL) constrained semantic consistency loss optimizes the plausibility of generated samples. Experiments are conducted on both a multi-dimensional simulated dataset and the real-world KITTI dataset. Results show that the proposed model achieves an average Intersection over Union (IoU) of 85.6% in standard dynamic scenes, outperforming DeepLabv3 + by 9.2% points. In low-light and high-speed motion scenarios, the robustness index reaches 92.0%, 8.5 points higher than baseline models. Ablation studies demonstrate that removing the Transformer leads to a 6.7% drop in mIoU, while excluding the feature fusion module reduces robustness by 4.0%, confirming the necessity of both components. Temporal analysis reveals that the model maintains a stable performance of 84.5–86.5% over 20-frame sequences, with fluctuation reduced by 63% compared to baseline. The adversarial training improves the model’s adaptability to lighting changes by 5.3%. The multi-head self-attention (MSA) mechanism reduces long-range misclassification by 6.7%. The gated fusion strategy lowers false positive rates in background-disturbed regions by 12.8%. This framework optimizes segmentation through a generator-segmenter feedback loop, effectively balancing dynamic background noise suppression and semantic fidelity. The contributions are threefold: (1) The first semantic segmentation framework to deeply integrate GANs and Transformers. (2) A theoretical model for dynamic feature gating and semantic consistency constraints. (3) A standardized evaluation system covering 10 dynamic background types and five illumination gradients. This study provides key technical support for real-time environmental perception in autonomous driving and intelligent surveillance, advancing both the theoretical and practical frontiers of dynamic scene understanding.
Data availability
Data is provided within the manuscript or supplementary information files.
References
Zeller, M. et al. Gaussian radar transformer for semantic segmentation in noisy radar data. IEEE Rob. Autom. Lett. 8 (1), 344–351 (2022).
Lee, J. et al. Improved real-time monocular SLAM using semantic segmentation on selective frames. IEEE Trans. Intell. Transp. Syst. 24 (3), 2800–2813 (2022).
Esparza, D. & Flores, G. The STDyn-SLAM: a stereo vision and semantic segmentation approach for VSLAM in dynamic outdoor environments. IEEE Access. 10, 18201–18209 (2022).
Fan, Y. et al. Blitz-SLAM: a semantic SLAM in dynamic environments. Pattern Recogn. 121, 108225 (2022).
Kuang, B., Yuan, J. & Liu, Q. A robust RGB-D SLAM based on multiple geometric features and semantic segmentation in dynamic environments. Meas. Sci. Technol. 34 (1), 015402 (2022).
Jia, S. LRD-SLAM: a lightweight robust dynamic SLAM method by semantic segmentation network. Wirel. Commun. Mob. Comput. 2022 (1), 7332390 (2022).
Mersch, B. et al. Building volumetric beliefs for dynamic environments exploiting map-based moving object segmentation. IEEE Rob. Autom. Lett. 8 (8), 5180–5187 (2023).
He, W. et al. Where can we help? A visual analytics approach to diagnosing and improving semantic segmentation of movable objects. IEEE Trans. Vis. Comput. Graph. 28 (1), 1040–1050 (2021).
Chen, X. et al. Moving object segmentation in 3D lidar data: a learning-based approach exploiting sequential data. IEEE Rob. Autom. Lett. 6 (4), 6529–6536 (2021).
Kim, J., Woo, J. & Im, S. Rvmos: range-view moving object segmentation leveraged by semantic and motion features. IEEE Rob. Autom. Lett. 7 (3), 8044–8051 (2022).
Bielski, A. & Favaro, P. Move: unsupervised movable object segmentation and detection. Adv. Neural. Inf. Process. Syst. 35, 33371–33386 (2022).
Mersch, B. et al. Receding moving object segmentation in 3d lidar data using sparse 4d convolutions. IEEE Rob. Autom. Lett. 7 (3), 7503–7510 (2022).
Dang, T. V. & Bui, N. T. Multi-scale fully convolutional network-based semantic segmentation for mobile robot navigation. Electronics 12 (3), 533 (2023).
Manakitsa, N. et al. A review of machine learning and deep learning for object detection, semantic segmentation, and human action recognition in machine and robotic vision. Technologies 12 (2), 15 (2024).
Song, T. et al. Ssf-mos: semantic scene flow assisted moving object segmentation for autonomous vehicles. IEEE Trans. Instrum. Meas. 73, 1–12 (2024).
Tang, F., Zhu, B. & Sun, J. Gradient enhancement techniques and motion consistency constraints for moving object segmentation in 3D LiDAR point clouds. Remote Sens. 17 (2), 195 (2025).
Cheng, G. & Zheng, J. Y. Sequential semantic segmentation of road profiles for path and speed planning. IEEE Trans. Intell. Transp. Syst. 23 (12), 23869–23882 (2022).
Acharya, D. et al. Single-image localisation using 3D models: combining hierarchical edge maps and semantic segmentation for domain adaptation. Autom. Constr. 136, 104152 (2022).
Lu, Y. et al. Label-efficient video object segmentation with motion clues. IEEE Trans. Circ. Syst. Video Technol. 34 (8), 6710–6721 (2023).
Gupta, D. & Kumar, M. Moving object tracking for surveillance application using semantic segmentation excellence (SemSegX) and TripForceNet. Circ. Syst. Signal. Process. 13 (1), 1–40 (2025).
Singh, G. et al. Fast semantic-aware motion state detection for visual SLAM in dynamic environment. IEEE Trans. Intell. Transp. Syst. 23 (12), 23014–23030 (2022).
Fong, W. K. et al. Panoptic nuscenes: a large-scale benchmark for lidar Panoptic segmentation and tracking. IEEE Rob. Autom. Lett. 7 (2), 3795–3802 (2022).
Kaihao, Z. et al. Adversarial spatio-temporal learning for video deblurring. IEEE Trans. Image Process. 28 (1), 291–301 (2018).
Muhammad, K. et al. Vision-based semantic segmentation in scene understanding for autonomous driving: recent achievements, challenges, and outlooks. IEEE Trans. Intell. Transp. Syst. 23 (12), 22694–22715 (2022).
Wilson, J. et al. MotionSC: data set and network for real-time semantic mapping in dynamic environments. IEEE Rob. Autom. Lett. 7 (3), 8439–8446 (2022).
Sehar, U. & Naseem, M. L. How deep learning is empowering semantic segmentation: traditional and deep learning techniques for semantic segmentation: a comparison. Multimedia Tools Appl. 81 (21), 30519–30544 (2022).
Arora, M. et al. Static map generation from 3D lidar point clouds exploiting ground segmentation. Robot. Auton. Syst. 159, 104287 (2023).
Pham, H. N. et al. A new deep learning approach based on bilateral semantic segmentation models for sustainable estuarine wetland ecosystem management. Sci. Total Environ. 838, 155826 (2022).
Nuo, C. et al. Motion and appearance decoupling representation for event cameras. IEEE Trans. Image Process. 34, 5964–5977 (2025).
Yuanbo, W. et al. All-in-one Weather-degraded image restoration via adaptive degradation-aware self-prompting model. IEEE Trans. Multimedia. 27, 3343–3355 (2025).
Hoyer, L. et al. Improving semi-supervised and domain-adaptive semantic segmentation with self-supervised depth estimation. Int. J. Comput. Vision. 131 (8), 2070–2096 (2023).
Wang, S., Zhu, J. & Zhang, R. Meta-rangeseg: lidar sequence semantic segmentation using multiple feature aggregation. IEEE Rob. Autom. Lett. 7 (4), 9739–9746 (2022).
Zurbrügg, R. et al. Embodied active domain adaptation for semantic segmentation via informative path planning. IEEE Rob. Autom. Lett. 7 (4), 8691–8698 (2022).
Sharma, D., Dhiman, C. & Kumar, D. XGL-T transformer model for intelligent image captioning. Multimedia Tools Appl. 83 (2), 4219–4240 (2024).
Sharma, D., Dhiman, C. & Kumar, D. FDT – Dr 2 T: a unified dense radiology report generation transformer framework for X-ray images. Mach. Vis. Appl. 35 (4), 68 (2024).
Rautela, K. et al. Obscenity detection transformer for detecting inappropriate contents from videos. Multimedia Tools Appl. 83 (4), 10799–10814 (2024).
Sharma, D., Dhiman, C. & Kumar, D. Control with style: style embedding-based variational autoencoder for controlled stylized caption generation framework. IEEE Trans. Cogn. Dev. Syst. 16 (6), 2032–2042 (2024).
Funding
This study received no funding.
Author information
Authors and Affiliations
Contributions
Conceptualization, Y.Q.L.; Z.B.L.; T.C. and X.J.H.; methodology, C.Z.; software, G.Z.; validation, D.Z.J.; C.C. and Y.Q.L.; formal analysis, Z.B.L.; investigation, Y.Z. and J.T.Z.; resources, X.J.H.; P.C.G. and G.Z.; data curation, T.C.; writing—original draft preparation, Y.Q.L. and Z.B.L.; writing—review and editing, Y.Q.L. and Z.B.L.; visualization, X.J.H.; G.Z. and D.Z.J.; supervision, P.C.G.; project administration, J.T.Z.; All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, Y., Luo, Z., Chen, T. et al. Dynamic background motion object semantic segmentation algorithm based on generative adversarial network and transformer collaboration. Sci Rep (2026). https://doi.org/10.1038/s41598-026-39249-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-39249-1