Abstract
Infrared and visible light image fusion aims to synthesize a more informative result by extracting and integrating complementary salient features within two heterogeneous modalities. Recent research has shown that capturing explicit self-similarity and implicit cross-correlation with the aid of an attention mechanism has garnered significant interest and presents several advantages. However, exploring the complementary relationships more comprehensively and optimizing the interaction degrees of double attentions quantitatively is still a challenging issue. In this paper, a novel infrared and visible light image fusion method exploring a double-attention mechanism is proposed. Specifically, our approach excavates intra- and inter-attention features of source images through a two-step feature extraction strategy and integrates them with an intra-attention block in the feature fusion stage. Additionally, to regulate the interaction of two kinds of attentions optimally, an adaptive interaction loss term is devised. In these ways, the salient infrared targets and visible texture details can be integrated more effectively. In the experiments, the proposed method was contrasted with seven state-of-the-art methods on the TNO and RoadScene datasets. The comprehensive subjective and objective comparisons demonstrate the superiority of our method. In addition, a thorough experiment and discussion on the interaction of intra- and inter-information is presented to validate and analyze the effectiveness of our work further.
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Code availability
Available from the corresponding author on reasonable request.
References
Zhang, H., Xu, H., Tian, X., Jiang, J. & Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 76, 323–336 (2021).
Karim, S. et al. ReMamba: A hybrid CNN-Mamba aggregation network for visible-infrared person re-identification. Sci. Rep. 14, 29362 (2024).
Vivone, G. Multispectral and hyperspectral image fusion in remote sensing: A survey. Inf. Fusion 89, 405–417 (2023).
Li, J. et al. AttFeat: Attention-based features for infrared and visible remote sensing image matching. IEEE Geosci. Remote Sens. Lett. 22, 1–5 (2025).
Voronin, V. et al. Deep visible and thermal image fusion for enhancement visibility for surveillance application. In Security + Defence (2022).
Yadav, G. & Yadav, D. Contrast enhancement of region of interest of backlit image for surveillance systems based on multi-illumination fusion. Image Vis. Comput. 135, 104693 (2023).
Yuan, Q. et al. Enhanced target tracking algorithm for autonomous driving based on visible and infrared image fusion. J. Intell. Connect. Veh. 6, 237–249 (2023).
Wang, D., Liu, J., Liu, R. & Fan, X. An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. Inf. Fusion 98, 101828 (2023).
Bineeshia, J. & Kumar, B. V. AIR-GANet: Multi-head attention integrated residual dense block based generative adversarial network for visible and infrared image fusion. Sci. Rep. 15, 39464 (2025).
Zhuang, C. et al. PHFuse: Unsupervised color visible and infrared image fusion with preserved hue. Sci. Rep. 15, 31458 (2025).
Dosovitskiy, A. et al. An image is worth 16 \(\times\) 16 words: Transformers for image recognition at scale. Preprint at https://arxiv.org/abs/2010.11929 (2021).
Ahmad, I. et al. A multiscale transformer with spatial attention for hyperspectral image classification. Sci. Rep. 16, 4690 (2026).
Ma, J. et al. SwinFusion: Cross-domain long-range learning for general image fusion via Swin transformer. IEEE/CAA J. Autom. Sin. 9, 1200–1217 (2022).
Rao, D. et al. TGFuse: An infrared and visible image fusion approach based on transformer and generative adversarial network. IEEE Trans. Image Process. https://doi.org/10.1109/TIP.2023.3273451 (2023).
Karacan, L. Multi-image transformer for multi-focus image fusion. Signal Process. Image Commun. 119, 117058 (2023).
Wang, Z. et al. SwinFuse: A residual Swin transformer fusion network for infrared and visible images. IEEE Trans. Instrum. Meas. 71, 117058 (2022).
Chen, X., Xu, S., Hu, S. & Ma, X. MGFA: A multi-scale global feature autoencoder to fuse infrared and visible images. Signal Process. Image Commun. 128, 117168 (2024).
Li, H. et al. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 28, 2614–2623 (2019).
Li, H., Wu, X. & Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 73, 72–86 (2021).
Zhang, Z., Wu, X. & Xu, T. FPNFuse: A lightweight feature pyramid network for infrared and visible image fusion. IET Image Proc. 16, 2308–2320 (2022).
Wang, H., Lu, X., Wu, Z., Li, R. & Wang, J. Infrared and visible image fusion based on autoencoder network. IET Image Process 19, 70086 (2025).
Liu, R. et al. A bilevel integrated model with data-driven layer ensemble for multi-modality image fusion. IEEE Trans. Image Process. 30, 1261–1274 (2021).
Liu, Y., Chen, X., Cheng, J., Peng, H. & Wang, Z. Infrared and visible image fusion with convolutional neural networks. Int. J. Wavelets Multiresolut. Inf. Process. 16, 1850018 (2018).
Xu, H. et al. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 44, 502–518 (2022).
Long, Y., Jia, H., Zhong, Y., Jiang, Y. & Jia, Y. RXDNFuse: A aggregated residual dense network for infrared and visible image fusion. Inf. Fusion 69, 128–141 (2021).
Li, H. et al. Different input resolutions and arbitrary output resolution: A meta learning-based deep framework for infrared and visible image fusion. IEEE Trans. Image Process. 30, 4070–4083 (2021).
Mustafa, H., Yang, J., Mustafa, H. & Zareapoor, M. Infrared and visible image fusion based on dilated residual attention network. Optik 224, 165409 (2020).
Luo, Y., He, K., Xu, D., Yin, W. & Liu, W. Infrared and visible image fusion based on visibility enhancement and hybrid multiscale decomposition. Optik 258, 168914 (2022).
Xu, J., Liu, Z. & Fang, M. An infrared and visible image fusion network based on multi-scale feature cascades and non-local attention. IET Image Process 18, 2114–2125 (2024).
Hu, X., Liu, Y. & Yang, F. PFCFuse: A Poolformer and CNN fusion network for infrared-visible image fusion. IEEE Trans. Instrum. Meas. 73, 1–14 (2024).
Ma, J. et al. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 48, 11–26 (2019).
Zarimeidani, M., Amirabadi, A., Amiri, N., Ahanian, I. & Es’haghi, S. Infrared and visible image fusion using GAN with fuzzy logic and Harris Hawks optimization. Sci. Rep. 16, 70 (2026).
Lebedev, M., Komarov, D., Vygolov & Vizilter, Y., Multisensor image fusion based on generative adversarial networks. In Image and Signal Processing for Remote Sensing XXV 111551T (2019).
Zhang, D., Yong, D., Zhao, J., Zhou, Z. & Yao, R. Structural similarity preserving GAN for infrared and visible image fusion. Multiresolut. Inf. Process. 19, 2050063 (2020).
Ma, J. et al. GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 70, 1–14 (2021).
Ma, J. et al. DDcGAN: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Trans. Image Process. 29, 4980–4995 (2020).
Shen, S. et al. ADF-Net: Attention-guided deep feature decomposition network for infrared and visible image fusion. IET Image Process. 18, 2774–2787 (2024).
Tang, W. et al. YDTR: Infrared and visible image fusion via Y-shape dynamic transformer. IEEE Trans. Multimed. 25, 5413–5428 (2023).
Li, J. et al. CGTF: Convolution-guided transformer for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 71, 1–14 (2022).
Yang, X. et al. DGLT-Fusion: A decoupled global-local infrared and visible image fusion transformer. Infrared Phys. Technol. 128, 104522 (2023).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. Preprint at https://arxiv.org/abs/2103.14030 (2021).
Golkhatmi, B., Houshmand, M. & Hosseini, S. A multi-scale attention-based Swin transformer model for medical images segmentation. Sci. Rep. 15, 38893 (2025).
Ke, A., Luo, J. & Cai, B. UNet-like network fused Swin transformer and CNN for semantic image synthesis. Sci. Rep. 14, 16761 (2024).
Yang, R., Liu, K. & Liang, Y. A fusion-attention Swin transformer for cardiac MRI image segmentation. IET Image Proc. 18, 105–115 (2024).
Li, H. et al. MulFS-CAP: Multimodal fusion-supervised cross-modality alignment perception for unregistered infrared-visible image fusion. IEEE Trans. Pattern Anal. Mach. Intell. 47, 3673–3690 (2025).
Liu, G., Qiu, J. & Yuan, Y. A multi-level SAR-guided contextual attention network for satellite images cloud removal. Remote Sens. 16, 4767 (2024).
Wang, Z. et al. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004).
Ma, K., Zeng, K. & Wang, Z. Perceptual quality assessment for multi-exposure image fusion. IEEE Trans. Image Process. 24, 3345–3356 (2015).
Tang, L., Yuan, J. & Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 82, 28–42 (2022).
Yue, J. et al. Dif-Fusion: Toward high color fidelity in infrared and visible image fusion with diffusion models. IEEE Trans. Image Process. 32, 5705–5720 (2023).
Liu, X., Huo, H., Li, J., Pang, S. & Zheng, B. A semantic-driven coupled network for infrared and visible image fusion. Inf. Fusion 108, 1556–2535 (2024).
Toet, A. The TNO multiband image data collection. Data Brief 15, 249–251 (2017).
Tian, Y., Carballo, A., Li, R. & Takeda, K. Road scene graph: A semantic graph-based scene representation dataset for intelligent vehicles. Preprint at arxiv:2011.13588 (2020).
Qu, G. et al. Information measure for performance of image fusion. Electron. Lett. 38, 313–315 (2002).
Han, Y., Cai, Y., Cao, Y. & Xu, X. A new image fusion performance metric based on visual information fidelity. Inf. Fusion 14, 127–135 (2013).
Liu, Z. et al. Objective assessment of multiresolution image fusion algorithms for context enhancement in night vision: A comparative study. IEEE Trans. Pattern Anal. Mach. Intell. 34, 94–109 (2012).
Acknowledgements
We would like to thank Professor Liu Zheng for providing fusion quality objective assessment toolbox.
Funding
This work was supported by National Natural Science Foundation of China under Project Number 61274021 and 61902282.
Author information
Authors and Affiliations
Contributions
Z.W. designed and implemented the fusion framework, conducted experiments, and performed data curation. Y.H. and B.Z. conceived the research idea and supervised the project. Z.W. and Y.H. wrote the manuscript. All authors reviewed, edited, and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, Z., Hu, Y. & Zhang, B. Infrared-visible image fusion with double-attention mechanism and adaptive interaction loss. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45802-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-45802-9