Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Mamba-based modulated fusion model for video moment retrieval
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 03 April 2026

Mamba-based modulated fusion model for video moment retrieval

  • Bing Yu1,2 na1,
  • Jingyu Li1,2 na1,
  • Youxian Di1,2,
  • Yingran Liu1,2,
  • Youdong Ding1,2 &
  • …
  • Dongjin Huang1,2 

Scientific Reports (2026) Cite this article

  • 1501 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Mathematics and computing

Abstract

Video Moment Retrieval (VMR) serves as a fundamental task in video understanding, bridging vision and language by localizing the most relevant temporal segments in untrimmed videos according to a textual query. However, existing approaches excel at fine-grained alignment but often fail to capture global temporal context effectively, particularly in long-form videos. To address this challenge, we propose Hybrid Mamba Network (HM-Net), a two-level fusion architecture which unifying the strengths of attention and sequence modeling. Especially, its core lies in the Hybrid Modulated Bi-Mamba (HMB) Block, which integrates the powerful temporal modeling capability of Mamba into the VMR framework to achieve effective long-range temporal reasoning. Extensive experiments on the challenging TACoS and QVHighlights benchmarks show that HM-Net consistently outperforms existing approaches, achieving 3.84% improvement in R1@0.5 (TACoS) and 1.65% in mAP (QVHighlights), demonstrating notable gains in localization accuracy, particularly on long-form videos.

Similar content being viewed by others

A lightweight high-frequency mamba network for image super-resolution

Article Open access 17 July 2025

Interactive text-guided image segmentation via vision Mamba and large language models

Article Open access 18 March 2026

A lightweight causal Mamba network for blurred QR code image restoration

Article Open access 24 April 2026

Data availability

The datasets used and analysed during the current study are available in the following repositories: The TACoS dataset is available at https://zenodo.org/records/15379789. The QVHighlights dataset is available at https://github.com/jayleicn/moment_detr.

References

  1. Liu, M., Nie, L., Wang, Y., Wang, M. & Rui, Y. A survey on video moment localization. ACM Comput. Surv. 55, 1–37. https://doi.org/10.1145/3556537 (2023).

    Google Scholar 

  2. Carion, N. et al. End-to-end object detection with transformers. In Computer Vision—ECCV 2020: 16th European Conference 213–229 https://doi.org/10.1007/978-3-030-58452-8_13 (2020).

  3. Moon, W., Hyun, S., Park, S., Park, D. & Heo, J.-P. Query-dependent video representation for moment retrieval and highlight detection. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 23023–23033 https://doi.org/10.1109/cvpr52729.2023.02205 (2023).

  4. Liu, Y. et al. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 3042–3051 https://doi.org/10.1109/cvpr52688.2022.00305 (2022).

  5. Jang, J., Park, J., Kim, J., Kwon, H. & Sohn, K. Knowing where to focus: Event-aware transformer for video grounding. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV) 13846–13856 https://doi.org/10.1109/iccv51070.2023.01273 (2023).

  6. Gao, J., Sun, C., Yang, Z. & Nevatia, R. Tall: Temporal activity localization via language query. In Proc. of the IEEE International Conference on Computer Vision (ICCV) https://doi.org/10.1109/iccv.2017.563 (2017).

  7. Xu, H., Das, A. & Saenko, K. R-c3d: Region convolutional 3d network for temporal activity detection. In Proc. of the IEEE International Conference on Computer Vision (ICCV) https://doi.org/10.1109/iccv.2017.617 (2017).

  8. Xu, H. et al. Multilevel language and vision integration for text-to-clip retrieval. In Proc. of the Thirty-Third AAAI Conference on Artificial Intelligence https://doi.org/10.1609/aaai.v33i01.33019062 (2019).

  9. Rossi, E. et al. Temporal graph networks for deep learning on dynamic graphs. In ICML 2020 Workshop on Graph Representation Learning (2020).

  10. Xiao, S., Chen, L., Shao, J., Zhuang, Y. & Xiao, J. Natural language video localization with learnable moment proposals. In Proc. of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) 4008–4017 https://doi.org/10.18653/v1/2021.emnlp-main.327 (2021).

  11. Nan, G. et al. Interventional video grounding with dual contrastive learning. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2765–2775 https://doi.org/10.1109/cvpr46437.2021.00279 (2021).

  12. Lei, J., Berg, T. L. & Bansal, M. Qvhighlights: Detecting moments and highlights in videos via natural language queries. In Proc. of the 35th International Conference on Neural Information Processing Systems(NeurIPS) https://doi.org/10.5555/3540261.3541167 (2021).

  13. Xu, Y. et al. Mh-detr: Video moment and highlight detection with cross-modal transformer. 2024 International Joint Conference on Neural Networks (IJCNN) 1–8.https://doi.org/10.1109/IJCNN60899.2024.10650814 (2024).

  14. Ma, H., Wang, G., Yu, F., Jia, Q. & Ding, S. Ms-detr: Towards effective video moment retrieval and highlight detection by joint motion-semantic learning. In Proc. of the 33rd ACM International Conference on Multimedia 4514–4523 https://doi.org/10.1145/3746027.3755484 (2025).

  15. Zhang, H. et al. Video corpus moment retrieval with contrastive learning. In Proc. of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval 685–695 https://doi.org/10.1145/3404835.3462874 (2021).

  16. Panta, L. et al. Cross-modal contrastive learning with asymmetric co-attention network for video moment retrieval. In 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 617–624, https://doi.org/10.1109/wacvw60836.2024.00071 (2024).

  17. Gao, J. & Xu, C. Fast video moment retrieval. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 1523–1532, https://doi.org/10.1109/iccv48922.2021.00155 (2021).

  18. Zeng, Y., Zhang, X. & Li, H. Multi-grained vision language pre-training: Aligning texts with visual concepts. In Proc. of the 38th International Conference on Machine Learning (ICML) (2021).

  19. Xue, Y. et al. Fmtrack: Frequency-aware interaction and multi-expert fusion for rgb-t tracking. IEEE Trans. Circuits Syst. Video Technol. https://doi.org/10.1109/TCSVT.2025.3601598 (2026).

    Google Scholar 

  20. Xue, Y. et al. Target-distractor aware uav tracking via global agent. IEEE Trans. Intell. Transp. Syst. https://doi.org/10.1109/TITS.2025.3581391 (2025).

    Google Scholar 

  21. Wu, W. et al. Adaptive patch contrast for weakly supervised semantic segmentation. Eng. Appl. Artif. Intell. https://doi.org/10.1016/j.engappai.2024.109626 (2025).

    Google Scholar 

  22. Wu, W. et al. Image fusion for cross-domain sequential recommendation. In Companion Proceedings of the ACM on Web Conference 2025, https://doi.org/10.1145/3701716.3717566 (2025).

  23. Wu, W. et al. Tag-enriched multi-attention with large language models for cross-domain sequential recommendation. IEEE Trans. Consum. Electron. https://doi.org/10.1109/TCE.2025.3620527 (2025).

    Google Scholar 

  24. Wu, W. et al. Llm-enhanced multimodal fusion for cross-domain sequential recommendation. ArXiv arXiv:2506.17966, https://doi.org/10.48550/arXiv.2506.17966 (2025).

  25. Fang, X. et al. Fewer steps, better performance: efficient cross-modal clip trimming for video moment retrieval using language. In Proc. of the Thirty-Eighth AAAI Conference on Artificial Intelligence https://doi.org/10.1609/aaai.v38i2.27941 (2024).

  26. Hou, D., Pang, L., Shen, H. & Cheng, X. Event-aware video corpus moment retrieval. arXiv preprint arXiv:2402.13566https://doi.org/10.48550/ARXIV.2402.13566 (2024).

  27. Liu, W. et al. Context-enhanced video moment retrieval with large language models. IEEE Transactions on Multimedia 6296–6306 https://doi.org/10.1109/TMM.2025.3581797 (2025).

  28. Jiang, Y. et al. Prior knowledge integration via LLM encoding and pseudo event regulation for video moment retrieval. In Proc. of the 32nd ACM International Conference on Multimedia (ACM MM), https://doi.org/10.1145/3664647.3681115 (2024).

  29. Gu, A., Goel, K. & Re, C. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations (ICLR) (2022).

  30. Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023).

  31. Dao, T. & Gu, A. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning (ICML) https://doi.org/10.5555/3692070.3692469 (2024).

  32. Zhu, L. et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. In Forty-first International Conference on Machine Learning (ICML) https://doi.org/10.5555/3692070.3694654 (2024).

  33. Liu, Y. et al. VMamba: Visual state space model. In The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), https://doi.org/10.52202/079017-3273 (2024).

  34. Wang, Z., Li, C., Xu, H., Zhu, X. & Li, H. Mamba yolo: a simple baseline for object detection with state space model. In Proc. of the Thirty-Ninth AAAI Conference on Artificial Intelligence https://doi.org/10.1609/aaai.v39i8.32885 (2025).

  35. Li, H. et al. Cfmw: Cross-modality fusion mamba for robust object detection under adverse weather. IEEE Trans. Circuits Syst. Video Technol. https://doi.org/10.1109/tcsvt.2025.3587918 (2025).

    Google Scholar 

  36. Lan, P., Xian, Y., Shen, T., Lee, Y. & Zhao, Q. Semantic-guided mamba fusion for robust object detection of tibetan plateau wildlife. Electronics https://doi.org/10.3390/electronics14224549 (2025).

    Google Scholar 

  37. Li, K. et al. Videomamba: State space model for efficient video understanding. In Computer Vision—ECCV 2024: 18th European Conference, 237–255, https://doi.org/10.1007/978-3-031-73347-5_14 (2024).

  38. Moon, W., Hyun, S., Lee, S. & Heo, J.-P. Correlation-guided query-dependency calibration in video representation learning for temporal grounding. arXiv preprint arXiv:2311.08835 (2023).

  39. Liu, Z. et al. Towards balanced alignment: Modal-enhanced semantic modeling for video moment retrieval. In Proc. of the AAAI Conference on Artificial Intelligence 38, 3855–3863. https://doi.org/10.1609/aaai.v38i4.28177 (2024).

  40. Gu, A. et al. Combining recurrent, convolutional, and continuous-time models with linear state-space layers. In Proc. of the 35th International Conference on Neural Information Processing Systems (NeurIPS) https://doi.org/10.5555/3540261.3540305 (2021).

  41. Gu, A., Dao, T., Ermon, S., Rudra, A. & Ré, C. Hippo: recurrent memory with optimal polynomial projections. In Proc. of the 34th International Conference on Neural Information Processing Systems (NeurIPS), https://doi.org/10.5555/3495724.3495849 (2020).

  42. Liu, S. et al. DAB-DETR: Dynamic anchor boxes are better queries for DETR. In International Conference on Learning Representations (ICLR) (2022).

  43. Rezatofighi, H. et al. Generalized intersection over union: A metric and a loss for bounding box regression. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/cvpr.2019.00075 (2019).

  44. Wu, W., Luo, H., Fang, B., Wang, J. & Ouyang, W. Cap4video: What can auxiliary captions do for text-video retrieval? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10704–10713 https://doi.org/10.1109/CVPR52729.2023.01031 (2023).

  45. Primus, P., Schmid, F. & Widmer, G. Tacos: Temporally-aligned audio captions for language-audio pretraining. In 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) https://doi.org/10.1109/waspaa66052.2025.11230997 (2025).

  46. Feichtenhofer, C., Fan, H., Malik, J. & He, K. Slowfast networks for video recognition. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV) https://doi.org/10.1109/iccv.2019.00630 (2019).

  47. Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. of the 38th International Conference on Machine Learning (ICML) (2021).

  48. Tran, D., Bourdev, L., Fergus, R., Torresani, L. & Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), 4489–4497, https://doi.org/10.1109/iccv.2015.510 (2015).

  49. Pennington, J., Socher, R. & Manning, C. D. Glove: Global vectors for word representation. In Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543, https://doi.org/10.3115/v1/d14-1162 (2014).

  50. Wang, Z., Wang, L., Wu, T., Li, T. & Wu, G. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proc. of the AAAI Conference on Artificial Intelligence 2613–2623 https://doi.org/10.1109/cvpr52729.2023.01031 (2022).

  51. Lin, K. Q. et al. Univtg: Towards unified video-language temporal grounding. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV) 2794–2804 https://doi.org/10.1109/iccv51070.2023.00262 (2023).

  52. Tang, K., He, L., Dang, J. & Gao, X. Boosting temporal sentence grounding via causal inference. In Proc. of the 33rd ACM International Conference on Multimedia https://doi.org/10.1145/3746027.3755624 (2025).

  53. Hu, J. et al. Maskable retentive network for video moment retrieval. In Proc. of the 32nd ACM International Conference on Multimedia (ACM MM), https://doi.org/10.1145/3664647.3680746 (2024).

  54. Jang, J., Park, J., Kim, J., Kwon, H. & Sohn, K. Knowing where to focus: Event-aware transformer for video grounding. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV) https://doi.org/10.1109/iccv51070.2023.01273 (2023).

  55. Xiao, Y. et al. Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection. arXiv preprint arXiv:2311.16464https://doi.org/10.1109/cvpr52733.2024.01770 (2023).

  56. Chen, B. et al. From global to granular: Revealing iqa model performance via correlation surface, https://doi.org/10.48550/arXiv.2601.21738 (2026)

Download references

Acknowledgements

This work was supported by the Shanghai Municipal Fund for Promoting the Development of the Cultural and Creative Industries (2025020022) and the Shanghai Natural Science Foundation (25ZR1401130).

Author information

Author notes
  1. Bing Yu and Jingyu Li contributed equally to this work.

Authors and Affiliations

  1. Department of Film and Television Engineering, Shanghai University, Shanghai, 200072, China

    Bing Yu, Jingyu Li, Youxian Di, Yingran Liu, Youdong Ding & Dongjin Huang

  2. Shanghai Engineering Research Center of Motion Picture Special Effects, Shanghai University, Shanghai, 200072, China

    Bing Yu, Jingyu Li, Youxian Di, Yingran Liu, Youdong Ding & Dongjin Huang

Authors
  1. Bing Yu
    View author publications

    Search author on:PubMed Google Scholar

  2. Jingyu Li
    View author publications

    Search author on:PubMed Google Scholar

  3. Youxian Di
    View author publications

    Search author on:PubMed Google Scholar

  4. Yingran Liu
    View author publications

    Search author on:PubMed Google Scholar

  5. Youdong Ding
    View author publications

    Search author on:PubMed Google Scholar

  6. Dongjin Huang
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Bing Yu and Jingyu Li wrote the main manuscript text. Bing Yu and Jingyu Li did experients of the manuscript. Youxian Di and Yingran Liu preparesd all figures. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Bing Yu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, B., Li, J., Di, Y. et al. Mamba-based modulated fusion model for video moment retrieval. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44804-x

Download citation

  • Received: 29 December 2025

  • Accepted: 13 March 2026

  • Published: 03 April 2026

  • DOI: https://doi.org/10.1038/s41598-026-44804-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Video moment retrieval
  • Cross-modal fusion
  • State space models
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics