Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Adaptive graph signal processing for robust multimodal fusion with dynamic semantic alignment
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 20 March 2026

Adaptive graph signal processing for robust multimodal fusion with dynamic semantic alignment

  • K. V. Karthikeya1,
  • Arun Sekar Rajasekaran2,
  • Ashok Kumar Das3,4,
  • Vivekananda Bhat K5 &
  • …
  • Shantanu Pal6 

Scientific Reports , Article number:  (2026) Cite this article

  • 646 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Mathematics and computing

Abstract

In this paper, we introduce an Adaptive Graph Signal Processing with Dynamic Semantic Alignment (AGSP-DSA) framework to perform robust multimodal data fusion across heterogeneous sources, including text, audio, and images. The proposed approach uses a dual-graph construction to learn both intra-model and inter-modal relations, spectral graph filtering to enhance informative signals, and effective node embeddings via Multi-scale Graph Convolutional Networks. In the semantic-aware attention mechanism, each modality may dynamically contribute to the context with respect to contextual relevance. The experimental outcomes on three benchmark datasets, including Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity dataset, Audio-Visual Event dataset, and MultiModal Internet Movie Database dataset, show that Adaptive Graph Signal Processing with Dynamic Semantic Alignment performs as the state of the art. More precisely, it achieves 95.3% accuracy, 93.6% F1 (Harmonic Mean of Precision and Recall) score, and 92.4% mean average precision on the Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity dataset, improving the MultiModal Graph Neural Network by 2.6% in accuracy. It gets 93.4% accuracy and 91.1% F1 score on Audio-Visual Event dataset, and 91.8% accuracy and 88.6% F1 score on MultiModal Internet Movie Database dataset, which demonstrates good generalization and robustness in the missing modality setting. These findings verify the efficiency of the proposed AGSP-DSA in promoting multimodal learning in sentiment analysis, event recognition, and multimedia classification.

Data availability

The data that support the findings of this study are available upon request from the corresponding author.

References

  1. Baltrušaitis, T., Ahuja, C. & Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443. https://doi.org/10.1109/TPAMI.2018.2798607 (2019).

    Google Scholar 

  2. Atrey, P., Hossain, M., Saddik, A. E. & Kankanhalli, M. Multimodal fusion for multimedia analysis: A survey. Multimedia Syst. 16, 345–379. https://doi.org/10.1007/s00530-010-0182-0 (2010).

    Google Scholar 

  3. Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A. & Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30, 83–98. https://doi.org/10.1109/MSP.2012.2235192 (2013).

    Google Scholar 

  4. Georgiou, E., Papaioannou, C. & Potamianos, A. Deep hierarchical fusion with application in sentiment analysis. In Proc. of Interspeech, 1646–1650, https://doi.org/10.21437/Interspeech.2019-3243 (2019).

  5. Jiang, B., Zhang, Z., Lin, D., Tang, J. & Luo, B. Semi-supervised learning with graph learning-convolutional networks. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11313–11320 (2019).

  6. You, R. et al. Cross-modality attention with semantic graph embedding for multi-label classification. In Proc. of the AAAI Conference on Artificial Intelligence 34, 12709–12716. https://doi.org/10.1609/aaai.v34i07.6964 (2020).

    Google Scholar 

  7. Xi, X., Chow, C.-O., Chuah, J. H. & Kanesan, J. Cross-modal semantic relations enhancement with graph attention network for image-text matching. IEEE Access 13, 46124–46135. https://doi.org/10.1109/ACCESS.2025.3549781 (2025).

    Google Scholar 

  8. Zhang, L., Jiang, Y., Yang, W. & Liu, B. Tctfusion: A triple-branch cross-modal transformer for adaptive infrared and visible image fusion. Electronics 14, 731. https://doi.org/10.3390/electronics14040731 (2025).

    Google Scholar 

  9. Hu, X. & Yamamura, M. Global local fusion neural network for multimodal sentiment analysis. Appl. Sci. 12, 8453. https://doi.org/10.3390/app12178453 (2022).

    Google Scholar 

  10. Liu, X., Li, S. & Wang, M. Hierarchical attention-based multimodal fusion network for video emotion recognition. Comput. Intell. Neurosci. 2021, 5585041. https://doi.org/10.1155/2021/5585041 (2021).

    Google Scholar 

  11. He, L., Wang, Z., Wang, L. & Li, F. Multimodal mutual attention-based sentiment analysis framework adapted to complicated contexts. IEEE Trans. Circuits Syst. Video Technol. 33, 7131–7143. https://doi.org/10.1109/TCSVT.2023.3276075 (2023).

    Google Scholar 

  12. Yu, J., Li, J., Yu, Z. & Huang, Q. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30, 4467–4480. https://doi.org/10.1109/TCSVT.2019.2947482 (2019).

    Google Scholar 

  13. Mohsin, M. Y. et al. Challenges and applications of graph signal processing. Int. J. Electr. Eng. Emerg. Technol. 5, 08–15 (2022).

    Google Scholar 

  14. Yao, D. et al. Adaptive homophily clustering: Structure homophily graph learning with adaptive filter for hyperspectral image. IEEE Trans. Geosci. Remote Sens. 63, 1–13. https://doi.org/10.1109/TGRS.2025.3556276 (2025).

    Google Scholar 

  15. Liu, X. et al. Automatic assessment of Chinese dysarthria using audio-visual vowel graph attention network. IEEE Trans. Audio Speech Lang. Process. https://doi.org/10.1109/TASLPRO.2025.3546562 (2025).

    Google Scholar 

  16. Wu, Z. et al. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24 (2020).

    Google Scholar 

  17. Zhu, X. et al. RMER-DT: Robust multimodal emotion recognition in conversational contexts based on diffusion and transformers. Inf. Fusion 123, 103268 (2025).

    Google Scholar 

  18. Xiang, J., Zhu, X. & Cambria, E. Integrating audio–visual text generation with contrastive learning for enhanced multimodal emotion analysis. Inf. Fusion 127, 103809 (2026).

    Google Scholar 

  19. Wang, R. et al. RAFT: Robust adversarial fusion transformer for multimodal sentiment analysis. Array 27, 100445 (2025).

    Google Scholar 

  20. Zhu, X. et al. A client–server based recognition system: Non-contact single/multiple emotional and behavioral state assessment methods. Comput. Methods Progr. Biomed. 260, 108564 (2025).

    Google Scholar 

  21. Wang, R. et al. CIME: Contextual interaction-based multimodal emotion analysis with enhanced semantic information. IEEE Trans. Comput. Soc. Syst. https://doi.org/10.1109/TCSS.2025.3572495 (2025).

    Google Scholar 

  22. Wang, R. et al. Contrastive-based removal of negative information in multimodal emotion analysis. Cogn. Comput. 17, 107. https://doi.org/10.1007/s12559-025-10463-9 (2025).

    Google Scholar 

  23. Chen, J. et al. DNLN: Image super-resolution with Deformable Non-Local attention and Multi-Branch Weighted Feature Fusion. Image Vis. Comput. 162, 105721 (2025).

    Google Scholar 

  24. Zhang, Y., Chen, H., Rida, I. & Zhu, X. A generative random modality dropout framework for robust multimodal emotion recognition. IEEE Intell. Syst. 40, 62–69 (2025).

    Google Scholar 

  25. Zhu, X. et al. EMVAS: End-to-end multimodal emotion visualization analysis system. Complex Intell. Syst. 11, 374 (2025).

    Google Scholar 

  26. Zhu, X. et al. A review of key technologies for emotion analysis using multimodal information. Cogn. Comput. 16, 1504–1530 (2024).

    Google Scholar 

  27. Wang, R. et al. Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking. Int. J. Multimed. Inf. Retrieval 13, 39 (2024).

    Google Scholar 

  28. Zhu, X., Huang, Y., Wang, X. & Wang, R. Emotion recognition based on brain-like multimodal hierarchical perception. Multimedia Tools Appl. 83, 56039–56057 (2024).

    Google Scholar 

  29. Vu, H.-T. et al. Label-representative graph convolutional network for multi-label text classification. Appl. Intell. 53, 1–16. https://doi.org/10.1007/s10489-022-04106-x (2022).

    Google Scholar 

  30. Zhou, H., Qian, Z., Li, P. & Zhu, Q. Graph attention network with cross-modal interaction for rumor detection. In Proc. of the International Joint Conference on Neural Networks (IJCNN), 1–8, https://doi.org/10.1109/IJCNN60899.2024.10650542 (2024).

  31. Zhang, J., Wu, G., Bi, X. & Cui, Y. Video summarization generation network based on dynamic graph contrastive learning and feature fusion. Electronics 13, 2039. https://doi.org/10.3390/electronics13112039 (2024).

    Google Scholar 

  32. Desmarais, J., Klassen, R., Patel, E. & Chaudhry, T. Dmflc: Short video classification based on deep multimodal feature fusion and low rank representation, https://doi.org/10.21203/rs.3.rs-2662848/v1 (2023).

  33. Wu, Z. et al. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24. https://doi.org/10.1109/TNNLS.2020.2978386 (2021).

    Google Scholar 

  34. Yu, Q. et al. Bitmulv: Bidirectional-decoding based transformer with multi-view visual representation. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 735–748, https://doi.org/10.1007/978-3-031-18907-457 (Springer, 2022).

  35. Fang, P., Chen, Z. & Xue, H. On inferring prototypes for multi-label few-shot learning via partial aggregation. Pattern Recogn. 164, 111482 (2025).

    Google Scholar 

  36. CMU-MOSEI Dataset. https://github.com/CMU-MultiComp-Lab/CMU-MultimodalSDK (Accessed July 2025).

  37. Tian, Y., Shi, J., Li, B., Duan, Z. & Xu, C. Audio-visual event localization in unconstrained videos. In Computer Vision -ECCV 2018 (eds Ferrari, V. et al.) 252–268 (Springer International Publishing, 2018).

    Google Scholar 

  38. MM-IMDb (Multimodal IMDb Dataset). https://www.innovatiana.com/en/datasets/mm-imdb-multimodal-imdb-dataset (Accessed July 2025).

  39. Liu, Z. et al. Ensemble Pretrained Models for Multimodal Sentiment Analysis using Textual and Video Data Fusion. In Companion Proceedings of the ACM Web Conference 2024 (WWW ’24), 1841–1848 (2024).

  40. Xie, Z., Yang, Y., Wang, J., Liu, X. & Li, X. Trustworthy multimodal fusion for sentiment analysis in ordinal sentiment space. IEEE Trans. Cir. Sys. for Video Technol. 34, 7657–7670 (2024).

    Google Scholar 

  41. Li, Y., Zhu, R. & Li, W. CorMulT: A semi-supervised modality correlation-aware multimodal transformer for sentiment analysis. IEEE Trans. Affect. Comput. 16, 2321–2333 (2025).

    Google Scholar 

  42. Mai, S., Zeng, Y. & Hu, H. Learning by comparing: Boosting multimodal affective computing through ordinal learning. In Proc. of the ACM on Web Conference 2025 (WWW ’25), 2120–2134 (2025).

  43. Huo, F., Xu, W., Guo, J., Wang, H. & Guo, S. C2KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16006–16015, https://doi.org/10.1109/CVPR52733.2024.01515 (2024).

  44. Chalk, J., Huh, J., Kazakos, E., Zisserman, A. & Damen, D. TIM: A Time Interval Machine for Audio-Visual Action Recognition. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18153–18163, https://doi.org/10.1109/CVPR52733.2024.01719 (2024).

  45. Xie, Z. et al. Segment-level event perception with semantic dictionary for weakly supervised audio-visual video parsing. Knowl.-Based Syst. 310, 112884 (2025).

    Google Scholar 

  46. Zhang, J. et al. Enhancing semantic audio-visual representation learning with supervised multi-scale attention. Pattern Anal. Appl. 28, 40 (2025).

    Google Scholar 

  47. Li, J. et al. Incorporating Domain Knowledge Graph into Multimodal Movie Genre Classification with Self-Supervised Attention and Contrastive Learning. In Proc. of the 31st ACM International Conference on Multimedia (MM ’23), 3337–3345 (2023).

  48. Li, Y., Quan, R., Zhu, L. & Yang, Y. Efficient multimodal fusion via interactive prompting. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2604–2613 (2023).

  49. Ak, K. E., Lee, G.-G., Xu, Y. & Shen, M. Leveraging efficient training and feature fusion in transformers for multimodal classification. In 2023 IEEE International Conference on Image Processing (ICIP), 1420–1424, https://doi.org/10.1109/ICIP49359.2023.10223098 (2023).

  50. Guerra-Manzanares, A. & Shamout, F. E. MILES: Modality-informed learning rate scheduler for balancing multimodal learning. In 2025 International Joint Conference on Neural Networks (IJCNN), 1–9, https://doi.org/10.1109/IJCNN64981.2025.11228348 (2025).

Download references

Acknowledgements

This paper was edited for grammar using “Grammarly”. The authors thank the associate editor and anonymous reviewers for insightful comments that significantly improved the technical quality and presentation of the manuscript.

Funding

Open access funding provided by Manipal Academy of Higher Education, Manipal. , India.

Author information

Authors and Affiliations

  1. Chief Information Office, AT&T, Hyderabad, India

    K. V. Karthikeya

  2. Department of ECE, SR University, Warangal, Telangana, 506371, India

    Arun Sekar Rajasekaran

  3. Center for Security, Theory and Algorithmic Research, International Institute of Information Technology, Hyderabad, 500032, India

    Ashok Kumar Das

  4. Department of Computer Science and Engineering, College of Informatics, Korea University, 145 Anam-ro, Seongbuk-gu, 02841, Seoul, South Korea

    Ashok Kumar Das

  5. Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India

    Vivekananda Bhat K

  6. School of Information Technology, Deakin University, Melbourne, VIC, 3125, Australia

    Shantanu Pal

Authors
  1. K. V. Karthikeya
    View author publications

    Search author on:PubMed Google Scholar

  2. Arun Sekar Rajasekaran
    View author publications

    Search author on:PubMed Google Scholar

  3. Ashok Kumar Das
    View author publications

    Search author on:PubMed Google Scholar

  4. Vivekananda Bhat K
    View author publications

    Search author on:PubMed Google Scholar

  5. Shantanu Pal
    View author publications

    Search author on:PubMed Google Scholar

Contributions

KVK-Conceptualization, Formal analysis, Supervision, Resources, Validation, Writing – original draft, Writing – review & editing; ASR-Investigation, Resources, Formal analysis, Writing – original draft, Writing – review & editing; AKD- Supervision, Validation, Investigation, Writing – original draft, Writing – review & editing; VBK-Supervision, Validation, Investigation and funding; SP-Supervision, Validation, Investigation, Writing – original draft, Writing – review & editing

Corresponding authors

Correspondence to Ashok Kumar Das or Vivekananda Bhat K.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

The research does not involve any Human Participants and/or Animals.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Karthikeya, K.V., Rajasekaran, A.S., Das, A.K. et al. Adaptive graph signal processing for robust multimodal fusion with dynamic semantic alignment. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44641-y

Download citation

  • Received: 28 October 2025

  • Accepted: 12 March 2026

  • Published: 20 March 2026

  • DOI: https://doi.org/10.1038/s41598-026-44641-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Adaptive graph signal processing
  • Graph convolutional networks
  • Multimodal fusion
  • Semantic attention
  • Sentiment analysis
  • Event recognition
  • Multimedia classification
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics