Adaptive graph signal processing for robust multimodal fusion with dynamic semantic alignment

Karthikeya, K. V.; Rajasekaran, Arun Sekar; Das, Ashok Kumar; K, Vivekananda Bhat; Pal, Shantanu

doi:10.1038/s41598-026-44641-y

Download PDF

Article
Open access
Published: 20 March 2026

Adaptive graph signal processing for robust multimodal fusion with dynamic semantic alignment

K. V. Karthikeya¹,
Arun Sekar Rajasekaran²,
Ashok Kumar Das^3,4,
Vivekananda Bhat K⁵ &
…
Shantanu Pal⁶

Scientific Reports , Article number: (2026) Cite this article

646 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

In this paper, we introduce an Adaptive Graph Signal Processing with Dynamic Semantic Alignment (AGSP-DSA) framework to perform robust multimodal data fusion across heterogeneous sources, including text, audio, and images. The proposed approach uses a dual-graph construction to learn both intra-model and inter-modal relations, spectral graph filtering to enhance informative signals, and effective node embeddings via Multi-scale Graph Convolutional Networks. In the semantic-aware attention mechanism, each modality may dynamically contribute to the context with respect to contextual relevance. The experimental outcomes on three benchmark datasets, including Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity dataset, Audio-Visual Event dataset, and MultiModal Internet Movie Database dataset, show that Adaptive Graph Signal Processing with Dynamic Semantic Alignment performs as the state of the art. More precisely, it achieves 95.3% accuracy, 93.6% F1 (Harmonic Mean of Precision and Recall) score, and 92.4% mean average precision on the Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity dataset, improving the MultiModal Graph Neural Network by 2.6% in accuracy. It gets 93.4% accuracy and 91.1% F1 score on Audio-Visual Event dataset, and 91.8% accuracy and 88.6% F1 score on MultiModal Internet Movie Database dataset, which demonstrates good generalization and robustness in the missing modality setting. These findings verify the efficiency of the proposed AGSP-DSA in promoting multimodal learning in sentiment analysis, event recognition, and multimedia classification.

Data availability

The data that support the findings of this study are available upon request from the corresponding author.

References

Baltrušaitis, T., Ahuja, C. & Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443. https://doi.org/10.1109/TPAMI.2018.2798607 (2019).
Google Scholar
Atrey, P., Hossain, M., Saddik, A. E. & Kankanhalli, M. Multimodal fusion for multimedia analysis: A survey. Multimedia Syst. 16, 345–379. https://doi.org/10.1007/s00530-010-0182-0 (2010).
Google Scholar
Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A. & Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30, 83–98. https://doi.org/10.1109/MSP.2012.2235192 (2013).
Google Scholar
Georgiou, E., Papaioannou, C. & Potamianos, A. Deep hierarchical fusion with application in sentiment analysis. In Proc. of Interspeech, 1646–1650, https://doi.org/10.21437/Interspeech.2019-3243 (2019).
Jiang, B., Zhang, Z., Lin, D., Tang, J. & Luo, B. Semi-supervised learning with graph learning-convolutional networks. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11313–11320 (2019).
You, R. et al. Cross-modality attention with semantic graph embedding for multi-label classification. In Proc. of the AAAI Conference on Artificial Intelligence 34, 12709–12716. https://doi.org/10.1609/aaai.v34i07.6964 (2020).
Google Scholar
Xi, X., Chow, C.-O., Chuah, J. H. & Kanesan, J. Cross-modal semantic relations enhancement with graph attention network for image-text matching. IEEE Access 13, 46124–46135. https://doi.org/10.1109/ACCESS.2025.3549781 (2025).
Google Scholar
Zhang, L., Jiang, Y., Yang, W. & Liu, B. Tctfusion: A triple-branch cross-modal transformer for adaptive infrared and visible image fusion. Electronics 14, 731. https://doi.org/10.3390/electronics14040731 (2025).
Google Scholar
Hu, X. & Yamamura, M. Global local fusion neural network for multimodal sentiment analysis. Appl. Sci. 12, 8453. https://doi.org/10.3390/app12178453 (2022).
Google Scholar
Liu, X., Li, S. & Wang, M. Hierarchical attention-based multimodal fusion network for video emotion recognition. Comput. Intell. Neurosci. 2021, 5585041. https://doi.org/10.1155/2021/5585041 (2021).
Google Scholar
He, L., Wang, Z., Wang, L. & Li, F. Multimodal mutual attention-based sentiment analysis framework adapted to complicated contexts. IEEE Trans. Circuits Syst. Video Technol. 33, 7131–7143. https://doi.org/10.1109/TCSVT.2023.3276075 (2023).
Google Scholar
Yu, J., Li, J., Yu, Z. & Huang, Q. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30, 4467–4480. https://doi.org/10.1109/TCSVT.2019.2947482 (2019).
Google Scholar
Mohsin, M. Y. et al. Challenges and applications of graph signal processing. Int. J. Electr. Eng. Emerg. Technol. 5, 08–15 (2022).
Google Scholar
Yao, D. et al. Adaptive homophily clustering: Structure homophily graph learning with adaptive filter for hyperspectral image. IEEE Trans. Geosci. Remote Sens. 63, 1–13. https://doi.org/10.1109/TGRS.2025.3556276 (2025).
Google Scholar
Liu, X. et al. Automatic assessment of Chinese dysarthria using audio-visual vowel graph attention network. IEEE Trans. Audio Speech Lang. Process. https://doi.org/10.1109/TASLPRO.2025.3546562 (2025).
Google Scholar
Wu, Z. et al. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24 (2020).
Google Scholar
Zhu, X. et al. RMER-DT: Robust multimodal emotion recognition in conversational contexts based on diffusion and transformers. Inf. Fusion 123, 103268 (2025).
Google Scholar
Xiang, J., Zhu, X. & Cambria, E. Integrating audio–visual text generation with contrastive learning for enhanced multimodal emotion analysis. Inf. Fusion 127, 103809 (2026).
Google Scholar
Wang, R. et al. RAFT: Robust adversarial fusion transformer for multimodal sentiment analysis. Array 27, 100445 (2025).
Google Scholar
Zhu, X. et al. A client–server based recognition system: Non-contact single/multiple emotional and behavioral state assessment methods. Comput. Methods Progr. Biomed. 260, 108564 (2025).
Google Scholar
Wang, R. et al. CIME: Contextual interaction-based multimodal emotion analysis with enhanced semantic information. IEEE Trans. Comput. Soc. Syst. https://doi.org/10.1109/TCSS.2025.3572495 (2025).
Google Scholar
Wang, R. et al. Contrastive-based removal of negative information in multimodal emotion analysis. Cogn. Comput. 17, 107. https://doi.org/10.1007/s12559-025-10463-9 (2025).
Google Scholar
Chen, J. et al. DNLN: Image super-resolution with Deformable Non-Local attention and Multi-Branch Weighted Feature Fusion. Image Vis. Comput. 162, 105721 (2025).
Google Scholar
Zhang, Y., Chen, H., Rida, I. & Zhu, X. A generative random modality dropout framework for robust multimodal emotion recognition. IEEE Intell. Syst. 40, 62–69 (2025).
Google Scholar
Zhu, X. et al. EMVAS: End-to-end multimodal emotion visualization analysis system. Complex Intell. Syst. 11, 374 (2025).
Google Scholar
Zhu, X. et al. A review of key technologies for emotion analysis using multimodal information. Cogn. Comput. 16, 1504–1530 (2024).
Google Scholar
Wang, R. et al. Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking. Int. J. Multimed. Inf. Retrieval 13, 39 (2024).
Google Scholar
Zhu, X., Huang, Y., Wang, X. & Wang, R. Emotion recognition based on brain-like multimodal hierarchical perception. Multimedia Tools Appl. 83, 56039–56057 (2024).
Google Scholar
Vu, H.-T. et al. Label-representative graph convolutional network for multi-label text classification. Appl. Intell. 53, 1–16. https://doi.org/10.1007/s10489-022-04106-x (2022).
Google Scholar
Zhou, H., Qian, Z., Li, P. & Zhu, Q. Graph attention network with cross-modal interaction for rumor detection. In Proc. of the International Joint Conference on Neural Networks (IJCNN), 1–8, https://doi.org/10.1109/IJCNN60899.2024.10650542 (2024).
Zhang, J., Wu, G., Bi, X. & Cui, Y. Video summarization generation network based on dynamic graph contrastive learning and feature fusion. Electronics 13, 2039. https://doi.org/10.3390/electronics13112039 (2024).
Google Scholar
Desmarais, J., Klassen, R., Patel, E. & Chaudhry, T. Dmflc: Short video classification based on deep multimodal feature fusion and low rank representation, https://doi.org/10.21203/rs.3.rs-2662848/v1 (2023).
Wu, Z. et al. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24. https://doi.org/10.1109/TNNLS.2020.2978386 (2021).
Google Scholar
Yu, Q. et al. Bitmulv: Bidirectional-decoding based transformer with multi-view visual representation. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 735–748, https://doi.org/10.1007/978-3-031-18907-457 (Springer, 2022).
Fang, P., Chen, Z. & Xue, H. On inferring prototypes for multi-label few-shot learning via partial aggregation. Pattern Recogn. 164, 111482 (2025).
Google Scholar
CMU-MOSEI Dataset. https://github.com/CMU-MultiComp-Lab/CMU-MultimodalSDK (Accessed July 2025).
Tian, Y., Shi, J., Li, B., Duan, Z. & Xu, C. Audio-visual event localization in unconstrained videos. In Computer Vision -ECCV 2018 (eds Ferrari, V. et al.) 252–268 (Springer International Publishing, 2018).
Google Scholar
MM-IMDb (Multimodal IMDb Dataset). https://www.innovatiana.com/en/datasets/mm-imdb-multimodal-imdb-dataset (Accessed July 2025).
Liu, Z. et al. Ensemble Pretrained Models for Multimodal Sentiment Analysis using Textual and Video Data Fusion. In Companion Proceedings of the ACM Web Conference 2024 (WWW ’24), 1841–1848 (2024).
Xie, Z., Yang, Y., Wang, J., Liu, X. & Li, X. Trustworthy multimodal fusion for sentiment analysis in ordinal sentiment space. IEEE Trans. Cir. Sys. for Video Technol. 34, 7657–7670 (2024).
Google Scholar
Li, Y., Zhu, R. & Li, W. CorMulT: A semi-supervised modality correlation-aware multimodal transformer for sentiment analysis. IEEE Trans. Affect. Comput. 16, 2321–2333 (2025).
Google Scholar
Mai, S., Zeng, Y. & Hu, H. Learning by comparing: Boosting multimodal affective computing through ordinal learning. In Proc. of the ACM on Web Conference 2025 (WWW ’25), 2120–2134 (2025).
Huo, F., Xu, W., Guo, J., Wang, H. & Guo, S. C2KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16006–16015, https://doi.org/10.1109/CVPR52733.2024.01515 (2024).
Chalk, J., Huh, J., Kazakos, E., Zisserman, A. & Damen, D. TIM: A Time Interval Machine for Audio-Visual Action Recognition. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18153–18163, https://doi.org/10.1109/CVPR52733.2024.01719 (2024).
Xie, Z. et al. Segment-level event perception with semantic dictionary for weakly supervised audio-visual video parsing. Knowl.-Based Syst. 310, 112884 (2025).
Google Scholar
Zhang, J. et al. Enhancing semantic audio-visual representation learning with supervised multi-scale attention. Pattern Anal. Appl. 28, 40 (2025).
Google Scholar
Li, J. et al. Incorporating Domain Knowledge Graph into Multimodal Movie Genre Classification with Self-Supervised Attention and Contrastive Learning. In Proc. of the 31st ACM International Conference on Multimedia (MM ’23), 3337–3345 (2023).
Li, Y., Quan, R., Zhu, L. & Yang, Y. Efficient multimodal fusion via interactive prompting. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2604–2613 (2023).
Ak, K. E., Lee, G.-G., Xu, Y. & Shen, M. Leveraging efficient training and feature fusion in transformers for multimodal classification. In 2023 IEEE International Conference on Image Processing (ICIP), 1420–1424, https://doi.org/10.1109/ICIP49359.2023.10223098 (2023).
Guerra-Manzanares, A. & Shamout, F. E. MILES: Modality-informed learning rate scheduler for balancing multimodal learning. In 2025 International Joint Conference on Neural Networks (IJCNN), 1–9, https://doi.org/10.1109/IJCNN64981.2025.11228348 (2025).

Download references

Acknowledgements

This paper was edited for grammar using “Grammarly”. The authors thank the associate editor and anonymous reviewers for insightful comments that significantly improved the technical quality and presentation of the manuscript.

Funding

Open access funding provided by Manipal Academy of Higher Education, Manipal. , India.

Author information

Authors and Affiliations

Chief Information Office, AT&T, Hyderabad, India
K. V. Karthikeya
Department of ECE, SR University, Warangal, Telangana, 506371, India
Arun Sekar Rajasekaran
Center for Security, Theory and Algorithmic Research, International Institute of Information Technology, Hyderabad, 500032, India
Ashok Kumar Das
Department of Computer Science and Engineering, College of Informatics, Korea University, 145 Anam-ro, Seongbuk-gu, 02841, Seoul, South Korea
Ashok Kumar Das
Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India
Vivekananda Bhat K
School of Information Technology, Deakin University, Melbourne, VIC, 3125, Australia
Shantanu Pal

Authors

K. V. Karthikeya
View author publications
Search author on:PubMed Google Scholar
Arun Sekar Rajasekaran
View author publications
Search author on:PubMed Google Scholar
Ashok Kumar Das
View author publications
Search author on:PubMed Google Scholar
Vivekananda Bhat K
View author publications
Search author on:PubMed Google Scholar
Shantanu Pal
View author publications
Search author on:PubMed Google Scholar

Contributions

KVK-Conceptualization, Formal analysis, Supervision, Resources, Validation, Writing – original draft, Writing – review & editing; ASR-Investigation, Resources, Formal analysis, Writing – original draft, Writing – review & editing; AKD- Supervision, Validation, Investigation, Writing – original draft, Writing – review & editing; VBK-Supervision, Validation, Investigation and funding; SP-Supervision, Validation, Investigation, Writing – original draft, Writing – review & editing

Corresponding authors

Correspondence to Ashok Kumar Das or Vivekananda Bhat K.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

The research does not involve any Human Participants and/or Animals.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Karthikeya, K.V., Rajasekaran, A.S., Das, A.K. et al. Adaptive graph signal processing for robust multimodal fusion with dynamic semantic alignment. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44641-y

Download citation

Received: 28 October 2025
Accepted: 12 March 2026
Published: 20 March 2026
DOI: https://doi.org/10.1038/s41598-026-44641-y

Adaptive graph signal processing for robust multimodal fusion with dynamic semantic alignment

Subjects

Abstract

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethical approval

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Subjects

Abstract

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethical approval

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links