Abstract
In this paper, we introduce an Adaptive Graph Signal Processing with Dynamic Semantic Alignment (AGSP-DSA) framework to perform robust multimodal data fusion across heterogeneous sources, including text, audio, and images. The proposed approach uses a dual-graph construction to learn both intra-model and inter-modal relations, spectral graph filtering to enhance informative signals, and effective node embeddings via Multi-scale Graph Convolutional Networks. In the semantic-aware attention mechanism, each modality may dynamically contribute to the context with respect to contextual relevance. The experimental outcomes on three benchmark datasets, including Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity dataset, Audio-Visual Event dataset, and MultiModal Internet Movie Database dataset, show that Adaptive Graph Signal Processing with Dynamic Semantic Alignment performs as the state of the art. More precisely, it achieves 95.3% accuracy, 93.6% F1 (Harmonic Mean of Precision and Recall) score, and 92.4% mean average precision on the Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity dataset, improving the MultiModal Graph Neural Network by 2.6% in accuracy. It gets 93.4% accuracy and 91.1% F1 score on Audio-Visual Event dataset, and 91.8% accuracy and 88.6% F1 score on MultiModal Internet Movie Database dataset, which demonstrates good generalization and robustness in the missing modality setting. These findings verify the efficiency of the proposed AGSP-DSA in promoting multimodal learning in sentiment analysis, event recognition, and multimedia classification.
Data availability
The data that support the findings of this study are available upon request from the corresponding author.
References
Baltrušaitis, T., Ahuja, C. & Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443. https://doi.org/10.1109/TPAMI.2018.2798607 (2019).
Atrey, P., Hossain, M., Saddik, A. E. & Kankanhalli, M. Multimodal fusion for multimedia analysis: A survey. Multimedia Syst. 16, 345–379. https://doi.org/10.1007/s00530-010-0182-0 (2010).
Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A. & Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 30, 83–98. https://doi.org/10.1109/MSP.2012.2235192 (2013).
Georgiou, E., Papaioannou, C. & Potamianos, A. Deep hierarchical fusion with application in sentiment analysis. In Proc. of Interspeech, 1646–1650, https://doi.org/10.21437/Interspeech.2019-3243 (2019).
Jiang, B., Zhang, Z., Lin, D., Tang, J. & Luo, B. Semi-supervised learning with graph learning-convolutional networks. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11313–11320 (2019).
You, R. et al. Cross-modality attention with semantic graph embedding for multi-label classification. In Proc. of the AAAI Conference on Artificial Intelligence 34, 12709–12716. https://doi.org/10.1609/aaai.v34i07.6964 (2020).
Xi, X., Chow, C.-O., Chuah, J. H. & Kanesan, J. Cross-modal semantic relations enhancement with graph attention network for image-text matching. IEEE Access 13, 46124–46135. https://doi.org/10.1109/ACCESS.2025.3549781 (2025).
Zhang, L., Jiang, Y., Yang, W. & Liu, B. Tctfusion: A triple-branch cross-modal transformer for adaptive infrared and visible image fusion. Electronics 14, 731. https://doi.org/10.3390/electronics14040731 (2025).
Hu, X. & Yamamura, M. Global local fusion neural network for multimodal sentiment analysis. Appl. Sci. 12, 8453. https://doi.org/10.3390/app12178453 (2022).
Liu, X., Li, S. & Wang, M. Hierarchical attention-based multimodal fusion network for video emotion recognition. Comput. Intell. Neurosci. 2021, 5585041. https://doi.org/10.1155/2021/5585041 (2021).
He, L., Wang, Z., Wang, L. & Li, F. Multimodal mutual attention-based sentiment analysis framework adapted to complicated contexts. IEEE Trans. Circuits Syst. Video Technol. 33, 7131–7143. https://doi.org/10.1109/TCSVT.2023.3276075 (2023).
Yu, J., Li, J., Yu, Z. & Huang, Q. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30, 4467–4480. https://doi.org/10.1109/TCSVT.2019.2947482 (2019).
Mohsin, M. Y. et al. Challenges and applications of graph signal processing. Int. J. Electr. Eng. Emerg. Technol. 5, 08–15 (2022).
Yao, D. et al. Adaptive homophily clustering: Structure homophily graph learning with adaptive filter for hyperspectral image. IEEE Trans. Geosci. Remote Sens. 63, 1–13. https://doi.org/10.1109/TGRS.2025.3556276 (2025).
Liu, X. et al. Automatic assessment of Chinese dysarthria using audio-visual vowel graph attention network. IEEE Trans. Audio Speech Lang. Process. https://doi.org/10.1109/TASLPRO.2025.3546562 (2025).
Wu, Z. et al. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24 (2020).
Zhu, X. et al. RMER-DT: Robust multimodal emotion recognition in conversational contexts based on diffusion and transformers. Inf. Fusion 123, 103268 (2025).
Xiang, J., Zhu, X. & Cambria, E. Integrating audio–visual text generation with contrastive learning for enhanced multimodal emotion analysis. Inf. Fusion 127, 103809 (2026).
Wang, R. et al. RAFT: Robust adversarial fusion transformer for multimodal sentiment analysis. Array 27, 100445 (2025).
Zhu, X. et al. A client–server based recognition system: Non-contact single/multiple emotional and behavioral state assessment methods. Comput. Methods Progr. Biomed. 260, 108564 (2025).
Wang, R. et al. CIME: Contextual interaction-based multimodal emotion analysis with enhanced semantic information. IEEE Trans. Comput. Soc. Syst. https://doi.org/10.1109/TCSS.2025.3572495 (2025).
Wang, R. et al. Contrastive-based removal of negative information in multimodal emotion analysis. Cogn. Comput. 17, 107. https://doi.org/10.1007/s12559-025-10463-9 (2025).
Chen, J. et al. DNLN: Image super-resolution with Deformable Non-Local attention and Multi-Branch Weighted Feature Fusion. Image Vis. Comput. 162, 105721 (2025).
Zhang, Y., Chen, H., Rida, I. & Zhu, X. A generative random modality dropout framework for robust multimodal emotion recognition. IEEE Intell. Syst. 40, 62–69 (2025).
Zhu, X. et al. EMVAS: End-to-end multimodal emotion visualization analysis system. Complex Intell. Syst. 11, 374 (2025).
Zhu, X. et al. A review of key technologies for emotion analysis using multimodal information. Cogn. Comput. 16, 1504–1530 (2024).
Wang, R. et al. Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking. Int. J. Multimed. Inf. Retrieval 13, 39 (2024).
Zhu, X., Huang, Y., Wang, X. & Wang, R. Emotion recognition based on brain-like multimodal hierarchical perception. Multimedia Tools Appl. 83, 56039–56057 (2024).
Vu, H.-T. et al. Label-representative graph convolutional network for multi-label text classification. Appl. Intell. 53, 1–16. https://doi.org/10.1007/s10489-022-04106-x (2022).
Zhou, H., Qian, Z., Li, P. & Zhu, Q. Graph attention network with cross-modal interaction for rumor detection. In Proc. of the International Joint Conference on Neural Networks (IJCNN), 1–8, https://doi.org/10.1109/IJCNN60899.2024.10650542 (2024).
Zhang, J., Wu, G., Bi, X. & Cui, Y. Video summarization generation network based on dynamic graph contrastive learning and feature fusion. Electronics 13, 2039. https://doi.org/10.3390/electronics13112039 (2024).
Desmarais, J., Klassen, R., Patel, E. & Chaudhry, T. Dmflc: Short video classification based on deep multimodal feature fusion and low rank representation, https://doi.org/10.21203/rs.3.rs-2662848/v1 (2023).
Wu, Z. et al. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24. https://doi.org/10.1109/TNNLS.2020.2978386 (2021).
Yu, Q. et al. Bitmulv: Bidirectional-decoding based transformer with multi-view visual representation. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 735–748, https://doi.org/10.1007/978-3-031-18907-457 (Springer, 2022).
Fang, P., Chen, Z. & Xue, H. On inferring prototypes for multi-label few-shot learning via partial aggregation. Pattern Recogn. 164, 111482 (2025).
CMU-MOSEI Dataset. https://github.com/CMU-MultiComp-Lab/CMU-MultimodalSDK (Accessed July 2025).
Tian, Y., Shi, J., Li, B., Duan, Z. & Xu, C. Audio-visual event localization in unconstrained videos. In Computer Vision -ECCV 2018 (eds Ferrari, V. et al.) 252–268 (Springer International Publishing, 2018).
MM-IMDb (Multimodal IMDb Dataset). https://www.innovatiana.com/en/datasets/mm-imdb-multimodal-imdb-dataset (Accessed July 2025).
Liu, Z. et al. Ensemble Pretrained Models for Multimodal Sentiment Analysis using Textual and Video Data Fusion. In Companion Proceedings of the ACM Web Conference 2024 (WWW ’24), 1841–1848 (2024).
Xie, Z., Yang, Y., Wang, J., Liu, X. & Li, X. Trustworthy multimodal fusion for sentiment analysis in ordinal sentiment space. IEEE Trans. Cir. Sys. for Video Technol. 34, 7657–7670 (2024).
Li, Y., Zhu, R. & Li, W. CorMulT: A semi-supervised modality correlation-aware multimodal transformer for sentiment analysis. IEEE Trans. Affect. Comput. 16, 2321–2333 (2025).
Mai, S., Zeng, Y. & Hu, H. Learning by comparing: Boosting multimodal affective computing through ordinal learning. In Proc. of the ACM on Web Conference 2025 (WWW ’25), 2120–2134 (2025).
Huo, F., Xu, W., Guo, J., Wang, H. & Guo, S. C2KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16006–16015, https://doi.org/10.1109/CVPR52733.2024.01515 (2024).
Chalk, J., Huh, J., Kazakos, E., Zisserman, A. & Damen, D. TIM: A Time Interval Machine for Audio-Visual Action Recognition. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18153–18163, https://doi.org/10.1109/CVPR52733.2024.01719 (2024).
Xie, Z. et al. Segment-level event perception with semantic dictionary for weakly supervised audio-visual video parsing. Knowl.-Based Syst. 310, 112884 (2025).
Zhang, J. et al. Enhancing semantic audio-visual representation learning with supervised multi-scale attention. Pattern Anal. Appl. 28, 40 (2025).
Li, J. et al. Incorporating Domain Knowledge Graph into Multimodal Movie Genre Classification with Self-Supervised Attention and Contrastive Learning. In Proc. of the 31st ACM International Conference on Multimedia (MM ’23), 3337–3345 (2023).
Li, Y., Quan, R., Zhu, L. & Yang, Y. Efficient multimodal fusion via interactive prompting. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2604–2613 (2023).
Ak, K. E., Lee, G.-G., Xu, Y. & Shen, M. Leveraging efficient training and feature fusion in transformers for multimodal classification. In 2023 IEEE International Conference on Image Processing (ICIP), 1420–1424, https://doi.org/10.1109/ICIP49359.2023.10223098 (2023).
Guerra-Manzanares, A. & Shamout, F. E. MILES: Modality-informed learning rate scheduler for balancing multimodal learning. In 2025 International Joint Conference on Neural Networks (IJCNN), 1–9, https://doi.org/10.1109/IJCNN64981.2025.11228348 (2025).
Acknowledgements
This paper was edited for grammar using “Grammarly”. The authors thank the associate editor and anonymous reviewers for insightful comments that significantly improved the technical quality and presentation of the manuscript.
Funding
Open access funding provided by Manipal Academy of Higher Education, Manipal. , India.
Author information
Authors and Affiliations
Contributions
KVK-Conceptualization, Formal analysis, Supervision, Resources, Validation, Writing – original draft, Writing – review & editing; ASR-Investigation, Resources, Formal analysis, Writing – original draft, Writing – review & editing; AKD- Supervision, Validation, Investigation, Writing – original draft, Writing – review & editing; VBK-Supervision, Validation, Investigation and funding; SP-Supervision, Validation, Investigation, Writing – original draft, Writing – review & editing
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
The research does not involve any Human Participants and/or Animals.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Karthikeya, K.V., Rajasekaran, A.S., Das, A.K. et al. Adaptive graph signal processing for robust multimodal fusion with dynamic semantic alignment. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44641-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-44641-y