Fig. 1
From: Multimodal GRU with directed pairwise cross-modal attention for sentiment analysis

The architecture of the MulG model consists of four main components: feature extraction, cross-modal processing, feature fusion, and residual connections. The model processes input sequences from text, audio, and image modalities to perform sentiment prediction.