Fig. 4: Overall framework.
From: Multimodal prototype fusion network for paper-cut image classification

The model comprises three parts: a Feature extraction: The paper cut dataset and the textual information are embedded as raw CFMA-Net via the CLIP Encoder. b Feature fusion: The input to CFMA-Net is comprised of paper cut images, which are encoded using a CLIP encoder. Modal fusion is completed by CFMA-Net using techniques such as cross-multiple attention and residuals, and the resulting output is Class prototype. c Similarity calculation: The similarity relationship between the input paper cutout image and Class prototype is calculated to get the predicted category.