Introduction

The rapid development of the Internet has ushered in an era of information explosion, with social media platforms like Weibo, Douyin, and WeChat public accounts significantly enriching our entertainment lives and contributing vast amounts of data to the digital landscape. This surge in data, coupled with advancements in computer technology and hardware capabilities, has facilitated the transition of artificial intelligence (AI) technology from theoretical exploration to practical industrial applications. AI and big data are now deeply integrated into various facets of daily life, influencing everything from basic needs to complex decision-making processes. Extracting valuable insights from this massive data influx has become essential, highlighting the critical importance of effective data mining and utilization techniques.

Amidst the myriad of information available online, ensuring the security and integrity of network content has become a focus for cybersecurity practitioners. Public opinion detection technology, in particular, plays a crucial role by monitoring and predicting public comments and views on the Internet, enabling a quick grasp of public opinion dynamics on hot events. Sentiment analysis, a core component of public opinion detection, infers emotional tendencies from public comments, providing valuable insights for public opinion analysis. However, the complexity of satire-a literary and artistic form characterized by exaggeration, sarcasm, and humor-poses a significant challenge to sentiment analysis. Satirical content often disguises negative meanings behind seemingly positive expressions, leading to potential misjudgments in sentiment analysis and impacting the accuracy of public opinion assessments. This, in turn, can present risks for enterprises and governments.

Recognizing the need for accurate sarcasm detection, researchers have made strides in developing AI-based sarcasm detection technologies1. Early techniques focused on single sensory channels such as facial movements, voice, or text, overlooking the multidimensional nature of emotions and failing to leverage the combined benefits of multi-sensory data2. With the proliferation of social media networks and advancements in computing power, multimedia content on the Internet has expanded rapidly, offering diverse ways for individuals to express emotions. Consequently, integrating multi-modal data-including text, sound, image, and video-has become increasingly important for effective sarcasm recognition.

Despite progress in multi-modal sarcasm detection, challenges remain, particularly in effectively training models with limited samples, especially for languages such as Chinese3,4. Addressing these challenges, this paper focuses on the task of multi-modal sarcasm detection in small-sample scenarios. We propose leveraging cue learning methods and the powerful generalization capabilities of pre-trained models to develop a robust multi-modal sarcasm detection model. This model aims to provide a practical solution for accurately identifying ironic content under conditions of data scarcity, thereby enhancing the reliability of sentiment analysis and public opinion detection in diverse linguistic and cultural contexts.

The symmetry in our approach is evident through the balanced integration of multiple modalities-text and image-each contributing equally to the detection process. This symmetric co-learning of sarcasm cues enhances the model’s performance, ensuring that neither modality dominates, and both are harmoniously utilized to achieve accurate sarcasm detection. The main contributions of this work are summarized as follows:

  • We introduce a multi-modal sarcasm detection model that combines text and image modalities within the CLIP architecture to improve accuracy in detecting sarcasm.

  • We propose a discrete prompt generation method to expand sarcastic and non-sarcastic prompt lexicons, addressing data scarcity in low-resource languages.

  • Experiments show significant performance improvements, validating the model’s effectiveness in low-resource scenarios and across diverse linguistic and cultural contexts.

Related work

This section will introduce the development of uni-modal and multi-modal sarcasm detection respectively, and finally introduce the research content of this paper by analyzing the current situation of sarcasm detection data set.

Single-modal sarcasm detection

For the study of uni-modal sarcasm detection, researchers mainly proposed various methods to detect text sarcasm on the text sarcasm detection dataset. Earlier methods extracted elaborate discrete features5,6,7 from text, including n-gram (n-gram) overlaps, punctuation of words, part-of-speech labels, emotions, emoticons, etc.

With the development of artificial intelligence technology, researchers have begun to use deep learning techniques to obtain more precise semantics in satirical texts. A deep learning model8 was first proposed, which employs a CNN to model the relationship between adjacent semantics. Then, multi-layer RNN is used to model the temporal information of the text, in addition to modeling the historical behavior of the author and the audience and other contextual features as supplements.

In addition, Artificially designed author, audience, and response characteristics are employed to enhance sarcasm detection1. After obtaining the embeddings of the text, contextual features were also derived through the design of a series of rules, and multiple features were connected using the Bidirectional Gated Recurrent Unit (Bi-GRU)9. This extra feature improves the performance compared to the model that relies entirely on text embedding. User information was focused on learning, generating trainable user embeddings, and employing CNN to model user characteristics, thereby enhancing the model’s sarcasm detection ability10.

Further, the previous idea of classifying data by various features was expanded upon through the modeling of syntactic and emotional features in addition to text embedding. These various features were then integrated into a Long Short Term Memory (LSTM) for multi-task learning11,12. In terms of a more diversified model design, a new multidimensional attention mechanism was proposed to explicitly model inconsistencies in text content2. Two bi-directional LSTMs were used to model text features with different granularity of words and characters, thereby further revealing the complex syntactic and semantic relationships in satirical detection texts13.

Although unimodal methods have contributed to some degree of improvement in sarcasm detection, they are inherently limited by the reliance on a single modality. Sarcasm, however, is a highly nuanced and multidimensional phenomenon, often encompassing various elements such as linguistic cues, sentiment, contextual information, and non-verbal cues, which cannot be fully captured by a single modality alone14. This complexity makes it difficult for unimodal approaches to accurately identify sarcasm in all its forms, as they may overlook crucial cues present in other modalities. Consequently, there has been a growing shift in the research community towards multimodal sarcasm detection. By integrating data from multiple modalities, including text, images, and audio, multimodal approaches offer the potential to improve detection accuracy and robustness, as they enable a more holistic understanding of the sarcasm expression and its contextual subtleties15,16.

Multi-modal sarcasm detection

Since researchers found that multi-modal data can effectively enhance the accuracy of sarcasm detection, various methods of multi-modal sarcasm detection have been proposed. In 2019, a Twitter-based Multi-modal Sarcasm Detection Dataset (MSD) was first created, and a Hierarchical Fusion Model (HFN) was proposed to address this issue17. The model initially obtains image features and image attribute features. Subsequently, attribute features and a bidirectional LSTM network are used to extract text features. Finally, the features of the three modes are reconstructed and fused into a feature vector for prediction.

Subsequently, it was found that modeling the inconsistency between multi-modal data is key to the task of multi-modal ironic detection, leading to the proposal of the Attribute BERT (Attr-BERT) model18. Intermodal attention, designed to capture inconsistencies between modes, applies the common attention mechanism to model contradictions in text. The results indicate that this model can effectively improve the performance of multi-modal sarcasm detection. Decomposition and relational networks were built to model cross-modal contrast and semantic association19. The decomposition network represented the commonality and difference between images and texts, while the relational network modeled the semantic association in a cross-modal context.

Multi-modal ironic detection was studied from a new perspective, leading to the proposal of in-modal and cross-modal graphs (InCrossMGS)20. By constructing heterogeneous internal modal and cross-modal graphs for each multi-modal example to achieve specific modal and cross-modal feature learning and exploring an interactive graph convolution network structure, we jointly and interactively learn the inconsistencies between internal modal graphs and cross-modal graphs to identify important clues in ironic detection. The experimental results show that the model achieves the most advanced performance in multi-modal sarcasm detection.

A deep learning model using a hierarchical fusion method was proposed to address the problem of missing emotion knowledge and sparse feature distribution across different modes in multi-modal ironic detection4. This method introduced Sentiment Analysis Network21. SenticNet (SenticNet) marks the emotion polarity of the input and simulates the inter-modal dependence by using cross-modal interaction. Experimental results show that the model can significantly improve the performance of sarcasm detection by effectively utilizing multi-modal information. Further, A lightweight multi-perspective interaction model based on deep learning was proposed to enrich the semantic information of image and text modes by integrating visual common sense knowledge, targeting the issue of a large number of parameters in ironic detection models on the Internet of Things22. The experimental results show that the performance of the model is the same as that of the multi-modal baseline in the multi-modal ironic detection task with fewer parameters.

Farabi et al.23 provides a comprehensive review of the field, summarizing the latest advancements, technical challenges, and future research directions, thus offering valuable insights for researchers. This work not only consolidates existing achievements but also outlines key directions for future studies. In the domain of deep learning and multimodal data processing, Lu et al.24 introduces a novel approach by constructing a combination network that captures the incongruity between facts and sentiment, effectively addressing the complexity of sarcasm through deep fusion of text and image information. Li et al.25 further enhances sarcasm detection by utilizing inter-modality inconsistency and leveraging attention mechanisms to improve context sensitivity, resulting in higher detection accuracy. This method emphasizes the critical role of context in identifying sarcasm in multimodal data. In the area of knowledge-enhanced detection, Yue et al.26 integrates knowledge graphs with neural networks, enriching semantic understanding and improving the model’s ability to detect subtle sarcasm. Similarly, Ren et al.27 applies knowledge augmentation to neural networks, significantly enhancing sarcasm recognition by incorporating domain-specific knowledge. Finally, Kei et al.28 focuses on social media sarcasm detection using attention-based neural networks, providing detailed analysis of sarcastic language structure and offering a powerful tool for automatic detection. Together, these studies highlight key developments in multimodal sarcasm detection and propose new methods using advanced machine learning techniques, knowledge fusion, and attention mechanisms, paving the way for more accurate and robust sarcasm recognition systems.

Limitations of existing methods

Limitations of single-modal solutions

For unimodal solutions such as CNN, RNN, and Bi-GRU, inherent limitations emerge when handling sarcasm detection tasks. CNN primarily focuses on local feature extraction, which limits its ability to capture the global semantics and contextual information crucial for identifying sarcasm in text. Although RNNs can process sequential data and capture temporal dependencies, they are prone to issues such as vanishing or exploding gradients, which restrict their capacity to learn long-term dependencies. Bi-GRU, by combining both forward and backward information, enhances contextual understanding; however, these models remain constrained by unimodal data, failing to leverage complementary information from other modalities such as images or audio. This is particularly problematic in sarcasm detection, where cross-modal interactions are often central to identifying sarcastic content.

Limitations of multi-modal solutions

Regarding multimodal solutions, while they offer a more comprehensive perspective by integrating data from multiple modalities, they also face several challenges. One significant issue is the alignment of multimodal data, as discrepancies may exist in time or space between different modalities, complicating the model’s ability to effectively process them. Additionally, multimodal models tend to have higher computational complexity, as they must process and fuse data from multiple sources, increasing both the training and computational costs. The interpretability of multimodal models is also a concern, as their decision-making processes often lack transparency, making it difficult to understand the underlying mechanisms driving their predictions. Moreover, the scarcity of high-quality multimodal data and the high cost of manual annotation pose significant barriers to the widespread development and application of such models.

Datasets

Table 1 Chinese and English satire data set introduction.

As shown in Table 1, these data sets are usually derived from different social media and online platforms, such as Amazon, Twitter, Reddit, etc., on the one hand, because the users of these platforms tend to express their emotions using sarcasm, and on the other hand, there are usually some hashtags in these platforms to facilitate the collection of data and provide rich examples for sarcasm detection. However, research has faced some limitations. First of all, the current data set is mainly in English and lacks samples from other languages, and the use and understanding of sarcasm in different cultural backgrounds may be different, so the direct translation method cannot be well applied to other language environments, which limits the generalization ability of the model and the application effect in other languages such as Chinese. Second, existing datasets focus on textual information, while satirical expressions tend to rely on more modal information. To sum up, the existing research work mainly focuses on models designed for data of more than 1,000 orders. Due to the high complexity of the models, a large amount of data is often needed to effectively learn, and good results cannot be trained under the condition of data limitation. Therefore, it is necessary to study methods that can effectively learn multi-modal ironic detection tasks under a small number of samples. Secondly, sarcasm detection is closely related to the task of emotion analysis, but most of the previous work did not involve the application of emotion knowledge. Although Liu et al.4 introduced emotion knowledge into their model, the actual effect was not optimal due to the limitation of the emotion analysis performance of the external emotion analysis model. Therefore, it is necessary to study how to use emotion data directly without relying on external models to assist sarcasm detection.

Table 2 summarize single - and multimodal methods to the related work section to more clearly show the different methods, the data sets used, and their limitations.

Table 2 Comparison of unimodal and multimodal sarcasm detection methods.

Model

Overview

Fig. 1
Fig. 1
Full size image

Cue learning method model based on CLIP model.

Aiming at the problem of limited sarcasm detection data in some languages, this paper designs a prompt learning method based on a multi-modal pre-training model for the unique input format of multi-modal sarcasm detection data. This method can effectively mine sarcasm knowledge in the multi-modal pre-training model by co-learning cues from image and text data. Therefore, a model with high sarcasm detection performance can be trained with effective data.

The whole model is divided into four main parts: text prompt vector generation, image prompt vector generation, sarcasm and non-ironic prompt vector generation, and multi-modal prompt vector fusion. From the perspective of model reasoning, firstly, a discrete prompt generation module is used to construct the input of sarcasm and non-satirical prompt vectors, and then the prompt vector generation method is used to generate graphics, sarcasm, and non-satirical prompt vector methods. Finally, the multi-modal prompt vector fusion module is used to fuse the graphic modes to generate multi-modal prompt vectors37. The sarcasm probability is calculated together with the sarcasm and non-irony cue vectors. The model diagram of the prompt learning method based on the CLIP (Contrastive Language-Image Pre-training) model is shown in Fig. 1. Typically, the CLIP model serves as the backbone for multi-modal feature extraction and alignment. By leveraging a shared embedding space for text and images, it facilitates the effective fusion of sarcasm-related features from both modalities.

Generation of discrete prompts

In general, the construction of sarcasm and non-satirical prompt vectors usually directly input “This is a satirical tweet” and “This is not a satirical tweet” into the CLIP pre-training model to get the sarcasm and non-satirical prompt vectors38, but this method over-relies on the sensitivity of the pre-training model to the input text, resulting in the model performance is often not optimal. To improve the model effect, more sophisticated methods need to be explored. Therefore, this paper proposes a discrete prompt generation method to make the sarcasm and non-ironic input texts more robust. Specifically, three sets are first created to construct ironic and non-ironic discrete prompts, including ironic discrete prompt templates, non-ironic discrete prompt templates, and ironic word sets. The collection of ironic words contains some synonyms of the word sarcasm, such as ridicule, mockery, teasing, etc. These words can show their characteristics of sarcasm in different contexts39,40. Finally, a large number of satirical discrete inputs and non-satirical discrete inputs can be generated by inputting any words in the satirical set into the satirical discrete prompt template and the non-satirical discrete template. The schematic diagram of the generation method is shown in Fig. 2.

Fig. 2
Fig. 2
Full size image

Discrete prompt generation method.

Assume that the ironic discrete prompt template, non-ironic discrete template, and ironic word set are named sets A, B, and C, and the ironic discrete input and non-ironic discrete input sets D and E, respectively, and \(|A| = n_A, |B| = n_B, |C| = n_C\). It can be calculated as \(|D| = n_A \cdot n_C, |E| = n_B \cdot n_C\).

According to Fig. 2, although the discrete prompt generation method can generate enough input statements, not every statement has a positive effect on sarcasm detection, and some statements may contain irrelevant or misleading information, which affects the judgment accuracy of the model. How to select good enough prompt statements is a problem faced by prompt learning at present. Therefore, this paper proposes a prompt filtering strategy, which can select the appropriate prompt to a certain extent through the posterior distribution of the prompt in the data. The specific operation is shown in Fig. 3.

Fig. 3
Fig. 3
Full size image

Discrete prompt screening method.

Suppose that the ironic discrete input set D contains discrete words \(\{D_1, D_2, \ldots \}\), and the non-ironic discrete input set E contains discrete words \(\{E_1, E_2, \ldots \}\). Initially, the discrete language is input into the CLIP text encoder to generate discrete language features \(\{\text {Emb}_{D_1}, \text {Emb}_{D_2}, \ldots \}\). Then, the data in the training set to be trained are divided into four groups, including:

  • The satirical picture set \(\{\text {Img}_{s_1}, \text {Img}_{s_2}, \ldots \}\),

  • Non-satirical pictures \(\{\text {Img}_{n_1}, \text {Img}_{n_2}, \ldots \}\),

  • Satirical text set \(\{\text {Text}_{s_1}, \text {Text}_{s_2}, \ldots \}\),

  • Set of non-satirical texts \(\{\text {Text}_{n_1}, \text {Text}_{n_2}, \ldots \}\).

Input the two groups of picture sets into the image encoder to obtain image features, and input the two groups of text sets into the text encoder to obtain text features. Specific calculations are shown in the formula.

$$\begin{aligned} & \left( \text {Emb}_{D}, \text {Emb}_{E}, \text {Emb}_{T_s}, \text {Emb}_{T_n}\right) = \text {TextTransformer}(D, E, \text {Text}_s, \text {Text}_n) \end{aligned}$$
(1)
$$\begin{aligned} & \left( \text {Emb}_{I_s}, \text {Emb}_{I_n}\right) = \text {ImgTransformer}(\text {Img}_s, \text {Img}_n) \end{aligned}$$
(2)

After obtaining these six features, in order to accurately evaluate the applicability of discrete words, it is necessary to set specific score calculation rules to measure the quality of discrete words, so as to screen out the appropriate discrete prompt subset from the candidate discrete words.

$$\begin{aligned} \text {Score}_D= & \left| F(\text {Emb}_D, \text {Emb}_{T_s}) - F(\text {Emb}_D, \text {Emb}_{T_n}) \right| + \left| F(\text {Emb}_D, \text {Emb}_{I_s}) - F(\text {Emb}_D, \text {Emb}_{I_n}) \right| \end{aligned}$$
(3)
$$\begin{aligned} \text {Score}_E= & \left| F(\text {Emb}_E, \text {Emb}_{T_n}) - F(\text {Emb}_D, \text {Emb}_{T_s}) \right| + \left| F(\text {Emb}_D, \text {Emb}_{I_n}) - F(\text {Emb}_D, \text {Emb}_{I_s}) \right| \end{aligned}$$
(4)

This rule implies the assumption that the more similar the sarcasm discrete language features are to the sarcasm data features and the less similar they are to the non-ironic data features, the more appropriate the sarcasm discrete language will be, while the less similar the non-ironic discrete language features are to the sarcasm data features and the more similar they are to the non-ironic data features, the more appropriate the non-ironic discrete language will be. Among them, the F function is used to measure the similarity of two features, which a variety of functions in practical applications, such as Euclidean distance, cosine similarity, and information entropy can measure. Consider two sets of inputs \(A = \{a_0, a_1, \ldots , a_n\}\) and \(B = \{b_0, b_1, \ldots , b_n\}\). The following metrics are calculated:

  • Euclidean Distance \(d(A, B)\):

    $$\begin{aligned} d(A, B) = \sqrt{\sum _{i=0}^n (a_i - b_i)^2} \end{aligned}$$
    (5)
  • Cosine Similarity \(\text {sim}(A, B)\):

    $$\begin{aligned} \text {sim}(A, B) = \frac{\sum _{i=0}^n a_i b_i}{\sqrt{\sum _{i=0}^n a_i^2} \sqrt{\sum _{i=0}^n b_i^2}} \end{aligned}$$
    (6)
  • Pearson Correlation Coefficient \(r(A, B)\):

    $$\begin{aligned} r(A, B) = \frac{n \sum a_i b_i - \sum a_i \sum b_i}{\sqrt{(n \sum a_i^2 - (\sum a_i)^2)(n \sum b_i^2 - (\sum b_i)^2)}} \end{aligned}$$
    (7)

The scores of ironic and non-ironic discrete words can be obtained by selecting a suitable scoring function. Let \(\text {Score}_D = \{\text {Score}_{D_1}, \text {Score}_{D_2}, \ldots \}\) and \(\text {Score}_E = \{\text {Score}_{E_1}, \text {Score}_{E_2}, \ldots \}\). By sorting these sets, the top best sarcastic and non-ironic discrete words can be selected:

$$\begin{aligned} D_{\text {selected}}= & \{D_{s_1}, D_{s_2}, \ldots , D_{s_n}\} \end{aligned}$$
(8)
$$\begin{aligned} E_{\text {selected}}= & \{E_{s_1}, E_{s_2}, \ldots , E_{s_n}\} \end{aligned}$$
(9)

This ranking and selection process ensures that the model accurately extracts the most representative sarcasm and non-ironic expressions from a large number of candidates. The effectiveness of this method depends on accurate preliminary scores and a reasonable ranking mechanism, providing a certain level of interpretability to facilitate researchers’ exploration.

Finally, the selected discrete words are encoded by the word vector of the text encoder to obtain their word embedding vectors. An average pooling operation is then carried out on these word embedding vectors to combine the characteristics of different discrete words, thereby obtaining more robust sarcasm and non-ironic word vectors \(W_s\) and \(W_n\):

$$\begin{aligned} W_s= & \text {meanpooling} \left( \sum _{i=1}^n \text {TextEmbedding}(D_{s_i}) \right) \end{aligned}$$
(10)
$$\begin{aligned} W_n= & \text {meanpooling} \left( \sum _{i=1}^n \text {TextEmbedding}(E_{s_i}) \right) \end{aligned}$$
(11)

In the subsequent training, appropriate cue learning methods can be used to generate sarcasm and non-ironic cue vectors that are easy to train for the model by using sarcasm and non-ironic word vectors \(W_s\) and \(W_n\).

Prompt vector generation

Although discrete word vectors possess inherent capabilities for sarcasm detection, their performance in prediction tasks is limited due to their static nature and lack of training adaptability. To address this, we introduce learnable continuous vectors into the representation. Specifically, we prepend a learnable continuous vector \(\textbf{P} = \{ p_i \in \mathbb {R}^d \}\) to both the sarcastic and non-ironic word vectors to form the prompted inputs \(W_s^\prime\) and \(W_n^\prime\) as shown in Eq. (12):

$$\begin{aligned} W_s^\prime = [\textbf{P}, W_s], \quad W_n^\prime = [\textbf{P}, W_n] \end{aligned}$$
(12)

These enhanced vectors are then fed into the multi-layer Transformer architecture of the CLIP text encoder. Each layer \(L_i\) transforms its input to obtain a richer vector representation, as delineated in Eq. (13):

$$\begin{aligned} {[}\textbf{W}_i] = L_i(\textbf{W}_{i-1}) \end{aligned}$$
(13)

The final output of this series of transformations, specifically from the last layer( K-th layer ) for both sarcastic and non-ironic inputs, is projected into a common latent semantic space via the text mapping layer \(\text {TextProj}\), resulting in the vectors \(Z_s\) and \(Z_n\):

$$\begin{aligned} Z_s = \text {TextProj}(W_K^s), \quad Z_n = \text {TextProj}(W_K^n) \end{aligned}$$
(14)

Unlike the sarcasm and non-satirical prompt vectors, text prompt vectors are generated without discrete language integration and screening. Instead, learnable continuous vectors \(\textbf{P}_i^t\) are generated at each layer of the text encoder and concatenated with the output of the respective layer. The operation at the i-th layer layer is described in Eq. (15):

$$\begin{aligned} {[}\textbf{P}_i^t, \textbf{W}_i^t] = L_i^t([\textbf{P}_{i-1}^t, \textbf{W}_{i-1}^t]) \end{aligned}$$
(15)

The final text prompt vector \(Z_t\) is computed as:

$$\begin{aligned} Z_t = \text {TextProj}([\textbf{P}_K^t, \textbf{W}_K^t]) \end{aligned}$$
(16)

Unlike text prompt vectors, image prompt vectors are constructed using a collaborative multi-modal prompt method. This approach allows for the optimization of both text and image prompt vectors simultaneously by sharing some parameters between them. Specifically, the image prompt vector \(\textbf{P}_i^I\) is encoded using the text prompt vector \(\textbf{P}_i^t\) through a multi-layer perceptron (MLP), as outlined in Eq. (17):

$$\begin{aligned} \textbf{P}_i^I = \text {MLP}(\textbf{P}_i^t) \end{aligned}$$
(17)

The encoded image prompt vectors are further processed through a series of transformation layers. Each layer \(L_i^I\) in the image encoder updates the vector representations by incorporating both the prompt vector and the previous layer’s output, as shown in Eq. (18):

$$\begin{aligned} {[}\textbf{P}_i^I, \textbf{W}_i^I] = L_i^I([\textbf{P}_{i-1}^I, \textbf{W}_{i-1}^I]) \end{aligned}$$
(18)

The final output from the last layer of this transformation process is projected into a common semantic space through the image projection layer \(\text {IMGProj}\), resulting in the final image prompt vector \(\textbf{Z}_I\):

$$\begin{aligned} \textbf{Z}_I = \text {IMGProj}(\textbf{W}_K^I) \end{aligned}$$
(19)

Sarcasm probability generation

The pre-training task of the CLIP model involves comparative learning, aiming to converge the representations of positive samples while diverging those of negative samples into a unified representation space41. While in simple image classification tasks, the similarity between images and category-specific texts can be directly computed, the multi-modal sarcasm detection task involves more complexity due to dual text modes that may both express sarcasm. Therefore, direct similarity calculations between these text inputs are not sufficient for classification.

To address this, multi-modal prompt vectors are constructed specifically for the sarcasm detection task. These vectors combine the contributions of both text and image modalities. The multi-modal prompt vector \(Z_M\) is generated through two projection functions \(F_t(\cdot )\) and \(F_I(\cdot )\), as shown in Equation (3.12):

$$\begin{aligned} Z_M = F_t(Z_t) + F_I(Z_I) \end{aligned}$$
(20)

Finally, the sarcasm probability \(\hat{y}\) is computed using the cosine similarity function \(\text {sim}(\cdot )\) among the multi-modal prompt vector \(Z_M\) and the sarcastic and non-ironic prompt vectors \(Z_s\) and \(Z_n\), respectively. This computation is encapsulated in Equation (3.13):

$$\begin{aligned} \hat{y} = \frac{\exp (\text {sim}(Z_M, Z_s))}{\exp (\text {sim}(Z_M, Z_s)) + \exp (\text {sim}(Z_M, Z_n))} \end{aligned}$$
(21)

Experiment

Setup

Dataset

This paper conducted the experiment using the Twitter Multi-modal sarcasm detection Dataset (MSD), which was published by Cai17 in 2019. The data set first collects English tweets containing a picture and some special hashtags (e.g., #sarcasm, etc.) from Twitter as an example of sarcasm, and English tweets with pictures, takes the data without sarcasm related hashtags as non-sarcasm data, and finally does some data cleaning. Tweets containing sarcasm, sarcastic, irony and ironic in the text and tweets containing urls were dropped to avoid introducing additional information. It also dropped words that often appeared alongside sarcasm tweets, such as jokes, humor and exgag. All data is divided into training set, verification set and test set, the ratio is 80% : 10% : 10%. Finally, in order to evaluate the model more accurately, Cai et al.17 manually checked the validation set and the test set to ensure the accuracy of the labels, and some statistics of the MSD dataset are shown in Table 3.

Table 3 Basic information of the MSD dataset.

To simulate a data shortage scenario, this paper outlines the experimental conditions specifically designed for small sample classification. From the original dataset, a specified number \(k\) of sarcastic and non-ironic samples are randomly selected from the training and validation sets. These samples are then used to construct a balanced training set \(|D_{\text {train}}|\) and a validation set \(|D_{\text {dev}}|\), ensuring that \(|D_{\text {train}}| = |D_{\text {dev}}|\). This approach aims to mimic a realistic small-sample learning environment.

Additionally, the original test set is utilized as the test set \(|D_{\text {test}}|\). Thus, the specific experimental configuration maintains equal sizes for the training and validation sets, represented as \(|D_{\text {test}}|, |D_{\text {train}}| = |D_{\text {dev}}|\).

This setup ensures that the challenges of small-sample learning are realistically represented, providing a robust framework for assessing the model’s performance under constrained data conditions. In addition, All experiments were conducted on a server with an NVIDIA Tesla V100 GPU (32GB), 128GB RAM, and an Intel Xeon processor.

To evaluate the models, Accuracy and F1-Score were selected as the primary metrics. Accuracy provides a high-level measure of the model’s ability to correctly classify samples, making it suitable for comparing overall performance across different models. In contrast, F1-Score balances precision and recall, offering deeper insights into the model’s effectiveness in detecting sarcasm, especially when class distributions are slightly imbalanced. The combination of these two metrics ensures a comprehensive assessment of the model’s overall performance and its ability to handle the nuances of sarcasm detection.

Parameter

In terms of the selection of the pre-training model, this paper uses the openai/clip-vit-large-patch14 (hereinafter referred to as CLIP-large) model downloaded from huggingface official website as the pre-training model. During training, CLIP-large parameters are frozen and only the parameters of the cue vector are trained. The parameters of the model are shown in Table 4.

Table 4 Parameters of the pre-trained model.

In the process of prompt learning training, various kinds of hyperparameter settings are shown in Table 5.

Table 5 Experimental hyperparameter settings.

Baselines

Since the scheme in this paper is mainly aimed at multi-modal sarcasm detection tasks in small-sample scenarios, it is necessary to compare model performance against multiple baselines on MSD datasets. In the selection of baseline, in addition to several classic multi-modal sarcasm detection models, a method based on CLIP model and a number of small sample classification methods are also added for testing, including:

P-Tuning v242: A prompt learning method that adds and optimizes continuous prompt information in each layer of the pre-trained model to improve the model’s adaptability and performance for specific tasks, and is more flexible and efficient than the traditional method of injecting prompts only in the input layer. During reproduction, CLIP-large text encoder is used for training.

BBT v243: A prompt learning method designed to optimize continuous prompts before input text through derivatively-free optimization methods, thereby improving the performance of large pre-trained language models in black box scenarios where internal parameters of the model are not accessible. During reproduction, CLIP-large text encoder is used for training.

BERT44: a text classification method, a deep learning model based on the Transformer architecture, specifically for natural language processing tasks. During reproduction, CLIP-large text encoder is used for training.

ViT45: An image classification method, a deep learning model based on the Transformer architecture, specifically for computer vision tasks. It splits images into patches of fixed size and transforms them into linear embedding vectors that capture global context information and long-term dependencies through self-attention mechanisms. During reproduction, CLIP-large image encoder is used for training.

HFN17: A multi-modal sarcasm detection algorithm that uses GloVe46 and ResNet47 as feature extractors to fuse multi-modal features using a hierarchical fusion network with a single attention layer. During reproduction, the GloVe and ResNet feature extractors were converted into CLIP-large graphic encoders for training.

Attr-BERT18, a multi-modal sarcasm detection algorithm, proposed a method to capture modal inconsistencies in multi-modal sarcasm detection using a collaborative attention mechanism to improve model performance. On reproduction, replace the encoder from ResNet and BERT with a CLIP-large encoder.

InCrossMGS20: a multi-modal sarcasm detection algorithm, which constructs text modal graph and image modal graph respectively based on graphic features, and then adopts a cross-modal interaction graph to integrate these two features. On reproduction, replace the encoder from ResNet and BERT with a CLIP-large encoder.

SAHFN-BERT4: A multi-modal sarcasm detection algorithm that extracts emotion data from each mode and uses cross-modal transformations to describe modal inconsistencies. In addition, an emotion-sensing image-text contrast loss mechanism is introduced to better synchronize the semantics of images and text. In reproduction, the encoders were replaced from DenseNet48 and RoBERT with CLIP-large encoders.

MVK22: a multi-modal sarcasm detection algorithm, proposes a lightweight multi-modal interaction model based on deep learning, which integrates visual common sense knowledge to enrich the modal representation of images and text in the sarcasm detection model. During reproduction, the encoder is replaced by DenseNet and BERT to CLIP-large encoder.

CLIP-MLP49: a multi-modal classification method based on CLIP model, which extracts graphic features through CLIP-large model, splices the graphic features, and then gets the prediction result after multi-layer perceptron classification.

Overall performance

Table 6 Comparison of sarcasm detection results of various models in 16-shot scenario.

We implemented the baseline models alongside our proposed model, and the experimental results are presented in Table 6. The baseline models were trained using the Adam optimizer, except for BBT v2, which does not require optimization. The training setup included a batch size of 16, 100 training epochs, and a learning rate selected from 2e-3, 2e-4, 2e-5 based on validation performance. The comparison experiments were conducted under a 16-shot classification scenario (16 training samples), and the reported results are the mean and standard deviation of performance across five different random seeds.

Here, “seeds” refer to the random seeds used to initialize the random number generator during the training process. By varying the seeds, the experiments account for different sources of randomness, such as weight initialization, data shuffling, and other stochastic processes. This ensures a more robust evaluation by minimizing the impact of any specific random initialization on the results.

The results in Table 6 show that our proposed model achieves superior performance in the 16-shot scenario, with an accuracy of 72.22% and an F1-score of 72.57%. Compared to the sub-optimal CLIP-MLP model, our method achieves a 2.6% improvement in accuracy and a 3.1% improvement in F1-score. These results highlight the effectiveness of prompt-based learning in low-data scenarios. From the comparisons between P-Tuning v2, BBT v2, and BERT, it is evident that prompt learning significantly outperforms traditional fine-tuning approaches under small-sample conditions. In particular, models based on traditional fine-tuning exhibit higher performance variance across different seeds (with standard deviations exceeding 4), while prompt-learning-based methods consistently maintain lower deviations (mostly within 2.5). This observation aligns with findings in existing literature that prompt learning is more stable and reliable in limited data scenarios. In contrast, models with large numbers of parameters, such as SAHNF-BERT, tend to overfit in low-sample settings, resulting in poorer performance stability. This further underscores the advantages of our lightweight prompt-based framework in achieving robust performance while avoiding overfitting.

To evaluate the scalability of our proposed method, we extended the experiments to include 128-shot, 512-shot, 1024-shot, and full-data scenarios. The accuracy and F1-score results are shown in Figs. 4 and 5. As data volume increases, the performance of baseline models such as SAHNF-BERT and InCrossMGs improves significantly, eventually surpassing our model in the 512-shot scenario and beyond. This is expected, as these models have larger parameter capacities that allow them to better leverage larger datasets. However, under small-sample conditions (16-shot and 128-shot), our method consistently achieves the best performance, demonstrating its superior ability to generalize with limited data. This trend reflects the inherent characteristics of prompt learning, which excels in efficiently utilizing small datasets but faces limitations in scaling to larger ones due to fewer learnable parameters. Despite this, our model maintains competitive performance even in larger data settings, highlighting its flexibility and adaptability.

Fig. 4
Fig. 4
Full size image

Performance acc value of each model under different data amounts.

We further conducted parameter sensitivity analysis to evaluate the impact of prompt length, discrete prompt filtering, and the number of discrete prompts on model performance. The F1-score results for different prompt lengths are shown in Fig. 6. The results indicate that prompt length plays a critical role in model performance. For sarcasm, text, and image inputs, the optimal prompt lengths align with the average lexeme lengths of the respective inputs after encoding: 11.8 for sarcasm, 19.3 for text, and 196 for image (due to a fixed embedding size of \(224 \times 224\) with a patch size of 16). Prompts that are too short fail to capture sufficient contextual information, while excessively long prompts lead to feature convergence and poor sample differentiation. Therefore, choosing an appropriate prompt length is crucial to optimizing the model’s ability to detect sarcasm effectively. In addition, the discrete prompt filtering method proposed in our scheme significantly improves model performance compared to using no filtering or no discrete prompts, as shown in Table 7. Cosine similarity achieves the best results, which is consistent with the CLIP model’s pretraining mechanism based on contrastive learning. This demonstrates that cosine similarity effectively measures the alignment between discrete prompts and sarcasm-related features, leading to better detection results.

Fig. 5
Fig. 5
Full size image

F1-score value of each model under different data amounts.

Parameter sensitivity analysis

In addition, this paper also makes several parameter sensitivity analysis experiments on sarcasm, image and text prompt length, discrete prompt calculation, and the number of discrete prompts to show the superiority of this scheme in selecting parameters and methods. The model performance F1 values for different prompt length Settings are shown in Fig. 6.

Fig. 6
Fig. 6
Full size image

Model performance F1 value under different prompt length settings.

It can be seen that the length of the prompt vector basically presents a trend from small to large, rising to a vertex and then decreasing. Due to the different lengths of sarcasm input, text input, and image input, the length of the prompt vector to achieve the best performance is also different. The average lexical length of sarcasm input after word vector encoding is 11.8. The average lexical length of text input after word vector encoding is 19.3. When the input resolution is 224*224 and the value of the image block is 16, the lexical length of image input after image embedding encoding is fixed at 196. Therefore, it can be concluded that inputs with longer lexeme lengths usually require longer prompt vector lengths so as to introduce enough prompt information to stimulate the model’s perception of irony, while inputs with relatively short lexeme lengths are not suitable for longer prompt lengths because the excessively long prompt length will lead to feature convergence after word vector encoding. The output after encoding by the multi-layer encoder will tend to be consistent, resulting in poor differentiation of different samples.

In addition, for the generation of sarcasm and non-ironic prompt vectors, this scheme will first use a discrete prompt generation method to generate enough discrete prompts, but the quality of discrete prompts is different, and not all of them have a positive effect on sarcasm detection. Therefore, this scheme proposes a discrete prompt filtering method, in which Euclidean distance can be used. Cosine similarity and Pearson correlation coefficient were used as feature similarity functions to calculate the quality score of the discrete prompt. The experimental results are shown in Table 7.

Table 7 Model performance under different feature similarity calculation functions.

It can be seen that the discrete prompt filtering method in this scheme has better performance than no filter function and no discrete prompt. On the one hand, this verifies the validity of the discrete prompt generation method in this scheme; on the other hand, it also verifies the conjecture in this paper that not every discrete prompt is effective. In addition, the cosine similarity in the filtering function has the best performance, which may be due to the fact that the comparative learning pre-training method adopted by CLIP model during pre-training is realized based on cosine similarity. Therefore, the discrete prompt filtering method in this scheme adopts cosine similarity to better reflect the similarity of discrete prompt to sarcasm and nonironic data, so as to achieve a better sarcasm detection effect.

In addition, different numbers of discrete prompts have significant effects on model performance. In the early stage of the experiment, as the number of discrete prompts increases, the performance of the model can be observed to gradually improve. This trend indicates that increasing the number of prompts helps to enhance the robustness of the model and make the learning process more stable. Specifically, the standard deviation of model performance decreases significantly with the increase of the number of prompts, which further indicates that adding more discrete prompts can effectively improve the adaptability of the model to different data changes. However, when the number of discrete prompts exceeds 20, the performance of the model begins to decline. This phenomenon suggests that only 20 to 50 of the generated discrete cues may be valid and have a positive impact on model training. When the number of prompts is too large, noise or redundant information may be introduced, which will affect the learning efficiency and performance of the model. Therefore, it is very important to properly control the quantity of discrete prompts to ensure their quality and efficiency to optimize the model performance. The model performance under different discrete prompt quantity settings is shown in Table 8. The term Top-N in the table refers to the N highest-ranked discrete prompts based on their quality scores, which reflect the relevance of the number and quality of discrete prompts to sarcasm detection performance.

Table 8 Model performance based on the number of discrete prompts.

Conclusion

This paper introduces a novel cue learning method based on a multi-modal pre-training model to address the challenges of sarcasm detection in low-resource settings. The study began by selecting the CLIP model as the foundational pre-training model through comprehensive comparative experiments. Subsequently, a tailored prompt learning framework was designed specifically for the CLIP model. This framework encompasses several key components: discrete prompt generation, prompt vector generation, and multi-modal prompt vector fusion.

The discrete prompt generation method involves expanding the sarcastic and non-ironic prompt lexicons through a series of predefined rules. These prompts are then filtered to identify the most effective discrete prompts, enhancing the robustness and accuracy of the sarcasm detection model. The prompt vector generation method leverages learnable continuous cue vectors inserted into sarcastic and non-ironic text inputs. These vectors are encoded by the model to obtain distinct cue vectors. Specific projection functions are then utilized to facilitate interaction between multi-modal features, ultimately generating multi-modal cue vectors through a fusion layer.

The proposed approach was rigorously evaluated through comparative experiments and a series of parameter sensitivity analyzes. The results demonstrate the superior performance of this method in multi-modal sarcasm detection tasks, particularly under conditions of data scarcity. The discrete prompt filtering strategy, in particular, proved effective in refining the quality of prompts, thereby improving model performance. Furthermore, the integration of learnable continuous vectors and multimodal fusion techniques significantly improved the ability of the model to accurately detect sarcasm in different modalities.

In summary, the cue learning method based on the CLIP model offers a practical and effective solution for multi-modal sarcasm detection in low-resource scenarios. The symmetric integration of text and image data within this approach not only advances the state-of-the-art in sarcasm detection but also provides a robust framework for future research in multi-modal sentiment analysis and public opinion monitoring. The insights gained from this study can be extended to other domains where understanding nuanced human expressions is crucial, paving the way for more sophisticated and contextually aware AI systems. The symmetry in our method’s design ensures a balanced and effective utilization of diverse data sources, reinforcing the reliability and robustness of our sarcasm detection model. This symmetric approach underscores the importance of equal contribution from each modality, fostering a more holistic and accurate analysis of sarcasm in multi-modal data.

While the proposed method demonstrates strong performance on the Twitter Multi-modal Sarcasm Detection Dataset (MSD), its generalizability to other datasets or real-world scenarios has not been explored due to the lack of publicly available large-scale multi-modal sarcasm detection datasets. Future research will focus on evaluating the model on additional datasets, including domain-specific and cross-lingual sarcasm detection tasks. Furthermore, the versatility of the proposed framework suggests potential applications beyond sarcasm detection, such as multi-modal sentiment analysis and humor detection. Exploring these areas will not only validate the robustness of the model but also extend its applicability to broader domains. Additionally, further investigations into optimizing prompt design and multi-modal fusion strategies could lead to even greater improvements in performance and adaptability.