Table 4 Parameters of the pre-trained model.

From: A multi-modal sarcasm detection model based on cue learning

Name

Quantity/Content

Image Encoder Architecture

ViT

Input Image Resolution

224*224

Image Block Size

16*16

Image Encoder Layers

24

Image Encoder Dimension

1024

Image Encoder Heads

16

Text Encoder Layers

12

Text Encoder Dimension

768

Text Encoder Vocabulary

49408