Table 4 Parameters of the pre-trained model.
From: A multi-modal sarcasm detection model based on cue learning
Name | Quantity/Content |
|---|---|
Image Encoder Architecture | ViT |
Input Image Resolution | 224*224 |
Image Block Size | 16*16 |
Image Encoder Layers | 24 |
Image Encoder Dimension | 1024 |
Image Encoder Heads | 16 |
Text Encoder Layers | 12 |
Text Encoder Dimension | 768 |
Text Encoder Vocabulary | 49408 |