Table 2 Hyperparameter Settings for VBG Model’s ViT, BERT, and GAN Modules.
Module | Hyperparameter | Value | Description |
|---|---|---|---|
ViT | Image size | \(224 \times 224\) | Image input size for the ViT module. |
Patch size | \(16 \times 16\) | Each image is divided into patches of \(16 \times 16\) pixels. | |
Hidden dimension | 768 | The size of the token embedding space in ViT. | |
Number of layers | 12 | The number of transformer layers used in ViT. | |
Number of heads | 12 | The number of attention heads in each transformer layer. | |
BERT | Max sequence length | 128 | The maximum length of the tokenized text sequence. |
Hidden size | 768 | The size of BERT’s hidden layer. | |
Number of layers | 12 | The number of transformer layers used in BERT. | |
Batch size | 32 | The batch size for text input during training. | |
GAN | Latent space dimension | 100 | The dimensionality of the noise vector input to the generator. |
Generator learning rate | \(10^{-4}\) | Learning rate for the generator. | |
Discriminator learning rate | \(10^{-4}\) | Learning rate for the discriminator. | |
Batch size | 64 | The batch size for both generator and discriminator. |