Table 2 Hyperparameter Settings for VBG Model’s ViT, BERT, and GAN Modules.

From: A Museum artifact classification model based on cross-modal attention fusion and generative data augmentation

Module

Hyperparameter

Value

Description

ViT

Image size

\(224 \times 224\)

Image input size for the ViT module.

Patch size

\(16 \times 16\)

Each image is divided into patches of \(16 \times 16\) pixels.

Hidden dimension

768

The size of the token embedding space in ViT.

Number of layers

12

The number of transformer layers used in ViT.

Number of heads

12

The number of attention heads in each transformer layer.

BERT

Max sequence length

128

The maximum length of the tokenized text sequence.

Hidden size

768

The size of BERT’s hidden layer.

Number of layers

12

The number of transformer layers used in BERT.

Batch size

32

The batch size for text input during training.

GAN

Latent space dimension

100

The dimensionality of the noise vector input to the generator.

Generator learning rate

\(10^{-4}\)

Learning rate for the generator.

Discriminator learning rate

\(10^{-4}\)

Learning rate for the discriminator.

Batch size

64

The batch size for both generator and discriminator.