Table 2 Hyeperparameter settings.

From: A transformer-based architecture for collaborative filtering modeling in personalized recommender systems

Parameters

Description

Value

Number of Transformer Layers

Total layers in the transformer encoder stack

4

Hidden Size (Embedding Dim)

Dimensionality of token embeddings and hidden representations

256

Number of Attention Heads

Number of parallel attention mechanisms in each layer

8

Feed-Forward Network Size

Size of the intermediate layer in the feed-forward block

1024

Dropout Rate

Probability of dropout applied to layers to prevent overfitting

0.1

Learning Rate

Initial step size for optimizer updates

1e-4

Batch Size

Number of training samples per batch

128

Sequence Length

Maximum length of user preferences sequences

50

Optimizer

Optimization algorithm used

Adam

Activation Function

Non-linear function in feed-forward layers

GELU

Training Epochs

Number of complete passes through the training dataset

100

Warm-up Steps

Number of steps for gradually increasing the learning rate

1000

Weight Decay

L2 regularization parameter to prevent overfitting

0.01

Early Stop

Strategy to halt training when validation loss stagnates, restoring best weights

Enabled (monitor = val_loss, patience = 10)

Masked Item Prediction

Strategy for training via item masking (BERT-style objective)

15% items masked