Table 6 Reinforcement Learning Hyperparameters

From: Large language models learning to write rhyming Tang poetry A Xunzi Yayun R1 case study

Hyperparameter

Description

Parameter Value

Batch_size

Batch size for training

2

Learning_rate

Learning rate

5e-6

max_prompt_length

Maximum prompt length

512

Num_epochs

Number of training epochs

1

gradient_accumulation_steps

Gradient accumulation steps

4

adam_beta1

Adam optimizer beta1 decay coefficient

0.9

adam_beta2

Adam optimizer beta2 decay coefficient

0.99

warmup_ratio

Warmup ratio

0.1

max_grad_norm

Gradient clipping threshold

0.1

antithesis_penalty

Penalty for each parallelism error in Lüshi (Algorithm 4)

0.5