Table 6 Reinforcement Learning Hyperparameters
From: Large language models learning to write rhyming Tang poetry A Xunzi Yayun R1 case study
Hyperparameter | Description | Parameter Value |
|---|---|---|
Batch_size | Batch size for training | 2 |
Learning_rate | Learning rate | 5e-6 |
max_prompt_length | Maximum prompt length | 512 |
Num_epochs | Number of training epochs | 1 |
gradient_accumulation_steps | Gradient accumulation steps | 4 |
adam_beta1 | Adam optimizer beta1 decay coefficient | 0.9 |
adam_beta2 | Adam optimizer beta2 decay coefficient | 0.99 |
warmup_ratio | Warmup ratio | 0.1 |
max_grad_norm | Gradient clipping threshold | 0.1 |
antithesis_penalty | Penalty for each parallelism error in Lüshi (Algorithm 4) | 0.5 |