Table 1 Pretraining hyperparameters of each MLM model.
Tokenization | BERT | DistilBERT | ELECTRA | RoBERTa |
|---|---|---|---|---|
WordPiece | WordPiece | WordPiece | Byte-pair encoding (BPE) | |
Model parameters | Hidden activation: GeLU Vocab size: 52,000 Batch size: 32 Sequence length: 512 Dropout rate: 0.1 Attention dropout: 0.1 Learning rate: 3e-5 Epochs: 5 Hidden size: 768 Attention heads: 12 Hidden layers: 6 Train Steps: 5,000 | |||