Table 1 Corpus statistics for Assamese-English and Bodo-English after preprocessing.
Language pair | Split | Sentences | Avg. length (tokens) | Vocab size |
|---|---|---|---|---|
as-en | Train | 127,000 | 18.2 | 45,000 |
Validation | 7,000 | 18.1 | – | |
Test | 7,000 | 18.0 | – | |
brx-en | Train | 80,000 | 16.7 | 38,000 |
Validation | 4,000 | 16.5 | – | |
Test | 4,000 | 16.6 | – |