Table 1 Corpus statistics for Assamese-English and Bodo-English after preprocessing.

From: Cross-lingual sparse-MoE distillation for efficient low-resource assamese–english and bodo–english translation

Language pair

Split

Sentences

Avg. length (tokens)

Vocab size

as-en

Train

127,000

18.2

45,000

 

Validation

7,000

18.1

 

Test

7,000

18.0

brx-en

Train

80,000

16.7

38,000

 

Validation

4,000

16.5

 

Test

4,000

16.6