Table 1 The languages employed and the corpora from which the word frequencies were extracted.

From: Language statistics as a window into mental representations

Language family

Sub-family

Language

Corpus size (mio.)

Corpus name

Indo-European

Germanic

English

1909

ukWaC

German

1339

deWaC

Dutch

2539

nlTenTen14

Romance

French

1331

frWaC

Italian

1556

itWaC

Spanish

98

SpanishWaC

Portuguese

3896

ptTenTen11

Italic

Latin

11

LatinISE

Slavic

Russian

14,554

ruTenTen11

Polish

7716

plTenTen12

Czech

10,502

csTenTen17

Croatian

1210

hrWaC

Baltic

Latvian

530

LatvianWaC

Hellenic

Greek

124

gkWaC

Indo-Aryan

Urdu

53

UrduWaC

Hindi

108

HindiWaC

Bengali

12

bnWaC

Uralic

Finno-Ugric

Hungarian

2573

huTenTen12

Turkic

Oghuz

Turkish

33

trWaC

Afro-Asiatic

Semitic

Arabic

7476

arTenTen12

Hebrew

48

hebWaC

Amharic

26

amWaC

Cushitic

Somali

72

soWaC

Niger-Congo

Bantu

Swahili

18

SwahiliWaC

Volta-Niger

Yoruba

3

YorubaWaC

Dravidian

Southern

Tamil

27

TamilWaC

Austronesian

Malayo-Polynesian

Malay

183

MalaysianWaC

Tagalog (Filipino)

198

tlTenTen19

Sino-Tibetan

Sinitic

Chinese (simplified)

13,531

zhTenTen17

Japonic

Japanese (Kanji)

337

jpWaC

  1. The languages investigated in Study 1 and 2 are displayed in boldface; all other languages were added in Study 3. Corpus Size refers to the number of tokens in the corpora after non-alphabetic characters and annotation tags have been removed.