Table 1 The languages employed and the corpora from which the word frequencies were extracted.
From: Language statistics as a window into mental representations
Language family | Sub-family | Language | Corpus size (mio.) | Corpus name |
|---|---|---|---|---|
Indo-European | Germanic | English | 1909 | ukWaC |
German | 1339 | deWaC | ||
Dutch | 2539 | nlTenTen14 | ||
Romance | French | 1331 | frWaC | |
Italian | 1556 | itWaC | ||
Spanish | 98 | SpanishWaC | ||
Portuguese | 3896 | ptTenTen11 | ||
Italic | Latin | 11 | LatinISE | |
Slavic | Russian | 14,554 | ruTenTen11 | |
Polish | 7716 | plTenTen12 | ||
Czech | 10,502 | csTenTen17 | ||
Croatian | 1210 | hrWaC | ||
Baltic | Latvian | 530 | LatvianWaC | |
Hellenic | Greek | 124 | gkWaC | |
Indo-Aryan | Urdu | 53 | UrduWaC | |
Hindi | 108 | HindiWaC | ||
Bengali | 12 | bnWaC | ||
Uralic | Finno-Ugric | Hungarian | 2573 | huTenTen12 |
Turkic | Oghuz | Turkish | 33 | trWaC |
Afro-Asiatic | Semitic | Arabic | 7476 | arTenTen12 |
Hebrew | 48 | hebWaC | ||
Amharic | 26 | amWaC | ||
Cushitic | Somali | 72 | soWaC | |
Niger-Congo | Bantu | Swahili | 18 | SwahiliWaC |
Volta-Niger | Yoruba | 3 | YorubaWaC | |
Dravidian | Southern | Tamil | 27 | TamilWaC |
Austronesian | Malayo-Polynesian | Malay | 183 | MalaysianWaC |
Tagalog (Filipino) | 198 | tlTenTen19 | ||
Sino-Tibetan | Sinitic | Chinese (simplified) | 13,531 | zhTenTen17 |
Japonic | – | Japanese (Kanji) | 337 | jpWaC |