Table 1 Introduction of eye-tracking datasets across different languages.

From: Hong Kong Corpus of Chinese Sentence and Passage Reading

Corpus names (abbreviations)	Language	Participants	Word tokens read by one participant	Accumulated word tokens¹
Dundee Corpus	English L1 & French L1	10 native speakers each	Tokens: 56,216 (types: 9,776); newspaper texts	1,083,890
Dundee Corpus	English L1 & French L1	10 native speakers each	Tokens: 52,173 (types: 11,321); newspaper texts	1,083,890
Potsdam Sentence Corpus (PSC)	German	222 native speakers	Tokens: 1,138; Sentences: 144	252,636
Dutch Eye-Movements ONline Internet Corpus (DEMONIC)	Dutch	55 native speakers	Tokens: 1746; Sentences: 224	96,030
Balanced Corpus of Contemporary Written Japanese (BCCWJ-EyeTrack)	Japanese	24 native speakers	Bunsetsu²: 411 out of 1643; 20 newspaper texts	9,864
Ghent Eye-Tracking Corpus (GECO)	Dutch L1 & English L2	19 unbalanced bilinguals	Tokens: 59,716 (types: 5,575); Gulliver’s Travels I	1,134,604
	Dutch L1 & English L2	19 unbalanced bilinguals	Tokens: 54,364 (types: 5,012); Gulliver’s Travels II	1,032,916
	English	14 monolinguals	Tokens: 54,364 (types: 5,012); Gulliver’s Travels	761,096
Provo Corpus	English	84 native speakers	Tokens: 2,689 (types: 1,197); Passages: 55	145,206
Zurich Cognitive Language Processing Corpus (ZuCo)	English	12 native adults	Tokens: 21,629; Sentences: 1107	259,548
Russian Sentence Corpus (RSC)	Russian	96 Russian participants	Tokens: 1,362; Sentences: 144	196,128
Beijing Sentence Corpus (BSC)	Chinese	60 native speakers	Tokens: 1,685; Sentences: 120	101,100
Multilingual Eye-Movement Corpus (MECO)	Dutch	45 native speakers	Tokens: 2231; Sentences: 112	100,395
	English	46 native speakers	Tokens: 1540; Sentences: 112	70,840
	Estonia	52 native speakers	Tokens: 2109; Sentences: 99	109,668
	Finnish	49 native speakers	Tokens: 1487; Sentences: 110	72,863
	German	45 native speakers	Tokens: 2027; Sentences: 115	91,215
	Greek	45 native speakers	Tokens: 2083; Sentences: 99	93,735
	Hebrew	47 native speakers	Tokens: 1950; Sentences: 121	91,650
	Italian	54 native speakers	Tokens: 2114; Sentences: 90	114,156
	Korean	32 native speakers	Tokens: 1796; Sentences: 101	57,472
	Norway	42 native speakers	Tokens: 2106; Sentences: 116	88,452
	Russian	46 native speakers	Tokens: 1894; Sentences: 107	87,124
	Spanish	48 native speakers	Tokens: 2412; Sentences: 98	115,776
	Turkish	29 native speakers	Tokens: 1697; Sentences: 104	49,213
Ghent Eye-tracking COrpus of sentence reading for Chinese-English bilinguals (GECO-CN)	Chinese L1 & English L2	32 bilinguals	Tokens: 59,403 (types: 5053); Sentences: 5066	1,900,896
			The Mysterious Affair at Styles (Chapters 1–7)	1,900,896
			Tokens:56,841 (types: 5363); Sentences: 5242	1,818,912
			The Mysterious Affair at Styles (Chapters 18–13)	1,818,912
Copenhagen Corpus of eye tracking recordings from natural reading of Danish texts (CopCo)	Danish	22 native speakers	Tokens: 34,897; Sentences: 1,832; speech manuscripts	767,734
Chinese Eye-Movement Database (CEMD)	Simplified Chinese	1,718 native speakers	Types: 8551; Sentences: 8015	1,339,960³
TURead	Turkish	196 native speakers	Tokens: 2943 (types: 2185)	576,828
TURead	Turkish	196 native speakers	192 short texts, each composed of 1–3 sentences	576,828

Note. ¹The accumulated word tokens are roughly calculated by the multiplication of tokens and the number of participants. ²A Japanese bunsetsu unit is composed of a content word plus functional morphology. ³Notice that this digit indicates the number of total fixation points but not accumulated word tokens.

Back to article page

Table 1 Introduction of eye-tracking datasets across different languages.

Search

Quick links