Table 1 Introduction of eye-tracking datasets across different languages.

From: Hong Kong Corpus of Chinese Sentence and Passage Reading

Corpus names (abbreviations)

Language

Participants

Word tokens read by one participant

Accumulated word tokens1

Dundee Corpus

English L1 & French L1

10 native speakers each

Tokens: 56,216 (types: 9,776); newspaper texts

1,083,890

Tokens: 52,173 (types: 11,321); newspaper texts

Potsdam Sentence Corpus (PSC)

German

222 native speakers

Tokens: 1,138; Sentences: 144

252,636

Dutch Eye-Movements ONline Internet Corpus (DEMONIC)

Dutch

55 native speakers

Tokens: 1746; Sentences: 224

96,030

Balanced Corpus of Contemporary Written Japanese (BCCWJ-EyeTrack)

Japanese

24 native speakers

Bunsetsu2: 411 out of 1643; 20 newspaper texts

9,864

Ghent Eye-Tracking Corpus (GECO)

Dutch L1 & English L2

19 unbalanced bilinguals

Tokens: 59,716 (types: 5,575); Gulliver’s Travels I

1,134,604

Tokens: 54,364 (types: 5,012); Gulliver’s Travels II

1,032,916

English

14 monolinguals

Tokens: 54,364 (types: 5,012); Gulliver’s Travels

761,096

Provo Corpus

English

84 native speakers

Tokens: 2,689 (types: 1,197); Passages: 55

145,206

Zurich Cognitive Language Processing Corpus (ZuCo)

English

12 native adults

Tokens: 21,629; Sentences: 1107

259,548

Russian Sentence Corpus (RSC)

Russian

96 Russian participants

Tokens: 1,362; Sentences: 144

196,128

Beijing Sentence Corpus (BSC)

Chinese

60 native speakers

Tokens: 1,685; Sentences: 120

101,100

Multilingual Eye-Movement Corpus (MECO)

Dutch

45 native speakers

Tokens: 2231; Sentences: 112

100,395

English

46 native speakers

Tokens: 1540; Sentences: 112

70,840

Estonia

52 native speakers

Tokens: 2109; Sentences: 99

109,668

Finnish

49 native speakers

Tokens: 1487; Sentences: 110

72,863

German

45 native speakers

Tokens: 2027; Sentences: 115

91,215

Greek

45 native speakers

Tokens: 2083; Sentences: 99

93,735

Hebrew

47 native speakers

Tokens: 1950; Sentences: 121

91,650

Italian

54 native speakers

Tokens: 2114; Sentences: 90

114,156

Korean

32 native speakers

Tokens: 1796; Sentences: 101

57,472

Norway

42 native speakers

Tokens: 2106; Sentences: 116

88,452

Russian

46 native speakers

Tokens: 1894; Sentences: 107

87,124

Spanish

48 native speakers

Tokens: 2412; Sentences: 98

115,776

Turkish

29 native speakers

Tokens: 1697; Sentences: 104

49,213

Ghent Eye-tracking COrpus of sentence reading for Chinese-English bilinguals (GECO-CN)

Chinese L1 & English L2

32 bilinguals

Tokens: 59,403 (types: 5053); Sentences: 5066

1,900,896

The Mysterious Affair at Styles (Chapters 1–7)

Tokens:56,841 (types: 5363); Sentences: 5242

1,818,912

The Mysterious Affair at Styles (Chapters 18–13)

Copenhagen Corpus of eye tracking recordings from natural reading of Danish texts (CopCo)

Danish

22 native speakers

Tokens: 34,897; Sentences: 1,832; speech manuscripts

767,734

Chinese Eye-Movement Database (CEMD)

Simplified Chinese

1,718 native speakers

Types: 8551; Sentences: 8015

1,339,9603

TURead

Turkish

196 native speakers

Tokens: 2943 (types: 2185)

576,828

192 short texts, each composed of 1–3 sentences

  1. Note. 1The accumulated word tokens are roughly calculated by the multiplication of tokens and the number of participants. 2A Japanese bunsetsu unit is composed of a content word plus functional morphology. 3Notice that this digit indicates the number of total fixation points but not accumulated word tokens.