Introduction

Mild cognitive impairment (MCI) is a neurocognitive condition characterised by cognitive deficits that exceed what might be expected for a person’s age and educational background, but is not severe enough to disrupt instrumental daily activities. This impairment exists without other identifiable causes, such as neurological, psychiatric, systemic disorders, metabolic dysfunction, or medication effects1. MCI was first used to describe the transition between normal cognition and dementia in Alzheimer’s disease (AD)2  but precedes other types of dementia as well 1. In clinical cohorts, the annual conversion rates from MCI to AD range between 10-15%2.

Reading is a sophisticated cognitive activity that relies on numerous mental abilities, including attention, visual and auditory processing, as well as working, short- and long-term memory, and sensory integration3. Several studies have examined reading ability in MCI in combination with eye-tracking. In one of the first studies to examine eye movement behaviour in patients with cognitive decline, the eye movements of 14 patients with mild to moderate AD were recorded and compared to the eye movements of 14 healthy controls (HC) while they read a predefined text4. Compared to HC, AD patients did not differ in saccade duration; they had significantly longer fixation time, more forward saccades per line of text, and more saccadic regressions. Fernández et al.5 reported patients with early AD who had a lower number of words with only one fixation, a higher total number of fixations in the first and second pass, a higher total number of saccadic regressions, and a higher number of skipped words. They also had a smaller size of outgoing saccades and, at the same time, a longer fixation duration. In another study by the same group6, gaze duration changed in HC as a function of sentence predictability (the higher the predictability, the shorter the gaze duration). AD patients showed an increase in gaze duration compared to HC, both for sentences with low and high predictability. The information from these studies was later used to build a machine learning based model7 that could identify early AD with up to 89.78% accuracy. A further extension of this work8 on a different sample of patients with MCI and HC used the same basic descriptors as Biondi et al. (2017)7, but modified by using frequency information (how often a word appears in the text). Similar accuracy of 86% was obtained in distinguishing MCI from HC, confirming the feasibility of using machine learning approaches to distinguish individuals with (mild) cognitive decline from healthy controls. These results were later confirmed by another study by the same group9 where their best multimodal language classifier and the neuropsychological classifier reached the AUROC = 0.90 and CA = 0.84. Reviews10,11 on this topic also suggest that eye movements can be used to discriminate between MCI and HC. Furthermore, this approach can be used to predict the transition from MCI to AD, as eye movements are sensitive to subtle changes in memory, visual, and executive functions10. Taken together, the available data suggest that reading combined with eye-tracking is a promising tool to discriminate between subjects with and without cognitive impairment (CI).

Most of the studies mentioned tested the use of this approach in patients with early AD; only a few studies included patients with MCI. In addition, the number of patients (MCI/AD) and HC in the mentioned studies is rather small (ranging from 14-40). In addition, only three studies7,8,9 used machine learning approaches to discriminate between patients and HC. One of the prerequisites for the successful use of machine learning algorithms is a large sample size. Furthermore, it is not clear which basic descriptors derived from eye-tracking or combined with neuropsychological and/or clinical data are optimal to discriminate between MCI and HC.

Individuals with AD have been suggested to experience altered reading abilities due to primary oculomotor abnormalities such as instability in fixation4,12. Those with AD progressively show impairments in both inhibitory control and correction of eye movement errors12. In particular, the capacity to voluntarily shift the gaze away from a conspicuous stimulus in the antisaccade task (AST) is compromised in AD. Moreover, the frequency of errors in the AST is correlated with the severity of AD12. Eye movement deficits may appear early in the disease’s progression, well before cognitive impairments become apparent through standard neuropsychological assessments13. Consequently, the question arises as to whether eye-movement impairment can be detected in individuals with MCI. Furthermore, eye movement disorders have been described in other neurodegenerative types of dementia, including frontotemporal dementia (FTD)14 and Lewy body dementia (LBD)14,15. To the best of our knowledge, no studies have been performed so far on eye-tracking during reading in MCI preceding either FTD or LBD.

Identifying MCI using inexpensive and simple tools is thus essential, since invasive and/or expensive biomarkers may not be widely available, especially for population-level usage. Eye-tracking is a promising method that fulfils these criteria. Combined with reading, a task that is heavily dependent on impaired cognitive functions in the early stages of the disease, eye-tracking could be the ideal approach for convenient detection of cognitive decline.

In this study, we investigated the differences in eye movement during reading between people with various levels of cognitive impairment and healthy controls. Based on the observed differences, we then designed the features to use in a predictive model for automatic detection of MCI. Finally, we conducted a validation study to test the predictive model on a completely independent dataset.

The paper is structured as follows. First we report on study participants and their characteristics in Section 2 and describe the machine learning setup in Section 3. Section 4 presents the results of statistical comparisons and initial machine learning, while Section 5 presents the results of the independent follow-up validation study. The final two sections discuss our results and conclude the paper.

A preliminary version of this work has been reported in a paper16 accepted for presentation at the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI) 2021.

Study subjects

The data used in this analysis was collected in two separate studies. The first study was performed to explore reading as a biomarker and to fit a predictive model. The follow-up study collected data from new participants, used to validate the proposed model. The experimental design was nearly identical for both studies. Based on user and expert feedback from the first study, some eye-tracking tasks were updated for the second one; however, the reading task investigated in this paper remained unchanged.

Enrollment criteria

In the initial study, subjects were recruited from nursing homes for elderly. Recruitment occurred either through physician referrals — when individuals expressed concern about their potential cognitive issues (subjective cognitive impairment) — or through volunteers who learned about the study from other participants or directly from the researchers. The follow-up study was conducted at a private medical centre in Slovenia, using the same recruitment methods as the initial study.

The following inclusion criteria were observed when recruiting participants:

  • subjects with CI older than 40 years;

  • subjects with no CI older than 40 years.

At the same time, the following exclusion criteria were used for recruitment:

  • subjects with uncorrected visual impairments;

  • subjects with concomitant neurological diseases;

  • subjects with psychiatric diseases, including those with GDS-15 (Geriatric Depression Scale – 15 questions 17) > 10;

  • drug or alcohol abuse;

  • subjects who declined to participate in the study or were unable to perform all the tests.

The data was collected under a clinical study approved by the National Medical Ethics Committee of the Republic of Slovenia (approval numbers: 0120-400/2015-5 dated April 2, 2016; 0120-400/2015/9 dated May 22, 2018; and 0120-400/2015/17 dated June 26, 2019; the Committee also issued a notice of notifications and minor changes no. 0120-400/2015/20, dated September 17, 2019). The clinical studies were conducted in accordance with a clinical protocol, good clinical practice and applicable national regulations to ensure that the rights, safety and welfare of the subjects were protected in accordance with the ethical principles enshrined in the Declaration of Helsinki.

The purpose and procedures of the study were explained to the subjects verbally and in writing prior to inclusion in the study. Based on this information, they were free to choose to participate in the study without any pressure or conditions. All subjects took part in the study voluntarily and were informed of the possibility of terminating the study or withdrawing their consent at any time.

Subjects’ characteristics

This subsection describes the characteristics of subjects in the first study, used for exploration and modelling. The characteristics of subjects in the validation study are given alongside the validation results in Section 5 for ease of understanding.

We analysed data from 115 subjects who were consecutively recruited for the study. The youngest subject was 43 years old, and the oldest was 94, with a median age of 68 years. The dataset comprised 85 female and 30 male subjects.

Diagnoses were made using the DSM-V criteria18. Based on neurological and psychological assessments (described below), each participant was classified into one of the following categories: healthy (no cognitive decline), borderline (some cognitive decline detected but not sufficient for an MCI diagnosis), MCI (mild cognitive impairment), or possible Alzheimer’s disease (AD).

Subjects with MCI needed to have at least one out of six cognitive domains affected (memory and learning, complex attention, executive function, language, perceptual-motor, or social cognition) with no impairments in basic activities of daily living and no dependence on others in any instrumental activities of daily living. Possible AD was diagnosed as impairment in at least one cognitive domain, but with impairments in basic activities of daily living and dependence on others in the instrumental activities of daily living. Subjects having some impairment in one cognitive domain, but not enough to diagnose MCI were designated as borderline.

A total of 53 subjects were diagnosed as healthy, 32 as borderline, 19 as having MCI, and 11 as possible AD. For the purpose of machine learning, we defined a binary classification problem with two groups: (1) healthy controls (HC), and (2) cognitively impaired (denoted as CI), which includes borderline, MCI, and subjects with possible AD. Table 1 provides further details on the age and gender distribution for each diagnosis/group.

Table 1 Gender, age, and cognitive scores distribution per diagnosis/group.

Subjects assessment

Each subject underwent an assessment lasting approximately two hours, which was conducted in three steps:

  1. 1.

    Neurological examination

    This step involved evaluating the participants’ cognitive condition and, when necessary, their motor and non-motor functions. Demographic information was gathered through a questionnaire that included details about family and medical history, current treatments, potential head injuries, and other relevant factors.

  2. 2.

    Psychological assessment

    The psychological assessment involved evaluating higher cognitive capabilities, including executive functions and memory by a certified clinical psychologist. To assess cognitive abilities, the following standardised tests were employed: ACE-R (Addenbrooke Cognitive Examination-Revised)19, FAB (Frontal Assessment Battery)20, CTMT (Comprehensive Trail Making Test)21, and GDS-15 (Geriatric Depression Scale – 15 questions)17.

    Subjects with ACE-R score less than 86 (for subjects aged 50-59 years), or less than 85 (for subjects aged 60-69 years), or less than 84 (for subjects aged 70-79 years) and/or subjects with Mini Mental Score Examination (MMSE, part of ACE-R) score less than 26 were candidates for placement into either borderline, MCI, and possible AD group. In addition to participants’ age, the education level was also taken into account in accordance with the used psychological tests. In cutoffs, we followed the established clinical practice, e.g. ACE-R19, MMSE22,23. The final decision about the group placement – also for subjects flagged as healthy by above mentioned tests – was taken by a neurologist when presented with complete information after the clinical examination of the patient and after observing the psychological report. More details about the cognitive scores distribution per diagnosis/group is given in Table 1.

  3. 3.

    Eye-tracking test battery

    The eye-tracking test battery was conducted using a 90 Hz Tobii 4C eye-tracker, specialised software developed by NEUS Diagnostics, d.o.o., and associated computer equipment — including a laptop for the examiner and a 23.6-inch display with a 1920\(\times\)1080 pixel resolution for the participants. A trained technician administered the tests. Participants sat approximately 70 cm from the display and had no direct contact with the equipment, interacting with the test battery solely through their gaze. The NEUS software presented visual stimuli on the screen and guided participants through the entire test battery. It recorded their eye movements and stored the data in a database for later analysis. All data were anonymised, and access to the database was restricted to researchers involved in the study.

    After the initial 5-point eye-tracker calibration, subjects were asked to read the text displayed on the screen in a large font. This task was presented as another part of the calibration process before the actual test, in order to reduce the stress subjects might feel due to testing – the reasoning is that they might perform the reading task more naturally. The text was arranged over 11 lines in three paragraphs, occupying most of the display area. After reading the text, subjects were instructed to explicitly confirm they have finished reading by directing their gaze at the button in the bottom right corner of the screen (see Figure 1). The reading task on average took from less than half a minute for healthy subjects to just over a minute for patients with possible AD.

Machine learning setup

The main machine learning objective was to try to differentiate between CI subjects and HC based on their gaze behaviour during reading. We have thus decided on a binary class, CI versus HC. As previously described, there were 62 subjects in the CI group and 53 in the HC group. We used only the nine reading features described below without any additional data to learn from. The reason for this decision is that we wanted to investigate how well a fast, stand-alone reading test can work on its own without collecting any additional data.

Fig. 1
figure 1

The short reading task (in Slovenian), comprising brief instructions and a confirmation button in the bottom right corner. The translation is as follows: “Dear user! Please read these brief instructions carefully. Their main purpose is to get your eyes used to the luminance and resolution of the computer screen for later testing. Any necessary adjustments will be detected by the computer itself and will happen automatically in the background. Before the start of the testing, we will present each of the short tasks to you. All tasks are performed by looking at the screen only. The test will take approximately 15 minutes. Brief instructions will be given for each task. These will also be given in audio form. You will hear a short warning signal before the start of each task. If you are ready, you can start with the test. Thank you for your participation!”

Eye-tracking reading features

The raw eye-tracking data are an event series of (xy) screen coordinates and the accompanying timestamp t, sampled at 90 Hz. We transformed the raw data into higher-level features using domain knowledge to (1) better facilitate the machine learning process, and to (2) gain human comprehensible insights into differences between observed groups.

Pre-processing

First, we discarded events with invalid coordinates — instances where the eye-tracker failed to detect the eye due to blinking, the user looking away from the screen, or similar issues. We also excluded the initial 5% and the final 10% of each recording since gaze behaviour can be erratic at the start and end of a recording session. These two thresholds were set based on the data from the pilot study (not reported here).

After cleaning the eye-tracking data, we extracted fixations and saccades. A fixation was defined as maintenance of the gaze on a single point, while a saccade was defined as a rapid, simultaneous movement of both eyes between two fixation points. We defined fixations as sequences of consecutive samples lasting at least 48 milliseconds with inter-sample distances no greater than 100 pixels (27.2 mm). These parameters were empirically determined to suit our sampling frequency and the relative size of the text displayed on the screen. Each pair of consecutive fixations identified in this manner define one saccade.

Definitions and meaning of reading features

Using the extracted fixations and saccades, we defined several reading features: the distances of forward and backward saccades, the variability in backward saccade distances, the speeds of forward and backward saccades, fixation durations, the variation in fixation durations, the ratio of forward to backward saccades, and the number of fixations per second. The goal of selecting these features was to capture as many aspects of the reader’s behaviour as possible while maintaining a reasonable number of variables. Detailed descriptions of these features are given in16 and are also provided below for easier understanding of this paper.

Direction of a saccade can be either forward (right) or backward (left). It is detected from the screen positions of the fixations before and after the saccade in question. The forward saccade distance (fdist) and backward saccade distance (bdist) features are defined as the median distance of all detected saccades in the given direction. These features correlate with the speed of reading, but also contain additional information, e.g., bdist is closely associated with starting a new line of text or with re-reading some parts of the text.

The amount of variation in backward saccade distances (bdist.stdev) is the standard deviation of the distances of all detected backward saccades. As already mentioned, backward saccades are closely associated with starting a new line of text and thus tend to be long. However, variation in these distances can signal that the user is re-reading some words, or is lost in the text, etc.

The forward saccade speed (fspeed) and backward saccade speed (bspeed) are defined as the median speed of all detected saccades in the given direction. While speed and distance correlate with each other, they are not the same, so these features can offer additional information into the reading behaviour.

The duration of fixations (fdur) is defined as the median duration of all detected fixations, and the amount of variation in fixation duration (fdur.std) as the standard deviation of the durations of all detected fixations. Both of these features can be associated with erratic or confused behaviour of the reader.

The ratio between forward and backward saccades (fsVbs) is computed as the number of detected forward saccades divided by the number of detected backward saccades.

The number of fixations per second (fxt) is the time-normalised number of detected fixations. This feature is a more robust version of reading speed (or time spent during reading).

Algorithms

There was no additional pre-processing of the data beyond the feature construction process described in the previous section. The only exception was excluding one case from the HC group that was an extreme outlier due to missing data despite multiple recalibrations.

We have explored four machine learning algorithms: logistic regression (LR), naïve Bayes (NB), stochastic gradient boosting (XGBoost algorithm; XGB), and random forest (RF). Each algorithm’s parameters were optimised with internal cross validation. The complete machine learning pipeline was implemented using the scikit-learn v0.23.1 library.

The algorithms were evaluated using the following measures: classification accuracy (CA), Brier score (Brier)24, area under the receiver-operating-characteristic curve (AUC), sensitivity, and specificity. We used 10-times repeated 10-fold cross validation. We also used the majority classifier (dummy; always classifying into the majority class) for comparison.

Results

In this section we first explore how various feature values are distributed among groups, both as a binary problem (HC versus CI) and with increasing level of cognitive impairment (from HC to possible AD). Next, we present the machine learning results of distinguishing between HC and CI. The results in this section are based on the initial study involving 115 subjects. Data from the second study, involving 99 new subjects, was intentionally withheld exclusively for validation purposes and is presented in its own section later.

Statistical distributions of reading features with increasing level of cognitive impairment

We were interested in how the reading features are distributed along all four groups of participants, from healthy controls to subjects with possible AD. Since most feature values were not normally distributed, we used the nonparametric Mann-Whitney U-test to make these comparisons. We report the results of the omnibus test comparing all four groups. The significance level for all comparisons in this study was set at 0.05 and all tests were two-tailed.

From the graphs in Figure 2 and associated p-values we can see that all the features except bdist.std exhibit different distributions between groups. It can further be seen that the groups are ordered with respect to the level of cognitive impairment, progressing from HC to borderline and MCI groups and finally to subjects with posible AD. This result, although expected, is nevertheless interesting and important to confirm in practice.

Fig. 2
figure 2

Distributions of reading features over all four groups alongside with their associated p-values.

Statistical distributions of reading features between HC and CI

While it is of clinical interest to observe how reading features are distributed with increasing level of cognitive impairment, it is the binary problem of detecting any level of cognitive impairment that is of primary interest when designing a decision support system for early detection of the disease. Therefore we also compared the distributions of the feature values between CI and HC groups. As before, we used the nonparametric Mann-Whitney U-test to test for differences. The results of these comparisons are reported in Table 2, along with some parameters of the feature value distributions.

Table 2 Distributions of eye-tracking feature values between CI and HC groups.

Except for the feature bdist.std, all other metrics consistently indicate significant differences in eye movement behaviour between the two groups. Healthy controls (HC) tend to have a notably higher number of fixations per unit of time, shorter fixation durations, and a more uniform distribution of these durations. These three features are particularly distinct between the groups. Additionally, HCs make longer saccades in both forward and backward directions and perform them slightly faster. They also exhibit a higher ratio of forward to backward saccades.

Machine learning results

The results from the 10-fold cross-validation, presented in Table 3, indicate that logistic regression is the top-performing algorithm overall. Naïve Bayes, however, slightly surpasses it in the AUC metric. Furthermore, the two algorithms seem to trade sensitivity and specificity between them. Specifically, logistic regression achieves a classification accuracy of 73.1%, an AUC of 79.7%, and a Brier score of 0.189. The confusion matrix for logistic regression, shown in Table 4, reveals the types of misclassifications it makes. As evident from the matrix, the algorithm errs equally in both classes, which is expected given the relatively balanced nature of the dataset.

Table 3 The performance estimation of classification algorithms.
Table 4 Confusion matrices for the best performing logistic regression classifier. (left: initial study; right: validation study).

Independent validation

As already mentioned, we performed an independent clinical study to validate our findings. This validation study observed the same protocol as the original study.

The validation study was officially registered with the Agency for Medicinal Products and Medical Devices of the Republic of Slovenia (approval no. 340-4/2019-8 dated 27 August, 2019). As such, the outcome of the examination had to be defined for all subjects – meaning that the borderline assessment was not an option in this validation study. The participants were therefore diagnosed as either HC, MCI, or possible AD.

We have recruited 99 consecutive new subjects for the validation study. Five of the subjects had other diagnoses (e.g. anxiety or depression) and were not eligible for the analysis. This left us with 94 subjects.

The youngest subject was 46 and the oldest 86 years old. The median age of the subjects was 69 years. The data consists of 60 female and 34 male subjects. Based on neurological and psychological evaluation, the validation set included 53 HC, 29 subjects with MCI, and 12 subjects with dementia. The latter two groups were merged into a single group of subjects with CI (as before denoted as CI). The detailed age and gender distribution of the subjects based on their diagnoses/group is given in Table 5 along with the cognitive scores distribution.

Table 5 Gender, age, and cognitive scores distribution per diagnosis/group for subjects in the validation dataset.

The model to differentiate between subjects with HC and CI, regardless of the level of impairment, which was learned from the data of 115 subjects and evaluated using cross-validation in the previous section, was not changed. The new data was used solely for the evaluation of its performance.

The results of the new data set are given in Table 6. The performance of all algorithms has decreased somewhat, with CA decreasing the most at about 5%. The LR and RF algorithms perform best and are quite similar in their performance.

The misclassifications remain as before approximately equal in both directions, HC to CI and vice versa (Table 4).

Table 6 The performance of the classification algorithms on validation study data.

Discussion

Our results show that gaze behaviour differs significantly between HC and subjects with at least some degree of CI. Compared to HC, CI subjects have significantly fewer fixations per time unit. Their fixations are on average longer, and the durations are not distributed as uniformly. This points to their reading behaviour being slower and especially more irregular in its appearance. Our findings are broadly in line with the earlier studies that investigated reading behaviour in individuals with AD or MCI4,5,6,25. Evidence from various studies and groups shows that eye-movement behaviour (including fixation duration, number of fixations, and the regularity of fixations) are robust indicators of CI.

Nevertheless, several methodological factors and language-related aspects can lead to variations in specific findings. First, the texts used in previous work have ranged from simple sentences to proverbs, potentially influencing the complexity and predictability of the reading materials. Second, while some studies focused on silent reading, others involved reading aloud, and the resulting eye-movement patterns can differ between these modes. Third, variations in how fixations, saccades, and regressions are defined or preprocessed (for example, thresholds for detecting fixations or noise removal procedures) inevitably affect the reported outcomes. Additionally, the languages used in the prior studies – such as Spanish in5,25 vs. English in8 – differ in their orthographic rules and word predictability patterns. Languages with more transparent orthographies or more inflectional complexity may shift the baseline for normal reading behaviour.

The graphs in Figure 2 show remarkably well that CI is really a continuum. Most of the constructed eye-tracking features increase or decrease smoothly from HC to borderline subjects, then MCI and finally to possible AD. We can see how the increasing level of impairment affects the gaze behaviour during reading: the fixations become longer and more erratic, the number of fixations per time unit steadily decreases etc. The only exception is the bdist.std feature, which is not significant in a binary setting.

While this continuous gaze behaviour is in a way an expected result, it was still important to confirm it in practice as it further validates the use of eye-tracking to detect early stages of cognitive impairment. It also means that even if the impairment is not immediately detected, it will likely become more easily detectable as it worsens over time.

As is often the case, the practical utility of machine learning results depends on their intended application. Comparing our findings with a majority classifier makes it clear that the proposed reading features carry significant information for the early detection of MCI. Additionally, the near 80% AUC score achieved through cross-validation on the training dataset suggests that a clinical decision support system (CDSS) based on a brief reading test could be practically applicable. This system could serve either as a quick, standalone test with lower precision or be integrated into a more comprehensive and time-consuming neuropsychological test battery. The validation results on an independent dataset were about 5% worse, but within the expected range. The result of AUC near the 75% mark now points more towards the second option of using reading as part of a larger test battery including other neuropsychological tests.

Most of the previous studies presented in the introduction focused on using eye-tracking during reading in patients with early AD and only a few studies included patients with MCI. Whereas our main goal was to detect the signs of cognitive impairment as early as possible, so a great majority of our CI patients were diagnosed as either MCI or borderline. Additionally, the number of patients and HC in the mentioned studies is rather small (ranging from 14-40) and were targetly selected for the studies. On the other hand, our study included two independent cohorts. The first one with 115 subjects was used to build a diagnostic model to differentiate between HC and CI groups. The second cohort included 94 subjects and was used for validation and performance evaluation of the model built on the eye-tracking data from the first cohort. We believe a larger sample size (compared to other studies) is an important addition to the MCI detection research, since all accuracy measures are prone to vary, especially with smaller samples.

As mentioned before, reading could be a good addition to a larger test battery in combination with other neuropsychological tests. That said, the test battery should be as short as possible to not overburden the users while, at the same time, give relevant results. In this aspect a short reading task is most welcome. In previous studies the reading tasks consisted of reading a much longer text (75 sentences in5, 120 sentences in6, and26, 120 sentences and additional 64 proverbs in7, and two texts with 146 words per text in8,9. On the other hand, we used the reading task consisting of just 12 sentences making it suitable to be incorporated as one of the tasks into the envisioned test battery.

Reading is a universally practised human activity and is not an artificially designed test, which grants it substantial ecological validity in experimental settings. In our observations, participants — including those with dementia — did not struggle with reading a few paragraphs displayed on a computer screen. Conversely, some alternative tasks felt less natural and prompted numerous questions from the participants.

However, a drawback of utilising reading is the necessity to verify the language dependence or independence of the machine-learned models. While we believe that the features are general enough to be language-independent, this assumption requires confirmation. Moreover, this may not hold true for fundamentally different writing systems, such as Chinese or right-to-left scripts like Arabic. On the positive side, in this relatively large sample study we tested the use of eye-tracking based reading features in subjects with a spectrum of CI, including MCI and possible AD.

One could argue that the groups were not matched by age. However, the main objective of the study/model was to predict the eye movement abnormalities by the use of eye-tracking in subjects with cognitive decline (including borderline, MCI and AD) and without cognitive decline (healthy subjects) no matter the age. The whole sample consisted of participants randomly and consecutively recruited to participate in the study – and since cognitive decline is correlated with age, there are some differences in age between the subgroups. Stratifying for age would actually be quite difficult and would, on the other hand, introduce a bias (in distribution) away from the envisioned practical use (and we cannot satisfy these two competing objectives at the same time).

In certain types of dementia beyond Alzheimer’s Disease (AD), such as Primary Progressive Aphasia (a subtype of Frontotemporal Dementia) language abilities are the first to deteriorate. Reading plays a significant role in language skills, alongside writing, naming, language production, and comprehension. In these disorders, the proposed reading features and a clinical decision support system based on them might perform even more effectively. Moreover, such a system could be utilised not only for the earliest possible detection of these conditions, but also to differentiate them from the altered reading abilities observed in other forms of dementia, such as AD, Lewy Body Dementia, and Vascular Dementia.

Despite the current lack of comparative studies, there are strong theoretical reasons to suspect that different neurodegenerative diseases might produce disease-specific signatures. AD, for instance, is frequently associated with pronounced memory deficits (particularly episodic memory), whereas Parkinson’s Disease often involves distinct motor symptoms and executive dysfunction. Such clinical differences may well translate into different patterns in eye-tracking or linguistic measures (e.g., lexical retrieval, reading speed, error rates) if thoroughly investigated.

Conclusions

The study confirmed the hypothesis that eye-movement behaviour during reading significantly differs between cognitively impaired individuals and healthy controls. Through proposed features it also gave further insight into the underlying nature of these differences.

Furthermore, the study confirmed that gaze behaviour during reading deteriorates with increasing level of CI. This further validates the eye-tracking approach proposed in this paper and hints that the detection will become increasingly more likely with increasing level of CI.

The machine learning experiments show that reading and the associated features proposed in this work are a promising approach to early detection of cognitive impairment using a simple, cheap, and noninvasive test. The somewhat lower results on the independent validation dataset indicate that the short task used in our study might not suffice as a stand-alone test. We have, however, shown that reading is a significant biomarker suitable as one of the fundamental components of a clinical decision support system.