Introduction

Undiagnosed cognitive impairment (CI) is a global challenge1, with 60–90% of individuals with CI never receiving a formal diagnosis2,3. Individuals with undiagnosed CI miss out on timely clinical care4 (e.g. cognitive enhancers, behavioral management, and caregiver support)5,6,7,8, which can affect their well-being9,10 and increase their risk of premature nursing home placement11,12,13. They may also not receive adequate support to manage and coordinate the care of their chronic diseases14,15, resulting in suboptimal disease management, inappropriate healthcare utilization, and higher healthcare costs16,17. Recently, the importance of early diagnosis has been further underscored by growing literature on early interventions for CI18,19, such as risk factor modification20 and anti-amyloid monoclonal antibodies21,22.

Early symptoms of CI are often subtle. Without objective cognitive tests, these symptoms are easily mistaken for normal ageing1,5,23,24. To address this inherent challenge, various international bodies23,24,25 have advocated the use of brief cognitive tests to facilitate case-finding among high-risk individuals in the community25. Although many brief cognitive tests exist in the literature (e.g. Montreal Cognitive Assessment26, Mini-Mental State Examination27, Mini-Cog28, Memory Impairment Screen29, Brief Cognitive Assessment Tool30), most are labor-intensive and require trained professionals1,23,24,31, which limit their scalability in community settings. Equally important, most tests were developed in populations with high literacy (e.g. White populations)32, and are predicated on the assumption that respondents are able to read and write in a language33. This may limit the usefulness of cognitive tests in underserved communities with lower literacy (e.g. in some non-White communities, and in lower- and middle-income countries [LMICs]), which often have the largest number of individuals with undiagnosed CI32,34. This has also led to call by the 2024 Lancet Commission on dementia care32 to address the unmet need for brief cognitive tests that are suited for individuals with lower literacy.

Digital cognitive tests hold promise as scalable tools for detecting CI in community settings, by leveraging artificial intelligence (AI) to automate the administration and scoring of brief cognitive assessments35, thereby reducing dependence on trained professionals in case-findings efforts. However, despite their potential, digital cognitive tests is still a relatively nascent field35. Few digital tests have undergone rigorous validation for the detection of CI in community settings36, especially in populations with lower literacy37. To address the unmet need for scalable case-finding tools that are suited for lower literacy groups, we have purpose-built an AI-based digital cognitive test (denoted as PENSIEVE-AITM) which has the following features:

  • Designed to be self-administered (using touch-screen tablets and pre-recorded audio instructions), thus reducing dependence on trained professionals.

  • Takes <5 min to complete (comprising only four drawing tasks), making it well-suited as a brief case-finding tool in community-settings.

  • Relies on drawing tasks alone, thus reducing dependence on respondents’ ability to read or write in a language33, and potentially allowing broader implementation in communities with varying literacy (such as in Singapore and other Asian populations). Arguably, drawing tasks can still be affected by literacy level38,39; but they are among the earliest skills that individuals develop before learning to read or write in a language, with the ability to draw shown to pre-date language development even in human civilizations33.

Using a large, community-representative sample from Singapore, this study aimed to:

  1. (1)

    Train an image-based deep-learning model to detect mild cognitive impairment and dementia (MCI/dementia) using the four drawing tasks in PENSIEVE-AI.

  2. (2)

    Examine the effects of key demographic features (e.g. education, test language) in improving model performance, given prior literature on the potential influence of these features on drawing tasks38,39.

  3. (3)

    Compare the performance of the deep-learning model to several commonly used assessment tools in detecting MCI/dementia, across participants with lower and higher literacy.

Of note, as a city-state in South-East Asia, Singapore offers a unique testbed to develop the new digital tool. Its 6-million-strong population serves as a microcosm of Asia, representing an amalgamation of Asian culture and comprising multiple Asian ethnicities, including Chinese, Malay, Indian, and other ethnic groups40. This diversity provides a robust testing ground for assessing the new tool’s performance across varied cultural and linguistic backgrounds. Additionally, the current cohort of older individuals in Singapore witnessed the country’s transformation from a traditional, lower-income, Asian society to a more westernized, higher-income country41. Consequently, this cohort of older Singaporeans encompasses a wide range of educational backgrounds, from minimal formal education to tertiary education. By validating the digital tool in such a heterogeneous population, we sought to demonstrate its potential for broader implementation in similar multiethnic and literacy-diverse settings beyond Singapore, such as in populations across East and South Asia, and potentially in some LMICs.

Results

A total of 1758 participants were included (Table 1), with 239 (13.6%) having clinically-adjudicated MCI/dementia. Given the nature of community recruitment, most cases were in early stages of CI (CDR global ≤1). Participants had a median age of 72 years and a median education of 10 years. Most participants could self-administer PENSIEVE-AI in <5 min (i.e. 69.1% self-administered; and 77.0% completed in under five minutes), with a median completion time of 3.7 min. However, participants with MCI/dementia were more likely to need some supervision to navigate the digital interface, and took longer to complete PENSIEVE-AI (4.6–6.7 min).

Table 1 Characteristics of the study participants (n = 1758)

Study samples were split into approximately 40% for Training sample and 20% for Validation sample (rounded to whole numbers), with the remaining set aside as Test sample. The sample split was done using the random approach, stratified by the clinical diagnosis (i.e. normal cognition, MCI and Dementia) to ensure balanced representation of clinical diagnosis across the split samples. Following the random split, the participant characteristics were largely comparable across the three split samples, as seen in Table 2. Training sample was used to train deep-learning models to distinguish MCI/dementia from normal cognition, and Validation sample was used to fine-tune model hyperparameters. Meanwhile, Test sample (i.e. single hold-out test set) evaluated actual performance of trained models in distinguishing MCI/dementia from normal cognition, and was used to select the best-performing model and the optimal cutoffs.

Table 2 Comparison of participant characteristics across training, validation and test samples

Table 3 presents the results of trained models in Test sample (n = 658). VGG-16 performed better than SwinTransformer among image-based models (Table 3A); CLIP performed better than CNN-GRU among alternative models (Table 3B). Drawing activities (e.g. replaying audio instructions, repeated drawing attempts, long pauses between drawing strokes) further improved performance of image-based models, with VGG-16 + Drawing activities achieving the best-performing model (area under receiver-operating-characteristic curve, AUC = 93.2%; area under precision-recall curve, PR-AUC = 70.8%). Using this best model, we further examined effects of basic demographics (i.e. age, sex, education, and test language) (Table 3C); of which, only education improved model performance further (i.e. similar AUC of 93.1%, with further improvement of PR-AUC to 74.1%), and hence VGG-16 + Drawing activities + Education was selected as the final model (bold-faced in Table 3). Based on this final selected model, we conducted ablation studies to understand relative contributions of the four drawing tasks in detecting MCI/dementia (Table 3D) – Complex figure recall alone had the greatest utility in detecting MCI/dementia (AUC = 89.8%); adding Complex figure copy improved AUC to 91.8%, and further addition of Clock drawing improved AUC to 92.1%.

Table 3 Comparison of the performance of trained models for distinguishing MCI/dementia from normal cognition in the Test sample (n = 658)

Table 4 compares the performance of PENSIEVE-AI and other commonly used assessment tools in Test sample (n = 658). PENSIEVE-AI had comparable performance to NTB (Neuropsychological Test Battery) and MoCA (Montreal Cognitive Assessment) for detecting MCI/dementia (AUC = 93.1–95.3%), even across the lower education subgroup (AUC 90.0–95.0%) and the higher education subgroup (AUC = 95.0–98.2%). In contrast, iAD8 (the Eight-item Informant Interview to Differentiate Aging and Dementia) had significantly lower AUC for MCI/dementia, particularly among participants ≤10 years of education (AUC = 73.2%, p < 0.001 when compared to PENSIEVE-AI). For the detection of dementia, all four tools (i.e. PENSIEVE-AI, NTB, MoCA, iAD8) have comparable AUCs of >90%. AUC results remained largely similar in the two sensitivity analyses (Table 4), when prevalence of MCI/dementia was increased to reflect average prevalence in most communities (i.e. 20%42,43,44,45,46,47 and 35%43,44,46,47 respectively). Additionally, several post-hoc analyses were conducted to examine potential AI biases in PENSIEVE-AI’s performance across various demographic subgroups. As seen in Supplementary Table 1, PENSIEVE-AI maintained similar AUC in detecting MCI/dementia even across the subgroups of age, sex, ethnicity, test language, and mode of administration.

Table 4 Performance of PENSIEVE-AI for detecting cognitive impairment in the Test sample (n = 658), and a comparison with the performance of other commonly used assessment tools

Test statistics of PENSIEVE-AI are plotted in Fig. 1a. Adopting two-cutoff approach, the lower cutoff (probability≥13%) had 85.7% sensitivity and 97.5% negative predictive value, and was used to rule out MCI/dementia (for individuals with probability scores below the cutoff); while the upper cutoff (probability≥45%) had 98.8% specificity and 85.1% positive predictive value, and identified those who were likely to have MCI/dementia (i.e. to rule in MCI/dementia). These two cutoffs provide an intermediate range between them (greyed area in Fig. 1a), identifying those who may be at higher risk and potentially require further monitoring or assessment. The optimal cutoffs varied slightly with changing prevalence of MCI/dementia, as seen in Figs. 1b and 1c.

Fig. 1: Plot of sensitivity, specificity, NPV and PPV based on probability scores of PENSIEVE-AI in the Test sample (n = 658).
figure 1

a Main results based on all the Test sample (n = 658). Adopting two-cutoff approach, the lower cutoff identifies sensitivity and negative predictive value (NPV) (red lines) which are >85% each, and is used to rule out mild cognitive impairment and dementia (MCI/dementia) when probability scores fall below this threshold. The upper cutoff identifies specificity and positive predictive value (PPV) (blue lines) which are >85% each, and is used to rule in MCI/dementia when probability scores exceed this threshold. The greyed area (demarcated by the lower and upper cutoffs) represents the intermediate range, identifying those who may be at higher risk and may require further monitoring or assessment. b Results based on Sensitivity analysis 1, whereby the prevalence of MCI/dementia was readjusted to 20% in the Test sample, based on prior meta-analytic findings that community prevalence was ~15% for MCI and ~5% for dementia. In the Test sample, a subset of participants with MCI and dementia were randomly selected to readjust the prevalence in the dataset (see Methods section for further details). The resulting dataset comprised 256 participants with normal cognition (80%), 48 participants with MCI (15%), and 16 participants with dementia (5%). c Results based on Sensitivity analysis 2, whereby the prevalence of MCI/dementia was readjusted to 35% in the Test sample, based on prior meta-analytic findings that community prevalence could be as high as ~25% for MCI and ~10% for dementia. In the Test sample a subset of participants with MCI and dementia were randomly selected to readjust the prevalence in the dataset (see Methods section for further details). The resulting dataset comprised 104 participants with normal cognition (65%), 40 participants with MCI (25%), and 16 participants with dementia (10%). Source data are provided as a Source Data file.

Effectively, the two cutoffs identified 3 risk categories for cognitive impairment: (1) Less likely to have cognitive impairment; (2) Higher risk of cognitive impairment; and (3) Likely to have cognitive impairment. These 3 categories, along with their cross-tabulation with the final diagnoses, are presented in Table 5. In the first category (i.e. Less likely to have cognitive impairment), 92–98% of individuals had normal cognition. In the second category (i.e. Higher risk of cognitive impairment), 20–40% of individuals were diagnosed with MCI. In the third category (i.e. Likely to have cognitive impairment), 85–88% of the individuals had MCI/dementia, with a large proportion having dementia (26–36%). Distinctions between these 3 risk categories are also visible in Fig. 2. The first category (white region with probability scores below the lower cutoff) identified those with normal cognition; the third category (dark grey region with probability scores above the upper cutoff) identified almost all individuals with dementia; while the second category (light grey region between the lower and upper cutoffs) mostly captured those with MCI. Detailed results on test statistics are available in Supplementary Tables 24.

Table 5 Cross-tabulation between the output from PENSIEVE-AI and the final diagnosis in Test sample (n = 658)
Fig. 2: Box plots showing the distribution of PENSIEVE-AI’s probability scores in the Test sample (n = 658).
figure 2

a Main results based on all the Test sample (n = 658). The box plot’s center line, box limits, and whiskers denote the median, lower and upper quartiles, and 1.5× interquartile range, respectively. The red dots represent the individual datapoints. The two horizontal dashed lines represent the two optimal cutoffs for PENSIEVE-AI. The lower cutoff has high sensitivity and negative predictive value (>85% each), and is used to rule out mild cognitive impairment and dementia (MCI/dementia) when probability scores fall below this threshold (as shown by the white region). The upper cutoff has high specificity and positive predictive value (>85% each), and identifies individuals likely to have MCI/dementia (when probability scores exceed this threshold, as shown by the dark grey region). The light grey region (demarcated by the lower and upper cutoffs) represents the intermediate range, identifying individuals who may be at higher risk and may require further monitoring or assessment. b Results based on Sensitivity analysis 1, whereby the prevalence of MCI/dementia was readjusted to 20% in the Test sample, based on prior meta-analytic findings that community prevalence was ~15% for MCI and ~5% for dementia. In the Test sample, a subset of participants with MCI and dementia were randomly selected in the Test sample to readjust the prevalence in the dataset (see Methods section for further details). The resulting dataset comprised 256 participants with normal cognition (80%), 48 participants with MCI (15%), and 16 participants with dementia (5%). c Results based on Sensitivity analysis 2, whereby the prevalence of MCI/dementia was readjusted to 35% in the Test sample, based on prior meta-analytic findings that community prevalence could be as high as ~25% for MCI and ~10% for dementia. In the Test sample, a subset of participants with MCI and dementia were randomly selected in the Test sample to readjust the prevalence in the dataset (see Methods section for further details). The resulting dataset comprised 104 participants with normal cognition (65%), 40 participants with MCI (25%), and 16 participants with dementia (10%). Source data are provided as a Source Data file.

Discussion

Summary of findings

Brief cognitive tests are crucial for detecting subtle, early symptoms of CI. However, most require trained professionals, limiting their scalability; many were developed from high literacy populations, limiting their usefulness in lower literacy subgroups. In this study, we developed a purpose-built AI tool for early detection of CI, based on 4 drawing tasks that can be self-administered by most participants in <5 min and do not rely on the ability to read or write in a language. The new PENSIEVE-AI was trained and validated using clinically-adjudicated diagnoses in a large, prospectively-recruited community sample. Among trained deep-learning models, VGG-16 demonstrated the highest performance; adding Drawing activities (e.g. pauses between drawing strokes) significantly improved performance, adding Education marginally improved performance, and adding Test language did not improve performance. The best-performing model (VGG-16 + Drawing activities + Education) demonstrated excellent performance in detecting MCI/dementia, comparable to detailed neuropsychological testing and MoCA. Results remained consistent across education subgroups, and when prevalence of MCI/dementia was readjusted to reflect average prevalence in most communities (i.e. ~15–25% for MCI42,43,44 and ~5–10% for dementia)45,46,47.

Interpretation of findings

Our findings highlight a key strength of PENSIEVE-AI in particular, and digital cognitive tests in general. Despite PENSIEVE-AI’s brevity (comprising only 4 test items and taking <5 min to complete), it achieves comparable AUC to detailed neuropsychological testing (which often requires at least 1–2 h to complete). This success is possibly attributed to the capture of additional data on test processes (i.e. drawing activities)35, which provides valuable information to offset the reduced number of test items. As seen in our findings, this test process data was highly informative in guiding diagnosis. Plausibly, nuanced behaviors during cognitive testing better reflect subtle cognitive changes than final test scores, especially at early stages of CI. In a way, this process data mimics conventional practice of qualitative observations during detailed neuropsychological testing, providing complementary information to final test scores. Such process data would otherwise not be feasibly captured in pen-and-paper versions of brief cognitive tests, due to the labor intensity of recording such qualitative observations in routine clinical practice.

Although many digital cognitive tests have been developed in the literature37, most are pilot studies with smaller samples and primarily correlated with another neuropsychological test (i.e. without evaluation over actual clinical diagnoses)37. The Brain Health Assessment (BHA) is among the few that showed promising results36. BHA shares some similarities with PENSIEVE-AI – both were trained on gold standard clinical diagnosis in large community samples; BHA also involves 4 tasks capturing various cognitive domains and reported similar AUC (up to 91.9%) for detecting MCI/dementia36. However, BHA differs in design from PENSIEVE-AI, requiring trained professionals to administer the 10 min test, whereas PENSIEVE-AI is designed to be primarily self-administered in <5 min. BHA was developed from White populations with high literacy (average education of 16–17 years in the development sample36; with recent pilot validations in non-White populations)36,48,49, in contrast to PENSIEVE-AI’s development within a multiethnic Asian population with lower literacy (average education of 10 years).

In extant literature, few digital cognitive tests rely solely on drawing tasks, with clock drawing being most widely adopted50. Consistent with the literature, our findings indicate that clock drawing alone is insufficient to detect early CI51,52, and must be combined with at least one other test that evaluates another cognitive domain52,53,54. Additionally, our findings further show that memory tasks are crucial for detecting early CI, possibly to capture early memory decline related to the most common aetiology (i.e. Alzheimer’s disease). The final model – incorporating a memory task and 3 other drawing tasks – achieved an AUC of >93%, which is among the highest reported to date for drawing-based digital cognitive tests. Consistent with the literature, our findings also suggest some influence of educational attainment on drawing-based tasks38,39, whereby the inclusion of education as a covariate further improved model performance (Table 3C). At the same time, the findings also affirm our initial hypothesis that drawing tasks may possibly be less affected by literacy – the education covariate only marginally improved model performance; and after including the education covariate, the final model demonstrated comparable performance to detailed neuropsychological testing even among individuals with lower literacy (Table 4).

Implications of findings

PENSIEVE-AI offers a scalable solution for case-finding of CI in the community. To address the global challenge of undiagnosed CI1,2,3, the International Association of Gerontology and Geriatrics has advocated for annual evaluation of cognitive function among older age-groups (e.g. all individuals ≥70 years)25. Yet few viable options are available to date for large-scale deployment. Unlike most brief cognitive tests, PENSIEVE-AI does not require trained professionals to administer, making it well-suited as a scalable tool for case-finding of CI in large populations. Given the brevity of PENSIEVE-AI, it can be easily embedded within routinely-conducted comprehensive geriatric assessments in the community, or used as a follow-up assessment tool in conjunction with subjective questionnaires55,56 (i.e. to provide more conclusive evidence of CI among individuals screened positive through subjective questionnaires)57. Considering the finding that PENSIEVE-AI can be self-administered by a majority of participants, it may also be deployed as standalone kiosks in community settings with high volumes of higher-risk older persons (e.g. primary care clinics), to allow individuals with cognitive concerns to complete brief cognitive evaluations. This approach can be especially cost-saving, as it does not require professional staff to man the community kiosks and at most only needs lay volunteers to be on standby to supervise those with difficulty navigating the digital interface.

At the population level, PENSIEVE-AI can serve as an efficient risk-stratification tool. As shown in Fig. 1, cutoffs for PENSIEVE-AI can be adjusted depending on prevalence of MCI/dementia in different populations, to identify individuals with varying risks of CI. Low-risk individuals ( <10% probability of MCI/dementia) may possibly be reassured and advised to repeat the test after a longer time horizon (e.g. 3–5 years). Intermediate-risk individuals ( ~25–40% probability of MCI/dementia) can be advised to consult a physician if concerned about cognition, or to repeat the test in 1 year for closer monitoring. High-risk individuals ( >85% probability of MCI/dementia) will benefit from direct referral to memory clinics for further assessment and management. Notably, the high-risk group largely captures most of the individuals with dementia (Fig. 2); thus, this category can also be the primary focus in communities more interested in detecting dementia than MCI. The risk-stratification approach described here is also summarized in Table 6.

Table 6 Potential clinical implications based on output from PENSIEVE-AI

Future directions

Moving forward, there are several avenues to further expand the applicability and utility of PENSIEVE-AI. An immediate direction is to translate the tool into other local languages and dialects in Singapore (e.g. Malay, Tamil, Cantonese, Hokkien, and Teochew), to enhance accessibility and inclusivity within Singapore’s multiethnic population. This effort is readily attainable, as PENSIEVE-AI is largely language-neutral (i.e. the test input is based on drawing data alone) and requires only the translation of test instructions. This approach is also supported by findings from the current study, which show that test language had minimal impact on PENSIEVE-AI’s overall performance (as seen in Table 3C and Supplementary Table 1). On a related note, PENSIEVE-AI may also hold potential for broader implementation in other literacy-diverse populations similar to Singapore (e.g. those across East and South Asia and some LMICs), given current findings that it is less affected by literacy (as shown in Table 3C and Table 4). However, this potential will need to be verified through future validation in populations beyond Singapore, considering prior literature on the potential cultural impact on drawing tasks similar to those used in PENSIEVE-AI38,58. Lastly, although current findings demonstrated the usefulness of PENSIEVE-AI for detecting the presence of MCI/dementia cross-sectionally, efforts are also ongoing to evaluate its utility in generating a global cognitive score, alongside further longitudinal evaluations of the psychometrics of this score with respect to test-retest reliability and validity in tracking cognitive decline over time.

Limitations

Several limitations are notable. First, PENSIEVE-AI is less useful for individuals with severe visual impairment or hand movement difficulties, as it requires the ability to see on-screen figures and draw with a stylus. Second, while being language-neutral is a strength of PENSIEVE-AI, it can also pose a limitation. The drawing-based tasks may plausibly capture less information on the language domain, potentially reducing PENSIEVE-AI’s sensitivity in detecting language dysfunction. This limitation is particularly relevant in young-onset CI, where language dysfunction can be more prevalent as the initial presentation (due to higher proportions of non-Alzheimer’s diseases in young-onset CI, e.g. frontotemporal lobar degeneration). Third, while digital cognitive tests have inherent strengths and appeal35, they also present new barriers, particularly for individuals with lower literacy and in LMICs. For example, individuals with lower literacy may be unfamiliar with using technology, and some LMICs may have limited access to touch-screen tablets and technology infrastructure. To mitigate these limitations, we conducted extensive user design iterations in this study to tailor PENSIEVE-AI to the needs of older individuals with less digital literacy (Supplementary Method 1). We also ensured that PENSIEVE-AI is compatible with generic, low-specification touch-screen tablets, and requires only intermittent internet connection to generate results from cloud-hosted deep-learning models. Additionally, we designed PENSIEVE-AI as an assessor- or center-based tool (i.e. not installed on older individuals’ personal devices), so that only a limited number of tablets are needed for large-scale assessments in the community. Fourth, residual AI biases may still exist despite our best efforts to minimize the biases (e.g. through extensive recruitment efforts to obtain community-representative samples, ensuring the project team comprised a diversity of age, sex, ethnicity and professional discipline [i.e. geriatric psychiatrist, geriatrician, neurologist, psychologist], and conducting post-hoc analyses to ensure no systematic biases across demographic subgroups). As an example of residual AI biases, individuals who chose to participate in this study may differ from those who opted not to, potentially reducing the community-representativeness of recruited samples. We mitigated this limitation by employing diverse sources of community recruitment, as well as by emphasizing ‘Detect dementia early’ in our recruitment publicity (rather than a conventional invitation to participate in research, which tends to attract a distinct group of individuals) (Supplementary Method 2). Fifth, while clinicians who determined the diagnoses in this study were blinded to the drawing data from PENSIEVE-AI, they were not blinded to participants’ demographic information (e.g. age, sex and education) because such details are often essential for accurate diagnosis (e.g. information on previous levels of cognitive abilities is critical when making clinical judgment on the presence of “significant cognitive decline”)59. While access to this demographic information might introduce potential bias in the results of PENSIEVE-AI, the risk is arguably low. As demonstrated in Table 3C, baseline models (using demographic information alone) contributed minimally to PENSIEVE-AI’s overall performance, in contrast to the substantial contributions of the drawing tasks. Sixth, PENSIEVE-AI is not intended to replace comprehensive clinical and neuropsychological assessments, as it does not provide a definitive diagnosis nor granular information on specific cognitive deficits18,60,61.

Conclusions

Using a large community sample, we developed an AI-based, drawing-based cognitive test that can be self-administered in <5 min by most participants. Despite its brevity and ease of use, PENSIEVE-AI demonstrated excellent performance in detecting MCI/dementia, comparable to detailed neuropsychological testing. It can be a valuable tool in situations where detailed neuropsychological testing is not feasible, such as being embedded within community assessments or deployed as community kiosks to identify individuals requiring further intervention. As PENSIEVE-AI is less affected by language or literacy, it holds the potential for broader implementation in other literacy-diverse settings similar to Singapore, such as in populations across East and South Asia and some LMICs.

Methods

Ethical approval

This study complies with all relevant ethical regulations. The research protocol was reviewed and approved by SingHealth Centralized IRB (reference: 2021/2590). Informed consent was obtained from all participants, or their legally authorized next-of-kin (for participants without mental capacity to consent)62. Participants who completed the research assessments received Singapore Dollar $80 as compensation for their time, inconvenience and transportation costs.

Study procedures

This was a nationally-funded study in Singapore to develop an AI tool for early detection of CI (Project PENSIEVE). From March 2022 to August 2024, we prospectively recruited community-dwelling older persons based on the following criteria: (1) At higher-risk of CI (i.e. aged ≥65 years25, and having at least one of the three chronic diseases: diabetes mellitus, hypertension, or hyperlipidemia); (2) Able to follow simple instructions in English or Mandarin Chinese; (3) Did not having severe visual impairment that could affect the ability to complete drawing tasks (note: to ensure generalizability, participants were included as long as they could see pictures on a piece of paper held before them); and (4) Had an informant who knew the participant well (e.g. family member or friend). Recruitment sources included 14 community roadshows by study team, clients of community partners, home visits by community volunteers, media publicity (radio, online articles, and posters), and word-of-mouth referrals from participants who had completed research assessments. To ensure that the recruited samples are representative of the community, the study’s publicity materials placed much emphasis on the key message of ‘Detect dementia early’ (along with direct referrals to memory clinics in the event of significant findings), rather than the conventional invitation to participate in research (which may inadvertently attract a distinct group of individuals). Samples of these publicity materials (e.g. study banner, poster, brochure) are presented in Supplementary Method 2.

The recruited participants received comprehensive assessments, which included semi-structured interviews with participants and their informants, detailed neuropsychological testing, and observational notes of participants’ behavior during assessments. Details on the comprehensive assessments are available in Supplementary Methods 3, 4. Diagnoses of MCI and dementia were made via consensus conference (by 3 dementia specialists). Dementia was diagnosed using the DSM-5 (Diagnostic and Statistical Manual of Mental Disorders–Fifth Edition) criteria59. MCI was diagnosed using the modified Petersen criteria63. Normal cognition was diagnosed when participants were found not to have dementia or MCI.

Measures

The new digital cognitive test (henceforth denoted as PENSIEVE-AITM) comprises 4 drawing tasks, namely: (1) complex figure copy; (2) simple figure copy; (3) clock drawing; and (4) complex figure recall (i.e. recall of complex figure from the first task). Respondents were provided a 12.4-inch touch-screen tablet and a stylus, and asked to follow on-screen voice instructions to complete the 4 drawing tasks on the tablet (of note, the same drawing prompts were used for every assessment, to ensure consistency in administration across different assessments). Throughout the 4 tasks, drawing activities (e.g. drawing motions, replaying audio instructions, repeated drawing attempts) were also captured within the tablet and included as input data for model training. The 4 tasks were designed to cover the cognitive domains of Visuospatial abilities (tasks 1 and 2)64, Attention and Executive function (task 3)64, Memory (task 4)64, and Language (ability to follow through audio instructions). Details on user design of PENSIEVE-AI are available in Supplementary Method 1. Of note, PENSIEVE-AI was completed by participants before the start of comprehensive assessments, and the dementia specialists who were determining the diagnosis in consensus conference were blinded to the drawings and drawing activities from PENSIEVE-AI (but not blinded to participants’ demographic information such as age, sex and education).

Three alternative assessment tools were included in the analyses as comparators to PENSIEVE-AI. These tools represent three common types of assessments in cognitive evaluations: an informant questionnaire (iAD8; the Eight-item Informant Interview to Differentiate Aging and Dementia)55, a brief cognitive test (MoCA; Montreal Cognitive Assessment)26 and detailed neuropsychological testing (NTB; Neuropsychological Test Battery)65. They are briefly described in the next paragraph, with further details available in Supplementary Method 4. It is important to note that the dementia specialists in this study were blinded to iAD8 results, but not blinded to those of MoCA or NTB. Given that the data from MoCA and NTB were used to inform the diagnostic process, the performance of MoCA and NTB were likely overestimated in this study (i.e. actual performance of MoCA and NTB would be lower than reported). Accordingly, readers should exercise caution when comparing these results to those of PENSIEVE-AI, interpreting them as general indicators rather than reflections of actual, real-world performance.

iAD855 is a brief questionnaire that requires informants to rate changes in participants’ cognition and function in the past few years (through yes/no responses). Its 8 items can be completed in ~3–5 min, with higher scores indicating greater cognitive problems. MoCA26 comprises 12 items that test participants in various cognitive domains. It can be completed in ~15–20 min, with higher scores reflecting better cognitive function. The NTB65 takes ~60 min to complete, and includes seven neuropsychological tests measuring the key cognitive domains of Visuospatial abilities (Benson Complex Figure Copy), Working memory (Craft Story 21 Immediate Recall), Delayed memory (Craft Story 21 Delayed Recall and Benson Complex Figure Recall), Language (Verbal Fluency–Animal), Attention/Processing speed (Trail Making Test–Part A), and Executive function (Trail Making Test–Part B).

Statistics & reproducibility

In Training and Validation samples, we experimented with image-based models (i.e. VGG-1666 and Swin Transformer)67, sequential models (i.e. CNN-GRU)68, and zero-shot vison-language models (i.e. CLIP)69. While the drawings and the drawing activities were the main input data for model training, we also explored the inclusion of basic demographic features (e.g. age, sex, educational attainment, and test language) to assess their potential effects in improving model performance. Models were trained using focal loss70 due to the unbalanced dataset. Focal loss is a technique that helps the model to pay more attention to harder-to-classify examples that it often misclassifies, rather than those it already classifies correctly. It achieves this by dynamically adjusting the contribution of each example to the overall training process. Based on the predicted probability of each example, it reduces the loss contribution from well-classified examples, thereby allowing the model to focus more on harder, misclassified examples. This method is particularly useful in situations where there are far more examples of one class (i.e. normal cognition) compared to another (i.e. MCI/dementia), which can otherwise overwhelm the model. By focusing on the harder, less frequent examples, the model thus improves its ability to identify those rarer examples of MCI/dementia. Further details on model training are presented in Supplementary Method 5.

In Test sample, predicted probabilities of the trained models were compared using area under receiver-operating-characteristic curve (AUC) – supplemented by area under precision-recall curve (PR-AUC) – to select the best-performing model for PENSIEVE-AI in distinguishing MCI/dementia from normal cognition. Thereafter, AUC of PENSIEVE-AI was compared to the AUCs of 3 other commonly-used assessment tools (i.e. iAD8, MoCA, NTB) using a non-parametric approach proposed by DeLong et al.71,72,73,74, with analyses stratified by education subgroups (i.e. ≤10 years of education and >10 years of education, based on median-split). A two-cutoff approach75,76,77,78,79 was adopted for PENSIEVE-AI. First cutoff has high sensitivity and negative predictive value (>85% respectively), and is used to rule out MCI/dementia (i.e. when probability scores fall below first cutoff). Second cutoff has high specificity and positive predictive value (>85% respectively), and identifies those who are likely to have MCI/dementia. This two-cutoff approach has been recommended in recent literature79, as it enhances test performance75,76,77,78, reduces effects of prevalence on test performance76, and prioritizes healthcare resources for those more likely to benefit75.

As secondary analysis, the performance of PENSIEVE-AI was evaluated for distinguishing dementia from non-dementia. Additionally, two sensitivity analyses were conducted in Test sample to evaluate robustness of results when prevalence of MCI/dementia was readjusted to reflect average prevalence in most communities:

  1. (1)

    Prevalence of MCI/dementia was artificially readjusted to 20%, based on prior meta-analytic findings that community prevalence was ~15% for MCI42,43,44 and ~5% for dementia45,46,47. Readjustment of prevalence was done by randomly selecting only a subset of participants with MCI and normal cognition – for each participant with dementia, 3 participants with MCI and 16 participants with normal cognition were randomly selected (i.e. so that the final dataset corresponded to 5% prevalence for dementia and 15% prevalence for MCI).

  2. (2)

    Prevalence of MCI/dementia was artificially readjusted to 35%, based on prior meta-analytic findings that community prevalence could be as high as ~25% for MCI43,44 and ~10% for dementia46,47. Readjustment of prevalence was done by randomly selecting only a subset of participants with MCI and normal cognition – for each participant with dementia, 2.5 participants with MCI and 6.5 participants with normal cognition were randomly selected (i.e. so that the final dataset corresponded to 10% prevalence for dementia and 25% prevalence for MCI).

Statistical analyses were conducted in Stata (version 18). No statistical method was used to predetermine sample size. During the initial study planning, we estimated that at least 1000 samples would be required (with each sample providing 4 drawing data), guided by a well-known classification challenge in recent literature (Tiny ImageNet)80. No data were excluded from the analyses.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.