Introduction

Drawing tests have been well-documented for their comprehensive assessment capabilities which include evaluating visuospatial skills, visual memory and executive function, and they are commonly used within the elderly population as a cognitive screening tool for dementia, both in clinical and research fields1. Among the most prominent drawing tests are the Pentagon Drawing Test (PDT), the Clock Drawing Test (CDT), and the Rey Complex Figure Test (RCFT). The PDT, for example, requires participants to draw two intersecting pentagons with scoring typically binary (fail or success)2. The CDT assesses executive function and visuospatial skills by having subjects draw a clock face set to a specific time, with scoring methods varying significantly from a binary system to detailed point assignments based on accuracy of contour, number sequence, and hand placement3,4,5. The RCFT, designed by Rey6, challenges participants to copy and recall a complex figure, with a widely used 36-point scoring system developed by Osterrieth7.

Recent advancements have seen the application of machine learning approaches to enhance the predictive accuracy of cognitive status from these tests. This is particularly valuable because of the simplicity of administering drawing tests, which could be useful for screening early stages of dementia in clinical fields. For example, deep-learning approaches have been utilized for the digitized PDT8, CDT9 and RCFT10 to predict MCI and CN patients. Additionally, multi-dimensional kinematic parameters extracted from a digital pen and tablet during RCFT were analyzed using logistic regression11.

However, there are some limitations in previous studies. Primarily, most of these studies had small samples sizes and lacked an external test set, which undermined the reliability of model performances. Even in cases where sample sizes were not small, the performance of models was not sufficiently robust for screening early stages of dementia. This could be attributed to the challenges inherent in utilizing image data in deep learning models. For instance, image data often contains a vast amount of information but can also be prone to noise due to its high dimensionality12,13. Moreover, image data encompasses diverse patterns and features, making it challenging for models to learn effectively, especially when sample sizes are not significantly large14.

In this paper, we propose a novel multi-stream deep learning framework composed of a spatial stream that processes raw RCFT images and a scoring stream that integrates RCFT scores generated by a previously developed AI-based scoring model along with demographic features15. The proposed model was implemented by using a total 1,740 subjects (CN 947, MCI 793) to train a deep learning model for distinguishing MCI patients from CN subjects. Additional 222 subjects (CN 106, MCI 116) were utilized as an external dataset to improve the reliability of the model performance.

Materials and methods

Datasets

The study was approved by the Institutional Review Boards of Chonnam National University Hospital (CNUH‐2019‐279) and Wonkwang University Hospital (2022–01-024–004). All research was performed in accordance with relevant guidelines and regulations, including the Declaration of Helsinki. Informed consent was obtained from all participants and/or their legal guardians.

GARD cohort

We enrolled 1,740 subjects from the Gwangju Alzheimer’s and Related Dementia (GARD) cohort registry at Chosun University in Gwangju, Korea during 2015–2019. The diagnostic criteria for CN and MCI have been described in Seo et al.16. Briefly, CN subjects were included if they were aged 60 or older, had a Clinical Dementia Rating (CDR) score of 0, and exhibited normal cognitive function, with all neuropsychological test z-scores above − 1.5 × standard deviation (SD) based on age, education, and gender norms. MCI patients were aged 60 or older, had a CDR score of 0.5, and met the MCI criteria established by17.

WUH cohort

The Wonkwang University Hospital (WUH) cohort includes 106 CN subjects and 116 MCI patients enrolled between 2017 and 2022. In alignment with our training set criteria, subjects were classified based on their CDR scores: a CDR score of 0 indicated a CN diagnosis, while a score of 0.5 indicated MCI.

Deep learning architecture

Figure 1A provides an overview of the proposed method. Our model predicts the probability of an individual being classified as a MCI patients using three pre-processed RCFT images along with age, sex and years of education. The pre-processing method for the RCFT images follows the protocol outlined by Park et al.15. Our prediction model employs a dual-stream architecture: a spatial stream and a scoring stream. Both streams process data through softmax functions, and their outputs are merged using average fusion to yield the final classification probability. In the spatial stream, each 512 × 512 image is input into a CNN model that uses EfficientNet18 as its backbone. We selected EfficientNet-B2 for its efficiency and suitability in medical applications, given its lower parameter count and adequate performance with limited datasets. EfficinetNet-B2 incorporates a 3 × 3 convolution layer followed by multiple 3 × 3 and 5 × 5 mobile inverted bottleneck convolution (MBConv) blocks, a design borrowed from MobileNet19 (Fig. 1B). Post-CNN, the feature map are flattened, and a multi-head self-attention layer is applied, enhancing the model’s focus on significant spatial region. The multi-head self-attention mechanism, as defined by20, combines multiple self-attention layers to capture diverse features, expressed as:

$$MultiHead\left( {Q,K,V} \right) = Concat\left( {head_{1} , \ldots , head_{h} } \right)W^{O} ,$$
$$head_{i} = Attention(QW_{i}^{Q} ,KW_{i}^{K} ,VW_{i}^{V} )$$
Fig. 1
Fig. 1
Full size image

Model architecture. (A) Overall model architecture featuring a dual-stream design: The spatial stream uses EfficientNet-B2 with a multi-head self-attention layer and the scoring stream integrates AI-generated RCFT scores (from a previously developed scoring model) together with demographic features (sex, age and years of education). Outputs from both streams are fused via average fusion to yield the final CN/MCI classification. (B) Detailed architecture of the EfficientNet-B2 model used in the spatial stream, including convolutional layers and Mobile Inverted Bottleneck Convolution (MBConv) layers.

where \(Q\), \(K\), \(V\) are the query, key and value matrix, respectively, and we use four attention heads (\(h\)=4). The outputs from multi-head self-attention layers are integrated and processed through two fully connected (FC) layers followed by a softmax function.

Conversely, the scoring stream incorporates RCFT scores generated by an AI-based scoring model15 and demographic variables (age, sex, and years of education). These features, RCFT scores from the three images together with the demographic variables, are concatenated and passed through a fully connected layer followed by a softmax function. Importantly, the weights of the AI-based scoring model are frozen, meaning that the scoring stream receives fixed RCFT score outputs during training and no parameter updates occur within this module.

Baseline models

The proposed model was evaluated against four baseline models: three logistic regression models and one deep learning model. The first baseline model utilized MMSE scores. The second baseline used three RCFT scores assessed by trained experts, while the third baseline used three RCFT scores generated by an AI-based scoring model. The final baseline was a deep learning model, which solely utilized the spatial stream network. All baseline models included age, sex and years of education as covariates.

Quality control of RCFT scoring using an AI-based scoring model

To mitigate potential errors arising from manual scoring, scanning, and digitization, we applied the AI-based scoring model for the external test set (n = 666) to enhance data quality. For images in which the discrepancy between the expert-assessed scores and the AI-generated scores exceeded ten points, trained human experts re-evaluated the drawings. The updated expert scores were then compared with the AI-generated scores to ensure scoring accuracy and reliability.

Experiments

We conducted prediction model building and performance evaluation using data from GARD and WUH cohort. GARD cohort was employed to construct the prediction model. Throughout the training process, we utilized the binary cross-entropy as the loss function and the Adam optimizer was adopted to minimize the loss function. To prevent overfitting, we reduced the initial learning rate to 10% every five epochs and implemented early stopping if there was no improvement in validation loss after 30 epochs, ensuring that the final model weights selected corresponded to the lowest validation loss.

To evaluate our model’s performance, GARD cohort was randomly divided into training, validation and test sets with 6:2:2 ratio. This division process was repeated fifty times. External validation was performed using WUH cohort. Model performance was assessed using the area under receiver operating characteristics (AUC), the accuracy (ACC), sensitivity (SEN) and specificity (SPE).

All experiments were conducted using the Pytorch library (v 2.0.0) in Python (v 3.8.8) with NVIDIA 1080ti GPUs with 48 GB of memory per GPU.

Results

Characteristics

Table 1 summarizes the clinical characteristics of subjects in the GARD and WUH cohort datasets. In the GARD dataset, the average ages were 71.8 (\(\pm\) 6.1) years for CN subjects and 73.5 (\(\pm\) 6.4) years for MCI patients (P < 0.01). Education levels and MMSE scores also significantly differed between CN subjects (education level: 10.4 \(\pm\) 4.6; MMSE score: 27.5 \(\pm\) 2.1) and MCI patients (9.8 \(\pm\) 4.7; 25.5 \(\pm\) 3.1) (P < 0.01). Similarly, sex ratios exhibited comparable trends in both groups. Conversely, the WUH dataset revealed no significant differences in the average ages between CN (69.9 \(\pm\) 7.7) subjects and MCI (71.4 \(\pm\) 8.3) patients (P > 0.05), nor were there differences in education levels between CN (8.7 \(\pm\) 4.2) and MCI (9.2 \(\pm\) 4.5) groups (P > 0.05). Comparing the two datasets, the external test set consistently showed lower age, education level, and RCFT scores across both groups, with the exception of the education level and RCFT copy score in CN group of the GARD dataset.

Table 1 Descriptive statistics. A dataset of 1,740 subjects from the Gwangju Alzheimer’s and Related Dementia (GARD) cohort was used for training, and an external test set of 222 subjects from Wonkwang University Hospital (WUH) was used for validation.

Improved agreement after AI-assisted scoring quality control

The initial correlation (R2) between scores generated by an AI-based scoring model and expert-assessed scores was 0.81, with a mean absolute error (MAE) of 3.0 points (Fig. 2A). Among the 666 external test images, 30 cases showed discrepancies greater than 10 points between the expert-assessed scores and the AI-generated scores. After re-evaluation by trained experts, scores for 26 images were corrected. Following this correction, the correspondence improved substantially, yielding an \(R^{2}\) of 0.95 with an MAE = 2.0 (Fig. 2 B).

Fig. 2
Fig. 2
Full size image

Comparative validation of AI-generated and expert-assessed scores. (A) Pre–AI-assisted quality control (QC), showing the agreement between scores generated by an AI-based scoring model and expert-assessed scores. Significant discrepancies (greater than ten points) between the AI-generated scores and human expert scores (highlighted in red) led to re-examination by trained experts. (B) Post–AI-assisted QC, demonstrating improved agreement between AI-generated scores and expert-corrected scores following the expert re-evaluation. In both panels, ‘predicted scores’ refer to scores generated by an AI-based RCFT scoring model, and ‘ground truth scores’ refer to expert-assessed (or expert-corrected) scores.

Comparison of model performance via internal test using GARD cohort

We evaluated the classification performances of five models, including three that incorporated the proposed method. These models are: 1) logistic regression using MMSE scores; 2) logistic regression using RCFT scores assessed by experts; 3) logistic regression using RCFT scores generated by an AI-based scoring model; 4) deep learning model utilizing only spatial stream network; 5) deep learning model employing multi stream networks.(Fig. 3) The mean performances of those models are shown in (Table 2A).

Fig. 3
Fig. 3
Full size image

ROC curve for external test set (WUH cohort dataset). The ROC curve is plotted using the median AUC results from 50 bootstrap samples, illustrating the performance of different models.

Table 2 Results of model prediction performance. (A) Internal test using the GARD cohort dataset. The baseline models consisted of three logistic regression models using (1) MMSE scores, (2)expert-assessed RCFT scores, and (3) AI-generated RCFT scores produced by a previously developed AI-based scoring model, as well as (4) a deeplearning model using only the spatial stream. All baseline models included chronological age, sex, and education as covariates. The data was split into6:2:2 (training, validation, and testing sets), and this process was repeated 50 times. (B) External test using the WUH cohort dataset. ExpertsassessedRCFT scores refers to the models using the initial expert-assessed scores before QC, while expert-corrected scores indicates the models usingthe expert-corrected scores obtained after re-evaluating based on comparisons with the AI-generated RCFT scores.

The logistic regression model with MMSE scores demonstrated the lowest performance, with an AUC of 0.714 [95% confidence interval: 0.706–0.712], an ACC of 0.660 [0.652–0.667], SEN of 0.625 [0.613–0.636] and SPE of 0.694 [0.685–0.704]. The logistic regression model using expert-assessed RCFT scores recorded an AUC of 0.776 [0.768–0.782], an ACC of 0.705 [0.699–0.712], an SEN of 0.700 [0.689–0.711] and an SPE of 0.71 [0.700–0.722]; the performance of the model using RCFT scores generated by an AI-based scoring model was similar, with an AUC of 0.777 [0.770–0.783], ACC of 0.710 [0.703–0.717], SEN of 0.699[0.689–0.709] and SPE of 0.721 [0.710–0.731].

Performance improvements were evident with the spatial stream network model, which achieved an AUC of 0.803 [0.768–0.837], ACC of 0.731 [0.702–0.761], SEN of 0.701 [0.661–0.741] and SPE of 0.762[0.720–0.804]. Finally, our proposed deep learning model using the two-stream network outperformed all baseline models across all metrics, with an AUC of 0.852 [0.837–0.869], ACC of 0.771 [0.755–0.787], SEN of 0.742 [0.718–0.767] and SPE of 0.800 [0.774–0.823].

External validation using WUH cohort

Performance metrics for the trained models on this set are detailed in Table 2 (B). The logistic regression model using expert-assessed RCFT scores from the initial dataset demonstrated an AUC of 0.750 [0.750–0.751], ACC of 0.709 [0.707–0.712], SEN of 0.832 [0.829–0.835] and SPE of 0.575 [0.571–0.579]. With the validated dataset based on the re-rated RCFT scores, the model’s performance improved to an AUC of 0.813 [0.812–0.814], ACC of 0.750 [0.748–0.753], SEN of 0.799 [0.718–0.767] and SPE of 0.800 [0.774–0.823]. The logistic model with RCFT scores generated by an AI-based scoring model displayed comparable performance to that of human experts (AUC = 0.804[0.803–0.805], ACC = 0.722[0.721–0.725], SEN = 0.799[0.797–0.802] and SPE = 0.639[0.634–0.722]). The deep learning model employing the spatial stream network achieved a higher AUC (0.837[0.814–0.860]), ACC (0.744[0.719–0.768]) and SPE (0.745[0.697–0.792]) but had a lower SEN (0.743[0.690–0.800]). Our proposed deep learning method using the two-stream network outperformed all baseline models, showing superior performance across all metrics: AUC = 0.872[0.862–0.882], ACC = 0.781[0.768–0.795], SEN = 0.836[0.807–0.864] and SPE = 0.722[0.687–0.757].

Discussion

In this article, we developed a multi-stream deep learning network to differentiate between MCI patients and CN subjects. Our approach surpasses previous methods utilizing drawing test (PDT, CDT and RCFT) by leveraging a larger sample size and an external test set, thereby enhancing the robustness and performance of the model. Notably, our model outperformed existing studies, achieving the highest recorded performance metrics.

Our multi-stream network combines both the scoring stream and spatial stream. The scoring stream incorporates RCFT scores generated by an AI-based RCFT scoring model, which reduces scoring time, minimizes human resource demands, and proactively prevents human scoring errors, thus improving accuracy. This advantage was demonstrated by our results. When AI-generated RCFT scores were used during the QC process, the overall model performance improved substantially compared with the performance based on the initial expert-assessed scores without QC. Furthermore, while expert scoring requires approximately 5 min per subject, the AI-based scoring model produces scores in about 10 s, highlighting its efficiency and scalability in clinical settings. The spatial stream of our model utilizes raw RCFT images as input, and captures subtle details within the images, such as pen thickness and stroke shape, that are not reflected in the standard human scoring system (0–36 points). This complementary information leads to substantial performance gains compared to models that rely solely on scoring. However, although raw image data are rich in information, they also contain considerable noise. Accordingly, the integration of multi-head self-attention layers enables the model to prioritize crucial spatial regions within the feature map, improving performance. Nonetheless, models that depend exclusively on raw images have shown higher variability in performance compared to logistic models based on RCFT scores, and the performance of the spatial stream network may be compromised due to resolution differences between existing training images and newly acquired test images. By combining the advantages of both scoring stream, which leverages RCFT scores generated by an AI-based model trained on human scoring system, and the spatial stream network, which processes images, our proposed method achieves high and robust performance.

The proposed method provides clinically practical and scalable approach for screening individuals at risk of early stage of cognitive impairment at medical check-up centers. Currently, the MMSE is the most commonly utilized screening tool because of its simplicity and quick administration time of approximately 5–10 min 2. However, our results indicate that MMSE is less informative for predicting MCI and showed limited accuracy in distinguishing between CN subjects and MCI patients (AUC = 0.714), consistent with previous finding (AUC = 0.733, N = 2,577)8. In contrast, comprehensive cognitive function tests such as the Neuropsychological Test Battery require substantial time, often up to 2 h, as well as additional effort for scoring and interpretation, making them impractical for large-scale screening21. Although the RCFT requires more administration time than the MMSE, approximately 30 min including a 20-min delay interval22, our RCFT-based model significantly outperformed that of the MMSE (AUC > 0.85). Furthermore, since the model requires only RCFT drawings and basic demographic information that are already collected routinely at medical check-up centers, no additional procedures or data collection steps are needed, making its integration into existing workflows straightforward. The model also derives AI-generated RCFT scores and predicted risk of cognitive impairment within a few seconds, eliminating the 5–10 min of clinician time typically needed for manual expert scoring. This reduction in time and personnel burden substantially enhances efficiency while maintaining high performance. Together, these advantages highlight the strong potential of the proposed method for real-world clinical adoption, offering a practical, accurate, and workflow-friendly alternative to traditional cognitive assessments.

In addition to these advantages, it is important to note that our RCFT-based approach offers differentiated clinical value compared with existing digital cognitive assessment tools (DCATs). Many widely used DCATs (e.g., ANAM, CogniCA) primarily assess reaction time, processing speed, and attentional control through brief computerized tasks. Although comprehensive platforms such as CANTAB evaluate a broader set of cognitive domains, they do not capture high-level visuospatial constructional abilities or non-verbal visual memory through complex figure copying and recall. The copy task captures spatial planning and structural integration, while the recall task provides a language-independent measure of visual memory that is particularly useful in low-education elderly populations. Furthermore, the RCFT allows qualitative evaluation of drawing strategies that can reflect executive dysfunction, information that is not available from single-score outputs of typical DCATs. By integrating these rich cognitive signals through a multi-stream deep learning framework, our model leverages a type of information fundamentally different from what existing digital tools can provide.

Despite the strengths of the proposed method, our study had some limitations and areas for future development. First, our model was developed and validated using only Korean cohorts. Although we included an external validation dataset from an independent institution, it was also collected within the same country and therefore does not fully address potential ethnic or cultural biases. While the RCFT is a nonverbal, visuospatial test with minimal linguistic influence, validation in larger and more diverse international cohorts is needed to confirm broader generalizability. Second, our model relied solely on static RCFT drawings, as both cohorts used a traditional paper-and-pencil administration. Consequently, kinematic information such as drawing speed, pressure, temporal patterns, and the sequence of strokes could not be incorporated, despite evidence that these features provide meaningful biomarkers of cognitive decline11,23. We have recently developed a tablet-based RCFT platform that records real-time drawing trajectories and extracts kinematic parameters. This will allow for development of future models that integrate these signals and potentially achieve further performance improvements. Another limitation concerns the interpretability of the proposed model. Despite using a hybrid framework that integrates image-derived features with conventional RCFT scores generated by an AI-based model, the final prediction remains a black-box output without explicit explanations for its decision. Since clinical adoption requires transparency, future work should incorporate explainable AI tools such as Grad-CAM, attention-based visualizations, and feature-attribution methods to improve interpretability and clinician trust.

In conclusion, our multi-stream deep learning network outperformed previous studies in distinguishing MCI patients from CN subjects. By integrating AI-generated RCFT scores with image-based information, our model demonstrated robust performance across internal and external datasets. Our findings suggest potential clinical utility as a time-efficient screening tool for cognitive impairment.