Abstract
Alzheimer’s disease (AD), a progressive neurodegenerative condition, notably impacts cognitive functions and daily activity. One method of detecting dementia involves a task where participants describe a given picture, and extensive research has been conducted using the participants’ speech and transcribed text. However, very few studies have explored the modality of the image itself. In this work, we propose a method that predicts dementia automatically by representing the relationship between images and texts as a graph. First, we transcribe the participants’ speech into text using an automatic speech recognition system. Then, we employ a vision language model to represent the relationship between the parts of the image and the corresponding descriptive sentences as a bipartite graph. Finally, we use a graph convolutional network (GCN), considering each subject as an individual graph, to classify AD patients through a graph-level classification task. In experiments conducted on the ADReSSo Challenge datasets, our model surpassed the existing state-of-the-art performance by achieving an accuracy of 88.73%. Additionally, ablation studies that removed the relationship between images and texts demonstrated the critical role of graphs in improving performance. Furthermore, by utilizing the sentence representations learned through the GCN, we identified the sentences and keywords critical for AD classification.
Similar content being viewed by others
Introduction
Alzheimer’s disease (AD), a progressive neurodegenerative condition, significantly affects cognitive functions and the ability to perform daily activities, primarily impairing memory. The progression of AD underlines the importance of early detection and intervention, which can potentially slow the cognitive decline associated with the disease. Consequently, research initiatives are increasingly focused on the development of diagnostic tools, including the analysis of speech and transcribed text, to identify AD in its early stages.
Conventional AD screening methods, such as the Mini Mental Status Examination (MMSE)1 and the Montreal Cognitive Assessment (MoCA)2, rely on the subjective judgments of clinicians, which may result in errors and high inter-rater variability3. To address these issues, there are numerous studies on automatically assessing AD4,5. One promising approach for AD screening involves utilizing speech signals4. Speech signals provide the benefit of being naturally and effortlessly gathered continuously over the day, thereby enabling the accumulation of substantial data volumes without overburdening the participants or researchers6.
A leading approach in speech-based AD detection is the Cookie Theft picture description task7, which is one of the tasks from the Boston Diagnostic Aphasia Examination (BDAE)8. In this task, participants are asked to describe everything they see in a picture using spontaneous speech, and the responses are recorded as audio files. The notable dataset associated with this task is the Pitt corpus7. The ADReSS Challenge dataset9 and the ADReSSo Challenge dataset10, both subsets of the Pitt corpus and matched for age and gender, are utilized as benchmark datasets for AD detection. The ADReSS Challenge dataset provides human-transcribed texts along with the audio files, which does not align with the objective of diagnosing dementia automatically without human intervention. Hence, this study employs the ADReSSo Challenge dataset, which offers purely audio files, to classify two health status: Healthy Control (HC) and AD.
Previous studies on AD classification based on picture description have typically involved three types of models: (1) audio-only models11,12, which extract acoustic features from speech signals using either traditional methods or deep embeddings, (2) text-only models13,14, which convert audio into text and utilize pre-trained language models such as BERT15 to extract features, and (3) multimodal models16,17,18, which utilize both audio and textual modalities, either by fusing their features or employing attention mechanisms during training phase. Recently, there has been a study leveraging feedback from Large Language Model (LLM) as a feature to improve performance19,20.
Our model introduces a method that incorporates image modality as well as text modality to automatically predict dementia by representing the relationships between images and texts as a graph. The spontaneous speech of the participants is transcribed into text using the Whisper21, a representative automatic speech recognition (ASR) system. Subsequently, a vision language model (VLM), specifically Bootstrapping Language-Image Pre-training (BLIP)22, is utilized to depict the connections between different sections of an image and their respective descriptive sentences in the form of a bipartite graph. In the final step, a graph convolutional network (GCN) is employed to perform graph-level classification for classification of AD patients, treating each subject as a separate graph.
The main contributions of this paper are summarized as follows:
-
This work is the first to utilize a graph for modeling the image-text relationship with a vision language model (VLM) in the domain of dementia detection.
-
The bipartite graph encapsulates the image-text relationship, which is crucial for performance enhancement. The image-text relations between the AD and HC groups are structurally distinct enough to enable differentiation by a standard GCN.
-
Our model achieved superior AD classification performance on the ADReSSo Challenge dataset compared to existing state-of-the-art (SOTA) models, attaining an accuracy of 88.73%, which surpasses the previously highest recorded SOTA model accuracy of 87.32%19. The peak accuracy achieved during one of the runs was 91.55%.
-
In explainability experiments using embedding vectors from a trained graph neural network, we identified key sentences and keywords essential for AD classification by comparing pooled embeddings of AD and HC groups with embeddings of individual sentences.
-
By employing a graph-based approach and avoiding computationally intensive Transformers, the model not only simplifies its architecture but also significantly reduces computational costs and improves memory efficiency.
Our proposed model is capable of automatically recognizing AD from audio transcription to classification, without relying on human-derived handcrafted features. Given any audio file from a picture description task, the model’s performance is reproducible with only minor adjustments to hyperparameters.
Related work
Dementia detection
In this section, we primarily introduce studies focused on dementia detection using the ADReSSo Challenge dataset.
Unimodal model
The most natural approach to using spontaneous speech data involves directly feeding the audio modality into the model. One study11 combined traditional acoustic features with acoustic embeddings from wav2vec 2.023, subsequently employing a support vector machine (SVM) for AD classification, resulting in an accuracy of 67.6%. Another study12 employed various pre-trained audio models, including wav2vec 2.0, to acquire acoustic embeddings and applied deep learning approaches for AD classification, achieving an accuracy of 78.9%.
In general, transcribing audio signals into text and then training models on the textual modality tends to yield superior performance compared to solely utilizing the audio modality. One study13 combined the final three states of the pre-trained BERT sequence classifier with the confidence score input produced by the ASR system, resulting in an accuracy of 84.51%. In addition to deep textual embeddings from BERT, another study14 utilized a set of handcrafted features (including syntactic, readability, and lexical diversity) alongside preprocessing steps that integrated silence segments. They trained Logistic Regression (LR) and SVM classifiers, achieving an accuracy of 84.51%.
Framework of the bipartite graph neural network model for AD classification. The image modality (in red) and text modality (in blue) are connected through bipartite graphs (in violet) based on the vision language model. Here, image-text similarity, inspired by VLM, is the edge weight of the bipartite graphs.
Multimodal model
The most common multimodal approach involves considering both the audio and the text modalities simultaneously. In this approach, the audio modality can be utilized either as it is or processed into an image domain through a log-Mel spectrogram. One study24 integrated various acoustic features, including x-vectors, prosody, and emotional embeddings, along with word embeddings, resulting in an accuracy of 80.30%. Another study25 introduced the WavBERT model, which involved converting the wav2vec output into the BERT input to retain non-semantic information. They also incorporated sentence-level pauses into ASR transcripts, leading to an accuracy of 83.10%.
In another research16, global fusion combining BERT with several acoustic models, such as x-vectors and encoder-decoder ASR embeddings, yielded an accuracy of 84.51%. Meanwhile, one study17 leveraged full transcripts as prompts to enhance speech segment training, addressing the limited perspective of Whisper due to the constrained audio segment lengths during fine-tuning. They achieved an accuracy of 84.51%. The research presented in18 introduced a multimodal model incorporating Co-attention, Deep Context, and label smoothing techniques. Co-attention enables simultaneous consideration of different representations, Deep Context captures both low- and high-level syntactic and semantic information, and label smoothing prevents overconfidence. Texts were encoded using BERT, while audio signals were transformed into log-Mel spectrograms and fed into Data-efficient image Transformers (DeiT)26. They achieved an accuracy of 85.35%. In another study19, feedback from ChatGPT was treated as an opinion feature, concatenated with text embedding from BERT and audio embedding from wav2vec 2.0, achieving an accuracy of 87.32%. A study utilizing another LLM, such as Mistral 7B, on the ADReSS dataset achieved an accuracy of 81.3%20.
One study27 leveraged both images and descriptive texts, utilizing insights from extensively pre-trained image-text alignment models, particularly Contrastive Language-Image Pre-training (CLIP)28, to enhance accuracy. However, their research differs from ours in that they do not examine the relationship between images and texts using a graph.
Another study29 introduced a tensor fusion layer to integrate transcribed text, audio, and log-Mel spectrograms, achieving an accuracy of 86.25% on the ADReSS dataset. Additionally, a separate study30 employed Neural Architecture Search to propose an optimal CNN structure and presented a novel approach to integrating text and log-Mel spectrogram modalities, resulting in an accuracy of 92.08% on the ADReSS dataset. Furthermore, another study31 utilized audio, lexical, and disfluency features, combining them through LSTM and a gating mechanism, achieving an accuracy of 79.2% on the ADReSS dataset.
Background
Vision language model
Vision language models (VLMs) enhance downstream vision and language tasks by pre-training on large image-text pairs datasets. CLIP employs contrastive learning to match text and image embeddings by selecting the most similar pair, while ALign the image and text representations BEfore Fusing (ALBEF)32 aligns unimodal representations before fusing them into a multimodal encoder with assistance from momentum distillation. Though CLIP and ALBEF use web image-text pairs for pre-training, the noisy data isn’t ideal for learning. BLIP improves this by using Captioning and Filtering (CapFilt) and Multimodal mixture of Encoder-Decoder (MED). BLIP-233 builds on BLIP with a more computationally efficient approach. To leverage the unique functionality of BLIP, we selected BLIP over BLIP-2 for image-text embedding and similarity measurement.
Graph convolutional network
Graph neural networks (GNNs)34 are designed to process graph data, similar to how convolutional neural networks (CNNs) process adjacent pixels in images. In GNNs, node information is exchanged between neighboring nodes through message passing to update embeddings. GCN35, a type of GNN, applies convolutional operations to graphs, aggregating information from neighboring nodes. In this study, we use the following GCN model:
where \(\textbf{x}_{i}^{l}\) and \(\textbf{x}_{i}^{l-1}\) are the node embedding vectors for the node i of the l-th layer and the \((l-1)\)-th layer, respectively. During the training phase, node i can represent either an image or a text node. If node i is an image node, then node j must be a text node, and vice versa. The edge weight is denoted as \(e_{j,i}\) from source node j to target node i. \(\mathscr {N}(i)\) is the set of neighboring nodes of node i, and \(\textbf{W}_{1}^{l}\) and \(\textbf{W}_{2}^{l}\) are learnable parameters.
Methods
Figure 1 illustrates the overall framework of our model. The framework consists of four main components: 1) image node processing (in red), 2) sentence node processing (in blue), 3) bipartite graph construction (in purple), which includes image-text similarity based on a VLM, and 4) graph convolutional network (in green) and AD classification.
Image node processing
-
Colorization The BLIP model is optimized for color images as it is pre-trained on the COCO dataset36. Therefore, using grayscale cartoon-style images such as the Cookie Theft picture (shown in Fig. 2a) directly for feature extraction is not ideal. When grayscale images are fed into BLIP as input, there are quite a few instances where inappropriate captions are generated. Hence, we colorized the Cookie Theft image using a generative AI tool. The effectiveness of colorization is discussed in Table 4.
-
Crop Subsequently, we cropped the image into 10 square-sized sub-images, as depicted in Fig. 2c. Each sub-image was cropped to varying sizes and then resized to the same sizes afterward. Each cropped image represents a distinct scenario. For instance, one image depicts water overflowing from a sink, while another portrays a boy stealing cookies. To accommodate descriptions depicting the overall context, cropped images close to the full size were also included.
Since the Cookie Theft picture is the standard image for AD recognition, several studies have manually divided it into halves or quadrants37,38,39 to analyze and classify AD groups. Another study manually divided the picture into 10 cropped areas based on words as a seed40. Some studies employ eye-tracking techniques to obtain areas of interest (AOI)41,42. In addition to these methods, we identified important areas in the picture by analyzing the relationship between frequently uttered words and the picture using Grad-CAM43 for the VLM in Fig. 2b. Based on the aforementioned references and Grad-CAM results, we meticulously cropped the picture into 10 sub-images.
-
Embeddings Following that, we utilized BLIP to extract the embedding vector for each cropped image. Here, the image embedding vector is obtained considering the corresponding sentence. In other words, even for the same cropped image, the value of the embedding vector varies depending on the corresponding sentence. Then, for each sample, we average the image embedding vectors across all sentences to obtain the final cropped image embedding vector. Each cropped image yields one embedding vector, which serves as an attribute of the image node in the bipartite graph. The practical implementation details for image embeddings are provided in the Experiments section.
Sentence node processing
-
Transcription We transcribed the given speech signals into text using the prominent ASR system, Whisper-large21. This process yielded one text file per subject.
-
Segmentation Subsequently, to facilitate matching with cropped images, we segmented the entire text into individual sentences, using punctuation marks such as periods, exclamation points, and question marks as delimiters.
-
Embeddings Following this, we employed BLIP to extract the embedding vector for each sentence. Here, the sentence embedding vector is obtained considering the corresponding cropped image. In other words, even for the same sentence, the value of the embedding vector varies depending on the corresponding cropped image. Then, for each sample, we averaged the sentence embedding vectors across all cropped images to obtain the final sentence embedding vector. Each sentence yields one embedding vector, which serves as an attribute of the text node in the bipartite graph. The practical implementation details for sentence embeddings are presented in the Experiments section.
Fig. 3 Examples of bipartite graphs for AD and HC group samples, respectively. The darker edges represent higher edge weights. In the AD sample, the edges (image node, text node) = (2, 2), (6, 2), (10, 7), (2, 10), (6, 10) have higher weights, indicating a tendency for the text to focus on specific crop images. In contrast, in the HC sample, the edges (image node, text node) = (4, 1), (5, 3), (2, 4), (6, 4), (1, 5), (2, 5), (6, 5), (9, 5), (8, 7) have relatively higher weights, showing that the text is more evenly connected across multiple crop images.
Bipartite graph construction: image-text similarity based on vision language model
We employ BLIP to extract the cosine similarity between the cropped images and sentences. A well-describing sentence yields a high cosine similarity. Upon examining several sample sentences, we confirmed that BLIP exhibits high relevance in computing image-text similarity. In contrast to research utilizing CLIP, we chose BLIP due to its superior performance. Furthermore, CLIP tend to focus on single-word aspects, which are less aligned with our research direction.
The image-text cosine similarity is computed for all pairs of cropped images and sentences, subsequently becoming the edge weights of the bipartite graph. The cosine similarity matrix essentially serves as the adjacency matrix of the graph. This approach enables the establishment of informative and reliable connectivity between the image and text modalities.
The nodes of the bipartite graph consist of image nodes, with embedding vectors of the cropped images as attributes, and text nodes, with embedding vectors of the sentences as attributes. Each subject corresponds to one bipartite graph. The number of image nodes is fixed at 10, while the number of text nodes corresponds to the number of sentences in the text. The practical implementation details for calculation of image-text similarity are provided in the Experiments section.
The bipartite graph captures the complex relational information between participants’ spoken descriptions and sub-images, encompassing several crucial aspects. Firstly, when participants thoroughly describe each part of the picture, the corresponding sentence associated with the sub-image receives a large edge weight. Secondly, comprehensive descriptions of all situations within the picture are essential; thus, if a participant provides an all-encompassing description, all image nodes will have at least one large edge weight value, connecting to the corresponding related sentence node. Thirdly, utterances unrelated to the image will not have large edge weight values with any image node. The bipartite graph encapsulates various implicit pieces of information, including the aforementioned aspects, which could significantly enhance performance if leveraged properly. If only the purely textual (or audio) modality is considered, there is a limitation in that such cases cannot be included in the model training.
Figure 3 presents examples of bipartite graphs for the AD and HC group samples, illustrating the different patterns of edge weights between the two samples. Since the edge weights represent the image-text similarity in the VLM, it is evident that, in the HC group samples, each text is more broadly connected to various crop images. The practical implementation details for constructing a bipartite graph dataset compatible with GNNs are presented in the Experiments section.
Graph convolutional network and AD classification
Graph convolutional network
Using GCN, the embedding vectors of image and text nodes can be updated by incorporating the connectivity information from neighboring nodes. In this process, information from neighbors is reflected proportionally to the edge weights. In our case, the edge weights are determined by the cosine similarity between the cropped images and sentences. Hence, the GCN model updates each node’s information more prominently when the relationship between images and texts is closer.
We employ three GCN layers; therefore, l can take on the values of 1, 2, or 3. When \(l=1\), \(\textbf{x}_{i}^{l-1}\) represents the initial embedding acquired through BLIP. At the final layer (\(l=L\)), we can obtain final node embeddings \(\textbf{x}_{i}^{L}\) for all nodes i, where L is 3 for our case. Here, we denote the final image and text node embeddings as \(\textbf{h}_{s,i}^{v}\) and \(\textbf{h}_{s,j}^{t}\), respectively, where s is the subject index, i is the cropped image index, and j is the sentence index.
Through a GCN that considers edge weights, the relationships between sub-images and sentences are learned by accounting for local structural information up to 3 hops. The final embedding of each node reflects information from neighboring image nodes and neighboring text nodes. For instance, in the case of the Cookie Theft picture, consider the sub-image where a boy is standing on a stool to grab a cookie while a girl reaches out beside him. Without using a GCN, all of these actions would need to be captured in a single sentence for the model to learn from this sub-image. However, by using a GCN, the utterance describing the boy standing on the stool, the utterance about the boy grabbing the cookie, and the utterance about the girl reaching out can all be incorporated as neighboring text nodes in the learning process, making generalization more feasible.
Graph-level classification
Once we have obtained the final embedding vectors for all nodes, we need to aggregate them into a graph-level embedding vector for the graph-level classification task. We employed mean pooling, where the graph-level embedding vector is the mean of all node embedding vectors. The global mean pooling vector for all nodes and for subject s, denoted as \(\textbf{h}_{s}\), is defined as
where \(\langle \cdot \rangle\) denotes the mean operation, and \(\textbf{h}_{s,i}^{v}\) and \(\textbf{h}_{s,j}^{t}\) represent the final image and text node embedding vectors, respectively. According to (2), regardless of the number of sentences for each sample, texts are pooled with equal weight to images.
Then, the pooled vector \(\textbf{h}_{s}\) is fed into the Linear layer for classification of AD versus HC as follows:
The training and validation loss are calculated using cross entropy. As a result, our proposed model predicts whether a given data sample belongs to AD or HC. Detailed practical information is provided in Experiments section.
Dataset
The basic statistics of the datasets, including the number of sentences per one sample and the average number of words in one sentence, are shown in Table 1. The ADReSSo Challenge dataset is designed for three tasks10, but we only utilize the audio files and ground truth labels pertaining to the Cookie Theft picture description task. The ADReSSo Challenge dataset, a benchmark dataset for AD detection, is a subset of the Pitt corpus, matched for age and gender; While the Pitt corpus comprises 548 samples, the ADReSSo Challenge dataset consists of 237 samples. The ADReSSo Challenge dataset was carefully constructed by considering the age and gender distribution when dividing the training and test sets, thus reducing the potential bias due to participants’ demographics.
The Pitt corpus does not have a standardized train-test split, meaning that performance can vary across studies depending on the samples included in the training or test sets. As a result, accuracy comparisons between different models may not be reliable. Therefore, the Pitt corpus was primarily used to support the ablation study on the ADReSSo challenge dataset.
Although dependent on the transcription results from Whisper, we conducted two primary analyses. Firstly, the average number of sentences per participant’s utterance is higher in the AD group for the ADReSSo Challenge dataset, whereas for the Pitt corpus, the two groups are comparable. The higher sentence count in the AD group is attributed to the prevalence of short sentences such as Okay, Yeah, Uh-huh, and similar expressions. This is evident when examining the average word count per sentence. In both datasets, the HC group exhibits a higher average word count per sentence, indicating that participants in the HC group tend to articulate sentences with more words. The detailed train-test split for the Pitt corpus is outlined in the Experiments section.
Experiments
Implementation of graph data
Image and text processing using LAVIS
We utilize LAVIS44, a Python library that includes a wide range of VLMs, for two key processes: 1) the embedding of images and texts, and 2) the calculation of image-text similarity. LAVIS provides access to over ten image-text tasks and more than thirty pretrained weights from SOTA foundation VLMs, including CLIP28, ALBEF32, BLIP22, and BLIP-233.
Specifically, we utilize the [CLS] token from the blip_feature_extractor with the base model type, which consists of pretrained weights from the CapFilt by the BLIP large model, to acquire image and text embeddings. For text embeddings, we first extract 10 embedding vectors per sentence, each consisting of 768 dimensions, considering the relationships with 10 cropped images corresponding to each sentence. Similarly, for image embeddings, we first extract embedding vectors for each sentence, with the number of vectors being equal to the number of sentences, and each vector comprising 768 dimensions, considering the relationships with sentences corresponding to each cropped image. We then average all image embedding tensors to obtain the final text embeddings, and vice versa for image embeddings.
The cosine similarity between images and texts is calculated using blip_image_text_matching with the base model type, which is fine-tuned with BLIP retrieval weights on the COCO dataset36. For each subject, we compute the cosine similarity between all pairs of cropped images \(N_{i}\) and sentences \(N_{t}\), yielding a similarity matrix of size \(N_{i} \times N_{t}\), which serves as the adjacency matrix from a graph perspective.
Construction of bipartite graph using PyG
PyG (PyTorch Geometric)45 is a library built upon PyTorch, designed to seamlessly manage GNNs for diverse applications involving structured data. The HeteroData object in PyG describes a heterogeneous graph, holding multiple node and edge types. When certain constraints are applied (the types of nodes are two, and edges are only possible between nodes of different types), a heterogeneous graph can be transformed into a bipartite graph. Thus, we utilize the HeteroData type to construct the bipartite graph dataset.
Then, we utilize the to_hetero module in PyG to transform a homogeneous GNN model into its heterogeneous counterpart. Subsequently, we opt for the GraphConv GNN operator46, which accounts for edge weights.
Graph neural network structure
Based on the experiments examining the dependency on the number of GCN layers shown in the Results section, the number of GCN layers is determined to be 3. After each GCN layer, batch normalization is applied. Following three GCN layers and three times of batch normalization, a linear layer for binary classification of AD versus HC follows.
We compare two configurations: (1) light GCN is a scenario where hidden dimensions diminish by half iteratively, while (2) full GCN indicates a scenario where hidden dimensions remain constant, regardless of the increasing number of layers. For the light GCN structure, with a 256-dimensional case, it initially receives features of 768 dimensions. Specifically, at \(L = 1\), it transforms from 768 to 256 dimensions, at \(L = 2\) from 256 to 128, and so forth, until \(L = 5\) where it reduces from 32 to 16 dimensions. Conversely, in the case of the full GCN structure, for the 256-dimensional scenario, after reducing from 768 to 256 dimensions at \(L = 1\), the dimension remains constant at 256 thereafter.
The potential hidden dimensions considered for the first layer of GCN were \(d=64, 128, 256, 384,\) and 768. Among these options, the optimal performance was observed with \(d=256\), therefore, subsequent experiments were conducted using this dimension.
Experimental settings
All experiments were carried out utilizing PyTorch47. The experimental settings are as follows: dropout rate of 0.2, learning rate of \(1e^{-6}\), and batch size of 4. The maximum number of epochs is set to 2000. However, it is rare for training to proceed until the final epoch because we employ early stopping to mitigate overfitting. With a patience of 300, if the validation loss does not improve for 300 consecutive epochs, the training terminates, and the best model is saved as the one from 300 epochs ago.
The training time, on a PC equipped with an NVIDIA GeForce RTX 3080Ti, averages around 10 minutes per fold except embedding process. The embedding process is designed to be run only once upon receiving the dataset, enabling it to be reused later and thus excluded from the computation time. The total time for both the embedding process and similarity calculation is approximately 10 minutes for the entire ADReSSo Challenge dataset.
The experiments utilized a 5-fold cross-validation (CV) approach. During the evaluation phase with the test set, predictions from the five models chosen from each fold were combined through voting to obtain the final prediction. We assessed performance using five metrics: precision, recall, F1-score, specificity, and accuracy. Of these, accuracy was prioritized as the primary metric for performance comparison, mirroring the approach in the ADReSSo Challenge.
The ADReSSo Challenge dataset is provided with separate training and test sets, whereas the Pitt corpus is not. Previous studies targeting the Pitt corpus have employed various methods for train-test split48,49, resulting in challenges for comparing performance across studies. Therefore, the experimental results on the Pitt corpus are provided to support ablation studies rather than for direct comparison with existing models.
In experiments conducted on the Pitt corpus, the train-test split ensures that the number of AD patients and HC participants is almost equal in both the training and test sets by adjusting the random seed. The sample ratio between the training and test sets was set at 8:2.
Results
AD classification results on ADReSSo dataset
The performance of our model on ADReSSo Challenge test set is shown in Table 2 along with the performance of the previous SOTA architectures. Our proposed model, which utilizes only image and text modalities, achieves an accuracy of 88.73%, surpassing the previous SOTA model19 that achieved 87.32% accuracy with additional features such as audio and ChatGPT’s opinion. As a result of conducting a t-test for statistical significance, the p-value was found to be \(4.3e^{-6}\), indicating that the accuracy of our model is statistically significantly different from that of the existing SOTA model. The performance in the table represents the average of four runs, with the highest performance reaching 91.55% during one of the runs. With an F1-score of 88.23%, our model also outperforms the best F1-score of the previous SOTA model, which was 87.25%. While accuracy stands as the primary performance metric, the F1-score holds significant importance as well. Unlike specificity, the F1-score evaluates how well the model detects AD patients correctly.
The GCN with the full structure achieves an accuracy of 86.97%, which is lower than the accuracy of 88.73% attained by the light structure. This discrepancy can be attributed to the tendency of the full structure to have a larger number of parameters and a higher susceptibility to overfitting. Increasing the dropout rate to mitigate overfitting yields comparable results. Similarly, for the results of the Pitt corpus in Table 5, the accuracy of the light structure surpasses that of the full structure by 3.31% (the proposed model in the table adopts the light structure). Applying the softmax function to edge weights normalizes information propagation and emphasizes significant neighbors, enhancing training stability but not improving overall performance, as accuracy remained at 87.32%. However, it reduced fluctuations in the learning curve.
Number of layers of GCN
Table 3 illustrates the dependency on the number of GCN layers. The variable L represents the total number of GCN layers. In parentheses, the hidden dimension of the first layer of GCN is indicated. In Table 3, the highest accuracy is generally achieved when \(L=3\). When L is less than 3, the model may not adequately learn the graph structure due to insufficient propagation, while for L greater than 3, oversmoothing occurs50. As the number of layers increases, instead of aggregating local information of neighboring nodes, global information of all nodes in the graph is aggregated. This results in all node embeddings on the graph becoming similar to each other, leading to oversmoothing and thereby impeding proper graph learning. Therefore, the permissible maximum number of layers decreases as the graph size decreases. In our case, the reference point is \(L=3\), and thus the proposed model in Table 2 is all based on \(L=3\).
Removing image-text relationship
In this section, we conducted ablation studies to assess the influence of the relationship between image and text on performance by eliminating the image-text relation through three approaches: (1) shuffling edge weights, (2) independent embeddings, and (3) a combination of the first two methods.
Shuffling edge weights
This ablation experiment involves randomly shuffling the weights of existing edges in the bipartite graph. As meaningful connections are replaced by random ones, the image-text relation is eliminated. Experimental results demonstrate that our proposed model shows a significant improvement in accuracy by 13.00% compared to the shuffling edge weights method (see Table 4). This indicates a substantial performance enhancement, underscoring the importance of the relationship between images and texts. Similarly, experiments on the Pitt corpus show a 3.88% increase in accuracy due to proper edge weights (see Table 5).
Independent embeddings
When extracting embeddings using BLIP, image embeddings are influenced by text, and text embeddings are influenced by images. To mitigate this effect, there is a necessity to independently embed images and text. In the independent embedding ablation study, image embeddings were generated using the ViT51, while text embeddings were generated using Sentence Transformers52, a Python framework for sentence, text and image embeddings. The ViT models were pre-trained on the ImageNet and ImageNet-21k datasets53. Specifically, we utilized the vit-base-patch16-224 pre-trained model to embed the image nodes, extracting the 768-dimensional embedding vector from [CLS] token of the hidden states of the last layer. For sentence nodes, we leveraged the all-mpnet-base-v2 pre-trained model from the Sentence Transformers, which had been trained on a vast dataset comprising over one billion sentence pairs. This model was employed to generate embeddings for sentence nodes, resulting in an output of the 768-dimensional embedding vector.
The ablation results of the independent embedding are presented in Table 4 for the ADReSSo Challenge dataset and in Table 5 for the Pitt corpus. When comparing the accuracy of the proposed model with that of independent embedding, it can be observed that the embedding through VLM resulted in an improvement of 9.56% for the ADReSSo Challenge dataset and 3.88% for the Pitt corpus, respectively. While this enhancement is less pronounced than that achieved through proper edge weights, it remains a significant effect.
Combination of two effects
The combined impact of shuffling edge weights and independent embedding is presented in Table 4 for the ADReSSo Challenge dataset and Table 5 for the Pitt corpus. When considering the combined influence of proper edge weights and embedding through VLM, we observe an increase in accuracy of 7.23% for the ADReSSo Challenge dataset and 5.04% for the Pitt corpus. The enhancement in the ADReSSo dataset is less than when considered individually, possibly due to the random effect of shuffling, which dampens its impact. In the case of the Pitt corpus, accuracy has improved more than when considered individually, as our expectations.
Effect of colorization
In the case of BLIP, as the pre-training data utilized comprises the COCO dataset, it accurately provides captions for color images and demonstrates precise features along with image-text alignment. However, for grayscale drawings such as the Cookie Theft picture, it may provide less precise captions and may not fully exhibit proper feature extraction and image-text alignment. To evaluate the colorization effect, we conducted an ablation experiment. Table 4 presents the results, indicating a 9.56% improvement in accuracy attributable to the colorization process.
We performed colorization through minor retouching. As part of future work to enhance robustness, we intend to create several colorized images with slight variations. This ensures consistency in how participants describe the images, despite VLM perceiving them slightly differently.
Dependence on pooling method
In our main experiment, we employed a global mean pooling to aggregate all node embeddings into a graph-level embedding. Concerned about the inclusion of unnecessary information in the averaging process, we conducted the ablation experiment using a global max pooling, which utilizes only the embeddings of the most significant nodes. Contrary to expectations, the accuracy with max pooling decreased by 3.17% compared to mean pooling, as shown in Table 4. Similarly, in the ablation experiments on the Pitt corpus presented in Table 5, replacing mean pooling with max pooling results in a decrease in accuracy of 0.80%. We attribute this decrease in performance to information loss resulting from the exclusion of less important node information when using max pooling.
Discussion
Critical sentences and keywords
We conducted an analysis to extract crucial sentences and keywords in classifying AD using the trained graph model. Experiments for explainability allow us to gain insights into scenarios where a participant is more likely to have AD based on specific types of sentences uttered or certain keywords frequently appearing in their speech. The method involves obtaining embedding vectors representing either the AD group or the HC group and comparing them with embedding vectors of individual sentences to investigate associations.
If we denote the representative embedding vector of the AD group and HC group as \(\textbf{h}_{\text {AD}}\) and \(\textbf{h}_{\text {HC}}\), respectively, the process of obtaining these two vectors is as follows. Firstly, for each subject s, the pooled embedding vector \(\textbf{h}_{s}\) is computed using the best model. Then, the representative embedding vector of the AD group is calculated as \(\textbf{h}_{\text {AD}}=\langle \textbf{h}_{s} \rangle\) for \(s\in \text {AD}\), and the representative embedding vector of the HC group is computed as \(\textbf{h}_{\text {HC}}=\langle \textbf{h}_{s} \rangle\) for \(s\in \text {HC}\). However, in this context, subject s includes only cases where the model’s prediction matches the ground truth.
Critical sentences for AD classification
After obtaining the representative embedding vectors for each group, comparison is conducted in two ways. The first method, similarity-based comparison, involves comparing the embedding vectors of sentences from a specific group with the representative embedding vector of the same group. For each sentence in the AD group, \(\textbf{h}_{s,j}^{t}\) for \(s\in \text {AD}\), the cosine similarity with \(\textbf{h}_{\text {AD}}\) is computed. Similarly, for each sentence in the HC group, \(\textbf{h}_{s,j}^{t}\) for \(s\in \text {HC}\), the cosine similarity with \(\textbf{h}_{\text {HC}}\) is computed. Extracting sentences with the highest cosine similarity values up to the top 20% yields sets of sentences \(S_{\text {AD},\sim }\) and \(S_{\text {HC},\sim }\), containing sentences close to the prototype of the AD and HC groups, respectively. Thus, \(S_{\text {AD},\sim }\) and \(S_{\text {HC},\sim }\) represent sets of sentences crucial for distinguishing between AD and HC.
Examples of sentences corresponding to the centroids of each cluster in Fig. 4: (a) \(S_{\text {HC},\sim }\), (b) \(S_{\text {AD},\sim }\), (c) \(S_{\text {HC},\not \sim }\), and (d) \(S_{\text {AD},\not \sim }\). Sentences shaded in gray represent statements made by the investigator, those shaded in sky blue highlight characteristics of the HC group, and those shaded in pink effectively represent characteristics of the AD group.
The second method, dissimilarity-based comparison, involves comparing the embedding vector of sentences from a specific group with the representative embedding vector of the other group. For each sentence in the AD group, \(\textbf{h}_{s,j}^{t}\) for \(s\in \text {AD}\), cosine similarity with \(\textbf{h}_{\text {HC}}\) is computed, while for each sentence in the HC group, \(\textbf{h}_{s,j}^{t}\) for \(s\in \text {HC}\), cosine similarity with \(\textbf{h}_\text {AD}\) is calculated. Extracting sentences with the bottom 20% lowest cosine similarity values yields sets of sentences \(S_{\text {AD},\not \sim }\) representing AD group sentences distant from the HC group prototype and \(S_{\text {HC},\not \sim }\) representing HC group sentences distant from the AD group prototype, which form another important set of sentences for distinguishing between AD and HC.
Each individual sentence in these groups is embedded using a Sentence Transformer. Subsequently, k-means clustering with six clusters is performed on each set, followed by two-dimensional visualization using t-SNE, as shown in Fig. 4. t-SNE, short for t-distributed stochastic neighbor embedding, is a statistical technique used to visualize high-dimensional data by assigning a position to each data point on a two- or three-dimensional space. Figure 4a–d represent clustering results for \(S_{\text {HC},\sim }\), \(S_{\text {AD},\sim }\), \(S_{\text {HC},\not \sim }\), and \(S_{\text {AD},\not \sim }\), respectively. Overall, sentences from the HC group are well-clustered, while those from the AD group exhibit a tendency to be dispersed. This is attributed to the fact that sentences from the HC group describe situations well, leading to effective clustering, whereas in the case of AD, there is a significant proportion of sentences unrelated to image descriptions, resulting in dispersion.
Examining the sentences within each cluster, shown in Fig. 5a–d, we observe common descriptions across both groups, such as the mother is washing dishes or water overflowing from the sink. However, the HC group notably contains more detailed descriptions, such as those detailing the cookie jar lid or scenes outside the window. In contrast, the AD group includes many sentences like I don’t know. The key insight here is that these observations are facilitated by utilizing the final pooled embedding vector from the GCN.
Critical keywords for AD classification
Using the aforementioned approach, we analyzed the keywords essential for classifying AD and HC. From the selected sets \(S_{\text {HC},\sim }\), \(S_{\text {AD},\sim }\), \(S_{\text {HC},\not \sim }\), \(S_{\text {AD},\not \sim }\), we extracted words exclusive to each group; specifically, words present only in the HC group (and vice versa for the AD group). We refer to these words as relevant keywords, as they play a significant role in distinguishing between AD and HC.
In contrast, while the previous steps extracted sentences up to the top 20% based on cosine similarity in the similarity-based comparison and up to the bottom 20% in the dissimilarity-based comparison, we extracted sentences up to the bottom 5% in the similarity-based comparison and up to the top 5% in the dissimilarity-based comparison. We refer to the keywords extracted from these sentences as irrelevant keywords, as they are not particularly helpful in distinguishing between AD and HC.
The word cloud visualization results for these words are shown in Fig. 6. Each of Fig. 6a,b can be divided into four areas: the left represents keywords from the HC group, the right represents keywords from the AD group, the top represents relevant keywords, and the bottom represents irrelevant keywords. For instance, cocoa is in the AD and relevant keywords group.
Examining the trends of the keywords, notable words from the HC group, which play a significant role in distinguishing from AD, include window, curtain, tree, grass, cabinet, lid, and counter. On the other hand, significant words from the AD group, crucial in distinguishing from HC, include summer, cocoa, eat, kid, lady, and ladder. Words from the HC group that are not crucial in distinguishing from AD include dish, woman, shoes, towel, and floor, while words from the AD group that are not crucial in distinguishing from HC include plate, water, hair, and hand.
The straightforward words such as dish, woman, water do not play a significant role, whereas words like curtain, tree, grass, lid are crucial. This finding aligns with the research results presented in the heatmap over the area of interests on the Cookie Theft image42. An important takeaway from this analysis is the ability to extract words crucial for AD classification using the results of graph embeddings.
AD patients have been reported to use a reduced number of nouns and display a more limited vocabulary compared to HCs, showing an increased tendency to rely on pronouns while the diversity of nouns diminishes54, and the keywords in Fig. 6 align with this trend. For instance, AD patients often exhibit a tendency to use common nouns like ’thing,’ which fail to specify concrete objects.
Comparison quality of description in terms of graph
We conducted an analysis comparing the quality of sentences describing the picture between the AD and HC groups from a graph perspective. A sentence that effectively describes the picture would have a higher relevance with the cropped images, resulting in a higher image-text cosine similarity value, i.e., a larger edge weight. By setting a threshold and eliminating edges with weights below it, we could remove relations with low relevance between the image and text. During the process of edge removal, if a node with zero degree emerges, we remove that node. After measuring the remaining number of image nodes for each subject, we averaged them for each group, and the results are depicted in Fig. 7. The horizontal axis represents the threshold, while the vertical axis represents the average number of surviving image nodes after thresholding. As the threshold value increases, the average number of image nodes decreases, with the AD group showing a faster decline compared to the HC group. The Kolmogorov-Smirnov test revealed a significant difference between the AD and HC groups in the ADReSSo Challenge dataset (p-value is 0.0026). Through this analysis, we can confirm from a graph perspective that the quality of sentences describing the image is better for the HC group than for the AD group.
Threshold for edge weights versus the average number of cropped images that survive after node removal for (a) ADReSSo Challenge and (b) the Pitt corpus datasets. The HC group is presented in blue, and the AD group is presented in red. The shaded regions represent plus and minus one standard deviation.
Limitations
The limitations of our proposed model lie in the manual cropping process of the picture, which may introduce subjectivity due to human intervention. Determining the optimal crop area and the ideal number of cropped images is necessary to further improve performance. By generating an optimal set of cropped images for a given picture, we can ensure high accuracy for any incoming spontaneous speech sample.
Conclusion
We introduce a novel approach to Alzheimer’s disease detection by leveraging both the text and image modalities of a picture description task. Our proposed method employs the VLM to construct bipartite graphs that encapsulate the relationships between image segments and corresponding textual descriptions. Our model effectively learns the structural information of the bipartite graph via the GCN. The experimental results on the ADReSSo Challenge datasets demonstrated a high accuracy of 88.73%, exceeding that of previous SOTA models. Ablation studies highlighted the critical role of the image-text relationship in enhancing classification accuracy. Additionally, the ability to identify specific sentences and keywords crucial for AD classification has significantly enhanced the explainability of our method.
For future work, we can further extend our proposed model to other types of picture description tasks, such as those found in the Delaware corpus55, a dataset used for mild cognitive impairment (MCI) screening. This dataset includes two additional pictures, Cat Rescue and Going and Coming. Incorporating additional modalities, such as audio, presents another opportunity for future research. The inclusion of embedding information from the audio modality could enhance the performance of the AD classification task.
Data availability
The data that support the findings of this study are available from DementiaBank (https://dementia.talkbank.org) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available.
References
Folstein, M. F., Folstein, S. E. & McHugh, P. R. “mini-mental state’’: A practical method for grading the cognitive state of patients for the clinician. J. Psychiatr. Res. 12, 189–198 (1975).
Nasreddine, Z. S. et al. The montreal cognitive assessment, moca: A brief screening tool for mild cognitive impairment. J. Am. Geriatr. Soc. 53, 695–699 (2005).
Chen, S. et al. Automatic dementia screening and scoring by applying deep learning on clock-drawing tests. Sci. Rep. 10, 20854 (2020).
De la Fuente Garcia, S., Ritchie, C. W. & Luz, S. Artificial intelligence, speech, and language processing approaches to monitoring Alzheimer’s disease: A systematic review. J. Alzheimer’s Dis. 78, 1547–1574 (2020).
Vigo, I., Coelho, L. & Reis, S. Speech-and language-based classification of Alzheimer’s disease: A systematic review. Bioengineering 9, 27 (2022).
Chen, J., Ye, J., Tang, F. & Zhou, J. Automatic detection of Alzheimer’s disease using spontaneous speech only. In Interspeech, Vol. 2021, 3830 (NIH Public Access, 2021).
Becker, J. T., Boiler, F., Lopez, O. L., Saxton, J. & McGonigle, K. L. The natural history of Alzheimer’s disease: Description of study cohort and accuracy of diagnosis. Arch. Neurol. 51, 585–594 (1994).
Goodglass, H., Kaplan, E. & Weintraub, S. BDAE: The Boston Diagnostic Aphasia Examination (Lippincott Williams & Wilkins, 2001).
Luz, S., Haider, F., de la Fuente, S., Fromm, D. & MacWhinney, B. Alzheimer’s dementia recognition through spontaneous speech: The adress challenge. arXiv preprint[SPACE]arXiv:2004.06833 (2020).
Luz, S., Haider, F., de la Fuente, S., Fromm, D. & MacWhinney, B. Detecting cognitive decline using speech only: The adresso challenge. arXiv preprint[SPACE]arXiv:2104.09356 (2021).
Balagopalan, A. & Novikova, J. Comparing acoustic-based approaches for alzheimer’s disease detection. arXiv preprint[SPACE]arXiv:2106.01555 (2021).
Gauder, M. L., Pepino, L. D., Ferrer, L. & Riera, P. Alzheimer disease recognition using speech-based embeddings from pre-trained models. In Proc. Interspeech 2021 3795–3799. https://doi.org/10.21437/Interspeech.2021-753 (2021).
Pan, Y. et al. Using the outputs of different automatic speech recognition paradigms for acoustic-and bert-based alzheimer’s dementia detection through spontaneous speech. In Interspeech 3810–3814 (2021).
Syed, Z. S., Syed, M. S. S., Lech, M. & Pirogova, E. Tackling the adresso challenge 2021: The muet-rmit system for alzheimer’s dementia recognition from spontaneous speech. In Interspeech 3815–3819 (2021).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint[SPACE]arXiv:1810.04805 (2018).
Pappagari, R. et al. Automatic detection and assessment of Alzheimer disease using speech and language technologies in low-resource scenarios. Interspeech 2021, 3825–3829 (2021).
Li, J. & Zhang, W.-Q. Whisper-based transfer learning for alzheimer disease classification: Leveraging speech segments with full transcripts as prompts. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 11211–11215 (IEEE, 2024).
Ilias, L. & Askounis, D. Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech. Knowl.-Based Syst. 277, 110834 (2023).
Bang, J.-U., Han, S.-H. & Kang, B.-O. Alzheimer’s disease recognition from spontaneous speech using large language models. ETRI Journal (2024).
Botelho, C. et al. Macro-descriptors for alzheimer’s disease detection using large language models. In Interspeech 2024, 1975–1979. https://doi.org/10.21437/Interspeech.2024-1255 (2024).
Radford, A. et al. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning 28492–28518 (PMLR, 2023).
Li, J., Li, D., Xiong, C. & Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning 12888–12900 (PMLR, 2022).
Baevski, A., Zhou, Y., Mohamed, A. & Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020).
Wang, N., Cao, Y., Hao, S., Shao, Z. & Subbalakshmi, K. Modular multi-modal attention network for Alzheimer’s disease detection using patient audio and language data. In Interspeech 3835–3839 (2021).
Zhu, Y., Obyat, A., Liang, X., Batsis, J. A. & Roth, R. M. Wavbert: Exploiting semantic and non-semantic speech using wav2vec and bert for dementia detection. In Interspeech vol. 2021, 3790 (NIH Public Access, 2021).
Touvron, H. et al. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning 10347–10357 (PMLR, 2021).
Zhu, Y. et al. Evaluating picture description speech for dementia detection using image-text alignment. arXiv preprint[SPACE]arXiv:2308.07933 (2023).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning 8748–8763 (PMLR, 2021).
Ilias, L., Askounis, D. & Psarras, J. A multimodal approach for dementia detection from spontaneous speech with tensor fusion layer. In 2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI) 1–5 (IEEE, 2022).
Chatzianastasis, M., Ilias, L., Askounis, D. & Vazirgiannis, M. Neural architecture search with multimodal fusion methods for diagnosing dementia. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5 (IEEE, 2023).
Rohanian, M., Hough, J. & Purver, M. Multi-modal fusion with gating using audio, lexical and disfluency features for Alzheimer’s dementia recognition from spontaneous speech. In Interspeech 2020 2187–2191. https://doi.org/10.21437/Interspeech.2020-2721 (2020).
Li, J. et al. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 34, 9694–9705 (2021).
Li, J., Li, D., Savarese, S. & Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning 19730–19742 (PMLR, 2023).
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M. & Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 20, 61–80 (2008).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint[SPACE]arXiv:1609.02907 (2016).
Lin, T.-Y. et al. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 740–755 (Springer, 2014).
Fromm, D. et al. The case of the cookie jar: Differences in typical language use in dementia. J. Alzheimer’s Dis. 1–18 (2024).
Field, T. S., Masrani, V., Murray, G. & Carenini, G. [td-p-002]: Improving diagnostic accuracy of Alzheimer’s disease from speech analysis using markers of hemispatial neglect. Alzheimer’s Dementia 13, P157–P158 (2017).
Ambadi, P. S. et al. Spatio-semantic graphs from picture description: Applications to detection of cognitive impairment. Front. Neurol. 12, 795374 (2021).
Bouazizi, M., Zheng, C., Yang, S. & Ohtsuki, T. Dementia detection from speech: What if language models are not the answer?. Information 15, 2 (2023).
Barral, O. et al. Non-invasive classification of Alzheimer’s disease using eye tracking and language. In Machine Learning for Healthcare Conference 813–841 (PMLR, 2020).
Mirheidari, B. et al. Detecting alzheimer’s disease by estimating attention and elicitation path through the alignment of spoken picture descriptions with the picture prompt. arXiv preprint[SPACE]arXiv:1910.00515 (2019).
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision 618–626 (2017).
Li, D. et al. Lavis: A one-stop library for language-vision intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) 31–41 (2023).
Fey, M. & Lenssen, J. E. Fast graph representation learning with pytorch geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (2019).
Morris, C. et al. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence 33, 4602–4609 (2019).
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019).
Bertini, F., Allevi, D., Lutero, G., Calzà, L. & Montesi, D. An automatic Alzheimer’s disease classifier based on spontaneous Spoken English. Comput. Speech Lang. 72, 101298 (2022).
Ortiz-Perez, D. et al. A deep learning-based multimodal architecture to predict signs of dementia. Neurocomputing 548, 126413 (2023).
Rusch, T. K., Bronstein, M. M. & Mishra, S. A survey on oversmoothing in graph neural networks. arXiv preprint[SPACE]arXiv:2303.10993 (2023).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (2020).
Reimers, N. & Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint[SPACE]arXiv:1908.10084 (2019).
Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Williams, E., Theys, C. & McAuliffe, M. Lexical-semantic properties of verbs and nouns used in conversation by people with Alzheimer’s disease. PLoS ONE 18, e0288556 (2023).
Lanzi, A. M. et al. Dementiabank: Theoretical rationale, protocol, and illustrative analyses. Am. J. Speech-Language Pathol. 32, 426–438 (2023).
Acknowledgements
This research was supported by the National Research Council of Science & Technology (NST) grant by the Korean government (MSIT) (No. CAP21054-300).
Author information
Authors and Affiliations
Contributions
B.L., B.O.K and H.J.S. conceived the experiment, B.L. conducted the experiment, all authors analysed the results, B.L. wrote the manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lee, B., Bang, JU., Song, H.J. et al. Alzheimer’s disease recognition using graph neural network by leveraging image-text similarity from vision language model. Sci Rep 15, 997 (2025). https://doi.org/10.1038/s41598-024-82597-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-024-82597-z
Keywords
This article is cited by
-
Deep ensemble learning with transformer models for enhanced Alzheimer’s disease detection
Scientific Reports (2025)
-
Graph isomorphism network with explainable learning for dementia screening using neurocognitive assessments
International Journal of Information Technology (2025)









