Introduction

Primary progressive aphasia (PPA) is a neurodegenerative syndrome characterized by the progressive deterioration of language and/or speech1,2. Although additional cognitive, behavioral, and motoric deficits emerge over time, speech and language deficits are the primary contributors to impaired activities of daily living in early stages of disease. There are three PPA subtypes, each with a distinct speech-language profile1. The semantic variant (svPPA) is associated with a loss of core semantic knowledge, leading to deficits in word retrieval and word comprehension. The logopenic variant (lvPPA) is associated with impaired phonological processing, with associated deficits in word retrieval and repetition. Lastly, the non-fluent variant (nfvPPA) is characterized by impaired expressive grammar and/or motor speech impairment.

Early and accurate diagnosis is essential for optimal provision of care for individuals with PPA, both in terms of speech-language services and potential forthcoming pharmacological interventions. With regard to speech-language intervention, the most appropriate restitutive interventions differ by PPA subtype3,4,5. Interventions targeting word retrieval may be most relevant for svPPA and lvPPA, whereas interventions targeting motoric aspects of speech and/or grammar may be of most benefit in nfvPPA. For disease-modifying treatments, it is important to note that PPA subtypes are associated with distinct underlying pathological profiles6. As a consequence, early clinical diagnosis contributes to pathological prediction which, in turn, may facilitate identification of appropriate pharmacological interventions as they become available. However, differential diagnosis by PPA subtype can be challenging, even for experienced speech-language clinicians.

Differential diagnosis by PPA subtype

In standard clinical care, differential diagnosis by PPA subtype requires comprehensive cognitive-linguistic assessment7. Diagnostic assessment typically requires hours of testing with tasks requiring overt responses (e.g., naming pictures, yes/no responses, repeating words and phrases), which may lead to fatigue and potentially compromise the validity and reliability of the results. Perhaps more importantly, even after comprehensive cognitive-linguistic assessment, a definitive diagnosis may be elusive. Whereas svPPA and nfvPPA are typically straightforward to differentiate behaviorally, distinguishing lvPPA from nfvPPA can be challenging due to overlapping clinical features, including reduced speech fluency in both subtypes8,9. Fluency is a multidimensional construct, reflecting motor speech, grammar, word finding, and prosody. Thus, although the source of impaired fluency in lvPPA and nfvPPA differs (deficits in phonological processing vs. motor speech and/or grammar, respectively), the two PPA subtypes may present similarly, particularly in mild, early stages10. Moreover, phonological paraphasias, which are common in lvPPA, can be difficult to distinguish from apraxic speech sound errors, which are common in nfvPPA. Differentiating lvPPA from svPPA also presents challenges, as anomia is a core feature for both subtypes. Moreover, additional overlapping clinical features emerge over time; for example, in lvPPA, semantic deficits may become apparent with progression11.

Differential diagnosis using biomarkers and machine learning

Given the challenges of differential diagnosis based on behavioral assessment, clinicians and researchers seek alternative or complementary tools for confirming a diagnosis12. Blood, cerebrospinal fluid (CSF), and neuroimaging (e.g., magnetic resonance imaging [MRI] and positron emission tomography) biomarkers have shown promise for identifying the underlying etiology of PPA13,14,15,16,17,18,19,20,21,22,23,24,25,26,27. To further improve diagnostic accuracy and efficiency, researchers have used neuroimaging biomarkers with machine learning (ML)13,14,19,26. Most studies using neuroimaging with ML have focused on structural MRI14,19,26, although resting-state electroencephalography (EEG)/magnetoencephalography (MEG), which reflects network dynamics, has also been used with ML for PPA subtype classification28,29,30. Although each of these studies achieved high classification accuracy for some diagnostic tasks (e.g., differentiating lvPPA vs. controls), poorer classification accuracy was achieved for other tasks (e.g., differentiating nfvPPA vs. lvPPA).

EEG has fewer contraindications relative to MRI and MEG (which exclude patients with implanted metal, for example) and is significantly less expensive (cost to record EEG data is negligible compared to the hundreds of dollars per hour for MRI and MEG). However, only one study has used ML with EEG for classification of PPA. Moral-Rubio et al.28 used resting-state EEG data as input into seven ML classification algorithms (random forest, decision tree, k-nearest neighbors (kNN), support vector machine (SVM), elastic net, Gaussian Naive Bayes, and multinomial Naive Bayes). They achieved good classification of controls vs. PPA (F1 = 0.83), and relatively worse, but still better-than-chance, four-way classification of controls vs. lvPPA vs. nfvPPA vs. svPPA (F1 = 0.60).

In sum, although the use of neuroimaging biomarkers with ML classification algorithms has proven useful for differential diagnosis, the identification of novel, reliable biomarkers and accompanying analytical approaches will continue to benefit the field. Biomarkers derived using techniques that are non-invasive and affordable, such as EEG, are particularly valuable. Despite the language-based nature of PPA syndromes, the utility of neuroimaging data obtained during language processing tasks has yet to be evaluated. Considering the nature of PPA and the distinct language phenotypes associated with each PPA subtype, a language-based EEG biomarker could prove particularly effective for differential diagnosis.

In recent years, temporal response function (TRF) modeling has gained traction as an ecologically-valid approach for characterizing neural processing of acoustic and linguistic features of continuous speech31,32. In TRF modeling, a linear function is estimated to map acoustic and/or linguistic features of speech to neurophysiological data. The accuracy of the resulting TRF can be tested by comparing the observed neurophysiological data with the TRF-predicted response, providing a measure of the fidelity of the neural representation in the brain. The TRF itself provides additional information about the time course of processing that specific feature. Researchers have argued that the TRF approach has potential as a tool for improving clinical diagnosis33,34,but TRF-derived measures have not been evaluated as diagnostic tools. In the current study, we sought to provide preliminary evidence regarding the diagnostic utility of TRF modeling and ML algorithms for differential diagnosis of PPA subtypes.

Current study

In this proof-of-concept study, we examined the utility of ML classification algorithms for diagnosis of PPA using EEG data collected while participants listened to 30 one-minute segments of a continuous speech narrative (15Ā minutes each from two audiobooks). TRF modeling was used to derive a linear function to map acoustic and linguistic features of the audiobook onto each participant’s EEG data. TRFs were estimated separately for the delta (1–4Ā Hz) and theta (4–8Ā Hz) EEG frequency bands, as they have been argued to support different levels of speech processing (e.g., delta band: word- and phrase-level representations, theta band: syllable-level representations32). Our first research question was whether the TRF holds promise for classifying participants by clinical subtype. Our second research question examined whether TRFs provided additional benefit compared to using the (preprocessed) EEG data alone. In other words, do the TRF-derived beta weights improve classification compared to the EEG data alone (without TRF mapping to the acoustic and linguistic features)? We predicted that the TRF beta weights would outperform the EEG-only data because they reflect processing of the acoustic and linguistic features of the continuous narrative. EEG waveforms, on the other hand, contain neural activity both related and unrelated to processing the narrative. The study workflow is presented in FigureĀ 1.

Figure 1
figure 1

Study workflow. EEG data were acquired while participants listened to 30 one-minute tracks of a continuous narrative. Acoustic features were derived from the audio. Additionally, for each word in the stimulus, linguistic feature values were derived using natural language processing (NLP). Acoustic and linguistic features were used to estimate a TRF to map feature values to a participant’s EEG responses. The resulting TRF beta weights were then used as input to a ML-based classifier.

Method

Participants

Participants included 10 healthy, age-, education-, and hearing-matched control participants, 10 individuals with svPPA, 10 individuals with nfvPPA, and 10 individuals with lvPPA (Table 1; note that control participants and participants with lvPPA are also presented in33). Participants with PPA were recruited as part of a speech-language intervention trial conducted by the Aphasia Research and Treatment Lab at the University of Texas at Austin35,36,37,38. Individuals with PPA were required to have a Mini-Mental State Exam39 score greater than 15 and to meet criteria for one of the canonical subtypes of PPA based on international consensus criteria1. Clinical diagnosis was based on comprehensive neurological and cognitive-linguistic assessment. Exclusion criteria for controls included a history of stroke, neurodegenerative disease, severe psychiatric disturbance, or developmental speech and language deficits. Due to the acoustic nature of the stimuli, hearing thresholds at 500, 1000, 2000, and 4000Ā Hz were collected for both ears. The pure tone average across frequencies and ears is reported in Table 1 for each participant group. The study was approved by the Institutional Review Board of the University of Texas at Austin and participants provided written informed consent. The study was conducted in accordance with relevant guidelines and regulations. Because control participants were not recruited as part of the larger clinical trial, they were paid $15/hour for their participation. All participants were native English speakers who spoke English as their primary language.

Table 1 Demographic characteristics, results of neuropsychological assessments of cognitive and linguistic processing, and performance on comprehension questions used in the current study.

Stimuli and task

Stimuli consisted of 15-minute segments from each of two audiobooks, Alice’s Adventures in Wonderland50, and Who Was Albert Einstein?51, the latter of which has been validated for use in stroke-induced aphasia52. Each audiobook was divided into 15 one-minute tracks, ensuring that each track started and ended with a complete sentence. Stimuli were presented binaurally using insert earphones (ER-3A, Etymotic Research, Elk Grove Village, IL). After listening to each track, participants were asked two multiple choice questions to encourage close attention to the audiobook (accuracy presented in Table 1). These questions were not evaluated for their validity in assessing story comprehension, though we note that an analysis of variance revealed significant differences across the groups (F (3, 26) = 8.21, p < 0.001); post hoc comparisons performed using Tukey’s Honestly Significant Difference test indicated that individuals with lvPPA and svPPA performed significantly worse than control participants, and individuals with svPPA also performed significantly worse than individuals with nfvPPA. To mitigate fatigue, participants were given the opportunity to take a break between tracks and were instructed to press the spacebar when they were ready to move on. For two participants with svPPA and five participants with nfvPPA, data were only available for the 15 tracks from Alice’s Adventures in Wonderland (see Supplementary Materials, Supplementary Table 1), creating an imbalance in samples between subtypes. We chose to use F1 for ranking classifier performance in order to minimize issues related to this imbalance (see Analyzing model performance for definition and justification for using F1).

EEG data collection and preprocessing

While participants listened to the audiobooks,Ā EEG data and audio were sampled at 25,000Ā Hz using a 32-channel (10–20 system) BrainVision actiCHamp active electrode system and BrainVision StimTrak, respectively (Brain Products, Gilching, Germany). The data were re-referenced offline using the common average reference. EEG data were preprocessed using EEGLAB 2019.153 in MATLAB 2016b (MathWorks Inc., Natick, MA, USA). Data were downsampled to 128Ā Hz, then filtered from 1 to 15Ā Hz using a non-causal, Hamming windowed-sinc FIR filter (high pass filter cut-off = 1Ā Hz, filter order = 846; low pass filter cut-off = 15Ā Hz, filter order = 212). Channels whose activity was > 3 standard deviations from surrounding channels were rejected and replaced via spherical spline interpolation. Large artifacts were suppressed using artifact subspace reconstruction54, with sixty seconds of manually-defined clean data used as calibration data. Lastly, independent component analysis using the infomax algorithm was performed to correct for eye movement, muscle, and electrocardiographic artifacts, with components manually identified for correction. The cleaned EEG data were further filtered into the delta (1–4Ā Hz) and theta (4–8Ā Hz) bands, as these two frequency bands have been identified as important for speech processing but may support different aspects of processing. Specifically, the delta band has been linked to processing longer speech units (e.g., words and phrases) and the theta band has been linked to processing shorter speech units (e.g. syllables)32.

Acoustic feature derivation

Cortical tracking of the speech envelope has proven sensitive to hearing impairment in neurotypical older adults55 and multiband envelope tracking has been shown to differ significantly between individuals with lvPPA and neurotypical older adults33. Thus, we investigated whether TRFs reflecting cortical tracking of acoustic features would be successful in PPA classification. Two acoustic features, the multiband speech envelope and broadband envelope derivative, were calculated for each of the audio tracks to be used for TRF modeling.

Multiband speech envelope

The multiband speech envelope reflects syllable, word, and phrase boundaries as well as prosodic cues56,57. To derive the multiband speech envelope, auditory stimuli from the audiobooks were first filtered through 16 gammatone filters to produce 16 bands58. The absolute value of the Hilbert transform in each of the 16 bands comprised the multiband stimulus envelope, which was then raised to a power of 0.6 to mimic the compression characteristics of the inner ear59. This resulted in 16 band-specific speech envelopes. TRFs were estimated for each of the 16 bands. The TRF beta weights were averaged across the 16 bands for ML classification.

Broadband envelope derivative

The broadband envelope derivative reflects acoustic onsets and offsets critical for identifying syllable, word, and phrase boundaries60. The auditory cortex, including the superior temporal gyrus, has been shown to be particularly sensitive to acoustic edges61. Considering that the superior temporal gyrus is a site of prominent atrophy in lvPPA62, we sought to determine whether cortical tracking of the broadband envelope derivative would be useful for PPA classification. Thus, we took the first temporal derivative of the broadband envelope to be used for TRF estimation. Only the positive values of the derivative were used.

Linguistic feature derivation

Linguistic features were selected that correspond to the core language domains implicated in PPA and used for PPA subtype classification. Specifically, we selected features reflecting phonological processing (significantly impaired in lvPPA), semantic processing (significantly impaired in svPPA), and syntactic processing (significantly impaired in nfvPPA). Critically, the specific linguistic features we selected to represent each of these levels of processing have been demonstrated to have better-than-chance prediction accuracy in previous studies utilizing TRF modeling63,64,65. Prosodylab-Aligner66 was used to temporally align phonemes and words with the audio tracks (i.e., for identification of phoneme and word onsets and offsets), with manual correction by expert linguists and highly trained research assistants. [Because of coarticulation, there is no ā€œground truthā€ for where one phoneme/word begins and another ends, and so we thus emphasized consistency in alignment by having the first author review each track, making edits as needed. We note that although ā€œerrorsā€ in alignment would impact the accuracy of TRF modeling, this would be consistent across participants and therefore should not impact classification performance.] Phoneme and word onsets were subsequently used to temporally align linguistic features with the EEG responses.

Phonological feature: cohort entropy

Cohort entropy quantifies the degree of uncertainty regarding word identity at the current phoneme based on competition among words in the cohort (the list of words with the same phonemes up to that point in the word). It was derived at the phoneme level and mapped to phoneme onsets for TRF estimation. Notably, the first phoneme in each word lacks a feature value. A phoneme’s cohort entropy is defined as the Shannon entropy for the cohort of words consistent with the phonemic makeup up to that phoneme64. Each word’s entropy is defined as its word frequency multiplied by the natural log of its word frequency. To derive word frequency, the frequency count of the word was determined based on the SUBTLEX_us_2007 corpus67 and then divided by the total number of words in the corpus, forming a probability distribution among the words; frequency is then defined as the natural logarithm of each word’s probability. For the ith phoneme in a word, the following formula was used to compute cohort entropy.

$$\sum\limits_{word}^{cohort} {freq\left( {word} \right) \cdot ln\left( {freq\left( {word} \right)} \right)}$$
Semantic features

These features were derived at the word level and were subsequently mapped to word onsets for TRF estimation.

Word frequency

Word frequency represents how frequently a word appears in the English language. As previously indicated, to derive word frequency, the frequency count of the word was determined based on the SUBTLEX_us_2007 corpus67 and then divided by the total number of words in the corpus, forming a probability distribution among the words; frequency is defined as the natural logarithm of each word’s probability, assuming no prior context64. For any word w, its word frequency can be mathematically formulated as its natural log probability, \(ln\left( {p\left( w \right)} \right)\), where p represents probability as defined above, independent of context.

Semantic dissimilarity

Semantic dissimilarity represents how semantically dissimilar a word is compared to the preceding words in a sentence63. To calculate semantic dissimilarity, we first used the well-established NLP model GPT2 to derive a semantic feature vector for each word68. GPT2 was chosen because it is a widely used neural language model yielding contextualized word representations (i.e., ā€œfeature vectorā€)69 that are sensitive and accurate to preceding context. Computations were run on Google Colab Pro’s GPUs and TPUs. Semantic dissimilarity was then derived by taking each word’s GPT2 feature vector and obtaining 1 minus the correlation coefficient between that vector and the mean of the vectors for all previous words in the sentence. As such, the first word for each sentence does not have a feature value. Dissimilarity values ranged from 0 to 2, with larger values reflecting larger dissimilarity. SciPy was used to compute the mean feature vector across words and NumPy was used to compute the correlations across feature vectors. For the ith word in a text, its semantic dissimilarity is mathematically formulated as

$$1 - r\left[ {f\left( {w_{i} } \right),\,mean\left[ {f\left( {w_{i - 1} } \right),f\left( {w_{i - 2} } \right), \ldots ,f\left( {w_{2} } \right),f\left( {w_{1} } \right)} \right]} \right]$$

where r represents Pearson’s correlation and f(w) represents a word’s feature vector.

Syntactic feature: syntactic surprisal

Syntactic surprisal was derived at the word level and subsequently mapped to word onsets for TRF estimation. Syntactic surprisal represents how surprising the part of speech (POS) tag of the current word is given the preceding words. A word’s syntactic surprisal is defined as the log probability of its POS tag conditioned on previous text65, where the next-word probability distribution was extracted using GPT270. As with semantic dissimilarity, GPT2 was chosen because of its contextualized word representations. To form the next-word probability distribution with GPT2, the text preceding the current word was fed into GPT2, which outputted logits. A softmax was applied to the logits to form a probability distribution. From this distribution, we decoded using the nucleus sampling algorithm70 with p = 0.9 (i.e., the smallest set of next-word predictions such that the cumulative probability was 0.9). Each word in this nucleus sample was then tagged with the POS tagger from SpaCy’s en_core_web_lg model (https://spacy.io/models/en#en_core_web_lg). From this, counts of each POS tag were computed and then normalized to form the POS tag probability distribution. For the ith word of a text, its syntactic surprisal can be mathematically formulated as

$$ln\left[ {p\left[ {pos\left( {w_{i} } \right)\left| {w_{i - 1} ,w_{i - 2} , \ldots ,w_{2} ,w_{1} } \right.} \right]} \right]$$

where \(p(pos\left( {w_{i} } \right)\left| {w_{i - 1} ,w_{i - 2} , \ldots ,w_{2} ,w_{1} } \right.)\) is computed from the nucleus sampling outlined above.

TRF modeling

TRF estimation was conducted using EEG data that were z-scored to each participant’s mean across channels. TRFs were constructed to map each track’s acoustic or linguistic features to a participant’s corresponding EEG data, with separate TRFs estimated for each acoustic and linguistic feature. For the multiband envelope, TRFs were estimated separately for each of the 16 frequency bands, then averaged across those bands. For the broadband envelope and linguistic features, a single TRF was estimated. Each TRF was estimated by minimizing the least-squares distance between EEG values predicted from a given feature and the participant’s observed EEG data. Time lags ofā€‰āˆ’ā€‰500 to 1000Ā ms were used. TRFs were derived using regularized linear ridge regression and validated using leave-one-out cross-validation, implemented in the mTRF Toolbox71. The resulting TRFs represented a vector of beta weights that were then used as input to the ML algorithms described below.

Classification

Classification tasks

The broadest task was to classify each participant as either a control participant or an individual with PPA (controls vs. PPA). Differential classification across participant groups (four-way classification, controls vs. svPPA vs. lvPPA vs. nfvPPA) and by PPA subtype (three-way classification, svPPA vs. lvPPA vs. nfvPPA) was also pursued. Additionally, we sought to classify one type of PPA by ruling out the other two types of PPA (svPPA vs. nfvPPA and lvPPA; lvPPA vs. svPPA and nfvPPA; nfvPPA vs. svPPA and lvPPA), which would be clinically useful in cases where overall PPA diagnosis is conferred and one PPA subtype is suspected. This is also a common methodology for multiclass classification that enables superior performance by ML classification algorithms. Lastly, we sought pairwise (two-way) classification by PPA subtype (svPPA vs. lvPPA; svPPA vs. nfvPPA; and lvPPA vs. nfvPPA), which would be useful in cases where PPA diagnosis is conferred and narrowed down to one of two possible subtypes.

Reading in EEG and TRF data

All participants had EEG data from 30 EEG channels, but only data from channel Cz (10–20 electrode system placement72) were fed into ML classification algorithms (see ML classification algorithms) because a single vector concatenating all channels (i.e.,Ā 30 channels × 8307 timestamps) would be too large for our computational constraints. Channel Cz was selected based on its common use for analysis and display purposes in previous TRF literature73,74. Further, it is not as susceptible to bias by hemispheric differences, which is particularly important in a population like PPA, where there is asymmetric neurodegeneration. Lastly, Cz has also been linked to language-related ERPs, such as the N40075,76. Participant-level data were reorganized into track-level data, resulting in 1095 tracks (33 participants with 30 tracks and 7 participants with 15 tracks) that were used for training and evaluating the ML classification algorithms. The number of data points (1095) used for training exceeds any in the literature on automated approaches to PPA classification. The number of data points for each subgroup overall is presented in Supplementary Table 1. The results reported in the main text reflect classification performance at the track-level. In the Supplementary Materials, we also report the classification performance when track-level predictions are merged into individual-level predictions (Supplementary Table 2).

TRF beta weights were available for every audio track. As with EEG data, each participant’s channel Cz TRF beta weights were used to build a ML-based classifier. Standardization of both TRF and EEG data is discussed in the ā€œML classification algorithmsā€ section. We note that the input to the model was a single vector, both for each classification task’s TRF-based model and the EEG-based model. TheĀ ā€œHyperparameter tuningā€ section describes the process used to select the single acoustic/linguistic feature and the single ML classification algorithm used in each classification task’s model.

ML classification algorithms

It is common practice to test a variety of classification algorithms to achieve the best classification performance77,78,79,80,81,82. In this study, we evaluated nine ML classification algorithms from the Python ScikitLearn package83: decision tree, random forest, extremely randomized trees (aka ExtraTrees), SVM, kNN, logistic regression, Gaussian Naive Bayes, Adaboost, and Multilayer Perceptron (MLP). This is similar to the seven ML classification algorithms used by28 for PPA classification. Note that kNN, SVM, and MLP required prior scaling as these algorithms are based on the notion of distance between data points; scaling here refers to standardizing all input TRF/EEG by subtracting the mean and scaling to unit variance. The other six ML classification algorithms did not require any preprocessing of the TRF beta weights or EEG data as they are not based on distance between data points.

Cross validation

At the participant level, the data were split into 5 stratified outer folds, where 80% of each fold was designated for training and 20% was designated for testing. Special care was taken to ensure this was done at the participant level instead of the track level so that results generalize across individuals. In other words, all tracks for a given participant were either in the training set or the test set (not both). The classifier’s predictions on each outer fold’s test set were merged to form a set of predictions for all data points, which were then compared to ground truth (see ā€œAnalyzing model performanceā€). This use of cross validation ensures the reported results are applicable across all participants in our sample. This is in contrast to the 80–20 train-test split, where the classifier would be trained on 80% of the data and only evaluated on 20% of the data (i.e., results only reflect a fifth of the dataset). Our decision to use cross validation instead of train-test split is motivated by the small N of our dataset.

Hyperparameter tuning

For each classification task’s model, we built a classifier for each possible combination of EEG frequency band, single acoustic/linguistic feature used to derive TRF weights, and single classification algorithm into which the TRFs were fed. The combination that resulted in the best performance on the nested cross-validation per classification task is reported in Tables 2, 3, 4, 5, 6. The classification performance for all classifiers constructed (i.e., each combination of frequency band, acoustic/linguistic feature, and classification algorithm) is reported for each classification task in Supplementary Tables 3–12, and the best classification performance for delta and theta bands, specifically, is reported in Supplementary Table 13. The percentage of classifiers outperforming the random sampling baseline is reported in Supplementary Table 14 (see ā€œAnalyzing model performanceā€). For each classifier built, we used 5-fold nested cross validation to determine the internal hyperparameters of the ML classification algorithm. For each of the five outer folds, its training set is split into five stratified inner folds (i.e., running a 5-fold cross validation on an outer fold’s training set, where 80% of each fold’s training set is designated for training and 20% is designated for validation). When evaluating a particular set of hyperparameters, classification performance was computed for each inner fold (i.e., trained on the inner fold’s training set and evaluated on the inner fold’s validation set) and then averaged. This process was repeated for several sets of hyperparameters, from which the best performing hyperparameters were identified. Note that only the outer fold’s training set was used to determine the best hyperparameter combination. Then, only the best performing hyperparameters were used to train a model on all of the outer fold’s training data, which was evaluated on the outer fold’s test set (which was not seen/used in the hyperparameter tuning process, thus giving an unbiased estimate of the hyperparameter’s true performance). This process was then repeated for the 2nd outer fold and so on, where each outer fold may select different hyperparameters from its training set, which was then evaluated on its test set. Finally, each outer fold’s test set predictions were merged to form a set of predictions for all data points, which were then compared to ground truth (see ā€œAnalyzing model performanceā€). This nested cross validation process allows us to optimize each classifier’s hyperparameters without compromising the validity of its evaluation and generalization to new patients. In sum, for each classification task, the inner folds were used for selecting the model’s best hyperparameter combination and the outer folds were used for final evaluation of the model itself.

Analyzing model performance

Recall, precision, and F1 score were metrics of interest. A class’ recall reflects the proportion of true positive cases predicted as positive relative to all true positive cases (e.g., how many individuals with PPA were classified as having PPA). Precision reflects the proportion of true positive cases predicted as positive relative to all predicted positive cases (e.g., for all samples classified as PPA, how many actually had PPA). Lastly, F1 score reflects the harmonic mean of its precision and recall, ranging from 0 to 1, where 1 reflects perfect classification. F1 was used to evaluate each model’s performance in lieu of accuracy for two reasons. First, for many of the selected classification tasks, there was an uneven class distribution; for example, for the classification task of svPPA vs. lv/nfvPPA, there were twice as many lv/nfvPPA samples as svPPA samples. Using the macro (i.e., unweighted) average of each class’ F1 is ideal for use in situations where there is class imbalance because it gives equal weighting to both the dominant and non-dominant class, avoiding artificial inflation of the F1 score by the dominant class (which could potentially have a higher F1 score). Using accuracy, as many previous studies have done, can result in a classifier achieving seemingly good performance by always predicting the dominant class; for example, given that there are three times as many PPA samples as controls, our classifier for controls vs. PPA would achieve 75% accuracy by classifying every sample as PPA. For F1, however, this would correspond to a much lower score. Unlike accuracy, F1 also balances the need for simultaneously good precision and recall. To show that our classifiers achieved meaningful, above-chance performance, F1 scores were derived by randomly sampling each prediction using the uniform distribution and the sample-label distribution (Supplementary Table 14). These baselines were computed through ScikitLearn’s DummyClassifier model, where its strategy parameter was set to either ā€œuniformā€ or ā€œstratifiedā€.

For all classification tasks, McNemar tests from the mlxtend package84 were used to compare the best EEG-only model that used EEG waveforms as input against the best model that used TRF beta weights as input in order to determine whether the derivation of the TRF beta weights provided additional benefit to classification performance. From the predictions of the best TRF-based and the best EEG-only classifiers, a 2 × 2 contingency table was formed using the mcnemar_table function from the mlx_extend package. From this contingency table, the McNemar test statistic and corresponding p-value was computed using the mcnemar function from the mlextend package.

Results

Classification using TRF beta weights

Our first research question was whether TRF beta weights can be used to successfully classify individuals with PPA across the classification tasks described above. First, for classification of samples as healthy controls or PPA, we achieved an F1 score of 0.60 (Table 2), which outperformed random sampling predictions by either a uniform or sample-label distribution by 0.14 (Supplementary Table 4). Based on precision and recall, PPA samples were more likely to be accurately classified than control samples. Second, for four-way classification by participant group (controls vs. svPPA vs. lvPPA vs. nfvPPA), we achieved an F1 score of 0.34 (Table 3), which outperformed our baseline of randomly sampling predictions by 0.10 (Supplementary Table 14). Based on precision and recall, control (Precision = 0.41, Recall = 0.39) samples were more likely to be accurately classified than the other groups (Precision and Recall ranging from 0.28 to 0.40). Next, for differential classification of samples by PPA subtype, we achieved an F1 score of 0.48 (Table 4), which outperformed our baseline of randomly sampling predictions by more than 0.16 (Supplementary Table 14). Based on precision and recall, however, confidence in this model’s classification would be relatively low, regardless of how a sample was classified. Subsequently, we sought to classify one PPA subtype by ruling out the other two PPA subtypes (Table 4). For classification of samples as svPPA or lvPPA/nfvPPA, we achieved an F1 score of 0.67; for classification of samples as lvPPA or svPPA/nfvPPA, we achieved an F1 score 0.73; and for classification of samples as nfvPPA or lvPPA/svPPA, we achieved an F1 score 0.68. Each of these three classification tasks outperformed baselines by more than 0.15 (Supplementary Table 14). Based on precision and recall, our classifiers did a better job at ruling out one PPA subtype relative to the other two subtypes than it did at diagnosing that subtype (e.g., for the classification task of svPPA vs. lv/nfvPPA, the model had a much higher precision score and a slightly higher recall score for classifying a case as belonging to the lv/nfvPPA class than for classifying a case as belonging to the svPPA class). Lastly, we conducted pairwise classification by PPA subtype (Table 5). For differentiating nfvPPA from lvPPA, we achieved an F1 score of 0.73. Differentiation of nfvPPA from svPPA had an F1 score of 0.74, as did the differentiation of lvPPA and svPPA. Classifiers for pairwise classification by PPA subtype outperformed baselines by more than 0.22 (Supplementary Table 14). Notably, although a relation between PPA subtypes and linguistic features most relevant for classification might be anticipated, this was not the case, as no clear pattern emerged regarding classification accuracy and the specific linguistic features used to derive TRF beta weights. Further, the different EEG frequency bands used as input to the models had no clear effect on classification accuracy and no single classification algorithm had the best performance across a majority of classification tasks.

Table 2 Differentiation of healthy controls from individuals with PPA with TRF beta weights as input.
Table 3 Four-way classification by participant group (controls vs. svPPA vs. lvPPA vs. nfvPPA) with TRF beta weights as input.
Table 4 Three-way classification by PPA subtype (svPPA vs. lvPPA vs. nfvPPA) with TRF beta weights as input.
Table 5 Classification of a single PPA subtype relative to the other two PPA subtypes with TRF beta weights as input.
Table 6 Pairwise classification by PPA subtype with TRF beta weights as input.

Classification performance for TRF beta weights versus EEG

Our second research question was whether the use of TRF beta weights would improve classification performance over the use of (preprocessed) EEG waveforms alone. Accordingly, for each classification task, channel Cz of the EEG data was fed into ML classification algorithms. The outcomes from the best EEG-based classifier were then compared to the best TRF-based classifier (Tables 2, 3, 4, 5, 6). Equivalent or superior performance of EEG data relative to TRF beta weights for PPA classification would indicate that TRF modeling is not necessary. For every classification task except the broad classification of controls vs. PPA, the best TRF-based model outperformed the best EEG-based model at the 99.9% confidence level (Table 7). This provides preliminary evidence that TRF modeling is worth the time and expertise required to extract TRF beta weights because it improved predictive accuracy relative to EEG alone.

Table 7 Comparison of the best TRF and EEG models for all classification tasks.

Discussion

In the current study, we explored the potential utility of temporal response function (TRF) modeling for classification of primary progressive aphasia (PPA) using electroencephalography (EEG) and machine learning (ML) classification algorithms in order to provide initial demonstration of the feasibility of the approach. Individuals with PPA and healthy controls listened to 30Ā minutes of continuous speech while EEG responses were recorded. TRF modeling was used to derive a linear function to map acoustic and linguistic features of the continuous speech onto the EEG data. Either the resulting TRF beta weights or (preprocessed) EEG data constituted input to the ML classification algorithms, which were used to perform a number of different classification tasks. We addressed two research questions in the current study.

Our first research question was whether TRF beta weights hold promise for use in PPA classification. The findings of the current study indicate that TRF beta weights may be useful for PPA classification, with better-than-chance classification performance observed for all tasks, although success varied across classification tasks. The most successful models were pairwise classification of PPA subtypes, with the best classification performance observed for svPPA vs. nfvPPA and nfvPPA vs. svPPA (F1s = 0.74), followed by nfvPPA vs. lvPPA (F1 = 0.73). Relatively good classification performance was also observed for classifying lvPPA vs. svPPA/nfvPPA (F1 = 0.73), with poorer classification performance observed for nfvPPA vs. svPPA/lvPPA (F1 = 0.68), svPPA vs. nfvPPA/lvPPA (F1 = 0.67), PPA vs. controls (F1 = 0.60), three-way classification by PPA subtype (F1 = 0.48), and four-way classification (controls vs. svPPA vs. lvPPA vs. nfvPPA, F1 = 0.34). However, we would note that, clinically, a PPA diagnosis must be conferred before differential diagnosis by PPA subtype. Considering the hierarchical approach to diagnosis (general PPA diagnosis to specific subtype diagnosis), it is not as clinically relevant to be able to perform four-way classification. The poor classification of PPA vs. controls could potentially emerge from the heterogeneity in TRFs across the PPA subtypes, precluding clear differentiation from controls. The F1 score of 0.73 for classification of nfvPPA vs. lvPPA is especially notable, given that differential diagnosis of nfvPPA vs. lvPPA can be challenging for clinicians8,9. Taken together, the findings are particularly promising for situations where a diagnosis of PPA has been established, but differential diagnosis by subtype remains elusive, particularly if diagnosis has been narrowed to one of two subtypes. These results provide preliminary evidence regarding the potential value of TRF-based biomarkers for facilitating differential diagnosis in PPA.

Our second research question was whether there was an added benefit of incorporating TRF beta weights compared to utilizing preprocessed EEG waveforms alone. The findings of the current study indicate that use of TRF beta weights leads to significantly better classification performance over EEG alone, except in the classification of PPA vs. controls, where performance was similar between TRF- and EEG-derived classifications. Overall, we provide preliminary evidence that TRF modeling is worth the additional effort compared to EEG data alone, although future work should focus on how to make TRF modeling accessible within clinical practice settings since the current methods require access to proprietary software and technical expertise.

Previous research on automated approaches to diagnosis of PPA with neuroimaging data have utilized a variety of different inputs to the models, including structural magnetic resonance imaging (MRI)14,26, functional connectivity from magnetoencephalographyĀ (MEG)29, power spectral density from resting-state EEG30, and graph theory-derived measures from resting-state EEG28. Of most relevance to the current study is the work of Moral-Rubio and colleagues28, in which two classification tasks were performed (PPA vs. controls and four-way classification of controls, svPPA, nfvPPA, and lvPPA). In that study, classification of PPA vs. controls was superior to our study (F1 = 0.83 vs. F1 = 0.60), as was four-way classification of controls, svPPA, nfvPPA, and lvPPA (F1 = 0.60 vs. F1 = 0.39). Although Moral-Rubio et al.28 achieved better classification performance for PPA vs. controls and for four-way classification for some ML algorithms, we extend their work by performing a larger number of classification tasks and using EEG data collected while participants engaged with language stimuli. Differences in the number of data points used for model training and in the model architectures themselves preclude direct comparison of classification performance across these studies. However, our results are largely consistent with previous research in supporting a potential role for automated approaches to PPA diagnosis.

Overall, we demonstrate that ML utilizing TRF-based biomarkers derived from EEG data holds promise as a means to support diagnostic decision-making in PPA. In contrast to automated approaches using MRI or MEG as input, EEG has the benefit of being affordable, with no contraindications for use. Further, EEG is non-invasive, as opposed to positron emission tomography or cerebrospinal fluid-based biomarkers that are currently used in standard clinical practice, making it a safer approach to informing diagnosis. These findings in PPA add to the evidence base suggesting a role of TRF modeling in improving diagnostic decision-making in clinical populations more broadly. Automated approaches developed to aid diagnosis hold potential for addressing health disparities associated with diagnosis/misdiagnosis as a function of race/ethnicity (see85 for discussion) or English-speaking status. For example, only ~8% of America’s speech-language pathologists speak a language other than English (ASHA 202386) and many standard assessment materials are developed in English only. The development of automated approaches to diagnosis in languages other than English could mitigate the influence of these factors.

Limitations and future directions

The current study marks an important step toward use of automated approaches to diagnosis of PPA, and the exploratory nature of this study presents multiple avenues for further research. The current study included a relatively small number of participants (n = 10 per participant group) that were not perfectly matched for demographic characteristics (e.g., there is a larger proportion of female participants in the control group than in the PPA groups), limiting the generalizability of findings (although we would note that the 1095 data points included in the ML classification is at least one order of magnitude larger than all previous research on automated classification of PPA). Future research should be conducted with a larger number of participants to further improve classification performance and generalizability to new samples. It will also be important to consider whether and to what extent the current approach improves upon the current gold standard cognitive-linguistic assessments used for diagnosis.

It was somewhat surprising that one of the poorest performing classification tasks was for classification of PPA vs. controls. This is also the only classification task where TRF beta weights did not outperform the EEG-only classification. As indicated previously, it is possible that this is a consequence of the heterogeneity of TRFs across PPA subtypes, making it difficult to clearly identify a TRF profile that distinguishes all PPA subtypes from controls. However, distinguishing neurotypical older adults from persons with PPA is likely to be the least relevant for standard clinical practice, as individuals with PPA are more likely to be misdiagnosed with a different neurodegenerative syndrome or psychiatric condition87,88, rather than classified as healthy. In other words, the potential utility of a TRF-based classifier for differentiating PPA vs. controls is likely limited. Instead, classification of PPA vs. Alzheimer’s dementia or PPA vs. severe clinical depression, for example, would be more clinically useful. Thus, future research may focus on the development of automated tools for differential diagnosis across neurodegenerative syndromes and/or other neurological or psychiatric conditions.

There is a great deal to be learned regarding factors contributing to the relative success of one TRF model over another. For example, in the current study, analyses were restricted to electrode Cz in order to determine whether the approach was useful for classification of PPA and PPA subtypes. Given the modest success in the current study, future work should seek to identify optimal electrode configurations that maximize classification success. Along these lines, an important next step is to apply more advanced deep learning approaches, such as convolutional neural networks, to PPA classification30. Applying more advanced deep learning approaches has the potential to improve classification performance while providing more interpretability, allowing for the identification of features of the input that most strongly contribute to classification accuracy. Contrary to the ML classification algorithms used in the current study, all channels of EEG data can be fed into deep learning classification algorithms (compared to only channel Cz in this paper); thus, it will be possible to identify which channels (i.e., electrodes) are most useful for classification. Relatedly, due to the lack of interpretability offered by the ML models paired with the TRF beta weights in the current study, there are a number of questions that remain unanswered. For example, why was there no clear relation between classification accuracy and the specific linguistic features used to derive TRF beta weights, and why did certain features perform better than others? Future work should focus on developing a better understanding of the factors that influence classification, with a particular emphasis on identifying acoustic and linguistic features that maximize classification accuracy. The results of such work may provide valuable insights into nature and diagnosis of PPA syndromes as well as our understanding of the neural processing of the specific acoustic and linguistic features being modeled.

Conclusion

In the current study, we showed that TRF-derived beta weights for acoustic and linguistic features of a continuous narrative hold promise for use in PPA classification. In doing so, we demonstrate the potential clinical utility of this automated approach using a TRF-based biomarker derived from EEG. With recent efforts to draw attention to the amount of testing required of individuals with PPA89, automated approaches to diagnosis will likely continue to gain traction. The current study marks an important first step toward more automated approaches to diagnosis, particularly those using TRF modeling. It provides proof-of-concept for the utility of TRF modeling for use in clinical diagnostic decision-making, motivating future research seeking to fine-tune the specific parameters used for classification. Future work should seek to make these automated approaches more accessible to clinicians, moving this research a step closer to use in clinical practice.