Table 1 Technical Implementation and Evaluation of XAI Methods

From: A systematic review of explainable artificial intelligence methods for speech-based cognitive decline detection

Study

XAI method(s) and implementation

Feature types and base models

Performance metrics

Evaluation framework and stability assessment

Quantitative metrics

Datasets used

Heitz et al. (2024)17

- SHAP

- Applied to Random Forest

- Mean absolute SHAP value for importance

- Lexical

- Syntactic

- Semantic

Models:

- BERT

- Random Forest

AUROC:

- Manual BERT: 0.899

- Manual RF: 0.888

- ASR BERT: 0.837

- ASR RF: 0.865

- Cross-validation

- Feature stability measurement

- Most stable: avg_word_length, mattr

- Most unstable: flesch_kincaid

- Balanced dataset validation

- Feature stability (sj metric)

- SHAP value distribution

ADReSS Challenge Dataset

Ilias & Askounis (2022)18

- LIME

- 5000 samples

- Model-agnostic approach

- Lexical

- Syntactic

- Semantic

Models:

- BERT variants

- BioBERT

- Siamese networks

- Accuracy: 87.50%

- F1-score: 86.73%

- Statistical significance testing

- Feature correlation analysis

- POS tag stability analysis

- Point-biserial correlation

- Jaccard’s index

- 5-fold cross-validation

ADReSS Dataset

de Arriba-Pérez et al. (2024)27

- Meta-transformer wrapper

- Tree algorithm-based

- Component-based analysis

- Content-independent

- High-level reasoning

- Context-dependent

Models:

- Random Forest

- Decision Tree

- Naive Bayes

- LLM (ChatGPT)

- Accuracy: 98.47%

- Precision: 98.49%

- Component-based analysis

- Feature selection metrics

- Context-independent features

- Mean Decrease in Impurity

- Pearson correlation

- 10-fold cross-validation

Celia web application dataset

Ambrosini et al. (2024)28

- SHAP

- Feature attribution method

- Feature importance ranking

- Voice periodicity

- Spectral

- Syllabic

Models:

- SVM

- CatBoost

- Logistic Regression

- Italian: 80-86%

- Spanish: Lower performance

- Multi-language validation

- Cross-lingual stability

- Language-specific features

- Feature attribution scores

- Cross-lingual metrics

- Multi-center validation

Custom dataset

Tang et al. (2023)19

- SHAP

- Open-source Python package

- Global feature importance

- Lexical

- Syntactic

- Semantic

Models:

- SVM

- MLP

- AdaBoost

- Ensemble

- Accuracy: 89.58%

- AUC: 0.9531

- Global-local interpretation

- Feature ranking

- ASR stability analysis

- Feature elimination

- SHAP values

- Feature importance scores

ADReSS-IS2020 dataset

Chandler et al. (2023)30

- Feature attribution methods

- Decision Tree extraction

- Statistical testing

- Lexeme-level

- Syntactic

- Semantic

Models:

- Random Forest

- Overall: 75%

- F1-scores: 0.72-0.77

- Statistical testing

- Clinical correlation

- Temporal stability

- F-statistic

- Univariate selection

- Clinical validation

Custom dataset

Iqbal et al. (2024)24

- LIME + SHAP

- Feature importance extraction

- Statistical testing

- Lexical

- Syntactic

- POS tags

Models:

- Random Forest

- Accuracy: 80%

- F1-score: 79-81%

- LIME-SHAP comparison

- Statistical validation

- POS tag stability

- Confidence metrics

- Feature importance

- Random search

DementiaBank (ADReSSo challenge dataset)

Han et al. (2025)25

- SHAP (Tree-based)

- Counterfactual generation via LLM

- Chain-of-thought prompting

- TF-IDF features

- Pause features

Model:

- XGBoost

- Sensitivity: 4% → 42%

- F1-score: 4% → 35%

- Feature analysis before/after generation

- Euclidean distance metrics

- SHAP values

- Feature elimination

- Distance metrics

Pitt corpus from DementiaBank

Oiza-Zapata & Gallardo-Antolín (2025)20

- SHAP

- Mutual Information

- Dual use for selection & interpretation

- eGeMAPS (88 features)

Models:

- SVM

- Random Forest

- XGBoost

- Accuracy: 75.00%

- AUC: 0.76

- CUI: 0.5643

- Clinical Utility Index

- ~70% computation reduction

- Systematic evaluation

- SHAP importance

- MI scores

- Clinical utility

ADReSS Dataset

Jang et al. (2021)29

- Feature importance analysis

- T-statistics from LR coefficients

- Odds ratio interpretation

- Acoustic (MFCCs)

- Linguistic

- Eye-tracking

Models:

- Logistic Regression

- Random Forest

- Gaussian Naïve Bayes

- AUC: 0.83 ± 0.01

- Task fusion performance

- 10-fold cross-validation

- Feature ranking

- Multi-modal analysis

- T-statistics

- Odds ratios

- Confidence intervals

Custom dataset

Li et al. (2025)26

- SHAP

- Attention mechanisms

- Correlation analysis

- Acoustic

- Linguistic

- Topic modeling (DTM)

Models:

- SVM

- TITAN (custom)

- Accuracy: 71.0%

- AUC: 0.8120

- F1: 0.7238

- Spearman correlation

- Cross-modal consistency

- Temporal analysis

- SHAP beeswarm plots

- Attention weights

- R² = 0.3876

CU-MARVEL-RABBIT Corpus, ADReSS

Lima et al. (2025)21

- SHAP (TreeExplainer)

- Feature importance

- Risk stratification

- NLP features (100)

- eGeMAPS

- GPT embeddings

Models:

- Random Forest

- XGBoost

- DNN

- Accuracy: 76.5%

- AUC: 0.857

- MAE: 3.7 (MMSE)

- 10-fold cross-validation

- External validation

- Demographic parity

- SHAP values

- Feature rankings

- Risk categories

ADReSSo, Lu Corpus, Pilot study

Ntampakis et al. (2025)22

- Attention visualization

- RAG-based explanations

- Literature grounding

- Acoustic (47)

- Wav2Vec2 embeddings

- DeBERTa embeddings

Model:

- Multimodal Transformer

- Accuracy: 95.77%

- F1: 0.9576

- Medical professional evaluation

- Interpretability: 3.96/5

- Clinical relevance: 3.85/5

- Attention maps

- Explanation quality scores

- Clinical utility: 3.70/5

IS2021 ADReSSo Challenge Dataset

  1. AD Alzheimer’s disease, ADReSS Alzheimer’s dementia recognition through spontaneous speech, ASR automatic speech recognition, AUC area under the curve, AUROC area under the receiver operating characteristic curve, BERT bidirectional encoder representations from transformers, BioBERT Biomedical BERT, CNN convolutional neural network, CUI Clinical Utility Index, CV cross-validation, DeBERTa decoding-enhanced BERT with disentangled Attention, DTM Dynamic Topic Model, eGeMAPS extended Geneva Minimalistic Acoustic Parameter Set, F1 = F1-score, GPT generative pre-trained transformer, LIME local interpretable model-agnostic explanations, LIWC linguistic inquiry and word count, LLM large language model, LR logistic regression, LSTM long short-term memory, MAE mean absolute error, MCI mild cognitive impairment, MFCC Mel-frequency Cepstral coefficients, MI mutual information, MLP multi-layer perceptron, MMSE mini-mental state examination, NLP natural language processing, POS part-of-speech, QUADAS-2 Quality Assessment of Diagnostic Accuracy Studies-2, RAG retrieval-augmented generation, RF Random Forest, RNN recurrent neural network, SHAP SHapley Additive exPlanations, SVM support vector machine, TF-IDF term frequency-inverse document frequency, TITAN text-image temporal alignment network, Wav2Vec2 wave-to-vector version 2, XAI explainable artificial intelligence, XGBoost eXtreme gradient boosting