Table 1 Technical Implementation and Evaluation of XAI Methods
Study | XAI method(s) and implementation | Feature types and base models | Performance metrics | Evaluation framework and stability assessment | Quantitative metrics | Datasets used |
|---|---|---|---|---|---|---|
Heitz et al. (2024)17 | - SHAP - Applied to Random Forest - Mean absolute SHAP value for importance | - Lexical - Syntactic - Semantic Models: - BERT - Random Forest | AUROC: - Manual BERT: 0.899 - Manual RF: 0.888 - ASR BERT: 0.837 - ASR RF: 0.865 | - Cross-validation - Feature stability measurement - Most stable: avg_word_length, mattr - Most unstable: flesch_kincaid - Balanced dataset validation | - Feature stability (sj metric) - SHAP value distribution | ADReSS Challenge Dataset |
Ilias & Askounis (2022)18 | - LIME - 5000 samples - Model-agnostic approach | - Lexical - Syntactic - Semantic Models: - BERT variants - BioBERT - Siamese networks | - Accuracy: 87.50% - F1-score: 86.73% | - Statistical significance testing - Feature correlation analysis - POS tag stability analysis | - Point-biserial correlation - Jaccard’s index - 5-fold cross-validation | ADReSS Dataset |
de Arriba-Pérez et al. (2024)27 | - Meta-transformer wrapper - Tree algorithm-based - Component-based analysis | - Content-independent - High-level reasoning - Context-dependent Models: - Random Forest - Decision Tree - Naive Bayes - LLM (ChatGPT) | - Accuracy: 98.47% - Precision: 98.49% | - Component-based analysis - Feature selection metrics - Context-independent features | - Mean Decrease in Impurity - Pearson correlation - 10-fold cross-validation | Celia web application dataset |
Ambrosini et al. (2024)28 | - SHAP - Feature attribution method - Feature importance ranking | - Voice periodicity - Spectral - Syllabic Models: - SVM - CatBoost - Logistic Regression | - Italian: 80-86% - Spanish: Lower performance | - Multi-language validation - Cross-lingual stability - Language-specific features | - Feature attribution scores - Cross-lingual metrics - Multi-center validation | Custom dataset |
Tang et al. (2023)19 | - SHAP - Open-source Python package - Global feature importance | - Lexical - Syntactic - Semantic Models: - SVM - MLP - AdaBoost - Ensemble | - Accuracy: 89.58% - AUC: 0.9531 | - Global-local interpretation - Feature ranking - ASR stability analysis - Feature elimination | - SHAP values - Feature importance scores | ADReSS-IS2020 dataset |
Chandler et al. (2023)30 | - Feature attribution methods - Decision Tree extraction - Statistical testing | - Lexeme-level - Syntactic - Semantic Models: - Random Forest | - Overall: 75% - F1-scores: 0.72-0.77 | - Statistical testing - Clinical correlation - Temporal stability | - F-statistic - Univariate selection - Clinical validation | Custom dataset |
Iqbal et al. (2024)24 | - LIME + SHAP - Feature importance extraction - Statistical testing | - Lexical - Syntactic - POS tags Models: - Random Forest | - Accuracy: 80% - F1-score: 79-81% | - LIME-SHAP comparison - Statistical validation - POS tag stability | - Confidence metrics - Feature importance - Random search | DementiaBank (ADReSSo challenge dataset) |
Han et al. (2025)25 | - SHAP (Tree-based) - Counterfactual generation via LLM - Chain-of-thought prompting | - TF-IDF features - Pause features Model: - XGBoost | - Sensitivity: 4% → 42% - F1-score: 4% → 35% | - Feature analysis before/after generation - Euclidean distance metrics | - SHAP values - Feature elimination - Distance metrics | Pitt corpus from DementiaBank |
Oiza-Zapata & Gallardo-Antolín (2025)20 | - SHAP - Mutual Information - Dual use for selection & interpretation | - eGeMAPS (88 features) Models: - SVM - Random Forest - XGBoost | - Accuracy: 75.00% - AUC: 0.76 - CUI: 0.5643 | - Clinical Utility Index - ~70% computation reduction - Systematic evaluation | - SHAP importance - MI scores - Clinical utility | ADReSS Dataset |
Jang et al. (2021)29 | - Feature importance analysis - T-statistics from LR coefficients - Odds ratio interpretation | - Acoustic (MFCCs) - Linguistic - Eye-tracking Models: - Logistic Regression - Random Forest - Gaussian Naïve Bayes | - AUC: 0.83 ± 0.01 - Task fusion performance | - 10-fold cross-validation - Feature ranking - Multi-modal analysis | - T-statistics - Odds ratios - Confidence intervals | Custom dataset |
Li et al. (2025)26 | - SHAP - Attention mechanisms - Correlation analysis | - Acoustic - Linguistic - Topic modeling (DTM) Models: - SVM - TITAN (custom) | - Accuracy: 71.0% - AUC: 0.8120 - F1: 0.7238 | - Spearman correlation - Cross-modal consistency - Temporal analysis | - SHAP beeswarm plots - Attention weights - R² = 0.3876 | CU-MARVEL-RABBIT Corpus, ADReSS |
Lima et al. (2025)21 | - SHAP (TreeExplainer) - Feature importance - Risk stratification | - NLP features (100) - eGeMAPS - GPT embeddings Models: - Random Forest - XGBoost - DNN | - Accuracy: 76.5% - AUC: 0.857 - MAE: 3.7 (MMSE) | - 10-fold cross-validation - External validation - Demographic parity | - SHAP values - Feature rankings - Risk categories | ADReSSo, Lu Corpus, Pilot study |
Ntampakis et al. (2025)22 | - Attention visualization - RAG-based explanations - Literature grounding | - Acoustic (47) - Wav2Vec2 embeddings - DeBERTa embeddings Model: - Multimodal Transformer | - Accuracy: 95.77% - F1: 0.9576 | - Medical professional evaluation - Interpretability: 3.96/5 - Clinical relevance: 3.85/5 | - Attention maps - Explanation quality scores - Clinical utility: 3.70/5 | IS2021 ADReSSo Challenge Dataset |