Fig. 2 | Scientific Reports

Fig. 2

From: Language biomarker screening using AI: a transdiagnostic approach to the brain

Fig. 2

NeuroScreen machine learning pipeline architecture for automated neurological assessment. The comprehensive workflow shows the development and validation of a diagnostic system that analyzes language production to distinguish between neurological conditions. Input data comprises speech and text samples from participants across six diagnostic groups: Left Hemisphere Damage (LHD), Right Hemisphere Damage (RHD), Dementia, Mild Cognitive Impairment (MCI), Traumatic Brain Injury (TBI), and Healthy Controls. Language production tasks undergo automated linguistic feature extraction across six domains: Lexicon (vocabulary richness), Phonology (speech sound patterns), Morphology (word formation), Syntax (grammatical structure), Semantics (meaning content), and Readability (text complexity). The preprocessing pipeline includes quality control checks, speaker leakage detection, correlated feature removal, mean imputation for missing values, z-score standardization, and principal component analysis for dimensionality reduction (retaining 95% variance). Five machine learning algorithms are systematically evaluated: Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), Gradient Boosting (GB), and Deep Neural Network (DNN). Model optimization employs hyperparameter tuning with GroupKFold cross-validation and randomized/halving grid search. Synthetic Minority Oversampling Technique (SMOTE) addresses class imbalance. The validated models comprise the NeuroScreen diagnostic tool for objective, automated neurological assessment based on quantitative linguistic analysis.

Back to article page