Abstract
Psychosis poses substantial social and healthcare burdens. The analysis of speech is a promising approach for the diagnosis and monitoring of psychosis, capturing symptoms like thought disorder and flattened affect. Recent advancements in Natural Language Processing (NLP) methodologies enable the automated extraction of informative speech features, which has been leveraged for early psychosis detection and assessment of symptomology. However, critical gaps persist, including the absence of standardized sample collection protocols, small sample sizes, and a lack of multi-illness classification, limiting clinical applicability. Our study aimed to (1) identify an optimal assessment approach for the online and remote collection of speech, in the context of assessing the psychosis spectrum and evaluate whether a fully automated, speech-based machine learning (ML) pipeline can discriminate among different conditions on the schizophrenia-bipolar spectrum (SSD-BD-SPE), help-seeking comparison subjects (MDD), and healthy controls (HC) at varying layers of analysis and diagnostic complexity. We adopted online data collection methods to collect 20 min of speech and demographic information from individuals. Participants were categorized as “healthy” help-seekers (HC), having a schizophrenia-spectrum disorder (SSD), bipolar disorder (BD), major depressive disorder (MDD), or being on the psychosis spectrum with sub-clinical psychotic experiences (SPE). SPE status was determined based on self-reported clinical diagnosis and responses to the PHQ-8 and PQ-16 screening questionnaires, while other diagnoses were determined based on self-report from participants. Linguistic and paralinguistic features were extracted and ensemble learning algorithms (e.g., XGBoost) were used to train models. A 70–30% train-test split and 30-fold cross-validation was used to validate the model performance. The final analysis sample included 1140 individuals and 22,650 min of speech. Using 5 min of speech, our model could discriminate between HC and those with a serious mental illness (SSD or BD) with 86% accuracy (AUC = 0.91, Recall = 0.7, Precision = 0.98). Furthermore, our model could discern among HC, SPE, BD and SSD groups with 86% accuracy (F1 macro = 0.855, Recall Macro = 0.86, Precision Macro = 0.86). Finally, in a 5-class discrimination task including individuals with MDD, our model had 76% accuracy (F1 macro = 0.757, Recall Macro = 0.758, Precision Macro = 0.766). Our ML pipeline demonstrated disorder-specific learning, achieving excellent or good accuracy across several classification tasks. We demonstrated that the screening of mental disorders is possible via a fully automated, remote speech assessment pipeline. We tested our model on relatively high number conditions (5 classes) in the literature and in a stratified sample of psychosis spectrum, including HC, SPE, SSD and BD (4 classes). We tested our model on a large sample (N = 1150) and demonstrated best-in-class accuracy with remotely collected speech data in the psychosis spectrum, however, further clinical validation is needed to test the reliability of model performance.
Similar content being viewed by others
Introduction
Alterations of speech in psychiatric disorders has been proposed as an objective, reproducible, and time-efficient biomarker [1,2,3] given its high prognostic value for onset, course, chronicity, and treatment response [4,5,6,7,8,9,10,11,12,13,14,15,16] across numerous conditions. Objective and scalable speech markers have gained special interest in psychosis where they can be combined with machine learning (ML) algorithms to detect illness, gauge symptom severity or predict relapse from speech [6, 13, 16,17,18,19,20]. The putative predictive power of speech for psychotic disorder prognosis and development of psychosis from at risk state has received particular attention, as this opens up the premise of early, objective detection and targeted prevention [6, 9, 10, 21, 22]. Importantly, speech and language were directly linked to key symptoms of psychosis, including formal thought disorder, changes in vocal expression, and flattened affect [23,24,25,26,27,28]. Several speech features, including pitch, rhythm, and connectivity could signal changes in motor control function and neural connectivity, which correlate with psychotic symptom severity [29,30,31,32,33]. These connections provided a neurobiological and phenomenological basis for why speech-based assessment can be useful in psychosis.
Despite these advancements, several critical gaps remain in the existing literature, limiting the application of speech analysis and assessment in clinical settings. Firstly, there is no standardized speech sample collection procedure. Speech elicitation methods hugely vary across research, with varying lengths and protocols. Prior studies often used speech recordings from laboratory settings with intense human involvement in terms of recording and transcription which limits the comparability of findings. Secondly, partly driven by the burden of recruitment and assessment procedures, previous studies have been conducted with relatively limited sample sizes and relatively narrow clinical scope (e.g., patients with acute psychotic episodes). This limits the complexity of applicable computational modelling of speech data and does not address the translation of findings to more heterogenous clinical populations. Thirdly, research to date has focused on primarily binary classifications between “healthy” individuals and individuals with one specific disorder, like bipolar or schizophrenia. This does not translate to the discriminatory diagnosis or detection of subtle changes that clinicians perform routinely. Moreover, the focus on binary discrimination provides limited insight into whether speech-based ML models can capture disorder-specific alteration in speech rather than learning the difference between healthy and unhealthy speech patterns.
To address these gaps, we collected speech data with various elicitation procedures remotely, online, from a large cross-diagnostic sample of individuals (N = 1140). Participants included those at different points on the schizophrenia-bipolar continuum - people who have been diagnosed with schizophrenia spectrum disorders (SSD), people who have been diagnosed with bipolar disorder (BD) and people who have sub-clinical psychotic experiences (SPE) suggesting some vulnerability to psychosis but without meeting the diagnostic criteria of full blown psychotic conditions and healthy control subjects, who might experience some distress regarding their mental health but have not received any psychiatric diagnosis or experience SPE (HC). Based on significant biological, genetic, neurological and phenomenological overlap, the schizophrenia-bipolar continuum can be understood as spanning the psychosis spectrum, where one dimension represents the affective nature of the condition (from affective to non affective psychosis) and another dimension represents the severity and clinical nature of the symptoms (from sub-clinical experiences to full-blown, clinical conditions) [29, 34,35,36,37]. Studies that aimed to identify biomarkers and intermediate phenotypes found that there is an unexpectedly high overlap in the biological characteristics of psychotic diseases across the schizophrenia and bipolar spectrum and clear distinctions based on biomarkers are not apparent which underlines the continuum interpretation [36, 37]. However, within the current diagnostic system and at current level of scientific understanding, it is still worthwhile to distinguish between BD and SSD as they have different prognoses and may require different treatments, therefore we wanted to test whether speech markers can capture these differences.
In addition to the primary clinical focus on the schizophrenia-bipolar spectrum, we collected speech data from people with major depressive disorder (MDD) to serve as a non-psychotic comparison sample. MDD is a highly prevalent condition which overlaps with psychosis both with regard to comorbid patients and shared clinical presentations (e.g., anhedonia, amotivation) [5, 29, 38, 39]. Also, MDD has been also connected to speech alterations that can overlap with speech changes observed in the psychosis spectrum, posing a potential challenge to speech-based discriminatory diagnosis [5, 38]. Lastly, we collected speech from healthy individuals (HC) to provide a control condition and help to evaluate the discriminatory power of speech further by adding additional layering to out analyses.
We created a ML pipeline adopting state-of-the-art text-based and paralinguistic approaches and tested its ability to perform classification tasks across multiple conditions. In brief, our study aimed to:
-
1.
Identify an optimal assessment approach for the online and remote collection of speech, in the context of assessing the psychosis spectrum.
-
2.
Evaluate whether our ML pipeline can discriminate among different conditions on the schizophrenia-bipolar spectrum (SSD-BD-SPE), help-seeking comparison subjects (MDD), and healthy controls (HC) at varying layers of analysis.
Methods
Participants
Participants from the United States of America or the United Kingdom were recruited via the online platform Prolific (https://www.prolific.co). Potential participants were screened via self report for the inclusion and exclusion criteria (N = 10,000) Diagnostic group was determined based on self-reported prior clinical diagnosis and responses to the 16-item prodromal questionnaire [40, 41] and the 8-item Patient Health Questionnaire [42,43,44]. Inclusion criteria included age of 18 years or above, native speaker of English, capability to record voice on a digital device, and ability to provide informed consent. We labelled participants into the following groups:
-
Healthy control (HC): no history of a diagnosed mental disorder and scored less than 3 on both the PQ-16 symptom scale (prodromal psychotic symptoms) and PHQ-8 (depression symptoms) [40, 41, 44]
-
Sub-clinical psychotic experiences (SPE): no history of any diagnosed psychiatric disorder and scored more than 6 on the PQ-16 symptom scale (prodromal psychotic symptoms), a proxy for clinical at-risk mental state [40].
-
Bipolar disorder (BD): reported clinical diagnosis of bipolar disorder (type I or II) but did not report diagnosis of schizophrenia or schizoaffective disorder. Participants who reported co-morbid conditions other than schizophrenia and schizoaffective disorder were included, e.g., generalized anxiety, substance use disorders. Notably, participants in BD group have not necessarily experienced psychotic symptoms.
-
Schizophrenia-spectrum disorders (SSD): reported clinical diagnosis of schizophrenia or schizoaffective disorder but did not report diagnosis of bipolar disorder. Participants who reported co-morbid conditions other than bipolar disorder were included, e.g., depression, anxiety.
-
Major depressive disorder (MDD): reported diagnosis of major depression and did not report a diagnosis of bipolar disorder, schizophrenia, schizoaffective disorder and scored less than 3 on the PQ-16. Participants who reported co-morbid conditions other than BD or SSD conditions were included.
Exclusion criteria included diagnosis with autism spectrum disorder or other developmental disorders which affect speech and language, history of cognitive impairment or dementia. Written, informed consent was collected from all participants.
Online assessment of speech
Speech was collected from eligible participants using an online experiment platform where participants provided secure voice samples in response to standardised prompts. Participants were instructed to record in a quiet environment, alone, and explicitly asked to avoid conditions that can reduce the quality of the audio recording. After an initial microphone test, each participant underwent five different speech-based tasks (detailed in Table 1), taking between 16–20 min to complete. We selected the five tasks based on use in previous research [17, 21, 45, 46]. Participants were allowed to stop and resume the testing procedure within a 1- week window.
Participants were compensated for their time. For screening, participants were paid equivalent of 3 GBP, and for the speech recording task, participants were paid 9 GBP, via Prolific.
Data pre-processing
Incoming data was uploaded to a secure Cloud server. Audio data was automatedly transcribed into text using Amazon Transcribe API. Files that contained less than 5 words/task were excluded from analyses, and stop and filler words (e.g. hm, ahmm, uhmm, hmm) were removed before analysing the text, using a predefined list of targeted words that suited our Amazon ASR system. Audio files that contained stereo WAV formats were converted to mono format.
Feature extraction
Following pre-processing, both raw audio data and text transcripts went through a feature extraction pipeline. Features were selected based on their demonstrated relevance to psychiatric conditions, their ability to capture abnormalities across multiple levels of language processing and their robustness of automated transcription [47].
Text-based features
Natural Language Processing (NLP) techniques were applied to extract features reflecting abnormalities across semantic, morphological, and syntactic levels. Feature selection was guided by previous research demonstrating their discriminative power in psychosis-related populations and viability with automated transcription.
Semantic coherence was captured with 7 features. We first applied word-embedding methodology: each word was represented as a vector, such that words used in similar contexts (e.g. ‘desk’ and ‘table’) were represented by similar vectors. Vector representations were generated by using word embeddings from the pre-trained word2vec Google News model [48]. Based on word representations as vectors, sentence embedding was calculated using Smooth Inverse Frequency (SIF), to produce a single vector for each sentence [28]. We used word2vec and SIF embeddings because they previously gave the greatest group differences between patients with schizophrenia and control subjects and have been widely applied in psychosis research [21, 28, 45, 48, 49]. Finally, cosine similarities between adjacent sentences and words were calculated to represent coherence at several levels. Coherence features included mean coherence score of the sentences, repetition, repetition divided by number of tokens in the speech sample, mean sentence coherence divided by the number of tokens in the speech sample and mean number of tokens per sentences.
Speech connectivity was measured by 30 features. Speech connectivity captures both semantic and syntactic information, measuring both the semantic coherence and grammatical complexity. Overall, connectivity measures are considered to provide a proxy of narrative planning [50] as they can determine markers of short range and long-range recurrence of a subject in the text [40]. To measure speech connectivity, we generated speech graphs from the text, where all of the words represented as nodes in a graph while the grammatical connections between words were represented as edges in the graph, following the procedure of Mota et al. [51]. Graphs were generated using three different types of text pre-processing (on tokens, word-stem and part of speech tags), and different parameters of the graphs (e.g., the largest connected component, number of repeated edges, diameter) were used as connectivity features [18, 19, 22].
Syntactic complexity was measured by 6 features. These metrics were based on automated part-of-speech tagging using NLTK (Natural Language Toolkit) Python package. Our features measured the frequency of tags compared to a number of tokens and the frequency of tags compared to each other (e.g., Number of Wh-determiners compared to all tags, number of unique types of tags in the text) [2]. These metrics were chosen based on their ability to quantify grammatical complexity and have shown discriminative power in previous studies on language in psychotic conditions [2, 47].
For a complete description of all features and their technical definitions, please refer to the Text-based features section of the Appendix.
Paralinguistic features
Paralinguistic features were designed to capture changes in emotional state (like flattened affect) and articulation (proxy of changes in motor control) and demonstrated sensitivity to various psychiatric conditions [11, 46]. 116 paralinguistic parameters were extracted from audio files, both at the level of task and in 10 s frames with 5 s overlap. Our feature set is comprised of temporal parameters (e.g. average syllable and pause ratio), prosodic parameters (e.g. glottal pulse period, jitter, shimmer), frequency parameters (e.g. F1 frequency mean), spectral parameters (e.g., Mel frequency cepstral coefficients) and amplitude parameters (e.g., loudness).
Features were extracted using audeering openSMILE and Praat softwares [52, 53]. For a complete description of all features and their technical definitions, please refer to the Paralinguistic features section of the Appendix.
Modelling outcomes
Following the automated transcription and feature extraction, the speech and language features were used to address the stated objectives, as summarized in Fig. 1. Data analyses were conducted in Python 3.7, Jupyter Notebook v6. 5.3.
Performance metrics of accuracy and balanced accuracy were used to compare predictive power. We choose these metrics because accuracy is a standard, common metric used in ML applications that makes our results comparable with previous findings, and balanced accuracy accounts for the sensitivity of each class, making it a particularly useful metric when the dataset is imbalanced. Balanced accuracy prevents bias toward the majority class, providing a more realistic understanding of classification performance in our case. Accuracy is defined as the ratio of correctly predicted instances to the total number of instances in the dataset. Mathematically, accuracy can be expressed as Accuracy=(TP+TN)/(TP+TN+FP+FN) * 100 where TP = True Positives, TN = True Negatives, FN = False Negatives, FP = False Positives. Balanced accuracy evaluates classification performance in multiclass problems by calculating the arithmetic mean of sensitivity (true positive rate) of each class. Sensitivity for each class is the number of true positive predictions for that class divided by the total number of actual instances of that class. The formula of balanced accuracy is Balanced Accuracy=(((TP/(TP+FN)+(TN/(TN+FP))) / 2where TP = True Positives, TN = True Negatives, FN = False Negatives, FP = False Positives.
Objective 1: Identify an optimal assessment approach for the online and remote collection of speech, in the context of assessing conditions on the psychosis spectrum
We first evaluated the predictive power of different types of speech tasks. We separated speech data per task and then fed each group of speech data corresponding to a given type of speech task into four types of ML algorithms (Random Forest, XGBoost, CatBoost, Extra Trees). Then, we compared the predictive power of the algorithms on the task of multiclass classification among the HC, SPE, BD, SSD groups. The four types of tested algorithms (Random Forest, XGBoost, CatBoost, Extra Trees) are types of ensemble learning, which combines the predictions of multiple individual models to create a stronger, more accurate, and robust model.
This approach has shown outstanding performance in speech and healthy applications, even when compared to more complex modelling approaches (e.g., Deep Neural Networks).
In this scenario, we aggregated data per file and per participant, taking the mean of different features at the level of the audio file and participant. ML algorithms were trained on 70% of the dataset using leave-one-out cross-validation and validated on the stratified, remaining 30% of the dataset.
Objective 2: Evaluate whether our ML pipeline can discriminate among different conditions on the schizophrenia-bipolar spectrum (SSD-BD-SPE), help-seeking comparison subjects (MDD), and healthy controls (HC) at varying layers of analysis
In the second part of our analysis, we wanted to test the discrimination ability of online collected speech at increasing levels of diagnostic complexity (Fig. 1, Objective 2). Therefore firstly, we used the relatively simple task of binary classification when our speech-based machine learning pipeline had to discriminate between HC and serious mental illness (SSD and BD), modelling discrimination between relatively distinct conditions. Then, to increase complexity, we tested our speech-based ML pipeline in multiclass classification problem (4 classes) when it the model had to discriminate among different conditions on the psychosis spectrum (SPE, BD, SSD) and HC simultaneously. In addition to reducing the random chance of correct classification this task tested the model’s ability to discriminate between conditions that are less distinct, show overlapping symptom profile and challenging to separate using other biomarkers compared to the binary classification task. Lastly, we tested our speech-based ML pipeline in multiclass classification problem (5 classes) when the model also had to discriminate a help-seeking psychiatric controls (MDD) in addition to previous psychosis continuum classes, adding another layer of diagnostic complexity to the task.
We then tested whether our ML pipeline, based an optimised 5 min speech battery, could discriminate between healthy controls and people with serious mental illness, defined as the combination of our SSD and BD groups). Basic demographic variables of age, gender and ethnicity were added to the pipeline as additional predictive features. Demographic and speech features were used to train an ensemble learning algorithm, XGBoost, as this algorithm overperformed other ensemble methods (Random Forest, CatBoost, Extra Trees) in the previous experiment. XGBoost was optimised for Area Under the ROC Curve (AUC), using leave-one-out cross-validation in the training set. A 70–30% train-test split ratio was applied to validate performance on the test set.
Next, we tested how our ML pipeline, based on the optimised 5 min speech battery, performed on the multiclass classification problem of discriminating between HC, SPE, BD and SSD groups. The same feature set was used, but we applied stratified SMOTE (Synthetic Minority Over-sampling Technique) on the training data to maximise the model’s training capabilities on the unbalanced dataset. The same ensemble learning algorithm, XGBoost, was applied on the feature set to train the model using leave-one-out crossvalidation. A 70–30% train-test split ratio was applied to validate performance on a stratified test set.
Evaluate whether our ML pipeline can further discriminate across the psychosis spectrum and major depressive disorder (HC-SPE-BD-SSD-MDD)
Finally, we evaluated whether our ML pipeline could discriminate across all 5 groups simultaneously: HC, SPE, SSD, BD, and MDD, where the latter is a purely affective condition. By including MDD, a purely affective condition, alongside conditions on the psychosis spectrum, we could evaluate whether our ML architecture could identify disorder-specific speech patterns as it requires the model to learn features that differentiate between fundamentally different types of psychiatric conditions. While our earlier analyses showed that a 5 min speech battery was sufficient for psychosis spectrum classification, we hypothesized that a broader range of speech samples would be needed to fully capture MDD’s distinct characteristics, particularly its associated psychomotor and articulatory changes. Previous research has shown that MDD is associated with specific alterations in speech timing, prosody, and articulation [11, 12]. These changes are often most evident in structured tasks like text reading, where deviations from expected patterns can be more readily detected. Therefore, we utilized the complete 16–20 min speech battery, including all speech elicitation methods (dream recall, picture description, free speech, and text reading tasks). This comprehensive approach allowed us to evaluate both the thought disorder-related features characteristic of psychosis spectrum conditions and the psychomotor/articulatory changes associated with MDD. Like in previous steps, an ensemble learning algorithm, XGBoost, was trained and tested using speech and demographic features. 30-fold cross-validation was applied to measure performance.
Explainability
To enhance model interpretability and understand feature contributions, we conducted several analyses. First, we calculated SHAP (SHapley Additive exPlanations) values for each feature to quantify their influence on model predictions. SHAP values are based on cooperative game theory concepts and provide a unified measure of feature importance. For each prediction, SHAP values are calculated by training the model on all possible combinations of features being present or absent, then averaging the marginal contribution of each feature across all these combinations. This process determines how each feature contributes to moving the prediction away from the baseline (average model output).
For each feature, SHAP values indicate how much that feature contributed to the difference between a model’s actual prediction and the average prediction for all samples. A positive SHAP value indicates the feature increased the predicted probability of a particular diagnosis, while a negative value indicates it decreased the probability. The magnitude of the SHAP value represents the strength of the feature’s influence on the prediction.
Additionally, we performed Mann-Whitney U tests with Bonferroni correction for multiple comparisons to identify features that showed significant differences (p < 0.05) between at least two diagnostic groups. These results are presented as barcharts in the Appendix, displaying means and standard errors for each diagnostic group. The combination of SHAP values and between-group statistical comparisons provides complementary perspectives on feature relevance: SHAP values indicate which features were most influential for model predictions, while the statistical comparisons reveal specific group differences in individual features.
Results
Sample
The sample consisted of 1140 individuals and 22,650 min of speech. The groups were imbalanced, due to the feasibility constraints of recruiting individuals with serious mental illness. The SSD group consisted of 84 participants, the BD group consisted of 227 participants, the SPE group consisted of 343 participants, the MDD group consisted of 156 participants and the HC group consisted of 330 participants. Demographic characteristics of the sample are displayed in Table 2.
Comparing assessment approaches
Using all of the 20 min of speech recordings without demographic information, the predictive power of different tasks ranged between 52.8–66.9% accuracy for discriminating among HC, SPE, BD and SSD groups (Table 3). Best performing models were consistently XBoost algorithms. The neutral text reading task had the lowest predictive power, and the picture description task had the highest. Sample duration did not show a strong connection with predictive power. For example, 8 min of picture description could achieve a slight improvement in performance compared to the dream recall task (1 min), and in some metrics, underperformed the condition when we only considered the description of four pictures.
Considering time constraints and predictive power, we optimised our short speech task as 4 min of picture description and one dream recall task.
Discriminating between healthy control and serious mental illness: HC vs. SSD/BD
Using computational features derived from the optimised 5 min of speech data and basic demographic information, our machine learning model could discriminate between HC group and individuals with serious mental illness (SSD and BD) with 86% accuracy (AUC: 0.91, Recall: 0.7, Precision: 0.98) in the test set (Fig. 2). The most influential features in the model were Part-of-Speech tag-based graphs, loops of one node, three nodes and two nodes in Part-of-Speech graphs, number of edges in Part-of-Speech and stem-based graphs, standard deviation of second formant frequency, average fundamental frequency, number of nodes in stem-based graphs, and standard deviation of first formant bandwidth in descending order of importance (Fig. 3).
Left: Confusion matrix shows the true positive (TP)(right, down), true negative (TN)(left, up), false positive (FP) (up, right), and false negative (FN) (down, left) values on the test set. Shades reflect the number of cases in a given rubric. Right: ROC curve (Receiver Operating Characteristic curve) is a graphical representation of a classifier’s performance. It plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various thresholds, showing how well the model can distinguish between classes. The area under the ROC curve (AUC) quantifies the classifier’s overall performance - higher AUC values indicate better performance.
This horizontal bar chart displays Global SHAP (SHapley Additive exPlanations) values in a model discriminating between Healthy Controls (HC) and Serious Mental Illness (SMI) subjects based on speech features. The x-axis represents SHAP values ranging from 0–1.4, while features are listed on the y-axis. The bars are stacked to show relative contributions, with blue representing Healthy Controls (HC) and orange representing Serious Mental Illness (SMI) groups. Features shown include number of nodes in Part-of-Speech tag-based graphs (number_of_nodes_P), loops of one node (L1_P), three nodes (L3_P), and two nodes (L2_P) in Part-of-Speech graphs, number of edges in Part-of-Speech (number_of_edges_P) and stem-based graphs (number_of_edges_S), standard deviation of second formant frequency (F2frequency_stdev), average fundamental frequency (Pitch_Mean), number of nodes in stem-based graphs (number_of_nodes_S), and standard deviation of first formant bandwidth (F1bandwidth_stdev), arranged in descending order of importance.
Discriminating among the schizophrenia-bipolar spectrum and healthy controls: HC vs. SPE vs. BD vs. SSD
Using the same, 5 min of speech from this optimised speech task and demographic variables, our machine learning model could discriminate between HC, SPE, BD and SSD groups simultaneously with 86% accuracy (F1 macro = 0.855, Recall Macro: 0.86, Precision Macro: 0.86) (Fig. 4) in our testing set. The most influential features in the model were number of strongly connected components in naive graphs, number of edges in naive graphs, standard deviation of first formant bandwidth, participant sex, average spectral flux for unvoiced sounds, average pitch, mean second formant frequency, loops of one node in Part-of-Speech graphs, mean first formant frequency, and mean of third mel-frequency cepstral coefficient in descending order of importance (Fig. 5).
This horizontal bar chart displays Global SHAP values in a model discriminating between Healthy Controls (HC), Sub-clinical Psychotic Experiences (SPE), Bipolar Disorder (BD), and Schizophrenia Spectrum Disorders (SSD) based on speech features. The x-axis represents SHAP values ranging from 0 to 3.5. Features shown include number of strongly connected components in naive graphs (LSC_N), number of edges in naive graphs (number_of_edges_N), standard deviation of first formant bandwidth (F1bandwidth_stdev), participant sex (Sex), average spectral flux for unvoiced sounds (spectralFluxUV_sma3nz_amean), average pitch (Pitch_Mean), mean second formant frequency (F2frequency_mean), loops of one node in Part-of-Speech graphs (L1_P), mean first formant frequency (F1frequency_mean), and mean of third mel-frequency cepstral coefficient (mfcc3_sma3_amean), arranged in descending order of importance. The bars are stacked to show relative contributions from each diagnostic group.
Discriminating among 5-classes: HC vs. MDD vs. SPE vs. BD vs. SSD
Using all of the speech data available (around 16 min), our machine learning model could discriminate between HC, SPE, BD, SSD and MDD groups with 76% accuracy (F1 macro: 0.757, Recall Macro: 0.758, Precision Macro: 0.766) (Fig. 6). The most influential features in the model were number of nodes in Part-of-Speech tag-based graphs and speech Duration, followed by adjectives frequency, Largest Strongly Connected component in Part-of-Speech graphs, participants age, Maximum similarity between any two sentences, corrected by number of words, Mean semantic similarity between adjacent sentences, corrected by number of words, Frequency of loudness peaks in the speech signal, Total count of syllables and Frequency of WH-pronouns in descending order of importance (Fig. 7).
This horizontal bar chart displays Global SHAP (SHapley Additive exPlanations) values in a model discriminating between Sub-clinical Psychotic Experiences (SPE), Bipolar disorder (BD), Schizophrenia Spectrum Disorder (SSD) and Major depressive disorder (MDD) subjects based on speech characteristics. The x-axis represents SHAP values ranging from 0–1.75, while features are listed on the y-axis. The bars are stacked to show relative contributions, with different colors representing each group: blue (HC), orange (SPE), green (BD), red (SSD), and purple (MDD). The number of nodes in Part-of-Speech tag-based graphs and speech Duration emerge as the most influential features, followed by adjectives frequency (JJR_freq), Largest Strongly Connected component in Part-of-Speech graphs (LSC_P), Age, Maximum similarity between any two sentences, corrected by number of words (MaxSimCorr), Mean semantic similarity between adjacent sentences, corrected by number of words (CoherenceCorr), Frequency of loudness peaks in the speech signal (loudnessPeaksPerSec), Total count of syllables spoken (Number_Syllables) and Frequency of WH-pronouns (Wp-freq) in descending order of importance.
Discussion
Principal findings
In this study, we demonstrated the feasibility of using online-collected, self-recorded speech data to classify different conditions on the psychosis spectrum and MDD. The results demonstrate a fully automated, scalable approach that requires only 5 min of speech. More specifically, our findings highlight that speech, even when collected remotely and online, has a sufficient degree of between-group variability to discriminate between different forms and stages of psychosis spectrum conditions, as well as to discriminate between affective and psychotic conditions.
Our first objective was to compare the predictive power of different online speech-based tasks to discern different conditions on the psychosis spectrum. The results yielded showed that each task had different predictive power, and that increasing the length of a speech sample does not necessarily translate to better predictive power. Tasks which included the generation of speech – rather than the recitation of prewritten text - were more informative. We speculate that this observation might stem from the additional variation in the textual information yielded by the response-based tasks, relative to reading tasks that only generate variance in paralinguistic parameters. While affective changes and motor control deviations can be clearly captured from acoustic information, thought disorder, one of the key symptoms of psychotic conditions, is more connected to language alterations [21, 49,50,51, 54,55,56,57,58]. It’s reasonable then that the tasks leading to variance in language suitable for capturing these conditions. Our findings are consistent with previous research that showed the suitability of a dream recall task [59] and picture description tasks [45], particularly when compared against interview settings of wakening reports.
Once we identified the optimal speech task battery to utilise, we then trained and tested our machine learning algorithms to assess its classification efficacy. Our ML pipeline showed excellent performance in a classification task between HC and individuals with serious mental illness. Previous studies leveraging either acoustic or NLP-based models demonstrated varied performance ranging from moderate (min. AUC = 0.63) to excellent (max accuracy = 97.68%) in discriminating between healthy controls and patients with schizophrenia [17, 52, 60,61,62,63,64,65,66]. Prior reports also have explored the use of AI techniques to discriminate between HC and patients with BD, with moderate performance reported (AUC ranges between 0.742 and 0.8) [67,68,69]. It is difficult to directly compare the performance of our ML pipelines, given the differences in our study designs. Previous work used more homogeneous samples, had smaller sample sizes (N raged between 15–288) and generally collated longer speech samples in offline, laboratory conditions. We had a diverse, large sample and utilised online, remote speech data collection. Yet, the performance of our model (AUC = 0.91) was comparable and possibly improved upon the best-performing models in the field. Broadly, our results suggest that strong performance and accuracy in identifying psychotic conditions can be maintained even when using scalable assessment methods in a larger and more robust sample.
In addition to this binary classification task, our ML pipeline showed relatively strong performance in classifying across HC, SPE (individuals with significant psychosis symptoms but not meeting threshold for a clinical psychotic disorder diagnosis), BD, and SSD groups. Notably, the overall accuracy of the model did not drop when we introduced two extra classes into the model compared to the binary classification (HCvs. SSD/MD), suggesting that speech has sufficient variance between different schizophrenia-bipolar spectrum conditions to enable high discriminatory power.
To our knowledge, this is the first time that AI-based algorithms have been used for diagnostic classification across this range of disorders – no prior study has leveraged linguistic markers to evaluate classification across more than 3 conditions on the psychotic spectrum. The only comparable study is from Espinola et al. [67] who trained a model to discriminate between major depression, bipolar disorder, schizophrenia, anxiety and healthy control, using paralinguistic parameters from speech samples recorded offline. They achieved similar performance (accuracy = 76%) on a small sample of 76 people. Our results reinforce and expand on this work, using a more robust, short sample of self-recorded, online collected speech. Spencer et al. [21] have also experimented with discriminating healthy controls, people at clinical high risk and first-episode patients with automated language markers. They reported associations of different speech markers with specific conditions and transition to psychosis from at-risk state, however, their samples were limited in size, and they did not apply predictive modelling which makes it difficult to compare results. Studies that aimed to discriminate between bipolar disorder and schizophrenia, reported high performance (AUC ranged between 0.87–0.92) [67, 70].
While these results are promising, some misclassifications did also occur. Notably, most misclassification occurred between the SPE and HC groups. We hypothesise that the reason underlying this is the relatively weak proxy of the SPE group labelled in our study. Caseness for this SPE group was defined as having endorsed more than 6 symptoms on the PQ-16 scale. Although this threshold has previously been suggested as a screening threshold for clinical at risk services [41], it is known to provide sensitivity at the cost of poor specificity. Consequently, it is reasonable to presume that a good portion of this SPE group does not meet the threshold for clinical high risk of psychosis, and overlaps with a “healthy” individual. It is likely that this broad definition of psychosis spectrum status might have blurred the threshold between our healthy group and the SPE group. We speculate that using a more stringent threshold may reduce the number of misclassifications, but that is beyond the scope of this investigation.
There was also some misclassification between healthy individuals and the SSD group. This may suggest that our definition of healthy control subjects was also relatively lenient. Individuals were considered healthy if they did not report a clinical diagnosis of mental disorders, and were below the threshold on the PQ-16 and PHQ-8; however, evidently, these criteria do not necessarily imply that these individuals do not suffer from mental health conditions. Along similar lines, we speculate that the majority of the participants in the SSD group were stable and in remission while taking part in the study. This is a fair assumption given the proactivity, concentration, online presence and continuous engagement required to comply with the remote study. Although the composition of the group allows us to assume that the model captures trait related changes in speech as opposed to disorder state-related changes, it also decreases the observable differences between the speech of the HC group and the SSD group. Therefore, future studies should focus on validating and replicating current findings on clinically validated and labelled data. Also, future research is needed to validate the performance of automated, speech-based AI models to captures changes in dynamic, state-like variations like symptom severity.
This misclassification pattern was replicated in our final analyses, where we trained and tested our algorithms to discriminate across all five groups, including MDD. This reinforces the notion that the definition of these groups were a relatively lenient proxy. Nevertheless, the best performing ML models still achieved acceptable performance on the discrimination task. Notably, the model performed well in discerning between serious mental illness, HC and MDD, as well as between MDD and BD. Differential diagnosis across these conditions is a challenge in clinical settings, particularly in primary care settings [61, 71]. As such, the capability of our ML pipeline to distinguish between these conditions suggests that there is a strong potential translational benefit of our technology into clinical care, complementing clinical decision-making.
We observed significant differences in multiple speech features (both text-based and paralinguistic) between groups (see Appendix), with features discriminating between two or more conditions significantly. However, no single feature showed significant differences across all conditions; rather, each feature discriminated certain conditions from others. Similarly, the most influential features varied across our models. This pattern suggests the importance of using models that incorporate multiple features rather than focusing on a few to achieve high discriminatory power across conditions. The presence of both text-based and acoustic features among the top 10 influential features in all three models reinforces that acoustic and text-based information should be analyzed together, as both demonstrate strong discriminative power. Graph-based features were the most influential in all three models, with particularly strong dominance in the HC-SMI model. This aligns with previous literature demonstrating the utility of graph-based analysis in discriminating between serious mental illnesses [13, 18, 52]. Demographic factors (age and sex) emerged as significant covariates in other models, which could reflect either the importance of demographic factors in speech-based analyses [23] or characteristics specific to our sample, given the significant age differences between HC and SPE groups and sex distribution differences among HC, SPE, and BD groups.
It is important to note that our feature set was not comprehensive. Additional features can be and have been developed by researchers or learned through deep learning architectures. Therefore, while the attributed influence of features provides insight into our algorithm’s decision-making mechanism, they do not necessarily represent the most optimal way to discriminate between the investigated conditions.
Strengths & limitations
This study possesses several strengths. We included multiple psychiatric conditions across the psychosis and affective spectrum, rather than focusing solely on binary algorithmic classifications. Furthermore, our sample is, to our knowledge, the largest to date, reinforcing the robustness of our findings across a heterogeneous population. As study participants were not recruited during the acute stage of their illness, our results are more likely to rely on trait variations in speech specific to each disorder rather than “state” related alterations. In addition, the online nature of all data collection and assessments improves the ecological validity of our findings, reflective of real-world situations.
It is also important to acknowledge the limitations of this study. Firstly, our study does not comprehensively consider cognitive measures, which could potentially confound our results. Cognition has previously been connected to speech changes and is a predictor of psychosis severity [72]. However, the magnitude of cognition’s effects is ambiguous: some studies suggest that cognition has a strong effect on some features (especially connectivity-based ones), though it appears independent of other features [51, 56, 72, 73]. While we included several demographic variables in our analyses, it was beyond the scope to consider cognitive-related measures. Future research should consider certain cognitive variables and their impact on the predictive power of speech-based models; for example, education can be used as a proxy.
Secondly, this study utilizes self-report questionnaires and self-reports of clinical diagnosis as a proxy for having a certain condition. This is particularly pertinent to our definition of the SPE, which was defined using the self-report PQ-16 questionnaire. Though we employed a previously used cut-off, the reliance on self-reports might have introduced biases into our analyses. It is feasible that more stringent or strictly clinical labels would improve the performance of our models, given this would reduce the potential overlaps between classes. It is therefore imperative to further validate in additional clinical samples.
Thirdly, it is important to underscore the unbalanced nature of our sample (e.g. SSD groups was underrepresented compared to HC or SPE), which might well skew results, particularly when comparing different conditions. However, we employed analytical methods (specifically SMOTE, comparing balanced accuracy, and macro F1 scores) to mitigate the effects of this unbalanced sample.
Fourthly, the current research was designed as a proof-of-concept study for a fully automated, remote, short, speech-based assessment of different conditions on the psychosis spectrum. Accordingly, we kept our ML models lean and comparable to each other. However, it is likely that deploying more complex ML architectures (such as Deep Neural Networks) and experimenting with a wider range of models and hyperparameter optimisation of such models would lead to better performance. Therefore, we interpret the reported model capabilities as promising findings rather than an upper marker of the potential capabilities of fully automated, remote, short, speech-based assessment technologies.
Fifthly, our analysis focused on structured speech tasks rather than natural conversations. While structured tasks offer a controlled environment for speech analysis, they may not fully capture the complexity and nuances of everyday speech patterns and social interactions. Natural interactions often involve different purposes, such as connecting, expressing opinions, or commiserating, which can impact linguistic choices and structures. For example, in casual conversation, individuals might use more interjections, repetitions for emphasis, or intensifiers, which may not be as prevalent in structured tasks. Similarly, pressured speech may be more apparent in unstructured settings where the individual has more freedom to express their rapid thoughts. Therefore, differences in speech characteristics between groups and predictive power of models should be interpreted solely in the context of monologic, prompted, structured speech.
Sixthly, the current study had certain exclusion criteria that may restrict our findings’ generalisability. The sample only included native English speakers and excluded people with neurodiversity. Further research should be conducted in samples that include neurodiverse groups and non-English languages, to evaluate whether our models are as effective in more representative samples, a crucial step to application in clinical setting.
Finally, the cross-sectional nature of our study restricts our ability to draw conclusions about causality and long-term trends. Longitudinal studies would help address this gap, allowing researchers to train models for the prediction of symptom changes, and allowing for the continuous emergence of speech alterations. The use of remote, quick assessments, as demonstrated in our study, might help facilitate such longitudinal studies, by enabling a more convenient way to continuously collect speech samples.
Conclusions
This study provides valuable insights into the scalability of speech-based assessment methods for the identification and classification of psychosis spectrum and affective conditions. Our findings broadly suggest that our ML pipeline learnt disorder- and severity-specific information from speech. We leveraged a large and robust sample to demonstrate best-in-class classification accuracy across psychosis spectrum conditions, using online, self-recorded speech data. Furthermore, this is the first report of accurate classification across 5 psychiatric groups. Overall, this study showcases the huge potential of online, remote-collected speech for application in clinical contexts.
Data availability
Data is available upon reasonable request to authors for research purposes.
References
Corcoran CM, Mittal VA, Bearden CE, E Gur R, Hitczenko K, Bilgrami Z, et al. Language as a biomarker for psychosis: a natural language processing approach. Schizophr Res. 2020;226:158–66.
Corcoran CM, Cecchi GA. Using language processing and speech analysis for the identification of psychosis and other disorders. Biol Psychiatry. 2020;5:770–9.
Palaniyappan L. More than a biomarker: could language be a biosocial marker of psychosis? NPJ Schizophr. 2021;7:42.
Parola A, Simonsen A, Lin JM, Zhou Y, Wang H, Ubukata S, et al. Voice patterns as markers of schizophrenia: building a cumulative generalizable approach via a cross-linguistic and meta-analysis based investigation. Schizophr Bull. 2023;49:S125–S141.
Schneider K, Leinweber K, Jamalabadi H, Teutenberg L, Brosch K, Pfarr J, et al. Syntactic complexity and diversity of spontaneous speech production in schizophrenia spectrum and major depressive disorders. NPJ Schizophr. 2023;9:35.
Tang SX, Kriz R, Cho S, Park SJ, Harowitz J, Gur RE, et al. Natural language processing methods are sensitive to sub-clinical linguistic differences in schizophrenia spectrum disorders. NPJ Schizophr. 2021;7:25.
Tognin S, van Hell HH, Merritt K, Winter-van Rossum I, Bossong MG, Kempton MJ, et al. Towards precision medicine in psychosis: benefits and challenges of multimodal multicenter studies—PSYSCAN: translating neuroimaging findings from research into clinical practice. Schizophr Bull. 2020;46:432–41.
Gupta T, Hespos SJ, Horton WS, Mittal VA. Automated analysis of written narratives reveals abnormalities in referential cohesion in youth at ultra high risk for psychosis. Schizophr Res. 2017;192:82–88.
Corcoran CM, Carrillo F, Fernández‐Slezak D, Bedi G, Klim C, Javitt DC, et al. Prediction of psychosis across protocols and risk cohorts using automated language analysis. World Psychiatry. 2018;17:67–75.
Rezaii N, Walker E, Wolff P. A machine learning approach to predicting psychosis using semantic density and latent content analysis. NPJ Schizophr. 2019;5:9–12.
Cummins N, Dineley J, Conde P, Matcham F, Siddi S, Lamers F. Multilingual markers of depression in remotely collected speech samples: A preliminary analysis. J Affect Disord. 2023;341:128–36.
Hoffman GMA, Gonze JC, Mendlewicz J. Speech pause time as a method for the evaluation of psychomotor retardation in depressive illness. Br J Psychiatry. 1985;146:535–8.
Voppel A, de Boer J, Brederoo S, Schnack H, Sommer I. Quantified language connectedness in schizophrenia-spectrum disorders. Psychiatry Res. 2021;304:114130.
Chang X, Zhao W, Kang J, Xiang S, Xie C, Corona-Hernández H, et al. Language abnormalities in schizophrenia: binding core symptoms through contemporary empirical evidence. NPJ Schizophr. 2022;8:95.
Alonso-Sánchez MF, Ford SD, MacKinley M, Silva A, Limongi R, Palaniyappan L. Progressive changes in descriptive discourse in first episode schizophrenia: a longitudinal computational semantics study. NPJ Schizophr. 2022;8:36.
Mackinley M, Chan J, Ke H, Dempster K, Palaniyappan L. Linguistic determinants of formal thought disorder in first episode psychosis. Early Interv Psychiatry. 2021;15:344–51.
De Boer JN, Voppel AE, Brederoo SG, Schnack HG, Truong KP, Wijnen FNK, et al. Acoustic speech markers for schizophrenia-spectrum disorders: a diagnostic and symptom recognition tool. Psychol Med. 2021;53:1302–12.
Ciampelli S, de Boer JN, Voppel AE, Corona Hernandez H, Brederoo SG, van Dellen E, et al. Syntactic network analysis in schizophrenia-spectrum disorders. Schizophr Bull. 2023;49:S172–S182.
Palaniyappan L, Mota NB, Oowise S, Balain V, Copelli M, Ribeiro S, et al. Speech structure links the neural and socio-behavioural correlates of psychotic disorders. Prog Neuropsychopharmacol Biol Psychiatry. 2019;88:112–20.
Corona Hernández H, Corcoran C, Achim AM, de Boer JN, Boerma T, Brederoo SG, et al. Natural language processing markers for psychosis and other psychiatric disorders: emerging themes and research agenda from a cross-linguistic workshop. Schizophr Bull. 2023;49:S86–S92.
Spencer TJ, Thompson B, Oliver D, Diederen K, Demjaha A, Weinstein S, et al. Lower speech connectedness linked to incidence of psychosis in people at clinical high risk. Schizophr Res. 2021;228:493–501.
Bedi G, Carrillo F, Cecchi GA, Slezak DF, Sigman M, Mota NB, et al. Automated analysis of free speech predicts psychosis onset in high-risk youths. NPJ Schizophr. 2015;1:15030.
Bedwell J, Cohen A, Trachik B, Deptula A, Mitchell J. Speech prosody abnormalities and specific dimensional schizotypy features: are relationships limited to male participants? J Nerv Ment Dis. 2014;202:745–51.
Cohen AS, Hong SL. Understanding constricted affect in schizotypy through computerized prosodic analysis. J Personal Disord. 2011;25:478–91.
Cohen AS, Cox CR, Cowan T, Masucci MD, Le TP, Docherty AR, et al. High predictive accuracy of negative schizotypy with acoustic measures. Clin Psychol Sci. 2022;10:310–23.
Cohen A, Elvevåg B. Automated computerized analysis of speech in psychiatric disorders. Curr Opin Psychiatry. 2014;27:203–9.
Hinzen W, Rosselló J, McKenna P. Can delusions be understood linguistically? Cognit Neuropsychiatry. 2016;21:281–99.
Hinzen W, Rosselló J. The linguistics of schizophrenia: thought disturbance as language pathology across positive symptoms. Front Psychol. 2015;6:971.
Stein F, Gruber M, Mauritz M, Brosch K, Pfarr J, Ringwald KG, et al. Brain Structural network connectivity of formal thought disorder dimensions in affective and psychotic disorders. Biol Psychiatry. 2024;95:629–38.
Ehlen F, Montag C, Leopold K, Heinz A. Linguistic findings in persons with schizophrenia-a review of the current literature. Front Psychol. 2023;14:1287706.
Abi‐Dargham A, Moeller SJ, Ali F, DeLorenzo C, Domschke K, Horga G, et al. Candidate biomarkers in psychiatric disorders: state of the field. World Psychiatry. 2023;22:236–62.
Agurto C, Pietrowicz M, Norel R, Eyigoz EK, Stanislawski E, Cecchi G, et al. Analyzing acoustic and prosodic fluctuations in free speech to predict psychosis onset in high-risk youths. Annu Int Conf IEEE Eng Med Biol Soc. 2020;2020:5575–9.
Birnbaum ML, Abrami A, Heisig S, Ali A, Arenare E, Agurto C, et al. Acoustic and facial features from clinical interviews for machine learning-based psychiatric diagnosis: algorithm development. JMIR Ment Health. 2022;9:e24699.
Keshavan MS, Morris DW, Sweeney JA, Pearlson G, Thaker G, Seidman LJ, et al. A dimensional approach to the psychosis spectrum between bipolar disorder and schizophrenia: the schizo-bipolar scale. Schizophr Res. 2011;133:250–4.
Jablensky A. The diagnostic concept of schizophrenia: its history, evolution, and future prospects. Dialogues Clin Neurosci. 2010;12:271–87.
Tamminga CA, Pearlson G, Keshavan M, Sweeney J, Clementz B, Thaker G. Bipolar and schizophrenia network for intermediate phenotypes: outcomes across the psychosis continuum. Schizophr Bull. 2014;40:S131–S137.
Reininghaus U, Böhnke JR, Chavez‐Baldini U, Gibbons R, Ivleva E, Clementz BA, et al. Transdiagnostic dimensions of psychosis in the bipolar‐schizophrenia network on intermediate phenotypes (B‐SNIP). World Psychiatry. 2019;18:67–76.
Berardi M, Brosch K, Pfarr J, Schneider K, Sültmann A, Thomas-Odenthal F, et al. Relative importance of speech and voice features in the classification of schizophrenia and depression. Transl Psychiatry. 2023;13:298.
Koutsouleris N, Meisenzahl EM, Borgwardt S, Riecher-Rössler A, Frodl T, Kambeitz J, et al. Individualized differential diagnosis of schizophrenia and mood disorders using neuroanatomical biomarkers. Brain. 2015;138:2059–73.
Ising HK, Veling W, Loewy RL, Rietveld MW, Rietdijk J, Dragt S, et al. The validity of the 16- item version of the prodromal questionnaire (PQ-16) to screen for ultra high risk of developing psychosis in the general help-seeking population. Schizophr Bull. 2012;38:1288–96.
McDonald M, Christoforidou E, Van Rijsbergen N, Gajwani R, Gross J, Gumley AI, et al. Using online screening in the general population to detect participants at clinical high-risk for psychosis. Schizophr Bull. 2019;45:600–9.
Rancans E, Vrublevska J, Trapencieris M, Snikere S, Ivanovs R, Logins R, et al. Validity of patient health questionnaire (PHQ-9) in detecting depression in primary care settings in Latvia – the results of the National Research Project BIOMEDICINE. Eur Neuropsychopharmacol. 2016;26:S481.
Kroenke K, Spitzer RL. The PHQ-9: a new depression diagnostic and severity measure. Psychiatr Ann. 2002;32:509–15.
Kroenke K, Strine TW, Spitzer RL, Williams JBW, Berry JT, Mokdad AH. The PHQ-8 as a measure of current depression in the general population. J Affect Disord. 2009;114:163–73.
Morgan SE, Diederen K, Vértes PE, Ip SHY, Wang B, Thompson B, et al. Natural language processing markers in first episode psychosis and people at clinical high-risk. Transl Psychiatry. 2021;11:630.
Olah J, Diederen K, Gibbs-Dean T, Kempton MJ, Dobson R, Spencer T, et al. Online speech assessment of the psychotic spectrum: exploring the relationship between overlapping acoustic markers of schizotypy, depression and anxiety. Schizophr Res. 2023;259:11–19.
Olah J, Cummins N, Arribas M, et al. Towards a scalable approach to assess speech organization across the psychosis-spectrum -online assessment in conjunction with automated transcription and extraction of speech measures. Transl Psychiatry. 2024;14:156 https://doi.org/10.1038/s41398-024-02851-w.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. Available from: https://arxiv.org/abs/1301.3781.
Nettekoven CR, Diederen K, Giles O, Duncan H, Stenson I, Olah J, et al. Semantic speech networks linked to formal thought disorder in early psychosis. Schizophr Bull. 2023;49:S142–S152.
Mota NB, Copelli M, Ribeiro S. Thought disorder measured as random speech structure classifies negative symptoms and schizophrenia diagnosis 6 months in advance. NPJ Schizophr. 2017;3 https://doi.org/10.1038/s41537-017-0019-3.
Mota NB, Weissheimer J, Finger I, Ribeiro M, Malcorra B, Hübner L. Speech as a graph: developmental perspectives on the organization of spoken language. Biol Psychiatry Cogn Neurosci Neuroimaging. 2023;8:985–93.
Eyben F, Wöllmer M, Schuller B. openSMILE – the munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM International conference on multimedia. Firenze, Italy. New York: ACM; 2010. p. 1459–62.
Boersma P, Weenink D. Praat: doing phonetics by computer [Computer program]. Version 6.4.26. http://www.praat.org/. Published 2023. https://www.fon.hum.uva.nl/praat/.
Premkumar P, Kuipers E, Kumari V. The path from schizotypy to depression and aggression and the role of family stress. Eur Psychiatry. 2020;63:e79.
Xu W, Wang W, Portanova J, Chander A, Campbell A, Pakhomov S, et al. Fully automated detection of formal thought disorder with time-series augmented representations for detection of incoherent speech (TARDIS). J Biomed Inform. 2022;126:103998.
Mota NB, Vasconcelos NAP, Lemos N, Pieretti AC, Kinouchi O, Cecchi GA, et al. Speech graphs provide a quantitative measure of thought disorder in psychosis. PLoS ONE. 2012;7:e34928.
Harvey PD. Speech competence in manic and schizophrenic psychoses: The association between clinically rated thought disorder and cohesion and reference performance. J Abnorm Psychol. 1983;92:368–77.
Ayer A, Yalınçetin B, Aydınlı E, Sevilmiş Ş, Ulaş H, Binbay T, et al. Formal thought disorder in first-episode psychosis. Compr Psychiatry. 2016;70:209–15.
Mota NB, Furtado R, Maia PPC, Copelli M, Ribeiro S. Graph analysis of dream reports is especially informative about psychosis. Sci Rep. 2014;4:3691.
Iter D, Yoon J, Jurafsky D. Automatic detection of incoherent speech for diagnosingschizophrenia. proceedings of the fifth workshop on computational linguistics and clinical psychology: from keyboard to clinic. Published online 2018. https://doi.org/10.18653/v1/w18-0615.
Benoit J, Onyeaka H, Keshavan M, Torous J. Systematic review of digital phenotyping and machine learning in psychosis spectrum illnesses. Harv Rev Psychiatry. 2020;28:296–304.
Martínez-Sánchez F, Muela-Martínez JA, Cortés-Soto P, García Meilán JJ, Vera Ferrándiz JA, Egea Caparrós A, et al. Can the acoustic analysis of expressive prosody discriminate schizophrenia? Span J Psychol. 2015;18:E86.
Huang Y, Lin Y, Liu C, Lee L, Hung S, Lo J, et al. Assessing schizophrenia patients through linguistic and acoustic features using deep learning techniques. IEEE Trans Neural Syst Rehabil Eng. 2022;30:947–56.
Silva AM, Limongi R, MacKinley M, Ford SD, Alonso-Sánchez MF, Palaniyappan L. Syntactic complexity of spoken language in the diagnosis of schizophrenia: a probabilistic Bayes network model. Schizophr Res. 2022.
Turek A, Machalska K, Chrobak AA, Tereszko A, Siwek M, Dudek D. Speech graph analysis of verbal fluency tests distinguish between patients with schizophrenia and healthy controls. Eur Neuropsychopharmacol. 2017;27:S914–S915.
de Boer JN, Voppel AE, Brederoo SG, Schnack HG, Truong KP, Wijnen FNK, et al. Acoustic speech markers for schizophrenia-spectrum disorders: a diagnostic and symptom recognition tool. Psychol Med. 2023;53:1302–12.
Pan W, Deng F, Wang X, Hang B, Zhou W, Zhu T. Exploring the ability of vocal biomarkers in distinguishing depression from bipolar disorder, schizophrenia, and healthy controls. Front Psychiatry. 2023;14:1079448.
Naderi H, Behrouz Haji Soleimani, Matwin S. Multimodal deep learning for mental disorders prediction from audio speech samples. arXiv:1909.01067v5 [Preprint]. 2020. Available from: https://arxiv.org/abs/1909.01067.
Wanderley Espinola C, Gomes JC, Mônica Silva Pereira J, dos Santos WP. Detection of major depressive disorder, bipolar disorder, schizophrenia and generalized anxiety disorder using vocal acoustic analysis and machine learning: an exploratory study. Res Biomed Eng. 2022;38:813–29.
Gosztolya G, Bagi A, Szalóki S, Szendi I, Hoffmann I. Making a distinction between schizophrenia and bipolar disorder based on temporal parameters in spontaneous speech. Interspeech 2020. Published online October 25, 2020. https://doi.org/10.21437/interspeech.2020-49.
Buckley PF, Miller BJ, Lehrer DS, Castle DJ. Psychiatric comorbidities and schizophrenia. Schizophr Bull. 2009;35:383–402.
Moberget T, Ivry RB. Prediction, psychosis, and the cerebellum. Biol Psychiatry Cogn Neurosci Neuroimaging. 2019;4:820–31.
Mota NB, Sigman M, Cecchi G, Copelli M, Ribeiro S. The maturation of speech structure in psychosis is resistant to formal education. NPJ Schizophr. 2018;4:25–10.
Funding
This research has been funded by Innovate UK, Fast Start Grant. SXT is supported by the NIH (K23 MH130750) and the Brain and Behavior Research Foundation Young Investigator Grant.
Author information
Authors and Affiliations
Contributions
JO, and SXT conceptualised the study. EW, R Ch, JO administered the project and conducted data collection. JO wrote the first version of the manuscript. JO and OM analysed the data and ran experiments. All authors contributed to the investigation, interpretation, review and editing of the manuscript. All authors read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
JO, EW, R Ch are co-founders and employees of Psyrin and owns equity for Psyrin. SXT owns equity and serves as a consultant for North Shore Therapeutics, received research funding and serves as a consultant for Winterlight Labs, and is on the advisory board and owns equity for Psyrin.
Ethics approval and consent to participate
The study has been approved by Advarra IRB (IRB#00000971), under reference number Pro00080888. All methods were performed in accordance with the relevant guidelines and regulations. Informed consent was obtained from all participants.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Olah, J., Wong, W.L.E., Chaudhry, Au.R.R. et al. Detecting schizophrenia, bipolar disorder, psychosis vulnerability and major depressive disorder from 5 minutes of online-collected speech. Transl Psychiatry 15, 241 (2025). https://doi.org/10.1038/s41398-025-03433-0
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41398-025-03433-0









