Analysis of conceptual overlap among formal thought disorder rating scales in psychosis: a systematic semantic synthesis

Voppel, Alban; Ciampelli, Silvia; Kircher, Tilo; Liddle, Peter F.; Massuda, Raffael; Stein, Frederike; Tang, Sunny X.; Ray, Manaan Kar; Park, Sohee; Palaniyappan, Lena

doi:10.1038/s41537-025-00712-z

Download PDF

Article
Open access
Published: 15 December 2025

Analysis of conceptual overlap among formal thought disorder rating scales in psychosis: a systematic semantic synthesis

Schizophrenia volume 12, Article number: 9 (2026) Cite this article

1601 Accesses
7 Altmetric
Metrics details

Subjects

Abstract

Measuring Formal Thought Disorder (FTD), a common, cross-diagnosed symptom dimension across mental disorders, is plagued by numerous inconsistencies. Clinicians use either FTD-specific scales or items from generic scales. While these tools are based on extensive clinical observations, they suffer from inconsistent terminology. Different scales may use the same term for distinct concepts or different terms for the same concept. This lack of conceptual standardization prevents the identification of underlying FTD subconstructs. By using natural language processing, we compared the definitions, labeling and overlap of FTD symptoms, i.e., the definitions of single items, across psychopathological scales. We used a three-pronged validation approach to analyze semantic clusters of single definitions of FTD scale psychopathological items. First, we used sentence-BERT to divide 30 Thought and Language Disorder scale (TALD) items into positive or negative FTD clusters, validating this approach by checking for correspondence with published factor-analytic divisions (approach validation). Second, we created a sparse item-to-item similarity matrix from 103 items across seven scales to identify semantically converging cross-scale FTD items; a clinician-researcher described the resulting four clusters, and we compared our automated classification with that of six blinded experts to establish expert-machine semantic correspondence. Finally, we analyzed data from 98 participants (49 healthy controls and 49 schizophrenia/affective psychosis), identifying the highest-correlating Clinical Language Disorder Scale (CLANG) item for each Thought, Language and Communication (TLC) scale item and mapping these to our BERT-derived clusters to establish data-level correspondence. When assigning TALD items to BERT-derived positive or negative FTD groupings, we observed a 73% match with prior factor analyses. The BERT-informed clustering of cross-scale items highlighted four coherent FTD groupings: (1) muddled communication & incomprehension, (2) abrupt topic shifts, (3) inconsistent narrative structure, (4) restricted speech. Expert raters showed moderate-to-high overlap (Fleiss’ kappa = 0.617) with computational clusters. A binomial test indicated that at the level of individual participants, correlations among CLANG-TLC item pairs were significantly more likely than chance to fall into the expected semantic cluster (p < 0.001). FTD rating scales measure overlapping, semantically related constructs that drive item-level correlations. Semantic clustering acts as a novel method to harmonize multi-scale data and pinpoint discrepancies between expert and machine classifications. Computational linguistics has the potential to improve consistency across rating scales especially when measuring complex constructs such as FTD.

Automated speech and language markers of longitudinal changes in psychosis symptoms

Article Open access 17 June 2025

Progressive changes in descriptive discourse in First Episode Schizophrenia: a longitudinal computational semantics study

Article Open access 12 April 2022

Differences in the neural correlates of schizophrenia with positive and negative formal thought disorder in patients with schizophrenia in the ENIGMA dataset

Article Open access 26 April 2024

Introduction

Formal thought disorder (FTD) is present across many mental disorders and is a hallmark symptom of schizophrenia-spectrum disorders in particular. It manifests as disruptions in the structure and flow of thought, often resulting in incoherent, fragmented or diminished speech¹. There is no singular approach to measure the severity of FTD; clinicians use scales such as the Thought, Language, and Communication scale (TLC)², the Thought and Language Disorder scale (TALD)³, or items from general symptom scales such as the Positive and Negative Syndrome Scale (PANSS)⁴. These scales are grounded in extensive clinical observations and historical conceptualizations of FTD^1,5,6,7. Although these scales set out to capture an overlapping construct, some incorporate items that do not appear in others, provide notably different labels to the same phenomena, or describe differing phenomena using the same label. For instance, the phenomena described as “logorrhea” in TALD broadly aligns with “pressured speech” in TLC, but the differences in nosology can obscure their equivalence. These variations make it difficult to compare the scores across studies using different scales, and raises questions on the reproducibility and interpretation of empirical observations (e.g., neuroimaging correlates, outcome predictions, treatment effects) reported from various FTD scales⁸. To address this problem of ‘incommensurability’ among scales⁹, here we employ observations of FTD scales from patients, human experts as well as machines detecting patterns from human language to provide a more structured understanding of this domain.

One major source of heterogeneity among FTD scales stems from the differences in the concepts the authors set out to measure via items when constructing the rating scales. For instance, “poverty of thought” has been conceptualised as a problem in the quantity of speech (TLC), the patient’s subjective experience (TALD), or their ability to sustain a conversation (PANSS). An obvious source of conceptual divergence is the level within the thought-language-communication system that each scale has focussed on. Some authors have explicitly endorsed an aspect of thought, language or communication over the others, though the resulting scales retained the broad aim of measuring FTD. A less obvious source of divergence is the varied theoretical assumptions held by the authors of rating scales. To date, it is not clear how these conceptual differences influence the building blocks (sub-constructs) that constitute FTD. A systematic examination of the meaning conveyed by operational descriptions of items across scales can clarify this issue.

Some of the measurement heterogeneity has become evident from factor analytic approaches that focused on item validity and symptom overlap (e.g.,^1,5,10,11 for a review see⁸). Although factor analysis can illuminate latent dimensions and guide conceptual refinements, its scope is inherently constrained by the need for substantial, overlapping datasets. To examine cross-scale consistency using this approach, the same participants need to be rated on multiple scales. This is not only time-consuming, but also prone to implicit rater bias (if done by the same rater) or unknown inter-rater differences, and is often not scalable beyond two instruments^7,12,13.

An emerging alternative approach is the use of content analysis. Here, the substrate of analysis is the descriptive text of rating scales rather than participant-level scores. This approach has been used to explore, for example, screening questionnaires in clinical high-risk psychosis¹⁴, depression¹⁵ and neurological symptoms¹⁶. The advent of Natural Language Processing (NLP) approaches have advanced content analysis by enabling granular examination of semantics i.e., the meaning carried by descriptors in rating scales. Such NLP approaches employ large language models trained on extensive corpora that capture subtle semantic relationships among words and sentences, providing an objective complement to expert-driven content analysis. Models such as Bidirectional Encoder Representations from Transformers (BERT) and GPT^17,18 offer theory-agnostic approaches for language assessment enabling concepts that drive textual descriptions to be made explicit (e.g., see¹⁹). By embedding each item of a rating scale in a high-dimensional semantic space, it becomes feasible to quantify item-to-item similarity in meaning (semantic clustering) across multiple rating instruments²⁰. While the application of NLP to study FTD has mostly focused on patient-derived speech data^21,22,23,24, we employ this approach to examine how different scales define, label, and describe the phenomena that constitute FTD.

Here, we apply semantic clustering to rating scales that are commonly used in operationalising the concept of FTD. Our aim is to identify the elements that constitute the “backbone” of the FTD construct across the scales. We anticipate this semantic structure to reflect the core overlapping concepts that likely influenced the itemisation of FTD construct across scales. To this end, we first demonstrate that the item-level semantic clustering approach applied to one of the most exhaustive FTD scales (TALD) identifies the relationships that define the well-established subconstructs of positive and negative FTD (approach validation). We then analyze 103 item-level descriptions from 7 FTD scales to identify meaningful clusters based on conceptual overlap and lack thereof (cluster generation). We estimate the agreement between machine-generated clusters and human-experts in assigning items to each cluster (expert assignment). Finally, we relate these findings to a clinical dataset in which two diverging scales - CLANG and TLC - were administered (clinical alignment), comparing how real-world item correlations map onto the NLP-derived clusters. To reconcile the observed differences and to aid harmonisation of FTD measurements across empirical studies, we provide the means to place individual items onto potential semantic clusters.

Methods

Rating scale items

We selected seven commonly used rating scales for measuring FTD: the TALD³, the Scale for the Assessment of Positive Symptoms and Scale for the Assessment of Negative Symptoms (SAPS-SANS)^25,26 the PANSS⁴, the Thought and Language (TLI)²⁷, the Scale for the Assessment of TLC², the Assessment of Bizarre-Idiosyncratic Thinking (BIT)²⁸, and the Clinical Language Disorder Rating Scale (CLANG)²⁹.

For the multi-dimensional scales (SAPS-SANS and PANSS), only subscales directly related to disorganized thinking or FTD were included. In total, we extracted 103 FTD-related items from these scales (see Supplemental Table 1 for a list of included items). Since these scales vary in their level of detail—some providing extensive examples, others using concise descriptors—we removed illustrative parenthetical examples to maintain consistency in focus and length. For instance, when processing the TALD item “restricted thinking”, we removed the words “(e.g., a depressive patient who is preoccupied with his indigestion)”. This allowed us to retain only the core phrasing of each item for semantic comparison.

Semantic embeddings

All item-level descriptions were converted to lowercase for consistency. We selected a sentence-level BERT model, specifically the all-mpnet-base-v2 model (https://huggingface.co/sentence-transformers/all-mpnet-base-v2), because the symptom descriptions predominantly comprise single sentences or short text segments, making a sentence-centric approach particularly suitable. The all-mpnet-base-v2 pre-trained model maps text (at the sentence or paragraph level) to a 768-dimensional dense vector space and was trained on a broad range of natural language tasks, thereby capturing contextual and semantic nuances within and between sentences¹⁷. Consequently, each FTD item description was transformed into a unique 768-dimensional embedding. By using a single, standardized embedding approach, we avoid biases that might arise from employing multiple specialized models or custom vocabularies.

Approach validation - positive and negative FTD

We assessed sentence cosine similarity between 30 TALD item descriptions and published descriptions of the widely used constructs of positive and negative FTD³. We employed the TALD alongside factor descriptions ensuring consistency as originally described by the same authors. Specifically, we compared each TALD item-level description to the following text for positive FTD—“Positive FTD is best represented by derailment and loosening of associations, an increased amount of produced speech (e.g., logorrhoea, pressured speech), the use of new words (neologisms), and stilted speech phenomena (manneristic speech)”—and for negative FTD—“Negative FTD has been conceptualised as a quantitative deficit in speech and thought production (e.g., poverty of speech, slowed thinking, and blocking)”³⁰. Each of the 30 TALD items was then assigned to a positive or negative FTD group, according to whichever similarity score (positive or negative) was higher. The resulting two-way classification was subsequently compared with a factor analysis–derived grouping of the same items³.

Clustering across scales

Using the 768-dimensional embeddings, we computed pairwise semantic similarity for all 103 items. Similarity was defined as the cosine angle between vectors, normalized to yield similarity scores ranging from 0 (no similarity) to 1 (identical vectors). This resulted in a 103 × 103 connectivity matrix whose cells indicated how semantically similar each item was to every other item.

To facilitate interpretation, we sparsified this matrix, preserving only the strongest semantic connections. Specifically, we treated the TLC items as a backbone, ensuring that each item from the other six scales retained only its single highest similarity link to an item in TLC. This procedure effectively filtered out less relevant connections, aiding in the discovery of cohesive clusters (or “communities”) of related items. The TLC was chosen as backbone for its widespread use and comparable number of items to other rating scales, as well as the availability of the patient dataset with CLANG & TLC.

Expert-machine semantic grouping

The resulting clusters using the TLC backbone were described by author LP, avoiding any words used in item titles. We then used expert raters, members of the DISCOURSE consortium clinical harmonization group to assign all 103 items to one of the four backbone groups based on this description, allowing us to compare BERT group structure to that of human raters, by looking at overlapping group membership of machine and human-picked clusters.

Clinical alignment and clustering

In a separate sample of 98 participants (49 healthy controls and 49 individuals with schizophrenia or affective psychosis), both CLANG and TLC scales were administered. Participants were recruited from community-based clinics and hospitals through the Oxford Mental Health Services in Oxford, United Kingdom as part of the Cerebral Asymmetry and Functional Language in Psychosis (CAFLIP) study and gave written informed consent. For each TLC item, we computed its correlation with all CLANG items across the participant group and identified the strongest CLANG correlate, pairing items across the scales. We then checked whether each TLC–CLANG pair appeared in the same BERT-derived cluster, providing a real-world assessment of how clinical item relationships corresponded to the NLP-based semantic groupings.

Results

Rating scales items

We extracted and embedded 103 items from the seven FTD rating scales (TALD, SAPS-SANS, PANSS, TLI, TLC, BIT, and CLANG) and observed substantial variation in their descriptive richness. CLANG items were the most concise (14.4 words on average), while TLC items were the longest (95.1 words on average, up to 242 words). SAPS-SANS showed similarly high word counts (90.2 on average - note that the TLC & SAPS-SANS share the same author), with BIT, TALD, PANSS, and TLI occupying middle ranges. This broad range in word count and exemplification underscores how each scale captures FTD with different levels of detail and granularity.

Semantic embeddings

Item-level descriptions from 7 scales were embedded in 768-dimensional vector space, and the resulting cosine similarity scores grouped in a matrix illustrating the relationships among individual items (Fig. 1). Higher similarity values (closer to 1) indicate stronger semantic overlap, revealing potential clusters of conceptually related symptoms across the different scales. To examine broader patterns at the scale level, we computed a second similarity matrix by first averaging item embeddings within each rating scale to form a scale “centroid”, and then calculating cosine similarity between every pair of centroids (Fig. 2). The diagonal entries in this matrix reflect the mean semantic consistency of the items within each scale, i.e., semantic distance between TLI item 1 to TLI items 2–7, repeated for each item within the scale. The off-diagonal cells indicate the degree of overlap of “centroids” between different scales.

**Fig. 1: Similarity matrix of individual FTD rating items.**

**Fig. 2: Scale-level similarity matrix of FTD rating scales.**

Approach validation - positive and negative FTD

When classifying TALD items into positive or negative FTD groups based on higher cosine similarity, 23 of 30 items (73.3%) matched the factor analysis–derived classification reported by Kircher et al. (2014), representing a statistically significant association (χ²(1, N = 30) = 6.04, p = 0.014). For instance, item TALD-14 (“Neologisms”) had a higher similarity to the positive FTD text (0.367) than to the negative FTD text (0.220), consistent with the factor-analytic assignment. This significant overlap serves as a proof-of-concept that semantic similarity–based categorizations can align closely with traditional factor-analytic divisions of FTD.

Clustering across scales

The result of our network sparsification, using TLC as a backbone and linking each item from the seven scales only to its single highest-similarity TLC item (Fig. 3), significantly reduced the complexity of the full similarity matrix. This streamlined network revealed four internally connected clusters of item-level symptoms, reflecting coherent groupings of semantically related FTD features across different rating scales. See Table 1 for a list of individual items for each of the four groups.

**Fig. 3: Semantic network of all items to TLC backbone.**

Table 1 Four clusters based on semantic embeddings with TLC as backbone.

Full size table

Expert-machine semantic grouping

Following the identification of the four BERT-derived clusters based on TLC embeddings, one author (LP) provided short, non-scale-specific descriptions for each group, in brief (1) muddled communication & incomprehension. (2) Abrupt topic shifts. (3) Inconsistent narrative structure, (4) restricted speech (see Table 1 for full descriptions). These descriptions were generated through a visual inspection of the clusters, without accessing the full list of items that belonged to each cluster. The descriptions were constructed in a manner that avoided the verbatim repetition of item names. Functional effects (e.g., difficulties in sequencing) rather than cognitive or mechanistic processes (e.g., lack of associations) were included in the descriptions. Six expert raters from the DISCOURSE consortium Clinical Harmonization Group then independently assigned each of the 103 items to one of these four clusters. Their assignments were compared with the original BERT-based groupings; Table 2 summarizes both overall and cluster-specific Fleiss’ kappa values. Raters demonstrated substantial consensus (κ = 0.617, 95% CI: 0.585–0.648 for all 103), while agreement per cluster ranged from 0.476 to 0.716, suggesting some categories may be more intuitively distinguished than others. Top 3 (best total agreement between raters & BERT cluster assignment) were Poverty of Speech (TLC-1), Neologisms (TALD-14) and Tangentiality (TLC-5) while 3 most disagreed items (worst agreement between raters & BERT cluster assignment) were Distractibility (TLI-8), Lack of details (CLANG-8) and PANSS-N7 (Stereotyped Thinking)

Table 2 Human-BERT group assignment agreement.

Full size table

Clinical alignment and clustering

In the sample of 98 participants (49 healthy controls and 49 individuals with Schizophrenia, Bipolar Disorder or Depression; see Table 3 for demographics), both CLANG and TLC items were rated. Four items -CLANG-1 (phonetic association), TLC-9 (clanging), TLC-15 (echolalia), and TLC-16 (blocking)- were excluded from further analysis because no participant received a non-zero score on these items. For the remaining items, each individual rating item was correlated with each other item. The highest CLANG-TLC item correlations were identified and compared to the four-domain semantic clustering applied to the same scales, see Fig. 4. Eleven out of 17 correlational item pairs fell into the same semantic domain, a result that significantly exceeded the 25% chance-level expectation (binomial test, p < 0.001).

**Fig. 4: TLC-CLANG semantic cluster and correlations.**

Table 3 CLANG-TLC participant demographic and clinical characteristics.

Full size table

Discussion

In this study, we compared items of symptom scales based on their semantic content rather than on participant scores, leveraging sentence-level BERT embeddings of FTD rating scales in different ways. First, we effectively distinguished the well-known positive and negative dimensions of FTD within the TALD scale, aligning with previous factor-analytic findings, thus providing a proof-of-concept for the semantic embedding approach. We then applied the same approach across seven rating scales, using TLC items as a backbone to sparsify the similarity matrix and reveal four distinct symptom clusters (Fig. 3), reducing the complexity of cross-scale comparisons. When evaluated against expert human judgment, these NLP-derived clusters showed substantial alignment with ratings, indicating that semantic similarity–based groupings largely matched expert perceptions. Finally, using a separate dataset with both CLANG and TLC, we found that items with the highest cross-scale correlations generally fell into the same BERT-derived cluster, suggesting meaningful convergence between real-world clinical data and our data-driven semantic domains.

In doing so, our study addresses four major points. First, NLP-based approaches offer an integrative, theory-agnostic framework for comparing conceptually related content across diverse rating instruments. Second, the identification of four distinctive semantic clusters, each containing items from different scales, shows that seemingly heterogeneous rating tools often capture overlapping constructs; for instance, “derailment” may be called “loose associations” or “flight of ideas,” yet here they converge within a single NLP-derived cluster. Third, our alignment with expert consensus and published factors suggests that clinicians’ intuitive groupings and conventional psychometric divisions do correspond, to a substantial degree, with computationally derived similarity networks—strengthening the claim that these clusters hold practical meaning. Fourth, analyzing real-world clinical data (the CLANG-TLC correlation set) revealed that items with the strongest cross-scale associations largely fell into the same NLP-derived cluster, reinforcing that these semantic groupings reflect meaningful patterns observable in patient populations.

While we provide a novel cluster and group division, it is not our intent to claim that we now know what FTD is, or how it should be definitely divided, factored, or conceptualized. Rather, our approach illustrates how NLP can serve as an “intuition pump” or tool for thinking³¹, guiding the integration of diverse clinical views, factor analyzes and historical frameworks towards a data-driven model of FTD.

Semantic analysis can compare and align items across different scales when they were not administered to the same participants, while factor analysis and latent profile analysis require a single, multivariate dataset and typically large samples. Practically, we see semantic analysis as a mapping tool to generate and refine groupings and correspondence between rating scale items. When joint datasets are available, those hypotheses should be confirmed with factor analysis. Interestingly and partially self-referential, the linguistic patterns captured in LLMs that allow them to quantify semantic distance are a subset of the patterns described as being breached in the description of FTD symptoms. Our approach adds to the toolbox for clarifying the clinical conceptualization of disordered thought and provides a means to derive ‘concept clusters’ of items across scales. Just as prior studies have examined short conceptual phrases to identify core semantic themes¹⁹, our approach demonstrates how large language models can distill rating-scale descriptions into interpretable clusters; a step forward in characterizing the varied symptom measurement tools for psychiatric disorders.

Strengths and weaknesses

One strength of this work lies in its clear demonstration that items from seven different FTD rating scales can be reconciled semantically despite disparate clinical histories and descriptive styles. The inclusion of expert rater groupings and TALD-based positive–negative classifications bolsters the credibility of the NLP-based approach. Our findings also provide a springboard for practical scale development: reviewing item clusters may help condense multiple rating instruments into a smaller, more cohesive item set.

However, a few limitations warrant mention. First, antonymous statements (e.g., “responses are too slow” vs. “responses are too fast”) can appear highly similar to embedding models - in these example snippets, the cosine similarity of 0.931 using the all-mpnet-base-v2 model illustrates how purely text-based approaches may cluster together opposite ends of a scale that could be conceptualized as separate in a positive-negative FTD division. Second, the item-level focus excluded severity descriptors, which contain additional semantic information regarding symptoms. Consider that the description for the most severe rating of TALD item 4, “Dissociation of Thinking” reads: “Scattered speech: Syntax is absent (paragrammatism, parasyntax), resulting in an incomprehensible, meaningless word and syllable mixture (“word salad”)³ and includes terms and words not included in our approach where we take only the base item description, but which would likely lead to higher cosine similarity with other rating scale item descriptions containing syntactic terms. Third, our clinician-described, embedding-seeded clusters warrant caution. Group 4, for example (described as “restricted/poverty of speech”), reflects reduced quantity or diminished meaning despite intact production and can overlap with Group 1 (“muddled communication”). As with any division and description of a complex, overlapping phenomenon like FTD, overlap and alternate interpretations of clusters remain present. Finally, some of the scales were conceived in different languages before translation to English, and all scales were composed by writers and researchers with different literary and scientific styles. Despite these challenges, the measured convergence between computational and human-driven groupings is promising.

Future directions

This item-level unification framework can be extended in several important ways. Semantic embeddings have been used to characterize clinical speech directly³²; the approach taken here aimed to tackle an upstream problem, semantic analysis of the rating scales used to characterize these clinical samples. However, our approach could be complementary, i.e., by measuring distance between patient speech and rating clusters. Extending the CLANG-TLC approach with more patient-level data will clarify whether semantically similar items show correlated patient ratings across all rating scales. It may be equally useful to explore whether specific clusters map onto distinct neural substrates (e.g., such as connectivity within language-relevant networks), offering a biological anchor for these conceptual groupings^9,33 In addition, identifying items with low connectivity could highlight underdefined or infrequent symptom domains that warrant refinement, integration or outright removal, thus reducing taxonomic incommensurability⁹ and refining a conceptual “hub” with core aspects of disorganized thinking. Taken together, these steps would not only refine item-level semantics but could also guide the design of more consistent, multidimensional assessment of FTD.

Conclusion

By applying sentence-level embeddings to unify different FTD rating scales, we offer a novel conceptual framework to investigate disorganized thinking. Mapping 103 items from seven scales into four overarching domains reveals a shared semantic space that converges with expert consensus, while the positive–negative division correlates with prior factor-analytic findings. In addition, the alignment of CLANG-TLC correlations with these domains affirms the real-world applicability of our NLP-based clusters. This approach aims to enrich our theoretical grasp of how different theoretical and clinical perspectives on FTD intersect. Going forward, these insights derived from semantic embedding may serve as a flexible scaffold to refine how FTD is classified, discussed, and examined experimentally—expanding our tools for disentangling this complex yet foundational symptom domain in mental disorders.

Data availability

Inquiries regarding anonymous CAFLIP data access should be directed to Lena Palaniyappan - lena.palaniyappan@mcgill.ca.

References

Kircher, T., Bröhl, H., Meier, F. & Engelen, J. Formal thought disorders: from phenomenology to neurobiology. Lancet Psychiatry 5, 515–526 (2018).
Article PubMed Google Scholar
Andreasen, N. C. Scale for the assessment of thought, language, and communication (TLC). Schizophr. Bull. 12, 473–482 (1986).
Article CAS PubMed Google Scholar
Kircher, T. et al. A rating scale for the assessment of objective and subjective formal thought and language disorder (TALD). Schizophr. Res. 160, 216–221 (2014).
Article PubMed Google Scholar
Kay, S. R., Fiszbein, A. & Opler, L. A. The positive and negative syndrome scale (PANSS) for schizophrenia. Schizophr. Bull. 13, 261–276 (1987).
Article CAS PubMed Google Scholar
Covington, M. A. et al. Schizophrenia and the structure of language: the linguist’s view. Schizophr. Res. 77, 85–98 (2005).
Article PubMed Google Scholar
Docherty, N. M. Cognitive impairments and disordered speech in schizophrenia: thought disorder, disorganization, and communication failure perspectives. J. Abnorm. Psychol. 114, 269–278 (2005).
Article PubMed Google Scholar
Rodriguez-Ferrera, S., McCarthy, R. A. & McKenna, P. J. Language in schizophrenia and its relationship to formal thought disorder. Psychol. Med. 31, 197–205 (2001).
Article CAS PubMed Google Scholar
Zamperoni, G., Tan, E. J., Rossell, S. L., Meyer, D. & Sumner, P. J. Evidence for the factor structure of formal thought disorder: a systematic review. Schizophr. Res. 264, 424–434 (2024).
Article PubMed Google Scholar
Wulff, D. U. & Mata, R. Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nat. Hum. Behav. 1–11 https://doi.org/10.1038/s41562-024-02089-y (2025).
Andreasen, N. C. Thought, language, and communication disorders: II Diagnostic significance. Arch. Gen. Psychiatry 36, 1325–1330 (1979).
Article CAS PubMed Google Scholar
Peralta, V. & Cuesta, M. J. Negative symptoms in schizophrenia: a confirmatory factor analysis of competing models. Am. J. Psychiatry 152, 1450–1457 (1995).
Article CAS PubMed Google Scholar
Peralta, V., Cuesta, M. J. & de Leon, J. Formal thought disorder in schizophrenia: a factor analytic study. Compr. Psychiatry 33, 105–110 (1992).
Article CAS PubMed Google Scholar
Roche, E. et al. The factor structure and clinical utility of formal thought disorder in first episode psychosis. Schizophr. Res. 168, 92–98 (2015).
Article PubMed Google Scholar
Bernardin, F., Gauld, C., Martin, V. P., Laprévote, V. & Dondé, C. The 68 symptoms of the clinical high risk for psychosis: low similarity among fourteen screening questionnaires. Psychiatry Res. 330, 115592 (2023).
Article CAS PubMed Google Scholar
Fried, E. I. The 52 symptoms of major depression: Lack of content overlap among seven common depression scales. J. Affect. Disord. 208, 191–197 (2017).
Article PubMed Google Scholar
Chrobak, A. A., Krupa, A., Dudek, D. & Siwek, M. How soft are neurological soft signs? Content overlap analysis of 71 symptoms among seven most commonly used neurological soft signs scales. J. Psychiatr. Res. 138, 404–412 (2021).
Article PubMed Google Scholar
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 4171–4186 (2019).
Floridi, L. & Chiriatti, M. GPT-3: its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020).
Article Google Scholar
Bolt, T. & Uddin, L. Q. “The brain is…”: a survey of the brain’s many definitions. Neuroinformatics 23, 4 (2025).
Article PubMed PubMed Central Google Scholar
Böke, A. et al. Enhancing diagnostic precision: using large-language models to evaluate content overlap in mental health questionnaires. JMIR https://preprints.jmir.org/preprint/79868 (2025).
Bedi, G. et al. A window into the intoxicated mind? Speech as an index of psychoactive drug effects. Neuropsychopharmacology 39, 2340–2348 (2014).
Article CAS PubMed PubMed Central Google Scholar
Corcoran, C. M. & Cecchi, G. A. Using language processing and speech analysis for the identification of psychosis and other disorders. Biol. Psychiatry Cogn. Neurosci. Neuroimaging 5, 770–779 (2020).
PubMed PubMed Central Google Scholar
Corona Hernández, H. et al. Natural language processing markers for psychosis and other psychiatric disorders: emerging themes and research agenda from a cross-linguistic workshop. Schizophr. Bull. 49, S86–S92 (2023).
Article PubMed PubMed Central Google Scholar
Voppel, A. E., de Boer, J., Brederoo, S., Schnack, H. & Sommer, I. Quantified language connectedness in schizophrenia-spectrum disorders. Psychiatry Res. 304, 114130 (2021).
Andreasen, N. C. Scale for the Assessment of Positive Symptoms (SAPS). https://doi.org/10.1037/t48377-000 (1984).
Andreasen, N. C. Scale for the Assessment of Negative Symptoms (SANS). 49–58 (1984).
Liddle, P. F. et al. Thought and Language Index: an instrument for assessing thought and language in schizophrenia. Br. J. Psychiatry 181, 326–330 (2002).
Article PubMed Google Scholar
Marengo, J. T., Harrow, M., Lanin-Kettering, I. & Wilson, A. Evaluating bizarre-idiosyncratic thinking: a comprehensive index of positive thought disorder. Schizophr. Bull. 12, 497–511 (1986).
Article CAS PubMed Google Scholar
Chen, E. Y. H. et al. Language disorganisation in schizophrenia: validation and assessment with a new clinical rating instrument. Hong Kong J. Psychiatry 6, 4–13 (1996).
Google Scholar
Kircher, T., Stein, F. & Nagels, A. Differences in single positive formal thought disorder symptoms between closely matched acute patients with schizophrenia and mania. Eur. Arch. Psychiatry Clin. Neurosci. 272, 395–401 (2022).
Article PubMed Google Scholar
Dennett, D. C. Intuition Pumps And Other Tools for Thinking (W. W. Norton & Company, 2013).
Palominos, C. et al. Approximating the semantic space: word embedding techniques in psychiatric speech analysis. Schizophrenia 10, 114 (2024).
Article CAS PubMed PubMed Central Google Scholar
Stein, F. et al. Transdiagnostic types of formal thought disorder and their association with gray matter brain structure: a model-based cluster analytic approach. Mol. Psychiatry 30, 4286–4295 (2025).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We wish to thank the DISCOURSE consortium Clinical Harmonization working group (www.discourseinpsychosis.org) for discussions of the manuscript and responding to a call for expert raters, the late Timothy W. Crow for his work on the CAFLIP dataset; and all participants for their role in the research. Funding: A. Voppel is supported by NARSAD Young Investigator Grant 32574 from the Brain & Behavior Research Foundation. F. Stein is supported by the German Research Foundation (DFG) through grant STE3301/1-1 (project number 527712970), and the CRC/TRR 393 consortium (“Trajectories of Affective Disorders”, project number 521379614) as well as by the Von Behring-Röntgen Society (project number 72_0013). T. Kircher receives funding from the German Research Foundation (DFG) FOR 2107, SFB/TRR 393 (“Trajectories of Affective Disorders”, project grant no 521379614), and the Germany’s Excellence Strategy (EXC 3066/1 “The Adaptive Mind”, Project No. 533717223), as well as the DYNAMIC center, funded by the LOEWE program of the Hessian Ministry of Science and Arts (grant number: LOEWE1/16/519/03/09.001(0009)/98). P.F. Liddle is supported by the Medical Research Council (Grant Nos. G0901321 and MR/J01186X/1). S.X. Tang is supported by the National Institutes of Health through grants K23 MH130750 and R01 MH140013, as well as by NARSAD Young Investigator Grant 30975 from the Brain & Behavior Research Foundation.S. Park’s research is supported by MH128967 and the Gertrude Conaway Vanderbilt Endowment. L. Palaniyappan’s research is supported by the Canada First Research Excellence Fund, awarded to the Healthy Brains, Healthy Lives initiative at McGill University (through New Investigator Supplement to LP) and Monique H. Bourgeois Chair in Developmental Disorders. He receives a salary award from the Fonds de recherche du Quebec-Sante ́(FRQS).

Author information

Authors and Affiliations

Department of Psychiatry, Douglas Mental Health University Institute, McGill University, Montreal, QC, Canada
Alban Voppel & Lena Palaniyappan
Center for Clinical Neuroscience and Cognition, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
Silvia Ciampelli
Dept. of Psychiatry, Marburg University, Marburg, Germany
Tilo Kircher & Frederike Stein
The Institute of Mental Health, University of Nottingham, Nottingham, UK
Peter F. Liddle
Department of Psychiatry - Federal University of Parana (UFPR) – Brazil, Curitiba, Brazil
Raffael Massuda
Zucker Hillside Hospital, Northwell Health, Glen Oaks, NY, USA
Sunny X. Tang
Institute of Behavioral Science, Feinstein Institutes for Medical Research, Manhasset, NY, USA
Sunny X. Tang
Department of Psychiatry, Zucker School of Medicine at Hofstra/Northwell, Hempstead, NY, USA
Sunny X. Tang
Addiction and Mental Health Services, Princess Alexandra Hospital, Metro South Hospital and Health Service, Brisbane, QLD, Australia
Manaan Kar Ray
Australian Institute for Suicide Research and Prevention, Griffith University, Brisbane, QLD, Australia
Manaan Kar Ray
PA-Southside Clinical Unit, Princess Alexandra Hospital, Faculty of Medicine, University of Queensland, Brisbane, QLD, Australia
Manaan Kar Ray
Department of Psychology, Vanderbilt University, Nashville, TN, USA
Sohee Park
Department of Psychiatry, Schulich School of Medicine & Dentistry, University of Western Ontario London, London, ON, Canada
Lena Palaniyappan
Robarts Research Institute & Lawson Health Research Institute, London, ON, Canada
Lena Palaniyappan

Authors

Alban Voppel
View author publications
Search author on:PubMed Google Scholar
Silvia Ciampelli
View author publications
Search author on:PubMed Google Scholar
Tilo Kircher
View author publications
Search author on:PubMed Google Scholar
Peter F. Liddle
View author publications
Search author on:PubMed Google Scholar
Raffael Massuda
View author publications
Search author on:PubMed Google Scholar
Frederike Stein
View author publications
Search author on:PubMed Google Scholar
Sunny X. Tang
View author publications
Search author on:PubMed Google Scholar
Manaan Kar Ray
View author publications
Search author on:PubMed Google Scholar
Sohee Park
View author publications
Search author on:PubMed Google Scholar
Lena Palaniyappan
View author publications
Search author on:PubMed Google Scholar

Contributions

A.V. and L.P. conceived the study. A.V. performed the experiments, statistical analysis and visualization. S.C., T.K., P.L., R.M., F.S., and S.X.T. were expert raters. M.K.R., S.P., and L.P. collected clinical participant data. L.P. provided expert descriptions of symptoms. A.V. wrote the first version of the manuscript and revised the manuscript together with L.P. All authors contributed to the writing of the final manuscript.

Corresponding author

Correspondence to Alban Voppel.

Ethics declarations

Competing interests

L.P. reports personal fees from Janssen Canada, Otsuka Canada, SPMM Course Limited, UK, Canadian Psychiatric Association; book royalties from Oxford University Press; investigator-initiated educational grants from Sunovion, Janssen Canada, Otsuka Canada outside the submitted work. RM reports personal fees from Janssen, Adium, and Daichii-Sankyo outside the submitted work. S.X.T. owns equity and serves on the board and as a consultant for North Shore Therapeutics, received research funding and serves as a consultant for Winterlight Labs, is on the advisory board and owns equity for Psyrin, and serves as a consultant for Catholic Charities Neighborhood Services and LB Pharmaceuticals. The other authors report no conflicts of interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental materials

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Voppel, A., Ciampelli, S., Kircher, T. et al. Analysis of conceptual overlap among formal thought disorder rating scales in psychosis: a systematic semantic synthesis. Schizophr 12, 9 (2026). https://doi.org/10.1038/s41537-025-00712-z

Download citation

Received: 10 September 2025
Accepted: 01 December 2025
Published: 15 December 2025
Version of record: 20 January 2026
DOI: https://doi.org/10.1038/s41537-025-00712-z

Subjects

Abstract

Similar content being viewed by others

Automated speech and language markers of longitudinal changes in psychosis symptoms

Progressive changes in descriptive discourse in First Episode Schizophrenia: a longitudinal computational semantics study

Differences in the neural correlates of schizophrenia with positive and negative formal thought disorder in patients with schizophrenia in the ENIGMA dataset

Introduction

Methods

Rating scale items

Semantic embeddings

Approach validation - positive and negative FTD

Clustering across scales

Expert-machine semantic grouping

Clinical alignment and clustering

Results

Rating scales items

Semantic embeddings

Approach validation - positive and negative FTD

Clustering across scales

Expert-machine semantic grouping

Clinical alignment and clustering

Discussion

Strengths and weaknesses

Future directions

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplemental materials

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links