Table 1 Description of each feature extraction tool.

From: Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths

Tool

Number of Features

Description

EVA

42

EVA captures entropic scores derived from sentiment polarity and intensity15. It detects occurrences of polarized sentiments (i.e., positive and negative valence words) and effects from valence modifiers (e.g., amplifiers, de-amplifiers, negators, adversative conjunctions). EVA expresses sentiment variability using 21 unique features, including length, variance, frequency, and intensity of persistent sentiment states, flip frequencies, and moving averages. A filtered version of EVA, involving the removal of neutral valence words, produces another 21 variants of the original features.

TAACO

168

TAACO measures the degree of lexical and semantic overlaps across text13. Lexical overlap is measured by counting overlapping lemma and part-of-speech tags across sentences and paragraphs13, while semantic overlap is measured using LSA, latent Dirichlet allocation, and word2vec scores23. Other features include type-token ratios, connectives, and givenness measures.

TAALES

485

TAALES measures lexical sophistication with n-gram frequencies, ranges, and strength-of-association scores that were calculated using various reference corpora24. An n-gram’s frequency refers to the number of times it appears in the reference corpus, while its range refers to the number of corpus’s documents it appears in. An n-gram’s strength-of-association score measures the probability of its components co-occurring as an n-gram. Additional features include psycholinguistic word information, word recognition scores, and word neighborhood information.

TAMMI

66

TAMMI extracts morphological information including basic morpheme counts, morphological variety and complexity, and morpheme type-token counts17. Basic morphemes include derivational and inflectional morphemes. Morphological variety and complexity are measured using scores derived from the Morphological Complexity Index46. TAMMI also calculates morpheme type-token counts and integrates information from MorphoLex to compute morpheme frequencies, family sizes, and hapax counts47.

TAASSC

355

TAASSC evaluates syntactic complexity and sophistication using classic complexity and verb argument construction (VAC) features48. Classic complexity features measure the length and diversity of word structures such as sentences, T-units, and clauses49, while VAC features measure verb, VAC, and verb-VAC frequencies with reference to the Corpus of Contemporary American English (COCA)50.

TAALED

38

TAALED measures lexical diversity across three dimensions: volume, abundance, and variety26. Volume refers to the total number of words, while abundance refers to the total number of unique lemmas. Lexical variety features include hypergeometric distribution scores, moving average type-token ratios, and measure of textual lexical diversity scores.