Table 1 Description of each feature extraction tool.
Tool | Number of Features | Description |
|---|---|---|
EVA | 42 | EVA captures entropic scores derived from sentiment polarity and intensity15. It detects occurrences of polarized sentiments (i.e., positive and negative valence words) and effects from valence modifiers (e.g., amplifiers, de-amplifiers, negators, adversative conjunctions). EVA expresses sentiment variability using 21 unique features, including length, variance, frequency, and intensity of persistent sentiment states, flip frequencies, and moving averages. A filtered version of EVA, involving the removal of neutral valence words, produces another 21 variants of the original features. |
TAACO | 168 | TAACO measures the degree of lexical and semantic overlaps across text13. Lexical overlap is measured by counting overlapping lemma and part-of-speech tags across sentences and paragraphs13, while semantic overlap is measured using LSA, latent Dirichlet allocation, and word2vec scores23. Other features include type-token ratios, connectives, and givenness measures. |
TAALES | 485 | TAALES measures lexical sophistication with n-gram frequencies, ranges, and strength-of-association scores that were calculated using various reference corpora24. An n-gram’s frequency refers to the number of times it appears in the reference corpus, while its range refers to the number of corpus’s documents it appears in. An n-gram’s strength-of-association score measures the probability of its components co-occurring as an n-gram. Additional features include psycholinguistic word information, word recognition scores, and word neighborhood information. |
TAMMI | 66 | TAMMI extracts morphological information including basic morpheme counts, morphological variety and complexity, and morpheme type-token counts17. Basic morphemes include derivational and inflectional morphemes. Morphological variety and complexity are measured using scores derived from the Morphological Complexity Index46. TAMMI also calculates morpheme type-token counts and integrates information from MorphoLex to compute morpheme frequencies, family sizes, and hapax counts47. |
TAASSC | 355 | TAASSC evaluates syntactic complexity and sophistication using classic complexity and verb argument construction (VAC) features48. Classic complexity features measure the length and diversity of word structures such as sentences, T-units, and clauses49, while VAC features measure verb, VAC, and verb-VAC frequencies with reference to the Corpus of Contemporary American English (COCA)50. |
TAALED | 38 | TAALED measures lexical diversity across three dimensions: volume, abundance, and variety26. Volume refers to the total number of words, while abundance refers to the total number of unique lemmas. Lexical variety features include hypergeometric distribution scores, moving average type-token ratios, and measure of textual lexical diversity scores. |