Table 3 List of features enriched in at least 95 out of 100 Boruta iterations.
Feature set | Feature name | Description |
|---|---|---|
EVA | FS_MAXHI | Maximum sentiment of the longest happy island. A happy island is a series of sentences with positive sentiment scores that are within the top 25% of all positive sentiment scores in the transcript. |
FS_SPP | Sum of positive peak values. Positive peaks are sentences with sentiment scores that are higher than those of adjacent sentences in the transcript. | |
FS_AVG | Average sentiment score of all sentences in the transcript. | |
FS_SVAR | Variance of the longest sad island. A sad island is a series of sentences with negative sentiment scores that are within the bottom 25% of all negative sentiment scores in the transcript. | |
TAALES | COCA_Fiction_Trigram_Range | Average trigram range of the transcript with reference to COCA’s fiction register50. A trigram is a sequence of three words. Its range refers to the number of corpora documents it appears in. |
COCA_spoken_Trigram_Range | Average trigram range of the transcript with reference to COCA’s spoken register50. | |
COCA_magazine_tri_prop_10k | Proportion of trigrams in the transcript that are among the top 10,000 frequent trigrams in COCA’s magazine register50. | |
WN_SD_CW | Average standard deviation (SD) of the naming latencies of all content words in the transcript. A word’s naming latency refers to the time taken to read it aloud. Values were derived from the English Lexicon Project’s (ELP) word naming task51. | |
OG_N | Average number of phonographic neighbors of all words in the transcript. Phonographic neighbors are words that differ in one letter and one phoneme. Values were derived from ELP51. | |
MRC_Imageability_CW | Average imageability score of all content words in the transcript. Values were derived from the Medical Research Council Psycholinguistic Database52. | |
LD_Mean_Accuracy_CW | Average lexical decision accuracy of all content words in the transcript. A word’s decision accuracy refers to the percentage of it being correctly identified as a real word. Values were derived from ELP’s lexical decision task51. | |
COCA_spoken_tri_2_MI | Average trigram mutual information (MI) score of the transcript with reference to COCA’s spoken register50. A trigram’s MI score is the joint probability of its bigram and unigram components occurring together. | |
TAMMI | Inflected_Tokens | Average number of inflected tokens. |
suffix_freq_per_cw | Average suffix frequency of content words. Values were derived from MorphoLex47. | |
TAACO | trigram_lemma_ttr | Number of unique trigram lemmas divided by the total number of trigram lemmas. |