Table 3 List of features enriched in at least 95 out of 100 Boruta iterations.

From: Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths

Feature set

Feature name

Description

EVA

FS_MAXHI

Maximum sentiment of the longest happy island. A happy island is a series of sentences with positive sentiment scores that are within the top 25% of all positive sentiment scores in the transcript.

FS_SPP

Sum of positive peak values. Positive peaks are sentences with sentiment scores that are higher than those of adjacent sentences in the transcript.

FS_AVG

Average sentiment score of all sentences in the transcript.

FS_SVAR

Variance of the longest sad island. A sad island is a series of sentences with negative sentiment scores that are within the bottom 25% of all negative sentiment scores in the transcript.

TAALES

COCA_Fiction_Trigram_Range

Average trigram range of the transcript with reference to COCA’s fiction register50. A trigram is a sequence of three words. Its range refers to the number of corpora documents it appears in.

COCA_spoken_Trigram_Range

Average trigram range of the transcript with reference to COCA’s spoken register50.

COCA_magazine_tri_prop_10k

Proportion of trigrams in the transcript that are among the top 10,000 frequent trigrams in COCA’s magazine register50.

WN_SD_CW

Average standard deviation (SD) of the naming latencies of all content words in the transcript. A word’s naming latency refers to the time taken to read it aloud. Values were derived from the English Lexicon Project’s (ELP) word naming task51.

OG_N

Average number of phonographic neighbors of all words in the transcript. Phonographic neighbors are words that differ in one letter and one phoneme. Values were derived from ELP51.

MRC_Imageability_CW

Average imageability score of all content words in the transcript. Values were derived from the Medical Research Council Psycholinguistic Database52.

LD_Mean_Accuracy_CW

Average lexical decision accuracy of all content words in the transcript. A word’s decision accuracy refers to the percentage of it being correctly identified as a real word. Values were derived from ELP’s lexical decision task51.

COCA_spoken_tri_2_MI

Average trigram mutual information (MI) score of the transcript with reference to COCA’s spoken register50. A trigram’s MI score is the joint probability of its bigram and unigram components occurring together.

TAMMI

Inflected_Tokens

Average number of inflected tokens.

suffix_freq_per_cw

Average suffix frequency of content words. Values were derived from MorphoLex47.

TAACO

trigram_lemma_ttr

Number of unique trigram lemmas divided by the total number of trigram lemmas.