Schizophrenia

Table 3 List of features enriched in at least 95 out of 100 Boruta iterations.

From: Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths

Feature set	Feature name	Description
EVA	FS_MAXHI	Maximum sentiment of the longest happy island. A happy island is a series of sentences with positive sentiment scores that are within the top 25% of all positive sentiment scores in the transcript.
	FS_SPP	Sum of positive peak values. Positive peaks are sentences with sentiment scores that are higher than those of adjacent sentences in the transcript.
	FS_AVG	Average sentiment score of all sentences in the transcript.
	FS_SVAR	Variance of the longest sad island. A sad island is a series of sentences with negative sentiment scores that are within the bottom 25% of all negative sentiment scores in the transcript.
TAALES	COCA_Fiction_Trigram_Range	Average trigram range of the transcript with reference to COCA’s fiction register⁵⁰. A trigram is a sequence of three words. Its range refers to the number of corpora documents it appears in.
	COCA_spoken_Trigram_Range	Average trigram range of the transcript with reference to COCA’s spoken register⁵⁰.
	COCA_magazine_tri_prop_10k	Proportion of trigrams in the transcript that are among the top 10,000 frequent trigrams in COCA’s magazine register⁵⁰.
	WN_SD_CW	Average standard deviation (SD) of the naming latencies of all content words in the transcript. A word’s naming latency refers to the time taken to read it aloud. Values were derived from the English Lexicon Project’s (ELP) word naming task⁵¹.
	OG_N	Average number of phonographic neighbors of all words in the transcript. Phonographic neighbors are words that differ in one letter and one phoneme. Values were derived from ELP⁵¹.
	MRC_Imageability_CW	Average imageability score of all content words in the transcript. Values were derived from the Medical Research Council Psycholinguistic Database⁵².
	LD_Mean_Accuracy_CW	Average lexical decision accuracy of all content words in the transcript. A word’s decision accuracy refers to the percentage of it being correctly identified as a real word. Values were derived from ELP’s lexical decision task⁵¹.
	COCA_spoken_tri_2_MI	Average trigram mutual information (MI) score of the transcript with reference to COCA’s spoken register⁵⁰. A trigram’s MI score is the joint probability of its bigram and unigram components occurring together.
TAMMI	Inflected_Tokens	Average number of inflected tokens.
TAMMI	suffix_freq_per_cw	Average suffix frequency of content words. Values were derived from MorphoLex⁴⁷.
TAACO	trigram_lemma_ttr	Number of unique trigram lemmas divided by the total number of trigram lemmas.

Back to article page

Search

Advanced search

Quick links