Table 1 List of all lyrical descriptors extracted for the two datasets, including a brief description.

From: Song lyrics have become simpler and more repetitive over the last five decades

Name

Description

Lexical descriptors

 Line counts

Total number of lines, blank lines, unique lines, ratio of blank and repeated lines

 Token counts

Number of tokens, characters, repeated token ratio, unique tokens per line, and avg. tokens per line

 Character counts

Number of [!?.,:;”-()] and digits (total amount of these characters and individual counts per character), ratio of punctuation and digits

 Token length

Average length of tokens

 n-gram ratios

Ratio of unique bigrams and trigrams

 Legomenon ratios

Ratio of hapax legomena, dis legomena and tris legomena

 Parts of speech

Frequency of adjectives, adverbs, nouns, pronouns, verbs

 Past tense

Percentage of verbs in past tense

 Stop words

Number and ratio of stop words, stop words per line

 Uncommon words

Number of uncommon words (i.e., words not contained WordNet60)

Diversity descriptors

 Compression ratio

Ratio of the size of zlib compressed lyrics vs. the original, uncompressed lyrics

 Diversity measures

Measure of Textual Lexical Diversity (MTLD), Herdan’s C, Summer’s S, Dugast’s \(U^2\) and Maas’ \(a^2\)

The diversity descriptors were extracted using the Python lexical_diversity and lexicalrichness library.

Readability descriptors

 Readability formulas

Flesch Reading Ease, Flesch Kincaid Grade, SMOG (Simple Measure of Gobbledygook), Automated Readability Index, Coleman Liau Index, Dale Chall Readability Score, Linsear Write Formula, Gunning Fog, Fernandez Huerta, Szigriszt Pazos and Gutierrez Polini

 Difficult words

Number of difficult words (consisting of three or more syllables)

The readability descriptors were extracted using the Python textstat library.

Rhyme descriptors

 Rhyme structures

Numbers of couplets, clerihews, alternating rhymes and nested rhymes

 Rhyme words

Number of unique rhyming words, percentage of rhyming lines in the lyrics

 Alliterations

Number of alliterations of length two, three, and four or more

The rhyme descriptors were extracted using the Python pronouncing library, which provides an interface to the Carnegie Mellon University Pronouncing Dictionary.

Structural descriptors

 Element counts

Number of sections and verses

 Distribution

Relation between the number of verses vs. sections and number of choruses vs sections

 Title occurrences

Number of times the song’s title appears

 Pattern

Verse and chorus alternating, two verses and at least one chorus, two choruses and at least one verse

 Start

Starts with chorus (binary attribute)

 Ending

Ends with two chorus repetitions (binary attribute)

Emotional descriptors

 Sentiment scores

Positivity and negativity scores via AFINN61, the sentiment lexicon by Bing Liu et al.62, the MPQA opinion corpus63, the sentiment140 dataset64 and the SentiWordNetlexicon65

 NRC

Emotion scores according to the NRC affect intensity lexicon66

 LIWC

Descriptors provided by LIWC39

 Happiness

Happiness score according to labMT67