Table 8 An overview of acoustic features. for more details, see the cooperative voice analysis repository (COVAREP).

From: A systematic review on automated clinical depression diagnosis

Acoustic feature

Description

Source features

Features reflecting airflow from the lungs through the glottis (i.e., glottal features) or vocal fold vibrations (i.e., voice quality features), which is the sound source later filtered by the vocal tract following the source-filter theory of speech production.

Jitter (%)

Deviations in the consecutive lengths of the f0 period, which suggests irregular and uneven vocal fold vibrations.

Shimmer (%)

The variation in the peak amplitudes of consecutive f0 periods, which implies unevenness in voice loudness.

Tremor (Hz)

The number of occurrences of the most powerful low-frequency fundamental frequency-modulating element within a defined examination range.

Harmonics-to-noise ratio (HNR) (dB)

Ratio between f0 and noise components, which indirectly correlates with perceived aspiration.

Frequency disturbance ratio (FDR) (%)

The average relative value of the frequency variation over 5 to 5 cycles (calculated using an average of five data points).

Amplitude disturbance ratio (ADR) (%)

Relative mean amplitude value over a set of windows.

Quasi-open quotient

Ratio of the vocal folds opening time. Functional dysphonias often reduce QOQ range.

Normalized amplitude quotient (NAQ)

A measurement that compares the amplitude between the highest and lowest points of the differentiated flow glottogram to the amplitude of the negative peak and normalizing it with respect to the period time. It can be used as an approximation of glottal adduction.

Peak slope

Slope of the regression line that is fit to log10 of the maxima of each frame.

Filter features

The resonant properties of the vocal and nasal tracts filter the sound source from the vocal folds: the filter attenuates certain frequencies and strengthens others by the shape of the vocal and nasal tracts.

F1 mean (Hz)

First peak in the spectrum that results from a resonance of the human vocal tract.

F2 mean (Hz)

Second peak in the spectrum that results from a resonance of the human vocal tract.

F1 variability (Hz)

Measures of dispersion of F1 (variance, standard deviation).

F2 variability (Hz)

Measures of dispersion of F2 (variance, standard deviation).

F1 range (Hz)

Difference between the lowest and highest F1 values.

Vowel space

F1 and F2 2D space for the vowels.

Linear predictive coding (LPC) coefficients

Coefficients that best predict the values of the next time point of the audio signal using the values from the previous n time points, which is used to reconstruct filter properties.

Spectral features

Features characterizing the frequency distribution of the speech signal at a particular moment in time.

Mel-frequency cepstral coefficients (MFCCs)

The coefficients derived by analyzing the Mel-spectrum of the log-magnitude of an audio segment.

Prosodic features

Changes over longer segments of time, which is perceived in the rhythm, stress, and intonation of speech.

f0 mean (Hz)

Fundamental frequency: lowest frequency of the speech signal, perceived as pitch (mean, median).

f0 variability (Hz)

Measures of dispersion of f0 (variance, standard deviation).

f0 range (Hz)

Difference between the lowest and highest f0.

Intensity (dB)

Defined as the acoustic intensity (i.e., power carried by sound per unit area in a direction perpendicular to that area in decibels relative to a reference value, perceived as loudness).

Intensity variability (dB)

Measures of dispersion of intensity (variance, standard deviation).

Energy velocity

Measured as the mean-squared central difference across frames and may correlate with motor coordination.

Maximum phonation time (s)

The mean of three attempts of the following measure is taken: the maximum time during which phonation of a vowel is sustained as long as possible with an upright position, deep breath, and a comfortable pitch and loudness.

Speech rate

Number of speech utterances per second over the duration of the speech sample (including pauses).

Articulation rate

Number of speech units per second throughout the speech sample (excluding pauses).

Time talking (s)

Sum of the duration of all speech segments.

Utterance duration mean (s)

Mean duration of utterance length.

Pause duration mean (s)

Mean duration of pause length.

Pause variability (s)

Measures of dispersion of pause duration (variance, standard deviation).

Pause rate (s)

Total length of pauses divided by the total length of speech (including pauses).

Pause total (s)

Total duration of pauses.