Figure 2

Network for depression detection using speaker embeddings as input. S, C, K refers to the stride, number of channels and kernel size of the convolutional layer, respectively. FC refers to a fully-connected layer. The same network is used for OpenSMILE and COVAREP features.