Fig. 1: Model approach.
From: Decoding speech perception from non-invasive brain recordings

We aim to decode speech from the brain activity of healthy participants recorded with MEG or EEG while they listen to stories and/or sentences. For this, our model extracts the deep contextual representations of 3 s speech signals (Y of F feature by T time samples) from a pretrained ‘speech module’ (wav2vec 2.0: ref. 29) and learns the representations (Z) of the brain activity on the corresponding 3 s window (X of C recording channels by T time samples) that maximally align with these speech representations with a contrastive loss (CLIP: ref. 44). The representation Z is given by a deep convolutional network. At evaluation, we input the model with left-out sentences and compute the probability of each 3 s speech segment given each brain representation. The resulting decoding can thus be ‘zero shot’ in that the audio snippets predicted by the model need not be present in the training set. This approach is thus more general than standard classification approaches where the decoder can only predict the categories learnt during training.