Figure 1

Extraction of semantic components from the visual stream of the film. (a) Semi-automatic pipeline for extraction of the semantic components. First, a film frame is passed through an automatic visual concept recognition system (Clarifai) to extract concept labels. Then, the extracted labels are passed through a language model (fastText) to obtain 300-dimensional semantic vectors, or word embeddings. The semantic vectors are averaged over all labels assigned to one frame, resulting in one averaged semantic vector per frame. The dimensionality of the vectors is further reduced by applying a principal component analysis to the averaged semantic vectors. The final result is a set of 50-dimensional semantic components that are used further to model the neural responses. (b) Example of how averaging of all concept labels per frame affects the semantic representation. Each word in the language model (fastText) can be seen as a point in a 300-dimensional semantic space. Neighboring words are assumed to capture similar semantics. Averaging in this space results in a new point that is placed in a neighborhood of all the words that are being averaged. Averaging has a capacity to represent combined complex meaning. In this example, the new point is in between the individual words ‘horse’, ‘carriage’ and ‘roof’, thus combining the meanings of these words together.