Fig. 5: Analysis of a three-layer model with 64 kernels trained on images with human labels.
From: Human gloss perception reproduced by tiny neural networks

a, In the first convolutional layer, the input image was convolved by a set of 64 kernels, 9 of which (7 by 7 pixels) we visualize on the left side. In the second and third convolutional layer, activation maps from the first convolutional layer were further convolved by the two sets of kernels subsequently and average pooling was taken for each of 64 activation maps from the third convolutional layer. Between convolutional layers 2 and 3, batch normalization and/or ReLU are applied. There were skip connections between layers to bypass the information if beneficial. Then the pooled features were used as input for a linear regression model, which predicted gloss level. b, Example of internal representation in a three-layer network model. We fed the network images it had not seen during training and extracted activation maps from each convolutional layer. The maps were aggregated by max pooling (layers 1 and 2) or average pooling (final layer) over the entire map, then used as inputs for t-SNE to project onto a 2D plane. The bottom two figures show representations from the first and third convolutional layers. For comparison, we also performed t-SNE analysis on the same images in the pixel space (leftmost plot), where matte and glossy objects are not separated and objects are primarily clustered by shape (not shown here).