Fig. 1: Schematic of our approach. | Nature Human Behaviour

Fig. 1: Schematic of our approach.

From: Human gloss perception reproduced by tiny neural networks

Fig. 1: Schematic of our approach.The alternative text for this image may have been generated using AI.

a, The scenes contained an object with a glossy surface (Ward reflectance model69,70, surface roughness fixed at 0.05). Random specular reflectance values and body colours were applied. The full set of 3,888 images was rendered by combining 36 shapes, 36 lighting environments and 3 random viewpoints per shape–lighting pair (36 × 36 × 3). Images were generated in the CIE XYZ (1931) colour space, converted to linear sRGB and then displayed using standard sRGB gamma correction (γ = 2.4). b, Left: design of online experiment. The 3,888 images were randomly divided into 54 sets of 72 images each, plus 12 additional images that were shared across sets and rated by all online observers. These shared images, taken from our previous study (see fig. 5a in ref. 30), were used solely to assess inter-observer variability and were not included in the main set of 3,888 images or used for training the network models. Two image sets were also tested in a lab-based experiment (see ‘Laboratory experiment’ in the Supplementary Information), to validate the online data quality. At least three observers were recruited for each set. Right: stimulus configuration for the asymmetric gloss matching task. By moving a slider, observers adjusted the gloss (that is, Pellacini’s c (ref. 73)) of the reference object to match the perceived gloss level of the two objects. c, For cross-validation, the 3,888 images were split such that each validation set consisted of a block of either three novel shapes (3 shapes × 36 lighting environments × 3 viewpoints = 324 images) or three novel lighting environments (36 shapes × 3 lighting environments × 3 viewpoints = 324 images), which were excluded from the corresponding training set. There were 12 such shape-based splits and 12 lighting-based splits, resulting in 24 unique, non-overlapping combinations of training (3,564 images) and validation (324 images) datasets. We trained CNNs with varying numbers of intermediate layers using images labelled by human gloss judgements (‘human-like networks’) or by physical ground-truth labels (‘ground-truth networks’), where each ‘label’ consisted of a continuous value of perceived gloss or physical specular reflectance as captured by Pellacini’s c. Gloss levels predicted by the networks were compared against human responses and physical ground-truth labels. The trained networks were analysed to understand the computational mechanisms that emerged within them.

Back to article page