Fig. 3
From: Activity in perceptual classification networks as a basis for human subjective time perception

Human and model duration estimation and its modulation by scene type. a Human duration estimates by duration of presented video between 1 and 64 s (log-log scale). b–d as for a but for model versions Full frame, “Gaze” and “Shuffled”. The data in each plot a–d are based on the same set of 4251 video presentations, reported on by human participants a or estimated by different model versions b–d. Shaded areas in a–d show ±1 standard deviation of the mean. Human reports in a show typical qualities of human temporal estimation with overestimation of short and underestimation of long durations. b Model estimates when input the full video frame replicate similar qualitative properties, but estimation was poorer than humans. c Model estimates when the input was constrained to approximate human visual-spatial attention, based on human gaze data, very closely approximated human reports made on the same videos. When the gaze contingency was “Shuffled” such that the gaze direction was applied to a different video than that from which it was obtained d, performance decreased. e Comparison of normalised mean error (NME) in estimations for models and human estimates (normalised by physical duration, across the presented durations). f Comparison of the root mean squared error (RMSE) of model estimates compared to the human data. The “Gaze” model is most closely matched (Full frame: 11.99, “Gaze”: 10.71, “Shuffled”: 11.79). g Mean deviation of duration estimates relative to mean estimate by scene type for human participants (mean shown in a; City: 6.20 mean (1040 trials), Campus & outside: 2.01 mean (1170 trials), Office & cafe: −4.23 (2080 trials); Total number of trials 4290). h As for g but for the “Gaze” model (mean shown in c; City: 24.41 mean (1035 trials), Campus & outside: −5.75 mean (1167 trials), Office & cafe: −9.00 (2068 trials); Total number of trials 4270). i. The number of accumulated salient perceptual changes over time in the different network layers (lowest to highest: conv2, pool5, fc7, output) by scene type, for the “Gaze” model shown in h