Fig. 4: Experimental results of video action recognition of the KTH dataset.
From: High-order tensor flow processing using integrated photonic circuits

a The adopted neural network model. Input data is a segment of video with five frames. The neural network comprises two convolutional layers with activation and pooling layers, a recurrent layer (RL), and a fully connected layer (FC). The ‘Conv. 1’ and ‘Conv. 2’ layers are computed by the PTFP chip. b and c Convolutional results of the PTFP chip of Conv. 1 and Conv. 2, respectively. Subplots from top to bottom display the convolved images of different frames. From left to right, several convolved video segments are displayed. For reference, computer-calculated results are provided aside. d, e Diffusion matrices of recognition of the PTFP chip and that of a digital computer, respectively. Numbers on the diagonal line record correct prediction. f Simulated accuracy of the neural network with different standard deviations (σnoise) of additive Gaussian noise. The solid curve represents the average recognition accuracy and the shading indicates the 90% confidence interval. Yellow triangle marks the experimental accuracy of the PTFP chip.