Fig. 5: Crossmodal audio-to-visual motion prediction with the RP-RC system. | Nature Communications

Fig. 5: Crossmodal audio-to-visual motion prediction with the RP-RC system.

From: Dynamic machine vision with retinomorphic photomemristor-reservoir computing

Fig. 5

a Schematic of the audio-to-motion data flow, including audio feature extraction, crossmodal learning (CML), and recurrent vision prediction. b Schematic of the crossmodal audio-to-motion prediction system. The PMA-trained CAE is identical to the one used in Fig. 4. c Audio-to-visual motion prediction results. Audio signals ‘A person is moving right’, ‘A person is moving left’, ‘A car is moving right’, ‘A car is moving left’ are used as input. For each input, Mel spectrograms and MFCC features are extracted for crossmodal recognition of the first motion frame (X1’) for crossmodal recognition by a trained DNN. The recognition accuracy of X1’ through audio input is 90% for 120 random test datasets (Supplementary Fig. 20). Audio-to-motion prediction is successful for three of the four motions (the first 25 predicted frames are shown). The motion of ‘A person is moving right’ is predicted successfully for the first 9 frames but then fades (Supplementary Fig. 21). By giving another audio input at step 9, the correct prediction of motion is re-established with weaker residual imprints from the previous frames.

Back to article page