Fig. 5: Auditory-visual/smell/taste crossmodal recognition and imagination.
From: Bioinspired multisensory neural network with crossmodal integration and recognition

a Illustration of the human ability to recognize and visualize audio input. b Schematic of the artificial auditory-vision/olfactory/gustatory system. Mel spectrograms convert the audio inputs into 13 × 3-dimensional features feeding the ANN. Visual data processed by 12 × 12 photodetectors and photomemristors, together with olfactory and gustatory vectors (Supplementary Fig. 6), are encoded into 12-dimensional features via an autoencoder (Supplementary Fig. 7) to represent the image, smell, and taste information. The ANN consists of 4 layers with 39 input, 12 hidden, 12 hidden, and 12 output neurons (image/smell/taste representation). c Detected image (spiking rate of PSC) and vision memory (PSC values after visual input) of an apple, pear, blueberry, heart, and dog. The memorized vision, smell, and taste vectors (Supplementary Fig. 6) are encoded into the representations via the autoencoder to supervise the training of the ANN with audio inputs /ˈapəl/, /pɛː/, /ˈbluːbəri/, music from the song ‘My heart will go on’, and the barking of a dog. d Recognized and reproduced image, smell and taste of an apple, pear, blueberry, and the reproduced image of a heart and dog upon associated audio input (spoken words, music, barking). Here, 2200 data sets with different accents (British/Chinese, male/female, child/adult) and two kinds of dog barking (Labrador Retriever and Cocker Spaniel) were divided into two parts, one with 1980 data sets for training and another with 220 data sets for testing. e Illustration of supervised training of the auditory-vision system using colors and apples. A blue apple is neither ‘seen’ nor ‘heard’ during the training process. f Imagination of a blue apple by the trained system when /bluː, ˈapəl/ is given as audio input after training.