Figure 1
From: Audio-visual combination of syllables involves time-sensitive dynamics following from fusion failure

(A) Proposed neurophysiological mechanisms for fusion versus combination. We posit that after being processed by primary auditory and motion sensitive areas (bottom row), AV inputs converge in the left Superior Temporal Sulcus (STS, middle row) that works as a multidimensional feature space, here reduced to a simple 2D-space in which lip-motion and 2nd speech formant are the main dimensions. The STS is relatively insensitive to AV asynchrony [as depicted in (B)], but encodes both physical inputs in the 2D-space, converging on the most likely cause of a common speech source given these inputs. In the visual [aga]—auditory /aba/ condition, coordinates in the 2D space fall close to those of the existing syllable ‘ada’, which is picked as solution such that the subject senses no conflict. In the visual [aba]—auditory /aga/ condition, the absence of existing ‘aCa’ solution at coordinates crossing triggers post-hoc reconstruction of a plausible sequence of the inputs via complex consonant transitions (i.e., ‘abga’ or ‘agba’1). Both combination outputs require additional interaction with time sensitive (prefrontal and auditory) brain regions. Grey arrows represent the STS output as readout by higher order areas. Blue and red arrows represent visual and auditory inputs, respectively. (B) Discrepant audio (A) and visual (V) syllabic speech units ‘aCa’ are represented within a critical time-window for integrating them as a single item coming from the same source. The auditory percept is either a McGurk fusion ‘ada’ (left) or a combination percept (right). For combination, the percept is either ‘abga’ or ‘agba’ and it arises on-line from the real order with which each phoneme is detected. Image made using Microsoft PowerPoint, version 16.41.