arising from C. Mares et al. Communications Biology https://doi.org/10.1038/s42003-023-04976-y (2023)

The ability to synchronize a motor response to an auditory signal is central to human activities such as dancing, joint music making, or conversing with others. Examining the temporal alignment between hands or articulators as motor effectors and speech syllables or music tones as auditory prompts, Mares et al.1 concluded that the sensorimotor synchronization (SMS) ability varies greatly in the general population, with a group of people (called “low synchronizers”) being particularly disrupted when asked to synchronize to auditory sequencies containing variable units. However, the well-foundedness of the conclusion is limited by a methodological oversight: the stimuli of the study do not consider the P-center effect that is central to the perception of temporal structure in speech and other acoustically complex sounds, thus making it difficult to draw meaningful comparisons between SMS to sequences containing identical vs. variable prompts and undermining the conclusions of the study.

Mares et al.1 replicate the results of previous studies2,3 showing that the task of synchronizing articulatory gestures of the syllable “tah” with sequences of isochronous but varied syllables leads to a split of the general population into two groups: the “high synchronizers” who are able to repeat the syllable “tah” at a constant rate resembling the rate of the auditory prompt, and the “low synchronizers” who cannot maintain a steady rate of the “tah” syllable production when listening to sequencies of varied syllables. Mares et al.1 confirm that the difficulties of low synchronizers persist during the synchronization with sequences of variable tones and when the motor effector changes from articulators to hands. They further add to existing evidence that these difficulties disappear when the low synchronizers are asked to synchronize with auditory prompts containing sequencies of identical units (either syllables or tones). Moreover, the authors demonstrate that sensorimotor priming with a sequence of identical tones can temporarily restore the low synchronizers’ ability to maintain a steady train of motor gestures during subsequent exposure to sequences of variable tones or syllables.

Unfortunately, the study suffers from a methodological oversight that limits direct comparison of SMS with identical vs. varied sequencies. The phenomenon being overlooked in the study is the well-documented P-center effect. The P-center (the “perceptual center”) of a sound refers to the subjective moment of occurrence and signifies that the acoustic and the perceptual onset of a sound do not co-occur4. The P-center tends to be located after the acoustic onset of the corresponding sound, though its exact location has been a matter of debate5 and differs across languages and possibly individuals6,7,8,9. Studies broadly agree that the P-center approximates (and possibly anticipates)8 vowel onsets7,10,11. The P-center has been attested in multifarious speech materials and by means of different tasks6,8, with emerging evidence for its role in neural speech tracking12. Moreover, it is not unique to speech, it has also been documented in musical sounds such as tones5,9,13,14 and will therefore apply to the tonal stimuli of the Mares et al. study1 in a similar way.

Overall, previous research has convincingly documented that evenly concatenated syllables—i.e., sequencies like the ones used in the experiment by Mares et al.1—sound irregular to listeners4,10,15,16,17. Similarly, when asked to synchronize the production of varied syllables with a metronome, speakers do not align syllable onsets in time with the metronome beat10,18,19,20,21,22. To illustrate this, Fig. 1 compares the timing of concatenated syllables used in the experiment by Mares et al.1 to the timing of the same syllables produced by a male speaker in time with a metronome set at a comparable rate (here, 250 ms). As can be seen in Fig. 1, stimuli of the speech-to-speech synchronization task (panel 1-A) display higher variability of inter-vocalic (m(ISI) = 227.67 ms, s(ISI) = 19.29 ms) than inter-syllabic (m(ISI) = 232.5 ms, s(ISI) = 9.32 ms) intervals. In contrast, the timing of syllables produced with the metronome (panel 1-B) shows the P-center effect, as indicated by a lower regularity of inter-syllabic intervals (m(ISI) = 248.89 ms, s(ISI) = 18.86 ms) and a higher regularity of inter-vocalic intervals (m(ISI) = 249.78, s(ISI) = 6.10 ms), with the latter approaching the metronome rate20,21,22.

Fig. 1: Timing of inter-syllabic and inter-vocalic intervals in a synthesised vs. naturally spoken train of syllables.
figure 1

Temporal analyses of inter-syllabic (ISI) and inter-vocalic (IVI) intervals in the materials of Mares et al.1 (A, taken from the onset of a stimulus steadily paced at 4.3 units/second)46 as compared to the production of a male speaker articulating the same materials (here, syllables) in time with a metronome paced at the rate of 250 ms (B). Annotations of syllable and vowel onsets were conducted manually by the author.

The methodological oversight is problematic because SMS requires auditory prompts to have temporal regularity and predictability23,24. Given that the perceived temporal regularity in varied spoken and tonal units is a matter of the P-center timing, all sequencies of the Mares et al. study1 containing varied prompts may have sounded irregular to all participants. These irregularities (see Fig. 1A) resemble stimuli of previous experiments that used finger taps as motor effectors and examined responses to temporal perturbations of local inter-onset intervals in isochronous metronome sequencies24,25. Such phase perturbations have been shown to elicit error correction responses reflective of perceptual monitoring for temporal reference-frames within an incoming auditory stimulus24,25,26. For example, when the onset timing of a local event is slightly shifted to deviate from isochrony of the remaining events in the sequence, participants shift their synchronization, even if explicitly instructed to ignore occasional perturbations27. The process of phase correction is therefore considered automatic and different from deliberate period correction elicited in response to global tempo changes within a sequence27.

Given these properties of SMS, the task of the Mares et al. study1 could only be performed if synchronization with temporally jittered prompts was not attempted at all. This means that “high synchronizers” performed well by ignoring the precise acoustic timing of local synchronization attractors and kept producing “tah” or clapping their hands at a rate broadly commensurate with the auditory prompt (listeners excel at establishing the distal rate of spoken input28,29). “Low synchronizers”, on the other hand, may have conscientiously followed the prompt trying to synchronize with the jittered P-centers of concatenated syllables, repeatedly deploying phase correction and failing to establish synchronization. The measure of synchronization used in the study – the phase-locking value – captures exactly this SMS property by calculating distal phase covariance between amplitude envelopes of the perceived and produced sequences, disregarding the actual synchronization accuracy23.

In a sense, then, “low synchronizers” were actually better at synchronizing with the external auditory prompt than “high synchronizers”. Since the grouping of participants into “high” and “low synchronizers” could no longer be maintained when the task involved acoustic prompts containing repetitions of the same unit (syllable or tone), it is very likely that the bipartite grouping of participants1 does not arise from individual differences in synchronization in its classic definition23,24. Indeed, it has been well established that individuals can vary in their general synchronization ability with different kinds of prompts30,31 and in their ability to adapt synchronization to tempo-changing32 or temporally perturbed33 prompts – but so far, without a strong indication that synchronization may be non-unimodally distributed in the non-clinical population. One piece of evidence currently missing is an experiment with varied syllables concatenated such as to establish equal spacing between successive P-centers (rather than concatenating jittered syllables, see Fig. 1). This will help to illuminate the role of the P-center timing in the task1.

Even though the speech-to-speech production studied by Mares et al.1 is unlikely to test auditory-motor synchronization proper, the consistency with which it divides the general population into two groups2,34 is remarkable and worth further consideration. In this context, the grouping can be hypothesized to arise from—hitherto poorly understood—individual differences in the interplay of feedback and feedforward control mechanisms during speech production35,36. According to the neurocomputational DIVA model, for example, speech production can best be understood to emerge from the relations between brain activity, speech motor commands and their sensory output, and to be governed by two control mechanisms35,36. Feedback control operates by identifying discrepancies between anticipated and actual outcomes of articulatory actions and adjusting motor commands in response. If feedback control detects auditory or somatosensory errors, corrections start to apply to feedforward processes. Feedforward control constitutes an internal motor program of speech sounds and syllables. During the production of a syllable like “tah”, the two mechanisms are assumed to interact, starting with the activation of the sensorimotor representations of the consonant and vowel gestures whose execution is monitored by feedback control. The model has found extensive support in auditory perturbation experiments37,38,39,40 – a paradigm that resembles in some ways the task of the study by Mares et al.1 Within this framework, “low synchronizers” may be primarily recruiting feedback control for adjusting the timing of the articulatory gestures to align with the P-centers of the input syllables while “high synchronizers” may be exclusively relying on feedforward commands to perform the task41. The task likely involves somatosensory (rather than auditory) feedback mechanism, since the grouping of “high” vs. “low” synchronizers persists across effectors1 while loudness adjustments do not affect the performance on this task. Open questions remain about how perceptual and motor abilities of the speaker but also the auditory stimulus itself influence the moment-to-moment balance of feedforward representations and feedback information (for relevant discussion, see refs. 38,42,43).

Other explanations (e.g., the presence of subjective rhythmization at a fast input tempo44 or socio-psychological factors45) are, however, also conceivable and would warrant careful examination.