Abstract
Recent research has revealed cross-linguistic and individual variations in the processing of acoustic cues for phonetic categorization. This study extends this line of inquiry by examining the auditory perception of native Spanish listeners and Chinese learners of Spanish, focusing on their ability to map acoustic signals onto intonation categories. Through two identification tasks employing synthesized stimuli with systematically varied acoustic and stress patterns, we investigated how listeners navigate multiple cues in recognizing Spanish sentence types. Results indicated that changes in fundamental frequency (F0), duration, and intensity significantly influenced native Spanish listeners’ intonation judgments, while Chinese learners predominantly relied on F0 modulations to differentiate statements from yes/no questions. Compared to native Spanish listeners, Chinese learners demonstrated lower sensitivity to changes across the three cues and less proficiency in reconciling cue trade-offs. Furthermore, our study revealed that both Spanish and Chinese listeners’ perceptual performance was modulated by stress patterns and their chronological age. Overall, our research elucidates the multifaceted nature of intonation perception, underscoring the critical role of linguistic background, individual characteristics, and lower-level prosodic context in the transformation of acoustic details into intonation categories.
Similar content being viewed by others
Introduction
In the intricate realm of speech perception, listeners frequently harness perceptual intelligence to weave together acoustic cues from multiple dimensions, ensuring a refined comprehension of the intended message (Holt and Lotto, 2006; Kim and Tremblay, 2022). A wealth of studies has shown that native speakers of varied languages such as English, German, Cantonese, and South Korean can discern intonation contrasts using changes in fundamental frequency (F0), intensity, and duration, although their utilizations may differ based on factors like listening conditions, stimulus types, and the speaker’s chronological age (Chang, 2013; Ma et al., 2008; Morrow and Liu, 2013; Niebuhr, 2007; Peng et al., 2012). A parallel cue integration process is observed in the perception of diverse linguistic categories including stop-voicing (Kong and Edwards, 2016; Schertz et al., 2020; Yu, 2022) and vowel contrasts (Lipski et al., 2012; Tillman et al., 2017) in English, alongside lexical stress (Zhang and Francis, 2010) and lexical tone (Chandrasekaran et al., 2010; Zhang et al., 2022) in Mandarin. Yet, despite this similarity in multi-cue integration, listeners from different languages tend to perceive the same acoustic stimuli in different ways, because the phonetic categories of their native language (L1) attune them to certain acoustic cues over others (Kim and Tremblay, 2022; Ou et al., 2023). To elucidate, previous studies have unearthed significant variations in cue weights among native and non-native English speakers in stop-voicing (Dmitrieva, 2019), lexical tone (Wiener, 2017), and intonation (Feng et al., 2019; Shang et al., 2022, 2024a). An influential theoretical framework that accounts for such cross-linguistic perceptual variances is the cue-weighting theory, which posits that the cue-weighting patterns that develop to differentiate speech contrasts in one’s L1 can extend to their perception of a second or foreign language (L2) (Holt and Lotto, 2006; Kim and Tremblay, 2020; Tremblay et al., 2018; Zhang and Francis, 2010).
Supporting this cue-weighting transfer hypothesis are numerous studies. Take, for example, the work by Souza et al. (2017), which determined that the size of the L1 vowel inventory can impact how listeners discern L2 English vowels. Notably, Danish learners of English, when compared to those from Catalan, Portuguese, and Russian backgrounds, demonstrated perceptions more closely aligned with L1 speakers (Souza et al., 2017). This occurrence is credited to the beneficial transfer of cue-weighting strategies from their expansive L1 vowel space, which sharpened their perception of the nuanced spectral differences inherent in L2 vowels. In a similar vein, existing research underscores that the perceptual processing of English lexical stress by L2 listeners hinges largely on the relative weighting of acoustic cues to lexical contrasts in their L1 (Connell et al., 2018; Cooper et al., 2002; Cutler et al., 2007; Kim and Tremblay, 2021; Tremblay et al., 2021; Van Heuven and De Jonge, 2011; Zhang and Francis, 2010). To illustrate, while English and Mandarin listeners displayed a greater reliance on F0 in perceiving English lexical stress, Russian listeners had a more pronounced dependence on duration, relegating F0 to a secondary position (Chrabaszcz et al., 2014). Such a propensity can likely be attributed to the marked relevance of duration cues in demarcating stress contrasts within Russian.
Continuing this narrative, subsequent research has delved into the scope of cue-weighting strategies, illustrating their transferability from L1 to L2 perception, even across different types of speech contrasts (Choi, 2022; Choi et al., 2019; Kim and Tremblay, 2021; Qin et al., 2019; Wiener and Goss, 2019). For instance, Seoul Korean listeners exhibited enhanced acuity in discerning intonationally cued lexical stress in L2 English when compared to French listeners. This heightened proficiency is postulated to arise from the positive transfer of tonal cues, which are crucial in differentiating the three-way laryngeal stop contrasts inherent to Seoul Korean (Kim and Tremblay, 2022). Based on these observations, it is reasonable to stipulate that the reliance placed on certain acoustic cues in L1 categories conditions the degree to which listeners utilize and depend on those dimensions when processing the L2.
Regarding intonation perception, the working model of automatic selective perception (ASP) posits that particular acoustic dimensions, primarily the F0, when consistently and reliably associated with intonation categorization, generally carry greater weight than secondary cues like duration and intensity, in auditory decision-making (Strange, 2009, 2011; Ortega-Llebaria et al., 2017). For instance, research focusing on Spanish intonation has highlighted the pivotal role of the final F0 movement, distinguishing it as the strongest intonational cue for perceiving question-statement contrast that can even override any preceding conflicting cues (Face, 2007; Shang et al., 2024b, 2024c). Conversely, for tonal languages like Mandarin Chinese, the F0 contour undertakes diverse linguistic functions, spanning both lexical tone and sentence-level intonation. In this context, listeners predominantly perceive intonation through global F0 trends, relegating surface F0 shapes to a more confined role of indicating lexical tone identity (Chen 2022; Yuan, 2004). Given the strong informational emphasis on F0 in Chinese, it is often proposed that listeners of tonal languages might exhibit enhanced abilities in pitch perception tasks, displaying higher behavioral sensitivity to F0 cues compared to non-tonal language listeners. Studies affirm this, showing that Chinese listeners excel in perceiving Cantonese tones, are more sensitive to F0 shifts, and quickly spot F0 mismatches in English words, relative to non-tonal language speakers (Chang et al., 2017; Deroche et al., 2019; Ortega-Llebaria et al., 2017). However, the extent to which this F0 advantage applies to intonation perception, especially in L2, remains debated. Some neurobehavioral research revealed differential neural encoding for pitch processed as tone and intonation, hinting that the observed F0 advantage may be specific to certain pitch events and not necessarily generalizable to all pitch perception tasks (Chien et al., 2020; Doherty et al., 2004; Gandour, 2009; Zatorre and Gandour, 2008). To truly grasp F0 perception among tonal and non-tonal language listeners, expansive studies probing varied pitch dimensions and linguistic pairings are indispensable.
Furthermore, while certain acoustic cues might inherently hold more significance than others during perception, their weights are not fixed. Listeners may often adapt cue weighting in response to fluctuating acoustic contexts, a phenomenon explored via theories like phonetic trading relation (Mann and Repp, 1980; Repp, 1982) and perceptual compensation (Kuang and Cui, 2018; Jiao and Xu, 2019). The degree of adaptation or compensation listeners apply in perceptual processing seems to depend on their sensitivity to the input acoustic dimensions aligned with intended phonetic categories (Hodgson and Miller, 1996). For example, English listeners, displaying nuanced perceptual differences in F1 discrimination tasks, showed greater compensations in response to acoustic perturbations in F1 vowel formants (Villacorta et al., 2007).
Despite the consistent perceptual patterns observed within the same linguistic community, individual variations remain a significant consideration for understanding the intricacies of perceptual landscapes in both L1 and L2 (Fuchs et al., 2015; Ou et al., 2023). Developmental studies using cross-sectional methods have captured shifts in cue weighting for L1 segmental contrasts from childhood to adulthood, demonstrating that children’s cue prioritization evolves towards an adult-like pattern over time (Mayo and Turk, 2004, 2005). Similar changes in cue-weighting strategies with increasing age have been observed in individual L2 learners. For instance, Kim et al. (2018) found that adult and child Korean learners of English exhibit changes in cue weighting over time and across different vowel contrasts. However, research exploring how perceptual weighting shifts across the adult age spectrum, particularly in L2, remains scarce. In a revelatory study on sound categorization across varied age groups, Toscano and Lansing (2019) discovered that older adults rely more heavily on onset F0 cues for L1 voicing judgments compared to younger adults. This finding suggests that age-related changes in perceptual strategies persist well into adulthood, at least for L1 perception. Nonetheless, it remains unclear whether similar age-related effects exist in the intonation perception of adult L2 learners, particularly in terms of cue-weighting strategies. Thus, further research in this area is warranted to bridge this gap in our understanding of L2 intonation processing across the lifespan.
Apart from age, gender may also play a role in shaping perceptual variations. Gender differences in speech perception may be attributed to a combination of physiological, sociolinguistic, and cognitive factors. For instance, Krizman et al. (2012) found that females exhibit stronger interhemispheric collaboration in speech processing, while Labov (1990) noted that women tend to use more standard language forms, potentially influencing their phonetic sensitivity. These gender-related differences have been observed in various aspects of speech perception. For example, earlier research has shed light on gender-related differences in how English prosodic phrase boundaries (Zhang, 2012) and Afrikaans tonogenesis (Pfiffner, 2020) are perceived. There is also evidence that the perceptual cue weighting of L1 English stop-voicing was modulated by the listener’s gender and their subjective evaluation of the talker (Yu, 2022). However, similar studies in L2 perception are relatively scarce. Despite various accounts of anatomical and functional differences between genders, Bryła-Cruz (2021) suggests that neither sex can be considered superior in L2 phonetic perception. Consequently, the role of gender in cross-linguistic intonation perception warrants further investigation.
Lastly, empirical studies on Chinese prosody show that question intonations ending with a rising tone (Tone2) are harder to identify than those with other tone endings (Liu et al., 2022; Yuan, 2006, 2011; Yuan and Shih, 2004). The root of this difficulty seems to be the overlap of Tone2 and question intonation in their primary encoding within the F0 dimension, paving the way for potential acoustic conflicts in perception. However, given that a single acoustic dimension can be used to encode multiple categories, such acoustic cue competition is not unique to the simultaneous processing of lexical tone and sentence intonation (Xu, 2004). Armed with this knowledge, and considering that in Spanish, both stress and intonation are consistently signaled by F0 and duration (Ortega-Llebaria, 2006; Ortega-Llebaria and Prieto, 2011; Ortega-Llebaria et al., 2013), a compelling question is whether perceptual weighting of Spanish intonation change when processing concurrent stress, particularly when these linguistic categories coincide within the same phonetic unit. In response to this query, our study examines the perception of oxytone and paroxytone words, both with and without stress on the final syllable, to discern potential acoustic conflicts arising from stress processing during Spanish intonation perception.
Aims and hypotheses
Building upon earlier findings and observations, our study seeks to investigate the perception of Spanish intonation among native Spanish listeners and Chinese L2 learners within dynamic acoustic change contexts. To this end, we engaged the Spanish L1 and L2 participants in two experiments. In these experiments, we initially synthesized pitch stimuli by gradually transitioning the final F0 contour from falling to rising directions. Following this base manipulation, we have further added manipulation of duration (Experiment 1) and intensity (Experiment 2) to those pitch stimuli. By evaluating listeners’ responses to these stimuli, we sought to address the following research questions (RQs):
RQ1: To what extent do changes in F0, duration, and intensity influence question-statement recognition among Spanish L1 and Chinese L2 listeners? Are there perceptual differences between listeners from different linguistic backgrounds?
RQ2: What are the potential relationships among various acoustic cues in the perceptual processing of Spanish intonation?
RQ3: Do individual differences in age and gender influence L1 and L2 listeners’ perception of Spanish statements and yes/no questions?
RQ4: To what degree do lower-level prosodic factors, such as the final stress pattern, influence listeners’ perception of Spanish intonation?
For RQ1, we first hypothesized that changes in F0 would significantly impact intonation perception across both L1 and L2 groups, based on the cue-weighting theory and the prominent role of F0 cues in Chinese and Spanish prosody (Chen, 2022; Face, 2007; Kim and Tremblay, 2020). We also expected to observe similar weighting or sensitivity to the intonation-related F0 cue between L1 and L2 listeners, because studies have found that pitch processed as intonation activated the same brain areas among tonal and non-tonal language speakers, regardless of their language-specific realization of intonation (Chien et al., 2020).
Additionally, historical data reveals that yes/no questions in Spanish and Chinese are usually characterized by elongated duration towards the end compared to statements, although this pattern in Chinese, further, is modulated by the final tone type (Yuan, 2006; Romera Barrios et al., 2007). Thus, we postulated that changes in duration could influence the perception of sentence types by listeners of the two languages. Likewise, question intonation might be characterized by heightened intensity in both languages, but unlike F0 and duration that consistently cue intonation contrasts, intensity seems to be an optional cue that conveys less reliable information in Spanish prosody (Romera Barrios et al., 2007; Yuan, 2006; Yuan and Shih, 2004). Given this, we hypothesized only a marginal, if any, impact of this cue on intonation perception. We also expected to observe a lower sensitivity of Chinese learners to duration and intensity compared to Spanish listeners due to a negative transfer from their L1 cue-weighting patterns.
For RQ2, we expected to find a perceptual trade-off between the primary (F0) and secondary cues (duration/intensity) of intonation based on the principle of phonetic trading relation (Mann and Repp, 1980; Repp, 1982).
For RQ3, considering prior studies of individual differences in L1 and L2 perceptual processing (Bryła-Cruz, 2021; Kim et al., 2018; Toscano and Lansing, 2019; Yu, 2022), we anticipated potential age and gender effects on Spanish intonation judgments by Spanish and Chinese listeners, without predicting specific directional outcomes.
For RQ4, given the acoustic overlap between stress and intonation in oxytone words’ final syllable (Ortega-Llebaria, 2006; Ortega-Llebaria and Prieto, 2011; Ortega-Llebaria et al., 2013), we hypothesized that oxytone words with stress on the final syllable might be less likely to be identified as questions compared to paroxytone words.
Methodology
Participants
A total of 46 native speakers of Castellan Spanish (hereafter, SP) and 95 Chinese L2 learners (hereafter, CH) of Spanish participated in the study. To control for the effect of age of acquisition, we excluded data from two CH participants who started learning Spanish before the age of 16. Consequently, the remaining CH participants, all born in mainland China, reported being predominantly exposed to Castellan Spanish during their language learning process. Significantly, they began learning L2 Spanish, on average, at the age of 19.73 years (SD = 3.00). Moreover, to ensure the L2 participants had a sufficient understanding of L2 intonation, we excluded data from CH learners (N = 5) with an A1 or A2 proficiency level. As a result, the remaining CH learners had either an intermediate (B1), advanced (B2), or superior (C1) proficiency in Spanish. Thus, the final participant pool for Experiment 1 consisted of 39 SP listeners and 78 CH learners of Spanish, with ages ranging from 18 to 59 years (N = 117, 90 women, 28 men, Mage = 28.27, SD = 8.43). For Experiment 2, the participant pool comprised 33 SP listeners and 77 CH learners, with ages ranging from 19 to 58 years (N = 110, 84 women, 26 men, Mage = 28.02, SD = 8.46). Critically, no participants reported any history of hearing or communication disorders at the time of testing, ensuring the reliability of the data collected. The socio-demographic information and language immersion conditions of the analyzed participants are presented in Table 1.
Stimulus synthesis
In Peninsular Spanish, the intonation of yes/no questions and statements differ in both the prenucleus and nucleus. To isolate the effects of intonation nucleus, single-word utterances with one stressed syllable were utilized as our base stimuli, thereby excluding any potential prenucleus influences. The materials comprised two trisyllabic Spanish words: “Sevilla” and “Alcalá”, with penultimate and final syllable stress patterns, respectively. To ensure the ecological validity of our study, we employed a Discourse Completion Task (DCT) to elicit the declarative forms of these words from a female Spanish speaker in Barcelona. The DCT approach allowed us to elicit the production of broad focus statements, ensuring that our base stimuli were produced in a natural and contextually relevant manner.
The final syllable of each stimulus word was selected for acoustic manipulation. Specifically, the F0 contour of the utterance-final syllable was replaced with a multi-step continuum using Praat (Boersma and Weenink, 2020). To achieve this, the original stylized F0 contour was first defined by two anchor points—at vowel onset (A1) and offset (A2)—between which values were determined through interpolation (see Fig. 1). The anchor point A1 was situated at the beginning of the final vowel, preserving the original F0 height. The anchor point A2 was positioned at the ultimate glottal pulse visible in the spectrogram, with a 3 Hz divergence from A1. The A2 continuum comprised eleven 20 Hz steps, nine ascending and one descending in relation to A1, traversing over 200 Hz. This process generated 22 F0 contours crossing two stress patterns for the synthesized final syllable, which then served as the basis for subsequent manipulations of duration and intensity.
For Experiment 1, duration modifications were incorporated with the 22 synthesized F0 contours. Three duration conditions - short, medium (original), long - were generated by compressing or expanding the vowel nucleus duration of the final syllable while setting segment boundaries at zero crossings to avoid spectral discontinuities. Long-duration stimuli were created by extracting 50 ms of periodic cycles from the center of the original vowel nucleus. This segment was then appended after the 5th glottal cycle of the last vowel. The increased value was selected based on prior research showing final vowel durations were approximately 40–70 ms longer in yes-no questions compared to statements in Spanish (Romera Barrios et al., 2007). The short durations were generated by extracting 40 ms of periodic cycles from the same region of the final vowel nucleus. This decrease was defined by referencing the shortest comparable statement production by the speaker. Therefore, combining the 2 stress patterns, 11 F0 steps, and 3 duration levels resulted in 66 stimuli for Experiment 1. Schematic representations of the duration manipulation are provided in Supplementary Fig. S1.
For Experiment 2, the intensity was manipulated using the Constant Amplification function in Cool Edit Pro 2.1 on the 22 F0 contours. Three intensity conditions were synthesized by applying a −7, 0-, and +7 dB modification to the final syllable relative to the normalized non-final syllable intensity of 70 dB. Thus, the low, original, and high intensities were set at 63 dB, 70 dB, and 77 dB. By crossing the 2 stress patterns, 11 F0 steps, and 3 intensity levels, a total of 66 stimuli were generated for Experiment 2. A schematic representation of the intensity manipulation can be found in Supplementary Fig. S2.
Procedure
Data were collected via an online survey platform (https://www.alchemer.com/). The survey comprised three sections. The first elicited participants’ demographic and linguistic background details. Sections “Methodology” and “Results” contained the auditory stimuli for Experiments 1 and 2. The stimulus audios can be found in Supplementary Audios S1 and S2. The text of each stimulus was displayed without punctuation. Participants could complete one or both experiments based on interest. Participants were instructed to utilize headphones in a quiet environment. A practice trial preceded the experiments to familiarize participants with procedures. Responses were captured on a 5-point Likert scale to ascertain nuanced perceptual shifts. For each stimulus, participants selected one of five descriptions of the intonation: “statement,” “more statement than question,” “either statement or question,” “more question than statement,” and “question.”
Statistics
Given the ordinal nature of the 5-point response scale, ordered generalized linear models (OGLM) were employed to allow flexibility in relaxing parallel lines assumptions when violated (Abrudan et al., 2020). This involved estimating separate coefficients across five outcome levels. To achieve this, two independent OGLM models were fit for Experiments 1 (model 1) and 2 (model 2) using the oglmx R package (Carroll, 2020). To address our RQs, we incorporated several interaction terms in our statistical models. For RQ1, we included Language Group × F0 change/Duration in model 1 and Language Group × F0 change/Intensity in model 2. RQ2 was examined through F0 change × Duration in model 1 and F0 change × Intensity in model 2. For RQ3, we added Language Group × Age and Language Group × Gender interactions in both models. Finally, to address RQ4, we incorporated Stress Type (paroxytone vs. oxytone) and its potential interactions with other factors in both models. Additionally, Stimulus Order was included in each model, and along with the Age factor, was z-transformed and mean-centered prior to inclusion.
Results
Results of Experiment 1
The model fitted for Experiment 1 yields four threshold parameters, representing the cut-points between the five ordered response levels. These thresholds and their corresponding statistics are presented in Table 2. The results indicate that while the first threshold was not statistically significant, the subsequent thresholds were all highly significant. This suggests that the model effectively distinguishes between the “more statement than question,” “either statement or question,” “more question than statement,” and “question” categories, but may not clearly differentiate between the “statement” and “more statement than question” categories. The increasing values of the thresholds (0.097 < 1.813 < 2.889 < 5.088) demonstrate the ordinal nature of the response variable, with larger gaps between later categories suggesting more pronounced distinctions between these higher-order response levels.
Overall, the statistical analysis revealed that listeners’ perception of intonation was significantly influenced by several factors. Four two-way interactions emerged as particularly significant: language group × F0 Change [χ2(1) = 23.54, p < 0.001], language group × duration [χ2(2) = 27.39, p < 0.001], F0 change × duration [χ2(2) = 16.97, p < 0.001], and language group × age [χ2(1) = 26.04, p < 0.001]. Stress type [χ2(1) = 63.31, p < 0.001] and stimulus order [χ2(1) = 7.48, p < 0.01] also proved to be significant predictors. However, the interaction between gender and language group did not yield statistically significant results for the prediction [χ2(1) = 0.01, p = 0.93]. Given that changes in the dependent variables affect the response categories differently, we will discuss these relationships by focusing on their margins on our outcome of interest. The comprehensive statistical details on the marginal effects are provided in Supplementary Table S1.
In Table 3, the main effect of F0 indicates that the baseline group (CH group) consistently associates rising pitch with higher levels of question perception. The interaction of L1SP × F0 further reveals that the SP group had stronger F0 effects compared to the CH group. To elucidate the F0 sensitivity of the SP group, we calculated their total effects at each response level. As shown in Table 4, the SP group showed approximately 11% higher sensitivity across all response levels while maintaining the same effect direction as the baseline group. This consistent pattern supports our first hypothesis that language background influences listeners’ utilization of F0 cues for question-statement identification. Specifically, SP listeners may have developed a more fine-grained perceptual mechanism for intonational F0 cues due to their L1 experience.
The main effect of L1 reveals a distinct perceptual bias between language groups. SP listeners exhibited a propensity to categorize sentences as statements, while CH listeners were more inclined to perceive the same utterances as questions. This finding underscores the influence of linguistic background on prosodic interpretation.
The interaction between L1 and duration provided further insights into the differential effects of stimulus length on perception. For SP listeners, a significant positive interaction was observed between L1 and duration, particularly for the long-duration condition. This interaction manifested as a substantial improvement in question identification as the stimulus duration increased from short to long. Conversely, CH listeners demonstrated a different pattern of responses across duration conditions. The main effects of duration for CH listeners, serving as the baseline group, were non-significant for both original and long durations. This lack of significance indicates that CH listeners were not sensitive to duration changes in Spanish intonation.
While F0 is a strong predictor of sentence type, its effect is modulated by the duration. Specifically, Table 3 shows that the pitch identification curve was markedly steeper in the long-duration level than in the short-duration level. This implies that an alteration in F0 had a more pronounced effect on question recognition when the final syllable was lengthened. In other words, both groups could utilize a lower F0 contour to identify a question when the final syllable duration was extended (see Fig. 2). Conversely, a higher F0 contour was required to identify a question when the duration was short. Additionally, the overall changes in the slope of the identification curve between the two language groups (see Fig. 2) indicate that SP listeners, being more sensitive to F0 and duration cues, made greater compensations for changes in duration than CH listeners.
Table 3 presents the main effect of Age, which corresponds to the age-related changes observed in the baseline group. The results indicate that with increasing age, CH listeners exhibited a greater propensity to perceive utterances as questions (levels 4–5). For the SP group, the age effects were derived by combining the main age effects with the L1SP × Age interaction effects, yielding the following values across the five response levels: −0.00822, −0.02000, −0.00847, 0.02375, and 0.01295. These combined effects reveal a milder age-related trend for SP listeners compared to their CH counterparts, particularly at levels 4 and 5. Finally, Table 3 reveals a significant effect of stress type on intonation perception. Words with penultimate stress were more likely to be perceived as questions compared to those with final stress. The underlying explanations for these observations will be elaborated upon in section “Discussion”.
Results of Experiment 2
Table 5 presents the threshold parameters of the model fitted for Experiment 2. All four threshold parameters were statistically significant, indicating clear distinctions between the five response levels. The negative value of the first threshold indicates a slight bias towards perceiving stimuli as statements, while the notably high value of the final threshold suggests that very strong cues were needed for participants to categorize a stimulus as a definite question.
The output of the model revealed that listeners’ perception of intonation was significantly influenced by several factors. Five two-way interactions emerged as particularly significant: language group × F0 change [χ2(1) = 33.54, p < 0.001], language group × intensity [χ2(2) = 6.55, p < 0.05], F0 change × intensity [χ2(2) = 12.72, p < 0.01], intensity × stress type [χ2(2) = 15.06, p < 0.0001], and language group × age [χ2(1) = 28.56, p < 0.0001]. However, the stimulus order [χ2(1) = 0.14, p = 0.71] and the interaction between gender and language group did not yield statistically significant results for the prediction [χ2(1) = 2.37, p = 0.12]. Statistical details on the marginal effects of Experiment 2 are provided in Supplementary Table S2.
Table 6 demonstrates that the effect of F0 on intonation recognition aligns with the results in Experiment 1. As depicted in Fig. 3, there is an increased likelihood of question-like responses as the final F0 contour elevates. The negative margins noted for the SN Group at levels 4 and 5 imply that CN listeners had a markedly higher tendency to perceive single-word sentences as yes/no questions compared to the SN listeners. Regarding the interaction of F0 × L1SP, consistent with Experiment 1, we found that SP listeners exhibited greater sensitivity to F0 linear changes perceived as intonation than CH learners.
In line with the F0 × duration interaction, a significant perceptual trade-off between F0 and intensity was identified in Experiment 2. Specifically, Table 6 delineates the positive coefficients of F0 × 77 dB at levels 4 and 5, indicating that the slope of the identification curves as a function of F0 was notably steeper at 77 dB compared to 63 dB. This suggests that listeners required fewer F0 cues for question recognition when the intensity escalated to the maximum level (77 dB). This observation substantiates the existence of phonetic trading relations and implies a counter-directional compensatory mechanism employed by listeners in response to the amplification or diminution of acoustic cues.
Table 6 also revealed significant effects of intensity on intonation perception, with notable differences between L1 groups. For CH listeners, intensity increases from the 63 dB baseline yielded consistent and significant effects, particularly at 77 dB. Specifically, as the intensity increased from 63 dB to 70 dB/77 dB, CH listeners showed a decreased probability of perceiving questions and an increased likelihood of perceiving statements. SP listeners exhibited a nuanced response pattern to intensity variations. As intensity increased, they demonstrated an increased propensity to perceive sentences as questions. However, this perceptual shift was characterized by non-linearity across the intensity spectrum. The most pronounced changes were observed at 70 dB, suggesting a heightened sensitivity to moderate intensity increases. At 77 dB, while the trend toward question perception persisted, the magnitude of the effect was attenuated.
Furthermore, the analysis revealed significant effects of stress type on identification across language groups. The CH group’s response pattern is directly reflected in the main effect. Specifically, for CH learners, Table 6 indicates that penultimate-stressed words (compared to final-stressed words) were less likely to be interpreted as question-like sentences. In contrast, SP listeners showed a stronger tendency to categorize words with penultimate stress as “more question than statement” or “question”. Besides, the interaction between intensity and stress type reveals a significant modulation of stress effects on intonation perception by intensity levels. Notably, at higher intensity levels, particularly at 77 dB, words with penultimate stress demonstrate an increased likelihood of being interpreted as questions compared to the baseline level (63 dB).
Consistent with Experiment 1, the main effect of age indicates that with increasing age, CH learners exhibited a subtle yet significant increase in perceiving utterances as questions (levels 4–5). For the SP group, the age effects were derived by combining the main age effects with the L1SP×Age interaction effects, yielding the following values across the five levels: −0.02080, −0.04824, −0.02884, −0.05729, and 0.03517. The negative effect at level 4 and the slightly reduced positive effect at level 5 suggest that the impact of aging on intonation perception is less pronounced in the SP group compared to CH learners.
Estimation of model performance
The evaluation of the two OGLM models utilized the log loss metric, a discriminative criterion pivotal in assessing the quality of probabilistic predictions, particularly in multi-class classification scenarios. In this context, the log loss values for the test dataset were computed employing the mlogLoss function available in the R ModelMetrics package (Hunt, 2020). The derived log loss values were 1.0467 and 1.0286 for the predictive models corresponding to Experiments 1 and 2, respectively. Generally, a proficient model is characterized by a log loss value that is inferior to the baseline, often referred to as the “naive” or “dumb” log loss, which is computed assuming a uniform distribution of the response categories (M) (Brown, 2020). In the present study, this baseline was established considering M as 5, representing the response levels, each having a probability of 20%. Consequently, the “naive” log loss for both experiments was ascertained to be 1.6094, as delineated in Eq. (1):
However, the empirical data of the study revealed a non-uniform distribution of the outcome variable’s classes, necessitating the computation of a non-informative log loss that encapsulates the specific distribution of listeners’ responses. This was achieved by designating the probability associated with each class based on the observed data. For instance, in Experiment 1, the response distributions for the categories “statement,” “more statement than question,” “either statement or question,” “more question than statement,” and “question” were 21.77%, 16.49%, 10.88%, 24.13%, and 26.74%, respectively. Therefore, the non-informative log loss for the Experiment 1 model was computed as 1.5662, as detailed in Eq. (2). Similarly, the non-informative log loss for the Experiment 2 model was determined to be 1.6113. Overall, the models exhibited a higher predictive accuracy and substantial reduction in average log loss values to 1.0467 and 1.0286, respectively, when compared to the “naive” or non-informative log loss benchmarks.
Discussion
This study delved into the perception of acoustic cues inherent to Spanish intonation among native Spanish L1 listeners and Chinese L2 learners. With respect to RQ1, our findings partially corroborated the hypothesized differences in intonation cue weighting between the two language groups. Specifically, changes in the final F0 contour significantly influenced intonation categorization for both L1 and L2 listeners, underscoring the importance of the F0 cue. However, contrary to our initial assumption for RQ1, Spanish L1 listeners showed greater sensitivity to F0 modulations in intonation processing compared to Chinese learners. We theorize that the elevated F0 sensitivity shown by Spanish L1 listeners could be ascribed to their innate familiarity and adeptness with native pitch patterns, which empowers them to more accurately and quickly identify intonation contrasts based on F0 linear transitions. Conversely, Chinese learners might possess restricted L2 experience in processing F0 signals in alignment with language-specific and well-defined intonation categories, culminating in a less steep slope for question-statement identification in Spanish. Additionally, the functional view posits that if certain phonetic cues are harnessed in one grammatical dimension, they will not be employed to a comparable extent in another phonological domain (Seddoh, 2002; Gandour et al., 1995). Consistent with this notion, several studies have highlighted that the inclination of tonal language listeners to primarily perceive F0 information related to word meanings (i.e., lexical tone) is a crucial factor in their reduced sensitivity to F0 cues processed as sentence intonation (Chen, 2005; Liang and Heuven, 2007). Adhering to this rationale and considering the parallel processing of stress and intonation in Spanish sentences, we propose that the increased effort by Chinese listeners to prioritize F0 cues for L2 Spanish stress—an essential component for word recognition—arising from the negative transfer of L1 prosodic realization, might also be a relevant factor influencing their diminished sensitivity to intonational F0 cues in Spanish.
Regarding the duration cue (RQ1), our findings largely aligned with our initial hypothesis, confirming that duration changes can significantly impact Spanish L1 listeners’ intonation perception. Long-duration patterns effectively increased their likelihood of question identification. This heightened sensitivity to temporal cues among Spanish L1 listeners suggests a greater reliance on duration as a prosodic marker for question-statement distinctions in their cue-weighting strategy. However, contrary to our expectations, we did not observe significant perceptual improvements in the Chinese group as the final stimulus duration increased. This unexpected result challenges our initial hypothesis and suggests a more complex relationship between L1 background and duration cues in cross-linguistic perception. The Chinese listeners’ relative insensitivity to duration changes suggests a preferential reliance on F0 contour as the primary cue for question-statement discrimination. This perceptual pattern likely stems from the tonal system inherent in their L1, where pitch variations play a crucial role in lexical and sentential distinctions. Such a prosodic transfer effect underscores the profound influence of L1 phonology on the L2 cue-weighting strategy.
Regarding the intensity cue (RQ1), our results partially corroborate the predictions for the Spanish group while revealing distinct perceptual patterns across language groups. Both Chinese L2 and Spanish L1 listeners demonstrated sensitivity to intensity variations, albeit with divergent response trajectories. Chinese L2 listeners exhibited a more pronounced and consistent shift towards statement perception as intensity increased, with the effect most salient at 77 dB. In contrast, Spanish L1 listeners showed an increased propensity to perceive sentences as questions with rising intensity, particularly at 70 dB. This differential response pattern suggests the existence of a language-specific intensity threshold, approximating 70 dB, at which native Spanish listeners display maximum sensitivity to intensity modulations. Notably, beyond this threshold, further intensity increments did not elicit proportional increases in response differentiation for the Spanish group. A potential explanation for the differences in secondary cue weighting between L1 and L2 listeners lies in their perceptual compensation capabilities (Feng et al., 2019). Since F0 contours synthesized in our study deviated from the intonation patterns of naturally spoken Spanish statements and questions (not being entirely linear at the end of the utterance), listeners probably increased the weight of other secondary cues to offset the loss of F0 information and bolster their perceptual decisions. However, native listeners might be better equipped to accurately compensate for acoustic changes than L2 listeners, given their extensive familiarity with the phonetic details of target intonation categories. Additionally, building on previous cue-weighting transfer research in cross-linguistic settings (Choi, 2022; Choi et al., 2019; Kim and Tremblay, 2021; Qin et al., 2019; Wiener and Goss, 2019), it is plausible that the reduced sensitivity of Chinese listeners to duration and intensity originates from a transfer of their L1 cue weighting. Considering that F0 plays a pivotal role in Chinese prosody, its speakers might not allocate as much attention to subtler variations in duration and intensity as speakers of non-tonal language do when discerning an intonation category, especially in non-native listening contexts (Chang and Yao, 2007; Feng et al., 2019; Jiao and Xu, 2019).
With respect to RQ2, our data substantiate the previous hypothesis by demonstrating a strong interaction between the effects of F0 and the duration/intensity cues in intonation perception. In particular, we found that the acoustic attenuation in one dimension of these cues could be compensated by increased contributions from the other such that the original percept can be preserved. This perceptual trade-off has been documented in the perception of various phonetic segments, such as ongoing sound changes in Southern Yi (Kuang and Cui, 2018) and stop-consonant voicing contrasts in American English (Holt et al., 2001; Jacewicz et al., 2009). Our results also indicated that Spanish L1 listeners, with heightened sensitivity to cues under phonetic trading relations, were more adept at compensating for acoustic variations in Spanish intonation. This observation aligns with previous perceptual compensation research, which proposed that listeners’ auditory compensation depended on their sensitivity to co-varying cues consistently correlated with the recognition of a specific phonetic category (Villacorta et al., 2007; Naul, Munhall, 2020).
With respect to RQ3, our study partially supports the hypothesis that listeners’ perceptual strategies are influenced by certain individual differences. Crucially, in our study, we found that age emerged as a significant factor influencing both L1 and L2 listeners’ perceptual processing, whereas gender did not demonstrate a discernible impact on intonation perception. Specifically, older listeners were more likely to have question-like responses, and this age-related trend is more pronounced in the Chinese group. Building upon earlier findings (Kim et al., 2018; Mayo and Turk, 2005; Toscano and Lansing, 2019), we posit that the observed age-related effects in both L1 and L2 groups likely reflect a general age-related cognitive change. Age-related alterations in the auditory system may affect the perception of certain acoustic features, consequently influencing sentence-type judgments. The observed trend could also indicate a compensatory mechanism in response to cognitive aging. As auditory processing capabilities decline with age, perceiving utterances as questions may serve as a strategic adaptation to auditory input uncertainty. This compensatory strategy could: (a) minimize potential communication errors in ambiguous situations; (b) require fewer cognitive resources, which may be particularly beneficial as cognitive processing efficiency changes with age. This adaptive tendency reflects the flexibility of the human cognitive system in response to age-related changes. Its more pronounced manifestation in L2 listeners may be attributed to the additional cognitive demands of processing a non-native language. Conversely, the milder effect in the Spanish group suggests that while cognitive aging affects all individuals, its impact on native language processing may be less severe due to deeply ingrained L1 prosodic patterns. However, we acknowledge the limitations of our cross-sectional approach in fully exploring these hypotheses. To address this and provide more robust evidence for our cognitive aging hypothesis, future research should employ longitudinal studies. Such studies would enable tracking of intonation perception changes within individuals over time, offering a clearer picture of how aging affects prosodic processing in both L1 and L2 contexts.
Finally, with respect to RQ4, our findings confirm the initial assumption that the perceptual processing of Spanish intonation was influenced by the stress pattern. Specifically, in Experiment 1, we found that paroxytone words were more likely to be recognized as questions compared to oxytone words under identical acoustic conditions. This finding can be interpreted in light of the principle of least effort (Zipf, 2016), which posits that humans naturally prefer to choose the least cognitively demanding course of action. Therefore, since paroxytone is the most frequent unmarked stress pattern in Spanish (Defior and Serrano, 2017; Roca, 2019), it is logical that both L1 and L2 listeners more frequently categorized paroxytone words as yes/no questions. In contrast, the potential conflict in F0 encodings between stress and intonation on the last syllable of oxytone words may have complicated the processing of intonational F0 cues for listeners, particularly those in the L2 group, thereby reducing the possibility of question responses. In Experiment 2, while Spanish L1 listeners exhibited a similar trend in stress effect as observed previously, Chinese L2 listeners showed an opposite pattern. The underlying mechanisms for this difference are not yet fully elucidated, we hypothesize that the interaction between stress and intensity may be key to understanding this phenomenon. For L2 listeners, the challenge of integrating non-native stress with intensity may engender perceptual strategies that markedly differ from those employed by L1 listeners. This contrasting trend could indicate either an overcompensation or a fundamentally distinct approach to processing intensity modulations in their L2.
Moreover, this finding underscores the importance of considering language-specific and lower-level prosodic features (e.g., stress) when studying intonation perception across different linguistic groups. The interaction between stress and intensity at phrase boundaries may be particularly crucial in Spanish, a stress-timed language, and may pose unique challenges for learners from tonal language backgrounds like Chinese. To fully unravel this complex issue, further research is imperative. Future studies should aim to thoroughly examine the interaction between stress and intensity cues during intonation perception, specifically at the final boundary of Spanish questions.
Conclusion
Based on examining the perception of Spanish intonation, the present study revealed important cross-linguistic similarities and differences in the processing of acoustic cues across multiple input dimensions among listeners from tonal and non-tonal language backgrounds. Our findings on intonation cue weighting corroborate and extend previous research (Holt & Lotto, 2006; Peng et al., 2012; Feng et al., 2019; Meng et al., 2020), demonstrating that listeners with diverse linguistic backgrounds employ distinct strategies in utilizing acoustic information to identify the most representative exemplars of prosodic categories. Importantly, our study goes beyond confirming the influence of language background on cross-linguistic perception. We provide novel evidence that listeners’ auditory performance is modulated by additional factors, notably chronological age and lower-level prosodic features. This multifaceted approach reveals the complex interplay between linguistic experience, cognitive development, and acoustic properties in shaping Spanish intonation perception.
This study has several limitations that point to directions for future research. Firstly, although F0, duration, and intensity are the most salient intonational cues, recognizing question-statement contrasts is not confined to these acoustic properties. Other contextual variables, such as speaking rate and phonetic environment, may also influence the evaluation of intonation categories. Secondly, while our use of a Discourse Completion Task provided some contextual grounding for the auditory stimuli, the subsequent acoustic manipulations may have reduced real-world applicability. Future research could explore these perceptual weightings in more naturalistic settings, potentially bridging the gap between laboratory findings and practical applications in multilingual environments. Additional limitations include the imbalance between male and female listeners, which may have limited our ability to detect potential gender effects. While our study did not reveal significant gender differences, a more balanced sample in future investigations could uncover subtle gender-related variations in intonation processing. Finally, although our findings suggest an age effect on intonation perception, the cross-sectional nature of our study limits inferences about the developmental trajectory of these perceptual abilities. To address this, longitudinal studies are essential. Such research would allow for tracking changes in intonation perception across the lifespan, revealing how these skills evolve over time and how they are influenced by ongoing linguistic experiences and cognitive development.
Data availability
The data pertinent to this study can be found in the Supplementary Information section.
References
Abrudan IN, Pop CM, Lazăr, PS (2020) Using a general ordered logit model to explain the influence of hotel facilities, general and sustainability-related, on customer ratings. Sustainability 12. https://doi.org/10.3390/su12219302
Boersma P, Weenink D (2020) Praat: doing phonetics by computer (Version 5.3.82). Software available at http://www.praat.org
Brown MKW (2020) Evaluating an ordinal output using data modeling, algorithmic modeling, and numerical analysis. Dissertation, Murray State University
Bryła-Cruz A (2021) The gender factor in the perception of English segments by non-native speakers. Stud Second Lang Learn Teach 11:103–131. https://doi.org/10.14746/ssllt.2021.11.1.5
Carroll N (2020) Oglmx: a package for estimation of ordered generalized linear models (Version 3.0.0.0). Cran R
Chandrasekaran B, Sampath PD, Wong PCM (2010) Individual variability in cue-weighting and lexical tone learning. J Acoust Soc Am 128:456–465. https://doi.org/10.1121/1.3445785
Chang CB, Yao Y (2007) Tone production in whispered Mandarin. Paper presented at the 16th International Congress of Phonetic Sciences, Saarbrücken, Germany, 6–10 August 2007
Chang SE (2013) Effects of fundamental frequency and duration variation on the perception of South Kyungsang Korean tones. Lang Speech 56:211–228. https://doi.org/10.1177/0023830912443951
Chang YS, Yao Y, Huang BH (2017) Effects of linguistic experience on the perception of high-variability non-native tones. J Acoust Soc Am 141. https://doi.org/10.1121/1.4976037
Chen SH (2005) The effects of tones on speaking frequency and intensity ranges in Mandarin and Min dialects. J Acoust Soc Am 117:3225–3230. https://doi.org/10.1121/1.1872312
Chen Y (2022) Mind the subtle f0 modifications: the interaction of tone and intonation in Sinitic varieties. Stellenbosch Papers Linguist Plus 62. https://doi.org/10.5842/62-2-904
Chien PJ, Friederici AD, Hartwigsen G et al. (2020) Neural correlates of intonation and lexical tone in tonal and non-tonal language speakers. Hum Brain Mapp 41:1842–1858. https://doi.org/10.1002/hbm.24916
Choi W (2022) Theorizing positive transfer in cross-linguistic speech perception: the acoustic-attentional-contextual hypothesis. J Phon 91:101135. https://doi.org/10.1016/j.wocn.2022.101135
Choi W, Tong X, Samuel AG (2019) Better than native: tone language experience enhances English lexical stress discrimination in Cantonese-English bilingual listeners. Cognition 189. https://doi.org/10.1016/j.cognition.2019.04.004
Chrabaszcz A, Winn M, Lin CY et al. (2014) Acoustic cues to perception of word stress by English, Mandarin, and Russian speakers. J Speech Lang Hear Res 57. https://doi.org/10.1044/2014_JSLHR-L-13-0279
Connell K, Hüls S, Martínez-García MT, Qin Z et al. (2018) English learners’ use of segmental and suprasegmental cues to stress in lexical access: an eye-tracking study. Lang Learn 68. https://doi.org/10.1111/lang.12288
Cooper N, Cutler A, Wales R (2002) Constraints of lexical stress on lexical access in English: evidence from native and non-native listeners. Lang Speech 45:207–228. https://doi.org/10.1177/00238309020450030101
Cutler A, Wales R, Cooper N, Janssen J (2007) Dutch listeners’ use of suprasegmental cues to English stress. Paper presented at the 16th International Congress of Phonetic Sciences, Saarbrucken, Germany, 6–10 August 2007
Defior S, Serrano F (2017) Learning to read Spanish. In: Ludo V, Charles P (eds) Learning to read across languages and writing systems. Cambridge University Press, Cambridge, pp. 243–69
Deroche MLD, Lu HP, Kulkarni AM et al. (2019) A tonal-language benefit for pitch in normally-hearing and cochlear-implanted children. Sci Rep 9. https://doi.org/10.1038/s41598-018-36393-1
Dmitrieva O (2019) Transferring perceptual cue-weighting from second language into first language: cues to voicing in Russian speakers of English. J Phon 73:128–143. https://doi.org/10.1016/j.wocn.2018.12.008
Doherty CP, West WC, Dilley LC et al. (2004) Question/statement judgments: an fMRI study of intonation processing. Hum Brain Mapp 23:85–98. https://doi.org/10.1002/hbm.20042
Face TL (2007) The role of intonational cues in the perception of declaratives and absolute interrogatives in Castilian Spanish. Estudios Fonética Exp 16:186–225
Feng J, Tao S, Wu X et al. (2019) The effects of amplitude and duration on the perception of English statements vs questions for native English and Chinese listeners. J Acoust Soc Am 145:EL449–EL455. https://doi.org/10.1121/1.5109046
Fuchs S, Pape D, Petrone C et al. (eds) (2015) Individual differences in speech production and perception. Peter Lang GmbH, Lausanne
Gandour J, Larsen J, Dechongkit S et al. (1995) Speech prosody in affective contexts in Thai patients with right hemisphere lesions. Brain Lang 51:422–443. https://doi.org/10.1006/brln.1995.1069
Gandour JT (2009) Neural substrates underlying the perception of linguistic prosody. In: Experimental studies in word and sentence prosody, vol 2E. De Gruyter Mouton, Munich, 3–26
Hodgson P, Miller JL (1996) Internal structure of phonetic categories: evidence for within‐category trading relations. J Acoust Soc Am 100:565–576. https://doi.org/10.1121/1.415867
Holt LL, Lotto AJ (2006) Cue weighting in auditory categorization: implications for first and second language acquisition. J Acoust Soc Am 119:3059–3071. https://doi.org/10.1121/1.2188377
Holt LL, Lotto AJ, Kluender KR (2001) Influence of fundamental frequency on stop-consonant voicing perception: a case of learned covariation or auditory enhancement? J Acoust Soc Am 109:764–774. https://doi.org/10.1121/1.1339825
Hunt T (2020) ModelMetrics: Rapid calculation of model metrics. R Package version 1.2.2.2:1
Jacewicz E, Fox RA, Lyle S (2009) Variation in stop consonant voicing in two regional varieties of American English. J Int Phon Assoc 39:313–334. https://doi.org/10.1017/S0025100309990156
Jiao L, Xu Y (2019) Whispered Mandarin has no production-enhanced cues for tone and intonation. Lingua 218:24–37. https://doi.org/10.1016/j.lingua.2018.01.004
Kim D, Clayards M, Goad, H (2018) A longitudinal study of individual differences in the acquisition of new vowel contrasts. J Phon 67. https://doi.org/10.1016/j.wocn.2017.11.003
Kim H, Tremblay A (2020) Testing the cue-weighting transfer hypothesis with Korean listeners’ perception of English lexical stress. J Acoust Soc Am 148. https://doi.org/10.1121/1.5147839
Kim H, Tremblay A (2021) Korean listeners’ processing of suprasegmental lexical contrasts in Korean and English: A cue-based transfer approach. J Phon 87. https://doi.org/10.1016/j.wocn.2021.101059
Kim H, Tremblay A (2022) Intonational cues to segmental contrasts in the native language facilitate the processing of intonational cues to lexical stress in the second language. Front Commun 7. https://doi.org/10.3389/fcomm.2022.845430
Krizman J, Skoe E, Kraus N (2012) Sex differences in auditory subcortical function. Clin Neurophysiol 123:590–597. https://doi.org/10.1016/j.clinph.2011.07.037
Kong EJ, Edwards J (2016) Individual differences in categorical perception of speech: cue weighting and executive function. J Phon 59:40–57. https://doi.org/10.1016/j.wocn.2016.08.006
Kuang J, Cui A (2018) Relative cue weighting in production and perception of an ongoing sound change in Southern Yi. J Phon 71:194–214. https://doi.org/10.1016/j.wocn.2018.09.002
Labov W (1990) The intersection of sex and social class in the course of linguistic change. Lang Var Change 2:205–254. https://doi.org/10.1017/S0954394500000338
Liang J, and Heuven VJ (2007) Chinese tone and intonation perceived by L1 and L2 listeners. In: Gussenhoven C, Riad T (eds) Experimental studies in word and sentence prosody, vol 2E. De Gruyter Mouton, Berlin, pp. 27–62
Lipski SC, Escudero P, Benders T (2012) Language experience modulates weighting of acoustic cues for vowel perception: an event-related potential study. Psychophysiology 49. https://doi.org/10.1111/j.1469-8986.2011.01347.x
Liu M, Chen Y, Schiller NO (2022) Context matters for tone and intonation processing in Mandarin. Lang Speech 65:52–72. 0.1177/0023830920986174
Ma JKY, Ciocca V, Whitehill TL (2008) Acoustic cues for the perception of intonation in Cantonese. Paper presented at the 9th Annual Conference of the International Speech Communication Association, Brisbane, 22–26 September 2008
Mann VA, Repp BH (1980) Influence of vocalic context on perception of the [∫]-[s] distinction. Percept Psychophys 28(3):213–228. https://doi.org/10.3758/BF03204377
Mayo C, Turk A (2004) Adult–child differences in acoustic cue weighting are influenced by segmental context: children are not always perceptually biased toward transitions. J Acoust Soc Am 115. https://doi.org/10.1121/1.1738838
Mayo C, Turk A (2005) The influence of spectral distinctiveness on acoustic cue weighting in children’s and adults’ speech perception. J Acoust Soc Am 118. https://doi.org/10.1121/1.1979451
Meng Y, Zhang J, Liu S, Wu C (2020) Influence of different acoustic cues in L1 lexical tone on the perception of L2 lexical stress using principal component analysis: an ERP study. Exp Brain Res 238:1489–1498. https://doi.org/10.1007/s00221-020-05823-w
Morrow K, Liu C (2013) Intonation perception in English: effects of stimulus amplitude and listeners’ language background. J Acoust Soc Am 133. https://doi.org/10.1121/1.4806314
Naul DR, Munhall KG (2020) Individual variability in auditory feedback processing: responses to real-time formant perturbations capacity and their relation to perceptual acuity. J Acoust Soc Am 148:3709–3721
Niebuhr O (2007) Categorical perception in intonation: a matter of signal dynamics? In: Paper presented at the 8th Annual Conference of the International Speech Communication Association, Antwerp, 27–31 August 2007
Ortega-Llebaria M (2006) Phonetic cues to stress and accent in Spanish. Paper presented at the 2nd conference on laboratory approaches to Spanish phonetics and phonology, Somerville, MA, 17–19 September 2006
Ortega-Llebaria M, Gu H, Fa J (2013) English speakers’ perception of Spanish lexical stress: context-driven L2 stress perception. J Phon 41:186–197. https://doi.org/10.1016/j.wocn.2013.01.006
Ortega-Llebaria M, Nemogá M, Presson N (2017) Long-term experience with a tonal language shapes the perception of intonation in English words: how Chinese-English bilinguals perceive “Rose?” vs. “Rose. Bilingualism 20:367–383. https://doi.org/10.1017/S1366728915000723
Ortega-Llebaria M, Prieto P (2011) Acoustic correlates of stress in Central Catalan and Castilian Spanish. Lang Speech 54:73–97
Ou J, Xiang M, Yu ACL (2023) Individual variability in subcortical neural encoding shapes phonetic cue weighting. Sci Rep 13. https://doi.org/10.1038/s41598-023-37212-y
Peng SC, Chatterjee M, Lu N (2012) Acoustic cue integration in speech intonation recognition with cochlear implants. Trends Amplif. 16:67–82. https://doi.org/10.1177/1084713812451159
Pfiffner AM (2020) Tonogenesis in Afrikaans: age and gender differences in cue weighting. Poster presented at the 17th Conference on Laboratory Phonology, Canada, 6–8 July 2020
Qin Z, Tremblay A, Zhang J (2019) Influence of within-category tonal information in the recognition of Mandarin-Chinese words by native and non-native listeners: an eye-tracking study. J Phon 73. https://doi.org/10.1016/j.wocn.2019.01.002
Repp BH (1982) Phonetic trading relations and context effects: new experimental evidence for a speech mode of perception. Psychol Bull 92:81–110. https://doi.org/10.1037/0033-2909.92.1.81
Roca I (2019) Spanish Word Stress. In: Rob G, Jeffrey H, Harry H (eds) The study of word stress and accent: theories, methods and data. Cambridge University Press, Cambridge, pp. 256–292
Romera Barrios L, Fernández-Planas AM, Salcioli Guidi V et al. (2007) Una muestra del español de Barcelona en el marco AMPER. Estudios Fonética Exp 16:147–184
Schertz J, Carbonell K, Lotto AJ (2020) Language specificity in phonetic cue weighting: monolingual and bilingual perception of the stop voicing contrast in English and Spanish. Phonetica 77. https://doi.org/10.1159/000497278
Seddoh SA (2002) How discrete or independent are affective prosody and linguistic prosody? Aphasiology 16:683–692. https://doi.org/10.1093/scan/nst124
Shang P, Elvira-García W, Li X (2022) Cue weighting differences in perception of Spanish sentence types between native listeners of Chinese and Spanish. In: Proceedings of 11th International Conference of Speech Prosody, Lisboa, 23–26 May 2022
Shang P, Li Y, Liang Y (2024a) Unraveling the contributions of prosodic patterns and individual traits on cross-linguistic perception of Spanish sentence modality. PLoS ONE 19(2). https://doi.org/10.1371/journal.pone.0298708
Shang P, Roseano P, Elvira-García W (2024b) Dynamic multi-cue weighting in the perception of Spanish intonation: differences between tonal and non-tonal language listeners. J Phon 102. https://doi.org/10.1016/j.wocn.2023.101294
Shang P, Roseano P, Elvira-García W (2024c) Cross-linguistic perception of Spanish intonation by Chinese speakers: effects of linguistic experience and prosodic features. Porta Linguarum (42):127–145. https://doi.org/10.30827/portalin.vi42.27042
Souza KDH, Carlet A, Jułkowska IA et al. (2017) Vowel inventory size matters: assessing cue-weighting in L2 vowel perception. Ilha do Desterro 70. https://doi.org/10.5007/2175-8026.2017v70n3p33
Strange W (2009) Automatic selective perception (ASP) of first‐language (L1) and second‐language (L2) speech: a working model. J Acoust Soc Am 125. https://doi.org/10.1121/1.4784716
Strange W (2011) Automatic selective perception (ASP) of first and second language speech: a working model. J Phon 39:456–466. https://doi.org/10.1016/j.wocn.2010.09.001
Tillman G, Benders T, Brown SD et al. (2017) An evidence accumulation model of acoustic cue weighting in vowel perception. J Phon 61:1–12. https://doi.org/10.1016/j.wocn.2016.12.001
Toscano JC, Lansing CR (2019) Age-related changes in temporal and spectral cue weights in speech. Lang Speech 62:61–79. https://doi.org/10.1177/0023830917737112
Tremblay A, Broersma M, Coughlin CE (2018) The functional weight of a prosodic cue in the native language predicts the learning of speech segmentation in a second language. Bilingualism Lang Cogn 21:640–652. https://doi.org/10.1017/S136672891700030X
Tremblay A, Broersma M, Zeng Y et al. (2021) Dutch listeners’ perception of English lexical stress: a cue-weighting approach. J Acoust Soc Am 149:3703–3714. https://doi.org/10.1121/10.0005086
Van Heuven VJ, De Jonge M (2011) Spectral and temporal reduction as stress cues in Dutch. Phonetica 68. https://doi.org/10.1159/000329900
Villacorta VM, Perkell JS, Guenther FH (2007) Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. J Acoust Soc Am 122:2306–2319. https://doi.org/10.1121/1.2773966
Wiener S (2017) Changes in Early L2 Cue-Weighting of Non-Native Speech: Evidence from Learners of Mandarin Chinese. Paper presented at the 18th Annual Conference of the International Speech Communication Association, Stockholm, 20–24 August 2017
Wiener S, Goss S (2019) Second and third language learners ‘sensitivity to Japanese pitch accent is additive: an information-based model of pitch perception. Stud Second Lang Acquis 41:897–910. https://doi.org/10.1017/S0272263119000068
Xu Y (2004) Transmitting Tone and Intonation Simultaneously-The Parallel Encoding and Target Approximation (PENTA) Model. Paper presented at the International Symposium on Tonal Aspects of Languages: With Emphasis on Tone Languages, Beijing, 28–31 March 2004
Yu ACL (2022) Perceptual Cue weighting is influenced by the listener’s gender and subjective evaluations of the speaker: the case of English stop voicing. Front Psychol 13. https://doi.org/10.3389/fpsyg.2022.840291
Yuan J (2004) Intonation in Mandarin Chinese: acoustics, perception, and computational modeling. Dissertation, Cornell University
Yuan J (2006) Mechanisms of question intonation in Mandarin. Paper presented at the 5th International Symposium on Chinese Spoken Language Processing, Berlin, 13–16 December 2006
Yuan J (2011) Perception of intonation in Mandarin Chinese. J Acoust Soc Am 130:4063–4069. https://doi.org/10.1121/1.3651818
Yuan J, Shih C (2004) Confusability of Chinese intonation. Paper presented at the International Conference on Speech Prosody, Nara, 23–26 March 2004
Zatorre RJ, Gandour JT (2008) Neural specializations for speech and pitch: moving beyond the dichotomies. Philos Trans R Soc B Biol Sci 363:1087–1104. https://doi.org/10.1098/rstb.2007.2161
Zhang H, Wiener S, Holt LL (2022) Adjustment of cue weighting in speech by speakers and listeners: evidence from amplitude and duration modifications of Mandarin Chinese tone. J Acoust Soc Am 151:992–1005. https://doi.org/10.1121/10.0009378
Zhang X (2012) A comparison of cue-weighting in the perception of prosodic phrase boundaries in English and Chinese. Dissertation, University of Michigan
Zhang Y, Francis A (2010) The weighting of vowel quality in native and non-native listeners’ perception of English lexical stress. J Phon 38. https://doi.org/10.1016/j.wocn.2009.11.002
Zipf GK (2016) Human behavior and the principle of least effort: an introduction to human ecology. Martino Fine Books, Connecticut
Acknowledgements
This research was funded by the Ministry of Education’s Youth Fund for Humanities and Social Sciences (Project No. 24YJC740058). We would like to thank all the Chinese and Spanish subjects who participated in this study.
Author information
Authors and Affiliations
Contributions
All authors jointly supervised this work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical statements
The research was conducted in accordance with the guidelines of the Declaration of Helsinki. Given that the data were collected anonymously online without physical harm to the participants, an exemption from ethical review was granted by the Phonetics Laboratory of the Universitat de Barceloan and the Ethical Committee of Beijing Institute of Technology (protocol code: BIT-EC-H-2023132).
Informed consent
Informed consent was obtained electronically from all study participants. They were informed of the objectives and potential implications of the research before enrolling in the experiment.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Shang, P., Wu, Y. The impact of multifaceted factors on auditory mapping between acoustic cues and Spanish intonation categories in a cross-linguistic context. Humanit Soc Sci Commun 11, 1701 (2024). https://doi.org/10.1057/s41599-024-04216-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1057/s41599-024-04216-6





