Introduction

In the intricate realm of speech perception, listeners frequently harness perceptual intelligence to weave together acoustic cues from multiple dimensions, ensuring a refined comprehension of the intended message (Holt and Lotto, 2006; Kim and Tremblay, 2022). A wealth of studies has shown that native speakers of varied languages such as English, German, Cantonese, and South Korean can discern intonation contrasts using changes in fundamental frequency (F0), intensity, and duration, although their utilizations may differ based on factors like listening conditions, stimulus types, and the speaker’s chronological age (Chang, 2013; Ma et al., 2008; Morrow and Liu, 2013; Niebuhr, 2007; Peng et al., 2012). A parallel cue integration process is observed in the perception of diverse linguistic categories including stop-voicing (Kong and Edwards, 2016; Schertz et al., 2020; Yu, 2022) and vowel contrasts (Lipski et al., 2012; Tillman et al., 2017) in English, alongside lexical stress (Zhang and Francis, 2010) and lexical tone (Chandrasekaran et al., 2010; Zhang et al., 2022) in Mandarin. Yet, despite this similarity in multi-cue integration, listeners from different languages tend to perceive the same acoustic stimuli in different ways, because the phonetic categories of their native language (L1) attune them to certain acoustic cues over others (Kim and Tremblay, 2022; Ou et al., 2023). To elucidate, previous studies have unearthed significant variations in cue weights among native and non-native English speakers in stop-voicing (Dmitrieva, 2019), lexical tone (Wiener, 2017), and intonation (Feng et al., 2019; Shang et al., 2022, 2024a). An influential theoretical framework that accounts for such cross-linguistic perceptual variances is the cue-weighting theory, which posits that the cue-weighting patterns that develop to differentiate speech contrasts in one’s L1 can extend to their perception of a second or foreign language (L2) (Holt and Lotto, 2006; Kim and Tremblay, 2020; Tremblay et al., 2018; Zhang and Francis, 2010).

Supporting this cue-weighting transfer hypothesis are numerous studies. Take, for example, the work by Souza et al. (2017), which determined that the size of the L1 vowel inventory can impact how listeners discern L2 English vowels. Notably, Danish learners of English, when compared to those from Catalan, Portuguese, and Russian backgrounds, demonstrated perceptions more closely aligned with L1 speakers (Souza et al., 2017). This occurrence is credited to the beneficial transfer of cue-weighting strategies from their expansive L1 vowel space, which sharpened their perception of the nuanced spectral differences inherent in L2 vowels. In a similar vein, existing research underscores that the perceptual processing of English lexical stress by L2 listeners hinges largely on the relative weighting of acoustic cues to lexical contrasts in their L1 (Connell et al., 2018; Cooper et al., 2002; Cutler et al., 2007; Kim and Tremblay, 2021; Tremblay et al., 2021; Van Heuven and De Jonge, 2011; Zhang and Francis, 2010). To illustrate, while English and Mandarin listeners displayed a greater reliance on F0 in perceiving English lexical stress, Russian listeners had a more pronounced dependence on duration, relegating F0 to a secondary position (Chrabaszcz et al., 2014). Such a propensity can likely be attributed to the marked relevance of duration cues in demarcating stress contrasts within Russian.

Continuing this narrative, subsequent research has delved into the scope of cue-weighting strategies, illustrating their transferability from L1 to L2 perception, even across different types of speech contrasts (Choi, 2022; Choi et al., 2019; Kim and Tremblay, 2021; Qin et al., 2019; Wiener and Goss, 2019). For instance, Seoul Korean listeners exhibited enhanced acuity in discerning intonationally cued lexical stress in L2 English when compared to French listeners. This heightened proficiency is postulated to arise from the positive transfer of tonal cues, which are crucial in differentiating the three-way laryngeal stop contrasts inherent to Seoul Korean (Kim and Tremblay, 2022). Based on these observations, it is reasonable to stipulate that the reliance placed on certain acoustic cues in L1 categories conditions the degree to which listeners utilize and depend on those dimensions when processing the L2.

Regarding intonation perception, the working model of automatic selective perception (ASP) posits that particular acoustic dimensions, primarily the F0, when consistently and reliably associated with intonation categorization, generally carry greater weight than secondary cues like duration and intensity, in auditory decision-making (Strange, 2009, 2011; Ortega-Llebaria et al., 2017). For instance, research focusing on Spanish intonation has highlighted the pivotal role of the final F0 movement, distinguishing it as the strongest intonational cue for perceiving question-statement contrast that can even override any preceding conflicting cues (Face, 2007; Shang et al., 2024b, 2024c). Conversely, for tonal languages like Mandarin Chinese, the F0 contour undertakes diverse linguistic functions, spanning both lexical tone and sentence-level intonation. In this context, listeners predominantly perceive intonation through global F0 trends, relegating surface F0 shapes to a more confined role of indicating lexical tone identity (Chen 2022; Yuan, 2004). Given the strong informational emphasis on F0 in Chinese, it is often proposed that listeners of tonal languages might exhibit enhanced abilities in pitch perception tasks, displaying higher behavioral sensitivity to F0 cues compared to non-tonal language listeners. Studies affirm this, showing that Chinese listeners excel in perceiving Cantonese tones, are more sensitive to F0 shifts, and quickly spot F0 mismatches in English words, relative to non-tonal language speakers (Chang et al., 2017; Deroche et al., 2019; Ortega-Llebaria et al., 2017). However, the extent to which this F0 advantage applies to intonation perception, especially in L2, remains debated. Some neurobehavioral research revealed differential neural encoding for pitch processed as tone and intonation, hinting that the observed F0 advantage may be specific to certain pitch events and not necessarily generalizable to all pitch perception tasks (Chien et al., 2020; Doherty et al., 2004; Gandour, 2009; Zatorre and Gandour, 2008). To truly grasp F0 perception among tonal and non-tonal language listeners, expansive studies probing varied pitch dimensions and linguistic pairings are indispensable.

Furthermore, while certain acoustic cues might inherently hold more significance than others during perception, their weights are not fixed. Listeners may often adapt cue weighting in response to fluctuating acoustic contexts, a phenomenon explored via theories like phonetic trading relation (Mann and Repp, 1980; Repp, 1982) and perceptual compensation (Kuang and Cui, 2018; Jiao and Xu, 2019). The degree of adaptation or compensation listeners apply in perceptual processing seems to depend on their sensitivity to the input acoustic dimensions aligned with intended phonetic categories (Hodgson and Miller, 1996). For example, English listeners, displaying nuanced perceptual differences in F1 discrimination tasks, showed greater compensations in response to acoustic perturbations in F1 vowel formants (Villacorta et al., 2007).

Despite the consistent perceptual patterns observed within the same linguistic community, individual variations remain a significant consideration for understanding the intricacies of perceptual landscapes in both L1 and L2 (Fuchs et al., 2015; Ou et al., 2023). Developmental studies using cross-sectional methods have captured shifts in cue weighting for L1 segmental contrasts from childhood to adulthood, demonstrating that children’s cue prioritization evolves towards an adult-like pattern over time (Mayo and Turk, 2004, 2005). Similar changes in cue-weighting strategies with increasing age have been observed in individual L2 learners. For instance, Kim et al. (2018) found that adult and child Korean learners of English exhibit changes in cue weighting over time and across different vowel contrasts. However, research exploring how perceptual weighting shifts across the adult age spectrum, particularly in L2, remains scarce. In a revelatory study on sound categorization across varied age groups, Toscano and Lansing (2019) discovered that older adults rely more heavily on onset F0 cues for L1 voicing judgments compared to younger adults. This finding suggests that age-related changes in perceptual strategies persist well into adulthood, at least for L1 perception. Nonetheless, it remains unclear whether similar age-related effects exist in the intonation perception of adult L2 learners, particularly in terms of cue-weighting strategies. Thus, further research in this area is warranted to bridge this gap in our understanding of L2 intonation processing across the lifespan.

Apart from age, gender may also play a role in shaping perceptual variations. Gender differences in speech perception may be attributed to a combination of physiological, sociolinguistic, and cognitive factors. For instance, Krizman et al. (2012) found that females exhibit stronger interhemispheric collaboration in speech processing, while Labov (1990) noted that women tend to use more standard language forms, potentially influencing their phonetic sensitivity. These gender-related differences have been observed in various aspects of speech perception. For example, earlier research has shed light on gender-related differences in how English prosodic phrase boundaries (Zhang, 2012) and Afrikaans tonogenesis (Pfiffner, 2020) are perceived. There is also evidence that the perceptual cue weighting of L1 English stop-voicing was modulated by the listener’s gender and their subjective evaluation of the talker (Yu, 2022). However, similar studies in L2 perception are relatively scarce. Despite various accounts of anatomical and functional differences between genders, Bryła-Cruz (2021) suggests that neither sex can be considered superior in L2 phonetic perception. Consequently, the role of gender in cross-linguistic intonation perception warrants further investigation.

Lastly, empirical studies on Chinese prosody show that question intonations ending with a rising tone (Tone2) are harder to identify than those with other tone endings (Liu et al., 2022; Yuan, 2006, 2011; Yuan and Shih, 2004). The root of this difficulty seems to be the overlap of Tone2 and question intonation in their primary encoding within the F0 dimension, paving the way for potential acoustic conflicts in perception. However, given that a single acoustic dimension can be used to encode multiple categories, such acoustic cue competition is not unique to the simultaneous processing of lexical tone and sentence intonation (Xu, 2004). Armed with this knowledge, and considering that in Spanish, both stress and intonation are consistently signaled by F0 and duration (Ortega-Llebaria, 2006; Ortega-Llebaria and Prieto, 2011; Ortega-Llebaria et al., 2013), a compelling question is whether perceptual weighting of Spanish intonation change when processing concurrent stress, particularly when these linguistic categories coincide within the same phonetic unit. In response to this query, our study examines the perception of oxytone and paroxytone words, both with and without stress on the final syllable, to discern potential acoustic conflicts arising from stress processing during Spanish intonation perception.

Aims and hypotheses

Building upon earlier findings and observations, our study seeks to investigate the perception of Spanish intonation among native Spanish listeners and Chinese L2 learners within dynamic acoustic change contexts. To this end, we engaged the Spanish L1 and L2 participants in two experiments. In these experiments, we initially synthesized pitch stimuli by gradually transitioning the final F0 contour from falling to rising directions. Following this base manipulation, we have further added manipulation of duration (Experiment 1) and intensity (Experiment 2) to those pitch stimuli. By evaluating listeners’ responses to these stimuli, we sought to address the following research questions (RQs):

RQ1: To what extent do changes in F0, duration, and intensity influence question-statement recognition among Spanish L1 and Chinese L2 listeners? Are there perceptual differences between listeners from different linguistic backgrounds?

RQ2: What are the potential relationships among various acoustic cues in the perceptual processing of Spanish intonation?

RQ3: Do individual differences in age and gender influence L1 and L2 listeners’ perception of Spanish statements and yes/no questions?

RQ4: To what degree do lower-level prosodic factors, such as the final stress pattern, influence listeners’ perception of Spanish intonation?

For RQ1, we first hypothesized that changes in F0 would significantly impact intonation perception across both L1 and L2 groups, based on the cue-weighting theory and the prominent role of F0 cues in Chinese and Spanish prosody (Chen, 2022; Face, 2007; Kim and Tremblay, 2020). We also expected to observe similar weighting or sensitivity to the intonation-related F0 cue between L1 and L2 listeners, because studies have found that pitch processed as intonation activated the same brain areas among tonal and non-tonal language speakers, regardless of their language-specific realization of intonation (Chien et al., 2020).

Additionally, historical data reveals that yes/no questions in Spanish and Chinese are usually characterized by elongated duration towards the end compared to statements, although this pattern in Chinese, further, is modulated by the final tone type (Yuan, 2006; Romera Barrios et al., 2007). Thus, we postulated that changes in duration could influence the perception of sentence types by listeners of the two languages. Likewise, question intonation might be characterized by heightened intensity in both languages, but unlike F0 and duration that consistently cue intonation contrasts, intensity seems to be an optional cue that conveys less reliable information in Spanish prosody (Romera Barrios et al., 2007; Yuan, 2006; Yuan and Shih, 2004). Given this, we hypothesized only a marginal, if any, impact of this cue on intonation perception. We also expected to observe a lower sensitivity of Chinese learners to duration and intensity compared to Spanish listeners due to a negative transfer from their L1 cue-weighting patterns.

For RQ2, we expected to find a perceptual trade-off between the primary (F0) and secondary cues (duration/intensity) of intonation based on the principle of phonetic trading relation (Mann and Repp, 1980; Repp, 1982).

For RQ3, considering prior studies of individual differences in L1 and L2 perceptual processing (Bryła-Cruz, 2021; Kim et al., 2018; Toscano and Lansing, 2019; Yu, 2022), we anticipated potential age and gender effects on Spanish intonation judgments by Spanish and Chinese listeners, without predicting specific directional outcomes.

For RQ4, given the acoustic overlap between stress and intonation in oxytone words’ final syllable (Ortega-Llebaria, 2006; Ortega-Llebaria and Prieto, 2011; Ortega-Llebaria et al., 2013), we hypothesized that oxytone words with stress on the final syllable might be less likely to be identified as questions compared to paroxytone words.

Methodology

Participants

A total of 46 native speakers of Castellan Spanish (hereafter, SP) and 95 Chinese L2 learners (hereafter, CH) of Spanish participated in the study. To control for the effect of age of acquisition, we excluded data from two CH participants who started learning Spanish before the age of 16. Consequently, the remaining CH participants, all born in mainland China, reported being predominantly exposed to Castellan Spanish during their language learning process. Significantly, they began learning L2 Spanish, on average, at the age of 19.73 years (SD = 3.00). Moreover, to ensure the L2 participants had a sufficient understanding of L2 intonation, we excluded data from CH learners (N = 5) with an A1 or A2 proficiency level. As a result, the remaining CH learners had either an intermediate (B1), advanced (B2), or superior (C1) proficiency in Spanish. Thus, the final participant pool for Experiment 1 consisted of 39 SP listeners and 78 CH learners of Spanish, with ages ranging from 18 to 59 years (N = 117, 90 women, 28 men, Mage = 28.27, SD = 8.43). For Experiment 2, the participant pool comprised 33 SP listeners and 77 CH learners, with ages ranging from 19 to 58 years (N = 110, 84 women, 26 men, Mage = 28.02, SD = 8.46). Critically, no participants reported any history of hearing or communication disorders at the time of testing, ensuring the reliability of the data collected. The socio-demographic information and language immersion conditions of the analyzed participants are presented in Table 1.

Table 1 Socio-demographic characteristics and language immersion status of participants.

Stimulus synthesis

In Peninsular Spanish, the intonation of yes/no questions and statements differ in both the prenucleus and nucleus. To isolate the effects of intonation nucleus, single-word utterances with one stressed syllable were utilized as our base stimuli, thereby excluding any potential prenucleus influences. The materials comprised two trisyllabic Spanish words: “Sevilla” and “Alcalá”, with penultimate and final syllable stress patterns, respectively. To ensure the ecological validity of our study, we employed a Discourse Completion Task (DCT) to elicit the declarative forms of these words from a female Spanish speaker in Barcelona. The DCT approach allowed us to elicit the production of broad focus statements, ensuring that our base stimuli were produced in a natural and contextually relevant manner.

The final syllable of each stimulus word was selected for acoustic manipulation. Specifically, the F0 contour of the utterance-final syllable was replaced with a multi-step continuum using Praat (Boersma and Weenink, 2020). To achieve this, the original stylized F0 contour was first defined by two anchor points—at vowel onset (A1) and offset (A2)—between which values were determined through interpolation (see Fig. 1). The anchor point A1 was situated at the beginning of the final vowel, preserving the original F0 height. The anchor point A2 was positioned at the ultimate glottal pulse visible in the spectrogram, with a 3 Hz divergence from A1. The A2 continuum comprised eleven 20 Hz steps, nine ascending and one descending in relation to A1, traversing over 200 Hz. This process generated 22 F0 contours crossing two stress patterns for the synthesized final syllable, which then served as the basis for subsequent manipulations of duration and intensity.

Fig. 1: Schematic illustration of F0 manipulation in the oxytone word “Alcalá”.
figure 1

The start point A1 was 196 Hz. The endpoint A2 was manipulated from 176 Hz to 376 Hz, with a 20 Hz step size. The original duration of the final vowel (185 ms) was set as the medium duration for the stimulus.

For Experiment 1, duration modifications were incorporated with the 22 synthesized F0 contours. Three duration conditions - short, medium (original), long - were generated by compressing or expanding the vowel nucleus duration of the final syllable while setting segment boundaries at zero crossings to avoid spectral discontinuities. Long-duration stimuli were created by extracting 50 ms of periodic cycles from the center of the original vowel nucleus. This segment was then appended after the 5th glottal cycle of the last vowel. The increased value was selected based on prior research showing final vowel durations were approximately 40–70 ms longer in yes-no questions compared to statements in Spanish (Romera Barrios et al., 2007). The short durations were generated by extracting 40 ms of periodic cycles from the same region of the final vowel nucleus. This decrease was defined by referencing the shortest comparable statement production by the speaker. Therefore, combining the 2 stress patterns, 11 F0 steps, and 3 duration levels resulted in 66 stimuli for Experiment 1. Schematic representations of the duration manipulation are provided in Supplementary Fig. S1.

For Experiment 2, the intensity was manipulated using the Constant Amplification function in Cool Edit Pro 2.1 on the 22 F0 contours. Three intensity conditions were synthesized by applying a −7, 0-, and +7 dB modification to the final syllable relative to the normalized non-final syllable intensity of 70 dB. Thus, the low, original, and high intensities were set at 63 dB, 70 dB, and 77 dB. By crossing the 2 stress patterns, 11 F0 steps, and 3 intensity levels, a total of 66 stimuli were generated for Experiment 2. A schematic representation of the intensity manipulation can be found in Supplementary Fig. S2.

Procedure

Data were collected via an online survey platform (https://www.alchemer.com/). The survey comprised three sections. The first elicited participants’ demographic and linguistic background details. Sections “Methodology” and “Results” contained the auditory stimuli for Experiments 1 and 2. The stimulus audios can be found in Supplementary Audios S1 and S2. The text of each stimulus was displayed without punctuation. Participants could complete one or both experiments based on interest. Participants were instructed to utilize headphones in a quiet environment. A practice trial preceded the experiments to familiarize participants with procedures. Responses were captured on a 5-point Likert scale to ascertain nuanced perceptual shifts. For each stimulus, participants selected one of five descriptions of the intonation: “statement,” “more statement than question,” “either statement or question,” “more question than statement,” and “question.”

Statistics

Given the ordinal nature of the 5-point response scale, ordered generalized linear models (OGLM) were employed to allow flexibility in relaxing parallel lines assumptions when violated (Abrudan et al., 2020). This involved estimating separate coefficients across five outcome levels. To achieve this, two independent OGLM models were fit for Experiments 1 (model 1) and 2 (model 2) using the oglmx R package (Carroll, 2020). To address our RQs, we incorporated several interaction terms in our statistical models. For RQ1, we included Language Group × F0 change/Duration in model 1 and Language Group × F0 change/Intensity in model 2. RQ2 was examined through F0 change × Duration in model 1 and F0 change × Intensity in model 2. For RQ3, we added Language Group × Age and Language Group × Gender interactions in both models. Finally, to address RQ4, we incorporated Stress Type (paroxytone vs. oxytone) and its potential interactions with other factors in both models. Additionally, Stimulus Order was included in each model, and along with the Age factor, was z-transformed and mean-centered prior to inclusion.

Results

Results of Experiment 1

The model fitted for Experiment 1 yields four threshold parameters, representing the cut-points between the five ordered response levels. These thresholds and their corresponding statistics are presented in Table 2. The results indicate that while the first threshold was not statistically significant, the subsequent thresholds were all highly significant. This suggests that the model effectively distinguishes between the “more statement than question,” “either statement or question,” “more question than statement,” and “question” categories, but may not clearly differentiate between the “statement” and “more statement than question” categories. The increasing values of the thresholds (0.097 < 1.813 < 2.889 < 5.088) demonstrate the ordinal nature of the response variable, with larger gaps between later categories suggesting more pronounced distinctions between these higher-order response levels.

Table 2 Threshold parameters of the OGLM model fitted for Experiment 1.

Overall, the statistical analysis revealed that listeners’ perception of intonation was significantly influenced by several factors. Four two-way interactions emerged as particularly significant: language group × F0 Change [χ2(1) = 23.54, p < 0.001], language group × duration [χ2(2) = 27.39, p < 0.001], F0 change × duration [χ2(2) = 16.97, p < 0.001], and language group × age [χ2(1) = 26.04, p < 0.001]. Stress type [χ2(1) = 63.31, p < 0.001] and stimulus order [χ2(1) = 7.48, p < 0.01] also proved to be significant predictors. However, the interaction between gender and language group did not yield statistically significant results for the prediction [χ2(1) = 0.01, p = 0.93]. Given that changes in the dependent variables affect the response categories differently, we will discuss these relationships by focusing on their margins on our outcome of interest. The comprehensive statistical details on the marginal effects are provided in Supplementary Table S1.

In Table 3, the main effect of F0 indicates that the baseline group (CH group) consistently associates rising pitch with higher levels of question perception. The interaction of L1SP × F0 further reveals that the SP group had stronger F0 effects compared to the CH group. To elucidate the F0 sensitivity of the SP group, we calculated their total effects at each response level. As shown in Table 4, the SP group showed approximately 11% higher sensitivity across all response levels while maintaining the same effect direction as the baseline group. This consistent pattern supports our first hypothesis that language background influences listeners’ utilization of F0 cues for question-statement identification. Specifically, SP listeners may have developed a more fine-grained perceptual mechanism for intonational F0 cues due to their L1 experience.

Table 3 Summary of marginal effects from the OGLM model for Experiment 1 (estimated coefficients with significance levels. ***p < 0.001; **p < 0.01; *p < 0.05; p < 0.1).
Table 4 Comparison of F0 sensitivity between CH and SP groups across all response levels.

The main effect of L1 reveals a distinct perceptual bias between language groups. SP listeners exhibited a propensity to categorize sentences as statements, while CH listeners were more inclined to perceive the same utterances as questions. This finding underscores the influence of linguistic background on prosodic interpretation.

The interaction between L1 and duration provided further insights into the differential effects of stimulus length on perception. For SP listeners, a significant positive interaction was observed between L1 and duration, particularly for the long-duration condition. This interaction manifested as a substantial improvement in question identification as the stimulus duration increased from short to long. Conversely, CH listeners demonstrated a different pattern of responses across duration conditions. The main effects of duration for CH listeners, serving as the baseline group, were non-significant for both original and long durations. This lack of significance indicates that CH listeners were not sensitive to duration changes in Spanish intonation.

While F0 is a strong predictor of sentence type, its effect is modulated by the duration. Specifically, Table 3 shows that the pitch identification curve was markedly steeper in the long-duration level than in the short-duration level. This implies that an alteration in F0 had a more pronounced effect on question recognition when the final syllable was lengthened. In other words, both groups could utilize a lower F0 contour to identify a question when the final syllable duration was extended (see Fig. 2). Conversely, a higher F0 contour was required to identify a question when the duration was short. Additionally, the overall changes in the slope of the identification curve between the two language groups (see Fig. 2) indicate that SP listeners, being more sensitive to F0 and duration cues, made greater compensations for changes in duration than CH listeners.

Fig. 2: Marginal effects of duration on each language group while varying the final F0 contour in Experiment 1.
figure 2

The lines colored in blue, yellow, and gray represent the short, medium, and long-duration conditions, respectively.

Table 3 presents the main effect of Age, which corresponds to the age-related changes observed in the baseline group. The results indicate that with increasing age, CH listeners exhibited a greater propensity to perceive utterances as questions (levels 4–5). For the SP group, the age effects were derived by combining the main age effects with the L1SP × Age interaction effects, yielding the following values across the five response levels: −0.00822, −0.02000, −0.00847, 0.02375, and 0.01295. These combined effects reveal a milder age-related trend for SP listeners compared to their CH counterparts, particularly at levels 4 and 5. Finally, Table 3 reveals a significant effect of stress type on intonation perception. Words with penultimate stress were more likely to be perceived as questions compared to those with final stress. The underlying explanations for these observations will be elaborated upon in section “Discussion”.

Results of Experiment 2

Table 5 presents the threshold parameters of the model fitted for Experiment 2. All four threshold parameters were statistically significant, indicating clear distinctions between the five response levels. The negative value of the first threshold indicates a slight bias towards perceiving stimuli as statements, while the notably high value of the final threshold suggests that very strong cues were needed for participants to categorize a stimulus as a definite question.

Table 5 Threshold parameters of the OGLM model fitted for Experiment 2.

The output of the model revealed that listeners’ perception of intonation was significantly influenced by several factors. Five two-way interactions emerged as particularly significant: language group × F0 change [χ2(1) = 33.54, p < 0.001], language group × intensity [χ2(2) = 6.55, p < 0.05], F0 change × intensity [χ2(2) = 12.72, p < 0.01], intensity × stress type [χ2(2) = 15.06, p < 0.0001], and language group × age [χ2(1) = 28.56, p < 0.0001]. However, the stimulus order [χ2(1) = 0.14, p = 0.71] and the interaction between gender and language group did not yield statistically significant results for the prediction [χ2(1) = 2.37, p = 0.12]. Statistical details on the marginal effects of Experiment 2 are provided in Supplementary Table S2.

Table 6 demonstrates that the effect of F0 on intonation recognition aligns with the results in Experiment 1. As depicted in Fig. 3, there is an increased likelihood of question-like responses as the final F0 contour elevates. The negative margins noted for the SN Group at levels 4 and 5 imply that CN listeners had a markedly higher tendency to perceive single-word sentences as yes/no questions compared to the SN listeners. Regarding the interaction of F0 × L1SP, consistent with Experiment 1, we found that SP listeners exhibited greater sensitivity to F0 linear changes perceived as intonation than CH learners.

Table 6 Summary of the marginal effects from the OGLM model for Experiment 2 (estimated coefficients with significance levels. ***p < 0.001; **p < 0.01; *p < 0.05; p < 0.1).
Fig. 3: Effect displays between F0 change, language group, and intensity for each response level in Experiment 2.
figure 3

The five lines in different colors represent five different response categories in the perceptual identification task, respectively.

In line with the F0 × duration interaction, a significant perceptual trade-off between F0 and intensity was identified in Experiment 2. Specifically, Table 6 delineates the positive coefficients of F0 × 77 dB at levels 4 and 5, indicating that the slope of the identification curves as a function of F0 was notably steeper at 77 dB compared to 63 dB. This suggests that listeners required fewer F0 cues for question recognition when the intensity escalated to the maximum level (77 dB). This observation substantiates the existence of phonetic trading relations and implies a counter-directional compensatory mechanism employed by listeners in response to the amplification or diminution of acoustic cues.

Table 6 also revealed significant effects of intensity on intonation perception, with notable differences between L1 groups. For CH listeners, intensity increases from the 63 dB baseline yielded consistent and significant effects, particularly at 77 dB. Specifically, as the intensity increased from 63 dB to 70 dB/77 dB, CH listeners showed a decreased probability of perceiving questions and an increased likelihood of perceiving statements. SP listeners exhibited a nuanced response pattern to intensity variations. As intensity increased, they demonstrated an increased propensity to perceive sentences as questions. However, this perceptual shift was characterized by non-linearity across the intensity spectrum. The most pronounced changes were observed at 70 dB, suggesting a heightened sensitivity to moderate intensity increases. At 77 dB, while the trend toward question perception persisted, the magnitude of the effect was attenuated.

Furthermore, the analysis revealed significant effects of stress type on identification across language groups. The CH group’s response pattern is directly reflected in the main effect. Specifically, for CH learners, Table 6 indicates that penultimate-stressed words (compared to final-stressed words) were less likely to be interpreted as question-like sentences. In contrast, SP listeners showed a stronger tendency to categorize words with penultimate stress as “more question than statement” or “question”. Besides, the interaction between intensity and stress type reveals a significant modulation of stress effects on intonation perception by intensity levels. Notably, at higher intensity levels, particularly at 77 dB, words with penultimate stress demonstrate an increased likelihood of being interpreted as questions compared to the baseline level (63 dB).

Consistent with Experiment 1, the main effect of age indicates that with increasing age, CH learners exhibited a subtle yet significant increase in perceiving utterances as questions (levels 4–5). For the SP group, the age effects were derived by combining the main age effects with the L1SP×Age interaction effects, yielding the following values across the five levels: −0.02080, −0.04824, −0.02884, −0.05729, and 0.03517. The negative effect at level 4 and the slightly reduced positive effect at level 5 suggest that the impact of aging on intonation perception is less pronounced in the SP group compared to CH learners.

Estimation of model performance

The evaluation of the two OGLM models utilized the log loss metric, a discriminative criterion pivotal in assessing the quality of probabilistic predictions, particularly in multi-class classification scenarios. In this context, the log loss values for the test dataset were computed employing the mlogLoss function available in the R ModelMetrics package (Hunt, 2020). The derived log loss values were 1.0467 and 1.0286 for the predictive models corresponding to Experiments 1 and 2, respectively. Generally, a proficient model is characterized by a log loss value that is inferior to the baseline, often referred to as the “naive” or “dumb” log loss, which is computed assuming a uniform distribution of the response categories (M) (Brown, 2020). In the present study, this baseline was established considering M as 5, representing the response levels, each having a probability of 20%. Consequently, the “naive” log loss for both experiments was ascertained to be 1.6094, as delineated in Eq. (1):

$$\log {loss}=-{ln}\left(\frac{1}{M}\right)=-{ln}\left(\frac{1}{5}\right)={ln}\left(5\right)=1.6094$$
(1)

However, the empirical data of the study revealed a non-uniform distribution of the outcome variable’s classes, necessitating the computation of a non-informative log loss that encapsulates the specific distribution of listeners’ responses. This was achieved by designating the probability associated with each class based on the observed data. For instance, in Experiment 1, the response distributions for the categories “statement,” “more statement than question,” “either statement or question,” “more question than statement,” and “question” were 21.77%, 16.49%, 10.88%, 24.13%, and 26.74%, respectively. Therefore, the non-informative log loss for the Experiment 1 model was computed as 1.5662, as detailed in Eq. (2). Similarly, the non-informative log loss for the Experiment 2 model was determined to be 1.6113. Overall, the models exhibited a higher predictive accuracy and substantial reduction in average log loss values to 1.0467 and 1.0286, respectively, when compared to the “naive” or non-informative log loss benchmarks.

$$\begin{array}{l}{log\; loss}=-\left(0.2117{ln}\left(0.2117\right)+0.1649{ln}\left(0.1649\right)+0.1088{ln}\left(0.1088\right)\right.\\\left.\qquad\qquad\qquad+0.2413{ln}\left(0.2413\right)+0.2674{ln}\left(0.2674\right)\right)=1.5662\end{array}$$
(2)

Discussion

This study delved into the perception of acoustic cues inherent to Spanish intonation among native Spanish L1 listeners and Chinese L2 learners. With respect to RQ1, our findings partially corroborated the hypothesized differences in intonation cue weighting between the two language groups. Specifically, changes in the final F0 contour significantly influenced intonation categorization for both L1 and L2 listeners, underscoring the importance of the F0 cue. However, contrary to our initial assumption for RQ1, Spanish L1 listeners showed greater sensitivity to F0 modulations in intonation processing compared to Chinese learners. We theorize that the elevated F0 sensitivity shown by Spanish L1 listeners could be ascribed to their innate familiarity and adeptness with native pitch patterns, which empowers them to more accurately and quickly identify intonation contrasts based on F0 linear transitions. Conversely, Chinese learners might possess restricted L2 experience in processing F0 signals in alignment with language-specific and well-defined intonation categories, culminating in a less steep slope for question-statement identification in Spanish. Additionally, the functional view posits that if certain phonetic cues are harnessed in one grammatical dimension, they will not be employed to a comparable extent in another phonological domain (Seddoh, 2002; Gandour et al., 1995). Consistent with this notion, several studies have highlighted that the inclination of tonal language listeners to primarily perceive F0 information related to word meanings (i.e., lexical tone) is a crucial factor in their reduced sensitivity to F0 cues processed as sentence intonation (Chen, 2005; Liang and Heuven, 2007). Adhering to this rationale and considering the parallel processing of stress and intonation in Spanish sentences, we propose that the increased effort by Chinese listeners to prioritize F0 cues for L2 Spanish stress—an essential component for word recognition—arising from the negative transfer of L1 prosodic realization, might also be a relevant factor influencing their diminished sensitivity to intonational F0 cues in Spanish.

Regarding the duration cue (RQ1), our findings largely aligned with our initial hypothesis, confirming that duration changes can significantly impact Spanish L1 listeners’ intonation perception. Long-duration patterns effectively increased their likelihood of question identification. This heightened sensitivity to temporal cues among Spanish L1 listeners suggests a greater reliance on duration as a prosodic marker for question-statement distinctions in their cue-weighting strategy. However, contrary to our expectations, we did not observe significant perceptual improvements in the Chinese group as the final stimulus duration increased. This unexpected result challenges our initial hypothesis and suggests a more complex relationship between L1 background and duration cues in cross-linguistic perception. The Chinese listeners’ relative insensitivity to duration changes suggests a preferential reliance on F0 contour as the primary cue for question-statement discrimination. This perceptual pattern likely stems from the tonal system inherent in their L1, where pitch variations play a crucial role in lexical and sentential distinctions. Such a prosodic transfer effect underscores the profound influence of L1 phonology on the L2 cue-weighting strategy.

Regarding the intensity cue (RQ1), our results partially corroborate the predictions for the Spanish group while revealing distinct perceptual patterns across language groups. Both Chinese L2 and Spanish L1 listeners demonstrated sensitivity to intensity variations, albeit with divergent response trajectories. Chinese L2 listeners exhibited a more pronounced and consistent shift towards statement perception as intensity increased, with the effect most salient at 77 dB. In contrast, Spanish L1 listeners showed an increased propensity to perceive sentences as questions with rising intensity, particularly at 70 dB. This differential response pattern suggests the existence of a language-specific intensity threshold, approximating 70 dB, at which native Spanish listeners display maximum sensitivity to intensity modulations. Notably, beyond this threshold, further intensity increments did not elicit proportional increases in response differentiation for the Spanish group. A potential explanation for the differences in secondary cue weighting between L1 and L2 listeners lies in their perceptual compensation capabilities (Feng et al., 2019). Since F0 contours synthesized in our study deviated from the intonation patterns of naturally spoken Spanish statements and questions (not being entirely linear at the end of the utterance), listeners probably increased the weight of other secondary cues to offset the loss of F0 information and bolster their perceptual decisions. However, native listeners might be better equipped to accurately compensate for acoustic changes than L2 listeners, given their extensive familiarity with the phonetic details of target intonation categories. Additionally, building on previous cue-weighting transfer research in cross-linguistic settings (Choi, 2022; Choi et al., 2019; Kim and Tremblay, 2021; Qin et al., 2019; Wiener and Goss, 2019), it is plausible that the reduced sensitivity of Chinese listeners to duration and intensity originates from a transfer of their L1 cue weighting. Considering that F0 plays a pivotal role in Chinese prosody, its speakers might not allocate as much attention to subtler variations in duration and intensity as speakers of non-tonal language do when discerning an intonation category, especially in non-native listening contexts (Chang and Yao, 2007; Feng et al., 2019; Jiao and Xu, 2019).

With respect to RQ2, our data substantiate the previous hypothesis by demonstrating a strong interaction between the effects of F0 and the duration/intensity cues in intonation perception. In particular, we found that the acoustic attenuation in one dimension of these cues could be compensated by increased contributions from the other such that the original percept can be preserved. This perceptual trade-off has been documented in the perception of various phonetic segments, such as ongoing sound changes in Southern Yi (Kuang and Cui, 2018) and stop-consonant voicing contrasts in American English (Holt et al., 2001; Jacewicz et al., 2009). Our results also indicated that Spanish L1 listeners, with heightened sensitivity to cues under phonetic trading relations, were more adept at compensating for acoustic variations in Spanish intonation. This observation aligns with previous perceptual compensation research, which proposed that listeners’ auditory compensation depended on their sensitivity to co-varying cues consistently correlated with the recognition of a specific phonetic category (Villacorta et al., 2007; Naul, Munhall, 2020).

With respect to RQ3, our study partially supports the hypothesis that listeners’ perceptual strategies are influenced by certain individual differences. Crucially, in our study, we found that age emerged as a significant factor influencing both L1 and L2 listeners’ perceptual processing, whereas gender did not demonstrate a discernible impact on intonation perception. Specifically, older listeners were more likely to have question-like responses, and this age-related trend is more pronounced in the Chinese group. Building upon earlier findings (Kim et al., 2018; Mayo and Turk, 2005; Toscano and Lansing, 2019), we posit that the observed age-related effects in both L1 and L2 groups likely reflect a general age-related cognitive change. Age-related alterations in the auditory system may affect the perception of certain acoustic features, consequently influencing sentence-type judgments. The observed trend could also indicate a compensatory mechanism in response to cognitive aging. As auditory processing capabilities decline with age, perceiving utterances as questions may serve as a strategic adaptation to auditory input uncertainty. This compensatory strategy could: (a) minimize potential communication errors in ambiguous situations; (b) require fewer cognitive resources, which may be particularly beneficial as cognitive processing efficiency changes with age. This adaptive tendency reflects the flexibility of the human cognitive system in response to age-related changes. Its more pronounced manifestation in L2 listeners may be attributed to the additional cognitive demands of processing a non-native language. Conversely, the milder effect in the Spanish group suggests that while cognitive aging affects all individuals, its impact on native language processing may be less severe due to deeply ingrained L1 prosodic patterns. However, we acknowledge the limitations of our cross-sectional approach in fully exploring these hypotheses. To address this and provide more robust evidence for our cognitive aging hypothesis, future research should employ longitudinal studies. Such studies would enable tracking of intonation perception changes within individuals over time, offering a clearer picture of how aging affects prosodic processing in both L1 and L2 contexts.

Finally, with respect to RQ4, our findings confirm the initial assumption that the perceptual processing of Spanish intonation was influenced by the stress pattern. Specifically, in Experiment 1, we found that paroxytone words were more likely to be recognized as questions compared to oxytone words under identical acoustic conditions. This finding can be interpreted in light of the principle of least effort (Zipf, 2016), which posits that humans naturally prefer to choose the least cognitively demanding course of action. Therefore, since paroxytone is the most frequent unmarked stress pattern in Spanish (Defior and Serrano, 2017; Roca, 2019), it is logical that both L1 and L2 listeners more frequently categorized paroxytone words as yes/no questions. In contrast, the potential conflict in F0 encodings between stress and intonation on the last syllable of oxytone words may have complicated the processing of intonational F0 cues for listeners, particularly those in the L2 group, thereby reducing the possibility of question responses. In Experiment 2, while Spanish L1 listeners exhibited a similar trend in stress effect as observed previously, Chinese L2 listeners showed an opposite pattern. The underlying mechanisms for this difference are not yet fully elucidated, we hypothesize that the interaction between stress and intensity may be key to understanding this phenomenon. For L2 listeners, the challenge of integrating non-native stress with intensity may engender perceptual strategies that markedly differ from those employed by L1 listeners. This contrasting trend could indicate either an overcompensation or a fundamentally distinct approach to processing intensity modulations in their L2.

Moreover, this finding underscores the importance of considering language-specific and lower-level prosodic features (e.g., stress) when studying intonation perception across different linguistic groups. The interaction between stress and intensity at phrase boundaries may be particularly crucial in Spanish, a stress-timed language, and may pose unique challenges for learners from tonal language backgrounds like Chinese. To fully unravel this complex issue, further research is imperative. Future studies should aim to thoroughly examine the interaction between stress and intensity cues during intonation perception, specifically at the final boundary of Spanish questions.

Conclusion

Based on examining the perception of Spanish intonation, the present study revealed important cross-linguistic similarities and differences in the processing of acoustic cues across multiple input dimensions among listeners from tonal and non-tonal language backgrounds. Our findings on intonation cue weighting corroborate and extend previous research (Holt & Lotto, 2006; Peng et al., 2012; Feng et al., 2019; Meng et al., 2020), demonstrating that listeners with diverse linguistic backgrounds employ distinct strategies in utilizing acoustic information to identify the most representative exemplars of prosodic categories. Importantly, our study goes beyond confirming the influence of language background on cross-linguistic perception. We provide novel evidence that listeners’ auditory performance is modulated by additional factors, notably chronological age and lower-level prosodic features. This multifaceted approach reveals the complex interplay between linguistic experience, cognitive development, and acoustic properties in shaping Spanish intonation perception.

This study has several limitations that point to directions for future research. Firstly, although F0, duration, and intensity are the most salient intonational cues, recognizing question-statement contrasts is not confined to these acoustic properties. Other contextual variables, such as speaking rate and phonetic environment, may also influence the evaluation of intonation categories. Secondly, while our use of a Discourse Completion Task provided some contextual grounding for the auditory stimuli, the subsequent acoustic manipulations may have reduced real-world applicability. Future research could explore these perceptual weightings in more naturalistic settings, potentially bridging the gap between laboratory findings and practical applications in multilingual environments. Additional limitations include the imbalance between male and female listeners, which may have limited our ability to detect potential gender effects. While our study did not reveal significant gender differences, a more balanced sample in future investigations could uncover subtle gender-related variations in intonation processing. Finally, although our findings suggest an age effect on intonation perception, the cross-sectional nature of our study limits inferences about the developmental trajectory of these perceptual abilities. To address this, longitudinal studies are essential. Such research would allow for tracking changes in intonation perception across the lifespan, revealing how these skills evolve over time and how they are influenced by ongoing linguistic experiences and cognitive development.