Developmental features of multi-word expressions in spoken discourse by Chinese learners of English

Zhang, Huiping; Wang, Xingzuo

doi:10.1057/s41599-024-04206-8

Download PDF

Article
Open access
Published: 18 December 2024

Developmental features of multi-word expressions in spoken discourse by Chinese learners of English

Humanities and Social Sciences Communications volume 11, Article number: 1663 (2024) Cite this article

2911 Accesses
1 Citations
Metrics details

Subjects

Abstract

Multi-word expressions (MWEs) serve as vital indicators of language development and have been a primary focus in second language acquisition research. The use of MWEs in spoken discourse, however, remains relatively under-explored. To address this gap, this study aims to investigate the current status and developmental features of MWE use in spoken English by Chinese learners of English across three proficiency levels based on Nattinger and DeCarricos’ classification framework of MWEs. A comparative analysis of MWE usage patterns across these groups yields two key findings: (1) The overall proficiency in using MWEs is relatively low, marked by an imbalance and inaccuracy in MWE types. Specifically, learners exhibit minimal use of “polywords” and “institutionalised expressions” within the structural dimension, and “discourse devices” and “social interactions” within the functional dimension. Moreover, learners demonstrate a high error rate across various MWE types. (2) Although the overall proficiency in using MWEs shows no significant improvement across the three levels, an upward trend is observed in the usage of overall tokens, types, and various categories of MWEs in both dimensions, culminating in a significant increase in the variation of MWEs at the highest proficiency level. Drawing upon these findings, this study proposes several pedagogical implications for enhancing the teaching and learning of MWEs in spoken discourse.

Improving EFL speaking performance among undergraduate students with an AI-powered mobile app in after-class assignments: an empirical investigation

Article Open access 15 March 2025

Different effects of verbal and visual working memory loads on Language prediction

Article Open access 01 July 2025

Refining the processing dynamics of English compound words in L2 learners: a psycholinguistic modeling approach

Article Open access 25 March 2026

Introduction

The usage-based approach is a linguistic theory emphasising that language learning is accomplished through statistical learning of the frequency and distribution of linguistic input (Ellis, 2002). Its central tenet is that language acquisition arises from actual language use (Tyler, 2010), significantly influenced by frequency of exposure (Tyler and Ortega, 2018). Specifically, learners gradually acquire the structures and rules of a language by being exposed to a vast amount of linguistic input (VanPatten and Cadierno, 1993).

Multi-word expressions (MWEs) are frequently occurring word groups in language (Yi, 2018; Yi and Zhong, 2024), including collocations, phrasal verbs, idioms, etc., characterised by their high fixedness and conventionality (Ramisch, 2015). Research indicates that approximately 70% of daily communication consists of MWEs (Altenberg and Granger, 2001). This high frequency signifies not only the fixed nature of MWEs in language but also their automaticity in cognitive processing-frequently encountered structures are more readily internalised and retrieved for use (Gries and Ellis, 2015). In other words, frequency significantly influences the acquisition of MWEs and accelerates the processing (Sonbul, 2015). More specifically, the higher the frequency and semantic specificity of an MWE, the faster its processing speed and the more fluent the language production. Through repeated exposure to high-frequency MWEs, learners could solidify them, forming strong mental representations. Conversely, low-frequency MWEs, due to limited exposure, struggle to establish robust mental representations. Consequently, after contextual activation, learners tend to prioritise the retrieval and utilisation of high-frequency MWEs based on the principle of least effort, while low-frequency MWEs become challenging to retrieve and produce automatically (Hu et al. 2020). Therefore, increasing exposure to MWEs within authentic contexts is crucial to enhance their processing speed and foster strong mental representations (Wolter and Yamashita, 2015).

Within language acquisition research, MWEs in spoken language have garnered considerable attention due to their demonstrable impact on cognitive processing and fluency. The storage and retrieval of MWEs as holistic units reduces the mental effort required for speech production, leading to smoother and more efficient articulation (Wray and Perkins, 2000). This lightening of cognitive load, as observed by Wood (2010), directly facilitates greater fluency. Moreover, as linguistic knowledge becomes proceduralised, speakers retrieve MWEs with increased speed, enabling longer stretches of uninterrupted speech and thus enhancing overall fluidity (Towell et al. 1996). This streamlined retrieval process further promotes cohesion by facilitating smoother transitions and increased coherence within spoken discourse (Qi and Xia, 2016). Empirical evidence from studies like Underwood et al. (2004) and Dahlmann and Adolphs (2007) indicates that MWEs contribute to uninterrupted speech flow by minimising pauses, further supporting their fluency-enhancing role. In demanding, high-speed communicative contexts such as sports commentary, MWEs are particularly prevalent, enabling speakers to maintain a natural and fluid expression (Kuiper, 2004). Furthermore, Strik et al. (2010) observed that MWEs often exhibit considerable pronunciation reduction, including frequent phoneme and syllable deletions. This characteristic distinguishes them from regular speech patterns and contributes to their rapid processing. Consequently, the efficient retrieval of MWEs enhances both fluency and accuracy in spoken communication (Yi and Zhong, 2024). However, L2 learners frequently demonstrate limited proficiency in using MWEs. Hu et al. (2020) reported a restricted variety in L2 learners’ MWE usage, while Martinez and Schmitt (2012) highlighted the frequent occurrence of inaccuracies. Therefore, enhancing MWE instruction in the context of spoken discourse is of paramount importance, carrying both significant theoretical and practical implications.

In recent years, studies on MWEs in the domain of second language acquisition have primarily focused on writing and reading(Cai, 2021; Yi et al. 2023; Yi and Zhong, 2024). For example, one study on the use of MWEs in English writing instruction analysed the limitations of traditional teaching methods, highlighting issues such as outdated pedagogical models and negative cultural transfer, which can hinder learners’ effective use of MWEs (Cai, 2021). This research underscored the benefits of MWEs in enhancing writing fluency, organizing essay structure, and improving coherence and logic. To address instructional gaps, it proposed methods like diversifying MWE acquisition strategies, emphasizing cultural connotations, and training students to recognise MWEs in context. Similarly, a meta-analysis on MWEs in reading examined their role in comprehension and memory, revealing significant positive effects. By synthesising findings across multiple empirical studies, the research systematically assessed the effectiveness of MWEs in reading tasks, emphasising how learners’ sensitivity to these expressions aids in overall comprehension and information processing (Yi and Zhong, 2024). Further exploring this area, research on phrase intuition demonstrated that both native and non-native speakers rely on lexical and phrasal statistical information to judge phrase frequency and association strength. Experimental findings showed that language users draw on various linguistic cues—such as word frequency, co-occurrence probability, and phonetic features—to assess MWE familiarity, supporting the importance of the Usage-Based Approach in language acquisition and offering fresh insights into the application of MWEs in teaching (Yi et al. 2023).

However, research on MWEs in spoken discourse in the field of second language acquisition remains limited. Academic explorations in this area primarily fall into three categories. The first category examines the relationship between L2 learners’ use of MWEs in spoken discourse and their oral proficiency. Studies show that increased MWE usage often correlates with higher fluency and accuracy, signalling more advanced language skills (Qi and Xia, 2016). Furthermore, familiarity with MWEs enhances learners’ confidence and communicative effectiveness, emphasising their motivational impact (Rafieyan, 2018). Moreover, studies of student speech reveal clear patterns: learners use MWEs more frequently in speech than in writing and tend to select familiar expressions to fulfill specific communicative needs (Huang, 2018). The second category investigates the accuracy of MWE usage, with a focus on common errors made by learners. Research in this area identifies recurring inaccuracies that point to areas where learners may benefit from targeted practice and guidance (Hu et al. 2020). Additionally, comparisons in classroom discussions show functional differences in MWE application: native speakers primarily use MWEs for discourse organisation, while non-native speakers rely on them more to express their stance, highlighting varied communicative approaches (Kashiha and Chan, 2015).Within the third category, researchers have compared the similarities and differences in the usage of MWEs among learners of different grades and varying levels of second language proficiency, aiming to summarise the developmental features of MWEs in spoken discourse (Hou et al. 2018; Jiang et al. 2024; Tavakoli and Uchihara, 2020; Wang and Qian, 2009). For example, in a study focusing on advanced Chinese learners of English, findings indicated that more proficient learners not only used MWEs more frequently, particularly collocations, but also increased their usage significantly over time (Hou et al. 2018). Similarly, research on intermediate and advanced L2 learners shows a notable increase in MWE use with rising proficiency levels (Tavakoli and Uchihara, 2020). Meanwhile, Comparisons between middle school and university students showed that university students used a wider variety of MWE types and demonstrated more context-appropriate usage, reflecting a developmental progression in MWE sophistication (Wang and Qian, 2009).

Progression in proficiency also appears to influence learners’ choice of MWE types. As Chinese learners of English advance, they increasingly adopt discourse-organising MWEs to manage complex narratives, whereas lower-level learners rely more on simpler, informational MWEs to convey basic messages (Qi and Xia, 2016). Corpus-based analyses further highlight this developmental shift, showing that lower-proficiency learners favour verb-based MWEs, which align with conversational discourse, while higher-proficiency learners are more inclined toward phrase-based MWEs typical of academic language. This trend underscores how expanding one’s repertoire of MWEs contributes to both fluency and accuracy in language production, marking a crucial aspect of overall language development (Jiang et al. 2024). Additionally, some researchers have tracked the developmental features of spoken MWEs in the same group of L2 learners, analysing their developmental patterns and changes over time (Qi and Ding, 2011; Vercellotti et al. 2021). The evolution of MWEs among Chinese learners shows that as proficiency increases, both the frequency and accuracy of MWE usage tend to improve, reflecting a higher level of familiarity and natural integration into the spoken language (Qi and Ding, 2011). In contrast, studies examining MWEs alongside lexical variety reveal a different trend, where MWE use initially rises and then declines, while lexical variety follows an inverse trajectory, first decreasing and then increasing over time, suggesting a negative relationship between the two (Vercellotti et al. 2021). Together, these findings underscore the diverse developmental patterns of MWE usage across proficiency levels.

Current research

Previous cross-sectional studies on MWEs in spoken discourse have primarily examined the frequency, types, and functional features of their use, with relatively little attention paid to errors in their usage. Moreover, these studies often involved a limited number of participants and primarily focused on intermediate and advanced English learners, neglecting beginners. As beginners are at the initial stage of the English learning continuum, exploring the developmental features of their MWE usage can contribute to a comprehensive understanding of second language acquisition. Additionally, spoken language, due to its spontaneous nature, can more authentically reflect learners’ language proficiency.

Therefore, based on The Spoken Corpus of Chinese Learners of English, this study investigated the commonalities and differences in the use of MWEs by Chinese learners of English, including beginners, across three learning stages: junior high school, senior high school, and university. The investigation was conducted from both structural and functional perspectives. The research aimed to identify the general patterns in the language acquisition process by analysing the common features of MWEs in spoken discourse used by learners at different stages. Simultaneously, it sought to compare the differences in the use of MWEs across various learning stages to uncover the developmental trends in MWEs.

This study addressed the following research questions:

(1)
What are the common features of MWE usage in spoken discourse among Chinese learners of English at the junior high school, senior high school, and university levels in terms of MWE frequency proportion, error frequency, proportion of error frequencies, and error types?
(2)
What are the distinctive features of MWE usage in spoken discourse among Chinese learners of English at the junior high school, senior high school, and university levels in terms of MWE total types and tokens, as well as structural and functional dimensions?

This study hypothesised that:

(1)
Chinese learners of English across junior high school, senior high school, and university levels are expected to exhibit similar patterns in the use of MWEs in spoken discourse, characterised by a consistent MWE frequency proportion, error frequency, proportion of error frequencies, and error types.
(2)
Significant differences are anticipated in the use of MWEs in spoken discourse among Chinese learners of English at junior high school, senior high school, and university levels, particularly in terms of MWE total types and tokens, as well as in the structural and functional dimensions of MWE usage.

Research methods

Classification framework for MWEs

Previous studies on MWEs have largely relied on the classification methods proposed by Biber et al. (1999) and Biber et al. (2004). These methods categorise MWEs based on their structural forms and pragmatic functions, further dividing them according to word class clustering and clause types (Hu et al. 2020; Ma, 2009). However, such classifications are more suitable for analysing written academic discourse and are less applicable to spoken language research. For example, in certain spoken tasks, learners primarily focus on describing events rather than expressing opinions, resulting in limited or no use of stance MWEs.

In contrast, the classification framework developed by Nattinger and DeCarrico (1992) aligns more closely with second language acquisition and teaching. It not only categorises MWEs based on their structural variability and continuity but also takes into account their functional meanings, thereby facilitating a comprehensive examination of learners’ acquisition of MWEs in spoken discourse. Therefore, this study adopted Nattinger and DeCarrico’s (1992) classification framework in Table 1.

Table 1 The taxonomy of MWEs by Nattinger and DeCarrico (1992).

Full size table

Corpus description

The present study used The Spoken Corpus of Chinese Learners of English (SCCLE), which was compiled by the School of International Studies at Northeast Normal University. The corpus data was collected from 47 middle schools (including key urban middle schools, ordinary urban middle schools, and rural middle schools, covering grades 7 to 12) and 4 universities (English majors from year 1 to year 3) in Northeast China. The participants consisted of 110 junior high school students aged 12 to 15 years old, 107 senior high school students aged 15 to 18 years old, and 102 university students aged 18 to 21 years old. Table 2 provides a summary of the participant demographic information and corpus details across the three proficiency stages, including age, age of acquisition, average length of instruction, average proficiency scores, number of texts, and MWE types and tokens.

Table 2 Participant Demographic Information and Corpus Details Across Proficiency Stages.

Full size table

The construction of the corpus strictly followed the standards for building spoken corpora. First, a feasibility study was conducted. Then, sampling standards and norms were established, followed by the collection of learners’ spoken language materials. During the recording process, learners’ background information (e.g., age, grade, school type, years of formal English education, spoken English practice methods, most recent English comprehensive test scores, etc.) was collected in the form of dialogue. This was followed by the collection of English spoken language materials. The corpus collection adopted a combination of free conversation and picture elicitation, and the duration was controlled within 20 min. No preparation was made by the students prior to the recording. Topics included after-school life, favourite food, pocket money, interesting stories, birthday parties, sports, watching TV, etc. All the recordings were subsequently named, sorted, and transcribed. The “Transcriber” software is used for audio transcription, and paralinguistic information (e.g., pauses, repetitions, stresses, unfinished utterances, code-switching, etc.) was annotated during the transcription process. After the transcription was completed, the text was converted into a corpus text format (TXT format) and marked with metadata (e.g., learner background information). Finally, manual proofreading was conducted to finalise the spoken corpus.

The corpus contains a total of 285,382 tokens and 500 files. According to the continuity of language proficiency development, this study divides learners into three learning stages: junior high school as stage one, senior high school as stage two, and university as stage three. For balanced comparisons, a subset of approximately 52,500 tokens was sampled from each stage, with totals of 52,487, 52,496, and 52,483 tokens for stages one, two, and three, respectively. This sampling was achieved by randomly selecting files from each group until the target token count was reached, ensuring a consistent number of tokens across stages. Additionally, files were selected to cover similar spoken discourse contexts within each stage, maintaining comparability in MWE usage across proficiency levels.

Extraction and filtering of MWEs

This section analyses the usage distribution of MWEs with different lengths across the three learning stages to determine the focus of this study. MWEs with six or more words are rare (Wei, 2004). Therefore, this study focuses on retrieving MWEs with 2–5 words. Firstly, the cleaned corpus is imported into AntConc. Secondly, the N-Grams function is used to generate MWEs (with a minimum frequency of 10 per million words and appearing in at least 5 texts). Finally, two native English-speaking linguistic experts assist in manually filtering out meaningless MWEs, identifying MWE errors, and categorising these errors according to study criteria. To ensure inter-rater reliability in the identification and categorisation of MWEs, Cohen’s Kappa was calculated, yielding a Kappa value of 0.758 (p < 0.001). This Kappa value falls within the 0.61–0.80 range, indicating a “substantial” level of agreement between raters. Any discrepancies were resolved through discussion, ensuring consistency and reliability in the final MWE dataset.

Based on the statistics of MWE error in this corpus, we found that MWE error mainly includes grammatical error and discoursal error. Grammatical error primarily consists of errors such as incorrect number agreemesage of MWEs of different lengths atnt in subject-verb concord, mix-ups in subject-verb collocation, incorrect combinations of verb + non-finite verb, and errors in verb tense and number. Discoursal error mainly includes the incorrect use of causal and contrastive conjunctions.

As shown in Table 3, the frequency of learners’ use of MWEs decreases as the length of the MWEs increases. Specifically, 3-word MWEs are the most frequently used, accounting for the largest proportion, and the combined usage of 3-word and 4-word MWEs is close to 100%. Therefore, this study focuses on analysing 3-word and 4-word MWEs.

Table 3 Usage of MWEs of different lengths at different stages.

Full size table

Statistical analysis

This section details the statistical techniques employed in this study, including the rationale behind their selection, the coding of variables, and the data cleaning process prior to analysis.

Data cleaning

The raw corpus data underwent several cleaning procedures before analysis. These included:

Removal of irrelevant data: Unnecessary information, such as participant background data, was removed to focus exclusively on the spoken English data.

Correction of transcription errors: Any transcription errors identified by the native English-speaking linguistic experts were rectified to ensure data accuracy.

Exclusion of low-frequency MWEs: MWEs with a frequency below 10 per million words and appearing in fewer than 5 texts were excluded from the analysis to guarantee the reliability of the results.

Statistical techniques

Standardised frequencies: For ease of comparison, the frequencies in this study are converted to standardised frequencies (i.e., frequency per 1000 words). The formula is: standardised frequencies = (observed frequencies/total number of tokens) * 1000.

Significance testing: To determine the significance of differences in MWE usage and error rates across the three learning stages, a log-likelihood ratio (LL) test was employed. This test was selected over the chi-square test due to its specific advantages. First, the LL test is well-suited for multiple comparisons, as it effectively assesses the independence of categorical variables across multiple groups, an essential feature for comparing the three proficiency levels in this study (Gelbukh et al. 2010). Additionally, the LL test offers greater robustness with smaller sample sizes, enhancing its reliability in studies with limited corpus data (Pojanapunya and Todd, 2018). Together, these characteristics make the LL test a reliable and well-suited method for accurately analysing differences in MWE usage and error rates across proficiency levels in this study. The log-likelihood ratio is calculated using the following formula:

$${\rm{LL}}=2* \Sigma [{\rm{Oi}}* \mathrm{ln}({\rm{Oi}}/{\rm{Ei}})]$$

Where the summation is over all observations, Oi denotes the observed frequencies and Ei refers to the expected frequencies. The log-likelihood ratio statistic is used to test the independence of two categorical variables by comparing the deviation between observed and expected values to determine whether the difference between groups is significant. The following critical values are used to determine the level of significance: Significance level Sig. (p) < 0.05, critical value is 3.84; p < 0.01, 6.63; p < 0.001, 10.83; p < 0.0001, 15.13.

Although multiple comparison correction is generally required for multiple comparisons to prevent an increase in false positive results, the commonly used Bonferroni correction method is not applicable in this study as there is no multiple sampling. Therefore, we employed the method provided by Lancaster University’s significance testing website (http://corpora.lancs.ac.uk/sigtest/) to conduct significance testing between the three stages and calculate the log-likelihood ratio for the three stages. This method effectively handles multiple comparisons, preventing an increase in false positive results and ensuring the reliability of the findings.

Polynomial regression analysis

To explore the developmental trends of MWE usage and error rates across the three learning stages, a quadratic polynomial curve model^{Footnote 1} was employed using SPSS software. This model was chosen due to its ability to capture non-linear trends and provide a better fit compared to linear models. The quadratic polynomial curve model accommodates both upward and downward trends, accurately reflecting the dynamic changes in the data.

Variables and coding

The dependent variables in this study include:

MWE frequency: the frequency of MWE usage across the three learning stages.

MWE error rate: the rate of MWE errors across the three learning stages.

The independent variable is:

Learning stage: coded as 1 for junior high school, 2 for senior high school, and 3 for university.

Results

Based on the cross-sectional comparisons among spoken discourses from stage one to three, the subsequent subsections reported the commonalities and differences observed in the MWEs used by Chinese learners of English in their spoken discourse.

Common features across the three stages

Common features of MWE frequency proportion

As shown in Table 4, in terms of the structural dimension, the usage of different MWE types is uneven across all three stages: “sentence builders” are the most frequently used (all exceeding 60%), followed by “phrasal constraints” (all close to 20%), then “polywords” (all between 7 and 10%), and finally “institutionalised expressions” are the least used (all less than 4%).

Table 4 Frequency proportion of MWEs across two dimensions and three stages.

Full size table

In terms of the functional dimension, the usage of MWEs across all stages is also uneven: “necessary topics” are the most frequently used (all exceeding 50%), while the proportions of “discourse devices” and “social interactions” fluctuate across stages, but both remain relatively low (under 30%).

In summary, the usage proportions of different MWE types are roughly similar across the three stages, with an uneven distribution. Certain MWE types are used less frequently.

Common features of MWE errors

Common features of error frequency

As shown in Table 5, the overall error rate of MWEs gradually decreases across the three stages (2.79%, 2.38%, and 1.70% respectively). From Stage one to two, although the total error frequency slightly decreases (LL = 1.82), the difference is not statistically significant. Similarly, from Stage two to three, the change in the total error frequency is even smaller (LL = 0.42). Overall, across all three stages, while the total error frequency gradually decreases, the change remains statistically insignificant (LL = 4.19, p = 0.123).

Table 5 Frequency, proportion, and log-likelihood ratio of MWEs misuse across three stages.

Full size table

Combining this with Fig. 1, we can see that for the structural dimension, the error frequencies of “phrasal constraints” decrease from Stage one to two (LL = 2.85) and slightly increase from Stage two to three (LL = 0.12), resulting in a statistically insignificant decrease across the three stages (LL = 3.26, p = 0.196). The error frequencies of “sentence builders” remain almost unchanged between Stage one and two (LL = 0.36) and slightly decrease from Stage two to three (LL = 0.85), leading to a statistically insignificant decrease across all three stages (LL = 2.34, p = 0.311).

**Fig. 1: Quadratic polynomial model curves of MWE error frequencies across three stages.**

As for the functional dimension, the error frequencies of “social interactions,” “necessary topics,” and “discourse devices” do not show statistically significant differences across the three stages. “Social interactions” slightly decrease from Stage one to two (LL = 1.93) and slightly increase from Stage two to three (LL = 0.07), exhibiting no significant change across the three stages (LL = 2.20, p = 0.332). “Necessary topics” slightly decrease from Stage one to two (LL = 0.27) and from Stage two to three (LL = 0.49), indicating no statistically significant change across all three stages (LL = 1.48, p = 0.476). “Discourse devices” exhibit a slight decrease from Stage one to two (LL = 0.16) and from Stage two to three (LL = 0.53), resulting in a statistically insignificant change across the three stages (LL = 1.31, p = 0.519)

According to Fig. 1, which presents results from quadratic regression analysis, the error frequencies of different MWEs by learners show only a slight trend of change across the three stages. The error frequencies of “sentence builders,” “necessary topics,” and “discourse devices” gradually decline. For “sentence builders,” the error frequency starts at a relatively high point in stage one, gradually decreases through stage two, and reaches its lowest level in stage three. This trend suggests that as learners progress, they become more adept at using sentence-building expressions accurately. Similarly, “necessary topics” follows a comparable downward trend, with error frequency steadily declining as learners advance, indicating improvement in using MWEs related to essential topics or common themes. “Discourse devices” also displays a consistent decline, suggesting that learners make fewer errors in organising discourse as they gain proficiency, perhaps due to greater exposure to and familiarity with discourse markers or cohesive devices. In contrast, “phrasal constraints” and “social interactions” demonstrate a weak U-shaped trend in their error frequencies. This U-shaped pattern may imply that while learners initially make gains in controlling phrasal structures, they encounter renewed challenges as they tackle more complex uses at higher proficiency levels. However, the total error frequency of MWEs, the total error rate of MWEs, and the error frequencies of various MWE types all lack statistically significant changes across the three stages. Consequently, the accuracy rate remains generally low, indicating limited improvement in MWE accuracy as learners move to higher proficiency levels.

Common features of the proportion of error frequencies

From Table 5, it is observed that in terms of the structural dimension, the error of MWEs is primarily concentrated in “sentence builders” (with a proportion ranging from 72 to 80%) and “phrasal constraints” (with a proportion ranging from 20 to 28%). Notably, no errors were observed for “polywords” and “institutionalised expressions”. In the functional dimension, the proportions of errors are relatively even.

Common features of error types

Upon corpus retrieval and observation, it was found that there are many types of MWE error, but some types have very low frequencies, such as “it can/will/must __,” which appeared only 5 times.

For instance: “I think it is must very difficult.” The most frequent MWE errors can be categorised into the following five types:

(1)
The occurrence of the construction “There are many/some/a lot of…”

“There are many/some/a lot of + plural countable noun” represents a valid collocation. However, this particular pattern is frequently misused by learners, which is quite typical at all stages of learning, with frequencies of 0.29 (15 instances), 0.23 (12 instances), and 0.40(21 instances) respectively. For example: “And there is three door in this school.” This type of error can be categorized under “sentence builders” in the structural dimension and “social interactions” in the functional dimension.
(2)
The utilisation of the construction of “It/this/there is”

This pattern was also commonly used in the spoken discourse of learners, with frequencies of 0.36 (19 instances), 0.29 (15 instances), and 0.19 (10 instances) respectively. For example: “And it is also make me relax.” This type of error falls under “sentence builders” structurally and “social interactions” functionally.
(3)
The error of “like doing/ to do” construction

Native speakers typically use “like doing__/ to do__” rather than “like do__.” This error occurs with frequencies of 0.10 (5 instances), 0.17 (9 instances), and 0.21 (11 instances) across the stages. For example: “And I like speak English with my classmates.” This error belongs to “phrasal constraints” structurally and “necessary topics” functionally.
(4)
The error of tense and number of verbs in constructions

L2 learners frequently struggle with the correct use of verb tense and number in English due to the morphological differences between English and Chinese verbs. These errors, with frequencies of 0.27 (14 instances), 0.15 (8 instances), and 0.23 (12 instances) across three learning stages, are particularly prevalent in patterns such as “don’t see__” and “don’t like__”. An example of an erroneous sentence would be “After that I don’t like see ghost film.” Such errors can be categorized as phrasal constraint errors within the structural dimension and essential topic errors within the functional dimension.
(5)
The error of “Because__so” and “although/though/even if__but”

“Because__so” appears with frequencies of 0.30 (16 instances), 0.23 (12 instances), and 0.29 (15 instances) across the stages, while “although/though/even if__but” appears with frequencies of 0.13 (7 instances), 0.15 (8 instances), and 0.21 (11 instances) respectively. Together, these two misused structures appear with frequencies of 0.44 (23 instances), 0.38 (20 instances), and 0.50 (26 instances), and can be categorised under “sentence builders” and “discourse devices.”

The first four types are grammatical errors, primarily related to verb tense and number, and occur with relatively high frequencies across the stages: 1.01 (53 instances), 0.84 (44 instances), and 1.03 (54 instances). The fifth type is a discourse error.

Table 5 shows that the total error frequencies across the three stages are 1.87, 1.52, and 1.37, indicating a decreasing trend, although the changes are not statistically significant (LL = 4.19, p = 0.123). This suggests that the main types of MWE errors are similar across the three stages, and while the error frequency gradually decreases, the overall difference is not statistically significant.

In conclusion, the total error frequency, overall error rate, as well as the error frequency and proportion of different MWE types, are all relatively similar across the three stages, and the main types of MWE errors are consistent. This suggests that learners’ accuracy in MWE use did not significantly improve across the stages.

Distinctive features across the three stages

Generally, higher MWE variability indicates greater variation in MWE usage and stronger productive MWE ability (Read, 2000). The type-token ratio (TTR) is a validated measure to assess lexical variation, when text length remains constant (Treffers-Daller et al. 2018). This section initially examines the differences in total types and tokens of MWEs used across the three stages, while also comparing the TTR at each stage to analyse the developmental features of MWE variation. Finally, the distinctive features in the structural and functional dimensions of MWEs are analysed.

Distinctive features of MWE total types and tokens

As Table 6 illustrates, statistically significant differences exist in the total MWE tokens used across the three stages: a slight decrease between Stages one and two (LL = 3.22), followed by a significant increase between Stages two and three (LL = 93.45). Overall, the total number of MWE tokens used exhibits a trend of decreasing initially, followed by a significant increase (LL = 106.94, p < 0.001).

Table 6 Log-likelihood ratios for MWEs total types and tokens, and type-token ratios.

Full size table

Regarding the total MWE types used, significant differences are also observed across the stages: a slight decrease between Stages one and two (LL = 0.47), followed by a substantial increase between Stages two and three (LL = 25.27, p < 0.001). Overall, the data reveals a trend of initial decrease followed by an increase in the total types of MWEs used (LL = 30.51, p < 0.001).

Comparing the type-token ratios across the stages, Stages one and two remain constant (0.04), while Stage three shows a significant increase (0.06). This indicates that the variation of MWE use notably improves in Stage three.

Distinctive features in structural and functional dimensions

Regarding the structural dimension in Table 7, there are significant differences across all three stages in the usage of “polywords,” “institutionalised expressions,” “phrasal constraints,” and “sentence builders”. Notably, “institutionalised expressions” show a significant increase from Stage one to two (LL = 6.91). In contrast, “polywords,” “phrasal constraints,” and “sentence builders” all demonstrate significant decreases from Stage one to two. From Stage two to three, only “institutionalised expressions” show a slight increase (LL = 2.76), while “polywords,” “phrasal constraints,” and “sentence builders” all exhibit significant increases.

Table 7 Frequencies and log-likelihood ratios for the two dimensions across the three stages.

Full size table

Analysing the functional dimension, significant differences are observed across the three stages in the usage of “social interactions,” “necessary topics,” and “discourse devices.” From Stage one to two, “social interactions” exhibit a slight increase (LL = 3.50), while both “necessary topics” and “discourse devices” show significant decreases. From Stage two to three, “social interactions” slightly increase (LL = 0.49), while both “necessary topics” and “discourse devices” display significant increases (p < 0.001).

Figure 2 further illustrates the changing trends in the frequencies of MWEs across seven dimensions throughout the three stages of learning, based on quadratic regression analysis. Notably, apart from “institutionalised expressions” and “social interactions,” the remaining five dimensions (“polywords,” “phrasal constraints,” “sentence builders,” “necessary topics,” and “discourse devices”) exhibit a typical U-shaped trend in their MWE frequency curves. Specifically, these dimensions show higher MWE frequencies in the first stage, a decline to their lowest point in the second stage, and a gradual increase in the third stage. This trend suggests that learners may initially use these MWEs more frequently. However, as the learning progresses and becomes more challenging, they might reduce their usage in the second stage. In the third stage, with improved language proficiency and fluency, the frequency of using these MWEs may rebound significantly. Unlike the five dimensions mentioned above, “institutionalised expressions” and “social interactions” display an “increasing” trend in their frequency curves. This indicates a gradual increase in their frequencies starting from the first stage and reaching the highest point in the third stage. This pattern suggests that learners accumulate “institutionalised expressions” and “social interactions” progressively throughout their language learning journey. Their increasing frequencies across stages reflect the learners’ gradual mastery and application of these MWEs.

**Fig. 2: Quadratic polynomial model curves of MWE usage frequencies across different dimensions in the three stages.**

In summary, except for “institutionalised expressions” and “social interactions,” the frequencies of MWEs in other dimensions are generally higher in the initial and later stages of learning, with a decline in the intermediate stage, presenting a “U” shaped curve. Conversely, “institutionalised expressions” and “social interactions” demonstrate a steady upward trend in their frequencies as learners progress through the learning stages, indicating their continuous growth in language acquisition.

Discussion

Commonalities across three stages

Frequency distribution of MWEs

Regarding the structural dimension, “sentence builders” are the most frequently used across all three stages, followed by “phrasal constraints,” consistent with Nattinger and DeCarrico’s (1992) observations. As discussed by Nattinger and DeCarrico (1992), “sentence builders”, providing a framework for the entire sentence, are highly variable and discontinuous, making them the most flexible, abundant, and frequently occurring in discourse. Conversely, “phrasal constraints” are shorter with some degree of variability, contributing to their relatively high frequency in discourse. Psycholinguistic research suggests that learners exhibit greater familiarity and faster processing speeds for frequently encountered MWEs in language input (Shi and Chai, 2021). This indicates that learners have more frequent exposure to “sentence builders” and “phrasal constraints”, leading to increased familiarity, faster processing speeds, and stronger mental representations. Consequently, these two types of MWEs are readily accessible from the mental lexicon. Furthermore, the selection and usage of MWEs are also linked to their inherent pragmatic capabilities (Nattinger and DeCarrico, 1992). These two types of MWEs exhibit a high degree of variability and are capable of fulfilling various pragmatic functions (Nattinger and DeCarrico, 1992), which accounts for learners’ tendency to employ them.

The infrequent use of “polywords” and “institutionalised expressions” aligns with the findings of Wang and Qian (2009). According to Nattinger and DeCarrico, (1992), these two types of MWEs, characterised by their fixed and continuous nature, have lower recurrence rates in discourse compared to “sentence builders” and “phrasal constraints.” Additionally, their relatively fixed structures make it challenging to predict their forms based on general grammatical principles, resulting in higher structural arbitrariness (Nattinger and DeCarrico, 1992). Comparatively, the study reveals that learners have less exposure to these two types of MWEs, leading to low familiarity, slower processing speeds, and a failure to establish strong mental representations, thus explaining their lower usage.

Concerning the functional dimension, “necessary topics” emerge as the most frequently used category. Given their close association with daily life, these MWEs receive the most exposure and thus attain the highest level of familiarity among learners, resulting in strong mental representations. In SCCLE, conversations primarily revolve around familiar themes like school and birthdays. Consequently, “necessary topics” emerge as the dominant category, while the other two categories exhibit lower usage. This observation is consistent with findings by Huang (2018). Regarding “discourse devices,” they function as connectors of meaning and structure within a discourse. While communicators utilise this category to enhance fluency and logical coherence, they involve discourse competence and a higher level of abstraction, presenting challenges for learners. Consequently, despite some usage, their application remains limited to a few fixed types, including logical connectives like “because I__,” temporal connectors like “and then__,” and fluency markers like “I think___.” Psycholinguistic research indicates that the concreteness of language can impact the processing speed of MWEs (Wang, 2019). Due to the relatively abstract nature of “discourse devices” compared to “necessary topics,” their processing speed is slower. Within specific contexts, “necessary topics” are readily retrieved by learners for efficient expression, while “discourse devices” are only utilised during discourse cohesion. Coupled with learners’ lower English proficiency, who tend to use simple and familiar MWEs, the output of “discourse devices” is relatively low. Lastly, regarding “social interactions,” learners may have less exposure to this type of MWEs, resulting in lower familiarity and weaker mental representations. Additionally, their awareness of using them is relatively poor, leading to a very low usage proportion of this category.

The errors of MWEs

Analysing the distribution of error frequency within the structural dimension, errors across all stages primarily concentrate on “sentence builders,” followed by “phrasal constraints.” Despite learners’ relative familiarity with these two categories, their higher variability and discontinuity in structural forms (Nattinger and DeCarrico, 1992) pose challenges. Learners need to consider appropriate collocations while adhering to grammatical rules. However, their limited proficiency in handling MWEs makes accurate production difficult during time-constrained spoken communication, leading to higher error rates. This contrasts with Gao’s (2015) findings, which revealed significantly higher error frequencies for “phrasal constraints” than for “polywords” and “sentence builders” among English majors. This discrepancy can be attributed to several factors.

Firstly, the overall English proficiency levels of participants differed between the studies. According to Biber et al. (2011), L2 development involves a gradual transition from clause-based features to phrase-based features. In Gao’s (2015) study, the participants were senior English majors with higher English proficiency and greater exposure to MWEs. They tended to use a variety of “phrasal constraints”. In contrast, the participants in the present study were primarily middle school students and some lower-year English majors, whose English proficiency was relatively lower, and whose language development focused mainly on clause features. They tended to use “sentence builders”, thus resulting in a higher incidence of errors in this category.

Secondly, learners, particularly middle school students, often exhibit limited variation in the use of MWEs in spoken discourse, tending to overuse certain MWEs (Wangand Qian, 2009). In this study, the use of “phrasal constraints” primarily centred around certain easily mastered MWEs (such as “in front of” “a lot of” and “next to the__”), resulting in a lower number of errors compared to “sentence builders.” Both “polywords” and “institutionalised expressions” have very fixed structures, minimising the likelihood of errors once memorised. Additionally, learners demonstrate low familiarity with these two types of MWEs. This results in infrequent usage across all learning stages, accompanied by correspondingly low error rates and even an absence of error in some cases.

Regarding the types of error, grammatical errors, particularly those related to verb tense and number, constitute the dominant form, aligning with Huang’s (2018) findings. Analysing the corpus reveals that exposure frequency and context exert a combined influence on the errors of MWEs. In Chinese, verbs like “Xihuan” (like) and “Zuo” (do) can be directly combined without morphological changes. Conversely, in English, “like” and “do” require the use of the infinitive form “to do” or the gerund form “doing” after “like.” According to usage-based perspective, context plays a crucial role in MWE usage, enabling their accurate application (Wang, 2015). However, learners’ limited exposure to certain MWEs leads to strong L1 interference from their native language context. This mismatch between the usage of MWEs and the English target context (Wang, 2019) gives rise to “Chinglish” errors.

In summary, learners exhibit an unbalanced use of different types of MWEs, with certain types being underused, and the overall accuracy rate is relatively low. In other words, their capability in using MWEs in spoken English is insufficient. The Ministry of Education of the People’s Republic of China (2022) mandates that junior high school students should master 200–300 idioms or fixed collocations^{Footnote 2}. In this study, Stage one yielded 158 distinct MWEs, Stage two had 146, and Stage three only reached 245. This suggests that learners’ overall proficiency of MWEs in spoken discourse remains low, deviating significantly from the set requirements. Furthermore, the accuracy of learners’ use of MWEs did not improve progressively across stages, indicating that they still face challenges in the accurate expression of MWEs.

A usage-based approach to language acquisition posits that language exposure plays a crucial role in the acquisition of MWEs. In the classroom, the primary source of exposure to MWEs comes from textbooks (Northbrook et al. 2022). However, upon closer examination of these textbooks, it becomes apparent that the number of MWEs in textbooks is insufficient, the types are limited, and the frequency of occurrence is uneven. At the junior high school level, it has been observed that teachers primarily emphasize MWEs with higher frequencies in textbooks, such as “sentence builders” and “phrasal constraints”, while “polywords” and “institutionalised expressions” are rarely covered. Besides, teachers seldom provide extracurricular reading materials, resulting in limited opportunities for learners to be exposed to less frequent MWEs.

At the senior high school level, the Ministry of Education of the People’s Republic of China (2017) has explicitly put forward the learning requirements for MWEs. Most teachers would increase the teaching content of MWEs and provide relevant reading materials accordingly. However, because acquiring MWEs in spoken discourse requires massive repeated exposure, the amount of exposure provided by the current classroom teaching and extracurricular reading is still insufficient. Consequently, even after graduating from senior high school, learners still experience significant difficulties using MWEs in spoken English.

At the university level, English majors are usually offered a rather comprehensive curriculum, which normally includes courses like Intensive Reading, English Listening and Speaking, English Reading and Writing, and Translation Theory and Practice. MWE instruction is integrated into language and professional skills training within these courses. The objective is to develop students’ proficient use of various MWEs through systematic training, thereby enhancing their accuracy and fluency. Spoken English, in particular, receives focused attention on correct MWE usage.

Despite all these efforts, observations indicated that learners still suffer from insufficient exposure to MWEs and fail to improve their accuracy in using MWEs significantly due to various possible reasons. For example, under limited classroom time, teachers need to make trade-offs among various teaching objectives, which might lead to insufficient time for the explanation and practice of MWEs; the coverage and frequency of occurrence of MWEs in textbooks are still insufficient to provide adequate exposure; some teachers may still adopt conventional teaching methods and lack innovative and diversified teaching activities, resulting in learners’ lack of opportunities for sufficient exposure to and practice of MWEs in authentic contexts.

Differences across three stages

Despite the overall low proficiency in MWE usage, learners demonstrated a U-shaped developmental trend in terms of the total tokens, total types, and the usage of different MWE categories within both dimensions. This suggests that learners attempted to use a wide range of MWEs during Stage one. However, Stage two witnessed a learning plateau (Flynn and O’neil, 1988), characterised by a significant decline in the frequency of use. By Stage three, after being exposed to a larger number of MWEs, learners attempted to use them extensively. Notably, however, the first two stages (middle school stages) revealed minimal changes in the variation of MWEs, suggesting that despite fluctuations in the frequency of use, learners remained confined to a limited repertoire of MWEs. This aligns with Wang and Qian’s (2009) observation that middle school students tend to overuse a limited set of MWEs in spoken discourse, resulting in low variation of MWEs. Conversely, Stage three exhibited a marked increase in the variation of MWEs, corroborating findings from studies such as those by Wang and Qian (2009) and Jiang et al. (2024), which highlighted significantly higher variation of MWEs among university students compared to their middle school counterparts. These usage patterns are largely influenced by the amount of exposure to MWEs.

As previously mentioned, in the junior high school stage, students have low frequencies and variation in the use of MWEs, primarily due to the limited language exposure provided by both classroom and extracurricular reading. However, upon entering high school, due to the focus of exam-oriented education on written language and test-taking skills, students have less actual exposure to spoken language, leading to a decrease in the frequency of MWEs usage. Both stages suffer from insufficient exposure offered through classroom activities and extracurricular reading materials. This limited exposure can lead to inappropriate use and slow down the acquisition process. This, coupled with the inherent difficulty in acquiring MWEs, explains why learners’ overall MWE usage in spoken discourse did not increase but rather plateaued and significantly declined during the first years of study.

In contrast, at the university level, most teachers intentionally increase students’ exposure to MWEs. As teachers continuously introduce and emphasise MWEs, students are exposed to a significant increase in MWEs materials. Consequently, in the third stage, we can observe a significant improvement in both the frequency and variation of students’ use of MWEs.

Conclusion

This study analysed the use of MWEs in spoken discourse by learners at three stages and found their overall MWE proficiency to be relatively low, evidenced by the imbalance and inaccuracy in their usage. For example, categories like “polywords” and “institutionalised expressions” in the structural dimension, and “discourse devices” and “social interactions” in the functional dimension, were used far less frequently than other categories; the learners also made various types of errors with a high overall error rate. The overall improvement in the use of MWEs across the three stages was not significant, but there was still a U-shaped trend in the total number of tokens, types, and categories of MWEs used. Specifically, after experiencing a plateau in the learning of MWEs in Stage two, learners improved in Stage three, mainly manifested in a significant increase in the total frequency and variation of MWEs used.

The present study makes several notable contributions concerning the use of MWEs in spoken discourse by L2 learners. First, the study unveiled the developmental trajectory of MWEs across two distinct dimensions, providing valuable insights into the acquisition process. While previous research primarily focused on the development of MWEs in written language, examining aspects like average length and type, this study delved into the structural and functional dimensions of MWEs in spoken discourse, including their overall variation. Results indicated a U-shaped developmental pattern, with the variation of MWEs being relatively lower during the secondary school stage and showing marked improvement at the university level.

Second, this research corroborates the findings of Verspoor et al. (2012) and Hou et al. (2018), underscoring the crucial role of ample input of MWEs in language acquisition. Additionally, the study identified a plateau phase in the learning of MWEs during the second stage, further emphasising the importance of consistent language exposure for the effective acquisition and development of MWEs.

Finally, this study analysed common errors of MWEs made by learners in spoken discourse. This analysis serves a dual purpose: it allows learners to recognise their own systematic errors, fostering a clearer understanding of their current proficiency of MWEs, and it provides teachers with comprehensive insights into the current state of MWE usage amongst their students.

In conclusion, this study explores the multidimensional development of MWEs within spoken discourse, offering practical guidance for both educators and learners in assessing and monitoring the development of MWEs. Furthermore, this research contributes valuable insights to the existing body of knowledge concerning MWE usage in spoken language.

Based on the aforementioned findings, this study proposes the following recommendations for the instruction of MWEs: First, teachers should explicitly teach lesser-known MWEs to raise learners’ awareness and encourage their appropriate application in spoken discourse. Second, explicit instruction on the grammatical aspects of MWEs will provide learners with the necessary knowledge to minimise errors. Third, teachers should offer a variety of learning materials that expose students to a wide range of MWEs within diverse contexts. By guiding learners to comprehend MWEs from both structural and functional perspectives, educators can foster greater variation in MWE usage in their students’ spoken language. Finally, given the multi-faceted nature of MWE acquisition, it is imperative that teachers not only increase learners’ exposure to MWEs but also address individual factors influencing their learning. This includes cultivating learner motivation to acquire a rich repertoire of MWEs and fostering their confidence to make consistent progress in their language learning journey.

However, it is crucial to acknowledge two limitations of this study. First, it is worth noting that the sample size of the spoken corpus in this study is relatively limited. Second, the present study adopts the cross-sectional design, thereby focusing on the developmental features of MWEs in spoken discourse across three learning stages. Consequently, it does not facilitate an exploration of the longitudinal development of MWEs in the spoken discourse of individual learners over time. To address these limitations, future studies could adopt a longitudinal design, expanding the size of the spoken corpus to comprehensively investigate the developmental pattern of MWEs employed by L2 learners in their spoken discourse over an extended period.

Data availability

The data analysed in this study is derived from The Spoken Corpus of Chinese Learners of English (SCCLE), which was compiled by the School of International Studies at Northeast Normal University. The corpus consists of spoken English samples collected from Chinese junior and senior high school students, as well as university students, for the purpose of studying the development of MWEs. Participants consented to the use of their spoken discourse for research purposes. To protect privacy and confidentiality, individual identifying details of participants are not publicly shared. The published study presents aggregated analyses of MWEs across proficiency levels, rather than individual speaker data. Access and permission to analyse the full SCCLE corpus are granted only with approval from the School of International Studies at Northeast Normal University. Specific excerpts from the corpus used as examples in the paper are available from the corresponding author upon reasonable request. The analysis methods and results are comprehensively described within the paper to enable reproducibility. Additional information to support validation may be provided by the authors upon request and the establishment of appropriate data sharing agreements.

Notes

Our approach was inspired by Yi (2022), who suggests that analyzing dynamic patterns can benefit from using mixed-effects models or growth-curve analysis (GCA). Yi’s study offers valuable guidance on conducting GCA, which influenced our decision to employ a quadratic polynomial curve model for a more nuanced analysis of our data.
The “fixed collocations” mentioned in the curriculum standards can be essentially equated with MWEs in terminological terms.

References

Altenberg B, Granger S (2001) The grammatical and lexical patterning of make in native and non-native student writing. Appl Linguist 22(2):173–195. https://doi.org/10.1093/applin/22.2.173
Article Google Scholar
Biber D, Conrad S, Cortes V (2004) If you look at …: lexical bundles in university teaching and textbooks. Appl Linguist 25(3):371–405. https://doi.org/10.1093/applin/25.3.371
Article Google Scholar
Biber D, Gray B, Poonpon K (2011) Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? TESOL Q 45(1):5–35. https://doi.org/10.5054/tq.2011.244483
Article Google Scholar
Biber D, Johansson S, Leech G, Conrad S, Finegan E (1999) Longman grammar of spoken and written English. Longman, London
Google Scholar
Cai F (2021) Research on improving English writing ability by lexical chunks approach. Front Educ Res 4(8):38–43. https://doi.org/10.25236/FER.2021.040808
Article Google Scholar
Dahlmann I, Adolphs S (2007) Pauses as an indicator of psycholinguistically valid multi-word expressions (MWEs)? In: Gregoire N, Evert S, Kim SN (eds) Proceedings of the workshop on a broader perspective on multiword expressions. Association for Computational Linguistics, Stroudsburg, PA, pp 49–56
Ellis NC (2002) Frequency effects in language processing: a review with implications for theories of implicit and explicit language acquisition. Stud Second Lang Acquis 24(2):143–188. https://doi.org/10.1017/S0272263102002024
Article Google Scholar
Flynn S, O’neil W (eds) (1988) Linguistic theory in second language acquisition. Springer, Dordrecht
Gao X (2015) A corpus-based study on lexical chunk errors in Chinese English majors’ spoken English. Master’s Thesis, Shandong University
Gelbukh A, Sidorov G, Lavin-Villa E, Chanona-Hernandez L (2010) Automatic term extraction using log-likelihood based comparison with general reference corpus. In: Hopfe CJ, Rezgui Y, Métais E, Preece A, Li H (eds) Natural language processing and information systems: 15th international conference on applications of natural language to information systems. Springer, Berlin, pp 248–255
Gries ST, Ellis NC (2015) Statistical measures for usage-based linguistics. Lang Learn 65(Suppl 1):228–255. https://doi.org/10.1111/lang.12119
Article Google Scholar
Hou J, Loerts H, Verspoor MH (2018) Chunk use and development in advanced Chinese L2 learners of English. Lang Teach Res 22(2):148–168. https://doi.org/10.1177/1362168816662290
Article Google Scholar
Hu Y, Shao M, Ji P (2020) A study of the structural categories and pragmatic functions of lexical bundles in Chinese EFL learners’ oral performance. J PLA Univ Foreign Lang 43(4):10–18
Google Scholar
Huang K (2018) Register features of lexical bundles used by Chinese EFL majors: a contrastive analysis of spoken and written English. Foreign Lang World 39(5):71–79
CAS Google Scholar
Jiang L, Kang M, Xiao Y (2024) Structural and functional features of lexical bundles used by English learners at different proficiency levels. J Northeast Univ (Soc Sci) 26(3):127–136. https://doi.org/10.15936/j.cnki.1008-3758.2024.03.014
Article Google Scholar
Kashiha H, Chan SH (2015) A little bit about: differences in native and non-native speakers’ use of formulaic language. Aust J Linguist 35(4):297–310. https://doi.org/10.1080/07268602.2015.1067132
Article Google Scholar
Kuiper K (2004) Formulaic performance in conventionalised varieties of speech. In: Schmitt N (ed) Formulaic sequences: acquisition, processing and use. John Benjamins Publishing Company, Amsterdam, pp 37–54
Ma G (2009) Lexical bundles in L2 timed writing of English majors. Foreign Lang Teach Res 41(1):54–60
Google Scholar
Martinez R, Schmitt N (2012) A phrasal expressions list. Appl Linguist 33(3):299–320. https://doi.org/10.1093/applin/ams010
Article Google Scholar
Ministry of Education of the People’s Republic of China (2017) General senior high school English curriculum standards. People’s Education Press, Beijing
Google Scholar
Ministry of Education of the People’s Republic of China (2022) English curriculum standards for compulsory education. People’s Education Press, Beijing
Google Scholar
Nattinger JR, Decarrico JS (1992) Lexical phrases and language teaching. Oxford University Press, Oxford
Google Scholar
Northbrook J, Allen D, Conklin K (2022) ‘Did you see that?’—The role of repetition and enhancement on lexical bundle processing in English learning materials. Appl Linguist 43(3):453–472. https://doi.org/10.1093/applin/amab063
Article Google Scholar
Pojanapunya P, Todd RW (2018) Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis. Corpus Linguist Linguist Theory 14(1):133–167. https://doi.org/10.1515/cllt-2015-0030
Article Google Scholar
Qi Y, Ding Y (2011) A contrastive analysis of chunks in the monologues by Chinese and American college students. Foreign Lang World 32(3):52–59
Google Scholar
Qi Y, Xia J (2016) The effect of chunk recitation on English writing and speaking proficiency. J PLA Univ Foreign Lang 39(1):96–103
Google Scholar
Rafieyan V (2018) Knowledge of formulaic sequences as a predictor of language proficiency. Int J Appl Linguist Engl Lit 7(2):64–69. https://doi.org/10.7575/aiac.ijalel.v.7n.2p.64
Article Google Scholar
Ramisch C (2015) Multiword expressions acquisition: a generic and open framework. Springer, Cham
Book Google Scholar
Read JAS (2000) Assessing vocabulary. Cambridge University Press, New York, NY
Book Google Scholar
Shi H, Chai X (2021) Factors influencing English chunks processing in Chinese learners. J PLA Univ Foreign Lang 44(3):102–110
MathSciNet Google Scholar
Sonbul S (2015) Fatal mistake, awful mistake, or extreme mistake? Frequency effects on off-line/on-line collocational processing. Biling Lang Cogn 18(3):419–437. https://doi.org/10.1017/S1366728914000674
Article Google Scholar
Strik H, Hulsbosch M, Cucchiarini C (2010) Analyzing and identifying multiword expressions in spoken language. Lang Resour Eval 44(1):41–58. https://doi.org/10.1007/s10579-009-9095-y
Article Google Scholar
Tavakoli P, Uchihara T (2020) To what extent are multiword sequences associated with oral fluency? Lang Learn 70(2):506–547. https://doi.org/10.1111/lang.12384
Article Google Scholar
Towell R, Hawkins R, Bazergui N (1996) The development of fluency in advanced learners of French. Appl Linguist 17(1):84–119. https://doi.org/10.1093/applin/17.1.84
Article Google Scholar
Treffers-Daller J, Parslow P, Williams S (2018) Back to basics: how measures of lexical variation can help discriminate between CEFR levels. Appl Linguist 39(3):302–327. https://doi.org/10.1093/applin/amw009
Article Google Scholar
Tyler AE (2010) Usage-based approaches to language and their applications to second language learning. Annu Rev Appl Linguist 30:270–291. https://doi.org/10.1017/S0267190510000140
Article Google Scholar
Tyler AE, Ortega L (2018) Usage-inspired L2 instruction: an emergent, research pedagogy. In: Tyler AE, Ortega L, Uno M, Park HI (eds) Usage-inspired L2 instruction: researched pedagogy. John Benjamins Publishing Company, Amsterdam, pp 3–26
Underwood G, Schmitt N, Galpin A (2004) The eyes have it: an eye-movement study into the processing of formulaic sequences. In: Schmitt N (ed) Formulaic sequences: acquisition, processing and use. John Benjamins Publishing Company, Amsterdam, pp 153–172
VanPatten B, Cadierno T (1993) Input processing and second language acquisition: a role for instruction. Mod Lang J 77(1):45–57. https://doi.org/10.2307/329557
Article Google Scholar
Vercellotti ML, Juffs A, Naismith B (2021) Multiword sequences in English language learners’ speech: the relationship between trigrams and lexical variety across development. System 98:102494. https://doi.org/10.1016/j.system.2021.102494
Article Google Scholar
Verspoor M, Schmid MS, Xu X (2012) A dynamic usage based perspective on L2 writing. J Second Lang Writ 21(3):239–263. https://doi.org/10.1016/j.jslw.2012.03.007
Article Google Scholar
Wang C (2015) Construction, constructional context and L2 learning. Mod Foreign Lang 38(3):357–365
ADS Google Scholar
Wang L, Qian J (2009) A corpus-based study on chunk patterns of Chinese EFL public speakers. Foreign Lang Res 32(2):115–120. https://doi.org/10.16263/j.cnki.23-1071/h.2009.02.031
Article MathSciNet Google Scholar
Wang Q (2019) Ensuring both conventionality and productivity: the collocation-as-the-default model of language use. Mod Foreign Lang 42(1):72–84
ADS Google Scholar
Wei N (2004) A preliminary study of the characteristics of Chinese learners’ spoken English. Mod Foreign Lang 27(2):140–149. https://doi.org/10.3969/j.issn.1003-6105.2004.02.004
Article Google Scholar
Wolter B, Yamashita J (2015) Processing collocations in a second language: a case of first language activation? Appl Psycholinguist 36(5):1193–1221. https://doi.org/10.1017/S0142716414000113
Article Google Scholar
Wood D (2010) Formulaic language and second language speech fluency: background, evidence and classroom applications. Bloomsbury Publishing, London
Google Scholar
Wray A, Perkins MR (2000) The functions of formulaic language: an integrated model. Lang Commun 20(1):1–28. https://doi.org/10.1016/S0271-5309(99)00015-4
Article Google Scholar
Yi W (2018) Statistical sensitivity, cognitive aptitudes, and processing of collocations. Stud Second Lang Acquis 40(4):831–856. https://doi.org/10.1017/S0272263118000141
Article Google Scholar
Yi W (2022) Processing of novel L2 compounds across repeated exposures during reading: a growth curve analysis. Appl Psycholinguist 43(3):551–579. https://doi.org/10.1017/S0142716422000017
Article Google Scholar
Yi W, Man K, Maie R (2023) Investigating first and second language speaker intuitions of phrasal frequency and association strength of multiword sequences. Lang Learn 73(1):266–300. https://doi.org/10.1111/lang.12521
Article Google Scholar
Yi W, Zhong Y (2024) The processing advantage of multiword sequences: a meta-analysis. Stud Second Lang Acquis 46(2):427–452. https://doi.org/10.1017/S0272263123000542
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors wish to extend their heartfelt thanks to the National Social Science Foundation of China for the financial support provided under Grant Number 20BYY209. This funding has been instrumental in enabling the research and composition of this study.

Author information

Authors and Affiliations

School of International Studies, Northeast Normal University, Changchun, China
Huiping Zhang & Xingzuo Wang

Authors

Huiping Zhang
View author publications
Search author on:PubMed Google Scholar
Xingzuo Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

HZ led the study’s conception and design, managed data collection and analysis, and was responsible for drafting and refining the paper. XW verified experimental procedures, assisted with data analysis, and provided technical feedback during paper revisions.

Corresponding author

Correspondence to Huiping Zhang.

Ethics declarations

Competing interests

The authors of this article declare that they have no financial or personal relationships that could potentially bias their work or influence their interpretation of the results. Specifically, the authors have no financial interests or relationships with any organizations that might have an interest in the submitted work. Additionally, the authors have no personal relationships with any individuals who might have an interest in the submitted work. Furthermore, the authors declare that they have no other conflicts of interest to disclose, including any non-financial interests that could be perceived as having an influence on the research or its interpretation. The authors confirm that this article is an original work and has not been previously published, nor is it currently under consideration for publication elsewhere. We hope that this Declaration of Interest Statement meets the requirements of your journal, and we look forward to the opportunity to share our research with your readership.

Ethical approval

The present study, which investigates the use of MWEs by Chinese learners of English within The Spoken Corpus of Chinese Learners of English (SCCLE), was conducted in accordance with ethical standards. Participation in the study was entirely voluntary, with all participants being fully informed about the aims and objectives of the SCCLE. The SCCLE was developed with the explicit consent of the learners involved, and all data have been appropriately anonymized to protect the privacy and confidentiality of the participants. The research conducted aligns with the intended purpose of the SCCLE, which is to provide a resource for educational and research purposes related to Chinese learners’ English speaking skills. As such, no additional ethical approval was required for this study, as it is based on the pre-existing, ethically approved corpus. The integrity of the ethical principles underlying the SCCLE has been maintained throughout the research process.

Informed consent

The study investigating the use of MWEs by Chinese Learners of English within The Spoken Corpus of Chinese Learners of English (SCCLE) was conducted in strict adherence to ethical guidelines. The SCCLE, which serves as the primary data source for this research, was established with the explicit consent of the learners whose data it contains. Participants were provided with comprehensive information about the corpus’s purpose and scope, and their engagement in the corpus’s creation was voluntary. All data utilized in this study have been meticulously anonymized to safeguard participant privacy and confidentiality.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, H., Wang, X. Developmental features of multi-word expressions in spoken discourse by Chinese learners of English. Humanit Soc Sci Commun 11, 1663 (2024). https://doi.org/10.1057/s41599-024-04206-8

Download citation

Received: 11 April 2024
Accepted: 25 November 2024
Published: 18 December 2024
Version of record: 18 December 2024
DOI: https://doi.org/10.1057/s41599-024-04206-8

Subjects

Abstract

Similar content being viewed by others

Improving EFL speaking performance among undergraduate students with an AI-powered mobile app in after-class assignments: an empirical investigation

Different effects of verbal and visual working memory loads on Language prediction

Refining the processing dynamics of English compound words in L2 learners: a psycholinguistic modeling approach

Introduction

Current research

Research methods

Classification framework for MWEs

Corpus description

Extraction and filtering of MWEs

Statistical analysis

Data cleaning

Statistical techniques

Polynomial regression analysis

Variables and coding

Results

Common features across the three stages

Common features of MWE frequency proportion

Common features of MWE errors

Common features of error frequency

Common features of the proportion of error frequencies

Common features of error types

Distinctive features across the three stages

Distinctive features of MWE total types and tokens

Distinctive features in structural and functional dimensions

Discussion

Commonalities across three stages

Frequency distribution of MWEs

The errors of MWEs

Differences across three stages

Conclusion

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical approval

Informed consent

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links