A lexical network approach to second language development

Chen, Heng

doi:10.1057/s41599-023-02151-6

Download PDF

Article
Open access
Published: 25 October 2023

A lexical network approach to second language development

Heng Chen ORCID: orcid.org/0000-0003-3678-9827¹

Humanities and Social Sciences Communications volume 10, Article number: 735 (2023) Cite this article

3766 Accesses
7 Citations
2 Altmetric
Metrics details

Subjects

Abstract

This study investigates lexical development in second language (L2) learning from the perspective of complex dynamic system theory (CDST) using a complex network method. Based on authentic written output texts from L2 Chinese learners of different proficiency levels and language backgrounds, we successfully differentiate between different proficiency levels using a bi-gram lexical network model at a corpus level. A more in-depth investigation reveals that when compared to traditional lexical complexity indices, such as average word length and hapax legomena percentage (though Guiraud proves to be a robust predictor), the lexical network indices, such as network density and network clusters, provide a more profound understanding of L2 proficiency distinctions and a more precise approximation of the target language. Moreover, our findings illuminate the consistent manifestation of complex network characteristics within L2 Chinese lexical networks across all proficiency levels. Additionally, word association features, encompassing more than just word frequency information, provide comprehensive properties of the interlanguage system, as supported by their information gain values. We argue that studies within the CDST framework should integrate both lexical frequency and lexical network features to gain a comprehensive understanding of L2 lexical development.

The developmental trajectories of L2 lexical-semantic networks

Article Open access 25 March 2023

Developmental features of Tibetan students’ Chinese L2 mental lexicon

Article Open access 28 March 2025

Lexibank, a public repository of standardized wordlists with computed phonological and lexical features

Article Open access 16 June 2022

Introduction

Interlanguage is a unique linguistic system that learners develop in the process of second language (L2) acquisition, which reflects features from both their first language (L1) and the target language (Selinker, 1972). In the past two decades, Complex Dynamic System Theory (CDST), which stems from Complex System Sciences (Larsen-Freeman, 1997; Holland, 2006), has gained increasing attention in investigating the complexities and dynamics of L2 development, shedding new light on how human language structures emerge from dynamic language usages (Ellis, Larsen-Freeman (2009a); Chen & Wang, 2023).

Under the CDST framework, previous empirical studies have measured lexical and syntactic complexity indices such as vocabulary richness, mean lengths of language units, and fine-grained phrasal and clausal features (Wolfe-Quintero et al., 1998; Kyle & Crossley, 2018; Chen & Xu, 2019), revealing several critical properties of dynamic interlanguage systems such as non-linearity, phase transition, and self-organization (De Bot et al., 2007; Hiver et al., 2022). However, two major challenges remain: how to measure linguistic features more consistently (as highlighted by Deng et al., 2021) and how to explore interactions between language sub-systems in L2 development. For instance, Verspoor et al. (2012) demonstrated that word length and word frequency remained unchanged across five different proficiency levels, while the Guiraud index (root-Type Token Ratio, root-TTR) showed variations. Additionally, chunks have emerged as one of the most accurate predictors of proficiency level. Therefore, further examinations, including the exploration of new indices, are necessary across various second languages, proficiency levels, and language backgrounds.

Complex network sciences, which emerged at the end of the 20th century, may provide a possible solution. A network is a graph composed of nodes and links, where nodes represent the basic elements of a system, and links represent the interactions between them (Liu, 2008a). Real systems are usually complex network systems with non-trivial topological features that are not found in simple networks (e.g., random networks), such as small-worldness and scale-freeness (Watts & Strogatz, 1998; Barabási & Albert, 1999). Language is a real complex adaptive system (CAS) that can be modeled as a complex network (Cong & Liu, 2014). Linguistic networks at various levels can be seen as language sub-systems that have been constructed and explored by researchers, such as phonological networks, morphological networks, phrasal networks, syntactic networks, and semantic networks (Cong & Liu, 2014; Chen et al., 2018; Dattner et al., 2022).

The complex network model is intimately linked to the complex systems theory, which posits that a system is a group of interconnected and interdependent elements that behave in accordance with a set of rules to constitute a cohesive whole (Backlund, 2000). Unlike the traditional connectionism paradigm in cognitive sciences (Williams, 2006), the complex network approach (also known as Cognitive Network Science in cognitive sciences) can not only capture the connections between relevant properties of the language system and human cognition (Sizemore et al. 2018) but also provide practical tools and methods to process and visualize large-scale materials and data (Mehler et al., 2016). For example, the complex network approach has found extensive applications in diverse areas such as authorship attributions (Machicao et al., 2018), stylometry analyses (Quispe et al., 2021), and investigations into language complexity (Cong & Liu, 2014).

Moreover, in recent years, the complex network approach has gained significant popularity in the field of language development and acquisition, particularly in psycholinguistics (Castro & Siew, 2020; Ciaglia et al., 2023). In L2 studies, the network approach has been applied to analyze syntactic structures based on dependency treebanks, demonstrating its ability to differentiate between L2 learners of different proficiency levels. Recent data-driven studies have used network analyses to investigate overall syntactic development in L2 learners, providing new insights into applied linguistics research. For example, Jiang et al. (2019) applied a syntactic network approach and discovered that English L2 learning is characterized by a gradual approximation to the target language rather than a sudden emergence as in L1 development. In another study, Hao et al. (2021) found varying regularities in the development of a syntactic network between interlanguage and native language acquisition, highlighting the network parameters’ ability to differentiate interlanguage modalities and indicate L2 proficiency in L2 Chinese.

However, the syntactic network approach particularly relies on dependency-treebank resources, which require a lot of manual work to ensure parsing accuracy, making it difficult to use frequently to monitor learners’ interlanguage development (especially in the Chinese context). Additionally, previous studies have shown that the annotation scheme used can significantly affect the results of dependency measures (Yan & Liu, 2022).

Given that complex linguistic structures can emerge from the co-occurrence of linguistic units in actual language use (Bybee & Hopper, 2001; Cong & Liu, 2021), this paper adopts a simpler word co-occurrence network approach to measure L2 Chinese development. Word co-occurrences, or bi-gram words, are increasingly being recognized as an important feature for distinguishing L2 learners’ proficiencies under the phraseological complexity construct in recent years (Arnon & Christiansen, 2017; Garner et al., 2019; Vandeweerd et al., 2022). Moreover, unlike phraseological complexity indices that need a reference corpus, the word co-occurrence network approach does not require extra resources.

Previous studies consistently highlight the reliability of bigrams or word co-occurrences as predictors of L2 proficiency, with higher proficiency levels being associated with more established and conventionalized word associations, as evidenced by indices such as TTR-bigram (Arnon & Christiansen, 2017; Bestgen & Granger, 2014; Durrant & Schmitt, 2009; Kyle et al., 2015). Furthermore, the predominant storage of linguistic information in the form of multi-word expressions or chunks/n-grams has led researchers to extensively investigate their role in L2 research, emphasizing the significance of rapid chunking of language input for language proficiency levels (Verspoor et al., 2012; Ellis, 2014; Christiansen & Chater, 2016; Saito, 2020). Notably, Liu (2008b)’s study based on 20 languages revealed that approximately 50% of syntactic dependencies consist of adjacent words, specifically bigrams, further highlighting the relevance of bigrams or word co-occurrences in capturing syntagmatic associations. These associations play a critical role in linguistic pattern recognition and language learning within the usage-based perspective of language development (Ellis et al., 2016; Schmid, 2020).

In addition, language processing models based on spreading activation, a fundamental concept in cognitive psychology (Collins & Loftus, 1975), necessitate the representation of language as a network (Hudson, 2007). Consequently, language can be viewed as a cognitive network (Cong & Liu, 2014). Linguistic networks based on word associations have the capacity to model various levels of language system architecture (Liu & Cong, 2014). Word co-occurrence networks, constructed using bigrams, have proven valuable in capturing the intrinsic relationships among numerous words within a relatively short period, providing an advantageous approach to revising and extending theories of lexical acquisition (Meara, 2009; Chen, 2021).

To illustrate the advantages of the bigram-network approach, consider the following example, a dialog between a native speaker (NS) and a non-NS (NNS):

NS: Are you a cat person or a dog person?

NNS: I’m a dog person. I really like dogs.

Traditional indices based on word frequency treat the text as a collection of isolated words, disregarding the meaningful relationships between words in their specific order. For both the NS and NNS texts, the TTR value is 0.7778. However, by analyzing bigrams, which capture adjacent word pairs, we can uncover the syntagmatic relationships between words. In the given example, bigrams like “cat person” and “dog person” are deeply entrenched in the mental lexicon of NSs and hold significant importance in language communication and language learning. These bigrams can exemplify the alignment or priming effects observed between the NS and NNS, as evidenced by the NNS’s repetition of the bigram “dog person”. Such alignment effects are known to have significance in L2 learning (Wang & Wang, 2015), suggesting that the NNS is attuned to and influenced by the language patterns and choices of the NS. The presence of alignment or priming effects can play a crucial role in facilitating language acquisition and the development of linguistic proficiency in L2 learners (Chen, 2021).

Furthermore, the lexical network approach offers a more comprehensive understanding beyond the analysis of individual bigrams. In the aforementioned example, both the NS and the NNS produced texts with TTR-bigram values of 1, indicating no apparent differences at first glance. However, upon closer examination of their respective lexical networks, distinct patterns emerge, as evident from network parameters such as node degrees and network density. Notably, in the NS’s lexical network, the shortest path length between “cat” and “dog” is 2, suggesting that the lexical network provides insights into the practical usage patterns of words in contexts, going beyond mere sequential information in sentences.

Therefore, this study aims to investigate the effectiveness of the lexical network approach in differentiating L2 Chinese proficiency levels, comparing it with traditional lexical complexity indices. It is worth noting that research on the complexity of L2 Chinese based on CDST has only emerged in recent years, with relatively limited studies in this area. Hu (2021) found that traditional syntactic complexity measures based on clauses and T-units have limited explanatory power for writing performance, while collocation-based phrase-level complexity measures can effectively predict writing proficiency. Additionally, studies on traditional lexical complexity indices have demonstrated their relevance under specific proficiency levels, registers, or language backgrounds, including measures of lexical diversity (Uber index, Guiraud index) and lexical sophistication (average word length (AWL), low-frequency words, lexical profile) (Wu et al., 2019; Qi & Liao, 2019; Chen & Xu, 2019; Liao, 2021). Specifically, this study will carefully select lexical network indices to compare with those lexical-complexity indices (including TTR-bigram and lexical density (LD), refer to Section “Network parameters and lexical complexity indices”). This study aims to address the following three research questions:

RQ 1: Are L2 Chinese lexical networks complex networks?

RQ 2: Can lexical network parameters distinguish different proficiency levels in L2 Chinese?

RQ 3: Are lexical network parameters more effective than traditional linguistic complexity indices?

Materials and methods

The corpora profile

We use the Guangwai-Lancaster Chinese Learner Corpus (GLCLC), a 1.2 million-word balanced L2 learner corpus of Mandarin Chinese^{Footnote 1}, as our data source. This corpus consists of spoken and written texts produced by 1492 Chinse L2 learners of three proficiency levels (i.e., beginner, intermediate, and advanced levels) from different native language backgrounds.

As shown in Table 1, we chose to study L2 learners from two different language-family backgrounds: Thai and Romance languages (including French and Spanish)^{Footnote 2}. We also included native Chinese texts from the Lancaster Corpus of Mandarin Chinese^{Footnote 3} for comparison. Since text genres can influence the quantitative results, we chose narrative texts from the written sub-corpus, with topics on my life, my story, or a memory trip. To ensure comparability between different language proficiencies, each sub-corpus consists of about 10,000 Chinese characters. Notably, in the subsequent analyses, we designate Thai/Romance 1, 2, and 3 to represent beginner, intermediate, and advanced levels, respectively.

Table 1 Profile of the sub-corpora chosen from the Guangwai–Lancaster Chinese Learner Corpus.

Full size table

Network parameters and lexical complexity indices

After selecting learner texts, we used NlPIR-ICTCLAS^{Footnote 4} to segment words and annotate part of speech. The network parameters were calculated using Pajek^{Footnote 5}. To compare the classification effect of network parameters, we also calculated lexical complexity indices using CTAP^{Footnote 6}. All network parameters and lexical complexity indices are briefly introduced in the following:

Network parameters

The network parameters that we are most interested in are those closely related to language proficiencies (see Jiang et al., 2019; Hao et al., 2021; Chen, 2021), including clustering coefficient (C), average path length (<l>), and clusters (cl), etc. For comparison, we also investigated some common network parameters, such as network size (n), number of edges (m), network density (ND), and network centrality (NC). Overall, we calculated a total of seven network parameters.

Figure 1 depicts two bigram networks constructed based on the dialog example provided in the Introduction section:

NS: Are you a cat person or a dog person?

NNS: I’m a dog person. I really like dogs.

**Fig. 1: Examples of lexical networks.**

In Fig. 1, the left plot represents the bigram network constructed from the text produced by the NS, while the right plot represents the bigram network from the text produced by the NNS. It is important to note that all the words in the network have been lemmatized to ensure consistent representation and analysis. However, it should be mentioned that this lemmatization process is not applicable to Chinese words, as they do not undergo inflectional changes in the same way as many other languages.

Network size (n) refers to the number of nodes in the network, which represents the words (types) in this context. In Fig. 1, both the NS’s and NNS’s networks have seven nodes.

The number of edges (m) represents the connections between the nodes, specifically the number of distinct bigram types. In Fig. 1, the NS’s network has 8 edges, while the NNS’s network has seven edges.

ND is a measure ranging from 0 to 1 that indicates the proportion of realized connections out of all possible connections in the network. For an undirected network with n nodes and m edges, the density is calculated as 2m/(n * (n − 1)). It provides insights into the overall density of the network. In language networks, sparsity is commonly observed, indicating a low probability of connections between random words. In this case, the ND values for the NS and NNS networks are 0.3809 and 0.3333, respectively. It seems that the ND value decreases in L2 Chinese syntactic development (see Hao et al., 2021).

The centrality of a node provides insights into its significance within the network. In this study, we focus on the most commonly used centrality measure, degree centrality, which represents the number of connections a node has. In Fig. 1, the node “person” has degrees of 3 and 1 in the NS and NNS networks, respectively. The centrality of the entire network is referred to as network centralization (NC), and it is roughly calculated as (K_max/n) − ND, where K_max represents the largest degree in the network. Degree centralization specifically indicates the strength of hubs or central nodes within the network. In language networks, degree centralization provides information about the prominence of nodes with high degrees. A higher degree of centralization indicates stronger combinational abilities of these nodes, whereas a lower degree of centralization suggests weaker combinational abilities. The NC value shows a growing trend in L2 syntactic development (see Hao et al., 2021).

The clustering coefficient (C) is a measure of how likely it is that nodes in a network form clusters or groups. It quantifies the probability that two neighbors of a node are also connected. For a node i with k_i neighbors, the maximum number of edges between these neighbors is k_i(k_i − 1)/2. If E_i is the actual number of edges between these neighbors, then the clustering coefficient C_i of node i is 2E_i/k_i(k_i − 1). The clustering coefficient C of the whole network is the average of C_i for all nodes. A study based on written Chinese lexical networks reveals that the C value decreases as the language develops (Chen et al., 2018), which is consistent with the findings in L2 syntactic development (see Hao et al., 2021).

The average path length <l> measures the mean of the shortest distances between any two nodes in the network. For a pair of nodes i and j, the distance L_ij is the number of edges in the shortest path connecting them. In Fig. 1, both the NS and NNS networks have a distance of 2 between the nodes “a” and “person”. It is important to note that, in general, the average path length values are typically not greater than 6. For instance, Chen et al. (2018) revealed an average path length of 4.746 for modern Chinese. This suggests that, on average, it requires a relatively small number of steps or connections to navigate between words within the language network.

The average path length is a valuable measure for assessing the efficiency and interconnectedness of a network. Smaller average path lengths signify higher connectivity and accessibility between nodes, facilitating efficient information flow and communication within the network. However, it should be noted that as the language system becomes more complex, the <l> value seems to increase to some extent, as evidenced by Chen et al. (2018). This implies that in more intricate lexical networks, the average path length may be slightly larger, indicating a potentially increased number of steps needed to navigate between words or concepts.

If a network has a small average path length <l>(<l>–<l_rnd>), and a large clustering coefficient (C), then we can call it a Small-World network. However, this definition is rather qualitative because it is difficult to decide whether the observed value is small or large. In general, it requires comparing the observed <l> and (C) to that of a stochastic network. The stochastic network model used in this paper is the ER model proposed by Erdos & Rényi (1960). Humphries & Gurney (2008) defined a quantitative parameter, the S-Small-worldness, for measuring the Small World feature. The formula is as follows:

$$S = \frac{{C/C_{rnd}}}{{\langle{l}\rangle/\langle{l}_{rnd}\rangle}}$$

In this formula, C denotes the network clustering coefficient, C_rnd represents the clustering coefficient of the ER random network with the same number of nodes and the same average degree; <l> denotes the average minimum path, and <l_rnd>represents the average path length of the ER random network. When the S is greater than 1, the network can be considered as a small-world network. Moreover, if the network is also a scale-free network (as revealed by the power-law distribution of the degrees of nodes), it can be called a complex network (Chen et al., 2018).

A cluster in a network refers to a group of nodes that are highly connected to each other within the group, forming a cohesive substructure. Clustering is the process of identifying and delineating these groups or communities within a network. The size of a cluster, represented by the cluster value (Cl), can indicate the extent of interconnectivity and cohesion within the linguistic network. In the context of lexical networks, a larger cluster value typically suggests a richer linguistic conceptual system. This is because a larger cluster implies that there are more tightly-knit groups of words or concepts that are interconnected and share strong associations within the network. These clusters represent cohesive semantic or syntactic domains, indicating a greater depth and complexity in the organization and structure of language.

Lexical complexity indices

The Uber index (U), proposed by Dugast (1978), is commonly used to measure lexical diversity or variation in L2 development studies. It is computed as follows:

$$U = \left( {\frac{{\left( {\rm{log}}\,{\rm{Tokens}} \right)^2}}{{{\rm{log}}\,{\rm{Tokens}}} - {{\rm{log}}\,{\rm{Types}}}}} \right)$$

The Guiraud index, another commonly used index to assess the lexical richness or diversity of a text (Guiraud, 1954), is calculated by dividing the number of unique words (types) in the text by the square root of the total number of words (tokens):

$${Guiraud} = \frac{\rm{Types}}{{\sqrt {\rm{Tokens}} }}$$

The Guiraud index takes into account the impact of text length on lexical diversity. By dividing the number of unique words by the square root of the total words, it provides a normalized measure that allows for fair comparisons across texts of different lengths. A higher Guiraud index indicates a greater lexical diversity, suggesting a wider range of vocabulary used in the text.

AWL is the number of syllables per word in a text. It is obtained by dividing the total number of syllables by the total number of words in the text. AWL can indicate the sophistication of a text.

LD measures the linguistic sophistication of a text based on the proportion of content words (lexical units) and function words (grammatical units). The formula for calculating the LD is to divide the number of lexical items by the total number of words in the text.

A hapax legomena (HL) is a word that appears only once in a text or corpus. HL refers to the count of such words in a text or corpus.

Long word percentage (LWP) is the ratio of words with three or more syllables to the total number of words in a text or corpus.

The type-token ratio of bigrams (TTR-Bigram), or simply the ratio of the number of unique bigrams to the total number of bigrams in a text or corpus, is used to measure the diversity or variation of bigrams within the text.

Statistical analysis

We used hierarchical clustering with SPSS 20 to test whether the indices described above can differentiate L2 learners’ language proficiency levels. The dendrograms based on Euclidean distance show the levels of similarity at which the clusters were formed.

We also computed the information gain of each index to assess how well a feature predicts language proficiency. Information gain is the amount of information obtained about the target variable (proficiency) when a feature (network parameter or lexical complexity index) is known. It measures the reduction in uncertainty about the target variable after splitting the data by the feature (see Witten & Frank, 2005). This reduction is quantified by comparing the entropy of the target variable before and after the split. The difference in entropy is the information gain.

Here is the formula for computing information gain (taking the index of AWL as an example):

$${{{\mathrm{IG}}}}\left( {{{\mathrm{X}}}} \right) = {{{\mathrm{H}}}}\left( {{{\mathrm{Y}}}} \right) - {{{\mathrm{H}}}}\left( {{{{\mathrm{Y}}}}\left| {{{\mathrm{X}}}} \right.} \right)$$

where IG(AWL) is the information gain of feature AWL, H(Y) is the entropy of the target variable (language proficiency), and H(Y | AWL) is the conditional entropy of the target variable given feature AWL (i.e., the entropy of language proficiency after splitting the data based on the values of the Average Word Length index).

The information gain tells us how much information about the target variable (language proficiency) is gained by knowing the value of the Average Word Length index. A higher information gain indicates that the Average Word Length index is more useful in predicting language proficiency.

Results

We report the results of sub-corpora classification based on the lexical network parameters and lexical complexity indices. Further analysis was also conducted using information gain measures of each predictive variable.

The Lexical network approach results

We constructed and measured the seven word-co-occurrence networks with Pajek. Figure 2 presents the lexical network plots, with weighted nodes based on degree centralization and colored node groups based on degree partitions.

**Fig. 2: Lexical networks based on the learner corpora.**

As depicted in Fig. 2, the lexical networks of all seven sub-corpora, similar to the syntactic networks of L2 learners (Jiang et al., 2019), exhibit characteristics of complex networks, which differ significantly from the tree-like networks observed in young children (less than 2 years old), as described by Corominas-Murtra et al. (2009). Additionally, it appears that higher-level networks tend to have fewer hub words. To verify this, we conducted a count of words with a degree of at least 100 for each network. The results confirm this observation to some extent, with Thai 1 to 3 containing 8, 7, and 7 such words and Romance 1 to 3 containing 9, 8, and 7 such words. In contrast, Native Chinese possess only 6 such words.

Furthermore, the S-Small-worldness values in Table 2 corroborate that all seven networks in Fig. 2 are small-world networks. As these networks also exhibit scale-free characteristics (as evidenced by the power law distributions of node degrees), they can be classified as complex networks. Specific network parameters are detailed in Table 2.

Table 2 Network parameter results.

Full size table

The seven network parameters in Table 2 reveal some patterns regarding language proficiency level. As learners’ proficiency level increases, they produce more nodes, edges, and lexical community clusters (as evidenced by n, m, and Cl). At the same time, the global network expands (as indicated by average path length), and the local network becomes spare (as indicated by network density, network centralization, and clustering coefficient). It should be noted that corpus size can significantly influence these network parameter patterns. Since the corpus sizes are all around 10,000 Chinese characters in this study, we may infer that as L2 learners’ language proficiency increases, their vocabulary increases at the same time, then their bi-gram types also increase. However, despite the overall expansion of the network, the local network becomes sparse due to the limitations imposed by the corpus size, leading to fewer local connections between words. As the network grows, the immediate connections between words within local neighborhoods become insufficient to sustain high local network density and clustering, ultimately causing a decrease in these measures. This suggests that the network’s local structure becomes less interconnected and clustered as the language proficiency of L2 Chinese learners increases, possibly due to the limited size of the analyzed corpus.

The above analysis suggests that the lexical network parameters can be used to classify L2 Chinese learners of different proficiencies. Figure 3 shows the results of hierarchical clustering with standardized lexical parameter values.

**Fig. 3: Hierarchical clustering results based on lexical network parameters.**

Figure 3 shows that the clustering results are good, with each proficiency level clustered together and finally grouped with Native Chinese. To determine which parameter variable contributes most to the hierarchical clustering, we calculated the information gain of each variable. The results are shown in Table 3.

Table 3 Information gains of the network-based indices.

Full size table

Based on the information gain values in Table 3, network size (n), number of edges (m), ND, and clustering coefficient (C) all contribute significantly to the clustering, performing better than NC and average path length (<l>). As for clusters (Cl), it contributes the least to the clustering.

The lexical complexity approach results

To compare the classification effect of network-based indices with traditional indices, we measured seven lexical complexity indices, including chunk-based indices, i.e., TTR-bigram, as displayed in Table 4.

Table 4 Values based on the Lexical complexity indices.

Full size table

As indicated in Table 4, intermediate-advanced L2 learners demonstrate larger lexical diversity (as indicated by U and Guiraud indices) and bigram diversity (TTR-bigram), along with shorter word length (AWL), and an increased number of Hapax words but a decreased percentage of longer words (longer words percentage) compared to beginners. However, there are no significant changes in lexical density. Moreover, when compared to Native Chinese, intermediate and advanced L2 learners exhibit more similar linguistic features, diverging from beginners.

Based on the seven lexical complexity indices, we used the hierarchical clustering method again to classify the corpora. The clustering results are presented in Fig. 4.

**Fig. 4: Hierarchical clustering results based on lexical complexity indices.**

The clustering results in Fig. 4 show that only beginners and Native Chinese are well grouped, while intermediate and advanced learners are grouped based on language backgrounds but not language proficiency. This means that the seven lexical complexity indices work well for low-level and extremely high-level learners (Native Chinese) but not for distinguishing intermediate and advanced L2 learners.

To assess which lexical complexity index has stronger predictive power, we also calculated the information gain of each variable, as shown in Table 5.

Table 5 Information gains of the lexical complexity indices.

Full size table

Based on the information gain values in Table 5, it is evident that lexical diversity (Guiraud) and bigram diversity (TTR-bigram) exhibit the strongest predictive power, while lexical sophistication (as indicated by AWL and LWP) shows a slightly weaker predictive ability. On the other hand, lexical diversity indices U, LD, and HL demonstrate the least predictive power among the analyzed linguistic features.

Discussions and conclusions

Language is a complex system with a probabilistic nature and complexity at the lexical, syntactic, and semantic levels (Cong & Liu, 2014). Traditional linguistic research methods often struggle to capture the intricacies of language systems. However, the complex network approach offers a potential solution. Previous applied-linguistic studies have focused primarily on specific semantic or syntactic structures in L2 complex systems (Ellis, Larsen‐Freeman (2009b); Ellis, 2014; Fogal, 2020). Researchers have found that the syntactic networks of L2 English learners exhibit small-world and scale-free properties similar to those of other complex networks (Jiang et al., 2019). Nevertheless, the lexical networks of L2 Chinese learners have received limited attention in this area of research.

This study explored word co-occurrence networks, or lexical networks, to measure L2 Chinese development, especially lexical development. We compared the clustering results based on lexical network parameters with traditional lexical complexity indices. The findings can be summarized as follows. Firstly, the L2 Chinese lexical networks are typical complex networks. Secondly, the lexical-network-based method can accurately classify the different proficiency levels, i.e., Beginner, Intermediate, and Advanced, as well as Native Chinese. Among the individual network parameters, network size (n), number of edges (m), ND, and clustering coefficient (C) have stronger predictive power than the others according to information gain values. Thirdly, the lexical-complexity-based method can distinguish beginners from intermediate-advanced learners, as well as native Chinese. However, it can hardly distinguish the intermediate from the advanced learners. Seen from the information gain values, lexical diversity indices Guiraud and bigram diversity indices TTR-bigram predict better than the others. Lastly, the classifications based on two different approaches show that the lexical-network-based method performs better than the lexical-complexity-based one. Moreover, it seems that the lexical-network-based method is not influenced by the learners’ native language backgrounds, while the other method may be affected at the intermediate and advanced levels.

Lexical complexity or richness is a multifaceted concept that encompasses lexical diversity, lexical sophistication, lexical density, and lexical errors (Read, 2000). Lexical diversity measures the variety of active vocabulary in a text, while lexical sophistication measures the proportion of relatively unusual or advanced words in a text. Our findings indicate that among the various aspects of lexical richness, the lexical diversity indices, as represented by Guiraud and TTR-bigram, exhibit greater effectiveness compared to the lexical sophistication indices (AWL and LWP) and the LD. These results suggest that measures of lexical diversity provide more valuable insights into the linguistic richness of the text or corpus under consideration. However, based on the lexical network classification results, we also propose that the co-occurrence relationships between words or word associations (as revealed by network density, clustering coefficient, and network cluster, etc.) are more informative than simple lexical frequency features.

Both the lexical-complexity-based approach and the lexical-network-based approach demonstrate that, besides isolated words, bigrams and word associations are crucial linguistic features for distinguishing language proficiency levels. Furthermore, while lexical networks are constructed based on bigrams, the network parameters encompass much deeper information beyond mere sequential combinations. This is evident in the analysis of word clustering and word connections, which provide valuable insights into the underlying structure and organization of the lexical network. Therefore, we argue that the framework of CDST should consider both lexical frequency features and lexical network features to better understand lexical richness and L2 development.

Recently, the multi-layer complex network approach has also risen in language cognition studies (Zaharchuk & Karuza, 2021), e.g., L1 development, L2 development, psycholinguistic relevant studies, etc. Based on these studies, we can see that the multi-layer complex network approach may offer a possible solution to L2 development studies in a more consistent and integrated way. Moreover, since the multi-layer network approach can simultaneously model the interactions between language users as well as between language units, the critical social structure effects in language learning and development (Duff, 2019) can be scrutinized. Taking the dialog between NS and NNS aforementioned for an example, a multi-level system can not only model the dynamic linguistic structures but also capture the interactions between the two speakers, probing into language acquisition mechanisms such as alignment effects (Wang & Wang, 2015; Chen, 2021). Furthermore, we can even analyze the role of generative AI (such as Chat-GPT) in facilitating language learning, extending the effect of data-driven learning.

In conclusion, the complex network approach provides a promising avenue for understanding the complexities of L2 development. Through data-driven analysis, this approach can reveal global and local patterns in the linguistic system of L2 learners, shedding light on the development of specific aspects of language use. Moreover, the emerging multi-layer networks approach can provide an even greater depth of analysis, offering a more comprehensive understanding of the complexities and dynamics of the multi-level language system. Future studies should continue to explore and refine these methods, as they hold great potential for advancing our understanding of L2 development.

Data availability

All data generated during this study are included in this article. The dataset (corpus) analyzed during the current study is available in the Guangwai-Lancaster Chinese Learner Corpus (https://www.sketchengine.eu/guangwai-lancaster-chinese-learner-corpus/) and the Lancaster Corpus of Mandarin Chinese (https://www.lancaster.ac.uk/fass/projects/corpus/LCMC/).

Notes

https://www.sketchengine.eu/guangwai-lancaster-chinese-learner-corpus/
The sub-corpora and 2-gram data are available for access at https://github.com/chenheng1003/Lexical-Network.git
https://www.lancaster.ac.uk/fass/projects/corpus/LCMC/
http://ictclas.nlpir.org/
http://mrvar.fdv.uni-lj.si/pajek/
https://ctap.litmind.ink/

References

Arnon I, Christiansen MH (2017) The role of multiword building blocks in explaining L1–L2 differences. Top Cogn Sci 9(3):621–636. https://doi.org/10.1111/tops.12271
Article PubMed Google Scholar
Backlund A (2000) The definition of system. Kybernetes 29(4):444–451. https://doi.org/10.1108/03684920010322055
Article MATH Google Scholar
Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512. https://doi.org/10.1126/science.286.5439.509
Article MathSciNet MATH ADS PubMed Google Scholar
Bestgen Y, Granger S (2014) Quantifying the development of phraseological competence in L2 English writing: an automated approach. J Second Lang Writ 26:28–41. https://doi.org/10.1016/j.jslw.2014.09.004
Article Google Scholar
Bybee J, Hopper P (2001) Frequency and the emergence of linguistic structure. John Benjamins, Amsterdam
Castro N, Siew CS (2020) Contributions of modern network science to the cognitive sciences: Revisiting research spirals of representation and process. Proc R Soc A 476(2238):20190825. https://doi.org/10.1098/rspa.2019.0825
Article MathSciNet MATH ADS PubMed PubMed Central Google Scholar
Chen H (2021) A quantitative and network approach to alignment effects in L2 continuation tasks. Chin J Appl Linguisti 44(3):399–416. https://doi.org/10.1515/CJAL-2021-0025
Article ADS Google Scholar
Chen H, Wang Y (2023) How does language evolve as a multi-level system? A quantitative exploration of written Chinese. Lang Sci 98(4):1–14. https://doi.org/10.1016/j.langsci.2023.101554
Article Google Scholar
Chen H, Xu H (2019) Quantitative linguistics approach to interlanguage development: a study based on the Guangwai-Lancaster Chinese Learner Corpus. Lingua 230:102736. https://doi.org/10.1016/j.lingua.2019.102736
Article Google Scholar
Chen H, Chen X, Liu H (2018) How does language change as a lexical network? An investigation based on written Chinese word co-occurrence networks. PloS ONE 13(2):e0192545. https://doi.org/10.1371/journal.pone.0192545
Article CAS PubMed PubMed Central Google Scholar
Christiansen MH, Chater N (2016) The now-or-never bottleneck: a fundamental constraint on language. Behav Brain Sci 39:e62. https://doi.org/10.1017/S0140525X1500031X
Article PubMed Google Scholar
Ciaglia F, Stella M, Kennington C (2023) Investigating preferential acquisition and attachment in early word learning through cognitive, visual and latent multiplex lexical networks. Physica A 612:128468. https://doi.org/10.1016/j.physa.2023.128468
Article Google Scholar
Collins AM, Loftus EF (1975) A spreading-activation theory of semantic processing. Psychol Rev 82(6):407–428. https://doi.org/10.1037/0033-295X.82.6.407
Article Google Scholar
Cong J, Liu H (2014) Approaching human language with complex networks. Phys Life Rev 11(4):598–618. https://doi.org/10.1016/j.plrev.2014.04.004
Article ADS PubMed Google Scholar
Cong J, Liu H (2021) Linguistic emergence from a networks approach: the case of modern Chinese two-character words. PloS ONE 16(11):e0259818. https://doi.org/10.1371/journal.pone.0259818
Article CAS PubMed PubMed Central Google Scholar
Corominas-Murtra B, Valverde S, Solé R (2009) The ontogeny of scale-free syntax networks: phase transitions in early language acquisition. Adv Complex Syst 12(3):371–392. https://doi.org/10.1142/S0219525909002192
Article MATH Google Scholar
Dattner E, Ashkenazi O, Ravid D, Levie, R (2022) Explaining dynamic morphological patterns in acquisition using Network Analysis. Morphology 1–46. https://doi.org/10.1007/s11525-022-09394-0
De Bot K, Lowie W, Verspoor M (2007) A dynamic systems theory approach to second language acquisition. Bilingualism 10(1):7–21. https://doi.org/10.1017/S1366728906002732
Article Google Scholar
Deng Y, Lei L, Liu D (2021) Calling for more consistency, refinement, and critical consideration in the use of syntactic complexity measures for writing. Appl Linguist 42(5):1021–1028. https://doi.org/10.1093/applin/amz069
Article Google Scholar
Duff PA (2019) Social dimensions and processes in second language acquisition: multilingual socialization in transnational contexts. Mod Lang J 103:6–22. https://doi.org/10.1111/modl.12534
Article Google Scholar
Dugast D (1978) Sur quoi se fonde la notion d’étendue théoretique du vocabulaire? Le Fr Mod 46(1):25–32
Google Scholar
Durrant P, Schmitt N (2009) To what extent do native and non-native writers make use of collocations? Int Rev Appl Linguist Lang Teach 47:157–177. https://doi.org/10.1515/iral.2009.007
Article Google Scholar
Ellis NC (2014) Frequency-based accounts of second language acquisition. In Gass S, Mackey A (eds) The Routledge handbook of second language acquisition. Routledge, London, pp 193–210
Ellis NC, Larsen-Freeman D (eds) (2009a). Language as a complex adaptive system. Wiley-Blackwell
Ellis NC, Larsen‐Freeman D (2009b) Constructing a second language: analyses and computational simulations of the emergence of linguistic constructions from usage. Lang Learn 59:90–125. https://doi.org/10.1111/j.1467-9922.2009.00537.x
Article Google Scholar
Ellis NC, Römer U, O’Donnell MB (2016) Usage-based approaches to language acquisition and processing: cognitive and corpus investigations of Construction Grammar. Wiley, New York
Google Scholar
Erdos P, Rényi A (1960) On the evolution of random graphs. Publ Math Inst Hung Acad Sci 5(1):17–61
MathSciNet MATH Google Scholar
Fogal GG (2020) Investigating variability in L2 development: extending a complexity theory perspective on L2 writing studies and authorial voice. Appl Linguist 41(4):575–600. https://doi.org/10.1093/applin/amz005
Article Google Scholar
Garner J, Crossley S, Kyle K (2019) N-gram measures and L2 writing proficiency. System 80:176–187. https://doi.org/10.1016/j.system.2018.12.001
Article Google Scholar
Guiraud P (1954) Les Charactères Statistiques du Vocabulaire. Essai de méthodologie. Presses Universitaires de France, Paris
Google Scholar
Hao Y, Wang X, Wu M, Liu H (2021) Syntactic networks of interlanguage across L2 modalities and proficiency levels. Front Psychol 12:643120. https://doi.org/10.3389/fpsyg.2021.643120
Article PubMed PubMed Central Google Scholar
Hiver P, Al-Hoorie A, Evans R (2022) Complex dynamic systems theory in language learning: a scoping review of 25 years of research. Stud Second Lang Acquis 44(4):913–941. https://doi.org/10.1017/S0272263121000553
Article Google Scholar
Holland JH (2006) Studying complex adaptive systems. J Syst Sci Complex 19(1):1–8. https://doi.org/10.1007/s11424-006-0001-z
Article MathSciNet MATH Google Scholar
Hu R (2021) On the relationship between collocation-based syntactic complexity and Chinese second language writing. Yuyan Wenzi Yingyong 117(1):132–144
Google Scholar
Hudson RA (2007) Language networks: the new word grammar. Oxford University Press, New York
Google Scholar
Humphries MD, Gurney K (2008) Network ‘small-world-ness’: a quantitative method for determining canonical network equivalence. PloS ONE 3(4):e0002051. https://doi.org/10.1371/journal.pone.0002051
Article CAS ADS Google Scholar
Jiang J, Yu W, Liu H (2019) Does scale-free syntactic network emerge in second language learning? Front Psychol 10:925. https://doi.org/10.3389/fpsyg.2019.00925
Article PubMed PubMed Central Google Scholar
Kyle K, Crossley SA (2018) Measuring syntactic complexity in L2 writing using fine‐grained clausal and phrasal indices. Mod Lang J 102(2):333–349. https://doi.org/10.1111/modl.12468
Article Google Scholar
Kyle K, Crossley SA, Kim YJ (2015) Native language identification and writing proficiency. Int J Learn Corpus Res 1(2):187–209. https://doi.org/10.1075/ijlcr.1.2.01kyl
Article Google Scholar
Larsen-Freeman D (1997) Chaos/complexity science and second language acquisition. Appl Linguist 18(2):141–165. https://doi.org/10.1093/applin/18.2.141
Article Google Scholar
Liao J (2021) The effects of linguistic measures in the analysis of L2 Chinese descriptive writing. Chin Second Lang 56(2):118–145. https://doi.org/10.1075/csl.21012.lia
Article ADS Google Scholar
Liu HT (2008a) The complexity of Chinese syntactic dependency networks. Physica A 387(12):3048–3058. https://doi.org/10.1016/j.physa.2008.01.069
Article ADS Google Scholar
Liu HT (2008b) Dependency distance as a metric of Language comprehension difficulty. J Cogn Sci 9:159–191
Article Google Scholar
Liu HT, Cong J (2014) Empirical characterization of modern Chinese as a multi-level system from the complex network approach. J Chin Linguist 42(1):1–38
Google Scholar
Machicao J, Corrêa Jr EA, Miranda GH, Amancio DR, Bruno OM (2018) Authorship attribution based on life-like network automata. PloS ONE 13(3):e0193703. https://doi.org/10.1371/journal.pone.0193703
Article CAS PubMed PubMed Central Google Scholar
Meara PM (2009) Connected words: word associations and second language vocabulary acquisition. Benjamins, Amsterdam
Mehler A, Lücking A, Banisch S, Blanchard P, Job B (eds) (2016) Towards a theoretical framework for analyzing complex linguistic networks. Springer, Berlin
Qi H, Liao J (2019) An investigation into Chinese linguistic development in l2 narrative and argumentative writing. Shijie Hanyu Jiaoxue 33(4):563–576
Google Scholar
Quispe LV, Tohalino JA, Amancio DR (2021) Using virtual edges to improve the discriminability of co-occurrence text networks. Physica A 562:125344. https://doi.org/10.1016/j.physa.2020.125344
Article Google Scholar
Read J (2000) Assessing vocabulary. Cambridge University Press, Cambridge
Saito K (2020) Multi‐or single‐word units? The role of collocation use in comprehensible and contextually appropriate second language speech. Lang Learn 70(2):548–588. https://doi.org/10.1111/lang.12387
Article Google Scholar
Schmid HJ (2020) The dynamics of the linguistic system: usage, conventionalization, and entrenchment. Oxford University Press, Oxford
Book Google Scholar
Selinker L (1972) Interlanguage. Intl Rev Appl Linguist 10:209–231. https://doi.org/10.1515/iral.1972.10.1-4.209
Article Google Scholar
Sizemore AE, Karuza EA, Giusti C, Bassett DS (2018) Knowledge gaps in the early growth of semantic feature networks. Nat Hum Behav 2(9):682–692. https://doi.org/10.1038/s41562-018-0422-4
Article PubMed PubMed Central Google Scholar
Vandeweerd N, Housen A, Paquot M (2022) Comparing the longitudinal development of phraseological complexity across oral and written tasks. Stud Second Lang Acquis 1–25. https://doi.org/10.1017/S0272263122000389
Verspoor M, Schmid MS, Xu X (2012) A dynamic usage based perspective on L2 writing. J Second Lang Writ 21(3):239–263. https://doi.org/10.1016/j.jslw.2012.03.007
Article Google Scholar
Wang C, Wang M (2015) Effect of alignment on L2 written production. Appl Linguist 36(5):503–526. https://doi.org/10.1093/applin/amt051
Article CAS Google Scholar
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442. https://doi.org/10.1038/30918
Article CAS MATH ADS PubMed Google Scholar
Williams JN (2006) Associationism and Connectionism. In Brown K (ed) Encyclopedia of language & linguistics (Second Edition). Elsevier Science, pp. 559-562
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann Publishers, San Francisco
MATH Google Scholar
Wolfe-Quintero K, Inagaki S, Kim HY (1998) Second language development in writing: Measures of fluency, accuracy, & complexity (No. 17). University of Hawaii Press
Wu J, Zhou W, Lu D (2019) Assessing Chinese L2 writing quality on basis of language features and content quality. Shijie Hanyu Jiaoxue 33(1):130–144
Google Scholar
Yan J, Liu HT (2022) Semantic roles or syntactic functions: the effects of annotation scheme on the results of dependency measures. Stud Linguist 76(2):406–428. https://doi.org/10.1111/stul.12177
Article Google Scholar
Zaharchuk HA, Karuza EA (2021) Multilayer networks: an untapped tool for understanding bilingual neurocognition. Brain Lang 220:104977. https://doi.org/10.1016/j.bandl.2021.104977
Article PubMed Google Scholar

Download references

Acknowledgements

This study was partly supported by the Social Science Foundation of the Education Ministry of China (Grant No. 20YJC740005), the MOE Project of Key Research Institute of Humanities and Social Sciences at Universities in China (22JJD740018), and the Xu Lab at Center for Linguistics and Applied Linguistics, Guangdong University of Foreign Studies.

Author information

Authors and Affiliations

Center for Linguistics and Applied Linguistics, Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou, China
Heng Chen

Authors

Heng Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

The author contributed to all parts of this study. He is the only author of this paper.

Corresponding author

Correspondence to Heng Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Informed consent

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, H. A lexical network approach to second language development. Humanit Soc Sci Commun 10, 735 (2023). https://doi.org/10.1057/s41599-023-02151-6

Download citation

Received: 04 May 2023
Accepted: 11 September 2023
Published: 25 October 2023
DOI: https://doi.org/10.1057/s41599-023-02151-6