Artificial intelligence in linguistics: a GBRT model approach to forecast Cantonese levels among Chinese Malaysians

Peng, Yuqing; Xie, Junxian; Zhang, Lin; Lyu, Yuwen

doi:10.1057/s41599-025-05520-5

Download PDF

Article
Open access
Published: 26 September 2025

Artificial intelligence in linguistics: a GBRT model approach to forecast Cantonese levels among Chinese Malaysians

Yuqing Peng¹,
Junxian Xie²,
Lin Zhang³ &
…
Yuwen Lyu⁴

Humanities and Social Sciences Communications volume 12, Article number: 1494 (2025) Cite this article

405 Accesses
7 Altmetric
Metrics details

Subjects

Language and linguistics

Abstract

This study leverages a Gradient Boosted Regression Trees (GBRT) machine learning model to explore how Cantonese media exposure and cultural identity affect Cantonese language proficiency among Chinese Malaysians. By integrating sociolinguistic insights with predictive modeling, we address the multidimensional nature of language use factors. Using survey data from 642 Chinese Malaysian respondents, the GBRT model achieved a high predictive accuracy (R² ≈ 0.90) for Cantonese proficiency. The model identified key predictors, such as daily Cantonese use in social settings, media engagement, and generational cohort, underscoring their significant roles in language maintenance. These findings demonstrate the potential of machine learning to advance sociolinguistic research and provide practical insights for preserving linguistic heritage in multicultural societies.

Exploring the occupational biases and stereotypes of Chinese large language models

Article Open access 29 May 2025

Cultural tendencies in generative AI

Article 20 June 2025

Prompt-based fine-tuning with multilingual transformers for language-independent sentiment analysis

Article Open access 01 July 2025

Introduction

The Gradient Boosted Regression Trees (GBRT) algorithm, as one of the machine learning techniques, has been applied across a myriad of pivotal research domains. This includes predicting soil pollutant accumulation in environmental science (Nie et al., 2021), forecasting the timing and magnitude of earthquakes in geological studies (Corbi et al., 2019), and anticipating traffic flow and accidents in urban management (Chen et al., 2019; Zhang et al., 2020). Within the humanities and social sciences, machine learning predominantly focuses on predicting individual and collective psychological and behavioral determinants, aiming for multifaceted social governance. However, the intersection of machine learning with sociology remains underexplored, particularly in delving into sociolinguistics and broader sociological perspectives on group cultural identity.

Sociolinguistics has consistently played a pivotal role in cultural studies, especially concerning cultural identity and propagation. The evolution of civilizations is intricately tied to linguistic culture. Language, a core human identifier, is deemed a paramount factor influencing national and ethnic identities in multicultural nations. Malaysia, second only to Singapore in its Chinese population proportion, stands out as one of the countries outside China with the most frequent use of Cantonese media and a high degree of Chinese cultural identification. Presently, overseas Cantonese media, imbued with strong native cultural and ethnic attributes, evolve autonomously. The phonetic characteristics and lexical shifts in Malaysian Mandarin are notably influenced by local Cantonese usage (Sun, 2020). Investigating the language usage of Chinese Malaysians will advance the linkage between native-language media and culture, epitomizing the study of Cantonese media and cultural identity across nations. However, extant research on language usage and cultural identity variables remains fragmented, lacking a holistic exploration of Cantonese media and language use. There’s a pressing need to delve into multiple dimensions of Cantonese language acquisition, heritage, and usage.

This study sheds light on the influential factors affecting Cantonese language proficiency among Chinese Malaysians, particularly examining the role of Cantonese media usage and cultural identity. By employing the GBRT model, we aim to predict the Cantonese proficiency levels of individuals in this population. This research bridges micro-level individual behaviors with macro-level cultural linguistic trends and demonstrates how machine learning can bolster sociolinguistic analysis. Specifically, this study addresses two key research questions:

a.
Which sociocultural factors significantly influence Cantonese proficiency among Chinese Malaysians?
b.
How accurately can a GBRT model predict Cantonese proficiency based on these factors?

Literature review

Research on language use among Chinese Malaysians has explored various sociolinguistic dimensions, particularly those related to ethnic identity and language retention. However, few studies have systematically investigated the predictive relationship between Cantonese media exposure, cultural identity, and Cantonese language proficiency using data-driven modeling approaches. This literature review outlines relevant scholarly contributions and highlights the theoretical and empirical foundation for the present study.

Cantonese usage and cultural identity in Malaysia

Malaysia is home to a linguistically diverse Chinese community, with Cantonese being one of the most widely spoken dialects. Prior studies have shown that Cantonese functions not only as a linguistic medium but also as a vehicle for transmitting cultural identity (David et al., 2008; Carstens, 2018). Language choice is often shaped by intergenerational transmission, family language policy, and cultural affiliation. Research by Wang and Chong (2011) demonstrates that language maintenance is supported by demographic, institutional, and settlement factors within the Malaysian Chinese community.

Media exposure and language acquisition

Exposure to language through media plays a crucial role in language acquisition and retention, especially in multilingual societies. Liu et al. (2023) show that Cantonese media content significantly enhances Chinese cultural identification among Malaysian Chinese. Chua (2012) argues that Cantonese media fosters a sense of shared community and contributes to cultural continuity through entertainment. Studies in bilingual language learning also underscore the importance of media exposure for sustaining proficiency (Unsworth et al., 2018).

Machine learning in sociolinguistics

While traditional sociolinguistic studies rely heavily on qualitative or small-scale quantitative methods, recent advances have introduced machine learning to capture complex patterns in language use. GBRT and other ensemble models have been used in various domains to predict outcomes based on high-dimensional sociobehavioral variables (Fisher et al., 2019). Yet, few studies have applied such models to predict language proficiency or media engagement. The current study addresses this gap by employing GBRT to model Cantonese proficiency, drawing on demographic, cultural, and media-use factors.

Theoretical contribution

This study contributes to the literature by integrating machine learning with sociolinguistic theory. It expands upon the usage-based theory of language acquisition, which posits that frequency and context of use are central to linguistic competence. By identifying the predictive importance of different variables, the study provides an empirical foundation for understanding how media engagement and cultural identity shape language proficiency. In summary, prior research offers insights into Cantonese use and cultural identity but lacks predictive, data-driven modeling. This study bridges sociolinguistic theory and machine learning, offering a novel approach to understanding language maintenance in multilingual societies.

Data and methods

Data source

This study was conducted in collaboration with researchers from the Southeast Asian Studies Department of the University of Malaya and the Journalism Department of Xiamen University’s Malaysia Campus. The survey employed a direct distribution and collection method: questionnaires were distributed, completed, and collected on the spot, with researchers providing neutral explanations to ensure respondents understood the questions. Data collection included both online electronic questionnaires and offline paper-based surveys, conducted in September 2021. Given that some older Chinese Malaysians received education only in English and are not literate in Chinese, the questionnaire was prepared in both Chinese and English. Throughout data collection, the research team verified that each participant fit the study’s target demographic to ensure data relevance.

The survey targeted residents of Kuala Lumpur, focusing on those aged 15 and above. A multistage sampling method was employed, selecting five areas of Kuala Lumpur: Bukit Bintang, Setiawangsa, Kepong, Lembah Pantai, and Cheras. Within each area, an initial pool of 2000 households was obtained using an equidistant sampling technique. From these, Chinese Malaysian households were identified, and one individual aged 15 or above in each selected household was chosen using the Kish grid method, yielding 479 valid adult responses. For minors (14 years old or younger), five Chinese-medium schools in Kuala Lumpur (a Chinese national primary school, an international primary school, a national primary school, an independent high school, and a national high school) were selected, from which 163 student respondents were randomly chosen based on student lists. Given the varied circumstances of respondents under 14, all youth respondents completed offline paper-based questionnaires with assistance from teachers. In total, 642 valid questionnaires were obtained. (The full English version of the questionnaire is provided in Appendix 1).

Kuala Lumpur was chosen as the survey location due to its sizable Cantonese-speaking community and cultural significance, ensuring a meaningful context for studying Cantonese usage. The five selected districts provided a broad representation of urban Chinese Malaysian communities in the city. This multistage sampling approach (households for adults and schools for youth) was designed to maximize representativeness: it included both general community members and school-aged individuals. Separating the under-15 group via school-based sampling ensured that younger participants—who might otherwise be underrepresented—were appropriately included. Overall, this sampling design balances feasibility with diversity, lending credibility to the generalizability of the findings.

GBRT model

This study uses the Gradient Boosted Regression Trees (GBRT) model to predict Cantonese language proficiency, which is measured as a continuous outcome. GBRT is a powerful ensemble machine learning algorithm that performs well in regression tasks, particularly when the relationship between predictors and outcomes is non-linear or complex. It works by combining multiple decision trees, where each tree corrects the errors of the previous ones, leading to progressively improved performance (Friedman, 2001).

In this study, the model is implemented using the scikit-learn library in Python. Before modeling, all categorical variables (e.g., education level, region, cultural practices) were transformed using One-Hot Encoding, which creates binary variables for each category. The dataset was split into a training set (80%) and a test set (20%). These preprocessing steps ensured compatibility with the GBRT model and reproducibility of results.

The GBRT model updates a series of weak learners F_m(x) additively to form a final strong learner:

$${{{F}}}_{{\rm{m}}}({{x}})={{{F}}}_{{{\rm{m}}}^{-1}}({{x}})+{\gamma }_{{\rm{m}}}{{{h}}}_{{\rm{m}}}({{x}})$$

where:

- F₀(x) is the initial model, often the mean of the target variable;

- h_m(x) is the decision tree fitted to the negative gradient of the loss function at iteration m;

- γ_m is the step size determined through line search to minimize the loss L(y, F(x)).

This additive optimization approach allows the model to learn residuals and iteratively improve predictions.

To interpret the model results, two analysis tools were employed:

Permutation feature importance

To evaluate the contribution of each variable, permutation feature importance was computed. This method, introduced by Breiman (2001) and refined by Fisher et al. (2019), estimates how much the model’s performance deteriorates when a feature’s values are randomly shuffled. A significant increase in prediction error implies that the variable is important.

The feature importance FI_s for a given variable s is calculated as:

$${{\rm{FI}}}_{{\rm{s}}}={\rm{L}}({\rm{y}},{\rm{f}}({{\rm{X}}}\_{{\rm{perm}}}^{\wedge}\{({\rm{s}})\}))-{\rm{L}}({\rm{y}},{\rm{f}}({\rm{X}}))$$

where:

- ${{\rm{X}}}\_{{\rm{perm}}}^{\wedge}\{({\rm{s}})\}$ is the dataset with features randomly permuted;

- l is the loss function (mean squared error);

- F is the trained model.

Partial dependence plots (PDPs)

Partial dependence plots (PDPs) help visualize how individual features affect the model’s predictions, controlling for other variables. The partial dependence function for a feature x_s is defined as:

$${\hat{{\rm{f}}}}\_{{\rm{PDP}}}({{\rm{x}}}_{{\rm{s}}})=(1/{\rm{n}})\sum\_\{{\rm{i}}=1\}^{\wedge}\{\rm{n}\}{\rm{f}}({{\rm{x}}}_{\rm{s}},{{\rm{x}}}_{{\rm{i}}}^{{\rm{c}}})$$

where ${{\rm{x}}}_{{\rm{i}}}^{{\rm{c}}}$ are the values of all other features for instance i, and f(·) is the trained model. This average prediction shows how changes in x_s affect the output, holding all other variables constant.

In the context of this study, PDPs were used to illustrate the effect of key variables—such as daily use of Cantonese in social contexts, media exposure, and generation cohort—on predicted Cantonese proficiency. For example, the model showed a strong non-linear increase in predicted proficiency with greater daily use of Cantonese in social interactions.

Results

Descriptive statistical results

Sample characteristics are presented in Table 1. Based on the actual survey situation and past experience with language usage questionnaires, the respondents’ gender, generation, education level, family income, whether they attended a Chinese school, whether they lived in a Chinese new village, and Cantonese proficiency level were included in the statistics. From the empirical survey, out of the 642 valid questionnaires received, the male-to-female ratio was nearly equal. The respondents’ ages were divided into four generations (Youth: 14 years old or younger, Young Adults: 15–35 years old, Middle-aged: 36–60 years old, and Elderly: 61 years old and above), with a relatively small difference in the number of respondents across these generations. Over 96% of the respondents had attended a Chinese school, and more than 76% had experience living in a Chinese new village. In daily life, 39.1% can fluently understand and use Cantonese, 27.26% can understand and basically use Cantonese, 20.87% can understand but seldom use it, 9.19% can understand but cannot speak, and 3.58% do not know Cantonese at all.

Table 1 Sample characteristics statistics.

Full size table

GBRT prediction results

This study employs the GBRT model using the sklearn toolkit in the Python programming language. The primary parameter settings for the model can be seen in Table 2.

Table 2 Parameter settings.

Full size table

Before running the model, categorical variables were first One-Hot encoded, and parameter tuning is performed during the model creation process. The ratio of the training set to the test set is not specified in the provided text. After computation, the R-squared value between the actual and predicted levels of Cantonese usage in the Malaysian Chinese test set is 0.9021. This indicates that the prediction model fits well, with an accuracy rate of 83.50%. Therefore, it can be inferred that the GBRT model-based prediction of the Cantonese proficiency level among Malaysian Chinese has a high accuracy rate, and the predicted results are close to the actual values. The details can be seen in Fig. 1.

**Fig. 1: Comparison of actual and predicted Cantonese proficiency scores using the GBRT model.**

The plots display the marginal effect of each predictor on the model’s output, controlling for all other variables. A rising curve suggests a positive contribution to predicted Cantonese proficiency, while a flat or declining curve indicates weaker influence.

These plots visualize how each variable affects predicted Cantonese proficiency, holding all other variables constant. Curves with pronounced slopes indicate a stronger influence on the model’s prediction.

The machine learning results are shown in Figs. 2 and 3. There are 20 factors that significantly predict the variation in Chinese cultural identity, with distinct changes, emerging as the main influencing factors. Figures 2 and 3 displays the partial dependence plots for these 20 factors. As scores change, each influencing factor exhibits different patterns of variation in predicting Cantonese language use.

**Fig. 2: Partial dependence plots of significant predictors (Set 1).**

**Fig. 3: Partial dependence plots of significant predictors (Set 2).**

The partial dependence plots clearly show the significant predictive role of Cantonese media usage and exposure on Cantonese language usage. Daily social language use in Cantonese, acquiring Cantonese through chatting with friends, and family communication are significant representative factors for Cantonese media usage. Among them, social language in Cantonese is the strongest predictive factor.

When the score for this factor is less than 2, it essentially has no predictive power. Scores greater than 2 show a significant increase in their predictive power, and scores greater than 4 show a relatively steady growth in their predictive influence. The degree and frequency of Cantonese media usage are also important predictive factors. When the score for Cantonese media usage is less than 3, Cantonese language usage remains at a high level. Scores between 3 and 4 result in a rapid drop to a lower level or even an inability to predict Cantonese language usage, followed by a slow recovery. Other factors, such as the frequency of exposure to Hong Kong movies and Malaysian Cantonese broadcast television programs, also have significant predictive effects.

From a sociological perspective, generation, native Mandarin speakers, generation of Chinese descent, and education level prominently emerged in the results as predictors of Cantonese usage. The generation is the strongest demographic predictor of Cantonese language usage. This might imply that older generations are more likely to use Cantonese. In addition to generational cohort, native Mandarin speakers, generational descent, and higher education levels also emerged as predictors of Cantonese proficiency among Malaysian Chinese. Overall, the study shows that the machine learning predictions align with past empirical research results. In situations where there are too many variables and the influencing factors are too complex and general, machine learning can sort out systematic and organized multi-dimensional variable relationships from the chaotic linear relationships. The computational process demonstrates good data robustness. Since it doesn’t carry subjective judgments, the data results are more objective, scientific, and credible.

Discussion

Social interaction and cultural embeddedness in Cantonese language proficiency

This study reveals that Cantonese proficiency among Malaysian Chinese is deeply rooted in patterns of social interaction and cultural engagement, resonating with core sociolinguistic frameworks. Most notably, frequent interpersonal use of Cantonese in daily interactions, especially with family and close friends, emerged as the strongest predictor of proficiency. This finding affirms the theoretical construct of language socialization (Ochs and Schieffelin, 2001), which posits that language learning is a socially situated process, driven by participation in culturally meaningful communicative practices. In heritage language contexts, family discourse has been widely recognized as a central mechanism for linguistic and identity transmission (Arriagada, 2005). Similarly, peer interactions act as a reinforcing mechanism for informal acquisition, consistent with Hamat’s and Hassan's (2019) theory of social networks as facilitators of language use and retention.

Beyond interpersonal interaction, cultural embeddedness also plays a critical role. Variables such as participation in traditional activities (e.g. Qingming Festival, Lo Hei) and engagement in ethnic community events significantly contributed to language proficiency, demonstrating that cultural practices serve as not only symbolic affirmations of identity but also functional domains for language use. These findings align with Fishman’s (1991) argument that language maintenance depends on the integration of minority languages into symbolic domains of community life, where cultural rituals sustain both linguistic practices and identity continuity. In this regard, Cantonese use is not merely linguistic, but deeply enmeshed in the ethnic and intergenerational lifeworlds of Malaysian Chinese.

Media exposure, generational dynamics, and linguistic capital

In parallel, the results indicate that media consumption is a powerful driver of Cantonese proficiency, particularly through exposure to Hong Kong films, local Cantonese-language TV, and pop culture. These findings reinforce the role of media as both a source of linguistic input and a platform for symbolic identity construction. Fishman (1991) underscores the importance of such domains in his model of reversing language shift, where media serve to legitimize and normalize minority language use in public and private spheres. Moreover, participants’ emotional attachment to Cantonese media and their perception of its practical value—such as for staying informed or expressing cultural belonging—exemplify what Bourdieu (1991) defines as linguistic capital: the symbolic and instrumental value attached to language varieties within specific social fields.

Generational status further emerged as the most significant demographic predictor of Cantonese proficiency. Older respondents—those who came of age before the widespread institutional dominance of Malay, English, or Mandarin—demonstrated significantly higher language ability, reflecting typical language shift trajectories in multilingual diaspora settings (Zhang, 2010). In contrast, younger generations—particularly those with limited exposure to Cantonese in educational or media contexts—exhibited signs of attrition. Education level and Chinese school attendance were additional predictive factors, suggesting that institutional exposure to Chinese linguistic and cultural norms enhances the retention of Cantonese as a heritage language. This supports Wang and Chong’s (2011) argument that language vitality is tied to both community-level institutional support and individual sociocultural investment.

Taken together, these results indicate that Cantonese proficiency is not determined by a single factor, but rather emerges from the complex interplay among media engagement, generational alignment, and symbolic valuation of language. The application of a Gradient Boosted Regression Trees (GBRT) model enabled the detection of nuanced, non-linear interactions among these predictors—demonstrating the value of machine learning in capturing sociolinguistic complexity that might otherwise be obscured in traditional statistical analyses.

Toward a multilayered understanding of heritage language maintenance

The integrated findings of this study point to a multilayered sociolinguistic ecology in which heritage language proficiency is not merely a byproduct of individual competence or exposure but a reflection of interdependent systems of social practice, cultural affiliation, symbolic valuation, and institutional reinforcement. As this research illustrates, language use is relational—constructed within networks of interpersonal communication, shaped by generational histories, sustained through culturally embedded practices, and mediated by both traditional and digital forms of media. This conception echoes the ethnography of communication approach (Hymes, 1972), which emphasizes that language behaviors must be understood within their broader sociocultural contexts and communicative functions.

More broadly, the study reinforces the notion that language maintenance in diasporic communities requires more than linguistic input alone. It demands a confluence of factors: familial transmission, peer reinforcement, culturally resonant content, community participation, and symbolic legitimacy. These dimensions interact dynamically across temporal, spatial, and generational lines, creating either fertile ground or fragile terrain for the survival of heritage languages. From a policy perspective, this underscores the importance of intergenerational language planning, culturally sensitive media production, and educational programs that recognize the value of dialectal diversity within the Chinese Malaysian context.

Methodologically, the study also contributes to the field by demonstrating how machine learning models such as GBRT can enhance sociolinguistic inquiry. By uncovering non-linear, high-dimensional relationships among variables, such approaches complement traditional qualitative and quantitative frameworks, offering a hybrid model of empirical sophistication and theoretical depth. Ultimately, Cantonese language proficiency—as mapped here—emerges not as a static attribute, but as a dynamic cultural resource negotiated across individual lives and collective histories.

Conclusion

The preservation and transmission of the Cantonese language among overseas Chinese communities faces not only structural challenges but also psychological and attitudinal barriers. Negative language beliefs and insufficient language awareness may accelerate language attrition more than any external constraints. This study contributes to the growing literature on heritage language maintenance by investigating the sociocultural, media-related, and demographic factors shaping Cantonese usage among Malaysian Chinese.

By integrating traditional statistical methods and Gradient Boosted Regression Trees (GBRT), the study identified twenty key predictors—spanning dimensions of media exposure, cultural identity, and generational background—that significantly influence Cantonese proficiency. Partial dependence analyses further revealed how each of these predictors affects language usage non-linearly, underscoring the value of data-driven modeling for uncovering nuanced sociolinguistic patterns.

In practical terms, the findings point to actionable strategies for strengthening Chinese cultural identity and supporting the vitality of Cantonese in diaspora settings. These include fostering immersive Cantonese-speaking environments in families and communities, producing culturally resonant media content (e.g., TV dramas, music, films), and establishing sustainable transnational media dissemination platforms. Such initiatives are essential not only for language retention but also for reinforcing emotional ties between overseas Chinese and their cultural heritage.

Beyond its practical implications, this study contributes to sociolinguistics by demonstrating the effectiveness of machine learning models in capturing the complex interplay of linguistic, cultural, and demographic variables. It also contributes to applied machine learning by extending its usage to sociocultural research contexts, thus offering an example of how computational tools can enrich theoretically grounded inquiries.

Future research can build upon this work in several directions. First, comparative studies across diaspora communities in North America, Europe, and Oceania can test the generalizability of these findings. Second, longitudinal approaches may examine how language beliefs and media consumption patterns evolve over time. Third, further methodological innovations, such as incorporating deep learning models or natural language processing techniques, could enhance predictive accuracy and interpretability. In sum, this study not only offers insights into the mechanisms of language maintenance but also opens new pathways for interdisciplinary collaboration between sociolinguistics, communication studies, and artificial intelligence.

Data availability

The data presented in this study are available from the corresponding author upon reasonable request. Any shared dataset will be provided in a de-identified form to maintain participant confidentiality.

References

Arriagada PA (2005) Family context and Spanish‐language use: a study of Latino children in the United States. Soc Sci Q 86(3):599–619
Article Google Scholar
Bourdieu P (1991) Language and symbolic power. Harvard University Press
Google Scholar
Breiman L (2001) Random forests. Machine learning. vol. 45, pp. 5–32. Springer
Carstens S (2018) Multilingual Chinese Malaysians: the global dimensions of language choice. Grazer Ling Stud 89:7–34
Google Scholar
Chen X, Zhang S, Li L et al. (2019) Multi‐model ensemble for short‐term traffic flow prediction under normal and abnormal conditions. IET Intell Transp Syst 13(2):260–268
Article Google Scholar
Chua BH (2012) Structure, audience and soft power in East Asian pop culture (vol. 1). Hong Kong University Press
Corbi F, Sandri L, Bedford J et al. (2019) Machine learning can predict the timing and size of analog earthquakes. Geophys Res Lett 46(3):1303–1311
Article ADS Google Scholar
David MK, Cavallaro F, Coluzzi P et al. (2008) Language policies: impact on language maintenance and teaching focus on Malaysia, Singapore and the Philippines. In: Foundation for Endangered Languages Conference (FEL XII). pp. 25–2. Kuala Lumpur: Persatuan Bahasa Modern Malaysia and Macquarie Library Pty. Ltd
Fisher A, Rudin C, Dominici F (2019) All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. J Mach Learn Res 20(177):1–81
MathSciNet CAS Google Scholar
Fishman JA (1991) Reversing language shift: Theoretical and empirical foundations of assistance to threatened languages. Multi Ling Mat 76
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232. https://doi.org/10.1214/aos/1013203451
Article MathSciNet Google Scholar
Hamat A, Hassan HA (2019) Use of social media for informal language learning by Malaysian university students. 3L: Lang Linguist Lit 25(4):68–83
Article Google Scholar
Hymes, D. (1972). Models of the interaction of language and social life. In J. J. Gumperz & D. Hymes (Eds.), Directions in Sociolinguistics: The Ethnography of Communication (pp. 35–71). New York: Holt, Rinehart and Winston
Liu N, Chen T, Peng Y, Xie Y et al. (2023) Cantonese media promotes Chinese cultural identification: structural equation modeling based on Malaysian Chinese. Front Psychol. 14. https://doi.org/10.3389/fpsyg.2023.1217340
Nie P, Roccotelli M, Fanti MP et al. (2021) Prediction of home energy consumption based on gradient boosting regression tree. Energy Rep. 7:1246–1255. https://doi.org/10.1016/j.egyr.2021.02.006
Article Google Scholar
Ochs E, Schieffelin B (2001) Language acquisition and socialization: three developmental stories and their implications. Linguist Anthropol Read 2001:263–301
Google Scholar
Sun LP (2020) Ji in Malaysia Mandarin: a perspective from dialect contact and grammaticalization. J Chin Linguist 48(1):147–173
Google Scholar
Unsworth S, Chondrogianni V, Skarabela B (2018) Experiential measures can be used as a proxy for language dominance in bilingual language acquisition research. Front Psychol 9:1809. https://doi.org/10.3389/fpsyg.2018.01809
Article PubMed PubMed Central Google Scholar
Wang X, Chong SL (2011) A hierarchical model for language maintenance and language shift: focus on the Malaysian Chinese community. J Multiling Multicult Dev 32(6):577–591. https://doi.org/10.1080/01434632.2011.617820
Article Google Scholar
Zhang D (2010) Language maintenance and language shift among Chinese immigrant parents and their second-generation children in the US. Bilingual Res J 33(1):42–60. https://doi.org/10.1080/15235881003733258
Article ADS Google Scholar
Zhang Z, Yang W, Wushour S et al. (2020) Traffic accident prediction based on LSTM‐GBRT model. J Control Sci Eng 2020(1):4206919. https://doi.org/10.1155/2020/4206919
Article Google Scholar

Download references

Acknowledgements

This study was supported by the National Social Science Foundation of China (21BSH023) and the Scientific Research Project of Guangzhou Medical University (2025SRP036).

Author information

Authors and Affiliations

School of Journalism and Communication, Guangzhou University, Guangzhou, China
Yuqing Peng
Key Laboratory of Maritime Silk Road, Guangzhou University, Guangzhou, China
Junxian Xie
School of Environment and Society, Tokyo Institute of Technology, Tokyo, Japan
Lin Zhang
School of Marxism, Guangzhou Medical University, Guangzhou, China
Yuwen Lyu

Authors

Yuqing Peng
View author publications
Search author on:PubMed Google Scholar
Junxian Xie
View author publications
Search author on:PubMed Google Scholar
Lin Zhang
View author publications
Search author on:PubMed Google Scholar
Yuwen Lyu
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: YP. Methodology: YL and LZ. Software: YL and LZ. Validation: YP and JX. Formal analysis: YP, YL, and JX. Investigation: YP. Data curation: YP, YL. Original draft preparation: YP and LZ. Draft review and editing: YP and YL. Visualization: JX and LZ. Project administration: YL and YP. All authors have read and approved the final version of the manuscript.

Corresponding authors

Correspondence to Lin Zhang or Yuwen Lyu.

Ethics declarations

Competing interests

The author declares no competing interests.

Ethical approval

The study was approved on June 10, 2021, by the Institutional Review Board (IRB) of the authors’ affiliated institution (Approval No. GZHU2021012) and was conducted in accordance with relevant ethical standards, including the 1964 Declaration of Helsinki and its subsequent amendments. All participants’ personal data were anonymized, and survey responses were handled with strict confidentiality. Online questionnaire data were collected through encrypted channels with access limited to the research team, and offline survey forms were securely stored and digitized in a protected manner. These measures ensured that participant privacy and data security were safeguarded throughout the study.

Informed consent

Written informed consent was obtained from all participants prior to data collection, between September 1 and September 24, 2021, for both participation in the study and the publication of anonymized data and research findings. For participants under legal age or those requiring assistance, written informed consent was obtained from their legal guardians or next of kin.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Survey

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Peng, Y., Xie, J., Zhang, L. et al. Artificial intelligence in linguistics: a GBRT model approach to forecast Cantonese levels among Chinese Malaysians. Humanit Soc Sci Commun 12, 1494 (2025). https://doi.org/10.1057/s41599-025-05520-5

Download citation

Received: 12 September 2024
Accepted: 04 July 2025
Published: 26 September 2025
DOI: https://doi.org/10.1057/s41599-025-05520-5