Introduction

Developing effective second language (L2) speaking skills is integral to language acquisition and communicative competence. However, Foreign Language Speaking Anxiety (FLSA)—characterized by apprehension and fear of speaking in a non-native language—remains a pervasive challenge for many English as a Foreign Language (EFL) learners (Bárkányi, 2021; Horwitz et al., 1986). This anxiety, which is prevalent among language learners (MacIntyre, 2017), can manifest as avoidance behaviors and hinder progress in oral proficiency (Ozdemir & Papi, 2022; Woodrow, 2006), ultimately disrupting broader language acquisition processes by diminishing learners’ willingness to communicate and engage actively (MacIntyre et al., 1998; Zhou et al., 2023). Consequently, there is an urgent need for interventions that can effectively mitigate FLSA while fostering L2 speaking proficiency. To address this need, researchers and educators have explored various pedagogical approaches and interventions, including traditional classroom activities (Horwitz et al., 1986) and technology-enhanced language learning programs (Aktaş, 2023; Chen, 2022; Derakhshan et al., 2024b). However, a comprehensive understanding of how different pedagogical approaches, learning environments, and individual learner characteristics interact to influence FLSA and its long-term effects remains an area for further investigation.

One promising avenue for addressing FLSA lies in the integration of Artificial Intelligence (AI) technologies into language education. While traditional L2 speaking practices—such as structured classroom interactions, peer discussions, and teacher-led instruction—are foundational to language learning, they often fail to address the specific challenges of learners with high levels of FLSA. Fear of judgment and the pressure to perform in front of peers can inhibit participation, particularly for anxious learners, who may feel paralyzed by the prospect of making mistakes (MacIntyre, 2017; Renandya & Nguyen, 2022). Moreover, these methods lack the flexibility to provide individualized and frequent practice, a limitation that is particularly pronounced in larger classroom settings or environments with limited instructional resources. Research has also pointed out that these traditional methods often overlook learners’ psychological needs, which can exacerbate anxiety, especially among adult or beginner-level learners (Çakmak, 2022; Zhou et al., 2023). These gaps have prompted the exploration of innovative solutions, with AI technologies emerging as transformative tools in language education.

AI-powered chatbots, equipped with natural language processing (NLP) and machine learning capabilities, present a novel approach to addressing both FLSA and proficiency gaps (Xin & Derakhshan, 2025; Wang et al., 2024). These tools create safe, judgment-free environments where learners can engage in authentic conversational scenarios without fear of social repercussions (Çakmak, 2022). By providing real-time feedback on grammar, pronunciation, and fluency, chatbots enable learners to identify and address their mistakes during practice sessions, fostering self-awareness and active learning (Chen et al., 2021; Du & Daniel, 2024). Furthermore, the adaptability of these tools allows learners to tailor their practice sessions to specific needs and schedules, offering the flexibility and personalization that traditional methods often lack (Divekar et al., 2022; Kartal & Yeşilyurt, 2024). Despite their growing popularity, existing studies often examine these tools in controlled environments or with homogeneous learner populations, leaving questions about their efficacy across varied demographic or proficiency groups.

Recent advancements in AI-based educational technologies further underscore their potential for improving L2 speaking skills and reducing anxiety. Research has shown that AI chatbots can enhance learners’ fluency, accuracy, and engagement while simultaneously fostering a supportive environment that alleviates anxiety (Hapsari & Wu, 2022; Hwang et al., 2024; Tai & Chen, 2024). However, much of the literature has examined these benefits independently, focusing either on skill development or on anxiety reduction (Çakmak, 2022; Fathi & Rahimi, 2024). Additionally, there remains limited exploration of how these tools support long-term behavioral changes, such as sustained willingness to communicate or enhanced confidence outside classroom contexts. Given that anxiety can significantly undermine the effectiveness of speaking practices, tools that reduce FLSA while promoting speaking proficiency hold particular promise for language learning outcomes.

This study aims to bridge this gap by employing a mixed methods approach to evaluate the dual benefits of AI-powered conversation bots in EFL instruction. Specifically, the research investigates how these tools enhance key aspects of L2 speaking skills—fluency, accuracy, and overall proficiency—while also alleviating FLSA among learners. By integrating quantitative measures with qualitative insights, this study provides a holistic perspective on the efficacy of AI chatbots, addressing not only measurable outcomes but also learners’ subjective experiences and perceptions. The findings are expected to inform not only classroom strategies but also broader policies aimed at integrating AI into language curricula, supporting teachers in diverse settings, and fostering more inclusive learning environments. By emphasizing practical applications, this research seeks to demonstrate how AI-powered tools can contribute to sustainable advancements in EFL education, ultimately benefiting learners across various cultural and institutional contexts.

Literature review

Theoretical framework

This study is grounded in Vygotsky’s (1978) sociocultural theory (SCT). SCT provides a particularly relevant lens for this research because it emphasizes that learning, including language acquisition, is fundamentally a social and mediated process. Unlike theories focusing primarily on individual cognitive mechanisms or input processing, SCT foregrounds the role of interaction with others, or with cultural tools, in driving cognitive and linguistic development. This framework highlights learning as occurring through guided participation in socially meaningful activities, particularly within the Zone of Proximal Development (ZPD). The ZPD represents the gap between what learners can accomplish independently and what they can achieve with the support of a more knowledgeable other (MKO). In the context of L2 speaking development, the concept of the MKO is crucial, representing a source of guidance that enables learners to perform beyond their independent capabilities. In this study, the AI-powered conversation bot functions as a virtual MKO, guiding learners in developing L2 speaking skills within their ZPD through dynamic interaction and feedback (Hwang et al., 2025).

Crucially, the chatbot’s ability to function as an effective MKO across diverse learner contexts is rooted in its capacity to adapt to varying proficiency levels dynamically. This adaptability is achieved through advanced algorithms that analyze learner performance in real-time, leveraging NLP and machine learning to adjust the difficulty and complexity of tasks based on individual needs (Huang et al., 2023; Jeon et al., 2023). For instance, if a learner struggles with a particular grammar point, the chatbot may provide additional examples, simplify the language used in subsequent interactions, or offer more targeted feedback. Conversely, if a learner demonstrates proficiency, the chatbot may introduce more challenging vocabulary, complex grammatical structures, or open-ended tasks that encourage greater elaboration and critical thinking. This dynamic adaptation aligns with key concepts from SCT, providing a robust framework for understanding how the chatbot facilitates language learning within the ZPD (Lantolf & Poehner, 2014). By offering structured practice, scaffolding, and real-time feedback, the chatbot helps learners engage in speaking tasks that exceed their current abilities, such as navigating conversational scenarios like role-playing or opinion-sharing, which stretch their linguistic capabilities (Lantolf, 2011). Through these guided interactions, the chatbot supports learners in bridging the gap between their present performance and their potential proficiency, aligning with Vygotsky’s assertion that optimal learning occurs within the ZPD (Lantolf et al., 2014). The chatbot’s design incorporates scaffolding mechanisms tailored to learners’ individual needs, providing contextualized hints, breaking down complex tasks into manageable steps, and delivering corrective feedback when errors occur, enabling learners to perform tasks they might initially find challenging and gradually become more autonomous in their speaking abilities (Rahimi et al., 2024; Tai & Chen, 2024; Wood et al., 1976). This scaffolding process is not static but evolves dynamically as learners progress, ensuring that the chatbot continuously challenges learners at the appropriate level of their ZPD (Graesser et al., 2014). This dynamic, individualized support system embodies the core SCT principle of mediated learning, where the technological tool is specifically designed to facilitate development through carefully calibrated assistance.

Finally, through repeated interaction with the chatbot, learners gradually internalize the language skills and strategies modeled during practice, transforming external guidance into internalized knowledge that can be applied independently in new contexts (Jeon et al., 2023; Vygotsky, 1978; Lantolf & Poehner, 2014). This process of internalization, central to Vygotsky’s theory, underscores how external social interactions evolve into internal cognitive functions. This process of internalization is not merely cognitive; it can also have affective consequences. As learners successfully internalize language skills through scaffolded interactions, their perceived competence and self-efficacy may increase, potentially contributing to reduced anxiety when faced with similar communicative tasks in the future. This development of self-regulation and metacognitive awareness, facilitated by the internalization of interactional strategies, is central to becoming an autonomous language learner (Zhou et al., 2023).

Therefore, SCT allows this study to frame the AI chatbot not merely as a tool, but as a mediational artifact facilitating a unique form of social learning. This perspective enables examination of how interaction with a non-human MKO supports ZPD progression, fosters skill internalization, and potentially mitigates the social pressures inherent in human interaction, thus altering the affective dimension of L2 learning. This highlights the interplay between guided interaction, cognitive development, and AI’s transformative potential in L2 acquisition. The internalization fostered by chatbot practice can also promote enduring self-regulatory skills and metacognitive awareness, encouraging learners to monitor their progress, identify weaknesses, and apply learned strategies in future real-life communication and self-study.

Foreign language speaking anxiety

Foreign language anxiety (FLA), a common phenomenon among language learners (Bárkányi, 2021; Horwitz et al., 1986), can significantly hinder their ability to acquire and utilize a L2 effectively (Botes et al., 2020; Onwuegbuzie et al., 1999). Horwitz et al. (1986) proposed a three-pronged model of FLA, identifying communication apprehension, test anxiety, and fear of negative evaluation as its key components. Communication apprehension refers to the fear of speaking, listening, reading, or writing in the target language, while test anxiety involves the worry and nervousness associated with language assessments. Fear of negative evaluation encompasses the concern about being judged or criticized by others when using the foreign language (Horwitz et al., 2009). Research suggests that high levels of FLA can hinder students’ willingness to communicate (MacIntyre et al., 1998), reduce their participation in classroom activities (Horwitz et al., 1986), and impede their language proficiency development (MacIntyre & Gardner, 1994). Moreover, FLA has been linked to decreased motivation and self-confidence in language learners (Fathi & Mohammaddokht, 2021; MacIntyre & Gardner, 1994). Individual differences, such as shyness and apprehension, significantly impact learners’ speaking abilities in a foreign language (Fallah, 2014; Kardaş, 2024). These personality traits, along with language aptitude and prior language learning experiences (Chen & Chang, 2004), contribute to the levels of anxiety experienced by learners. Additionally, situational factors, such as classroom environment, teaching methods, and social interactions can influence learners’ levels of FLA (Horwitz et al., 1986; Kim, 2009).

FLSA, as a specific manifestation of FLA, is centered on the apprehension and distress learners experience during oral communication in a non-native language (Mora et al., 2024; Woodrow, 2006). Unlike general FLA, which encompasses anxiety across all language domains (e.g., reading, writing, listening), FLSA targets oral interactions, where the immediacy and spontaneity of speaking amplify feelings of vulnerability (Kasbi & Elahi Shirvan, 2017; Ozdemir & Papi, 2022). Learners with FLSA often struggle with fears of making grammatical errors, concerns about pronunciation, and worries about being misunderstood. Anxiety may also stem from navigating unfamiliar cultural norms in communication, further intensifying learners’ feelings of inadequacy (Akkakoson, 2016). FLSA arises from various sources, such as fear of negative evaluation by peers or instructors, performance anxiety during speaking assessments, and self-doubt regarding linguistic competence (Horwitz et al., 1986; Ozdemir & Papi, 2022; Sadighi & Dastpak, 2017). These anxieties can manifest in physiological symptoms, such as trembling, sweating, and increased heart rate, further exacerbating learners’ discomfort during speaking tasks (MacIntyre & Gardner, 1989).

To address the challenges posed by FLSA, researchers have explored various interventions, including the use of AI-powered chatbots. Given the profound impact of anxiety on language learning, it is crucial to identify effective interventions that can address this specific form of anxiety. While multiple affective factors influence language acquisition, such as motivation, self-efficacy, and enjoyment (Botes et al., 2020; Dewaele & MacIntyre, 2014), anxiety stands out due to its pervasive negative effects on learners’ communicative abilities. The focus on anxiety in this study is motivated by its significant role in inhibiting speaking performance and the potential for technological interventions to mitigate its impact. Addressing FLSA is crucial, and emerging technologies like AI-powered chatbots offer promising solutions. Recent research has begun to explore the potential of AI-powered chatbots in reducing FLSA and enhancing speaking skills (Qiao & Zhao, 2023).

Çakmak (2022) examined the effects of chatbot-human interaction on EFL students’ speaking performance and anxiety using the chatbot Replika over a 12-week intervention with 89 participants, finding that while chatbot interaction led to improved speaking performance, it did not necessarily reduce L2 speaking anxiety. This highlights the complex relationship between chatbot use, speaking performance, and anxiety, suggesting that factors beyond simply interacting with a chatbot may contribute to anxiety reduction. Similarly, Hapsari and Wu (2022) introduced an AI chatbot model designed to alleviate speaking anxiety, foster enjoyment, and encourage critical thinking among university-level EFL learners. Although their findings, based on interviews with teachers, suggested that this model could enhance the learning process, the study’s focus on a specific educational context raises questions about the generalizability of these results to other settings. Furthermore, Naseer et al. (2024) examined chatbot-mediated interactions for 320 language learners and found notable gains in speaking proficiency (22% on average), increased vocabulary retention (19%), and reduced anxiety levels (78% of learners reported lower anxiety). These improvements, attributed to the non-judgmental environment provided by conversational AI, highlight the potential of chatbots as low-pressure supplements to regular instruction. However, further research is needed to investigate whether these positive effects can be generalized to diverse learner populations and educational contexts. Recent research by Du and Daniel (2024) offered a systematic review of AI-powered chatbots for EFL speaking practice and concluded that, while chatbots can effectively reduce anxiety and enhance speaking skills, further studies in diverse educational settings are warranted to verify their broader impact. Moreover, Zheng (2024) investigated a GenAI-enhanced chatbot for reading instruction and found that it helped reduce foreign language reading anxiety, suggesting the potential of AI chatbots to address anxiety across different language skills.

AI-powered chatbots for language learning

The integration of AI technologies is reshaping language learning, offering new tools for personalized instruction, immediate feedback, and enhanced language acquisition (Derakhshan, 2025; Derakhshan & Ghiasvand, 2024; Divekar et al., 2022; Pedró et al., 2019; Xiu-Yi, 2024; Yuan & Liu, 2025). Central to this shift are AI algorithms, NLP techniques, and machine learning models. A key benefit of AI in this context is its capacity to individualize the learning experience. Adaptive platforms analyze learner data to adjust content and pacing based on specific needs, creating customized learning pathways that enhance engagement and motivation (Alam, 2021; Yin & Fathi, 2025; Zhai & Wibowo, 2023). In addition to personalization, AI delivers timely and detailed feedback. Systems equipped with NLP can evaluate both written and spoken language, detect errors, and provide corrective suggestions in real time (Chen et al., 2021; Derakhshan et al., 2024b; Zhang & Zou, 2022). This continuous feedback process helps learners monitor their progress, address weaknesses, and build proficiency effectively. Moreover, AI platforms utilize interactive technologies to simulate real-world communication contexts, making language learning more immersive and practical (Derakhshan et al., 2024a; Godwin-Jones, 2023; Schmidt & Strasser, 2022). Conversational agents and virtual language exchange tools engage learners in dialogs, role-plays, and simulations, enabling authentic language use and comprehension (Liao et al., 2023). These interactive experiences move beyond rote memorization, promoting communicative competence and fostering cultural understanding in the target language. By integrating these features, AI-driven solutions offer significant potential to transform traditional approaches to language learning.

Among these AI-powered tools, chatbots have gained significant attention for their potential to address FLA and provide personalized language practice (Xin & Derakhshan, 2025). Although Mondly, with its focus on immersive scenarios and gamified learning, represents one approach to chatbot-mediated language learning, other AI platforms also offer diverse functionalities and pedagogical designs. For example, Duolingo leverages AI for personalized learning paths and adaptive exercises, while Babbel emphasizes curriculum design aligned with the CEFR and integrates speech recognition technology for pronunciation feedback (Kessler et al., 2023). Rosetta Stone, known for its immersive approach, includes AI-powered features like TruAccent for real-time pronunciation correction (Handley, 2024). These varied platforms reflect different instructional philosophies, raising questions about the relative efficacy of AI solutions in addressing specific challenges, such as speaking anxiety or pronunciation errors (Hsu et al., 2023; Zou et al., 2023) compared to traditional language classrooms. For instance, while AI chatbots provide immediate feedback and flexible practice, teacher-led or peer-based methods may offer deeper socio-emotional support and nuanced cultural insights, suggesting a complementary role rather than a full replacement. These diverse approaches have led to a growing body of research investigating the efficacy of AI chatbots in language learning.

Research has consistently demonstrated the positive impact of AI chatbots on various aspects of language learning. Recent studies highlight the potential of modern chatbots to deliver interactive and context-sensitive feedback, which enhances learners’ confidence and engagement in speaking activities (Kim & Su, 2024; Yang et al., 2022). For instance, Yang et al. (2022) showed that an AI chatbot used as an English conversation partner in EFL classes improved students’ willingness to communicate while reducing speaking anxiety. Similarly, Kim and Su (2024) found that implementing a chatbot in Korean language learning settings significantly increased learners’ communication confidence and reduced anxiety levels. Additional research supports these findings. Naseer et al. (2024) demonstrated that chatbots acting as conversational partners effectively reduced FLA and facilitated language acquisition, creating supportive environments for learners. Shafiee Rad (2024) reported that tools like Speeko substantially enhanced L2 speaking proficiency and learners’ willingness to communicate, highlighting the transformative potential of AI applications in language education. These studies suggest that AI chatbots can be valuable tools for improving speaking skills, reducing anxiety, and fostering a more positive and engaging learning experience.

Despite these advancements, challenges persist. Early chatbot models often failed to deliver meaningful feedback due to limitations in processing input and generating nuanced responses, leading to suboptimal learning outcomes (Luo et al., 2022). While newer models, such as GPT-4, have made substantial progress in replicating human interaction, issues like ambiguity in meaning and occasional misinterpretation of user input remain. These challenges can hinder the learning process by causing confusion or frustration for users. Although improvements in chatbot accuracy and contextual understanding are evident, consistent and reliable performance across diverse user interactions remains an area for further development (Luo et al., 2022). Looking forward, ongoing research continues to advance the capabilities of AI-powered chatbots. Innovations in natural language understanding, dialog management, and user engagement are improving their functionality (Wollny et al., 2021). The integration of multimodal features, such as voice recognition and gesture interpretation, promises to enrich interactions with conversational agents further (Serban et al., 2017). These advancements point to a promising future for chatbots in language education. While they already demonstrate substantial potential in reducing speaking anxiety and improving proficiency, continued development is essential to address current limitations and enhance their effectiveness. As AI technologies evolve, their role in transforming language learning will likely become even more impactful.

AI chatbots in EFL speaking instruction

The burgeoning field of AI presents exciting possibilities for transforming EFL speaking instruction. The current research on AI chatbots highlights both their potential and limitations in promoting speaking skills and reducing speaking anxiety (Fathi et al., 2024; Hsu et al., 2023; Hwang et al., 2024; Tai & Chen, 2024). The effectiveness of AI chatbots in language learning can be traced back to the communicative language teaching (CLT) approach, which emphasizes interaction as a means of learning (Fathi et al., 2025; Littlewood, 1981). Additionally, Vygotsky’s (1978) SCT underscores the importance of social interaction in cognitive development, providing a theoretical foundation for the use of conversational agents in language acquisition. A critical area of focus is chatbot design and functionality, which not only influences skill enhancement but also impacts anxiety reduction among learners.

To understand the potential of AI chatbots in EFL speaking instruction, it is essential to examine their efficacy in comparison to both human-led instruction and other AI tools. Although AI chatbots offer the potential for personalized practice and anxiety reduction (Hwang et al., 2024), their ability to fully replicate the nuances of human interaction and feedback remains a subject of ongoing investigation (Jinming & Daniel, 2024). For example, while chatbots can provide immediate feedback on pronunciation and grammar (Hsu et al., 2023), they may not be able to consider individual learning styles, offer nuanced explanations, or adapt feedback based on subtle cues in the same way that a teacher or peer can. Similarly, while chatbots can simulate conversational practice, they may not fully replicate the dynamic and unpredictable nature of real-life human interactions, which are crucial for developing communicative competence (Hsu et al., 2023; Jeon, 2024). In contrast, more traditional approaches—such as teacher-facilitated group discussions or guided role-plays—may provide deeper socio-cultural insights and immediate adjustments based on live human feedback, underscoring a complementary rather than exclusive relationship between AI chatbots and conventional practices.

Research on the effectiveness of AI chatbots in EFL speaking instruction has yielded promising results. For example, Yang et al. (2022) introduced “Ellie,” a task-based voice chatbot specifically tailored for EFL learners. Their study demonstrated that Ellie effectively encouraged student engagement and facilitated meaningful interaction, highlighting the potential of chatbots to foster active participation in language learning. Another crucial aspect is automated feedback, which plays a significant role in both skill development and reducing learner anxiety by providing immediate, non-judgmental responses. Zou et al. (2023) investigated AI speech evaluation programs that provide feedback to learners. Their study, conducted over a one-month period, found that participants reported improvements and significant gains in speaking skills. However, the study lacked a control group, which limits the ability to draw definitive conclusions about the effectiveness of the AI feedback programs.

The use of AI chatbots has shown promising results in enhancing learners’ speaking skills, including fluency, coherence, vocabulary, grammar, and pronunciation (Hwang et al., 2024; Shin et al., 2021; Tai & Chen, 2024; Hsu et al., 2023). Studies have demonstrated that chatbots effectively improve speaking proficiency by providing learners with opportunities for practice and immediate feedback in a controlled environment (Dizon, 2020; Qiao & Zhao, 2023; Yang et al., 2022; Lin & Mubarok, 2021). For instance, integrating AI chatbots as conversational partners in speaking classes has led to increased learner engagement and proficiency (Yang et al., 2022). Similarly, mind map-guided AI chatbots have outperformed conventional chatbots by offering more structured and effective interactions (Lin & Mubarok, 2021), highlighting the importance of structured chatbot design in maximizing instructional potential. However, it is important to note that some studies have found that while chatbots can improve certain aspects of speaking, they may not significantly impact other language areas, such as listening comprehension (Dizon, 2020), suggesting that chatbots may be more effective for certain language domains than others.

AI chatbots also play a significant role in reducing language-related anxiety and enhancing learner motivation. By creating a less stressful and non-judgmental learning environment, chatbots encourage learners to participate more actively (Hapsari & Wu, 2022; Jinming & Daniel, 2024). Studies suggest that interacting with chatbots can alleviate speaking anxiety and increase willingness to communicate (Hapsari & Wu, 2022; Muthmainnah, 2024). For example, a semester-long study by Kim and Su (2024) involving Korean-as-a-foreign-language learners demonstrated that eight structured chatbot sessions effectively reduced speaking anxiety while enhancing learners’ willingness to engage in communicative tasks. These findings underscore the value of designing chatbot interactions to build a comfortable and engaging learning experience, particularly for learners who may experience higher levels of anxiety in traditional classroom settings.

Moreover, AI chatbots have been found to enhance constructs closely tied to language learning success, such as foreign language enjoyment, language-specific grit, motivation, and positive attitudes (Hwang et al., 2024). Foreign language enjoyment refers to the positive emotional experiences learners have during language learning, while language-specific grit involves the perseverance and determination to overcome challenges (Derakhshan & Fathi, 2024; Hwang et al., 2024). Han and Ryu (2024) observed that voice-based AI chatbot activities positively influenced learners’ motivation and attitudes toward English learning, reinforcing the idea that chatbots can simultaneously improve speaking skills and strengthen the drive to learn. Motivation underpins sustained engagement in language study, pushing learners to persist despite difficulties, and attitudes such as enjoyment and grit further bolster motivation by shaping how learners perceive the learning experience and the effort needed for success. Yuan and Liu (2024) also discovered that AI tools increased learners’ engagement and enjoyment, both of which are central to improved speaking performance. Engagement, defined as the level of involvement and interest in learning tasks, is strongly correlated with better learning outcomes and is thus integral to the broader success of AI chatbot interventions.

The effectiveness of AI chatbots is closely linked to their design and ability to cater to individual learner needs, reflecting the principles of personalized learning (Wang & Odell, 2002). Personalized chatbot interactions maximize learning gains by addressing diverse psychological needs (Jeon, 2024). Incorporating aspects of social interaction, such as relevant topics and effective feedback strategies, is crucial for building rapport with learners (Engwall et al., 2022). Differences in learner profiles, such as age and educational background, may influence the perceived effectiveness of chatbots. For example, high school students reported greater enjoyment when using a text-based chatbot compared to college students, suggesting that chatbot design may need to be tailored to different learner groups (Shin et al., 2021). In addition, it is important to recognize that few studies investigating chatbot-mediated interventions have focused specifically on speaking anxieties, highlighting the need for research that targets the distinct fears learners face when speaking in an L2 (Ozdemir & Papi, 2022).

Furthermore, while many existing studies focus on single-institution or single-culture contexts, it is essential to recognize that the effectiveness of AI chatbots may vary considerably across broader cultural and educational settings (Chin et al., 2023; Yuan, 2024). For example, in collectivist cultures that prioritize group cohesion and collaborative learning, chatbots might be optimally deployed in peer-driven tasks or cooperative exercises, whereas in individualistic cultures, self-paced or one-on-one chatbot interactions may be more suitable. Likewise, factors, such as institutional support, teacher readiness, and technological infrastructure can shape how effectively chatbots are integrated into curricula (Merelo et al., 2024). These contextual nuances suggest that generalizing findings across regions or learner populations should be approached with caution, and future research would benefit from cross-cultural comparisons of chatbot-mediated instruction (Saihi et al., 2024).

Despite the positive outcomes observed in the research on AI chatbots for language learning, challenges remain in optimizing their use. While chatbots can effectively improve speaking skills, their impact on other language areas, such as listening comprehension, may be limited (Dizon, 2020). Furthermore, the effectiveness of chatbots can vary depending on individual learner characteristics, highlighting the need for personalized approaches (Jeon, 2024). Technical challenges, such as ambiguous responses, can also hinder progress and user experience (Naseer et al., 2024). Therefore, future research should prioritize the development of adaptive chatbot systems that tailor interactions based on learner feedback and performance, ensuring a more holistic approach to language learning and anxiety reduction. This includes conducting replication studies with diverse participant groups and in varied contexts to validate existing findings and inform best practices (Porte, 2012).

The present study

AI-powered chatbots are increasingly recognized as valuable tools in language learning, offering personalized, interactive, and judgment-free environments that address challenges often encountered in traditional classroom instruction. Research has shown that chatbots can enhance language proficiency and reduce speaking anxiety, particularly for learners who face difficulties with conventional teaching methods (Çakmak, 2022; Yang et al., 2022). Nevertheless, more work is needed to confirm and extend these findings, especially regarding the distinctive role of speaking-specific anxiety in different language-learning contexts. Replication studies help refine our understanding of how chatbots shape both affective and linguistic outcomes (Fathi et al., 2024; Porte, 2012).

This study contributes to this growing body of research by investigating the impact of AI-powered chatbots on L2 speaking skills and speaking-specific anxiety among EFL learners. By adopting a mixed-methods approach, we integrate quantitative and qualitative data to provide a more comprehensive understanding of how learners engage with chatbot-mediated speaking tasks and how these interactions may enhance proficiency and reduce anxiety. Focusing on speaking-specific anxiety enables a more precise exploration of learner fears tied to oral communication, complementing broader investigations of general language anxiety. The study is guided by two primary research questions:

  1. 1.

    To what extent do AI-powered conversation bots enhance L2 speaking skills, including fluency, accuracy, and overall proficiency, in EFL learners?

  2. 2.

    How does interacting with AI chatbots affect L2 speaking anxiety among EFL learners?

Based on these research questions and the existing literature, the study proposes the following hypotheses:

  • H1: EFL learners using AI-powered chatbots will show significantly greater improvements in L2 speaking skills (fluency, accuracy, and overall proficiency) compared to learners using traditional speaking practice methods.

  • H2: EFL learners interacting with AI-powered chatbots will report significantly lower speaking anxiety compared to those engaging in traditional speaking practice.

  • H3: The effects of AI-powered chatbots on reducing speaking anxiety and improving speaking skills will vary based on baseline anxiety levels, with learners experiencing higher initial anxiety deriving greater benefits from chatbot interactions.

Methodology

Participants

This study employed an explanatory sequential mixed-methods design to evaluate the effects of AI-powered conversation bots on L2 speaking skills and speaking anxiety. The mixed-methods approach combined quantitative data for statistical analysis with qualitative data to capture participants’ subjective experiences. This integration ensured that observed changes in speaking skills and anxiety levels were both quantified and contextualized, offering a nuanced understanding of the intervention’s impact (Creswell & Creswell, 2017). Triangulating these data sources enhanced the study’s validity and provided deeper insights into the intervention’s effectiveness.

The participants were recruited from two intact undergraduate classes at an IELTS preparation center in China. Using pre-existing classes reflects common practices in educational research, balancing logistical feasibility with the authenticity of real-world settings (Creswell & Creswell, 2017). Sixty students participated, with 30 assigned to the experimental group and 30 to the control group through convenience sampling. To ensure baseline equivalence, participants were grouped based on official IELTS scores ranging from 5 to 6, corresponding to the B1 (low intermediate) level on the common european framework of reference. This stratification ensured that both groups started with comparable English proficiency, isolating the effects of the AI intervention.

Demographic data provided additional context for the sample. Participants’ mean age was 22.36 years (SD = 2.84), with no significant age differences between the groups (p > 0.05). Gender distribution included 36 females and 24 males, consistent with the demographics of the IELTS preparation center, and showed no significant group differences (p > 0.05). Participants reported an average of seven years of prior language learning experience (SD = 1.86), also comparable across groups (p > 0.05). These demographic variables, collected via a self-reported questionnaire during recruitment, confirmed the sample’s homogeneity regarding age, gender, and language learning background. The study also incorporated qualitative interviews with 13 participants from the experimental group to enrich the quantitative findings. These interviews provided detailed insights into how participants engaged with the AI intervention, their perceived benefits, and challenges encountered during the study. The qualitative phase captured variability in experiences that pre-test and post-test scores could not fully explain, offering a comprehensive understanding of the intervention’s impact on learners.

Ethical considerations were a core aspect of the research. Written informed consent was obtained from all participants prior to data collection, ensuring voluntary participation. The study protocol was reviewed and approved by the Institutional Review Board of the corresponding author’s institution. These ethical measures ensured transparency, participant well-being, and the confidentiality of collected data, reinforcing the study’s integrity and adherence to research standards.

Instruments

This mixed-methods study utilized a range of instruments to assess the research objectives:

L2 speaking skill

The study employed the IELTS Speaking test, a well-established standardized assessment of spoken English proficiency, as the primary tool for evaluating participants’ speaking skills. This test was selected because it assesses the four key components of speaking proficiency targeted by the Mondly intervention: fluency and coherence, lexical resource, grammatical range and accuracy, and pronunciation. Administered as a face-to-face interview, the test assesses these components through a structured three-part format. In Part 1, participants answer introductory questions about familiar topics such as their personal lives and interests. Part 2 requires them to deliver a structured response to a specific topic using prompts provided on a cue card. Part 3 involves a more advanced discussion expanding on the topic from Part 2, encouraging participants to express and justify complex ideas.

The version of the IELTS Speaking test used in this study strictly followed the official format and guidelines set by IELTS. To ensure consistent and reliable assessment, two IELTS-certified examiners independently evaluated the recorded responses. These examiners were experienced IELTS raters with a minimum of 5 years of experience each and were selected from a pool of certified examiners at the IELTS testing center where the study was conducted. The selection process aimed to ensure that the examiners had similar levels of experience and familiarity with the IELTS scoring rubrics. Before the study, the examiners underwent a calibration session to align their interpretation and application of the IELTS scoring rubrics. This step was essential to maintaining uniformity in evaluations. The examiners adhered to the official rubrics, which offer detailed criteria for assessing various proficiency levels across the four domains. To confirm inter-rater reliability, 10% of the recordings were independently rated by both examiners. The analysis yielded a Cohen’s kappa coefficient of 0.87, indicating strong agreement and reinforcing the reliability of the scoring process.

The IELTS Speaking test was particularly well-suited for this study because its assessment criteria directly aligned with the speaking skills targeted by the Mondly intervention. Mondly’s speaking practice activities were designed to simulate real-world communication scenarios, providing participants with opportunities to develop skills assessed in the test. For example, its simulated conversations focused on functional language use in practical contexts, such as ordering food or asking for directions, corresponding to the communicative tasks emphasized in Part 1 of the IELTS Speaking test. Additionally, Mondly’s open-ended response activities encouraged learners to elaborate on complex topics or express their opinions, mirroring the type of discourse evaluated in Part 3. Moreover, the platform’s pronunciation practice, powered by speech recognition technology, provided targeted feedback on phonetic accuracy, aligning closely with the pronunciation assessment criteria of the test. Mondly’s focus on grammar and vocabulary development further supported the lexical resource and grammatical range dimensions of the IELTS Speaking test, ensuring that participants’ speaking practice addressed all components critical for proficiency.

L2 speaking anxiety

To gauge participants’ L2 speaking anxiety, the study employed a validated 19-item scale developed by Ozdemir and Papi (2022). This scale builds upon the well-established Foreign Language Classroom Anxiety Scale by Horwitz et al. (1986), but is specifically adapted to focus on anxieties associated with spoken communication rather than general language learning anxiety. We selected this adapted tool because it provides a more precise assessment of speaking-specific anxiety, which aligns directly with our study’s focus on L2 speaking skills and speaking anxiety. Participants indicated their level of agreement with each statement on a 6-point Likert scale, ranging from “strongly agree” to “strongly disagree.” For example, one sample item reads: “I get nervous when I am speaking English in my class.” The scale demonstrated strong internal consistency within this study, with a Cronbach’s Alpha coefficient of 0.81. To capture potential changes in anxiety levels, the scale was administered to all participants in both groups at the pre-test and post-test stages of the study.

Semi-structured interviews

To complement the quantitative data, semi-structured interviews were conducted with 13 participants from the experimental group who used Mondly. The interview protocol was carefully designed to gather in-depth insights into the participants’ experiences, focusing on both positive outcomes and areas requiring improvement. These one-on-one interviews were conducted either in person or via video conferencing, depending on participants’ availability, and each session lasted ~30–45 min. The interviews explored several key areas through open-ended questions, such as: “Can you describe your experience using Mondly for practicing speaking?”; “Did Mondly help you feel more confident in speaking English?”; “What were some of the strengths and weaknesses of using Mondly for your speaking practice?”; and “How would you compare this experience to traditional methods of speaking practice (e.g., practicing with a partner)?” These core questions guided the exploration of the perceived impact of Mondly on speaking confidence and fluency, features participants found beneficial or challenging, and comparisons between chatbot-assisted learning and traditional speaking practice methods.

The semi-structured format provided the flexibility to delve deeper into participants’ responses, enabling a more nuanced understanding of their perspectives. This approach, widely recognized as a valuable method for eliciting rich qualitative data (e.g., King et al., 2019), allowed interviewers to ask follow-up questions and clarify responses as needed, ensuring the data captured reflected the full range of participant experiences. The interview transcripts were then subjected to thematic analysis, which identified recurring patterns and key themes. To enhance the trustworthiness of the analysis, member checking was employed, allowing participants to review and verify the accuracy of the interpretations. These qualitative findings enriched the study’s mixed-methods framework, providing valuable context to interpret the quantitative results and offering a comprehensive view of the chatbot’s effectiveness in supporting language learning.

Although the interview protocol was guided by these main questions, the semi-structured format allowed for extensive exploration of participants’ experiences. Interviewers used probing and follow-up questions to encourage participants to elaborate on their responses, share specific examples, and reflect on various aspects of their interaction with Mondly. This approach enabled us to gather rich, detailed data that went beyond the initial questions, resulting in interviews that lasted between 30 and 45 min. For instance, when participants mentioned feeling more confident, interviewers would ask, “What aspects of Mondly do you think contributed to this increased confidence?” or “Can you provide an example of a situation where you felt this change?” This allowed participants to delve deeper into their experiences, providing nuanced insights that informed the thematic analysis.

Mondly

Mondly was selected as the primary tool for the experimental group’s L2 speaking practice due to its alignment with established pedagogical frameworks, particularly communicative and task-based language teaching. The platform provides task-based conversational activities, adaptive learning features, and real-time feedback, making it well-suited for developing speaking skills. Its speech recognition technology evaluates pronunciation, fluency, and grammar, offering immediate corrective feedback to help learners identify and address specific skill gaps while refining their language use. The design of Mondly also incorporates gamified elements, including points, badges, and leaderboards, which sustain learner motivation and encourage consistent practice. These features foster an engaging and immersive environment, aligning with the study’s aim of improving speaking proficiency while reducing speaking anxiety. These attributes made Mondly an appropriate tool for this intervention.

To confirm the suitability of Mondly for the study, a pilot test was conducted with five EFL learners before full implementation. Over a two-week period, participants used the platform, and their feedback was gathered through semi-structured interviews and usability surveys. The findings demonstrated that Mondly effectively met learners’ needs and aligned with the study’s objectives. Based on this feedback, minor adjustments were made to its integration into the curriculum, such as providing orientation sessions for participants unfamiliar with AI-based platforms. These steps ensured that Mondly was appropriately positioned to support measurable improvements in L2 speaking skills and alleviate speaking anxiety in the experimental group.

Procedure

This mixed methods study employed an explanatory sequential design to assess the impact of an AI-powered language learning application, Mondly, on L2 speaking skills and speaking anxiety. The study implemented structured procedures for both the experimental and control groups, aiming to maintain methodological rigor, equivalency in practice time, and isolation of the treatment variable.

Experimental group

The experimental group utilized Mondly as an out-of-class supplementary tool integrated into their IELTS preparation curriculum, replacing one hour of self-study practice per week with Mondly activities. We recommended a total of 540 min across 6 weeks, advising participants to complete three 30 min sessions per week. This approach aligns with our pilot data and prior research (e.g., Engwall et al., 2022; Fathi et al., 2024) suggesting that sustained practice over a structured time frame supports improvements in speaking proficiency and reductions in language anxiety. Importantly, 540 min was not mandated as an absolute maximum or minimum; rather, it served as a practical target engagement level, balancing participants’ cognitive load with the need for repeated, scaffolded interaction.

Participants engaged with Mondly through structured modules targeting critical aspects of speaking practice. These modules were selected to align with the skills assessed in the IELTS Speaking test, as well as to address common challenges faced by EFL learners in spoken communication. For each task, clear instructions and examples were provided to ensure learners understood the requirements and interacted with the chatbot effectively. Specifically, these modules included role-playing exercises that simulated real-world scenarios such as ordering food or making appointments, requiring learners to use functional language and respond appropriately to the chatbot’s prompts. Moreover, open-ended tasks prompted extended responses to questions about personal experiences or opinions, encouraging learners to develop fluency, coherence, and the ability to express and justify their ideas. To further integrate speaking with other skills, reading and responding tasks involved reading short texts and answering comprehension questions posed by the chatbot. Finally, pronunciation drills focused on practicing individual sounds, word stress, and intonation patterns, with the chatbot providing immediate feedback on pronunciation accuracy.

To ensure task relevancy and maintain focus, participants were guided through each session by clear objectives. Weekly instructions outlined specific tasks and provided strategies for maximizing engagement with Mondly’s features. At the end of each session, participants reflected on their performance and documented their progress in personal journals, detailing completed tasks, challenges encountered, and observations about their practice. These journals offered qualitative insights into participants’ engagement and satisfaction with the intervention.

Ensuring adherence and engagement

To clarify and validate our 540 min target, we employed a two-tier monitoring system. First, Mondly’s tracking capabilities recorded each participant’s login times, session durations, and completed tasks. Although Mondly’s data served as an unofficial log rather than an official usage report, it allowed us to approximate participant activity and compare individual patterns over the study period. Second, participants maintained weekly time logs, documenting their start and end times for each session. These logs were cross-referenced with Mondly’s usage data to address any discrepancies (e.g., instances where participants logged more or less time than the app data suggested). In such cases, participants were contacted for clarification. If a participant consistently failed to meet the recommended 540 min or could not resolve discrepancies, their data were excluded from the final analysis. This process ensured that participants who remained in the study had reasonably met the engagement target.

We also instructed participants to limit other English practice activities to a maximum of one hour per week, to help control for out-of-class exposure. Random audits of interaction logs checked whether learners were meaningfully engaged in conversation rather than simply navigating the application. By triangulating self-reports, usage data, and regular follow-ups, we aimed to keep participants’ exposure as consistent as possible across the experimental group.

Control group

The control group followed an equivalent practice regimen designed to mirror the duration and variety of the Mondly tasks. They, too, aimed to complete three 30 min sessions per week over 6 weeks, equating to 540 min of directed practice. To maintain similar conditions to the experimental group, the control group’s practice also replaced one hour of their weekly self-study time. To account for potential variations in out-of-class language exposure, both groups were instructed to limit their English practice outside of the assigned activities (Mondly for the experimental group, alternative activities for the control group) to a maximum of one hour per week. This helped to ensure that any observed differences between the groups could be more confidently attributed to the intervention.

Participants accessed online platforms offering interactive exercises, educational videos, and podcasts tailored to improving speaking skills. These resources exposed learners to diverse topics and enhanced their listening and speaking capabilities. To ensure focused engagement, the research team provided specific instructions and links to curated resources, guiding learners to practice similar skills as those targeted in the Mondly intervention. In addition, participants engaged in online language exchange programs, specifically those with structured conversation prompts and tasks to ensure focused practice, enabling conversational practice with native English speakers. This provided opportunities for authentic interaction, mimicking the contextual engagement facilitated by Mondly. To control for extraneous interaction, learners were instructed to limit their language exchange sessions to the assigned tasks and time limits.

Self-directed activities formed a significant component of the control group’s practice. These included shadowing exercises, where participants imitated native speaker recordings to refine pronunciation and intonation, and self-assessment tasks, such as recording and reviewing their spoken responses. These activities encouraged learners to focus on specific language skills, aligning with the experimental group’s structured practice approach. To ensure adherence, control group participants documented their activities in weekly logs, specifying the tasks completed and the time spent on each. These logs were reviewed regularly to confirm compliance. The research team also conducted random follow-ups to address discrepancies and verify engagement.

Data analysis

Descriptive statistics, including means, standard deviations, and confidence intervals (CIs), summarized the distribution of L2 speaking skills and speaking anxiety levels. Paired-samples t-tests assessed within-group changes across pre-test and post-test phases, evaluating improvements in speaking skills and reductions in anxiety. To compare the effectiveness of the AI-based language learning application and the control group’s activities, two separate one-way between-groups analyses of covariance (ANCOVAs) were conducted. Separate ANCOVAs were performed because the two dependent variables—L2 speaking skills and L2 speaking anxiety—represent distinct constructs, each requiring independent evaluation to ensure precise modeling of intervention effects. Running separate analyses avoided potential violations of statistical assumptions, such as multicollinearity or overlapping error variance, which could arise if both dependent variables were included in a single multivariate analysis.

For the ANCOVA assessing L2 speaking skills, the pre-test IELTS speaking scores served as the covariate to control for initial proficiency differences between groups, ensuring that any observed post-test differences were attributable to the intervention rather than pre-existing disparities (Pallant, 2020). Similarly, for the ANCOVA on L2 speaking anxiety, the pre-test anxiety scores were used as the covariate, controlling for baseline anxiety levels. This approach allowed for a nuanced understanding of the intervention’s effectiveness on each outcome while maintaining the integrity of the statistical analyses. Additionally, paired-samples t-tests within each group further examined changes in self-reported speaking anxiety levels and speaking proficiency across the pre-test and post-test phases. This dual approach ensured robust evaluation of both within-group and between-group effects.

The qualitative data, derived from thematic analysis of interview transcripts from a subset of experimental group participants, provided additional insights into learners’ experiences with the AI-based application. This analysis (King et al., 2019) identified recurring themes and patterns, offering context to complement the quantitative findings and shedding light on participants’ perceptions and attitudes toward conversational AI for improving speaking skills.

Results

Quantitative data

Table 1 presents the descriptive statistics, including means, standard deviations, and 95% CIs for L2 speaking skills and L2 speaking anxiety at pre-test and post-test for both the experimental and control groups. The experimental group showed a mean score on the pre-test of 5.06 (SD = 1.23, 95% CI [4.72, 5.40]), which increased to 5.78 (SD = 1.16, 95% CI [5.45, 6.11]) on the post-test. This reflects a gain of 0.72 points in L2 speaking skills following the intervention. The control group also demonstrated improvement, with a mean score of 5.14 (SD = 1.08, 95% CI [4.85, 5.43]) at pre-test rising to 5.42 (SD = 0.95, 95% CI [5.15, 5.69]) at post-test, reflecting a smaller gain of 0.28 points.

Table 1 Descriptive Statistics for L2 Speaking Skill and L2 Speaking Anxiety.

In terms of L2 speaking anxiety, the experimental group’s mean anxiety score decreased from 31.85 (SD = 5.42, 95% CI [30.13, 33.57]) at pre-test to 27.72 (SD = 4.78, 95% CI [26.21, 29.23]) at post-test, indicating a substantial reduction in speaking anxiety. The control group also showed a slight decrease in anxiety, with scores reducing from 29.83 (SD = 4.97, 95% CI [28.26, 31.40]) at pre-test to 28.76 (SD = 4.42, 95% CI [27.32, 30.20]) at post-test.

Table 2 summarizes the results of paired-samples t-tests conducted to examine within-group changes in L2 speaking skills and L2 speaking anxiety from pre-test to post-test for both the experimental and control groups. The experimental group displayed a statistically significant improvement in L2 speaking skills (t(29) = 4.25, p < 0.001), with a moderate effect size according to Cohen’s d (d = 0.60) (Cohen, 1988). This suggests that the intervention had a meaningful positive impact on the speaking abilities of participants in the experimental group. The control group also showed a significant increase in L2 speaking skills (t(29) = 2.54, p = 0.016), albeit with a smaller effect size (d = 0.36). The experimental group also exhibited a statistically significant decrease in L2 speaking anxiety (t(29) = 5.78, p < 0.001), with a large effect size (d = 0.76). This indicates that the intervention was successful in reducing self-reported speaking anxiety within the experimental group. The control group also showed a significant decrease in anxiety (t(29) = 2.17, p = 0.043), with a small effect size (d = 0.22).

Table 2 Results of Paired-Samples t-Tests for L2 Speaking Skills and L2 Speaking Anxiety.

Following the within-group analyses using paired-samples t-tests, a one-way ANCOVA was conducted to examine the between-group differences in L2 speaking skills and L2 speaking anxiety after controlling for pre-existing group disparities. The pre-test scores from the L2 speaking skills assessment served as the covariate in the first ANCOVA model, while the pre-test scores from the L2 speaking anxiety measure were used as the covariate in the second ANCOVA model.

The ANCOVA results for L2 speaking skill (Table 3) revealed a statistically significant main effect for the pre-test covariate (F(1, 57) = 23.76, p < 0.001, η² = 0.45). This indicates that pre-existing differences in speaking skills between the groups at baseline significantly influenced post-test scores. Additionally, a significant main effect for group membership was found (F(1, 57) = 5.79, p = 0.022, η² = 0.11). After controlling for pre-test speaking skills, participants in the experimental group who interacted with the AI-based language learning application showed statistically significant gains in L2 speaking skills compared to the control group engaged in alternative speaking practice activities.

Table 3 ANCOVA Results for L2 Speaking Skill.

The ANCOVA results for L2 speaking anxiety (Table 4) indicated a statistically significant main effect for the pre-test covariate (F(1, 57) = 39.04, p < 0.001, η² = 0.72), suggesting that pre-test anxiety levels had a substantial influence on post-test self-reported anxiety. This large effect size indicates that baseline differences in anxiety were a critical factor in predicting post-test outcomes. Additionally, a significant main effect for group membership was observed (F(1, 57) = 8.89, p = 0.007, η² = 0.17), with the experimental group reporting significantly greater reductions in speaking anxiety compared to the control group after controlling for pre-test anxiety levels.

Table 4 ANCOVA Results for L2 Speaking Anxiety.

To assess how varying baseline anxiety levels influenced the intervention’s effectiveness, we categorized participants in the experimental group into three subgroups based on their pre-test anxiety scores: Low ( <25th percentile), Moderate (25–75th percentile), and High ( > 75th percentile). Table 5 details the descriptive statistics, mean reductions in post-test anxiety, and the results of a one-way ANOVA comparing these subgroups.

Table 5 Subgroup Analysis of Baseline Anxiety and Post-test Anxiety Reduction.

Participants with higher baseline anxiety (High subgroup) showed the largest mean reduction in speaking anxiety (6.12 points), compared to smaller reductions observed in the Moderate (3.21 points) and Low (2.14 points) subgroups. The one-way ANOVA revealed a significant effect of baseline anxiety level (F(2, 27) = 4.56, p = 0.018), confirming that learners entering the intervention with higher initial anxiety levels experienced more substantial improvements in post-test anxiety than their lower-anxiety counterparts. Tukey post-hoc tests further identified significant differences between the High and Low subgroups (p = 0.013), indicating that participants with pronounced baseline anxiety benefited more from the intervention. In contrast, no significant differences were observed between the Moderate and Low subgroups.

These findings suggest that the AI-powered intervention was particularly effective for participants with higher initial levels of speaking anxiety, potentially because the supportive and non-judgmental environment provided by the chatbot offered greater relative relief. The strong effect size of baseline anxiety levels (η² = 0.72) underscores the importance of considering initial participant profiles when interpreting the intervention’s impact.

Post-hoc power analysis

Given the relatively small sample size (n = 60) divided between the experimental and control groups, a post-hoc power analysis was conducted to evaluate whether the sample size was sufficient to detect meaningful effects for both L2 speaking skills and L2 speaking anxiety. The analysis utilized the effect sizes obtained from the ANCOVA results, with partial eta-squared values converted to Cohen’s f values for the calculations.

For L2 speaking skills, the ANCOVA results indicated a partial eta-squared of 0.11, corresponding to a Cohen’s f of 0.35. Using this effect size, an alpha level of 0.05, and the sample size of 60, the observed power (1 - β) was calculated to be 0.87, exceeding the conventional threshold of 0.80 for adequate statistical power (Cohen, 2013). This indicates that the sample size was sufficient to detect a medium-sized effect in L2 speaking skills with confidence. For L2 speaking anxiety, the partial eta-squared value of 0.17 yielded a Cohen’s f of 0.45. Based on this effect size and the same parameters, the observed power was calculated as 0.94, further affirming the adequacy of the sample size for detecting the large effect observed in the reduction of speaking anxiety.

Qualitative Results

This section explores the qualitative findings derived from semi-structured interviews conducted with a subset of students (n = 13) selected from the experimental group. Thematic analysis was conducted following the procedures outlined by Braun and Clarke (2006), involving familiarization with the data, generating initial codes, searching for themes, reviewing themes, and defining and naming themes. This rigorous approach ensured that the analysis captured the depth and complexity of the data collected. To enhance the trustworthiness of the analysis, several strategies were employed. Firstly, an audit trail was maintained, documenting the coding process and decision-making throughout the analysis. Secondly, two researchers independently coded a 20% subset of the transcripts using a multi-category scheme in which each segment could be assigned to a single best-fitting theme (or an open category when no existing theme applied). To assess inter-rater reliability, we calculated Cohen’s kappa for each thematic category, treating each category as a discrete binary variable. Cohen’s kappa measures the agreement between raters on the assignment of segments to specific themes, ranging from 0 (no agreement) to 1 (perfect agreement). A kappa value of 0.86 was obtained, indicating strong agreement between the raters. Initial discrepancies in coding were discussed and resolved through consensus, resulting in a Cohen’s kappa of 0.86, indicating a high level of agreement. Finally, member checking was conducted, where a selection of participants were invited to review the identified themes and provide feedback on their accuracy and representativeness.

The interviewees were purposefully sampled to ensure representation across key demographic and proficiency variables present in the experimental group. Participants were selected to reflect diversity in terms of age, gender, and pre-test proficiency scores. Comparisons of demographic and proficiency-related data between the interviewees and the broader experimental group indicated no statistically significant differences, ensuring the sample was representative of the experimental cohort as a whole. Although the sample size for the interviews (n = 13) may appear small, it aligns with qualitative research practices aimed at achieving thematic saturation, where no new themes emerge from additional data collection. Additionally, the proportional size of the interview subgroup (43% of the experimental group) supports the representativeness of the qualitative findings. This approach strengthens the validity of the themes derived, providing nuanced insights into the participants’ experiences that complement the quantitative findings.

Theme 1: perceived enhancement of speaking fluency

A prominent theme that emerged from the interviews centered around participants’ perceived improvement in speaking fluency after utilizing the AI-based application. Students consistently expressed a sense of increased comfort and confidence in expressing themselves verbally. This theme was identified by analyzing the frequency of statements about increased comfort and confidence, with nine out of 13 participants reporting such improvements. For instance, Student A stated, “I undeniably feel that I can now speak more fluidly. Previously, I would hesitate frequently and struggle to find the appropriate words, but now it comes more naturally.” Similarly, Student B highlighted the application’s role in facilitating speaking practice without judgment, allowing them to make mistakes and receive corrective feedback. This iterative process of practice and correction, as perceived by Student B, contributed to a sense of enhanced flow and fluency in their spoken communication.

Theme 2: augmented pronunciation and grammar

Several participants (eight of 13) mentioned the positive impact of Mondly on their pronunciation and grammar. The frequency of specific mentions of pronunciation improvement and grammatical accuracy helped identify this theme. Student C specifically mentioned improvements in mastering challenging sounds, stating, “I previously struggled with specific sounds, but the application’s pronunciation exercises helped me master them. Now, I feel more confident speaking and know I’m pronouncing things accurately.” Similarly, Student D highlighted the application’s role in refining grammatical skills. They noted, “I observed my grammar becoming more accurate as I utilized the application. It would identify my errors and suggest corrections, which assisted me in learning from them and refining my overall grammar skills.” These quotes illustrate how the application’s features, such as pronunciation exercises and corrective feedback on grammar, were perceived by participants as valuable tools for enhancing their spoken language proficiency.

Theme 3: amplified motivation and engagement

The interactive and personalized nature of the AI-based language learning application emerged as a key factor contributing to participants’ increased motivation and engagement in language learning. This was evident in statements from seven participants who expressed that the app’s game-like features and personalized feedback motivated them to engage more fully. Student E expressed how the application transformed the learning experience, stating, “The application made learning English more enjoyable and engaging. It felt less like studying and more like playing a game, which maintained my motivation to practice.” This sentiment regarding the gamified elements and increased enjoyment of the learning process was further echoed by Student F who highlighted the application’s personalized feedback mechanism. Student F appreciated how the application catered to individual learning styles, stating, “I appreciated the personalized feedback provided by the application. It made me feel like I was learning at my own pace and targeting specific areas that needed improvement.” These quotes suggest that the application’s ability to create a more stimulating and individualized learning environment fostered a sense of agency and motivation among participants.

Theme 4: mitigating anxiety and building confidence

A consistent theme identified by eight participants was the reduction of speaking anxiety. Frequency analysis revealed that more than half of the participants described overcoming their fear of speaking English through the use of Mondly. For instance, Student G described overcoming their fear of speaking English, stating, “Previously, I was genuinely anxious about speaking English, but the application helped me overcome that fear. Now, I feel more confident and comfortable expressing myself in English.” Similarly, Student H highlighted the application’s role in creating a safe space for practicing without judgment. They explained, “The application created a safe space for me to practice speaking without feeling anxious about making mistakes. This aided me in building my confidence and becoming a more comfortable speaker.” These quotes suggest that the application’s features, such as personalized practice and a non-judgmental environment, contributed to a reduction in speaking anxiety and a subsequent increase in participants’ confidence in their spoken English abilities.

Theme 5: recommendations for optimization

Although participants generally reported positive experiences using the AI-based language learning application, some provided valuable suggestions for improving the app, including a desire for more diverse topics (mentioned by six participants) and the addition of peer interaction (noted by four participants). Student I expressed a desire for greater variety in topics and exercises, stating, “It would be beneficial if the application offered a greater variety in topics and exercises. This would maintain interest and engagement in the long run.” This feedback highlights the importance of ongoing content development to cater to user preferences and sustain long-term engagement. For instance, several participants suggested incorporating more culturally relevant topics, such as discussions about popular music or movies, to increase their motivation and engagement. Others expressed a desire for more challenging exercises, such as debates or presentations, to further develop their speaking skills. Additionally, Student J suggested incorporating peer interaction, noting, “I would have preferred the option to practice speaking with other learners in addition to the AI-based interactions.” This suggestion underscores the potential value of integrating social learning elements within the application to complement the benefits of AI-powered language learning. Specifically, participants suggested features like virtual classrooms or online discussion forums where they could interact with other learners, practice their speaking skills in a more social setting, and receive feedback from their peers. These insights offer valuable considerations for future iterations of the application in enhancing the overall user experience.

Theme 6: benefits beyond speaking skills

Interestingly, some participants mentioned unexpected benefits of using the app, such as deeper insights into English grammar and increased interest in English culture. This was identified as a secondary theme, where five participants specifically mentioned that Mondly prompted them to explore additional resources, like movies and news articles. These insights suggest that the app might foster a holistic learning experience extending beyond the enhancement of speaking skills. For instance, Student K noted that the application not only enhanced their speaking fluency but also solidified their grasp of grammatical concepts, stating, “I discovered that the application not only assisted me in speaking better, but it also solidified my grasp of English grammar rules.” This deeper understanding stemmed from the application’s ability to provide contextualized grammar explanations and examples within the conversational exercises. Student K explained that seeing grammar rules applied in real-life conversations helped them understand the concepts more clearly and remember them more easily. Similarly, Student L described how the application sparked their curiosity about English culture, leading them to explore additional learning materials. Student L explained, “Using the application sparked my curiosity in learning more about English culture. I began watching English movies and reading English news articles, which further improved my overall language skills.” This increased interest was attributed to the application’s use of authentic materials and culturally relevant scenarios in its exercises. Student L mentioned that the conversations about everyday life in English-speaking countries sparked their curiosity and motivated them to learn more about the culture. These experiences illustrate the potential of the AI-based language learning application to foster a more holistic learning experience that extends beyond spoken language proficiency. The findings suggest that the application can serve as a springboard for learners to develop a deeper appreciation for the English language and its associated culture.

Discussion

The present study explored the effectiveness of AI-powered chatbots in improving L2 speaking skills and reducing speaking anxiety among EFL learners. Quantitative findings demonstrated that participants using the chatbot achieved significant enhancements in speaking proficiency, notably in areas, such as fluency and accuracy, compared to those in the control group. Additionally, the experimental group showed notable reductions in speaking anxiety, particularly among individuals with higher baseline anxiety levels. Qualitative insights enriched these findings by revealing that learners valued the chatbot as a practical tool for improving their speaking abilities, fostering motivation, and alleviating anxiety.

These results not only align with but also extend existing research on AI chatbots in language learning. For instance, while Yang et al. (2022) found that chatbots can enhance overall proficiency by promoting learner engagement, the current study highlighted significant gains in specific areas of speaking proficiency, namely fluency and accuracy. Such findings suggest that out-of-class chatbots, which offer personalized and varied interactive activities, may exert a more pronounced effect on targeted skills compared to in-class chatbots with limited activities. Furthermore, whereas Yang et al. focused on in-class usage, our results underscore the potential of chatbots in self-directed learning contexts, enabling learners to practice at their own pace and based on individual needs. Similarly, Shin et al. (2021) emphasized the importance of personalized chatbot interactions, and our study’s focus on learners with higher anxiety levels reinforces the value of adaptive technology in language education.

The observed improvements in L2 speaking skills appear to stem from specific chatbot features that promote language development in ways traditional methods may not. By enabling real-time, interactive conversations, the chatbot facilitated authentic communication scenarios, such as role-playing (e.g., ordering food, asking for directions) and open-ended discussions that encouraged learners to articulate opinions or elaborate on complex topics. Offering spontaneous speech practice in a low-pressure environment proved challenging to replicate in conventional classrooms, where time constraints and fear of negative evaluation often impede participation. These varied, meaningful interactions likely fostered fluency by prompting learners to experiment with language forms in context-driven tasks. Such an approach is reminiscent of the ZPD concept (Vygotsky, 1978), whereby learners tackle tasks slightly beyond their independent capabilities but progress with scaffolded guidance.

Moreover, the chatbot’s immediate, private feedback on grammar and pronunciation appeared more effective than delayed or public feedback common in traditional settings, allowing learners to internalize corrections and refine their skills without anxiety. This feedback mechanism aligns with principles of scaffolded learning (Wood et al., 1976), where timely support helps learners progress beyond their current capabilities. Receiving targeted guidance promptly may have been especially crucial for reducing anxiety, as it helped learners gain confidence in their ability to use the language correctly, minimizing the fear of public scrutiny that often accompanies group-based feedback. By framing errors as learning opportunities rather than failures, the chatbot promoted a supportive environment that contrasts with the discomfort some learners feel in a traditional classroom (Horwitz et al., 1986). This private, corrective feedback likely facilitated self-regulated learning and ongoing skill refinement, aligning with Zou et al. (2023) and Chen et al. (2021), who underscore the importance of immediate, personalized feedback. Hsu et al. (2023) likewise found that automated feedback mechanisms improved grammatical accuracy and pronunciation, indicating that this type of targeted support can serve as a tool for mediated language development. This feedback loop, where the chatbot provides support and guidance within the learner’s ZPD, enables them to internalize new language structures and strategies, ultimately facilitating language development.

Additionally, the reductions in speaking anxiety observed in this study support prior research on AI chatbots as enablers of low-stress learning environments (Çakmak, 2022; Hapsari & Wu, 2022; Xin & Derakhshan, 2025). Speaking anxiety—a well-documented barrier to language learning (Horwitz et al., 1986; MacIntyre & Gardner, 1994)—often results from fear of negative evaluation and limited control over the learning process. In our study, the chatbot’s flexibility allowed learners to manage their learning pace, select topics of interest, and practice privately, thereby mitigating some of these anxiety-inducing factors. This sense of autonomy contrasts with settings where learners feel compelled to speak under peer pressure and strict time limits. Subgroup analysis revealed that learners with higher baseline anxiety benefited the most, suggesting that the chatbot’s features were particularly advantageous for those intimidated by conventional classrooms. Offering a personalized environment and targeted feedback likely boosted these learners’ confidence, encouraging them to engage more fully in speaking tasks.

Nevertheless, individual differences extend beyond anxiety alone. While our study concentrated on learners with heightened anxiety, other factors—such as prior proficiency, digital literacy, and learning style preferences—may also influence how learners respond to AI chatbots (Jeon, 2024). For instance, participants with limited digital experience or lower baseline proficiency could face an initial adaptation period but might ultimately gain significant benefits if chatbot tasks are carefully calibrated to their skill level. Clarifying these diverse learner profiles would further enrich understanding of how AI chatbots could be tailored to various educational contexts and learner needs.

The qualitative findings add depth to these outcomes by revealing learners’ perceptions of enhanced fluency, which they attributed to consistent practice and prompt feedback on mistakes. Regular chatbot interactions likely contributed to the automatization of language structures, leading to smoother and more confident speech. This observation aligns with Han and Ryu’s (2024) findings that chatbot activities positively influence motivation and attitudes. Moreover, participants noted substantial improvements in pronunciation and grammar, with instant feedback helping them address difficult sounds and refine sentence structures—a result consistent with Fathi et al. (2024), who emphasize the importance of feedback in bridging interlanguage gaps. Gamified elements, such as points or badges, may have further sustained engagement, aligning with Yuan and Liu’s (2024) report that such features foster motivation. As learners concentrated on achieving these goals, their anxiety appeared to subside, supporting Dörnyei and Ushioda’s (2011) assertion that motivated learners are more resilient in overcoming language barriers.

Participants also mentioned additional benefits beyond improved speaking proficiency and anxiety reduction, including a deeper understanding of grammar and greater interest in specific aspects of English culture, such as idiomatic expressions, popular media, and conversational norms. This indicates that the chatbot’s influence transcended linguistic gains, potentially fostering both cultural competence and foreign language enjoyment (Dewaele & MacIntyre, 2014). The consistent engagement noted among learners may also reflect the development of “grit,” defined as persistence and passion for long-term goals (Hwang et al., 2024). However, participants suggested broadening the chatbot content—such as adding more diverse topics and options for peer collaboration—to maintain interest and social interaction (Engwall et al., 2022). While AI chatbots offer distinct advantages, incorporating them alongside teacher- or peer-led opportunities is critical to ensure a balanced, communicative learning experience (Han & Ryu, 2024).

In acknowledging the positive outcomes, it is also important to consider the design limitations of the chatbot that may have shaped the results. For instance, scripted or generic prompts could limit spontaneity, reducing the extent to which learners can practice genuinely creative language use. Additionally, feedback—though immediate—sometimes lacked context-specific explanations, potentially hindering learners from fully grasping the rationale behind corrections. Refining chatbot design to offer more adaptive, context-aware responses could enhance both communicative authenticity and depth of feedback, leading to more sustained improvements in speaking skills.

Conclusion

This mixed-methods study examined the effectiveness of AI-powered conversation bots in improving L2 speaking skills and reducing FLSA among EFL learners. The results demonstrated that the experimental group, which used Mondly, an AI conversation bot application, showed significant gains in speaking proficiency and reductions in FLSA compared to the control group, which engaged in alternative speaking activities. Qualitative data from participant interviews supported these findings, highlighting perceived improvements in fluency, pronunciation, grammar, motivation, and confidence.

This study makes several important contributions to the field of computer-assisted language learning. Firstly, it provides valuable evidence for the efficacy of AI-powered chatbots in improving L2 speaking skills and reducing speaking anxiety, particularly among learners who may find traditional classroom environments challenging. Secondly, by adopting a mixed-methods approach, this study offers a more holistic understanding of the impact of chatbot interventions, capturing both the measurable outcomes and the nuanced experiences of learners. Thirdly, this research specifically focuses on FLSA, a critical area that has received less attention compared to general FLA. Finally, by conducting a detailed exploration of Mondly’s features and functionalities, this study offers practical insights for educators and researchers interested in understanding how specific chatbot features can be leveraged to support language learning and reduce anxiety.

AI conversation bots provide EFL educators with a versatile tool for integration into curricula, offering frequent, judgment-free speaking practice, personalized feedback on linguistic performance, and gamified features to enhance learner motivation and engagement. These platforms enable dynamic, interactive learning experiences both in and outside the classroom, allowing learners to refine their speaking skills independently or in small group settings. This suggests that language curricula should be designed to incorporate AI tools strategically, providing opportunities for learners to engage in chatbot-mediated practice that complements traditional classroom activities. For example, chatbots could be used for targeted pronunciation practice, vocabulary building, or fluency development, allowing teachers to focus on other aspects of language instruction, such as facilitating discussions or providing individualized feedback. Furthermore, these findings suggest a need for teacher training programs that focus not only on the technical operation of AI tools but also on pedagogical strategies for integrating them effectively, such as designing appropriate chatbot-mediated tasks, guiding learners in interpreting AI feedback, and scaffolding their use to maximize learning gains while managing cognitive load.

Beyond classroom instruction, AI conversation bots also hold significant potential as supplementary resources for language institutions. By incorporating these tools into their offerings, institutions can provide students with opportunities for additional speaking practice, enhancing overall learning outcomes. Institutions could consider integrating chatbots into language labs, self-access centers, or online learning platforms, providing learners with flexible and accessible tools for independent practice. Moreover, institutions could offer workshops or training sessions for teachers on how to effectively utilize AI chatbots in their teaching, ensuring that these tools are integrated thoughtfully and purposefully. For independent learners, AI conversation bots facilitate self-directed learning, enabling them to practice at their own pace and focus on specific areas for improvement. This flexibility empowers learners to take greater control of their language development and fosters a more tailored and effective learning experience. At a policy level, these results support considering strategic investment in AI-driven language learning resources, alongside initiatives to ensure equitable access and effective implementation across educational institutions.

The study also holds implications for the design and development of educational AI chatbots. The findings, particularly the qualitative feedback suggesting a desire for more topic variety and peer interaction options, point towards areas for future enhancement. Developers should consider incorporating more adaptive content generation based on learner interests and proficiency, as well as exploring features that facilitate guided social interaction alongside AI practice. Collaboration between language educators, researchers, and AI developers could lead to more pedagogically sound and engaging tools that better align with CLT principles.

The positive impact on both skills and anxiety underscores the potential of AI chatbots, particularly as tools to foster learner autonomy and self-directed practice. Learners can use the low-stakes environment to build confidence and experiment with language, potentially developing self-assessment skills by reflecting on the feedback received. Nevertheless, while AI-powered chatbots prove beneficial in alleviating speaking anxiety and enhancing communicative abilities, it is important to integrate them as complementary tools rather than replacements for human interaction. Peer practice and teacher guidance remain central to addressing nuances in language use and sociocultural contexts that chatbots may not fully replicate. For instance, collaborative exercises with classmates provide real-time negotiation of meaning and emotional support, while teacher-led feedback can offer in-depth, contextual explanations. Balancing chatbot use with these human elements can help learners experience both the individualized, low-pressure practice afforded by AI and the rich interpersonal dynamics essential for holistic language development. Specifically, for learners identified with high speaking anxiety, chatbots could serve as an initial, confidence-building step before engaging in higher-stakes peer or classroom interactions.

This study acknowledges several limitations that should be addressed in future research. First, the modest sample size and limited diversity of participants constrain the generalizability of the findings. While the observed improvements in L2 speaking skills and reductions in speaking anxiety are encouraging, larger-scale studies with participants from diverse linguistic, cultural, and educational backgrounds are needed to validate these results. Such research could also examine how AI-powered chatbots perform across different learner demographics, including variations in age, proficiency levels, and educational settings.

Second, the study’s reliance on Mondly, a commercially available AI chatbot, restricted the ability to control or customize specific features for experimental purposes. While Mondly offered diverse interactions and immediate feedback, its proprietary design limited the exploration of individual functionalities, such as feedback types or conversational complexity. Future studies could address this limitation by utilizing custom-built chatbots with adjustable parameters to facilitate experimental flexibility. For example, researchers could compare the effectiveness of adaptive difficulty settings, different feedback modalities (e.g., text-based versus voice-based), or various task types (e.g., structured versus open-ended) to identify the features that most effectively enhance speaking skills or reduce anxiety. Such investigations would provide more granular insights into the mechanisms driving the efficacy of AI-powered tools.

Finally, speaking anxiety was assessed solely through self-reported measures, which, while useful, may not fully capture the multifaceted nature of FLA. Self-reports are inherently subjective and susceptible to social desirability bias or participants’ perceptions of their progress. While this study focused on short-term outcomes, future research could incorporate physiological measures, such as heart rate variability or galvanic skin response, alongside self-reports to provide a more comprehensive assessment of anxiety. Additionally, longitudinal studies are needed to investigate the long-term impact of chatbot interventions on anxiety reduction and language development. These studies could examine the sustained effects of chatbot use on anxiety levels, as well as how these interventions might influence learners’ motivation, self-confidence, and willingness to communicate over time.

In sum, this study confirms the significant potential of AI-powered conversation bots as effective tools in EFL contexts, capable of concurrently enhancing speaking proficiency and mitigating speaking anxiety. While acknowledging the need for ongoing technological refinement and pedagogical integration alongside human interaction, the findings suggest that such AI tools represent a valuable asset for creating more personalized, supportive, and ultimately more effective language learning environments. As AI continues to evolve, its thoughtful application holds considerable promise for transforming L2 speaking instruction.