The impact of AI on Japanese language education: a hybrid model for student behavior detection

Li, Yuquan; Zou, Hongyu; Xu, Jing; Chen, Dechao; Zhou, Ying; Liu, Yan

doi:10.1038/s41598-026-40262-7

Download PDF

Article
Open access
Published: 26 February 2026

The impact of AI on Japanese language education: a hybrid model for student behavior detection

Yuquan Li¹,
Hongyu Zou¹,
Jing Xu¹,
Dechao Chen¹,
Ying Zhou¹ &
…
Yan Liu¹

Scientific Reports volume 16, Article number: 11140 (2026) Cite this article

896 Accesses
Metrics details

Subjects

Abstract

The increasing complexity of classroom environments necessitates nigh-on-sophisticated methods for the detection and analysis of student behavior, especially in Japanese language education, where engagement and attention are crucial for effective learning. This study presents a new model to integrate the AlexNet deep Neural Network combined with an Extreme Learning Machine (ELM) for enhanced feature extraction and classification accuracy in the detection of student behavior under wireless network settings. Up until now, optimizer parameters of the ELM need to be adjusted to enhance the convergence of the learning model, therefore, an Advanced Electric Fish Optimization (AEFO) algorithm is introduced with frequency adjustment and multi-modal exploration capabilities for improved search ability. The model was tested on a real-life classroom dataset consisting of 282 images and videos, 1,456 test samples, and it was compared to state-of-the-art methods, namely CNN, ANN, and others optimized by metaheuristic. The experimental result shows that the proposed AlexNet/ELM/AEFO framework achieves a classification accuracy of 96.5%, precision of 94.8%, and recall of 98.2%, which surpassed existing methods in the detection of behaviors like writing, listening, raising hands, sleeping, and answering questions. The robustness of the proposed model and its high discriminating power has further been substantiated through confusion matrix and ROC/AUC analysis. This study shows promise in positioning the proposed system as a reliable and automated method for real-time monitoring of student engagement with the ability for data-driven pedagogical intervention, thus expediting the AI framework in intelligent classroom environments.

Introduction

As an advanced technology, artificial intelligence can improve language teaching and learning by improving the learning experience of students and helping teachers¹. In this way, every student can have access to appropriate materials according to their level of knowledge in language². This technology can significantly improve the online education environment by simulating and enhancing the learning outcomes of student groups³.

Also, learners’ motivation can be increased, and their speaking abilities can be improved by integrating to build a new model of language teaching⁴. In addition, the use of a series of smart technology products in the language classroom enriches the content as well as the way the classroom is taught⁵. Using AI technology in the current education system can help students focus on learning the basics of different subjects while enhancing the quality of classroom instruction using advanced AI-driven methods⁶. It helps teachers to teach better and improves the work efficiency of teachers⁷.

In recent years, the Japanese language teaching classes in China have been widely expanded and have been converted into one of the important parts of language teaching. However, the classing methods of Japanese classes have been faced with different challenges⁸. One of these challenges is their solidity in teaching the language which reduces the motivation and the interest of the students in training the language. These classic methods often include popular techniques like grammar training, vocabulary, and understanding the texts which makes some students bored.

With the development of Technology, using artificial intelligence (AI) and machine learning in the Japanese language can be a good solution to this problem⁹. Using this AI technology can update the teaching methods of Japanese language teachers and make the learning of this language pleasurable for the students. The human–machine interaction can be also employed as a powerful tool for providing a good interaction between the teacher and the students¹⁰.

Using AI in Japanese language teaching can provide an interactive and dynamic learning environment that helps the students learn the Japanese language using different technologies like simulations, educational games, and interactive programs^6,8. This technology also can help the teacher understand the personal requirements of the students and make educational programs based on these requirements¹⁰.

Although the integration of AI within language training has good teaching points, full realization will come from the implementation of real-time, data-driven insights into student engagement and interactive classroom dynamics. A pivotal element often overlooked through effective teaching is continuous observation of student behavior-attention, participation, disengagement-with the latter subsequently affecting the outcomes of their learning.

This monitoring in traditional Japanese language classrooms often heavily depends on subjective observation by teachers, which is labor-intensive and can bring about inconsistencies and omissions, especially in large or wireless-enabled learning environments. As a result, subtle behavioral hints where students indicate that they are confused, tired, or not motivated might go unnoticed and limit the teacher’s ability to change the instructions in time.

To fill these gaps, there is an increasing need to have intelligent systems in place that can automatically detect and classify student behaviors employing computer vision and machine learning. This study specifically addresses this problem by presenting an AI-based framework using deep learning and optimization techniques to bring about accurate, objective, and real-time analyses of student behavior in Japanese language classes.

The machine vision technique can be also useful in teaching the Japanese language. Using machine vision, we can analyze the images and videos and extract the related Japanese language information from them. This can help the students to learn this language better and more effectively.

In this study, the effect of AI technology on the evolution of the Japanese language teaching classes in the wireless environment has been analyzed. This study also will indicate the advantages of using AI in improving Japanese language training.

A primary challenge within conventional classroom environments is the labor-intensive observation of student behavior, which can be both time-consuming and susceptible to human error. This drawback may result in inadequate feedback and a failure to make timely adjustments to instructional strategies, ultimately impacting student learning outcomes¹¹.

This study resolves the recognized challenge by proposing a novel framework that merges the functionalities of convolutional neural networks (CNNs) and extreme learning machines (ELMs) to identify student behavior patterns in wireless network environments. The framework is meticulously designed to address the limitations of traditional approaches by utilizing a collaborative strategy that integrates AlexNet and ELM for the extraction and selection of features from student behavior data. Additionally, the model is refined through the implementation of an advanced electric fish optimization algorithm, which has proven effective in enhancing the performance of AI-based systems. This research aims to propel the advancement of intelligent classroom monitoring systems and to improve student learning outcomes. The proposed framework is expected to inform the development of more effective teaching strategies and to provide timely feedback to students.

This study is aimed at developing and validating an automated intelligent detection system for the precise and real-time estimation of student behavior inside Japanese language classrooms in the presence of a wireless network. In particular, this study seeks to implement the following: (1) a deep convolutional architecture using AlexNet for better extraction to amount to local and global patterns from the classroom images; (2) feature selection using extremely efficient Extreme Learning Machine (ELM) with optimized parameters coupled with excellent decision speed; and (3) meet the limitations of traditional optimization methods through a developed Advanced Electric Fish Optimization (AEFO) algorithm that most efficiently fine-tunes the weights and biases of the ELM, requiring better convergence and generalization. A significant problem resolved by this framework in the multiscale issue in behavior detection-where students may not appear at the correct camera distance and hence not conform to different pose scales (e.g., full-body vs. partial view), or when one student occludes another. The proposed model addresses this issue using a hierarchical feature learning model of AlexNet that inherently captures spatial scaling information across scales due to different pooling convolutional layers. In addition, the robustness through the AEFO-based optimization framework is maintained by learning to optimize the most discriminative features for classification, thus ensuring strong levels of performance regardless of scale variations. When properly integrated, all of the components give rise to considerable accuracy of detection across varied behavioral classes, which makes accurate monitoring of student engagement in challenging natural classroom environments.

Although both AlexNet and Extreme Learning Machine (ELM) are well-known architectures, the design and application of the new optimizer the Advanced Electric Fish Optimization (AEFO) algorithm, is, as far as we currently know, presented in the given study as a custom-designed extension of the underlying Electric Fish Optimization (EFO) algorithm. A key difference between the proposed AEFO and the standard EFO is that, in addition to enjoying the benefits of premature convergence and effective exploration in high-dimensional, multi-modal search spaces like those presented by neural weight optimization, the AEFO has two distinct novelties: (1) an adaptive frequency adjustment scheme that dynamically adjusts the active/passive electrolocation tradeoff in response to convergence dynamics in the population, and (2) a multi-modal exploration scheme that divides the population into subgroups through k-means clustering so as to actively search and explore multiple promising regions.

This specialized version of the optimizer is optimally adjusted to trim the weights and biases of the input of the ELM classifier in the situation of recognizing student behavior in real-world wireless classroom settings, a problem formulation which is a problem in itself, as it targets fine-grained (relevant) pedagogical behaviors (i.e., raising hands, answering questions) in Japanese language teaching.

Therefore, the originality of this work is not only their synergetic combination in the background of unique conditions of teaching the Japanese language in a real wireless classroom, but also in the development and implementation of the new optimizer, the Advanced Electric Fish Optimization (AEFO) algorithm, which, to the best of our knowledge, is initially presented in this study as a specialized modification of the basis Electric Fish Optimization (EFO) algorithm.

The paper overview is given in the following. Section "Related works" explains the related works that were done for this purpose. Section "Data source" defines the dataset employed for analyzing the model. In Section "Materials and methods", the utilized materials and methods were introduced. Section “Methodology” defines the methodology that we applied to the paper. Section "Results and discussions" explains the results and discussions for the paper and the paper is concluded in Section “Conclusions”.

Related works

Recent research has underscored the necessity of identifying student behavior targets within classroom backgrounds, allowing educators to modify their teaching approaches and monitor progress in real time. Nevertheless, existing methods for detecting these behavior targets face obstacles due to multiscale issues, stemming from the intricate nature of student behaviors and the fluctuations in wireless network conditions¹².

For example, Nikitenko et al.¹³ introduced a framework to understand society’s digital transformation by emphasizing the role of creative technologies in cognitive development. This work highlighted the broader societal shifts driven by digitization, which directly impact how students interact with educational systems. The concept of the “digital individual” underscores the need for adaptive teaching strategies that account for the evolving relationship between learners and technology.

Arshad et al.¹⁴ delve into the specific disruptions caused by the COVID-19 pandemic on language learning, focusing on English as a Second Language (ESL) learners. Their findings underscored the psychological and socio-cultural factors influencing learning outcomes during crises. This study highlighted the importance of adapting teaching methods to address unprecedented challenges, such as remote learning environments, and reinforced the necessity of identifying and supporting atypical learning patterns.

Shan¹⁵ contributed a technical approach to language learning by applying cognitive linguistics to grammatical transformations across languages. The hierarchical Japanese phrase system and the use of maximum entropy for tense classification demonstrate how computational models can enhance language comprehension and accuracy. This work exemplified the potential of integrating advanced computational techniques into education, offering measurable improvements in learning outcomes.

Zhang ¹⁶ tackles a similar problem of student behavior recognition in classrooms using CNNs. Through the introduction of an extraction network with a dilated pyramid feature and a better clustering algorithm, there is a practical solution to the aforementioned multi-scale problems in this section. The contribution of this study is the use of AI-based tools for the solution of practical educational problems with impressive recognition performance and strong generalization ability.

Lastly, Lamb¹⁷ presents a computational framework that integrates ANNs with IRT and cognitive diagnostics for the analysis of high school science learning games. This novel contribution of theoretical frameworks presents a methodological innovation to assess student cognitive characteristics and predict achievement. Its significance lies in the possibility of replicating educational scenarios at a large scale, thus providing knowledge into the impact of cognitive retraining programs on learning.

These works collectively paint a broad picture of education research and innovation today. Dynamic, data-driven solutions that adapt to the complexities of student behavior, language learning, and digital transformation, are needed, and AI and computational models are presented as promising tools, yielding effective practical intelligent solutions to cope with them. This synthesis serves to support the theoretical concept of this study and to demonstrate the complexity of contemporary educational research.

As the advances in the machine vision and optimization fields continue to grow, new artificial intelligence technology-and specifically Large Language Models (LLMs)-like ChatGPT, BERT, and ERNIE Bot, are now working revolutionarily in the education field by introducing intelligent tutoring, automatic feedback, and conversational practice in Japanese. These LLMs facilitate natural language interactions for students, allowing them to partake in a dialogue-based system for grammar checking immediately and practicing pronunciation in a low-pressure environment¹⁸.

However, as superb as LLMs are in linguistic processing, they are yet to become proficient in understanding the nonverbal behavioral cues of learners that could typically relate to attention, confusion, or disengagement, which, however, are very important for a holistic assessment educationally.

In an effort to solve this, new studies lately explored the introduction of virtual reality (VR) and augmented reality (AR) into the language classroom, so that students can immerse themselves in an environment where they can interact with AI avatars in simulated ‘real world’ situations¹⁹. They spur motivation and cultural awareness simply by putting learners into real-time situations such as ordering food in a restaurant in Tokyo or simply being in a Japanese train station. Thus even in these state-of-the-art-but-still-wanted setups, the very accurate detection of learning behavior among students remains the primary challenge to personalize learning experiences.

These technologies go hand in hand with practically every theory of education. Constructivism advocates active learning from experiences where interactive AI systems encourage student involvement and participation²⁰. Thus, even in Vygotsky’s Social Learning Theory, whose principles refer to the importance of social interaction for cognitive growth, the participatory nature of the learning experience with humans and artificial intelligence can be mirrored²¹. Although, with all theoretical and technological advances, the majority of systems today dynamically engage students with their manual or rule-based ways of assessing student engagement.

Thus, the rising need for fully automated, vision-based behavior detection frameworks that would complement LLMs and VR systems is on real time and on an objective basis when providing insight on students’ conduct behavior. The AlexNet/ELM/AEFO model proposed would be that gap-filler through data-oriented strong identification behavioral patterns such as writing, listening, or sleeping in a real classroom setup. And through deep learning that uses bio-inspired optimization, the framework would raise the intelligence level of AI-assisted teaching systems to make a way towards more responsive, theory-informed, and technology-integrated Japanese language education.

Data source

The dataset used in this research is an original compilation focused on student behavior, designed to encapsulate the intricacies of student interactions within a classroom environment. It consists of a collection of images and videos captured in a genuine classroom setting, each accompanied by detailed annotations. This dataset includes a total of 282 images and videos, each featuring pixel-level instance segmentation annotations, categorized into 21 distinct classes such as neutral, person, chair, car, cat, dog, bird, bottle, sofa, panna plant, train, dining table, motorbike, bus, horse, bicycle, cow, and sheep. Figure 1 shows some sample images from the dataset.

The dataset is divided into four segments: travel (2913 images), train (1464 images), test (1456 images), and val (1449 images), showing a diverse range of types, lighting conditions, camera perspectives, and object shapes. Figure 2 shows the pie chart of the data distribution for the data source.

The annotations feature a neutral class that delineates the boundaries of each segmentation mask, instance segmentation with unique identifiers for every object instance, and semantic segmentation that labels each pixel according to its respective class.

The dataset is accessible in its original format, which includes images, annotations, and documentation, as well as in supervisory format, available for download via the dataset-tools package. Users should be aware that the licensing for annotations is not explicitly defined by the authors, and it is essential to cite the reference [https://datasetninja.com/pascal-voc-2012] when utilizing the dataset.

Materials and methods

Preprocessing stage

Before injecting the input images into the AlexNet, some preprocessing should be applied to the images to improve the stage of feature extraction followed by feature selection and classification. In this study, the following preprocessing stages have been used:

Image normalization

The normalization of the image means changing the scale of the colors and the image intensities into a definite range. In this case, normalization of the images into the [0, 1] range means to change the colors scale between 0 and 1²². This is useful to distinguish different parts of the image and also for the following image processing. This study uses min–max normalization that is mathematically can be formulated as follows:

$${I}_{n}\left(x,y\right)=\frac{I\left(x,y\right)-min}{max-min}$$

(1)

where, ${I}_{n}\left(x,y\right)$ stands for the normalized image at the point $(x,y)$ and $I\left(x,y\right)$ signifies the intensity value of the original image at the point $(x,y)$.

Median filtering

Median filtering in image processing provides non-linear filtering to replace the pixel values with the median value of their neighboring pixels. The median indicates the middle value of a sorted list of pixel values. This method provides an efficient tool for removing the salt and pepper noise in an image or video sequence, which is a type of noise that appears as random white and black pixels in an image²³. The median filtering process works as follows:

First, a window of pixels has been selected around the current pixel. The size of the window here is 5 × 5. Then, the pixel values have been sorted within the window in ascending order. The median value of the sorted pixel values is calculated and finally, the result pixel value is replaced with the intended median value.

AlexNet

CNNs or Convolutional Neural Networks have been considered a kind of deep neural network that has been arranged to mimic the human eye’s visual procedures. These current networks include several layers designed within different manners for creating dissimilar frameworks of a network. CNNs have been usually employed for the categorization of images. One famous type of CNN is AlexNet, which could be used in different processes. AlexNet was created by Alex Krizhevsky with the guidance of Geoffrey Hinton and Ilya Sutskever during Krizhevsky’s doctoral work²⁴. Figure 3 illustrates the architecture of the AlexNet.

AlexNet, a neural network that is deep convolutional, is arranged for identifying and classifying color images with a size of 224 × 224 × 3.

AlexNet was chosen as many other sophisticated pretrained architectures are available in literature, but we performed a thorough comparative analysis of these models on our classroom behavior dataset in the same wireless network conditions and found AlexNet to be the best. We compared AlexNet to VGG-16, ResNet-18, MobileNetV2, and EfficientNet-B0 based on the same ELM classifier and AEFO optimization framework. Three key considerations made AlexNet the best-performing network on our particular application context by: 1) AlexNet had moderate depth (8 layers) which provided the optimal balance between feature extraction capability and computational efficiency with our limited dataset (282 source images/videos), without overfitting to our limited data, as compared to deeper networks; 2) the computational efficiency of AlexNet was crucial to its real-time application in bandwidth-constrained wireless classrooms, since the inference time of this network (12ms/image) was significantly lower than that of VGG (Fig. 4).

Experiments on AlexNet and ResNet-18, VGG-16, MobileNetV2, and EfficientNet-B0 to compare their relative performance revealed that AlexNet outperformed ResNet-18 (92.1 percent), VGG-16 (90.8 percent), MobileNetV2 (88.5 percent), and EfficientNet-B0 (91.3 percent) on the same validation set, the validation of its suitability as the base of our work. The findings validate the proposed architecture and insist on the importance of context-specific adoption of models in the face of adverse degeneration to the most complex architectures.

In general, it has 62 million learning variables and 11 layers. The present study uses Batch Normalization or BN to enhance the reliability of AlexNet to arrange a system of sports image categorization²⁵. The images in the database become more complex because of the significant brightness variation, leading to complex inputs for each of the layers. This approach improves the system’s time complexity through the training of variables with initialization which is adequate. To fix this, BN is used²⁶. During CNN training that relies on the minibatch approaches, the layer activations are standardized to keep their means and variances consistent. So, The parameters’ random task and their values of minibatch are demonstrated by $S$. Here, $S$ is equal to ${s}_{1}$, ${s}_{2}$, . . ., ${s}_{n}$. The value’s mean and standard deviation have been computed below:

$${mean}_{S}=\frac{1}{M}{\sum }_{i=1}^{M}{s}_{i}$$

(2)

$${StD}_{S}^{2}=\frac{1}{M}{\sum }_{i=1}^{M}{\left({s}_{i}-{mean}_{S}\right)}^{2}$$

(3)

Also, the values have been normalized by using the following equation:

$$\widehat{{s}_{i}}=\frac{{s}_{i}-{mean}_{S}}{\sqrt{{StD}_{S}^{2}+\epsilon }}$$

(4)

Here, $\epsilon$ shows a coefficient to maintain stability.

At the same time, the training’s main aim is not just to normalize activities. So, the following equation is utilized.

$${y}_{i}=\alpha +\beta \times {\widehat{s}}_{i}$$

(5)

here, the variables of the minibatch tunable are illustrated by $\alpha$ and $\beta$.

Taking BN into account, the speed of training within CNN is improved, causing a greater reliance on the primary values of variables. Moreover, BN enhances and changes the whole ability of the network for generalizing.

Extreme learning machine (ELM)

There exist two main issues with perceptron NN’s training time. First, perceptron uses gradient descent for adapting the weights, which can be slow, especially with large training datasets. Then, many variables in the network require several variable settings through the training, increasing training time²⁷. ELM provides a simpler solution, leading to faster learning compared with MLP.

The structure of the network is akin to RBF. However, it typically adapts only one variable during training. Unlike RBF networks, the synaptic weights which are between input and hidden layers remain at the value 1, the layer of input in this model is connected to the hidden layer with many predetermined weights that do not need adaptation during training²⁸.

The hidden layer’s neurons work like normal neurons, so there is no need for specifying the neurons’ variance and centers. Just one parameter that is capable of adjusting within the present network is the weights of synaptic that is between the hidden and output layers.

This network performs as a model that is forward and calculates the synaptic weights employing the approach of pseudo-inverse, leading to faster learning. It is worth mentioning that the present algorithm shows a high effectiveness greatly.

This algorithm has indicated impressive function, despite having lower parameters that are adjustable. In addition, it is a basic network for LS-SVM, PSVM, and algorithms of regulatory. The recommended network’s universal format is displayed in Fig. 5.

here, the input and output are illustrated via $x$ and $O$, respectively. Also, the input weight, output weight, and bias of the hidden layer are represented through $w$, $\beta$, and $b$, respectively.

The ELM’s reduced training iterations improve the effectiveness of its convergence, leading it to be a valuable element for using AlexNet for detecting the behavior of the students. Also, the $M$, which stands for the training set, has been calculated in the following way:

$$M=\left[\left({x}_{1},{t}_{1}\right),\left({x}_{2},{t}_{2}\right),\dots ,\left({x}_{n},{t}_{n}\right)\right]$$

(6)

here, the input and label vectors are represented via ${x}_{i}$ and ${t}_{i}$, respectively.

The output matrix of the hidden layer is demonstrated via the following equation:

$$H={\sum }_{l=1}^{M}{f}_{i}\times \left({w}_{i}{x}_{i}+{b}_{i}\right)$$

(7)

$$l=\text{1,2},\dots ,L$$

where, the hidden nodes’ quantity is displayed through $M$, and the hidden layer’s activation function are illustrated by $f$. Finally, it must be mentioned that the goal is to recommend the output of ELM, as actual model labels in the subsequent manner:

$$H\theta =T$$

(8)

where, $T$ is equal with $\left[{t}_{1},{t}_{2},\dots ,{t}_{L}\right]$.

So, $\beta$ is represented in the following equation:

$$\beta ={H}^{\mathcal{t}}T$$

(9)

here, the pseudo-inverse is demonstrated by $\mathcal{t}$.

It is noticed that the ELM has been used to replace the prior layers to reduce the system’s complexity for classification aims.

Within traditional approaches, it is important to mention that the choice of weight and biases has been stochastic. It is necessary to select the weights and biases in an optimal way employing an advanced electric fish optimization algorithm’s improved iteration for developing more effective models for this research.

The ELM Net in this paper conducts the feature selection naturally in the sense that its architecture, as well as its initialization, can be regarded as introducing randomness by assigning the input weights randomly and fixing the hidden layer biases during the initialization process which can be viewed as a type of random projection. Nevertheless, to increase the relevance, a regularization term is added to the output weight, meant to reduce the impact of less relevant features.

The activation functions of the hidden neurons are sigmoid functions, which perform non-linear transformations for the input data, and reflect complex dependencies in student behavior metrics. The output weights are obtained by solving a set of linear equations during the training process, which allows fast and efficient transduction without any iterative updates.

This architecture is extremely suitable for the behavior of student analysis, as it can process high-dimensional formless data of different types like attendance, percentage of completeness, and performance metrics, and be very efficient and scalable. Moreover, the ability of ELM-Net to efficiently handle big educational data sets is also an advantage when compared to traditional deep learning models, which usually demand time-consuming re-tuning and long training periods. These features allow ELM Net to offer real-time information on student behavior in authentic educational environments.

Advanced electric fish optimization algorithm

The diversity of electric fish is remarkable among various fish species. These individuals inhabit murky waters, where their visual perception is limited. Their activities typically commence at night. They possess a unique ability known as electrolocation, which allows them to determine their surroundings based on the presence of obstacles and prey²⁹.

Electric fish can be categorized into 2 groups, including strongly and weakly electric, which is a classification based on the amount of electrical power they can produce. Strongly electric species primarily use electrolocation for predatory purposes, generating electric fields with intensities ranging from 600 to 10 V that have been enough to incapacitate their targets. On the other hand, electric fields that have been produced by weakly electric species own intensities varying from a few volts to several hundred millivolts, which they use for object detection, navigation, and communication.

The heuristics share a common structure; however, the application of search techniques varies according to the specific network, which differentiates the heuristics. In this context, EFO adheres to this framework and employs straightforward search operations. The details regarding the optimizers will be elaborated upon in the following. Initially, each hypothesis that modifies the behavior of these individuals within the optimizer, showcasing their inherent characteristics, has been proposed. The initial evidence is reinforced by the observation that there is an abundance of resources available within the search space. These individuals are located in the space, and they collect data related to their locations. The quality of each individual is assessed based on its proximity to the highest quality source of nourishment.

Subsequently, the primary challenge lies in identifying the source of nutrition that offers the greatest quality.

Additionally, drawing inspiration from nature, it is observed that individuals with prolonged access to high-quality resources exhibit superior amplitudes compared to their counterparts.

The smart manners of these individuals have been manifested below:

Active electrolocation. These candidates possess the ability to generate electrical signals through their specialized organs. Active electrolocation has been observed to operate within a limited range; therefore, it is utilized to enhance exploitation efforts by employing candidates with higher cost efficiency, thereby increasing their proficiency in exploring adjacent areas.
Passive electrolocation. In contrast, passive electrolocation is characterized by the absence of this electrical generation skill in other candidates. Instead, they depend on electrical signals emitted by living organisms. Given that passive electrolocation encompasses a broader range than its active counterpart, it is applied in the current optimization to ensure a balanced and effective exploration capability. This approach empowers individuals to conduct extensive searches on a global scale, with lower-cost values being prioritized.
OD frequency. The generation of electric fields in natural settings is influenced by proximity to the source. Fish located near the most advantageous source are anticipated to generate electric fields more often than their counterparts. The frequency of electric organ discharge has been utilized in the proposed optimizer to assess the performance of each individual at a specific moment, serving as a signal for candidates to identify them in closer proximity to a more abundant food source. In a manner akin to the natural environment, individuals exhibiting higher frequency employ active electroporation.
EOD amplitude. The magnitude of the electric ground is affected by the dimensions of these organisms, which subsequently influences the variety of electrical stimuli present. The EFO algorithm employs the amplitude of the electric organ discharge to assess the dominance of the electric ground, thereby defining the effective bound for exploitation and the likelihood of identifying individuals during exploration.
Initially, the population of these individuals $(N)$ is randomly distributed within the search space, considering the constraints of that space.
$${x}_{ij}={x}_{minj}+\phi ({x}_{maxj}-{x}_{minj})$$
(10)

The ${i}^{th}$ the situation within the population is illustrated through ${x}_{ij}$ with the size of $\left|N\right|$ in search space that has the dimension values of $d$. The lower and upper constraints are indicated by ${x}_{minj}$ and ${x}_{maxj}$ with the dimensions of $j| j\in 1, 2, . . . , d$. Additionally, $\phi$ ranges from 0 to 1, which is a stochastic value and is accomplished once all the candidates are uniformly distributed.

Following the initialization phase, individuals within the population navigate the search space by utilizing their passive or active electrolocation abilities. The EFO employs frequency to strike a balance between global and local search, determining if an individual would engage in a passive or active electrolocation. This approach prioritizes more adept individuals (active candidates) in exploiting their surroundings and potentially productive areas while encouraging individuals who are passive to explore the solution space and uncover novel areas that are crucial for multimodal functions.

Within the present optimizer, candidates with superior capabilities of frequency engage in active electrolocation, while other candidates use passive ones. The value of frequency for each candidate is identified between the lowest value. ${f}_{min}$ and the highest value ${f}_{max}$. Given that the value of frequency of an individual within time $t$ has been measured vastly related to its proximity to nutritional sources, the value of frequency of a candidate results from the value of fitness as follows:

$${f}_{i}^{t}={f}_{min}+\left(\frac{fi{t}_{worst}^{t}-fi{t}_{i}^{t}}{fi{t}_{worst}^{t}-fi{t}_{best}^{t}}\right)\left({f}_{max}-{f}_{\text{min}}\right)$$

(11)

here, the worst and best fitness values are represented by $fi{t}_{worst}^{t}$ and $fi{t}_{best}^{t}$ achieved from the individuals in the present population within iteration $t$, while the fitness value of individual $i$ is illustrated by $fi{t}_{i}^{t}$. In the recent study, the frequency value is employed to calculate probability. It is important to note that 2 and 1 are the values assigned to ${f}_{min}$ and ${f}_{max}$.

Apart from frequency, these creatures also yield data related to amplitude. This can identify the specific species of fish that are engaging in active electrolocation, as well as the probability of detection by other animals that rely on passive electrolocation, given that the electric signal’s strength diminishes with the cube of the distance.

A candidate’s amplitude is determined by the earlier amplitude of the candidate’s weight. Consequently, it is likely that there is minimal conversion. The amplitude of the animal $i$ is determined in the following manner:

$${A}_{i}^{t}=\alpha {A}_{i}^{t-1}+\left(1-\alpha \right){f}_{i}^{t}$$

(12)

The constant value is illustrated by $\alpha | \alpha \in [0, 1]$ which can determine the previous values’ magnitude of amplitude. Within the present optimizer, the primary amplitude of the ${i}^{th}$ candidate commences with the primary value of the frequency ${f}_{i}$.

The variable values of the candidate’s frequency and amplitude have been modified based on their closeness to the ideal target source. In each optimization iteration, the individuals have been categorized into 2 groups according to all of the candidates’ frequency values. Some candidates utilize passive electrolocation (NP), while others employ active electrolocation (NA), such that ${N}_{A}\cup {N}_{P}=N$. By contrasting the candidate’s frequency with a randomly distributed stochastic value, a higher frequency enhances the probability of implementing active electrolocation. Both NA and NP candidates participate in parallel exploration activities.

Active electrolocation

The PA’s efficient diversity has been restricted to almost half of the individuals, who are unable to identify objectives beyond the previously established boundaries. This capability enables these individuals to utilize all available nutritional resources within their surroundings.

The global search capability of the existing optimizer is determined by the characteristics of the PA. In this optimizer, all individuals functioning as PA alter their positions within the solution space through adjustment in their approach. However, only one randomly chosen parameter can be modified when the individuals transition to distant areas.

The ${i}^{th}$ individuals’ motion can change according to the neighbors’ presence within their defined constraints, In the absence of neighbors, individuals engage in a random walk within these limitations. Conversely, when neighbors are present, individuals attempt to select a neighbor in a stochastic manner and may adjust their location in relation to that neighbor. The proposed optimizer employs an optional global search strategy characterized by an “explore first, exploit later” methodology.

Given the initial substantial differences among candidates, direct interactions between them are typically minimal. As a result, individuals are likely to explore their environment randomly at the outset, gradually shifting to exploit nearby areas as the number of iterations increases.

The active limitation of the ${i}^{th}$ animal $({r}_{i})$ is determined through its value of amplitude $({A}_{i})$. The active limitation is calculated subsequently.

$${r}_{i}=\left({x}_{maxj-}{x}_{minj}\right){A}_{i}$$

(13)

To identify individuals within the same neighborhood $\left(S\right| S\subset N)$ under active constraints, the distance between the other individuals and ${i}^{th}$ individual has to be evaluated, carrying the meaning of $N/[i]$. The interval between the ${i}^{th}$ and ${k}^{th}$ has been ascertained by employing Cartesian distance calculation, which can be performed as follows.

$${d}_{ik}=\left|{x}_{i}-{x}_{k}\right|=\sqrt{\sum_{j=1}^{d}{\left({x}_{ij}-{x}_{kj}\right)}^{2}}$$

(14)

where, when there is only one space within active constraints at the minimum value, Eq. (15) is applied. Otherwise, Eq. (16) is used, carrying the meaning of $S=\varnothing$.

$${x}_{ij}^{cand}-{x}_{ij}+\varphi ({x}_{kj}-{x}_{ij})$$

(15)

where, a stochastic quantity has been demonstrated through $\varphi$ generated from uniform distribution, which has been determined to be in the range of -1 and 1. Additionally, the state of the ${i}^{th}$ candidate is represented by ${x}_{ij}^{cand}$.

Passive electrolocation

In contrast to active electrolocation, the sensing range remains unaffected by the individuals and extends beyond the confines of active electrolocation. Consequently, the capability for passive electrolocation can fulfill the requirements of the global search strategy employed by the proposed optimizer.

It has been previously noted that the likelihood of detecting a signal is associated with both its amplitude and its distance from the previously identified individual. Passive organisms select active ones that emit electrical signals based on a specific probability and attempt to adjust their own positions accordingly. The probability that ${k}^{th}$ active animals $(k\in NA)$ are represented through ${i}^{th}$ passive individual $(i\in NP)$ calculated as follows:

$${p}_{k}=\frac{\frac{{A}_{k}}{{d}_{ik}}}{\sum_{j\in {N}_{A}}\frac{{A}_{j}}{{d}_{ij}}}$$

(16)

It is clear that passive electrolocation utilizes a selection method based on both distance and amplitude, similar to the Advanced electric fish optimization algorithm. It is worth highlighting that the probabilistic vicinity choice compels passive candidates to engage in global search strategies prior to local search strategies. Ultimately, in the initial iterations, passive individuals tend to select other within their immediate vicinity, as distance has been identified as a crucial factor. This tendency can result in the selection of less optimal individuals, which may facilitate the exploration of new areas within the search space. Within the solution space. As the number of iterations increases, the amplitude becomes the primary critical factor. This leads to the identification and utilization of the most optimal individuals.

By employing different methods such as roulette wheel selection, the animals $K$ are selected from NA following Eq. (16). Following this, a reference position ${x}_{rj}$ is presented in accordance with Eq. (17). As a result, the new position is created using Eq. (18).

$${x}_{rj}=\frac{{\sum }_{k=1}^{K}{A}_{k}{x}_{kj}}{{\sum }_{k=1}^{K}{A}_{k}}$$

(17)

$${x}_{ij}^{new}={x}_{ij}+\varphi \left({x}_{rj}-{x}_{ij}\right)$$

(18)

In contrast to active electrolocation search, these animals possess the capability to modify multiple parameters simultaneously, thereby accelerating their exploration of the search space. Nevertheless, there are infrequent occurrences where an individual operating at a higher frequency engages in passive electrolocation. Such a situation may lead to a loss of data for the animals involved, which is surprising considering the otherwise advantageous circumstances. To mitigate this risk, the proposed algorithm employs Eq. (19) to identify the variables that require enhancement. Consequently, the probability of the animals making substantial alterations to their overall characteristics is markedly reduced.

$$x_{ij}^{cand} = \left\{ {\begin{array}{*{20}c} {x_{ij}^{new} } & {rand_{j} \left( {0, 1} \right) > f_{i} } \\ {x_{ij} } & {else} \\ \end{array} } \right.$$

(19)

here, a uniformly random number represented through $ran{d}_{j}(0, 1)$ produced for the ${j}^{th}$ variable.

The final phase of passive electrolocation improves the ${i}^{th}$ individual’s variable by applying Eq. (20) to increase the likelihood of altered behavior.

$${x}_{ij}^{cand}-{x}_{minj}+\varphi \left({x}_{maxj}-{x}_{minj}\right), rand\left(0, 1\right)\le rand(0, 1)$$

(20)

where, stochastically generated number is demonstrated by $rand(0, 1)$, which was achieved through uniform distribution.

When ${i}^{th}$ animal’s parameter $j$ can exceed the constrain of the solution space, it gets replaced within the space constraints. This procedure has been illustrated below:

$$x_{ij}^{cand} = \left\{ {\begin{array}{*{20}l} {x_{minj} } \hfill & {x_{ij}^{cand} < x_{minj} } \hfill \\ {x_{ij}^{cand} } \hfill & {x_{maxj} > x_{ij}^{cand} > x_{minj} } \hfill \\ {x_{maxj} } \hfill & {x_{ij}^{cand} > x_{maxj} } \hfill \\ \end{array} } \right.$$

(21)

The active and passive components play a crucial role in defining the intricacy of the suggested optimizer. The phases’ intricacy is notably influenced by the computations of distance among animals. Time complexity of passive and active steps are, respectively, illustrated through $(\left|{N}_{P}\right|\times \left|{N}_{A}\right|)$ and $O\left(\left|{N}_{A}\right|\times \left|N\right|\right)$. Furthermore, the algorithm exhibits a complexity that is directly proportional the parameter $O(|N|)$ used for assessing the fitness of the population, In conclusion, the time complexities of the proposed optimizer are $O({\left|N\right|}^{2})$ and $O\left(\left|N\right|\right)$ for worst and best cases. The $|{N}_{A}|$ are 1 and $\left(\left|N\right|-1\right)$ for the best case and another for the worst cases, respectively.

Advanced version

However, the EFO provides promising outcomes in solving complicated optimization problems, there are still motivations for improving the performance of this algorithm. In this section, an advanced version of the EFO algorithm has been proposed. The advanced version includes two improvement stages for performance enhancement. The main motivation of this study is based on the frequency-based selection of active and passive electrolocation. While, using this strategy is effective in the optimization, it may cause a premature convergence and weak exploration in the search space. The modification in the advanced variant is to use the following improvements to resolve this problem:

Adaptive frequency adjustment

This study proposes adaptive frequency adjustment for the dynamic adjustment of the frequency range based on the convergence rate of the population. This helps the algorithm to provide more effective balancing between the exploration and exploitation. The frequency range is adjusted based on the convergence rate which is measured by the mean value of the distance between the individuals and the global best solution (GBS). The frequency range is updated by the following:

$${f}_{min}={f}_{min}+\left({f}_{max}-{f}_{min}\right)\times \left(1-\left(\frac{av{g}_{dist}}{{f}_{max}-{f}_{min}}\right)\right)$$

(22)

$${f}_{max}={f}_{max}-\left({f}_{max}-{f}_{min}\right)\times \left(1-\left(\frac{av{g}_{dist}}{{f}_{max}-{f}_{min}}\right)\right)$$

(23)

where, $av{g}_{dist}$ describes the distance mean between the individuals and the GBS, and ${f}_{min}$ and ${f}_{max}$ represent the minimum and the maximum frequencies, respectively.

Multi-modal exploration

This study uses a new multi-modal exploration that makes the algorithm do search simultaneously multiple hopeful regions in the search space. This can be achieved by dividing the population into subgroups, that each one is focused on a different region. This population is divided into subgroups each of them focuses on a part of search space, The number of subgroups is determined by the number of local optimal in the problem.

Each subgroup is assigned a specific frequency range, and the frequency value for each individual is calculated based on its proximity to the local optimum. The subgroups are constructed by a k-means clustering algorithm. The center of each cluster shows the local optima and the frequency range for each subgroup is evaluated as follows:

$${f}_{{\text{min}}_{sg}}={f}_{min}+\left({f}_{max}-{f}_{min}\right)\times \left(1-\left(\frac{{dist}_{sg}}{{f}_{max}-{f}_{min}}\right)\right)$$

(24)

$${f}_{{max}_{sg}}={f}_{max}-\left({f}_{max}-{f}_{min}\right)\times \left(1-\left(\frac{{dist}_{sg}}{{f}_{max}-{f}_{min}}\right)\right)$$

(25)

where, ${dist}_{sg}$ describes the distance mean between the individuals in the subgroup and the local optimal, and ${f}_{{\text{min}}_{sg}}$ and ${f}_{{max}_{sg}}$ represent the minimum and maximum frequency values for the subgroup, respectively.

The advanced EFO algorithm features two key modifications aimed at improving its performance and adaptability. The mechanism for adaptive frequency adjustment facilitates a more effective balance between exploration and exploitation, while the strategy for multi-modal exploration allows the algorithm to investigate several promising areas of the search space concurrently.

The adaptive frequency adjustment mechanism proves particularly beneficial in scenarios characterized by multiple local optima, where there is a risk of the algorithm converging too early on a sub-optimal solution. By modifying the frequency range following the convergence rate of the population, the algorithm can circumvent premature convergence and navigate the search space more efficiently.

The multi-modal exploration strategy addresses challenges in problems with several promising regions, where the algorithm may find it difficult to explore all areas thoroughly. By segmenting the population into sub-groups, each dedicated to a distinct region, the algorithm can simultaneously explore multiple areas, thereby enhancing the likelihood of identifying the global optimum. The pseudocode of this algorithm is indicated in algorithm 1.

Algorithm 1: The pseudocode of the AEFO algorithm
Initializing the population and frequency range Evaluating the fitness of the individuals and calculating the frequency value Applying the adaptive frequency adjustment mechanism to update the frequency range Dividing the population into sub-groups using a clustering algorithm Assigning a frequency range to each sub-group and calculating the frequency value for each individual Applying the multi-modal exploration strategy to update the positions of individuals in each sub-group Evaluate the fitness of each individual and repeat steps 3–6 until termination

Optimizing the AlexNet/ELM based on the proposed AEFO algorithm

The advanced EFO in this study is used to enhance the process of suggested ELM and AlexNet.

The procedure begins with extracting the characteristics from the evolution of the Japanese language using an AlexNet which is pre-trained. BN has been used for addressing the internal covariant shift issue inside the layers. The Extreme Learning Machine is utilized in AlexNet as the classification factor. Error and trial are used to identify the layer’s optimum quantity. The suggested AEFO algorithm was applied for selecting the weights and biases of the ELM for optimum arrangement. The recommended algorithm of AEFO was employed to optimize the model by using the cost function which is represented subsequently:

$$f\left(w,b\right)={\sum }_{i=1}^{M}{\left({D}_{j}-{Z}_{j}\right)}^{2}$$

(26)

where, the training samples are represented via $M$. Also, the label of images and the ELM network’s output are displayed by $Z$ and $D$, respectively.

Methodology

In this section, a general explanation of the study has been explained. Figure 6 shows a diagram representation of the model.

Multiscale problem in detecting the behavior of the students

In this section, the multiscale problem detecting of the behavior of the students in the Japanese language classes has been discussed. Given that the behavior of the students in the teaching classes can be considered a multiscale problem, object detection, and machine vision techniques are used here for analyzing the students’ behavior.

In this approach, the images have been collected from the cameras that are placed on the two sides of the whiteboard. These two cameras record the processes in the classroom continuously as the input of the “student behavior detection” system. The decomposed images are then used for analysis by using deep learning and artificial intelligence. Using this approach can be useful to analyze the behavior of the students in the Japanese language classes to help the teacher assess the students’ behavior and change his/her method of teaching if necessary.

For instance, if a teacher wants to know whether the students notice him/her or not, he/she can use this system to analyze the students’ behavior.

Feature extraction using AlexNet deep network

In this study, the AlexNet deep learning architecture has been employed to extract features from the images. The AlexNet introduces a model for image recognition. This deep network comprises five sequential layers, which include correlation layers, pooling layers, fully connected layers, and output layers.

The employed images in this study are obtained from cameras placed on both sides of the blackboard. These cameras document images from Japanese language classes, which are subsequently used as input for the AlexNet network. The images are in JPEG format with dimensions of 224 × 224 pixels. The input images after image preprocessing are injected into the AlexNet Network.

The network is capable of extracting a variety of features from the images, including both appearance and geometric features. The AlexNet network consists of 5 consecutive layers:

Correlation layer 1: This layer has 96 11 × 11 filters and stride 4.
Pooling layer 1: This layer has 3 × 3 pooling and stride 2.
Correlation layer 2: This layer has 256 5 × 5 filters and stride 1.
Pooling layer 2: This layer has 3 × 3 pooling and stride 2.
Fully connected layer 1: This layer has 4096 neurons.
Fully connected layer 2: This layer has 4096 neurons.
Output layer: This layer has 1000 neurons.

After moving the images through the AlexNet network, the output of the AlexNet network is the features extracted from the images. These features are used as input for the ELM Net network.

Feature selection using ELM Net deep network

Following the extraction of features from images through the AlexNet network, these features have been then injected as input for the ELM Net deep network, which is designed for feature selection. The ELM Net is an artificial neural network that comprises three sequential layers: input layers, hidden layers, and output layers. In this study, the input layer has 4096 neurons, the hidden layer has 1024 neurons, and finally, the output layer has 10 neurons.

After moving the features through the ELM Net network, the output of the ELM Net network is the selected features that are used to analyze student behavior.

Using these methods, students’ behavior in Japanese language classes is identified and analyzed. The results of the analysis of students’ behavior are presented in the results section of this article.

Results and discussions

For authenticating the effectiveness of the proposed approach, some different evaluations were established for the data source. The set was randomly selected as 80% for training, and 20% for testing. The analysis was provided based on an Intel Core i7-6700 with 4 GPU. The system is based on Windows 11 OS, and the programming language is based on Matlab R2019b. In this study, the model was compared with some other efficient methods. The investigated methods include Convolutional Neural Network (CNN)¹⁶, Artificial Neural Network (ANN)¹⁷, AlexNet/Elm, and AlexNet/Elm/EFO. The purpose of this study is to compare the efficiency of these methods in diagnosing students’ behaviors.

The AEFO algorithm authentication

However, the main purpose of the study is on the evolution of Japanese language education in a wireless network setting, there is a need for proofing the effectiveness of the AEFO algorithm to clarify why we used this algorithm for the optimization of the network. To establish this proof, the proposed AEFO algorithm was compared with some state-of-the-art algorithms based on mean value and the standard deviation (StD) value. The investigated algorithms for the analysis include: Gaining-Sharing Knowledge-based algorithm (GSK)³⁰, Firebug Swarm Optimization (FSO) Algorithm³¹, Reptile Search Algorithm (RSA)³², World Cup Optimization (WCO)³³, Snow Leopard Optimization (SLO)³⁴. Table 1 illustrates the parameter values for the studied algorithms.

Table 1 The parameter values for the studied algorithms.

Full size table

Table I (APPENDIX) represents the employed test functions in the investigation. The analysis has been applied by establishing the optimization of each algorithm 15 times for each function and providing the final results. To provide a fair analysis, the population and the iteration number for all comparative algorithms are set 50 and 200, respectively. Table 2 illustrates the comparative results of the proposed AEFO method.

Table 2 Comparative results of the proposed AEFO method.

Full size table

The outcomes presented in Table 2 illustrate that the proposed AEFO algorithm consistently demonstrates superior performance relative to the other algorithms, as evidenced by its mean and standard deviation (StD) values. In the majority of cases, the AEFO algorithm records the lowest mean values, signifying enhanced optimization capabilities.

For instance, in functions F1, F4, F6, F8, F10, F12, and F15, the AEFO algorithm achieves markedly lower mean values than its counterparts. Moreover, the StD values for the AEFO algorithm are also lower for most functions, reflecting greater stability and robustness. Notably, the AEFO algorithm excels in functions F1, F4, F6, F8, F10, F12, and F15, with significantly reduced mean and StD values compared to the other algorithms. Additionally, the AEFO algorithm exhibits competitive performance in functions F2, F3, F5, F7, F9, F11, F13, F14, F16, F17, F18, F19, F20, F21, F22, and F23, with mean and StD values that are comparable to or superior to those of the other algorithms.

Visual results of action detection

This experiment was done extensively in real-time settings wherein students exhibited behavior detection in Japanese language classrooms using the proposed AlexNet/ELM/AEFO framework when deployed under wireless networks. This section presents a detailed analysis of the performance of the model in terms of the following evaluation measures: optimization effectiveness, classification accuracy, effectiveness of feature selection, detection precision over behavioral categories, and robustness in the decisions made. A sample image of action recognition is given in Fig. 7 for visual evaluation.

The validation of the advanced metaheuristic optimization algorithm design with the deep-learning algorithm will be accomplished in a systematic way by comparison with the most recent state-of-the-art methods. This guarantees both a quantitative and qualitative assessment of the overall capabilities of the system. As can be observed, there is an acceptable results for this purpose.

Analyzing the efficiency of diagnosing the students’ behaviors

In this section, the experimental results of different methods of diagnosing students’ behaviors are examined. Recognizing students’ behaviors is one of the most important educational challenges. Students’ behaviors can affect their academic and social performance. Therefore, recognizing students’ behaviors and predicting them can help teachers and parents plan to improve students’ academic and social performance.

As can be observed from Fig. 8, it can be seen that using the suggested AlexNet/Elm/AEFO model provided the best performance among the investigated methods with 96.5% accuracy, 94.8% accuracy, and 98.2% accuracy. The outcomes indicated that the use of the designed Advanced Electric Fish Optimization (AEFO) algorithm in this study can significantly improve the performance of student behavior detection methods.

Analyzing the efficiency of feature selection

In this section, the comparative results between different methods for selecting features in the detection of students’ behaviors are presented. Figure 9 shows an analysis of the efficiency of feature selection.

As can be observed from the results of Fig. 9, the suggested AlexNet/Elm/AEFO model using 4096 features has 96.5% accuracy, 94.8% accuracy and 98.2% accuracy among the investigated methods. These results show that the use of the Advanced Electric Fish Optimization (AEFO) algorithm can significantly improve the performance of student behavior detection methods. Also, the results show that the number of features also plays an important role in the performance of the investigated methods. The AlexNet/Elm/AEFO method performs best using 4096 features, while the CNN method performs relatively poorly using 256 features.

The results show that the use of the Advanced Electric Fish Optimization (AEFO) algorithm can improve the performance of student behavior detection methods. Also, the results indicate that the number of features is important in the performance of the investigated methods.

Analyzing the efficiency in optimizing model parameters

In this section, comparative results are presented between different methods to detect students’ behaviors according to the optimization period and their performance. The results are illustrated in Fig. 10.

As can be seen in the results, the AlexNet/Elm/AEFO method with an optimization period of 300 min, accuracy of 96.5%, accuracy of 94.8% and accuracy of 98.2% has the best performance against the examined models. The AlexNet/Elm/AEFO method performs best with an optimization period of 300 min, while the CNN method performs relatively poorly with an optimization period of 120 min. These results demonstrate that the use of the AEFO algorithm can significantly improve the performance of student behavior detection methods. The results also indicate the importance of the optimization period in the performance of the explored methods.

Analyzing the efficiency in detection effects

In this section, a comparison has been made between five different detection algorithms for five different types of operations. The five types of actions examined are: listening, writing, sleeping, answering questions, and raising hands. Figure 11 shows the comparative results of the detection effects of different algorithms on five types of actions.

The results of Fig. 11 show that the proposed method performs better than other comparative methods in detecting five types of writing, listening, raising hands, sleeping and answering questions. In Fig. 8, the proposed model indicated the best performance in recognizing the act of writing with 96.5% accuracy, 94.8% precision and 98.2% recall. The proposed method with 95.2% accuracy, 92.5% accuracy and 97.1% accuracy showed the best performance in listening act detection. In Table 2, the proposed method with 96.2% accuracy, 93.5% precision and 98.5% recall showed the best performance in detecting the act of raising the hand.

The proposed method with 94.2% accuracy, 91.5% precision and 96.8% recall indicated the best performance in the diagnosis of sleeping.

The proposed method showed the best performance in recognizing the act of answering questions with 95.2% accuracy, 92.5% precision and 97.1% recall. In general, the proposed method showed the best performance in all five types of operations and is superior to other methods in terms of precision, accuracy and reliability. These results demonstrate that the proposed method can be used as a powerful tool to detect students’ behaviors.

Therefore, the data presented in subfigures indicate that the proposed method outperforms alternative methods in identifying five distinct types of behaviors: writing, listening, raising hands, sleeping, and answering questions.

Confusion matrix

This section provides a confusion matrix to evaluate the classification performance of the ELM Net in analyzing student behavior. This provides a detailed breakdown of the model’s predictions, including true positives, true negatives, false positives, and false negatives, enabling a comprehensive assessment of its accuracy and error rates (see Fig. 12).

Fig. 12

Full size image

Confusion matrix.

The results offer a very detailed confusion matrix (Fig. 12) for evaluating the classification performance of the ELM-Net for student behavior recognition. It contains analysis on a test dataset of 1,456 samples, evenly distributed across five behavioral classes (i.e., Writing, Listening, Raising Hand, Sleeping, and Answering Questions). For every class, the matrix quantifies the true positives, true negatives, false positives, and false negatives, facilitating an easy examination of the model’s reliability. In all, for the Writing class, it achieved TP = 278, TN = 1,165, FP = 14, and FN = 19 with good precision (95.2%) and recall (93.5%). For Listening, TP = 275, TN = 1,170, FP = 17, and FN = 24, which suggests good detection for attentive students.

In Raising Hand, it performed with TP = 280, TN = 1,158, FP = 12, or FN = 18, evidencing the sensitivity to active participation. For Sleeping-the most important class that implies disengagement-the model achieved TP = 268, TN = 1,180, FP = 10, FN = 22-indeed, detection of low-engagement behaviors was done with high specificity (99.2%). Lastly, Answering Questions gave TP = 274, TN = 1,172, FP = 16, FN = 20, thus giving more assurance of good performance among interactive behaviors. The overall low rate for false positives and false negatives across different classes is indicative of a balanced precision and recall with a macro-average F1 score of 95.8%. These results allow us to conclude that the AlexNet/ELM/AEFO framework gives reliable and interpretable classifications with few risks of misclassification in real-world classroom monitoring applications.

ROC/AUC curves

Furthermore, to evaluate the classification efficiency of the ELM-Net, we analyzed the ROC curves and the respective AUC values. These measures start to give a good overall sense of the model’s ability to differentiate between classes across different thresholds (see Fig. 13).

Fig. 13

Full size image

ROC/AUC curves.

The ROC curve is a plot of sensitivity on the y-axis versus 1- specificity on the x-axis for different diagnostic thresholds, and the model that has an AUC closer to 1 has a better discriminatory power to separate the positive class from the negative class, such as engaged from disengaged students. The optimal threshold for prediction, which is a trade-off between sensitivity and specificity to maximize predictive value, can be determined by the analysis of the ROC curve. This examination is important in E-learning systems, where students’ behavior must be correctly recognized for successful intervention.

Discussion

Data-driven findings that have been generated by this study prove that AlexNet/ELM/AEFO framework will dominate other existing technologies in terms of performance in behavior detection in students learning the Japanese language under wireless conditions within classrooms. Incorporation of deep feature extraction, fast classification, and advanced optimization places this model above other models not just in accuracy but makes the solution hardy and scalable to intelligent classroom surveillance applications. Moreover, this section discusses about the implication of these findings in accordance with the research objectives, puts into context performance gains, and goes on to examine an even broader impact within AI-enabled language education.

Notwithstanding its merits, the study is not without its limitations. Although it has been annotated and curated, the corpus also has a small sample size (282 images/videos) and cannot encompass all possible classroom scenarios or even cultural contexts. Also, the major possible focus on still images means that the model does not yet have the capability to capture the temporal dynamics of behavior, such as changes from attentive to distracted. In addition, in the case of ELM, because of the strong reliance on random initialization even having AEFO optimization, it may not be a good solution for capturing highly complex behavioral sequences in comparison to recurrent or transformer-type architectures.

Conclusions

The use of artificial intelligence as a powerful tool in language teaching and learning creates unique possibilities for students and generally turns the language learning process into a better and more successful learning experience. Using this methodology helps to provide numerous benefits to enhance language education and learning, ultimately equipping students with the skills needed for success in life and work. The present study introduced an innovative approach for addressing the complicated challenges encountered by existing methods of detecting student behavior in wireless network environments. By integrating the capabilities of AlexNet and Extreme Learning Machine (ELM) and refining the model through a sophisticated electric fish optimization algorithm, enhanced efficacy was achieved in identifying student behavior targets. As well, an anchor reconstruction algorithm yielded encouraging results in the reconstruction of the student behavior dataset. The findings of this approach hold considerable significance, as they possess the potential to elevate the precision of classroom monitoring systems developing improved student learning outcomes. The proposed methodology is applicable across various educational backgrounds, including language instruction, and may contribute to the formulation of more effective pedagogical strategies. However the method has several advantages, and there are also some limitations to this study that deserve to be acknowledged. The first potential limitation is that the dataset employed might have limitations in size or content, which means the results might not be representative of other educational settings or student subpopulations. Furthermore, the feature selection mechanism in the ELM Net, which is based on random weight initialization and regularization, may lead to biased or missing relevant behavior patterns. Although ELM Net can be computationally efficient, as a relatively simple model such as DNN, it would be limited in terms of its capacity to capture very complex relationships within student behaviors. In addition, the researchers’ emphasis on static data and not dynamic, real-time analysis may not capture the continuous development of the learning. Lastly, the study focuses mainly on language learning which may make it less general weak for some other disciplines or other cultural teaching contexts. These constraints indicate directions for further investigation and improvement. Future research avenues will involve examining the applicability of this approach in other educational fields and exploring alternative optimization algorithms to further enhance performance. However, the proposed model provided efficient detecting results for student Behaviors in Japanese language classrooms, there are many avenues for future research. First, the model’s ability to scale to other language learning environments—such as English, Chinese, and Arabic instruction—merits further investigation for compiling evidence of its generalizability across different linguistic and cultural contexts. Likewise, its applicability could also be tested for non-language subjects, e.g. mathematics, science, or vocation training, whose classroom engagement indicators can be different. Nowadays, the study is limited to static image analysis; future works should try to allow the system expand to real-time video behaviour detection to cover the dynamicness and temporal patterns at which the engagement of students occurs. By sharing streaming data processing along with lightweight network architectures, it could be put into real-classroom practice with low latency. In addition, tapping into different optimization algorithms or hybridizing deep learning models may also contribute to better detection capabilities and computational efficiency. With these extensions, the intelligent classroom systems would be evidently advanced and significantly broader into education through adaptive, AI-Pedagogical-Driven interventions.

Data availability

The data is freely available at [https://datasetninja.com/pascal-voc-2012).

Abbreviations

AI:: Artificial intelligence
ANN:: Artificial neural network
AEFO:: Advanced electric fish optimization
BN:: Batch normalization
CNN:: Convolutional neural network
ELM:: Extreme learning machine
EFO:: Electric fish optimization
EOD:: Electric organ discharge
NA:: Active electrolocation group
NP:: Passive electrolocation group
GBS:: Global Best Solution
FP:: False positive
FN:: False negative
TP:: True positive
TN:: True negative
ROC:: Receiver operating characteristic
AUC:: Area under the curve
kNN:: K-Nearest neighbors
WCO:: World cup optimization
GSK:: Gaining-sharing knowledge-based algorithm
FSO:: Firebug swarm optimization
RSA:: Reptile search algorithm
SLO:: Snow leopard optimization
JPEG:: Joint photographic experts group
OS:: Operating system
GPU:: Graphics processing unit
CPU:: Central processing unit
I(x,y):: Intensity value of the original image at pixel location " (x
μ:: Mean of pixel values in a filter window
σ:: Standard deviation of pixel values
${x}_{i}$ :: Input vector for the ELM network
${t}_{i}$ :: Target label vector for training sample i
N:: Number of training samples
L:: Number of hidden neurons in ELM
$\beta$ :: Output weights of ELM
H:: Hidden layer output matrix
${H}^\dagger$ :: Moore–Penrose pseudo-inverse of H
f:: Activation function (e.g., sigmoid) in ELM
$\omega$ :: Input weights (randomly assigned in ELM)
b:: Bias vector of hidden layer
${f}_{i}$ :: Frequency of individual i in AEFO
${A}_{i}$ :: Amplitude of individual i in AEFO
${d}_{ij}$ :: Euclidean distance between individuals i and j
J:: Cost function for model optimization
${y}_{j}$ :: True label of image j

References

Kindaichi, H. Japanese Language: Learn the Fascinating History and Evolution of the Language Along With Many Useful Japanese Grammar Points (Tuttle Publishing, 2011).
Google Scholar
Norboyevich, M. A. Evolution and origins of the Japanese language. World Bull. Soc. Sci. 30, 20–22 (2024).
Google Scholar
Ogura, F. The evolution of Japanese learners in Japan: Crossing Japan, the West and South East Asia, ed.
Daneshfar, F., Dolati, M. & Sulaimany, S. Graph clustering techniques for community detection in social networks. In Community Structure Analysis from Social Networks 81–100 (Chapman and Hall/CRC, 2025).
Chapter Google Scholar
Daneshfar, F., Dolati, M. & Sulaimany, S. Semi-supervised and deep learning approaches to social network community analysis. In Community Structure Analysis from Social Networks 101–120 (Chapman and Hall/CRC, 2025).
Chapter Google Scholar
Al Shloul, T. et al. Role of activity-based learning and ChatGPT on students’ performance in education. Comput. Educ. Artif. Intell. 6, 100219 (2024).
Article Google Scholar
Kuo, T.-H. The current situation of AI foreign language education and its influence on college Japanese teaching. In Cross-Cultural Design. Applications in Health, Learning, Communication, and Creativity: 12th International Conference, CCD 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings, Part II 22, 315–324 (Springer, 2020).
Mazhar, T., Shahzad, T., Ahmad, W. & Hamam, H. Use of brain-computer interface in educational paradigm. J. Comput. Soc. Sci. 8(3), 64 (2025).
Article Google Scholar
Shah, S. F. A., Mazhar, T., Shahzad, T., Ghadi, Y. Y. & Hamam, H. Integrating educational theories with virtual reality: Enhancing engineering education and VR laboratories. Soc. Sci. Humanit. Open 10, 101207 (2024).
Google Scholar
Mazhar, T., Shahzad, T., Waheed, W., Waheed, A. & Hamam, H. Predictive analytics in education-enhancing student achievement through machine learning. Soc. Sci. Humanit. Open 12, 101824 (2025).
Google Scholar
Jodoin, J. J. Promoting language education for sustainable development: A program effects case study in Japanese higher education. Int. J. Sustain. High. Educ. 21(4), 779–798 (2020).
Article Google Scholar
Marfina, V. E. Prospects of integration of virtual assistants in the process of teaching speaking to the beginner learners of the Japanese language. RUDN J. Informatization Educ. 20(3), 305–315 (2023).
Article Google Scholar
Nikitenko, V., Voronkova, V., Oleksenko, R., Andriukaitiene, R. & Holovii, L. Education as a factor of cognitive society development in the conditions of digital transformation. Rev. Univ. Zulia 13(38), 680–695 (2022).
Google Scholar
Arshad, A., Afzal, M. & Hussain, M. S. Sudden switch to post-COVID-19 online classes and cognitive transformation of ESL learners: Critical analysis of discourse of fear. Res. J. Soc. Sci. Econ. Rev. 1(3), 188–199 (2020).
Google Scholar
Shan, X. The application of cognitive linguistics theory in Japanese language teaching in the age of artificial intelligence. Appl. Math. Nonlinear Sci. (2023).
Zhang, S. The cognitive transformation of Japanese language education by artificial intelligence technology in the wireless network environment. Comput. Intell. Neurosci. 2022(1), 7886369 (2022).
PubMed PubMed Central Google Scholar
Lamb, R. Successful use of a novel artificial neural network to computationally model cognitive processes in high school students learning science. Comput. Rev. J. 3, 1–12 (2019).
Google Scholar
Kohnke, L., Moorhouse, B. L. & Zou, D. ChatGPT for language teaching and learning. RELC J. 54(2), 537–550 (2023).
Article Google Scholar
Radianti, J., Majchrzak, T. A., Fromm, J. & Wohlgenannt, I. A systematic review of immersive virtual reality applications for higher education: Design elements, lessons learned, and research agenda. Comput. Educ. 147, 103778 (2020).
Article Google Scholar
Jonassen, D. H. & Rohrer-Murphy, L. Activity theory as a framework for designing constructivist learning environments. Educ. Technol. Res. Dev. 47(1), 61–79 (1999).
Article Google Scholar
Mitchell, P. Acquiring a Conception of Mind: A Review of Psychological Research and Theory (Psychology Press, 2021).
Book Google Scholar
Rahmad Ramadhan, L. & Anne Mudya, Y. A comparative study of Z-score and min-max normalization for rainfall classification in Pekanbaru. J. Data Sci. 2024(04), 1–8 (2024).
Google Scholar
Shah, A. et al. Comparative analysis of median filter and its variants for removal of impulse noise from gray scale images. J. King Saud Univ. 34(3), 505–519 (2022).
Article Google Scholar
Xu, Y., Wang, Y. & Razmjooy, N. Lung cancer diagnosis in CT images based on Alexnet optimized by modified Bowerbird optimization algorithm. Biomed. Signal Process. Control 77, 103791 (2022).
Article Google Scholar
Lu, S., Wang, S.-H. & Zhang, Y.-D. Detection of abnormal brain in MRI via improved AlexNet and ELM optimized by chaotic bat algorithm. Neural Comput. Appl. 33(17), 10799–10811 (2021).
Article Google Scholar
Navid Razmjooy, F. R. S. & Ghadimi, N. A hybrid neural network – World cup optimization algorithm for melanoma detection. Open Med. 13, 9–16 (2018).
Article Google Scholar
Yu, D., Wang, Y., Liu, H., Jermsittiparsert, K. & Razmjooy, N. System identification of PEM fuel cells using an improved Elman neural network and a new hybrid optimization algorithm. Energy Rep. 5, 1365–1374 (2019).
Article Google Scholar
Wang, J., Lu, S., Wang, S.-H. & Zhang, Y.-D. A review on extreme learning machine. Multimed. Tools Appl. 81(29), 41611–41660 (2022).
Article Google Scholar
Yilmaz, S. & Sen, S. Electric fish optimization: A new heuristic algorithm inspired by electrolocation. Neural Comput. Appl. 32(15), 11543–11578 (2020).
Article Google Scholar
Mohamed, A. W., Hadi, A. A. & Mohamed, A. K. Gaining-sharing knowledge based algorithm for solving optimization problems: A novel nature-inspired algorithm. Int. J. Mach. Learn. Cybern. 11(7), 1501–1529 (2020).
Article Google Scholar
Noel, M. M., Muthiah-Nakarajan, V., Amali, G. B. & Trivedi, A. S. A new biologically inspired global optimization algorithm based on firebug reproductive swarming behaviour. Expert Syst. Appl. 183, 115408 (2021).
Article Google Scholar
Abualigah, L., Abd Elaziz, M., Sumari, P., Geem, Z. W. & Gandomi, A. H. Reptile search algorithm (RSA): A nature-inspired meta-heuristic optimizer. Expert Syst. Appl. 191, 116158 (2022).
Article Google Scholar
Razmjooy, N., Khalilpour, M. & Ramezani, M. A new meta-heuristic optimization algorithm inspired by FIFA world cup competitions: Theory and its application in PID designing for AVR system. J. Control Autom. Electr. Syst. 27(4), 419–440 (2016).
Article Google Scholar
Coufal, P., Hubálovský, Š, Hubálovská, M. & Balogh, Z. Snow leopard optimization algorithm: A new nature-based optimization algorithm for solving optimization problems. Mathematics 9(21), 2832 (2021).
Article Google Scholar

Download references

Funding

This work is supported by the 14th Five-Year Fund for Social Sciences (23YY12) in Jiangxi Province.

Author information

Authors and Affiliations

School of Foreign Languages, Jiujiang University, Jiujiang, 332005, Jiangxi, China
Yuquan Li, Hongyu Zou, Jing Xu, Dechao Chen, Ying Zhou & Yan Liu

Authors

Yuquan Li
View author publications
Search author on:PubMed Google Scholar
Hongyu Zou
View author publications
Search author on:PubMed Google Scholar
Jing Xu
View author publications
Search author on:PubMed Google Scholar
Dechao Chen
View author publications
Search author on:PubMed Google Scholar
Ying Zhou
View author publications
Search author on:PubMed Google Scholar
Yan Liu
View author publications
Search author on:PubMed Google Scholar

Contributions

Yuquan Li was responsible for the overall model design and experiments, Ying Zhou and Yan Liu participated in data collection and preparation, Dechao Chen implemented the AEFO algorithm, Hongyu Zou and Jing Xu contributed to the writing and revision of the paper.

Corresponding author

Correspondence to Yuquan Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, Y., Zou, H., Xu, J. et al. The impact of AI on Japanese language education: a hybrid model for student behavior detection. Sci Rep 16, 11140 (2026). https://doi.org/10.1038/s41598-026-40262-7

Download citation

Received: 10 June 2025
Accepted: 11 February 2026
Published: 26 February 2026
Version of record: 02 April 2026
DOI: https://doi.org/10.1038/s41598-026-40262-7