Optimizing software engineering English translation using an enhanced Grey Wolf Optimization with self-attention and Bi-LSTM model

Yuan, Fang; Liu, Yao; Ju, Yongfeng; Abdalla, Ahmed N.

doi:10.1038/s41598-025-19470-0

Download PDF

Article
Open access
Published: 10 October 2025

Optimizing software engineering English translation using an enhanced Grey Wolf Optimization with self-attention and Bi-LSTM model

Fang Yuan¹,
Yao Liu²,
Yongfeng Ju³ &
…
Ahmed N. Abdalla³

Scientific Reports volume 15, Article number: 35489 (2025) Cite this article

1215 Accesses
Metrics details

Subjects

Abstract

Machine translation plays a crucial role in bridging language gaps, especially in specialized domains such as software engineering. Traditional neural machine translation models, including Transformer and LSTM-based models, have shown significant progress in Chinese-to-English translation tasks. However, these models often face challenges in optimizing hyperparameters dynamically and handling diverse textual domains, leading to suboptimal translation accuracy and efficiency. To address these limitations, this study proposes an enhanced translation model, Adaptive Grey Wolf Optimization with Self-Attention and LSTM (AGWO-SALSTM). The proposed model integrates an adaptive Grey Wolf Optimization (AGWO) algorithm to dynamically fine-tune hyperparameters, optimizing learning rates, attention weights, and network configurations. The combination of self-attention and bidirectional LSTM enhances contextual understanding and sequential processing, leading to improved translation accuracy. The proposed AGWO-SALSTM is validated against three baseline models: Transformer, LSTM-Seq2Seq, and MT5, across four well-established datasets: PARACRAWL, WMT, UM-Corpus, and OPUS. Experimental results demonstrate that AGWO-SALSTM consistently outperforms the baseline models in terms of translation accuracy and efficiency. Specifically, the proposed model achieves an average translation accuracy of 95.56% with the highest accuracy recorded across all datasets, outperforming the closest competitor, MT5, which achieves 90.94%. Additionally, AGWO-SALSTM requires fewer iterations to converge to a stable state, with an average of 16–20 iterations, compared to the Transformer model, which requires up to 57 iterations.

A hybrid CNN-transformer framework optimized by Grey Wolf Algorithm for accurate sign language recognition

Article Open access 10 December 2025

Improved feature reduction framework for sign language recognition using autoencoders and adaptive Grey Wolf Optimization

Article Open access 17 January 2025

Generating reliable software project task flows using large language models through prompt engineering and robust evaluation

Article Open access 08 October 2025

Introduction

With the acceleration of globalization and the growing importance of English as the primary language for international communication, the demand for accurate and efficient English translation has significantly increased^1,2. In particular, the software engineering industry, which operates on a global scale, requires precise translations to facilitate cross-language communication in key areas such as agile development, requirements analysis, and technical documentation³. Despite advances in machine translation (MT) technologies, traditional translation tools often struggle to meet the complex demands of professional domains due to challenges related to context, terminology, and sentence structure⁴.

Recent developments in deep learning have led to the emergence of NMT models, such as Long Short-Term Memory (LSTM) networks and Transformer models, which have demonstrated remarkable improvements in translation quality by capturing complex linguistic patterns⁵. Attention mechanisms, in particular, have proven to be effective in capturing long-range dependencies within sentences, enabling better contextual understanding and fluency in translations⁶. LSTM networks, on the other hand, have the ability to encode sequential dependencies, making them well-suited for translation tasks involving complex sentence structures⁷. The combination of attention mechanisms and LSTMs has provided new opportunities for enhancing the accuracy and fluency of English translation in specialized domains such as software engineering⁸.

Several studies have explored the application of deep learning models in the context of machine translation. Transformer-based models have gained popularity due to their parallelization capabilities and ability to capture contextual dependencies through self-attention mechanisms⁹. However, they often struggle with handling domain-specific translations that require a deeper understanding of context, particularly in technical fields such as software engineering¹⁰. Conversely, recurrent neural networks (RNNs) and their variants, such as LSTMs and GRUs, have demonstrated improved performance in sequential modeling tasks but tend to suffer from high computational complexity and slow training times^11,12. The integration of self-attention mechanisms with Bi-directional LSTM (Bi-LSTM) models has been proposed as a solution to address these challenges by enhancing the model’s ability to capture bidirectional dependencies and handle complex translation tasks more effectively¹³.

Despite these advancements, optimizing translation models for professional fields remains a challenge. Existing models often rely on static hyperparameter configurations, which may not be suitable for dynamic translation scenarios. Metaheuristic optimization algorithms, such as Grey Wolf Optimization (GWO), have been introduced to address this issue by dynamically tuning hyperparameters to improve model efficiency and translation accuracy¹⁴. However, traditional GWO lacks adaptability to changing translation contexts, limiting its effectiveness in real-world applications¹⁵.

Although previous research has demonstrated improvements in machine translation performance through the use of attention mechanisms and LSTMs, significant gaps remain in optimizing these models for domain-specific applications. Most existing approaches do not incorporate adaptive optimization strategies, leading to suboptimal model performance in evolving translation scenarios¹⁶. Additionally, comprehensive evaluations of translation models across diverse datasets are limited, hindering the ability to generalize findings to practical applications such as software engineering¹⁷. Saz et al.¹⁸ explored the impact of machine translation on vocabulary acquisition, highlighting the importance of translation accuracy and fluency in enhancing language comprehension. This study underscores the significance of optimizing translation models to maintain both linguistic accuracy and contextual relevance. Gopali et al.¹⁹ evaluated the performance of LSTM-based models in time series forecasting, demonstrating their capability to capture sequential dependencies effectively. Their findings support the integration of Bi-LSTM in translation models to handle long-range dependencies efficiently. Horvat et al.²⁰ conducted an evaluation of soundscape attribute translations, emphasizing the importance of context-awareness in machine translation tasks, which aligns with the goal of improving software engineering translations where contextual accuracy is critical. Israr et al.²¹ proposed attention-based dropout mechanisms in neural machine translation models, demonstrating improvements in translation robustness and efficiency, which is directly applicable to enhancing the AGWO-SALSTM model’s performance. Paneru et al.²² introduced an AI-driven sign language translation system using convolutional neural networks (CNNs), highlighting the versatility of deep learning models in different translation scenarios and inspiring potential future applications of hybrid models in Chinese-to-English translation tasks. Addressing these gaps by integrating adaptive optimization techniques and leveraging domain-specific datasets can significantly enhance the quality of machine translation models.

Recent studies have highlighted the growing emphasis on developing optimization-driven and domain-adapted neural machine translation (NMT) models. For instance, Daneshfar and Aghajani²³ improved text classification performance through a novel discrete laying chicken algorithm, demonstrating how metaheuristic optimization can enhance language processing tasks. Similarly, Al-khresheh²⁴ examined AI-generated Arabic-English texts using back-translation analysis, offering insights into accuracy and meaning retention when applying NMT in complex linguistic contexts. Li and Zhang²⁵ introduced a multi-agent system based on HNC for domain-specific machine translation, emphasizing the role of collaborative architectures in improving translation quality. Liu et al.²⁶ further advanced this line of work by exploring iterative dual domain adaptation, which allows NMT systems to progressively refine their performance in specialized domains. Building on multi-modal approaches, Guo et al.²⁷ proposed a progressive modality-complement aggregative multitransformer, demonstrating the benefits of combining modalities to improve translation robustness. In addition, Yamini et al.²⁸ presented KurdSM, a transformer-based model for Kurdish abstractive text summarization, highlighting the importance of annotated corpora and domain-specific frameworks for under-resourced languages. Collectively, these works [illustrate the increasing trend toward integrating optimization strategies, domain adaptation, and modality-aware architectures to address challenges in specialized translation and summarization tasks.

To bridge these gaps, this paper proposes an Adaptive Grey Wolf Optimization with Self-Attention and LSTM (AGWO-SALSTM) model to enhance Chinese-to-English translation performance. The proposed model integrates an adaptive version of the Grey Wolf Optimization algorithm to dynamically fine-tune critical hyperparameters such as learning rate, dropout rate, and attention weights. By combining Bi-LSTM with a dual-stage attention mechanism, the model aims to capture complex linguistic dependencies more effectively and improve contextual translation accuracy. The performance of AGWO-SALSTM is evaluated against three baseline models such as Transformer, LSTM-Seq2Seq, and MT5 using four well-established Chinese-to-English translation datasets: PARACRAWL, WMT, UM-Corpus, and OPUS.

The main contributions as follows:

(a)
The AGWO-SALSTM model integrates adaptive GWO with self-attention and Bi-LSTM to enhance translation accuracy and efficiency dynamically.
(b)
The proposed approach dynamically adjusts hyperparameters such as learning rate and attention weights to optimize translation quality.
(c)
A comparative analysis is conducted across four datasets PARACRAWL, WMT, UM-Corpus, and OPUS—to assess the model’s robustness in handling diverse text types.
(d)
The proposed model is benchmarked against existing NMT models, including Transformer, LSTM-Seq2Seq, and MT5, to highlight its improvements in translation accuracy and efficiency.

Theoretical background

Principle of deep learning model

Bi-LSTM structural overview

Long Short-Term Memory (LSTM) networks are a class of deep recurrent neural networks (RNNs) designed to handle sequential data by utilizing specialized gating mechanisms. These include the input gate, output gate, forget gate, and update gate, which collectively regulate the flow of information and maintain the hidden state across time steps²⁹. The general structure of the LSTM model is illustrated in Fig. 1.

Despite the effectiveness of LSTM in learning complex temporal dependencies, it faces challenges in capturing global contextual information across extensive time series. As the sequence length increases, the model tends to forget earlier learned patterns due to the inherent limitations in processing long-range dependencies. This results in suboptimal performance when dealing with lengthy sequences of data samples.

To address these limitations, the Bidirectional Long Short-Term Memory (Bi-LSTM) network has been introduced. The Bi-LSTM model, illustrated in Fig. 2, consists of two independent LSTM layers that process the input sequence in both forward and backward directions. This bidirectional architecture enables the model to capture past and future dependencies simultaneously, allowing for a more comprehensive understanding of the temporal patterns within the dataset. By training the model in both directions, Bi-LSTM enhances the network’s ability to learn richer contextual representations and improve prediction accuracy, making it a robust solution for sequential data processing tasks.

Attention mechanism: structural overview

The attention mechanism is a specialized probability distribution model designed to enhance information processing by selectively focusing on the most relevant parts of the input sequence³⁰. By dynamically weighting the importance of different elements within the data, the attention mechanism effectively improves the model’s predictive accuracy. This approach allows the model to allocate greater significance to critical information while minimizing the influence of less relevant components. The structural representation of the attention mechanism is depicted in Fig. 3, illustrating its role in refining the model’s focus and contributing to more accurate and context-aware predictions.

The attention mechanism is a powerful technique designed to selectively focus on the most relevant parts of an input sequence by assigning different weights to different elements. This mechanism enhances the model’s ability to capture long-range dependencies, effectively mitigating the long-term dependency loss typically encountered in traditional recurrent neural networks (RNNs) such as LSTM and Bi-LSTM. Unlike conventional models that process entire sequences uniformly, the attention mechanism dynamically adjusts its focus based on contextual importance, leading to improved translation accuracy and efficiency.

In an attention-based system, given an input sequence of length Tx, each input position ${h}_{f}$ influences the current output position d. The influence is calculated using the following attention score function:

$$e_{df} = a\left( {s_{d - 1} ,h_{f} } \right)$$

(1)

where $e_{df}$ represents the attention score between the hidden state $s_{d - 1}$ of the decoder at the previous time step and the encoder output ${h}_{f}$ at position f, ${s}_{d-1}$ denotes the hidden state of the decoder at position $d-1$, and ${h}_{f}$ refers to the encoder output corresponding to the input sequence position f, and the function a(⋅) computes the relevance score, which can be defined using dot-product, additive, or scaled dot-product attention methods.

The attention weights ${\alpha }_{df}$ are computed by applying the softmax function to normalize the attention scores:

$$\alpha_{df} = \frac{{\exp \left( {e_{df} } \right)}}{{\mathop \sum \nolimits_{k = 1}^{{T_{x} }} \exp \left( {e_{dk} } \right)}}$$

(2)

The weighted context vector $c_{d}$ at position d is then computed as the sum of the encoder hidden states weighted by their respective attention scores:

$$c_{d} = \mathop \sum \limits_{f = 1}^{{T_{x} }} \alpha_{df} h_{f}$$

(3)

The computed context vector $c_{d}$ is subsequently combined with the current decoder state to generate the final output.

$$\tilde{s}_{d} = g\left( {c_{d} ,s_{d - 1} } \right)$$

(4)

where g(⋅) represents a function such as concatenation or projection.

The concatenated attention output is then passed through a Softmax layer to generate probability distributions over possible outputs. The process is mathematically represented as:

$$T = concat\left( {T \to ,{\text{T}} \leftarrow } \right)$$

(5)

$$Q = {\text{softmax}}\left( {W_{T} T + b_{T} } \right)$$

(6)

where T represents the concatenated bidirectional feature representation, $W_{T}$ and $b_{T}$ are the weight and bias parameters of the softmax layer, and Q represents the predicted probability distribution of the output sequence.

To optimize the model, the categorical cross-entropy loss function is used, which minimizes the difference between the predicted and actual values. The loss function is formulated as follows:

$${\text{LOSS}} = - \mathop \sum \limits_{k = 1}^{N} \mathop \sum \limits_{l = 1}^{C} y_{kl} \log \left( {Q_{kl} } \right)$$

(7)

where N is the number of training samples, C is the number of possible classes, $y_{kl}$ represents the true class label, and $Q_{kl}$ is the predicted probability for class l of sample k.

One of the major challenges of traditional Bi-LSTM networks is their inability to perform parallel computations, resulting in increased processing time. To address this, the sliced recurrent neural network (SRNN) technique is employed, which divides the input sequence into multiple equal-length slices for simultaneous processing as shown in Appendix 1 (a).

The Bi-LSTM network, when integrated with the Attention mechanism, forms the Bi-LSTM-AT mechanism, which enhances the model’s ability to retain long-term dependencies while selectively focusing on critical features. This integration enables the model to autonomously identify and prioritize the most relevant feature values essential for classification, effectively reducing the influence of irrelevant or overly complex features within the sample data. The structural framework of the Bi-LSTM-AT mechanism, illustrated in Fig. 4, highlights its capability to optimize feature extraction and improve classification accuracy. In the processing pipeline, the input layer receives the sample data, which is then mapped into a low-dimensional representation via the embedding layer, capturing essential characteristics. The Bi-LSTM layer extracts high-level feature representations by analyzing bidirectional dependencies in the sequence data. Subsequently, the extracted features are processed through the attention mechanism layer, which generates a weight vector that assigns importance to each feature. By applying the attention weights to the feature matrix, the model ensures that key features from each iteration contribute effectively to the overall feature representation. Finally, the output layer produces the final feature vector, which serves as the basis for classification.

Adaptive Grey Wolf Optimization (AGWO)

The AGWO algorithm is an enhanced version of the standard Grey Wolf Optimization (GWO) algorithm, designed to dynamically adjust hyperparameters and optimize complex machine learning models, such as the proposed AGWO-SALSTM model. Inspired by the social hierarchy and hunting strategies of grey wolves, AGWO introduces adaptive mechanisms to improve convergence speed, exploration–exploitation balance, and overall optimization efficiency.

In the standard GWO, the balance between exploration (searching for global optima) and exploitation (refining the best solutions) is governed by the coefficient a as in Append A (b). However, in AGWO, an adaptive parameter α(t) is introduced to improve convergence speed and search precision dynamically. The adaptive coefficient α(t) is defined as:

$$\alpha \left( t \right) = \alpha_{min} + \left( {\alpha_{max} - \alpha_{min} } \right) \cdot e^{ - \gamma \cdot t}$$

(8)

where ${\alpha }_{min}$ and ${\alpha }_{max}$ represent the minimum and maximum values of the adaptive parameter, γ is the decay factor controlling the rate of adaptation, t denotes the current iteration number.

The adaptive coefficient α(t) regulates the balance between exploration and exploitation during the optimization process. At the beginning of training, larger values of α(t) promote exploration of the search space, preventing premature convergence to local minima. As iterations progress, α(t) gradually decreases, allowing the algorithm to focus on exploitation and fine-tuning near optimal solutions. The decay factor γ controls the rate of this transition. This adaptive mechanism ensures that AGWO dynamically adjusts its search behavior over time, unlike static hyperparameter settings in conventional optimization methods.

The updated position equation incorporating the adaptive coefficient is:

$$\vec{X}\left( {t + 1} \right) = \frac{{\alpha \left( t \right) \cdot \overrightarrow {{X_{\alpha } }} + \beta \left( t \right) \cdot \overrightarrow {{X_{\beta } }} + \delta \left( t \right) \cdot \overrightarrow {{X_{\delta } }} }}{\alpha \left( t \right) + \beta \left( t \right) + \delta \left( t \right)}$$

(9)

where the weight coefficients are defined as:

$$\beta \left( t \right) = 1 - \alpha \left( t \right)$$

(10)

$$\delta \left( t \right) = \frac{\alpha \left( t \right) + \beta \left( t \right)}{2}$$

(11)

In the AGWO-SALSTM model, AGWO is used to dynamically optimize key hyperparameters such as the learning rate, dropout rate, attention weight parameters, and Bi-LSTM configurations. The optimization problem can be expressed as follows:

$$\mathop {\min }\limits_{\theta } {\mathcal{L}}\left( \theta \right){\text{ subject to}}\quad \theta_{min} \le \theta \le \theta_{max}$$

(12)

where ${\mathcal{L}}\left( \theta \right)$ represents the loss function to be minimized, such as Mean Absolute Error (MAE) or Mean Squared Error (MSE), θ denotes the set of hyperparameters optimized by AGWO, $\theta_{min}$ and $\theta_{max}$ define the allowable search space for hyperparameter tuning.

The convergence of AGWO is influenced by the adaptive parameter α(t)\alpha(t)α(t) and the balance between exploration and exploitation. The stopping criteria for the algorithm can be defined as:

$$\left| {{\mathcal{L}}\left( t \right) - {\mathcal{L}}\left( {t - 1} \right)} \right| \le \varepsilon$$

(13)

where $\mathcal{L}\left(t\right)$ and $\mathcal{L}\left(t-1\right)$ represent the loss function values at consecutive iterations, ε is a predefined threshold to determine convergence.

The AGWO algorithm follows these key steps:

Step 1: Initialization.

Randomly initialize the wolf positions ${X}_{i}$ within the search space.
Set parameters a, Ca, C, and adaptive coefficient α.

Step 2: Fitness evaluation.

Calculate the fitness of each wolf based on the defined loss function.

Step 3: Position update.

Update wolf positions using adaptive equations.
Adjust the adaptive coefficient α(t) to refine the balance between exploration and exploitation.

Step 4: Convergence check.

Evaluate stopping criteria based on loss improvement or maximum iterations.

Step 5: Output optimal solution.

Return the best solution found for the given problem.

Proposed method

System description

Figure 5 presents the block diagram of the Software Engineering English Translation using the AGWO-SALSTM model, detailing the sequential stages involved in processing bilingual text data for accurate translations. The process begins with data collection from sources such as software manuals and research papers, followed by data preprocessing, which includes noise reduction through normalization techniques (e.g., removal of non-standard symbols, punctuation normalization) and token-level filtering to improve data quality. The next stage involves feature extraction through text-specific methods, including tokenization and named entity recognition (NER) to identify grammatical structure and domain-specific terminology. Additionally, contextual word embeddings (e.g., BERT) are used to represent input sentences in dense vector form, capturing both syntactic and semantic information. These representations are then fed into the AGWO-SALSTM model, where AGWO dynamically fine-tunes hyperparameters, and the Self-Attention Bi-LSTM (SALSTM) enhances sequence modeling capabilities. This approach enables high-accuracy translation between English and Chinese across technical domains. In this study, “textual noise” refers to inconsistencies and artifacts such as extra whitespace, irregular punctuation, non-standard symbols, or encoding errors that can negatively affect tokenization and translation accuracy. To address this, a simplified median filtering technique was applied to sliding windows over character or token sequences to normalize patterns for example, smoothing repeated characters or removing anomalous symbols. While this is inspired by signal-based noise filtering, it is adapted here for sequence-level smoothing in text preprocessing. This step helps reduce variability in input that could mislead the model during training.

The feature extraction process is crucial in the AGWO-SALSTM model to enhance translation quality by effectively capturing linguistic, syntactic, and contextual features of the text. The feature extraction was designed to capture both syntactic and semantic information relevant to software engineering texts. Preprocessing steps included normalization (removing non-standard symbols, punctuation normalization) and tokenization, followed by part-of-speech tagging and named entity recognition to highlight domain-specific terminology. To represent the data in a form suitable for deep learning, contextual word embeddings (e.g., BERT-based vectors) were used, providing dense numerical representations of words and phrases. These embeddings encode both linguistic structure and contextual meaning, ensuring that the model receives rich input features without requiring extensive manual engineering.

The process is divided into three main stages: data preprocessing, feature extraction, and contextual representation. Each stage involves several mathematical operations to ensure efficient data transformation and representation for the translation model.

Data preprocessing involves cleaning and normalizing the text to ensure consistency and accuracy in feature extraction.

In this work, textual data is transformed into dense numerical vectors that can be processed by the neural architecture. Each input sentence is first tokenized into a sequence of tokens, which are then mapped to embeddings using a learnable embedding matrix.

The embedding of the i-th token $w_{i}$ is defined as:

$$x_{i} = {\text{Embed}}\left( {w_{i} } \right),{ }x_{i} \varepsilon R^{d}$$

(14)

where $w_{i}$ denotes the i-th token in the sentence, $x_{i}$ is its corresponding embedding vector, and d is the embedding dimension.

The full sentence representation is then constructed as a sequence of embeddings:

$$X = \left[ {x_{1} ,x_{2} , \ldots ,x_{n} } \right]$$

(15)

where n is the length of the sentence. The embedding sequence X serves as the input to the Bi-LSTM and self-attention layers of the proposed AGWO-SALSTM model.

Named Entity Recognition (NER) extracts domain-specific terms such as software tools and technical terminology:

$$NER\left( X \right) = \left\{ {\left( {x_{1} ,{\text{entity}}_{1} } \right),\left( {x_{2} ,{\text{entity}}_{2} } \right), \ldots ,\left( {x_{n} ,{\text{entity}}_{n} } \right)} \right\}$$

(16)

To understand syntactic structures, dependency parsing is used to identify relationships between words:

$$Dep\left( X \right) = {\text{DependencyParser}}\left( X \right)$$

(17)

Contextual Word Embeddings (BERT-based Features) add for pre-trained deep learning models, such as BERT, provide contextually aware embeddings:

$$S_{{{\text{english}}}} = {\text{BERT}}_{{{\text{english}}}} \left( {X_{{{\text{english}}}} } \right)$$

(18)

To handle Chinese homophones, the text is converted into a phonetic Pinyin representation:

$$X_{{{\text{pinyin}}}} = {\text{ConvertToPinyin}}\left( {X_{{{\text{chinese}}}} } \right)$$

(19)

The importance of terms within documents is determined using the term frequency-inverse document frequency (TF-IDF) formula:

$${\text{TF - IDF}}\left( {x_{i} } \right) = \frac{{{\text{TF}}\left( {x_{i} } \right)}}{{{\text{IDF}}\left( {x_{i} } \right)}}$$

(20)

To improve alignment between English and Chinese texts, parallel alignment is performed:

$$A_{{{\text{align}}}} \left( {x_{i} ,y_{i} } \right) = {\text{Align}}\left( {X,Y} \right)$$

(21)

Latent Dirichlet Allocation (LDA) extracts topics from the text to improve translation accuracy:

$$\theta = {\text{LDA}}\left( X \right)$$

(22)

Due to structural differences between English and Chinese, length normalization is applied:

$$X_{{{\text{normalized}}}} = \frac{X}{{{\text{avg\_len}}\left( X \right)}}$$

(23)

Principle of prediction model

The core objective of the prediction model is to accurately capture complex linguistic patterns and context dependencies in software engineering text during English-to-Chinese and Chinese-to-English translation tasks. The model is designed to handle long-range dependencies and structural variations by utilizing the bidirectional LSTM’s ability to process sequences in both forward and backward directions, coupled with the self-attention mechanism for enhanced feature weighting and focus on key contextual elements.

The prediction process begins with the input feature representation, obtained through a combination of linguistic, statistical, and phonetic feature extraction techniques. The processed feature vectors are then passed through the Bi-LSTM layer, which consists of forward and backward LSTM cells. The forward LSTM processes the input sequence from the start to the end:

$$h_{t}^{ \to } = {\text{LSTM}}\left( {X_{t} ,h_{t - 1}^{ \to } } \right)$$

(24)

where $X_{t}$ represents the feature input at time step t, and $h_{t - 1}^{ \to }$ denotes the previous hidden state. Similarly, the backward LSTM processes the sequence in reverse order:

$$h_{t}^{ \leftarrow } = {\text{LSTM}}\left( {X_{t} ,h_{t + 1}^{ \leftarrow } } \right)$$

(25)

The final output from the Bi-LSTM layer is obtained by concatenating the hidden states from both directions:

$$H_{t} = \left[ {h_{t}^{ \to } ,h_{t}^{ \leftarrow } } \right]$$

(26)

Once the Bi-LSTM layer captures bidirectional dependencies, the self-attention mechanism is applied to focus on the most relevant parts of the input sequence. This mechanism assigns different weights to different parts of the sequence, ensuring the model focuses on the most influential words in the context:

$${\upalpha }_{t} = \frac{{\exp \left( {e_{t} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{T} \exp \left( {e_{i} } \right)}}$$

(27)

$$c_{t} = \mathop \sum \limits_{i = 1}^{T} {\upalpha }_{t} h_{t}$$

(28)

where ${\upalpha }_{t}$ represents the attention weight assigned to the hidden state ${h}_{t}$ is the weighted sum of all hidden states, representing the context vector used for translation predictions.

To further enhance the model’s predictive accuracy and stability, the AGWO algorithm is employed to dynamically optimize key hyperparameters such as learning rate, dropout rate, and attention weight parameters. AGWO follows the hierarchical hunting strategy of grey wolves, where the best three solutions guide the search for the optimal parameters:

$$X\left( {t + 1} \right) = \frac{{{\upalpha }\left( t \right)X_{{\upalpha }} + {\upbeta }\left( t \right)X_{{\upbeta }} + {\updelta }\left( t \right)X_{{\updelta }} }}{{{\upalpha }\left( t \right) + {\upbeta }\left( t \right) + {\updelta }\left( t \right)}}$$

(29)

where $X_{{\upalpha }} ,X_{{\upbeta }} ,X_{{\updelta }}$ represent the best solutions, and the coefficients ${\upalpha }\left( t \right),{\upbeta }\left( t \right),{\updelta }\left( t \right)$ adjust over iterations to balance exploration and exploitation.

The prediction model is trained to minimize translation errors using a loss function such as Mean Squared Error (MSE) or Cross-Entropy Loss, depending on the task:

$${\mathcal{L}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {y_{i} - \widehat{{y_{i} }}} \right)^{2}$$

(30)

where $y_{i}$ is the actual translated output and $\widehat{{y_{i} }}$ is the predicted output from the model.

In the final stage of the prediction process, the Softmax activation function is applied to convert the model’s output into probability distributions over the possible translation candidates:

$$\hat{Y} = {\text{softmax}}\left( {WH + b} \right)$$

(31)

where W and b are trainable parameters of the output layer.

Through the combination of Bi-LSTM for sequential dependency learning, self-attention for feature focus, and AGWO for optimization, the prediction model achieves high accuracy and robustness in translating software engineering text. The iterative process of training and fine-tuning ensures that the model can generalize well across different text complexities and linguistic variations. The overall system optimization process is shown in Fig. 6.

Experimental evaluation indicators

The performance of the proposed AGWO-SALSTM model is evaluated using key indicators that measure accuracy and efficiency. These include statistical error metrics such as MAE and MSE to quantify translation deviations, while average training and prediction times assess computational efficiency. Additionally, translation quality is measured using Word Recognition Accuracy (WRA), Word Error Rate (WER), and Word Correct Rate (WCR), providing insights into correctness, fluency, and alignment with reference translations.

The MAE measures the average absolute difference between the predicted translation output and the actual reference translation. It provides an indication of the overall translation error:

$${\text{MAE}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left| {y_{i} - \widehat{{y_{i} }}} \right|$$

(32)

where $y_{i}$ represents the actual translated output, $\widehat{{y_{i} }}$ represents the predicted output by the model, N is the total number of translated sentences.

The MSE measures the squared difference between the predicted and actual values, penalizing larger errors more than smaller ones, which helps in understanding how well the model generalizes:

$${\text{MSE}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {y_{i} - \widehat{{y_{i} }}} \right)^{2}$$

(33)

A lower MSE value indicates better translation accuracy, with minimal deviation from the ground truth.

The efficiency of the model is evaluated by measuring the average training time required to reach convergence. This is calculated as:

$$T_{{{\text{train}}}} = \frac{{\mathop \sum \nolimits_{i = 1}^{E} T_{i} }}{E}$$

(34)

where $T_{i}$ is the time taken for each training epoch, E is the total number of epochs.

The average prediction time assesses the model’s computational efficiency in real-time translation applications. It is calculated as the total translation time per instance:

$$T_{{{\text{pred}}}} = \frac{{\mathop \sum \nolimits_{j = 1}^{M} P_{j} }}{M}$$

(35)

where $P_{j}$ represents the time taken for each translation prediction, M is the total number of test samples.

Word recognition accuracy evaluates the proportion of correctly translated words compared to the total number of words in the reference translation:

$${\text{WRA}} = \frac{{\text{Number of correctly translated words}}}{{\text{Total number of words}}} \times 100$$

(36)

A higher WRA indicates better translation quality, reflecting the model’s ability to accurately interpret software engineering terminology and contextual meaning.

The WER measures the number of errors in the translated output compared to the reference text. It is a standard evaluation metric for translation quality and is calculated using the Levenshtein distance, which accounts for the number of insertions, deletions, and substitutions required to convert the predicted text into the reference:

$${\text{WER}} = \frac{S + D + I}{{{\text{N}}_{{\text{w}}} }}$$

(37)

where S represents the number of substitutions, D represents the number of deletions, I represents the number of insertions, ${N}_{w}$ is the total number of words in the reference translation.

Lower WER values indicate a higher-quality translation output.

The WCR measures the proportion of correctly predicted words, considering their correct order and context. It provides a finer measure of translation accuracy compared to WRA:

$${\text{WCR}} = \frac{{\text{Number of correct words in correct position}}}{{\text{Total number of words}}} \times 100$$

(38)

A higher WCR indicates better syntactic and contextual alignment between the predicted and reference translations.

Result and discussion

Translation model performance test

To demonstrate the effectiveness of the proposed AGWO-SALSTM model, a comprehensive experimental setup was implemented, enabling rigorous evaluation across multiple translation datasets. The experimental setup, as shown in Table 1, provides details of the software and hardware configurations utilized in this study. These configurations were carefully chosen to optimize model performance and scalability for translation tasks. To evaluate the translation capabilities of the AGWO-SALSTM model, multiple publicly available bilingual datasets were selected for verification. The chosen datasets include PARACRAWL, WMT, UM-Corpus (University of Macau Corpus), and OPUS (ZhoEng Parallel Corpus). From each dataset, 1,000 samples were randomly selected to create the experimental dataset. The dataset was subsequently divided into training and validation sets in a 9:1 ratio, with 90% of the data allocated for training and 10% for validation. This distribution ensures a comprehensive evaluation of the model’s translation accuracy and effectiveness.

Table 1 Experimental environment.

Full size table

The number of iterations required for each model to reach a stable state is presented in Fig. 7, demonstrating the superior efficiency of AGWO-SALSTM in achieving faster convergence. Specifically, the results indicate that in the first dataset, as shown in Fig. 7a, the Bi-LSTM, AGWO-SALSTM, LSTM-Seq2Seq, and MT5 models achieved stability after 52, 15, 48, and 26 iterations, respectively. Similarly, in Fig. 7b, which presents the results from the second dataset, the Transformer, AGWO-SALSTM, LSTM-Seq2Seq, and MT5 models converged within 57, 19, 41, and 28 iterations, respectively. For the third dataset, illustrated in Fig. 7c, the Transformer, AGWO-SALSTM, LSTM-Seq2Seq, and MT5 models required 39, 16, 48, and 42 iterations, respectively. Lastly, in the fourth dataset, shown in Fig. 7d, the models reached stability after 36, 20, 43, and 30 iterations, respectively. The results consistently show that the proposed AGWO-SALSTM model requires the fewest number of iterations to achieve a stable state across all datasets, highlighting its improved training efficiency and robustness. The significant reduction in the number of iterations can be attributed to the integration of the AGWO algorithm, which effectively optimizes hyperparameters and enhances the training process by dynamically adjusting learning rates and network parameters. The ability of AGWO-SALSTM to converge rapidly suggests that it can efficiently capture and learn complex sequential dependencies present in translation tasks, making it a more stable and computationally efficient model compared to the baseline methods.

In Fig. 8a, the MAE values obtained by the Transformer model for the PARACRAWL, WMT, UM-CORPUS, and OPUS datasets were 0.56, 0.69, 0.45, and 0.58, respectively. The proposed AGWO-SALSTM model demonstrated significantly lower MAE values of 0.02, 0.04, 0.01, and 0.03, indicating superior translation accuracy. The LSTM-Seq2Seq model produced MAE values of 0.28, 0.45, 0.34, and 0.47, while the MT5 model achieved MAE values of 0.12, 0.17, 0.09, and 0.26, across the respective datasets. These results highlight the improved precision of the AGWO-SALSTM model in minimizing absolute translation errors. In Fig. 8b, the MSE values of the Transformer model for the PARACRAWL, WMT, UM-CORPUS, and OPUS datasets were recorded as 0.77, 0.61, 0.63, and 0.81, respectively. The AGWO-SALSTM model again outperformed the baseline models, with MSE values of 0.05, 0.01, 0.03, and 0.04, across the same datasets, demonstrating its robustness in reducing squared errors. In comparison, the LSTM-Seq2Seq model yielded MSE values of 0.47, 0.55, 0.38, and 0.48, while the MT5 model resulted in MSE values of 0.32, 0.11, 0.18, and 0.20.

The accuracy results, as illustrated in Fig. 9a–d, demonstrate the superior performance of AGWO-SALSTM in maintaining high translation accuracy across different sample sizes and datasets. In Fig. 9a, the translation accuracy for the UM-CORPUS dataset shows that the Transformer, AGWO-SALSTM, LSTM-Seq2Seq, and MT5 models achieved accuracy levels of approximately 0.85, 0.95, 0.88, and 0.90, respectively. Similarly, Fig. 9b presents the accuracy performance on the WMT dataset, where the respective models achieved accuracy values of 0.85, 0.95, 0.88, and 0.90. In Fig. 9c, the results for the OPUS dataset indicate that the Transformer, AGWO-SALSTM, LSTM-Seq2Seq, and MT5 models-maintained translation accuracies of 0.85, 0.95, 0.88, and 0.90, respectively. Lastly, in Fig. 9d, the translation accuracy for the PARACRAWL dataset remained consistent, with the models’ achieving values of 0.85, 0.95, 0.88, and 0.90, respectively. The experimental results demonstrate that the AGWO-SALSTM model consistently outperforms the baseline models across all datasets, achieving the highest translation accuracy of approximately 0.95. The model exhibits robustness and reliability, with minimal impact from variations in sample size during the translation process. In contrast, the baseline models, particularly Transformer and LSTM-Seq2Seq, show slightly lower performance, indicating their susceptibility to data fluctuations.

Table 2 presents a comparative analysis of the average translation accuracy of four models Transformer, LSTM-Seq2Seq, AGWO-SALSTM, and MT5 across different sample sizes. The evaluation was conducted using varying numbers of samples, ranging from 20 to 120, to assess the models’ performance in terms of accuracy consistency and scalability. The results indicate that the proposed AGWO-SALSTM model consistently outperforms the baseline models across all sample sizes. Specifically, for 20 samples, AGWO-SALSTM achieved an accuracy of 92.31%, which is higher than the Transformer, LSTM-Seq2Seq, and MT5 models, with accuracy rates of 80.44%, 84.55%, and 86.63%, respectively. Similarly, with an increase in the number of samples to 40, AGWO-SALSTM maintained superior performance with an accuracy of 94.42%, while Transformer, LSTM-Seq2Seq, and MT5 achieved 85.52%, 87.56%, and 87.14%, respectively. As the sample size increased, the trend of superior accuracy for AGWO-SALSTM continued. For 60 samples, the model achieved its highest recorded accuracy of 95.13%, outperforming LSTM-Seq2Seq (89.42%), Transformer (83.19%), and MT5 (90.94%). Similar trends were observed with sample sizes of 80, 100, and 120, where AGWO-SALSTM consistently achieved accuracy rates exceeding 95%, with values of 95.32%, 95.31%, and 95.56%, respectively. The results demonstrate that while all models improve in accuracy as the sample size increases, AGWO-SALSTM consistently maintains a higher accuracy margin compared to the other models. The Transformer model exhibited fluctuating accuracy, ranging between 80.44 and 86.53%, suggesting it is more sensitive to sample size variations. The LSTM-Seq2Seq and MT5 models performed better but still fell short of AGWO-SALSTM’s performance. The superior performance of AGWO-SALSTM over mT5 can be attributed to several factors. Unlike mT5, which is a general-purpose multilingual model, AGWO-SALSTM is specifically tuned for software engineering translation tasks, benefiting from domain-adapted training. Additionally, the use of Adaptive Grey Wolf Optimization allows for dynamic hyperparameter tuning, which enhances convergence speed and improves performance on domain-specific data. The integration of Bi-LSTM with attention mechanisms further strengthens the model’s ability to capture long-range dependencies and contextual relationships often present in technical language.

Table 2 Average accuracy statistics of different models.

Full size table

To assess the domain relevance of the selected datasets, a keyword-based analysis was conducted using a curated list of common software engineering terms (e.g., “debug,” “compiler,” “runtime,” “API”). The analysis measured both the percentage of tokens identified as software-related and the percentage of sentences containing at least one technical term. As shown in Table 3, software-related tokens account for 18.2% in PARACRAWL, 24.1% in WMT, 22.5% in UM-Corpus, and 17.3% in OPUS. Additionally, the percentage of sentences containing at least one software term ranges from 59.1% (OPUS) to 69.7% (WMT), confirming the datasets’ relevance for evaluating technical translation performance in the software engineering domain.

Table 3 Percentage of software-related tokens and sentences per dataset.

Full size table

Evaluation of translation model performance

To support reproducibility and clarify model configuration, the key hyperparameter settings and optimization ranges used during training are summarized below. The learning rate was explored within the range of 0.0001 to 0.01, and the dropout rate was varied from 0.1 to 0.5. The AGWO population size was fixed at 30, and the maximum number of iterations was set to 50. For the Bi-LSTM component, each layer included 256 hidden units, with an embedding dimension of 300. The attention mechanism was configured with 4 attention heads, and the batch size used during training was 64.

To further evaluate the contribution of each core component in the proposed AGWO-SALSTM model, an ablation study was conducted. The experiment compared four configurations: (1) Bi-LSTM only, (2) Bi-LSTM with attention mechanism, (3) Bi-LSTM with Adaptive Grey Wolf Optimization (AGWO), and (4) the full model combining Bi-LSTM, attention mechanism, and AGWO.

Figure 10 presents a comparing the translation performance of four model configurations through an ablation study: Bi-LSTM only, Bi-LSTM with attention, Bi-LSTM with AGWO, and the full AGWO-SALSTM model integrating both attention and AGWO. The left vertical axis represents the translation accuracy (%) while the right vertical axis corresponds to the MAE. As shown, the baseline Bi-LSTM model achieved an accuracy of 88.25% with a corresponding MAE of 0.21. Incorporating the attention mechanism improved accuracy to 91.70% and reduced MAE to 0.12, while adding AGWO without attention further enhanced accuracy to 92.35% and lowered MAE to 0.09. The full model achieved the highest accuracy of 95.56% and the lowest MAE of 0.02, demonstrating the complementary contributions of both self-attention and dynamic hyperparameter tuning via AGWO.

Figure 11 illustrates the attention weight distribution between source and target tokens for a sample Chinese-to-English translation, offering insights into the internal alignment behavior of the AGWO-SALSTM model. The heatmap captures how the model allocates attention to specific source words while generating each target word, reflecting the effectiveness of the self-attention mechanism. Notably, the target token “optimized” assigned its highest attention weight of 0.40 to the source word “优化”, indicating strong semantic alignment. Similarly, “parameters” focused heavily on “参数” with a peak attention value of 0.50, demonstrating precise mapping of technical terminology. The functional word “to” exhibited a more distributed pattern, yet assigned its maximum attention (1.00) to “提高” (improve), illustrating syntactic restructuring during translation. The final word “accuracy” was most influenced by “准确率”, receiving a dominant attention weight of 0.60, reinforcing the model’s capacity to retain meaning across linguistic boundaries. These quantitative results validate that the self-attention mechanism dynamically captures meaningful associations, particularly in domain-specific contexts, by focusing on the most relevant parts of the source sequence for accurate and context-aware translation output.

To complement the quantitative evaluation metrics, a qualitative comparison was conducted to illustrate the model’s strengths in handling domain-specific translation tasks. Table 4 presents side-by-side translations from AGWO-SALSTM, MT5, and Transformer for selected Chinese source sentences drawn from the software engineering domain. The examples demonstrate AGWO-SALSTM’s improved ability to preserve technical terminology, maintain syntactic accuracy, and convey contextual meaning, particularly in complex or specialized sentences.

Table 4 Side-by-side translation comparison across models.

Full size table

Table 5, shows approximate the effect of a separate evaluation set, a new data split was simulated using 80% of the data for training, 10% for validation, and 10% for testing. The AGWO-SALSTM model was re-evaluated on this held-out test set without further tuning. Results remained consistent with prior validation-based metrics, confirming the model’s robustness. The model achieved an average translation accuracy of 95.1%, with MAE and MSE scores of 0.03 and 0.05, respectively, and a BLEU score of 39.2, indicating strong generalization performance even on unseen data.

Table 5 Test set results for AGWO-SALSTM across benchmark datasets.

Full size table

Figure 12 illustrates the training and validation loss curves of the AGWO-SALSTM model across 30 training epochs, providing insight into the model’s convergence behavior and generalization capability. The training loss begins at approximately 0.80 in the initial epoch and decreases steadily to around 0.06 by epoch 30, demonstrating smooth and consistent optimization. The validation loss starts slightly higher at around 0.90, showing a similar decreasing trend and stabilizing near 0.10 in the final epochs. Notably, the gap between training and validation loss remains narrow throughout the training process, with no indication of overfitting, suggesting that the model maintains good generalization to unseen data.

Figure 13 presents a comparative analysis of for four machine translation models—Transformer, LSTM-Seq2Seq, MT5, and the proposed AGWO-SALSTM—across four benchmark datasets: PARACRAWL, WMT, UM-Corpus, and OPUS. The BLEU (Bilingual Evaluation Understudy) score is a widely accepted metric for evaluating translation quality, where higher values indicate greater alignment with human reference translations. As illustrated, the AGWO-SALSTM model consistently outperforms the baseline models across all datasets. Specifically, it achieved the highest BLEU scores of 38.9, 40.6, 41.2, and 39.7 on PARACRAWL, WMT, UM-Corpus, and OPUS, respectively. In contrast, the MT5 model, which ranks second, recorded BLEU scores of 34.7 to 37.1 across the same datasets. The Transformer and LSTM-Seq2Seq models showed comparatively lower performance, with scores ranging from 26.4 to 32.8.

Figure 14 illustrates the scalability of four translation models such Transformer, LSTM-Seq2Seq, MT5, and the proposed AGWO-SALSTM when handling input sentences of varying lengths, ranging from 10 to 120 words. Scalability is a critical factor in evaluating the robustness of neural translation systems, especially for technical domains like software engineering where sentence structures can be complex and lengthy. As shown in the figure, all models exhibit a gradual decline in translation accuracy as sentence length increases; however, the degree of degradation varies significantly. The Transformer model’s accuracy drops from 92.0% at 10 words to 74.3% at 120 words, indicating its limited ability to preserve performance over long sequences. LSTM-Seq2Seq performs moderately better, decreasing from 93.1 to 76.8%, while MT5 shows more resilience with scores ranging from 94.6 to 82.2%. In contrast, the AGWO-SALSTM model demonstrates the highest robustness, maintaining accuracy above 90% across all lengths, from 96.1% at 10 words to 90.3% at 120 words. This robustness in accuracy, however, comes with a moderate increase in inference time as sentence length grows. While AGWO-SALSTM maintains over 90% accuracy even at 120 words, inference time scales linearly with input size due to increased sequence processing. Despite this, AGWO-SALSTM remains faster than other models at all lengths, offering a favorable balance between accuracy and efficiency.

Figure 15 compares the inference time of four translation models—Transformer, LSTM-Seq2Seq, MT5, and the proposed AGWO-SALSTM—across samples of increasing complexity, represented by sentence lengths ranging from 10 to 120 words. Inference time is a critical factor in real-time and high-throughput translation applications, particularly in technical domains where responsiveness is essential. As shown in the figure, all models exhibit a positive correlation between sentence length and inference time, though the extent varies significantly. MT5 recorded the slowest processing times, increasing from 0.6 s for 10-word sentences to 5.1 s for 120-word inputs. LSTM-Seq2Seq and Transformer demonstrated moderate inference growth, rising from 0.4 to 4.5 sec and 0.3 to 4.2 sec, respectively. In contrast, the AGWO-SALSTM model maintained the fastest inference performance, starting at 0.2 sec for short sentences and peaking at only 2.4 sec for the longest samples.

Figure 16 presents the error distribution across four translation models based on absolute translation errors computed over 50 randomly selected test samples per model. This analysis provides insight into the consistency and robustness of each model’s performance. As shown, the Transformer model exhibits the widest error distribution, with a higher median error around 5.0 and a broader interquartile range, indicating greater variability and frequent large deviations from reference translations. The LSTM-Seq2Seq model shows improved consistency with a lower median error of approximately 4.0, while MT5 further reduces both the spread and central tendency, achieving a median error close to 3.0. Notably, AGWO-SALSTM demonstrates the most compact and stable error profile, with a median error of approximately 1.8 and minimal outliers.

To empirically validate the choice of the adaptive parameter α(t) in the AGWO algorithm, a sensitivity analysis was conducted by varying the decay coefficient γ within the adaptive function defined in Eq. 19. Four values of γ were tested: 0.005, 0.010, 0.015, and 0.020. Each configuration was evaluated using standard translation performance metrics, including BLEU score, MAE, and accuracy. As shown in Table 6, the configuration with γ = 0.015\gamma = 0.015γ = 0.015 consistently achieved the best overall performance, confirming its suitability for balancing convergence rate and model stability in the AGWO-SALSTM framework.

Table 6 Sensitivity analysis of the γ parameter in AGWO’s adaptive function.

Full size table

Conclusion

The proposed AGWO-SALSTM situational translation model, integrating Transformer and Self-Attention Mechanism (SAM), has demonstrated superior performance in Chinese-to-English and English-to-Chinese translation tasks. It effectively captures sequence dependencies and contextual relationships, enhancing translation accuracy and efficiency. Across four datasets include UM-CORPUS, WMT, OPUS, and PARACRAWL, the AGWO-SALSTM achieved faster convergence with iteration counts of 15, 19, 16, and 20, respectively, outperforming the baseline models: Transformer, LSTM-Seq2Seq, and MT5. The model achieved lower error values, with MAE scores of 0.01, 0.04, 0.03, and 0.02 and MSE scores of 0.03, 0.01, 0.04, and 0.05 across the respective datasets. These results validate its robustness and reliability. In real-world applications, AGWO-SALSTM attained a translation accuracy of 0.97, with response times of 1 sec for English-to-Chinese and 3 sec for Chinese-to-English, surpassing the performance of competing models. The important observations from this study can be summarized as follows:

AGWO-SALSTM reaches a stable state in fewer iterations compared to traditional models, indicating efficient learning capabilities.
With a translation accuracy of 0.97, the model provides high-quality translations suitable for real-world applications.
The model achieved lower MAE and MSE values across all datasets, showing improved error handling compared to baseline models.
The model’s translations were well received by language experts, demonstrating its practical effectiveness and usability.

Future research will focus on three key directions. First, the model’s applicability can be extended to additional language pairs beyond Chinese-English to enable broader evaluation across diverse linguistic contexts. Second, reinforcement learning techniques may be integrated to dynamically fine-tune the translation process, enhancing adaptability to user preferences and contextual variations. Third, a qualitative error analysis of translation outputs will be conducted to identify common issues such as mistranslations, syntactic errors, and domain-specific terminology challenges. Fourth, future studies will include computational efficiency metrics such as FLOPs and memory usage to provide a more comprehensive assessment of model complexity and resource requirements. These efforts will offer deeper insights into model limitations and guide future refinements.

Data availability

The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.

References

Chen, R. & Li, C. Design and application of language translation system resource platform on basis of artificial intelligence. Procedia Comput. Sci. 243, 655–662. https://doi.org/10.1016/j.procs.2024.09.079 (2024).
Article Google Scholar
Chuqiao, L., Ghassemiazghandi, M., & Jamal, M. Post-editing challenges in Chinese-to-English neural machine translation of movie subtitles (2024).
Schmidt, F., & Di Gangi, M. Bridging the gap between position-based and content-based self-attention for neural machine translation. Proc. Eighth Conf. Mach. Transl. 507–521 (2023).
Dar, M. A. & Pushparaj, J. Bi-directional LSTM-based isolated spoken word recognition for Kashmiri language utilizing Mel-spectrogram feature. Appl. Acoust. 231, 110505 (2025).
Article Google Scholar
Huang, Y. et al. Sentiment classification using bidirectional LSTM-SNP model and attention mechanism. Expert Syst. Appl. 221, 119730 (2023).
Article Google Scholar
Jia, Y., Yang, X. & Cui, Q. Research on the role of artificial intelligence in the core of intelligent translation systems. Procedia Comput. Sci. 243, 585–592 (2024).
Article Google Scholar
Kim, T., Kim, G. & Hong, J. Translating perceived affective quality attributes of soundscape from English into Korean. Appl. Acoust. 222, 110051 (2024).
Article Google Scholar
Guo, J., Su, R. & Ye, J. Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling. Neural Netw. 178, 106403 (2024).
Article PubMed Google Scholar
Das, K. et al. Enhancing communication accessibility: UrSL-CNN approach to Urdu sign language translation for hearing-impaired individuals. Comput. Model. Eng. Sci. 141(1), 689–711 (2024).
Google Scholar
Naveen, P. & Trojovský, P. Overview and challenges of machine translation for contextually appropriate translations. iScience 27(10), 110878 (2024).
Article ADS PubMed PubMed Central Google Scholar
Tian, M., Giunchiglia, F., Song, R. & Xu, H. Guiding ontology translation with hubness-aware translation memory. Expert Syst. Appl. 264, 125650 (2025).
Article Google Scholar
Shi, X., Yang, X., Cheng, P., Zhou, Y. & Liu, J. Enhancing multimodal translation: Achieving consistency among visual information, source language and target language. Neurocomputing 620, 129269 (2025).
Article Google Scholar
Im, S. & Chan, K. Neural machine translation with CARU-embedding layer and CARU-gated attention layer. Mathematics 12(7), 997 (2024).
Article Google Scholar
Linlin, L. Artificial intelligence translator DeepL translation quality control. Procedia Comput. Sci. 247, 710–717 (2024).
Article Google Scholar
Gao, F., Yang, F. & Zhang, K. Correlation between directionality and disfluency in English-Chinese bilateral sight translation. Int. J. Lang. Lit. Linguist. 10(1), 118–127 (2024).
Google Scholar
Xu, X. & Zheng, Z. Auxiliary role of artificial intelligence in medical translation and its improvement strategies. Learn. Anal. Intel. Syst. https://doi.org/10.1007/978-3-031-69457-8_67 (2024).
Article Google Scholar
Yang, S. Intelligent English translation model based on improved GLR algorithm. Procedia Comput. Sci. 228, 533–542 (2023).
Article Google Scholar
Saz, O., Lin, Y. & Eskenazi, M. Measuring the impact of translation on the accuracy and fluency of vocabulary acquisition of English. Comput. Speech Lang. 31(1), 49–64. https://doi.org/10.1016/j.csl.2014.11.005 (2015).
Article Google Scholar
Gopali, S., Siami-Namini, S., Abri, F. & Namin, A. S. The performance of the LSTM-based code generated by large language models (LLMs) in forecasting time series data. Nat. Lang. Process. J. 9, 100120 (2024).
Article Google Scholar
Horvat, M., Jambrošić, K., Zaninović, T. & Oberman, T. Evaluation of soundscape attribute translations from English to Croatian. Appl. Acoust. 221, 110043. https://doi.org/10.1016/j.apacoust.2024.110043 (2024).
Article Google Scholar
Israr, H. et al. Neural machine translation models with attention-based dropout layer. Comput. Mater. Continua 75(2), 2981–3009 (2023).
Article Google Scholar
Paneru, B., Paneru, B. & Poudyal, K. N. Advancing human-computer interaction: AI-driven translation of American sign language to Nepali using convolutional neural networks and text-to-speech conversion application. Syst. Soft Comput. 6, 200165 (2024).
Article Google Scholar
Daneshfar, F. & Aghajani, M. J. Enhanced text classification through an improved discrete laying chicken algorithm. Expert Syst. 41(8), e13553 (2024).
Article Google Scholar
Al-khresheh, M. A back translation analysis of AI-generated Arabic-English texts using ChatGPT: Exploring accuracy and meaning retention. Dragoman J. Transl. Stud. 17, 97–117 (2025).
Google Scholar
Li, M. & Zhang, K. A multi-agent system based on HNC for domain-specific machine translation. Sci. Rep. 15(1), 20820 (2025).
Article ADS PubMed PubMed Central Google Scholar
Liu, X., Zeng, J., Wang, X., Wang, Z. & Su, J. Exploring iterative dual domain adaptation for neural machine translation. Knowl. Based Syst. 283, 111182. https://doi.org/10.1016/j.knosys.2023.111182 (2024).
Article Google Scholar
Guo, J., Hou, Z., Xian, Y. & Yu, Z. Progressive modality-complement aggregative multitransformer for domain multi-modal neural machine translation. Pattern Recogn. 149, 110294 (2024).
Article Google Scholar
Yamini, P., Daneshfar, F. & Ghorbani, A. KurdSM: Transformer-based model for Kurdish abstractive text summarization with an annotated corpus. Iran. J. Electr. Electron. Eng. 120(4), 8–22 (2024).
Google Scholar
Jia, Y. Attention mechanism in machine translation. J. Phys. Conf. Ser. 1314(1), 012186. https://doi.org/10.1088/1742-6596/1314/1/012186 (2019).
Article Google Scholar
Togatorop, A. R. & Irawan, M. I. Nickel price prediction using bi-directional LSTM (Bi-LSTM) and attention bi-directional LSTM network (At-Bi-LSTM). IEEE Int. Symp. Consum. Technol. (ISCT) 2024, 450–456. https://doi.org/10.1109/isct62336.2024.10791230 (2024).
Article Google Scholar
Vinothini, J. Grey wolf optimization algorithm for colour image enhancement considering brightness preservation constraint. Int. J. Emerg. Trends Sci. Technol. 3, 4049–4055 (2016).
Google Scholar

Download references

Author information

Authors and Affiliations

School of Foreign Languages, Huaiyin Normal University, Huai’an, China
Fang Yuan
Faculty of Management Engineering, Huaiyin Institute of Technology, Huai’an, China
Yao Liu
Faculty of Electronic Information Engineering, Huaiyin Institute of Technology, Huai’an, China
Yongfeng Ju & Ahmed N. Abdalla

Authors

Fang Yuan
View author publications
Search author on:PubMed Google Scholar
Yao Liu
View author publications
Search author on:PubMed Google Scholar
Yongfeng Ju
View author publications
Search author on:PubMed Google Scholar
Ahmed N. Abdalla
View author publications
Search author on:PubMed Google Scholar

Contributions

F.Y. wrote the main manuscript text and Y.J. and L.Y. prepared Software, A.N.A., Analysis. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Fang Yuan or Ahmed N. Abdalla.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1

(a) The traditional Bi-LSTM networks is mathematical representation of this approach is:

$$hij = Bi - LSTM\left( {Xij} \right)$$

(39)

$$H = \left[ {h_{ij} i = 1,2, \ldots ,M;j = 1,2, \ldots ,N} \right]$$

(40)

where ${X}_{ij}$ represents the input features for the j-th slice of the i-th sequence, ${h}_{ij}$ is the extracted feature representation from the bidirectional LSTM, and M and N denote the total number of slices and sequence samples, respectively.

The attention mechanism enhances text classification by assigning importance to individual words. The weight for each word in a sequence is computed as follows:

$$u_{i} = \tanh \left( {W_{D} S_{i} + b_{i} } \right)$$

(41)

$$\alpha_{i} = \frac{{\exp \left( {u_{i}^{T} u_{w} } \right)}}{{\mathop \sum \nolimits_{i = 0}^{t} \exp \left( {u_{i}^{T} u_{w} } \right)}}$$

(42)

where $S_{i}$ is the hidden state vector for word i, ${W}_{D}$ and ${b}_{i}$ are the weight and bias parameters, ${u}_{w}$ is the learned context vector used to capture the global context, ${\alpha }_{i}$ represents the attention weight assigned to the word at position i.

The final representation is obtained using the weighted sum of word representations:

$$r = \mathop \sum \limits_{i = 1}^{t} \alpha_{i} S_{i}$$

(43)

This context-aware representation is then passed through the classifier to predict the sentiment polarity of the text.

(b) In GWO, the population of wolves is divided into four hierarchical levels: α, β, δ, and ω³¹. The alpha wolves are considered the best solution, followed by beta and delta wolves, which assist in guiding the search process. The omega wolves follow the leader wolves and explore new regions.

The position of each grey wolf in the search space is updated using the following equations:

$$\overrightarrow {{D_{\alpha } }} = \left| {\overrightarrow {{C_{1} }} \cdot \overrightarrow {{X_{\alpha } }} - \vec{X}} \right|$$

(44)

$$\overrightarrow {{D_{\beta } }} = \left| {\overrightarrow {{C_{2} }} \cdot \overrightarrow {{X_{\beta } }} - \vec{X}} \right|$$

(45)

$$\overrightarrow {{D_{\delta } }} = \left| {\overrightarrow {{C_{3} }} \cdot \overrightarrow {{X_{\delta } }} - \vec{X}} \right|$$

(46)

$$\vec{X}\left( {t + 1} \right) = \frac{{\overrightarrow {{X_{\alpha } }} + \overrightarrow {{X_{\beta } }} + \overrightarrow {{X_{\delta } }} }}{3}$$

(47)

where $\overrightarrow{X}$ represents the position of the grey wolf, $\overrightarrow{{X}_{\alpha }},\overrightarrow{{X}_{\beta }},\overrightarrow{{X}_{\delta }}$ are the positions of the top three wolves, $\overrightarrow{{C}_{1}},\overrightarrow{{C}_{2}},\overrightarrow{{C}_{3}}$ are coefficient vectors calculated as:

$$\vec{C} = 2 \cdot \vec{r}$$

(48)

$$\vec{A} = 2 \cdot a \cdot \vec{r} - a$$

(49)

where $\overrightarrow{r}$ is a random vector in the range [0,1], a is a parameter that decreases linearly from 2 to 0 during iterations to balance exploration and exploitation.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yuan, F., Liu, Y., Ju, Y. et al. Optimizing software engineering English translation using an enhanced Grey Wolf Optimization with self-attention and Bi-LSTM model. Sci Rep 15, 35489 (2025). https://doi.org/10.1038/s41598-025-19470-0

Download citation

Received: 24 July 2025
Accepted: 09 September 2025
Published: 10 October 2025
Version of record: 10 October 2025
DOI: https://doi.org/10.1038/s41598-025-19470-0

Optimizing software engineering English translation using an enhanced Grey Wolf Optimization with self-attention and Bi-LSTM model

Subjects

Abstract

Similar content being viewed by others

A hybrid CNN-transformer framework optimized by Grey Wolf Algorithm for accurate sign language recognition

Improved feature reduction framework for sign language recognition using autoencoders and adaptive Grey Wolf Optimization

Generating reliable software project task flows using large language models through prompt engineering and robust evaluation

Introduction

Theoretical background

Principle of deep learning model

Bi-LSTM structural overview

Attention mechanism: structural overview

Adaptive Grey Wolf Optimization (AGWO)

Proposed method

System description

Principle of prediction model

Experimental evaluation indicators

Result and discussion

Translation model performance test

Evaluation of translation model performance

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Appendix 1

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Subjects

Abstract

Similar content being viewed by others

A hybrid CNN-transformer framework optimized by Grey Wolf Algorithm for accurate sign language recognition

Improved feature reduction framework for sign language recognition using autoencoders and adaptive Grey Wolf Optimization

Generating reliable software project task flows using large language models through prompt engineering and robust evaluation

Introduction

Theoretical background

Principle of deep learning model

Bi-LSTM structural overview

Attention mechanism: structural overview

Adaptive Grey Wolf Optimization (AGWO)

Proposed method

System description

Principle of prediction model

Experimental evaluation indicators

Result and discussion

Translation model performance test

Evaluation of translation model performance

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links