Introduction

With the rapid growth of internet media, the volume of user-generated content has increased significantly, leading to active research in areas such as sentiment analysis and opinion mining. A key challenge in this domain is the automatic identification of sentiment-bearing expressions or keywords. Most existing approaches primarily focus on explicit sentiment words that appear directly in the text, while overlooking implicit or potential sentiment indicators that may not be explicitly mentioned. These latent sentiment cues, which often constitute a substantial portion of the underlying emotional content, capture important aspects of the writer’s attitude and perspective. By effectively analyzing both explicit and implicit sentiment cues, researchers can address diverse real-world challenges, including predicting stock market trends and election outcomes, assessing public reactions to events or news, and understanding attitudes toward specific population groups1,2,3. Thus, the need for modern techniques, such as machine learning and sentiment analysis, has been identified to assist in the accurate verification of specific information among these enormous volumes of data. In this context sentiment analysis can be formulated as a text classification problem, evolving from polarity detection (positive, negative, neutral) to fine-grained emotion recognition (e.g., anger, fear, joy, sadness). A key challenge in this domain is severe class imbalance in real-world data, where emotions such as surprise and disgust occur far less frequently than more readily expressed sentiments like anger and joy. This imbalance increases class-level uncertainty and degrades the performance of conventional deep learning models4,5, which typically rely on softmax-based decision mechanisms. To address this, we propose an optimization-aided deep learning framework that refines sentiment vector representations to reduce uncertainty in underrepresented classes. The novelty of our proposed framework is the joint utilization of ALBERT, a parameter-efficient transformer encoder, with a stacked Bi-GRU→ Bi-LSTM architecture, along with parameter tuning driven by Elk Herd optimization (EHO). Each of these components has been examined individually in the literature; however, their simultaneous use has not been thoroughly investigated for sentiment analysis. ALBERT sends regular contextual embeddings using a thinner architecture, Bi-GRU addresses efficiency in gating sequential matters and Bi-LSTM achieves deeper bidirectional dependencies with their individual strengths complementing one another, and design possibilities not attainable from any one architecture alone. EHO employs a nature-inspired metaheuristic optimization strategy, based on herd behavior, to adaptively adjust model parameters during training, achieving more stable convergence than conventional optimizers, including Particle Swarm Optimization (PSO) and genetic algorithms (GA).

Although these components are individually known in the literature, their integrated use within a unified sentiment-analysis pipeline remains largely unexplored. This tri-level design accounts for gaps in the literature where models either utilized heavy transformer architectures or lacked dynamic optimization for sentiment tasks that are domain sensitive. More importantly, this configuration is intentionally crafted to address practical challenges such as class imbalance, domain variability, and the need for improved computational efficiency. Therefore, our hybrid AI framework offers more consistent classification across other datasets, and represents a real, and significant improvement on previous models.

Unlike accuracy metrics that are biased toward majority classes, intra-class performance metrics are employed to ensure fair evaluation across all sentiment categories. Experimental results demonstrate improved robustness and accuracy in sentiment classification under high uncertainty. The key contributions of the proposed model are as follows:

  1. (1)

    ALBERT was selected for its parameter efficiency and reduced memory footprint compared to BERT and RoBERTa, making it well-suited for large-scale sentiment datasets without compromising contextual embedding quality.

  2. (2)

    The hybrid uses of GRU and LSTM leverages the strengths of both recurrent units: GRU efficiently captures short-term dependencies with fewer parameters, while LSTM is more effective at modeling long-term dependencies in sequential data. Their combination provides a balanced representation of temporal features in text.

  3. (3)

    Elk Herd optimization was employed to improve parameter tuning and avoid local minima during training, thereby enhancing overall classification robustness, particularly in imbalanced and noisy sentiment datasets.

  4. (4)

    To evaluate how well the model performs, we use popular sentiment analysis datasets such as SST-5, Laptop14, Twitter, Restaurant14, Restaurant15, and Restaurant16, focusing on F-score and Accuracy.

  5. (5)

    A comprehensive ablation study is carried out to investigate the effects of learning rate on accuracy, as well as to assess execution time and memory utilization.

Related work

Sentiment analysis was initially adopted in the commercial domain, where it is widely applied to personal blogs, Twitter, and Facebook posts for purposes such as brand management and customer relationship management. In addition, automated sentiment analysis systems have been developed to evaluate customer inquiries and complaints. In education, sentiment analysis has been integrated into automated tutoring and student evaluation systems. Some studies aim to assess the accuracy of students’ responses, while others focus on detecting learners’ emotional states6. Evidence further suggests that students achieve better learning outcomes in calm and positive environments.

Applications have also been developed to track emotional trends on social media, as well as to investigate the psychological impact of disasters such as earthquakes7. Additional research has focused on monitoring public health8, detecting cyberbullying9, and identifying personality traits such as extroversion and narcissism from textual expressions10. Further studies have analyzed linguistic patterns to explore gender differences in language use, demonstrating that an author’s gender can often be inferred from emotional cues in their writing. Other research has examined societal perspectives on politics, elections, and national values by analyzing social media content, as well as identifying partisan bias in news and online discussions11.

This section explores key advancements in textual sentiment analysis methods. Traditional machine learning (ML) methods typically operate on hand-crafted features rather than raw text. These features are derived from textual representations such as Bag of Words (BoW), word n-grams, punctuation patterns, emojis, and emotion lexicons. To improve input quality and reduce dimensionality, feature engineering operations—such as feature selection, weighting, and extraction—are applied. However, these operations may inadvertently discard semantically important information, thereby reducing classification accuracy. Moreover, BoW-based approaches fail to preserve word order and syntactic integrity within sentences, limiting their ability to capture contextual dependencies. To address these limitations, deep learning (DL) methods have been introduced, enabling direct processing of raw text while retaining the semantic context of co-occurring words. DL models, composed of multiple hidden layers within artificial neural networks, learn hierarchical feature representations by iteratively optimizing inter-layer weights through backpropagation. This process highlights salient features relevant to sentiment prediction, leading to superior classification performance compared to traditional ML approaches. However, DL-based sentiment models require large-scale labeled datasets to achieve high generalization capability, a significant challenge for low-resource languages, where annotated sentiment corpora remain scarce.

Recent advancements in word embeddings and pre-trained language models (PLMs)12 have mitigated the large-data requirement of DL-based NLP systems. By leveraging prior knowledge encoded in embeddings, models can be fine-tuned rather than trained from scratch, thus accelerating convergence and improving generalization. In addition to word-level embedding techniques13, sentence-level representation methods14 have been proposed to capture broader semantic relationships. Progressively, general-purpose PLMs have emerged, capable of learning not only semantic and syntactic patterns but also pragmatic usage tendencies. These models reduce data requirements for sentiment analysis through effective transfer learning. PLMs have achieved state-of-the-art performance in sentiment analysis tasks, even in low-resource scenarios. While extensively applied in English, their adoption for other languages remains limited.

Recently, several studies have utilized attention mechanisms to enhance aspect-based sentiment analysis. Huang et al.15 proposed a Layered Attention Model for joint aspect–context representation, while the Attentional Encoder Network linked contexts through attention-based encoders. Qing et al.16 captured both local and global contexts with multi-level awareness, and Wu et al.17 introduced a multi-attention mechanism to model long-distance dependencies and filter irrelevant information. Li et al.18 further enhanced sentiment representation by incorporating vocabulary relations among aspects to capture syntactic patterns and multi-dimensional sentiments.

Lu and Huang19 constructed concurrence word graphs for sentiment classification, while Huang et al.20 applied graph convolutional networks (GCNs) on sentence tree structures to capture word dependencies. Zhao et al.21 and Xiao et al.22 proposed GCN-based methods leveraging dependency graphs where words are nodes and grammatical relations are edges.

Zheng et al.23 demonstrated that GCNs can identify aspect-relevant words by aggregating neighbor information across layers. Cao et al.24 developed a multimodal sentiment analysis framework for microblogs by integrating textual and visual features. You et al.25 introduced a cross-modality consistent regression (CCR) model for multimodal sentiment prediction. Tiwari et al.26 successfully captured latent contextual and co-occurrence information from multilingual social media data by predicting bitcoin prices using an improved stacked LSTM model calibrated with particle swarm optimization (PSO). Dobrojevic et al.27 integrated NLP and ML for cyberbullying detection by encoding text with TF-IDF and BERT, classifying with XGBoost, and fine-tuning the model using an enhanced wolf optimization algorithm alongside other meta-heuristics. Aziz et al.28,29,30,31,32,33 proposed a range of deep attention–based models for aspect-based sentiment analysis, incorporating multimodal learning for Urdu text.

Recent advances in natural language processing, particularly the adoption of neural network architectures like LSTMs and Bidirectional LSTMs, have considerably enhanced text analysis. SOTA transformer models like BERT are computationally expensive due to their large number of parameters, making them resource-intensive for real-world applications. In this study, we propose a unified framework that integrates ALBERT embeddings, Bi-GRU, Bi-LSTM, multihead attention mechanism followed by Elk Herd optimization (EHO) to address the aforementioned challenges. Specifically, ALBERT embeddings are employed to enhance contextual representation, while the combined Bi-GRU and Bi-LSTM architecture, augmented with multihead attention, strengthens the model’s capacity to capture salient sentiment-related features. Elk Herd optimization was employed to improve parameter tuning and avoid local minima during training, thereby enhancing overall classification robustness, particularly in imbalanced and noisy sentiment datasets.

Problem definition and theoretical background

Liu’s34 formal definition of an opinion is structured as a quadruple (g, s, h, t), where each component plays a critical role. Here, g refers to the subject of the sentiment, or the entity the opinion is directed toward. s represents the sentiment or attitude expressed towards g. h indicates the individual sharing the opinion, while t marks the specific time at which the opinion is expressed. These four elements are essential for accurately capturing the essence of an opinion. If any of these elements are missing in the document containing the opinion, the utility of the extracted data may be significantly diminished.

Recurrent neural network

Recurrent neural networks (RNNs)35 are a class of artificial neural networks specifically designed to model sequential and temporally dependent data through an internal memory mechanism. Unlike feed-forward architectures such as multi-layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs), which process each input independently without considering temporal dependencies, RNNs are capable of capturing contextual relationships between successive input elements. This capability is critical for tasks where input samples are inherently sequential, such as time-series analysis and natural language processing (NLP). This makes RNNs particularly effective in applications such as speech recognition, language modeling, and machine translation.

In an RNN, a sequence of \(\:X=({x}_{m1},\dots\:\dots\:..{x}_{mT})\) word vectors is processed element by element. Specifically, an RNN generates an output vector yt and an internal state vector ht, which functions as the RNN’s memory. For input, an RNN takes a sequence element xmt along with the previous internal state ht−1. An example of an RNN with a hidden layer whose output of the layer hidden is defined as the internal state vector:

$$\:\:\:{h}_{t}=\sigma\:\left({W}_{h}\:{x}_{{m}_{t}}+{U}_{h}{h}_{t-1}+{b}_{h}\right)$$
(1)
$$\:{\:y}_{t}=\sigma\:\left({W}_{y}{h}_{t}+{b}_{y}\right)$$
(2)

However, due to repeated gradient multiplications across time steps, RNNs are susceptible to the vanishing gradient and exploding gradient problems35, which can severely hinder the learning of long-term dependencies. To mitigate these limitations, advanced RNN variants such as long short-term memory (LSTM) networks and gated recurrent units (GRUs) have been developed, offering improved stability and the ability to model long-range dependencies. These architectures are discussed in the following sections.

LSTM

Long short-term memory (LSTM) networks were introduced to address the limitations of conventional RNNs. LSTMs achieve this by incorporating specialized gating mechanisms—namely the input gate, forget gate, and output gate—along with an internal cell state that enables controlled information flow over extended sequences. The internal architecture of an LSTM cell is depicted in (Fig. 1).

Formally, at each time step t, the input vector xt is processed alongside the previous hidden state ht-1 through gate-specific transformations, governed by weight matrices W, bias vectors b, and the logistic sigmoid activation function σ. The cell state ct acts as a memory conveyor, selectively updated and read by the gates to retain or discard information as required. This gated memory mechanism allows LSTMs to maintain context across longer temporal spans, making them well-suited for sequence modeling tasks.

Fig. 1
figure 1

Architecture of LSTM network.

To further enhance contextual awareness, Bidirectional LSTMs (Bi-LSTMs) extend the unidirectional LSTM architecture by processing the input sequence in both forward and backward directions. In text classification, this enables each token in a sentence to be informed by both its preceding and succeeding context, analogous to reading the sentence from start to end and end to start simultaneously. The concatenation of forward and backward hidden states yields richer contextual representations, thereby improving learning efficacy and classification accuracy.

Gated recurrent units

GRUs are a variant of LSTMs, instead of having separate input and output gates, they use an update gate. This little feature decides how much information to keep and how much to refresh, making them quite efficient. Instead of a forget gate, GRUs use a reset gate with a different structure and position. Figure 2 illustrates the GRU cell. The GRU employs two gates: the update gate, which decides how much of the cell’s current content should be updated with the new candidate state, and the reset gate, which, when its value is close to 0, effectively clears the cell’s memory, allowing the unit to treat the next input as the first in the sequence. In a GRU, the input and forget gates are combined into a single update gate.

Fig. 2
figure 2

Architecture of GRU network.

$$\:\:\:{i}_{t}=\sigma\:\left({W}_{ix}{x}_{t}+\:{W}_{ih}{h}_{t-1}+{b}_{i}\right)$$
(3)
$$\:{\:f}_{t}=\sigma\:\left({W}_{fx}{x}_{t}+\:{W}_{fh}{h}_{t-1}+{b}_{f}\right)$$
(4)

When the update gate is set to 0, the entire previous cell state is passed to the current state, meaning the current input has no effect. On the other hand, if the update gate is set to 1, all of the current input is included in the cell’s current state, with no information carried over from the previous cell state. Essentially, the input gate it acts as the inverse of the forget gate ft, such that it =1 - ft.

$$\:{\:r}_{t}=\sigma\:\left({W}_{rx}{x}_{t}+\:{W}_{rh}{h}_{t-1}+{b}_{r}\right)\text{r}\text{e}\text{s}\text{e}\text{t}\:\text{g}\text{a}\text{t}\text{e}\:$$
(5)
$$\:\:{\stackrel{-}{h}}_{t}=\text{tanh}\left({W}_{hx}{x}_{t}+\:{W}_{hh}\left({r}_{t}{h}_{t-1}\right)+{b}_{n}\right)\text{c}\text{a}\text{n}\text{d}\text{i}\text{d}\text{a}\text{t}\text{e}\:\text{h}\text{i}\text{d}\text{d}\text{e}\text{n}\:\text{s}\text{t}\text{a}\text{t}\text{e}\:$$
(6)
$$\:{z}_{t}=\sigma\:\left({W}_{zx}{x}_{t}+\:{W}_{zh}{h}_{t-1}+{b}_{z}\right)\text{u}\text{p}\text{d}\text{a}\text{t}\text{e}\:\text{g}\text{a}\text{t}\text{e}$$
(7)
$$\:{\:h}_{t}={z}_{t}{\stackrel{-}{h}}_{t}+(1-{z}_{t}){h}_{t-1}\text{f}\text{i}\text{n}\text{a}\text{l}\:\text{h}\text{i}\text{d}\text{d}\text{e}\text{n}\:\text{s}\text{t}\text{a}\text{t}\text{e}$$
(8)

Traditional unidirectional GRU neural networks can only predict target data from one direction, lacking the ability to incorporate future information to constrain current predictions. A Bidirectional GRU (BiGRU) follows the same principle as BiLSTM as used above, utilizing two GRU layers—one for the forward sequence and another for the reversed sequence. By concatenating the hidden states from both directions, BiGRU provides a comprehensive understanding of the entire sequence, making it highly effective for tasks like setiment analysis.

$$\:{\overrightarrow{h}}_{t}=GRU({x}_{t},{\overrightarrow{h}}_{t-1})$$
(9)
$$\:\:{\overleftarrow{h}}_{t}=GRU({x}_{t},{\overleftarrow{h}}_{t-1})$$
(10)
$$\:{\:h}_{t}={w}_{t}{\overrightarrow{h}}_{t}+{v}_{t}{\overleftarrow{h}}_{t}+{b}_{t}$$
(11)

Given that some dependencies relate to both past and future states, the model employs both forward and backward directions in the Bi-GRU layer.

Word embedding

These are vector representations of words in a continuous, fixed-length, dense, multi-dimensional space, designed to capture semantic and syntactic relationships between terms. These embeddings are typically learned using unsupervised machine learning techniques applied to large-scale text corpora. They can be categorized based on factors such as the underlying training corpus, vector dimensionality, and the algorithmic approach used for their generation. Among the most prominent embedding methods are Word2Vec36, which leverages neural network architectures, and GloVe37, which is derived from statistical co-occurrence analysis. The subsequent sections provide an overview of these embedding techniques and their application in sentiment classification tasks. Word2Vec is a family of models that learn distributed word representations, mapping each word to a high-dimensional vector in which semantically or contextually similar words occupy nearby positions in the embedding space. Trained on large corpora, these models capture contextual meaning by leveraging two complementary neural network architectures: continuous bag of words (CBOW) and continuous skip-gram. CBOW predicts a target word based on its surrounding context, whereas Skip-Gram predicts the surrounding context given a target word. The resulting embeddings typically reside in a vector space with hundreds of dimensions, enabling fine-grained representation of word semantics for downstream tasks, including sentiment analysis. GloVe, similar to Word2Vec, is a word embedding method that maps words to fixed-length dense vectors representing their semantic meaning, trained on large-scale text corpora. Unlike Word2Vec, which learns embeddings by predicting word-context relationships directly during training, GloVe adopts a global statistical approach based on aggregated co-occurrence information. The process begins by scanning the entire corpus to identify word co-occurrences within a predefined sliding context window. All co-occurrence counts are stored in a large co-occurrence matrix, where rows and columns correspond to words, and each matrix entry Xij​ represents the number of times word wi appears in the context of word wj​. Once this matrix is constructed, GloVe applies a weighted least squares regression model to learn word vectors such that their dot product predicts the logarithm of the probability of co-occurrence between two words.

FastText38 extends the Word2Vec framework by incorporating subword information into word embeddings. Unlike Word2Vec and GloVe, where the smallest processing unit is a complete word, FastText represents each word as a combination of character n-grams in addition to the word itself. This enables the model to capture morphological patterns and internal word structure, which is particularly beneficial for morphologically rich or agglutinative languages. The training objective follows the same predictive modeling principle as Word2Vec—employing either the Continuous Bag of Words (CBOW) or Skip-Gram architectures—to optimize word and subword vectors jointly. A key advantage of FastText is its ability to generate meaningful embeddings for out-of-vocabulary (OOV) words or rare tokens by composing them from their constituent subword n-grams. This makes FastText particularly suitable for low-resource scenarios and domains with high lexical variability.

Bidirectional encoder representations from transformers (BERT)39 is a language representation model designed to pre-train unlabeled text bidirectionally using a technique called Masked Language Modeling (MLM). To understand BERT, we must first examine two foundational concepts it employs Attention and Transformers. Attention, as defined in40, is a mechanism that enables the model to assign varying weights to different elements in the input sequence, allowing it to capture relationships and dependencies that provide critical contextual cues. Transformers, built upon stacked self-attention layers and feed-forward networks, allow BERT to model long-range dependencies without recurrent connections, resulting in efficient parallel training and strong context-aware representations. An extension of BERT, A Lite BERT (ALBERT)41, addresses key limitations of the original architecture—namely, large model size and high memory consumption—while preserving accuracy. ALBERT introduces two main innovations: (1) Factorized Embedding Parameterization, which decomposes the large vocabulary embedding matrix into two smaller matrices, significantly reducing the number of parameters; and (2) Cross-Layer Parameter Sharing, where the same parameters are shared across multiple transformer layers, further reducing memory usage without sacrificing depth. Additionally, ALBERT replaces the Next Sentence Prediction (NSP) objective with a Sentence Order Prediction (SOP) task, which better models inter-sentence coherence. These modifications allow ALBERT to achieve competitive or superior performance to BERT on various NLP tasks, including sentiment analysis, while being more computationally efficient and scalable.

Particle swarm optimization

The particle swarm optimization (PSO) algorithm was initially proposed by Kennedy and Eberhart in 199542. The algorithm is based on the social behavior of a flock of birds searching for food, where each bird is a particle, and in the algorithm, this particle is a possible solution to the problem. PSO has already proven robust for several applications.

Algorithm 1
figure a

EHO algorithm.

Animal-inspired variants of particle swarm optimization (PSO) have been widely explored for machine learning parameter optimization, drawing from natural behaviors to enhance exploration and exploitation. Examples include cat swarm optimization (CSO) mimicking hunting and resting patterns of cats, whale optimization algorithm (WOA) based on humpback whale bubble-net feeding, cuckoo search (CS) inspired by brood parasitism of cuckoos, bat algorithm (BA) simulating echolocation in bats, grey wolf optimizer (GWO) modeling the leadership hierarchy and hunting strategy of grey wolves, and Elk Herd optimization (EHO) reflecting elk social grouping and mating behavior (Algorithm 1). These nature-inspired adaptations introduce problem-specific dynamics that improve convergence speed and solution quality in tuning machine learning models.

Elk Herd optimization (EHO)43 is a swarm intelligence metaheuristic inspired by elk social and reproductive behaviour. The process involves four main phases:

  1. 1.

    Elk Herd generation – The population (bulls and harems) is initialized within predefined bounds, with fitness values computed and ranked.

In a continuous search space, each elk candidate X(j) is generated using:

$$X_{i}^{{\left( j \right)}} = L_{i} + \left( {U_{i} - L_{i} } \right) \times r_{i} ,\;\;\;\;\;\;r_{i} \sim U\left( {0,1} \right),~i = 1,2, \ldots \ldots ,~D$$
(12)

Here Li and Ui denote the lower and upper bounds of dimension i, D is the number of attributes, X(j) is the jth elk solution in the herd. The entire elk population is represented as:

$$H = \left[ {\begin{array}{*{20}c} {X_{1}^{{(1)}} } & {X_{2}^{{(1)}} } & { \cdot \cdot \cdot \cdot \cdot } & {X_{D}^{{(1)}} } \\ {X_{1}^{{(2)}} } & {X_{2}^{{(2)}} } & { \cdot \cdot \cdot \cdot \cdot \cdot } & {X_{D}^{{(2)}} } \\ { \cdot \cdot } & { \cdot \cdot } & { \cdot \cdot } & { \cdot \cdot } \\ {X_{1}^{{(M)}} } & {X_{2}^{{(M)}} } & { \cdot \cdot \cdot \cdot \cdot } & {X_{D}^{{(M)}} } \\ \end{array} } \right]$$
(13)

The fitness of each solution is evaluated and the population is sorted in ascending order:

$$\:f\left({X}^{\left(1\right)}\right)\:\le\:f\left({X}^{\left(2\right)}\right)\le\:\dots\:\le\:f\left({X}^{\left(M\right)}\right)$$
  1. 2.

    Rutting season – During the rutting phase, bulls are selected based on fitness scores, and further these bulls compete to form harems. Harems are assigned via roulette wheel selection proportional to each bull’s strength. The number of families is given by:

$$F = \left\lfloor {\beta M} \right\rfloor ,\beta \;{\text{is the bull selection ratio}}$$
(14)

From the sorted herd, the top F elks are designed as bulls:

$$\:B=\text{arg}\underset{j\in\:\{1,\dots\:,F\}}{\text{min}}f\left({X}^{\left(j\right)}\right)$$
(15)

The roulette-wheel selection will be defined as:

$$\:{P}_{j}=\frac{1/f\left({X}^{\left(j\right)}\right)}{{\sum\:}_{k=1}^{F}\left(1/f\left({X}^{\left(k\right)}\right)\right)}$$
(16)

Each bull receives harem members in proportion to its selection probability \(\:{P}_{j}\).

  1. 3.

    Calving season – During the calving season, each new offspring is produced through one of two reproductive mechanisms, depending on whether it inherits the index of the bull or the harem member. In the first case the calf inherits the bull’s index, and its traits are formed through a weighted combination of the bull and the mother harem member. The reproduction rule for this case is:

$$\:{C}_{i}\left(t+1\right)=\alpha\:\:{B}_{i}\left(t\right)+(1-\alpha\:){H}_{i}\left(t\right);\:\:\alpha\:\:\sim\:\text{U}\left(\text{0,1}\right)$$
(17)

\(\:\text{H}\text{e}\text{r}\text{e}\:{C}_{i}\left(t+1\right)=\text{n}\text{e}\text{w}\:\text{c}\text{a}\text{l}\text{f}\), \(\:{B}_{i}\left(t\right)=\:\)bull and \(\:{H}_{i}\left(t\right)\) = mother harem member.

In other case the calf inherits the mother’s index, and its traits result from the mother, the bull, and an additional randomly selected elk from the herd. The reproduction rule for this case is:

$$\begin{gathered} {C_{i,d}}\left( {t+1} \right)={H_{i,d}}\left( t \right)+{\delta _1}\left( {{B_{i,d}}\left( t \right) - {H_{i,d}}\left( t \right)} \right)~+~{\delta _2}\left( {{R_d}\left( t \right) - {H_{i,d}}\left( t \right)} \right); \hfill \\ {\delta _1}and~{\delta _2}\sim U\left( {0,2} \right) \hfill \\ \end{gathered}$$
(18)

Here d = dimension index, \(\:{R}_{d}\left(t\right)\) = randomly selected elk from the herd.

  1. 4.

    Selection season – In the final phase all individuals are merged as

$$\:P=B\cup\:H\cup\:C$$
(19)

This combined population P is then ranked according to fitness and the top M elks form the next generation: \(\:{H}^{(t+1)}={Top}_{M}\left(P\right)\)

This iterative process continues until convergence or a termination condition is met, with computational complexity proportional to the herd size, decision variables, and iterations.

Proposed model

In this section, we have presented a sentiment classification framework that integrates deep learning-based feature extraction with metaheuristic optimization for class label determination. The approach employs optimal sentiment vectors to represent each sentiment category and guide the classification process. Initially, a deep learning model is trained on the available training dataset to generate intermediate feature representations (hidden layer or output vectors) for each input sentence. These representations are subsequently utilized in the optimization stage, where an optimization algorithm constructs optimal sentiment vectors corresponding to each sentiment class by minimizing/maximizing predefined objective functions.

The framework adopts a two-phase architecture, wherein the deep learning training phase and the optimization phase are executed independently and sequentially. This design ensures that the optimization process leverages stable deep learning representations without influencing the network training dynamics. During inference, the class label of a test instance is assigned by computing the similarity between its deep learning output representation and the precomputed optimal sentiment vectors. The sentiment label corresponding to the maximum similarity score is selected as the predicted class.

Algorithm 2
figure b

Proposed algorithm.

The input text is first processed using ALBERT embeddings to capture contextual information and semantic nuances. These embeddings are then passed through a Bi-GRU network, which models sequential dependencies in the text. The Bi-GRU outputs are enhanced with self-attention. The final output predicts sentiments based on the processed text data, leveraging contextual embeddings for improved accuracy. To further improve classification accuracy and convergence speed, the Elk Herd optimizer is employed to fine-tune the neural network parameters, ensuring optimal weight adjustments for robust sentiment prediction. The proposed method, involves four primary steps: pre-processing, word embedding, model construction, and hyper-parameter optimization as illustrated in (Fig. 3). The details of each step are outlined below.

Pre-processing layer

Data preprocessing is the conversion of unstructured data into structured data. To achieve this, the initially collected data is cleaned, which can be done differently for each type of data or for each type of application. After this cleaning, the data is structured so that the algorithms can understand it. For the proposed scheme following preprocessing techniques will be implemented: case normalization, handling repeated characters, removing accents, eliminating numbers and special characters, filtering out laughter expressions, removing URLs and stop words.

Fig. 3
figure 3

As shown in Figure, ALBERT is first employed to extract lightweight contextual embeddings. These embeddings are sequentially modeled using a Bi-GRU layer to capture short-term dependencies, followed by a Bi-LSTM layer for long-term contextual refinement. The Elk Herd optimization (EHO) algorithm is integrated to optimize key hyperparameters and enhance classification performance, thereby enabling a synergistic interaction between representation learning, sequence modeling, and optimization.

Embedding layer (ALBERT)

One of the most critical steps in natural language processing is word embedding, as it serves as the input and forms the basis for sentence embeddings. For this purpose, ALBERT (A Lite BERT) is employed. ALBERT leverages a deep neural network with Transformer encoders to generate contextualized word embeddings, ensuring that the representation of each word captures its meaning within the surrounding context. These embeddings are carefully generated to enable the estimation of similarity between words and sentences using advanced attention mechanisms and contextual relationships.

Model construction and optimization

Algorithm 2 outlines the sequential phases of the proposed model. Initially, it takes a set of documents, tokenizes them into sentences, and then further tokenizes each sentence into individual words. Word embeddings for each token are computed using ALBERT, a model known for its efficiency, as it uses parameter sharing to reduce the model’s overall size while maintaining strong performance in generating compact embedding dimensions. After preprocessing, the algorithm splits the data into training and testing sets, using 80% of the data for training and the remaining 20% for testing.

The core of the model combines Bi-GRU (bidirectional gated recurrent units) and Bi-LSTM (bidirectional long short-term memory) layers, enhanced with an attention mechanism. The Bi-GRU layers are used for primary feature extraction, while the Bi-LSTM layers capture more complex hierarchical features. The attention mechanism enables the model to focus on important parts of the input text, adjusting the attention weights based on token relevance. Once the sentiment features are extracted, the model normalizes the outputs and processes them through fully connected layers to generate a predicted sentiment. The model then adjusts the weights between layers using back-propagation, fine-tuning its predictions by minimizing the difference between predicted and actual sentiment values. During the optimization phase, a stochastic optimization method from the swarm intelligence domain, known as Elk Herd optimization (EHO), is employed to optimize the parameters. EHO is inspired by the natural herding behavior of elks, where the population is divided into clans led by dominant males, and members collaborate through mating, fighting, and movement strategies to explore and exploit the search space. The proposed model applies a fitness function to assess its accuracy and refines its parameters through Elk Herd optimization (EHO), improving the final output’s alignment with the correct sentiment. Finally, the model’s performance is evaluated using the test dataset, comparing the predicted sentiments with actual sentiments to determine the model’s effectiveness.

Experiments and results

To validate the proposed approach, three primary objectives are set: (1) to assess the performance of proposed models, (2) to study the effects of EHO optimization and (3) to compare the proposed method with the graph based, supervised and other similar approaches.

Datasets

For sentiment analysis and classification experiments, we utilized six datasets: Twitter44, Laptop1445, Rest1445, Rest1546, Rest1647, SST-548. These datasets reflect different methods of expressing sentiment occurring in the real world. The SST-5 dataset involves fine-grained sentiment classification into five emotional classes, which allows the model to learn the nuances of polarity discrimination. The Restaurant14, Restaurant15, Restaurant16, and Laptop14 datasets utilize aspect-based sentiment analysis; this aspect is particularly relevant since it reflects the sentiment associated with a specific feature of products and services previously produced by users. The Twitter dataset added the informative dimension of informal, high-variance social-media language, as would be the case with the other identified datasets. This variability will allow us to assess the generalization ability, robustness, and their adaptability of the proposed methods. The combination of the varying but complementary datasets, will also ensure operational performance to cover realistic, heterogeneous situations for authorship based sentiment analysis. In the Table 1, we outline the dataset used and their characteristics. To compare the datasets systematically, we analyze them across key aspects such as training and testing size, sentiment classes, text length, and domain of application. While Twitter provides short, real-time social media posts, the Restaurant and Laptop datasets capture detailed customer reviews in specific domains. SST-5, on the other hand, introduces a richer five-class sentiment structure, allowing nuanced analysis. The datasets exhibit notable variation in text length, which directly influences the complexity of sentiment classification. Twitter samples are constrained to a maximum of 280 characters, typically comprising 10–20 words, thereby producing short and concise inputs with high linguistic variability. In contrast, Restaurant14, Restaurant15, Restaurant16, and Laptop14 reviews generally consist of 1–3 sentences with an average length of 20–40 words. These can be categorized as medium-length reviews, as they provide richer contextual information compared to tweets while remaining less extensive than long-form narratives. The SST-5 dataset demonstrates variable text length, encompassing both short phrases with minimal lexical content and complete sentences containing 20–30 or more words. This variability enables fine-grained sentiment modeling across multiple levels of granularity.

The datasets span diverse application domains, ranging from social media content (Twitter) to aspect-specific product and service reviews (Restaurant14, Restaurant15, Restaurant16, and Laptop14), while SST-5 serves as a general-purpose benchmark for sentiment analysis.

Table 1 Datasets characteristics.

Experimental settings and metrics used

We conducted the experiments on an Intel Core i7 processor running at 4.20 GHz, equipped with 32 GB of RAM, and using Windows 10. The metrics used in this study include the F1 measure and overall accuracy. The F1 metric integrates precision and recall into one value, making it easier to compare these two performance indicators across various solutions. Accuracy measures the percentage of cases correctly identified by the model.

Baselines

The performance of our proposed and the selected combination of GRU-LSTM models is compared against both pioneering studies and recently developed systems which includes TensorGCN49, BUGE50, CGA2TC51, HyperGAT52, TextING53, SSGC54, GFN55, DGEDT-RoBERTa56, HRLN-RoBERTa56, RSSG + BERT57, dotGCN40, KE-IGCN58, ASGCN-DG59 and CL-GAT-XLNET60.

Result analysis

The proposed method was systematically evaluated against established baselines for sentiment analysis, which employ diverse approaches to infer sentiment from datasets such as tweets, reviews, and news articles. These baselines encompass both binary and multi-class classification settings and typically provide sentiment scores corresponding to positive, negative, and neutral categories. To ensure comparability, the proposed method was adapted to operate under the same framework, producing results in terms of positive, negative, and neutral sentiments rather than its original stress-level outputs. This alignment facilitated an equitable comparison with the supervised baseline methods.

We employed 128-dimensional ALBERT-BASE model optimized with the Adam optimizer at a learning rate of 2 × 10⁻⁵. The ALBERT-BASE architecture, comprising 12 transformer layers, a hidden size of 768, 12 self-attention heads, and approximately 12 million parameters, leverages factorized embeddings and cross-layer parameter sharing to enhance efficiency. Each BiGRU and BiLSTM layer is configured with 128 hidden and 64 hidden units respectively. To optimize network parameters beyond conventional gradient descent, the Elk Herd optimization (EHO) metaheuristic is applied at the output layer, thereby mitigating premature convergence and enhancing generalization under imbalanced and noisy sentiment data conditions. The network is trained for 15 epochs, ensuring stable convergence and robustness in performance. The quality of sentiment analysis results for the proposed model on the various datasets is displayed in (Table 2).

Restaurant14 results

For the positive class, the Optimized vector yields a 16.87% improvement in Precision, 15.87% in Recall, and 16.34% in F1-Score. In the negative class, the Optimized vector shows an 11.77% increase in Precision, a 16.49% rise in Recall, and a 13.98% improvement in F1-Score. Finally, for the neutral class, the Optimized vector demonstrates an 11.32% enhancement in Precision, a 17.39% increase in Recall, and a 14.32% boost in F1-Score.

Restaurant15 results

In the case of the positive class, the optimized vector yields substantial performance improvements, with gains of 16.50% in F1-Score, 18.47% in Recall, and 14.11% in Precision relative to the baseline vector. For negative class, the Optimized vector shows a 14.80, 14.54 and 14.64% rise in Precision, Recall, and F1-Score. Finally, for the neutral class, a significant boost in Precision, recall and a F1-Score is observed.

Restaurant 16 results

The most notable improvement occurs in Precision for the neutral class, with a striking 52.30% increase, indicating that the Optimized vector significantly enhances the model’s ability to correctly classify neutral sentiment. The negative class benefits the most from improvements in Recall, while the positive class experiences balanced improvements across all metrics. Overall, the Optimized vector leads to a substantial boost in model accuracy, particularly for the neutral sentiment.

Table 2 The comparison experiments on all datasets.

Twitter result

For the Twitter dataset, the neutral class exhibits marginal improvements, with a 3.69% increase in Precision accompanied by a slight decline in Recall (-0.24%), resulting in a modest gain in F1-Score (1.67%). In contrast, both the positive and negative classes demonstrate more pronounced enhancements. The negative class, in particular, shows substantial improvements in Recall (15.49%) and Precision (13.38%). Similarly, the positive class records notable gains, with Precision, Recall, and F1-Score increasing by 10.34, 13.92, and 12.16%, respectively.

Laptop 14 results

The Lap14 dataset reveals overall improvements in the Optimized vector across all classes, with the neutral class showing the most significant gains. Specifically, the neutral class experiences a 14.61% increase in Precision, a 15.70% boost in Recall, and a 15.07% improvement in F1-Score. The positive class exhibits a 14.11% enhancement in Precision, an 11.06% increase in Recall, and a 12.74% rise in F1-Score. In contrast, the negative class shows more modest improvements, with Precision increasing by 11.03% and Recall improving by a smaller margin of 5.50%.

SST-5 results

For the SST-5 dataset, the optimized vector yields consistent performance improvements across all sentiment classes. The Neutral class shows the largest gain in Precision (27.61%), though Recall improves only marginally (4.02%), indicating higher accuracy in identifying neutral instances but limited improvement in coverage. The Very Positive class demonstrates balanced gains in Precision (14.46%), Recall (17.82%), and F1-Score (16.15%). Similarly, the Positive class improves by 13.10% in Precision, 17.64% in Recall, and 15.21% in F1-Score, while the Very Negative class records increases of 14.41%, 16.67%, and 15.84%, respectively. The Negative class achieves a notable Precision gain (17.12%) with moderate Recall improvement (14.67%), resulting in a balanced F1-Score increase (15.92%). Overall, the optimized vector significantly enhances classification performance, with the Neutral class excelling in precision, and the Very Positive and Very Negative classes showing well-rounded improvements across all metrics.

Table 3 The comparison experiments on Laptop14, Twitter, restaurant 14, 15, 16 and SST-5.

Table 3 summarizes the performance of the evaluated approaches on different datasets. The results demonstrate that proposed model consistently outperforms existing methods across all evaluation metrics, underscoring the robustness and effectiveness of the proposed model. From the results in (Table 3), we observe that the proposed model improves the model’s accuracy. When using Elk Herd optimization, we observe that the accuracy and F-score shows a slight increase. For the Laptop 14, Twitter, Restaurant 14–16 and SST-5 datasets, the proposed model demonstrate fairly good performance, with accuracy rates of 85.9, 83.1, 89.6, 85.7, 95.7, 53.5% and F-Score 83.2, 81.5 and 84.4, 86.1, 87 and 53% respectively. For Laptop14, Twitter, Restaurant14, Restaurant15, Restaurant16 it reports 6.1 and 6.5%, 6.4 and 5.9%, 4 and 4.8%, 0.6 and 18.4%, 2.7 and 5.7% average improvement respectively in comparison to graph convolutional network (GCN) based approaches. For SST-5 dataset the proposed model reports 16.8 and 17.8% average improvement in comparison to Graph embedding and hypergraph based models. The results indicate that the architectural features of the proposed model align well with the consistent patterns found in the respective datasets.

Table 4 Estimated memory usage per dataset.

This enables the model to generalize effectively, achieving strong accuracy and F-scores. Its ability to manage complexity and expressiveness without overfitting is key to maintaining stability and performance across different datasets. The variety and quality of the training datasets likely contribute to providing the model with diverse and representative data, further enhancing its generalization capabilities.

Our proposed approach demonstrates a clear superiority across above mentioned datasets, as evidenced by the detailed analysis of the confusion matrices as shown in (Fig. 4). The proposed method exhibits remarkable improvements in fine-grained sentiment classification, successfully distinguishing between sentiment categories with a marked reduction in confusion between adjacent classes.

Fig. 4
figure 4

Confusion matrix for various datasets.

In our study the Restaurant14, Restaurant15, Restaurant16, Twitter, Laptop14 are the inbalanced and domain specific datasets. For all datasets, the Optimized vector significantly improves performance; however, the influence depends on the characteristic features of the dataset. Stronger gains occur with imbalanced datasets and in domain-specific ones, especially for the neutral class, which indicates improved handling of minority and ambiguous sentiments. Balanced datasets show more consistent improvement across classes, whereas informal open-domain text, such as Twitter, displays limitations for neutral sentiment detection. Overall, the Optimized vector enhances both robustness and discriminative ability, though challenges remain for noisy or highly ambiguous text. Model performance also declines when the vocabulary shifts across domains; significant changes in linguistic patterns can cause the model to overfit to domain-specific terms, reducing its generalization ability.

Ablation study

We conduct four sets of ablation experiments across different datasets to assess the contribution of each component. The study also examines how the learning rate affects the accuracy of the proposed model. The experimental results reveal that integrating ALBERT and EHO into the learning framework resulted in an overall performance boost for our sentiment analysis model. To further explore this impact, we performed tests on each dataset type, incorporating the Multihead-attention mechanism and EHO, which assigns weights to each token.

Execution time and memory usage

Using the same dataset used in our suggested approach, we assessed the memory consumption and execution time of baseline sentiment analysis methods. We used the time data immediately reported in55 for models like HyperGAT, BUGE, and CGA2TC. We did not include the preprocessing time in our analysis because all models need preprocessing procedures like tokenization before they can make predictions. Our propoesd model performs better than SOTA techniques terms of time and memory efficiency.

Table 5 Estimated training time per epoch.

Compared to graph-based models, it uses less memory and computes more quickly since it can directly express sequential relationships without the need for edge dependencies or graph-based structures. According to earlier studies, HyperGAT, BUGE, and CGA2TC.

The models’ utilization of memory can be seen in (Table 4). This decreased memory utilization demonstrates how our model’s architecture has been optimized to handle massive amounts of information with greater accuracy while using less resources. Since graph-based embeddings have to handle edge dependencies, adjacency matrices, and neighborhood aggregation, all of which become more complex and scale with the size of the graph structure, they usually require more memory.

ALBERT embeddings, on the other hand, considerably reduce memory usage because they are generated using parameter optimization strategies like factorized embedding parameterization and cross-layer parameter sharing. We evaluated a variation of our methodology on multiple datasets in order to determine how long it took to compute for suggested models. The training and testing time for each epoch is summarized in (Tables 5 and 6). The Twitter, Laptop14, and SST-5 datasets have very short training and testing times, while the Restaurant 14–16 datasets have much longer training and testing times, according to the tables. Here we have tested Model1: (GRU-> LSTM-> attention) with mutihead attention and model2: the proposed model (GRU-> LSTM-> EHO).

Fig. 5
figure 5

The effect of leaning rate on accuracy.

To evaluate our model’s performance on various datasets, we change the learning rate from 0.00001 to 0.1. The optimal learning rate for all datasets is 0.01 as shown in (Fig. 5). Interestingly, we observed a steady improvement in the model’s performance as we progressively raised the learning rate. However, going above 0.01 in the learning rate caused instability, which resulted in significant performance swings and, in certain situations, completely stopped convergence. We set the learning rate to 0.01 in order to achieve stability and peak performance.

Table 6 Estimated testing time per batch.

Effect of removal of each component on the model’s performance

We aim to assess the impact of combinations of the various components of the proposed scheme. For this purpose, we employed component removal strategy. In Table 7 we have presented the impact of component removal. The proposed model powered with all mentioned components produces significant results. It is clear from the results that our model’s variants viz. without ALBERT, without BiGRU, without BiLSTM and without EHO reported low performances. Our model w/o EHO observe a slight decline in accuracy and F-score for all the datasets. Whereas a marginal decline in performance is reported by the model’s variants w/o ALBERT, Bi-GRU and Bi-LSTM.

Table 7 Component removal effect on model’s overall performance.

Conclusion

In the proposed framework, ALBERT is utilized due to its parameter efficiency and reduced memory requirements compared to models such as BERT and RoBERTa, enabling effective handling of large-scale sentiment datasets without compromising contextual embedding quality. To capture sequential dependencies, a hybrid architecture integrating GRU and LSTM is employed, where GRU efficiently models short-term dependencies with fewer parameters, while LSTM excels at capturing long-term dependencies, resulting in a balanced temporal feature representation. Additionally, Elk Herd optimization (EHO) is incorporated to enhance parameter tuning and mitigate the risk of local minima during training, thereby improving robustness in the presence of noisy and imbalanced sentiment datasets. Model performance is evaluated on widely used sentiment analysis benchmarks, including SST-5, Laptop14, Twitter, Restaurant14, Restaurant15, and Restaurant16, using Accuracy and F-score as evaluation metrics. Furthermore, a detailed ablation study is conducted to analyze the influence of learning rate on accuracy, alongside assessments of execution time and memory utilization. While ALBERT is a highly efficient and lightweight backbone for our proposed sentiment analysis framework, it does have some shortcomings that we would like to acknowledge. A notable limitation is the reduction in the depth of semantic and contextual understanding compared to larger transformer models like BERT or RoBERTa. Additionally, the collected data may not adequately reflect the complete variety of sentiment expressions across different domains, languages, or cultural settings, due to the dataset internal bias the model’s generalizability losses to capture the unseen data. In the future, we aim to improve our model by incorporating multimodal datasets from different domains. While the current study focuses exclusively on English datasets, extending the model to multilingual and cross-lingual sentiment analysis remains an important future direction including various low resource languages. Due to the complexity and resource requirements of multilingual evaluation, this was beyond the scope of the present work but will be explored in future research. For the future work, the paper recommends work on diverse application including marketing, finance, education and healthcare.