Introduction

Emotional investigation is crucial for the development of affective interfaces that provide suitable emotional responses, thereby creating a sense of emotional engagement and facilitating online communication1. Emotion recognition plays a significant role in several fields of life, specifically in driver assistance systems (DAS) and active and assisted living (AAL). Emotion Recognition is one of the technical AAL enablers, as it is deliberated to provide substantial assistance in observing and monitoring the mental state of elderly and disabled persons2. Additionally, it is evident from advanced publications that the classification performance of emotion recognition methods is steadily improving, and the opportunities for automated emotion recognition methods have also expanded. Multiple model types are utilized to identify emotions in humans, including body movements, facial expressions, textual information, heartbeat, and blood pressure3. In computational linguistics, human emotion detection utilizing text becomes more important from an application perspective. Currently, there is a massive amount of textual data on the internet. Textual emotion recognition is a computational experiment that analyses natural language in text to identify its combinations with emotions, such as fear, anger, sadness, joy, and others4. It is possible in various government organizations, industries, and media applications. Textual emotional detection aims to identify the primary emotion affected by analyzing their input texts5.

It depends on the assumption that if a person is happy, it controls them to utilize positive words. In business development, emotion detection can help marketers develop effective approaches for new product development, service delivery, and customer relationship management (CRM)6. Psychologists can infer an individual’s emotions based on the text they write and use to predict their mental state7. This knowledge might be practically employed to predict customer preferences and user behaviour for corporate economic gain. Text-based emotion detection might be utilized in psychology, education, business, and other fields. In the past decade, emotion detection from text has been explored to identify users’ emotional states from multimodal resources, including gestures, eye gazes, and audio8. Technology utilizing emotion detection models can automatically create emotions. One specific field that stands to benefit from dependable emotion detection methods is artificial intelligence (AI)9. AI software equipped with effective emotion recognition might be employed to enhance effective human-computer interaction devices. Preceding investigations on affective computing employed approaches from classical machine learning (ML), unlike the latest DL field advances10. ML and DL-based methods are used to calculate emotions, which have more significant results than traditional ones.

This manuscript presents an Optimised Ensemble Model for Precise Textual Emotion Recognition Using an Improved Sand Cat Swarm Optimization (OEMPTER-ISCSO) method. The primary objective of the OEMPTER-ISCSO method is to accurately recognize emotions in text, facilitating enhanced communication with individuals with disabilities. Initially, the text pre-processing stage involves multiple levels to normalize and clean the input text. Furthermore, the FastText method is employed for the word embedding process, transforming words into numerical vector representations. For textual emotion detection, an ensemble of three classifiers, such as the enhanced deep belief network (EDBN), Elman neural network (ELNN), and an improved temporal convolutional network (ITCN) method, is employed. Finally, the enhanced sand cat swarm optimization (ISCO) method-based hyperparameter selection procedure is executed to optimize the detection outcomes of the ensemble models. The experimentation of the OEMPTER-ISCSO technique is accomplished using emotion detection from a text dataset. The key contribution of the OEMPTER-ISCSO technique is listed below.

  • The OEMPTER-ISCSO model undergoes a pre-processing process that includes cleaning, normalization, and tokenization to enhance data quality. This process confirms the removal of noise and irrelevant data, making the input more consistent. It significantly improves the efficiency of emotion detection by enhancing the quality of extracted features.

  • The OEMPTER-ISCSO method utilizes FastText-based word embeddings to capture the semantic and contextual variances of words, allowing for richer textual representation. This approach facilitates the encoding of sub-word data, thereby enhancing comprehension of rare and misspelt words. The model plays a crucial role in improving the performance of downstream emotion classification models.

  • The OEMPTER-ISCSO approach implements an ensemble framework by integrating EDBN, ELNN, and ITCN models. This hybrid setting utilizes every individual model’s merit to capture spatial and temporal features effectively. As a result, it significantly improves the accuracy and robustness of emotion classification.

  • The OEMPTER-ISCSO methodology utilizes the ISCO approach to fine-tune the hyperparameters of the ensemble architecture. The ISCO model improves the convergence rate and exploration capabilities by simulating adaptive search behaviours. This results in enhanced model accuracy, mitigated training time, and improved overall efficiency.

  • The OEMPTER-ISCSO model is novel in integrating a hybrid ensemble architecture incorporating EDBN, ELNN, and ITCN methods with the ISCO model for hyperparameter tuning. This fusion, combined with FastText-based word embeddings, offers a novel approach to textual emotion detection. Such a configuration has not been previously explored in this domain, resulting in improved accuracy and robustness.

Related works

Islam et al.11 proposed a model using advanced sensor applications and ML techniques. Utilized techniques include ontology-based knowledge representation, multimodal sensor fusion, and DL approaches such as convolutional neural networks (CNN) and recurrent neural networks (RNN). Romero and Armenta12 introduced a technique utilizing real-time image processing. The approach utilizes a CNN for facial feature extraction and emotion classification, deployed on a Raspberry Pi 3b+ platform to ensure low-cost and efficient real-time implementation. Asha et al.13 developed a modern voice-controlled assistant incorporating voice recognition as well as natural language processing (NLP) for effective identification and essentially retrieving main words, as well as DL models for effective analysis. It takes user voice commands, performs web searches, opens Google Chrome, and converts text to speech. Computer vision (CV) capabilities include real-world object detection and web scraping for pricing data. Emotion detection utilizing pre-trained methods improves human-centric interaction. Pavithra et al.14 introduced the SER method enlargement by employing ML models, especially RNN and DL. By examining critical audio features such as prosody, pitch, and rhythm, this method aims to achieve precise emotion recognition for innovative speech instances. Brilli et al.15 developed AIris, an AI-powered wearable device that offers VIPs environmental awareness and interaction capabilities. AIris associates an advanced camera fixed on eyewear with an NLP interface, allowing users to receive real-world auditory explanations of their settings. This work also made a functional prototype method that works effectually in real-world circumstances. Bertacchini et al.16 proposed a model that integrates the Pepper robot with the chat generative pre-trained transformer (ChatGPT) for real-time, natural language dialogue. Utilizing human-robot interaction (HRI) and NLP, the work simulates interactions with individuals diagnosed with autism spectrum disorder (ASD). Reddy et al.17 presented an innovative method that associates hand gesture recognition (HGR) with real-world voice output, developed to help people with paralyzed hands monitor and enhance their hand movements. This novel method utilizes progressive technologies to bridge the distance between action and intention for individuals with inadequate hand mobility. This work represents a significant advancement in assistive technology. Begum et al.18 proposed a sign language translation system that utilizes a quantized You Only Look Once version 4 tiny (YOLOv4-Tiny) model for detecting 49 Bengali sign characters, and an LSTM network for generating meaningful text from recognized characters. Kandula et al.19 presented a sign language recognition (SLR) system that utilizes webcam-recorded hand gestures to enhance communication for individuals with hearing or speech impairments. Di Luzio, Rosato, and Panella20 proposed a method to strengthen emotion classification through video analysis by utilizing explainability models to optimize facial landmark features. Deep models, such as 2D-CNN and deep neural networks (DNNs), are employed, along with an improved integrated gradient method, to detect and refine crucial facial points. This approach enhances accuracy while minimizing noise and reducing computational costs.

Slade et al.21 proposed a SER model by integrating the audio spectrogram transformer (AST) with DL techniques such as 1D CNN (1D CNN), Bidirectional LSTM (BiLSTM), and CNN BiLSTM, optimized using a novel cluster search optimization (CSO) technique. CSO utilizes cluster centroid search, reinforcement learning (RL), and noise tempered K-means (NTKM) to enhance model performance across multiple emotion datasets. Neeraja et al.22 developed an effective driver somnolence detection system by utilizing DL methods integrated with CV and physiological signal analysis. ML models are integrated to improve detection precision and scalability. Ali and Hughes23 proposed an efficient emotion recognition model utilizing a unified biosensor–vision multimodal transformer (UBVMT) model, which integrates self-supervised learning techniques, including masked autoencoding and contrastive modelling. By incorporating 2D representations of ECG/PPG signals with facial features, the model mitigates memory load through homogeneous Transformer blocks, enabling scalable emotion classification in the arousal-valence space. Paul et al.24 proposed a real-time attendance system that integrates facial recognition and emotion detection using a dual-path architecture. It leverages ResNet-50 for face recognition, the Vision Transformer (ViT) for emotion detection, and a custom dataset. Choi, Zhang, and Watkins25 presented a novel variant of the self-supervised audio spectrogram transformer (SSAST) model. The approach integrates dual representations from both middle and final layers using mean, max, and min patch-wise pooling, improving feature richness and accuracy across multiple benchmark datasets. Wang and Chai26 enhanced personalized learning path optimization and learning efficiency by proposing the LSTM-Transformer model. This model utilizes LSTM to capture learners’ behavioural sequences and the Transformer’s self-attention mechanism to enhance context understanding, enabling accurate prediction and adaptive optimization of individual learning trajectories. Ramani et al.27 explored emotion detection using a deep Bidirectional LSTM on multimodal mobile sensor data, eliminating the need for manual feature engineering and demonstrating its efficiency for human-robot interaction applications. Prithi and Tamizharasi28 improved customer relationship management (CRM) by integrating facial expression recognition into the customer information system (CIS) using a feature fusion deep multi-layer classification (FFDMLC) model. The model employs DL methods for feature computation and classification, with hyperparameters optimized using the COOT optimization algorithm to enhance recognition accuracy. Selvaraju et al.29 presented a real-time system for Indian SL and speech-to-text translation in video conferencing using CNN, YOLOv5, Hidden Markov Model (HMM), and WebRTC, improving communication accessibility for the deaf and speech-impaired. Ghadami, Taheri, and Meghdari30 utilized transformer encoder-based networks with early and late fusion techniques, optimized by a genetic algorithm (GA), to recognize Iranian Sign Language words. Key features such as hand and lip keypoints, along with spatial metrics, are used for training the model using multi-task learning, enabling accurate word and sentence recognition.

Khanum et al.31 proposed an IoT-based wearable device for women’s safety, enabling real-time audio tracking, location monitoring, and emergency alerts, even in offline conditions. Siju and Selvam32 developed an HGR system by employing Google Mediapipe to extract 21-point hand landmark vectors, which are later utilized for training a lightweight DNN in TensorFlow. The model recognizes various gestures and is examined in real-time with a live webcam stream, making it appropriate for edge devices. Naik et al.33 developed a robust and reliable multimodal emotion recognition system by utilizing DL models across text, audio, and video data. The model integrates Bidirectional Encoder Representations from Transformers (BERT) for text-based emotion detection. The Term Frequency-Inverse Document Frequency (TF-IDF) technique is also utilized for feature extraction. Furthermore, the CNN with audio augmentation for audio signals and the CNN with OpenCV are used for real-time facial expression analysis in video. Liu et al.34 introduced a model that employs an Adaptive Evolutionary Computational Integrated Learning Model (AdaECELM), integrating TF-IDF for feature selection, Cuckoo Search Optimisation (CSO), and AdaBoost for ensemble learning through soft voting. Filahi et al.35 presented a technique by using diverse ML methods comprising logistic regression (LR), naïve bayes (NB), support vector machine (SVM), random forest (RF), and adaboosting, and DL models like gated recurrent unit (GRU) and long short-term memory (LSTM). Sandulescu et al.36 developed NeuroPredict, an AI-driven healthcare platform that utilizes Internet of Medical Things (IoMT) devices and AI models. The technique also integrates AI-based predictive models with voice-based emotion detection algorithms, employing voice features as non-invasive indicators of mental health changes. Muhammad et al.37 introduced the CNN technique by integrating transformer models, such as DeBERTa-v3-large, Electra, XLNet, RoBERTa, and T5, to improve model performance in recognizing complex emotional variations. Also, the International Survey on Emotion Antecedents and Reactions (ISEAR) dataset was utilized for testing the model. Thiab, Alawneh, and Mohammad38 proposed a method by utilizing DL and transformer-based models. RNNs and transformer architectures are evaluated individually, and their outputs are integrated by using an ensemble learning approach with majority voting to improve performance. Kumar, Khan, and Choi39 developed a novel methodology by employing a hybrid DL technique integrating RoBERTa with parameter-efficient adapter layers, Bidirectional LSTM (BiLSTM), and attention mechanisms (AM). Geethanjali and Valarmathi40 proposed a hybrid model, the Improved Chimp Optimisation Algorithm–CNN-LSTM (IChOA-CNN-LSTM). The technique is examined under the GeoCoV19 dataset. Arbaizar et al.41 utilized Hidden Markov Models (HMM) for handling missing data and a transformer DNN for multivariate time-series forecasting, incorporating classification algorithms to predict emotional valence and responses to psychiatric questionnaires. Kohneh Shahri, Afshar Kazemi, and Pourebrahimi42 presented a technique by using AI methods, including DL-based motion detection, body language recognition, image processing, sound and text processing, CV, and NLP. Table 1 summarises the existing studies on emotion recognition for individuals with disabilities.

Table 1 Summary of existing studies comprising methods, datasets, and key findings.

Despite crucial improvements in DL, CV, and transformer-based models across various emotion recognition, SL, and safety applications, several limitations still exist. Many models require large annotated datasets, which are often scarce, resulting in challenges to generalization and robustness. Few models exhibit inefficiency due to high computational complexity and memory requirements on resource-constrained edge devices. Existing multimodal fusion approaches often encounter issues with synchronization and effective feature integration, which can impact accuracy. Furthermore, few systems comprehensively address offline functionality and privacy concerns, particularly in safety-critical applications. There is also limited research on adaptive models that can dynamically optimize performance based on varying input quality and user contexts. Despite these improvements, various techniques still encounter threats due to their reliance on specific datasets, restricted generalizability across diverse data sources, and challenges in handling noisy or imbalanced data. Moreover, a few models increase computational complexity. The research gap in addressing these concerns involves developing lightweight, scalable architectures capable of efficient multimodal fusion, enhancing both offline and real-time capabilities, and improving model adaptability with minimal manual intervention. Moreover, many models rely heavily on manual feature engineering, which limits scalability and adaptability across diverse datasets and applications. Addressing this research gap requires the design of end-to-end self-supervised or semi-supervised learning frameworks that reduce dependency on labelled data while maintaining high accuracy and efficiency. Additionally, a gap exists in developing scalable and efficient systems that support high accuracy across diverse real-world scenarios while effectively managing data heterogeneity and model complexity.

Proposed methodology

This study presents an OEMPTER-ISCSO model. The primary objective of the OEMPTER-ISCSO method is to enhance the communication of individuals with disabilities by accurately recognizing emotions in text. The proposed OEMPTER-ISCSO method comprises several stages, including text pre-processing, word embedding, classification, and hyperparameter tuning. The overall working flow procedure of the OEMPTER-ISCSO method is portrayed in Fig. 1.

Fig. 1
Fig. 1
Full size image

Overall working process of OEMPTER-ISCSO model.

Text pre-processing

Initially, the text pre-processing stage involves multiple levels to normalize and clean the input text43. Text pre-processing transforms text into a design that is analyzable and predictable for specific tasks. This includes eliminating unimportant data, such as stop-words, URLs, stemming, and lemmatization, and executing tokenization, which removes unrelated data and prepares the dataset for further processing. The pre-processing stages in text summarisation are determined according to the task’s goal, such as removing related features and eliminating unrelated data to enhance the algorithm’s performance. It is crucial to achieve optimal outcomes in text summarisation, as they directly impact the accuracy and quality of the produced analyses. The pre-processing phase plays a vital role in converting raw text into a structured format appropriate for feature extraction and classification. Stemming mitigates words to their root forms by removing suffixes, often resulting in non-lexical stems. In contrast, lemmatization considers the context and reduces words to their dictionary base form, thereby improving the quality of extracted features. By applying these targeted pre-processing techniques, the system effectually captures semantic, statistical, and linguistic characteristics, thereby enhancing the accuracy and coherence of emotion summarisation.

The initial data cleaning phase involves removing duplicate URLs, handling dynamic URLs, and preserving those with embedded HTML components to ensure that only English-language-related content is retained for evaluation. Removing stop words—common words like "a," "an," and “the”—helps mitigate noise and improves concentration on crucial terms that carry semantic weight. The tokenization procedure further simplifies the text by separating large blocks into small units, such as splitting sentences into individual words for easier analysis.

The stemming procedure is applied to mitigate words to their root form, which may not be an actual word but captures the base idea of related terms. For example, "running," "runner," and “ran” are reduced to “run”. This process helps integrate semantically identical words and enhances text analysis effectively. Lemmatization, on the contrary, refines this process by altering words to their proper base form using a dictionary, ensuring grammatical accuracy and preserving the original context. For instance, from "sharing," lemmatization provides the correct base form "share," contributing to more precise and meaningful text summarisation. This process mitigates redundancy and enhances consistency in the analysis, allowing for a more precise and accurate representation of emotions.

FastText-based word embedding

Next, the FastText method is employed for the word embedding process, transforming words into numerical vector representations44. This model is chosen due to its capability to capture subword data, which is advantageous for handling out-of-vocabulary (OOV) words and morphologically rich languages. This technique depicts words as a sum of character n-grams, facilitating the comprehension of the internal word structures, unlike conventional embeddings such as Word2Vec or GloVe. This improves its robustness in noisy or domain-specific datasets. Furthermore, the model is computationally effective and gives meaningful vectors even for rare or misspelt words. Its pre-trained models on large corpora contribute to enhanced semantic and contextual representation, making it an ideal choice for downstream NLP tasks, such as emotion detection. Fig. 2 illustrates the flow of the FastText model.

Fig. 2
Fig. 2
Full size image

Architecture of FastText model.

It is a word representation in a vector space that captures semantic relations among words. Specifically, it is a mathematical method that characterizes words such that related words are positioned closer together in the vector space. These representations are frequently applied in ML and NLP tasks. The notion after word embeddings is to transform words into mathematical vectors to safeguard their semantical relationships. This enables ML models to understand meanings and healthier relationships more effectively. Word embeddings mainly benefit sentiment analysis, text classification, and language translation tasks.

FastText is a lightweight, free, and open-source library designed for effective text representation and classification. It is designed to process large text databases efficiently and is primarily suitable for tasks such as word embedding, language identification, and text classification. FastText can constantly represent vectors (embeddings) for words in all texts. These embeddings utilize semantic data and are beneficial for various NLP tasks. It assists in training the text classifier using a shallow neural network (NN). This makes it efficient for tasks when labelled data is presented for training, such as topic classification or sentiment analysis.

The fastText model is derived from an NN structure that combines the Bag-of-Words (BoW) model and sub-word data. The formulations are stated as exposed in Eq. (1).

$$-\frac{1}{N}{\sum }_{n=1}^{N}{y}_{n} log\left(f\left(BA{x}_{n}\right)\right)$$
(1)

Where are the standardized features bag of the \(nth\) document, the labels \(A\) and \(B\), and the weighted matrices? This approach is trained asynchronously on numerous CPUs with a linearly decaying learning rate and stochastic gradient descent. The training process involves updating the NN parameters to minimize these objective functions. It utilizes negative sampling and hierarchical softmax models to make the training process more effective. It is essential to understand that, although these provide an overall review, the actual performance details, optimizations, and hyperparameters may differ according to the specific settings and version used in FastText.

Classification using ensemble models

An ensemble of three classifiers is employed for textual emotion detection and classification: the EDBN model, ELNN technique, and ITCN method. The ensemble model is chosen to utilize the unique merits of every model, improving overall performance in textual emotion detection. The model is prevalent because it can effectively capture hierarchical and abstract feature representations, enhancing emotion recognition from complex text patterns. ELNN, with its feedback connections, outperforms modelling temporal dependencies in sequential data. ITCN offers the benefits of capturing long-range dependencies with reduced complexity and faster training compared to RNNs. Altogether, these models complement each other, confirming enhanced generalization, robustness, and classification accuracy over single-model approaches, particularly in emotionally diverse and context-sensitive textual datasets.

EDBN classifier

Comparable to the conventional DBN, the EDBN learning process primarily consists of two phases: tuning and pretraining. During this pretraining phase, the contrastive divergence (CD) model is applied to execute unsupervised training for the RBM45. Previously, the complete NN was extended into a forward-form network during the fine-tuning phase, and the weights of the complete network were adjusted by accepting the EBP model. The probability equation for the activation state of the basic component RBM is exposed as Eqs. (2-3):

$$\begin{array}{c}p({h}_{j}=1|v)=\sigma \left({\sum }_{j=1}^{m}{w}_{ji}{v}_{j}+{b}_{j}\right)\\ =\frac{1}{1+{e}^{-\left({\sum }_{j=1}^{m}{w}_{ji}{v}_{j}-{b}_{j}\right)}}\end{array}$$
(2)
$$\begin{array}{c}p({v}_{j}=1|h)=\sigma \left({\sum }_{j=1}^{m}{w}_{ji}{h}_{i}+{a}_{j}\right)\\ =\frac{1}{1+{e}^{-\left({\sum }_{j=1}^{n}{w}_{ji}{h}_{i}-{a}_{j}\right)}}\end{array}$$
(3)

Whereas \({v}_{j}\) designates the input of the \(jth\) node within the visual layer, \({h}_{i}\) characterizes the \(ith\) node value of the hidden layer (HL), \({a}_{j}\) and \({b}_{i}\) represent offset values of the hidden and visible neurons in sequence, \({w}_{ji}\) refers to the connection weight between the hidden neuron \(i\) and the visible neuron \(j\), \(sigmoid\) signifies the activation function, and the sigmoid expression is \(1/(1+{e}^{-x}),\) \(m\) stands for visible neuron counts, \(n\) indicates the hidden neuron counts. Let parameters \(\theta =(w, a,b)\), the update rule of weight and offset is shown as

$${\theta }^{(p+1)}={\theta }^{(p)}+\Delta \theta =\langle {h}_{i}^{0}{v}_{j}^{0}\rangle -\langle {h}_{i}^{1}{v}_{j}^{1}\rangle$$
(4)

Whereas \(\langle \cdot \rangle\) characterizes the average value gained from the sampling state, \({h}_{i}^{0}{v}_{j}^{0}\) signifies the primary state distribution, \({h}_{i}^{1}{v}_{\dot{j}}^{1}\) specifies the state gained after a Markov iteration, and \(p\) signifies the sum of unsupervised training.

During this EDBN, \({p}_{i}\) and \({p}_{j}\) correspond to the \(ith\) hidden neuron and the \(jth\) visible layer node, respectively. The sigmoid activation function in Eqs. (2-3) are reserved and eliminated. The constant transformation of Eqs. (2-3) are understood by including a zero‐mean Gaussian noise to the input of the sigmoid activation function of the samples; the expressions after the transformation are exposed in Eqs. (5-6):

$${p}_{i}={\phi }_{i}\left({\sum }_{j}^{m}{w}_{ji}{p}_{j}+\beta \cdot {N}_{i}\left(\text{0,1}\right)\right)$$
(5)
$${p}_{j}={\phi }_{j}\left({\sum }_{i}^{n}{w}_{ji}{p}_{i}+\beta \cdot {N}_{j}\left(\text{0,1}\right)\right)$$
(6)

Whereas,

$${\phi }_{i}\left({x}_{i}\right)={\theta }_{l}+\left({\theta }_{h}-{\theta }_{l}\right)\cdot \frac{1}{1+{e}^{-{q}_{i}{x}_{i}}}$$
(7)
$${\phi }_{j}\left({x}_{j}\right)={\theta }_{l}+\left({\theta }_{h}-{\theta }_{l}\right)\cdot \frac{1}{1+{e}^{-{q}_{{j}^{X}j}}}$$
(8)

Eqs. (5) and (6) represent the inference and learning process of the EDBN, \(where N(\text{0,1})\) denotes a Gaussian random variable with a mean of 0 and a variance of 1. \(\beta\) refers to the constant, \(\phi ()\) characterizes the sigmoid function of the asymptotic as \({\theta }_{h}\) and \({\theta }_{l}\), \(q\) designates the noise control variable and is applied for controlling the sigmoid function slope. Based on the comparison divergence rule, the updated equations of bias and weight value are presented as shown (9) \(-\)(11):

$$\Delta {w}_{ij}={\alpha }_{w}\left(\langle {p}_{j}^{0}{p}_{i}^{0}\rangle -\langle {p}_{j}^{1}{p}_{i}^{1}\rangle \right)$$
(9)
$$\Delta a=\frac{{\alpha }_{a}}{{a}^{2}}\left(\langle {p}_{i}^{{0}^{2}}\rangle -\langle {p}_{i}^{{1}^{2}}\rangle \right)$$
(10)
$$\Delta b=\frac{{\alpha }_{b}}{{b}^{2}}\left(\langle {p}_{j}^{{0}^{2}}\rangle -\langle {p}_{j}^{{1}^{2}}\rangle \right)$$
(11)

Now, \({x}_{w},\) \({\alpha }_{a}\), and \({\alpha }_{b}\) characterize the learning rate of NNs.

ELNN classifier

Elman presented the ELNN, NN, as a kind of recurrent NN (RNN) containing many connected neurons46. It originated from the significant architecture of the back-propagation NN (BPNN) and includes the added HL. These additional layers function as the one-step delay component, enabling the system to retain a general system configuration recognized according to the associations among neurons. This contains self-organizing, feed-forward, and recurrent NNs. Feedback networks are especially unique in that they transfer data in either direction, either backwards or forward. Response information may affect neurons through various networking layers or be limited to a particular layer. BPNN is an extensively accepted multilayered feed-forward NN through outstanding generalizability and nonlinear feature maps. The training procedure modified the network weights during forward data propagation. The thresholds and weights are adjusted to ensure that the BPNN’s forecast output gradually approaches the target output. The hierarchical structure of the Elman network typically consists of four distinct layers. In the HL, the signals are handled through the activation function. This layer additionally includes feedback characteristics. Ultimately, the output layer determines the outcomes. Fig. 3 portrays the structure of the ELNN model.

Fig. 3
Fig. 3
Full size image

Structure of the ELNN model.

ITCN classifier

The TCN is utilized in time series prediction by multiple researchers; nevertheless, the difficulty in sparse mode is that it is unable to precisely capture the time series process through the calculation process, resulting in a lower signal-to-noise ratio, and the disappearance or explosion of the gradient may occur during the method’s training47. Based on the idea of global average pooling (GAP) and soft thresholding, the ITCN is presented in this research.

Based on the normal residual connections, a sub‐network depends on the threshold inserted into the ITCN. It is presented in Eq. (12):

$$f\left(x\right)=\left\{\begin{array}{ll}x+\xi & x<-\xi \\ 0& \left|x\right|\le \xi \\ x-\xi & x>\xi \end{array}\right.$$
(12)

In Eq. (12), \(\xi\) represents the determined threshold, \(x\) signifies input variables, and \(f(x)\) denotes the soft thresholding function. Soft thresholding separates the number of input variables that are independent output variables. It significantly defines the \(\xi\) value; thus, the sub-network is presented in global mean pooling, which is adaptively determined according to the input variable characteristics.

In the sub-network, the GAP is performed as the dropout layer value of the output. Afterwards, a 1D vector is fed into the fully connected (FC) layers, and the final layer is the \(Sigmoid\) function, which normalizes the output values to zero and one. The scaling weight is noted as \(\gamma\). The threshold \(\xi\) value could be defined as Eq. (12):

$$\xi =\gamma GAP\left(\left|x\right|\right)$$
(13)

Eq. (13) defines the threshold \(\xi\) as the value to which products are set to zero or one. This approach verifies that the threshold is established by the example data features that make the adaptable method and enhance the capability to remove effective features from the input data.

ISCO-based hyperparameter tuning

The ISCO-based hyperparameter selection procedure optimizes the ensemble models’ recognition outcomes48. This model is chosen due to its superior exploration-exploitation balance and fast convergence rate. This method dynamically alters search directions using adaptive coefficients inspired by the hunting behaviour of sand cats, which are not typically observed in conventional grid or random search methods. This ensures optimal parameter selection, specifically in high-dimensional search spaces. ISCO also avoids local optima more effectively than standard evolutionary or swarm-based techniques, such as PSO or GA. Its lightweight structure and minimal computational cost make it ideal for fine-tuning complex ensemble models, enhancing accuracy and mitigating overfitting. Fig. 4 illustrates the working flow of the ISCO model.

Fig. 4
Fig. 4
Full size image

Workflow of the ISCO methodology.

The SCSO model simulates two key survival behaviours of sand cats: hunting and foraging. Compared to other population-based intelligence models, SCSO exhibits robust optimization abilities appropriate for complex multi-objective problems. However, its search accuracy and convergence speed are limited, making it prone to local optima. To address this, three improvements are utilized for improving its global search capability, resulting in the ISCSO method, which is applied to optimize edge node utilization. Additionally, logistic chaotic mapping is used to initialize populations, leveraging its non-linearity, ergodicity, and randomness to improve convergence speed and precision, as shown in Eqs. (14-15).

$${X}_{i+1}=\gamma {x}_{i}+\left(1-{x}_{i}\right)$$
(14)
$${Y}_{i+1}={l}_{\text{min}}+{X}_{i}\cdot \left({l}_{\text{max}}-{l}_{\text{min}}\right)$$
(15)

Here, \(\gamma\) helps as the controller parameter, using \(\gamma>1\), the range of values \({x}_{i}\) drops inside \(0<{x}_{i}<1,\) \({x}_{id}\) indicates the sequence of chaos produced by Eq. (19), \({Y}_{i}\) characterizes the location of \(the ith\) individual, and \({l}_{\text{max}}\) and \({l}_{\text{min}}\) symbolize the searching region of the populations. At this stage, the SC identifies prey by evaluating the optimal position, current position, and sensitivity range. This occurrence is accurately explained by Eq. (16):

$$pos\left(i+1\right)=r\left(po{s}_{bc}\left(i\right)-rand\left(\text{0,1}\right)\cdot po{s}_{c}\left(i\right)\right)$$
(16)

While \(po{s}_{bc}(i)\) denotes optimum solutions, \(po{s}_{c}(i)\) refers to the present location, and \(r\) signifies a sensitivity range. This approach within the model enables the discovery of numerous search routes, helping individuals make effective adjustments to their locations. Next, combined with spiral exploration, individuals implement searching processes inside the searching region in a spiral pattern. Expanding the model’s exploration abilities improves its chances of escaping local optima and enhances its overall global search performance. The updated equation is characterized by Eq. (17):

$$pos\left(i+1\right)=0\cdot r\left(po{s}_{bc}\left(i\right)-rand\left(\text{0,1}\right)\cdot po{s}_{c}\left(i\right)\right)$$
(17)
$$0=exp\left(bg\right)cos\left(2\pi g\right)$$
(18)

Eq. (18) indicates the calculation of the features of spiral exploration, signified as \(0\). Whereas \(b\) signifies the constant of spiral shape and \(g\) characterizes the route coefficient, where \(g\in [-\text{1,1}].\)

During this prey-attacking stage, an arbitrary location, denoted as \(p{os}_{ted} (i)\) is generated using the top and current locations. Then, an arbitrary angle \(\alpha\) is designated over the roulette model, and the attack process is implemented utilizing Eq. (20):

$$po{s}_{rand}\left(i\right)=\left|rand\left(\text{0,1}\right)\cdot po{s}_{bc}\left(i\right)-po{s}_{c}\left(i\right)\right|$$
(19)
$$pos\left(i+1\right)=po{s}_{bc}\left(i\right)-po{s}_{rand}\left(i\right)\cdot r\cdot cos\left(\alpha \right)$$
(20)

The prey attacks in the normal model are performed at arbitrary angles, which may result in the model discounting some optimal solutions. The mathematical formulation is defined in Eq. (21):

$$pos\left(i+1\right)=po{s}_{bc}\left(i\right)+\left(po{s}_{bc}\left(i\right)-po{s}_{c}\left(i\right)\right)\cdot C\cdot levy$$
(21)
$$levy=\frac{u}{|vs{|}^{-\beta }}$$
(22)

Here, variables \(u\) and \(v\) follow normal distributions, \(\sim and N\left(0,{\sigma }_{u}^{2}\right),v\sim N\left(0,{\sigma }_{v}^{2}\right)\), and \(C\) denote the constant demonstrating the step adjustment coefficient. The comprehensive stages of the ISCSO model are obtainable in Algorithm 1.

Algorithm 1:

Pseudocode of the ISCO model

figure a

The ISCO model facilitates the effective tuning of the OEMPTER-ISCSO model by utilizing adaptive behaviours inspired by sand cats, including spiral exploration, chaotic initialization, and Levy flights. This modification enhances the global search capability of the model and also prevents local optima, ensuring faster convergence in high-dimensional spaces. ISCO achieves optimal model performance with minimal computational cost by dynamically adjusting paraeters such as learning rate, dropout, and layer configuration. This results in enhanced accuracy, mitigated overfitting, and efficient emotion recognition from text, making the system highly suitable for real-time communication in sustainable environments for individuals with disabilities. Table 2 depicts the hyperparameter values of the OEMPTER-ISCSO technique.

Table 2 Key parameters of the ISCO model for tuning the OEMPTER-ISCSO technique in high-dimensional search spaces.

Fitness selection is a substantial factor in influencing the outcome of the ISCO model. The hyperparameter range procedure concludes by evaluating the efficiency of the candidate solution encoded in the model. The ISCO model reflects accuracy as a foremost standard for projecting fitness functions. Its formulation is expressed as follows:

$$Fitness =\text{ max }(P)$$
(23)
$$P=\frac{TP}{TP+FP }$$
(24)

Here, \(TP\) signifies the positive value of true, and \(FP\) denotes the positive value of false.

Experimental analysis

The experimental validation of the OEMPTER-ISCSO approach is examined under the Emotion detection from text dataset49. The technique is simulated using the Python 3.6.5 tool on PC i5-8600k, 250GB SSD, GeForce 1050Ti 4GB, 16GB RAM, and 1TB HDD. The parameter settings are: learning rate: 0.01, activation: ReLU, epoch count: 50, dropout: 0.5, and batch size: 5. The dataset consists of 22280 samples below eight sentiments, as shown in Table 3. Table 4 illustrates the sample texts.

Table 3 Details of the dataset.
Table 4 Sample texts.

Fig. 5 displays the classifier results of the OEMPTER-ISCSO approach below 80%TRPH and 20%TSPH. Fig. 5a and 5b represent the confusion matrices through precise classification and identification of distinct class labels. Fig. 5c-5d shows the PR and ROC studies, which indicate higher performance across all class labels. The confusion matrix illustrates robust classification for classes such as Sadness, with 3,960 correct predictions, and Happiness, with 3,932 correct predictions. In contrast, classes such as Worry and Neutral have lower TP counts, indicating challenges in these categories. During testing, the model emphasized robust performance with notable TP in Sadness (969) and Happiness (1011), though lower recall is observed for Worry (85) and Neutral (90). The PR and ROC curves exhibit robust TP rates for most classes, illustrating consistently high precision and accuracy. While recall for the Worry and Neutral classes is comparatively lower, the model presents robust and reliable performance across the majority of emotion categories.

Fig. 5
Fig. 5
Full size image

80%TRPH and 20%TSPH of (a-b) confusion matrices and (c-d) curves of PR and ROC.

Table 5 and Fig. 6 depict the text emotion recognition of the OEMPTER-ISCSO approach below 80%TRPH and 20%TSPH. The performance implies that the OEMPTER-ISCSO approach has gained efficient performance. According to 80%TRPH, the OEMPTER-ISCSO approach got average \(acc{u}_{y}\), \(pre{c}_{n}\), \(rec{a}_{l}\), \({F1}_{measure}\), \(MCC\), and Kappa of 95.10%, 95.82%, 95.10%, 95.45%, 95.09%, and 96.88%, respectively. Similarly, according to 20%TSPH, the OEMPTER-ISCSO technique achieved average \(acc{u}_{y}\), \(pre{c}_{n}\), \(rec{a}_{l}\), \({F1}_{measure}\), \(MCC\), and Kappa of 95.33%, 96.05%, 95.33%, 95.67%, 95.34%, and 97.16%, respectively.

Table 5 Text emotion detection of OEMPTER-ISCSO model under 80%TRPH and 20%TSPH.
Fig. 6
Fig. 6
Full size image

Average of OEMPTER-ISCSO model under 80%TRPH and 20%TSPH.

In Fig. 7, the training (TRA) \(acc{u}_{y}\) and validation (VAL) \(acc{u}_{y}\) performances of the OEMPTER-ISCSO model under 80%TRPH and 20%TSPH are shown. The \(acc{u}_{y}\) values are calculated through a period of 0–30 epochs. The figure noted that the values of TRA and VAL \(acc{u}_{y}\) present an increasing trend, indicating the competency of the OEMPTER-ISCSO method with maximum performance across numerous repetitions. Moreover, the TRA and VAL \(acc{u}_{y}\) values remain close throughout the epochs, indicating diminished overfitting and demonstrating the optimal outcome of the OEMPTER-ISCSO method, which ensures reliable calculations on unseen samples.

Fig. 7
Fig. 7
Full size image

\(Acc{u}_{y}\) curve of OEMPTER-ISCSO model under 80%TRPH and 20%TSPH.

Fig. 8 demonstrates the TRA loss (TRALOS) and VAL loss (VALLOS) graph of the OEMPTER-ISCSO model under 80%TRPH and 20%TSPH. The loss values are computed across 0 to 30 epochs. The values of TRALOS and VALLOS represent a diminishing tendency, indicating the proficiency of the OEMPTER-ISCSO method in harmonizing a tradeoff between data fitting and generalization. The consecutive decrease in loss and securities values enhanced the outcome of the OEMPTER-ISCSO method and eventually tuned the forecast solutions.

Fig. 8
Fig. 8
Full size image

Loss curve of OEMPTER-ISCSO model under 80%TRPH and 20%TSPH.

Fig. 9 exhibits the classifier analysis of the OEMPTER-ISCSO technique below 70%TRPH and 30%TSPH. Fig. 9a and Fig. 9b display the confusion matrix, which provides precise classification and identification of all classes. Fig. 9c-9d displays the PR and ROC curves, which show superior performance across all class labels. The TRPH exhibits robust classification performance, with high correct predictions in major classes such as Sadness and Happiness, while the TSPH consistently yields accurate predictions across categories. The PR curve illustrates high precision for most emotions, and the ROC curve demonstrates robust true positive rates, especially for classes such as Sadness, Surprise, Fun, and Happiness, emphasizing the efficiency of the model and its discrimination ability across the dataset.

Fig. 9
Fig. 9
Full size image

70%TRPH and 30%TSPH of (a-b) confusion matrices and (c-d) curves of PR and ROC.

Table 6 and Fig. 10 depict the text emotion detection of the OEMPTER-ISCSO approach below 70%TRPH and 30%TSPH. The results indicate that the OEMPTER-ISCSO approach has achieved effective performance. According to 70%TRPH, the OEMPTER-ISCSO method attains an average \(acc{u}_{y}\), \(pre{c}_{n}\), \(rec{a}_{l}\), \({F1}_{measure}\), \(MCC\), and Kappa of 95.93%, 96.68%, 95.93%, 96.30%, 96.02%, and 97.77%, respectively. Likewise, according to 30%TSPH, the OEMPTER-ISCSO method attains an average \(acc{u}_{y}\), \(pre{c}_{n}\), \(rec{a}_{l}\), \({F1}_{measure}\), \(MCC\), and Kappa of 95.55%, 96.85%, 95.55%, 96.16%, 95.88%, and 97.36%, respectively.

Table 6 Text emotion detection of OEMPTER-ISCSO model under 70%TRPH and 30%TSPH.
Fig. 10
Fig. 10
Full size image

Average of OEMPTER-ISCSO model under 70%TRPH and 30%TSPH.

Fig. 11 shows the TRA \(acc{u}_{y}\) and VAL \(acc{u}_{y}\) performances of the OEMPTER-ISCSO methodology below 70%TRPH and 30%TSPH. The \(acc{u}_{y}\) values are calculated through a period of 0–30 epochs. The figure underscored that the values of TRA and VAL \(acc{u}_{y}\) show a cumulative trend, indicating the proficiency of the OEMPTER-ISCSO technique with enhanced performance through multiple repetitions. Additionally, the TRA and VAL \(acc{u}_{y}\) values remain relatively close across the epochs, indicating lesser overfitting and suggesting an improved performance of the OEMPTER-ISCSO technique, which ensures steady predictions on unseen samples.

Fig. 11
Fig. 11
Full size image

\(Acc{u}_{y}\) curve of OEMPTER-ISCSO model under 70%TRPH and 30%TSPH.

Fig. 12 presents the TRALOS and VALLOS graphs of the OEMPTER-ISCSO model under 70%TRPH and 30%TSPH. The loss values are computed over a period of 0 to 30 epochs. The values of TRALOS and VALLOS exhibit a reducing trend, which indicates the competency of the OEMPTER-ISCSO approach in balancing the tradeoff between generalization and data fitting. The successive reduction in loss values also ensures the maximum performance of the OEMPTER-ISCSO approach and tunes the prediction results over time.

Fig. 12
Fig. 12
Full size image

Loss curve of OEMPTER-ISCSO model under 70%TRPH and 30%TSPH.

Table 7 and Fig. 13 inspect the comparative study of the OEMPTER-ISCSO method with existing methodologies20,21,50. The performances indicated that the bc-LSTM, CRN, PCN, BERT-BiLSTM, XLNet, Bert, XLNet-BIGRU-Att, Base ViT, CrossViT, Cross Former, Early Convolutional ViT (Early ConViT), Mobile ViT, and Pooling‑based Vision Transformer (PiT) techniques have reached poorer performance. At the same time, the proposed OEMPTER-ISCSO approach has respective effective values of \(acc{u}_{y}\), \(pre{c}_{n}\), \(rec{a}_{l},\) and \({F1}_{measure}\) of 95.93%, 96.68%, 95.93%, and 96.30%, correspondingly.

Table 7 Comparative study of OEMPTER-ISCSO model with existing approaches20,21,50.
Fig. 13
Fig. 13
Full size image

Comparative analysis of OEMPTER-ISCSO model with existing approaches.

The comparative analysis of the OEMPTER-ISCSO technique is presented in terms of computation time (CT) in Table 8 and Fig. 14. The results indicate that the OEMPTER-ISCSO model achieves a superior performance. The OEMPTER-ISCSO approach presents minimal CT of 04.71sec while the bc-LSTM, CRN, PCN, BERT-BiLSTM, XLNet, Bert, XLNet-BIGRU-Att, Base ViT, CrossViT, Cross Former, Early ConViT, Mobile ViT, and PiT models attain improved CT values of 07.10sec, 18.89sec, 18.82sec, 16.78sec, 14.84sec, 08.56sec, 13.83sec, 13.05sec, 11.30sec, 8.406sec, 9.708sec, 10.18sec, 12.05sec, respectively.

Table 8 CT outcome of OEMPTER-ISCSO technique with existing models.
Fig. 14
Fig. 14
Full size image

CT outcome of OEMPTER-ISCSO technique with existing models.

Table 9 demonstrates the ablation study of the OEMPTER-ISCSO methodology. The outputs show that applying ISCO to each model Deep Belief Network (DBN), Elman Neural Network (ELNN), and Temporal CNN (TCNN) consistently enhance performance across all metrics, including \(acc{u}_{y}\), \(pre{c}_{n}\), \(rec{a}_{l},\) and \({F1}_{measure}\). For instance, the TCNN model with ISCO attains the highest \(acc{u}_{y}\) of 95.84% and an \({F1}_{measure}\) of 65.45%, compared to 95.02% and 64.87% without ISCO, highlighting the efficiency of the ISCO method in improving generalization and fine-tuning. The fusion model, without hyperparameter tuning, attains lower than individual ISCO-optimized models, emphasizing the significance of ISCO in optimizing model parameters and contributing significantly to enhanced emotion recognition performance.

Table 9 Comparative performance evaluation of the OEMPTER-ISCSO methodology through ablation study against existing techniques.

Conclusion

This manuscript presents an OEMPTER-ISCSO method. Initially, the text pre-processing stage involves multiple levels to normalize and clean an input text. Then, the FastText method is employed for the word embedding process, transforming words into numerical vector representations. An ensemble of three classifiers, EDBN, ELNN, and ITCN methods, is used for textual emotion detection. Additionally, the ISCO model-based hyperparameter selection process is executed to optimize the detection outcomes of the ensemble models. The experimentation of the OEMPTER-ISCSO technique is accomplished using emotion detection from a text dataset. The performance validation of the OEMPTER-ISCSO technique demonstrated a superior accuracy value of 95.84% over existing models. The limitations of the OEMPTER-ISCSO technique include reliance on specific datasets that may not fully represent the diverse range of real-world scenarios, potentially restricting the generalizability of the findings. Moreover, the proposed models’ computational complexity and resource-intensive behaviour may affect their deployment in resource-constrained environments. The study also faces challenges in handling noisy and incomplete data, which could impact the accuracy of predictions. Furthermore, the real-time performance of the system under varying conditions needs additional optimization to ensure scalability. Future work should focus on improving the robustness of the model by integrating more diverse datasets and optimizing computational efficiency for real-time applications. Additionally, it could extend its practical utility by incorporating hybrid approaches and exploring the model’s applicability in other domains, such as healthcare and industrial automation.