Introduction

Disability is one of the significant issues which remains a challenge. Residual effects of disability is seen as barriers in the form of mental or physical impairments that affects the development and participation of an individual1. As a result, intense attempts are made to eliminate this kind of obstruction. Individuals with disabilities rely on others to fulfil their needs. Artificial intelligence (AI) is a field of computer science that aims to devise intelligent computer systems, which display qualities associated with human intelligence, including learning, problem-solving, language comprehension, and decision-making2. One significant role of AI has persisted in NLP, which combines linguistic and computational methods to enable computers to facilitate HCI and human language understanding. Sentiment analysis (SA), text summarisation, machine translation, and speech recognition fall under the research fields in the area of NLP3. Emotions are characterized into six kinds: sadness, joy, surprise, fear, disgust, and anger. Additionally, emotions are also defined in numerous types, including optimism and love4. In contrast to facial expressions and speech recognition, text sentences often fail to convey themselves as they are bland. Due to the difficulty and vagueness, it is a challenging task to accurately identify the emotions expressed in the text. Determining the emotion of the provided text is a complex process, as each word may have a distinct morphological style and meaning5. Emotional interaction is a prevalent cognitive event in humans’ everyday life. Precise emotion recognition is the principle of efficient human communication, decision-making, and interaction. With the growth of AI and big data, emotion recognition has become a standard research project in both industry and academia6. Since the straightforward and fundamental medium, textual data arising from emotional communication, is frequently employed to understand emotional conditions. TER for people with disabilities is dedicated to automatically recognizing emotional modalities in textual terminologies, like Angry, Sad, and Happy. TER is a more detailed examination when compared to SA, and has attained substantial interest from academic circles7. For humans, emotions are identified rapidly depending on their emotional states. At the same time, for automated TER systems, calculation approaches should be constantly created and enhanced to attain more precise emotion prediction. AI implemented in NLP leverages computational and linguistic methods to enable machines to comprehend basic phenomena, such as emotions or sentiments, from texts8. Thus, the primary objective is to examine thoughts, ideas, and opinions through the polarities task, encompassing both positive and negative perspectives. In the NLP domain, most analyses regarding TER are performed at a coarse-grained sentiment level and are based on conventional machine learning (ML) methods9. While they review advanced methodologies of TER, DL-driven TER methods are only concisely stated without a detailed and thorough outline. Currently, DL models have rectified their reliance on manual feature extraction and achieved good results. The DL approaches have proven their extraordinary performance, frequently achieving advanced outcomes in TER tasks10.

This paper develops an Intelligent Emotion Recognition from Text Using a Hybrid Deep Learning Model and Word Embedding Process (IERT-HDLMWEP) model. The key contributions of this manuscript are mentioned below:

  • A unique IERT-HDLMWEP model is designed for a TER system for people with disabilities.

  • The text pre-processing step involves various standard stages designed to enhance analysis and reduce input data dimensionality, thereby supporting better feature extraction and improving overall model performance by filtering noise and standardizing input formats.

  • The IERT-HDLMWEP model adopts a hybrid feature representation by integrating Word2Vec, TF-IDF-CDW, and POS encoding to improve emotion detection in textual data. This integration captures semantic meaning, contextual weight, and syntactic structure, enriching feature quality. It enables the model to detect subtle emotional cues across varied linguistic expressions and improves classification robustness and generalization across diverse emotional contexts.

  • The hybrid C-BiG-A technique, integrating a CNN and BiGRU with an AM model, is utilized for the final classification. This architecture captures both local and long-range dependencies in the text, improving contextual understanding. The AM also refines its focus on emotion-relevant words, thereby improving interpretability. Overall, it enhances the model’s capability to distinguish complex emotional patterns in textual data.

  • This IERT-HDLMWEP methodology uniquely integrates convolutional and BiGRU with an AM to capture multi-level contextual features. By effectively incorporating spatial and temporal data, it improves emotion detection beyond conventional methods. The AM dynamically highlights crucial emotional cues, enhancing model focus and accuracy. This novel integration advances the handling of intrinsic linguistic patterns in emotional text classification.

Related works on TER

Kumar et al.11 implemented a multimodal emotion recognition method, Visual Spoken Textual Additive Net (VISTANet), for classifying emotions displayed by input images, text, and speech into separate types. A novel interpretability approach, K-Average Additive exPlanation (KAAP), is established, which classifies significant textual, spoken, and visual features, resulting in the forecasting of a specific emotion type. The VISTANet merges data from text, speech, and image conditions, utilizing a hybrid of late and intermediate fusion. It autonomously alters the intermediate output’s weights by calculating the weighted average. Di Luzio et al.12 explored explainability methods for binary deep neural network (DNN) frameworks in the model of emotion recognition via video analysis. The authors investigated the input features for binary classifiers that recognize emotions, utilizing facial behaviour analysis and an enhanced form of the Integrated Gradients explainability technique. Fu et al.13 proposed a spectral domain reconstruction graph neural network (SDR-GNN) model for unfinished multimodal learning in conversational emotion recognition. This model creates an utterance semantic interaction graph utilizing a sliding window that relies on context and speaker relations. Li et al.14 proposed a new emotion recognition system depending on a curriculum learning (CL) approach (ERNetCL). This approach integrates a spatial encoder (SE), a temporal encoder (TE), and a CL loss. To mitigate the severe effects of emotional change and provide a framework for individuals to learn a curriculum from simple to complex, employ the CL concept in the task of ERC to enhance the network’s parameters continually. Kusal et al.15 presented a hybrid DL network depending on the convolutional-recurrent network utilized for detecting the individual’s emotions, depending on the conversational text. A convolutional network can extract local dependencies and patterns and is intrinsically shift-invariant. Simultaneously, the recurrent network extracts long-term relationships in sequential data. Feng et al.16 developed a multimodal speech emotion recognition technique that relies on multiscale MFCCs and a multiview AM, which can capture numerous audio emotional features and effectively merge emotion-specific features from two features. Under various attention configurations and audio input states, it is observed that the finest emotion recognition precision is achieved by mutually leveraging four attention modules and three diverse scales of MFCCs. Zhang et al.17 proposed a novel technique for building emotion classification labels using language resources and density-based spatial clustering of applications with noise (DBSCAN). Moreover, incorporate the spatial features and frequency area of emotional EEG signals and pass those features to the serial network, which integrates LSTM and CNN for EEG emotion learning and identification. Omarov and Zhumanov18 proposed an innovative Bi-LSTM methodology for analyzing emotions in textual content. This technique implements the strength of RNNs to extract both past and text context, providing a detailed interpretation of emotional content. By incorporating the backwards and forward layers of LSTM model, this model efficiently learns the semantic representation of words and their dependencies within sentences.

Hicham and Nassera19 proposed the stacked DL models integrating a robustly optimized Bert pretraining approach (RoBERTa) with gated recurrent unit (GRU), long short-term memory (LSTM), bidirectional GRU (BiGRU), and bidirectional LSTM (BiLSTM). These hybrid architectures are optimized using adaptive moment estimation (Adam) models. Mahajan, More, and Shah20 developed a novel multilabel dataset and evaluate various ML and DL techniques comprising logistic regression (LR), support vector machine (SVM), naïve bayes (NB), random forest (RF), LSTM, BiLSTM, GRU, and convolutional neural network (CNN) to capture both single and mixed emotional states. Zhu et al.21 developed a reliable medical question-answering system by utilizing knowledge embedding and a transformer-based architecture. The model also integrates a knowledge understanding layer and an answer generation layer to improve both the accuracy and ethical quality of responses. Khan et al.22 improved violence detection in surveillance videos by utilizing a two-stream DL model that integrates 3D convolutional networks with depth-wise convolutions. The model also utilizes RGB frame analysis with background suppression and optical flow to accurately capture violent actions while maintaining computational efficiency suitable for edge devices. Arumugam et al.23 proposed an Audio, Visual, and Text Emotions Fusion Network (AVTEFN) model that utilizes graph attention networks (GAT), hybrid wav2vec 2.0 with CNN, and bidirectional encoder representations from transformers (BERT) combined with bidirectional gated recurrent units (Bi-GRU). Khan et al.24 proposed an advanced Multimodal Emotion Recognition (MER) approach by utilizing a Joint Multi-Scale Multimodal Transformer (JMMT) with Recursive Cross-Attention. Alyoubi and Alyoubi25 presented an optimized multimodal emotion recognition framework that integrates Bidirectional Encoder Representations from Transformers (BERT)/Robustly Optimized BERT Pretraining Approach (RoBERTa) for text, wav2vec 2.0 for speech, and Residual Networks (ResNet50)/Visual Geometry Group Network (VGG16) for facial expressions. The model also utilizes a transformer-based cross-modal attention mechanism and Shapley Additive Explanations (SHAP) to improve both classification accuracy and interpretability. Vani et al.26 introduced Text Fusion+, an integrated application that utilizes optical character recognition (OCR), NLP, and Text-to-Speech (TTS) technologies. The model also employs DL-based summarization and an NLP-driven question-answering module. Khan et al.27 presented a technique by utilizing sequence learning techniques, including Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), with their advanced forms such as bi-directional and multi-layer architectures, to enhance auditory emotion recognition. Ghous, Najam, and Jalal28 detected emotional states in individuals with cognitive disabilities by applying advanced ML methods. Initially, bandpass filtering (BF) and downsampling is used for pre-processing. The model also incorporates an adaptive automated feature selection and transformation (AAFST) technique with a multi-class SVM to improve the accuracy and reliability of emotion recognition. Patil et al.29 enabled early prediction of learning disabilities, specifically dyslexia, by employing handwritten text recognition and DNN. Mishra et al.30 introduced a model by utilizing MLP and transformer techniques such as GPT-4 and BERT. The model utilizes sentiment analysis, emotion classification, and is trained using the adaptive moment estimation (Adam) optimizer to achieve high accuracy and generalization. Table 1 presents a comparative analysis of existing textual emotion recognition systems for individuals with disabilities.

Table 1 A comparative study of the reviewed techniques.

Existing studies exhibit various limitations and research gaps, despite their advancements. Multi-model approaches, such as VISTANet and SDR-GNN, are computationally intensive, which restricts their real-time applicability. Explainability methods, while improving interpretability, often concentrate on binary classification or limited modalities. Most models do not adequately address the complexity of mixed or multilabel emotions, particularly in conversational or code-mixed contexts. Additionally, few studies integrate curriculum learning or adaptive fusion mechanisms to improve the robustness of the model. There is also a research gap in utilizing lightweight yet effective hybrid DL architectures for scalable emotion recognition. Addressing these gaps can improve generalizability and efficiency across diverse datasets and applications.

Research design and methodology

In this work, a new IERT-HDLMWEP model is developed for emotion identification using textual data. The proposed model aims to enhance the accuracy of a DL-based TER system in identifying and interpreting emotions in text, thereby improving communication support for individuals with disabilities. It encompasses text pre-processing, word embedding, and a hybrid classification method. Figure 1 indicates the complete process of the IERT-HDLMWEP model.

Fig. 1
figure 1

Overview of the IERT-HDLMWEP model, comprising data preprocessing, word embedding, and a hybrid DL with AM. Evaluation metrics include accuracy, precision, recall, F-measure, and AUC score.

Text pre-processing method

Initially, the text pre-processing step involves several typical levels to enhance analysis and reduce the dimensionality of the input data. Information gained from various resources, primarily social media, is often unstructured31. Raw data may be noisy and contain grammatical and spelling errors. Therefore, texts need to be cleaned before examining. Considering that several words are insignificant and add nothing to the text (for example, special characters, prepositions, stop words, and punctuation), pre-processing is used to improve the study and reduce the input data dimensionality. Some usual work is involved in the complete process, as shown.

  • Substitute negative words. Negations are words like never, not yet, and no that specify the unrelated meaning of phrases or words. The SA aims to substitute the negation with an antonym. For instance, the word not good is substituted with bad which is the opposite of good. Therefore, a sentence like “The car is not good” is converted into “The car is bad. “Still, particular negative words, like never, not, and no, and negative reductions, like doesn’t, mustn’t, and couldn’t, are frequently a part of stopword lists. Therefore, replacing each negative word and contraction with not, and then correcting this problem after stop-word removal as part of the spelling corrections.

  • Lowercasing: It converts each character in the dataset to lowercase, as opposed to capitalizing correct words, names, and initials at the beginning of the sentence. Capitalization is challenging on Twitter, as users often avoid capital letters, which can create a tone of ease in the chat, making the messages more conversational. Exchanging each of the texts’ letters for a similar circumstance. The word ‘Ball’, for example, will turn out to be ‘ball’.

  • Converting emoticons. Now, users use emoticons to explain their emotions, thoughts, and feelings. Therefore, converting each emoticon to consistent words will give better outcomes.

  • Removing redundant information [comprising hashtags (#), additional spaces, punctuation, special characters \(\:(\$,\:\&,\:\%,\:.\:.\:.)\), @username, URL references, stop words (e.g. ‘the’, ‘is’, ‘at’, etc.), non-ASCII English letters and numbers are neglected to preserve the individuality of the information encoding in English]. This type of information fails to predict the emotions expressed by users accurately.

  • Expanded acronyms to their new words, utilizing the acronym dictionary. Slang and abbreviations are improperly organized words that are often utilized on Twitter. They should be restored to their original form.

  • Change words with repetitive characters to their English roots. Each frequently uses words with repetitive letters (for example: ‘coooool’) to express their emotions.

  • PoS tagging. Numerous constructive modules in the text, such as nouns, verbs, adjectives, and adverbs, are recognized in this phase.

  • Tokenization. It deconstructs text into smaller textual units (for example, documents into expressions, sentences into words).

  • Lemmatization. There are derivationally related families of words with comparable meanings, like presidential, president, and presidency. This model aims to lower derivational and inflectional forms of the word to the standard base. It uses morphological and vocabulary studies to eliminate inflectional ends and return the dictionary or base form of the word, such as the lemma. If lemmatization were applied to the word ‘saw’, it would attempt to return both ‘see’ and ‘saw’ according to whether it was used as a noun or a verb. It is the procedure of decreasing a specific word to its simpler method, equal to the stem, but it preserves word-related data such as PoS tags.

Word embedding-based Word2Vec approach

For the word embedding process, the IERT-HDLMWEP model employs a hybrid feature representation that combines Word2Vec, TF-IDF-CDW, and POS Encoding for enhanced emotion detection in textual data32. This model is chosen for its superior capability in capturing semantic relationships between words by representing them as continuous vector spaces, which conventional bag-of-words models fail to do. The context-dependent word representations are effectively learned by the Word2Vec model, enabling it to comprehend subtle variations and similarities in language usage. This results in richer feature representations that enhance downstream tasks, such as emotion detection. This model is appropriate for handling extensive textual datasets and is computationally efficient and scalable to large corpora. The robustness of the model is enhanced by its ability to capture both syntactic and semantic data, outperforming simpler encoding techniques in capturing complex language patterns. Overall, Word2Vec presents a balance of effectiveness and efficiency, making it an ideal choice over conventional or more resource-intensive embedding methods.

Word2Vec

Word2Vec is a widely used static embedding model that encodes words as dense vectors in a fixed semantic space, where semantically relevant terms are located. It utilizes two different training mechanisms: CBOW and Skip-gram. It focuses on inferring context words from specified centre words, while CBOW forecasts the central term by combining data from its neighbouring words. To leverage the CBOW model to develop word embeddings. CBOW calculates the likelihood of the centre word \(\:{w}_{n}\), given its adjacent context words \(\:{w}_{c}\), depending on the embeddings of numerous surrounding words \(\:{w}_{c}\).

$$\:p\left({w}_{n}|{w}_{c}\right)=\frac{\text{e}\text{x}\text{p}\left({w}_{n}{h}_{n}\right)}{{\varSigma\:}_{{w}^{{\prime\:}}\in\:corpus}\text{exp}\left({w}^{{\prime\:}}{h}_{n}\right)}$$
(1)

Now, \(\:{h}_{n}\) depicts an average embedded vector of the nearby contextual window, \(\:{w}_{n}\), and \(\:{w}_{c}\) represents the context words. Figure 2 illustrates the architecture of the Word2Vec model.

Fig. 2
figure 2

Architecture of the Word2Vec method consisting of an input vector, a hidden layer with linear neurons, and an output layer with a softmax classifier.

TF-IDF-CDW

TF-IDF is a commonly endorsed term-weighting model in NLP intended for assessing the importance of words in specific documents. Term frequency \(\:\left(TF\right)\) imitates how frequently a term \(\:{t}_{i}\) arises inside a document \(\:{d}_{j}\):

$$\:TF\left({t}_{i},{d}_{j}\right)=\frac{{n}_{i,j}}{\left|{d}_{j}\right|}$$
(2)

Now, \(\:{n}_{i,j}\) represents the \(\:TF\) of \(\:{t}_{i}\) in document \(\:{d}_{j}\), and \(\:\left|{d}_{j}\right|\) refers to the overall word count in \(\:{d}_{j}\). IDF assesses the rarity of the term through the overall corpus.

$$\:IDF\left({t}_{i}\right)=\text{l}\text{o}\text{g}\frac{\left|D\right|}{1+\left|\left\{{d}_{j}:{t}_{i}\in\:{d}_{j}\right\}\right|}$$
(3)

Here \(\:\left|D\right|\) depicts the total number of documents inside the corpus, and \(\:\left|\left\{{d}_{j}:{t}_{i}\in\:{d}_{j}\right\}\right|\) indicates document counts comprising term \(\:{t}_{i}\). To multiply \(\:TF\) and \(\:IDF\), the TF-IDF score is attained to depict the importance of term \(\:{t}_{i}\) in document \(\:{d}_{j}\):

$$\:TP-IDP\left({t}_{i},{d}_{j}\right)=\frac{{n}_{i,j}}{\left|{d}_{j}\right|}\times\:\text{l}\text{o}\text{g}\frac{\left|D\right|}{1+\left|\left\{{d}_{j}:{t}_{i}\in\:{d}_{j}\right\}\right|}$$
(4)

Consequently, some terms that occur regularly within particular classes but rarely elsewhere can receive inadequately low weights. To tackle this restriction, project an improved metric named \(\:CDW\).

$$\:CDW\left({t}_{i}\right)=1+\frac{\sum\:_{c\varepsilon C}P\left(c|{t}_{i}\right)\text{l}\text{o}\text{g}P\left(c|{t}_{i}\right)}{\text{log}\left|C\right|+1}$$
(5)

Now, \(\:P\left(c|{t}_{i}\right)\) depicts the proportion of documents including \(\:{t}_{i}\) which belong to class \(\:c\), and \(\:\left|C\right|\) refers to overall class counts. This model is designed to identify terms that are consistently distributed across classes, while highlighting those with category-specific preferences. CDW is primarily efficient at recognizing terms strongly associated with specific classes. To combine CDW, the methodology allocates a higher weight to category-specific terms. At last, the score of TF‐IDF‐CDW is calculated to multiply the value of TF‐IDF with an equivalent value of CDW, thus acquiring either document- or category‐level term significance:

$$\:TP-IDP-CDW\left({t}_{i},{d}_{j}\right)=TP\left({t}_{i},{d}_{j}\right)\cdot\:IDP\left({t}_{i}\right)\cdot\:CDW\left({t}_{i}\right)$$
(6)

This formulation enables the method to assign higher significance to terms that contribute more significantly to the distinction class, thereby enhancing the performance of classification and improving the quality of semantic representation.

POS encoding.

POS acts as a basic syntactic indicator in language study, depicting the grammatical part of a word within the sentence. Integrating POS data could enhance the understanding of the grammatical framework and the semantic context. Specifically, in tasks of text classification, POS tags such as verbs and nouns tend to be more effective for classifier outcomes. Therefore, integrating POS aspects enhances the syntactic awareness of the methodology and the performance of classification.

To accept an arbitrary vector-based POS encoding approach. In particular, every dissimilar POS class is allocated an arbitrary 10-D vector that can adjust over training. Then calculate the TF‐IDF‐CDW weight for every word \(\:TI{C}_{1:n}=\{ti{c}_{1},\:ti{c}_{2},\:\cdots\:,\:ti{c}_{n}\}\) as described in the equation mentioned above. POS encoding vectors \(\:PO{S}_{1:n}=\{po{s}_{1},\:po{s}_{2},\:\cdots\:po{s}_{n}\}\) are derived utilizing the method. At last, the improved embedding vector \(\:{V}_{1:n}\) for every word is developed to scale the Word2Vec vector \(\:p{v}_{i}\) with its equivalent TF‐IDF‐CDW weight \(\:ti{c}_{j}\) and aimed at a 10-D POS encoding vector \(\:po{s}_{i}\), producing a 310-D representation:

$$\:{V}_{1:n}=\left\{p{v}_{1}\cdot\:ti{c}_{1}+po{s}_{1},p{v}_{2}\cdot\:ti{c}_{2}+po{s}_{2},\dots\:,p{v}_{n}\cdot\:ti{c}_{n}+po{s}_{n}\right\}$$
(7)

An innovative improved word embedding vector reflects the significant differences of similar words through diverse texts and further acquires POS data. The created embedding matrix emphasizes key semantic components while reducing noise interference in the classification method.

Hybrid classification process

At last, the hybrid of the C-BiG-A technique is employed for the classification process. This hybrid model is chosen for its ability to effectively capture both local patterns and long-range dependencies in textual data. The Bi-GRU model processes sequential data in both forward and backwards directions, and CNN outperforms in extracting spatial features and local n-gram patterns, capturing contextual relationships. The AM model further enhances the model by selectively focusing on the most relevant parts of the input, thereby improving both interpretability and performance. This fusion model effectually balances complexity and efficiency, presenting superior emotion classification accuracy, particularly for complex and mixed emotions compared to standalone models. Its architecture is appropriate for handling varying text lengths and diverse linguistic structures, making it a robust choice over simpler or less adaptive models. Figure 3 specifies the structure of the C-BiG-A technique.

Fig. 3
figure 3

Framework of C-BiG-A technique consisting of input, convolutional, recurrent, attention, and dense layers resulting in the softmax-based output.

CNN can well remove the features and information33. By selecting the convolutional kernel across various input sequence positions, it can successfully capture the changing local features and patterns of the sequence data. The convolutional layer takes local features from the time series data by using a sliding window. While the RNN has the benefit of being able to remove contextually related data, it may admit a minimum range of contextual information and suffer from the long-term dependency problem. Then, it is essential to introduce a threshold mechanism into the RNN architecture to maintain the required data. It can not only be an efficient solution to the problems of gradient explosion and gradient vanishing, but also partially address the disadvantages of transferring information over longer distances. The generally applied RNN frameworks with gate mechanisms are the LSTM and GRU structures, which possess temporal solid processing abilities. Still, if GRU fails to maintain memory, the internal architecture of the network has only dual gates that prevent the memory area, decrease the training parameters, and outcomes at the fastest training speed. The fundamental structure of a GRU method, and the forward GRU computational equation, is as shown:

$$\:{r}_{t}=\sigma\:\left({W}_{r}\left[{h}_{t-1},{x}_{t}\right]+{b}_{r}\right)$$
(8)
$$\:{z}_{t}=\sigma\:\left({W}_{z}\left[{h}_{t-1},{x}_{t}\right]+{b}_{z}\right)$$
(9)
$$\:{\stackrel{\sim}{h}}_{t}=\text{t}\text{a}\text{n}\text{h}\left({W}_{h}\left[{r}_{t}*{h}_{t-1},{x}_{t}\right]+{b}_{h}\right)$$
(10)
$$\:{h}_{t}=(1-{z}_{t})*{h}_{t-1}+{z}_{t}*{\stackrel{\sim}{h}}_{t})$$
(11)
$$\:{y}_{t}=softmax\left({W}_{0}\cdot\:{h}_{t}+{b}_{0}\right)$$
(12)

whereas, \(\:{x}_{t}\) characterizes the input vector of the present time step; \(\:{W}_{r},\) \(\:{W}_{z}\), and \(\:{W}_{h}\) characterize the weighted matrices; and \(\:{b}_{r},\) \(\:{b}_{z}\), and \(\:{b}_{h}\) symbolize the biased vectors.

The Bi-GRU model can learn either past or future information. It may learn the input sequence information more broadly to prevent missing information after processing longer sequences. \(\:{g}_{t}^{{\prime\:}}\) and\(\:\:{g}_{t}\) represent the consistent outputs for the Backwards and Forward GRU layers at moment \(\:t\), correspondingly presented as shown:

$$\:{g}_{t}^{forward}=GRU\left({x}_{t},{h}_{t-1}^{forward}\right)$$
(13)
$$\:{g}_{t}^{backward}=GRU\left({x}_{t},{h}_{t+1}^{backward}\right)$$
(14)

The Bi-GRU output \(\:{O}_{t}\) is presented as shown:

$$\:{O}_{t}=\overrightarrow{W}{g}_{t}+\overleftarrow{W}{g}_{t}^{{\prime\:}}\leftarrow\:+{b}_{t}$$
(15)

whereas \(\:\overleftarrow{W}\) and \(\:\overrightarrow{W}\) represent the weighted matrices for the backwards and forward GRU structures, individually, while \(\:{b}_{t}\) symbolizes the output layer’s bias vector.

They present a linear generalized attention mechanism, which achieves near-linear growth in both time and space complexity. This model significantly reduces training time, particularly after processing longer sequences. The attention matrix \(\:A\in\:{\mathbb{R}}^{L\times\:L}\) as demonstrated:

$$\:A(i,j)=K({q}_{i}^{T},{k}_{j}^{T})$$
(16)

Here, \(\:{k}_{j}{\:and\:q}_{i}\) characterize the \(\:jth\) and \(\:ith\) row vectors of the key \(\:K\) and query \(\:Q,\) respectively. The kernel \(\:K\) is represented as demonstrated:

$$\:K\left(x,y\right)=\mathbb{E}\left[\varphi\:{\left(x\right)}^{T}\varphi\:\left(y\right)\right]$$
(17)

Now, \(\:\varphi\:\left(x\right)\) signifies mapping functions. If \(\:{Q}^{{\prime\:}},{\:K}^{{\prime\:}}\in\:{\mathbb{R}}^{L\times\:p}\), their row vectors are embodied as \(\:\varphi\:\left({q}_{i}^{T}\right)\) and \(\:\varphi\:\left({k}_{j}^{T}\right)\), individually. The efficient attention, depending on the kernel description, is formulated as shown:

$$\:\widehat{Att}\leftrightarrow\:\left(Q,K,\:V\right)={\widehat{D}}^{-1}\left(BV\right)$$
(18)

whereas, \(\:B={Q}^{{\prime\:}}({K}^{{\prime\:}}{)}^{T}\) and \(\:\widehat{D}=diag\left(B{1}_{L}\right)\). Now, \(\:\widehat{Att}\) represents the estimated attention, while the brackets specify the computing sequence. Table 2 specifies the key hyperparameters of the C-BiG-A technique.

Table 2 Key hyperparameters of the C-BiG-A model.

Experimental validation

The performance assessment of the IERT-HDLMWEP method is examined under the Emotion detection from the Text dataset34. The technique runs on Python 3.6.5 with an i5-8600k CPU, 4GB GPU, 16GB RAM, 250GB SSD, and 1 TB HDD, using a 0.01 learning rate, ReLU, 50 epochs, 0.5 dropout, and batch size 5. This dataset consists a total of 39,173 samples, categorized into 12 sentiments, as outlined in Table 3 below. Table 4 represents the sample text.

Table 3 Details of the dataset.
Table 4 Sample text.

Figure 4 shows the confusion matrices formed by the IERT-HDLMWEP approach at 80:20 and 70:30 ratios of TRPHE to TSPHE. The results state that the IERT-HDLMWEP technique effectively detects and identifies each class.

Fig. 4
figure 4

Confusion matrices of (ac) TRPHE of 80% and 70% and (bd) TSPHE of 20% and 30%.

Table 5; Fig. 5 present the textual emotion detection outcome of the IERT-HDLMWEP technique at 80:20. Under 80% TRPHE, the IERT-HDLMWEP technique attains an average \(\:acc{u}_{y}\) of 99.64%, \(\:pre{c}_{n}\) of 94.95%, \(\:rec{a}_{l}\) of 89.19%, \(\:{F}_{Measure}\:\)of 91.22%, \(\:{AUC}_{Score}\:\)of 94.49%, and Kappa of 94.56%. Also, on 20% TSPHE, the IERT-HDLMWEP model obtains an average \(\:acc{u}_{y}\) of 99.67%, \(\:pre{c}_{n}\) of 93.61%, \(\:rec{a}_{l}\) of 88.54%, \(\:{F}_{Measure}\:\)of 89.97%, \(\:{AUC}_{Score}\:\)of 94.18%, and Kappa of 94.25%.

Table 5 Textual emotion detection outcome of IERT-HDLMWEP model under 80%:20%.

Table 6; Fig. 6 portray the textual emotion detection outcome of the IERT-HDLMWEP method on 70:30. Based on 70% TRPHE, the IERT-HDLMWEP method attains an average \(\:acc{u}_{y}\) of 99.31%, \(\:pre{c}_{n}\) of 90.71%, \(\:rec{a}_{l}\) of 80.81%, \(\:{F}_{Measure}\:\)of 82.83%, \(\:{AUC}_{Score}\:\)of 90.12%, and Kappa of 90.28%. Similarly, on 30% TSPHE, the IERT-HDLMWEP technique obtains an average \(\:acc{u}_{y}\) of 99.38%, \(\:pre{c}_{n}\) of 93.88%, \(\:rec{a}_{l}\) of 80.87%, \(\:{F}_{Measure}\:\)of 82.93%, \(\:{AUC}_{Score}\:\)of 90.26%, and Kappa of 90.32%.

Fig. 5
figure 5

Average values of the IERT-HDLMWEP model under 80%:20%.

Fig. 6
figure 6

Average values of the IERT-HDLMWEP model under 70%:30%.

Table 6 Textual emotion detection outcome of IERT-HDLMWEP model under 80%:20%.

Figure 7 exemplifies the training (TRAIN) \(\:acc{u}_{y}\) and validation (VALID) \(\:acc{u}_{y}\) of an IERT-HDLMWEP approach at an 80:20 ratio over 25 epochs. Primarily, both TRAIN and VALID \(\:acc{u}_{y}\:\)rise quickly, representing efficient pattern learning from the data. Around the epoch, the VALID \(\:acc{u}_{y}\)slightly exceeds the training accuracy, signifying good generalization without over-fitting. As training advances, it reflects higher performance and a lower performance gap between TRAIN and VALID. The close alignment of both curves during training implies that the model is well-regularised and well-generalized. This shows the approach’s stronger capability in learning and retaining beneficial features across both seen and unseen data.

Fig. 7
figure 7

\(\:Acc{u}_{y}\) curve of the IERT-HDLMWEP model under 80:20.

Figure 8 demonstrates the TRAIN and VALID losses of the IERT-HDLMWEP technique at 80:20 over 25 epochs. Initially, both TRAIN and VALID losses are higher, showing that the method begins with a partial understanding of the data. As training evolves, both losses continually decrease, indicating that the technique is efficiently learning and optimizing its parameters. The close alignment between the TRAIN and VALID loss curves in training implies that the model hasn’t overfitted and retains good generalization to unseen data. This reliable and steady decrease in loss shows a well-trained, stable, and consistent DL model.

Fig. 8
figure 8

Loss curve of the IERT-HDLMWEP model under 80:20.

In Fig. 9, the precision-recall (PR) inspection study of the IERT-HDLMWEP methodology on the 80:20 dataset provides insights into its performance by charting Precision against Recall for all classes. The figure illustrates that the IERT-HDLMWEP technique consistently achieves increased PR values across diverse classes, indicating its potential in maintaining a significant share of true positive predictions among all positive predictions (precision), while also capturing a substantial portion of actual positives (recall). The steady improvement in PR results across each class depicts the efficiency of the IERT-HDLMWEP technique during the classifier process.

In Fig. 10, the ROC analysis of the IERT-HDLMWEP technique is examined under an 80:20 ratio. The results indicate that the IERT-HDLMWEP method achieves elevated ROC values across all classes, demonstrating a significant ability to differentiate between class labels. This consistent pattern of increased values of ROC across numerous class labels indicates the effective results of the IERT-HDLMWEP method in class prediction, underscoring the robust nature of the classification process.

Fig. 9
figure 9

PR curve of the IERT-HDLMWEP method under 80:20.

Fig. 10
figure 10

ROC curve of the IERT-HDLMWEP method at 80:20.

Table 7; Fig. 11 demonstrate the comparative analysis of the IERT-HDLMWEP method with current techniques under various metrics19,20,35,36. The outcomes emphasized that the IERT-HDLMWEP model got higher \(\:acc{u}_{y}\), \(\:pre{c}_{n}\), \(\:rec{a}_{l},\) and \(\:{F}_{Measure}\) of 99.67%, 93.61%, 88.54%, and 89.97%, respectively. While the existing methodologies, namely RoBERTa, BiGRU, RF, U-Net, Improved DBN-SVM, Bi-GRU, XLNet, DAN, Bi-LSTM, and SVM, have shown worse performance under various metrics.

Fig. 11
figure 11

Comparative analysis of the IERT-HDLMWEP model with existing methods.

Table 7 Comparative analysis of the IERT-HDLMWEP model with existing methods19,20,35,36.

In Table 8; Fig. 12, the computational time (CT) of the IERT-HDLMWEP methodology is compared with that of existing techniques. The IERT-HDLMWEP methodology presents a lower CT of 8.06 s while the RoBERTa, BiGRU, RF, U-Net, Improved DBN-SVM, Bi-GRU, XLNet, DAN, Bi-LSTM, and SVM methodologies attained superior CTs of 16.99 s, 13.00 s, 17.90 s, 18.56 s, 21.82 s, 28.89 s, 15.36 s, 21.64 s, 12.59 s, and 10.43 s, respectively.

Fig. 12
figure 12

CT outcome of IERT-HDLMWEP methodology with existing models.

Table 8 CT outcome of IERT-HDLMWEP methodology with existing models.

Table 9; Fig. 13 demonstrates the error analysis of the IERT-HDLMWEP technique with existing methods. The error analysis of various models exhibit significant differences in performance across evaluation metrics. RoBERTa achieved the highest overall results with \(\:acc{u}_{y}\) of 10.25%, \(\:pre{c}_{n}\) of 13.19%, \(\:rec{a}_{l}\) of 13.27%, and \(\:{F}_{Measure}\) of 17.07%. U-Net followed closely with \(\:acc{u}_{y}\) of 11.03%, \(\:pre{c}_{n}\) of 13.80%, \(\:rec{a}_{l}\) of 13.95%, and \(\:{F}_{Measure}\) of 17.63%. In contrast, conventional models such as RF and SVM performed poorly, with RF illustrating an \(\:acc{u}_{y}\) of 1.87%, \(\:pre{c}_{n}\) of 6.97%, \(\:rec{a}_{l}\) of 17.86%, and \(\:{F}_{Measure}\) of 17.84%, while SVM had \(\:acc{u}_{y}\) of 1.97%, \(\:pre{c}_{n}\) of 8.93%, \(\:rec{a}_{l}\) of 13.58%, and \(\:{F}_{Measure}\) of 15.67%. BiGRU and Bi-GRU showed low \(\:acc{u}_{y}\) of 5.21% and 2.61% respectively, but relatively high recall values above 17%, indicating a bias in detecting positives. XLNet balanced its metrics well, achieving \(\:acc{u}_{y}\) of 3.43%, \(\:pre{c}_{n}\) of 7.82%, \(\:rec{a}_{l}\) of 19.25%, and the highest \(\:{F}_{Measure}\) of 18.78%. The IERT-HDLMWEP model attained an \(\:acc{u}_{y}\) of 0.33%, \(\:pre{c}_{n}\) of 6.39%, \(\:rec{a}_{l}\) of 11.46%, and \(\:{F}_{Measure}\) of 10.03%, indicating severe issues in both \(\:pre{c}_{n}\) and \(\:rec{a}_{l}\). These results indicate that as more advanced representations and architectural enhancements are introduced, both \(\:pre{c}_{n}\) and \(\:rec{a}_{l}\) improve consistently, resulting in more balanced and reliable classification performance.

Fig. 13
figure 13

Error analysis of IERT-HDLMWEP technique with existing methods.

Table 9 Error analysis of IERT-HDLMWEP technique with existing methods.

Table 10; Fig. 14 indicates the ablation study of the IERT-HDLMWEP approach. The ablation study illustrates the progressive improvement in performance as diverse components are integrated into the model. Starting with Word2Vec, the baseline model attained an \(\:acc{u}_{y}\) of 97.06%, \(\:pre{c}_{n}\) of 90.92%, \(\:rec{a}_{l}\) of 85.81%, and \(\:{F}_{Measure}\) of 87.06%. The TF-IDF-CDW approach slightly improved performance with \(\:acc{u}_{y}\) of 97.81%, \(\:pre{c}_{n}\) of 91.46%, \(\:rec{a}_{l}\) of 86.57%, and \(\:{F}_{Measure}\) of 87.75%. Incorporating the PSE technique further enhanced the metrics to \(\:acc{u}_{y}\) of 98.50%, \(\:pre{c}_{n}\) of 92.23%, \(\:rec{a}_{l}\) of 87.30%, and \(\:{F}_{Measure}\) of 88.51%. The C-BiG-A model significantly outperformed previous variants, reaching \(\:acc{u}_{y}\) of 99.03%, \(\:pre{c}_{n}\) of 92.98%, \(\:rec{a}_{l}\) of 87.80%, and \(\:{F}_{Measure}\) of 89.28%. The final IERT-HDLMWEP model attained the best results across all metrics, with an \(\:acc{u}_{y}\) of 99.67%, \(\:pre{c}_{n}\) of 93.61%, \(\:rec{a}_{l}\) of 88.54%, and \(\:{F}_{Measure}\) of 89.97%, confirming the efficiency of the complete hybrid architecture.

Fig. 14
figure 14

Ablation study-based comparative analysis of the IERT-HDLMWEP methodology.

Table 10 Ablation study-based comparative analysis of the IERT-HDLMWEP methodology.

Conclusion

In this study, a novel emotion detection model, named the IERT-HDLMWEP method, is developed to accurately identify and interpret emotions in text, thereby enhancing communication support for people with disabilities. Primarily, the text pre-processing stage involves several typical levels to increase the study and minimize the dimensionality of input data. For the word embedding process, the IERT-HDLMWEP method generates a hybrid feature representation by integrating pre-trained Word2Vec vectors weighted using a TF-IDF-CDW and enriched with POS feature vectors to enhance emotion detection in textual data. At last, the hybrid of the C-BiG-A technique is employed for the classification process. A comprehensive simulation was implemented to verify the performance of the IERT-HDLMWEP methodology. The empirical results indicated that the IERT-HDLMWEP methodology emphasized development over other recent techniques. The limitations of the IERT-HDLMWEP methodology include its dependence on general textual data. Though disability-specific data is not required due to the text-based nature of the approach, the present model may not capture all contextual variations relevant to specific user groups and may not generalize well to other languages or domains with diverse linguistic styles or informal expressions. The performance may also be affected when dealing with highly informal or noisy text, which is typical of some real-world communications. Future work should focus on enhancing the system’s robustness to diverse linguistic styles and expanding its adaptability to various domains. Incorporating user feedback mechanisms could additionally personalize emotion recognition. Additionally, integrating this text-based model with other assistive technologies may improve overall support for people with intellectual disabilities. Finally, developing real-time processing capabilities and user-friendly interfaces will be crucial for practical deployment.