Introduction

Corpus linguistics plays an essential role in TER by offering empirical linguistic data and logical methods for the detection of emotions in the text. With the systematic examination of an enormous, illustrative text corpus, researchers can reveal linguistic patterns, keywords, syntactic structures, etc., which are widely integrated to diverse emotional states1. Corpus linguistics aids in building an emotion-labelled dataset to identify culture- or context-specific emotions, and modelling emotion lexicons which enhance the outcome of emotion identification approaches. Moreover, corpus-driven works exhibit refined linguistic cues, such as metaphor use, intensifier, or negation pattern, which are significant to recognize emotions in text2 precisely. Emotion recognition holds a promising position in the domain of artificial intelligence (AI) and HCI. Different kinds of methods are employed to identify emotions in humans, such as textual information, facial expressions, heartbeat, body movements, and other physiological cues. In the domain of computational linguistics, detecting emotions in text is gaining importance from a practical perspective. Currently, the internet contains a vast volume of textual data. Extracting emotions for various objectives, such as business, is highly interesting3. As the most fundamental and straightforward medium, textual data produced during emotional communications is frequently employed to deduce an individual’s emotional condition. Furthermore, in recent years, with the swift progress of social platforms and intelligent devices, online social networks have emerged as an unmatched worldwide trend, including Facebook, Weibo, Line, and Twitter. TER aims at automatically detecting emotional states in textual expressions, namely Happiness, Sadness, and Anger4. TER involves a deeper examination than SA and has attracted significant attention from the academic community. For individuals, emotions are recognized swiftly through their internal feelings. At the same time, for automated TER systems, computational approaches must be regularly enhanced and refined to provide more precise emotion detection5. The increasing attention to TER is driven by the expansion of social networking platforms and online data, which enable users to interact and express views on diverse subjects. The interest in TER has also led to novel natural language processing (NLP) techniques centred on TER detection and classification.

Studies on TER have influenced numerous practical applications, such as healthcare, HCI, consumer evaluation, and education. Yet, the domain of NLP remains ambiguous due to its computational and linguistic approaches that enable machines to interpret and generate HCI through text and speech6. It focuses on building models for various functions such as emotions, interpretation, thoughts, and sentiment. SA identifies the sentiment of the given text as positive, neutral, or negative. Textual data frequently holds and conveys emotion in multiple forms, thus posing challenges in capturing and understanding it7. Emotions can be communicated clearly through emotion-related terms and strong language, making them easy to examine. However, that is not always the case. Factors like sentence formation, word choice, structure, and the use of punctuation can make emotion recognition more challenging8. In today’s world, technological progress is indispensable, impacting lives globally by improving productivity and efficiency. Artificial intelligence (AI) is increasingly valued for its ability to perform tasks similar to humans, driving significant advancements9. Deep learning (DL) is among the well-known branches of AI, where experts are exploring, implementing, and applying its strengths in different technological sectors, such as software engineering and information processing. Lately, DL methods have reduced the reliance on manual feature extraction and attained satisfactory outcomes10. However, based on the latest surveys on deep emotion recognition approaches, the majority of works emphasize visual, audio, and multimodal modalities. Business organizations and researchers significantly benefit from automatic detection of emotional cues to assist businesses and researchers in gaining deeper insights into user sentiments and behaviours. Using linguistic patterns within extensive text collections allows for more accurate and culturally aware emotion detection. Thus, a model is necessary for capturing subtle discrepancies in language, such as sarcasm or emphasis, resulting in more reliable emotion analysis in diverse contexts.

This paper proposes an Emotion Detection in Text via Integrated Word Vector Representation and Multi-Model Deep Neural Networks (EDTIWVR-MDNN) approach. The EDTIWVR-MDNN approach intends to develop an effective textual emotion recognition and analysis to enhance sentiment understanding in natural language. Initially, the text pre-processing stage involves various levels for extracting the relevant data from the text. Furthermore, the word embedding process is performed by the fusion of the term frequency-inverse document frequency (TF-IDF), bi-directional encoder representations from transformers (BERT), and the GloVe technique to transform textual data into numerical vectors. For the classification process, the EDTIWVR-MDNN technique employs a hybrid of an attention mechanism-based temporal convolutional network and a bi-directional gated recurrent unit (AM-T-BiG) model. The experimental evaluation of the EDTIWVR-MDNN methodology is examined under the Emotion Detection from Text dataset. The significant contribution of the EDTIWVR-MDNN methodology is listed below.

  • The framework employs a comprehensive multi-level text pre-processing process for clearing, normalizing, and refining raw textual data, ensuring the extraction of relevant and noise-free information. This process includes tokenization, stop-word removal, stemming or lemmatization, and case normalization. It also improves the semantic quality of features and lays a robust foundation for reliable word embedding and accurate classification.

  • The integrated TF-IDF, BERT, and GloVe techniques are utilized for the word embedding process for transforming textual data into rich and informative numerical vectors. This incorporation employs the merits of both statistical and contextual embeddings. It significantly enhances the representation of semantic relationships and improves model performance in intrinsic text classification tasks.

  • The classification process involves the hybrid TCN and BiGRU techniques for effectively capturing both long-range dependencies and local contextual features in sequential data. This hybrid architecture enhances the model’s ability to focus on relevant temporal patterns. It results in improved accuracy, robustness, and interpretability in intrinsic text classification scenarios.

  • The novelty of the EDTIWVR-MDNN technique lies in its integration of a hybrid word embedding scheme, combining TF-IDF, BERT, and GloVe, with the attention-enhanced AM-T-BiG models, which unite TCN and BiGRU. This unique incorporation enables deeper semantic understanding and temporal pattern recognition. It also improves classification performance on complex textual data while maintaining interpretability and generalization.

The rest of the paper is systematized as shown. “Survey of existing emotion detection approaches” comprehensively reviews the related literature and mechanisms that use existing models to detect emotion. The proposed design is presented in “Materials and methods”. Performance assessment and comparative study are given in “Evaluation and results”. “Conclusion” concludes by defining the research.

Survey of existing emotion detection approaches

Moorthy and Moon11 proposed a multimodal learning approach to recognize emotions, known as the hybrid multi-attention network (HMATN) model. A collaborative cross-attentional model for audio-visual amalgamation, focusing on efficiently capturing salient features through modalities, although protecting both inter- and intra-modal associations. This paradigm computes cross-attention weights through examining the association between merged distinct modes and feature illustrations. In the meantime, this paradigm leverages the hybrid attention of a single and parallel cross-modal (HASPCM) module, encompassing a single-modal attention module and a parallel cross-modal attention module, to utilize multimodal data and unknown features to enhance representation. Mengara et al.12 presented a new Mixture of Experts (MoE) architecture, solving the issues by incorporating a dynamic gating mechanism, transformer-driven sub-expert networks alongside a cross-modal attention and sparse Top-k activation mechanisms. All modalities are refined by various dedicated sub-experts created to extract a complex contextual and temporal pattern. In contrast, the dynamic gating particularly influences the aids of the most related experts. Jon et al.13 presented the SER in naturalistic conditions (SERNC) challenge, which addresses emotional state assessment and categorical emotion recognition. To manage the difficulties of natural speech, such as intra- and inter-subject inconsistency, the authors proposed multi-level acoustic-textual emotion representation (MATER). This new hierarchical approach incorporates textual and acoustic features at the text, word, and embedding levels. By combining lower-level acoustic and lexical cues alongside higher-level contextualized representations, MATER efficiently extracts semantic nuances and fine-grained prosodic variations. Al Ameen and Al Maktoum14 presented a technique named SVM-ERATI, an SVM-assisted technique for emotion recognition that involves audio and text information (ATI). The audio features that are extracted comprise prosody-like pitch and energy, and Mel-frequency cepstral coefficients (MFCCs) regarding the audio emotion properties. Tahir et al.15 introduced a new emotion recognition paradigm, which utilizes DL techniques to detect five essential human emotions and the pleasure dimensions (valence) related to these emotions, utilizing keystroke and text dynamics. DL techniques are then utilized to forecast an individual’s emotions through text data. Semantic analysis of the text data is attained through the global vector (Glove) representation of words. In16, an innovative multimodal incessant emotion recognition architecture is introduced to overcome the above difficulties. The multimodal attention fusion (MAF) technique is presented for multimodal fusion to utilize redundancy and complementarity among several modalities. To resolve temporal context relationships, the global contextual TCN (GC-TCN) and the local contextual TCN (LC-TCN) were introduced. Those models can gradually combine multiscale temporal contextual data from input streams of various modalities. Zhang et al.17 proposed a novel MER methodology that leverages both text and audio by utilizing hybrid attention networks (MER-HAN). Particularly, an audio and text encoder (ATE) block furnished with DL approaches alongside the local intra-modal attention mechanism is primarily developed for learning higher-level text and audio feature representations from the related text and audio sequences.

In Ref.18, a new hybrid multimodal emotion recognition system, InceptionV3DenseNet, is recommended for enhancing the detection precision. At first, contextual features are captured from various modalities, namely text, audio, and video. Zero-crossing rate, MFCC, pitch, and energy are captured from audio modalities, and Term Frequency–Inverse Document Frequency (TF-IDF) are captured from textual modalities. Liu et al.19 proposed SemantiCodec. This advanced audio codec leverages a dual-encoder architecture by integrating a semantic encoder based on self-supervised Audio Masked Autoencoder (AudioMAE) and an acoustic encoder. Bárcena Ruiz and Gil Herrera20 proposed a model by using an ensemble of partially trained Bidirectional Encoder Representations from Transformers (BERT) models within Condorcet’s Jury Theorem (CJT) framework. The approach utilizes the Jury Dynamic (JD) algorithm with reinforcement learning (RL) to adaptively improve performance despite limited training data. Noor et al.21 proposed a Stacking Ensemble-based Deep learning (SENSDeep) technique by integrating multiple pre-trained transformer models, including BERT, Robustly Optimized BERT Pre-training Approach (RoBERTa), A Lite BERT (ALBERT), Distilled BERT (DistilBERT), XLNet, and Bidirectional and Auto-Regressive Transformers (BART) technique to improve detection accuracy and robustness. Luo et al.22 explored the correlation between single-hand gestural features and emotional expressions using statistical correlation analysis and participant interviews. Hu et al.23 improved speech emotion recognition (SER) by using Label Semantic-driven Contrastive Learning (LaSCL) and integrating emotion Label Embeddings (LE) with Contrastive Learning (CL) and Label Divergence Loss (LDL) to strengthen the relationship between semantic labels and speech representations. Tang et al.24 introduced a methodology by integrating convolutional neural networks (CNN) and state space sequence models (SSM) for capturing both local features and long-range dependencies. The FaceMamba and attention feature injection (AFI) approaches are also utilized to improve hierarchical feature integration. Zhang et al.25 proposed the Implicit Expression Recognition model of Emojis in Social Media Comments (IER-SMCEM) technique for enabling more accurate sentiment recognition in financial texts. Shawon et al.26 developed RoboInsight. This museum guide robot utilizes Robotics for precise line-following navigation, Image Processing (IP) for accurate exhibit recognition, and NLP for seamless multilingual chatbot interactions. Kodati and Tene27 developed an advanced multi-task learning (MTL) model by employing soft-parameter sharing (SPS) integrated with attention mechanisms (AMs) like long short-term memory with AM (LSTM-AM), bidirectional gated recurrent units with self-head AM (BiGRU-SAM), and bidirectional neural networks with multi-head AM (BNN-MHAM) for analysis. Xia et al.28 introduced a model by utilizing systematic design space exploration and critical evaluation methods. Comparison analysis of existing textual emotion recognition in Table 1.

Table 1 A comparative study of diverse advanced models.

Materials and methods

In this paper, the EDTIWVR-MDNN approach is proposed. The primary intention of the EDTIWVR-MDNN methodology is to develop an effective textual emotion recognition and analysis on linguistic corpora. It encompasses various processes, including text pre-processing techniques, word embedding strategies, and hybrid DL-based emotion classification. Figure 1 represents the entire flow of the EDTIWVR-MDNN methodology.

Fig. 1
figure 1

Overall flow of the EDTIWVR-MDNN approach.

Dataset description

The utilized Emotion Detection from Text dataset contains 39,173 samples under 12 sentiments, as depicted below in Table 229. The dataset, sourced from the data. The world platform provides a valuable resource for creating precise emotion detection models from short, informal text, despite issues like class imbalance and the complexity of distinguishing diverse emotional categories. Table 3 represents the sample texts.

Table 2 Details of the dataset.
Table 3 Sample texts.

Algorithm: emotion detection on textual data using MLPFE-DBMMC model

Algorithm 1 illsutrates the steps involved in the MLPFE-DBMMC methodology, capturing the text pre-processing, word embedding, classification stages.

figure a

Algorithm 1: MLPFE-DBMMC methodology.

figure b

Pre-processing techniques for text

The text pre-processing techniques are executed for extracting the relevant data in text30. The various text pre-processing steps are explained below.

Sentence tokenization: For NLP, NLTK is a general-purpose Python library that is mainly employed. The detached text files were introduced, and the raw text is known as sentence tokenized. The sentences are assembled into a CSV file and recorded with a file name.

Expanding contraction: The Python library called contraction is employed for expanding the short words in a sentence. For instance, ‘you’ re‘ is fixed as you ‘are‘. This step is completed to help the detection of grammatical classes in the POS tag.

Part-of-speech (POS) tag: The POS taggers process a series of words and assign a POS tag to each word. For instance, VD denotes a verb, JJ indicates an adjective, NN represents a noun, and IN denotes a preposition. Moreover, the meaningful words are considered to be adjectives, nouns, and verbs, which are proposed to be mined to reduce redundancy. To avoid inconsistencies, only non-meaningful words (e.g., stopwords or words with tags other than V, N, and J) are eliminated, while the meaningful words (V/N/J) are retained and transformed to lowercase for consistent processing, ensuring reproducibility.

Word tokenization: A well-defined regular expression was used to remove words consisting only of alphabetic characters.

Lower case conversion: Every removed word is converted to lowercase to reduce duplicate entries in the vocabulary. Also, when equated with a definite dictionary, this helps to eliminate the stop word.

Removal of stop words: English stop words, such as he, is, the, and has, do not contribute meaning to the text and were eliminated. Also, a few modified words were eliminated, which are the highly common verbs in the new language, like know, feel, said, come, and the character’s name.

Lemmatization: This model eliminates the affixes of words, which are in its vocabulary. It results in a list of root words for eliminating the redundant words in text. For instance, lemmatize ‘plays’ to ‘play’.

Combined semantic feature extraction

Besides, the word embedding process is performed by using the fusion of TF-IDF, BERT, and GloVe to transform textual data into numerical vectors. The TF-IDF model effectively captures the significance of words based on their frequency and distribution across documents, providing a robust statistical foundation. The GloVe also presents global co-occurrence data, capturing semantic relationships between words efficiently. BERT, a contextualized transformer-based model, outperforms in comprehending word meanings in varying contexts, which is significant for intrinsic and complex textual data. Altogether, this hybrid approach utilizes both conventional and DL embeddings for improving feature richness, semantic depth, and context awareness, outperforming models that depend on any single embedding method alone. These embeddings are concatenated to form a rich, unified feature vector, incorporating conventional and DL representations. Thus, the integration enhances the generalization and classification accuracy of the model, specifically in scenarios with diverse and complex language patterns.

TF-IDF

Every feature extracted value is verified using the TF-IDF model, where a weight \(\:w\) is assigned to each phrase \(\:t\)31.

The number of times \(\:t\) acts in text \(\:d\) is calculated using the term frequency by Eq. (1).

$$\:tf\left(t,\:d\right)=\frac{{f}_{d}\left(t\right)}{\text{m}\text{a}\text{x}\left({f}_{d}\left(t\right)\right)}$$

Here, S \(\:{f}_{d}\left(t\right)\) denotes the rate of phrase \(\:t\) in text \(\:d\).

The relevance of a term is projected utilizing IDF. A term’s rarity is defined utilizing Eq. (2).

$$\:IDF\left(t\right)=log\left(\frac{N}{d{f}_{t}}\right)$$
(2)

While \(\:"N"\) represents the total number of texts, \(\:"\text{d}\text{f}\:\text{t}"\) denotes the number of texts that contain the word \(\:t\). To attain TF-IDF, it is necessary to multiply Eqs. (1) and (2).

$$\:W=tf\left(t,\:d\right)\text{*}IDF\left(t\right)$$
(3)

To group and classify the data, the proposed method utilizes ML models. The growth of a feature set is a vital module to attain that. The real writing method of every author depends on the specific processes by which they develop and demonstrate their skill. This information serves as a basis for removing the style dimension feature for every author in the text database. The arithmetical measure of the TF-IDF vectorizer is employed to define the weight by computing how relevant a term is to every text. TF-IDF is a multi-purpose model that converts text into a beneficial mathematical or vector representation. TF-IDF is essentially an occurrence-based vector that regularises word frequency to dimension and number of times, utilizing a combination of dual metrics. A count vectorizer and n‐gram study of words are employed for managing the cleaned text database. By using an exact resemblance, the reciprocal is expressed below.

$$\:tf\left(w,\:e\right)=\frac{freq\left(count\left(w\right),e\right)}{\left|e\right|}=\text{l}\text{o}\text{g}\left(1+freq\left(w,\:e\right)\right)$$
(4)
$$\:idf\left(w,\:e\right)=\left(\text{l}\text{o}\text{g}\left|E\right|\right)\frac{1}{cont\left(e\in\:E:W\in\:e\right)}$$
(5)
$$\:tf\left(w,\:e,\:E\right)=tf\left(w,\:e\right)\text{*}idf\left(w,E\right)$$
(6)

\(\:Freq(w,\:e)\) denotes the term frequency in text \(\:e\), \(\:tf(w,\:e)\) indicates the regularised frequency of the \(\:tth\) term, \(\:idf(w,e)\) produces an inverse document frequency of \(\:0\); or else, it creates a reverse of 1. The text for the complete database was signified as \(\:\left|E\right|\). \(\:(e:E:W)\) shows the no. of texts that contain the letter \(\:"w"\). Figure 2 illustrates the general configuration of the TF-IDF model.

Fig. 2
figure 2

General structure of TF-IDF.

BERT

The encoded layer of BERT is a vital part of the text classification method, which leads to an intricate transformation of an input text into complete data of a vector that takes good context information32. The full study of its characteristics is described below:

  • Contextual embedding: Unlike traditional word embedding, which delivers secure illustration for every word regardless of context, the BERT encoded layer forms an embedding of context. The present embedding can understand the word’s primary meaning based on a series of entire inputs that permit the system to recognize the semantic difference and explain words with dissimilar meanings.

  • Transformer structure: The layer of encoding uses the structure of the Transformer, which attributes a self-attention mechanism that permits recognizing context association and longer‐term dependencies in an input. By executing the self‐attention technique with manifold heads, BERT manages an input sequence, which captures both local and global context data.

  • Pre-trained language representation: This model is naturally pre‐trained on massive databases using unsupervised learning for purposes like masked language modelling and sentence prediction. Throughout the pre-training stage, this technique acquires the capability to encode text into a contextual and effective representation through an optimizer for diverse language understanding tasks. The earlier trained representation is a foundation for subsequent functions, allowing the system to use extensive linguistic data acquired throughout pre-training.

  • Refining for a specific task: While the pre-trained representation of BERT recognizes broad linguistic trends, the encoded layer was enhanced using data relevant through supervised learning. The network alteration representation enhances its efficiency on a specific classification task.

  • Vector representation with higher dimension: This encoded layer makes a vector representation with a higher dimension for every token in an input sequence. The current representation generally contains numerous sizes, which.

$$\:e\left({w}_{i}\right)\in\:{R}^{d}.$$
(7)

While the \(\:ith\:\)word is signified as \(\:{w}_{i}\), and the word embedding vector is denoted as \(\:e\left({w}_{i}\right)\). The vector space of \(\:dth\:\)dimension was signified as \(\:{R}^{d}.\).

The BERT pre-trained language model’s main feature is its execution of Transformer encoding in dual directions. The BERT was prepared to handle input sentences of many words with 512 tokens. Every sentence is pre-processed before processing by including a CLS token at the start and adding a SEP token to the completion for representing its closure. Afterwards, the tokenization and insertion of special tokens were employed to create three distinct types of embeddings. Figure 3 shows the architecture of the BERT method.

  • Word vector: This embedding specifies the single token in an input sentence. The tokens are mapped to a representation of the vector attained throughout the pre-training process. Its goals are to encode the semantic data about context and token meaning.

  • Segment vector: It was employed for handling an input sequence which contains numerous sentences or segments. The current vector delivers a different identifier for every token, which signifies the segment or sentence it is derived from. This enables the system to differentiate among various components and reflect their contextual significance in processing.

  • Location embedding vector: In addition to word and segment embeddings, BERT includes a location embedding vector to encode the data position within a sequence of input.

Fig. 3
figure 3

Architecture of the BERT model.

GloVe

Joined local contextual window model and global matrix factorization to learn the word’s space representations. During the Global Vectors (GloVe) method, the statistics of word existences in the corpus are the primary information source accessible to each unsupervised model to learn the word representation33. While numerous models exist, the question remains as to how meaning is made from the statistics and how the resultant word vectors may characterize the meaning.

$$\:{\sum\:}_{i,j=1}^{V}f\left({X}_{i,j}\right){\left({u}_{i,j}^{T}{v}_{j}+{b}_{i}+{c}_{j}-\:\text{l}\text{o}\text{g}\:{X}_{i,j}\right)}^{2}$$
(8)

whereas, \(\:V\) refers to vocabulary size,\(\:\:X\) signifies the word matrix of co-occurrence (\(\:{X}_{i,j}\) denote time counts that word \(\:j\) takes place in terms of word \(\:i\)), the weights \(\:f\) is provided by \(\:f\left(x\right)=(x/{x}_{\text{m}\text{a}\text{x}}{)}^{\alpha\:}\) if \(\:x<{x}_{\text{m}\text{a}\text{x}}\) and one or else, \(\:{x}_{\text{m}\text{a}\text{x}}=100\) and \(\:\alpha\:=0.75\) (empirically established), \(\:{u}_{i},{u}_{j}\) denotes word vector’s dual layers, \(\:{b}_{i},{c}_{j}\) means bias term. Easily, it denotes weight matrix factorization with the biased terms. Figure 4 demonstrates the infrastructure of Glove.

Fig. 4
figure 4

Infrastructure of Glove.

Hybrid neural network architecture-based emotion detection

At last, the hybrid of the AM-T-BiG technique is utilized for the classification operation34. The integration is highly effective for contextual understanding and sequence modelling. The AM model facilitates focusing on the most relevant parts of the input, thus enhancing the interpretability and handling long-range dependencies better than conventional methods. The TCN technique presents effectual parallel processing and stable gradient flow over long sequences, outperforming standard recurrent networks in temporal feature extraction. The BiGRU model efficiently assists in capturing data from both past and future contexts, which is significant for precise sequential data classification. This hybrid technique presents superior performance and robustness compared to standalone models, making it appropriate for intrinsic textual classification tasks where understanding context and sequence is crucial.

This method intends to precisely acquire the short-term and long-term dynamic features represented by key parameters. This model effectively utilizes the TCN module’s capability to capture long-term dependencies and periodic features during processing. The time-series data is efficiently modelled by the TCN technique due to its unique causal convolutional framework, while avoiding gradient explosion or vanishing gradients common in conventional recurrent neural network (RNN) techniques, thereby accurately capturing long-term dependencies during processing.

Then, these enhanced long-term features from the TCN module are passed to the layer of Bi-GRU to discover further the short-term volatility and bi-directional temporal correlation in data. The past and future influences, along with the current data, are efficiently captured by the forward and backwards GRU components, thereby significantly enhancing sensitivity to immediate changes and improving feature representations.

$$\:\overrightarrow{{h}_{t}}=GRU\left({x}_{t},\overrightarrow{{h}_{t}-1}\right)$$
(9)
$$\:\overleftarrow{{h}_{t}}=GRU\left({x}_{t},\overleftarrow{{h}_{t}-1}\right)$$
(10)
$$\:{y}_{t}=\overrightarrow{{h}_{t}}\oplus\:\overleftarrow{{h}_{t}}$$
(11)

Here, \(\:\overrightarrow{{h}_{t}}\) and \(\:\overleftarrow{{h}_{t}}\) refer to the forward and backwards outputs of Bi-GRU; \(\:{y}_{t}\) indicates the complete output of Bi-GRU.

To enhance the accuracy of prediction, this method integrates the AM as an optimization tool. Attention is a key technology in the area of neural networks, which concentrates on intelligently assigning computation resources to prioritize or highlight data that is more significant for the completion of a task. The TCN-Bi-GRU framework demonstrates exceptional long-sequence modelling capability, with its AM enabling adaptive temporal feature elimination while preventing gradient vanishing. The attention computation is specified:

$$\:Attention\:\left(Q,\:K,\:V\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V$$
(12)

Now, \(\:\sqrt{{d}_{k}}\) refers to the adjustment factor, and \(\:Q\) signifies the output vector of the output unit. The softmax value is both zero and one, depicting normalization; \(\:K\) indicates spatial feature relations among data.

Table 4 indicates the hyperparameters of the AM-T-BiG technique for ensuring effective learning and stable convergence. The speed and accuracy is effectually balanced by the learning rate and batch size, while dropout assisted in preventing overfitting. The TCN layers captured long-term dependencies utilizing dilated convolutions, and Bi-GRU layers modeled short-term bi-directional relations. The AM model additionally refined significant temporal features, resulting in an enhanced generalization across emotion categories.

Table 4 Hyperparameter settings utilized for training the AM-T-BiG hybrid neural network model for textual emotion classification.

Evaluation and results

The simulation validation of the EDTIWVR-MDNN model is examined under the Emotion Detection from Text dataset29. The technique is simulated using Python 3.6.5 on a PC with an i5-8600k, 250GB SSD, GeForce 1050Ti 4GB, 16GB RAM, and 1 TB HDD. Parameters include a learning rate of 0.01, ReLU activation, 50 epochs, 0.5 dropout, and a batch size of 5.

Figure 5 shows a countplot graph that visualizes the distribution of several emotional sentiments in a dataset. Every bar in the chart corresponds to a specific emotion. The x-axis is labelled as “sentiment”, whereas the y-axis demonstrates the count, offering a clear and colourful distinction between the emotional categories and their respective frequencies.

Fig. 5
figure 5

Countplot graph of the EDTIWVR-MDNN model under various sentiments.

Figure 6 displays the histplot graph that exemplifies the distribution of multiple emotions in a dataset. Every bar in the histogram signifies the frequency (count) of a specific emotional category. The bar height denotes how frequently each emotion occurs.

Fig. 6
figure 6

Histplot graph of the EDTIWVR-MDNN model under various emotions.

Table 5; Fig. 7 depict the textual emotion detection of the EDTIWVR-MDNN approach on 80:20. Under 80% of the training phase (TRPHE), the EDTIWVR-MDNN model obtains an average \(\:acc{u}_{y}\) of 99.18%, \(\:pre{c}_{n}\) of 91.42%, \(\:rec{a}_{l}\) of 79.73%, \(\:{F1}_{Score}\:\)of 82.16%, and \(\:{G}_{Measure}\) of 83.60%. Likewise, at 20% of the testing phase (TSPHE), the EDTIWVR-MDNN model obtains an average \(\:acc{u}_{y}\) of 99.26%, \(\:pre{c}_{n}\) of 93.13%, \(\:rec{a}_{l}\) of 80.14%, \(\:{F1}_{Score}\:\)of 82.54%, and \(\:{G}_{Measure}\)of 84.14%.

Table 5 Textual emotion detection of EDTIWVR-MDNN model under 80:20.
Fig. 7
figure 7

Average values of the EDTIWVR-MDNN model under 80%:20%.

Figure 8 exhibits the training (TRAIN) \(\:acc{u}_{y}\) and validation (VALID) \(\:acc{u}_{y}\) of an EDTIWVR-MDNN method on 80:20 over 30 epochs. In the beginning, both TRAIN and VALID \(\:acc{u}_{y}\:\)rise quickly, denoting efficient pattern learning from the data. Around the epoch, the VALID \(\:acc{u}_{y}\) minimally exceeds the training accuracy, signifying good generalization without over-fitting. As training advances, it reflects maximum performance and a minimum performance gap between TRAIN and VALID. The close alignment of both curves in training suggests that the method is well-regularised and generalized. This reveals the method’s stronger capability in learning and retaining valuable features across both seen and unseen data.

Fig. 8
figure 8

\(\:Acc{u}_{y}\) curve of the EDTIWVR-MDNN method under 80:20.

Figure 9 exhibits the TRAIN and VALID losses of the EDTIWVR-MDNN method at 80:20 over 30 epochs. Initially, both TRAIN and VALID losses are higher, denoting that the process begins with a partial understanding of the data. As training evolves, both losses persistently reduce, displaying that the method is efficiently learning and updating its parameters. The close alignment between the TRAIN and VALID loss curves during training suggests that the technique hasn’t over-fitted and upholds good generalization to unseen data.

Fig. 9
figure 9

Loss curve of the EDTIWVR-MDNN method under 80:20.

Table 6; Fig. 10 illustrate the textual emotion detection of the EDTIWVR-MDNN technique at 70:30. Under 70% TRPHE, the EDTIWVR-MDNN model attains an average \(\:acc{u}_{y}\) of 98.89%, \(\:pre{c}_{n}\) of 81.37%, \(\:rec{a}_{l}\) of 73.72%, \(\:{F1}_{Score}\:\)of 74.93%, and \(\:{G}_{Measure}\) of 75.76%. Similarly, at 30% TSPHE, the EDTIWVR-MDNN model attains an average \(\:acc{u}_{y}\) of 98.74%, \(\:pre{c}_{n}\) of 81.10%, \(\:rec{a}_{l}\) of 72.86%, \(\:{F1}_{Score}\:\)of 74.32%, and \(\:{G}_{Measure}\) of 75.23%.

Table 6 Textual emotion detection of EDTIWVR-MDNN approach under 70:30.
Fig. 10
figure 10

Average values of the EDTIWVR-MDNN approach under 70%:30%.

Figure 11 exemplifies the TRAIN \(\:acc{u}_{y}\) and VALID \(\:acc{u}_{y}\) of an EDTIWVR-MDNN technique on 70:30 over 30 epochs. At first, both TRAIN and VALID \(\:acc{u}_{y}\:\)rise rapidly, specifying an effective pattern learning from the data. Around the epoch, the VALID \(\:acc{u}_{y}\) exceeds the training accuracy minimally, indicating good generalization without over-fitting. As training increases, it reflects maximum performance and a minimum performance gap between TRAIN and VALID. The close alignment of both curves in training suggests that the model is well-regularised and generalized. This establishes the model’s strong capability in learning and retaining informative features across both seen and unseen data.

Fig. 11
figure 11

\(\:Acc{u}_{y}\) Curve of the EDTIWVR-MDNN technique under 70:30.

Figure 12 exemplifies the TRAIN and VALID losses of the EDTIWVR-MDNN methodology at 70:30 over 30 epochs. Initially, both TRAIN and VALID losses are high, signifying that the model begins with a partial understanding of the data. As training progresses, both losses persistently decline, indicating that the model is effectively learning and enhancing its parameters. The close alignment among the TRAIN and VALID loss curves during training demonstrates that the model hasn’t over-fitted and retains good generalization to unseen data.

Fig. 12
figure 12

Loss curve of the EDTIWVR-MDNN methodology under 70:30.

Table 7; Fig. 13 portray the comparative investigation of the EDTIWVR-MDNN methodology with existing approaches under the Emotion Detection from Text dataset20,21,35,36,37. The outcomes underscored that the presented EDTIWVR-MDNN approach achieved the highest \(\:acc{u}_{y},\) \(\:pre{c}_{n}\), \(\:rec{a}_{l},\) and \(\:{F1}_{Score}\) of 99.26%, 93.13%, 80.14%, and 82.54%, respectively. While present methodologies, such as RL, RoBERTa, XLNet, ALBERT, Bi-RNN, BiLSTM + CNN, TextCNN, Dialogue RNN, Naïve Bayes (NB), and VC(LR-SGD) have shown worse performance under various metrics.

Table 7 Comparative investigation of EDTIWVR-MDNN methodology with other existing models under the emotion detection from text dataset20,21,35,36,37.
Fig. 13
figure 13

Comparative analysis of the EDTIWVR-MDNN methodology (a) \(\:Acc{u}_{y}\), (b) \(\:Pre{c}_{n}\), (c) \(\:Rec{a}_{l}\), and (d) \(\:{F1}_{Score}\).

Table 8; Fig. 14 indicate the comparison assessment of the EDTIWVR-MDNN technique with existing methods under the Emotion Analysis Based on Text (EAT) dataset38,39. The MFN model model attained an \(\:acc{u}_{y}\) of 86.50%, \(\:pre{c}_{n}\) of 79.56%, \(\:rec{a}_{l}\) of 77.88%, and \(\:{F1}_{Score}\) of 86.08%. Additionally, the MCTN achieved an \(\:acc{u}_{y}\) of 80.50%, \(\:pre{c}_{n}\) of 86.77%, \(\:rec{a}_{l}\) of 83.06%, and \(\:{F1}_{Score}\) of 91.77%. The RAVEN model depicted an \(\:acc{u}_{y}\) of 77.00%, \(\:pre{c}_{n}\) of 88.53%, \(\:rec{a}_{l}\) of 83.19%, and \(\:{F1}_{Score}\) of 90.27% and HFusion reached an \(\:acc{u}_{y}\) of 74.30%, \(\:pre{c}_{n}\) of 88.52%, \(\:rec{a}_{l}\) of 80.60%, and \(\:{F1}_{Score}\) of 90.62%. MulT obtained an \(\:acc{u}_{y}\) of 84.80%, \(\:pre{c}_{n}\) of 86.12%, \(\:rec{a}_{l}\) of 80.84%, and \(\:{F1}_{Score}\) of 91.80% while SSE-FT achieved an \(\:acc{u}_{y}\) of 86.50%, \(\:pre{c}_{n}\) of 89.30%, \(\:rec{a}_{l}\) of 84.31%, and \(\:{F1}_{Score}\) of 86.87%. The EDTIWVR-MDNN model demonstrated greater performance with an \(\:acc{u}_{y}\) of 97.45%, \(\:pre{c}_{n}\) of 97.09%, \(\:rec{a}_{l}\) of 97.06%, and \(\:{F1}_{Score}\) of 96.91%.

Table 8 Comparison assessment of the EDTIWVR-MDNN technique with other existing methods under the EAT dataset38,39.
Fig. 14
figure 14

Comparison assessment of the EDTIWVR-MDNN technique with other existing methods under the EAT dataset.

Table 9; Fig. 15 specify the comparison evaluation of the EDTIWVR-MDNN method with existing techniques under the Text Emotion Recognition (TER) dataset40,41. The SVM model attained an \(\:acc{u}_{y}\) of 78.97%, \(\:pre{c}_{n}\) of 81.45%, \(\:rec{a}_{l}\) of 78.36%, and \(\:{F1}_{Score}\) of 79.67%. Moreover, RF achieved an \(\:acc{u}_{y}\) of 76.25%, \(\:pre{c}_{n}\) of 79.42%, \(\:rec{a}_{l}\) of 75.66%, and \(\:{F1}_{Score}\) of 77.02%. The NB model illustrated an \(\:acc{u}_{y}\) of 68.94%, \(\:pre{c}_{n}\) of 61.75%, \(\:rec{a}_{l}\) of 51.41%, and \(\:{F1}_{Score}\) of 49.61% and DT reached an \(\:acc{u}_{y}\) of 69.42%, \(\:pre{c}_{n}\) of 72.48%, \(\:rec{a}_{l}\) of 69.70%, and \(\:{F1}_{Score}\) of 70.94%. GRU obtained an \(\:acc{u}_{y}\) of 78.02%, \(\:pre{c}_{n}\) of 71.19%, \(\:rec{a}_{l}\) of 73.91%, and \(\:{F1}_{Score}\) of 72.35% while CNN achieved an \(\:acc{u}_{y}\) of 79.32%, \(\:pre{c}_{n}\) of 77.30%, \(\:rec{a}_{l}\) of 75.21%, and \(\:{F1}_{Score}\) of 74.10%. The proposed XXX model outperformed the existing techniques with an \(\:acc{u}_{y}\) of 97.26%, \(\:pre{c}_{n}\) of 96.68%, \(\:rec{a}_{l}\) of 96.20%, and \(\:{F1}_{Score}\) of 95.94%, thus emphasizing its capability in precisely detecting emotions from textual data.

Table 9 Comparison evaluation of the EDTIWVR-MDNN method with other existing techniques under the TER dataset40,41.
Fig. 15
figure 15

Comparison evaluation of the EDTIWVR-MDNN method with other existing techniques under the TER dataset.

Table 10; Fig. 16 specify the computational time (CT) analysis of the EDTIWVR-MDNN technique with existing models. The EDTIWVR-MDNN technique significantly outperforms all baseline models with the lowest CT of 5.11 s. Other models, such as TextCNN and BiLSTM + CNN, required higher CTs of 28.39 and 24.84 s, respectively. Models namely XLNet and Bi-RNN recorded CT values of 21.86 and 21.43 s, while ALBERT and RoBERTa took 18.27 and 13.42 s. Moreover, the LR-SGD reported higher CT of 15.27 and 16.19 s. This indicates that the EDTIWVR-MDNN model is highly efficient, making it appropriate for real-time or resource-constrained applications.

Table 10 CT analysis of the EDTIWVR-MDNN technique with other existing methods under the emotion detection from text dataset.
Fig. 16
figure 16

CT analysis of the EDTIWVR-MDNN technique with other existing methods under the emotion detection from text dataset.

Table 11; Fig. 17 portray the error analysis of the EDTIWVR-MDNN approach with existing models. The EDTIWVR-MDNN approach depicts the lowest error values with an \(\:acc{u}_{y}\) of 0.74%, \(\:pre{c}_{n}\) of 6.87%, \(\:rec{a}_{l}\) of 19.86%, and \(\:{F1}_{Score}\) of 17.46%. The very low \(\:acc{u}_{y}\) and \(\:pre{c}_{n}\) highlights the weak classification capability of the model. The \(\:rec{a}_{l}\) of 19.86% exhibits that the model retrieves some relevant emotional instances but misses most others, while the \(\:{F1}_{Score}\) of 17.46% demonstrates an overall imbalance between identifying and correctly labeling emotions. Thus, the EDTIWVR-MDNN model shows that it effectively captured contextual or semantic variances in text, likely due to limited representation learning and inadequate handling of emotional subtleties within intrinsic linguistic patterns.

Table 11 Error assessment of the EDTIWVR-MDNN approach with other existing models under the emotion detection from text dataset.
Fig. 17
figure 17

Error assessment of the EDTIWVR-MDNN approach with other existing models under the Emotion Detection from Text dataset.

Table 12; Fig. 18 indicate the ablation study of the EDTIWVR-MDNN methodology. The EDTIWVR-MDNN methodology attained an \(\:acc{u}_{y}\) of 99.26%, \(\:pre{c}_{n}\) of 93.13%, \(\:rec{a}_{l}\) of 80.14%, and \(\:{F1}_{Score}\) of 82.54%, which integrates the AM-T-BiG model with a fusion of word embedding processes, highlighting superior performance over other approaches. Furthermore, distinct techniques namely TCN without attention and BiGRU achieved an \(\:acc{u}_{y}\) of 95.71%, \(\:pre{c}_{n}\) of 88.93%, \(\:rec{a}_{l}\) of 75.74%, and \(\:{F1}_{Score}\) of 78.47%, while BiGRU without TCN and attention achieved an \(\:acc{u}_{y}\) of 96.35%, precision of 89.71%, \(\:rec{a}_{l}\) of 76.57%, and \(\:{F1}_{Score}\) of 79.02%. AM-T-BiG with attention improved the results to \(\:acc{u}_{y}\) of 97.02%, \(\:pre{c}_{n}\) of 90.54%, \(\:rec{a}_{l}\) of 77.46%, and \(\:{F1}_{Score}\) of 79.84%. Word embedding techniques alone such as TF-IDF, BERT, and Glove achieved varying performance with \(\:acc{u}_{y}\) ranging from 94.23% to 97.98%, \(\:pre{c}_{n}\) from 90.45% to 92.70%, \(\:rec{a}_{l}\) from 75.44% to 79.89%, and \(\:{F1}_{Score}\) from 75.92% to 79.27%. Thus, the EDTIWVR-MDNN method emphasized the effectiveness of integrating corpus linguistics with DL approaches for improved semantic-driven emotion detection on textual data.

Table 12 Ablation study results comparing EDTIWVR-MDNN method with existing techniques.
Fig. 18
figure 18

Ablation study results comparing EDTIWVR-MDNN method with existing techniques.

Table 13 illustrates the computational efficiency of diverse emotion recognition models42. The SET model took 22.23 G FLOPs, 2277 M GPU memory, and 29.9470 s for inference while MET utilized 16.41 G FLOPs, 2863 M GPU memory, and 21.6060 s. The m-CNN-LSTM achieved 21.77 G FLOPs, 2307 M GPU memory, and 17.9080 s and m-MLP required 29.65 G FLOPs, 2917 M GPU memory, and 27.0900 s. The m-LSTM model recorded 27.83 G FLOPs, 2176 M GPU memory, and 29.2230 s and HADLTER-NLPT depicted higher efficiency with 10.58 G FLOPs, 536 M GPU memory, and 11.0400 s. The EDTIWVR-MDNN model portrayed the highest computational efficiency with only 7.90 G FLOPs, 205 M GPU memory, and 8.5900 s for inference, highlighting a significant mitigation in resource needs without losing performance.

Table 13 Comparison of computational efficiency of emotion recognition models utilizing FLOPs, GPU memory usage, and inference time.

Conclusion

In this paper, the EDTIWVR-MDNN model is proposed. The EDTIWVR-MDNN model intends to develop an effective textual emotion recognition and analysis on language corpora. It encompasses various processes, including text pre-processing techniques, word embedding strategies, and hybrid DL-based emotion classification. Initially, the text pre-processing stage involves various levels for extracting the relevant data from the text. Furthermore, the word embedding process is applied by the fusion of TF-IDF, BERT, and GloVe to transform textual data into numerical vectors. Finally, the hybrid of the AM-T-BiG technique is utilized for the classification process. The experimental evaluation of the EDTIWVR-MDNN methodology is examined under the Emotion Detection from Text dataset. The comparison analysis of the EDTIWVR-MDNN methodology portrayed a superior accuracy value of 99.26% over existing approaches. The limitations comprise the dependence on a relatively limited dataset, which may affect the generalizability of the findings across diverse textual domains and languages. Additionally, the performance may degrade while capturing highly complex or context-dependent emotions, particularly for mixed or subtle emotional expressions. The model also face difficulty with sarcasm and idiomatic language that often alter the intended emotional meaning. Future work may concentrate on expanding the dataset with more varied and multilingual text sources to enhance robustness. Integrating multimodal data such as audio or visual cues could additionally improve emotion recognition accuracy. Moreover, exploring adaptive learning methods to better handle evolving language use and emotional expressions in real-time applications is recommended. Finally, addressing computational efficiency to enable deployment in resource-constrained environments will be a valuable direction. Also, future studies may include integrating multimodal features such as text, audio, and visual cues, as well as incorporating multilingual datasets for addressing challenges related to sarcasm, idiomatic expressions, and linguistic diversity.