Table 1 A comparative study of the reviewed techniques.
Author | Objective | Method | Dataset | Result |
|---|---|---|---|---|
Kumar et al.11 | A hybrid-fusion-based, innovative, and interpretable multimodal emotion detection method, VISTANet, is presented to categorize an input comprising an image, equivalent text, and speech into discrete emotion segments. | KAAP | IIT-R MMEmoRec dataset | Accuracy of 80.11% |
Di Luzio et al.12 | A novel explainable AI model to recognize vital facial movements and distinctive emotional feelings. | DNN | Extended Cohn-Kanade dataset (CK+) | – |
Fu et al.13 | A novel structure for handling inadequate conversational information in the task of MERC that deliberates the higher-order data of modality and multi-frequencies, and fully employs the semantical dependencies. | GRU and Spectral Domain Reconstruction Graph Neural Network | IEMOCAP, CMU-MOSI, and CMU-MOSEI datasets | The Time of SDR-GNNmini is 7.52s |
Li et al.14 | An innovative ERNetCL model incorporates RNN- and MHA techniques in a simplified method to acquire spatial and temporal contextual data. | TE, SE, and CL | MELD, IEMOCAP, EmoryNLP, and DailyDialog dataset | Weighted-F1 of 66.31%, 69.73%, 39.71%, and 53.09% of Four Datasets |
Kusal et al.15 | Presents a hybrid DL model that depends on the convolutional-recurrent network employed to identify the emotions of individuals, depending on the conversational text. | Neural Network Language Model (NNLM), CNN, Recurrent Neural Network (RNN) | Empathetic Dialogues dataset | Accuracy of 73.62% |
Feng et al.16 | Establish a multimodal technique that combines text and speech information to capture the full benefits of emotion-relevant data, surpassing the application of a multiscale MFCC multiview AM. | Multiscale MFCC and Multi-Head Attention (MHA) | IEMOCAP and MSP-IMPROV dataset | IEMOCAP is 0.754 in WA and 0.742 in UA |
Zhang et al.17 | An automated emotion study method is utilized to allow the machine to understand the emotional inference transferred by the person’s EEG signals. | CNN and LSTM | DEAP dataset | Accuracy of 92.98% |
Omarov and Zhumanov18 | Projects an innovative Bi-LSTM technique for emotion study and identification in textual content, able to acquire either preceding or upcoming context to enhance performance. | Bi-LSTM | Kaggle Emotion Detection Dataset | Weighted Average of 90% Precision, 90% Recall, and 90% F-Score |
Hicham and Nassera19 | To develop a stacked DL technique for efficient multilingual opinion mining. | RoBERTa-GRU, RoBERTa-LSTM, RoBERTa-BiGRU, RoBERTa-BiLSTM, Adam optimizer, AEDA, SMOTE, GPT | French, English, Arabic, Cohen’s kappa, ROC-AUC, Accuracy, MCC, K-fold | High Efficacy, Improved Classification |
Mahajan, More, and Shah20 | To develop and evaluate models for recognizing single and mixed emotions in multilingual, code-mixed YouTube comments. | LR, SVM, NB, RF, LSTM, BiLSTM, GRU, CNN | 13,000 multilabel YouTube comments, Accuracy, F1 score | SVM Highest Accuracy |
Zhu et al.21 | To develop an accurate and efficient medical question-answering system using advanced AI techniques. | Knowledge Embedding, Transformer-based Architecture, Knowledge Understanding Layer, Answer Generation Layer | MCMLE and USMLE Datasets | 82.92% on MCMLE and 64.02% on USMLE |
Khan et al.22 | To develop an efficient DL method for accurate violence detection in surveillance videos using a two-stream approach. | Two-Stream Architecture, 3D Convolution Network, Background Suppression, Optical Flow Analysis, Depth-Wise 3D Convolutions | RWF2000, RLVS | Accuracy, Efficiency |
Arumugam et al.23 | To develop a novel multimodal by integrating audio, visual, and text inputs for accurate emotion recognition. | AVTEFN, Hybrid Wav2Vec 2.0 + CNN, BERT with Bi-GRU, Attention-Based Fusion | Benchmark Dataset | Accuracy at 98.7%, Precision at 98.2%, Recall, at 97.2%, and F1-score of 97.49% |
Khan et al.24 | To enhance multimodal emotion recognition by capturing inter- and intra-modal relationships using a joint transformer-based model. | MER, JMMT | IEMOCAP, MELD | Accuracy, F1-Score |
Alyoubi and Alyoubi25 | To develop an optimized transformer-based multimodal emotion recognition framework for accurate emotion classification. | BERT/RoBERTa for Text, wav2vec 2.0 for Speech, ResNet50/VGG16 for Visuals, Cross-Modal Attention, SHAP Explainability | Multimodal EmotionLines (MELD) Dataset | Accuracy, Explainability |
Vani et al.26 | To develop and evaluate Text Fusion + through advanced text analysis and audio output. | OCR, NLP, TTS, DL Summarization, NLP-based Q&A Module | Standard Dataset | Summarization Accuracy, User Accessibility |
Khan et al.27 | To explore sequence learning techniques for accurate auditory emotion recognition using advanced RNNs. | LSTM, GRU, Bi-Directional LSTM, Deep/Multilayer Architectures | Emotion Dataset | Accuracy, Model Robustness |
Ghous, Najam, and Jalal28 | To detect emotional states in individuals with cognitive disabilities using EEG data and advanced ML models. | BF, Downsampling, AAFST, Multi-Class SVM | SEED-IV | Accuracy, Emotion Detection |
Patil et al.29 | To predict learning disabilities using handwritten text analysis and intelligent systems. | DNN, Character Confusion Detection, Pattern Extraction Techniques | IAM Handwriting Dataset | Accuracy, Scalability |
Mishra et al.30 | To develop a transformer-based NLP model to support mental health monitoring. | Transformer Architecture, GPT-4 and BERT, Adam Optimizer | English Twitter Dataset | 94% Accuracy |