Improved facial emotion recognition model based on a novel deep convolutional structure

Elsheikh, Reham A.; Mohamed, M. A.; Abou-Taleb, Ahmed Mohamed; Ata, Mohamed Maher

doi:10.1038/s41598-024-79167-8

Download PDF

Article
Open access
Published: 23 November 2024

Improved facial emotion recognition model based on a novel deep convolutional structure

Reham A. Elsheikh¹,
M. A. Mohamed¹,
Ahmed Mohamed Abou-Taleb¹ &
…
Mohamed Maher Ata²

Scientific Reports volume 14, Article number: 29050 (2024) Cite this article

16k Accesses
25 Citations
Metrics details

Subjects

Abstract

Facial Emotion Recognition (FER) is a very challenging task due to the varying nature of facial expressions, occlusions, illumination, pose variations, cultural and gender differences, and many other aspects that cause a drastic degradation in quality of facial images. In this paper, an anti-aliased deep convolution network (AA-DCN) model has been developed and proposed to explore how anti-aliasing can increase and improve recognition fidelity of facial emotions. The AA-DCN model detects eight distinct emotions from image data. Furthermore, their features have been extracted using the proposed model and numerous classical deep learning algorithms. The proposed AA-DCN model has been applied to three different datasets to evaluate its performance: The Cohn-Kanade Extending (CK+) database has been utilized, achieving an ultimate accuracy of 99.26% in (5 min, 25 s), the Japanese female facial expressions (JAFFE) obtained 98% accuracy in (8 min, 13 s), and on one of the most challenging FER datasets; the Real-world Affective Face (RAF) dataset; reached 82%, in low training time (12 min, 2s). The experimental results demonstrate that the anti-aliased DCN model is significantly increasing emotion recognition while improving the aliasing artifacts caused by the down-sampling layers.

Image-based facial emotion recognition using convolutional neural network on emognition dataset

Article Open access 23 June 2024

Four-layer ConvNet to facial emotion recognition with minimal epochs and the significance of data diversity

Article Open access 28 April 2022

An interactive information based DCNN-BiLSTM model with dual attention mechanism for facial expression recognition

Article Open access 19 July 2025

Introduction

In human communication, emotions are the first signs to express how they feel on the inside. These emotions enable them to communicate with one another, with their environment, and that has been revolutionizing the way they interact with technology: either through their facial expressions, physiological signals, or tone of voice¹. In daily life, the influence of facial expressions on overall communication varies from 55 to 93%. So, a large amount of useful emotional data may be acquired by detecting facial expressions². This is why, when compared to other technologies, automated FER has received the greatest attention from researchers. Automated FER has been widely applied in the discipline of computer vision such as human-computer-interactions, smartphones, security, behavioral psychology (criminal psychic analysis), medical treatment, observation of driver exhaustion, animation, and other fields³. It is also a fundamental technique in robot vision, allowing robots to understand human emotions. For many years, Deep Convolutional Neural Networks (DCNs) had been considered to be invariant to low image transformations such as scaling, image translation, and other minor modifications. As a result, they are frequently employed in the recognition of facial emotions. However, numerous researchers have lately demonstrated that this is not the case and that DCNs are truly shift-variants⁴. One frequent reason is down-sampling (stride) strategies that disregard the sampling theorem, yielding in the aliasing problem. Aliasing in DCN happens when high-frequency image components are mistakenly represented as low-frequency ones during the down-sampling process, leading to data loss. This causes a loss of critical features and jagged edges, which can negatively impact the DCN’s overall performance. For example, aliasing might allow DCN to incorrectly label one emotion with another when performing facial emotion classification tasks, leading to a significant decrease in accuracy. Anti-aliasing is one potential fix to this issue, which employs a significant signal processing principle, which is one ought to always blur just before subsampling, yet recent CNN architectures do not follow this approach⁵. Unlike numerous prior studies that employed antialiasing techniques in deep learning (DL), this work presented antialiasing in a CNN methodology designed to tackle the aliasing difficulty in FER systems.

For example, Zou et al.⁶ developed an enhanced low-pass filtration layer that addresses aliasing issues, which is an obstacle in deep learning. This layer functions to estimate filter weights for each channel group and spatial location in the input feature maps. The approach was then evaluated on a variety of applications, including COCO instance segmentation, ImageNet categorization, and segment landscapes. The results indicate that this technique easily responds to different feature frequencies, eliminating aliasing while retaining key identifying information⁷. Furthermore, Ning et al. recently employed the currently available WaveCNet anti-aliasing approach for tiny-object identification. In each ResNet residual block pathway, the authors deployed WaveletPool uniformly. WaveCNet reliably avoids aliasing by replacing standard down-sampling procedures in CNNs with wavelet pooling (WaveletPool) layers. Experiments on the WiderFace, DOTA, and TinyPerson datasets demonstrate how important anti-aliasing is for tiny object detection and how competently the recommended method succeeds in yielding new state-of-the-art results on all three datasets⁸.

In this paper, after the dataset collection, preprocessing, augmentation, and by analyzing the recognition accuracy of traditional CNN models, the datasets go through two main phases: (i) The first phase has been adopted to extract features from facial images based on an optimized deep CNN model. (iii) The second phase has been employed based on a hybrid (AA-DCN) model using a tuned blur filter to achieve an optimal anti-aliasing effect, resulting in a more accurate emotion recognition. Figure 1 shows the main components of the two proposed recognition algorithms.

The key contributions made in this research can be summarized as follows:

a)
Preprocessing and augmenting the utilized datasets to expand and balance their size for enhancing the training capacity.
b)
Evaluating and analyzing the performance of some classical (VGG16, VGG19, InceptionV3, Xception, EfficientNetB0, ResNet50, and DenseNet121) deep CNN models for classifying emotions from facial expressions.
c)
Proposing a deep learning-based DCN approach (Algorithm 1) to extract features that provide a magnificent impact for enhanced facial emotion recognition.
d)
Proposing a hybrid (AA-DCN) model (Algorithm 2) using a tuned blur-pool layers and max-pool layers in the DCN model to increase accuracy of emotion recognition by increasing quality of images for an optimal anti-aliasing effect.
e)
Modifying the hyper-parameters of both proposed models while testing the suggested models on various FER benchmark datasets.
f)
Comparing performance of existing studies with the proposed approaches.

The following work is organized as follows: Sect. (2) discusses related work, Sect. (3) focuses on classical CNN architectures, Sect. (4) introduces the developed DCN method employed in this study, Sect. (5) describes the utilized datasets, Sect. (6) explores the experimental outcomes and discussions of the applied DCN methodology, Sect. (7) presents the AA-DCN as well as the detailed experimental process, and Sect. (8) draws a conclusion of the paper, discusses the limits in this study and makes suggestions for further developments.

Related work

A novel FER system has been presented by Umer et al. using deep learning. The algorithm has been divided into four steps: (a) a face detection process to define a region of interest, (b) feature learning tasks through a CNN architecture, and (c) techniques for data augmentation have been employed to fertilize the learning that leads to a great enhancement in the performance of the FER-method. For that, the experimental results showed high performance in comparison to current state-of-the-art approaches³. Chowdary et al. have investigated the transfer learning approaches for facial emotion classification. The authors have eliminated only the fully connected layers of the pre-trained models and added new fully-connected layers that were more suitable for the instructions of the task. The mobile-Net model achieved the highest performance among all four pre-trained models because of its faster performance, and a small number of parameters. One of the limitations of the proposed model was using only one dataset in testing the experiments⁹. Abate et al. have investigated the influence of masked faces on recognizing emotions from facial images. They have discussed how the most performing algorithms like CNN, ResNet, and ARM could be retrained in three different occlusion scenarios in the presence of facial masks. The results reported in this study were useful to draw attention to the challenging occlusion problem, but they were not the best¹⁰. Shaik et al. have aimed to develop a novel deep learning strategy known as the “Visual-Attention-based Composite Dense Neural Network” (VA-CDNN) that focuses on extracting attention-based features from several faces. Therefore, to extract global features from a normalized face, Viola-Jones methods and the Xception model have been used to extract localized landmarks (the mouth and eye pairs). Then, to categorize the facial expressions, a deep neural network has been constructed that accepts both local and global attention information. Even the suggested model outperformed many recent advances in FER, but this strategy only operated on frontal pictures and was confined to real-time invariant face data¹¹. Saurav et al. have published a real-time Dual Integrated-CNN (DICNN) model for facial emotion categorization in the wild. Face detection, alignment and recognition using the suggested DICNN model are the three phases of the FER approach. This methodology was developed and implemented on an embedded-platform. Although the model has efficiently recognized facial expressions, it faces poor misclassification, mostly in the fear category¹².

Rajan et al., on the other hand, have presented a hybrid, layered CNN methodology for real-time FER. The model proposed is split up into three stages: First, two pre-processing procedures have been conducted, one to improve the edges and another to cope with illumination variations. Second, weighted histogram equalization (input1) and edge enhancement (input2) have been fed into a dual CNN layer for the feature maps. Finally, these characteristics have been integrated and included in the LSTM. They have been then connected to the global average pooling (GAP) to reduce the characteristics. Following that, the SoftMax layer estimated the expression. This model has been evaluated using a self-created database as well as three publicly available FER datasets. The recommended approach performs well in distinguishing surprised and joyful reactions but badly in sadness and anger¹³. Khattak et al. have revealed a CNN technique for classifying age, emotions, and gender from face data. Unlike the prior studies, which faced a problem of degradation in image quality resulted from the mis-selection of CNN layers, this model utilized an appropriately optimized number of layers to improve the classification accuracy. However, the experiments carried out on gender and age employed just one domain dataset, and the other datasets used in classification were restricted¹⁴.

Bentomi et al. have presented a hybrid approach for FER associating (VGG16, ResNet50) with a multilayer perceptron (MLP) classifier. The classical models have been employed as feature extractors by adding only the GAP layer; no fine-tuning was done to the network parameters. The early-stopping has been utilized to avert overfitting in MLP and has also improved the overall accuracy in terms of generalization. The method still needs to be tested on large datasets for recognizing facial emotions¹⁵. LIU et al. have performed a new deep-learning model to improve the prediction accuracy from face emotion. To combat the effects of ambient noise, a pose-guided face alignment approach has been developed to eliminate intra-class differences. A fusion ResNet and VGG-16 model has also been created to reduce training time. The suggested approach has various benefits, including the complete utilization of facial alignment to minimize the influence of ambient noise, including changes in posture, lighting, and occlusion. Furthermore, the model efficiently distinguishes between comparable sentiments such as fear and disgust. However, the classification performance still has to be improved¹⁶.

Wang et al. have coupled the benefits of the attention mechanism with multi-task learning. The suggested multi-task attention network (MTAN) has been enhanced in two ways: task and feature. Using the self-attention mechanism, the MTACN network focused on the relevance of each attention module for each unique activity. Furthermore, the MTCAN model has been presented to solve the problem of task divergence. As a result, the self-attention mechanism is added to capture the distance dependency between the attention modules of particular tasks, depending on the two tasks (classification and regression). The aspects of each activity are then thoroughly learned. The suggested classification task and emotion recognition accuracy still need to be improved¹⁷. Taskiran et al. have proposed another hybrid face recognition (HFR) method to increase the robustness of face recognition. The HFR system is comprised of six steps: face detection from video frames; detection of facial landmarks to extract dynamic characteristics during smiling action; extraction of appearance features from landmarks during a smile using 3 different pretrained architectures (ArcFace, VGGFace, and VGGFace2); extraction of dynamic facial features for gender detection; and feature selection and classification using an Extremely-Randomized-Trees Classifier. The proposed model could be useful for performing face recognition in videos extracted from systems that may contain images with illumination variations, noise, and blur while performing face recognition. However, the accuracy of performing face recognition needs to be improved for better face recognition¹⁸. EmNet (Emotion Network), a deep integrated CNN model, has been investigated by Saurav et al. The EmNet model improved the integrated variation of two structurally comparable Deep-CNN models using a joint optimization approach. The new FER technique’s efficiency has been evaluated on an embedded platform with limited resources, and it achieved a significant gain in accuracy over current methods. Furthermore, EmNet’s three prediction outputs were joined using two integration algorithms (averaged and weighted maximum). The suggested model functioned well in identifying facial images in the neutral, surprise, disgust, and happiness classes but struggled in the sad and afraid classes¹⁹.

Devi D et al. have used a novel Deep Regression (DR) classifier to recognize facial emotions. The DR model is divided into six phases: pre-processing with the Gamma-HE algorithm, facial point extraction with the Pyramid Histogram of Oriented-Gradients (PHOG) algorithm, segmentation with the Viola-Jones Algorithm (VJA), feature extraction, feature selection, and finally classification. In comparison to current algorithms, the presented FER model earned significant accuracy. However, the major issue in this work was the high training time²⁰. Li et al. have presented an improved FER methodology based ResNet-50. The method uses a CNN model for expression recognition. Also, to overcome the overfitting problem that may occur, the 10-group cross-validation technique has been chosen. Each group consisted of 10 images representing the seven emotions. Even though the proposed technique had a good recognition effect and good accuracy, more images were needed to be collected than in this experiment to make further improvements in facial recognition²¹. Arora et al. have presented a hybrid automatic system that could differentiate the emotions connoted on the face. Principal Component Analysis (PCA) and a gradient filter were obtained for feature extraction, and Particle Swarm Optimization (PSO) is used to optimize the extracted features for each emotion.

The authors have achieved high classification accuracy, but with only one dataset in the testing phase²². Zheng et al. have constructed a hybrid Inception ResNetV2 and attention mechanism called Convolutional Block Attention Module (CBAM) to increase the capacity of instructors to recognize expressions in real-world environments. The Inception ResNet V2 was utilized to extract the deep expression features and was deployed as a globalization network to mitigate the issue of over-fitting during the learning phase. The attention module (CBAM) is included to focus on the expression details. In addition, a new dataset of intensity-based facial expressions known as EIDB-13 is generated. The model might also assess students’ interest in educational material. For better feature extraction, this method needs to be optimized further²³.

Fontaine et al. have focused their research on the role of AI in assessing postoperative pain. To categorize and identify distinct patients’ facial expressions, a DCNN system (ResNet-18) is presented and evaluated. Their data has been collected before to and following surgery using self-reported pain intensity (NRS, from 0 to 10). The suggested DL method accurately predicted pain intensity among these 11 available ratings. The findings indicated that facial expression analysis-based AI might be highly beneficial in recognizing severe pain, particularly in persons who are unable to adequately describe their suffering. However, the scientists did not compare the expected results to human observers’ assessments. They were also utilizing a pre-trained ResNet-18 model due to the low data availability²⁴. Ching et al. have presented a real-time entertainment greeting system using the CNN model to improve the down mood of any passersby. The CNN model has been used to detect eyes, faces, and mouths from a captured image using the VJA. The emotions are recognized from the eyes and mouth, where the face is used to recognize a known user. After that, a funny 3-D animation is played depending on the specified mood. The experimental results showed that the presented model recognized and identified the face and emotion well. But the proposed approach was limited to three emotions only (happy, regular, and unhappy), as the main aim of the system was to locate passers-by who were unhappy²⁵.

A deep convolution neural network approach based on a local gravitational force descriptor was presented by Mohan et al. as a means of classifying FERs. There are two components to the suggested approach. A unique deep convolution neural network model (DCNN) is fed with the local gravitational force descriptor, which was first used to extract local characteristics from face photos. Two branches make up the provided DCNN. While the second branch extracts holistic information, the first branch was used to identify geometric aspects, including edges, curves, and lines. Lastly, the final categorization score is calculated using the score-level fusion approach. The long training time of this work hindered its performance, even if the findings show that it beat all state-of-the-art approaches on all databases⁶. Furthermore, FER-net—a convolutional neural network designed to effectively differentiate FEs—was developed by Mohan et al. Features are automatically extracted from facial regions using FER-net. After that, a Softmax classifier received these features in order to identify FEs. FER-net was evaluated on five benchmarking datasets: FER2013, Jaffee, CK+, KDEF, and RAF. These datasets have average accuracy rates of 78.9%, 96.7%, 97.8%, 82.5%, and 81.68%, respectively. The acquired findings show that the FER-net is superior when compared to recent research²⁶.

A deep convolutional neural network called LieNet was developed by Mohan et al. to accurately and identify the multiscale variations of deceit. To create a single image, the first 20 frames from each movie are retrieved and synthesized. Additionally, a signal with audio is taken out of the video. In addition, a 2D plane is plotted with 13 channels of EEG signals, and these signals are concatenated to create a image. Second, features were taken out of each modality independently by the LieNet model. Third, a Softmax classifier is used to estimate scores across all modalities. Experimental results show that the LieNet outperforms previous research on the BoL database’s Set-A and Set-B, with average accuracy of 95.91% and 96.04%, respectively. The LieNet achieved 97% and 98% accuracy on the RL trail and MU3D datasets, respectively²⁷.

In data-limited circumstances, Suzuki et al. devised a knowledge-transferred fine-tuning method for producing anti-aliased convolutional neural networks (CNNs). While fine-tuning the anti-aliased CNN, the authors applied knowledge from a pre-trained CNN that had not been overfitted to the restricted training data. To accomplish this goal, they use two forms of loss to transmit information: pixel-level loss for detailed knowledge and global-level loss for general detection knowledge. The ImageNet 2012 dataset findings reveal that the knowledge transferred to tuning yields high precision with hyper-parameter modifications²⁸.

Zhang, presented an anti-aliased CNN model, which incorporates blur filters to normal the down sampling processes like stride convolution and pooling layers. The (lowpass) blur filter in the anti-aliased CNN eliminates such aliasing effects produced through down-sampling. As a result, anti-aliased CNNs outperform standard CNNs without blur filters in recognizing facial images. Based on this, numerous studies have refined the anti-aliased CNN and proven that blur filters work well for a wide range of visual recognition tasks, but also using it depends on the nature of the given task and the used data⁵. The following is a synopsis of past relevant work in Table 1.

Table 1 Summery of the related work.

Subjects

Abstract

Similar content being viewed by others

Image-based facial emotion recognition using convolutional neural network on emognition dataset

Four-layer ConvNet to facial emotion recognition with minimal epochs and the significance of data diversity

An interactive information based DCNN-BiLSTM model with dual attention mechanism for facial expression recognition

Introduction

Related work

FER based deep learning architecture

Convolutional neural network

Traditional CNN architectures

Algorithm 1: proposed DCN overview

Experiment procedure

Data sets description

RAF_DB (Real-world affective face database)

JAFFE (Japanese female facial expressions)

CK+ (Cohn-Kanade extending)

Dataset preprocessing

Evaluation metrics

Results and discussion

Evaluation of classical CNN on CK + dataset

The evaluation of the provided DCN on CK + dataset

Evaluation of proposed DCN model utilizing more datasets

Using the Jaffee dataset

Using the RAF-DB database

Anti-aliasing and deep CNNs

Algorithm 2: proposed AA-DCN overview

Results and discussion

The evaluation of AA- DCN model

Conclusion and future scope

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Multiscale wavelet attention convolutional network for facial expression recognition

Fusion of EEG feature extraction and CNN-MSTA transformer emotion recognition classification model

Search

Quick links