Smart comprehend gesture based emotions recognition system for people with hearing disability utilizing spatio temporal graph convolutional network techniques

Alshahrani, Reem; Alharbi, Abeer A. K.

doi:10.1038/s41598-025-31692-w

Download PDF

Article
Open access
Published: 23 January 2026

Smart comprehend gesture based emotions recognition system for people with hearing disability utilizing spatio temporal graph convolutional network techniques

Reem Alshahrani^1,2 &
Abeer A. K. Alharbi³

Scientific Reports volume 16, Article number: 3079 (2026) Cite this article

1352 Accesses
Metrics details

Subjects

Abstract

Sign language (SL) is essential for communication among individuals with hearing or speech impairments. Facial emotion recognition plays a crucial role in improving expression analysis and assistive technologies when integrated with SL, specifically in fields like patient monitoring, psychoanalysis, and human-computer interaction. These applications directly assist the development of intelligent systems for communication and care. Gesture recognition (GR) can facilitate communication between machines and humans by enhancing this type of interaction. Machine learning (ML) is a part of artificial intelligence (AI) that focuses on developing methods that rely on data. The primary challenge in gesture detection is the machine’s inability to instantly interpret human language. It is crucial for enabling communication, specifically for the deaf and elderly, by allowing them to give commands through gestures. Therefore, this article presents a novel Smart Comprehend Gesture-Based Emotions Recognition System Utilising Spatio-Temporal Graph Convolutional Network (SCGERS-STGCN) approach for individuals with hearing disabilities. The SCGERS-STGCN approach enables the recognition of gestures and emotions to enhance communication for individuals with hearing impairments. Initially, the SCGERS-STGCN model utilizes Gaussian filtering (GF) in the image pre-processing stage to reduce noise and improve the quality of input images. For feature extraction, the Vision Transformer (ViT) model is utilized to capture complex patterns and relationships within the gestures and facial expressions that indicate emotions. Additionally, the spatio-temporal graph convolutional network (ST-GCN) approach is employed for facial emotion detection and classification. Finally, the parameter tuning of the ST-GCN model is performed using the developed African vulture optimization algorithm (DAVOA) model. The experimentation of the SCGERS-STGCN model is performed under the Emotion detection dataset. The comparison analysis of the SCGERS-STGCN model revealed a superior accuracy value of 98.53% compared to existing techniques.

Introduction

Globally, there are nearly 70 million people with hearing impairment. Deafness leads to significant communication difficulties: deaf people can’t hear, and a majority of them are unable to use written languages, experiencing important issues when describing themselves in these languages through written texts¹. Having issues with the tense of verbs, concordances of numbers, and gender, and they have problems after making a physical image of philosophical ideas. This issue may impact individuals with hearing impairments in accessing information, participating in social activities, and forming social relationships, among other aspects of daily life². Deaf people utilize SL for connecting, and there are insufficient sign-language communication and interpreter schemes. According to the strictness of deafness, hearing deficiencies are considered as mild, moderate, severe, or profound³. Patients with profound or severe hearing impairment often struggle to listen to others and therefore experience communication difficulties. Deaf people use SL gestures for communication. Conversely, normal people usually do not recognize these gestures, which creates a barrier to communication between deaf and hearing people. Nearly 200 SLs worldwide, and SLs, such as spoken languages, differ from one another. SL is a subdivision of statements applied as a medium of communication by the deaf⁴. Unlike other natural languages, it employs necessary physical actions to communicate messages, recognized as signs or gestures. Emotion is the approach and knowledge formed when humans relate objective belongings to their needs⁵. It imitates people’s current psychological and physiological conditions, playing an integral part in their decision-making, communication, and cognition. Investigators believe that some emotions created by humans should be accompanied by physical variations, such as muscle contraction, relaxation, facial expressions, and visceral actions⁶.

A few investigators recognize that, due to hearing impairment, individuals with hearing loss often struggle to receive information from the external world as completely and accurately as those without hearing loss⁷. Consequently, they might have a reasoning bias, which can cause them to misunderstand communications significantly, resulting in a relational cognitive bias. As a result, it is of higher significance to examine the facial emotion detection capability of deaf people. Facial expression recognition (FER) has gained significant importance in computer vision (CV) recently. It is used for the classification and analysis of assumed facial expressions⁸. FER is used in various domains, including driving assistance, robotics, lie detectors, mental health disorder prediction, security, and others⁹. With the development of deep learning (DL), FER technology has achieved significant growth in detection accuracy compared to conventional techniques. Due to their ability to remove image information, convolutional neural networks (CNNs) are extensively applied for the tasks of image classification, mainly in FER¹⁰. Nevertheless, there might be particular problems in training a CNN method for FER. The hearing-impaired individuals find difficulty in communicating effectually due to the lack of widespread SL understanding among the general populace. This communication gap restricts the independence and social participation of people with hearing disabilities. This gap can be bridged by properly recognizing the gestures and by enabling real-time interpretation of SL into understandable formats. Utilizing advanced spatio-temporal models for GR provides opportunities for inclusive technologies in education, healthcare, and public services. Hence, this work aims to assist meaningful interaction and improve daily life accessibility for the hearing-impaired community.

This article presents a novel Smart Comprehend Gesture-Based Emotions Recognition System (SCGERS) utilizing a Spatio-Temporal Graph Convolutional Network (STGCN) approach for individuals with hearing disabilities. The SCGERS-STGCN approach enables the recognition of gestures and emotions to enhance communication for individuals with hearing impairments. Initially, the SCGERS-STGCN model utilizes Gaussian filtering (GF) in the image pre-processing stage to reduce noise and improve the quality of input images. For feature extraction, the Vision Transformer (ViT) model is utilized to capture complex patterns and relationships within the gestures and facial expressions that indicate emotions. Additionally, the spatio-temporal graph convolutional network (ST-GCN) approach is employed for facial emotion detection and classification. Finally, the parameter tuning of the ST-GCN model is performed using the developed African vulture optimization algorithm (DAVOA) model. The experimentation of the SCGERS-STGCN model is performed under the Emotion detection dataset. The significant contribution of the SCGERS-STGCN model is listed below.

The SCGERS-STGCN technique utilizes GF-based image pre-processing to enhance visual clarity by mitigating noise while preserving key facial structures, thereby facilitating more reliable feature extraction. This improves the accuracy of downstream emotion classification tasks. Its integration strengthens the robustness of the overall emotion recognition process.
The SCGERS-STGCN methodology utilizes ViT-based feature extraction to capture long-range dependencies and contextual cues in facial expressions, thereby enabling a rich and discriminative feature representation. This enhances the model’s ability to distinguish subtle emotional discrepancies. Its application improves the overall performance of the recognition framework.
The SCGERS-STGCN method utilizes ST-GCN-based emotion classification to model both spatial and temporal dynamics of facial landmarks, enabling the detection and classification of intrinsic emotions with high precision. This approach enhances the technique’s ability to comprehend complex emotional changes over time. As a result, it significantly improves accuracy and robustness in emotion recognition tasks.
The SCGERS-STGCN approach implements DAVOA-based hyperparameter optimization by effectively fine-tuning the model’s parameters, resulting in improved classification accuracy and faster convergence across diverse datasets. This optimization ensures the model adapts well to varying data characteristics. Consequently, it enhances overall performance and efficiency in emotion recognition tasks.
The SCGERS-STGCN model presents a novel hybrid framework that integrates GF, ViT, and ST-GCN into a cohesive emotion recognition process, optimized by the DAVOA technique. This integration effectively balances computational efficiency with high detection accuracy. The approach is specially designed for real-time facial emotion analysis, making it appropriate for dynamic and complex environments.

The article is structured as follows: "Literature survey" section presents the literature review, "Materials and methods" section outlines the proposed method, "Performance validation" section details the results evaluation, and "Conclusion" section concludes the study.

Literature survey

Miah et al.¹¹ presented a temporal and spatial attention method combined with a common neural network (NN) intended for the SLR scheme. This design is divided into three divisions: a graph-based temporal, a general NN, and a graph-based spatial branch. In particular, the spatial branch distinguishes spatial dependences, whereas the time-based division increases temporal dependences surrounding the data. Additionally, the common NN subdivision enhances the structure’s generalizability, thereby strengthening its robustness. Sreemathy et al.¹² introduced a model for the automated detection of dual-handed signs of Indian SL (ISL). The three stages of these works consist of classification, feature extraction, and pre-processing. The trained approach is applied to test the real-time gestures. A DL method was additionally applied using GoogleNet, AlexNet, VGG-19, and VGG-16. Nedjar and M’hamedi¹³ developed RSA, an interactive scheme intended for the simulation and recognition of letters in Arabic SL. Nekkanti et al.¹⁴ presented a real-time detection method using the SL approach, which incorporates CNN and image processing for decoding gestures in video recordings. Additionally, a Flask-based API was provided, which is deployed to cloud computing, making the solution widely accessible and open to execution. González-Rodríguez et al.¹⁵ implemented a bi-directional SL translation system. DL techniques like GRU, LSTM, recurrent NNs (RNN), Transformers, and bi-directional RNN (BRNN) are compared to discover the most precise method for SL translation and recognition. Key point recognition with Media Pipe is used for tracking and understanding the gestures of SL. Srinivasan et al.¹⁶ proposed a sign detector utilizing Keras and OpenCV modules in Python. By utilizing these technologies, one can recognize when they choose to transfer over SL, which is not a standard language for communicating with people. Keras and OpenCV, components of the Python programming language, are utilized to achieve the work. The presented work demonstrates a user-friendly model for communication, utilizing Python to identify SLs for individuals with hearing impairments. Jia and Li¹⁷ implemented an easier and more precise SLR-YOLO model method that enhances YOLO v8. Initially, the SPPF module was replaced with the RFB module within the backbone system to enhance the network’s feature extraction capabilities; then, BiFPN was applied to improve feature combination, and the Ghost component was included to simplify the system. Sreemathy et al.¹⁸ developed a Python-based model that categorizes 80 words from SL. This paper presents two dissimilar methods: SVM and YOLOv4, utilizing MediaPipe. SVM uses the radial basis function (RBF), linear, and polynomial kernels. Slade et al.¹⁹ proposed a model by utilizing audio spectrogram transformer (AST) with hyperparameter and architecture optimisation using a novel cluster search optimisation (CSO) approach. It introduces optimised models including One-Dimensional CNN (1D-CNN), Bidirectional Long Short-Term Memory (BiLSTM), and CNN-BiLSTM with attention, assisted by the Noise Tempered Kmeans (NTKM) clustering model. Imtiaz and Khan²⁰ introduced the gradual proximity-guided target data selection (GPTDS) methodology for reliable sample selection and a prediction confidence-aware test-time augmentation (PC-TTA) technique to enhance inference accuracy with minimal computational requirements. Utilizing the DEAP and SEED datasets, the method demonstrates superior performance and efficiency in classifying emotions related to healthcare.

Phan and Phan²¹ proposed a hybrid DL with NN (Vis-Net) integrating ViT with CNN models like MobileNet-V2, Inception-V3, ResNet152-V2, NASNetLarge, and DenseNet for multi-level driver drowsiness detection based on Katajima’s scale. It also integrates Mostafa’s emotion detection framework to enhance real-time performance by prioritizing fatigue-relevant frames and reducing latency. Alotaibi, Sundarapandi, and Rajendran²² proposed a computational linguistics-based sentiment analysis with enhanced beetle antenna search and DL (CLSA-EBASDL) model. The approach utilizes bidirectional encoder representations from transformers (BERT) for word embedding, an attention-based BiLSTM (ABiLSTM) network for mood classification, and the enhanced beetle antenna search (EBAS) technique for hyperparameter optimization. Madhan et al.²³ developed an emotion-based music player that automatically recognizes facial expressions using the HAAR Cascade algorithm for face detection and a support vector machine (SVM) for emotion classification. Additionally, the system employs the k-nearest neighbour (KNN) classification method. Pradeep et al.²⁴ proposed a real-time vision-based sign language recognition (SLR) system using the hidden Markov model (HMM) for speech-to-text translation, integrated with natural language processing (NLP) techniques to generate AVATAR-based sign language projections. Balasubramani and Surendran²⁵ aimed to enhance early detection of autism spectrum disorder (ASD) by analyzing facial emotions using a self-attention-based progressive generative adversarial networks (SA-PGAN) methodology optimized with the gorilla troops optimizer (GTO) model. Krishnan et al.²⁶ improved multimodal emotion recognition by utilizing an advanced transformer model with a fine-grained correlation fusion (AT-FGCF) method, which incorporates the dense Swin Transformer (DSwinT) technique for audio and video, and emotion-aware cognitive BERT (EAC-BERT) for text representation. Palermo et al.²⁷ surveyed the landscape of context recognition in edge computing, focusing on smart eyewear by analyzing real-time, multimodal sensor data. Key techniques such as sensor fusion, noise mitigation, and energy-efficient processing are highlighted to support low-latency, context-aware applications in domains like healthcare and augmented reality. Jayalakshmi et al.²⁸ improved emotion recognition by employing a multimodal optimized emotion analysis using the DL (MOEA-DL) technique, which integrates CNNs for images, recurrent neural networks (RNNs) for text, and LSTM networks for audio. Thilakavathy et al.²⁹ improved large-scale sentiment analysis by employing contextual detection text analysis using DL (CDTA-DL) methodology, which integrates RNNs and CNNs to capture both sequential and spatial text features.

Jiang et al.³⁰ presented a model to convert speech from electroencephalogram (EEG) signals using a novel functional areas spatio-temporal transformer (FAST) framework. By converting EEG data into tokens and using transformer-based sequence encoding, the model captures spatio-temporal neural patterns associated with covert speech. Mazhar et al.³¹ proposed a lightweight ML-based approach for aspect-oriented emotion classification (AOEC) in facial expression recognition from video data. Utilizing models such as Naive Bayes (NB), SVM, Random Forest (RF), and CNN, the method improves sentiment analysis accuracy. Taware and Thakare³² introduced a robust multimodal emotion recognition (MER) system that utilizes a parallel deep CNN (PDCNN) technique for enhanced feature representation and a BiLSTM for capturing temporal dependencies. To mitigate computational complexity and select optimal features, it employs a hybrid particle swarm optimization model with multi-attribute utility theory and the Archimedes optimization algorithm (PMA). Saqib et al.³³ proposed an innovative image-to-text summarisation approach by integrating the You Only Look Once (YOLO) object detection model with the WordNet lexical dataset. YOLO accurately localizes objects within images, while WordNet provides semantic context. This fusion improves automated image understanding, contributing to both the CV and NLP domains. Bouhanou and Aboutabit³⁴ enhanced Arabic sign language (ArSL) recognition by utilizing a hybrid DL method that integrates CNN for spatial feature extraction and LSTM networks for temporal sequence modelling. Khan et al.³⁵ proposed an enhanced real-time fall detection system using an improved You Only Look Once Version 8 Small (YOLOV8S) model. The Convolutional Block Attention Modules (CBAMs) technique is also utilized for improving detection accuracy in intrinsic environments. Rahman et al.³⁶ presented an accurate emotion classification system from EEG signals by utilizing an extended independent component analysis (E-ICA) model for artifact removal, multi-class common spatial pattern (M-CSP) technique for feature extraction, and BiLSTM for classification and tuning. Shah et al.³⁷ developed an explainable three-way face recognition mechanism (E3FRM) methodology by utilizing principal component analysis/fisher linear discriminant (PCA/FLD) model for feature extraction and a three-way decision model with dual verification for better authorization and detection. Khan et al.³⁸ developed an accurate and robust salient object detection (SOD) system by utilizing a novel progressive multi-stage iterative feature refinement network (PIFRNet) technique integrated with a pyramidal attention mechanism (PAM) method. Ullah et al.³⁹ introduced the Attention-Enhanced Fire Recognition Network (AEFRN) method by incorporating convolutional self-attention (CSA), recursive atrous self-attention (RASA), and an enhanced convolutional block attention module (CBAM) model for robust feature extraction in challenging environments. Ahmad et al.⁴⁰ presented the Advanced Multi-View Deep Feature Learning (AMV-DFL) methodology with DL and ML techniques to improve epileptic seizure detection from EEG signals. Ivanko and Ryumin⁴¹ developed an intelligent system integrating SLR, audiovisual speech recognition, SL synthesis, and speech synthesis by utilizing transformer-based models and neural vocoders for accurate bidirectional translation between SL and spoken language. Mira and Hellwich⁴² developed diverse ensembles of frame-wise features integrating deep neural networks (DNNs), self-organizing map (SOM), and RBF networks, and using RNNs to effectively capture temporal information. Gaikwad and Shete⁴³ presented the adaptive thresholding-based region growing and canny edge detection (ATRG-CED) technique optimized by modified Tasmanian devil optimization (MTDO) and Multiscale Attention Embedded Residual DenseNet (MAERDNet) methods for accurate SL and gesture recognition.

The reviewed studies illustrate improvements in SLR, emotion recognition, sentiment analysis, and context awareness using diverse DL and ML models. However, a key limitation is the lack of standardization across datasets, which affects the model’s generalizability. Most methods depend on controlled environments, mitigating their robustness in real-world, noisy conditions. Another challenge is the high computational complexity, which limits deployment on resource-constrained devices. Privacy concerns are rarely addressed, particularly in video and EEG-based systems. A notable research gap exists in integrating lightweight architectures with real-time adaptability, multimodal fusion, and cross-lingual or cross-domain scalability to enhance inclusivity and reliability.

Materials and methods

In this study, a novel SCGERS-STGCN technique is proposed for individuals with hearing disabilities. The SCGERS-STGCN method facilitates the recognition of gestures and emotions, thereby enhancing communication for individuals with hearing impairments. It involves different stages, including image pre-processing, ViT using a feature extractor, gesture-based emotion recognition using ST-GCN, and DAVOA-based parameter tuning. Figure 1 signifies the complete working process of the SCGERS-STGCN method.

Stage I: image pre-processing

Initially, the SCGERS-STGCN model employs GF for the image pre-processing stage to reduce noise and enhance the quality of input images⁴⁴. This model is chosen for its efficiency in reducing noise while preserving crucial facial features, which is critical for accurate emotion recognition. Unlike median or bilateral filters, GF provides a smooth and consistent blurring effect that mitigates high-frequency noise without significantly distorting edges. The model also exhibits superiority in its simplicity and computational efficiency, making it appropriate for real-time applications. Furthermore, GF is less sensitive to outliers compared to other filters, ensuring that subtle facial expressions remain intact. This balance between noise removal and feature preservation makes GF an ideal choice over more complex or computationally expensive methods for pre-processing facial images in emotion recognition systems.

GF is a vital image pre-processing model that is extensively employed in gesture-based emotion detection methods to enhance image quality and reduce noise. By using a GF, the system efficiently smooths images, removing annoying variants that could restrict the recognition of delicate gestures and facial expressions. This pre-processing stage is crucial for enhancing the accuracy of subsequent classification and feature extraction processes. The filtered images provide more precise and more distinct representations of gestures, enabling more accurate detection of emotional states. Ultimately, GF significantly contributes to the overall reliability and robustness of gesture-based emotion detection methods, facilitating a deeper understanding of human emotions.

Stage II: ViT using feature extractor

For feature extraction, the ViT model is employed to seize intricate relationships and patterns in the gestures and facial expressions indicative of emotions⁴⁵. This model is chosen for its superior ability to capture long-range dependencies and intrinsic contextual relationships in images through its self-attention mechanism. This model processes the entire image as a sequence of patches, enabling it to learn global features more effectively, unlike conventional CNNs that focus on local receptive fields. The model is also effective in recognizing subtle discrepancies in facial expressions. Additionally, ViT exhibits scalability and robustness when trained on massive datasets, often outperforming CNNs in terms of accuracy. Its architecture is highly flexible and can be fine-tuned for specific tasks, giving superior representation power without the inductive biases inherent in CNNs. These merits position ViT as a powerful alternative to conventional methods for facial emotion recognition. Figure 2 represents the architecture of the ViT model.

The Transformer is differentiated from former CNN and RNN networks by its absence of convolutional and recursive frameworks, depending entirely on a self-attention mechanism. This method has attained advanced outcomes in the areas of NLP, multimodal, and CV responsibilities. The transformer system is mainly comprised of a decoder and an encoder. During this $\:ViT$, only the encoder region is applied, with the positional encoder carried out before the encoder. The decoding portion is substituted through fully connected (FC) layers.

Encoder transformer

This implements feature extraction by frequently loading Encoder blocks. The inbound data are initially standardized by Norm, then handled by the layer of Multiple-Head Attention, which permits the method to concentrate on information from various subspaces and positions, assigning proper weights. The standardized outcome is then passed to the Block of MLP to enhance the model’s sensitivity. The Block of MLP contains a GELU activation function, dual Dropout layers, and a linear layer.

The Multi-Head Attention layer, a central module of $\:the\:ViT$ structure, contains the following phases:

For the sequence data of output $\:i$, equivalent to the $\:ith$ input nodes $\:{x}_{1},{x}_{2},\dots\:,{\:x}_{i},$ mapping the inputs to $\:{a}_{1},{a}_{2},\dots\:,{\:a}_{i}$ with $\:f\left(x\right)$. Formerly, achieve a dot product with the three manageable transformation matrices $\:{W}_{q},{\:W}_{k}$, and $\:{W}_{v}$ to get the equivalent $\:{q}_{i},{\:k}_{i}$, respectively.

$$\:{q}_{i}={a}_{i}{W}_{q},{k}_{i}={a}_{i}{W}_{k},{v}_{i}={a}_{i}{W}_{v}$$

(1)

Whereas $\:{q}_{i}$ denotes the query vector, corresponding to the equivalent $\:{k}_{i},{\:v}_{i}$ signifies the information removed from $\:{a}_{i}$. The equivalent of $\:{q}_{i}$ and $\:{k}_{i}$ is to compute the association between the dual; the better the correlation, the larger the weight of the equivalent$\:{\:v}_{i}$. The correlation weights are calculated using a scalar dot product in $\:ViT$.

$$\:weight\left({q}_{t},{k}_{i}\right)=\frac{{q}^{{T}_{t}}{k}_{i}}{\sqrt{{d}_{k}}}$$

(2)

Here, $\:{d}_{k}$ signifies the vector length. The size of the input matrix is adjusted correctly by scaling to prevent the gradient from being inadequate after normalization, which may impact network training. The $\:({q}_{t},\:{k}_{t})$ values were normalized by $\:softmax$. Finally, the self-attention mechanism is signified by the multiplication of the matrix.

$$\:Attention\left(Q,K,V\right)=soft\text{m}\text{a}\text{x}\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V$$

(3)

The input $\:{a}_{i}$ is passed over $\:{W}_{q},{\:W}_{k}$, and $\:{W}_{v}$ to get the equivalent $\:{q}_{i},{\:k}_{i}$, and $\:{v}_{i}$. These are then separated into $\:h$ portions based on the head counts $\:h$. For every head with a similar model as Eq. (3) in Self-Attention, the $\:h$ are spliced and concatenated, and the last output is gained.

$$\:MultiHead\left(Q,K,V\right)=Concat\left(hea{d}_{1},\:\dots\:,hea{d}_{h}\right){W}^{O}\\ \:where\:hea{d}_{i}=Attention\left(Q{W}_{i}^{Q},K{W}_{i}^{K},\:V{W}_{i}^{V}\right)$$

(4)

MLP head

The output token’s shape remains unchanged after the Transformer Encoder. Nowadays, the group token has been removed, and the last classification outcome was achieved using an MLP Head, which contains an activation function$\:\:of\:\text{t}\text{a}\text{n}\text{h}$, linear layers, and other linear layers.

Stage III: recognition process using ST-GCN

Additionally, the ST-GCN approach is utilized for the detection and classification of facial emotions⁴⁶. This model is used prevalently for its robust spatial and temporal dependencies in facial landmark data. The technique treats facial landmarks as nodes in a graph, capturing the relational structure between different facial points over time. This enables precise modelling of dynamic facial expressions by considering how features evolve sequentially. Its ability to jointly learn spatial configurations and temporal patterns results in enhanced accuracy in recognizing intrinsic emotions. Moreover, the model is computationally efficient for processing sequential data and outperforms many conventional methods that fail to explicitly integrate temporal dynamics, making it ideal for real-time emotion recognition tasks.

GCN is widely applied in modelling human skeleton data. In this approach, the human skeleton is commonly represented as a graph of spatiotemporal $\:G=(V,\:E)$ with $\:T$ frames and$\:\:N$ joints, where $\:V$ characterizes the joints of the skeleton and $\:E$ symbolizes the edges linking these joints. The skeleton coordinates of human movements are formulated as $\:X\in\:{R}^{C\times\:T\times\:N}$, while $\:C$ denotes channel counts, $\:T$ signifies frame counts in the video, and $\:N$ refers to node counts in the human skeleton. The GCN-based method primarily consists of two segments: temporal and spatial graph convolutions.

During this spatial dimension, the feature extraction of some joint point $\:{v}_{ti}$ in the skeleton graph, the operation of graph convolution is stated as:

$$\:{f}_{out}\left({v}_{ti}\right)={\varSigma\:}_{vtj\epsilon\:B\left({v}_{ti}\right)}\frac{1}{{z}_{ti\left({v}_{tj}\right)}}{f}_{in}\left(p\left({v}_{ti}{v}_{tj}\right)\right)\cdot\:w\left({v}_{ti}{v}_{tj}\right)$$

(5)

Whereas fl. and f2 symbolize the input and output features correspondingly, $\:B\left(vti\right)=\left\{{v}_{ti}\right|r({v}_{ti},{v}_{ti})\in\:$ $\:R\}$ characterizes $\:{the\:v}_{ti}$ set of neighbouring nodes. $\:R$ manages the variety of selected neighbouring nodes, $\:{Z}_{ti}\left({v}_{tj}\right)=\left|\right\{{v}_{t\kappa\:}\left|{l}_{ti}\right({v}_{t\kappa\:})={l}_{ti}({v}_{tj}\left)\right\}|$ refers to the term normalization; $\:w$ stands for the weighting function of neighboring joint points.

The graph convolution operation in the time domain is prolonged from graph convolution in the spatial domain, by utilizing parameter $\:{\Gamma\:}$ as the time range for controlling the neighbour set, the neighbour set in either spatial or temporal dimensions is presented by:

$$\:B\left({v}_{ti}\right)=\left\{{v}_{qj}\right|d({v}_{tj},{v}_{ti})\le\:K,|q-t|\le\:\lfloor\:{\Gamma\:}/2\rfloor$$

(6)

The equivalent label mapping set for its neighbouring nodes is:

$$\:{l}_{ST}\left({v}_{qj}\right)={l}_{ti}\left({v}_{tj}\right)+(a-r+{\Gamma\:}/2)K$$

(7)

Here $\:{l}_{ti}\left({v}_{tj}\right)$ characterizes the label mapping of $\:{v}_{ti}$ in the case of the solitary frame.

Hence, on the skeleton input described by feature $\:X$ and graph structure $\:A$, the network output after a graph convolution layer is signified as:

$$\:{f}_{ou\tau\:}=\sigma\:\left({D}^{-\frac{1}{2}}\stackrel{\sim}{A}{D}^{-\frac{1}{2}}{f}_{in}W\right)$$

(8)

Here, $\:\stackrel{\sim}{A}=A+l$ signifies the human body’s skeletal graph structure. The connection relationship between joints in the skeleton graph is denoted by an $\:N$x$\:N$ an identity matrix 1 and adjacency matrix $\:A$, $\:D$ denotes degree matrix of every joint point, $\:{D}^{-1/2}(A+l){D}^{-1/2}$ symbolizes the structure of the normalized skeleton, $\:W$ characterizes the network’s learnable weight matrix, and $\:\sigma\:$ refers to the activated linear layer.

Stage IV: DAVOA-based parameter tuning

Finally, the hyperparameter range of the ST-GCN approach is performed by the DAVOA model⁴⁷. This model is chosen for its superior capability in balancing exploration and exploitation during optimization, ensuring efficient convergence to optimal hyperparameters. Compared to conventional optimization methods, such as grid or random search, DAVOA dynamically adapts its search strategy based on population behaviour, thereby mitigating the risk of getting trapped in local minima. The nature-inspired mechanism of the technique presents faster convergence and higher accuracy in tuning intrinsic models. Moreover, DAVOA handles multimodal and high-dimensional search spaces effectively, making it appropriate for optimizing DL techniques where many interdependent parameters exist. This results in enhanced model performance, reduced training time, and improved generalization compared to conventional optimization techniques.

The AVOA is a metaheuristic method stimulated by vulture predation. The lifestyle of the vulture suggests working on defining a novel metaheuristic model for addressing optimization issues. To initialize the population size, obtain a similar number for the entire set of vultures and identify the most sought-after vulture within the whole set. It is demonstrated below:

$$\:H\left(i\right)\left\{\begin{array}{ll}Best\:vulture\:1,&\:if\:{Z}_{j}={f}_{1}\\\:Best\:vulture\:2,&\:if\:{Z}_{i}={f}_{2}\end{array}\right.$$

(9)

Whereas $\:{f}_{1}+{f}_{2}=1,{f}_{1}$ and $\:{f}_{2}$ illustrate the features in the interval of $\:\left(\text{0,1}\right)$, which is calculated before the optimizer. The Roulette wheel is utilized to perfect the most reasonable solution range likelihood and select the most needed solution based on the subsequent Eq. (10):

$$\:{z}_{i}=\frac{{G}_{i}}{{\sum\:}_{j=1}^{m}{G}_{i}}$$

(10)

Whereas $\:G$ establishes the vulture’s efficiency. When $\:\alpha\:$-numeric is close to 1, the $\:\beta\:$‐numeric is adjacent to $\:0$, and conversely.

Examining the vulture’s rate of famine. The subsequent expression demonstrates this method:

$$\:l=d\times\:\left({\text{s}\text{i}\text{n}}^{\gamma\:}\left(\frac{\pi\:}{2}\times\:\frac{ite{r}_{i}}{{\text{m}\text{a}\text{x}}_{iter}}\right)+\text{c}\text{o}\text{s}\left(\frac{\pi\:}{2}\times\:\frac{ite{r}_{i}}{{\text{m}\text{a}\text{x}}_{iter}}\right)-1\right)$$

(11)

$$\:G=\left(2\times\:\delta\:+1\right)\times\:s\times\:\left(1-\frac{ite{r}_{i}}{{\text{m}\text{a}\text{x}}_{iter}}\right)+l$$

(12)

Now $\:\delta\:$ designates the accidental quantity between the $\:(0$,1), $\:ite{r}_{i}$ establishes the present iteration, a determined number set that is associated with the optimizer performance and exploitation stage is established by $\:s$, the complete iteration count is presented by $\:{\text{m}\text{a}\text{x}}_{iter},$ $\:y$ represents the accidental amount between $\:(0$,1), and $\:d$ refers to the accidental quantity among $\:-2$ and 2. When $\:s$ reduces to $\:0$, then the vulture is starving, and when it develops to 1, it’s pleased. Vulture’s method includes an accidental segment with two different designs and an aspect $\:{R}_{1}$ for selecting the plane by summing in the interval $\:(0,\:1)$. Food travelling is expressed as demonstrated:

If $\:{R}_{1}\ge\:ran{d}_{{R}_{1}}$:

$$\:H\left(i+1\right)=BV\left(i\right)-T\left(i\right)\times\:H$$

(13)

If $\:{R}_{1}<ran{d}_{{R}_{1}}$:

$$\:H\left(i+1\right)=BV\left(i\right)-H+ran{d}_{2}\times\:\left(\left(ub-lb\right)\times\:ran{d}_{3}+lb\right)$$

(14)

Whereas,

$$\:T\left(i\right)=\left|U\times\:BV\left(i\right)-H\left(i\right)\right|$$

(15)

Here, $\:Z$ defines the vultures’ shift accidentally to preserve food from another’s vultures and is exposed by $\:U=2\times\:rand$, $\:lb,$ and$\:\:ub\:$represent the lower and upper limits, BV contains the most needed vultures, and $\:ran{d}_{2}$ and $\:ran{d}_{3}$ establish two accident amounts between $\:\left(\text{0,1}\right)$.

When$\:\:\left|H\right|<1$, application occurs. This contains two sections with dual designs, which are determined by the dual features of $\:{R}_{2}$ and $\:{R}_{3}$, both of which fall within the range of $\:(0$, 1). The initial segment of the application starts using $\:0.5<\left|H\right|<1.2$ designs containing a rotating fly $\:when\:\left|H\right|\ge\:0.5$, which establishes the effective energy of the vulture pleased. It’s modelled based on the succeeding Eqs. (16) and (17):

$$\:H\left(i+1\right)=T\left(i\right)\times\:\left(H+ran{d}_{4}\right)-j\left(t\right)$$

(16)

$$\:j\left(t\right)=BV\left(i\right)-H\left(i\right)$$

(17)

Here, $\:ran{d}_{4}$ defines a random quantity in the interval $\:(0,\:1)$.

The spiral motion of the vultures is a subsequent equation.

$$\:{y}_{1}=BV\left(i\right)\times\:\left(\frac{ran{d}_{5}\times\:H\left(i\right)}{2\pi\:}\right)\times\:\text{c}\text{o}\text{s}\left(H\left(i\right)\right)$$

(18)

$$\:{y}_{2}=BV\left(i\right)\times\:\left(\frac{ran{d}_{6}\times\:H\left(i\right)}{2\pi\:}\right)\times\:\text{s}\text{i}\text{n}\left(H\left(i\right)\right)$$

(19)

$$\:H\left(i+1\right)=BV\left(i\right)-\left({y}_{1}+{y}_{2}\right)$$

(20)

Here, $\:ran{d}_{5}$ and $\:ran{d}_{6}$ are amongst $\:(0$, 1). Dual vultures attack different types of vultures for food storage, and good power is used to position food. When $\:\left|H\right|<0.5$, this term is determinate. Initially, $\:ran{d}_{{R}_{3}}$ is among $\:(0$,1). When $\:ran{d}_{{R}_{3}}\ge\:{R}_{3}$, the project is to increase various types of vultures over the nutrition source. When $\:ran{d}_{{R}_{3}}<{R}_{3},$ the violent siege-fight is implemented. Mostly, they are across, which produces a bigger fight for food. This is mathematically expressed based on the succeeding equations:

$$\:{B}_{1}=BestViltur{e}_{1}\left(i\right)-\frac{BestViltur{e}_{1}\left(i\right)\times\:H\left(i\right)}{BestViltur{e}_{1}\left(i\right)-H(i{)}^{2}}\times\:H$$

(21)

$$\:{B}_{2}=BestViltur{e}_{2}\left(i\right)-\frac{BestViltur{e}_{2}\left(i\right)\times\:H\left(i\right)}{BestViltur{e}_{2}\left(i\right)-H(i{)}^{2}}\times\:H$$

(22)

Now $\:BestVilture\left(i\right)$ and $\:BestVilture\left(i\right)$ identify the most needed vulture for the initial and the secondary clusters, $\:H\left(i\right)$ gives the vulture present vector location and is achieved as the subsequent Eq. (23):

$$\:H\left(i+1\right)=\frac{{B}_{1}+{B}_{2}}{2}$$

(23)

When$\:\:\left|F\right|<0.5$, the stronger vultures waste their control, thus they are unable to survive against each other. It is expressed below:

$$\:H\left(i+1\right)=BV\left(i\right)-\left|j\left(t\right)\right|\times\:H\times\:Levy\left(j\right)$$

(24)

Here, the stage of modification describes the Levy fight (LF):

$$LF\left(x\right)=\frac{a\times\:\sigma\:}{100\times\:|b{|}^{2}}$$

(25)

$$\sigma\:={\left(\frac{ \Gamma \left(1+\zeta\:\right)\times\:\text{sin}\left(\frac{\pi\:\zeta\:}{2}\right)}{ \Gamma \left(1+{\zeta\:}_{2}\right)\times\:\zeta\:\times\:2\left(\frac{\zeta\:-1}{2}\right)}\right)}^{\frac{1}{\zeta\:}}$$

(26)

Now, $\:a$ and $\:b$ define two accidental quantities in the interval $\:\left(\text{0,1}\right)$, and $\:\zeta\:$ designates a constant set of 1.5. Initially, the poor vultures, due to incorrect cost estimation in every iteration, are unable to raise their position depending on the subsequent methods.

$$\:{R}_{1}<ran{d}_{{R}_{1}}$$

$$\:H\left(i+1\right)=\left\{\begin{array}{l}BV\:\left(i\right)-H+{r}_{1}\times\:\text{s}\text{i}\text{n}\left({r}_{2}\right)\times\:\left|{r}_{3}\times\:{H}_{best}^{i}-{H}_{worst}^{i}\right|,\:{r}_{4}<0.5\\\:BV\left(i\right)-H+{r}_{1}\times\:\text{c}\text{o}\text{s}\left({r}_{2}\right)\times\:\left|{r}_{3}\times\:{H}_{best}^{i}-{H}_{worst}^{i}\right|,\:{r}_{4}\ge\:0.5\end{array}\right.$$

(27)

Whereas $\:{H}_{best}^{i}$ and $\:{H}_{worst}^{i}$ describe the most disagreeable and the desirable solutions for cluster number $\:i$, $\:{r}_{1},{r}_{2},{r}_{3},{r}_{4}$ are approximated based on the subsequent expression to determine the novel condition updating of the vultures.

$$\:{r}_{1}=a-itr\times\:\left(\frac{a}{N}\right)$$

(28)

$$\:{r}_{2}=2\pi\:\times\:rnd$$

(29)

$$\:{r}_{3}=2\times\:rnd$$

(30)

$$\:{r}_{4}=rnd$$

(31)

Here, the present iteration is designated by $\:itr$, the iteration count is denoted by $\:N,$ $\:rnd$ represents a random quantity in the interval [0, 1], and a denotes an invariant volume match of 2. Then, the OBL method is used. When $\:rnd$ is smaller than the invariable quantity, $\:c$ describes the newly upgraded clusters and their corresponding opposite novel clusters and acquires the most appropriate solution from them based on their persistence quantities. The opposite volume for a specific vulture $\:{x}_{i}$, depends on the subsequent Eqs. (32) and (33):

$$\:{X}_{i}={U}_{b}+{L}_{b}-{x}_{i},$$

(32)

$$\:{x}_{i}\in\:\left[{L}_{b},\:{U}_{b}\right]$$

(33)

Now, $\:{U}_{b}$ and$\:\:{L}_{b}$ represent the upper and lower limits of the study region. Algorithm 1 illustrates the DAVOA model.

Fitness selection (FS) is a crucial factor that impacts the efficiency of the DAVOA. The hyperparameter array procedure covers the solution-encoded technique to assess the competence of the candidate solutions. Here, the DAVOA imitates precision as the chief standard to project the fitness function. It is demonstrated below.

$$\:Fitness\:=\:\text{m}\text{a}\text{x}\:\left(P\right)$$

(34)

$$\:P=\frac{TP}{TP+FP}$$

(35)

Here, $\:TP$ signifies the true positive value, and $\:FP$ designates the false positive value.

Performance validation

In this segment, the experimental validation of the SCGERS-STGCN approach is performed under the Emotion detection dataset⁴⁸. The method runs on Python 3.6.5 with an i5-8600k CPU, 4GB GPU, 16GB RAM, 250GB SSD, and 1 TB HDD, using a 0.01 learning rate, ReLU, 50 epochs, 0.5 dropout, and batch size 5.

The dataset contains 25,400 samples of 48 × 48 pixel grayscale images of faces, divided into training and testing datasets. Images are categorized depending on the emotion expressed in the facial expressions (happy, neutral, sad, angry, surprised, disgusted, and fearful), as shown in Table 1.

Table 1 Details of the dataset.

Full size table

Figure 3 presents the confusion matrix generated by the SCGERS-STGCN technique for ratios of 80:20 and 70:30 of training phase (TRPH) and testing phase (TSPH). The outcomes indicate that the SCGERS-STGCN model accurately detects and classifies all distinct classes.

Table 2 and Fig. 4 illustrate the emotion detection performance of the SCGERS-STGCN approach below 80%TRPH and 20%TSPH. The performances show that the SCGERS-STGCN approach correctly identified all the samples. With an 80% TRPH, the SCGERS-STGCN model presents an average $\:acc{u}_{y}$, $\:pre{c}_{n}$, $\:rec{a}_{l}$, $\:{F1}_{score}$, and $\:{AUC}_{score}$, of 98.45%, 92.75%, 87.32%, 89.08%, and 93.19%, respectively. Also, with 20% TSPH, the SCGERS-STGCN model presents average $\:acc{u}_{y}$, $\:pre{c}_{n}$, $\:rec{a}_{l}$, $\:{F1}_{score}$, and $\:{AUC}_{score}$, of 98.53%, 93.79%, 89.49%, 91.13%, and 94.30%, correspondingly.

Table 2 Emotion detection of SCGERS-STGCN model under 80%TRPH and 20%TSPH.

Full size table

In Fig. 5, the training $\:acc{u}_{y}$ (TRAAY) and validation $\:acc{u}_{y}$ (VLAAY) solutions of the SCGERS-STGCN method at 80%TRPH and 20%TSPH are represented. The $\:acc{u}_{y}\:$values are computed above the interval of 0–25 epochs. The outcomes emphasized that the TRAAY and VLAAY outcomes exhibit a growing trend, which is attributed to the capacity of the SCGERS-STGCN method, demonstrating superior performance across several iterations. Additionally, the TRAAY and VLAAY values remain relatively close throughout the epochs, indicating superior performance and minimal overfitting of the SCGERS-STGCN technique, which ensures consistent predictions on unobserved samples.

In Fig. 6, the TRA loss (TRALOS) and VLA loss (VLALOS) outcomes of the SCGERS-STGCN method under 80%TRPH and 20%TSPH are exhibited. The values of loss are figured throughout 0–25 epochs. It is noted that the TRALOS and VLALOS values exhibit a diminishing trend, indicating the proficiency of the SCGERS-STGCN technique in corresponding to a trade-off. The continual reduction in loss values moreover secures the maximum performance of the SCGERS-STGCN technique and fine-tunes the prediction results. Finally.

Table 3 and Fig. 7 denote the emotion detection of the SCGERS-STGCN model below 70%TRPH and 30%TSPH. The outcomes indicate that the SCGERS-STGCN technique appropriately recognized all the samples. By 70% TRPH, the SCGERS-STGCN technique presents an average $\:acc{u}_{y}$, $\:pre{c}_{n}$, $\:rec{a}_{l}$, $\:{F1}_{score}$, and $\:{AUC}_{score}$, of 98.12%, 91.83%, 86.98%, 88.64%, and 92.92%, respectively. Simultaneously, with 30% TSPH, the SCGERS-STGCN technique presents average $\:acc{u}_{y}$, $\:pre{c}_{n}$, $\:rec{a}_{l}$, $\:{F1}_{score}$, and $\:{AUC}_{score}$, of 98.10%, 92.49%, 86.40%, 88.27%, and 92.63%, correspondingly.

Table 3 Emotion detection of SCGERS-STGCN model under 70%TRPH and 30%TSPH.

Full size table

In Fig. 8, the TRAAY and VLAAY results of the SCGERS-STGCN technique under 70%TRPH and 30%TSPH are shown. The $\:acc{u}_{y}\:$values are computed through a range of 0–25 epochs. The figure highlights that both values show an increasing trend, indicating the competence of the SCGERS-STGCN methodology, with maximal performance achieved across several iterations. At the same time, the TRAAY and VLAAY are near epochs, which minimize overfitting and demonstrate the optimal solution of the SCGERS-STGCN methodology, ensuring accurate forecasts for hidden samples.

In Fig. 9, the TRALOS and VLALOS graph of the SCGERS-STGCN approach under 70%TRPH and 30%TSPH is displayed. The loss outcomes are computed over the range of 0–25 epochs. It is exemplified that the TRALOS and VLALOS values exhibit a decreasing trend, indicating the competency of the SCGERS-STGCN technique in achieving a trade-off. The sequential reduction in the value of loss further guarantees the greater solution of the SCGERS-STGCN technique.

The comparative study of the SCGERS-STGCN technique with existing approaches is illustrated in Table 4; Fig. 10^{19,20,21,49,50,51}. The simulation performance specified that the SCGERS-STGCN technique outperformed other techniques. Based on $\:acc{u}_{y}$, the SCGERS-STGCN model has improved $\:acc{u}_{y}$ of 98.53% while the CNN-BiLSTM, 1D-CNN, GPTDS, PC-TTA, Vis-Net, ViT-B/16/SAM, 5-Layer, ResNet-50, CNN, GoogleNet, Inception, and CNN-Raspberry approaches have lower $\:acc{u}_{y}$ of 88.47%, 87.62%, 87.95%, 86.86%, 84.24%, 81.40%, 87.70%, 87.27%, 86.30%, 83.56%, 91.60%, 94.97%, correspondingly.

Table 4 Comparative outcome of SCGERS-STGCN model with existing approaches^{19,20,21,49,50,51}.

Full size table

Additionally, for $\:pre{c}_{n}$, the SCGERS-STGCN model has maximal $\:pre{c}_{n}$ of 93.79% while the CNN-BiLSTM, 1D-CNN, GPTDS, PC-TTA, Vis-Net, ViT-B/16/SAM, 5-Layer, ResNet-50, CNN, GoogleNet, Inception, and CNN-Raspberry methods have minimal $\:pre{c}_{n}$ of 87.51%, 86.61%, 86.85%, 86.01%, 92.11%, 88.40%, 83.70%, 86.27%, 85.30%, 91.56%, 90.60%, 91.97%, respectively. Similarly, based on $\:rec{a}_{l}$, the SCGERS-STGCN approach attained $\:rec{a}_{l}$ of 89.49%. At the same time, the CNN-BiLSTM, 1D-CNN, GPTDS, PC-TTA, Vis-Net, ViT-B/16/SAM, 5-Layer, ResNet-50, CNN, GoogleNet, Inception, and CNN-Raspberry methods have diminished $\:rec{a}_{l}$ of 85.77%, 88.67%, 85.09%, 87.89%, 81.87%, 81.87%, 86.37%, 84.37%, 87.10%, 81.27%, 86.77%, 81.02%, respectively.

In Table 5 and Fig. 11, the comparative outcome of the SCGERS-STGCN approach is identified in terms of execution time (ET). The table values suggest that the SCGERS-STGCN model yields better outcomes. According to ET, the SCGERS-STGCN model provides a minimum ET of 2.40s, whereas the CNN-BiLSTM, 1D-CNN, GPTDS, PC-TTA, Vis-Net, ViT-B/16/SAM, 5-Layer, ResNet-50, CNN, GoogleNet, Inception, and CNN-Raspberry approaches attain better ET values of 10.67s, 13.09s, 15.00s, 9.56s, 6.99s, 10.96s, 13.43s, 11.34s, 9.12s, 17.94s, 10.42s, 12.19s, respectively. The significantly lower execution time of the SCGERS-STGCN model highlights its suitability for real-time shoplifting detection in time-sensitive retail environments. Furthermore, this efficiency enables seamless integration into low-power or edge-based systems without compromising responsiveness.

Table 5 ET outcome of the SCGERS-STGCN approach with existing techniques.

Full size table

Table 6 and Fig. 12 illustrate the error analysis of the SCGERS-STGCN methodology with existing models. The SCGERS-STGCN methodology illustrate an $\:acc{u}_{y}$ of 1.47%, $\:pre{c}_{n}$ of 6.21%, $\:rec{a}_{l}$ of 10.51%, and $\:{F1}_{score}$ of 8.87%. Compared to better-performing models such as ViT-B/16/SAM with 18.60% accuracy or GoogleNet Method with 18.73% recall, the SCGERS-STGCN model is depicting lower results. However, with better training, some changes to the model, or more data, its performance could improve. These early outputs show where the model needs to get better.

Table 6 Error analysis of the SCGERS-STGCN methodology with existing models.

Full size table

The ablation study of the SCGERS-STGCN model is specified in the Table 7 and Fig. 13. The SCGERS-STGCN model demonstrate an $\:acc{u}_{y}$ of 98.53%, $\:pre{c}_{n}$ of 93.79%, $\:rec{a}_{l}$ of 89.49%, and $\:{F1}_{score}$ of 91.13%, thus illustrating its consistent performance across all key metrics compared to other techniques such as GF with an $\:acc{u}_{y}$ of 95.86% and $\:{F1}_{score}$ of 88.47%, or even advanced models like ST-GCN with an $\:acc{u}_{y}$ of 97.86% and $\:{F1}_{score}$ of 90.54%. The ablation results clearly show that the SCGERS-STGCN technique achieves the best balance between correctly identifying relevant instances and minimizing errors, making it a highly reliable solution.

Table 7 Result analysis of the ablation study of SCGERS-STGCN model.

Full size table

Conclusion

In this paper, a new SCGERS-STGCN technique is proposed for individuals with hearing disabilities. The SCGERS-STGCN method facilitates the recognition of gestures and emotions, thereby enhancing communication for individuals with hearing impairments. Initially, the SCGERS-STGCN model employs GF for the image pre-processing stage to reduce noise and improve the quality of input images. For feature extraction, the ViT model is utilized to capture the complex relationships and patterns within gestures and facial expressions that indicate emotions. Additionally, the ST-GCN approach is employed for detecting facial emotions based on gestures. At last, the parameter range of the ST-GCN classifier is implemented by DAVOA and utilized to enhance the system’s accuracy and efficacy. The experimentation of the SCGERS-STGCN model is performed under the Emotion detection dataset. The comparison analysis of the SCGERS-STGCN model revealed a superior accuracy value of 98.53% compared to existing techniques. The limitations of the SCGERS-STGCN model include its reliance on a relatively small and homogeneous dataset, which may affect the model’s generalizability across diverse populations and varied environmental conditions. The model may encounter difficulties with occlusions, varying lighting conditions, and rapid facial movements, which can impact detection accuracy. Although the computational requirements are optimized, they may still pose challenges for deployment on low-power edge devices. Furthermore, existing models primarily focus on static facial features and may not fully capture subtle microexpressions or contextual emotional cues. Future work may explore the integration of multimodal data sources, such as audio and physiological signals, to enhance the robustness of recognition. Investigating lightweight model architectures and real-time adaptive learning mechanisms could also improve performance and usability in real-world applications, specifically in mobile or embedded environments. Expanding on multimodal data fusion represents a logical and impactful progression of this research, presenting broader applicability and greater resilience to variability across users and settings.

Data availability

The data that support the findings of this study are openly available in the Kaggle repository at https://www.kaggle.com/datasets/ananthu017/emotion-detection-fer, reference number [48].

References

Hou, F. Z. et al. Deep feature pyramid network for EEG emotion recognition. Measurement 201, 111724 (2022).
Article Google Scholar
Christensen, J. A., Sis, J., Kulkarni, A. M. & Chatterjee, M. Effects of age and hearing loss on the recognition of emotions in speech. Ear Hear. 40, 1069 (2019).
Article PubMed PubMed Central Google Scholar
Noroozi, F., Marjanovic, M., Njegus, A., Escalera, S. & Anbarjafari, G. Audio-Visual emotion recognition in video clips. IEEE Trans. Affect. Comput. 10, 60–75 (2019).
Article Google Scholar
Yang, Y. et al. Investigating of deaf emotion cognition pattern by EEG and facial expression combination. IEEE J. Biomed. Health Inf. 26, 589–599 (2022).
Article Google Scholar
Mollahosseini, A., Hasani, B., Mahoor, M. H. AffectNet A dataset for facial Expression, Valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 10, 18–31 (2019).
Article Google Scholar
Avila, A. R., Akhtar, Z., Santos, J. F., O’Shaughnessy, D. & Falk, T. H. Feature pooling of modulation spectrum features for improved speech emotion recognition in the wild. IEEE Trans. Affect. Comput. 12, 177–188 (2021).
Article Google Scholar
Soleymani, M., Pantic, M. & Pun, T. Multimodal emotion recognition in response to videos. IEEE Trans. Affect. Comput. 3, 211–223 (2012).
Article Google Scholar
Htun, S. N. N., Zin, T. T. & Hama, H. Human action analysis using virtual grounding point and motion history. In 2020 IEEE 2nd Global Conference on Life Sciences and Technologies (LifeTech), pp. 249–250, (IEEE, 2020).
Jain, H., Dixit, A. & Sharma, A. Detecting image spam on social media platforms using deep learning techniques. J. Cybersecur. Inform. Manage. 15 (1), 62–76 (2025).
Google Scholar
Dorman, M. F. et al. Approximations to the voice of a cochlear implant: explorations with Single-Sided deaf listeners. Trends Hear. 24, 2331216520920079 (2020).
Article PubMed PubMed Central Google Scholar
Miah, A. S. M., Hasan, M. A. M., Okuyama, Y., Tomioka, Y. & Shin, J. Spatial-temporal attention with graph and general neural network-based sign language recognition. Pattern Anal. Appl. 27 (2), 37 (2024).
Article Google Scholar
Sreemathy, R., Turuk, M., Kulkarni, I. & Khurana, S. Sign Language recognition using artificial intelligence. Educ. Inform. Technol. 28 (5), 5259–5278 (2023).
Article Google Scholar
Nedjar, I. & M’hamedi, M. Interactive system based on artificial intelligence and robotic arm to enhance Arabic sign Language learning in deaf children. Education Inform. Technologies, pp.1–18. (2024).
Nekkanti, L. B., Priyanga, A., Posonia, A. M. & Mayan, J. A. Breaking down communication barriers: Real-time sign language recognition using CNN & flask-based API. In 2023 International Conference on Circuit Power and Computing Technologies (ICCPCT), pp. 330–336. (IEEE, 2023).
González-Rodríguez, J. R., Córdova-Esparza, D. M., Terven, J. & Romero-González, J. A. Towards a Bidirectional Mexican Sign Language–Spanish Translation System: A Deep Learning Approach. Technologies 12(1), 7 (2024).
Article Google Scholar
Srinivasan, R. et al. Python and Opencv for sign language recognition. In 2023 International Conference on Device Intelligence, Computing and Communication Technologies,(DICCT), pp 1–5. (IEEE, 2023).
Jia, W. & Li, C. SLR-YOLO: An improved YOLOv8 network for real-time sign Language recognition. J. Intell. Fuzzy Syst. 46(1), 1663–1680 (2024).
Google Scholar
Sreemathy, R. et al. Continuous word-level sign Language recognition using an expert system based on machine learning. Int. J. Cogn. Comput. Eng. 4, 170–178 (2023).
Google Scholar
Slade, S. et al. Cluster search optimisation of deep neural networks for audio emotion classification. Knowl. Based Syst. 314, 113223 (2025).
Article Google Scholar
Imtiaz, M. N. & Khan, N. Enhanced cross-dataset electroencephalogram-based emotion recognition using unsupervised domain adaptation. Comput. Biol. Med. 184, 109394 (2025).
Article PubMed Google Scholar
Phan, T. C. & Phan, A. C. A novel approach of drowsiness levels detection using Vis-Net combined with facial emotion. Syst. Soft Comput. 7, 200288 (2025).
Article Google Scholar
Alotaibi, Y., Sundarapandi, A. M. S. & Rajendran, S. Computational linguistics based text emotion analysis using enhanced beetle antenna search with deep learning during COVID-19 pandemic. PeerJ Comput. Sci. 9, e1714 (2023).
Article PubMed PubMed Central Google Scholar
Madhan, S., Sridharan, S., Deivasigamani, S., Rajesh, R. & Surendran, R. Facial expression analysis using K-nearest neighbor classification method: enhancing emotion detection and stress monitoring in an interactive music player. In 2024 5th International Conference on Smart Electronics and Communication (ICOSEC), pp. 1316–1322. (IEEE, 2024).
Pradeep, D., Monisha, S., Nandhini, J., Poogesh, R. & Praneeshwar, R. Gesture language recognition through deep learning. In 2025 International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 564–568, (IEEE, 2025).
Balasubramani, J. & Surendran, R. Utilizing hybrid-deep learning for autism spectrum disorder detection in children via facial emotion recognition. In 2024 2nd International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), pp. 487–492, (IEEE, 2024).
Krishnan, S. R. N., Sangeetha, M., Srinivasulu, S. & Surendran, R. Advanced transformer model with fine-grained correlation fusion for multimodal emotion analysis. In 2024 International Conference on Emerging Research in Computational Science (ICERCS), pp. 1–7, (IEEE, 2024).
Palermo, F. et al. Advancements in context recognition for edge devices and smart eyewear: Sensors and applications. IEEE Access. (2025).
Jayalakshmi, V., Thilakavathy, P., Manikandan, G., Deepa, R. & Surendran, R. Deep learning for sentiment analysis with adaptive neural networks for emotion classification. In 2024 International Conference on Sustainable Communication Networks and Application (ICSCNA), pp. 1019–1024. (IEEE, 2024).
Thilakavathy, P., Manikandan, G., Deepa, R., Jayalakshmi, V. & Surendran, R. Improved contextual understanding and emotion detection in large-scale text data with hybrid deep learning models. In 2024 International Conference on Sustainable Communication Networks and Application (ICSCNA), pp. 1025–1031, (IEEE, 2024).
Jiang, M. et al. Decoding covert speech from EEG using a functional areas spatio-temporal transformer. arXiv preprint arXiv:2504.03762. (2025).
Mazhar, T. et al. Movie reviews classification through facial image recognition and emotion detection using machine learning methods. Symmetry 14(12), 2607 (2022).
Article ADS Google Scholar
Taware, S. & Thakare, A. D. Multimodal emotion recognition based on face and speech using deep convolution neural network and long short term memory. Circuits, Syst. Signal Process., pp.1–28. (2025).
Saqib, S. M. et al. Integrating YOLO and WordNet for automated image object summarization. Signal. Image Video Process. 18(12), 9465–9481 (2024).
Article Google Scholar
Bouhanou, I. & Aboutabit, N. Arabic sign language classification using CNN-LSTM integration for enhanced gesture recognition. In International Conference on Mathematics Data Science, pp. 52–60, (Springer, Cham, 2025).
Khan, H. et al. Visionary vigilance: Optimized YOLOV8 for fallen person detection with large-scale benchmark dataset. Image Vis. Comput. 149, 105195 (2024).
Article Google Scholar
Rahman, A. U. et al. Emotion-Based mental state classification using EEG for Brain‐Computer interface applications. Comput. Intell. 41(4), e70112 (2025).
Article Google Scholar
Shah, A. et al. An ensemble face recognition mechanism based on three-way decisions. J. King Saud University-Computer Inform. Sci. 35(4), 196–208 (2023).
Article Google Scholar
Khan, R., Alzaben, N., Daradkeh, Y. I., Zhu, X. & Ullah, I. Pyramidal attention with progressive multi-stage iterative feature refinement for salient object segmentation. Image Vis. Comput. 162, 105670 (2025).
Article Google Scholar
Ullah, I., Alzaben, N., Daradkeh, Y. I. & Lee, M. Y. Optimal features assisted multi-attention fusion for robust fire recognition in adverse conditions. Sci. Rep. 15(1), 23923 (2025).
Article CAS PubMed PubMed Central ADS Google Scholar
Ahmad, I. et al. Robust epileptic seizure detection based on biomedical signals using an advanced multi-view deep feature learning approach. IEEE J. Biomedical Health Inf. 28(10), 5742–5754 (2024).
Article Google Scholar
Ivanko, D. & Ryumin, D. Intelligent system for automatic bidirectional sign Language translation based on recognition and synthesis of audiovisual and sign speech. Int. Archives Photogrammetry Remote Sens. Spat. Inform. Sci. 48, 131–136 (2025).
Article ADS Google Scholar
Mira, A. & Hellwich, O. Deep learning models beyond Temporal frame-wise features for hand gesture video recognition. J. Supercomputing. 80(9), 12430–12462 (2024).
Article Google Scholar
Gaikwad, S. A. & Shete, V. User adaptive hand gesture recognition for ISL using multiscale and attention embedded residual densenet with adaptive gesture segmentation framework. Signal. Image Video Process. 19(3), 210 (2025).
Article Google Scholar
Suryanarayana, G. et al. Accurate magnetic resonance image super-resolution using deep networks and Gaussian filtering in the stationary wavelet domain. IEEE Access. 9, 71406–71417 (2021).
Article Google Scholar
Feng, G., Wang, H., Wang, M., Zheng, X. & Zhang, R. A research on emotion recognition of the elderly based on transformer and physiological signals. Electronics 13(15), 3019 (2024).
Article Google Scholar
Wang, Y., Song, T., Yang, Y. & Hong, Z. Research on multi-scale spatio-temporal graph convolutional human behavior recognition method incorporating multi-granularity features. (2024).
Guo, S., Kou, H., Bi, Y. & Mamlooki, M. Predicting the compressive strength of self-compacting concrete by developed African vulture optimization algorithm-Elman neural networks. Sci. Rep. 14(1), 20080 (2024).
Article CAS PubMed PubMed Central ADS Google Scholar
https://www.kaggle.com/datasets/ananthu017/emotion-detection-fer
Chaudhari, A., Bhatt, C., Krishna, A. & Mazzeo, P. L. ViTFER: facial emotion recognition with vision transformers. Appl. Syst. Innov. 5(4), 80 (2022).
Article Google Scholar
Białek, C., Matiolański, A. & Grega, M. An efficient approach to face emotion recognition with convolutional neural networks. Electronics 12(12), 2707 (2023).
Article Google Scholar
Reghunathan, R. K., Ramankutty, V. K., Kallingal, A. & Vinod, V. Facial expression recognition using pre-trained architectures. Eng. Proc., 62(1), 22. (2024).
Google Scholar

Download references

Acknowledgements

The authors extend their appreciation to the King Salman center For Disability Research for funding this work through Research Group no KSRG-2024-128.

Author information

Authors and Affiliations

Department of Computer Science, College of Computers and Information Technology, Taif University, P. O. Box 11099, 21944, Taif, Saudi Arabia
Reem Alshahrani
King Salman Centre for Disability Research, 11614, Riyadh, Saudi Arabia
Reem Alshahrani
Department of Information Systems, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), 11432, Riyadh, Saudi Arabia
Abeer A. K. Alharbi

Authors

Reem Alshahrani
View author publications
Search author on:PubMed Google Scholar
Abeer A. K. Alharbi
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: Reem Alshahrani and Abeer A. K. Alharbi, Data curation and Formal analysis: Reem Alshahrani and Abeer A. K. Alharbi, Investigation and Methodology: Reem Alshahrani and Abeer A. K. AlharbiProject administration and Resources: Supervision; Reem Alshahrani, Writing—original draft: Reem Alshahrani and Abeer A. K. AlharbiValidation and Visualization: Reem Alshahrani and Abeer A. K. AlharbiWriting—review and editing, Reem Alshahrani and Abeer A. K. AlharbiAll authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Reem Alshahrani.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Alshahrani, R., Alharbi, A.A.K. Smart comprehend gesture based emotions recognition system for people with hearing disability utilizing spatio temporal graph convolutional network techniques. Sci Rep 16, 3079 (2026). https://doi.org/10.1038/s41598-025-31692-w

Download citation

Received: 06 December 2024
Accepted: 04 December 2025
Published: 23 January 2026
Version of record: 23 January 2026
DOI: https://doi.org/10.1038/s41598-025-31692-w

Smart comprehend gesture based emotions recognition system for people with hearing disability utilizing spatio temporal graph convolutional network techniques

Subjects

Abstract

Introduction

Literature survey

Materials and methods

Stage I: image pre-processing

Stage II: ViT using feature extractor

Encoder transformer

MLP head

Stage III: recognition process using ST-GCN

Stage IV: DAVOA-based parameter tuning

Performance validation

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Subjects

Abstract

Introduction

Literature survey

Materials and methods

Stage I: image pre-processing

Stage II: ViT using feature extractor

Encoder transformer

MLP head

Stage III: recognition process using ST-GCN

Stage IV: DAVOA-based parameter tuning

Performance validation

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links