Abstract
Gesture recognition (GR) is an emerging and wide-ranging area of research. GR is extensively applied in sign language, Immersive game technology, and other computer interfaces, among others. People with visual impairments face challenges in completing tasks, including navigating environments, using technologies, and engaging in social interactions. Additionally, people face challenges in balancing their individuality with the need for protection in their day-to-day work. It is likely to recognize the communication of visually challenged and deaf people by recording their speech, and in comparison, with recent datasets, hence establishing their objectives. The conventional machine learning (ML) model attempts to utilize handcrafted features, but often underperforms in real-time environments. Deep learning (DL) models have become a sensation amongst investigators recently, making conventional ML techniques comparatively old. Therefore, this study presents a new approach, Enhancing Gesture Recognition for the Visually Impaired using Deep Learning and an Improved Snake Optimization Algorithm (EGRVI-DLISOA), in an IoT environment. The EGRVI-DLISOA approach is an advanced GR system powered by DL in an IoT environment, designed to provide real-time interpretation of gestures to assist the visually impaired. Initially, the EGRVI-DLISOA technique utilizes the Sobel filter (SF) technique for the noise elimination process. For feature extraction, the SqueezeNet model is utilized due to its efficiency in capturing meaningful features from complex visual data. For an accurate GR process, the long short-term memory (LSTM) approach is implemented. To fine-tune the hyperparameter values of the LSTM classifier, the improved snake optimization algorithm (ISOA) is utilized. The experimentation of the EGRVI-DLISOA technique is investigated under the hand gestures dataset. The comparison study of the EGRVI-DLISOA technique revealed a superior accuracy value of 98.62% compared to existing models.
Introduction
In day-to-day life, visually impaired individuals face various tasks, such as navigating unfamiliar environments and visually recognizing their surroundings, including people, faces, or objects1. Several research studies have been conducted to enable computers and/or machines to perceive the world like humans, thereby improving visual recognition algorithms in methods such as object recognition and object detection. One of the capabilities of visual recognition is to detect a person’s facial expressions through communication2. In general, non-verbal information, such as facial expressions, gestures, and body movements, plays a massive role in face-to-face communication3. Additionally, facial expression is strongly associated with human emotion, which specifies the inner state of the speaker and/or the listener through interaction. Such information is essential for effective communication with others. Visually impaired individuals often face challenges in accessing such information due to restrictions in social interaction4. Particularly, those who lost vision initially in their life will have more complexities in social interaction. To overcome this, several assistive devices are available for visually impaired individuals to capture non-verbal information through social interactions5.
The communication between computers and humans is growing extensively, but the field is witnessing continuous improvement, with novel approaches resulting in methods that are being exposed6. GR is among the most developed fields of artificial intelligence (AI) and computer vision (CV), which have aided in enhancing interaction with visually impaired individuals and also help gesture-based signalling systems7. The current advancements in hand gesture detection across diverse regions have captured the attention of various industries. In recent decades, the Internet of Things (IoT) has greatly simplified daily life through the use of smart devices. Similarly, these devices are designed to improve the quality of life for individuals with visual impairments. The major challenge every visually impaired individual face is obstacle and navigation detection8. Several real-world blind assistive methods have been improved to help people navigate by avoiding difficulties in their own way9. Diverse approaches and systems are enhanced by utilizing sensors, brain-computer Interfacing, CV, robotics, and human–machine communication to aid individuals with disabilities employing diverse technologies10. Currently, DL and ML methods are becoming increasingly popular, which enables visually impaired individuals to interact with others more efficiently.
This study presents a new approach, Enhancing Gesture Recognition for the Visually Impaired using Deep Learning and an Improved Snake Optimization Algorithm (EGRVI-DLISOA), for enhancing gesture recognition in an IoT environment. The EGRVI-DLISOA approach is an advanced GR system powered by DL in an IoT environment, designed to provide real-time interpretation of gestures to assist the visually impaired. Initially, the EGRVI-DLISOA technique utilizes the Sobel filter (SF) technique for the noise elimination process. For feature extraction, the SqueezeNet model is utilized due to its efficiency in capturing meaningful features from complex visual data. For an accurate GR process, the long short-term memory (LSTM) approach is implemented. To fine-tune the hyperparameter values of the LSTM classifier, the improved snake optimization algorithm (ISOA) is utilized. The experimentation of the EGRVI-DLISOA technique is investigated under the hand gestures dataset. The key contribution of the EGRVI-DLISOA technique is listed below.
-
The EGRVI-DLISOA model incorporates the SF-based noise elimination technique to mitigate background noise while preserving crucial edge structures in gesture input data, resulting in sharper spatial representations that support more accurate feature extraction and recognition, particularly under low-quality or distorted input conditions.
-
The EGRVI-DLISOA approach employs a lightweight SqueezeNet architecture to extract rich spatial features from gesture inputs, while significantly reducing computational overhead. This enables efficient deployment on resource-constrained devices and improves real-time responsiveness without compromising accuracy in intrinsic gesture recognition scenarios.
-
The EGRVI-DLISOA methodology utilizes the LSTM model to effectively capture long-range temporal dependencies in sequential gesture data, thereby enabling the precise interpretation of dynamic hand movements while improving recognition consistency across varying input durations and user-specific gesture variations.
-
The ISOA technique is incorporated into the EGRVI-DLISOA method for effectively fine-tuning the hyperparameters, thereby improving the GR and convergence, while maintaining robustness across diverse input conditions, ultimately enhancing model stability and adaptability during training.
-
The novelty of the EGRVI-DLISOA technique is in the integration of SF, LSTM, and ISOA models, which creates a compact and noise-resilient GR framework. This integration enhances real-time efficiency and accuracy by optimizing feature extraction and tuning simultaneously. The approach outperforms by integrating advanced noise elimination, lightweight spatial feature extraction, temporal modelling, and adaptive optimization into a single robust system.
Literature survey
Zainuddin et al.11 developed the Rehabilitation IoT (RIOT) structure and examined the efficiency of the ML model combined with the MediaPipe structure for gesture detection calibration. The Design of Experiment (DoE) model permits a systematic exploration of the relationship between precise hand GR and RIOT. To ensure accurate rehabilitation assessments, this initiative aims to enhance manageable home-based stroke rehabilitation by implementing secure and optimized hand gesture detection methods. In12, the Kinect-based movement recognition is thoroughly examined, and a novel movement recognition approach based on the dynamic-signal evidence hypothesis and the Hidden Markov model (HMM) is presented. In13, a signal-based gesture detection approach is proposed that controls gesture coding based on stroke trajectory and feature demonstration based on the stage of dynamical variations. Specifically, the approach enhances an endpoint recognition method to ensure accurate GR. In addition, a sub-carrier selection method is designed to select the optimal sub-carriers, taking into account complete gesture information. Fan et al.14 developed a new Amphibious Hierarchical GR (AHGR) technique. This technique can adaptively switch between lightweight and more complex recognition of gesture techniques, depending on environmental variations, to ensure the effectiveness and accuracy of GR. The more challenging technique relies on the projected SqueezeNet-BiLSTM model, specifically designed for the land-based environment. Sruthi et al.15 propose a technique for classifying hand gestures. Four diverse approaches, including the projected approach, are utilized for classifying the hand gesture. Then several ML methods like SVM, KNN, and decision trees (DTs) are utilized to classify the left or right hand utilized for signalling type of gestures. In16, a new vision-based hybrid DNN model is presented. The spatial feature extractor from sign gestures utilizes a 3D DNN with atrous convolution. Sequential and temporal feature extraction is implemented by utilizing an attention-based Bi-LSTM. Additionally, differentiate the feature extraction abstract employing the altered AEs. The discriminating feature extractor for distinguishing sign gestures from annoying transition gestures through an optimized hybrid attention component. Mallik et al.17 propose a new virtual keyboard for character input through different hand gestures, focused on dual key features: character input mechanisms and hand GR. A new technique is proposed with fully connected (FC) layers and LSTM for improved hand GR and sequential data processing. This technique is also incorporated with CNN, dropout layers, and max pooling for enhanced spatial feature extraction. This technique structures processes, both spatial and temporal features of hand gestures, employing LSTM to remove difficult patterns from frame orders, thereby providing a comprehensive understanding of the input data.
Hao et al.18 proposed a gesture recognition methodology for complex scenes using millimetre wave radar and a lightweight Multi-Convolutional Neural Network Long Short-Term Memory (Multi-CNN-LSTM) technique. Dawood et al.19 developed ARNet, a human action recognition framework that integrates a refined InceptionResNet-V2 with Parametric Rectified Linear Unit (PReLU) activation for improved spatial feature extraction, and a Bidirectional Long Short-Term Memory (Bi-LSTM) network for capturing temporal dynamics in video sequences. Seifi et al.20 proposed XentricAI, an explainable hand gesture recognition system that incorporates a variational autoencoder (VAE) for gesture anomaly detection and transfer learning (TL) for user adaptability. Lamsellak et al.21 presented a technique by incorporating multiple sensors, acceleration and rotation signals by employing ML techniques, namely random forest (RF), support vector machine (SVM), and k-nearest neighbours (kNN) methods. Zhou et al.22 proposed a novel Covariance-based Graph Convolutional Network (CovGCN) technique, incorporating the Covariance-based Topology Refinement Module (CovTRM) model, to enhance hand gesture recognition. Rezaee et al.23 presented a hand gesture classification method that utilizes surface electromyography data, integrating a U-Net architecture with a MobileNetV2 encoder. The model is optimized using Bi-LSTM and Bayesian optimization (BO) models, with edge computing for real-time processing. Qu, Yu, and Tan24 developed a low-cost, real-time gesture recognition system by utilizing OpenMV, TensorFlow, and EdgeImpulse. Chopparapu, Chopparapu, and Vasagiri25 introduced a real-time image enhancement approach using deep reinforcement learning (DRL) with a convolutional neural network (CNN)-based Q-learning model, optimizing filter selection based on structural similarity index measure, peak signal-to-noise ratio, and mean squared error. Zhou et al.26 proposed CSSA-YOLO, a real-time behaviour detection model for smart classrooms that combines Cross-Scale Shuffle Attention with the Complete Intersection over Union loss function to improve small-scale and occluded behaviour recognition. Su et al.27 introduced a distributed acoustic 3D spatial beamforming model by utilizing in-car speakers and microphones to create in-car occupancy grids.
Limitations and research gap in existing gesture recognition techniques
The existing studies portray various limitations despite improvements in GR systems. Various models heavily depend on constrained environments, making them less adaptable to real-world complexities, such as lighting variations, occlusions, or dynamic backgrounds. Models, such as Kinect or mmWave radar, require expensive hardware, thereby restricting scalability. A few methods also encounter difficulty in generalizing, specifically in multi-user or multi-gesture scenarios. Several studies lack effective spatiotemporal modelling or fail to balance accuracy with computational efficiency. Moreover, high training costs are illustrated by hybrid techniques, namely CNN, LSTM, Bi-LSTM, and AM, and are also sensitive to hyperparameter tuning. The research gap lies in developing lightweight, adaptive, and robust GR models that can accurately classify in diverse, real-time, and resource-constrained environments.
The proposed methodology
In this paper, a new EGRVI-DLISOA model is presented in an IoT environment. The EGRVI-DLISOA model is an advanced GR system powered by DL in an IoT environment, designed to provide real-time interpretation of gestures to assist the visually impaired. To accomplish this, the EGRVI-DLISOA technique contains four processes: noise removal, feature extraction, LSTM-based gesture detection, and ISOA-based parameter tuning. Figure 1 illustrates the entire workflow of the EGRVI-DLISOA technique.
Image preprocessing: SF model
Initially, the EGRVI-DLISOA technique uses the SF model for the noise elimination process28. This technique is chosen for its efficiency in detection and noise reduction, which is considered crucial for preserving the significant contours of the gesture. This method also enhances edge features while reducing noise, thereby improving the clarity of the features. The computational simplicity of the model facilitates fast processing and also enables better discrimination of subtle gestures compared to standard filters, such as Gaussian or median filters. This balance of accuracy and efficiency makes SF an ideal choice for robust GR preprocessing.
The SF is generally applied in GR systems for visually impaired people, improving edge recognition by emphasizing object contours. By using vertical and horizontal gradients, it recognizes gesture frameworks, making the shapes more different for machine understanding. This helps in precisely taking gesture patterns and hand movements from video or images. Incorporated with other processing methods, the SF enables consistent, gesture-based interaction and navigation, thereby improving accessibility for individuals with visual impairments.
Feature extraction: SqueezeNet
For feature extraction, the SqueezeNet model is utilized for its efficiency in capturing meaningful features from complex visual data29. The model’s lightweight architecture significantly reduces the number of parameters while maintaining high accuracy. The technique also requires less computational power and memory, making it ideal for resource-constrained and real-time environments. The utilization of “fire module” assists in effectually capturing the spatial features by incorporating layers such as squeeze and expand, thus improving the feature representation without excessive complexity. This balance of compactness and performance facilitates faster training and inference, making SqueezeNet a practical and ideal choice over heavier CNN models for GR tasks.
SqueezeNet is a relatively lightweight CNN structure designed to achieve the same accuracy as AlexNet with 50 times fewer parameters. The structure mainly concentrates on decreasing the size of model (less than half a megabyte) while upholding robust performance, primarily utilizing 9 Fire modules (fire 1 to fire 9), where each one includes Squeeze and the following extension convolutional layers, as exposed in Eq. (1). Every Fire module contains a Squeeze layer with 1 × 1 convolutions to decrease an input channel, following by an Expand layer with a combination of 1 × 1 and 3 × 3 convolutions, effectively balancing amount of parameter and feature learning. This process is formulated below in the mathematical formulation:
For classification, the design of SqueezeNet effectively extracts features. The delicate differences in drive patterns necessitate a model that can distinguish fine-grained features while being computationally efficient. The structure capability is to compress input data through the Squeeze layers, which is followed by the growth of feature maps using 1 × 1 and 3 × 3 convolutions, enabling SqueezeNet to capture both global and local patterns efficiently. SqueezeNet’s compact model allows it to be used on devices with limited memory and processing power, such as embedded systems and mobile phones, for real-world classification tasks without compromising classification accuracy. Furthermore, the structure’s efficacy enables quicker inference times, making it suitable for real-world applications where prompt decisions are vital.
Classification: LSTM
For an accurate GR process, the LSTM model is employed30. The model effectually addresses vanishing gradient issues, unlike conventional RNNs, thus allowing them to retain relevant data over extended time steps. The method is also suitable due to its superior capability in modelling long-term temporal dependencies in sequential data. This makes them ideal for recognizing gesture patterns that unfold over time. The LSTM also captures motion dynamics and temporal correlations efficiently, compared to CNNs, which are more spatially focused. Their gated architecture enables selective memory updates, thereby improving recognition accuracy, particularly in time-sensitive real-time gesture recognition applications. Figure 2 represents the architecture of LSTM.
A recurrent neural network (RNN) often attempts to retain data over an extended period. The LSTM method tackles this problem by incorporating cells of memory equipped with gated devices. These gates permit the technique to define which data wants to be discarded or retained.
Forget gate
This gate regulates the elimination of data from the LSTM memory. By using a sigmoid activation function, it computes a value, represented as \(f_{t}\), which lies between 0 and 1. It states the degree to which the previously learned data \(h_{t - 1}\) and the present input \(xt\) must be discarded or preserved. This procedure was stated in mathematical formulation below:
Input gate
This gate defines which novel data must be combined into the LSTM network. It contains dual layers, such as a hyperbolic tangent (tanh) and a sigmoid. The layer of sigmoid yields an upgrade signal, indicating that it wants to upgrade. The layer of tanh creates a candidate values vector, \(\tilde{c}_{t}\), which is measured for addition to the memory. These dual layers together choose the memory upgrade that was evaluated below:
The upgraded memory \(c_{t}\) in (Eq. 5) outcomes from uniting the procedure of neglecting an old value \(c_{t - 1}\) with the novel candidate value \(i_{t} \tilde{c}_{t}\):
Output gate
This gate determines which part of the LSTM memory affects the output. It starts with a layer of sigmoid, which computes an output gate signal, \(o_{t}\). The layer of \({\text{ tanh}}\) maps values between − 1 and 1. Then, an outcome was multiplied by a new sigmoid layer output for creating the last output \(h_{t}\):
Parameter optimizer: ISOA
To fine-tune the parameter values of the LSTM classifier, the ISOA is utilized31. The ISOA model is chosen for its efficient exploration and exploitation capabilities, which result in faster convergence and better avoidance of local optima compared to conventional optimization methods, such as particle swarm optimization (PSO) or genetic algorithms (GAs). The searching strategy is dynamically adjusted by the ISOA model, thus enhancing the fine-tuning of model parameters. The model also effectively balances global search with local refinement, exhibiting higher stability and accuracy. This makes it particularly effective in optimizing intrinsic models, such as LSTM networks, and ensuring efficient training and enhanced overall performance in GR tasks.
SOA is an optimizer technique which divides snake groups into female and male groups with a similar number of individuals. Let the total number of snake groups be \(Np\). The number of male and female snake groups is denoted by \(N_{m} = N_{f} = N_{p/2}\), correspondingly. Then, this algorithm is separated into local development and global exploration phases. SOA is an effective optimizer technique. However, it frequently falls short of local goals, lacking adequate global search ability. To improve the SOA search capability for the global optimizer and hasten convergence, the ISOA is suggested.
Snake colony location initialisation depend upon quasi‐reverse learning approach and tent chaotic mapping (TCM)
The variety of the initial location is crucial for achieving the global optimum solution in SOA. Chaotic mapping is characterized by ergodicity, regularity, sensitivity, and robust randomness, which enables it to escape local optima and achieve superior global optimum search capabilities. Moreover, the TCM randomly generates numbers within the interval [0, 1] to determine the initial position of the population. So, TCM is introduced to set the snake colony location. The mathematical equation of the chaotic sequence depending upon TCM is given below:
Here, \(z_{i}\) denotes the ith chaotic value. \(\varepsilon\) refers to the parameter of control, \(\varepsilon \epsilon \left[ {0,\;1} \right],\) and the value is fixed as 0.6. As per Eq. (8), it is attained that the early location of snake colony individuals depends upon TCM given below:
whereas \(b_{0}\) and \(b_{1}\) specify the lower and upper bounds, respectively. To expand the range of the initial location, a quasi-reverse learning approach is also presented. Depending upon the simple principle of quasi‐reverse learning, \(x\left( {x\epsilon \left[ {c_{1} ,~\;c_{2} } \right]} \right)\) is formed. At the same time, \(c_{2}\) and \(c_{1}\) denote the maximum and minimum values of \(x\), respectively. The quasi‐reverse point \(x^{*}\) was expressed below:
As per Eq. (10), \(x^{*}\) denotes an evenly distributed randomly generated number in the range of \(\left[ {\left( {c_{1} + c_{2} } \right)/2, c_{1} + c_{2} - x} \right]\). The quasi-reverse learning approach and TCM are combined to form a TCM-quasi-reverse learning approach. Based on Eqs. (9) and (10), it is attained that the initial location of the individual is dependent upon the tactic of TCM-quasi‐reverse learning. Its mathematical formulation is mentioned below:
While \(P_{i}\) denotes the initial location of the ith snake, \(p_{r}\) designates the probability of quasi‐reverse learning, and \(p_{r} = 0.3\) is used in this paper.
Throughout the snake group location initialization, the initial location \(X\) is dependent on TCM, which is combined with an early position \(P\) produced by the TCM quasi-reverse learning approach to generate a mixed initial location \(X_{c} = \left\{ {X \cup P} \right\}\). According to the fitness sizes, the locations with the highest \(N_{p}\) fitness value in \(X_{c}\) are chosen as the initial position.
Snake population location and fitness upgrade depend upon SOA
Let the solution space be fixed as the foraging space of snake colonies, and the food quantity is formulated below in the mathematical formulation:
While \(t_{c}\) refers to the iteration count. \(N_{{\text{ max }}}\) denotes the maximum iteration count. If the food amount is less than the value of threshold \(T_{h1} = 0.25\), i.e., if \(Q < 0.25\), the snake colony arrives at the global search phase, and the location upgrade is formulated below in the mathematical expression:
While \(X_{i,\;m}\) and \(X_{i,\;f}\) refer to the locations of the ith male and female snakes, correspondingly, \(x_{r,\;f}\) and \(X_{m}\) specify randomly generated individual locations in female and male snake colonies, respectively. \(f_{r,\;m}\) and \(f_{r,\;f}\) are \(X_{m}\) and \(X_{r}\), respectively, and \(f\) parallels fitness. \(f_{i,\;f}\) and \(f_{i,\;m}\) represent the fitness of the ith female and male snakes, respectively. \(r_{d}\) denotes the randomly created value from the interval of \(\left( {0,1} \right)\). If the food amount exceeds the threshold value \(T_{h1}\), i.e., \(Q \; > \; 0.25\), the snake population enters the local growth phase. It is mathematically expressed below:
In the local growth phase, if the ambient temperature is greater than the threshold \(T_{h2} = 0.6\), i.e., Q > 0.25 and T > 0.6, the snakes only consume the food they have kept at home, and the location upgrade of the male and female individuals is described below.
Here, \(x_{fd}\) designates the food position that corresponds to the global optimum.
Alternatively, the snakes will engage in courtship behaviour, where they select a mate of the opposite sex with a randomly generated probability \(\left( {P_{r} \epsilon \left( {0,\,1} \right)} \right)\).
If \(P_{r} \; > \;0.6\), the snakes involved in heterosexual performance of mating, and the location upgrade of male and female individuals, were formulated below:
If \(P_{r} < 0.6\), the snakes pause the behaviour of heterosexual mating and go to homosexual violence, and the location upgrade of both male and female individuals is demonstrated below:
While \(f_{b,\;f}\) and \(f_{b,\;m}\) specify the optimum fitness values of female and male snake colonies, correspondingly, \(X_{b,\;f}\) and \(X_{m}\) refer to optimal present locations of female and male snake groups, respectively. Table 1 specifies the hyperparameters of the ISOA model.
The fitness choice is a critical factor in deploying the efficiency of ISOA. The parameter choice method contains the encoding method for calculating the solution of candidate outcomes. During this paper, the ISOA’s significant accuracy is the primary criterion to make the fitness function (FF) that is defined as:
In which \(FP\) and \(TP\) depict the false and true positive ratios.
Performance validation
In this section, the analysis of the EGRVI-DLISOA approach is examined under the hand gestures dataset32,33. The method runs on Python 3.6.5 with an Intel Core i5-8600 K CPU, 4 GB GPU, 16 GB RAM, 250 GB SSD, and 1 TB HDD, using a 0.01 learning rate, ReLU activation, 50 epochs, 0.5 dropout, and a batch size of 5. The dataset contains 4000 instances under four class labels as portrayed in Table 2. Figure 3 demonstrates the sample images.
Figure 4 represents the classifier results of the EGRVI-DLISOA approach. Figure 4a, b displays the confusion matrices with precise classification and identification of every class under 70%TRPH and 30%TSPH. Figure 4c shows the analysis of PR, which represents the highest performance across all classes. Finally, Fig. 4d illustrates the analysis of the ROC curve, indicating proficient outcomes with higher values of the ROC for different class labels.
Table 3 and Figs. 5 and 6 illustrate the GR outcome of the EGRVI-DLISOA model under 70%TRPH and 30%TSPH. The outcomes denote that the EGRVI-DLISOA approach correctly recognized the samples. With 70%TRPH, the EGRVI-DLISOA approach provides an average \(accu_{y}\), \(prec_{n}\), \(reca_{l}\), \(F1_{score}\) and \(MCC\) of 98.52%, 97.03%, 97.05%, 97.04%, and 96.05%, respectively. Moreover, with 30%TSPH, the EGRVI-DLISOA approach presents an average \(accu_{y}\), \(prec_{n}\), \(reca_{l}\), \(F1_{score}\) and MCC of 98.62%, 97.22%, 97.21%, 97.21%, and 96.30%, respectively.
In Fig. 7, the training (TRA) \(accu_{y}\) and validation (VAL) \(accu_{y}\) curves of the EGRVI-DLISOA approach are established. The \(accu_{y}\) values are intended over a range of 0–25 epochs. The outcome highlighted that the VAL and TRA \(accu_{y}\) values show a rising tendency, which suggests the capability of the EGRVI-DLISOA approach to provide a better solution over many iterations. Moreover, the TRA and VAL \(accu_{y}\) remain closer over the epochs, indicating the least overfitting and revealing a more robust solution of the EGRVI-DLISOA model, which guarantees reliable predictions on unseen instances.
In Fig. 8, the TRA loss (TRALOS) and VAL loss (VALLOS) curves of the EGRVI-DLISOA model are shown. The values of loss are computed in the range of 0–25 epochs. The TRALOS and VALLOS outcomes exhibit a declining tendency, indicating the proficiency of the EGRVI-DLISOA technique in balancing the trade-off between data fitting and generalization. The constant decrease in loss values also ensures the improved performance of the EGRVI-DLISOA technique and fine-tunes the prediction outcomes over time.
Table 4 and Fig. 9 inspect the comparison results of the EGRVI-DLISOA approach with the existing techniques19,20,33. The values in the table indicate that the EGRVI-DLISOA method has effective performance. The results underscored that the InceptionResNet-V2, ARNet, VAE, SqueezeNet, ResNet-50, VGG-16, InceptionV3, MobileNet v2, Efficient-B1, and Xception methodologies reported worse performance. The DenseNet121 model provides somewhat closer outcomes with \(an accu_{y} { }\) of 97.56%, \({ }prec_{n}\) of 96.51%, \(reca_{l}\) of 96.03%, and \(F1_{score}\) of 96.27%. Besides, the EGRVI-DLISOA method illustrated upgraded performance with a maximum \(accu_{y} { }\) of 98.62%, \({ }prec_{n}\) of 97.22%, \(reca_{l}\) of 97.21%, \(and F1_{score}\) of 97.21%.
Table 5 and Fig. 10 specify the computational time (CT) analysis of the EGRVI-DLISOA technique compared to existing models. The EGRVI-DLISOA technique outperformed existing models with the lowest CT of 8.32 s. In contrast, MobileNet v2, ARNet, SqueezeNet, ResNet-50, and InceptionResNet-V2 models achieved CT times of 11.26 s, 13.52 s, 14.49 s, 16.80 s, and 17.78 s, respectively. More time-consuming models, such as VAE, Xception, Efficient-B1, DenseNet121, InceptionV3, and VGG-16, attained CTs of 17.00 s, 18.42 s, 19.60 s, 20.19 s, 23.67 s, and 24.31 s, respectively. These results demonstrate the high efficiency of the EGRVI-DLISOA model compared to existing techniques. Its low computation time makes it highly appropriate for time-sensitive applications.
The error analysis of various models reveals that the EGRVI-DLISOA model achieves the lowest error rates across all performance metrics in Table 6 and Fig. 11. The EGRVI-DLISOA model illustrated an \(accu_{y}\) error of 1.38%, \(prec_{n}\) error of 2.78%, \(reca_{l}\) error of 2.79%, and \(F1_{score}\) error of 2.79%. Compared to other methods, such as the VGG 16 Algorithm with an \(accu_{y}\) error of 9.67% and \(F1_{score}\) error of 24.79%, or ARNet with an \(accu_{y}\) error of 9.16% and \(F1_{score}\) error of 24.28%, the EGRVI-DLISOA model demonstrates a significant reduction in classification errors. These results highlight its superior reliability and \(prec_{n}\) in minimizing prediction inaccuracies across diverse scenarios.
Table 7 and Fig. 12 specify the ablation study of performance metrics of the EGRVI-DLISOA approach. The EGRVI-DLISOA approach attained the highest \(accu_{y}\) of 98.62%, \(prec_{n}\) of 97.22%, \(reca_{l}\) of 97.21%, and \(F1_{score}\) of 97.21%, outperforming all compared approaches. The LSTM model achieves an \(accu_{y}\) of 97.94%, \(prec_{n}\) of 96.45%, \(reca_{l}\) of 96.59%, and \(F1_{score}\) of 96.60%. ISOA achieved an \(accu_{y}\) of 97.14% and an \(F1_{score}\) of 95.85%, while the SqueezeNet method attained an \(accu_{y}\) of 96.39% and an \(F1_{score}\) of 95.08%. The baseline SF model had the lowest values, with an \(accu_{y}\) of 95.62% and an \(F1_{score}\) of 94.41%. These results confirm that the EGRVI-DLISOA model delivers improved classification performance with higher consistency and robustness.
Conclusion
In this study, a new EGRVI-DLISOA model is presented in an IoT environment. The EGRVI-DLISOA model is an advanced GR system powered by DL in an IoT environment, designed to provide real-time interpretation of gestures to assist the visually impaired. To accomplish this, the EGRVI-DLISOA technique contains four processes: noise removal, feature extraction, LSTM-based gesture detection, and ISOA-based parameter tuning. Initially, the EGRVI-DLISOA technique uses the SF technique for the noise elimination process. For feature extraction, the SqueezeNet model is utilized for its efficiency in capturing meaningful features from complex visual data. For an accurate GR process, the LSTM model is employed. To fine-tune the parameter values of the LSTM classifier, the ISOA is applied. The experimentation of the EGRVI-DLISOA technique is investigated under the hand gestures dataset. The comparison study of the EGRVI-DLISOA technique revealed a superior accuracy value of 98.62% compared to existing models. The limitations of the EGRVI-DLISOA technique comprise restricted analysis across diverse real-world environments. The generalization ability across various lighting conditions and gesture discrepancies may also be affected. The presented model focuses on static gesture recognition, which limits its efficiency in recognizing continuous or dynamic gestures. Integration with wearable IoT devices remains limited, which affects portability and the feasibility of real-time applications. The model also relies on labelled datasets and shows restrictions in adaptability to unseen gestures without retraining. Computational overhead may pose challenges for deployment on low-power edge devices. Future enhancements include expanding dataset diversity, incorporating dynamic gesture support, and optimizing the model for real-time processing in low-resource IoT settings.
Data availability
The data that support the findings of this study are openly available at [https://www.dlsi.ua.es/ ~ jgallego/datasets/gestures/](https:/www.dlsi.ua.es/ ~ jgallego/datasets/gestures).
References
Rahman, M. M., Islam, M. M., Ahmmed, S. & Khan, S. A. Obstacle and fall detection to guide the visually impaired people with real time monitoring. SN Comput. Sci. 1(4), 219 (2020).
Sadi, M. S. et al. Finger-gesture controlled wheelchair with enabling IoT. Sensors 22(22), 8716 (2022).
de Souza, L. S., Francisco, R., da Rosa Tavares, J. E. & Barbosa, J. L. V. Intelligent Environments and Assistive Technologies for Assisting Visually Impairment People: A Systematic Literature Review (Springer, Cham, 2023).
El-Wahed Khalifa, H. A. et al. Enhancing neutrosophic fuzzy compromise approach for solving stochastic bi-level linear programming problems with right-hand sides of constraints follow normal distribution. Int. J. Neutrosophic Sci. (IJNS) 23(1), 287 (2024).
Mueen, A., Awedh, M. & Zafar, B. Multi-obstacle aware smart navigation system for visually impaired people in fog connected IoT-cloud environment. Health Inform. J. 28(3), 14604582221112608 (2022).
Subhashini, S. & Revathi, S. Static and dynamic hand gesture recognition system with deep convolutional levy flight whale optimization. Multimed. Tools Appl. 83(1), 1559–1588 (2024).
Dhilip Karthik, M., Kareem, R. M., Nisha, V. M. & Sajidha, S. A. Smart walking stick for visually impaired people. In Privacy Preservation of Genomic and Medical Data 361–381 (Wiley, New York, 2023).
Punsara, K. K. T., Premachandra, H. H. R. C., Chanaka, A. W. A. D., Wijayawickrama, R. V. & Nimsiri, A. IoT based sign language recognition system. In 2020 2nd International Conference on Advancements in Computing (ICAC) 1, 162–167 (IEEE, 2020).
Mujahid, A. et al. Real-time hand gesture recognition based on deep learning YOLOv3 model. Appl. Sci. 11(9), 4164 (2021).
Hamed, K. Artificial intelligence based automated sign gesture recognition solutions for visually challenged people. J. Intell. Syst. Internet Things 14(2), 127–139 (2025).
Zainuddin, A. A., Mohd Dhuzuki, N. H., Puzi, A. A., Johar, M. N. & Yazid, M. Calibrating hand gesture recognition for stroke rehabilitation internet-of-things (RIOT) using MediaPipe in smart healthcare systems. Int. J. Adv. Comput. Sci. Appl. 15(7), 568 (2024).
Mondal, A. & Roy, S. Design and development of dynamic gesture recognition system based on deep neural network for driver assistive devices. Int. J. Syst. Syst. Eng. 13(3), 271–283 (2023).
Ding, X. et al. Robust gesture recognition method toward intelligent environment using Wi-Fi signals. Measurement 231, 114525 (2024).
Fan, L. et al. Smart-data-glove-based gesture recognition for amphibious communication. Micromachines 14(11), 2050 (2023).
Sruthi, P., Satapathy, S. & Udgata, S. K. HandFi: WiFi sensing based hand gesture recognition using channel state information. Proc. Comput. Sci. 235, 426–435 (2024).
Rajalakshmi, E. et al. Multi-semantic discriminative feature learning for sign gesture recognition using hybrid deep neural architecture. IEEE Access 11, 2226–2238 (2023).
Mallik, B., Rahim, M. A., Miah, A. S. M., Yun, K. S. & Shin, J. Virtual keyboard: A real-time hand gesture recognition-based character input system using LSTM and mediapipe holistic. Comput. Syst. Sci. Eng. 48(2), 555–570 (2024).
Hao, Z., Sun, Z., Li, F., Wang, R. & Peng, J. Millimeter wave gesture recognition using multi-feature fusion models in complex scenes. Sci. Rep. 14(1), 13758 (2024).
Dawood, H. et al. ARNet: Integrating spatial and temporal deep learning for robust action recognition in videos. Comput. Model. Eng. Sci. (CMES) 144(1), 429 (2025).
Seifi, S., Sukianto, T., Carbonelli, C., Servadei, L. & Wille, R. Complying with the EU AI act: Innovations in explainable and user-centric hand gesture recognition. Mach. Learn. Appl. 20, 100655 (2025).
Lamsellak, O., Hdid, J., Benlghazi, A., Chetouani, A. & Benali, A. Multi-approach learning with embedded sensors application in gesture recognition. Int. J. Interact. Mobile Technol. 18(24), 51 (2024).
Zhou, H., Le, H. T., Zhang, S., Phung, S. L. & Alici, G. Hand gesture recognition from surface electromyography signals with graph convolutional network and attention mechanisms. IEEE Sens. J. 25, 9081 (2025).
Rezaee, K., Khavari, S. F., Ansari, M., Zare, F. & Roknabadi, M. H. A. Hand gestures classification of sEMG signals based on BiLSTM-metaheuristic optimization and hybrid U-Net-MobileNetV2 encoder architecture. Sci. Rep. 14(1), 31257 (2024).
Qu, X., Yu, S. & Tan, X. Implementation of gesture recognition technology optimized by neural networks in OpenMV. Int. J. Inf. Commun. Technol. 26(5), 1–21 (2025).
Chopparapu, S., Chopparapu, G. & Vasagiri, D. Enhancing visual perception in real-time: A deep reinforcement learning approach to image quality improvement. Eng. Technol. Appl. Sci. Res. 14(3), 14725–14731 (2024).
Zhou, L., Liu, X., Guan, X. & Cheng, Y. CSSA-YOLO: Cross-scale spatiotemporal attention network for fine-grained behavior recognition in classroom environments. Sensors 25(10), 3132 (2025).
Su, Y., Zhang, F., Jin, B. & Zhang, D. Manipulation of acoustic focusing for multi-target sensing with distributed microphones in smart car cabin. Proc. ACM Interact. Mobile Wearable Ubiquitous Technol. 9(2), 1–28 (2025).
Pchelkina, O. & Luzhnov, P. An algorithm for automatic image segmentation using the Sobel method for an optical coherence tomography. In 2024 6th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE) 1–6 (IEEE, 2024).
Iyer, H. & Jeong, H. PE-USGC: Posture estimation-based unsupervised spatial Gaussian clustering for supervised classification of near-duplicate human motion. IEEE Access https://doi.org/10.1109/ACCESS.2024.3491655 (2024).
Al-Husseini, H., Hosseini, M. M., Yousofi, A. & Alazzawi, M. A. Whale optimization algorithm-enhanced long short-term memory classifier with novel wrapped feature selection for intrusion detection. J. Sens. Actuator Netw. 13(6), 73 (2024).
Zhao, M. et al. Capacity optimization of wind–solar–storage multi-power microgrid based on two-layer model and an improved snake optimization algorithm. Electronics 13(21), 4315 (2024).
Alashhab, S., Gallego, A. J. & Lozano, M. Á. Efficient gesture recognition for the assistance of visually impaired people using multi-head neural networks. Eng. Appl. Artif. Intell. 114, 105188 (2022).
Acknowledgements
The authors extend their appreciation to the King Salman center For Disability Research for funding this work through Research Group no KSRG-2024-229.
Funding
King Salman Center for Disability Research (KSRG-2024-229).
Author information
Authors and Affiliations
Contributions
Hanan Abdullah Mengash: Conceptualization, methodology, validation, investigation, writing—original draft preparation, funding Basma S. Alqadi: Conceptualization, methodology, writing—original draft preparation, writing—review and editing Radwa Marzouk: software, validation, data curation, writing—review and editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Mengash, H.A., Alqadi, B.S. & Marzouk, R. Enhancing gesture recognition for assisting visually impaired persons using deep learning in an IoT environment-based improved snake optimisation algorithm. Sci Rep 15, 38149 (2025). https://doi.org/10.1038/s41598-025-22070-7
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-22070-7











