Introduction

Deaf is a disability that impairs their hearing and makes them not able to hear, while dumb is a disability which impairs their speaking and makes them not able to speak1. Both are only disabled in speaking and hearing; they can still do many other things. The only thing that separates them and the normal people interaction. If there is a way for normal people and deaf-dumb individuals to interact, the deaf-dumb individual can live like normal people2. And the only way for them to interact is through SL. SL is the primary communication method among hearing-impaired individuals and other people. It is stated through the manual that hand and body motions and non-manual motions are features of facial expression. These characteristics are associated with forming utterances, which transfer the meaning of sentences or words3. SL technology covers a wide-ranging spectrum, from taking signs to their realistic demonstration to assist the interaction among hearing-impaired individuals and between speaking and hearing-impaired individuals. More particularly, SL capturing includes precisely extracting mouth, body, and hand expressions utilizing suitable sensing gadgets in marker-based or marker-less setups4.

Accurately capturing the SL technology is presently limited by the discrimination and resolution for sensor capability and the reality that blocks and fast hand movements pose substantial tasks to precisely capturing signs. SLR includes enlarging prevailing ML models to firmly categorize human articulations into continuous sentences or isolated signs5. Existing challenges in SLR exist in the absence of massive annotated databases, which significantly affect the precision and generalization capability of SLR approaches and the complexity of identifying sign limitations in continuous SLR consequences. An automated SLR method can identify sign gestures. The gestures are typically given with hand movement and supportive aspects of body postures and facial expressions6. SLR contains the entire process of identifying and tracking the signs achieved and changing them into semantically meaningful expressions and words7. The SLR method utilizing the IoT and Deep Learning (DL) has been advanced8. It is a set of models and techniques with higher-level abstracts through structures formed by several non-linear transformations. DL models employ many data to extract features automatically, aiming to emulate the human brain’s capability to observe, analyze, learn, and make an implication, particularly for enormous, diverse complexities9. DL structures generate relationships beyond instant neighbours in data and create learning patterns, removing representations directly from data without human intervention10. Various DL methods have recently been presented for hand gesture detection. A deep convolutional neural network (DCNN) based static SLR.

This study presents a Smart Assistive Communication System for the Hearing-Impaired using Sign Language Recognition with Hybrid Deep Learning (SACHI-SLRHDL) methodology in IoT. The SACHI-SLRHDL technique aims to assist people with hearing impairments by creating an intelligent solution. At the primary stage, the SACHI-SLRHDL technique utilizes bilateral filtering (BF) for image pre-processing to increase the excellence of the captured images by reducing noise while preserving edges. Furthermore, the improved MobileNetV3 model is employed for the feature extraction process. Moreover, the convolutional neural network with a bidirectional gated recurrent unit and attention (CNN-BiGRU-A) model classifier is implemented for the SLR process. Finally, the attraction-repulsion optimization algorithm (AROA) adjusts the hyperparameter values of the CNN-BiGRU-A method optimally, resulting in more excellent classification performance. To exhibit the more significant solution of the SACHI-SLRHDL method, a comprehensive experimental analysis is performed under an Indian SL dataset. The key contribution of the SACHI-SLRHDL method is listed below.

  • The SACHI-SLRHDL model utilizes BF for pre-processing to improve image quality by effectively reducing noise while preserving edges. This enhances the clarity of input images, which is significant for accurate SLRs. The model ensures improved performance in subsequent stages of the recognition process by improving image features.

  • The SACHI-SLRHDL approach uses an improved MobileNetV3 architecture for feature extraction, optimizing performance and computational efficiency. This adaptation enables faster processing without compromising the accuracy of extracted features. MitigatingMitigating complexity makes the model more appropriate for real-time applications in SLR.

  • The SACHI-SLRHDL methodology introduces a novel CNN-BiGRU-A framework that integrates CNN for spatial feature extraction and BiGRU for processing sequential data. Attention mechanisms (AMs) are integrated to prioritize critical features, improving the method’s capability to recognize intrinsic SL gestures. This approach improves both the accuracy and interpretability of the recognition system.

  • The SACHI-SLRHDL approach employs the AROA model for fine-tuning, improving the training process by effectively optimizing the model’s parameters. This method enhances overall performance by balancing exploration and exploitation during optimization. The role of AROA is crucial in achieving greater efficiency and accuracy in the recognition task.

  • The key novelty of the SACHI-SLRHDL approach is the integration of CNN, BiGRU, AMs, and the AROA model to improve SLR. This hybrid model enables effective spatial and temporal feature extraction while the AM assists in prioritizing critical features. Additionally, AROA optimizes the model’s performance, ensuring higher accuracy and efficiency in real-time applications.

Literature works

Akhila Thejaswi et al.11 investigate a combination of SL Recognition (SLR) and SL Translation (SLT) methods to guarantee accurate real-world sign gesture recognition. A CNN is utilized in the presented work as a DL method to train a more extensive database of hand gestures and achieve image investigation employing the MediaPipe library for identification and landmark estimation. A vast dataset analysis organizes the study. In12, an IoT-based method, which can fit on a ring finger, is presented. This method can translate and learn English and Arabic braille into audio utilizing DL methods improved with transfer learning (TL). The detection of the braille image captured is attained through a TL-based CNN. Shwany et al.13 project a real-world approach to detect threatening signs by criminals through interrogation. The presented approach is to install a proper camera feature in front of the offender, hand gesture records in a particular area of the hand, employ a few image processing methods, like contrast improvement methods, to the image to help detection as input, and then categorize the image utilizing CNN for a particular concern with enhanced capabilities for that area and utilizing AlexNet. Lakshmi et al.14 intend to generate a real-world video-based interactive SLT education method. The Flask framework and pre-trained methods for classification are segments of the method to enhance communication and understanding with SL. In the future, the presented method will utilize Python libraries and tensor flow for image input forecasting, which can efficiently resolve several interaction complexities. The presented method represents an engaging and innovative method to boost and teach communication capabilities in deaf and dumb children. Akdag and Baykan15 present a novel method to isolate SL word recognition utilizing an innovative DL method, which associates the assets both temporally separated (R(2 + 1)D) and residual three-dimensional (R3D) convolutional blocks. The R3(2 + 1)D-SLR method can take the complicated temporal and spatial features vital for precise sign detection.

In16, an ensemble meta-learning method is projected. The work tests and trains the deep ensemble meta-learning method utilizing dual synthetically created assistive service databases. The DL method utilizes several ensemble input learners to employ a meta-classification system shared with each output, demonstrating individual assistive services. This method attains substantially greater outcomes than classical ML methods and simpler feed-forward neural network methods without the ensemble method. Parveen et al.17 present a gadget that can change a deaf individual hand gestures into voice and text. The video recorded by the camera employs an open-source CV toolkit named OpenCV. Then, the video utilizes image processing methods, including histogram of gradient (HOG) and CNN. Afterwards, a Raspberry Pi 4 examines the recorded motions and relates the outcomes with a database. Faisal et al.18 projected three components: (i) the sign recognition module (SRM) that identifies the signs of deaf people, (ii) the speech recognition and synthesis module (SRSM) that procedures the speech of non-deaf people and changes into text, (iii) avatar module (AM) to perform and create the equivalent sign of the non-deaf speech that are incorporated into the sign translation companion method named as SDCS to assist the interaction between deaf to hear and conversely. Li et al.19 present an updated Archive File Integrity Check Method (AFICM) using a hybrid DL model integrating Bi-directions Long Short-Term Memory (Bi-LSTM) with adaptive gating and Temporal Convolutional Neural Networks (TCNN) model. Ghadi et al.20 explore ML methods used to address security issues in wireless sensor networks while considering their functionality, adaptation challenges, and open problems in the field. Zholshiyeva et al.21 explore automating SLT by integrating ML and DL techniques for Kazakh SL (QazSL) recognition. Five algorithms are employed, trained on a dataset of over 4,400 images.

Ghadi et al.22 explore the challenges and safety issues of integrating federated learning (FL) with IoT, focusing on its applications in smart businesses, cities, transportation, and healthcare while addressing encrypted data transmission requirements. Thakur, Dangi, and Lalwani23 introduce two Hybrid Learning Algorithms (HLA) combining CNN and recurrent neural network (RNN) to capture spatial and sequential patterns, improved by optimization techniques using Whale Optimization and Grey Wolf Optimizer for feature selection. Mazhar et al.24 explore the motivation for IoT device installation in smart buildings and grids, focusing on incorporating artificial intelligence (AI), IoT, and smart grids to improve energy efficiency, security, and comfort while examining ML methods for forecasting energy demand. John and Deshpande25 propose a hybrid deep RNN with Chaos Game Optimization (CGO) for effectual hand gesture recognition, aiming to classify alphabet signs from 2D gesture images through pre-processing, feature extraction, selection, and classification stages. Renjith, Manazhy, and Suresh26 present a hybrid model for Indian SL (ISL) recognition integrating CNNs for spatial feature extraction and RNNs for capturing temporal relationships, aiming to improve the capability of the method to identify complex sign gestures from a dataset of 36 ISL sign classes. Paul et al.27 introduce a novel Human Motion Recognition (HMR) method for medical-related human activities (MRHA) detection, integrating EfficientNet for spatial feature extraction and ConvLSTM for spatio-temporal pattern recognition, followed by a classification module for final predictions. Palanisamy et al.28 present a hybrid approach incorporating DL and graph theory for SLR, illustrating crucial enhancements in accuracy and computational efficiency, making it a competitive solution for enhancing communication for the hearing impaired.

The primary limitation in current research is the lack of sufficiently large and diverse datasets for SLR, which limits the generalizability and accuracy of models across diverse SLs and real-world scenarios. Many existing methods also face difficulty capturing both spatial and temporal features of sign gestures, resulting in mitigated recognition efficiency. Additionally, challenges related to integrating DL and ML methods with IoT, FL, and privacy-preserving techniques must be addressed. Moreover, research on real-time SLT and interaction systems is still in its early stages, with high accuracy and low-latency performance being difficult to attain consistently. Finally, the diversity in regional SL gestures and the dynamic nature of signs pose further hurdles in creating universally applicable systems.

The article is structured as follows: Sect. 2 presents the literature review, Sect. 3 outlines the proposed method, Sect. 4 details the results evaluation, and Sect. 5 concludes the study.

Proposed methodology

This study presents a SACHI-SLRHDL methodology in IoT. The SACHI-SLRHDL technique aims to develop an effective SLR technique that assists people with hearing impairments by creating an intelligent solution. It comprises four distinct processes: image pre-processing, improved MobileNetV3 for feature extractor, hybrid DL classification process, and AROA-based parameter tuning. Figure 1 depicts the entire flow of the SACHI-SLRHDL methodology.

Fig. 1
figure 1

Overall flow of the SACHI-SLRHDL model.

Image pre-processing: BF model

Initially, the SACHI-SLRHDL approach utilizes BF for image pre-processing to improve the excellence of the captured images by decreasing noise while preserving edges29. This model was chosen for image pre-processing due to its superior capability to preserve edges while reducing noise, which is significant for maintaining the integrity of SL gestures. Unlike conventional smoothing techniques, BF effectively smooths out noise without blurring the crucial details, ensuring that key features of the SL images remain intact. Furthermore, BF works well in scenarios with varying lighting conditions and complex backgrounds, which is common in real-world applications. This makes it ideal for pre-processing SL images that may suffer from such challenges. Moreover, the BF model is computationally efficient, allowing it to be implemented in real-time systems, which is significant for SLR tasks. By improving image quality without compromising key spatial details, BF assists in enhancing the overall performance of subsequent DL methods in SLR. Figure 2 specifies the BF architecture.

Fig. 2
figure 2

Structure of BF model.

BF is an innovative image pre-processing model that enhances the excellence of images using SLR techniques. It aids in reducing noise while defending edges, which is vital for precisely seizing the hand gestures in SL. In the structure of IoT-based SLR, BF certifies that the seized images from IoT devices, like sensors or cameras, are clear and free from falsification. This pre-processing stage considerably progresses the accuracy of feature extraction by upholding significant spatial details in the imageries. By eliminating unrelated noise, BF permits the detection method to concentrate on profound gestures, safeguarding superior performance in real SLT. Therefore, it contributes to the efficacy of IoT-enabled models in helping persons with hearing loss.

Feature extraction: improved MobileNetV3

Next, the improved MobileNetV3 model extracts relevant features from input images30. This technique was chosen due to its capability to balance high performance with low computational cost, making it appropriate for real-time applications like SLR. Unlike larger networks that require extensive computational resources, MobileNetV3 presents effectual processing without sacrificing accuracy, which is significant for deployment on resource-constrained devices such as IoT systems. Its optimized architecture uses depthwise separable convolutions, which mitigate the number of parameters and computational complexity, making it faster and more efficient. Furthermore, MobileNetV3 performs exceptionally well in extracting discriminative features from images, which is crucial for accurately recognizing SL gestures. By implementing the improved MobileNetV3, the model attains high recognition accuracy while maintaining efficiency, even under varying conditions. This makes it a robust choice compared to conventional, heavier CNN architectures. Figure 3 illustrates the MobileNetV3 model.

Fig. 3
figure 3

MobileNetV3 architecture.

This work designated MobileNetV3 from the MobileNet series. The MobileNetV3 method keeps its lightweight features, whereas enduring uses the depth-wise separable convolutional and reversed residual module from the MobileNetV2 method. It improves the bottleneck architecture by combining the Squeeze-and‐Excitation (SE) units, reinforcing the importance of significant characteristics and overcoming unimportant ones. Furthermore, the novel hard‐swish activation function has been accepted to enhance the system architecture. The MobileNetV3 approach is accessible in larger and smaller versions according to the availability of resources, and this work utilizes the MobileNetV3‐larger approach as a base.

However, combining the SE modules into the bottleneck architecture of MobileNetV3-Large has developed the model’s performance; the SE modules choose information amongst channels to define the significance of all channels. Nevertheless, it manages the important positioning information within the visual fields. Therefore, this method can only capture local feature information, resulting in problems like scattered fields of interest and narrow performances. To deal with these restrictions, the ECA unit increases the SE modules by preventing the reduction of dimensions and taking cross‐channel interaction information more effectively.

Despite developing the ECA unit through the SE modules, it still selects the information amongst channels. During the paper, the SE module is substituted in the MobileNetV3 architecture using the CA module to enhance MobileNetV3. The complete framework of the enhanced MobileNetV3-CA method is presented. The module of CA can concentrate the attention models on the field of interest over the efficient position in the pixel-coordinated method, thus gaining information that considers either position or channel, decreasing the attention to interference data and enhancing the feature appearance capability of the technique. The fundamental architecture of the CA module is presented\(\:.\) For a specified feature graph \(\:X\), the width is \(\:W\), the channel counts are \(\:C,\) and the height is \(\:H\). The module of CA initially pools the input \(\:X\) in dual spatial control, like width and height, to get feature mapping in either direction. Then, it connects the feature mapping from these dual ways in spatial sizes and then variations the sizes to the unique \(\:Clr\) utilizing the \(\:1\)x\(\:1\) convolutional transformation. Then, it uses Swish activation and batch normalization processes to get the middle feature mapping comprising information from either direction, as the equation below exposes.

$$\:f=\delta\:\left({F}_{1}\left(\left[\frac{1}{W}{\sum\:}_{0\le\:j\le\:W}^{\infty\:}{x}_{c}\left(h,j\right),\:\frac{1}{H}{\sum\:}_{0\le\:i\le\:H}^{\infty\:}{x}_{c}\left(i,\:w\right)\right]\right)\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(1)

Here, \(\:f\) denotes intermediate feature mapping gained by encoder spatial information in dual ways, \(\:\delta\:\) represents the Swish activation function, and \(\:{F}_{1}\) refers to the function of convolution transformation of \(\:1\)x\(\:1\). Now, \(\:{x}_{c}\) denotes feature data of the particular location of the feature graph in channel \(\:c,\)\(\:h\) denotes the specific height of feature mapping, and \(\:j\) symbolizes feature mapping width by the value range between \(\:[0,\:W].\) Also, \(\:w\) represents the particular width of the feature mapping, and \(\:i\) mean feature mapping height through the value ranges from \(\:[0,\:H].\)\(\:F\) can be separated into dual tensors, \(\:{h}^{f}\) and \(\:{w}^{f}\), along with the spatial dimensions in dual ways. Over dual \(\:1\)x\(\:1\) convolution transformation functions, \(\:{h}^{f}\) and \(\:{w}^{f}\) are transformed into tensors by the equivalent channel counts as the input \(\:X\). Lastly, it multiplies the lengthy attention weight with \(\:X\) to obtain the CA module output, as the equation below shows.

$$\:{y}_{c}={x}_{c}\left(i,j\right)\cdot\:\left(\sigma\:\left[{F}_{h}\left({f}^{h}\right)\right]\right)\cdot\:\left(0\left[{F}_{w}\left({f}^{w}\right)\right]\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(2)

While \(\:{y}_{c}\) denotes the output of the \(\:c‐th\) channel, \(\:0\) means the activation function of the sigmoid, and \(\:{F}_{h}\) and \(\:{F}_{w}\) represent convolution transformation functions in width and height.

Classification process: hybrid DL models

For the SLR process, the hybrid of the CNN-BiGRU-A classifier is employed31. This hybrid model was chosen for its ability to handle spatial and temporal information effectively, which is significant for accurate SLR. CNN outperforms in extracting spatial features from images, while BiGRU captures sequential dependencies, making it ideal for comprehending the temporal aspect of sign gestures. Adding AMs allows the model to concentrate on the most crucial features in a sequence, improving recognition accuracy by mitigating noise and irrelevant data. This integration enables the model to process dynamic, real-world SL data more effectually than conventional methods that may only focus on one aspect (spatial or temporal) at a time. Furthermore, this hybrid methodology ensures that the model can handle the complexity and variability of SL gestures, giving superior performance to simpler architectures. The integration of these techniques presents a robust solution to the challenges in SLR, particularly for continuous and dynamic gestures. Figure 4 portrays the structure of the CNN-BiGRU-A model.

Fig. 4
figure 4

Structure of CNN-BiGRU-A method.

The CNN-BiGRU‐A method comprises 3 core elements. Initially, CNN is applied to remove local temporal features from the time sequences subsiding information, assisting the process in recognizing short-term forms within the data through various monitoring scores. Bi-GRU handles longer‐range dependency in the time sequences, permitting the method to consider previous or upcoming subsiding tendencies, which develops complete prediction precision. Finally, the AM concentrates on the most significant time intervals, allocating high weight to important moments of change and improving the performance of the models by ordering primary data. This mixture allows the method to capture composite subsiding patterns successfully and makes precise predictions. For instance, in a mining region using composite subsiding behaviour, CNN identifies fast, localized variations at different observing points. BiGRU then trajectories longer‐range tendencies by combining historical or present data, assisting the model in identifying gradually growing subsiding patterns. The AM emphasizes moments of abrupt change, leading the model’s concentration to crucial changes, like abrupt growths in subsiding rate. Mutually, these modules guarantee timely and precise predictions, making the method helpful in dynamical mining atmospheres.

The CNN module contains various layers, which work together to remove essential patterns from the input data. This convolution layer recognizes particular attributes within the data by calculating weighting amounts, whereas pooling and activation layers present nonlinearity and decrease data sizes, effectively allowing the system to identify composite subsiding patterns. Normalization and fully connected (FC) layers enhance the last predictions by normalization, enhancing training speed or model strength. The major equations are as shown:

$$\:(I*K{)}_{ij}={\sum\:}_{m}{\sum\:}_{n}{I}_{m+i,n+j}\cdot\:{K}_{mn}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(3)
$$\:{P}_{ij}=\text{m}\text{a}\text{x}\left({I}_{i-m,j-n}\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(4)
$$\:O=\sigma\:\left(W\cdot\:I+b\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(5)
$$\:\widehat{x}=\frac{x-{\mu\:}_{B}}{\sqrt{{\sigma\:}_{B}^{2}+\epsilon\:}},\:y=\gamma\:\widehat{x}+\beta\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(6)

During Eq. (3), \(\:(I*K{)}_{ij}\) signifies the output feature mapping value at location \(\:(i,\:j)\) after the convolutional process. \(\:I\:\)denotes input data, \(\:K\) represents the convolutional kernel, and \(\:i,j\), and \(\:m,\)\(\:n\) represent position indices over output feature mapping and the convolutional kernel. In Eq. (4), \(\:{P}_{\text{i}\text{j}}\) characterizes the output feature mapping value at location \(\:(i,\:j)\) after the pooling process. Also\(\:,\:m\:and\:n\) represent position indices of the pooling window. During Eq. (5), \(\:O\) refers to output, \(\:I\) means input features, \(\:W\) stands for weighted matrix, \(\:b\) signifies biased matrix, and \(\:\sigma\:\) denotes activation function. Equation (6) characterizes the normalization layer task, while \(\:x\) signifies input data, \(\:\widehat{x}\) symbolizes the standardized input data, \(\:{\sigma\:}_{B}^{2}\)and\(\:{\mu\:}_{B}\) means the variance and mean of the present minibatch data, correspondingly. \(\:\epsilon\:\) denotes constant for numerical accuracy, whereas \(\:\gamma\:\) and \(\:\beta\:\) are learnable parameters.

The GRU module enhances prediction precision by controlling the flow of information over its update and reset gates, which define the related data to keep or discard at all steps. This mechanism permits the method to effectively take important patterns in subsiding information regarding either recent inputs or previous observations. For instance, in predicting subsiding tendencies, the GRU utilizes previous and present data scores to recognize a steady pattern that improves the model’s capability for predicting upcoming subsiding precisely. Applying the hidden layer (HL) \(\:{h}_{t-1}\) from the preceding time step and the present input \(\:{x}_{t}\), the GRU approach can be signified as shown:

$$\:{r}_{t}=\sigma\:\left({W}_{r}\cdot\:\left[{h}_{t-1},\:{x}_{t}\right]+{b}_{z}\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(7)
$$\:{z}_{t}=\sigma\:\left({W}_{z}\cdot\:\left[{h}_{t-1},\:{x}_{t}\right]+{b}_{r}\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(8)

Here, \(\:{W}_{r},{\:W}_{r}\) characterizes the weighted matrices, \(\:{b}_{z},{b}_{r}\) symbolize the biased vectors, and \(\:\sigma\:\) means the sigmoid activation function. The reset gate \(\:{r}_{t}\) defines which data from the preceding HL \(\:{h}_{t-1}\) must be discarded; however, the update gate \(\:{z}_{t}\) selects the mixing ratio of the newer and older memories.

Then, the last output is gained by computing the candidate HL \(\:{\stackrel{\sim}{h}}_{t}\) and the HL \(\:{h}_{t}\). The HL is then distributed to another layer or applied as the previous output.

$$\:\overline{h}=\:\text{t}\text{a}\text{n}\text{h}\:\left(W\cdot\:\left[{r}_{t}\odot\:{h}_{t-1},\:{x}_{t}\right]+b\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(9)
$$\:{h}_{t}=\left(1-{z}_{t}\right)\odot\:{h}_{t-1}+{z}_{t}\odot\:{\stackrel{\sim}{h}}_{t}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(10)

Here \(\:W\) denotes the weighted matrix, \(\:{\stackrel{\sim}{h}}_{t}\) can be approximated utilizing \(\:{x}_{t}\) and \(\:{r}_{t}\) to gain the promising HL. Lastly, \(\:{\stackrel{\sim}{h}}_{t}\) and \(\:{h}_{t}\) are weighted to obtain the last condition fusion degree, \(\:\odot\:\) recognizes the Hadamard functions.

This bidirectional model allows the method to recognize patterns within the data more efficiently, which is helpful in tracking modifications that subside over time.

The AM enhances this ability by specially targeting significant portions of the data. It allocates greater weight to crucial data at every step, selecting main characteristics that might specify essential variations. For instance, in a real-time situation, Bi-GRU’s main characteristics are tendencies in previous or upcoming contexts, whereas the AM highlights unexpected moves or crucial points within the information, like regions where rates quickly rise. This integration permits the method to better adjust to useful requirements in mining regions using wide-ranging behaviours. These particular equations are as shown in (11)-(13) :

$$\:{\alpha\:}_{ij}=\frac{\text{exp}\left(score\left({h}_{i},{\overline{h}}_{j}\right)\right)}{\sum\:\text{e}\text{x}\text{p}\left(score\left({h}_{n},{\overline{h}}_{m}\right)\right)}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(11)
$$\:{c}_{i}=\sum\:{\alpha\:}_{ij}{\overline{h}}_{S}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(12)
$$\:{\alpha\:}_{i}=f\left({c}_{i},\:{h}_{i}\right)=\text{t}\text{a}\text{n}\text{h}\left({W}_{c}\cdot\:\left[{c}_{i},\:{h}_{i}\right]\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(13)

Now, \(\:{\alpha\:}_{ij}\) characterizes the attention score computed amongst the output of the encoder at the \(\:j\:th\) time step and the decoding layer at \(\:i\:th\) time steps. \(\:h\) signifies the HL at all-time steps, \(\:W\) symbolizes the weighted matrix related to the input or HL, and \(\:{\alpha\:}_{i}\) signifies the final attention weighting gained over the AM.

This prediction process includes three significant stages: Initially, data can be pre-processed in the CNN layer over pooling and convolution to make feature-rich data. Next, these vectors are given to a layer of BiGRU that takes either short‐ or long‐term designs within the data and avoids gradient problems. Lastly, the AM allocates weight to the main features, decreasing unrelated data and enhancing model efficacy. This allows the method to concentrate on essential patterns in subsiding data, leading to precise predictions.

Parameter optimizing process: AROA

Finally, the AROA optimally adjusts the hyperparameter values of the CNN-BiGRU-A approach, resulting in more excellent classification performance32. This methos was chosen for parameter optimization due to its ability to balance exploration and exploitation during the optimization process effectively. Unlike conventional optimization methods that may get stuck in local minima, AROA uses attraction and repulsion mechanisms to explore the solution space more thoroughly and avert suboptimal solutions. This is beneficial for DL models with complex parameter spaces. Furthermore, AROA’s simplicity and efficiency make it a robust choice for optimizing resource-intensive models like DL networks without needing extensive computational power. The algorithm’s flexibility in fine-tuning hyperparameters enhances model convergence and accuracy, improving overall performance. The ARO model’s adaptability and capability to optimize parameters such as learning rates, batch sizes, and network architecture make it superior to more conventional optimization techniques like grid or random search. AROA is a practical and effectual solution for improving the model’s performance in SLR tasks. Figure 5 demonstrates the structure of the AROA model.

Fig. 5
figure 5

Structure of AROA method.

This method naturally imitates the phenomenon of attraction-repulsion. The initial phase in AROA is to make the first value of \(\:n\) individuals\(\:X\).

$$\:{X}_{i}=rand\odot\:\left({X}_{up}-{X}_{low}\right)+{X}_{low}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(14)

During Eq. (14), \(\:{X}_{i}\) refers to the value of the \(\:{i}^{th}\) individual, \(\:{X}_{low}\) and\(\:\:{X}_{up}\:\)mean lower and upper limits of the searching space, individually. \(\:rand\) denotes the randomly generated vector.

Then, each fitness value \(\:{X}_{i}\) is calculated and the best is defined based on the testing problem. The following stage in AROA utilizes the theory of attraction and repulsion, which relies on the distance between individuals \(\:X\). Hence, the value of \(\:X\) can be upgraded by computing the fitness levels of neighbouring individuals. The distance between \(\:{i}^{th}\) and \(\:{j}^{th}\) has been calculated as shown:

$$\:D=\left[\begin{array}{lllll}{d}_{\text{1,1}}&\:{d}_{\text{1,2}}&\:{d}_{\text{1,3}}&\:\dots\:&\:{d}_{1,n}\\\:{d}_{\text{2,1}}&\:{d}_{\text{2,2}}&\:{d}_{\text{2,3}}&\:\dots\:&\:{d}_{2,n}\\\:{d}_{\text{3,1}}&\:{d}_{\text{3,2}}&\:{d}_{\text{3,3}}&\:\dots\:&\:{d}_{3,n}\\\:\dots\:&\:\dots\:&\:\dots\:&\:\dots\:&\:\dots\:\\\:{d}_{n,1}&\:{d}_{n,2}&\:{d}_{n,3}&\:…&\:{d}_{n,n}\end{array}\right]\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(15)
$$\:{d}^{2}\left({X}_{i},{X}_{j}\right)={\sum\:}_{k=1}^{dim}{\left({x}_{i}^{k}-{x}_{j}^{k}\right)}^{2}k=1dim\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(16)

Meanwhile, \(\:{X}_{i}\) and \(\:{X}_{j}\) correspondingly provide the values of \(\:{i}^{th}\) and \(\:{j}^{th}\) individuals, and \(\:dim\) denotes \(\:{X}_{i}\)’s dimension counts\(\:.\)

The following operation is to update the attraction-repulsion operator \(\:\left({n}_{i}\right)\) according to the distance from the \(\:{i}^{th}\) individual to the furthermost member of \(\:X\)\(\:\left({d}_{i,\:\text{m}\text{a}\text{x}}\right)\) and \(\:{d}_{i,j}\in\:D\). This can be described as shown:

$$\:{n}_{i}=\frac{1}{n{\sum\:\:}_{j=1}^{k}\left({X}_{j}-{X}_{i}\right)}\cdot\:\left(1-\frac{{d}_{i,j}}{{d}_{i,\:\text{m}\text{a}\text{x}}}\right)\cdot\:s\left({f}_{i},\:{f}_{j}\right).\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(17)

Here, \(\:c\) stands for the step size, and \(\:s\) signifies the function, which controls the direction of the change based on fitness value, and the \(\:s\) value is upgraded as:

$$\:s\left({f}_{i},\:{f}_{j}\right)=\left\{\begin{array}{ll}1&\:{f}_{i}>{f}_{j}\\\:0&\:{f}_{i}={f}_{j}\\\:-1&\:{f}_{i}<{f}_{j}\end{array}\right.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(18)

Additionally, the \(\:k\) value, in Eq. (17), denotes the neighbour number, which reduces with excess the iterations and it can be upgraded as:

$$\:k=\left(1-\frac{t}{{t}_{\text{m}\text{a}\text{x}}}\right)\cdot\:n+1\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(19)

Now, \(\:t\) means present iteration and \(\:{t}_{\:\text{m}\text{a}\text{x}\:}\)stands for maximal iteration counts.

The next step is to utilize attraction to determine the optimal solution. This process characterizes the exploration stage as equivalent to other MH models determining the possible area. The Attraction operator \(\:\left(\right)\) can be described as:

$$\:{b}_{i}=\left\{\begin{array}{l}c\cdot\:m\cdot\:\left({X}_{best}-{X}_{i}\right){r}_{1}\ge\:{p}_{1}\\\:c\cdot\:m\cdot\:\left({a}_{1}{X}_{best}-{X}_{i}\right){r}_{1}<{p}_{1}\end{array}\right.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(20)

Whereas \(\:{X}_{best}\) signifies optimal solution, \(\:{a}_{1}\) designates randomly generated vectors. The parameter \(\:{r}_{1}\in\:\left[\text{0,1}\right]\) means a randomly generated number, and \(\:{p}_{1}\) indicates probability thresholds. The parameter \(\:m\) has been utilized to mimic the impact of the best solution, and it is necessary for controlling the balance between exploitation and exploration; it is outlined as follows:

$$\:m=\frac{1}{2}\left(\frac{\text{exp}\left(18\cdot\:\left(\frac{t}{{t}_{\text{m}\text{a}\text{x}}}\right)-4\right)-1}{\text{exp}\left(18\cdot\:\left(\frac{t}{{t}_{\text{m}\text{a}\text{x}}}\right)-4\right)+1}+1\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(21)

Consequently, the exploration phase of AROA has been employed to improve the probability of defining the optimal solution. This process can be described as shown:

$$\:{X}_{i}\left(t\right)={X}_{i}\left(t-1\right)+{n}_{i}+{b}_{i}+{r}_{i},\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(22)
$$\:{r}_{i}=\left\{\begin{array}{ll}\left\{\begin{array}{c}{r}_{B}\:{r}_{3}>0.5.\frac{t}{{t}_{max}}+0.25\\\:{r}_{tri}\:{r}_{3}\le\:0.5.\frac{t}{{t}_{max}}+0.25\end{array}\right.&\:{r}_{2}<{p}_{2}\\\:{r}_{R}&\:{r}_{2}\ge\:{p}_{2}\end{array}\right.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(23)

While \(\:{r}_{B}\) represents the operator that signifies the Brownian motion by upgrading the standard deviation based on the searching area limits, and it can be described as:

$$\:{r}_{B}={u}_{1}\odot\:N\left(0,\:f\:{r}_{1}\:1-\frac{t}{{t}_{\text{m}\text{a}\text{x}}}\cdot\:\left({X}_{up}-{X}_{low}\right)\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(24)

Whereas \(\:{u}_{1}\) denotes a binary vector. \(\:N\) signifies the randomly generated vector value after a normal distribution, and \(\:f{r}_{1}\) symbolizes the contact value.

Besides, \(\:{r}_{tri}\) denotes the second operator, which relies on trigonometric functions and the individual that can be chosen using the roulette wheel selection. This can be outlined as shown:

$$\:{r}_{tri}=\left\{\begin{array}{c}f{r}_{2}\cdot\:{u}_{2}\cdot\:\left(1-\frac{t}{{t}_{\text{m}\text{a}\text{x}}}\right)\cdot\:sin\left(2{r}_{5}\pi\:\right)\odot\:\left|{a}_{2}\odot\:{X}_{w}-{X}_{i}\right|\:\:{r}_{4}<0.5\\\:f{r}_{2}\cdot\:{u}_{2}\cdot\:\left(1-\frac{t}{{t}_{\text{m}\text{a}\text{x}}}\right)\cdot\:cos\odot\:\left(2{r}_{5}\pi\:\right)\left|{a}_{2}\odot\:{X}_{w}-{X}_{i}\right|\:\:{r}_{4}\ge\:0.5\end{array}\:\right.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(25)

Here, \(\:f{r}_{2}\) denotes the multiplier, \(\:{u}_{2}\) refers to binary vectors, \(\:{and\:r}_{4}\) and \(\:{r}_{5}\) are randomly generated numbers amongst\(\:\:(0\), 1). \(\:{a}_{2}\) relates to a randomly generated vector comprising values ranging between \(\:(0\)-1). \(\:{X}_{w}\) denotes a randomly chosen solution from \(\:X.\)

During Eq. (23), \(\:{r}_{R}\) refers to the third operator applied to improve the value of \(\:{X}_{i}\), and it can be well-defined as:

$$\:{r}_{R}={u}_{3}\odot\:\left(2\cdot\:{a}_{3}-\text{o}\right)\odot\:\left({X}_{p}ll-{X}_{low}\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(26)

\(\:{u}_{3}\) denotes the binary vector gained using the threshold \(\:t{r}_{3}\) used for every solution. \(\:{a}_{3}\) indicates randomly selected vector values, and zero stands for matrix unit.

Additionally, the eddy formation theory can be used to improve the solution, and this can be expressed as:

$$\:{X}_{i}=\left\{\begin{array}{l}{X}_{i}+{c}_{fp}\left({u}_{4}\left({a}_{4}\left({X}_{ll}-{X}_{low}\right)+{X}_{low}\right)\right)\:{r}_{6}<{e}_{f}\\\:{X}_{i}+\left({e}_{f}\cdot\:\left(1-{r}_{7}\right)+{r}_{7}\right)\left({X}_{r8}-{X}_{r9}\right)\:{r}_{6}\ge\:{e}_{f}\end{array}\right.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(27)

Here, \(\:{r}_{7}\) signifies randomly formed integers variant from \(\:(0\)-1), and \(\:{e}_{f}\) signifies probability cutoffs. \(\:{u}_{4}\) indicates a binary vector gained by the threshold of \(\:1-{e}_{f}\), and \(\:{a}_{4}\) represents a vector containing arbitrary numbers. However, \(\:{r}_{8}\) and \(\:{r}_{9}\) are agent indexes randomly selected from \(\:X\), and \(\:{c}_{f}\) means parameter upgraded as shown:

$$\:{c}_{f}={\left(1-\frac{t}{{t}_{\text{m}\text{a}\text{x}}}\right)}^{3}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(28)

After, the memory is measured as the subsequent impact applied to upgrade the solutions. This can be directed by comparing the novel value of the solution using its old value and preservative the best of them as expressed in the subsequent Eq. (29).

$$\:{X}_{i}\left(t\right)=\left\{\begin{array}{ll}{X}_{i}\left(t\right)&\:f\left({X}_{i}\left(t\right)\right)<f\left({X}_{i}\left(t-1\right)\right)\\\:{X}_{i}\left(t-1\right)&\:f\left({X}_{i}\left(t\right)\right)\ge\:f\left({X}_{i}\left(t-1\right)\right)\end{array}\right.\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(29)

The AROA originates a fitness function (FF) to improve classifier performance. It defines an optimistic number to characterize the higher efficiency of the candidate solution. Here, the decrease in classification rate of error is deliberated as FF. Its formulation is mathematically expressed in Eq. (30).

$$\:fitness\left({x}_{i}\right)=ClassifierErrorRate\left({x}_{i}\right)$$
$$\:=\frac{no.\:of\:misclassified\:samples}{Total\:no.\:of\:samples}\times\:100\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(30)

Performance validation

The SACHI-SLRHDL model is examined under an ISL dataset33. This dataset consists of 20 class labels below 800 images, as shown in Table 1. Figure 6 represents the sample images.

Table 1 Details of dataset.
Fig. 6
figure 6

Sample images.

Evaluation metrics for classification models: \(\:\varvec{A}\varvec{c}\varvec{c}{\varvec{u}}_{\varvec{y}}$$, $$\:\varvec{P}\varvec{r}\varvec{e}{\varvec{c}}_{\varvec{n}}$$, $$\:\varvec{R}\varvec{e}\varvec{c}{\varvec{a}}_{\varvec{l}}$$, $$\:{\varvec{F}}_{\varvec{S}\varvec{c}\varvec{o}\varvec{r}\varvec{e}}$$, and $$\:\varvec{M}\varvec{C}\varvec{C}\)

The performance of the classification models is evaluated using diverse metrics. Equation (31) represents \(\:acc{u}_{y}\), which measures the overall proportion of correct predictions. Equation (32) computes \(\:pre{c}_{n}\), the ratio of correct positive predictions. Equation (33) defines \(\:rec{a}_{l}\), which evaluates the capability of the model to detect true positives. Equation (34) shows the \(\:{F}_{Score}\), a metric that integrates \(\:pre{c}_{n}\) and \(\:rec{a}_{l}\) into a single value to balance their trade-offs, particularly in cases with imbalanced classes. Finally, Eq. (35) represents \(\:MCC\), which evaluates the balance between classification accuracy for both classes, providing a more balanced performance evaluation in imbalanced datasets. These metrics give a comprehensive assessment of model performance as represented by the following equations:

$$\:Acc{u}_{y}=\:\frac{TP+TN}{TP+TN+FP+FN}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(31)
$$\:Pre{c}_{n}=\:\frac{TP}{TP+FP}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(32)
$$\:Rec{a}_{l}=\:\frac{TP}{TP+FN}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(33)
$$\:{F}_{Score}=\:\frac{2\cdot\:Pre{c}_{n}\cdot\:Rec{a}_{l}}{Pre{c}_{n}+Rec{a}_{l}}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(34)
$$\:MCC=\frac{\left(TP.TN\right)-(FP.FN)\:\:\:\:}{\sqrt{\left(TP+FP\right)\:.\:(TP+FN)\:.\:(TN+FP)\:.\:(TN+FN)}}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(35)

Here, \(\:TP\) represents True Positives, \(\:TN\) indicates True Negatives, \(\:FP\) denotes False Positives, and \(\:FN\) stands for False Negatives. These metrics provide a comprehensive evaluation of the model’s performance, including its capability to correctly detect both positive and negative instances and its handling of class imbalances. The inclusion of \(\:pre{c}_{n}\), \(\:rec{a}_{l}\), \(\:{F}_{Score}\), and \(\:MCC\) presents insights into the efficiency of the model in detecting relevant patterns while minimizing errors across diverse classification scenarios.

Analysis and evaluation of experimental results

Figure 7 states the confusion matrix generated through the SACHI-SLRHDL approach below 80%:20% and 70%:30% of TRAPS/TESPS. The performances designate that the SACHI-SLRHDL model has effectual detection and identification of 20 classes accurately.

Fig. 7
figure 7

Confusion matrix of (a-c) TRAPS of 80% and 70% and (b-d) TESPS of 20% and 30%.

Table 2; Fig. 8 illustrate the SL detection of the SACHI-SLRHDL approach below 80%TRAPS and 20%TESPS. The table values implied that the SACHI-SLRHDL approach has attained efficient performance. Using 80%TRAPS, the SACHI-SLRHDL approach gains typical \(\:acc{u}_{y}\) of 98.89%, \(\:pre{c}_{n}\) of 88.89%, \(\:rec{a}_{l}\) of 88.85%, \(\:{F}_{score}\:\)of 88.80%, and MCC of 88.25%, correspondingly. Moreover, using 20%TRAPS, the SACHI-SLRHDL method obtains standard \(\:acc{u}_{y}\) of 99.19%, \(\:pre{c}_{n}\) of 91.54%, \(\:rec{a}_{l}\) of 93.21%, \(\:{F}_{score}\:\)of 91.87%, and MCC of 91.72%, correspondingly.

Table 2 SL detection of SACHI-SLRHDL approach under 80%TRAPS and 20%TESPS.
Fig. 8
figure 8

Average of SACHI-SLRHDL approach under 80%TRAPS and 20%TESPS.

Table 3; Fig. 9 demonstrate the SL detection of the SACHI-SLRHDL approach below 70%TRAPS and 30%TESPS. The values of the table implied that the SACHI-SLRHDL approach has gained efficient performance. Using 70%TRAPS, the SACHI-SLRHDL method obtains typical \(\:acc{u}_{y}\) of 97.98%, \(\:pre{c}_{n}\) of 79.93%, \(\:rec{a}_{l}\) of 80.04%, \(\:{F}_{score}\:\)of 79.76%, and \(\:MCC\) of 78.83%, respectively. Additionally, using 30%TRAPS, the SACHI-SLRHDL technique reaches a typical \(\:acc{u}_{y}\) of 98.04%, \(\:pre{c}_{n}\) of 79.32%, \(\:rec{a}_{l}\) of 78.56%, \(\:{F}_{score}\:\)of 78.44%, and \(\:MCC\) of 77.69%, respectively.

Table 3 SL detection of SACHI-SLRHDL approach under 70%TRAPS and 30%TESPS.
Fig. 9
figure 9

Average of SACHI-SLRHDL approach under 70%TRAPS and 30%TESPS.

Figure 10 depicts the training (TRA) \(\:acc{u}_{y}\) and validation (VAL) \(\:acc{u}_{y}\) performances of SACHI-SLRHDL technique below 80%TRAPS and 20%TESPS. The \(\:acc{u}_{y}\:\)values are computed across an interval of 0–50 epochs. The figure identified that the values of TRA and VAL \(\:acc{u}_{y}\) show an increasing trend, indicating the capability of the SACHI-SLRHDL approach through maximum performance across multiple repetitions. Furthermore, the TRA and VAL \(\:acc{u}_{y}\) values remain close through the epochs, notifying lesser overfitting and showcasing the higher performance of the SACHI-SLRHDL approach, which guarantees reliable prediction over hidden samples.

Fig. 10
figure 10

\(\:Acc{u}_{y}\) curve of SACHI-SLRHDL approach under 80%TRAPS and 20%TESPS.

Figure 11 shows the TRA loss (TRALOS) and VAL loss (VALLOS) graph of the SACHI-SLRHDL approach with 80%TRAPS and 20%TESPS. The loss values are computed through an interval of 0–50 epochs. The values of TRALOS and VALLOS demonstrate a reducing trend, which indicates the proficiency of the SACHI-SLRHDL approach in harmonizing an exchange between generalization and data fitting. The continual reduction in loss values assures the superior performance of the SACHI-SLRHDL method and tuning of the prediction results afterwards.

Fig. 11
figure 11

Loss curve of SACHI-SLRHDL approach under 80%TRAPS and 20%TESPS.

In Fig. 12, the PR curve inspection of the SACHI-SLRHDL approach below 80%TRAPS and 20%TESPS offers an understanding of its outcome by scheming Precision instead of Recall for 20 distinct class labels. The figure exhibits that the SACHI-SLRHDL technique continually attains enhanced PR values over distinct class labels, which indicates its proficiency in keeping a high proportion of true positive predictions (precision) while effectively grabbing a significant share of actual positives (recall).

Fig. 12
figure 12

PR curve of SACHI-SLRHDL approach at 80%TRAPS and 20%TESPS.

Figure 13 examines the ROC outcome of the SACHI-SLRHDL approach below 80%TRAPS and 20%TESPS. The performances showed that the SACHI-SLRHDL approach gains superior ROC analysis across each class label, representing noteworthy proficiency in understanding the classes. This consistent tendency of maximum ROC values through several classes illustrates the skilful outcomes of the SACHI-SLRHDL technique in predictive classes, implying the classification system’s robust nature.

Fig. 13
figure 13

ROC curve of SACHI-SLRHDL approach under 80%TRAPS and 20%TESPS.

Comparative analysis of the SACHI-SLRHDL model performance across different techniques under the ISL dataset

Table 4; Fig. 14 compare the SACHI-SLRHDL method’s comparison performances with the existing techniques19,34,35. The performances highlight that the DMM-MobileNet, Bi-SRN, Skeletal Feature + LSTM, ANFIS Networks, MLP-MDC, PCNN, Modified K-NN, Bi-LSTM, TCNN, CNN-BiLSTM, and DCNN methodologies have exibited poorer performance. Likewise, SVM models have attained closer outcomes by \(\:pre{c}_{n}\), \(\:rec{a}_{l},\)\(\:acc{u}_{y},\:\)and \(\:{F}_{score}\) of 89.04%, 89.53%, 98.26%, and 89.30%, respectively. Additionally, the SACHI-SLRHDL technique reported enhanced performance with higher \(\:pre{c}_{n}\), \(\:rec{a}_{l},\)\(\:acc{u}_{y},\:\)and \(\:{F}_{score}\) of 91.54%, 93.21%, 99.19%, and 91.87%, respectively.

Table 4 Comparative outcomes of the SACHI-SLRHDL technique with recent models19,34,35.
Fig. 14
figure 14

Comparative outcome of SACHI-SLRHDL technique with recent models.

Comparative evaluation of computational time for the SACHI-SLRHDL model across different techniques under the ISL dataset

Table 5; Fig. 15 depict the computational time (CT) analysis of the SACHI-SLRHDL technique compared to existing methods. The SACHI-SLRHDL model demostrates the fastest CT at 6.98 s. Compared to other methods like DMM-MobileNet at 22.61 s and Bi-SRN method at 22.37 s, the SACHI-SLRHDL model significantly outperforms, suggesting its effectualness in real-time applications. While models such as Bi-LSTM at 17.15 s and Skeletal Feature plus LSTM at 13.74 s are faster than some conventional methods, the SACHI-SLRHDL method stands out in terms of minimizing CT without sacrificing performance. Other models, comprising PCNN method at 24.67 s and CNN-BiLSTM at 23.24 s, require more time, highlighting the SACHI-SLRHDL model’s advantage in speed. The efficiency of the SACHI-SLRHDL model makes it specifically appropriate for applications with low-latency requirements, presenting an optimal balance between performance and computational cost.

Table 5 CT evaluation of the SACHI-SLRHDL technique with existing methods.
Fig. 15
figure 15

CT evaluation of the SACHI-SLRHDL technique with existing methods.

Conclusion

In this study, a SACHI-SLRHDL methodology in IoT was presented. The model involved four distinct processes: image pre-processing, improved MobileNetV3 for feature extractor, hybrid DL classification process, and AROA-based parameter tuning. At the primary stage, the SACHI-SLRHDL model utilized BF for image pre-processing to enhance the excellence of the captured images by reducing noise while preserving edges. Next, the improved MobileNetV3 model extracted relevant input image features. For the SLR process, the hybrid of the CNN-BiGRU-A classifier was employed. Finally, the AROA optimally adjusts the CNN-BiGRU-A model’s hyperparameter values, resulting in better classification performance. A comprehensive experimental analysis is performed under an ISL dataset to exhibit the more significant solution of the SACHI-SLRHDL method. The experimental validation of the SACHI-SLRHDL method portrayed a superior accuracy value of 99.19% over existing techniques. The limitations of the SACHI-SLRHDL method comprise the restricted size and diversity of the dataset, which may affect the generalization of the model to diverse sign languages and real-world scenarios. Additionally, the performance of the model in challenging environments, such as low lighting or occlusions, has not been fully tested. The computational complexity of the model may hinder its deployment on low-resource devices or IoT platforms. Furthermore, the scalability of the technique for large-scale applications in diverse settings remains unexamined. Future work should concentrate on expanding the dataset to encompass a wider variety of signs and gestures, testing the model under various environmental conditions, and optimizing the approach for deployment on resource-constrained devices. Moreover, further exploration of cross-lingual and cross-cultural adaptability could improve the effectiveness of the model in global applications.