Abstract
Arabic sign language (ArSL) is a visual-manual language which facilitates communication among Deaf people in the Arabic-speaking nations. Recognizing the ArSL is crucial due to variety of reasons, including its impact on the Deaf populace, education, healthcare, and society, as well. Previous approaches for the recognition of Arabic sign language have some limitations especially in terms of accuracy and their capability to capture the detailed features of the signs. To overcome these challenges, a new model is proposed namely DeepArabianSignNet, that incorporates DenseNet, EfficientNet and an attention-based Deep ResNet. This model uses a newly introduced G-TverskyUNet3+ to detect regions of interest in preprocessed Arabic sign language images. In addition, employing a novel metaheuristic algorithm, the Crisscross Seed Forest Optimization Algorithm, which combines the Crisscross Optimization and Forest Optimization algorithms to determine the best features from the extracted texture, color, and deep learning features. The proposed model is assessed using two databases, the variation of the training rate was 70% and 80%; Database 2 was exceptional, with an accuracy of 0.97675 for 70% of the training data and 0.98376 for 80%. The results presented in this paper prove that DeepArabianSignNet is effective in improving Arabic sign language recognition.
Similar content being viewed by others
Introduction
Gestures have long been a fundamental mode of communication, with Deaf and Hard to Hearing (DHH) individuals today being the primary users of recognized sign languages1,2,3. These languages are not only signed but also contain the manual features like the hand gestures4 and the non-manual features including the facial actions and the body language5. Although sign language assists DHH people in their communication, the gap between the DHH and hearing people has not been closed. In the world, 466 million people have hearing impaired and they have to face communication problem every day. Thus, it was important to recognize the fact which sign language is not a linguistic minority that can be disregarded6. Different sign languages have different signs for the letters of the alphabet and they are mostly in the form of the letters7. However, sign language is different by one country to the another and there are 144 sign languages in world including the Arabic Sign Language (ArSL) which has 30 different alphabet sign that is specific to the Arabic region6,8,9,10.
Arabic Sign Language detection is essential in ensuring that the hearing-impaired individuals in the Arab speaking countries have an easy time communicating with other people11,12,13. Considering the cultural and linguistic heterogeneity within the Arab world, this system can significantly improve the communication interaction during the daily practices, education, and work14. For the people who use ArSL as their main method of communication, sign language detection technologies can translate the signs into text or speech, thus allowing the person with hearing impairment to interact between the rest of the society with minimal interference15,16,17,18. This technology is most beneficial in government services, healthcare, education and social communications where sign language interpreters may not always be available19,20.
The benefits that can be derived from detection of ArSL are numerous. It improves the chances of easy communication and integration of the deaf in the society and not feel locked out because of language issues. As the development of machine learning and computer vision, the gesture recognition accuracy and response speed are enhanced19,21, so that the system becomes more stable. However, there are some issues which need to be solved as well. This is because, as mentioned before, Arabian sign language was the complex language which has multiple gestures and variations of regional dialects that can influence the level of consistency of detection22. However, some of the systems may need costly hardware or high computational power and this may not be possible in some settings23.
The uses of detection of ArSL include; It can be incorporated in mobile applications, where people can use their smartphones to interact in real time or in public service facilities such as airports, banks, and government offices for the benefit of the deaf24. In education, it can be used to assist the deaf students to be able to communicate with teachers and students in the class. Furthermore, the technology can be applied in the health sector to enhance the interaction between the patients and the medical practitioners, specifically to avoid communication barriers in a deaf patient25.
It is worth mentioning that different approaches have been used in the detection of Arabic Sign Language (ArSL) and each of them has its pros and cons. A Faster R-CNN method employs deep learning models like VGG-16 and ResNet-1826 for gesture recognition but has a problem with computational cost27. Another approach uses a 3D CNN skeleton network for real time sign detection but is very sensitive to environmental conditions especially light. The results reveal that the proposed systems employing KNN and SVM with Leap Motion Controller exhibit high accuracy for double-hand gestures while the single-hand recognition is not very reliable28. The vision-based technique performs well in terms of accuracy for certain gestures but suffers from the problem of data set dependency23. Because of these limitations, a new method is developed in this paper to overcome these drawbacks and improve the ArSL detection.
Contribution
The following indicates the proposed algorithm’s contribution;
-
To introduce a new G-TverskyUNet3+ based segmentation model for accurate identification of Region of Interest (ROI)areas, from the pre-processed images.
-
To extract the optimal features among the extracted multi-features (texture, color and deep learning) using a new hybrid optimization model referred as CSFOA. This CSFOA is the combination of FOA and CSO, respectively.
-
To introduce a new DeepArabianSignNet model based on DenseNet-EfficientNet and attention-based Deep ResNet, for accurate recognition of the arabian sign languages.
Organization
Section “Introduction” is focused on the presentation of the importance of the recognition of Arabic sign language and the goals of the research. Section “Literature review” reviews the literature on sign language recognition, and identifies areas which the study intends to fill. Section “Proposed methodology for Arabian sign language (ArSL) detection” focuses on the DeepArabianSignNet model describing its architecture, pre-processing of the input data and the training process. Section “Result and discussion” displays the results, assess the performance of the proposed model, and compare it with other methods and finally conclude the overall significance of the suggested approach contribution.
Literature review
The recent improvements in ArSL recognition have motivated the development of numerous novel systems based on deep learning, computer vision, and wearable technologies. Alawwad et al.26 have used Faster R-CNN with ResNet-18 and VGG-16 to achieve 93% accuracy in recognizing ArSL alphabets under varying backgrounds. Bencherif et al.29 proposed a video-based model using both 2D and 3D CNNs for dynamic and static sign recognition. Hisham and Hamouda30 used the Leap Motion Controller with recognized machine learning models like SVM, KNN, and DTW to reach 92% accuracy on average. Tharwat et al.31 presented a vision-based system to recognize Quranic letters, with nearly 99% accuracy. Alani and Cosma32 came up with ArSL-CNN based on the ArSL2018 dataset and further improved recognition accuracy using SMOTE to reach up to 97.29%. Bansal et al.33 applied mRMR-PSO for optimal feature selection using HOG features, showing better recognition accuracy on several sign language datasets versus other methods. Miah et al.34 defined BenSignNet for Bengali Sign Language, giving an accuracy of more than 94% on three datasets. Sharma and Singh35 created an ISL dataset and designed a CNN model touting a good performance with little processing time to back it up. Alyami et al.36 introduced a pose-based Transformer utilizing MediaPipe keypoints for hand and face, increasing recognition rate by 4%. Sharma and Singh35 designed an SISLA speech-to-sign avatar system with multilingual capabilities and up to 91% accuracy. Abdul Ameer et al.12 presented an attention-based LSTM with MediaPipe for temporal gesture recognition and achieved an accuracy of more than 85%. AlKhuraym et al.37 used EfficientNet-Lite0 along with label smoothing to garner a 94% accuracy under background noise-robust conditions. Shanableh38 proposed a two-stage temporal segmentation model with the CNN transfer learning method, beating the prior models with 97.3 and 92.6% word and phrase recognition. Rwelli et al.39 theorized a CNN-based wearable sensor system using DG5-V gloves with real-time vocalized output-the system achieved an accuracy of 90%, enhancing accessibility for the hearing impaired.
Table 1 gives a discussion of the benefits and drawbacks of state-of-the-art approaches on sign language recognition.
Proposed methodology for Arabian sign language (ArSL) detection
In this research work, a novel ArSL model is introduced, with the assistance acquired from the AI based approaches and hybrid meta-heuristic optimization model. The overall architecture of the proposed model is shown in Fig. 1.
Image acquisition
The proposed approach for detecting ASL includes detailed image acquisition process which is mainly based on two datasets.
Arabic Sign Language ArSL dataset
The dataset 1 (https://www.kaggle.com/datasets/sabribelmadoui/arabic-sign-language-augmented-dataset) for the image acquisition is well-prepared for developing a reliable Arabic sign language recognition system. The Arabic sign language was made up of 290 images to the test set, 13,926 images to the training set, 870 images for the validation set making a total of 15,086 images in total and every image has a dimension of 416 × 416 pixels. These images were captured in different settings with different backgrounds and different angles of hand holding the cell phone camera.
RGB Arabic Alphabets Sign Language Dataset
Dataset 2 (https://www.kaggle.com/datasets/muhammadalbrham/rgb-arabic-alphabets-sign-language-dataset) appears to be a valuable resource for image acquisition and model training. RGB Arabic Alphabet Sign Language (AASL) database includes 7857 raw and fully labelled RGB images for Arabic sign language alphabets, that, to the knowledge, was the first public RGB database. The database has been expected to be useful to anyone who wants to build realistic Arabic sign language classification models. The dataset is intended to assist anyone who are interested in creating classification models for Arabic sign language in real-world scenarios. More than 200 individuals’ AASL was gathered under various conditions, including background, illumination, image orientation, size, and resolution. To guarantee a high-quality dataset, the gathered photos were supervised, verified, and filtered by subject-matter experts.
KArSL database
Dataset 3, can be downloaded from: https://www.kaggle.com/datasets/yousefdotpy/karsl-502. KArSL is one of the largest video databases for width-Arabic sign language. The database is built around 502 isolated sign words collected with the Microsoft Kinect V2. Each sign of the database is performed by three professional signers. Each signer repeated each sign 50 times, resulting in a total of 75,300 samples of the whole database (502 × 3 × 50).
Pre-processing
Image resizing
The collected RGB images has been resized to a standard size of 224 × 224 pixels for all the images used in the dataset. This resizing is important in order to achieve uniformity of the input size which is important for the machine learning models to process. Figure 2 represents the resized images for database 1 and database 2.
L*a*b* color space conversion
After resizing the images go through L*a*b* Color Space Conversion, which are the very important step in image processing as well as computer vision. It consists of three main components: L* which is the lightness, a* which is the redness greenness and the b* which is the blueness yellowness. In this 3D color space L* is between 0 and 100 which means black and white respectively while a* and b* are between − 128 and + 128 and represent blue-yellow and green–red respectively. L*a*b color space analysis for dataset 1 and dataset 2 are presented in Fig. 3.
RGB is converted directly into L*a*b* by utilizing Eq. (1)
Here Q is obtained as per Eq. (2),
In the above matrix, \({P}_{k}={p}_{k}/{m}_{k}\), \({M}_{k}=1\), \({N}_{k}=\left(1-{p}_{k}-{m}_{k}\right)/{m}_{k}\left(k\in \left\{r,g,b\right\}\right)\) and [H] is obtained as per Eq. (3):
After the values of \(P\), \(M,\) and \(N\) are acquired, they are employed with the goal to find the values of \({L}^{*}\), \({a}^{*}\) and \({b}^{*}\). Equation (4) provides a set of formulas to obtain \({L}^{*}{a}^{*}{b}^{*}\).
Here, \({P}_{c}, {M}_{c},\) and \({N}_{c}\) represents the coordinates of the white reference illuminant. The function of \(f\) is defined as per Eq. (5).
Image augmentation
Following the capturing of L*a*b color space images, the images were subjected to image augmentation to increase the dataset variability and resilience. Some technologies utilized with this procedure involve rotation, flipping, and scaling among others. Rotation is performed at certain degrees like 45, 90, and 180 degrees which results to several images of the same object with different orientations to make the model rotationally invariant. The flipping is done both in the horizontal (left and right) and vertical (top and bottom) axis making it possible for the model to recognize the objects in any orientation. The scaling which is the process of enlarging or reducing the size of the images is important in helping the model learn different objects in different sizes and distances. Augmented images of dataset 1 and dataset 2 is graphically depicted by Fig. 4. The augmented image is the pre-processed image, from where the ROI regions are identified in the segmentation stage.
G-TverskyUNet3+: proposed architecture
The 3D U-Net provides the basis of G-TverskyUNet3+, while the trained new GNeXt conserve computing power, an attention gates module works as a noise filter, and the new skip connection architecture with U-Net 3+ functions as a low-level feature extractor. In general, these served as the foundation for a new attention 3D U-Net with numerous skip connections that is utilized to partition images used for Arabic sign language. The architecture of the model is shown in Fig. 5.
For image segmentation, the 3D U-Net model employs an encoder-decoder architecture, in which convolutional layers gradually increase feature channels and decrease spatial size. In order to blend high-level features from deeper layers with fine-grained information from early layers, skip connections are used. This architecture maintains spatial details throughout the upsampling process, guaranteeing precise segmentation. It works very well with 3D input graphics, including gesture data from sign language. For high-quality output, the network’s topology helps to preserve both coarse and fine features.
New GNeXt
A novel backbone called GNeXt combines Ghost Convolution and ConvNeXt to effectively extract features while lowering computational overhead. ConvNeXt optimizes for accuracy and scalability by building on ResNet with improvements from Transformer models. Ghost Convolution improves processing performance by using inexpensive convolutions to generate ghost features, which lowers the model’s parameters and effort. The GNeXt backbone maintains low processing costs while handling 3D data well. The lightweight design of the model’s framework enhances accuracy and model convergence.
A three-dimensional U-Net model constitutes the basis for this framework. This includes the primary encoder and decoder elements. The usage of the new GNeXt backbone with encoder, together with the numerous skip connections along with attention modules, are the primary distinctions between the suggested model and the original 3D U-Net. Figure 6 depicts the full building.
Images in 3D Arabic sign language up to 240 × 240 × 155 voxels can be supported by the model. On the left, the encoder block executes convolutional activities, ReLU activations, and batch normalizations. Up to the final encoder block F, the input image’s size progressively reduces and its channel count rises over the course of five steps. Using transposed convolution, the convolution block in the final encoder stage receives the input, evaluated it as in the previous blocks, and then passes it towards decoder blocks for upsampling. The encoder’s upper blocks extract low-level sematic information by an input image, while the lower blocks extract high-level features. The decoder then carries out the reverse action, i.e., reconstructs the image’s original size using the up-sampling technique. In this procedure, skip connections assist backpropagate the outcomes to compute the loss while giving the network access to high-level semantic image information. In order to maximize the flow for features with the encoder and decoder blocks, skip connections were added in this model. In order to conserve a substantial number of computational resources and increase accuracy, the attention gate modules filter out a noisy information and only pass essential features.
GNeXt
ConvNeXt Backbone and Group Convolutions are combined to create the recently developed GNeXt. Using the ImageNet dataset, one of the most advanced backbone models, ConvNeXt, obtained a top-1 accuracy of 87–88%. Convolutional networks make up the entirety of the model, which retains the simplicity and efficiency for a standard CNN while outperforming Transformer by terms of scalability and accuracy. ConvNeXt builds upon the ResNet network’s basic architecture to gradually enhance the model by using Swin Transformer’s design. Additionally, a 3D version for novel GNeXt backbone is used here as an encoder block with the model for Ghost convolution, the cost-effective linear operation to build feature maps that efficiently reduces model parameters and computing workload. As shown in Fig. 7, this design is an enhanced version of MobileNetv1, with shortcut connections between bottlenecks and linear bottlenecks between layers.
The block first does a 1 × 1 × 1 convolution with ReLU6 and the batch normalization using a set of features includes width, height, and depth. Using batch normalization and ReLU6 to convolve three RGB channels, the input is processed in the second layer using a 3 × 3 × 3 depth-wise convolution, following the original architecture. However, volumetric data is employed in this instance. Consequently, since the information do not present three RGB channels, 3D depth wise voxels for 3D input image.
An additional 1 × 1 × 1 convolution is done without the activation function with block’s final layer. After that, the output has been added to earlier input. Due to the high processing overhead associated with 3D convolutions, this block aids in the model’s ability to remain small. This allows for low-cost, depth-wise convolution to extract by many features as possible by an input image at every stage. Additionally, new GNeXt with 3D pretrained weights is employed, and the transfer learning technique is applied. This approach improved the accuracy slightly while enabling the model to converge more quickly and smoothly than previous models.
Ghost convolution
Model parameters and computational workload were dramatically reduced when GhostNet developed Ghost convolution, a linear process that generates feature maps at a low cost.
The design of Ghost convolution is shown in Fig. 8, where the conventional convolution operation is split into two parts: the primary convolution as well as cheap convolution. The primary convolution severely restricts the total number of kernels used in convolution could be significantly less than which of traditional convolution, else it is practically the same as conventional convolution. Cheap convolution, on the other hand, performs group convolution using the original feature map that is produced by primary convolution, producing redundant feature maps that are referred to Ghost feature maps. Group convolution drastically lowers the complexity of the model by using less computation and operating more quickly than ordinary convolution. To create output feature maps necessary for feature extraction, the primary and Ghost feature maps has been combined. The primary and the Ghost feature maps were maintained at same size with this method. Ghost convolution used several times throughout the network to avoid the object detector producing an excessive number of parameters.
There are 1:1:9:1ratios of ConvNeXt blocks in each step. In addition, the first down sampling module creates the Patchify layer using a convolutional layer with a 4 × 4 kernel size. Compared to ResNet, the ConvNeXt block has an inverted bottleneck structure and a bigger 7 × 7 kernel size. The layer normalization (LN) layer and Gaussian error linear units (GELU) function as per Eq. (6), which are employed in Transformer, take the place of the widely utilized batch norm (BN) and ReLU activation function with CNN in microscopic architecture, resulting in fewer layers. Between the two 1 × 1 convolution layers in the block and the one LN layer that comes before the convolutions layer, there is only one activation layer that remains. Additionally, a separate 2 × 2 convolutional down sampling layers with a stride of 2 is employed, and to stabilize the training, an LN layer is added afterwards.
In Eq. (6), Φ(x) represents as Gaussian distribution’s cumulative distribution function.
Multiple skip connections
The problem of disappearing gradients in deep networks is addressed by multiple skip connections, which enhance information flow between encoder and decoder layers. Fine-grained details are preserved by these connections, which enable low-level information from the encoder to be sent straight to the decoder. The model ensures accurate segmentation by capturing both coarse and fine information through the use of skip links throughout the network. Deeper networks can avoid losing important information during the upsampling process thanks to this strategy. The accuracy of the model is improved by skip connections, particularly for intricate tasks like gesture recognition. The 3D version deep method is employed, which enables modules to extract more high-level features with high accuracy; however, deeper networks use a less efficient backpropagation technique, which affects how the loss is calculated. For the Res-Net and U-Net algorithm, skip connections have been successfully offered as a solution to this issue. In order to address the disappearing problem and deteriorating accuracy with Res-Net, the skip connections skip across one or more levels along with its associated activities. Traditional neural network framework starts by summing the input and weights with an output from the preceding layer, \({X}_{0}\), and then activate the input using the activation function. Then, the procedure typically repeats twice, as per Eq. (7).
From the skip connections, repeat the same procedure but in one more mechanism. Then pass the \({X}_{0}\) and add it among with \({Z}_{2}\) with the second activation function, as represented by Eq. (8).
If all of the values are positive, the activation function immediately outputs the two inputs that were previously received. When the range of \({Z}_{2}\) is negative, then these only outputs \({X}_{0}\), and the output might be illustrated by Eq. (9).
In contrast, without skip connections, \({Z}_{2}\) with a 0 value, which will be deactivated. The network may perform convolution process with deeper networks without loss by avoiding some outputs with a range of 0. This is made possible by the skip connection operation. Nearly identical activities are needed in U-Net architecture. Rather than going towards subsequent encoder layers, the output of preceding encoder layers is transferred towards decoder layers. Furthermore, U-Net used a concatenation procedure in place of an addition operation. Later on, U-Net3+ and U-Net++ redesigned versions for skip connections were suggested. This study made use of U-Net3+ 's skip connection model to improve low-level feature extraction along with feature flow. The output is shared by top three decoder blocks and is presented as a skip connection at the first block. Share the output with the two decoder blocks at bottom starting from the second encoder block, then repeat the procedure until you reach the fourth encoder block. The decoder blocks go through the same procedure. Proceed by the bottom decoder block towards final decoder block this time. The top blocks share access to the outputs.
Figure 9 demonstrates the number of both encoder and decoder blocks provide numerous inputs into a single decoder block, D. By transmitting the pre-processed images to each convolution block in order to extract additional distinct features, those numerous skip connections help to protect the fine features that are extracted with every encoder and decoder block. Large maps of features are generated by decoder blocks and smaller, same-scale feature maps by encoder blocks are combined in each decoder block inside the network to capture both coarse-grained semantics and fine-grained features at full scales.
Attention models
One of the primary modules in this study is the attention module. The attention gate’s main purpose is the score function. It outputs the score for each input after receiving the query and key as inputs. The final result is then processed using a value to establish the relative importance of each component. A weighted average of the components, which depend on key and query, sets up the attention method overall. After an effective application of the attention module in sequential acquisition of language assignments, the CNN sector developed the attention gate, referred to the Attention Gates. The attention mechanism concentrates on the most significant aspects of the input material by eliminating extraneous information. To assess each component of the input image’s importance, the Attention Gate assigns a score. This enables the model to ignore extraneous details and concentrate on key aspects. The attention mechanism improves feature extraction, which raises the effectiveness and performance of the model. It is especially useful for applications where some attributes are more crucial than others, such as sign language recognition.
The explainability of attention mechanisms is one of its most notable features. It is challenging to comprehend the logic behind predictions using traditional deep learning models, especially convolutional or recurrent neural networks, which are frequently referred to as “black-box” models. On the other hand, attention methods offer transparency by indicating which aspects of the input the model is considering while making decisions. Attention heatmaps, which show the areas of the input data that the model considers most significant, can be used to display this. These heatmaps, for instance, show the parts of the hand or body that the model concentrates on when identifying a gesture in ArSL identification. Because it clarifies for stakeholders why the model generated a specific forecast, this visual depiction increases confidence in the model.
Additionally, attention mechanisms make the model more reliable and make debugging easier. Attention visualization makes it simple to spot instances where the model is concentrating on unimportant aspects, like background or noise, which can lead to changes in the training data or model’s architecture. Attention mechanisms are particularly good at capturing long-range relationships between various sequence pieces, which might be difficult for traditional models to do in sequential data applications like sign language recognition. Convolutional Neural Networks (CNNs) and attention are combined in this study to improve the model’s performance by enabling it to prioritize significant gesture elements, increasing classification accuracy while preserving computational economy34.
Integrating CNNs with attention processes enhances explainability and model performance in the context of ArSL recognition. By assisting the network in concentrating on the most important components of a gesture, the attention mechanism improves recognition accuracy. Additionally, the model’s behavior may be better understood and interpreted thanks to its transparency, which is particularly helpful in applications involving assistive devices for the hard of hearing. In addition to producing a more successful recognition system, the combination of CNNs and attention enhances the model’s dependability and credibility, which eventually increases its applicability in real-world situations.
Unet group convolution block
Group Convolution divides the input feature map into smaller groups and applies convolution to each group independently, lowering the computing cost. This method speeds up the model and uses less memory by reducing the number of operations needed for processing. Large inputs can be handled effectively using it, particularly in settings with limited resources. Group convolution preserves the quality of feature extraction while allowing for faster processing. Lightweight models frequently employ this method to increase efficiency without compromising performance. Group convolution was implemented in AlexNet to solve the problem of limited video memory. It is currently utilized in multiple lightweight modules to reduce the number of mechanisms and parameters, as demonstrated by Fig. 10.
According to the number of channels, this approach divides the input feature map evenly into numerous groups. Then, it performs a standard convolution on every group, assuming which an input feature map \(X\in {R}^{C\times H\times W},C\) indicates the number of channels for input feature map and \(H\) and \(W\) indicate the height and width of input feature map, respectively. Similarly, the input feature map \(Y\in {R}^{{C}^{\prime}\times {H}^{\prime}\times {W}^{\prime}}, {C}^{\prime}\) indicates the number of channels for output feature map and \({H}^{\prime}\) and \({W}^{\prime}\) indicate the height and width of output feature map, respectively. The computation of conventional convolution has been evaluated by Eq. (10):
where \(k\) indicates the height and width of convolution kernel.
The evaluation of group convolution is represented as per Eq. (11):
where \(g\) indicates the number of groups an input feature map is divided into, \(\frac{C}{g}\) indicates the number of channels with every group of input feature map, and \(\frac{{C}^{\prime}}{g}\) represents the number of channels with every group of output feature map. The group convolution decreases the computation for conventional convolution to \(\frac{1}{g}\) and decreases the number of parameters to \(\frac{1}{g}\). This might be crucial to remember which the convolution kernel of each group only convolves in its own input feature map—it does not convolve with feature maps from other groups. Figure 11 represents the segmented images of dataset 1 and dataset2.
Feature extraction
A feature extraction module is proposed to refine the significant features from the segmented images. It includes:
Multi-Scale Feature Extraction with Deep Convolutional Layers: Multi-Scale Feature Extraction is typically achieved by employing convolutional layers with different receptive fields within the same network. A convolutional layer applies a filter \(\mathcal{F}\) of size \(s\times s\) to the input feature map \(X\), producing an output feature map \(Y\) as given in Eq. (12), in which \({\mathcal{F}}_{j}\) indicates filter weights, \({X}_{i+j}\) stands for input pixels, and \(b\) defines bias.
Apply convolutions with different kernel sizes to capture features at multi-scales as shown in Eq. (13), in which \(Con{v}_{s\times s}\) represents convolution with a kernel size of \(s\times s\). Specifically, smaller kernels obtain fine details, when larger kernels obtain coarser features.
Use dilated convolutions to develop the receptive field without improving the number of parameters as expressed in Eq. (14), in which \(Con{v}_{s\times s}^{d}\) stands for convolution with dilation rate \(d\), which spaces out the kernel elements, effectively enlarging the receptive field.
Pooling layers down-sample the feature maps, allowing the model to capture broader contextual information as stated in Eq. (15), in which \(Poo{l}_{p\times p}\) signifies pooling operation with a window size of \(p\times p\).
The outputs from different convolutional layers are combined to form the final multi-scale feature representation as given in Eq. (16), in which
The combination operation applied here is summation that merges multi-scale features.
(1) LBP Texture Features: LBP captures the local texture information by comparing each pixel with its neighborhood and encoding this information into a binary number. Given an input image \({I}_{segment}(m,n)\) and a pixel at position \(\left({m}_{c},{n}_{c}\right)\) with intensity \(I({m}_{c},{n}_{c}),\) the LBP value is computed by comparing the intensity of the central pixel with each of its neighbors \(\left({m}_{g},{n}_{g}\right)\) as defined in Eq. (17), in which \(S\left(\cdot \right)\) indicates a sign function that outputs a binary result based on the intensity comparison.
Equation (18) states the LBP value for the central pixel and is obtained by summing the binary results, weighted by powers of 2 based on the position of the neighbour, in which \(N\) refers to neighbours count.
After computing the LBP value for each pixel in the image, the next step is to construct a histogram \(H\) that represents the frequency of each possible LBP value as specified in Eq. (19), in that \(\delta \left(m,n\right)\) denotes Kronecker delta function, which is 1 if \(m-n\), else 0, and \(v\) ranges from \(\left[0-{2}^{N}-1\right]\).
The LBP histogram serves as a feature descriptor that represents the texture of the entire image.
(2) Color-Based Features using Color Moments: This is a statistical measure used to capture the distribution of color intensities in an image. Let the color image be represented by a pixel matrix, where each pixel \(\rho\) in the image has three color channels \(\complement\) \(\left(\complement \epsilon \left\{R,G,B\right\}\right)\) for an RGB image. The moments are computed for each channel. The mean of a color channel \(\complement\) is the average color value in that channel as shown in Eq. (20), in which \({\rho }_{\complement }\left(i\right)\) denotes color value of \({i}\)th pixel in channel \(\complement\), and \(n\) indicates the image pixels’s count.
Equation (21) defines the variance which measures the spread of color values around the mean.
The skewness \({\varphi }_{\complement }\) measures the asymmetry of the distribution of color values as given in Eq. (22).
These moments are calculated for each color channel, providing a compact yet informative feature vector for image analysis. The final feature vector \({F}_{ext}\) determines the combination for features from deep convolutional layers, LBP and color moments and is represented as shown in Eq. (23).
Feature selection
A feature selection module is developed to choose an optimal features from extracted \({F}_{ext}\) images. It is developed using:
FOA: It is inspired by the natural processes in forests, particularly seed dispersion and tree growth. The primary idea is to simulate the random dispersion of seeds and the growth of trees to explore the solution space effectively.
-
(i)
Initialize trees
The Forest Optimization Algorithm (FOA) views possible solutions as trees, each of which has a unique age and set of variable values. To regulate the number of trees in the forest, the age of each tree is set to '0' for newly generated trees and then grows by '1' following each local seeding step, excluding new trees. A tree is considered as an array of length \(1\times ({N}_{var}+1)\) in Eq. (24), where \({N}_{var}\) represents the dimension of the problem and “Age” indicates the age of the related tree.
The ‘life time’ parameter is a predetermined parameter that establishes the maximum allowable age of a tree. When a tree reaches this age, it is removed from the forest and added to the candidate population; this is decided at the beginning of the process. While a small value results in older trees being excluded at the start of the competition, decreasing the likelihood of local searches, a large value raises the age of trees.
-
(ii)
Local seeding of the trees
In the natural world, seeds sprout into young trees when they fall close to trees. Trees that have better growing conditions like sunshine and location compete with one another to survive. Local seeding adds neighbors to trees that are 0 years old in an effort to mimic this process, making all trees older than new ones by 1. The algorithm raises the age of promising trees to regulate the number of trees in a forest. If a tree shows promise, it is reset to '0' so that neighbors can be added via local seeding. As they get older, unpromising trees eventually die. The 'Local Seeding Changes’ (or ‘LSC’) parameter of the algorithm controls how many seeds drop next to trees and become neighbors. The dimension of the problem domain should be used to determine this parameter.
A local seeding operator is applied to every tree in the algorithm, which begins with all trees having an age of 0. New trees are added for every zero-aged tree. As iterations continue, fewer trees are introduced since older trees do not take part in the local seeding step. The technique avoids cases where values fall below or exceed the boundaries of associated variables by simulating local search and truncating values that are less than lower and higher bounds. The algorithm determines the number of trees.
-
(iii)
Population limiting
Two criteria are employed to stop the spread of forests: “life time” and "area limit." The candidate population is created by removing trees whose life times exceed the “life time” threshold. If there are more trees than the forest’s "area limit," they are added to the candidate population. The number of initial trees is regarded as the same as the “area limit” option. A portion of the candidate population is subjected to the global seeding stage following population limiting.
-
(iv)
Global seeding of the trees
Numerous tree species is found in forests, and by feeding on the seeds and fruits of these trees, their habitats become more expansive. In order to maintain the empire of many tree species in various locations, natural forces like wind and water also aid in the distribution of seeds across the forest. The global seeding step uses a predetermined percentage of the candidate population as a parameter to replicate the dispersal of tree seeds in the forest. A tree with age 0 is introduced to the forest after the global seeding operator chooses trees from the candidate population, chooses variables at random from each tree, and swaps their values with another value that is generated at random. The amount of variables whose values is altered, referred to as Global Seeding Changes (GSC), affects this global search.
-
(v)
Updating the best so far tree
The tree with the highest fitness value is chosen as the best tree at this point after the trees have been sorted based on their fitness values. To prevent the best tree from aging as a result of the local planting stage, the age of the best tree thereafter be set to 0. Because local seeding is done on trees that are "0" years old, the best tree is able to locally optimize its location by the local seeding operator.
Dispersion of seeds around each tree is simulated and seeds represent potential new solutions as illustrated in Eq. (25), in which \({S}_{i.j}\) represents position of \({j}\)th seed of \({i}\)th tree, \({C}_{i,j}\) stands for current position of \({i}\)th tree, \(r\) addresses dispersal radius, and \(rand\left(-\text{1,1}\right)\) generates random number in \(\left[-1 to 1\right]\).
Evaluate the fitness of all seeds and select the best seed to replace the corresponding tree. The tree’s position is updated to the best seed’s position as given in Eq. (26), in which
-
(vi)
Stop condition
Three stop conditions are taken into consideration 1.The initial number of iterations 2. The optimal tree’s fitness value remains constant throughout multiple iterations. 3. Accuracy up to the designated level.
CSO: It mimics the pattern of crisscrossing, where solutions are combined and exchanged between different points to enhance convergence towards the global optimum40.
-
(i)
Horizontal crossover
An arithmetic crossover that is applied to every dimension between two distinct people is called a horizontal crossover. The following Eqs. (27) and (28) is utilized to reproduce their offspring if the horizontal crossover operation is performed at the dth dimension by the ith parent individual \(X(i)\) and the jth parent individual \(X(j):\)
Here, \({r}_{1}\) and \({r}_{2}\) represented as uniformly distributed random values between 0 and 1, \({c}_{1}\) and \({c}_{2}\) are uniformly distributed random values among -1 and 1, \({MS}_{hc}\left(i,d\right)\) and \({MS}_{hc}\left(j,d\right)\) represents the moderation solutions that are the offspring of \(X\left(i,d\right)\) and \(X\left(j,d\right).\) Eqs. (1) and (2) state that the horizontal crossover in a multidimensional solution space looks for the new solutions (i.e., \({MS}_{hc}(i)\)) in a hypercube space that has a higher likelihood of accepting the two paired parent individuals (i.e., \(X(i)\) and \(X(j)\),) as its diagonal vertices. In the meantime, to reduce the blind region that the parent individuals are unable to seek, the horizontal crossover may sample the new locations on the hypercube’s periphery with a lower probability. The horizontal crossover’s cross-border search technique sets it apart from the genetic algorithm.
A technique for locating the best solutions inside an iteration is the horizontal crossover search. It involves a random permutation of numbers from \(1 to M\) by pairing individuals in a matrix \({DS}_{vc}\). \({MS}_{hc}(no1)\) and \({MS}_{hc}(no2)\) are the moderation solutions produced by the selected individuals, \(X(no1)\) and \(X(no2).\) For the purpose of finding as many solutions as feasible, the horizontal crossover probability (\({P}_{1}\)) is usually set to 1. An individual’s search scope is greatly influenced by the expansion coefficient (\({c}_{1}\) or \({c}_{2}\) ). Following the generation of moderation solutions, \({MS}_{hc}\) and its parent population \(X\) engage in a competitive operation. Only the competition winners make it through and are kept in the \({DS}_{hc}\) matrix.
-
(ii)
Vertical crossover
An arithmetic crossover that is applied to every individual between two distinct dimensions is called a vertical crossover. Assume that the person’s \(d1th\) and \(d2th\) dimensions are utilized to convey Eq. (29) allows for the reproduction of their child \({MS}_{vc}(i)\) following the vertical crossover procedure.
where, \(r\) is the uniformly distributed random value between 0 and 1, \({MS}_{vc}\left(i,d1\right)\) denotes the offspring of \(X\left(i,d1\right)\) and \(X\left(i,d2\right)\) (i.e., \({DS}_{hc}(i,d1)\) and \({DS}_{hc}(i,d2))\). The vertical crossover search ensures that individual locate inside the boundaries of each dimension by normalizing the population of dominant solutions (\({DS}_{hc}\)) from the horizontal crossover. It keeps swarm dimensions from trapping into local minima by taking place between distinct dimensions of the same person. Static dimensions emerge from local optima without destroying another global optimal dimension since each vertical crossover operation produces a single progeny. The chance of vertical crossover is lower than that of horizontal crossover because only a small number of dimensions are caught in local minima.
-
(iii)
Competitive operator
The competitive operator facilitates competition between the parent population and the offspring population. For instance, only when its offspring individual (i.e., the moderation solutions) is involved in the horizontal crossover The dominant solution, or \({MS}_{hc}(i)\), performs better than its parent individual, \(X(i).\) After the vertical crossover, is it possible for it to endure and be preserved in \({DS}_{vc}(i).\) If not, the parent person lives on. In comparison to the competitor operator, the vertical crossover operates similarly. The population moves quickly to the search region with better fitness and the converge rate to the global optima is accelerated by the simplicity of this competitive process. Apply the crisscross pattern to exchange information between solutions as described in Eq. (30), in which \({C}_{i,j}^{\prime}\) indicates updated position of \({i}\)th tree, \({C}_{k,j}\) and \({C}_{i,j}\) refers to positions of two other trees in the population, and \(\rho\) denotes crisscross factor.
Proposed CSFOA: For optimal feature selection, the hybrid optimizer CSFOA helps select the most relevant features by simulating biological and natural search processes. The update mechanism in CSFOA is designed to effectively balance exploration and exploitation by alternating between the seed dispersal strategy of FOA and the crisscross pattern update of CSO using a weighted combination as shown in Eq. (31), in which \({C}_{i,j}^{new}\) explains the newly updated position of \({i}\)th tree in \({j}\)th dimension.
Also, \(a\) and \(b\) denotes weighting factors that balance the contribution of the FOA and CCO components, respectively. These might be dynamically adjusted depend on iteration count to control exploration and exploitation as expressed in Eq. (32), in which \(t\) and \({M}_{t}\) indicates current and maximum iteration in order.
Early on, \(a\) is high, promoting exploration. As \(t\) approaches \({M}_{t}\), \(b\) increases, focusing more on exploitation. The fitness \(fit(x)\) is calculated based on the objective of accuracy \(A\) maximization as given in Eq. (33).
Fitness of updated solution \({C}_{i,j}^{new}\) is evaluated using the objective function \(fit\left({C}_{i,j}^{new}\right)\). If the new position yields a better fitness value, the tree is updated to this new position as stated in Eq. (34).
The fitness evaluation ensures that only improvements are retained, guiding the algorithm toward the global optimum. Algorithm 1 explains the pseudocode of developed CSFOA, and its hyper-parameters are manifested in Table 2.
ArabSignNet-based detection
A new ArabSignNet using DenseNet-EfficientNet and attention-based Deep ResNet is proposed for enhanced detection. The architecture of the proposed model is shown in Fig. 15.
(3) Hybrid DenseNet and EfficientNet: DenseNet-12141 links every layer to each other layer with a feed-forward fashion. EfficientNet-b042 scales the network’s width, depth, and resolution depend on a compound scaling model, achieving higher accuracy with fewer parameters. The input \({F}_{opt}\) is processed through DenseNet and EfficientNet in parallel to extract diverse and rich feature representations. The features from DenseNet \({F}_{densenet}\left(x\right)\) and EfficientNet \({F}_{efficientnet}\left(x\right)\) are concatenated to form a unified feature map \({F}_{DE}\left(x\right)\) as formulated in Eq. (35), in which \(x\) indicates the input. The architecture of the proposed Hybrid DenseNet and EfficientNet is shown in Figs. 12 and 13, respectively.
(4) Attention-Based Deep ResNet: The optimal features \({F}_{opt}\) is fed into a deep ResNet3 with integrated attention mechanisms. This stage focuses on refining the features by emphasizing the most informative parts of the image as stated in Eq. (36), in which \({F}_{att}\left(x\right)\) indicates attention-weighted feature map, \({\varphi }_{i}\) stands for weights assigned from attention mechanism to different features, and \({F}_{resnet}^{i}\left(x\right)\) addresses feature maps processed by ResNet. The architecture of the model is shown in Fig. 14.
This DL-based detection architecture focuses on both accuracy and computational efficiency.
(A) EL with Model Averaging.
An EL approach is implemented by combining the predictions from multiple models such as DenseNet, EfficientNet, and ResNet using model averaging to enhance the final detection accuracy. DenseNet, EfficientNet, and ResNet are trained separately on the same detection task. Each model learns to extract features and make predictions based on its unique architecture. After training, each model generates its prediction for a given input image as \({F}_{DE}\left(x\right)\) and \({F}_{att}\left(x\right)\). The predictions from the three models were combined by a concatenation layer as shown by Eq. (37).
The final prediction \({F}_{final}\left(x\right)\) is used to make the detection decision. For classification, this involves selecting the class with the highest probability. Each model \({M}_{i}\) outputs a probability distribution over the classes as defined in Eq. (38), in which \({P}_{{M}_{i}}\left(x\right)\) is the final decision.
Table 3 summarizes the hyperparameter settings of suggested method.
Result and discussion
Python was employed for creating the proposed technique for Arabian sign language detection. Two databases—the Arabic Sign Language (ArSL) dataset (Database 1) and the RGB Arabic Alphabets Sign Language Dataset (Database 2)—had been utilized to verify the method. The performance of the suggested method has been evaluated to multiple identified techniques, such CNN35, ArSL-CNN32, AArSLRS31, and SVM30. Two different partition types on each database were utilized to assess the method. 30 percent of the data was employed for testing and 70% were employed for training in the initial validity. In the subsequent validity setup, 20% of the data is employed for testing and the balance of 80% is used for training.The method’s performance metrics, including FPR, FNR, NPV, MCC, F-measure, accuracy, precision, sensitivity, and specificity, were evaluated. Table 4 outlines the database sizes, their impact on model accuracy, and potential overfitting risks during model training.
Analysis of the proposed model for Dataset 1 (70% variation in training rate)
The proposed model can be compared to the current models, which are SVM30, AArSLRS31, ArSL-CNN32, and CNN35, as shown in Table 5. 70% of the data from Database 1 serves as verification for the above analyses. The comparison that was described above is evaluated employing significant performance metrics such as MCC, sensitivity, and accuracy. After considering the results, it could be determined that the new provided method surpasses existing methods. When accuracy was taken into consideration, the approach proposed generated a score of 98.14%, which is substantially higher than SVM30 of 95.58% and has a more major advantages than other approaches which generate lower results. When sensitivity has been compared to the proposed model, an approximate outcome occurs, which is better compared to other competitors, obtaining a score of 97.85%. In addition, the proposed model gained MCC of 98.32% surpasses SVM30, at 94.48%, AArSLRS31, at 93.82%, ArSL-CNN32, at 95.49%, and CNN35, at 94.01%. Therefore, the outcomes demonstrate that the suggested methods surpass the existing models on the basis of the Arabian sign language recognition process (Fig. 15).
Analysis of the proposed model for Dataset 1 (80% variation in training rate)
Table 6 compared the proposed method to compared methods, SVM30, AArSLRS31, ArSL-CNN32, and CNN35, utilizing data from Database 1 used for verification in 80% of the instances. The overall comparison concentrates on the four major evaluation metrics: FPR, NPV, accuracy, and precision. The findings illustrate that the proposed method surpasses existing models. The proposed method (99.92%) exceeds SVM30 (96.28%), ArSL-CNN (96.34%), and other approaches on the basis of accuracy. The precision finding demonstrated an identical pattern, with the proposed model exceeding all other models with a score of 99.96%. With an NPV of 98.05%, the proposed method exceeded traditional methods like SVM30, which generated an NPV of 95.82%. In addition, the CNN method’s FPR value35 obtained its highest value of 0.058, while the proposed approach achieved the lowest value of 0.018. The findings demonstrate that the proposed approach beats existing techniques in accurately identifying Arabic sign language. Figure 16 demonstrates the visual representation of evaluation analysis.
Analysis of the suggested model for Dataset 2 (70% variation in training rate)
As demonstrated in Table 7, the proposed approach has been assessed by employing existing methods, such SVM30, AArSLRS31, ArSL-CNN32, and CNN35. With the goal to achieve this, 70% of Database 2’s data is utilized to evaluate the proposed Arabian sign language detection technique. Crucial metrics for performance such as accuracy, specificity, F-measure, and FNR were utilized in the comparison study. It is obvious the proposed approach executes significantly better than the methods presently in usage. Actual outcomes indicate that the proposed method accomplished an accuracy of 97.67%, which is higher than the accuracy rates of comparable models like the SVM30 at 95.24% and AArSLRS31 at 96.09%. In addition, the proposed method’s specificity produced an optimal outcome of 97.91%. Also, the F-measure of the method that was offered is 97.21%, which is greater than the values of the SVM30 (95.22%) and the AArSLRS31 (94.27%). The proposed approach surpassed existing methods, as demonstrated by the lowest FNR of 0.01357 obtained with the developed method. Results like these illustrate how accurate and effective the approach is in recognizing Arabian sign language. The visual representation of the evaluation research is presented in Fig. 17.
Analysis of the proposed model for Dataset 2 (80% variation in training rate)
The proposed method has been evaluated in Table 8, utilizing 80% of the training data from Database 2 to be compared with existing methods, including SVM30, AArSLRS31, ArSL-CNN32, and CNN35. Evaluation requirements, including precision, accuracy, MCC, and FPR, are employed when performing analysis of comparisons. It is evident from the findings that the proposed method performed significantly more efficiently than existing methods. In comparison with SVM30 at 97.13%, ArSL-CNN32 at 97.38%, and CNN35 at 95.06%, the proposed method has been demonstrated to execute effectively at 98.37%. The proposed approach obtains 98.98% precision, which is the highest of all methods evaluated in comparable precision results. It appears that the proposed approach performs more effectively than existing methods and achieved an MCC of 98.91, exceeding SVM30 at 97.10%. In comparison with CNN35, where the maximum FPR is described as 0.0485, the minimum FPR of the proposed method is observed as 0.0137. Such types of findings demonstrate that the proposed approach exceeds all other models presently in usage with respect to detecting parameters for Arabian sign language.
The training time and scalability metrics for various models on two datasets at different training rates (70% and 80%) are shown in Table 9. The suggested model outperforms other models like SVM (training time: 200 s, scalability: 56.02) and CNN (training time: 300 s, scalability: 64.06) for Dataset 1, with the lowest training time of 120 s and the highest scalability of 89.07 at 70% training. Likewise, the suggested model retains its effectiveness at 80% training with a 150-s training duration and 87.08 scalability. The proposed approach again records the greatest performance for Dataset 2 at 70% training, with a training time of 130 s and a scalability of 80. At 80% training, it records a training time of 160 s and a scalability of 88. In contrast to existing methods like SVM, AArSLRS, ArSL-CNN, and CNN, these findings show the suggested model’s greater efficiency and scalability, underscoring its applicability for Arabic sign language recognition.
Analysis of meta-heuristic algorithms with the proposed model
In the comparative analysis, several meta-heuristic algorithms have been compared such as Atom search optimization (ASO)43, Deer Hunting Optimization (DHO)44, and Particle Swarm optimization (PSO)33. The performance analysis of several metrics such as accuracy, precision, sensitivity, specificity, MCC, NPV, F-measure, FPR, and FNR has been evaluated and analysed in Tables 10, 11, 12 and 13.
The suggested model is compared to meta-heuristic techniques (ASO, DHO, and PSO) based on Database 1 with 70% training variation in the performance study as per Table 11. The suggested model performs better than any other, attaining the maximum specificity (98.58%), accuracy (98.14%), and precision (99.20%). DHO and PSO trail with accuracy rates of 96.05% and 95.48%, respectively, while ASO comes in second with 96.58%. Additionally, the suggested model exhibits the lowest false-negative rate (1.67%) and false-positive rate (2.23%), demonstrating exceptional robustness. When compared to the meta-heuristic methods, the suggested model offers the best overall balance of accuracy, sensitivity, and specificity.
The performance analysis uses Database 1 with 80% training variation as per Table 12 to assess the suggested model and meta-heuristic techniques (ASO, DHO, and PSO). The suggested model performs exceptionally well, with the lowest FPR (1.83%) and FNR (0.96%) along with the best accuracy (99.93%), precision (99.96%), and specificity (98.93%). With an accuracy of 97.58%, ASO trails DHO (97.04%) and PSO (96.48%) by a small margin. The suggested model exhibits exceptional robustness across all criteria, including the F-measure (98.76%) and sensitivity (98.17%). All things considered, it offers the optimum trade-off between accuracy and minimizing errors.
The suggested model’s performance is evaluated against meta-heuristic techniques (ASO, DHO, and PSO) based on Database 2 with 70% training variation from Table 13. In addition to having the lowest false-positive rate (2.23%) and false-negative rate (1.34%), the suggested model also has the highest accuracy (97.68%), precision (97.69%), and specificity (97.91%). With an accuracy of 96.23%, ASO outperforms DHO (95.64%) and PSO (95.18%). The suggested model performs better overall, outperforming all others in terms of MCC (97.30%) and F-measure (97.22%). The suggested model minimizes errors while providing the maximum accuracy and sensitivity.
The suggested model’s performance is compared to meta-heuristic techniques (ASO, DHO, and PSO) using Database 2 with 80% training variation from Table 12. Along with the lowest FPR (1.37%) and FNR (0.98%), the suggested model has the highest accuracy (98.38%), precision (98.99%), and specificity (98.54%). ASO performs well, although it lags behind the suggested model with an accuracy of 97.56%. In the majority of metrics, DHO outperforms PSO, but both perform equally. All things considered, the suggested model provides the optimal trade-off between accuracy, sensitivity, and error reduction.
Comparative evaluation of the proposed model on the KArSL dataset against existing Arabic sign language recognition methods
The evaluation results (shown in Table 14) clearly exhibit that the proposed model has better performance than those existing models on practically all criteria. It hit an accuracy peak at 99.20%, nearly 2.82% above the nearest in rank, i.e., ArSL-CNN32, and nearly 9% more than SVM30. This proves the proposed model to be superior in learning complex and changing forms of sign. At precision, recall, and F1-score, the proposed method is constantly above 99%, representing its high discriminative power and equally balanced performance on all classes. Both MCC and Kappa, which are close to 1, strongly signify the agreement and robustness of classification, even in imbalanced data scenarios. The proposed approach also showed very low variance in its training to testing performance differences, suggesting that it benefits from a regularized architecture and possibly uses advanced augmentation techniques-a salient feature that highly avoids overfitting. On the contrary, SVM30 is highly at risk of overfitting, which might be due to its incapability to generalize well for highly repetitive visual inputs involving a few signers. Although SVM30 enjoys the fastest training time and smallest model size, it cannot fit into a production-grade system for Arabic Sign Language recognition due to its lower classification precision. The proposed model, while slightly larger and slower at training, compensates for that with great precision and inference speed and lower errors. Hence, the proposed architecture is suitable to be deployed in real-time, while having sufficient efficiency, generalization, and reliability in recognizing Arabic Sign Language from the KArSL dataset.
Comparative analysis of the proposed model with meta-heuristic optimization techniques for KArSL dataset
The proposed model enhanced with the new meta-heuristic optimization approach is vastly better than other conventional meta-heuristic schemes, such as ASO, DHO, and PSO. It shows the best results in accuracy (99.20%), log loss (0.012), and convergence rate (63 iterations), thus proving its successful performance and computational efficiency, which is evident from Table 15.
-
With reference to ASO: The ASO shows good exploration; however, one of its drawbacks is slow convergence and therefore lower generalization, probably owing to instability in a high dimensional parameter space. It also becomes more sensitive to the hyperparameters, which, in turn, can diminish the robustness of the proposed method.
-
With reference to DHO: DHO manages to give a good enough performance in terms of accuracy and stability, and it follows the proposed method quite close, especially with MCC and F1-score, but convergence speed and optimization cost always fall behind, so the proposed approach is better for real-time or large-scale applications.
-
With reference to PSO: PSO has a mediocre performance for most of the metrics while yet being lacking in accuracy and convergence time due to its propensity toward trapping in the local optima, a hindrance to optimization capacity in complex deep learning parameter space.
The newly proposed optimization method is able to strike a better balance between exploitation and exploration, hence making it more dependable for navigating through the loss landscape of deep learning approaches that are trained on complex visual-spatial gesture data like KArSL.
Statistical analysis for Database 1, database 2 and database 3
The proposed model has been rigorously assessed against state-of-the-art methods using three different databases, and the outcomes acquired are manifested in Tables 16, 17 and 18, respectively. In the case of Database 1, our proposed model outperformed all of the state-of-the-art competitors, with a promising mean accuracy of 98.14% against 95.58% of SVM, 94.81% of AArSLRS, 95.18% of ArSL-CNN, and 93.35% of CNN. A small standard deviation (± 0.23) and a tight 95% confidence interval (97.91–98.37) ensure that the model performs with stability and consistency. The highest t-test values and extremely low p values (e.g., 0.00005 vs CNN) indicate that the improvement is statistically significant. With a similar pattern appearing in Database 2, the proposed model yields an accuracy of 97.67% with very limited deviation (± 0.18), surpassing that of the other methods such as AArSLRS (96.09%) and SVM (95.24%). The low p values (well below 0.01) again ensure that the difference is statistically significant and reaffirm the model’s superiority in different domains. Most impressively, the model obtained 99.20% accuracy on Database 3, significantly leaving behind all other approaches (e.g., SVM, 90.45%, and ArSL-CNN, 96.38%). The low variance (± 0.12) and extremely low p values (e.g., < 0.00001) assert the model’s stability and great generalizability over very complex datasets.
Ablation study
The ablation findings (manifested in Tables 19, 20, 21) further emphasize the necessity of each constituent of the architecting procedure. The full model is in the lead all the time over every ablated version in every dataset, hinting at the synergy between the existing modules. In fact, there occurs a fall in attention removal: from 98.14 to 96.72% on Database 1 to 99.20 to 98.14% on Database 3. This makes apparent how important the attention is when it mates or synergizes with feature focus to facilitate discriminative learning. Data augmentation is also crucial here, as its removal reduces accuracy by 1.25–1.57%, underlining its contribution to generalization. The lack of the metaheuristic optimization module severely downgrades the performance, saying that this optimizer promotes a more efficient feature selection and convergence. Furthermore, the residual connections greatly help stabilize deep model training, since their removal triggers the sharpest drops in accuracy (up to 3.64% degradation on Database 3), corroborating their importance in preventing the vanishing gradient could be and promoting the reuse of features.
Cross-validation
The model’s ability to generalize is demonstrated by its exceptional results in the cross-validation analysis (K = 5) across all three databases (results are manifested in Table 22), with accuracy ranging from 97.70 to 99.20%. Similar high precision and recall (sensitivity) values show that the model can reliably detect both positive and negative instances, which is important for tasks that need precise classification. The model’s exceptional specificity and MCC, which consistently surpass 97%, demonstrate its ability to prevent false positives and sustain a balanced performance across all metrics. Its efficacy is further supported by its low FPR (False Positive Rate) and FNR (False Negative Rate), especially in Database 3, where the FPR is as low as 0.70%. Overall, the model’s high F-measure and steady performance demonstrate its robustness.
Robustness tests
The model performs well in robustness tests (see Table 23), with Log Loss values between 0.012 and 0.013, indicating strong calibration and close alignment between the predicted probabilities and the actual labels. The model’s dependability is demonstrated by the nearly perfect agreement between the predicted and actual labels, as indicated by the Kappa Score, which ranges from 0.972 to 0.993. The model shows a low risk of overfitting, indicating that it performs well on unseen data, with Generalization Scores ranging from 97.1 to 99.0%. Furthermore, the low risk of overfitting indicates that the model retains a high capacity for generalization without transforming into overly focused on the training dataset.
Computational cost analysis
The model continues to be computationally efficient, and this is evident from the outcomes exhibited in Table 24. Rapid prototyping and deployment are feasible due to the training time’s reasonable speed, which ranges from 75.0 to 85.0 s across all three databases. Time-sensitive applications require real-time prediction capabilities, which are ensured by the low inference time, which ranges from 2.1 ms/sample to 3.9 ms/sample. Furthermore, the model’s size is still manageable, falling between 18.3 MB and 27.5 MB, making it appropriate for deployment on devices with constrained storage.
Performance evaluation and comparative analysis of the proposed model across ArSL, RGB Arabic alphabet, and KArSL datasets
As per Tables 25, 26 and 27, the proposed model consistently surpasses both traditional machine learning- and deep learning-based models in accuracy across the three benchmark Arabic Sign Language datasets: ArSL, RGB Arabic Alphabet, and KArSL., In the case of the ArSL Dataset, it achieved an accuracy of 98.14%, which is very impressive when compared to SVM (95.58%) and ArSL-CNN (95.18%), indicating it possibly has better pattern recognition and generalization capabilities. On the RGB Arabic Alphabet dataset, the model achieved an accuracy of 97.67%, higher than AArSLRS at 96.09% and CNN at 94.19%, reflecting its robustness in the alphabet-major gesture recognition under RGB visual modality. For the KArSL dataset, the efficacy of the model was further established, recording an accuracy of 99.2%, F1-score of 99.2%, ROC-AUC of 99.45%, and Log loss of 0.012, low enough to consider. In addition, high reliability was also produced on classification tasks: Kappa score = 0.993 and MCC = 0.99. Furthermore, the model remains less prone to overfitting, with a very fast inference time of 2.1 ms/sample; hence, very good for deployment in real time. Hence, these results collectively show that besides the proposed approach achieving the best accuracy, it is also efficient, stable, and scalable across diverse Arabic Sign Language datasets.
Conclusion
In conclusion, the proposed G-TverskyUNet3+ model represents a significant improvement in Arabic Sign Language (ArSL) detection, providing a more accurate and efficient method for recognizing gestures utilized by deaf and hard-of-hearing communities in Arab world. By integrating the advanced GNeXt backbone, along with multiple skip connections as well as attention modules, the method enhances feature extraction and training efficiency compared to the previous 3D U-Net architecture. The use of image pre-processing and augmentation methods further contributes to the model’s robustness, enabling it to effectively handle complex and varied input data. The results from two datasets validate the model’s performance, with the best accuracy achieved using Database 2, showing a clear improvement as the training rate increases from 70 to 80%. The G-TverskyUNet3+ model demonstrates high accuracy, with 0.97675 for 70% of the training data and 0.98376 for 80%, underscoring its potential for real-world applications. This method addresses the unique challenges of ArSL detection, such as the nuances of Arabic body language, hand gestures, and facial expressions, making it a valuable tool for improving communication accessibility in educational, social, and everyday settings. Ultimately, this research sets a new standard for ArSL detection technology and opens doors for future innovations aimed at further improving the accuracy, speed, and applicability of sign language recognition systems.
Data availability
The Arabic Sign Language datasets provide essential resources for developing recognition systems. Dataset 1, available https://www.kaggle.com/datasets/sabribelmadoui/arabic-sign-language-augmented-dataset contains 15,086 images (13,926 for training, 870 for validation, and 290 for testing), each sized 416 × 416 pixels, captured in diverse settings to enhance model robustness. Dataset 2, found at https://www.kaggle.com/datasets/muhammadalbrham/rgb-arabic-alphabets-sign-language-dataset , includes 7857 labeled RGB images of Arabic sign language alphabets, collected from over 200 participants under various conditions, ensuring quality for effective classification model training. Dataset 3, can be downloaded from : https://www.kaggle.com/datasets/yousefdotpy/karsl-502. KArSL is one of the largest video databases for width-Arabic sign language. The database is built around 502 isolated sign words collected with the Microsoft Kinect V2. Each sign of the database is performed by three professional signers. Each signer repeated each sign 50 times, resulting in a total of 75,300 samples of the whole database (502 × 3 × 50).
References
Awwad, S., Idwan, S. & Gharaibeh, H. Real-time sign languages character recognition. Int. J. Comput. Appl. Technol. 65(1), 36–44 (2021).
Baihan, A., Alutaibi, A. & Sharma, S. Sign language recognition using modified deep learning network and hybrid optimization: A hybrid optimizer (HO) based optimized CNNSa-LSTM approach (2024). https://doi.org/10.21203/rs.3.rs-4876563/v1.
Alshehri, M., Sharma, S., Gupta, P. & Shah, S. Empowering the visually impaired: Translating handwritten digits into spoken language with HRNN-GOA and Haralick features. J. Disability Res. 3, 1–23. https://doi.org/10.57197/JDR-2023-0051 (2024).
Moin, A. et al. A wearable biosensing system with in-sensor adaptive machine learning for hand gesture recognition. Nature Electronics 4(1), 54–63 (2021).
Luqman, H. & El-Alfy, E. S. M. Towards hybrid multimodal manual and non-manual Arabic sign language recognition: MArSL database and pilot study. Electronics 10(14), 1739 (2021).
Swain, S. K. Hearing Loss and its Impact in the Community. Matrix Sci. Medica 8(1), 1–5 (2024).
Power, J. M., Grimm, G. W. & List, J. M. Evolutionary dynamics in the dispersal of sign languages. R. Soc. Open Sci. 7(1), 191100 (2020).
Ahmed, A. M. et al. Arabic sign language intelligent translator. Imaging Sci. J. 68(1), 11–23 (2020).
Herbaz, N., El Idrissi, H. & Badri, A. Advanced sign language recognition using deep learning: A study on Arabic Sign Language (ArSL) with VGGNet and ResNet50 Models (2025).
Uddin, M. Z., Boletsis, C. & Rudshavn, P. Real-time Norwegian sign language recognition using MediaPipe and LSTM. Multimodal Technol. Interact. 9(3), 23 (2025).
Alsaadi, Z. et al. A real time Arabic sign language alphabets (ArSLA) recognition model using deep learning architecture. Computers 11(5), 78 (2022).
Abdul Ameer, R. S., Ahmed, M. A., Al-Qaysi, Z. T., Salih, M. M. & Shuwandy, M. L. Empowering communication: A deep learning framework for Arabic sign language recognition with an attention mechanism. Computers 13(6), 153 (2024).
Noor, T. H. et al. Real-time arabic sign language recognition using a hybrid deep learning model. Sensors 24(11), 3683 (2024).
Hasasneh, A. Arabic sign language characters recognition based on a deep learning approach and a simple linear classifier. Jordanian J. Comput. Inf. Technol. 6(3), 281–290 (2020).
Boulesnane, A., Bellil, L. & Ghiri, M. G. A hybrid CNN-random forest model with landmark angles for real-time arabic sign language recognition. Neural Comput. Appl. 37(4), 2641–2662 (2025).
Al Ahmadi, S., Muhammad, F. & Al Dawsari, H. Enhancing Arabic sign language interpretation: Leveraging convolutional neural networks and transfer learning. Mathematics 12(6), 823 (2024).
Hussein, L. A. & Mohammed, Z. S. ArSLR-ML: A Python-based machine learning application for arabic sign language recognition. Softw. Impacts 24, 100746 (2025).
Balat, M., Awaad, R., Zaky, A. B. & Aly, S. A. Revolutionizing communication with deep learning and XAI for enhanced Arabic sign language recognition (2025). arXiv preprint arXiv:2501.08169.
Jamil, T. Design and implementation of an intelligent system to translate Arabic text into arabic sign language. In 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), 1–4. (IEEE, 2020).
Dhanjal, A. S. & Singh, W. An automatic machine translation system for multi-lingual speech to Indian sign language. Multimedia Tools Appl. 81(3), 4283–4321 (2022).
Aly, S. & Aly, W. DeepArSLR: A novel signer-independent deep learning framework for isolated arabic sign language gestures recognition. IEEE Access 8, 83199–83212 (2020).
Balaha, M. M. et al. A vision-based deep learning approach for independent-users Arabic sign language interpretation. Multimedia Tools Appl. 82(5), 6807–6826 (2023).
Alamri, F. S., Rehman, A., Abdullahi, S. B. & Saba, T. Intelligent real-life key-pixel image detection system for early Arabic sign language learners. PeerJ Comput. Sci. 10, e2063 (2024).
Aldhahri, E. et al. Arabic sign language recognition using convolutional neural network and mobilenet. Arab. J. Sci. Eng. 48(2), 2147–2154 (2023).
Lahiani, H. & Frikha, M. Exploring CNN-based transfer learning approaches for Arabic alphabets sign language recognition using the ArSL2018 dataset. Int. J. Intell. Eng. Inform. 12(2), 236–260 (2024).
Alawwad, R. A., Bchir, O. & Ismail, M. M. B. Arabic sign language recognition using Faster R-CNN. Int. J. Adv. Comput. Sci. Appl. 12(3), 692–700 (2021).
Zakariah, M., Alotaibi, Y. A., Koundal, D., Guo, Y. & Mamun Elahi, M. Sign language recognition for Arabic alphabets using transfer learning technique. Comput. Intell. Neurosci. 2022(1), 4567989 (2022).
Luqman, H. ArabSign: A multi-modality dataset and benchmark for continuous Arabic Sign Language recognition. In 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), 1–8. (IEEE, 2023).
Bencherif, M. A. et al. Arabic sign language recognition system using 2D hands and body skeleton data. IEEE Access 9, 59612–59627 (2021).
Hisham, B. & Hamouda, A. Arabic sign language recognition using Ada-Boosting based on a leap motion controller. Int. J. Inf. Technol. 13(3), 1221–1234 (2021).
Tharwat, G., Ahmed, A. M. & Bouallegue, B. Arabic sign language recognition system for alphabets using machine learning techniques. J. Electr. Comput. Eng. 2021(1), 2995851 (2021).
Alani, A. A. & Cosma, G. ArSL-CNN: a convolutional neural network for Arabic sign language gesture recognition. Indones. J. Electr. Eng. Comput. Sci. 22, 1096–1107 (2021).
Bansal, S. R., Wadhawan, S. & Goel, R. mrmr-pso: A hybrid feature selection technique with a multiobjective approach for sign language recognition. Arab. J. Sci. Eng. 47(8), 10365–10380 (2022).
Miah, A. S. M., Shin, J., Hasan, M. A. M. & Rahim, M. A. Bensignnet: Bengali sign language alphabet recognition using concatenated segmentation and convolutional neural network. Appl. Sci. 12(8), 3933 (2022).
Sharma, S. & Singh, S. Recognition of Indian sign language (ISL) using deep learning model. Wirel. Pers. Commun. 123(1), 671–692 (2022).
Alyami, S., Luqman, H. & Hammoudeh, M. Isolated arabic sign language recognition using a transformer-based model and landmark keypoints. ACM Trans. Asian Low-Resource Lang. Inf. Process. 23(1), 1–19 (2024).
AlKhuraym, B. Y., Ismail, M. M. B., & Bchir, O. Arabic sign language recognition using lightweight CNN-based architecture. Int. J. Adv. Comput. Sci. Appl. 13(4) (2022).
Shanableh, T. Two-stage deep learning solution for continuous Arabic Sign Language recognition using word count prediction and motion images. IEEE Access 11, 126823–126833 (2023).
Rwelli, R. E., Shahin, O. R. & Taloba, A. I. Gesture based Arabic sign language recognition for impaired people based on convolution neural network (2022). arXiv preprint arXiv:2203.05602.
Meng, A., Li, Z., Yin, H., Chen, S. & Guo, Z. Accelerating particle swarm optimization using crisscross search. Inf. Sci. 329, 1–13 (2015).
Arulananth, T. S. et al. Classification of paediatric pneumonia using modified DenseNet-121 deep-learning model. IEEE Access 1–1 (2024).
Pamungkas, Y. et al. Implementation of EfficientNet-B0 architecture in malaria detection system based on patient red blood cell (RBC) images. In Proc. ICITRI 2024, 123–128 (2024).
Marzouk, R., Alrowais, F., Al-Wesabi, F. N. & Hilal, A. M. Atom search optimization with deep learning enabled arabic sign language recognition for speaking and hearing disability persons. Healthcare 10(9), 1606 (2022).
Al-onazi, B. B. et al. Arabic sign language gesture classification using deer hunting optimization with machine learning model. Comput. Mater. Continua 75(2), 3413–3429 (2023).
Acknowledgements
The author extends the appreciation to the Deanship of Postgraduate Studies and Scientific Research at Majmaah University for funding this research work through the project number (R-2025-1806)
Author information
Authors and Affiliations
Contributions
All authors have contributed equally.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Mohamed, A.A., Al-Saleh, A., Sharma, S.K. et al. Attention-based hybrid deep learning model with CSFOA optimization and G-TverskyUNet3+ for Arabic sign language recognition. Sci Rep 15, 20313 (2025). https://doi.org/10.1038/s41598-025-03560-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-03560-0