Table 3 Step-by-step algorithm for the dual-path feature ViT model.

From: Recognizing American Sign Language gestures efficiently and accurately using a hybrid transformer model

Step

Description

1. Input processing

Load input RGB image/video frame from the ASL Alphabet dataset. Resize the image to 64 × 64 pixels to maintain uniformity. Normalize pixel values to [0,1] range

2. Data augmentation

Apply random rotation (± 20°), random horizontal flipping, brightness adjustments, and Gaussian noise addition to enhance generalization

3. Dual-path feature extraction

The input image is fed into two parallel feature extraction paths: (1) Global Feature Path (captures full hand structure) and (2) Hand-Specific Path (focuses on key hand details)

4. Patch embedding

Convert the input image into non-overlapping patches (16 × 16), followed by linear projection into fixed-length feature vectors

5. Vision transformer encoding

Each patch is passed through a multi-head self-attention mechanism, feed-forward layers, and positional encoding to capture long-range dependencies in both feature paths

6. Element-wise feature fusion

Multiply feature maps from the global and hand-specific paths using element-wise multiplication to enhance discriminative hand features

7. Fully connected layer

Pass the fused feature vector through dense layers (1024 neurons, ReLU activation) to learn complex gesture representations

8. Classification head

The final feature vector is processed through a softmax layer, outputting a probability distribution over the 29 ASL Alphabet classes

9. Prediction output

The predicted class label (A-Z, excluding J and Z) is generated based on the highest probability score from the softmax layer

10. Model evaluation

The model is evaluated using metrics such as accuracy, precision, recall, F1-score, and inference speed (FPS) on the test set

11. Real-time deployment (Optional)

The trained model can be integrated into a real-time ASL recognition system using webcam input for live sign language detection