Table 3 Step-by-step algorithm for the dual-path feature ViT model.
Step | Description |
|---|---|
1. Input processing | Load input RGB image/video frame from the ASL Alphabet dataset. Resize the image to 64 × 64 pixels to maintain uniformity. Normalize pixel values to [0,1] range |
2. Data augmentation | Apply random rotation (± 20°), random horizontal flipping, brightness adjustments, and Gaussian noise addition to enhance generalization |
3. Dual-path feature extraction | The input image is fed into two parallel feature extraction paths: (1) Global Feature Path (captures full hand structure) and (2) Hand-Specific Path (focuses on key hand details) |
4. Patch embedding | Convert the input image into non-overlapping patches (16 × 16), followed by linear projection into fixed-length feature vectors |
5. Vision transformer encoding | Each patch is passed through a multi-head self-attention mechanism, feed-forward layers, and positional encoding to capture long-range dependencies in both feature paths |
6. Element-wise feature fusion | Multiply feature maps from the global and hand-specific paths using element-wise multiplication to enhance discriminative hand features |
7. Fully connected layer | Pass the fused feature vector through dense layers (1024 neurons, ReLU activation) to learn complex gesture representations |
8. Classification head | The final feature vector is processed through a softmax layer, outputting a probability distribution over the 29 ASL Alphabet classes |
9. Prediction output | The predicted class label (A-Z, excluding J and Z) is generated based on the highest probability score from the softmax layer |
10. Model evaluation | The model is evaluated using metrics such as accuracy, precision, recall, F1-score, and inference speed (FPS) on the test set |
11. Real-time deployment (Optional) | The trained model can be integrated into a real-time ASL recognition system using webcam input for live sign language detection |