Table 1 Comparative summary of recent studies on deep learning approaches for visual speech and Lip-Reading Recognition.
Study | Methodology | Dataset | Results | Contributions | Significance | Limitations |
|---|---|---|---|---|---|---|
long-term recurrent convolutional network model with three convolutional layers (LRCN-3Conv) | MIRACL-VC1 dataset | 95.42% accuracy for word test data and 95.63% for phrases accuracy of 90.67% in the word-labeled class | The achievement is ascribed to the MediaPipe Face Mesh detection feature, which enables precise identification of the lip region. Making use of sophisticated deep learning methods and accurate landmark detection | Results point to increased communication accessibility for those with hearing impairments. | necessary to develop a more practical and efficient technique so that lip-reading recognition can function better in practical settings. | |
3D-CNN | MIRACL-VC1 dataset | Precision around 89% of the words | The system comprises a Conv3D algorithm that matches words to their corresponding visemes, and a feature extraction technique that converts lip features into a visual feature cube. | outperforms the previous system in terms of performance and gives higher classification accuracy. | N/A | |
An additional method of authentication makes use of distinct temporal motions of facial features and face recognition when speaking a password. | MIRACL-VC1 dataset | accuracy of 98.1% | The suggested approach is data-efficient, demonstrating its effectiveness by benchmarking against other facial recognition and lip reading algorithms. It even produces good results with ten positive video samples. | Language constraints do not impede the suggested paradigm because users can set passwords in any language. | N/A | |
A deep neural network is a combination of a convolutional neural network (CNN) and a recurrent neural network (RNN). | MIRACL-VC1 dataset | accuracy of 96.1% | The suggested approach makes use of face recognition and each person’s distinct temporal facial feature movements as they pronounce a password. | The suggested methodology does not impose any language limitations on the password specification. | Testing time optimization is significantly hampered by the introduction of authentication systems for PCs and mobile devices. | |
Convolutional Neural Network (CNN) | MIRACL-VC1 dataset | accuracy of 76% Key Words | predict phrases from speakers in silent videos using a variety of language | produces a performance that is noticeably better than that of earlier suggested methods. | N/A | |
visual feature extraction (ResNet), contextual relationship modeling (transformer encoder with multi-head attention), alignment (CTC) and decoding (Prefix beam search). | 22 of sentences (a conversation between a vendor and a customer in a shop). | The proposed model achieves a best Word Error Rate (WER) of 12.70 on the testing split, improving over the single-stream model which shows a best WER of 17.41, suggesting a multi-modal approach improves overall SLR. | The proposed novel model for CSLR of SSL with Deep Learning integrates the hand and lip movements of a signer, addressing the lack of suitable vision-based public datasets. | The proposed model has the potential to significantly enhance communication accessibility and quality of life for the hearing-impaired in Sri Lanka and beyond. | Due to significant syntax differences, the model created may not apply to other sign languages like American Sign Language or Persian Sign Language. | |
The paper introduces a cross-modal attention module that can be integrated into any existing network for unimodal continuous sign language recognition or translation. | RWTH-PHOENIX-2014 dataset | Reduced the WER by 0.9 on the recognition task and increased most BLEU scores by approximately 0.6 on the translation task. | This study explores the feasibility of incorporating a new modality using a lightweight cross-modal encoder, eliminating the need for a separate feature extractor in an end-to-end manner. | The study utilized cross-modal attention on stochastic transformer networks with linear competing units, with the Cross-Modal Attention module being used in addition to the original pipeline. | Have not conducted any experiments with the ensembled version in this work. | |
PiSLTRc: Position-Informed Sign Language Transformer with Content-Aware and Temporal Convolution Layers | RWTH-PHOENIX-Weather-2014T | Accuracy: 87% BLEU↑ (improved translation quality) | Introduced position-aware temporal convolution for better spatial-temporal understanding | Achieved SOTA performance on translation quality for gloss-free SLT tasks | Computationally intensive; requires high-end GPU; not optimized for mobile/real-time settings | |
SF-Transformer: Encoder-Decoder model with 2D/3D Convs from SF-Net and Transformer Decoders | Chinese Sign Language (CSL) | Accuracy: 84% Fast convergence | Combines spatial-temporal CNNs with Transformer to improve speed and edge deployability | Demonstrates practical feasibility for mobile sign translation systems | Limited to CSL dataset; generalization to other sign languages not demonstrated | |
RTG-Net: Region-aware Temporal Graph Convolutional Network with keypoint selection and re-parameterization | RWTH-PHOENIX-Weather-2014T | Accuracy: 83% | Designed for real-time sign translation on edge devices using lightweight GCN architecture | Efficient for embedded systems; good trade-off between accuracy and runtime | Slightly lower accuracy than Transformer-based models; not yet validated across diverse datasets |