Table 1 Comparative summary of recent studies on deep learning approaches for visual speech and Lip-Reading Recognition.

From: A multistream attention based neural network for visual speech recognition and sign language understanding

Study

Methodology

Dataset

Results

Contributions

Significance

Limitations

23

long-term recurrent convolutional network model with three convolutional layers (LRCN-3Conv)

MIRACL-VC1 dataset

95.42% accuracy for word test data and 95.63% for phrases accuracy of 90.67% in the word-labeled class

The achievement is ascribed to the MediaPipe Face Mesh detection feature, which enables precise identification of the lip region. Making use of sophisticated deep learning methods and accurate landmark detection

Results point to increased communication accessibility for those with hearing impairments.

necessary to develop a more practical and efficient technique so that lip-reading recognition can function better in practical settings.

22

3D-CNN

MIRACL-VC1 dataset

Precision around 89% of the words

The system comprises a Conv3D algorithm that matches words to their corresponding visemes, and a feature extraction technique that converts lip features into a visual feature cube.

outperforms the previous system in terms of performance and gives higher classification accuracy.

N/A

19

An additional method of authentication makes use of distinct temporal motions of facial features and face recognition when speaking a password.

MIRACL-VC1 dataset

accuracy of 98.1%

The suggested approach is data-efficient, demonstrating its effectiveness by benchmarking against other facial recognition and lip reading algorithms. It even produces good results with ten positive video samples.

Language constraints do not impede the suggested paradigm because users can set passwords in any language.

N/A

24

A deep neural network is a combination of a convolutional neural network (CNN) and a recurrent neural network (RNN).

MIRACL-VC1 dataset

accuracy of 96.1%

The suggested approach makes use of face recognition and each person’s distinct temporal facial feature movements as they pronounce a password.

The suggested methodology does not impose any language limitations on the password specification.

Testing time optimization is significantly hampered by the introduction of authentication systems for PCs and mobile devices.

16

Convolutional Neural Network (CNN)

MIRACL-VC1 dataset

accuracy of 76% Key Words

predict phrases from speakers in silent videos using a variety of language

produces a performance that is noticeably better than that of earlier suggested methods.

N/A

17

visual feature extraction (ResNet), contextual relationship modeling (transformer encoder with multi-head attention), alignment (CTC) and decoding (Prefix beam search).

22 of sentences (a conversation between a vendor and a customer in a shop).

The proposed model achieves a best Word Error Rate (WER) of 12.70 on the testing split, improving over the single-stream model which shows a best WER of 17.41, suggesting a multi-modal approach improves overall SLR.

The proposed novel model for CSLR of SSL with Deep Learning integrates the hand and lip movements of a signer, addressing the lack of suitable vision-based public datasets.

The proposed model has the potential to significantly enhance communication accessibility and quality of life for the hearing-impaired in Sri Lanka and beyond.

Due to significant syntax differences, the model created may not apply to other sign languages like American Sign Language or Persian Sign Language.

25

The paper introduces a cross-modal attention module that can be integrated into any existing network for unimodal continuous sign language recognition or translation.

RWTH-PHOENIX-2014 dataset

Reduced the WER by 0.9 on the recognition task and increased most BLEU scores by approximately 0.6 on the translation task.

This study explores the feasibility of incorporating a new modality using a lightweight cross-modal encoder, eliminating the need for a separate feature extractor in an end-to-end manner.

The study utilized cross-modal attention on stochastic transformer networks with linear competing units, with the Cross-Modal Attention module being used in addition to the original pipeline.

Have not conducted any experiments with the ensembled version in this work.

10

PiSLTRc: Position-Informed Sign Language Transformer with Content-Aware and Temporal Convolution Layers

RWTH-PHOENIX-Weather-2014T

Accuracy: 87%

BLEU↑ (improved translation quality)

Introduced position-aware temporal convolution for better spatial-temporal understanding

Achieved SOTA performance on translation quality for gloss-free SLT tasks

Computationally intensive; requires high-end GPU; not optimized for mobile/real-time settings

11

SF-Transformer: Encoder-Decoder model with 2D/3D Convs from SF-Net and Transformer Decoders

Chinese Sign Language (CSL)

Accuracy: 84%

Fast convergence

Combines spatial-temporal CNNs with Transformer to improve speed and edge deployability

Demonstrates practical feasibility for mobile sign translation systems

Limited to CSL dataset; generalization to other sign languages not demonstrated

20

RTG-Net: Region-aware Temporal Graph Convolutional Network with keypoint selection and re-parameterization

RWTH-PHOENIX-Weather-2014T

Accuracy: 83%

Designed for real-time sign translation on edge devices using lightweight GCN architecture

Efficient for embedded systems; good trade-off between accuracy and runtime

Slightly lower accuracy than Transformer-based models; not yet validated across diverse datasets