Scientific Reports

Table 1 Comparative summary of recent studies on deep learning approaches for visual speech and Lip-Reading Recognition.

From: A multistream attention based neural network for visual speech recognition and sign language understanding

Study	Methodology	Dataset	Results	Contributions	Significance	Limitations
²³	long-term recurrent convolutional network model with three convolutional layers (LRCN-3Conv)	MIRACL-VC1 dataset	95.42% accuracy for word test data and 95.63% for phrases accuracy of 90.67% in the word-labeled class	The achievement is ascribed to the MediaPipe Face Mesh detection feature, which enables precise identification of the lip region. Making use of sophisticated deep learning methods and accurate landmark detection	Results point to increased communication accessibility for those with hearing impairments.	necessary to develop a more practical and efficient technique so that lip-reading recognition can function better in practical settings.
²²	3D-CNN	MIRACL-VC1 dataset	Precision around 89% of the words	The system comprises a Conv3D algorithm that matches words to their corresponding visemes, and a feature extraction technique that converts lip features into a visual feature cube.	outperforms the previous system in terms of performance and gives higher classification accuracy.	N/A
¹⁹	An additional method of authentication makes use of distinct temporal motions of facial features and face recognition when speaking a password.	MIRACL-VC1 dataset	accuracy of 98.1%	The suggested approach is data-efficient, demonstrating its effectiveness by benchmarking against other facial recognition and lip reading algorithms. It even produces good results with ten positive video samples.	Language constraints do not impede the suggested paradigm because users can set passwords in any language.	N/A
²⁴	A deep neural network is a combination of a convolutional neural network (CNN) and a recurrent neural network (RNN).	MIRACL-VC1 dataset	accuracy of 96.1%	The suggested approach makes use of face recognition and each person’s distinct temporal facial feature movements as they pronounce a password.	The suggested methodology does not impose any language limitations on the password specification.	Testing time optimization is significantly hampered by the introduction of authentication systems for PCs and mobile devices.
¹⁶	Convolutional Neural Network (CNN)	MIRACL-VC1 dataset	accuracy of 76% Key Words	predict phrases from speakers in silent videos using a variety of language	produces a performance that is noticeably better than that of earlier suggested methods.	N/A
¹⁷	visual feature extraction (ResNet), contextual relationship modeling (transformer encoder with multi-head attention), alignment (CTC) and decoding (Prefix beam search).	22 of sentences (a conversation between a vendor and a customer in a shop).	The proposed model achieves a best Word Error Rate (WER) of 12.70 on the testing split, improving over the single-stream model which shows a best WER of 17.41, suggesting a multi-modal approach improves overall SLR.	The proposed novel model for CSLR of SSL with Deep Learning integrates the hand and lip movements of a signer, addressing the lack of suitable vision-based public datasets.	The proposed model has the potential to significantly enhance communication accessibility and quality of life for the hearing-impaired in Sri Lanka and beyond.	Due to significant syntax differences, the model created may not apply to other sign languages like American Sign Language or Persian Sign Language.
²⁵	The paper introduces a cross-modal attention module that can be integrated into any existing network for unimodal continuous sign language recognition or translation.	RWTH-PHOENIX-2014 dataset	Reduced the WER by 0.9 on the recognition task and increased most BLEU scores by approximately 0.6 on the translation task.	This study explores the feasibility of incorporating a new modality using a lightweight cross-modal encoder, eliminating the need for a separate feature extractor in an end-to-end manner.	The study utilized cross-modal attention on stochastic transformer networks with linear competing units, with the Cross-Modal Attention module being used in addition to the original pipeline.	Have not conducted any experiments with the ensembled version in this work.
¹⁰	PiSLTRc: Position-Informed Sign Language Transformer with Content-Aware and Temporal Convolution Layers	RWTH-PHOENIX-Weather-2014T	Accuracy: 87% BLEU↑ (improved translation quality)	Introduced position-aware temporal convolution for better spatial-temporal understanding	Achieved SOTA performance on translation quality for gloss-free SLT tasks	Computationally intensive; requires high-end GPU; not optimized for mobile/real-time settings
¹¹	SF-Transformer: Encoder-Decoder model with 2D/3D Convs from SF-Net and Transformer Decoders	Chinese Sign Language (CSL)	Accuracy: 84% Fast convergence	Combines spatial-temporal CNNs with Transformer to improve speed and edge deployability	Demonstrates practical feasibility for mobile sign translation systems	Limited to CSL dataset; generalization to other sign languages not demonstrated
²⁰	RTG-Net: Region-aware Temporal Graph Convolutional Network with keypoint selection and re-parameterization	RWTH-PHOENIX-Weather-2014T	Accuracy: 83%	Designed for real-time sign translation on edge devices using lightweight GCN architecture	Efficient for embedded systems; good trade-off between accuracy and runtime	Slightly lower accuracy than Transformer-based models; not yet validated across diverse datasets

Back to article page

Search

Advanced search

Quick links