Speech recognition using an english multimodal corpus with integrated image and depth information

Wang, Bing

doi:10.1038/s41598-024-78557-2

Download PDF

Article
Open access
Published: 06 November 2024

Speech recognition using an english multimodal corpus with integrated image and depth information

Bing Wang¹

Scientific Reports volume 14, Article number: 27000 (2024) Cite this article

2612 Accesses
1 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Traditional English corpora mainly collect information from a single modality, but lack information from multimodal information, resulting in low quality of corpus information and certain problems with recognition accuracy. To solve the above problems, this paper proposes to introduce depth information into multimodal corpora, and studies the construction method of English multimodal corpora that integrates electronic images and depth information, as well as the speech recognition method of the corpus. The multimodal fusion strategy adopted integrates speech signals and image information, including key visual information such as the speaker’s lip movements and facial expressions, and uses deep learning technology to mine acoustic and visual features. The acoustic model in the Kaldi toolkit is used for experimental research.Through experimental research, the following conclusions were drawn: Under 15-dimensional lip features, the accuracy of corpus A under monophone model was 2.4% higher than that of corpus B under monophone model when the SNR (signal-to-noise ratio) was 10dB, and the accuracy of corpus A under the triphone model at the signal-to-noise ratio of 10dB was 1.7% higher than that of corpus B under the triphone model at the signal-to-noise ratio of 10dB. Under the 32-dimensional lip features, the speech recognition effect of corpus A under the monophone model at the SNR of 10dB was 1.4% higher than that of corpus B under the monophone model at the SNR of 10dB, and the accuracy of corpus A under the triphone model at the SNR of 10dB was 2.6% higher than that of corpus B under the triphone model at the SNR of 10dB. The English multimodal corpus with image and depth information has a high accuracy, and the depth information helps to improve the accuracy of the corpus.

An open dataset of multidimensional signals based on different speech patterns in pragmatic Mandarin

Article Open access 08 December 2025

Multi-class identification of tonal contrasts in Chokri using supervised machine learning algorithms

Article Open access 10 May 2024

The ECOLANG Multimodal Corpus of adult-child and adult-adult Language

Article Open access 16 January 2025

Introduction

At present, most English corpora are small in size and unevenly distributed in fields, which limits the generalization ability of the model in specific scenarios. Secondly, the quality of annotations varies, especially in the fine-grained alignment of images and texts, some data may contain errors or inconsistencies. In addition, multimodal corpora have limited descriptions of complex interactive behaviors in real situations and cannot fully cover the rich relationships between multimodal information such as language, vision, and audio. Therefore, this paper studies the method of incorporating deep information into English multimodal corpora, in order to improve the speech recognition effect of the corpus and provide valuable reference for related research.

With the rapid development of artificial intelligence technology, the importance of speech recognition technology in human-computer interaction is becoming more and more prominent.However, traditional speech recognition systems based on audio signals often have poor recognition results in practical applications.In order to solve this problem, researchers began to explore a multi-modal corpus that combines image and depth information to provide new solutions for speech recognition.This paper focuses on a speech recognition method based on the English multi-modal corpus. This method aims to improve the accuracy and stability of speech recognition by integrating multi-modal information such as speech and images.In recent years, the application of deep learning in the field of speech recognition has achieved remarkable results, and its powerful feature learning ability has significantly improved the performance of speech recognition.However, in the complex and changeable actual environment, a single-mode speech recognition system still faces many problems, such as noise interference and differences in different accents.Therefore, combining non-speech modal information such as images has become a key way to improve the effect of speech recognition.Image information can capture non-speech features such as the speaker’s lip movements and facial expressions, which play an important role in solving ambiguity and noise problems in speech recognition.By integrating these multi-modal information, we can further improve the accuracy and robustness of speech recognition, and promote the wider development of speech recognition technology in practical applications.

The innovation of the English multimodal corpus speech recognition method proposed in this paper is that it combines image and depth information, and shows significant contributions in many aspects.First of all, this method adopts an advanced multi-modal fusion strategy, which effectively integrates speech signals and image information within the framework of deep learning.This strategy not only digs into the acoustic characteristics in the speech signal, but also uses image data to capture key visual information such as the speaker’s lip movement and facial expression, significantly improving the accuracy of speech recognition.Secondly, in the construction of the corpus, this paper has also made an important breakthrough.By constructing an English multimodal corpus that combines image and depth information, it provides valuable data resources for research in the field of speech recognition.The corpus covers high-quality voice records, as well as accurately corresponding lip dynamics and facial expression images, thus providing a solid data foundation for the training of multi-modal fusion models.Finally, through rigorous experimental verification, this study confirms that the multi-modal speech recognition method we proposed has a significant improvement in accuracy and stability compared to traditional methods.This result not only proves the effectiveness of our method in practical applications, but also opens up a new path for the development of speech recognition technology.

Related work

Many scholars have studied speech recognition.In order to study the application of multi-modal NLP teaching combined with speech recognition based on hybrid deep learning in spoken English practice, Xu Jin introduced the basic principles of speech recognition technology, expounded the concept of hidden Markov model and three key algorithms, and realized its simulation and implementation in speech recognition applications¹. Lin Yi proposed a multi-scale convolutional neural network architecture to adapt to different data distributions in order to deal with the diversity between radio transmission noise and speakers, so as to improve the accuracy of multilingual speech recognition². Wang Mou developed a multi-modal Mandarin corpus that contains simultaneous speech from the air and bones. The multi-modal speech is recorded with headphones equipped with air conduction and bone conduction microphones, and the semantic embedding and adaptive weights of the two speech sources are dynamically fused through the MMT module³. Benkerzaz Saliha outlined the main definitions of automatic speech recognition and proposed some methods of automatic speech recognition⁴. Singh Amitoj conducted a systematic survey of the existing literature related to automatic speech recognition of Indian languages, providing a reference for the best available research on automatic speech recognition of languages⁵. Ran Duan, based on artificial intelligence speech recognition technology, improves and analyzes the speech recognition algorithm, and uses an effective algorithm as the system algorithm of the artificial intelligence model.At the same time, a control experiment was designed to verify and analyze the artificial intelligence speech recognition correction model⁶.Kaur Jaspreet discussed many problems and challenges related to tone language and made a systematic investigation on automatic speech recognition of tone language⁷. The above scholars have all participated in the research on speech recognition and put forward some valuable suggestions.

Many scholars have studied multimodal corpora⁸. Hiippala Tuomo established a multimodal corpus composed of English charts, proposed a new multi-level annotation mode, and then provided a rich description of its multimodal structure⁹. Snaith Mark introduced the design, construction and output of the patient consultation corpus, which is a multimodal medical care corpus¹⁰. Tian Miao realized the retrieval and query functions of keywords, wrong sentences, and English character errors. The designed corpus enriched the existing Madagascar corpus and established a separate English character error query database¹¹. Hiippala Tuomo introduced AI2D-RST, a multimodal corpus of 1000 English language diagrams that represents topics in primary school natural sciences such as food webs, life cycles, moon phases, and human physiology¹². Pinto Sara Ramos proposed a new method of multimodal corpus analysis, which provides valuable reference for relevant research¹³. Anderson Jemima Asabea believe that New Englanders borrow interjections and other language forms from their indigenous languages to express their feelings at a specific moment. They examined the use and pragmatic functions of the four local interjections Ei, eh, eh and fish in Ghanaian English on online platforms¹⁴.Although many scholars have studied multimodal corpora, few scholars have specifically studied multimodal corpora that integrate image and depth information.

In this paper, methods for constructing English multimodal corpus, such as corpus collection, corpus preprocessing, multimodal data collection, multimodal data preprocessing, as well as color image feature extraction and depth information feature extraction, have been proposed. This paper also conducted an experimental study on English multimodal corpus speech recognition.

Speech recognition technology and depth imaging technology

Speech recognition technology

Speech recognition technology is a scientific technology that includes digital signal processing, artificial intelligence, linguistics, mathematics, psychology and affecvism¹⁵. Its essence is a pattern recognition process, which classifies the input speech according to the corresponding pattern to find the optimal speech matching result. Pattern recognition includes preprocessing, feature extraction and other basic modules. The parameters of feature extraction include Mel-frequency cepstral coefficients, gamma tone filter coefficients, etc. Speech recognition technology can be divided into random model method, neural network based method, probability and statistics method, etc.

Depth imaging technology

The depth imaging technology for acquiring depth information includes structured light method, time-of-flight method, triangulation method and frequency modulated continuous wave (FMCW) method¹⁶. Structured light method: The matching error may be increased in the case of insufficient image texture features, poor lighting environment, etc. for that the binocular matching depth information acquisition method is too dependent on the characteristics of the image itself and is too sensitive to changes in ambient light. The structured light method does not rely on the image texture information, and can quickly find the depth information features of the scene object through preprocessing the light signal. It is less affected by the external environmental factors, and can improve the matching accuracy. Based on the projection times of structured light, structured light method can be divided into single projection structured light method and multiple projection structured light method. Although the single projection structured light method has certain advantages, it has the limitations of low depth of field, low vertical spatial resolution and so on^17,18. The multiple projection structured light method can achieve higher spatial resolution, but it has the limitations of general anti vibration performance, poor real-time performance and high requirements for equipment. Time of flight method: Time-of-flight method takes light source and photosensitive module as the core component, and realizes the direct output of depth information through formula. It has the advantages of far recognition position, fast response speed, etc., but it also has the limitations of high cost, no precision imaging, etc. Triangulation method: Triangulation method judges the spatial coordinates of the detected object through the imaging relationship between the laser source, the detected object and the detector. Frequency modulated continuous wave method: FMCW method has the advantages of high sensitivity, high precision, high accuracy and strong anti-interference. It has the limitations of long measurement time and poor measurement effect in medium and long distance.

Construction of english multimodal corpus

The construction of English multimodal corpus includes the principles of data collection, data preprocessing, multimodal data collection, and multimodal data preprocessing, as shown in Fig. 1.

Principles of corpus collection

When collecting multimodal corpus, the principles of timeliness, representativeness, balance and appropriateness should be taken into account.

The principle of contemporariness means that the collected language materials should keep up with the pace of the times and be able to restore the real scene of people using English. The principle of representativeness refers to the fact that due to the limited nature of the corpus samples, the corpus samples that can represent the real scene should be selected as far as possible, and the corpus should be collected and sorted according to scientific methods, so that the corpus collection should be representative. The principle of balance means that each type of audio-visual oral corpus should occupy a reasonable proportion in the corpus. The principle of appropriateness is embodied in the following aspects: First, the collected corpus should adapt to the user’s level, and the vocabulary and syntax of the corpus should be kept at an appropriate level of difficulty. Second, the collected corpus should ensure its content is healthy, avoid strong negative emotions in the corpus, and avoid violating the correct social values and social public order and good customs. Third, the collected corpus should not involve political sensitivity. Fourth, pronunciation standards and clear pictures of the collected corpus should be ensured.

Corpus collection

The corpus collection can select a large number of language texts as the original corpus through news websites, literary websites, movie websites and other channels, and screen the corpus according to people’s language habits, so that the original texts of the corpus meet the requirements of the subsequent construction of multimodal English corpus. Corpus collection includes two processes: text conversion and sampling analysis. Text conversion refers to the format conversion of unqualified original language text collected by news websites and other channels into word documents. Sampling analysis refers to the use of random sampling analysis method to review the problems in the original corpus text, including special symbols, redundant punctuation marks, blank lines, garbled code, non English, etc.

Corpus preprocessing

Corpus preprocessing is a process to solve the problems of special symbols and redundant punctuation marks in the original corpus text, which is also called corpus cleaning. In terms of corpus preprocessing, text management software can be used to check whether there are blank lines, garbled codes, non English and other phenomena in the original corpus text, and handle them in a timely manner. Manual inspection can be used to check the corpus collected in the original corpus text to determine whether its content is too profound, whether it involves politically sensitive topics, and whether it has negative emotions. If the above situation exists, it should be removed from the corpus.

Multimodal data acquisition

A multimodal data acquisition system can be composed of computers, recording pens, cameras, teleprompter screens, etc¹⁹. In the process of data collection, the objects on the data collection site can be marked to ensure that they are always in the same position, so as to avoid affecting the subsequent multimodal corpus speech recognition. At the same time, a head rest can also be placed on the seat to avoid invalid recording due to large head movements of the recorder and ensure the progress of multimodal data acquisition. During the data acquisition process, when the recorder clicks the start button, the multimodal data acquisition system automatically collects the recorder’s audio, color sequence images, depth images and other data. When the recorder finishes recording a sentence, he clicks the end button. Before starting the next recording, the recorder needs to wait for the data to be written to the binary file before recording the next sentence. Multimodal data acquisition is carried out in groups²⁰. If the recorder has misread, missed read or multiple read during multimodal data acquisition, the recorder’s group needs to record again.

Multimodal data preprocessing

The multimodal data acquisition method proposed in this paper is carried out in groups, which leads to the problem of multimodal data alignment²¹. To solve this problem, a time stamp can be attached to the multimodal data. Synchronization of multimodal data is realized according to the time stamp and the frame rate corresponding to the data. In the aspect of voice annotation, Penn Phonetics Lab Forced Aligner can be used to realize automatic voice annotation, and the annotated file of this software is in the file format of praat TextGrid²². Although the voice annotation effect of this software is good, the machine annotated voice may have errors. In this case, manual intervention is required.

Multimodal feature extraction

Color image feature extraction

Feature extraction of electronic image refers to extracting the symbolic information of the image as feature information after processing the information of the electronic image. Principal component analysis (PCA) is an important part of electronic image feature extraction²³. The process of extracting facial lip features from color images is shown in Fig. 2.

The process of extracting facial lip features through PCA includes the following steps: first, preprocess the facial image, crop the lip area and align the size; then, flatten each lip image into a one-dimensional vector to form a data matrix and perform centering. Next, calculate the covariance matrix of the data and perform eigenvalue decomposition to obtain a set of eigenvectors and eigenvalues. Select the first few eigenvectors containing the main variance as the principal components, project the lip image data onto these principal components, and obtain low-dimensional eigenvectors as the feature representation of the lip area, thereby achieving information extraction and dimensionality reduction.

Face detection: Color image face detection methods include knowledge rule method, invariant feature method, template matching method and statistical model method. Knowledge rule method is based on face rules to analyze the distribution of human eyes, nose and other organs to achieve face detection²⁴. Invariant feature method includes contour based method, skin color based method and so on. The method based on template matching distinguishes the existence of faces in color images by comparing the correlation between standard face templates and color image sub windows. The statistical model method transforms the face detection problem into a statistical recognition problem to detect the face. Its methods include neural network method, Bayesian estimation method, etc. Lip localization: Its principle is that by converting the color system image to the color space of lips and skin color, the Dlib machine learning library can be used to calculate the four critical points of human lip features to obtain the region of interest in the lip area²⁵.

Normalization: After face detection and lip location, normalization processing is performed. Color image parameters belong to high-dimensional space parameters, and the extraction process of color image parameters can be regarded as the process of dimension reduction²⁶. PCA has a good effect in color image parameter extraction. The discrete cosine transform can be used to compress the color image and remove the high-frequency part of the color image, so as to improve the processing speed of PCA. Because individual lips have different opening ranges, color image features have individual differences. To avoid this situation, lip features can be normalized. The calculation formula is:

$$a^{{\prime }} = \frac{{\left( {a - \min \left( a \right)} \right)}}{{\left( {\max \left( a \right) - \min \left( a \right)} \right)}}$$

(1)

Feature extraction of depth information

The facial model is studied based on Kinect three-dimensional coordinate system. The preprocessing process of depth information is as follows. With the center point of the mouth corner of the face model as the origin, the depth feature points are linearly transformed, so that the values of the mouth corners on both sides are the same in the Z direction. The left and right corners of the original face model are:

$$left = \left[ {xleft,yleft,zleft} \right]$$

(2)

$$right = \left[ {xright,yright,zright} \right]$$

(3)

In Formulas (2) and (3), x,y,z represent the spatial coordinate axes.

The regularized left and right corners of the face model are:

$$nleft = \left[ {nxleft,nyleft,nzleft} \right]$$

(4)

$$nright = \left[ {nxright,nyright,nzright} \right]$$

(5)

The angle of mouth corner rotation is:

$$\nu = \tan ^{{ - 1}} \left( {\frac{{\left( {zright - zleft} \right)}}{{zright - zleft}}} \right)$$

(6)

The length of mouth corner line is:

$$d = \sqrt {\left( {zright - zleft} \right)^{2} + \left( {xright - xleft} \right)^{2} }$$

(7)

Setting the change matrix as N of 3×3, if the face model is rotated around the spatial coordinate Y axis to the corresponding position of the camera, the following conditions shall be met:

$$d = \left| {nxright - nxleft} \right|$$

(8)

$$yright = nyright$$

(9)

$$yleft = nyleft$$

(10)

$$nzright = nzleft$$

(11)

According to the calculation, N is:

$$N = \left( {\begin{array}{*{20}c} {\cos (\nu )} & 0 & {\sin (\nu )} \\ 0 & 1 & 0 \\ { - \sin (\nu )} & 0 & {\cos (\nu )} \\ \end{array} } \right)$$

(12)

Experiment on english multimodal corpus speech recognition

In this paper, 72 students from school H were invited to participate in the collection of multimodal data, and two English multimodal corpora were built according to the method proposed. One was a multimodal corpus that combined color images and depth information, called corpus A, while the other was a multimodal corpus integrating color images, called corpus B. The acoustic model in Kaldi toolkit was selected for experimental research on speech recognition of these two corpora. The acoustic model selected included monophone model and triphone model. The basic characteristics of 72 students participating in the recording are shown in Table 1.

Speech recognition based on monophone model with 15-dimensional lip features

Bubble noise with different signal-to-noise ratio was added to the audio, and the signal-to-noise ratio was respectively − 5dB, 0dB, 5dB and 10dB under different conditions. The audio and video of corpus A was the audio and video after the fusion of audio and 15-dimensional lip depth image features, and the audio and video of corpus B was the audio and video after the fusion of audio and 15-dimensional lip color image features. The speech recognition of audio and video in the corpora under the monophone model was observed, as shown in Fig. 3 for details.

Table 1 Basic characteristics of 72 students participating in the recording.

Full size table

As shown in Fig. 4, after the fusion of audio and 15-dimensional lip depth image features, the accuracy of corpus A was 5.8% at a signal-to-noise ratio of -5dB, while that of corpus B was 5.6% at a signal-to-noise ratio of -5dB. The speech recognition accuracy of the corpus was low when the SNR was − 5dB. The accuracy of corpus A at 0dB was 31.4%, and that of corpus B at 0dB was 30.2%. The accuracy of corpus A was 67.8% when the signal-to-noise ratio was 5dB, while that of corpus B was 65.3% when the signal-to-noise ratio was 5dB. At this time, the accuracy of corpus A was higher than that of corpus B. The accuracy of corpus A was 83.7% when the signal-to-noise ratio was 10dB, and that of corpus B was 81.3% when the signal-to-noise ratio was 10dB, so the accuracy of corpus A at SNR of 10dB was 2.4% higher than that of corpus B at SNR of 10dB. It can be seen from the above data that the smaller the signal-to-noise ratio, the lower the accuracy of of the corpus. When the signal-to-noise ratio was − 5dB, the speech recognition effect of the corpus was poor, but the accuracy of speech recognition of corpus A was always higher than that of corpus B.

Speech recognition based on triphone model with 15-dimensional lip features

The speech recognition of audio and video in the corpora under the triphone model was observed, as shown in Fig. 4.

As shown in Fig. 4, when the signal-to-noise ratio was − 5dB, the accuracy of corpus A under the triphone model was 6.1%, and that of corpus B under the triphone model was 5.9%. When the signal-to-noise ratio was 0dB, the accuracy of corpus A under the triphone model was 46.2%, and that of corpus B under the triphone model was 43.8%. The accuracy of the two corpora was improved. When the signal-to-noise ratio was 5dB, the accuracy of corpus A under the triphone model was 80.9%, and the accuracy of corpus B under the triphone model was 78.3%, of which the accuracy of corpus A under the three tone sub model was higher. When the signal-to-noise ratio was 10dB, the accuracy of corpus A under the triphone model was 90.3%, and that of corpus B under the triphone model was 88.6%. The accuracy of corpus A based on the triphone model was 1.7% higher than that of corpus B based on the triphone model. According to the data in Fig. 4, it can be seen that when the SNR was 10dB, the accuracy of speech recognition in the corpus under the triphone model was higher than that of the monophone model.

Speech recognition based on monophone model with 32-dimensional lip features

The audio and video of corpus A was the audio and video after the fusion of audio and 32-dimensional lip depth image features, and the audio and video of corpus B was the audio and video after the fusion of audio and 32-dimensional lip color image features. The speech recognition of audio and video in the corpora under the monophone model was observed, as shown in Fig. 5.

As shown in Fig. 5, when the signal-to-noise ratio was − 5dB, the accuracy of the monophone model in the 32-dimensional lip feature corpus A was 6.4%, and the accuracy of the monophone model in the 32-dimensional lip feature corpus B was 5.7%, of which the accuracy of the corpus A was higher. When the signal-to-noise ratio was 0dB, the accuracy of the monophone model of the corpus A under the 32-dimensional lip feature was 32.5%, and that of the corpus B under the 32-dimensional lip feature was 30.8%. The accuracy of the corpus under the signal to noise ratio of 0dB was higher than that of the corpus under the signal to noise ratio of -5dB. When the signal-to-noise ratio was 5dB, the accuracy of the monophone model of the corpus A with 32 dimensional lip features was 69.2%, and that of the corpus B with 32-dimensional lip features was 66.9%. The accuracy of the monophone model of the corpus A with 32-dimensional lip features was higher than that of the corpus B with 32-dimensional lip features. When the signal-to-noise ratio was 10dB, the speech recognition accuracy of the monophone model of the corpus A with 32-dimensional lip features was 84.7%, and that of the corpus B with 32-dimensional lip features was 83.3%. The accuracy of the monophone model of the corpus A with 32-dimensional lip features was 1.4% higher than that of the corpus B with 32-dimensional lip features. Based on the data in Fig. 4, it can be seen that when the signal-to-noise ratio was 10dB, the accuracy of the 32-dimensional lip feature corpus’s monophone model was higher than that of the 15-dimensional lip feature corpus’s monophone model.

32-dimensional lip features may capture more information and details. Compared with 15-dimensional features, they can provide richer speech information. As the signal-to-noise ratio increases, the influence of background noise decreases, and the model can be more accurate. Recognize speech signals clearly. As the signal-to-noise ratio increases from − 5dB to 10dB, the accuracy rate increases significantly, indicating that the model is more sensitive to clear speech signals.

Speech recognition based on triphone model with 32-dimensional lip features

The speech recognition of audio and video in the corpora under the triphone model was observed, as shown in Fig. 6.

As shown in Fig. 6, when the signal-to-noise ratio was − 5dB, the accuracy of the triphone model in the 32-dimensional lip feature corpus A was 7.1%, and the speech recognition accuracy of the triphone model in the 32-dimensional lip feature corpus B was 6.5%. When the signal-to-noise ratio was 0dB, the accuracy of the triphone model in the 32-dimensional lip feature corpus A was 48.2%, and the accuracy of the triphone model in the 32-dimensional lip feature corpus B was 45.8%. When the signal-to-noise ratio was 5dB, the accuracy of the triphone model in the 32-dimensional lip feature corpus A was 83.8%, and the accuracy of the triphone model in the 32-dimensional lip feature corpus B was 79.2%. When the signal-to-noise ratio was 10dB, the accuracy of the triphone model in the 32-dimensional lip feature corpus A was 92.3%, and the accuracy of the triphone model in the 32-dimensional lip feature corpus B was 89.7%. The accuracy of the triphone model of corpus A with 32-dimensional lip features was 2.6% higher than that of corpus B with 32-dimensional lip features. It can be seen from the above data that the accuracy of of the triphone model of corpus A with 32-dimensional lip features was higher than that of the triphone model of corpus B with 32-dimensional lip features. According to Fig. 5, when the signal-to-noise ratio was 10dB, the accuracy of of the triphone model of the 32-dimensional lip feature corpus was higher than that of the 15-dimensional lip feature corpus. The triphone model can consider more contextual information (i.e., the influence of previous and following phonemes) than the monophone model, which makes the recognition accuracy higher under the same conditions, especially when the signal-to-noise ratio is high.

Conclusions

This paper summarizes the speech recognition methods based on English multimodal corpora that combine image and depth information, discusses in detail the fusion application of speech recognition technology and depth imaging technology, and the principles that should be followed when building English multimodal corpora. Through empirical research, it is proved that the speech recognition effect of the corpus combining color images and depth information is better than that of the corpus using only color images under the monophone model and the triphone model, and the recognition effect of the triphone model is better than that of the monophone model. However, there are still some shortcomings in this study. The scale and diversity of the corpus need to be further expanded, and the stability and accuracy of speech recognition technology in complex environments still need to be improved. In the future, we will continue to optimize the multimodal fusion strategy, explore more dimensional feature extraction methods, and strive to build a larger-scale and higher-quality English multimodal corpus to promote the further development of speech recognition technology.

Data availability

Datasets generated and/or analyzed during the current study are available from the corresponding author on request.

References

Xu, J. & Li, T. Application of multimodal NLP instruction combined with speech recognition in oral english practice. Mobile Inform. Syst. 1 (2022) 2262696. (2022).
Lin, Y. A unified framework for multilingual speech recognition in air traffic control systems. IEEE Trans. Neural Netw. Learn. Syst. 32 (8), 3608–3620 (2020).
Article Google Scholar
Wang, M., Chen, J., Zhang, X. L. & Rahardja, S. End-to-end multi-modal speech recognition on an air and bone conducted speech corpus. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 513–524 (2022).
Article Google Scholar
Benkerzaz, S., Elmir, Y. & Dennai, A. A study on automatic speech recognition. J. Inform. Technol. Rev. 10 (3), 80–83 (2019).
Google Scholar
Singh, A. ASRoIL: a comprehensive survey for automatic speech recognition of Indian languages. Artif. Intell. Rev. 53 (5), 3673–3704 (2020).
Article Google Scholar
Ran, D., Yingli, W. & Haoxin, Q. Artificial intelligence speech recognition model for correcting spoken English teaching. J. Intell. Fuzzy Syst. 40 (2), 3513–3524 (2021).
Article Google Scholar
Kaur, J., Singh, A. & Kadyan, V. Automatic speech recognition system for tonal languages: state-of-the-art survey. Arch. Comput. Methods Eng. 28 (3), 1039–1068 (2021).
Article Google Scholar
Huang, L. Toward multimodal corpus pragmatics: Rationale, case, and agenda. Digit. Scholarsh. Humanit. 36 (1), 101–114 (2021).
Article Google Scholar
Hiippala, T. AI2D-RST: a multimodal corpus of 1000 primary school science diagrams. Lang. Resour. Eval. 55 (3), 661–688 (2021).
Article Google Scholar
Snaith, M. A multimodal corpus of simulated consultations between a patient and multiple healthcare professionals. Lang. Resour. Eval. 55 (4), 1077–1092 (2021).
Article Google Scholar
Tian, M. Construction of computer English corpus assisted by internet of things information perception and interaction technology. Comput. Intell. Neurosci. 1 (2022) 6803802. (2022).
Hiippala, T. et al. AI2D-RST: a multimodal corpus of 1000 primary school science diagrams. Lang. Resour. Eval. 5 (5), 661–688 (2021).
Article Google Scholar
Pinto, S., Ramos & Mubaraki, A. Multimodal corpus analysis of subtitling: the case of non-standard varieties. Target. Int. J. Transl. Stud. 32 (3), 389–419 (2020).
Article Google Scholar
Anderson, J., Asabea, E., Agbaglo & Rachel, G. A. Exploring ghanaians’ usage of Ei, ehe, eh, and eish in global web-based English Corpus. Corpus Pragmat. 8 (2), 131–148 (2024).
Article Google Scholar
Cave, R. & Bloch, S. The use of speech recognition technology by people living with amyotrophic lateral sclerosis: a sco** review. Disabil. Rehabil. Assist. Technol. 18 (7), 1043–1055 (2023).
Google Scholar
Zhou, S. et al. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proceedings of the IEEE 109.5 : 820–838. (2021).
Zhang, J. Adaptive modulation method of structured light projection based on the bidirectional reflection distribution function model. Acta Optica Sin. 41 (9), 0912001 (2021).
Google Scholar
Polat, H. and Saadin Oyucu. Building a speech and text corpus of Turkish: large corpus collection with initial speech recognition results. Symmetry 12.2 290. (2020).
Sharma, K. & Giannakos, M. Multimodal data capabilities for learning: what can multimodal data tell us about learning? Br. J. Edu. Technol. 51 (5), 1450–1484 (2020).
Article Google Scholar
Gao, J., Li, P., Chen, Z. & Zhang, J. A survey on deep learning for multimodal data fusion. Neural Comput. 32 (5), 829–864 (2020).
Article MathSciNet PubMed Google Scholar
Lipkova, J. et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 40 (10) 1095–1110. (2022).
CAS Google Scholar
Mahr, T. J., Berisha, V., Kawabata, K., Liss, J. & Hustad, K. C. Performance of forced-alignment algorithms on children’s speech. J. Speech Lang. Hear. Res. 64, 2213–2222 (2021).
Article PubMed PubMed Central Google Scholar
Jiang, J. SuperPCA: a superpixelwise PCA approach for unsupervised feature extraction of hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 56 (8), 4581–4593 (2018).
Article ADS Google Scholar
Chen, B., Liu, X., Zheng, Y., Zhao, G. & Shi, Y. Q. A robust GAN-generated face detection method based on dual-color spaces and an improved Xception. IEEE Trans. Circuits Syst. Video Technol. 32 (6), 3527–3538 (2021).
Article Google Scholar
Alshammari, A. K. et al. Influence of lip position on esthetics perception with respect to profile divergence using silhouette images. BMC Oral Health 23 (1), 791 (2023).
Article MathSciNet PubMed PubMed Central Google Scholar
Huang, L. et al. Normalization techniques in training dnns: methodology, analysis and application. IEEE Trans. Pattern Anal. Mach. Intell. 45 (8), 10173–10196 (2023).
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

School of Foreign Languages, Henan University of Science and Technology, Luoyang, Henan, 471000, China
Bing Wang

Authors

Bing Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

Bing Wang: Writing—original draft preparation. conducted the experiment(s). analysed the results. Editing data curation, SupervisionAll authors reviewed the manuscript.

Corresponding author

Correspondence to Bing Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, B. Speech recognition using an english multimodal corpus with integrated image and depth information. Sci Rep 14, 27000 (2024). https://doi.org/10.1038/s41598-024-78557-2

Download citation

Received: 21 June 2024
Accepted: 31 October 2024
Published: 06 November 2024
Version of record: 06 November 2024
DOI: https://doi.org/10.1038/s41598-024-78557-2