Introduction

Research background and motivations

With the rapid development of artificial intelligence (AI) technology, speech recognition, as one of the core technologies of human-computer interaction, has been widely used in real-time interactive scenes such as emotional socialization, online education and intelligent customer service1,2. However, the speech signals in these scenes are highly continuous, uncertain and complex context-dependent, and the traditional speech recognition model shows obvious limitations in dynamic scene adaptability, semantic coherence and processing efficiency. Especially in dealing with multi-language, multi-accent and complex context-changing tasks, there are often problems of information loss and semantic deviation3,4,5. At the same time, the demand for real-time processing further aggravates the bottleneck of computing resources, making it a key challenge to build a speech recognition model with high efficiency and robustness6,7. Therefore, the combination of innovative optimization based on Transformer architecture and multi-modal feature fusion technology provides an important technical direction and theoretical basis for solving existing problems.

Research objectives

Aiming at the key challenges of speech recognition in real-time interactive scenes, this study proposes an end-to-end model that can dynamically adapt to complex contexts and efficiently process diverse speech signals. The specific objectives include: First, by optimizing the Transformer architecture, the dynamic coding and multi-scale feature modeling of speech signals are realized, thus enhancing the semantic consistency and context capture ability of the model in complex scenes. Second, semantic optimization generator and context-aware decoding mechanism are introduced to improve the information retention rate and event recognition accuracy in the process of speech-to-text conversion. Finally, a real-time speech recognition system with high adaptability and robustness in multi-language and multi-scene conditions is constructed, which lays the foundation for improving the real-time and interactivity of speech technology.

Literature review

Application and optimization of transformer in End-to-End speech recognition

In recent years, with its powerful feature modeling capabilities, the Transformer architecture has gradually become the mainstream technology in end-to-end automatic speech recognition (E2E-ASR). Shahamiri et al.8 designed the Dysarthric Speech-Transformer (DS-Transformer) model in response to the complexity and scarcity of dysarthric speech. Through the neural freezing strategy and pre-training with healthy speech data, they achieved a performance improvement of up to 23%. Dong et al.9 proposed a soft beam pruning algorithm combined with a prefix module, which achieved a dynamic balance between accuracy and efficiency when optimizing the decoding paths of speech recognition in specific domains. Vitale et al.10 studied acoustic syllable boundary detection through the encoding layer of E2E-ASR, revealing the potential advantages of the Transformer in capturing the rhythmic features of syllables. Rybicka et al.11 effectively dealt with the complexity of multi-speaker recording scenarios through the attractor refinement module and the k-means clustering algorithm. Their performance improvement in real data reached 15%, demonstrating the adaptability of the Transformer in flexibly handling complex speech tasks. Regarding the speech synthesis of rare languages, Lu et al.12 introduced cross-lingual context encoding features and Conformer blocks into the FastSpeech2 model. Combined with the token-average mechanism, they optimized the generation quality in scenarios with scarce data, successfully achieving a significant decrease in the character error rate and providing a breakthrough solution for the speech synthesis of minority languages.

Verification of speech recognition effect of transformer end-to-end model in specific dataset scene

The performance of the Transformer architecture in speech recognition for specific scenarios and datasets has also been widely verified. Tang13 designed the Denoising and Mandarin Recognition Supplement-Transformer (DMRS-Transformer) network, which integrated a denoising module and a Mandarin recognition supplement mechanism. It achieved a reduction in the character error rate of 0.8% and 1.5% on the Aishell-1 and HKUST datasets respectively. Hadwan et al.14 proposed the Acoustic Feature Strategy-Transformer (AFS-Transformer) model. By embedding speaker information in acoustic features and optimizing the processing strategy for silent frames, it effectively improved the model’s adaptability in scenarios with two speakers. Aiming at the bottleneck problems of high parameter quantity and low deployment efficiency, Ben-Letaifa and Rouas (2023) proposed a variable-rate-based pruning algorithm to dynamically optimize the parameter distribution of the feed-forward layer of the Transformer, achieving an optimized balance between performance and resource utilization15. Loubser et al.16 combined Convolutional Neural Network (CNN) with a lightweight Transformer architecture, which significantly reduced the computational cost while maintaining a low Word Error Rate (WER). Based on Squeezeformer, Guo et al.17 designed a multimodal pronunciation error detection model. Experiments on the PSC-Reading Mandarin dataset showed that the proposed model significantly improved the F1 score and diagnostic accuracy, further verifying the practical value of the Transformer architecture in speech error diagnosis. Pondel-Sycz et al.18 conducted a systematic analysis of five Transformer-based models in multilingual datasets (Mozilla Common Voice, LibriSpeech, and VoxPopuli). The results showed that these models demonstrated excellent performance and adaptability in both clean audio and degraded signal scenarios.

Existing research and analysis

In conclusion, the current research on Transformer-based end-to-end speech recognition mainly focuses on three aspects. Firstly, improving the robustness and accuracy of the model for complex speech signals through structural optimization. Secondly, controlling the model scale in scenarios with limited resources to reduce the burden of training and inference. Thirdly, enhancing the perception ability for specific contexts and rare expressions. However, despite the positive progress made in the technical approach of existing achievements, when facing real-time interaction tasks, the existing models generally still have insufficient responses in terms of high dynamics, cross-scenario adaptability, and decoding efficiency. Especially in scenarios such as multilingual alternation, dialect interference, and complex context switching, problems like semantic breaks and recognition lags are likely to occur in the models. At the same time, the architecture with a large number of parameters and the complex decoding paths limit its application in edge devices or resource-sensitive systems19,20,21. Therefore, starting from the dual dimensions of real-time performance and interactivity, this study aims to propose an end-to-end model that can dynamically adapt to context changes and has high computational efficiency. Through structural reconstruction and the optimization of the semantic guidance mechanism, the practicality and performance boundaries of the Transformer in real-time speech recognition scenarios will be improved.

Research model

Method of model design

The proposed Dynamic Adaptive Transformer for Real-Time Speech Recognition (DATR-SR) model is based on E2E-ASR architecture, and the improved transformer is embedded to realize efficient speech signal processing. Its core design includes four key processes: adaptive coding module, multi-scale feature extraction module, context-aware decoding module and semantic optimization generator module.

Aiming at the continuity, uncertainty and context dependence of speech signal in real-time interactive scene, DATR-SR adopts dynamic hierarchical adaptive encoder to dynamically allocate computing resources according to signal complexity to avoid redundant operations. Multi-scale feature extraction module captures local and global features to enhance the adaptability to continuous speech22,23. In the decoding stage, the context-aware event-driven mechanism is introduced, the decoding path is adjusted in real time, and the semantic association is optimized by combining with Graph Neural Network (GNN)24. Semantic optimization generator provides context guidance through semantic prediction, improves the dynamic adaptation ability of the model in real-time speech recognition, and realizes efficient processing and real-time output of diversified requirements.

The improvement principle of DATR-SR model in two stages

Coding stage

In the coding stage, DATR-SR employs a dynamic hierarchical adaptation mechanism to adjust the number of encoding layers and the allocation of computational resources based on the complexity of the speech signal. This avoids over-modeling of low-complexity signals, which is common in traditional architectures, and enhances the processing efficiency for high-complexity signals25,26,27. At the same time, the multi-scale feature extraction module uses a hierarchical attention mechanism to model local and global features separately. This effectively handles long segments of continuous speech and alleviates the issues of feature loss and redundant calculations. It achieves a balance between resource allocation and representation efficiency28.

The core calculation in the coding stage can be expressed as:

  1. (1)

    Signal complexity evaluation

$$\:{C}_{t}=\alpha\:\cdot\:\text{V}\text{a}\text{r}\left({x}_{t}\right)+\beta\:\cdot\:\text{E}\text{n}\text{t}\text{r}\text{o}\text{p}\text{y}\left({x}_{t}\right)+\gamma\:\cdot\:\text{S}\text{p}\text{a}\text{r}\text{s}\text{i}\text{t}\text{y}\left({x}_{t}\right)$$
(1)

\(\:{x}_{t}\) represents the feature vector of the input speech signal. \(\:\text{V}\text{a}\text{r}\), \(\:\text{E}\text{n}\text{t}\text{r}\text{o}\text{p}\text{y}\) and \(\:\text{S}\text{p}\text{a}\text{r}\text{s}\text{i}\text{t}\text{y}\) are the variance, entropy and sparsity of the feature respectively. \(\:\alpha\:\), \(\:\beta\:\) and \(\:\gamma\:\) are the adjustment parameters.

  1. (2)

    Dynamic computing resource allocation

$$\:{L}_{t}=min\left({L}_{max},\frac{{C}_{t}}{\tau\:}\cdot\:{L}_{\text{t}\text{o}\text{t}\text{a}\text{l}}\right)$$
(2)

\(\:{L}_{max}\) is the maximum number of layers. \(\:{L}_{\text{t}\text{o}\text{t}\text{a}\text{l}}\) is the total number of layers. \(\:\tau\:\) is the complexity threshold.

  1. (3)

    Calculate the weight distribution layer by layer

$$\:{h}_{l}=\sigma\:({W}_{l}\cdot\:{h}_{l-1}+{b}_{l})\forall\:l\in\:\{1,\dots\:,{L}_{t}\}$$
(3)

\(\:{h}_{l}\) represents the output of the \(\:l\)-th layer. \(\:\sigma\:\) is the activation function. \(\:{W}_{l}\) and \(\:{b}_{l}\) are the weights and offsets.

  1. (4)

    Multi-head attention mechanism

$$\:\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{Q{K}^{\top\:}}{\sqrt{{d}_{k}}}\right)V$$
(4)

\(\:Q\), \(\:K\) and \(\:V\) are the matrix of query, key and value respectively, and \(\:{d}_{k}\) is the dimension of key vector.

  1. (5)

    Multi-scale polymerization

$$\:{H}_{\text{multi-scale}}=\sum\:_{i=1}^{N}\:{\alpha\:}_{i}\cdot\:{H}_{i},{\alpha\:}_{i}=\frac{exp\left({w}_{i}\right)}{\sum\:_{j=1}^{N}\:exp\left({w}_{j}\right)}$$
(5)

\(\:{H}_{i}\) represents the feature representation of the \(\:i\)-th scale, and \(\:{\alpha\:}_{i}\) is the scale weight.

Decoding stage

In the decoding stage, DATR-SR adopts context-aware event-driven mechanism and semantic optimization strategy to monitor key events (such as pause and speech rate change) in real time and dynamically adjust the decoding path to adapt to multi-scene changes. Combined with GNN, the semantic association in decoding path is optimized to reduce decoding isolation29. The semantic optimization generator generates prior semantic information through context prediction, guides the decoder to choose the best path, reduces the invalid search space, and comprehensively improves semantic coherence and real-time response ability.

The core calculation in the decoding stage can be expressed as:

  1. (1)

    Speech event detection

$$\:{E}_{t}=\delta\:\cdot\:\parallel\:\varDelta\:{h}_{t}{\parallel\:}_{2}+ϵ\cdot\:\text{E}\text{n}\text{e}\text{r}\text{g}\text{y}\left({x}_{t}\right)+\eta\:\cdot\:\text{D}\text{u}\text{r}\text{a}\text{t}\text{i}\text{o}\text{n}\left({x}_{t}\right)$$
(6)

\(\:\varDelta\:{h}_{t}\) is the characteristic of time step difference. \(\:\text{E}\text{n}\text{e}\text{r}\text{g}\text{y}\) and \(\:\text{D}\text{u}\text{r}\text{a}\text{t}\text{i}\text{o}\text{n}\) are the energy and duration of speech signal respectively. \(\:\delta\:\), \(\:ϵ\) and \(\:\eta\:\) are parameters.

  1. (2)

    Dynamic path adjustment

$$\:{z}_{t+1}={f}_{adjust}({z}_{t},{E}_{t})$$
(7)

\(\:{z}_{t}\) represents the current decoding state. \(\:{f}_{adjust}\) is the dynamic path adjustment function.

  1. (3)

    Transcendental semantic generation

$$\:{s}_{t}=\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}({W}_{s}\cdot\:{h}_{t}+{b}_{s})$$
(8)

\(\:{s}_{t}\) is the generated prior semantic information. \(\:{W}_{s}\) and \(\:{b}_{s}\) are the weight and bias.

  1. (4)

    GNN optimization path

$$\:{H}_{GNN}=\sigma\:\left(\sum\:_{v\in\:\mathcal{V}}\:\frac{1}{\left|\mathcal{N}\right(v\left)\right|}\sum\:_{u\in\:\mathcal{N}\left(v\right)}\:{W}_{edge}\cdot\:{h}_{u}+{b}_{edge}\right)$$
(9)

\(\:\mathcal{V}\) is the set of graph nodes. \(\:\mathcal{N}\left(v\right)\) is the neighborhood of node \(\:v\). \(\:{W}_{edge}\) and \(\:{b}_{edge}\) are edge weight and offset.

  1. (5)

    Decoding path update

$$\:{z}_{t+1}={f}_{decode}({z}_{t},{s}_{t},{H}_{GNN})$$
(10)

\(\:{f}_{decode}\) is the decoding path update function.

  1. (6)

    Final decoding output

$$\:{y}_{T}=\text{a}\text{r}\text{g}\text{m}\text{a}\text{x}\left(\prod\:_{t=1}^{T}\:P\left({y}_{t}\right|{z}_{t},{s}_{t})\right)$$
(11)

\(\:{y}_{t}\) is the decoded output of time step \(\:t\).

In this process, the related parameters and implementation of GNN are shown in Table 1:

Table 1 Relevant parameters and implementation of GNN.

Experimental design and performance evaluation

Datasets collection

In order to meet the diversity requirements of speech recognition in real-time interactive scenes, this study uses two kinds of datasets for training and testing. The first kind of dataset includes Aishell-1, HKUST, LibriSpeech and CommonVoice, covering multilingual environment and pronunciation variants, and enhancing the adaptability of the model to multiple languages and accents. The second kind of dataset selects ten Chinese TV series which have been popular in recent years, and shows rich scene and language diversity by crawling audio content, while avoiding privacy issues.

The second kind of dataset uses FFmpeg tools to extract audio with high precision, and combines with short-time Fourier transform for feature decomposition. In order to meet the optimization requirements of speech recognition for the length of speech segments, according to the distribution characteristics of speech duration, the audio is cut into sub-data packets ranging from 1 min to 5 min, and a total of 88,125 sub-data items are generated. In addition, in order to reduce the influence of environmental noise and the deviation of scene characteristics, 76,232 pieces of data that meet the standard are preserved through the processing of mean return to zero and variance normalization. There are 174,966 pieces of data in the two types of datasets, 80% of which are used for training and 20% for testing.

The sorting results of the first kind of dataset are shown in Table 2:

Table 2 The first kind of dataset.

The sorting results of the second kind of dataset are shown in Table 3:

Table 3 The second kind of dataset.

Experimental environment

In this study, the performance of DATR-SR model is tested from two aspects: one is to evaluate its robustness in a large number of data environments, and the other is to test its speech recognition effect, and compare it with the cutting-edge model to verify its superiority. The specific analysis is as follows:

  1. (1)

    Robustness analysis of DATR-SR in mass data environment.

Using the first kind of public data set, the data are divided into six proportions: 10%, 20%, 40%, 50%, 80% and 100%, and the DATR-SR is tested in stages. The Character Error Rate (CER) and training convergence time are analyzed, and the influence of data amount on model performance is evaluated30. At the same time, the reasoning delay error and the utilization rate of computing resources are monitored to verify the computational efficiency and hardware adaptability of the model in large-scale data processing31.

  1. (2)

    Analysis of speech recognition effect of DATR-SR on two kinds of datasets.

Based on the first kind of public dataset and the second kind of TV series dataset, the speech recognition performance of DATR-SR is evaluated, and the core recognition ability of DATR-SR in multi-language and multi-scene conditions is tested by calculating the WER, F1 index of pronunciation error detection and recognition accuracy on each dataset.

  1. (3)

    Comparative analysis of DATR-SR and frontier speech recognition model.

In order to objectively compare the advantages and disadvantages of DATR-SR with other cutting-edge speech recognition models, a unified comparative experiment was designed on two kinds of datasets, and the same training and testing process is adopted to ensure fairness. It is divided into three dimensions, as shown in Table 4:

Table 4 Three evaluation dimensions of comparative analysis.

In Table 4 above, the calculation equation of CCR is:

$$\:CCR=\frac{\sum\:_{i=1}^{N}\:\left(\frac{1}{{M}_{i}}\sum\:_{j=1}^{{M}_{i}}\:{f}_{coherence}({w}_{i,j},{w}_{i,j+1})\right)}{N}\times\:100$$
(12)

\(\:N\) is the total number of test speech samples. \(\:{M}_{i}\) is the total number of sentences in sample \(\:i\). \(\:{w}_{i,j}\) is the \(\:j\)-th sentence in sample \(\:i\). \(\:{f}_{coherence}\) is the coherence scoring function of sentences \(\:{w}_{i,j}\) and \(\:{w}_{i,j+1}\) based on the included angle of semantic vectors.

The calculation equation of MSD is:

$$\:MSD=\frac{\sum\:_{i=1}^{N}\:\left(\frac{1}{{K}_{i}}\sum\:_{k=1}^{{K}_{i}}\:\parallel\:{V}_{input}\left({w}_{i,k}\right)-{V}_{output}\left({w}_{i,k}\right){\parallel\:}^{2}\right)}{N}\times\:100$$
(13)

\(\:{K}_{i}\) is the total number of words in sample \(\:i\). \(\:{V}_{input}\left({w}_{i,k}\right)\) is the semantic embedding vector of the input speech. \(\:{V}_{output}\left({w}_{i,k}\right)\) is the semantic embedding vector of the model output.

The calculation equation of IRR is:

$$\:IRR=\frac{\sum\:_{i=1}^{N}\:\left(\frac{1}{{T}_{i}}\sum\:_{t=1}^{{T}_{i}}\:{\delta\:}_{t}\cdot\:P({w}_{i,t}\mid\:{x}_{i,t})\right)}{\sum\:_{i=1}^{N}\:{T}_{i}}\times\:100$$
(14)

\(\:{T}_{i}\) is the total time step of sample \(\:i\). \(\:{\delta\:}_{t}\) is the marking function of whether the information is correctly retained in time step \(\:t\) (1 means correct, 0 means error). \(\:P({w}_{i,t}\mid\:{x}_{i,t})\) is the probability that the input signal \(\:{x}_{i,t}\) is transcribed into the word \(\:{w}_{i,t}\) at time step \(\:t\).

The calculation equation of STRT is:

$$\:STRT=\frac{\sum\:_{i=1}^{N}\:\left(\frac{1}{{S}_{i}}\sum\:_{s=1}^{{S}_{i}}\:\left({t}_{end}\right({x}_{i,s})-{t}_{start}({x}_{i,s}\left)\right)\right)}{N}$$
(15)

\(\:{S}_{i}\) is the total number of scene changes in sample \(\:i\). \(\:{t}_{start}\left({x}_{i,s}\right)\) is the switching start time of voice signal in scene \(\:s\). \(\:{t}_{end}\left({x}_{i,s}\right)\) is the end time of voice signal switching in scene \(\:s\).

The calculation equation of SAR is:

$$\:SAR=\frac{\sum\:_{i=1}^{N}\:\left(\frac{1}{{M}_{i}}\sum\:_{m=1}^{{M}_{i}}\:\left(1-\frac{{e}_{i,m}}{{T}_{i,m}}\right)\right)}{N}\times\:100$$
(16)

\(\:{M}_{i}\) is the total number of scenes in sample \(\:i\). \(\:{e}_{i,m}\) is the error frame number of the model in scene \(\:m\). \(\:{T}_{i,m}\) is the total number of frames of scene \(\:m\).

Parameters setting

It is necessary to compare DATR-SR with the most advanced speech recognition models in the past two years. First, the study sorts out the implementation methods of the advanced models, and then tests them at a unified data level to reduce errors. The research refers to the previous literature review, and counts seven most advanced speech recognition models at this stage. The sorting results are shown in Table 5:

Table 5 Seven cutting-edge speech recognition models.

The software and hardware environment of the study is arranged as shown in Table 6:

Table 6 The software and hardware environment studied.

Performance evaluation

Robustness analysis

The robustness analysis results of DATR-SR model are shown in Fig. 1:

Fig. 1
figure 1

Robustness analysis of DATR-SR (a) Aishell-1 (b) HKUST (c) LibriSpeech (d) CommonVoice.

In Fig. 1, DATR-SR shows strongly demonstrates robustness and adaptability under different datasets and data volume ratios. With the increase of data volume, CER decreases from the highest 5.2% to the lowest 2.7%. Especially on the CommonVoice dataset with complex context, and keeps a low semantic error (2.8-3.1%), which reflects its recognition accuracy for multi-language and multi-scenes. The training convergence time is controlled in the range of 74–90 s with 100% data, which shows high efficiency for large-scale training. In the aspect of reasoning delay, all datasets are kept within 15ms, and the resource utilization rate is above 75%. It further proves the efficiency and hardware adaptability of the model in real-time interactive scenes. This shows that DATR-SR can effectively balance complexity and efficiency, and meet the real-time requirements of diverse speech recognition tasks.

Analysis of speech recognition effect

The analysis result of speech recognition effect of DATR-SR model is shown in Fig. 2:

Fig. 2
figure 2

DATR-SR speech recognition effect analysis results (a) the first kind of dataset (b) the second kind of dataset.

In Fig. 2, DATR-SR shows high speech recognition ability and cross-scene adaptability on two kinds of datasets. On the first kind of dataset, the WER is maintained at 4.3-6.2%, and the accuracy is over 91%, which reflects the stable performance of DATR-SR in multi-language and multi-accent environment. The F1 index is as high as 0.91, which verifies the accurate capture of semantic information. On the second kind of dataset, DATR-SR faces TV drama scenes with complex contexts, and the WER fluctuates slightly, but the accuracy rate is always above 90%, and the F1 index reaches 0.91. It proves that DATR-SR can effectively adapt to different scenes and language styles. These results further confirm that DATR-SR not only has cross-domain speech recognition ability in real-time interactive scenes, but also shows excellent dynamic adaptability in complex semantic conversion tasks.

Comparative analysis with Cutting-Edge speech recognition model

The comparative analysis results of DATR-SR model and frontier speech recognition model are shown in Fig. 3:

Fig. 3
figure 3

Comparative analysis results of DATR-SR and cutting-edge speech recognition model (a)\(\:{A}_{1}\) (b)\(\:{A}_{2}\) (c)\(\:{A}_{3}\)

In Fig. 3, DATR-SR shows excellent performance advantages in all three dimensions. In the context consistency analysis, the CCR of DATR-SR is as high as 92.3%, which is significantly better than other models, and the MSD is the lowest, only 4.2%, which reflects its semantic coherence and accuracy in complex dialogue contexts. In the analysis of event recognition ability, the ERR reaches 91.3%, and the EER is controlled at 4.2%, which shows the ability to accurately detect diverse voice events. In the dynamic scene adaptation analysis, STRT is only 485ms, far below the standard of 500ms, SAR reaches 91.8%, and SSER remains at 4.2%, which proves its efficiency and stability in real-time interactive scenes.

In-depth, DATR-SR achieves accurate capture of phonetic continuity and semantic consistency by optimizing context modeling and semantic mapping. The event detection module performs well in complex scenes, effectively reducing semantic ambiguity. The rapid response mechanism significantly enhances the adaptability and robustness of the model in multi-scene switching, and indicates wide potential its wide application potential in real-time interactive speech recognition tasks.

Discussion

From the experimental results, the advantages of DATR-SR in speech recognition tasks are multi-dimensional optimization and efficient dynamic adaptability. The model can dynamically adjust the computing resources according to the complexity of the signal, and realize efficient processing of semantic consistency and information retention through multi-scale feature extraction and context-aware decoding. Different from the traditional fixed computing framework, DATR-SR flexibly responds to input changes, and avoids redundant computation while improving efficiency. In addition, its high recall rate and low error rate in high-load and diverse voice events verify the adaptability and stability of multi-task parallel processing. GNN is introduced to further optimize the semantic path and strengthen the depth of language understanding and the accuracy of information transmission. This design provides a new idea for technical new perspective in the field of speech recognition, and lays a theoretical foundation for complex multimodal fusion and the construction of real-time speech processing system.

Conclusion

Research contribution

In this study, the DATR-SR model is proposed, which achieves a double new perspective in theory and practice in the field of end-to-end speech recognition. The model innovatively combines the dynamic hierarchical adaptive encoder and the context-aware decoding module. Through the real-time evaluation of signal complexity and the optimal allocation of computing resources, the processing efficiency and semantic consistency of speech signals are significantly improved. At the same time, the study further introduces multi-scale feature extraction and semantic optimization generation mechanism, and realizes the synchronous optimization of semantic coherence rate and information retention ability in complex context. Through extensive tests on open datasets and diverse scenes, the robustness and dynamic adaptability of the model in cross-language and cross-scene tasks are verified, which provides theoretical support and innovative path for speech recognition technology in real-time interactive scenes and provides efficient technical solutions for industry development.

Future works and research limitations

The future research can further explore the scope of application and the performance of deep optimization of the model. Although DATR-SR shows strongly demonstrates adaptability and stability in multi-language and multi-scene, the performance of the model in more complex multi-modal data fusion and high noise environment has not been fully verified, which provides an important direction for future optimization research. In addition, in order to further enhance the universality and expansibility of the model, the research can explore the combination of DATR-SR with pre-training language model and other deep learning frameworks to enhance its ability of multitasking. At the same time, with the continuous improvement of hardware performance, how to further reduce the computational complexity and energy consumption in the resource-constrained environment is also a problem worthy of in-depth discussion. These efforts will open a broader space for the development of intelligent voice technology and promote its popularization and application in practical scenes.