Introduction

The use of robots has increased in recent years, this caused the booming attention attached by human-machine coordination in manufacturing, military, service delivery and other fields1,2,3, in this field, making the robot understand the movement of human is particularly important4,5. The computer vision-based methods6 and sEMG-based methods7 have been proposed and widely adopted several years ago. Actually, the computer version-based models often need fixed cameras and a bright environment, and the block of the body will significantly influence the prediction result. Also, the method needs a lot of computing resources and has a delay between the predicted result and the real movement. However, the sEMG-based models can acquire the kinematics in a few instants before the movement occurs and do not need the conditions required by the computer version-based models. The sEMG signals contain rich information about muscles and have a correlation with movements. sEMG has been used in human-machine interaction control systems for decades8,9,10, so the sEMG-based method was the focus of the study. However, the most widely used sEMG-based methods were myoelectric-pattern recognition techniques11, which are only able to provide discrete movement classification. Therefore, to provide continuous kinematic estimation, the sEMG-based simultaneous and proportional control method was proposed; the method provided continuous movement estimation, and it is closer to common human activities.

To achieve the goal of intuitive and natural control of robotic arms, Lin et al.12 employed wise non-negative matrix factorization (NMF) method based on the concept of muscle synergy to estimate kinematics from multichannel sEMG13, Qing et al.14 proposed a Hill-based muscle model that estimate continuous elbow angles from sEMG signals. Wang et al..15 used Long Short-Term Memory (LSTM) networks to acquire continuous hand kinematics of six movements from forearm sEMG signals. Actually, musculoskeletal base models need Complex calculations, so they are hard to be widely adopted in practical applications. The non-negative matrix factorization-based methods are usually used in whist movement estimation, and they provide a control matrix rather than joint angles. Most machine learning models use sensors to collect angles, which makes the collection of instabilities of the collected angles when muscles are moving. Moreover, the traditional models cannot analyze the time features and the relationship between different sEMG channels simultaneously, which cannot achieve a high estimation accuracy.

Thus, there is a need to find a novel method that can collect joint angles and estimate them accurately. Traditional angle sensors should be fixed in muscles; the movement of muscles may cause instabilities. The Motion Capture technology can measure angles without sensors fixed in muscles, and can calculate angles from the key point fixed in human joints, which makes the collection of joint angles more accurate. So we used a Motion Capture device to measure joint angles. We used the designed software to synchronize sEMG signals and joint angles to ensure the Neural signals and kinematics are synchronized. To make the accuracy higher, we employed the Linear-Attention16 in this study. Linear-Attention can extract the features of input data in 2 dimensions, so we chose to apply Linear-Attention and to compare it with the commonly adopted deep learning-based methods (MLP17, TCN18, LSTM19, and dot-production attention20).

Method

Subjects

The dataset used in this study was obtained from eight full-body individuals (all male, ages 24–40, all right-handed). None of the participants has any known neurological disorders. All participants read and signed an informed consent form before proceeding with the experiment. The experimental protocol and all methods adhered to the principles outlined in the Declaration of Helsinki and were approved by the Research Ethics Committee of West China Hospital (#2022-505).

Experimental protocol

As introduced in the Introduction, to synchronize the sEMG signals and arm movements, the sEMG signals and angle measurements were obtained from the same arm, the sEMG signals were recorded by Noraxon (Noraxon Ultium EMG, China), and the angle measurements were obtained from Vicon(Vicon, England) and are synchronous by designed software.

During the measurement, each subject was asked to sit down in front of a laptop displaying a repeated movement video. The eight sEMG sensors were placed in the anterior deltoid, the middle deltoid, the posterior deltoid, the biceps, the posterior triceps, the medial triceps, the pectoralis major, and the musculi supraspinatus, as shown in Fig. 2. The joint angles were captured from Vicon, and four angles were recorded and are shown in Table 1.

Table 1 Recorded Joint Angles.

For target movements, four single DoF movements, three simultaneous multiple DoF movements, and a free movement were selected; they are functionally relevant and representative of most shoulder-elbow flexions, translations, and rotations in common. These are listed in Table 2. The subjects were told to sequentially perform the eight types of movements on the dominant arm, starting from the natural state with the arms extended, motionless, and palms facing forward. Subjects were asked to start and end each movement in the sequence by a prompt on a laptop screen. The subjects were instructed to imitate the movements shown on the screen, each movement repeated six times, and had a rest for at least 90 seconds between each movement, which ensured the quality of the sEMG signals acquired from the subjects.

Fig. 1
figure 1

The experiment setup, the subject sits on a chair imitating the movements on a laptop, with the sEMG sensor on the dominant arm recording sEMG signals, and Vicon recording the angles of the four joints of the subject.

Fig. 2
figure 2

sEMG sensor placement of subjects, the red dot represents the position of the sensors.

Table 2 Eight recorded movements.

The Noraxon (Noraxon Ultium EMG, China) and Vicon (Vicon, England) were used to acquire sEMG signals and angles. The Noraxon applied sEMG signals for 8 channels at sampling frequency of 2000Hz, The Vicon can measure all known joint angles in the human body. In this measurement, four needed angles were recorded and are mentioned in the first paragraph in this section. The Vicon recorded angles at a sampling frequency of 100Hz, and were resampled to 2000Hz after measurement. The synchrony of sEMG and angle sampling points was ensured by a designed software. 10Hz-500Hz Butterworth filter was applied to the recorded sEMG signals. The setup of the experiment is shown in Fig. 1.

The root mean square value (RMS) feature was chosen as the feature extract from sEMG signals for it contains a great amount of information. The RMS feature was extracted by a window of length 100ms, and a step at 0.5ms. The RMS can be calculated using the following equation:

$$\begin{aligned} RMS=\sqrt{\frac{1}{N}\sum _{i=1}^{N}(n_i-\overline{n})^2} \end{aligned}$$
(1)

Where the N is the size of the window,\(n_i\) represents the \(i-th\) sampling point of the sEMG in the sliding window, and the \(\overline{n}\) represents the mean value of the data in the sliding window.

After the extract for the RMS, The \(\mu -law\) normalization was concluded. The \(\mu\)-law normalization21 amplifies the input sEMG data close to 0 using a logarithmic approach, allowing them to contribute more to the predicted results. Normalization of the \(\mu\) law has been shown to effectively improve the performance of the model to evaluate EMG signals. The \(\mu\)-law normalization formula is as follows:

$$\begin{aligned} F(x)=sign(x_i)\frac{\ln (1+\mu |x_i|)}{\ln (1+\mu )} \end{aligned}$$
(2)

where \(x_i\) is the t-th sampling point of the input, and the hyperparameter \(\mu\) decides the range after normalization.

Model development

Dot-product attention

Attention based model:The dot-product attention attention mechanism22 is introduced by Google cooperation in 2017. The mechanism has demonstrated exceptional performance in natural language processing. The mechanism is crucial in generating outputs based on the interdependence among the input vector sequences, which can extract the correlation from different sEMG channels and can generate a better angle. The dot-product attention is considered the most basic attention mechanism, so it was chosen to be in comparison as a baseline model in this study. The model computing formula has been displayed below.

Given the input matrix X with a dimension of \(l\times c\) , where l is the length of the input sequence and c is the number of sEMG channels in the experiment, l is set to 100ms. The attention layer generates the attention matrix with the same dimension as X, utilizing a set of parallel and independent heads. The attention layer contains three linear layers, and can generate the query matrix Q, key matrix K, and values matrix V, respectively. The corresponding formulas are defined as follows:

$$\begin{aligned} & Q=W_qX+b_q \end{aligned}$$
(3)
$$\begin{aligned} & K=W_kX+b_k \end{aligned}$$
(4)
$$\begin{aligned} & Q=W_vX+b_v \end{aligned}$$
(5)
$$\begin{aligned} & \alpha =softmax\left( \frac{QK^T}{\sqrt{d_k}}\right) \end{aligned}$$
(6)

Where the \(W_q\),\(W_k\),\(W_v\) are three learnable matrices of size \(R^{l\times l}\),\(b_q\),\(b_k\),\(b_v\) are in \(R^{l\times c}\),\(d_k\) is a scalar value, which equals to the first dimension of \(W_q\), \(W_k\) and \(W_v\), the dot-product of two matrix Q and K are divided by \(\sqrt{d_k}\) .\(\alpha\) represents the attention matrix and was normalized to [0,1].

Linear-attention based model

Linear- Attention mechanism was first provided by Angelos in 202016, different from the dot-product attention mechanism, Linear Attention used the structure similar to RNN, which can deal with the sequence data better then other models, also the advantage of dot-product remained. Linear-Attention mechanism combines the advantages of RNN and dot-product attention, which makes it perform better in sEMG continues estimation.

Dot-product attention mechanism can calculate the relationship between every sEMG channel, but cannot understand the time feature in the sEMG data. The Linear-Attention overcomes the disadvantage.

In linear-attention mechanism, the attention layer was formalized as a recurrent neural network, the resulting RNN has two hidden states, namely the attention memory s and the normalized memory z. The formula has been subscribed to donate the timestep in the recurrence:

$$\begin{aligned} & s_0=0 \end{aligned}$$
(7)
$$\begin{aligned} & z_0=0 \end{aligned}$$
(8)
$$\begin{aligned} & s_i=s_{i-1}+\phi (x_iW_K)(x_iW_V)^T \end{aligned}$$
(9)
$$\begin{aligned} & z_i=z_{i-1}+\phi (x_iW_K) \end{aligned}$$
(10)
$$\begin{aligned} & y_i=f_l\left( \frac{\phi (x_iW_Q)^Ts_i}{\phi (x_iW_Q)^Tz_i)}+x_i\right) \end{aligned}$$
(11)
$$\begin{aligned} & \phi (x)=elu(x)+1 \end{aligned}$$
(12)

where \(x_i\) denotes the ith input and the \(y_i\) denotes the ith output for a specific transformer layer, the \(W_Q\) , \(W_K\) , \(W_V\) are three learnable matrices. The function elu() denotes the exponential linear unit.

As show in Fig. 3, the Linear-Attention use the structure similar to RNN, the \(s_{t-1}\) and \(s_t\) represent the attention memory output at time \(t-1\) and time t. The \(z_{t-1}\) and \(z_t\) represent the attention memory output at time \(t-1\) and time t also. They recorded the feature in previous sample points, this allows Linear-Attention mechanism to understand the sEMG signals better. The \(x_t\) represents the eight channels of sEMG in time t, \(x_t\) was calculated by the Linear-Attention cells then calculated by an exponential linear unit, this allow the Linear-Attention mechanism to understand the relationship between different sEMG channels and perform better in angle estimation.

Fig. 3
figure 3

The linear-attention based model(left) and linear-attention algorithm(right).

The linear-attention based model is displayed in Fig. 3, as shown in the figure The output of Linear-Attention cell then calculated by a exponential linear unit, and then add to the raw input to prevent gradient vanishing and explosion23, then the output go though a full connect layer to get the output into a vector in \(1\times 4\) which related to four estimated angles.

Comparison of the models

The multilayer perceptron (MLP), considered to be the most basic neural network and has been used frequently in investigating human-machine systems24,25. Therefore, in this study, MLP has been chosen as the baseline model for comparison. The MLP has been shown to have the ability to fit a continuous curve or sequence value. In this study, three layers of non-fully connected networks were set as the hidden layers. The sEMG data was flattened into a 1-dimensional form before send into the neural network. The last two layers compressed the expanded data into 4 channels to predict the joint angles. Finally, to prevent the models from overfitting, the spatial dropout at 0.3 was added between each layer during training.

As a comparison model, we employ a Temporal Convolutional Network (TCN)18. TCN utilizes stacked one-dimensional convolutional layers with dilated kernels to model long-range temporal dependencies in sequential data. Critically, they enforce causality—ensuring predictions depend solely on past inputs while enabling parallel computation over entire sequences, offering computational advantages over recurrent architectures. TCN had been used in previous studies26 and performed well in continuous angle estimation, so TCN was chosen as a comparison model.

Long Short-Term Memory network (LSTM)19, has been used in previous study. LSTM is a recurrent neural network (RNN) variant specifically designed with gating mechanisms (input, forget, output gates) and a cell state to learn long-term temporal dependencies in sequential data. This architecture effectively addresses the vanishing gradient problem inherent in standard RNNs, making it well-suited for modeling physiological signals like sEMG.

Settings

The backpropagation algorithm was chosen to be the training approach in this study. Both the dot product attention and Linear-Attention set the layer into 1 The LSTM layer was set to 3. the dropout value was set to 0.3, The dropout value was set to 0.3 to prevent overfitting. All models were trained for 300 epochs, the learning rete was set to 0.001 and half after 100 epochs during training. In this study, all hyper-parameters were chosen best performed from the grouped values.

In this experiment, the Friedman’s test27 and Wilcoxon’s sign rank test28 were used to evaluate the significance of the results.

Results

Before a detailed analysis of the proposed and comparison networks, a performance comparison was made. The input data used the \(\mu -law\) normalization and raw input data (as show in Table 5), the Table show the comparison of all models in this study.

As shown in Table 5 ,all models were greatly improved by adopting the \(\mu -law\) normalization, because it amplifies the input sEMG data close to 0 using a logarithmic approach, allowing them to contribute more to the predicted results. It has been demonstrated that\(\mu -law\) normalization can effectively improve he model’s performance for evaluating EMG signals .

After the comparison of raw input and \(\mu -law\) normalization, all models were developed in PyTorch29. All movements were combined, four repetitions of each movement were set as the training set and the remaining two repetitions were set as the testing set. All results are shown in Table 3. As an example, the performance of each model in subject 3 is compared in Fig. 4, which shows the variation of the measured joint angles and the estimated angles over time.

Fig. 4
figure 4

Estimated outputs compared to the actual measured joint angles based on, (a) MLP model, (b) TCN model, (c) LSTM model, (d) DABD, (e) LABD.

Table 3 Individual subject correlation coefficients between directly measure and estimated joint angles for each analysis model.
Table 4 Average performance (CC, RMSE, DTW) for individual subject between directly measure and estimated joint angles for each analysis model.
Table 5 Average performance (CC) comparison between all models with raw input and \(\mu -law\) normalization input.
Table 6 The average correlation coefficients, RMSE, and DTW between directly measured and estimated joint angles for models in cross-subjects senario.

Panel (a) of the figure shows that the MLP model is hardly able to match the pattern of any of the four measured joint angles and particularly performed badly when estimating the SDV angle. The TCN model performed better, when the real joint angle became stable, the estimated joint angle became unstable. The LSTM performed better for the estimation of the elbow joint angle but it performed unwell for the SDV angle. The DABD performed better then the upper models but there was still somewhere inaccuracy in SBV and SDV angle. However, the LABD compensated for the shortcomings of the upper models, and achieved the best performance when compared. The Pearson correlation coefficient (cc) was used to provide a quantitative measure of the association between the estimated joint angles and real joint angles of each subject, as shown in Table 3.

Fig. 5
figure 5

Comparison of the EBH angle values and the RMS values extracted from the recorded sEMG data from the biceps.

To analyze the relevance of the features, we plot a line chart of predictions vs. the single input feature. The EBH angle depends mostly on the biceps, so we compared the angle with the RMS value extracted from the recorded sEMG data from the biceps, as show in Fig. 5, the rest angel and RMS figures were put in a github link (click the “github link” to view them), they have a similar trend, so all models performed well in EBH angle as shown in Fig. 4, While the rest angels depend on several muscles, so they do not have the similar trends with a single RMS channel. Our model can efficiently acquire information from different sEMG (RMS) channels, that is the reason our model performed well in the rest angles’ estimation.

We also reported the average CC value, Root Mean Square Error (RMSE) and Dynamic Time Warping distance (DWT) in Table 4. Our method achieved the best performance in RMSE and DTW, while MLP performed the worse, which showed the complete advantage of our model LABD.

To evaluate the generalization ability of our method, we conducted the leave-one-subject-out tests, that is train the model on other 7 subjects and test on the left single subject, between our model and other models and reported the average performance (CC, RMSE and DTW) in Table 6, from the Table, our model LABD continues performed well in CC value, and performed almost the same as the model DABD and LSTM in RMSE and DTW, which means our model can be generalized in cross-subject tasks and has a better performance than the baseline models (TCN, MLP, and LSTM), this showed the potential of our model in cross-subject tasks.

The different methods showed significantly different performance on subjects according to the Friedman test. And the Wilcoxon signed-rank test result after Bonferroni correction indicated that the LABD(CC=0.9430) significantly outperformed the MLP (CC=0.8204, p=0.0419), the TCN (CC=0.8575, p=0.0374), the LSTM (CC=0.9015, p=0.0714) and the DABD (CC=0.9083, p=0.0799).

Table 7 The comparisons of inference time, training time, model params of different models on 4 joint angles on I9-139000H(CPU), NVIDIA RTX4060(GPU), and Raspberry Pi 4B(ARM).

To evaluate the model from a deployment perspective, we tested the inference time of different models on different devices: CPU, GPU, and Raspberry Pi 4B. The results were shown in Table 7. From the table, the MLP and TCN achieved shorter inference time than our model in three kinds of devices and training time, but they have poor performance in model accuracy. Our model LABD, performed better than the LSTM and DABD models in inference time and training time, and well within the real-time requirement for human-machine interaction, which showed our model is suitable for real-time deployment.

Discussion

In this study, a Linear-Attention based model (LABD) was proposed, the method was used to obtain estimated joint angles from non-invasively collected sEMG signals, and compared to MLP, LSTM, TCN and DABD. The result of the experiment showed that LABD outperformed MLP, LSTM, TCN, and DABD in exploiting the kinematic information from sEMG signals in each case, and the differences were statistically significant.

The experiment show that, there is a great challenge in estimating the joint angles adequately using existing models. The MLP model cannot estimate the joint angles precisely because of its structure, the input data must be flattened at first, and the step makes the sEMG signal lose its information in the time dimension and channel dimension. TCN is a Convolutional Neural Network designed for time-series data. It can deal with sEMG signals better than traditional CNN models, but the Convolutional operation caused some unwanted noise. Compare to MLP and TCN model, the LSTM model can generate smoother and more accurate joint angles because its memory cell can combine prior information and current information, but the cell is too simple to deal with the sEMG information and cannot extract features from sEMG channels, which makes the joint angles decided by more than one channel badly estimated. Different from LSTM, the DABD can not only combine prior information but also can combine future information and the model structure is Sufficiently complex in dealing the sEMG signals, which make it outperformed than LSTM, but the DABD cannot extract features from sEMG channels as well as LSTM model. The proposed method: LABD, solved the problem due to its structure and yielded the best estimation results.

Conclusion

The proposed LABD model achieved a mean Pearson correlation coefficient of 0.94 (±0.02) and an RMSE of 3.75, outperforming MLP(0.82±0.2, p=0.0419), TCN (0.86±0.05, p=0.0374), LSTM (0.90±0.01, p=0.0714) and DBAD (0.93±0.04, p=0.0799), while our model parameters (0.68MB) and inference time (CPU: 8.36ms, GPU: 0.99ms, ARM: 22.97ms) are outperformed than the well-performed models (DABD and LSTM).

Though the synchronized measurements, the kinematic information acquired from sEMG signals can estimate the four joint angles of the upper-limb, the results show that the Linear-Attention based model achieved a significantly better accuracy than MLP, TCN, LSTM and dot-production models, the structure of LABD makes it own modest time and memory, the eight movements were used to ensure the model can estimate enough and broad movements.

The model was trained for every single subject, absolutely, the training strategy is complicated and makes it difficult to generalize. Cross-subject style can train a model across subjects and adapt to multiple subjects30. Transfer learning can use a small number of training epochs to reduce model training time by preserving similar parts between subjects. Our future work will focus on applying our methods to better performed cross-subject study, transfer learning and synthetic EMG data to improve robustness in scenarios with small sample sizes, as demonstrated in recent generative AI studies on rare diseases31. In short, our method is a outperformed and suitable approach for a single subject.