Introduction

Exoskeletons have garnered significant attention in recent years due to their broad applications in rehabilitation, industrial, and military domains1,2,3,4. In industrial applications5, exoskeleton devices are designed to reduce workers’ fatigue, prevent injuries, and increase productivity. By enhancing human strength and endurance, these devices enable workers to perform high-demand physical tasks for extended periods. However, accurately classifying and responding to a diverse range of human motion states in real time remains a significant challenge for assistive exoskeletons6. For industrial exoskeletons, it is essential to accurately identify various movement states, including types of motion and load conditions. These systems must dynamically respond to different actions and provide appropriate assistance based on real-time motion classification.

Despite progress, the development of exoskeleton systems, especially for industrial use, has not fully addressed these complex challenges. Human actions are highly variable under actual conditions; the same movement may exhibit substantial differences across individuals and load conditions. Existing algorithms struggle to manage this variability effectively. Consequently, developing algorithms that can robustly extract motion features and adapt to different tasks and load conditions has become a pressing challenge in the field of assistive exoskeletons.

Surface electromyography (sEMG)-based methods offer significant advantages in predicting motion intent, making them particularly suitable for real-world applications. sEMG signals generated 30 to 150 ms before actual human movement provide direct insight into user intent before visible motion occurs7,8; this capability allows for predictive, real-time responses that can enhance the precision of exoskeleton control. Additionally, sEMG can monitor muscle fatigue, further demonstrating the assistive effects of exoskeleton devices9,10,11. Given its rich information on muscle activity and ease of acquisition, sEMG is widely used in human–machine interaction systems, making it an effective tool for predicting motion intent and supporting effective control in assistive technologies.

Previous studies have explored various human motion state classification methods. Early sEMG-based upper limb motion classification methods typically relied on traditional classifiers. These methods generally involve collecting sEMG signals, pre-processing, manual feature extraction in both time and frequency domains, training models with extracted features, and classifying input data12. For example, gaussian mixture models were used in13 to classify six upper limb movements, while14 utilized a linear programming boosting algorithm to classify seven upper limb actions. Reference15 applied a logistic polynomial regression approach to classify dynamic lifting tasks across three load conditions, achieving over 80% accuracy. In16, a cubic support vector machine (SVM) model was employed for similar lifting task classification under varying loads, attaining 99% accuracy. However, traditional machine learning algorithms lack the capability to adapt in real-time to user intent and cannot effectively capture the complex features of sEMG signals17,18.

The recent introduction of deep learning has provided new approaches for sEMG signal classification, as sEMG signals typically contain abundant high- and low-frequency information, exhibiting complex spatial patterns and local features. convolutional neural network (CNN), known for capturing spatial structures, have been widely applied in sEMG signal processing19,20,21. For example, in 22, a combination of deformable convolutional neural networks (DCNN) and magnitude-based short-time Fourier transform achieved an accuracy of 82.03% in classifying six basic arm movements (flexion, extension, abduction, adduction, pronation, and supination).

Long short-term memory network (LSTM), known for capturing long-term dependencies, are well-suited to handle sEMG signals23, which often have strong temporal characteristics and complex dynamic changes within short periods. Through memory cells, LSTMs effectively capture and retain these long-term temporal dependencies. For instance, 6 combined a CNN with an LSTM to classify four shoulder movements of subjects wearing an exoskeleton, achieving an average accuracy of 96.2%. In 24 a six-axis inertial sensor combined with an LSTM model was used to recognize activities and estimate loads for subjects wearing an exoskeleton, achieving 90.80% accuracy in activity recognition and 87.14% in load estimation.

Attention-based LSTM was initially proposed in 25 to address relation classification in natural language processing. Given that sEMG signals exhibit temporal dependencies and local feature variability, the importance of features at different time points varies for the final classification outcome. Thus, attention-based LSTM is highly suitable for processing sEMG signals10,26,27. In the face of complex motion patterns or varying load conditions, the attention mechanism helps models focus on the most representative moments, enhancing adaptability across different motion states.

These studies demonstrate the significant research value and application potential of deep learning methods incorporating sEMG data for exoskeleton applications. However, existing studies lack a comprehensive classification of motion states and loads, and most do not consider the effects of wearing an exoskeleton.

In this work, we propose an sEMG-based solution that leverages a convolutional Bidirectional LSTM (BiLSTM) model with an attention mechanism for human activity recognition, achieving classification across five typical human motions with an accuracy of 97.29%.

This research focuses on an AI-assisted approach to enhance human motion intent recognition during the use of upper limb exoskeletons. The paper primarily addresses the perceptual aspect of recognizing motion intent while wearing an exoskeleton. The main contributions are: (a) classification of common upper limb motion states (covering motion types and loads) while wearing an exoskeleton, especially the rarely discussed Static Load-Bearing State (SLB)—where the person is stationary while carrying a load; and (b) implementation of a convolutional BiLSTM model with an Attention mechanism, yielding promising classification results.

Methods

Participants

A total of 10 participants were recruited for the experiment, aged 24–42 years (M = 27.70, S.D. = 5.57), with heights ranging from 1.68 to 1.82 m (M = 1.75 m, S.D. = 0.04 m) and weights from 65 to 90 kg (M = 78.70 kg, S.D. = 7.52 kg). All participants were free from any conditions affecting the experiment and refrained from engaging in strenuous physical activity the day before testing. All participants provided written informed consent before participation. The study which was in accordance with the principles and guidelines described in the Declaration of Helsinki and was approved by the Ethics Committee of Zhejiang Provincial People’s Hospital (KY2024220). They also signed informed written consent forms for the publication of any identifying information or images in an online open-access publication.

Apparatus

We used a self-designed exoskeleton as the experimental platform, shown in Fig. 1. This exoskeleton is powered by UniTree A1 motors (Hangzhou Unitree Technology Co., LTD, China), which enables zero-torque mode driving to minimize friction within the motor components.

Fig. 1
figure 1

Exoskeleton experimental platform.

Following the experimental motion design from 28, we designed a device to measure maximum voluntary isometric contraction (MVIC), as shown in Fig. 2. This device facilitates MVIC experiments for primary muscle groups, such as the shoulder and hip joints.

Fig. 2
figure 2

MVIC test platform.

The MVIC test rig was placed in a motion capture room equipped with 16 cameras and LED light bands to capture participants from multiple angles, helping to standardize participant movements. In Fig. 2, A represents the camera, B denotes the LED light band, C is the data acquisition laptop, D1-D5 are the load sensors (SBT710, SIMBATOUCH INC, China), and E is the digital transducer (SBT904D, SIMBATOUCH INC, China) with a power supply module. The platform provides force measurements with an accuracy of 0.01N across various movements. We utilized an sEMG acquisition device from Sichiray Technology Co., Ltd., China, which has a sampling rate of 200 Hz.

Testing procedures

First, each participant completed an MVIC test using the MVIC test platform. Then, sEMG electrodes were placed on the middle deltoid(DT), biceps brachii(BB), and triceps brachii(TB) according to the SENIAM (surface EMG for a non-invasive assessment of muscles) guidelines29, as shown in Fig. 5. Afterwards, participants wore the exoskeleton and performed the following tasks sequentially, representing five states: Resting State (RS)—a resting or non-movement state; Mild Activity State (MA)—mild movement state; Rapid Movement State (RM)—rapid limb movement; Dynamic Load-Bearing State (DLB)—dynamic load-bearing; SLB—static heavy load-bearing, as shown in Fig. 3.

Fig. 3
figure 3

Schematic diagram of experimental motion design.

In this study, the movement patterns were designed as follows: MA involved normal arm swinging during walking, while RM was designed as a rapid shoulder joint extension of 90 degrees within one second. To simulate a static load-bearing state, SLB was defined as the subject holding a 25 kg dumbbell with maximum effort for 5 s without movement. DLB required the subject to perform a front raise using a 7.5 kg dumbbell, with duration dependent on the subject’s adaptation to the motion. During the experiment, each subject rested for 5 min between actions and took a 30-min break between sets to prevent fatigue from affecting data quality. sEMG signals were recorded using an sEMG collection module. After data processing, the total recording time for each type of action was standardized to 30 s.

Data analysis

Data processing and acquisition

The original signals collected in this paper are shown in Table 1:

Table 1 The original data collected by the experiment.

The original sEMG data contains many deviation values, and directly feeding these signals into the network increases the complexity of model training. Furthermore, due to the limited dataset scale (N = 10) of the constructed dataset, sEMG features from a single subject may disproportionately influence the model. To address these challenges, we implemented systematic feature engineering to extract comprehensive time-domain, frequency-domain, and morphological features from the signal, as illustrated in Table 2. These features have been shown in multiple studies16,30 to be effective in improving the performance of sEMG classification models. By utilizing these universal features, we aimed to capture the inherent characteristics of sEMG signals that remain consistent across different subjects, thereby improving the model’s ability to recognize motion states regardless of individual physiological variations.

Table 2 The original data collected by the experiment.

First, we extracted time-domain features to characterize the amplitude characteristics of sEMG signals. The root mean square (RMS) represents the average signal energy, reflecting muscle activation levels, while peak-to-peak (P-P) value indicates the range of muscle contraction intensity.

Next, we performed frequency-domain analysis to capture the spectral properties of sEMG signals. Mean frequency (MNF) and median frequency (MDF) were extracted, revealing muscle fiber recruitment patterns and potential fatigue indicators, which are crucial for identifying variations in motor unit action potentials across different motion states.

Additionally, we calculated shape factor (SF) to provide waveform morphological information, and root sum of squares (RSS) to offer a comprehensive measurement of signal intensity that emphasizes peaks more than RMS, making it particularly suitable for detecting brief but intense muscle activations during specific upper limb movements.

To capture subtle spectral characteristics, we employed Mel-frequency cepstral coefficients (MFCCs), specifically utilizing the first and third coefficients (MFCC1 and MFCC3) to detect nuanced frequency distribution patterns in sEMG signals associated with different motion states.

These eight features were calculated for each sEMG channel, creating a 24-dimensional feature vector. This multi-domain feature set serves as a comprehensive representation of sEMG signals, establishing a solid foundation for motion state analysis.

Classification model

Perceiving exoskeleton users’ upper limb motion patterns is a critical issue in human-in-the-loop exoskeleton systems. However, many exoskeletons, including most passive exoskeletons and some active exoskeletons, operate based on predefined motion patterns and trajectories. This approach places humans in a passive role within the human-exoskeleton system, resulting in poor human–machine interaction. To address this issue, this study proposes a CNN-BiLSTM-Attention model for motion intention classification, as shown in Fig. 5. The model leverages the local temporal features extraction capabilities of CNN, the long-term temporal dependencies processing abilities of BiLSTM, and the attention mechanism’s focus on critical information to achieve improved motion classification performance.

To process the collected sEMG data, an overlapping sliding window mechanism was first applied, with a window length of 500 ms and an overlap of 450 ms. This approach enhances the inclusion of short-term dynamic information. Due to the sequential nature of sEMG data, information between adjacent time points changes rapidly. Using overlapping windows allows the model to capture more detailed variations while reducing the loss of sEMG information at the window boundaries. Subsequently, the sEMG signal is processed into multiple 100 × 3 sEMG signal matrices, which were then subjected to systematic feature engineering to extract multi-domain discriminative patterns. Then a 1D CNN layer is applied for local temporal features extraction.

For temporal feature extraction in sEMG data, we considered recurrent neural network (RNN)31. However, RNNs can suffer from gradient vanishing or exploding issues during the training process for sequential data. LSTM can partially address these issues32. Figure 4 illustrates the data processing flow within an LSTM cell, which forms the core component of LSTM layers. The LSTM layer’s states include the hidden state \({h}_{t}\) which serves as the LSTM layer’s output at the current timestep, and the cell state \({c}_{t}\), responsible for carrying information across time steps. Information regulation is achieved through gates, which selectively allow specific information to pass. LSTM includes three types of gates (illustrated by the dashed boxes in Fig. 4): the forget gate, the input gate, and the output gate. These gates are responsible for discarding irrelevant information, updating the LSTM cell state, and determining which information is included in the LSTM output state. Additionally, the candidate cell state \({\widetilde{C}}_{t}\) processed through a tanh layer, combines with the output of the input gate \({i}_{t}\) to determine cell state updates.

Fig. 4
figure 4

LSTM unit structure.

The parameters of the LSTM network include the input weights \(W\), recurrent weights \(Q\) and bias \(b\). Each gate and candidate cell state is computed as follows:

$${f}_{t}=\sigma \left({W}_{f}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{f}\right)$$
(1)
$${i}_{t}=\sigma \left({W}_{i}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{i}\right)$$
(2)
$${o}_{t}=\sigma \left({W}_{o}\left[{h}_{t-1},{x}_{t}\right]+{b}_{o}\right)$$
(3)
$${\widetilde{C}}_{t}=\text{tanh}\left({W}_{C}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{C}\right)$$
(4)

The cell state \({c}_{t}\) and hidden state \({h}_{t}\) are calculated as:

$${c}_{t}={f}_{t}*{c}_{t-1}+{i}_{t}*{\widetilde{C}}_{t}$$
(5)
$${h}_{t}={o}_{t}*\text{tanh}\left({C}_{t}\right)$$
(6)

Compared to LSTM, which can only process data in a single direction—meaning it relies solely on information from previous time steps for predictions—Bi-LSTM processes data in both directions simultaneously. This enables a more comprehensive understanding of the temporal variations in sEMG signals, thereby enhancing classification performance. Given that feature importance varies across different motion states, especially in cases of large movements like RM and DLB, relying solely on CNN and BiLSTM may not sufficiently emphasize these critical features. To address this issue, we introduced an attention mechanism33. The attention mechanism selectively focuses on the most relevant parts of the sEMG signals associated with each motion pattern, assigning them higher weights, thus improving the robustness and interpretability of the classification results.

During model training, cross-entropy loss was used as the optimization criterion, with the ADAM optimizer set to a learning rate of 0.0001. The model was trained for 200 epochs with a batch size of 128. The detailed model architecture is outlined below:

First, the pre-processed sEMG signal passes through a 1D convolutional layer (with 64 filters of size 3), which extracts local temporal features from the signal. This is followed by a maxpooling layer (pooling size of 2) to reduce feature dimensionality. The output from the convolutional layer is then fed into the BiLSTM layer, where the number of units is set to 128, with a dropout rate of 0.2 to prevent overfitting. The BiLSTM captures the long-term temporal dependencies of the signal. We applied an attention mechanism to enhance the model’s attention to critical sEMG features. The output is then flattened for processing in subsequent fully connected layers. The flattened features pass through a fully connected layer (with 256 neurons and ReLU activation), followed by another dropout layer (0.5) to further prevent overfitting. Finally, a softmax function outputs the probability distribution across five categories, representing different motion states, as shown in Figs. 5.

Fig. 5
figure 5

Proposed CNN-BiLSTM-attention framework for sEMG-based motion classification.

We collected 10 sets of sEMG data from 10 participants, using 80% for training and 20% for testing and model validation. As shown in Fig. 6, as the number of training epochs increases, both the training and validation losses gradually decrease, indicating good convergence. The accuracy of the training and validation sets improves with additional epochs and stabilizes in the later stages, demonstrating the model’s effectiveness.

Fig. 6
figure 6

Training process of the proposed model.

Evaluation methods

First, to understand our proposed model’s ability to differentiate between various motion states, we utilized t-distributed stochastic neighbor embedding (t-SNE) for feature space visualization analysis34. This dimensionality reduction technique provides an intuitive visualization of the separation between the five motion states at different processing stages, effectively revealing the model’s internal representation learning capabilities. By extracting and visualizing features after each major component of our architecture, we can systematically track how signal representations evolve and become increasingly discriminative throughout the network.

Subsequently, to investigate the attention layer’s focus on different motion states, we visualized the attention weight matrix as heatmap. By analyzing the attention heatmap, we were able to evaluate the model’s allocation of attention and verify whether it aligns with the motion patterns.

To comprehensively evaluate the classification performance of the proposed model, it is imperative to draw conclusions based on relevant evaluation metrics. While accuracy is a robust metric that reflects the model’s overall performance, relying solely on accuracy is insufficient—especially in multi-class classification tasks where the ability to differentiate between classes can vary significantly. A single accuracy metric does not capture the nuances of performance for each individual category. This is particularly evident for motion modes that are easily confused, such as RS vs. MA and RM vs. DLB, where accuracy alone might mask underlying misclassification issues. Therefore, we introduced the confusion matrix as an evaluation tool to provide a more intuitive and detailed view of the model’s discriminative capabilities. By analyzing the confusion matrix, we can observe the classification accuracy for each category and pinpoint which motion modes are prone to confusion. This deeper analysis yields valuable insights into the model’s performance across various motion modes, thereby identifying potential areas for improvement and offering targeted directions for future optimization.

Furthermore, to assess the generalization capability of the proposed model across different subjects, we incorporated leave-one-subject-out cross-validation (LOOCV) into our evaluation. LOOCV is particularly well-suited for datasets with inherent subject-specific variations35, as it systematically designates each subject’s data as the test set while using the remaining data for training. This approach captures the variability in performance across individuals, revealing both the strengths and weaknesses of the model when applied to unseen subjects. The LOOCV results provide additional insights into the model’s robustness and underscore potential areas for further enhancement, particularly in addressing misclassification issues among closely related motion modes.

For statistical comparison of model performances, we employed Dunn’s test with Bonferroni correction36. This non-parametric approach was selected after normality testing revealed that cross-subject performance data for some models did not conform to normal distribution assumptions. Given the multiple comparison scenario involving five different models, Dunn’s test provides a robust framework for identifying significant performance differences while the Bonferroni correction controls the family-wise error rate, ensuring statistical rigor in our conclusions.

Results

Visualization of the feature extraction process.

Figure 7 illustrates the feature distribution in the 2D t-SNE space prior to deep learning model processing. The visualization reveals that after manual feature extraction, the sEMG signals demonstrate an initial level of separability between motion states. Figure 8 shows that following convolutional processing, while preliminary clustering patterns emerge, substantial overlap persists between different motion states. As shown in Fig. 9, the incorporation of temporal modeling through the BiLSTM layer significantly enhances the class structure, with distinct motion clusters becoming more apparent. Finally, Fig. 10 demonstrates that after processing through the complete network architecture, the motion states form well-defined and distinctly separated clusters. This pronounced separation in the feature space validates the effectiveness of our proposed model in learning discriminative features for robust motion state classification.

Fig. 7
figure 7

sEMG feature visualization using t-SNE (without model processing).

Fig. 8
figure 8

sEMG feature visualization using t-SNE (After CNN layer).

Fig. 9
figure 9

sEMG feature visualization using t-SNE (After BiLSTM layer).

Fig. 10
figure 10

sEMG feature visualization using t-SNE (After Attention and Dense layer).

Visualization of the attention

Figures 11 and 12 indicate that the attention weights for RS and MA are relatively uniform, with standard deviations of 0.08 and 0.05, respectively. Figures 13 and 14 reveal that RM and DLB share similar characteristics, as the model assigns greater importance to information from time steps T4 to T5; this is also reflected in their higher attention weight variability, with standard deviations of 0.23 for RM and 0.12 for DLB. Figure 15 illustrates the attention weights for SLB, showing that the model places more emphasis on information at time step T1 when making classification decisions, with a corresponding standard deviation of 0.17.

Fig. 11
figure 11

Attention Heatmap (Subject1, RS).

Fig. 12
figure 12

Attention Heatmap (Subject1, MA).

Fig. 13
figure 13

Attention Heatmap (Subject1, RM).

Fig. 14
figure 14

Attention Heatmap (Subject1, DLB).

Fig. 15
figure 15

Attention Heatmap (Subject1, SLB).

Ablation analysis

The ablation study results, as summarized in Table 3 and Fig. 16, demonstrate that the proposed model outperforms all compared architectures across key evaluation metrics. Specifically, it achieves an accuracy of 97.29%, precision of 97.29%, recall of 97.29%, and an F1 score of 0.9729, surpassing the performance of individual CNN (96.00%), BiLSTM (90.30%), CNN + BiLSTM (96.64%), and CNN + Attention (94.96%) models. Furthermore, the model maintains a reasonable inference time of 57.20 ± 2.38 ms, which is comparable to simpler architectures while delivering superior performance. Figure 17 displays the confusion matrix of the proposed model, showing clear diagonal dominance, which indicates high classification accuracy across all motion states.

Table 3 Ablation study: impact of different module combinations on classification performance.
Fig. 16
figure 16

Accuracy of different module combinations in ablation study.

Fig. 17
figure 17

Confusion matrix of the proposed model.

As detailed in Table 4, the proposed model achieved excellent performance across all motion states. The RS showed the highest performance with 98.39% precision and 98.87% recall, resulting in an F1 score of 0.9863. Similarly, SLB demonstrated strong results with 97.75% precision and 97.90% recall (F1: 0.9782). While RM and DLB showed slightly lower metrics with F1 scores of 0.9608 and 0.9655 respectively, they still maintained robust classification performance. MA also achieved impressive results with 97.59% precision and 97.12% recall (F1: 0.9735), confirming the model’s consistent performance across all motion states.

Table 4 PER-class performance metrics.

Comparation with other studies

Regarding comparisons with other models, a challenge in upper limb motion classification research is the absence of standardized datasets, unlike the established benchmarks available for gesture recognition tasks15,35. Most researchers in this field construct their own datasets, which complicates direct performance comparisons across studies. While the EMAHA-DB1 dataset, which includes 22 common upper limb movements15, offers potential for standardization, it has not yet been widely adopted in related research. To establish a meaningful evaluation framework, we implemented several baseline algorithms including CNN + LSTM 6, Cubic SVM 16, LSTM24, and DCNN 22, comparing their performance against our proposed model on our custom dataset. Additionally, we utilized LOOCV methodology to rigorously validate all algorithms, ensuring robust assessment of generalization capabilities across different subjects. Table 5 presents a comprehensive comparison of classification accuracy between these established approaches and our proposed method.

Table 5 Model performance comparison of LOOCV results.

The experimental results comparing different models are presented in Fig. 18 and Table 5. The proposed model achieved the highest performance across all metrics, with an accuracy of 88.17 ± 5.39%, precision of 88.76 ± 4.97%, recall of 88.13 ± 5.47%, and F1 score of 0.8799 ± 0.5555. Statistical analysis using Dunn’s test with Bonferroni correction confirmed our model significantly outperformed the LSTM model (p < 0.001) and DCNN model (p < 0.001) in F1 score. The CNN-LSTM architecture showed the second-best performance (accuracy: 77.96 ± 10.38%), followed by the SVM model (accuracy: 77.65 ± 6.42%). Notably, our proposed model maintained more consistent performance with a considerably lower standard deviation (5.39%) compared to CNN-LSTM (10.38%), indicating better stability across subjects. The DCNN model achieved moderate results (accuracy: 56.54 ± 24.20%), while the basic LSTM model performed lowest (accuracy: 46.71 ± 11.54%). These results demonstrate that our proposed architecture provides robust sEMG signal classification with enhanced generalization capabilities for upper limb motion recognition across different subjects.

Fig. 18
figure 18

Accuracy distribution of different models using LOOCV.

Discussion

The experimental results demonstrate the effectiveness of our proposed model for sEMG-based movement classification in human-exoskeleton systems. The training dynamics illustrated in Fig. 6 reveal stable convergence without overfitting, as evidenced by the consistent narrowing of both training and validation losses. This stability, combined with the plateauing accuracy curves, indicates the model’s robustness and appropriate complexity for the task.

The progressive feature visualization through t-SNE (Figs. 7, 8, 9, 10) provides insights into our model’s transformation of sEMG signals into discriminative features. While overlap between classes in these 2D projections should be interpreted cautiously, as t-SNE cannot perfectly preserve high-dimensional spatial relationships, the overall trend from entangled to increasingly distinct clusters validates our architecture’s feature learning capabilities. The visualization reveals a clear evolutionary pattern: initial convolutional processing establishes basic clustering tendencies (Fig. 8), temporal modeling through BiLSTM enhances motion-specific patterns (Fig. 9), and the attention mechanism further refines class separation (Fig. 10). This progression supports our architectural design hypothesis that each component contributes unique and complementary discriminative capabilities to the final classification task.

Figures 11 and 12 show that the attention weights for RS and MA are relatively uniform, with standard deviations of 0.08 and 0.05, respectively. This indicates that under these conditions, the model does not rely heavily on any specific time step but integrates features evenly across the entire time window.

Figures 13 and 14 reveal that, for RM and DLB, the model assigns higher attention during the T4–T5 interval, suggesting that significant muscle activations are detected during this period, which leads the model to focus on this information for classification.

In contrast, Fig. 15 shows that for SLB, the attention is predominantly focused on time step T1, indicating that when maximum effort is exerted, the signal characteristics at certain time steps become more pronounced, prompting the model to assign greater weight to these cues.

Overall, the attention mechanism exhibits adaptive capabilities: the model distributes its focus evenly in stable or mild conditions, whereas in high-intensity or dynamic tasks it emphasizes distinctive features at specific time steps. This behavior confirms the potential of attention-based models in capturing subtle variations in sEMG signals and provides a theoretical basis for further optimizing motion state recognition systems.

The ablation study findings offer valuable insights into the role of different neural network components in sEMG-based motion recognition, as shown in Fig. 16. While CNN effectively capture local temporal patterns in sEMG signals, achieving 96.00% accuracy, the relatively lower performance of the standalone BiLSTM (90.30%) suggests that long-term dependencies alone are insufficient for robust classification. However, the synergistic improvement observed in the CNN + BiLSTM architecture (96.64%) demonstrates that combining local feature extraction with long-range temporal modeling significantly enhances the model’s ability to distinguish between similar motion patterns.The superior performance of our proposed model (97.29%) over both CNN + BiLSTM and CNN + Attention architectures indicates that the attention mechanism’s selective focus on relevant signal segments complements both local pattern detection and sequential feature extraction. This architectural synergy is particularly important for real-world applications, where the ability to automatically identify and emphasize discriminative signal components can help overcome individual variations in muscle activation patterns.

The per-class performance analysis reveals interesting patterns in motion state recognition (Fig. 17, Table 4). The high F1 score in distinguishing RS (F1: 0.9863) and SLB (F1: 0.9782) suggests that these conditions produce distinctly different muscle activation patterns. The slightly lower performance in differentiating between RM and DLB (F1: 0.9608 and 0.9655 respectively) reflects the inherent challenge of capturing load-related information from muscle activation signals alone.

A critical aspect of our study is the cross-subject generalization evaluation through LOOCV (Fig. 18, Table 5). The observed accuracy reduction when testing on completely new subjects aligns with previous research findings15,35. and highlights a persistent challenge in sEMG-based motion recognition. However, our model demonstrates superior resilience to this performance degradation compared to baseline approaches, maintaining the highest mean accuracy and lowest standard deviation across subjects. This suggests that our architecture better captures universal movement patterns while being less susceptible to individual-specific signal characteristics.

Despite these promising results, several limitations must be acknowledged. First, individual differences in sEMG signals remain a significant challenge for cross-user recognition37,38. Variations in muscle physiological characteristics, electrode placement, and movement execution styles among different users result in substantial differences in sEMG patterns. Second, the current study focused on a specific set of movement patterns; the model’s performance on more complex or transitional movements requires further investigation. Third, the study concentrated solely on motion intent classification, laying the groundwork for human–exoskeleton interaction. In addition, the use of a 500 ms sliding window with a 450 ms overlap may introduce a system delay. These issues need to be addressed in future work.

Future work should address these limitations by:

  1. (1)

    Developing transfer learning frameworks based on common features, extracting universal movement pattern representations from multi-user data to establish foundational models that enable new users to adapt quickly with minimal personalized data, and designing incremental learning mechanisms that allow the system to recognize and memorize newly emerging movement patterns online, thereby improving adaptability to unknown behaviors;

  2. (2)

    Expanding the movement pattern vocabulary to encompass more diverse and complex activities, and establishing standardized benchmarks for upper-limb movements in human–exoskeleton systems to enable fair comparisons across different sEMG-based recognition methods;

  3. (3)

    Exploring the integration of motion state classification with trajectory prediction through transformer-based sequence-to-sequence models, combined with model predictive control frameworks, as a potential approach to enable real-time anticipatory exoskeleton actuation.

Conclusion

This paper presents a novel deep learning architecture for sEMG-based upper limb motion recognition in human-exoskeleton systems. By integrating CNN for local pattern extraction, BiLSTM for temporal dependency modeling, and an attention mechanism for selective feature emphasis, our model achieves robust classification performance across different motion states. The experimental results demonstrate the effectiveness of this approach, achieving 97.29% accuracy in ablation studies and maintaining 88.17 ± 5.39% accuracy in cross-subject validation, surpassing traditional approaches and baseline deep learning models.

The comprehensive evaluation through feature visualization, ablation studies, and cross-subject testing validates our architectural design choices. The t-SNE visualizations reveal the progressive improvement in feature discrimination through different network components, while the ablation study quantitatively confirms each component’s contribution to the final performance. Particularly noteworthy is the model’s ability to distinguish between similar motion patterns, such as rapid movements and dynamic load-bearing activities, with F1 scores exceeding 0.96 for all motion states.

While the results are promising, challenges remain in achieving consistent performance across different subjects due to individual variations in sEMG patterns. Future work should focus on developing transfer learning frameworks for better cross-user adaptation, designing incremental learning mechanisms for online pattern recognition, and expanding the movement pattern vocabulary. These improvements will be crucial for advancing the practical application of sEMG-based motion recognition in human-exoskeleton systems.

The proposed approach represents a significant step toward more reliable and adaptable human-exoskeleton interaction systems, offering potential benefits for rehabilitation, assistive technology, and industrial applications. Our findings contribute to the broader understanding of deep learning applications in biosignal processing and human motion recognition, while also highlighting important directions for future research in this field.