Introduction

Scientific rehabilitation exercises have been shown to play a pivotal role in facilitating functional recovery in patients after surgical interventions1. However, in many countries, the financial burden of rehabilitation is significant due to a shortage of trained rehabilitation professionals2. Consequently, a considerable number of patients are unable to receive timely and effective rehabilitation in their own homes. Therefore, researchers are increasingly focusing on the accurate and efficient assessment of the quality of patients’ rehabilitation manoeuvres.

According to the presentation form of the final results of movement quality evaluation, current methods can be categorized into classification-based methods3,4,5 and regression-based methods6,7,8. Classification-based methods are dominated by traditional machine learning. They utilize features extracted from video or sensor data to train classification models, thereby yielding discrete action categories. However, these methods are difficult to capture the complex non-linear relationships between motion nodes and perform poorly in accurately assessing the quality of actions. In contrast, regression-based methods are capable of predicting continuous action quality scores by pre-processing the data. Lee et al.9 proposed a hybrid model to learn the linear or nonlinear relationships of sensor data to output continuous scores. While Capecci et al.10 used a Hidden Markov Model to learn the probability distribution of the hidden states given the inputs. Nevertheless, these methods face significant challenges in efficiently predicting quality based on substantial volumes of action data. Moreover, the pre-training process introduces additional complexity to the action evaluation procedure. In recent developments, deep learning-based methodologies11,12,13,14 have demonstrated significant advancements in the domain of action evaluation. Liao et al.11 utilized a deep pyramidal network to process encoded motion data. By leveraging subnetworks, the model captures the spatial characteristics of human movement to analyze joint displacements of individual body parts. These motion assessment models achieve higher accuracy and better generalization compared to traditional approaches. Although graph convolutional networks can capture the spatial information of human motion, constructing adjacency matrices using near-neighborhoods limits the network’s ability to prioritize on the motion states of distant nodes. As a result, the network cannot capture valuable semantic information from joints with larger movement amplitudes in rehabilitation exercises. Considering the characteristics of rehabilitation movements, subtle differences in the features of the motion frame arise from slow recovery movements of patients. Additionally, the varying motion speeds influenced by patients’ subjectivity introduce temporal complexity, which further challenges the accuracy of action assessment in this field.

Fig. 1
figure 1

Rehabilitation exercise process and joint topology construction. (a) Establishing connections solely between adjacent joint points fails to capture the coordination of different joints. (b) Local joint connections, including adjacent joints and second-order joints, which neglect the exchange of information between distant joints. (c) A fused topology integrates second-order joints and distal nodes (hands and feet), aligning with coordinated limb movements in rehabilitation and enhancing joint information representation.

To address the above problems, we propose a Frame Topology Fusion Hierarchical Graph Convolution Network (FTF-HGCN). The network consists of Frame Topology Fusion (FTF) blocks and Hierarchical Temporal Convolution Attention (HTCA) blocks. To enhance the information representation of distal nodes, we designed a fused topological structure, as shown in Fig. 1. This fused topology builds upon the initial adjacency matrix of the human body topology and further integrates motion information from distal limbs, effectively considering the interactions and coordination of limb movements. An adaptive learnable matrix is constructed for each action frame based on the spatial information of distal keypoints, thereby enhancing the network’s ability to understand subtle differences between action frames. Considering that different patients complete rehabilitation movements at varying speeds, we design a multilevel temporal convolution attention module, which consists of four Branch Temporal Convolution (BTC) modules. Each branch extracts motion features from disparate ranges of time series, and these outputs are ultimately integrated to improve the network’s comprehension of motion information. To evaluate the efficacy of this approach, extensive experiments are conducted on the UI-PRMD and KIMORE rehabilitation datasets, demonstrating that our method achieves the best performance compared to other assessment methods. The contributions of our method can be summarized as follows:

  • We propose a frame topology fusion hierarchical graph convolutional Network that jointly models spatial and temporal features, significantly enhancing the precision of rehabilitation motion quality assessment for stroke patients.

  • We construct joint topology representations dynamically by integrating adjacent human body topology with distal limb motion information. This creates an adaptive and learnable topological matrix that captures subtle differences between motion frames, improving the representation of human motion information.

  • We design a multi-level temporal feature extraction module that learns motion information across different scales to accommodate variations in patients’ motion speeds.

  • Our approach achieves leading performance on the two public datasets, offering new improvements and insights for motion quality assessment. This provides reliable support for home-based rehabilitation.

Related works

Rehabilitation exercise quality assessment

Research methods for evaluating the quality of rehabilitation movements can be broadly categorized into traditional machine learning approaches and deep learning-based methods. Early machine learning approaches were only capable of classifying movements as correct or incorrect. Techniques such as k-nearest neighbors (KNN)15, support vector machines (SVM)16, and random forests17 have been employed for binary classification tasks. To achieve a continuous quality evaluation of movements, Houmanfar et al.18 utilized Mahalanobis distance to quantify the quality level of rehabilitation movements by analyzing repetitive motions of patients and healthy individuals. Additionally, Medury et al.19 adopted dynamic time warping methods to quantify movement distance functions into quality scores. Furthermore, Vakanski et al.20 introduced a probabilistic approach based on Gaussian mixture models, extracting individual sequences from the trained model to evaluate movement quality.

Deep learning-based methods have demonstrated superior capabilities in the analysis and processing of consecutive frames of motions. Deb et al.12 enhanced temporal feature extraction by integrating graph convolutional networks (GCNs) with long short-term memory networks (LSTMs)21. Mourchid et al.13 effectively processed the spatio-temporal features of skeletal data by integrating an enhanced STGCN and Transformer architecture, enabling the automatic assessment of physical rehabilitation exercises without clinician supervision. Although deep learning-based methods often outperform traditional approaches, they lack sufficient attention to issues such as uncertainty in patients’ rehabilitation movements, which may lead to the loss of subtle information between motion frames. Additionally, challenges remain in fully understanding the temporal sequence information of movements caused by variations in the speed of rehabilitation exercises. Therefore, there is still room for improvement in current deep learning methods.

Graph convolutional networks

The representation of information in graph structures has garnered increasing attention in real-world applications. Such non-Euclidean data poses significant challenges to traditional neural network methods. To address these challenges, researchers have developed graph convolutional networks (GCNs)22, which have demonstrated considerable practical value. For instance, Sofianos et al.23 and Dang et al.24 applied GCNs to human pose prediction, while Shi et al.25 utilized sparse graphs for pedestrian trajectory prediction. Additionally, Liu et al.26 employed graph convolutions on skeletal data for action recognition tasks. Building on these works, Deb et al.12 was the first to propose the use of GCNs to evaluate the quality of rehabilitation movements.

Although GCNs excel at processing non-Euclidean data, for long-term prediction tasks, challenges arise in handling temporal information and understanding the spatial semantics of movements. To better capture spatiotemporal information, researchers have introduced improvements to spatiotemporal graph convolutional networks (ST-GCNs)27,28,29, incorporating temporal processing and spatial feature aggregation modules into traditional GCNs, thereby effectively addressing issues related to data temporal dependencies. For example, Zhong et al.27 introduced the concept of spatio-temporal gating in ST-GCNs to learn the complex dependencies of various actions, while Cai et al.29 fused spatial dependencies with temporal consistency in ST-GCNs, proposing a novel graph-based method for solving 3D human pose estimation problems. The application of spatiotemporal graph convolutional networks enables a deeper aggregation of spatial features among nodes, effectively addressing the long-term dependency challenges in prediction tasks.

Frame topology fusion hierarchical graph convolution network (FTF-HGCN)

This paper uses 3D joint data captured by Vicon or Kinect sensors as input, and the predicted quality scores of rehabilitation movements as output. Video information is processed by the Frame Topology Fusion (FTF) block, which builds upon the multi-order adjacency matrix \(\mathscr {A}\). Specifically, it integrates the information matrix \(\mathscr {A}_{end}\) from distant motion nodes to construct a learnable feature matrix for each motion frame. Subsequently, this information is fed into a hierarchical temporal convolutional layer, where different temporal convolution branches extract motion features from varying temporal ranges. The fused information from these temporal layers is then processed by an attention module, enabling the network to focus more effectively on information relevant to limb movements. Finally, deep motion features are extracted through Long Short-Term Memory (LSTM) layers and fully connected layers to produce the output results. Figure 2 illustrates the network architecture of our proposed FTF-HGCN, which consists mainly of the FTF blocks and the Hierarchical Temporal Convolution Attention (HTCA) module.

Fig. 2
figure 2

(a) The architecture of frame topology fusion hierarchical graph convolution network (FTF-HGCN). (b) Frame topology fusion (FTF) block. (c) Attention aggregation module (AAM). The network module takes 3D joints as input and processes them through the architecture. Subsequently, a fully connected layer outputs the final score, enabling the quality assessment of rehabilitation motions.

Preliminaries

In this study, the proposed graph-based video data processing model targets motion analysis and feature extraction, systematically handling input video data. The model first defines video input as \(O=O_1,O_2,\ldots ,O_n)\), where n denotes the total number of collected videos, and each video \(O_i\) is further decomposed into frame-level data. Specially, the j-th frame \(f_j\) has the same dimension as \(R^{(V,C)}\) to collectively forming the spatial feature representation, where V is the point set of human joint nodes and C is the feature dimension of each node. To capture dynamic interactions between nodes within a frame, we construct a graph structure \(G = (V,E,X)\), where E represents the connectivity relationships between nodes. \(X \in R^{(T,V,C)}\) serves as node attributes that encapsulate temporal skeletal data, where T indicates temporal length. Additionally, the adjacency matrix \(A_{ij}\) quantifies the spatial dependencies between the i-th and j-th nodes. In the data processing pipeline, the input image sequences undergo feature extraction and transformation via the graph convolutional operation \(Y = {\mathscr {A}_k}X{W_k}\), where \(\mathscr {A}_k\) is the adjacency matrix, X is the input feature matrix, and \(W_k\) is a learnable weight matrix. Through this combination of graph topology and convolution operations, the model systematically analyzes temporal features and spatial relationships among nodes, establishing a solid theoretical and computational framework for subsequent motion-related tasks.

Frame topology fusion block (FTF block)

Topology construction

Building upon the multi-order neighborhood topology matrix \(\mathscr {A}\), we establish informational connections for distal motion joints through a manually designed approach. The nodes of distal movement amplitudes N are selected based on the following formula:

$$\begin{aligned} dis(n)\le 1,\quad n\in N, \end{aligned}$$
(1)

where limb-end nodes designed as primary distal nodes, dis() denotes the distance from a given node to the nearest primary distal node. We designate limb-end nodes as primary distal nodes, satisfying \(dis=0\), while the secondary distal nodes are defined by \(dis=1\). The set N comprises both primary and secondary distal nodes. Next, to obtain the edge set \(\varepsilon\) of the distal nodes, we construct the self-connection feature edges \(\varepsilon _{sc}\) and the fully connected feature edges \(\varepsilon _{fc}\) for the distal nodes. \(\varepsilon _{sc}\) is formed by the self-connections of the distal nodes. For the construction of \(\varepsilon _{fc}\), we loop through each distal node and pass it to the edge selector, which sequentially establishes connections between the given distal node and other distal nodes. The establishment of fully connected feature edges \(\varepsilon _{fc}\) is described by Eq. (2).

$$\begin{aligned} \varepsilon _{fc}[i][j]= \begin{Bmatrix} 0,\quad i\not \in {N}\vee j\not \in {N} \\ 1,\quad i\in {N}\wedge j\in {N} \end{Bmatrix}\end{aligned}$$
(2)

The final edge set of the distal nodes is obtained through the following formula:

$$\begin{aligned} \varepsilon =\varepsilon _{sc}\cup \varepsilon _{fc} \end{aligned}$$
(3)

Finally, the graph generator constructs the feature map based on the connection characteristics of the distal edges through the following formula:

$$\begin{aligned} \mathscr {A}_{g}=sum(g,\dim =1)\times \left((1-I)\odot g+(I\odot g)^{-1}\right) \end{aligned}$$
(4)

After standardizing the resulting feature maps of \(\varepsilon _{sc}\) and \(\varepsilon _{fc}\), they are concatenated to obtain the final output:

$$\begin{aligned} \mathscr {A}_{end}=\left\| \mathscr {A}_{g},(\textrm{g}\in G,G=g_{sc}\vee g_{fc})\right\| \end{aligned}$$
(5)

where \(g_{sc}\) and \(g_{fc}\) are the feature maps formed by \(\varepsilon _{sc}\) and \(\varepsilon _{fc}\), respectively. I is the identity matrix and \(\odot\) denotes the Hadamard product. \(\mathscr {A}_{g}\) is the topological matrix corresponding to the feature map formed by the distal edge set.

Frame topology fusion

The constructed fusion topological structure is applied to each action frame, and a learnable feature matrix is established for this structure, as shown in Fig. 2b. To better capture the subtle differences in inter-frame actions, more detailed topological information of the distal joints in the human skeleton is incorporated. Referring to the construction method of the neighborhood node topological matrix, a similar learnable matrix \(\mathscr {A}_{end}\) is set up to allow \(\mathscr {A}\) and \(\mathscr {A}_{end}\) to learn their adaptive node-related weight information along the frame dimension. These are then input in parallel into the network module for graph convolution operations, and the output results are concatenated:

$$\begin{aligned} \begin{aligned} F_{out}=\left\| \mathscr {A}_{k}XW_{k},(k\in K,K=\mathscr {A}\vee \mathscr {A}_{end})\right\| \end{aligned}\end{aligned}$$
(6)

where \(\mathscr {A}_{k}\) represents the k-th subset after the fusion topological structure, and \(W_k\) denotes the trainable parameters for each topological subset. This deepens the network’s focus and understanding of the motion information of distal nodes based on the aggregation of the overall skeleton information, thereby extending the feature to the frame level. As a result, it expresses the different spatial node topological information between motion frames, which facilitates the the learning of subtle differences across frames.

In addition, considering the motion correlation between consecutive motion frames, a ConvLSTM30 network layer is employed to predict the topological relationships. This approach enables the connection weights between nodes at different time instances to vary, thereby more effectively capturing the interdependencies of actions. Subsequently, the predicted temporal topology matrix is combined with the manually constructed fusion topology matrix to derive a new learnable matrix, with the output presented as follows::

$$\begin{aligned} \begin{aligned} \mathscr {A}^{{\prime }}=softmax\left[\sigma \left(convLSTM\left(X^{T}\right)\right)+\mathscr {A}\right] \end{aligned}\end{aligned}$$
(7)

where X is the input feature, \(\sigma\) is the activation function, \(\mathscr {A}\) is the initial topological matrix formed by \(g_{sc}\) and \(g_{fc}\).

In summary, the overall formula for graph convolution is:

$$\begin{aligned} \begin{aligned} H^{(l)}=\sigma \left(D^{-\frac{1}{2}}A^{{\prime }}D^{-\frac{1}{2}}H^{(l-1)}W^{(l-1)}\right) \end{aligned}\end{aligned}$$
(8)

where D is the diagonal matrix, H is the joint action features, and W represents the learnable weight parameters.

Hierarchical temporal convolution attention module (HTCA)

Branch temporal convolution

To extract information from actions of varying durations, this paper introduces a multi-level information extraction module consisting of four different Branch Temporal Convolutions (BTC). The specific module construction and parameter settings of each BTC are shown in Fig. 3, where Dil denotes the null convolution step size and MP represents the kernel size of max pooling. Specifically, this approach splits the input into four branches based on convolution kernel size, with each branch comprising two dilated convolutions, one convolutional long short-term memory network (ConvLSTM), and one max pooling operation. In the design of the branch network, dilated convolution expands the receptive field within a consistent network structure. ConvLSTM captures long-term dependencies across motion frames, boosting the network’s ability to learn from extended sequences. Additionally, the max pooling operation prioritizes nodes with the largest displacement amplitudes, allowing the network to learn more detailed motion features.

Fig. 3
figure 3

Components of the BTC module and specific parameter settings. (a) The BTC module, consisting of four branches, primarily comprising Dilation Temporal Convolution (DTC), ConvLSTM, and MaxPool components. The information within each branch is concatenated to form the output of that branch module. (b) Specific parameter design of convolution kernels for different BTC Modules. Convolution with different scales effectively ensures the ability of network to acquire a wider range of contextual information.

Attention aggregation module

To enable the network to pay more attention to the joint part which is richer in motion information, this paper integrates and extracts the information from different branches through the Attention Aggregation Module (AAM), as shown in Fig. 2c. Firstly, after splicing the output information of each branch, the local patterns and spatial relations of the data are captured through the initial convolution process, serving as the input X to the module. Secondly, a shared weight pool is employed through the multilayer deep convolution to derive T, C, and R as the target, context, and result in the motion evalution, respectively. Subsequently, the aggregated information is obtained through the following process:

$$\begin{aligned} X_{node}=softmax\left(\frac{T\cdot \textrm{C}^{T}}{\sqrt{d}}\right)\cdot {Conv}(R+\mathfrak {J}(R)), \end{aligned}$$
(9)

where d is the normalization constant, and \(\mathfrak {J}\) is an activation function that processes the time stride, consisting of both nonlinear and linear activation. To accelerate network convergence and effectively balance local and global contextual features, we introduce a residual connection as a compensation for initial input X:

$$\begin{aligned} \Delta x=Conv(\mathfrak {L}(Res(x)), \end{aligned}$$
(10)

where \(\mathfrak {L}\) is the reshape operation. Finally, these two components are summed to produce the ultimate output \(X^{\prime }\):

$$\begin{aligned} X^{\prime }=X_{node}+\Delta x, \end{aligned}$$
(11)

The integration of the AAM with the residual module enhances the network’s robustness to noise while enabling finer-grained focus on high-amplitude motion information. Consequently, this approach effectively improves the network’s ability to assess the rehabilitation quality of subtle movements. The overall process of FTF-HGCN is presented in Algorithm 1.

Algorithm 1
figure a

The entire process of FTF-HGCN.

Experiments

Datasets

To evaluate the performance of the model, we conduct experiments on two rehabilitation motion datasets: KIMORE31 and UI-PRMD30. The KIMORE dataset comprises real score annotations and RGBD videos for five types of movements, with each action represented by a 75-dimensional joint angle displacement sequence. It is divided into two groups: an experimental group and a control group. The experimental group consists of 34 patients with motor impairments, including Parkinson’s disease, back pain, and stroke. The control group includes 12 rehabilitation therapists and experts, along with 32 healthy non-expert participants. The UI-PRMD dataset used Vicon sensors to collect three-dimensional data for 10 rehabilitation movements from 10 healthy subjects. Each subject performed every movement 10 times, each action represented by a 117-dimensional joint angle displacement sequence. Due to the fact that in the real scene there are indeed noise effects, we introduced a padding strategy in the pre-processing stage of the data. Specifically, missing joint data were filled with zeros to ensure dimensional alignment. This approach leverages the robust fitting capabilities of the neural network to ultimately derive quality scores. In addition, the hierarchical network architecture enables the model to learn patient motion characteristics from various dimensions, ultimately integrating information from different levels to inherently facilitate the fitting and generalization of these deviation patterns. This allows the model to effectively handle motion deviations arising from patient subjectivity.

Evaluation metrics

The quality assessment of rehabilitation actions typically focuses on the discrepancy between predicted and actual results. To better evaluate the performance of the model, we employ mean absolute deviation (MAD), mean absolute percentage error (MAPE), and root mean square error (RMSE) as evaluation metrics because they are widely used in the KIMORE and UI-PRMD datasets for quantifying deviations from expert standards, and they align with existing methods12,13,14. This makes it easy to compare our methods with state-of-the-art approaches. Our method clearly outperforms comparative approaches on these metrics. This validates its effectiveness in overall motion quality assessment. To these metrics, lower values of these metrics indicate higher prediction accuracy. MAD represents the average of absolute deviations between actual and predicted values, serving as a scale-dependent measure. MAPE is a relative metric that remains unaffected by the sign of errors. RMSE is defined as the square root of the mean squared error, which quantifies the average difference between predicted and actual values. The formulas for MAD, MAPE, and RMSE are provided below:

$$\begin{aligned} MAD= & \frac{1}{n}\sum _{i=1}^{n}\left| y-\hat{y}\right| \end{aligned}$$
(12)
$$\begin{aligned} MAPE= & \frac{1}{n}\sum _{i=1}^{n}\left| \frac{y-\hat{y}}{y}\right| \times 100 \end{aligned}$$
(13)
$$\begin{aligned} RMSE= & \sqrt{\frac{1}{n}\sum _{i=1}^{n}(y-\hat{y})^{2}} \end{aligned}$$
(14)

where n represents the sample size, \(\hat{y}\) is the predicted score of the action, and y is the actual score of the action.

Experimental setup

We implement our experiments using the TensorFlow 2.0 deep learning framework, with all computations performed on an NVIDIA GeForce GTX 3090 GPU platform. The Adam optimizer is employed to train the model on the training set for 1000 epochs. We propose an annealing strategy with an initial learning rate of 0.0001, which is dynamically adjusted as the number of training epochs increases. For the UI-PRMD and KIMORE datasets, batch sizes are set to 16 and 20, respectively, and a dropout rate of 0.2 is applied to mitigate network overfitting. The overall network architecture integrates the ResNet residual mechanism [32]. To ensure the reliability of the results, we conduct ten training and testing runs, storing performance metrics (MAD, RMSE, MAPE) after each run and computing their average as the final comparative outcome.

Experimental results

We conduct extensive experiments on the publicly available rehabilitation training datasets KIMORE and UI-PRMD. The results are compared with the similar methods. For the KIMORE dataset, we utilize continuous video frames containing 25 three-dimensional joint nodes as input to the model, outputting predicted quality scores for each movement. These scores were evaluated using three metrics: MAD, RMSE, and MAPE. The results of this comparison are shown in Table 1.

Table 1 Comparison of MAD, RMSE, and MAPE metrics across exercises Ex1 to Ex5 for different methods.

Our method achieves the best results on the quality prediction of all these rehabilitation exercises. Compared to the previous state-of-the-art results on the metric of MAD/RMSE/MAPE, our approach reveals superiority in the five exercises. In contrast, Liao et al.11 used traditional Graph Convolution Networks (GCNs) to model joint topological relationships based on static adjacency matrices, which overlooked the dynamic changes in topology during motion execution. Mourchid et al.13 relied on handcrafted motion features, which required domain expertise and exhibit limited generalization to complex motions. Our FTF-HGCN introduces the Frame Topology Fusion (FTF) module, which dynamically adjusts inter-joint spatial relationships through a learnable frame-level topological matrix, enhancing the modeling of motion-specific phases. This dynamic topology modeling aligns theoretically with the nonlinear spatio-temporal characteristics of rehabilitation motions, offering particular advantages in capturing interactions between distal and supporting joints. Similarly, Sardari et al.14 utilized Long Short-Term Memory (LSTM) networks to model temporal motion information, but it lacked the dependence on spatial topology of joints. This made it challenging to capture joint interactions. The FTF-HGCN combines with frame-level topological information, achieves unified modeling of spatial and temporal features, enhancing the learning of dynamic motion characteristics. It employs end-to-end learning to automatically extract multi-order neighborhood features of joints and dynamically allocates attention weights, mitigating the subjectivity of manual feature design and theoretically improving adaptability to diverse motion patterns. Figure 4 presents the visualization of rehabilitation quality for four exercises. The left side of the figure presents the attention maps for expert movements, thereby illustrating the joint intensity of motion information. The right side corresponds to the results of the rehabilitation patient’s movements. It shows the difference in the attentional map with respect to the expert’s movements and the quality score of the patient’s movements. Higher scores indicate greater movement accuracy and alignment with the standard.

Fig. 4
figure 4

Attention heatmaps of rehabilitation actions and their corresponding visualized joint attention weights. Four movements from the KIMORE dataset are visualized, with one professional example and two control patient examples included in each movement. The joint attention weights of these two patient movements exhibit varying differences from the expert standard, with smaller discrepancies corresponding to higher scores for movement standardization.

For the UI-PRMD dataset, we use 39 consecutive video frames of 3D joint points as input to the model and compare the results on the MAD evaluation metric as shown in Table 2. For Ex3, our method is not as effective as Song et al.33. This is due to the fact that Ex3 places greater reliance on proximal joint coordination as opposed to substantial displacement at the termination stage. Ex3 requires single-leg support and a significant shift in the body’s center of gravity, requiring greater balance and joint coordination, particularly in dynamic adjustments of the knee, hip, and upper limbs. Additionally, the temporal characteristics of this motion are relatively complex, as the kneeling and arm-raising actions must be precisely synchronized within a short time frame. This potentially leads to larger deviations in skeletal data across certain frames.

To better assist patients in understanding the scores associated with rehabilitation motions, we have integrated hospital standards for evaluating the completion of rehabilitation motions to develop a reference table for evaluating rehabilitation outcome grades. We categorized the quality of motion completion into five grades (Non-proficient: 0–10, Poor-proficient: 10–15, Moderate-proficient: 15–30, Proficient: 30–45, and High-proficient: 45–50) based on the range of motion scores. Each grade corresponds to a distinct level of rehabilitation motion effectiveness, enabling patients to directly understand the meaning of their motion scores.

Table 2 Comparison of MAD metric across different methods for exercises Ex1 to Ex10.

Computational cost

In Table 3, we compare the number of parameters and the running time of the training and testing phases of the model on the Kimore dataset. Taking Ex5 as an example, Our model has a parameter of 0.572M, with an average per-frame runtime of 53 milliseconds. This is approximately 20 ms slower than the method proposed by Mourchid et al.13. The increased runtime is due to the additional parameters and computational complexity introduced by our hierarchical temporal convolutional attention mechanism, which extends the network’s inference time. However, this module significantly enhances motion evaluation quality, achieving the best performance across all three metrics. Our design maintains the parameter increase within an acceptable range by employing a \(1\times 1\) convolutional kernel size in the hierarchical network and implementing the attention mechanism through a single-layer convolution and channel multiplication. Compared to the method by Deb et al.12, our approach definitively reduces the parameter by 0.115M and achieves an approximately 80 ms faster per-frame inference time.

Table 3 Comparison of computational cost on different models.
Table 4 Ablation study on FTF module.
Table 5 Ablation study on HTC module.

Ablation study

We conduct a series of ablation studies on the KIMORE dataset to further validate the effectiveness of the proposed method. The experiments will be divided into two modules: the frame topology fusion (FTF) module and the hierarchical temporal convolution (HTC) module. In the ablation study on the FTF module, we retain the base topology matrix of the neighborhood nodes along with the HTC module and perform on the Ex2 exercise. The effectiveness of the distal node topology construction is evaluated separately and the experimental results are shown in Table 4.

In the ablation experiments conducted on the HTC module, we utilize the integrated topology matrix of the nodes as the baseline network and assess the effectiveness of each module within the BTC branch network on the Ex5 exercise. The experimental results shows in Table 5, demonstrate that the comprehensive integration of the branch modules significantly enhances the network’s attention to the overall rehabilitation movement while mitigating prediction bias.

Figure 5 shows the visualization of the articulation point topology matrix, the darker color indicates the stronger information interaction between the nodes. The fusion topology matrix improves the attention mechanism in facilitating information interaction among joint points, particularly by improving the extraction of motion information from distal nodes.

Fig. 5
figure 5

The attention map of joints information after training process. (a) Initial topology. (b) Integrated topology. The fusion topology construction approach is noteworthy for its increased emphasis on distal joint nodes and its enhanced aggregation of node information.

Limitation and future directions

We acknowledge that the global error metrics (MAD, RMSE, and MAPE) may not directly quantify misalignments at specific time points or individual joints. To address this limitation, we plan to incorporate more fine-grained evaluation metrics in future work, such as dynamic time regulation (DTW) to capture temporal errors or localized error analysis for specific joints to identify misalignments. Additionally, our five-level motion evaluation has been validated by clinical experts. Moving forward, we will deepen our collaboration with clinical experts to verify the correlation between model scores and actual rehabilitation outcomes. We will also integrate medical evaluation scales, such as loss of range of motion (ROM), to further clarify the clinical utility of our model.

To enhance the model’s adaptability to non-ideal conditions, we will employ data augmentation techniques to maintain accurate motion evaluation across diverse environments. We will expand the repertoire of rehabilitation motion types to include comprehensive exercises targeting the upper limbs, lower limbs, hip joints, elbow joints, wrist joints, and knee joints. This will allow us to address the diverse clinical manifestations of stroke patients and overcome the limitations of single-mode rehabilitation exercises. Furthermore, we will also collaborate with the First Affiliated Hospital of Zhengzhou University to deploy the model in real-world home environments, leveraging feedback from clinical experts to validate its performance and practical utility. We believe these improvements will enable the system to provide clear and patient-friendly result presentations, allowing stroke patients to perform autonomous and effective rehabilitation exercises at home.

In the future work, we plan to explore advanced graph representation learning37 (GRL) techniques to optimize the model’s ability to capture complex motion patterns. The current FTF-HGCN framework effectively utilizes the Frame Topology Fusion (FTF) module and the Hierarchical Temporal Convolution Attention (HTCA) module to dynamically model spatial and temporal relationships among joints, achieving robust representation of graph structures for rehabilitation motion assessment. However, GRL offers promising opportunities to further improve the model’s expressive power and computational efficiency. By adaptively learning low-dimensional embeddings of graph structures, GRL can more effectively capture latent patterns in complex motions and optimize non-local dependencies among joints. This approach is expected to enhance the model’s capability to handle diverse motion patterns and accommodate individualized variations in patient movements. These advances will further strengthen the model’s support for home-based stroke rehabilitation, addressing the limitations of traditional graph convolution networks and aligning with the evolving needs of clinical applications.

We will conduct practical testing with stroke patients through our collaboration with the First Affiliated Hospital of Zhengzhou University. This testing will refine the presentation of system results, making them easier for patients to understand and implement. Furthermore, we aim to enhance personalized services tailored to the specific symptoms of individual patients.

Conclusion

This paper proposes a frame topology fusion hierarchical graph convolution network that integrates spatial information from distant joints based on the neighborhood topology of keypoints. On this basis, it constructs a learnable frame-level spatial topology matrix and subsequently employs a hierarchical temporal convolution module to fuse the topological structures of patient motion information across different temporal scales. The network effectively captures subtle differences between nodes and discerns the motion characteristics of rehabilitation motions at varying rates. Consequently, it facilitates more precise predictions of movement quality in the field of rehabilitation. To improve the patients’ understanding of score significance, we will present attention error maps of key joints in the motions, thereby helping them better comprehend the standards for normative motions. Future research will concentrate on efficiently integrating and utilizing multi-order neighborhood joint information and distal joint information to further refine and improve the evaluation performance.