Frame topology fusion-based hierarchical graph convolution for automatic assessment of physical rehabilitation exercises

Zhang, Shaohui; Han, Qiuying; Wang, Peng; Li, Junjie

doi:10.1038/s41598-025-12020-8

Download PDF

Article
Open access
Published: 23 July 2025

Frame topology fusion-based hierarchical graph convolution for automatic assessment of physical rehabilitation exercises

Shaohui Zhang^1,2,
Qiuying Han³,
Peng Wang¹ &
…
Junjie Li⁴

Scientific Reports volume 15, Article number: 26720 (2025) Cite this article

1353 Accesses
Metrics details

Subjects

Abstract

Stroke rehabilitation movements are significantly influenced by patient subjectivity, leading to challenges in capturing subtle differences and temporal characteristics of patient motions. Existing methods typically focus on adjacent joint movements, overlooking the intricate interdependencies among body joints. Moreover, they lack the capacity to assess motion quality based on diverse temporal characteristics. To address these challenges, we propose a Frame Topology Fusion Hierarchical Graph Convolution Network (FTF-HGCN). This method aims to provide a more precise assessment of rehabilitation movement quality by effectively modeling both spatial and temporal features. First, this method combines nearby and distant keypoints to construct a fused topology structure for obtaining the enhanced motion representation. This allows the network to focus on joints with larger motion amplitudes. Second, based on the fused topology structure, a learnable topological matrix is established for each action frame to capture subtle differences between patient movements. Finally, a hierarchical temporal convolution attention module is employed to integrate motion feature information across different time sequences. Subsequently, a fully connected layer is used to output the predicted quality score of rehabilitation movements. Extensive experiments were conducted on KIMORE and UI-PRMD datasets, achieving best performance on relevant evaluation metrics (MAD: 13.4$\% \downarrow$, RMSE: 39.8$\% \downarrow$, MAPE: 7.6$\% \downarrow$). This shows that the proposed FTF-HGCN method is capable of delivering accurate evaluations and offering superior support for the home-based rehabilitation of stroke patients.

An innovative model based on machine learning and fuzzy logic for tracking lower limb exercises in stroke patients

Article Open access 02 April 2025

Concurrent tDCS-fMRI after stroke reveals link between attention network organization and motor improvement

Article Open access 20 August 2024

Automatic rehabilitation exercise task assessment of stroke patients based on wearable sensors with a lightweight multichannel 1D-CNN model

Article Open access 19 August 2024

Introduction

Scientific rehabilitation exercises have been shown to play a pivotal role in facilitating functional recovery in patients after surgical interventions¹. However, in many countries, the financial burden of rehabilitation is significant due to a shortage of trained rehabilitation professionals². Consequently, a considerable number of patients are unable to receive timely and effective rehabilitation in their own homes. Therefore, researchers are increasingly focusing on the accurate and efficient assessment of the quality of patients’ rehabilitation manoeuvres.

According to the presentation form of the final results of movement quality evaluation, current methods can be categorized into classification-based methods^3,4,5 and regression-based methods^6,7,8. Classification-based methods are dominated by traditional machine learning. They utilize features extracted from video or sensor data to train classification models, thereby yielding discrete action categories. However, these methods are difficult to capture the complex non-linear relationships between motion nodes and perform poorly in accurately assessing the quality of actions. In contrast, regression-based methods are capable of predicting continuous action quality scores by pre-processing the data. Lee et al.⁹ proposed a hybrid model to learn the linear or nonlinear relationships of sensor data to output continuous scores. While Capecci et al.¹⁰ used a Hidden Markov Model to learn the probability distribution of the hidden states given the inputs. Nevertheless, these methods face significant challenges in efficiently predicting quality based on substantial volumes of action data. Moreover, the pre-training process introduces additional complexity to the action evaluation procedure. In recent developments, deep learning-based methodologies^11,12,13,14 have demonstrated significant advancements in the domain of action evaluation. Liao et al.¹¹ utilized a deep pyramidal network to process encoded motion data. By leveraging subnetworks, the model captures the spatial characteristics of human movement to analyze joint displacements of individual body parts. These motion assessment models achieve higher accuracy and better generalization compared to traditional approaches. Although graph convolutional networks can capture the spatial information of human motion, constructing adjacency matrices using near-neighborhoods limits the network’s ability to prioritize on the motion states of distant nodes. As a result, the network cannot capture valuable semantic information from joints with larger movement amplitudes in rehabilitation exercises. Considering the characteristics of rehabilitation movements, subtle differences in the features of the motion frame arise from slow recovery movements of patients. Additionally, the varying motion speeds influenced by patients’ subjectivity introduce temporal complexity, which further challenges the accuracy of action assessment in this field.

To address the above problems, we propose a Frame Topology Fusion Hierarchical Graph Convolution Network (FTF-HGCN). The network consists of Frame Topology Fusion (FTF) blocks and Hierarchical Temporal Convolution Attention (HTCA) blocks. To enhance the information representation of distal nodes, we designed a fused topological structure, as shown in Fig. 1. This fused topology builds upon the initial adjacency matrix of the human body topology and further integrates motion information from distal limbs, effectively considering the interactions and coordination of limb movements. An adaptive learnable matrix is constructed for each action frame based on the spatial information of distal keypoints, thereby enhancing the network’s ability to understand subtle differences between action frames. Considering that different patients complete rehabilitation movements at varying speeds, we design a multilevel temporal convolution attention module, which consists of four Branch Temporal Convolution (BTC) modules. Each branch extracts motion features from disparate ranges of time series, and these outputs are ultimately integrated to improve the network’s comprehension of motion information. To evaluate the efficacy of this approach, extensive experiments are conducted on the UI-PRMD and KIMORE rehabilitation datasets, demonstrating that our method achieves the best performance compared to other assessment methods. The contributions of our method can be summarized as follows:

We propose a frame topology fusion hierarchical graph convolutional Network that jointly models spatial and temporal features, significantly enhancing the precision of rehabilitation motion quality assessment for stroke patients.
We construct joint topology representations dynamically by integrating adjacent human body topology with distal limb motion information. This creates an adaptive and learnable topological matrix that captures subtle differences between motion frames, improving the representation of human motion information.
We design a multi-level temporal feature extraction module that learns motion information across different scales to accommodate variations in patients’ motion speeds.
Our approach achieves leading performance on the two public datasets, offering new improvements and insights for motion quality assessment. This provides reliable support for home-based rehabilitation.

Related works

Rehabilitation exercise quality assessment

Research methods for evaluating the quality of rehabilitation movements can be broadly categorized into traditional machine learning approaches and deep learning-based methods. Early machine learning approaches were only capable of classifying movements as correct or incorrect. Techniques such as k-nearest neighbors (KNN)¹⁵, support vector machines (SVM)¹⁶, and random forests¹⁷ have been employed for binary classification tasks. To achieve a continuous quality evaluation of movements, Houmanfar et al.¹⁸ utilized Mahalanobis distance to quantify the quality level of rehabilitation movements by analyzing repetitive motions of patients and healthy individuals. Additionally, Medury et al.¹⁹ adopted dynamic time warping methods to quantify movement distance functions into quality scores. Furthermore, Vakanski et al.²⁰ introduced a probabilistic approach based on Gaussian mixture models, extracting individual sequences from the trained model to evaluate movement quality.

Deep learning-based methods have demonstrated superior capabilities in the analysis and processing of consecutive frames of motions. Deb et al.¹² enhanced temporal feature extraction by integrating graph convolutional networks (GCNs) with long short-term memory networks (LSTMs)²¹. Mourchid et al.¹³ effectively processed the spatio-temporal features of skeletal data by integrating an enhanced STGCN and Transformer architecture, enabling the automatic assessment of physical rehabilitation exercises without clinician supervision. Although deep learning-based methods often outperform traditional approaches, they lack sufficient attention to issues such as uncertainty in patients’ rehabilitation movements, which may lead to the loss of subtle information between motion frames. Additionally, challenges remain in fully understanding the temporal sequence information of movements caused by variations in the speed of rehabilitation exercises. Therefore, there is still room for improvement in current deep learning methods.

Graph convolutional networks

The representation of information in graph structures has garnered increasing attention in real-world applications. Such non-Euclidean data poses significant challenges to traditional neural network methods. To address these challenges, researchers have developed graph convolutional networks (GCNs)²², which have demonstrated considerable practical value. For instance, Sofianos et al.²³ and Dang et al.²⁴ applied GCNs to human pose prediction, while Shi et al.²⁵ utilized sparse graphs for pedestrian trajectory prediction. Additionally, Liu et al.²⁶ employed graph convolutions on skeletal data for action recognition tasks. Building on these works, Deb et al.¹² was the first to propose the use of GCNs to evaluate the quality of rehabilitation movements.

Although GCNs excel at processing non-Euclidean data, for long-term prediction tasks, challenges arise in handling temporal information and understanding the spatial semantics of movements. To better capture spatiotemporal information, researchers have introduced improvements to spatiotemporal graph convolutional networks (ST-GCNs)^27,28,29, incorporating temporal processing and spatial feature aggregation modules into traditional GCNs, thereby effectively addressing issues related to data temporal dependencies. For example, Zhong et al.²⁷ introduced the concept of spatio-temporal gating in ST-GCNs to learn the complex dependencies of various actions, while Cai et al.²⁹ fused spatial dependencies with temporal consistency in ST-GCNs, proposing a novel graph-based method for solving 3D human pose estimation problems. The application of spatiotemporal graph convolutional networks enables a deeper aggregation of spatial features among nodes, effectively addressing the long-term dependency challenges in prediction tasks.

Frame topology fusion hierarchical graph convolution network (FTF-HGCN)

This paper uses 3D joint data captured by Vicon or Kinect sensors as input, and the predicted quality scores of rehabilitation movements as output. Video information is processed by the Frame Topology Fusion (FTF) block, which builds upon the multi-order adjacency matrix $\mathscr {A}$. Specifically, it integrates the information matrix $\mathscr {A}_{end}$ from distant motion nodes to construct a learnable feature matrix for each motion frame. Subsequently, this information is fed into a hierarchical temporal convolutional layer, where different temporal convolution branches extract motion features from varying temporal ranges. The fused information from these temporal layers is then processed by an attention module, enabling the network to focus more effectively on information relevant to limb movements. Finally, deep motion features are extracted through Long Short-Term Memory (LSTM) layers and fully connected layers to produce the output results. Figure 2 illustrates the network architecture of our proposed FTF-HGCN, which consists mainly of the FTF blocks and the Hierarchical Temporal Convolution Attention (HTCA) module.

Preliminaries

In this study, the proposed graph-based video data processing model targets motion analysis and feature extraction, systematically handling input video data. The model first defines video input as $O=O_1,O_2,\ldots ,O_n)$, where n denotes the total number of collected videos, and each video $O_i$ is further decomposed into frame-level data. Specially, the j-th frame $f_j$ has the same dimension as $R^{(V,C)}$ to collectively forming the spatial feature representation, where V is the point set of human joint nodes and C is the feature dimension of each node. To capture dynamic interactions between nodes within a frame, we construct a graph structure $G = (V,E,X)$, where E represents the connectivity relationships between nodes. $X \in R^{(T,V,C)}$ serves as node attributes that encapsulate temporal skeletal data, where T indicates temporal length. Additionally, the adjacency matrix $A_{ij}$ quantifies the spatial dependencies between the i-th and j-th nodes. In the data processing pipeline, the input image sequences undergo feature extraction and transformation via the graph convolutional operation $Y = {\mathscr {A}_k}X{W_k}$, where $\mathscr {A}_k$ is the adjacency matrix, X is the input feature matrix, and $W_k$ is a learnable weight matrix. Through this combination of graph topology and convolution operations, the model systematically analyzes temporal features and spatial relationships among nodes, establishing a solid theoretical and computational framework for subsequent motion-related tasks.

Frame topology fusion block (FTF block)

Topology construction

Building upon the multi-order neighborhood topology matrix $\mathscr {A}$, we establish informational connections for distal motion joints through a manually designed approach. The nodes of distal movement amplitudes N are selected based on the following formula:

$$\begin{aligned} dis(n)\le 1,\quad n\in N, \end{aligned}$$

(1)

where limb-end nodes designed as primary distal nodes, dis() denotes the distance from a given node to the nearest primary distal node. We designate limb-end nodes as primary distal nodes, satisfying $dis=0$, while the secondary distal nodes are defined by $dis=1$. The set N comprises both primary and secondary distal nodes. Next, to obtain the edge set $\varepsilon$ of the distal nodes, we construct the self-connection feature edges $\varepsilon _{sc}$ and the fully connected feature edges $\varepsilon _{fc}$ for the distal nodes. $\varepsilon _{sc}$ is formed by the self-connections of the distal nodes. For the construction of $\varepsilon _{fc}$, we loop through each distal node and pass it to the edge selector, which sequentially establishes connections between the given distal node and other distal nodes. The establishment of fully connected feature edges $\varepsilon _{fc}$ is described by Eq. (2).

$$\begin{aligned} \varepsilon _{fc}[i][j]= \begin{Bmatrix} 0,\quad i\not \in {N}\vee j\not \in {N} \\ 1,\quad i\in {N}\wedge j\in {N} \end{Bmatrix}\end{aligned}$$

(2)

The final edge set of the distal nodes is obtained through the following formula:

$$\begin{aligned} \varepsilon =\varepsilon _{sc}\cup \varepsilon _{fc} \end{aligned}$$

(3)

Finally, the graph generator constructs the feature map based on the connection characteristics of the distal edges through the following formula:

$$\begin{aligned} \mathscr {A}_{g}=sum(g,\dim =1)\times \left((1-I)\odot g+(I\odot g)^{-1}\right) \end{aligned}$$

(4)

After standardizing the resulting feature maps of $\varepsilon _{sc}$ and $\varepsilon _{fc}$, they are concatenated to obtain the final output:

$$\begin{aligned} \mathscr {A}_{end}=\left\| \mathscr {A}_{g},(\textrm{g}\in G,G=g_{sc}\vee g_{fc})\right\| \end{aligned}$$

(5)

where $g_{sc}$ and $g_{fc}$ are the feature maps formed by $\varepsilon _{sc}$ and $\varepsilon _{fc}$, respectively. I is the identity matrix and $\odot$ denotes the Hadamard product. $\mathscr {A}_{g}$ is the topological matrix corresponding to the feature map formed by the distal edge set.

Frame topology fusion

The constructed fusion topological structure is applied to each action frame, and a learnable feature matrix is established for this structure, as shown in Fig. 2b. To better capture the subtle differences in inter-frame actions, more detailed topological information of the distal joints in the human skeleton is incorporated. Referring to the construction method of the neighborhood node topological matrix, a similar learnable matrix $\mathscr {A}_{end}$ is set up to allow $\mathscr {A}$ and $\mathscr {A}_{end}$ to learn their adaptive node-related weight information along the frame dimension. These are then input in parallel into the network module for graph convolution operations, and the output results are concatenated:

$$\begin{aligned} \begin{aligned} F_{out}=\left\| \mathscr {A}_{k}XW_{k},(k\in K,K=\mathscr {A}\vee \mathscr {A}_{end})\right\| \end{aligned}\end{aligned}$$

(6)

where $\mathscr {A}_{k}$ represents the k-th subset after the fusion topological structure, and $W_k$ denotes the trainable parameters for each topological subset. This deepens the network’s focus and understanding of the motion information of distal nodes based on the aggregation of the overall skeleton information, thereby extending the feature to the frame level. As a result, it expresses the different spatial node topological information between motion frames, which facilitates the the learning of subtle differences across frames.

In addition, considering the motion correlation between consecutive motion frames, a ConvLSTM³⁰ network layer is employed to predict the topological relationships. This approach enables the connection weights between nodes at different time instances to vary, thereby more effectively capturing the interdependencies of actions. Subsequently, the predicted temporal topology matrix is combined with the manually constructed fusion topology matrix to derive a new learnable matrix, with the output presented as follows::

$$\begin{aligned} \begin{aligned} \mathscr {A}^{{\prime }}=softmax\left[\sigma \left(convLSTM\left(X^{T}\right)\right)+\mathscr {A}\right] \end{aligned}\end{aligned}$$

(7)

where X is the input feature, $\sigma$ is the activation function, $\mathscr {A}$ is the initial topological matrix formed by $g_{sc}$ and $g_{fc}$.

In summary, the overall formula for graph convolution is:

$$\begin{aligned} \begin{aligned} H^{(l)}=\sigma \left(D^{-\frac{1}{2}}A^{{\prime }}D^{-\frac{1}{2}}H^{(l-1)}W^{(l-1)}\right) \end{aligned}\end{aligned}$$

(8)

where D is the diagonal matrix, H is the joint action features, and W represents the learnable weight parameters.

Hierarchical temporal convolution attention module (HTCA)

Branch temporal convolution

To extract information from actions of varying durations, this paper introduces a multi-level information extraction module consisting of four different Branch Temporal Convolutions (BTC). The specific module construction and parameter settings of each BTC are shown in Fig. 3, where Dil denotes the null convolution step size and MP represents the kernel size of max pooling. Specifically, this approach splits the input into four branches based on convolution kernel size, with each branch comprising two dilated convolutions, one convolutional long short-term memory network (ConvLSTM), and one max pooling operation. In the design of the branch network, dilated convolution expands the receptive field within a consistent network structure. ConvLSTM captures long-term dependencies across motion frames, boosting the network’s ability to learn from extended sequences. Additionally, the max pooling operation prioritizes nodes with the largest displacement amplitudes, allowing the network to learn more detailed motion features.

Attention aggregation module

To enable the network to pay more attention to the joint part which is richer in motion information, this paper integrates and extracts the information from different branches through the Attention Aggregation Module (AAM), as shown in Fig. 2c. Firstly, after splicing the output information of each branch, the local patterns and spatial relations of the data are captured through the initial convolution process, serving as the input X to the module. Secondly, a shared weight pool is employed through the multilayer deep convolution to derive T, C, and R as the target, context, and result in the motion evalution, respectively. Subsequently, the aggregated information is obtained through the following process:

$$\begin{aligned} X_{node}=softmax\left(\frac{T\cdot \textrm{C}^{T}}{\sqrt{d}}\right)\cdot {Conv}(R+\mathfrak {J}(R)), \end{aligned}$$

(9)

where d is the normalization constant, and $\mathfrak {J}$ is an activation function that processes the time stride, consisting of both nonlinear and linear activation. To accelerate network convergence and effectively balance local and global contextual features, we introduce a residual connection as a compensation for initial input X:

$$\begin{aligned} \Delta x=Conv(\mathfrak {L}(Res(x)), \end{aligned}$$

(10)

where $\mathfrak {L}$ is the reshape operation. Finally, these two components are summed to produce the ultimate output $X^{\prime }$:

$$\begin{aligned} X^{\prime }=X_{node}+\Delta x, \end{aligned}$$

(11)

The integration of the AAM with the residual module enhances the network’s robustness to noise while enabling finer-grained focus on high-amplitude motion information. Consequently, this approach effectively improves the network’s ability to assess the rehabilitation quality of subtle movements. The overall process of FTF-HGCN is presented in Algorithm 1.

Experiments

Datasets

To evaluate the performance of the model, we conduct experiments on two rehabilitation motion datasets: KIMORE³¹ and UI-PRMD³⁰. The KIMORE dataset comprises real score annotations and RGBD videos for five types of movements, with each action represented by a 75-dimensional joint angle displacement sequence. It is divided into two groups: an experimental group and a control group. The experimental group consists of 34 patients with motor impairments, including Parkinson’s disease, back pain, and stroke. The control group includes 12 rehabilitation therapists and experts, along with 32 healthy non-expert participants. The UI-PRMD dataset used Vicon sensors to collect three-dimensional data for 10 rehabilitation movements from 10 healthy subjects. Each subject performed every movement 10 times, each action represented by a 117-dimensional joint angle displacement sequence. Due to the fact that in the real scene there are indeed noise effects, we introduced a padding strategy in the pre-processing stage of the data. Specifically, missing joint data were filled with zeros to ensure dimensional alignment. This approach leverages the robust fitting capabilities of the neural network to ultimately derive quality scores. In addition, the hierarchical network architecture enables the model to learn patient motion characteristics from various dimensions, ultimately integrating information from different levels to inherently facilitate the fitting and generalization of these deviation patterns. This allows the model to effectively handle motion deviations arising from patient subjectivity.

Evaluation metrics

The quality assessment of rehabilitation actions typically focuses on the discrepancy between predicted and actual results. To better evaluate the performance of the model, we employ mean absolute deviation (MAD), mean absolute percentage error (MAPE), and root mean square error (RMSE) as evaluation metrics because they are widely used in the KIMORE and UI-PRMD datasets for quantifying deviations from expert standards, and they align with existing methods^12,13,14. This makes it easy to compare our methods with state-of-the-art approaches. Our method clearly outperforms comparative approaches on these metrics. This validates its effectiveness in overall motion quality assessment. To these metrics, lower values of these metrics indicate higher prediction accuracy. MAD represents the average of absolute deviations between actual and predicted values, serving as a scale-dependent measure. MAPE is a relative metric that remains unaffected by the sign of errors. RMSE is defined as the square root of the mean squared error, which quantifies the average difference between predicted and actual values. The formulas for MAD, MAPE, and RMSE are provided below:

$$\begin{aligned} MAD= & \frac{1}{n}\sum _{i=1}^{n}\left| y-\hat{y}\right| \end{aligned}$$

(12)

$$\begin{aligned} MAPE= & \frac{1}{n}\sum _{i=1}^{n}\left| \frac{y-\hat{y}}{y}\right| \times 100 \end{aligned}$$

(13)

$$\begin{aligned} RMSE= & \sqrt{\frac{1}{n}\sum _{i=1}^{n}(y-\hat{y})^{2}} \end{aligned}$$

(14)

where n represents the sample size, $\hat{y}$ is the predicted score of the action, and y is the actual score of the action.

Experimental setup

We implement our experiments using the TensorFlow 2.0 deep learning framework, with all computations performed on an NVIDIA GeForce GTX 3090 GPU platform. The Adam optimizer is employed to train the model on the training set for 1000 epochs. We propose an annealing strategy with an initial learning rate of 0.0001, which is dynamically adjusted as the number of training epochs increases. For the UI-PRMD and KIMORE datasets, batch sizes are set to 16 and 20, respectively, and a dropout rate of 0.2 is applied to mitigate network overfitting. The overall network architecture integrates the ResNet residual mechanism [32]. To ensure the reliability of the results, we conduct ten training and testing runs, storing performance metrics (MAD, RMSE, MAPE) after each run and computing their average as the final comparative outcome.

Experimental results

We conduct extensive experiments on the publicly available rehabilitation training datasets KIMORE and UI-PRMD. The results are compared with the similar methods. For the KIMORE dataset, we utilize continuous video frames containing 25 three-dimensional joint nodes as input to the model, outputting predicted quality scores for each movement. These scores were evaluated using three metrics: MAD, RMSE, and MAPE. The results of this comparison are shown in Table 1.

Table 1 Comparison of MAD, RMSE, and MAPE metrics across exercises Ex1 to Ex5 for different methods.

Full size table

Our method achieves the best results on the quality prediction of all these rehabilitation exercises. Compared to the previous state-of-the-art results on the metric of MAD/RMSE/MAPE, our approach reveals superiority in the five exercises. In contrast, Liao et al.¹¹ used traditional Graph Convolution Networks (GCNs) to model joint topological relationships based on static adjacency matrices, which overlooked the dynamic changes in topology during motion execution. Mourchid et al.¹³ relied on handcrafted motion features, which required domain expertise and exhibit limited generalization to complex motions. Our FTF-HGCN introduces the Frame Topology Fusion (FTF) module, which dynamically adjusts inter-joint spatial relationships through a learnable frame-level topological matrix, enhancing the modeling of motion-specific phases. This dynamic topology modeling aligns theoretically with the nonlinear spatio-temporal characteristics of rehabilitation motions, offering particular advantages in capturing interactions between distal and supporting joints. Similarly, Sardari et al.¹⁴ utilized Long Short-Term Memory (LSTM) networks to model temporal motion information, but it lacked the dependence on spatial topology of joints. This made it challenging to capture joint interactions. The FTF-HGCN combines with frame-level topological information, achieves unified modeling of spatial and temporal features, enhancing the learning of dynamic motion characteristics. It employs end-to-end learning to automatically extract multi-order neighborhood features of joints and dynamically allocates attention weights, mitigating the subjectivity of manual feature design and theoretically improving adaptability to diverse motion patterns. Figure 4 presents the visualization of rehabilitation quality for four exercises. The left side of the figure presents the attention maps for expert movements, thereby illustrating the joint intensity of motion information. The right side corresponds to the results of the rehabilitation patient’s movements. It shows the difference in the attentional map with respect to the expert’s movements and the quality score of the patient’s movements. Higher scores indicate greater movement accuracy and alignment with the standard.

For the UI-PRMD dataset, we use 39 consecutive video frames of 3D joint points as input to the model and compare the results on the MAD evaluation metric as shown in Table 2. For Ex3, our method is not as effective as Song et al.³³. This is due to the fact that Ex3 places greater reliance on proximal joint coordination as opposed to substantial displacement at the termination stage. Ex3 requires single-leg support and a significant shift in the body’s center of gravity, requiring greater balance and joint coordination, particularly in dynamic adjustments of the knee, hip, and upper limbs. Additionally, the temporal characteristics of this motion are relatively complex, as the kneeling and arm-raising actions must be precisely synchronized within a short time frame. This potentially leads to larger deviations in skeletal data across certain frames.

To better assist patients in understanding the scores associated with rehabilitation motions, we have integrated hospital standards for evaluating the completion of rehabilitation motions to develop a reference table for evaluating rehabilitation outcome grades. We categorized the quality of motion completion into five grades (Non-proficient: 0–10, Poor-proficient: 10–15, Moderate-proficient: 15–30, Proficient: 30–45, and High-proficient: 45–50) based on the range of motion scores. Each grade corresponds to a distinct level of rehabilitation motion effectiveness, enabling patients to directly understand the meaning of their motion scores.

Table 2 Comparison of MAD metric across different methods for exercises Ex1 to Ex10.

Full size table

Computational cost

In Table 3, we compare the number of parameters and the running time of the training and testing phases of the model on the Kimore dataset. Taking Ex5 as an example, Our model has a parameter of 0.572M, with an average per-frame runtime of 53 milliseconds. This is approximately 20 ms slower than the method proposed by Mourchid et al.¹³. The increased runtime is due to the additional parameters and computational complexity introduced by our hierarchical temporal convolutional attention mechanism, which extends the network’s inference time. However, this module significantly enhances motion evaluation quality, achieving the best performance across all three metrics. Our design maintains the parameter increase within an acceptable range by employing a $1\times 1$ convolutional kernel size in the hierarchical network and implementing the attention mechanism through a single-layer convolution and channel multiplication. Compared to the method by Deb et al.¹², our approach definitively reduces the parameter by 0.115M and achieves an approximately 80 ms faster per-frame inference time.

Table 3 Comparison of computational cost on different models.

Full size table

Table 4 Ablation study on FTF module.

Full size table

Table 5 Ablation study on HTC module.

Full size table

Ablation study

We conduct a series of ablation studies on the KIMORE dataset to further validate the effectiveness of the proposed method. The experiments will be divided into two modules: the frame topology fusion (FTF) module and the hierarchical temporal convolution (HTC) module. In the ablation study on the FTF module, we retain the base topology matrix of the neighborhood nodes along with the HTC module and perform on the Ex2 exercise. The effectiveness of the distal node topology construction is evaluated separately and the experimental results are shown in Table 4.

In the ablation experiments conducted on the HTC module, we utilize the integrated topology matrix of the nodes as the baseline network and assess the effectiveness of each module within the BTC branch network on the Ex5 exercise. The experimental results shows in Table 5, demonstrate that the comprehensive integration of the branch modules significantly enhances the network’s attention to the overall rehabilitation movement while mitigating prediction bias.

Figure 5 shows the visualization of the articulation point topology matrix, the darker color indicates the stronger information interaction between the nodes. The fusion topology matrix improves the attention mechanism in facilitating information interaction among joint points, particularly by improving the extraction of motion information from distal nodes.

Limitation and future directions

We acknowledge that the global error metrics (MAD, RMSE, and MAPE) may not directly quantify misalignments at specific time points or individual joints. To address this limitation, we plan to incorporate more fine-grained evaluation metrics in future work, such as dynamic time regulation (DTW) to capture temporal errors or localized error analysis for specific joints to identify misalignments. Additionally, our five-level motion evaluation has been validated by clinical experts. Moving forward, we will deepen our collaboration with clinical experts to verify the correlation between model scores and actual rehabilitation outcomes. We will also integrate medical evaluation scales, such as loss of range of motion (ROM), to further clarify the clinical utility of our model.

To enhance the model’s adaptability to non-ideal conditions, we will employ data augmentation techniques to maintain accurate motion evaluation across diverse environments. We will expand the repertoire of rehabilitation motion types to include comprehensive exercises targeting the upper limbs, lower limbs, hip joints, elbow joints, wrist joints, and knee joints. This will allow us to address the diverse clinical manifestations of stroke patients and overcome the limitations of single-mode rehabilitation exercises. Furthermore, we will also collaborate with the First Affiliated Hospital of Zhengzhou University to deploy the model in real-world home environments, leveraging feedback from clinical experts to validate its performance and practical utility. We believe these improvements will enable the system to provide clear and patient-friendly result presentations, allowing stroke patients to perform autonomous and effective rehabilitation exercises at home.

In the future work, we plan to explore advanced graph representation learning³⁷ (GRL) techniques to optimize the model’s ability to capture complex motion patterns. The current FTF-HGCN framework effectively utilizes the Frame Topology Fusion (FTF) module and the Hierarchical Temporal Convolution Attention (HTCA) module to dynamically model spatial and temporal relationships among joints, achieving robust representation of graph structures for rehabilitation motion assessment. However, GRL offers promising opportunities to further improve the model’s expressive power and computational efficiency. By adaptively learning low-dimensional embeddings of graph structures, GRL can more effectively capture latent patterns in complex motions and optimize non-local dependencies among joints. This approach is expected to enhance the model’s capability to handle diverse motion patterns and accommodate individualized variations in patient movements. These advances will further strengthen the model’s support for home-based stroke rehabilitation, addressing the limitations of traditional graph convolution networks and aligning with the evolving needs of clinical applications.

We will conduct practical testing with stroke patients through our collaboration with the First Affiliated Hospital of Zhengzhou University. This testing will refine the presentation of system results, making them easier for patients to understand and implement. Furthermore, we aim to enhance personalized services tailored to the specific symptoms of individual patients.

Conclusion

This paper proposes a frame topology fusion hierarchical graph convolution network that integrates spatial information from distant joints based on the neighborhood topology of keypoints. On this basis, it constructs a learnable frame-level spatial topology matrix and subsequently employs a hierarchical temporal convolution module to fuse the topological structures of patient motion information across different temporal scales. The network effectively captures subtle differences between nodes and discerns the motion characteristics of rehabilitation motions at varying rates. Consequently, it facilitates more precise predictions of movement quality in the field of rehabilitation. To improve the patients’ understanding of score significance, we will present attention error maps of key joints in the motions, thereby helping them better comprehend the standards for normative motions. Future research will concentrate on efficiently integrating and utilizing multi-order neighborhood joint information and distal joint information to further refine and improve the evaluation performance.

Data availability

The datasets and code generated and/or analyzed during the current study will be available from the corresponding author on reasonable request. The datasets KIMORE and UI-PRMD can be acquired from https://vrai.dii.univpm.it/content/KiMoRe-dataset/ and https://paperswithcode.com/dataset/ui-prmd respectively. We have collected some rehabilitation motion data from real-world environments, with the data still being refined. Partially displayed data are shown at https://github.com/ShaohuiZH/FTF-HGCN.

References

Frazzitta, G. et al. The beneficial role of intensive exercise on parkinson disease progression. Am. J. Phys. Med. Rehabil. 92, 523–532 (2013).
Article PubMed Google Scholar
Allen, L. et al. Assessing the impact of a home-based stroke rehabilitation programme: a cost-effectiveness study. Disability Rehabil. 41, 2060–2065 (2019).
Article Google Scholar
Hamaguchi, T. et al. Support vector machine-based classifier for the assessment of finger movement of stroke patients undergoing rehabilitation. J. Med. Biolog. Eng. 40, 91–100 (2020).
Article Google Scholar
Mannini, A., Trojaniello, D., Cereatti, A. & Sabatini, A. M. A machine learning framework for gait classification using inertial sensors: Application to elderly, post-stroke and huntington’s disease patients. Sensors 16, 134 (2016).
Article ADS PubMed PubMed Central Google Scholar
Zhang, Y. & Ma, Y. Application of supervised machine learning algorithms in the classification of sagittal gait patterns of cerebral palsy children with spastic diplegia. Comput. Biol. Med. 106, 33–39 (2019).
Article PubMed Google Scholar
Lee, M. H., Siewiorek, D. P., Smailagic, A. & Bernardino, A. et al. Opportunities of a machine learning-based decision support system for stroke rehabilitation assessment. arXiv preprint arXiv:2002.12261 (2020).
Lee, M. H., Siewiorek, D. P., Smailagic, A., Bernardino, A. & Badia, S. B. i. Learning to assess the quality of stroke rehabilitation exercises. In Proceedings of the 24th international conference on intelligent user interfaces, 218–228 (2019).
Lee, M. H., Siewiorek, D. P., Smailagic, A., Bernardino, A. & Bermúdez i Badia, S. Interactive hybrid approach to combine machine and human intelligence for personalized rehabilitation assessment. In Proceedings of the ACM conference on health, inference, and learning, 160–169 (2020).
Lee, M. H., Siewiorek, D. P., Smailagic, A., Bernardino, A. & Bermúdez i Badia, S. An exploratory study on techniques for quantitative assessment of stroke rehabilitation exercises. In Proceedings of the 28th ACM conference on user modeling, adaptation and personalization, 303–307 (2020).
Capecci, M. et al. A hidden semi-markov model based approach for rehabilitation exercise assessment. J. Biomed. Inf. 78, 1–11 (2018).
Article Google Scholar
Liao, Y., Vakanski, A. & Xian, M. A deep learning framework for assessing physical rehabilitation exercises. IEEE Trans. Neural Syst. Rehabil. Eng. 28, 468–477 (2020).
Article PubMed PubMed Central Google Scholar
Deb, S., Islam, M. F., Rahman, S. & Rahman, S. Graph convolutional networks for assessment of physical rehabilitation exercises. IEEE Trans. Neural Syst. Rehabil. Eng. 30, 410–419 (2022).
Article PubMed Google Scholar
Mourchid, Y. & Slama, R. D-stgcnt: A dense spatio-temporal graph conv-gru network based on transformer for assessment of patient physical rehabilitation. Comput. Biol. Med. 165, 107420 (2023).
Article PubMed Google Scholar
Sardari, S. et al. Lightpra: A lightweight temporal convolutional network for automatic physical rehabilitation exercise assessment. Comput. Biol. Med. 173, 108382 (2024).
Article PubMed Google Scholar
Li, W., Hsieh, C. Y., Lin, L. & Chu, W. C. Hand gesture recognition for post-stroke rehabilitation using leap motion. In 2017 international conference on applied system innovation (ICASI), 386–388 (IEEE, 2017).
Zhi, Y. X. et al. Automatic detection of compensation during robotic stroke rehabilitation therapy. IEEE J. Trans. Eng. Health Med. 6, 1–7 (2017).
Article Google Scholar
Bierbauer, W. et al. Improvements in exercise capacity of older adults during cardiac rehabilitation. Eur. J. Preventive Cardiol. 27, 1747–1755 (2020).
Article Google Scholar
Houmanfar, R., Karg, M. & Kulić, D. Movement analysis of rehabilitation exercises: Distance metrics for measuring patient progress. IEEE Syst. J. 10, 1014–1025 (2014).
Article ADS Google Scholar
Medury, A. & Madanat, S. Incorporating network considerations into pavement management systems: A case for approximate dynamic programming. Transport. Res. Part C: Emerg. Technol. 33, 134–150 (2013).
Article Google Scholar
Vakanski, A., Ferguson, J. & Lee, S. Mathematical modeling and evaluation of human motions in physical therapy using mixture density neural networks. J. Physiother Phys. Rehabil. 1 (2016).
Yu, Y., Si, X., Hu, C. & Zhang, J. A review of recurrent neural networks: Lstm cells and network architectures. Neural Comput. 31, 1235–1270 (2019).
Article MathSciNet PubMed MATH Google Scholar
Zhang, S., Tong, H., Xu, J. & Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Social Netw. 6, 1–23 (2019).
Article Google Scholar
Sofianos, T., Sampieri, A., Franco, L. & Galasso, F. Space-time-separable graph convolutional network for pose forecasting. In Proceedings of the IEEE/CVF international conference on computer vision, 11209–11218 (2021).
Dang, L., Nie, Y., Long, C., Zhang, Q. & Li, G. Msr-gcn: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision, 11467–11476 (2021).
Shi, L. et al. Sgcn: Sparse graph convolution network for pedestrian trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8994–9003 (2021).
Liu, Z., Zhang, H., Chen, Z., Wang, Z. & Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 143–152 (2020).
Zhong, C., Hu, L., Zhang, Z., Ye, Y. & Xia, S. Spatio-temporal gating-adjacency gcn for human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6447–6456 (2022).
Yan, S., Xiong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, vol. 32 (2018).
Cai, Y. et al. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision, 2272–2281 (2019).
Vakanski, A., Jun, H. P., Paul, D. & Baker, R. A data set of human body movements for physical rehabilitation exercises. Data 3, 2 (2018).
Article PubMed PubMed Central Google Scholar
Capecci, M. et al. The kimore dataset: Kinematic assessment of movement and clinical scores for remote monitoring of physical rehabilitation. IEEE Trans. Neural Syst. Rehabil. Eng. 27, 1436–1448 (2019).
Article PubMed Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
Song, Y., Zhang, Z., Shan, C. & Wang, L. Richly activated graph convolutional network for robust skeleton-based action recognition. IEEE Trans. Circuits Syst. Video Technol. 31, 1915–1925 (2020).
Article Google Scholar
Zhang, P. et al. Semantics-guided neural networks for efficient skeleton-based human action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1112–1121 (2020).
Li, C., Zhong, Q., Xie, D. & Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055 (2018).
Yao, L., Lei, Q., Zhang, H., Du, J. & Gao, S. A contrastive learning network for performance metric and assessment of physical rehabilitation exercises. IEEE Trans. Neural Syst. Rehabil. Eng 31, 3790–3802 (2023).
Article PubMed Google Scholar
Yang, Y. et al. Integrating fuzzy clustering and graph convolution network to accurately identify clusters from attributed graph. IEEE Trans. Network Sci. Eng. (2024).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.62172457), Science and Technology Research Project of Henan Province (Nos.242102210104 and 252102210032), and Key Scientific Research Project of Higher Education Institutions of Henan Province (No.25B520021).

Author information

Authors and Affiliations

School of Artificial Intelligence, Zhoukou Normal University, Zhoukou, 466001, China
Shaohui Zhang & Peng Wang
School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China
Shaohui Zhang
School of Computer Science and Technology, Zhoukou Normal University, Zhoukou, 466001, China
Qiuying Han
School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an, 710072, China
Junjie Li

Authors

Shaohui Zhang
View author publications
Search author on:PubMed Google Scholar
Qiuying Han
View author publications
Search author on:PubMed Google Scholar
Peng Wang
View author publications
Search author on:PubMed Google Scholar
Junjie Li
View author publications
Search author on:PubMed Google Scholar

Contributions

S.Z. contributed to data collection, literature search, data analysis, manuscript writing, supervision, and fund sourcing. Q.H. contributed to the conceptualization, study design, data collection, manuscript writing, and interpretation. P.W. contributed to literature search, data analysis, and experiment validation. J.L. contributed to the conceptualization, data analysis, interpretation, and experiment validation. All authors discussed the results and contributed to the final manuscript.

Corresponding authors

Correspondence to Shaohui Zhang or Junjie Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, S., Han, Q., Wang, P. et al. Frame topology fusion-based hierarchical graph convolution for automatic assessment of physical rehabilitation exercises. Sci Rep 15, 26720 (2025). https://doi.org/10.1038/s41598-025-12020-8

Download citation

Received: 20 March 2025
Accepted: 14 July 2025
Published: 23 July 2025
DOI: https://doi.org/10.1038/s41598-025-12020-8

Subjects

Abstract

Similar content being viewed by others

An innovative model based on machine learning and fuzzy logic for tracking lower limb exercises in stroke patients

Concurrent tDCS-fMRI after stroke reveals link between attention network organization and motor improvement

Automatic rehabilitation exercise task assessment of stroke patients based on wearable sensors with a lightweight multichannel 1D-CNN model

Introduction

Related works

Rehabilitation exercise quality assessment

Graph convolutional networks

Frame topology fusion hierarchical graph convolution network (FTF-HGCN)

Preliminaries

Frame topology fusion block (FTF block)

Topology construction

Frame topology fusion

Hierarchical temporal convolution attention module (HTCA)

Branch temporal convolution

Attention aggregation module

Experiments

Datasets

Evaluation metrics

Experimental setup

Experimental results

Computational cost

Ablation study

Limitation and future directions

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links